Analyzing Large-Scale User Data with Hadoop and HBase
Upcoming SlideShare
Loading in...5
×
 

Analyzing Large-Scale User Data with Hadoop and HBase

on

  • 3,319 views

A more technical presentation about WibiData presented to the Bay Area Search meetup 1/25/2012.

A more technical presentation about WibiData presented to the Bay Area Search meetup 1/25/2012.

Statistics

Views

Total Views
3,319
Views on SlideShare
1,863
Embed Views
1,456

Actions

Likes
1
Downloads
35
Comments
0

5 Embeds 1,456

http://www.wibidata.com 1450
http://translate.googleusercontent.com 2
http://www.wibi.dev 2
http://wibidata.com 1
http://ee.wibidata.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Analyzing Large-Scale User Data with Hadoop and HBase Analyzing Large-Scale User Data with Hadoop and HBase Presentation Transcript

  • : Analyzing Large-Scale User Data with Hadoop and HBase Aaron Kimball – CTO Odiago, Inc.Developed By:
  • is… • A large-scale storage, serving, and analysis platform • For user- or other entity-centric dataDeveloped By:
  • use cases • Product/content recommendations – “Because you liked book X, you may like book Y” • Ad targeting – “Because of your interest in sports, check out…” • Social network analysis – “You may know these people…” • Fraud detection, anti-spam, search personalization…Developed By:
  • Use case characteristics • Have a large number of users • Want to store (large) transaction data as well as derived data (e.g., recommendations) • Need to serve recommendations interactively • Require a combination of offline and on-the- fly computationDeveloped By:
  • A typical workflowDeveloped By:
  • Challenges • Support real-time retrieval of profile data • Store a long transactional data history • Keep related data logically and physically close • Update data in a timely fashion without wasting computation • Fault tolerance • Data schema changes over timeDeveloped By:
  • architecture Certified Technology productDeveloped By:
  • HBase data model • Data in cells, addressed by four “coordinates” – Row Id (primary key) – Column family – Column “qualifier” – TimestampDeveloped By:
  • Schema free: not what you want • HBase may not impose a schema, but your data still has one • Up to the application to determine how to organize & interpret data • You still need to pick a serialization systemDeveloped By:
  • Schemas = trade-offs • Different schemas enable efficient storage/retrieval/analysis of different types of data • Physical organization of data still makes a big difference – Especially with respect to read/write patternsDeveloped By:
  • WibiData workloads • Large number of fat rows (one per user) • Each row updated relatively few times/day – Though updates may involve large records • Raw data written to one set of columns • Processed results read from another – Often with an interactive latency requirement • Needs to support complex data typesDeveloped By:
  • Serialization with • Apache Avro provides flexible serialization • All data written along with its “writer schema” • Reader schema may differ from the writer’s { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]} ] }Developed By:
  • Serialization with • No code generation required • Producers and consumers of data can migrate independently • Data format migrations do not require structural changes to underlying dataDeveloped By:
  • WibiData: An extended data model <column> <name>email</name> <description>Email address</description> <schema>"string"</schema> </column> • Columns or whole families have common Avro schemas for evolvable storage and retrievalDeveloped By:
  • WibiData: An extended data model • Column families are a logical concept • Data is physically arranged in locality groups • Row ids are hashed for uniform write pressureDeveloped By:
  • WibiData: An extended data model • Wibi uses 3-d storage • Data is often sorted by timestampDeveloped By:
  • Analyzing data: Producers • Producers create derived column values • Produce operator works on one row at a time – Can be run in MapReduce, or on a one-off basis • Produce is a row mutation operatorDeveloped By:
  • Analyzing data: Gatherers • Gatherers aggregate data across all rows • Always run within MapReduce • A bridge between rows and (key, value) pairsDeveloped By:
  • Interactive access: REST API PUT request GET request • REST API provides interactive access • Producers can be triggered “on demand” to create fresh recommendationsDeveloped By:
  • Example: Ad TargetingDeveloped By:
  • Example: Ad Targeting ProducerDeveloped By:
  • Example: Ad Targeting ProducerDeveloped By:
  • Gathering Category Associations • Gather observed behaviorDeveloped By:
  • Gathering Category Associations • Associate interests, clicksDeveloped By:
  • Gathering Category Associations • … for all pairsDeveloped By:
  • Gathering Category Associations • And aggregate across all users Map phase: Reduce phase:Developed By:
  • Conclusions • Hadoop, HBase, Avro form the core of a large- scale machine learning/analysis platform • How you set up your schema matters • Producer/gatherer programming model allows computations over tables to be expressed naturally; works with MapReduceDeveloped By:
  • www.wibidata.com / @wibidata Aaron Kimball – aaron@odiago.comDeveloped By: