Analyzing Large-Scale User Data with Hadoop and HBase
Upcoming SlideShare
Loading in...5

Analyzing Large-Scale User Data with Hadoop and HBase



A more technical presentation about WibiData presented to the Bay Area Search meetup 1/25/2012.

A more technical presentation about WibiData presented to the Bay Area Search meetup 1/25/2012.



Total Views
Views on SlideShare
Embed Views



5 Embeds 1,456 1450 2 2 1 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Analyzing Large-Scale User Data with Hadoop and HBase Analyzing Large-Scale User Data with Hadoop and HBase Presentation Transcript

  • : Analyzing Large-Scale User Data with Hadoop and HBase Aaron Kimball – CTO Odiago, Inc.Developed By:
  • is… • A large-scale storage, serving, and analysis platform • For user- or other entity-centric dataDeveloped By:
  • use cases • Product/content recommendations – “Because you liked book X, you may like book Y” • Ad targeting – “Because of your interest in sports, check out…” • Social network analysis – “You may know these people…” • Fraud detection, anti-spam, search personalization…Developed By:
  • Use case characteristics • Have a large number of users • Want to store (large) transaction data as well as derived data (e.g., recommendations) • Need to serve recommendations interactively • Require a combination of offline and on-the- fly computationDeveloped By:
  • A typical workflowDeveloped By:
  • Challenges • Support real-time retrieval of profile data • Store a long transactional data history • Keep related data logically and physically close • Update data in a timely fashion without wasting computation • Fault tolerance • Data schema changes over timeDeveloped By:
  • architecture Certified Technology productDeveloped By:
  • HBase data model • Data in cells, addressed by four “coordinates” – Row Id (primary key) – Column family – Column “qualifier” – TimestampDeveloped By:
  • Schema free: not what you want • HBase may not impose a schema, but your data still has one • Up to the application to determine how to organize & interpret data • You still need to pick a serialization systemDeveloped By:
  • Schemas = trade-offs • Different schemas enable efficient storage/retrieval/analysis of different types of data • Physical organization of data still makes a big difference – Especially with respect to read/write patternsDeveloped By:
  • WibiData workloads • Large number of fat rows (one per user) • Each row updated relatively few times/day – Though updates may involve large records • Raw data written to one set of columns • Processed results read from another – Often with an interactive latency requirement • Needs to support complex data typesDeveloped By:
  • Serialization with • Apache Avro provides flexible serialization • All data written along with its “writer schema” • Reader schema may differ from the writer’s { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]} ] }Developed By:
  • Serialization with • No code generation required • Producers and consumers of data can migrate independently • Data format migrations do not require structural changes to underlying dataDeveloped By:
  • WibiData: An extended data model <column> <name>email</name> <description>Email address</description> <schema>"string"</schema> </column> • Columns or whole families have common Avro schemas for evolvable storage and retrievalDeveloped By:
  • WibiData: An extended data model • Column families are a logical concept • Data is physically arranged in locality groups • Row ids are hashed for uniform write pressureDeveloped By:
  • WibiData: An extended data model • Wibi uses 3-d storage • Data is often sorted by timestampDeveloped By:
  • Analyzing data: Producers • Producers create derived column values • Produce operator works on one row at a time – Can be run in MapReduce, or on a one-off basis • Produce is a row mutation operatorDeveloped By:
  • Analyzing data: Gatherers • Gatherers aggregate data across all rows • Always run within MapReduce • A bridge between rows and (key, value) pairsDeveloped By:
  • Interactive access: REST API PUT request GET request • REST API provides interactive access • Producers can be triggered “on demand” to create fresh recommendationsDeveloped By:
  • Example: Ad TargetingDeveloped By:
  • Example: Ad Targeting ProducerDeveloped By:
  • Example: Ad Targeting ProducerDeveloped By:
  • Gathering Category Associations • Gather observed behaviorDeveloped By:
  • Gathering Category Associations • Associate interests, clicksDeveloped By:
  • Gathering Category Associations • … for all pairsDeveloped By:
  • Gathering Category Associations • And aggregate across all users Map phase: Reduce phase:Developed By:
  • Conclusions • Hadoop, HBase, Avro form the core of a large- scale machine learning/analysis platform • How you set up your schema matters • Producer/gatherer programming model allows computations over tables to be expressed naturally; works with MapReduceDeveloped By:
  • / @wibidata Aaron Kimball – aaron@odiago.comDeveloped By: