Analyzing Large-Scale User Data with Hadoop and HBase


Published on

A more technical presentation about WibiData presented to the Bay Area Search meetup 1/25/2012.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Analyzing Large-Scale User Data with Hadoop and HBase

  1. 1. : Analyzing Large-Scale User Data with Hadoop and HBase Aaron Kimball – CTO Odiago, Inc.Developed By:
  2. 2. is… • A large-scale storage, serving, and analysis platform • For user- or other entity-centric dataDeveloped By:
  3. 3. use cases • Product/content recommendations – “Because you liked book X, you may like book Y” • Ad targeting – “Because of your interest in sports, check out…” • Social network analysis – “You may know these people…” • Fraud detection, anti-spam, search personalization…Developed By:
  4. 4. Use case characteristics • Have a large number of users • Want to store (large) transaction data as well as derived data (e.g., recommendations) • Need to serve recommendations interactively • Require a combination of offline and on-the- fly computationDeveloped By:
  5. 5. A typical workflowDeveloped By:
  6. 6. Challenges • Support real-time retrieval of profile data • Store a long transactional data history • Keep related data logically and physically close • Update data in a timely fashion without wasting computation • Fault tolerance • Data schema changes over timeDeveloped By:
  7. 7. architecture Certified Technology productDeveloped By:
  8. 8. HBase data model • Data in cells, addressed by four “coordinates” – Row Id (primary key) – Column family – Column “qualifier” – TimestampDeveloped By:
  9. 9. Schema free: not what you want • HBase may not impose a schema, but your data still has one • Up to the application to determine how to organize & interpret data • You still need to pick a serialization systemDeveloped By:
  10. 10. Schemas = trade-offs • Different schemas enable efficient storage/retrieval/analysis of different types of data • Physical organization of data still makes a big difference – Especially with respect to read/write patternsDeveloped By:
  11. 11. WibiData workloads • Large number of fat rows (one per user) • Each row updated relatively few times/day – Though updates may involve large records • Raw data written to one set of columns • Processed results read from another – Often with an interactive latency requirement • Needs to support complex data typesDeveloped By:
  12. 12. Serialization with • Apache Avro provides flexible serialization • All data written along with its “writer schema” • Reader schema may differ from the writer’s { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]} ] }Developed By:
  13. 13. Serialization with • No code generation required • Producers and consumers of data can migrate independently • Data format migrations do not require structural changes to underlying dataDeveloped By:
  14. 14. WibiData: An extended data model <column> <name>email</name> <description>Email address</description> <schema>"string"</schema> </column> • Columns or whole families have common Avro schemas for evolvable storage and retrievalDeveloped By:
  15. 15. WibiData: An extended data model • Column families are a logical concept • Data is physically arranged in locality groups • Row ids are hashed for uniform write pressureDeveloped By:
  16. 16. WibiData: An extended data model • Wibi uses 3-d storage • Data is often sorted by timestampDeveloped By:
  17. 17. Analyzing data: Producers • Producers create derived column values • Produce operator works on one row at a time – Can be run in MapReduce, or on a one-off basis • Produce is a row mutation operatorDeveloped By:
  18. 18. Analyzing data: Gatherers • Gatherers aggregate data across all rows • Always run within MapReduce • A bridge between rows and (key, value) pairsDeveloped By:
  19. 19. Interactive access: REST API PUT request GET request • REST API provides interactive access • Producers can be triggered “on demand” to create fresh recommendationsDeveloped By:
  20. 20. Example: Ad TargetingDeveloped By:
  21. 21. Example: Ad Targeting ProducerDeveloped By:
  22. 22. Example: Ad Targeting ProducerDeveloped By:
  23. 23. Gathering Category Associations • Gather observed behaviorDeveloped By:
  24. 24. Gathering Category Associations • Associate interests, clicksDeveloped By:
  25. 25. Gathering Category Associations • … for all pairsDeveloped By:
  26. 26. Gathering Category Associations • And aggregate across all users Map phase: Reduce phase:Developed By:
  27. 27. Conclusions • Hadoop, HBase, Avro form the core of a large- scale machine learning/analysis platform • How you set up your schema matters • Producer/gatherer programming model allows computations over tables to be expressed naturally; works with MapReduceDeveloped By:
  28. 28. / @wibidata Aaron Kimball – aaron@odiago.comDeveloped By: