: Analyzing Large-Scale                User Data with Hadoop and HBase                 Aaron Kimball – CTO                ...
is…      • A large-scale storage, serving, and analysis        platform      • For user- or other entity-centric dataDevel...
use cases      • Product/content recommendations                – “Because you liked book X, you may like book Y”      • A...
Use case characteristics      • Have a large number of users      • Want to store (large) transaction data as well        ...
A typical workflowDeveloped By:
Challenges      • Support real-time retrieval of profile data      • Store a long transactional data history      • Keep r...
architecture                          Certified Technology productDeveloped By:
HBase data model      • Data in cells, addressed by four “coordinates”                – Row Id (primary key)              ...
Schema free: not what you want      • HBase may not impose a schema, but your        data still has one      • Up to the a...
Schemas = trade-offs      • Different schemas enable efficient        storage/retrieval/analysis of different types of    ...
WibiData workloads      • Large number of fat rows (one per user)      • Each row updated relatively few times/day        ...
Serialization with      • Apache Avro provides flexible serialization      • All data written along with its “writer schem...
Serialization with      • No code generation required      • Producers and consumers of data can migrate        independen...
WibiData: An extended data model                <column>                  <name>email</name>                  <description...
WibiData: An extended data model      • Column families are a logical concept      • Data is physically arranged in locali...
WibiData: An extended data model      • Wibi uses 3-d storage      • Data is often sorted by timestampDeveloped By:
Analyzing data: Producers      • Producers create derived column values      • Produce operator works on one row at a time...
Analyzing data: Gatherers   • Gatherers aggregate data across all rows   • Always run within MapReduce   • A bridge betwee...
Interactive access: REST API                PUT request                GET request      • REST API provides interactive ac...
Example: Ad TargetingDeveloped By:
Example: Ad Targeting                       ProducerDeveloped By:
Example: Ad Targeting                       ProducerDeveloped By:
Gathering Category Associations      • Gather observed behaviorDeveloped By:
Gathering Category Associations      • Associate interests, clicksDeveloped By:
Gathering Category Associations      • … for all pairsDeveloped By:
Gathering Category Associations      • And aggregate across all users                Map phase:        Reduce phase:Develo...
Conclusions      • Hadoop, HBase, Avro form the core of a large-        scale machine learning/analysis platform      • Ho...
www.wibidata.com / @wibidata                    Aaron Kimball – aaron@odiago.comDeveloped By:
Analyzing Large-Scale User Data with Hadoop and HBase
Upcoming SlideShare
Loading in...5
×

Analyzing Large-Scale User Data with Hadoop and HBase

3,138

Published on

A more technical presentation about WibiData presented to the Bay Area Search meetup 1/25/2012.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,138
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
44
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Analyzing Large-Scale User Data with Hadoop and HBase

  1. 1. : Analyzing Large-Scale User Data with Hadoop and HBase Aaron Kimball – CTO Odiago, Inc.Developed By:
  2. 2. is… • A large-scale storage, serving, and analysis platform • For user- or other entity-centric dataDeveloped By:
  3. 3. use cases • Product/content recommendations – “Because you liked book X, you may like book Y” • Ad targeting – “Because of your interest in sports, check out…” • Social network analysis – “You may know these people…” • Fraud detection, anti-spam, search personalization…Developed By:
  4. 4. Use case characteristics • Have a large number of users • Want to store (large) transaction data as well as derived data (e.g., recommendations) • Need to serve recommendations interactively • Require a combination of offline and on-the- fly computationDeveloped By:
  5. 5. A typical workflowDeveloped By:
  6. 6. Challenges • Support real-time retrieval of profile data • Store a long transactional data history • Keep related data logically and physically close • Update data in a timely fashion without wasting computation • Fault tolerance • Data schema changes over timeDeveloped By:
  7. 7. architecture Certified Technology productDeveloped By:
  8. 8. HBase data model • Data in cells, addressed by four “coordinates” – Row Id (primary key) – Column family – Column “qualifier” – TimestampDeveloped By:
  9. 9. Schema free: not what you want • HBase may not impose a schema, but your data still has one • Up to the application to determine how to organize & interpret data • You still need to pick a serialization systemDeveloped By:
  10. 10. Schemas = trade-offs • Different schemas enable efficient storage/retrieval/analysis of different types of data • Physical organization of data still makes a big difference – Especially with respect to read/write patternsDeveloped By:
  11. 11. WibiData workloads • Large number of fat rows (one per user) • Each row updated relatively few times/day – Though updates may involve large records • Raw data written to one set of columns • Processed results read from another – Often with an interactive latency requirement • Needs to support complex data typesDeveloped By:
  12. 12. Serialization with • Apache Avro provides flexible serialization • All data written along with its “writer schema” • Reader schema may differ from the writer’s { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]} ] }Developed By:
  13. 13. Serialization with • No code generation required • Producers and consumers of data can migrate independently • Data format migrations do not require structural changes to underlying dataDeveloped By:
  14. 14. WibiData: An extended data model <column> <name>email</name> <description>Email address</description> <schema>"string"</schema> </column> • Columns or whole families have common Avro schemas for evolvable storage and retrievalDeveloped By:
  15. 15. WibiData: An extended data model • Column families are a logical concept • Data is physically arranged in locality groups • Row ids are hashed for uniform write pressureDeveloped By:
  16. 16. WibiData: An extended data model • Wibi uses 3-d storage • Data is often sorted by timestampDeveloped By:
  17. 17. Analyzing data: Producers • Producers create derived column values • Produce operator works on one row at a time – Can be run in MapReduce, or on a one-off basis • Produce is a row mutation operatorDeveloped By:
  18. 18. Analyzing data: Gatherers • Gatherers aggregate data across all rows • Always run within MapReduce • A bridge between rows and (key, value) pairsDeveloped By:
  19. 19. Interactive access: REST API PUT request GET request • REST API provides interactive access • Producers can be triggered “on demand” to create fresh recommendationsDeveloped By:
  20. 20. Example: Ad TargetingDeveloped By:
  21. 21. Example: Ad Targeting ProducerDeveloped By:
  22. 22. Example: Ad Targeting ProducerDeveloped By:
  23. 23. Gathering Category Associations • Gather observed behaviorDeveloped By:
  24. 24. Gathering Category Associations • Associate interests, clicksDeveloped By:
  25. 25. Gathering Category Associations • … for all pairsDeveloped By:
  26. 26. Gathering Category Associations • And aggregate across all users Map phase: Reduce phase:Developed By:
  27. 27. Conclusions • Hadoop, HBase, Avro form the core of a large- scale machine learning/analysis platform • How you set up your schema matters • Producer/gatherer programming model allows computations over tables to be expressed naturally; works with MapReduceDeveloped By:
  28. 28. www.wibidata.com / @wibidata Aaron Kimball – aaron@odiago.comDeveloped By:
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×