Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBase Design Patterns @ Yahoo!

5,399 views

Published on

Speaker: Francis Liu (Yahoo!)

HBase's introduction into the Yahoo! Grid has provided our users with new ways to process and store data. A year after its availability, there has been varied usages: Event processing for personalization, incremental processing for ingestion, time-based aggregations for analytics, etc. All these were possible thanks to features HBase brings beyond working with HDFS files. This talk will review some recurring HBase design patterns at Yahoo! as well as share our learnings and experiences.

Published in: Software, Technology, Education
  • Be the first to comment

HBase Design Patterns @ Yahoo!

  1. 1. HBase Design Patterns @ Y! PRESENTED BY Francis Liu | toffer@apache.org⎪ May 5, 2014
  2. 2. Y! Grid ▪ Off-Stage Processing ▪ Hosted Service ▪ Multi-tenant
  3. 3. Batch Processing (with HDFS) ▪ Append-only ▪ Efficient full table scans ▪ Process entire data set (or partitions)
  4. 4. HBase ▪ Mutable ▪ Point Access ▪ Range scans ▪ Record-level processing ▪ 7 clusters, 1500 nodes, 6PB
  5. 5. Entity Store: Motivation ▪ Integrate data from multiple data sources ▪ Store historical data ▪ Share data › Analytics › Machine Learning › Consume a data source
  6. 6. Entity Store ▪ Records as Entities › Web pages › Celebrities › etc. ▪ Denormalized as a single table
  7. 7. Entity Store: Content Store Ingestion Service Sports Enrichment News Enrichment xxxxxxxxx c:content xxxxxxxx m:sports m:news Serving Bulk Ingest xxxxxxxxx c:message xxxxxxxx Feed
  8. 8. Entity Store: Considerations ▪ Row vs multiple rows as an entity? › Row in most cases ▪ Blob vs Primitives as cell values? › Blobs are more compact › Primitives work better for granular updates › Out of the box filters work better with primitives › Use a compact binary format ▪ Prepare for Schema Changes › Provide a DAO library ▪ Incremental Scan › Batch id (via version) › Size cache for batch
  9. 9. Event Processing: Motivation ▪ Process a stream of events › Ad Targeting › Personalization › etc. ▪ Low average age of a record/model/etc
  10. 10. Event Processing ▪ Entity Store ▪ Incremental computation › Persist incremental state ▪ Stream processing framework › ie Storm ▪ Fit working set in Block Cache HBase StormData Collector Serving
  11. 11. Event Processing: Ad Targeting Ad Targeting HBase MapReduce Storm HDFS Data Collector Index Batch Near realtime ServingProcessingCollection
  12. 12. Event Processing - Considerations ▪ Limit large compactions ▪ Deferred log flush ▪ Avoid compaction storms ▪ Async Access › HBase work queue › AsyncHBase ▪ Blobs when possible ▪ Cache optimizations
  13. 13. Phased Event Processing: Motivation ▪ Large/Complex event pipeline ▪ Modularization ▪ Dependency between pipelines
  14. 14. Phased Event Processing ▪ Notifications › Separate Table › Separate Column Family Topology1 Data Collector Notifications Topology2 Notifications Notifications Topology3 Serving
  15. 15. Phased Event Processing: Personalization Data Collector Notifications Enrichment Notifications Ingestion Serving MapReduceHDFS HBase ServingProcessingCollection Fetcher
  16. 16. Phased Event Processing: Considerations ▪ Notifications › Ordered › At least once ▪ Write to multiple regions ▪ Transactions
  17. 17. Time Series DB: Motivation ▪ Track/Monitor changes over time › Application Metrics › User Analytics › System Metrics › etc. ▪ Alerts/Alarms › Thresholds › Changes over time
  18. 18. Time Series DB: Personalization Data Quality HBase StormData Collector Web UI Serving
  19. 19. Time-Series: Considerations ▪ Hot metrics › Namespace › Indexed tags ▪ Pre-compute aggregates if it is accessed often ▪ Consider using a block encoding scheme (PREFIX, FAST_DIFF, etc) ▪ Consider pre-computed aggregates in a separate table ▪ Consider OpenTSDB
  20. 20. HBaseCon 2014 Thank You! (We’re hiring)

×