HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon

4,078 views

Published on

Presented by: Ameya Kantikar, Groupon

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,078
On SlideShare
0
From Embeds
0
Number of Embeds
101
Actions
Shares
0
Downloads
0
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide
  • Split this into 2-3 slides
  • Split this into 2-3 slides
  • Split this into 2-3 slides
  • Will be used as a back up slide. If there are specific questions regarding performance tuning.
  • HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon

    1. 1. HBase @ Groupon Deal Personalization Systems with HBase Ameya Kanitkar ameya@groupon.com
    2. 2. Relevance & Personalization Systems @ Groupon
    3. 3. Relevance & Personalization
    4. 4. Scaling: Keeping Up With a Changing Business ChicagoPittsburgh San Francisco 200+ 400+ 2000+ Growing Deals Growing Users • 200 Million+ subscribers • We need to store data like, user click history, email records, service logs etc. This tunes to billions of data points and TB’s of data
    5. 5. Deal Personalization Infrastructure Use Cases Deliver Personalized Emails Deliver Personalized Website & Mobile Experience Offline System Online System Email Personalize billions of emails for hundreds of millions of users Personalize one of the most popular e-commerce mobile & web app for hundreds of millions of users & page views
    6. 6. Earlier System Offline Personalization Map/Reduce Data Pipeline (User Logs, Email Records, User History etc) Online Deal Personalization API MySQL Store Email
    7. 7. Earlier System Email Offline Personalization Map/Reduce Data Pipeline Online Deal Personalization API MySQL Store • Scaling MySQL for data such as user click history, email records was painful (to say the least) • Need to maintain two separate data pipelines for essentially the same data.
    8. 8. Email Offline Personalization Map/Reduce Data Pipeline Online Deal Personalization API Ideal Data Store • Common data store that serves data to both online and offline systems • Data store that scales to hundreds of millions of records • Data store that plays well with our existing hadoop based systems • Data store that supports get() put() access patterns based on a key (User ID). Ideal System
    9. 9. Architecture Options HBase System Online Personalization Email Offline Personalization Map/Reduce Data Pipeline
    10. 10. Architecture Options HBase System Online Personalization Email Offline Personalization Map/Reduce Data Pipeline • Simple design • Consolidated system that serves both online and offline personalization Pros
    11. 11. Architecture Options HBase System Online Personalization Email Offline Personalization Map/Reduce Data Pipeline • We now have same uptime SLA on both offline and online system • Maintaining online latency SLA for bulk writes and bulk reads is hard. Cons And here is why…
    12. 12. Architecture HBase Offline System HBase for Online System Real Time Relevance Email Relevance Map/Reduce Replication Data Pipeline • We can now maintain different SLA on online and offline systems • We can tune HBase cluster differently for online and offline systems
    13. 13. HBase Schema Design User ID Column Family 1 Column Family 2 Unique Identifier for Users User History and Profile Information Email History For Users • Most of our data access patterns are via “User Key” • This makes it easy to design HBase schema • The actual data is kept in JSON Overwrite user history and profile info Append email history for each day as a separate columns. (On avg each row has over 200 columns)
    14. 14. Cluster Sizing Hadoop + HBase Cluster 100+ machine Hadoop cluster, this runs heavy map reduce jobs The same cluster also hosts 15 node HBase cluster Online HBase Cluster HBase Replication 10 Machine dedicated HBase cluster to serve real time SLA • Machine Profile • 96 GB RAM (HBase 25 GB) • 24 Virtual Cores CPU • 8 2TB Disks • Data Profile • 100 Million+ Records • 2TB+ Data • Over 4.2 Billion Data Points
    15. 15. Summary SOLVED Data store that works for online and offline use cases
    16. 16. ameya@groupon.com www.groupon.com/techjobs Questions? Thanks!
    17. 17. HBase Performance Tuning • Pre Split Tables • Increase Lease Timeouts • Increase Scanner Timeouts • Increase region sizes (Ours is 10 GB) • Keep as few regions per region server as possible (We have around 15-35) •For heavy write jobs: • hbase.hregion.memstore.block.multiplier: 4 • hbase.hregion.memstore.flush.size: 134217728 • hbase.hstore.blockingStoreFiles: 100 • hbase.zookeeper.useMulti: false (This was necessary for us, we are on 0.94.2) for stable replication It took some performance tuning to get stability out of Hbase

    ×