Your SlideShare is downloading. ×
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

1,219

Published on

Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or …

Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty from very simple to very difficult. This session will explore how different approaches have worked or not worked at scale for counting uniques on HBase with Hadoop.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,219
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
50
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. UNIQUE SETS WITHHBASEElliott Clark
  • 2. Problem SummaryThe business needs to know how many unique people have done some action
  • 3. Problem Specifics Need to create a lot of different counts of unique users  100 different counters per day per game (could be per website, or any other group)  1000 different games Some counters require knowledge of old data  Count of unique users who joined today  Count of unique users who have ever paid
  • 4. 1st Try – Bit Set per Hour Row Key is the game and the hour Column qualifiers are the counter names Column values are 1.5Mb bit sets Each hour a new bloom filter is created for every counter Compute a day’s counter by OR’ing the bits and dividing the count of high bits by probability of a collision
  • 5. 1st Try – Example Row D:DAU D:new_uniquesGame1 2012-01-01 0100 NUM_IN_SET: 1.5M NUM_IN_SET: 0.9M 010010001101100100…. 1100110100111010….
  • 6. 1st Try – Pluses & Minuses Allows accuracy to drive size Requires a full table scan of all bit sets A lot of data generated Huge number of regions Not 100% accurate Very hard to debug
  • 7. 2nd Try – Bit Sets per User Row Key is the user’s ID reversed  Reverse the ID to stop hot spotting of regions Column qualifiers are a compound key of game and counter name Column values are a start date-hour and a bit set  Each position in a bit set refers to a subsequent hour after the start time  1 means the user performed that action  0 means the user did not perform that action
  • 8. 2nd Try – Example Row D:game1_active D:game2_paid_money Start Date: 2012-01-01 0500 Start Date: 2012-01-01 0500Game1 2012-01-01 0100 010010001101100100…. 00000000000000000100….
  • 9. 2nd Try – Pluses & Minuses Easier to debug Size grows with the number of users not with the accuracy required Requires a full table scan of all users Scales with number of users ever seen; not number of users active on a given day Very active users can make rows grow without bound Very hard to un-do any mistakes. Dirty data is very hard to correct.
  • 10. 3rd Try – Multi Pass Group Group all log data for a day by user ID Join log data with historic data in Hbase, by doing a GET on the user’s row Compute new information about the user Emit new data about the user and +1s for every action that the user did from the log data
  • 11. 3rd Try – Data Flow Count: +1Log DataLog DataLog Data Recomputed User Data Hbase User Data
  • 12. 3rd Try – Pluses & Minuses Easy to debug Scales with number of users that are active Allows for a more holistic view of the users Requires a large amount of data to be shuffled and sorted
  • 13. Conclusions Try to get the best upper bound on runtime More and more flexibility will be required as time goes on Store more data now, and when new features are requested development will be easier Choose a good serialization framework and stick with it Always clean your data before inserting
  • 14. Questions?

×