HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

UNIQUE SETS WITH
HBASE
Elliott Clark

Problem Summary

The business needs to know how many unique
people have done some action

Problem Specifics
 Need to create a lot of different counts of
unique users
 100 different counters per day per game (could
be per website, or any other group)
 1000 different games

 Some counters require knowledge of old data
 Count of unique users who joined today
 Count of unique users who have ever paid

1st Try – Bit Set per Hour
 Row Key is the game and the hour
 Column qualifiers are the counter names
 Column values are 1.5Mb bit sets
 Each hour a new bloom filter is created for
every counter
 Compute a day’s counter by OR’ing the bits
and dividing the count of high bits by
probability of a collision

1st Try – Example Row

D:DAU D:new_uniques

Game1 2012-01-01 0100 NUM_IN_SET: 1.5M NUM_IN_SET: 0.9M
010010001101100100…. 1100110100111010….

1st Try – Pluses & Minuses
 Allows accuracy to drive size
 Requires a full table scan of all bit sets
 A lot of data generated
 Huge number of regions
 Not 100% accurate
 Very hard to debug

2nd Try – Bit Sets per User
 Row Key is the user’s ID reversed
 Reverse the ID to stop hot spotting of regions
 Column qualifiers are a compound key of
game and counter name
 Column values are a start date-hour and a bit
set
 Each position in a bit set refers to a subsequent
hour after the start time
 1 means the user performed that action
 0 means the user did not perform that action

2nd Try – Example Row

D:game1_active D:game2_paid_money
Start Date: 2012-01-01 0500 Start Date: 2012-01-01 0500
Game1 2012-01-01 0100 010010001101100100…. 00000000000000000100….

2nd Try – Pluses & Minuses
 Easier to debug
 Size grows with the number of users not with
the accuracy required
 Requires a full table scan of all users
 Scales with number of users ever seen; not
number of users active on a given day
 Very active users can make rows grow without
bound
 Very hard to un-do any mistakes. Dirty data is
very hard to correct.

3rd Try – Multi Pass Group
 Group all log data for a day by user ID
 Join log data with historic data in Hbase, by
doing a GET on the user’s row
 Compute new information about the user
 Emit new data about the user and +1s for
every action that the user did from the log data

3rd Try – Data Flow

Count: +1

Log Data
Log Data
Log Data
Recomputed User
Data
Hbase User Data

3rd Try – Pluses & Minuses
 Easy to debug
 Scales with number of users that are active
 Allows for a more holistic view of the users
 Requires a large amount of data to be shuffled
and sorted

Conclusions
 Try to get the best upper bound on runtime
 More and more flexibility will be required as
time goes on
 Store more data now, and when new features
are requested development will be easier
 Choose a good serialization framework and
stick with it
 Always clean your data before inserting

HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

More Related Content

What's hot

Viewers also liked

Similar to HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

More from Cloudera, Inc.

Recently uploaded

HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon