UNIQUE SETS WITH
HBASE
Elliott Clark
Problem Summary



The business needs to know how many unique
        people have done some action
Problem Specifics
   Need to create a lot of different counts of
    unique users
     100 different counters per day per game (could
      be per website, or any other group)
     1000 different games

   Some counters require knowledge of old data
     Count of unique users who joined today
     Count of unique users who have ever paid
1st Try – Bit Set per Hour
   Row Key is the game and the hour
   Column qualifiers are the counter names
   Column values are 1.5Mb bit sets
   Each hour a new bloom filter is created for
    every counter
   Compute a day’s counter by OR’ing the bits
    and dividing the count of high bits by
    probability of a collision
1st Try – Example Row



                        D:DAU                  D:new_uniques

Game1 2012-01-01 0100   NUM_IN_SET: 1.5M       NUM_IN_SET: 0.9M
                        010010001101100100….   1100110100111010….
1st Try – Pluses & Minuses
   Allows accuracy to drive size
   Requires a full table scan of all bit sets
   A lot of data generated
   Huge number of regions
   Not 100% accurate
   Very hard to debug
2nd Try – Bit Sets per User
   Row Key is the user’s ID reversed
     Reverse   the ID to stop hot spotting of regions
   Column qualifiers are a compound key of
    game and counter name
   Column values are a start date-hour and a bit
    set
     Each position in a bit set refers to a subsequent
      hour after the start time
     1 means the user performed that action
     0 means the user did not perform that action
2nd Try – Example Row



                        D:game1_active                D:game2_paid_money
                        Start Date: 2012-01-01 0500   Start Date: 2012-01-01 0500
Game1 2012-01-01 0100   010010001101100100….          00000000000000000100….
2nd Try – Pluses & Minuses
   Easier to debug
   Size grows with the number of users not with
    the accuracy required
   Requires a full table scan of all users
   Scales with number of users ever seen; not
    number of users active on a given day
   Very active users can make rows grow without
    bound
   Very hard to un-do any mistakes. Dirty data is
    very hard to correct.
3rd Try – Multi Pass Group
   Group all log data for a day by user ID
   Join log data with historic data in Hbase, by
    doing a GET on the user’s row
   Compute new information about the user
   Emit new data about the user and +1s for
    every action that the user did from the log data
3rd Try – Data Flow


                             Count: +1


Log Data
Log Data
Log Data
                                 Recomputed User
                                 Data
           Hbase User Data
3rd Try – Pluses & Minuses
   Easy to debug
   Scales with number of users that are active
   Allows for a more holistic view of the users
   Requires a large amount of data to be shuffled
    and sorted
Conclusions
   Try to get the best upper bound on runtime
   More and more flexibility will be required as
    time goes on
   Store more data now, and when new features
    are requested development will be easier
   Choose a good serialization framework and
    stick with it
   Always clean your data before inserting
Questions?

HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

  • 1.
  • 2.
    Problem Summary The businessneeds to know how many unique people have done some action
  • 3.
    Problem Specifics  Need to create a lot of different counts of unique users  100 different counters per day per game (could be per website, or any other group)  1000 different games  Some counters require knowledge of old data  Count of unique users who joined today  Count of unique users who have ever paid
  • 4.
    1st Try –Bit Set per Hour  Row Key is the game and the hour  Column qualifiers are the counter names  Column values are 1.5Mb bit sets  Each hour a new bloom filter is created for every counter  Compute a day’s counter by OR’ing the bits and dividing the count of high bits by probability of a collision
  • 5.
    1st Try –Example Row D:DAU D:new_uniques Game1 2012-01-01 0100 NUM_IN_SET: 1.5M NUM_IN_SET: 0.9M 010010001101100100…. 1100110100111010….
  • 6.
    1st Try –Pluses & Minuses  Allows accuracy to drive size  Requires a full table scan of all bit sets  A lot of data generated  Huge number of regions  Not 100% accurate  Very hard to debug
  • 7.
    2nd Try –Bit Sets per User  Row Key is the user’s ID reversed  Reverse the ID to stop hot spotting of regions  Column qualifiers are a compound key of game and counter name  Column values are a start date-hour and a bit set  Each position in a bit set refers to a subsequent hour after the start time  1 means the user performed that action  0 means the user did not perform that action
  • 8.
    2nd Try –Example Row D:game1_active D:game2_paid_money Start Date: 2012-01-01 0500 Start Date: 2012-01-01 0500 Game1 2012-01-01 0100 010010001101100100…. 00000000000000000100….
  • 9.
    2nd Try –Pluses & Minuses  Easier to debug  Size grows with the number of users not with the accuracy required  Requires a full table scan of all users  Scales with number of users ever seen; not number of users active on a given day  Very active users can make rows grow without bound  Very hard to un-do any mistakes. Dirty data is very hard to correct.
  • 10.
    3rd Try –Multi Pass Group  Group all log data for a day by user ID  Join log data with historic data in Hbase, by doing a GET on the user’s row  Compute new information about the user  Emit new data about the user and +1s for every action that the user did from the log data
  • 11.
    3rd Try –Data Flow Count: +1 Log Data Log Data Log Data Recomputed User Data Hbase User Data
  • 12.
    3rd Try –Pluses & Minuses  Easy to debug  Scales with number of users that are active  Allows for a more holistic view of the users  Requires a large amount of data to be shuffled and sorted
  • 13.
    Conclusions  Try to get the best upper bound on runtime  More and more flexibility will be required as time goes on  Store more data now, and when new features are requested development will be easier  Choose a good serialization framework and stick with it  Always clean your data before inserting
  • 14.