HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

Cloudera, Inc.
Cloudera, Inc.Cloudera, Inc.
UNIQUE SETS WITH
HBASE
Elliott Clark
Problem Summary



The business needs to know how many unique
        people have done some action
Problem Specifics
   Need to create a lot of different counts of
    unique users
     100 different counters per day per game (could
      be per website, or any other group)
     1000 different games

   Some counters require knowledge of old data
     Count of unique users who joined today
     Count of unique users who have ever paid
1st Try – Bit Set per Hour
   Row Key is the game and the hour
   Column qualifiers are the counter names
   Column values are 1.5Mb bit sets
   Each hour a new bloom filter is created for
    every counter
   Compute a day’s counter by OR’ing the bits
    and dividing the count of high bits by
    probability of a collision
1st Try – Example Row



                        D:DAU                  D:new_uniques

Game1 2012-01-01 0100   NUM_IN_SET: 1.5M       NUM_IN_SET: 0.9M
                        010010001101100100….   1100110100111010….
1st Try – Pluses & Minuses
   Allows accuracy to drive size
   Requires a full table scan of all bit sets
   A lot of data generated
   Huge number of regions
   Not 100% accurate
   Very hard to debug
2nd Try – Bit Sets per User
   Row Key is the user’s ID reversed
     Reverse   the ID to stop hot spotting of regions
   Column qualifiers are a compound key of
    game and counter name
   Column values are a start date-hour and a bit
    set
     Each position in a bit set refers to a subsequent
      hour after the start time
     1 means the user performed that action
     0 means the user did not perform that action
2nd Try – Example Row



                        D:game1_active                D:game2_paid_money
                        Start Date: 2012-01-01 0500   Start Date: 2012-01-01 0500
Game1 2012-01-01 0100   010010001101100100….          00000000000000000100….
2nd Try – Pluses & Minuses
   Easier to debug
   Size grows with the number of users not with
    the accuracy required
   Requires a full table scan of all users
   Scales with number of users ever seen; not
    number of users active on a given day
   Very active users can make rows grow without
    bound
   Very hard to un-do any mistakes. Dirty data is
    very hard to correct.
3rd Try – Multi Pass Group
   Group all log data for a day by user ID
   Join log data with historic data in Hbase, by
    doing a GET on the user’s row
   Compute new information about the user
   Emit new data about the user and +1s for
    every action that the user did from the log data
3rd Try – Data Flow


                             Count: +1


Log Data
Log Data
Log Data
                                 Recomputed User
                                 Data
           Hbase User Data
3rd Try – Pluses & Minuses
   Easy to debug
   Scales with number of users that are active
   Allows for a more holistic view of the users
   Requires a large amount of data to be shuffled
    and sorted
Conclusions
   Try to get the best upper bound on runtime
   More and more flexibility will be required as
    time goes on
   Store more data now, and when new features
    are requested development will be easier
   Choose a good serialization framework and
    stick with it
   Always clean your data before inserting
Questions?
1 of 14

Recommended

HBaseCon 2015: Running ML Infrastructure on HBase by
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon
4K views25 slides
A High Performance Mutable Engagement Activity Delta Lake by
A High Performance Mutable Engagement Activity Delta LakeA High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta LakeDatabricks
177 views23 slides
Real Time Machine Learning Visualization With Spark by
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkChester Chen
1.2K views40 slides
Hadoop summit 2010, HONU by
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONUJerome Boulon
1.1K views22 slides
Quark Virtualization Engine for Analytics by
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics DataWorks Summit/Hadoop Summit
942 views21 slides
Explore big data at speed of thought with Spark 2.0 and Snappydata by
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
1.4K views48 slides

More Related Content

What's hot

Software architecture for data applications by
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
192 views30 slides
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase by
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
1.1K views38 slides
The Future of Real-Time in Spark by
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkDatabricks
4.9K views30 slides
Getting Started with Real-time Analytics by
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
4K views42 slides
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013 by
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013Amazon Web Services
30.5K views48 slides
Google Dremel. Concept and Implementations. by
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Vicente Orjales
7.7K views19 slides

What's hot(20)

Software architecture for data applications by Ding Li
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li192 views
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase by Michael Stack
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack1.1K views
The Future of Real-Time in Spark by Databricks
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
Databricks4.9K views
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013 by Amazon Web Services
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
Amazon Web Services30.5K views
Google Dremel. Concept and Implementations. by Vicente Orjales
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
Vicente Orjales7.7K views
Amazon Redshift by Jeff Patti
Amazon RedshiftAmazon Redshift
Amazon Redshift
Jeff Patti1.3K views
Data Modeling Basics for the Cloud with DataStax by DataStax
Data Modeling Basics for the Cloud with DataStaxData Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStax
DataStax557 views
Improving Organizational Knowledge with Natural Language Processing Enriched ... by DataWorks Summit
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
DataWorks Summit363 views
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J... by Dataconomy Media
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
Dataconomy Media352 views
Practical data science by Ding Li
Practical data sciencePractical data science
Practical data science
Ding Li142 views
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ... by Data Con LA
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Data Con LA596 views
Delta from a Data Engineer's Perspective by Databricks
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks1.1K views

Viewers also liked

HBaseCon 2013: 1500 JIRAs in 20 Minutes by
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesCloudera, Inc.
4.1K views112 slides
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC by
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
3.9K views12 slides
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo! by
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!Cloudera, Inc.
3.2K views24 slides
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase by
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon
3.3K views20 slides
HBase Read High Availability Using Timeline-Consistent Region Replicas by
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
4.1K views38 slides
HBaseCon 2013: Rebuilding for Scale on Apache HBase by
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseCloudera, Inc.
3.9K views17 slides

Viewers also liked(20)

HBaseCon 2013: 1500 JIRAs in 20 Minutes by Cloudera, Inc.
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
Cloudera, Inc.4.1K views
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC by Cloudera, Inc.
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.3.9K views
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo! by Cloudera, Inc.
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
Cloudera, Inc.3.2K views
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase by HBaseCon
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon3.3K views
HBase Read High Availability Using Timeline-Consistent Region Replicas by HBaseCon
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon4.1K views
HBaseCon 2013: Rebuilding for Scale on Apache HBase by Cloudera, Inc.
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBase
Cloudera, Inc.3.9K views
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb... by Cloudera, Inc.
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
Cloudera, Inc.3.2K views
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,... by Cloudera, Inc.
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Cloudera, Inc.3.8K views
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics by Cloudera, Inc.
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
Cloudera, Inc.4.8K views
HBaseCon 2013: Being Smarter Than the Smart Meter by Cloudera, Inc.
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
Cloudera, Inc.4.3K views
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data... by Cloudera, Inc.
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
Cloudera, Inc.3.5K views
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase by Cloudera, Inc.
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.3.2K views
Cross-Site BigTable using HBase by HBaseCon
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
HBaseCon3.5K views
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN by HBaseCon
HBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARNHBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon2.9K views
HBaseCon 2012 | Building Mobile Infrastructure with HBase by Cloudera, Inc.
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBase
Cloudera, Inc.2.6K views
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second... by Cloudera, Inc.
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.4.2K views
HBaseCon 2013: Apache HBase on Flash by Cloudera, Inc.
HBaseCon 2013: Apache HBase on FlashHBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on Flash
Cloudera, Inc.4.3K views
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase. by Cloudera, Inc.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
Cloudera, Inc.7.1K views
HBaseCon 2012 | Scaling GIS In Three Acts by Cloudera, Inc.
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three Acts
Cloudera, Inc.3.6K views
Tales from the Cloudera Field by HBaseCon
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
HBaseCon4K views

Similar to HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

Towards a Practice of Token Engineering by
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token EngineeringTrent McConaghy
8.7K views55 slides
Game analytics - The challenges of mobile free-to-play games by
Game analytics - The challenges of mobile free-to-play gamesGame analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play gamesChristian Beckers
1.6K views30 slides
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift by
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon RedshiftAmazon Web Services
5.2K views63 slides
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB by
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAmazon Web Services
7.7K views81 slides
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics by
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAmazon Web Services
5.5K views48 slides
Monomi: Practical Analytical Query Processing over Encrypted Data by
Monomi: Practical Analytical Query Processing over Encrypted DataMonomi: Practical Analytical Query Processing over Encrypted Data
Monomi: Practical Analytical Query Processing over Encrypted DataShurenBi1
7 views25 slides

Similar to HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon(20)

Towards a Practice of Token Engineering by Trent McConaghy
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
Trent McConaghy8.7K views
Game analytics - The challenges of mobile free-to-play games by Christian Beckers
Game analytics - The challenges of mobile free-to-play gamesGame analytics - The challenges of mobile free-to-play games
Game analytics - The challenges of mobile free-to-play games
Christian Beckers1.6K views
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift by Amazon Web Services
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
Amazon Web Services5.2K views
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB by Amazon Web Services
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
Amazon Web Services7.7K views
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics by Amazon Web Services
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
Amazon Web Services5.5K views
Monomi: Practical Analytical Query Processing over Encrypted Data by ShurenBi1
Monomi: Practical Analytical Query Processing over Encrypted DataMonomi: Practical Analytical Query Processing over Encrypted Data
Monomi: Practical Analytical Query Processing over Encrypted Data
ShurenBi17 views
20.project inventry management system by Lapi Mics
20.project inventry management system20.project inventry management system
20.project inventry management system
Lapi Mics2.7K views
Big data use cases in the cloud presentation by TUSHAR GARG
Big data use cases in the cloud presentationBig data use cases in the cloud presentation
Big data use cases in the cloud presentation
TUSHAR GARG283 views
Forecasting database performance by Shenglin Du
Forecasting database performanceForecasting database performance
Forecasting database performance
Shenglin Du1K views
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ... by Christoph Adler
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
Christoph Adler226 views
Sqlmaterial 120414024230-phpapp01 by Lalit009kumar
Sqlmaterial 120414024230-phpapp01Sqlmaterial 120414024230-phpapp01
Sqlmaterial 120414024230-phpapp01
Lalit009kumar239 views
How to not fail at security data analytics (by CxOSidekick) by Dinis Cruz
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
Dinis Cruz329 views
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ... by Christoph Adler
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
Christoph Adler243 views
Azure Stream Analytics : Analyse Data in Motion by Ruhani Arora
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
Ruhani Arora694 views
#TwitterRealTime - Real time processing @twitter by Twitter Developers
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx by
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
109 views55 slides
Cloudera Data Impact Awards 2021 - Finalists by
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
6.5K views34 slides
2020 Cloudera Data Impact Awards Finalists by
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
6.3K views43 slides
Edc event vienna presentation 1 oct 2019 by
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
4.5K views67 slides
Machine Learning with Limited Labeled Data 4/3/19 by
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
3.6K views36 slides
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
2.5K views21 slides

More from Cloudera, Inc.(20)

Partner Briefing_January 25 (FINAL).pptx by Cloudera, Inc.
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.109 views
Cloudera Data Impact Awards 2021 - Finalists by Cloudera, Inc.
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.6.5K views
2020 Cloudera Data Impact Awards Finalists by Cloudera, Inc.
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.6.3K views
Edc event vienna presentation 1 oct 2019 by Cloudera, Inc.
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.4.5K views
Machine Learning with Limited Labeled Data 4/3/19 by Cloudera, Inc.
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.3.6K views
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by Cloudera, Inc.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.2.5K views
Introducing Cloudera DataFlow (CDF) 2.13.19 by Cloudera, Inc.
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.4.9K views
Introducing Cloudera Data Science Workbench for HDP 2.12.19 by Cloudera, Inc.
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.2.7K views
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19 by Cloudera, Inc.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.1.6K views
Leveraging the cloud for analytics and machine learning 1.29.19 by Cloudera, Inc.
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.1.6K views
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19 by Cloudera, Inc.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.2.5K views
Leveraging the Cloud for Big Data Analytics 12.11.18 by Cloudera, Inc.
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.1.7K views
Modern Data Warehouse Fundamentals Part 3 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.1.3K views
Modern Data Warehouse Fundamentals Part 2 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.2.3K views
Modern Data Warehouse Fundamentals Part 1 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.1.5K views
Extending Cloudera SDX beyond the Platform by Cloudera, Inc.
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.966 views
Federated Learning: ML with Privacy on the Edge 11.15.18 by Cloudera, Inc.
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.2.2K views
Analyst Webinar: Doing a 180 on Customer 360 by Cloudera, Inc.
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.1.4K views
Build a modern platform for anti-money laundering 9.19.18 by Cloudera, Inc.
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.1K views
Introducing the data science sandbox as a service 8.30.18 by Cloudera, Inc.
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.1.2K views

Recently uploaded

The Power of Heat Decarbonisation Plans in the Built Environment by
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built EnvironmentIES VE
69 views20 slides
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
112 views34 slides
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...ShapeBlue
120 views13 slides
Qualifying SaaS, IaaS.pptx by
Qualifying SaaS, IaaS.pptxQualifying SaaS, IaaS.pptx
Qualifying SaaS, IaaS.pptxSachin Bhandari
897 views8 slides
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...ShapeBlue
88 views13 slides
DRBD Deep Dive - Philipp Reisner - LINBIT by
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBITShapeBlue
140 views21 slides

Recently uploaded(20)

The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE69 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue112 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue120 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue88 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue140 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue117 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson156 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue176 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue94 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue93 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue154 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue79 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash153 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue146 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue163 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue101 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue181 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue253 views

HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

  • 2. Problem Summary The business needs to know how many unique people have done some action
  • 3. Problem Specifics  Need to create a lot of different counts of unique users  100 different counters per day per game (could be per website, or any other group)  1000 different games  Some counters require knowledge of old data  Count of unique users who joined today  Count of unique users who have ever paid
  • 4. 1st Try – Bit Set per Hour  Row Key is the game and the hour  Column qualifiers are the counter names  Column values are 1.5Mb bit sets  Each hour a new bloom filter is created for every counter  Compute a day’s counter by OR’ing the bits and dividing the count of high bits by probability of a collision
  • 5. 1st Try – Example Row D:DAU D:new_uniques Game1 2012-01-01 0100 NUM_IN_SET: 1.5M NUM_IN_SET: 0.9M 010010001101100100…. 1100110100111010….
  • 6. 1st Try – Pluses & Minuses  Allows accuracy to drive size  Requires a full table scan of all bit sets  A lot of data generated  Huge number of regions  Not 100% accurate  Very hard to debug
  • 7. 2nd Try – Bit Sets per User  Row Key is the user’s ID reversed  Reverse the ID to stop hot spotting of regions  Column qualifiers are a compound key of game and counter name  Column values are a start date-hour and a bit set  Each position in a bit set refers to a subsequent hour after the start time  1 means the user performed that action  0 means the user did not perform that action
  • 8. 2nd Try – Example Row D:game1_active D:game2_paid_money Start Date: 2012-01-01 0500 Start Date: 2012-01-01 0500 Game1 2012-01-01 0100 010010001101100100…. 00000000000000000100….
  • 9. 2nd Try – Pluses & Minuses  Easier to debug  Size grows with the number of users not with the accuracy required  Requires a full table scan of all users  Scales with number of users ever seen; not number of users active on a given day  Very active users can make rows grow without bound  Very hard to un-do any mistakes. Dirty data is very hard to correct.
  • 10. 3rd Try – Multi Pass Group  Group all log data for a day by user ID  Join log data with historic data in Hbase, by doing a GET on the user’s row  Compute new information about the user  Emit new data about the user and +1s for every action that the user did from the log data
  • 11. 3rd Try – Data Flow Count: +1 Log Data Log Data Log Data Recomputed User Data Hbase User Data
  • 12. 3rd Try – Pluses & Minuses  Easy to debug  Scales with number of users that are active  Allows for a more holistic view of the users  Requires a large amount of data to be shuffled and sorted
  • 13. Conclusions  Try to get the best upper bound on runtime  More and more flexibility will be required as time goes on  Store more data now, and when new features are requested development will be easier  Choose a good serialization framework and stick with it  Always clean your data before inserting