SlideShare a Scribd company logo

Realtime analytics

probabilistic data structures

1 of 18
Download to read offline
Multidimensional
probabilistic real-time
analytics at Scale
VALENTIN BAZAREVSKY
Questions to audience
 HLL
 MinHash
 Uniform distribution
 Inclusion-Exclusion principle
 Bitmap
Web Analytics Questions
 How big your audience?
 From where?
 How active?
 Gender?
 What browsers / devices?
 How similar audiences are?
 Who is the most similar to your audience?
 What dynamics?
Advanced Web Analytics Questions
 What characteristics my audience will have if I build it by particular rule?
 If KPI could be described by given rule, give me audience which fits them better than others
Numbers
 2B cookie profiles
 50k segments
 35B cookie-segment pairs
 150M transaction predicate sets
 15 TB of transactional data
 50k requests per second
Segment size Segments
> 1k 6k
> 10k 6k
> 100k 6k
> 1M 6k
> 10M 6k
> 100M 2k
> 1B 25
Estimation PIPELINE
HyperLogLogs
MinHashes
1% Bitmaps
1%, 0.01% samples as sets

Recommended

08 distributed optimization
08 distributed optimization08 distributed optimization
08 distributed optimizationMarco Quartulli
 
presentation 2019 04_09_rev1
presentation 2019 04_09_rev1presentation 2019 04_09_rev1
presentation 2019 04_09_rev1Hyun Wong Choi
 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIMLILAB
 
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIG. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIMLILAB
 
La competencia de comprension lectora en estudiantes de nivel medio superior
La competencia de comprension lectora en estudiantes de nivel medio superiorLa competencia de comprension lectora en estudiantes de nivel medio superior
La competencia de comprension lectora en estudiantes de nivel medio superiorAdelina (Ade) Salguero Flores
 
Spain and its monuments
Spain and its monumentsSpain and its monuments
Spain and its monumentsJuanmaProfe
 

More Related Content

Viewers also liked

What we like and what we don´t like
What we like and what we don´t likeWhat we like and what we don´t like
What we like and what we don´t likeJuanmaProfe
 
Skolačka
SkolačkaSkolačka
Skolačkaevite
 
1. Gud er troværdig
1. Gud er troværdig1. Gud er troværdig
1. Gud er troværdigkonfx
 
Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...Valentin Bazarevsky
 
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2Chutiporn Ap
 
Підсумковий урок
Підсумковий урокПідсумковий урок
Підсумковий урокkogyto
 
Writing predictive web services with Azure ML
Writing predictive web services with Azure MLWriting predictive web services with Azure ML
Writing predictive web services with Azure MLValentin Bazarevsky
 
Science10 h permanentice
Science10 h permanenticeScience10 h permanentice
Science10 h permanenticee_mcgaffney
 
Day 7 powerpoint at the bell review
Day 7 powerpoint at the bell reviewDay 7 powerpoint at the bell review
Day 7 powerpoint at the bell reviewBrittknee Basch
 
Foamcub trainer
Foamcub trainerFoamcub trainer
Foamcub trainermesin oven
 
One day in our life
One day in our lifeOne day in our life
One day in our lifeJuanmaProfe
 
Pinky dinky doo
Pinky dinky doo Pinky dinky doo
Pinky dinky doo karitochoco
 

Viewers also liked (20)

Muazzam_mirza[1]
Muazzam_mirza[1]Muazzam_mirza[1]
Muazzam_mirza[1]
 
Bulgarian Recipes
Bulgarian RecipesBulgarian Recipes
Bulgarian Recipes
 
What we like and what we don´t like
What we like and what we don´t likeWhat we like and what we don´t like
What we like and what we don´t like
 
Portrait
PortraitPortrait
Portrait
 
Skolačka
SkolačkaSkolačka
Skolačka
 
Story to U MaM
Story to U MaMStory to U MaM
Story to U MaM
 
Klimt
KlimtKlimt
Klimt
 
1. Gud er troværdig
1. Gud er troværdig1. Gud er troværdig
1. Gud er troværdig
 
Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...
 
Downloads
DownloadsDownloads
Downloads
 
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
 
Підсумковий урок
Підсумковий урокПідсумковий урок
Підсумковий урок
 
Nelson
NelsonNelson
Nelson
 
Writing predictive web services with Azure ML
Writing predictive web services with Azure MLWriting predictive web services with Azure ML
Writing predictive web services with Azure ML
 
Science10 h permanentice
Science10 h permanenticeScience10 h permanentice
Science10 h permanentice
 
Day 7 powerpoint at the bell review
Day 7 powerpoint at the bell reviewDay 7 powerpoint at the bell review
Day 7 powerpoint at the bell review
 
Foamcub trainer
Foamcub trainerFoamcub trainer
Foamcub trainer
 
Erinaceomorpha
ErinaceomorphaErinaceomorpha
Erinaceomorpha
 
One day in our life
One day in our lifeOne day in our life
One day in our life
 
Pinky dinky doo
Pinky dinky doo Pinky dinky doo
Pinky dinky doo
 

Similar to Realtime analytics

An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structuresMiguel Ping
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsArmando Vieira
 
Tutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfTutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfDuy-Hieu Bui
 
Market basket predictive_model
Market basket predictive_modelMarket basket predictive_model
Market basket predictive_modelFatima Khalid
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3Open Analytics
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsChester Chen
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3abramsm
 
IEEE DSP Workshop 2011
IEEE DSP Workshop 2011IEEE DSP Workshop 2011
IEEE DSP Workshop 2011Behzad Dogahe
 
IEEE DSP Workshop 2011
IEEE DSP Workshop 2011IEEE DSP Workshop 2011
IEEE DSP Workshop 2011dogahe
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecturebleporini
 
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDBMongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDBMongoDB
 
Mobisy_Rohit_Gaurav_Final.pptx
Mobisy_Rohit_Gaurav_Final.pptxMobisy_Rohit_Gaurav_Final.pptx
Mobisy_Rohit_Gaurav_Final.pptxssusereb8514
 
Digital Electronics Notes
Digital Electronics Notes Digital Electronics Notes
Digital Electronics Notes Srikrishna Thota
 
Hidden Decision Trees to Score Transactions
Hidden Decision Trees to Score TransactionsHidden Decision Trees to Score Transactions
Hidden Decision Trees to Score Transactionsvincentg64
 
Facebook Comments Volume Prediction
Facebook Comments Volume PredictionFacebook Comments Volume Prediction
Facebook Comments Volume PredictionVaibhav Sharma
 
LeanXcale for Monitoring
LeanXcale for MonitoringLeanXcale for Monitoring
LeanXcale for MonitoringLeanXcale
 

Similar to Realtime analytics (20)

An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structures
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Image compression Algorithms
Image compression AlgorithmsImage compression Algorithms
Image compression Algorithms
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithms
 
Tutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfTutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdf
 
Market basket predictive_model
Market basket predictive_modelMarket basket predictive_model
Market basket predictive_model
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN Applications
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
IEEE DSP Workshop 2011
IEEE DSP Workshop 2011IEEE DSP Workshop 2011
IEEE DSP Workshop 2011
 
IEEE DSP Workshop 2011
IEEE DSP Workshop 2011IEEE DSP Workshop 2011
IEEE DSP Workshop 2011
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecture
 
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDBMongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
 
Mobisy_Rohit_Gaurav_Final.pptx
Mobisy_Rohit_Gaurav_Final.pptxMobisy_Rohit_Gaurav_Final.pptx
Mobisy_Rohit_Gaurav_Final.pptx
 
Digital Electronics Notes
Digital Electronics Notes Digital Electronics Notes
Digital Electronics Notes
 
Hidden Decision Trees to Score Transactions
Hidden Decision Trees to Score TransactionsHidden Decision Trees to Score Transactions
Hidden Decision Trees to Score Transactions
 
Facebook Comments Volume Prediction
Facebook Comments Volume PredictionFacebook Comments Volume Prediction
Facebook Comments Volume Prediction
 
LeanXcale for Monitoring
LeanXcale for MonitoringLeanXcale for Monitoring
LeanXcale for Monitoring
 

Recently uploaded

A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelTope Osanyintuyi
 
What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?Denodo
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............mahetamanav24
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxPoonamRijal
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsDataArchiva
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughsNikolas Markou
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdfAlexia Trejo
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...ThinkInnovation
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxHizkiaJastis
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...DrSumathyV
 

Recently uploaded (13)

Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft Excel
 
What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptx
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data Goals
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdf
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptx
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...
 

Realtime analytics

  • 2. Questions to audience  HLL  MinHash  Uniform distribution  Inclusion-Exclusion principle  Bitmap
  • 3. Web Analytics Questions  How big your audience?  From where?  How active?  Gender?  What browsers / devices?  How similar audiences are?  Who is the most similar to your audience?  What dynamics?
  • 4. Advanced Web Analytics Questions  What characteristics my audience will have if I build it by particular rule?  If KPI could be described by given rule, give me audience which fits them better than others
  • 5. Numbers  2B cookie profiles  50k segments  35B cookie-segment pairs  150M transaction predicate sets  15 TB of transactional data  50k requests per second Segment size Segments > 1k 6k > 10k 6k > 100k 6k > 1M 6k > 10M 6k > 100M 2k > 1B 25
  • 7. Probabilistic data structures landscape  HLL zipped 2% error – 400b  MinHash – 32 kb  1% bitmap – 2-5 mb  1% sets – depending on size (in our case up to 150Mb – rare case)
  • 8. Hyperloglog intuition  Allows to estimate number of unique users in set  Probability it will have 0 in first position – 50%  Two zeros sequentially 25%  Three - 12.5%  Etc.  What can you say about the set if you know that maximal sequence of zeros was 10?
  • 9. HLL intuition pt. 2  0011001010100  1010010010100  1101101010100  1100111010100  0111000010100  0101001010100  0001000000100
  • 10. Set operations on HLL  Union  Intersection  Subtraction  Inclusion exclusion principle  Accuracy degradation  Binomial coefficients
  • 11. calculation tree transformation  HLL can union only with another HLL  If you need to intersect HLL with another HLL, you need to use inclusion exclusion principle:  |A and B| = |A| + |B| - |A or B| - this results number, not HLL  So how to estimate expressions like:  (A and B) or C => (A or C) and (B or C)  Needed recursive tree transformation, which will result only one final intersection and subtraction
  • 12. MinHash vs K Min Values  Jaccard index:  Sampling ratio normalization  Cardinality estimation via KMinValues  Accuracy degradation when estimation result much smaller then bigger set
  • 13. Bitmaps  Each bit corresponds to particular set item  Good estimation accuracy and performance  Not efficient from memory requirements if underlying set is small  Mapping from element id to sequence number in bitmap required (sync challenge for distributed application)  Improvement: Compressed bitmaps  Still big overhead, as we need to store all the items
  • 14. Sampled audience as Sets  Huge memory consumption for big audiences  Set operations performance depend on smaller set  So operations with two big sets are slow  Resample big sets to 0.01% and use this only for case if all sets in equation big  No need to store id-sequence number mapping  Efficient for small audiences
  • 15. To sum up (2b audience) HLL MinHash (8k) Bitmaps 1% Sets (1% + 0.01%) Size 2kb (400b packed) 32 kb 5 Mb 0 – 200 Mb Accuracy 2% in average for cardinality. 2% if sets cardinality less than 100 2% if sets size > 10k 2% if sets size > 10k Restrictions Significant degradation if set sizes differ more than 10 times Set sizes difference > 1000 times Lots of extra data for big sets if there is no need to intersect with small Lots of extra data for big sets if there is no need to intersect with small Supported operations Union natively, Intersect and subtract via inclusion exclusion principle. Not every calculation tree can be estimated. Union, Intersect, Subtract Recursive disjoint and intersection leads to accuracy degradation. Requires tree transformation Union, Intersect, Subtract Union, Intersect, subtract
  • 16. Combination of different approaches  HLL + MH  Use MH for intersection and subtraction  Bitmaps + Sets  I.e. sparse and dense representation of set  Store items as sets and then convert them to bitmaps after certain threshold
  • 17. What we store  Segment data (near realtime)  Segment stats per each day (HLL + MinHash) 14 Gb, 1Gb per day  Affinities report (daily recount + deltas near realtime)  1% sample bitmap (no compression in Redis, 190 Gb)  1% + 0.01% sample sets (40 Gb)  Transaction Predicate Sets (Daily)  HLL (compressed. 150M HLLs in 40 Gb)