SlideShare a Scribd company logo
1 of 18
Multidimensional
probabilistic real-time
analytics at Scale
VALENTIN BAZAREVSKY
Questions to audience
 HLL
 MinHash
 Uniform distribution
 Inclusion-Exclusion principle
 Bitmap
Web Analytics Questions
 How big your audience?
 From where?
 How active?
 Gender?
 What browsers / devices?
 How similar audiences are?
 Who is the most similar to your audience?
 What dynamics?
Advanced Web Analytics Questions
 What characteristics my audience will have if I build it by particular rule?
 If KPI could be described by given rule, give me audience which fits them better than others
Numbers
 2B cookie profiles
 50k segments
 35B cookie-segment pairs
 150M transaction predicate sets
 15 TB of transactional data
 50k requests per second
Segment size Segments
> 1k 6k
> 10k 6k
> 100k 6k
> 1M 6k
> 10M 6k
> 100M 2k
> 1B 25
Estimation PIPELINE
HyperLogLogs
MinHashes
1% Bitmaps
1%, 0.01% samples as sets
Probabilistic data structures landscape
 HLL zipped 2% error – 400b
 MinHash – 32 kb
 1% bitmap – 2-5 mb
 1% sets – depending on size
(in our case up to 150Mb – rare case)
Hyperloglog intuition
 Allows to estimate number of unique users in set
 Probability it will have 0 in first position – 50%
 Two zeros sequentially 25%
 Three - 12.5%
 Etc.
 What can you say about the set if you know that maximal sequence of zeros was 10?
HLL intuition pt. 2
 0011001010100
 1010010010100
 1101101010100
 1100111010100
 0111000010100
 0101001010100
 0001000000100
Set operations on HLL
 Union
 Intersection
 Subtraction
 Inclusion exclusion principle
 Accuracy degradation
 Binomial coefficients
calculation tree transformation
 HLL can union only with another HLL
 If you need to intersect HLL with another HLL, you need to use inclusion
exclusion principle:
 |A and B| = |A| + |B| - |A or B| - this results number, not HLL
 So how to estimate expressions like:
 (A and B) or C => (A or C) and (B or C)
 Needed recursive tree transformation, which will result only one final
intersection and subtraction
MinHash vs K Min Values
 Jaccard index:
 Sampling ratio normalization
 Cardinality estimation via KMinValues
 Accuracy degradation when estimation result much smaller then bigger set
Bitmaps
 Each bit corresponds to particular set item
 Good estimation accuracy and performance
 Not efficient from memory requirements if underlying set is small
 Mapping from element id to sequence number in bitmap required (sync
challenge for distributed application)
 Improvement: Compressed bitmaps
 Still big overhead, as we need to store all the items
Sampled audience as Sets
 Huge memory consumption for big audiences
 Set operations performance depend on smaller set
 So operations with two big sets are slow
 Resample big sets to 0.01% and use this only for case if all sets in equation big
 No need to store id-sequence number mapping
 Efficient for small audiences
To sum up (2b audience)
HLL MinHash (8k) Bitmaps 1% Sets (1% + 0.01%)
Size 2kb (400b packed) 32 kb 5 Mb 0 – 200 Mb
Accuracy 2% in average for
cardinality.
2% if sets cardinality
less than 100
2% if sets size > 10k 2% if sets size > 10k
Restrictions Significant
degradation if set
sizes differ more than
10 times
Set sizes difference >
1000 times
Lots of extra data
for big sets if there
is no need to
intersect with small
Lots of extra data
for big sets if there
is no need to
intersect with small
Supported
operations
Union natively,
Intersect and subtract
via inclusion exclusion
principle.
Not every calculation
tree can be
estimated.
Union, Intersect,
Subtract
Recursive disjoint and
intersection leads to
accuracy degradation.
Requires tree
transformation
Union, Intersect,
Subtract
Union, Intersect,
subtract
Combination of different approaches
 HLL + MH
 Use MH for intersection and subtraction
 Bitmaps + Sets
 I.e. sparse and dense representation of set
 Store items as sets and then convert them to bitmaps after certain threshold
What we store
 Segment data (near realtime)
 Segment stats per each day (HLL + MinHash) 14 Gb, 1Gb per day
 Affinities report (daily recount + deltas near realtime)
 1% sample bitmap (no compression in Redis, 190 Gb)
 1% + 0.01% sample sets (40 Gb)
 Transaction Predicate Sets (Daily)
 HLL (compressed. 150M HLLs in 40 Gb)
Questions?

More Related Content

Viewers also liked

What we like and what we don´t like
What we like and what we don´t likeWhat we like and what we don´t like
What we like and what we don´t likeJuanmaProfe
 
Skolačka
SkolačkaSkolačka
Skolačkaevite
 
1. Gud er troværdig
1. Gud er troværdig1. Gud er troværdig
1. Gud er troværdigkonfx
 
Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...Valentin Bazarevsky
 
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2Chutiporn Ap
 
Підсумковий урок
Підсумковий урокПідсумковий урок
Підсумковий урокkogyto
 
Writing predictive web services with Azure ML
Writing predictive web services with Azure MLWriting predictive web services with Azure ML
Writing predictive web services with Azure MLValentin Bazarevsky
 
Science10 h permanentice
Science10 h permanenticeScience10 h permanentice
Science10 h permanenticee_mcgaffney
 
Day 7 powerpoint at the bell review
Day 7 powerpoint at the bell reviewDay 7 powerpoint at the bell review
Day 7 powerpoint at the bell reviewBrittknee Basch
 
Foamcub trainer
Foamcub trainerFoamcub trainer
Foamcub trainermesin oven
 
One day in our life
One day in our lifeOne day in our life
One day in our lifeJuanmaProfe
 
Pinky dinky doo
Pinky dinky doo Pinky dinky doo
Pinky dinky doo karitochoco
 

Viewers also liked (20)

Muazzam_mirza[1]
Muazzam_mirza[1]Muazzam_mirza[1]
Muazzam_mirza[1]
 
Bulgarian Recipes
Bulgarian RecipesBulgarian Recipes
Bulgarian Recipes
 
What we like and what we don´t like
What we like and what we don´t likeWhat we like and what we don´t like
What we like and what we don´t like
 
Portrait
PortraitPortrait
Portrait
 
Skolačka
SkolačkaSkolačka
Skolačka
 
Story to U MaM
Story to U MaMStory to U MaM
Story to U MaM
 
Klimt
KlimtKlimt
Klimt
 
1. Gud er troværdig
1. Gud er troværdig1. Gud er troværdig
1. Gud er troværdig
 
Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...Разработка средств управления и мониторинга распределенной мультиагентной сис...
Разработка средств управления и мониторинга распределенной мультиагентной сис...
 
Downloads
DownloadsDownloads
Downloads
 
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
โครงการตรวจสอบครุภัณฑ์ ปวช.2/2
 
Підсумковий урок
Підсумковий урокПідсумковий урок
Підсумковий урок
 
Nelson
NelsonNelson
Nelson
 
Writing predictive web services with Azure ML
Writing predictive web services with Azure MLWriting predictive web services with Azure ML
Writing predictive web services with Azure ML
 
Science10 h permanentice
Science10 h permanenticeScience10 h permanentice
Science10 h permanentice
 
Day 7 powerpoint at the bell review
Day 7 powerpoint at the bell reviewDay 7 powerpoint at the bell review
Day 7 powerpoint at the bell review
 
Foamcub trainer
Foamcub trainerFoamcub trainer
Foamcub trainer
 
Erinaceomorpha
ErinaceomorphaErinaceomorpha
Erinaceomorpha
 
One day in our life
One day in our lifeOne day in our life
One day in our life
 
Pinky dinky doo
Pinky dinky doo Pinky dinky doo
Pinky dinky doo
 

Similar to Realtime analytics

An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structuresMiguel Ping
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsArmando Vieira
 
Tutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfTutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfDuy-Hieu Bui
 
Market basket predictive_model
Market basket predictive_modelMarket basket predictive_model
Market basket predictive_modelFatima Khalid
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3Open Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsChester Chen
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3abramsm
 
IEEE DSP Workshop 2011
IEEE DSP Workshop 2011IEEE DSP Workshop 2011
IEEE DSP Workshop 2011Behzad Dogahe
 
IEEE DSP Workshop 2011
IEEE DSP Workshop 2011IEEE DSP Workshop 2011
IEEE DSP Workshop 2011dogahe
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecturebleporini
 
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDBMongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDBMongoDB
 
Mobisy_Rohit_Gaurav_Final.pptx
Mobisy_Rohit_Gaurav_Final.pptxMobisy_Rohit_Gaurav_Final.pptx
Mobisy_Rohit_Gaurav_Final.pptxssusereb8514
 
Digital Electronics Notes
Digital Electronics Notes Digital Electronics Notes
Digital Electronics Notes Srikrishna Thota
 
Hidden Decision Trees to Score Transactions
Hidden Decision Trees to Score TransactionsHidden Decision Trees to Score Transactions
Hidden Decision Trees to Score Transactionsvincentg64
 
Facebook Comments Volume Prediction
Facebook Comments Volume PredictionFacebook Comments Volume Prediction
Facebook Comments Volume PredictionVaibhav Sharma
 

Similar to Realtime analytics (20)

An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structures
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Image compression Algorithms
Image compression AlgorithmsImage compression Algorithms
Image compression Algorithms
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithms
 
Tutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfTutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdf
 
Market basket predictive_model
Market basket predictive_modelMarket basket predictive_model
Market basket predictive_model
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN Applications
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
IEEE DSP Workshop 2011
IEEE DSP Workshop 2011IEEE DSP Workshop 2011
IEEE DSP Workshop 2011
 
IEEE DSP Workshop 2011
IEEE DSP Workshop 2011IEEE DSP Workshop 2011
IEEE DSP Workshop 2011
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecture
 
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDBMongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
MongoDB in Denver: How Global Healthcare Exchange is Using MongoDB
 
Mobisy_Rohit_Gaurav_Final.pptx
Mobisy_Rohit_Gaurav_Final.pptxMobisy_Rohit_Gaurav_Final.pptx
Mobisy_Rohit_Gaurav_Final.pptx
 
Digital Electronics Notes
Digital Electronics Notes Digital Electronics Notes
Digital Electronics Notes
 
Hidden Decision Trees to Score Transactions
Hidden Decision Trees to Score TransactionsHidden Decision Trees to Score Transactions
Hidden Decision Trees to Score Transactions
 
Facebook Comments Volume Prediction
Facebook Comments Volume PredictionFacebook Comments Volume Prediction
Facebook Comments Volume Prediction
 

Recently uploaded

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Recently uploaded (20)

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Realtime analytics

  • 2. Questions to audience  HLL  MinHash  Uniform distribution  Inclusion-Exclusion principle  Bitmap
  • 3. Web Analytics Questions  How big your audience?  From where?  How active?  Gender?  What browsers / devices?  How similar audiences are?  Who is the most similar to your audience?  What dynamics?
  • 4. Advanced Web Analytics Questions  What characteristics my audience will have if I build it by particular rule?  If KPI could be described by given rule, give me audience which fits them better than others
  • 5. Numbers  2B cookie profiles  50k segments  35B cookie-segment pairs  150M transaction predicate sets  15 TB of transactional data  50k requests per second Segment size Segments > 1k 6k > 10k 6k > 100k 6k > 1M 6k > 10M 6k > 100M 2k > 1B 25
  • 7. Probabilistic data structures landscape  HLL zipped 2% error – 400b  MinHash – 32 kb  1% bitmap – 2-5 mb  1% sets – depending on size (in our case up to 150Mb – rare case)
  • 8. Hyperloglog intuition  Allows to estimate number of unique users in set  Probability it will have 0 in first position – 50%  Two zeros sequentially 25%  Three - 12.5%  Etc.  What can you say about the set if you know that maximal sequence of zeros was 10?
  • 9. HLL intuition pt. 2  0011001010100  1010010010100  1101101010100  1100111010100  0111000010100  0101001010100  0001000000100
  • 10. Set operations on HLL  Union  Intersection  Subtraction  Inclusion exclusion principle  Accuracy degradation  Binomial coefficients
  • 11. calculation tree transformation  HLL can union only with another HLL  If you need to intersect HLL with another HLL, you need to use inclusion exclusion principle:  |A and B| = |A| + |B| - |A or B| - this results number, not HLL  So how to estimate expressions like:  (A and B) or C => (A or C) and (B or C)  Needed recursive tree transformation, which will result only one final intersection and subtraction
  • 12. MinHash vs K Min Values  Jaccard index:  Sampling ratio normalization  Cardinality estimation via KMinValues  Accuracy degradation when estimation result much smaller then bigger set
  • 13. Bitmaps  Each bit corresponds to particular set item  Good estimation accuracy and performance  Not efficient from memory requirements if underlying set is small  Mapping from element id to sequence number in bitmap required (sync challenge for distributed application)  Improvement: Compressed bitmaps  Still big overhead, as we need to store all the items
  • 14. Sampled audience as Sets  Huge memory consumption for big audiences  Set operations performance depend on smaller set  So operations with two big sets are slow  Resample big sets to 0.01% and use this only for case if all sets in equation big  No need to store id-sequence number mapping  Efficient for small audiences
  • 15. To sum up (2b audience) HLL MinHash (8k) Bitmaps 1% Sets (1% + 0.01%) Size 2kb (400b packed) 32 kb 5 Mb 0 – 200 Mb Accuracy 2% in average for cardinality. 2% if sets cardinality less than 100 2% if sets size > 10k 2% if sets size > 10k Restrictions Significant degradation if set sizes differ more than 10 times Set sizes difference > 1000 times Lots of extra data for big sets if there is no need to intersect with small Lots of extra data for big sets if there is no need to intersect with small Supported operations Union natively, Intersect and subtract via inclusion exclusion principle. Not every calculation tree can be estimated. Union, Intersect, Subtract Recursive disjoint and intersection leads to accuracy degradation. Requires tree transformation Union, Intersect, Subtract Union, Intersect, subtract
  • 16. Combination of different approaches  HLL + MH  Use MH for intersection and subtraction  Bitmaps + Sets  I.e. sparse and dense representation of set  Store items as sets and then convert them to bitmaps after certain threshold
  • 17. What we store  Segment data (near realtime)  Segment stats per each day (HLL + MinHash) 14 Gb, 1Gb per day  Affinities report (daily recount + deltas near realtime)  1% sample bitmap (no compression in Redis, 190 Gb)  1% + 0.01% sample sets (40 Gb)  Transaction Predicate Sets (Daily)  HLL (compressed. 150M HLLs in 40 Gb)