SlideShare a Scribd company logo
Probabilistic
Data Structures
By,
Shrinivas Vasala
Outline
• Basic Idea
• Hyperloglog
• Bloom filters
• Count-min sketches
• References & further reading
2
Basic Idea
• Problem: you have a lot of data to count,
track or otherwise analyze
• Often an approximate answer is sufficient,
especially if you can place bounds on how
wrong the approximation is likely to be
• IO is the most expensive
3
Hyperloglog
• Originally described by Flajolet and colleagues in 2007
• Can estimate cardinalities well beyond 109 with a relative accuracy
(standard error) of 2% while only using 1.5kb of memory
• Hashing turns anything into uniform distribution
• If the maximum number of leading zeros observed is ’n’, an estimate
for the number of distinct elements in the set is 2n
• Tuning Precision : set is split into multiple subsets
• Harmonic mean + low/high sampling adjustments  Results
• Increasing the number of bits of your hash increases the highest
possible number you can accurately approximate
• Commonly used hashing algorithm: Murmur
• Implementation in Redis
4
Bloom filters
• Conceived by Burton Howard Bloom in 1970
• Used to test whether an element is a member of a set
• Query returns either "possibly in set" or "definitely not in set“
• An empty Bloom filter is a bit array of ‘m’ bits, all set to 0
• ‘k’ hash functions map the set element to one of the m array
positions with a uniform random distribution
• The bits at all these positions are set to 1
• To query, hash with same hash functions and check if all positions
are set, if not its surely not in the set else maybe present in set
• For an optimal value of k with 1% error each element requires only
about 9.6 bits — regardless of the size of the elements
5
1. Choose a ballpark value for no. of elements in the set ‘n’
2. Choose a value for m
3. Calculate the optimal value of k [ k = (m/n) ln 2 ]
4. Calculate the error rate ‘p’ [ p = (1-e-kn/m)k ]
If p is unacceptable, return to step 2 and change m;
otherwise we're done.
Count-min sketches
• Invented by Graham Cormode and S. Muthu Muthukrishnan in 2003
• How many of each item there is in an collection
• Sketch is a compact summary of a large amount of data, is a 2D array
of w columns and d rows
• Each box is a counter
• Each row is indexed by a
corresponding hash function
• Estimated frequency for
‘Something’ is min(a,b,c,d)
• ‘w’ limits the magnitude of the error [ error <= 2 * n/w ]
• ‘d’ controls the probability that the estimation is greater than the error
[probability limit exceeded = 1 – (1/2) ** d]
• Works best on Skewed data
References & further reading
Other Topics
• MinHash is a technique for quickly estimating how similar two sets are
• Quotient filters are AMQs (approximate membership query) and, provide many of the same
benefits as Bloom filters
• Skip list is a data structure that allows fast search within an ordered sequence of elements
Hyperloglog
1. http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html
2. http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms
3. http://en.wikipedia.org/wiki/HyperLogLog
Bloom filters
1. http://billmill.org/bloomfilter-tutorial/
2. http://en.wikipedia.org/wiki/Bloom_filter
Count-min sketches
1. http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon-
2012
2. http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
3. https://sites.google.com/site/countminsketch/home
7

More Related Content

What's hot

Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Bloom filter
Bloom filterBloom filter
Bloom filter
wang ping
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
Albert Bifet
 
Bloom filter
Bloom filterBloom filter
Bloom filter
Hamid Feizabadi
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
Flink Forward
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
Albert Bifet
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley
 
Bloom filters
Bloom filtersBloom filters
Bloom filters
Devesh Maru
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Christopher Conlan
 
Bloom filter
Bloom filterBloom filter
Bloom filter
feng lee
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
Christopher Conlan
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
Albert Bifet
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Hashing
HashingHashing
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
Rafi Dar
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
Hash tables
Hash tablesHash tables
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
Albert Bifet
 

What's hot (20)

Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Bloom filters
Bloom filtersBloom filters
Bloom filters
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Hashing
HashingHashing
Hashing
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Hash tables
Hash tablesHash tables
Hash tables
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 

Viewers also liked

Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
nathanmarz
 
ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений Сафроненко
PechaKucha Ukraine
 
Big Data aggregation techniques
Big Data aggregation techniquesBig Data aggregation techniques
Big Data aggregation techniques
Valentin Logvinskiy
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
bzamecnik
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Qrator Labs
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglog
nybon
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017
Roman Elizarov
 

Viewers also liked (7)

Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений Сафроненко
 
Big Data aggregation techniques
Big Data aggregation techniquesBig Data aggregation techniques
Big Data aggregation techniques
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglog
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017
 

Similar to Probabilistic data structures

Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiMachine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Professor Lili Saghafi
 
greedy algorithm.pptx good for understanding
greedy algorithm.pptx good for understandinggreedy algorithm.pptx good for understanding
greedy algorithm.pptx good for understanding
HUSNAINAHMAD39
 
Amortized analysis
Amortized analysisAmortized analysis
Amortized analysis
Dr Shashikant Athawale
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
mark_landry
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
Junghoon Kim
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
Open Analytics
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
Random finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensorsRandom finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensors
Daniel Hauschildt
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)
roshmat
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
sandeepsandy494692
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
Performance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering AlgorithmPerformance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering Algorithm
IOSR Journals
 
F017132529
F017132529F017132529
F017132529
IOSR Journals
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
Student
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
Sanghyuk Chun
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
JK970901
 

Similar to Probabilistic data structures (20)

Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiMachine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
 
greedy algorithm.pptx good for understanding
greedy algorithm.pptx good for understandinggreedy algorithm.pptx good for understanding
greedy algorithm.pptx good for understanding
 
Amortized analysis
Amortized analysisAmortized analysis
Amortized analysis
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Random finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensorsRandom finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensors
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
Performance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering AlgorithmPerformance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering Algorithm
 
F017132529
F017132529F017132529
F017132529
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 

Recently uploaded

一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 

Recently uploaded (20)

一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 

Probabilistic data structures

  • 2. Outline • Basic Idea • Hyperloglog • Bloom filters • Count-min sketches • References & further reading 2
  • 3. Basic Idea • Problem: you have a lot of data to count, track or otherwise analyze • Often an approximate answer is sufficient, especially if you can place bounds on how wrong the approximation is likely to be • IO is the most expensive 3
  • 4. Hyperloglog • Originally described by Flajolet and colleagues in 2007 • Can estimate cardinalities well beyond 109 with a relative accuracy (standard error) of 2% while only using 1.5kb of memory • Hashing turns anything into uniform distribution • If the maximum number of leading zeros observed is ’n’, an estimate for the number of distinct elements in the set is 2n • Tuning Precision : set is split into multiple subsets • Harmonic mean + low/high sampling adjustments  Results • Increasing the number of bits of your hash increases the highest possible number you can accurately approximate • Commonly used hashing algorithm: Murmur • Implementation in Redis 4
  • 5. Bloom filters • Conceived by Burton Howard Bloom in 1970 • Used to test whether an element is a member of a set • Query returns either "possibly in set" or "definitely not in set“ • An empty Bloom filter is a bit array of ‘m’ bits, all set to 0 • ‘k’ hash functions map the set element to one of the m array positions with a uniform random distribution • The bits at all these positions are set to 1 • To query, hash with same hash functions and check if all positions are set, if not its surely not in the set else maybe present in set • For an optimal value of k with 1% error each element requires only about 9.6 bits — regardless of the size of the elements 5 1. Choose a ballpark value for no. of elements in the set ‘n’ 2. Choose a value for m 3. Calculate the optimal value of k [ k = (m/n) ln 2 ] 4. Calculate the error rate ‘p’ [ p = (1-e-kn/m)k ] If p is unacceptable, return to step 2 and change m; otherwise we're done.
  • 6. Count-min sketches • Invented by Graham Cormode and S. Muthu Muthukrishnan in 2003 • How many of each item there is in an collection • Sketch is a compact summary of a large amount of data, is a 2D array of w columns and d rows • Each box is a counter • Each row is indexed by a corresponding hash function • Estimated frequency for ‘Something’ is min(a,b,c,d) • ‘w’ limits the magnitude of the error [ error <= 2 * n/w ] • ‘d’ controls the probability that the estimation is greater than the error [probability limit exceeded = 1 – (1/2) ** d] • Works best on Skewed data
  • 7. References & further reading Other Topics • MinHash is a technique for quickly estimating how similar two sets are • Quotient filters are AMQs (approximate membership query) and, provide many of the same benefits as Bloom filters • Skip list is a data structure that allows fast search within an ordered sequence of elements Hyperloglog 1. http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html 2. http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms 3. http://en.wikipedia.org/wiki/HyperLogLog Bloom filters 1. http://billmill.org/bloomfilter-tutorial/ 2. http://en.wikipedia.org/wiki/Bloom_filter Count-min sketches 1. http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon- 2012 2. http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch 3. https://sites.google.com/site/countminsketch/home 7