Your SlideShare is downloading.
×

- 1. Andrii Gakhov, PhD Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019 Bilbao, Spain
- 2. Andrii Gakhov Senior Software Engineer at Ferret Go GmbH, Germany Ph.D. in Mathematical Modelling, M.Sc. in Applied Mathematics Twitter: @gakhov | Website: gakhov.com Probabilistic Data Structures and Algorithms for Big Data Applications ISBN: 9783748190486 https://pdsa.gakhov.com
- 3. 0. Motivation Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
- 4. Bioinformatics: Counting k-mers in DNA Counting substrings of length k in DNA sequence data (k-mers) is essential in bioinformatics, for instance, for metagenomic sequencing. A large fraction of the storage is spent on storing k-mers with sequencing errors and which are observed only a single time in the data*. Can we eﬃciently avoid to persist such invalid substrings? Can we eﬃciently count valid substrings? * Pritchard, J.K.: Efﬁcient counting of k-mers in DNA sequences using a bloom ﬁlter. BMC Bioinformatics 12(1), 333, 2011 For example, the team that sequenced the giant panda genome needed to count 8.62 billion 27-mers, where 68% were low-coverage k-mers.
- 5. 1. Data-Intensive Applications in Big Data epoch Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
- 6. What is Big Data? Doug Laney in 2001 described Big Data datasets as such that contain greater variety arriving in increasing volumes and with ever- higher velocity. Today this is known as the famous 3V’s of Big Data. Big Data Velocity Variety Volume expresses the amount of data describes the speed at which data is arriving refers to the number of types of data
- 7. What is Big Data? Big Data is more than simply a matter of size. Big Data does not refer to data, it refers to technology. The datasets of Big Data are larger, more complex, and generated more rapidly than our current resources can handle. Image: https://www.freepngimg.com/electronics/technology
- 8. 2. Probabilistic Data Structures and Algorithms Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
- 9. Probabilistic Data Structures and Algorithms (PDSA) A family of advanced approaches that are optimized to use sublinear memory and constant execution time. Cannot provide the exact answers and have some probability of error. error resources The tradeoﬀ between the error and the resources is another feature that distinguish the algorithms and data structures of this family.
- 10. PDSA in Big Data Ecosystem Count-Min Sketch Count Sketch Bloom Filter Quotient Filter Cuckoo Filter Linear Counting FM Sketch LogLog HyperLogLog Random Sampling t-digestq-digestGreenwald-Khanna MinHash SimHash LSH Counting ﬁnd the number of unique elements Membership keep track of indexed elements Rank approximate percentiles and quantiles Frequency estimate frequencies of elements Similarity ﬁnd similar documents Big Data Velocity Variety Volume
- 11. PDSA in Apache Spark SQL (PySpark interface) q-quantile estimation (Greenwald-Khanna) # pyspark.sql.DataFrameStatFunctions(df).approxQuantile df.approxQuantile("language", [0.5], 0.25) Approximate number of distinct elements (HyperLogLog++) #pyspark.sql.functions.approx_count_distinct df.agg(approx_count_distinct(df.language).alias('lang')).collect() Spark SQL is Apache Spark's module for working with structured data.
- 12. PDSA in Production
- 13. 3. Frequency Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
- 14. Frequency: Challenge A hashtag is used to index a topic on Twitter and allows people to easily follow items they are interested in. Hashtags are usually written with a # symbol in front. Find the most trending hashtags on Twitter every second about 6000 tweets are created on Twitter, that is roughly 500 million items daily most of tweets are linked with one or more hashtags https://www.internetlivestats.com/twitter-statistics/
- 15. Frequency:Traditional Approach Build a table that lists of all seen thus far elements with corresponding counters Increment counters when a new element comes or add that element into the table and initialize its counter Return the value of the counter that corresponds to the element as frequency requires linear memory requires O(n) time lookup (worst case) huge overhead for heavy hitters search 1 1 1 1 1 2
- 16. Frequency: Challenges for Big Data data streams Continuous data streams potentially unbounded number of unique elements ➡ sublinear (polylogarithmic at most) space not feasible to re-process data streams ➡ one-pass algorithms preferred high frequency throughput ➡ fast updates Image: https://www.pngﬁnd.com
- 17. Count-Min Sketch a simple space-eﬃcient probabilistic data structure that is used to estimate frequencies of elements in data streams and can address the Heavy hitters problem presented by Graham Cormode and Shan Muthukrishnan in 2003
- 18. Frequency: Estimation with a single counter 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m+1 h( ) 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m h( ) +1 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m+1 h( )
- 19. 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 counter 1 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 counter 2 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 counter k 0 1 2 3 4 5 …. m-1 m … CMSketch Frequency: Estimation with Count-Min Sketch +1 +1 +1 h1( ) h2( ) hk( )…,
- 20. 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 counter 1 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 counter 2 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 counter k 0 1 2 3 4 5 …. m-1 m … CMSketch Frequency: Estimation with Count-Min Sketch h1( ) h2( ) hk( )…, f( ) = min (1, 3, ..., 5) = 1
- 21. Counting: Invoking Count-Min Sketch from Python import json from pdsa.frequency.count_min_sketch import CountMinSketch cms = CountMinSketch(5, 2000) with open('tweets.txt') as f: for line in f: ip = json.loads(line)['hashtag'] cms.add(ip) print('Frequency of #Python', cms.frequency("Python")) size_in_bytes = cms.sizeof() print('Size in bytes', size_in_bytes) # ~40Kb / 32-bit counters
- 22. 4. Counting Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
- 23. Counting: Challenge Count the number of unique visitors Amazon and eBay had about 3.375 billion* visitors in June 2019 Assume 337 million of unique IP addresses (128 bit per IPv6 record) 5.4 GB of memory just to store them all *SimilarWeb.Com Data for June, 2019 What if we can count them with 12 KB only? Image: https://www.cleanpng.com
- 24. Counting:Traditional Approach Build list of all unique elements Sort / search to avoid listing elements twice Count elements in the list requires linear memory requires O(n·log n) time
- 25. Counting:Approximate Counting @katyperry has 107,287,629 followers Would you really care if she has 107.2, 108.0, or 106.7 million followers?
- 26. HyperLogLog a hash-based probabilistic algorithm for counting the number of distinct values in the presence of duplicates proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007
- 27. Counting: Estimation with a single counter (Flajolet, Martin) h( ) h( ) 0 0 0 0 1 1 … 0 0 1 0 1 0 0 0 0 binary (LSB-0) rank ( ) = 4 h( ) 1 1 0 0 0 1 … 0 0 1 0 1 0 0 0 0 binary (LSB-0) rank ( ) = 0 h( ) 1 0 0 0 1 … 0 0 0 1 2 3 4 … m-1 m R = 1 n ≈ 2R 0.77351 FM Sketch
- 28. Counting: Estimation with HyperLogLog 1 1 0 0 0 1 … 0 0 1 0 1 0 0 1 0 binary (LSB-0) rank1 ( ) = 0 h1( ) 0 5 … 2 1 2 … k HLL Sketch h1( ) h2( ) hk( )…, 0 0 0 0 0 1 … 0 0 1 1 1 0 0 0 0 binary (LSB-0) rank2 ( ) = 5 h2( ) 0 0 1 1 0 1 … 0 0 1 0 1 1 0 0 1 binary (LSB-0) rankk ( ) = 0 hk( ) … iff bigger than existing value iff bigger than existing value iff bigger then existing value n ≈ α ⋅ k ⋅ 2AVG(HLLi)
- 29. Counting: HyperLogLog Algorithm Based on a single 32-bit hash function Simulates k hash functions using stochastic averaging approach p bits (32 - p) bits addressing bits rank computation bits hash(x) = 32-bit hash value Stores only k = 2p counters (registers), about 4 bytes each The memory always ﬁxed, regardless the number of unique elements More counters provide less error (memory/accuracy trade-oﬀ)
- 30. Counting: Invoking HyperLogLog from Python import json from pdsa.cardinality.hyperloglog import HyperLogLog hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters with open('visitors.txt') as f: for line in f: ip = json.loads(line)['ip'] hll.add(ip) num_of_unique_visitors = hll.count() print('Unique visitors', num_of_unique_visitors) size_in_bytes = hll.sizeof() print('Size in bytes', size_in_bytes) # ~ 4Kb
- 31. Counting: Distinct Count in Redis Redis uses the HyperLogLog data structure to count unique elements in a set requires a small constant amount of memory of 12KB for every data structure approximates the exact cardinality with a standard error of 0.81%. redis> PFADD hll python java ruby (integer) 1 redis> PFADD hll python python python (integer) 0 redis> PFADD hll java ruby (integer) 0 redis> PFCOUNT hll (integer) 3 http://antirez.com/news/75
- 32. 5. Final Notes Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
- 33. Final Notes Think about Big Data as a technology challenge Instead of buying new servers, learn new algorithms Believe in hashing! Sample vs Hashing. Probabilistic Data Structures and Algorithms become useful when your problem ﬁts Image: https://longfordpc.com/
- 34. Read More [book] Probabilistic Data Structures and Algorithms for Big Data Applications https://pdsa.gakhov.com [repo] Probabilistic Data Structures and Algorithms in Python https://github.com/gakhov/pdsa Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ Redis new data structure: the HyperLogLog http://antirez.com/news/75 Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html Big Data with Sketchy Structures https://towardsdatascience.com/b73fb3a33e2a Count-Min Sketch http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf
- 35. Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov Website: www.gakhov.com Twitter: @gakhov Probabilistic Data Structures and Algorithms for Big Data Applications pdsa.gakhov.com Eskerrik asko!
- 36. 6.Additional Slides Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov, @gakhov (for that person who wants more)
- 37. Counting: Interactive Presentation of HyperLogLog
- 38. Counting:Accuracy vs MemoryTradeoff in HyperLogLog !38 More counters require more memory (4 bytes per counter) More counters need more bits for addressing them (m = 2p )
- 39. Counting: HyperLogLog++Algorithm HyperLogLog++ 64-bit hash function, so allows to count more values better bias correction using pre-trained data proposed a sparse representation of the counters (registers) to reduce memory requirements HyperLogLog++ is an improved version of HyperLogLog developed in Google and proposed in 2013