Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

67 views

Published on

We interact with an increasing amount of data but classical data structures and algorithms can't fit our requirements anymore. This talk is to present the probabilistic algorithms and data structures and describe the main areas of their applications.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

  1. 1. Andrii Gakhov, PhD Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019
 Bilbao, Spain
  2. 2. Andrii Gakhov Senior Software Engineer
 at Ferret Go GmbH, Germany Ph.D. in Mathematical Modelling, 
 M.Sc. in Applied Mathematics Twitter: @gakhov | Website: gakhov.com Probabilistic Data Structures and Algorithms
 for Big Data Applications ISBN: 9783748190486
 https://pdsa.gakhov.com
  3. 3. 0. Motivation Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  4. 4. Bioinformatics: Counting k-mers in DNA Counting substrings of length k in DNA sequence data (k-mers) is essential in bioinformatics, for instance, for metagenomic sequencing. A large fraction of the storage is spent on storing k-mers with sequencing errors and which are observed only a single time in the data*. Can we efficiently avoid to persist such invalid substrings? Can we efficiently count valid substrings? * Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333, 2011 For example, the team that sequenced the giant panda genome needed to count 8.62 billion 27-mers, where 68% were low-coverage k-mers.
  5. 5. 1. Data-Intensive Applications 
 in Big Data epoch Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  6. 6. What is Big Data? Doug Laney in 2001 described Big Data datasets as such that contain greater variety arriving in increasing volumes and with ever- higher velocity. Today this is known as the famous 3V’s of Big Data. Big Data Velocity Variety Volume expresses the amount of data describes the speed at which data is arriving refers to the number of types of data
  7. 7. What is Big Data? Big Data is more than simply a matter of size. Big Data does not refer to data, it refers to technology. The datasets of Big Data are larger, more complex, and generated more rapidly than our current resources can handle. Image: https://www.freepngimg.com/electronics/technology
  8. 8. 2. Probabilistic Data Structures
 and Algorithms Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  9. 9. Probabilistic Data Structures and Algorithms (PDSA) A family of advanced approaches that are optimized to use sublinear memory and constant execution time. Cannot provide the exact answers and have some probability of error. error resources The tradeoff between the error and the resources is another feature that distinguish the algorithms and data structures of this family.
  10. 10. PDSA in Big Data Ecosystem Count-Min Sketch Count Sketch Bloom Filter Quotient Filter Cuckoo Filter Linear Counting FM Sketch LogLog HyperLogLog Random Sampling t-digestq-digestGreenwald-Khanna MinHash SimHash LSH Counting find the number of unique elements Membership keep track of indexed elements Rank approximate percentiles and quantiles Frequency estimate frequencies of elements Similarity find similar documents Big Data Velocity Variety Volume
  11. 11. PDSA in Apache Spark SQL (PySpark interface) q-quantile estimation (Greenwald-Khanna) # pyspark.sql.DataFrameStatFunctions(df).approxQuantile
 df.approxQuantile("language", [0.5], 0.25) Approximate number of distinct elements (HyperLogLog++) #pyspark.sql.functions.approx_count_distinct df.agg(approx_count_distinct(df.language).alias('lang')).collect() Spark SQL is Apache Spark's module for working with structured data.
  12. 12. PDSA in Production
  13. 13. 3. Frequency Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  14. 14. Frequency: Challenge A hashtag is used to index a topic on Twitter and allows people to easily follow items they are interested in. Hashtags are usually written with a # symbol in front. Find the most trending hashtags on Twitter every second about 6000 tweets are created on Twitter, that is roughly 500 million items daily most of tweets are linked with one or more hashtags https://www.internetlivestats.com/twitter-statistics/
  15. 15. Frequency:Traditional Approach Build a table that lists of all seen thus far elements with corresponding counters Increment counters when a new element comes or add that element into the table and initialize its counter Return the value of the counter that corresponds to the element as frequency requires linear memory requires O(n) time lookup (worst case) huge overhead for heavy hitters search 1 1 1 1 1 2
  16. 16. Frequency: Challenges for Big Data data streams Continuous data streams potentially unbounded number of unique elements
 ➡ sublinear (polylogarithmic at most) space
 not feasible to re-process data streams
 ➡ one-pass algorithms preferred
 high frequency throughput
 ➡ fast updates Image: https://www.pngfind.com
  17. 17. Count-Min Sketch a simple space-efficient probabilistic data structure that is used to estimate frequencies of elements in data streams and can address the Heavy hitters problem presented by Graham Cormode and Shan Muthukrishnan in 2003
  18. 18. Frequency: Estimation with a single counter 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m+1 h( ) 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m h( ) +1 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m+1 h( )
  19. 19. 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 counter 1 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 counter 2 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 counter k 0 1 2 3 4 5 …. m-1 m … CMSketch Frequency: Estimation with Count-Min Sketch +1 +1 +1 h1( ) h2( ) hk( )…,
  20. 20. 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 counter 1 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 counter 2 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 counter k 0 1 2 3 4 5 …. m-1 m … CMSketch Frequency: Estimation with Count-Min Sketch h1( ) h2( ) hk( )…, f( ) = min (1, 3, ..., 5) = 1
  21. 21. Counting: Invoking Count-Min Sketch from Python 
 import json from pdsa.frequency.count_min_sketch import CountMinSketch cms = CountMinSketch(5, 2000) with open('tweets.txt') as f: for line in f: ip = json.loads(line)['hashtag'] cms.add(ip) print('Frequency of #Python', cms.frequency("Python")) size_in_bytes = cms.sizeof() print('Size in bytes', size_in_bytes) # ~40Kb / 32-bit counters 

  22. 22. 4. Counting Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  23. 23. Counting: Challenge Count the number of unique visitors Amazon and eBay had about 3.375 billion* visitors in June 2019 Assume 337 million of unique IP addresses (128 bit per IPv6 record) 5.4 GB of memory just to store them all *SimilarWeb.Com Data for June, 2019 What if we can count them with 12 KB only? Image: https://www.cleanpng.com
  24. 24. Counting:Traditional Approach Build list of all unique elements Sort / search 
 to avoid listing elements twice Count elements in the list requires linear memory requires O(n·log n) time
  25. 25. Counting:Approximate Counting @katyperry has 107,287,629 followers Would you really care 
 if she has 107.2, 108.0, or 106.7 million followers?
  26. 26. HyperLogLog a hash-based probabilistic algorithm for counting the number of distinct values in the presence of duplicates proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007
  27. 27. Counting: Estimation with a single counter (Flajolet, Martin) h( ) h( ) 0 0 0 0 1 1 … 0 0 1 0 1 0 0 0 0 binary (LSB-0) rank ( ) = 4 h( ) 1 1 0 0 0 1 … 0 0 1 0 1 0 0 0 0 binary (LSB-0) rank ( ) = 0 h( ) 1 0 0 0 1 … 0 0 0 1 2 3 4 … m-1 m R = 1 n ≈ 2R 0.77351 FM Sketch
  28. 28. Counting: Estimation with HyperLogLog 1 1 0 0 0 1 … 0 0 1 0 1 0 0 1 0 binary (LSB-0) rank1 ( ) = 0 h1( ) 0 5 … 2 1 2 … k HLL Sketch h1( ) h2( ) hk( )…, 0 0 0 0 0 1 … 0 0 1 1 1 0 0 0 0 binary (LSB-0) rank2 ( ) = 5 h2( ) 0 0 1 1 0 1 … 0 0 1 0 1 1 0 0 1 binary (LSB-0) rankk ( ) = 0 hk( ) … iff bigger than existing value iff bigger than existing value iff bigger then existing value n ≈ α ⋅ k ⋅ 2AVG(HLLi)
  29. 29. Counting: HyperLogLog Algorithm Based on a single 32-bit hash function Simulates k hash functions using stochastic averaging approach p bits (32 - p) bits addressing bits rank computation bits hash(x) = 32-bit hash value Stores only k = 2p counters (registers), about 4 bytes each The memory always fixed, regardless the number of unique elements More counters provide less error (memory/accuracy trade-off)
  30. 30. Counting: Invoking HyperLogLog from Python 
 import json from pdsa.cardinality.hyperloglog import HyperLogLog hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters with open('visitors.txt') as f: for line in f: ip = json.loads(line)['ip'] hll.add(ip) num_of_unique_visitors = hll.count() print('Unique visitors', num_of_unique_visitors) size_in_bytes = hll.sizeof() print('Size in bytes', size_in_bytes) # ~ 4Kb
  31. 31. Counting: Distinct Count in Redis Redis uses the HyperLogLog data structure to count unique elements in a set requires a small constant amount of memory of 12KB for every data structure approximates the exact cardinality with a standard error of 0.81%. redis> PFADD hll python java ruby (integer) 1 redis> PFADD hll python python python (integer) 0 redis> PFADD hll java ruby (integer) 0 redis> PFCOUNT hll (integer) 3 http://antirez.com/news/75
  32. 32. 5. Final Notes Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  33. 33. Final Notes Think about Big Data as a technology challenge Instead of buying new servers, learn new algorithms Believe in hashing! Sample vs Hashing. Probabilistic Data Structures and Algorithms become useful when your problem fits Image: https://longfordpc.com/
  34. 34. Read More [book] Probabilistic Data Structures and Algorithms for Big Data Applications 
 https://pdsa.gakhov.com [repo] Probabilistic Data Structures and Algorithms in Python
 https://github.com/gakhov/pdsa Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure 
 https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ Redis new data structure: the HyperLogLog
 http://antirez.com/news/75 Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles 
 https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html Big Data with Sketchy Structures 
 https://towardsdatascience.com/b73fb3a33e2a Count-Min Sketch 
 http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf
  35. 35. Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov Website: www.gakhov.com Twitter: @gakhov Probabilistic Data Structures and Algorithms for Big Data Applications pdsa.gakhov.com Eskerrik asko!
  36. 36. 6.Additional Slides Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov, @gakhov (for that person who wants more)
  37. 37. Counting: Interactive Presentation of HyperLogLog
  38. 38. Counting:Accuracy vs MemoryTradeoff in HyperLogLog !38 More counters require more memory (4 bytes per counter) More counters need more bits for addressing them (m = 2p )
  39. 39. Counting: HyperLogLog++Algorithm HyperLogLog++ 64-bit hash function, so allows to count more values better bias correction using pre-trained data proposed a sparse representation of the counters (registers) to reduce memory requirements HyperLogLog++ is an improved version of HyperLogLog 
 developed in Google and proposed in 2013

×