Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
and Approximations
Samples, Hashes, Approxim...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design...
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed o...
Upcoming SlideShare
Loading in …5
×

Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures Jan 28 2016 @ Galvanize

1,311 views

Published on

Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures Jan 28 2016 @ Galvanize

Published in: Software

Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures Jan 28 2016 @ Galvanize

  1. 1. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc and Approximations Samples, Hashes, Approximates, and 
 Probabilistic Data Structures Advanced Apache Spark Meetup Thanks, Galvanize!! Jan 28th, 2016 Chris Fregly Principal Data Solutions Engineer We’re Hiring! Only *Nice* People!! advancedspark.com!
  2. 2. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Netflix OSS Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced . Due 2016
  3. 3. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup http://advancedspark.com Meetup Metrics Top 5 Most-active Spark Meetup! 2300+ Members in just 6 mos!! 2500+ Docker image downloads Meetup Mission Code dive deep into Spark and related open source code bases Study integrations with Cassandra, ElasticSearch, Kafka, NiFi Surface and share patterns and idioms of well-designed,
  4. 4. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Demos! 4
  5. 5. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Scaling with Parallelism 5 O(log n) Peter
  6. 6. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Scaling with Composability Max (a max b max c max d) == (a max b) max (c max d) Set Union (a U b U c U d) == (a U b) U (c U d) Addition (a + b + c + d) == (a + b) + (c + d) Multiply (a * b * c * d) == (a * b) * (c * d) 6
  7. 7. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857 7 What were they thinking?! Not Composable “Divide like an Egyptian”
  8. 8. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What about Average? Average ( a[3, 1] ((3 + 5) + (5 + 7)) 20 b[5, 1] == ----------------------- == --- == 5 b[5, 1] ((1 + 2) + 1) 4 c[7, 1] ) 8 value count Pairwise Average (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2 Divide, Add, Divide: Not Composable Single Divide: Composable! AVG(3, 5, 5, 7) == 5
  9. 9. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Demos! 9
  10. 10. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (# errors between then and now) Using machine learning or graph algos Inherently probabilistic and approximate Finding topics in documents (LDA) Finding similar pairs of users, items, words at scale (LSH) Finding top influencers (PageRank) Streaming aggregations (distinct count or top k) Inherently sloppy means of collecting (at least once delivery) 10 Approximate as much as you can get away with! Ask for forgiveness later !!
  11. 11. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark When NOT to Approximate? If you’ve ever heard the term… “Sarbanes-Oxley” …in-that-order, at the office, after 2002. 11
  12. 12. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Demos! 12
  13. 13. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc A Few Good Algorithms 13 You can’t handle 
 the approximate!
  14. 14. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Common to These Algos & Data Structs Low, fixed size in memory Known error bounds Store large amount of data Less memory than Java/Scala collections Tunable tradeoff between size and error Rely on multiple hash functions or operations Size of hash range defines error 14
  15. 15. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Bloom Filter Set.contains(key): Boolean “Hash Multiple Times and Flip the Bits Wherever You Land” 15
  16. 16. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bloom Filter Approximate set membership for key False positive: expect contains(), actual !contains() True negative: expect !contains(), actual !contains() Elements only added, never removed 16
  17. 17. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bloom Filter in Action 17 set(key) contains(key): Boolean Images by @avibryant TRUE -> maybe contains FALSE -> definitely does not contain.
  18. 18. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc CountMin Sketch Frequency Count and TopK “Hash Multiple Times and Add 1 Wherever You Land” 18
  19. 19. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter 19 Johnny Hallyday Martin Odersky Donald Trump
  20. 20. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch In Action 20 Images derived from @avibryant Find minimum of all rows … … Can overestimate, 
 but never underestimate Multiple hash functions (1 hash function per row) Binary hash output (1 element per column) x 2 occurrences of “Top Gun” for slightly additional complexity Top Gun Top Gun Top Gun (x 2) A Few
 Good Men Taps Top Gun (x 2) add(Top Gun, 2) getCount(Top Gun): Long Use Case: TopK movies using total views add(A Few Good Men, 1) add(Taps, 1) A Few
 Good Men Taps … … Overlap Top Gun Overlap A Few Good Men
  21. 21. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc HyperLogLog Count Distinct “Hash Multiple Times and Uniformly Distribute Where You Land” 21
  22. 22. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution Error estimate 14 bits for size of range m = 2^14 = 16,384 slots error = 1.04/(sqrt(16,384)) = .81% 22 Not many of these
  23. 23. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog In Action Use Case: Distinct number of views per movie 23 0 32 Top Gun: Hour 2 user
 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 user 1001 user 2009 user 3005 user 3003 Top Gun: Hour 1 user 3001 user 7009 0 16 Uniform Distribution: Estimate distinct # of users in smaller space Uniform Distribution: Estimate distinct # of users in smaller space Composable! (a bit of precision loss)
  24. 24. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Locality Sensitive Hashing Set Similarity “Pre-process Items into Buckets, Compare Within Buckets” 24
  25. 25. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Locality Sensitive Hashing (LSH) Approximate set similarity Hash designed to cluster similar items Avoids cartesian all-pairs comparison Pre-process m rows into b buckets b << m Hash items multiple times Similar items hash to overlapping buckets Compare just contents of buckets Much smaller cartesian … and parallel !! 25
  26. 26. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Demos! 26
  27. 27. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Common Tools to Approximate Twitter Algebird Redis Apache Spark 27 Composable Library Distributed Cache Big Data Processing
  28. 28. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Twitter Algebird Rooted in Algebraic Fundamentals! Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count) 28
  29. 29. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error (Tunable) Add user views for given movie PFADD TopGun_HLL user1001 user2009 user3005 PFADD TopGun_HLL user3003 user1001 Get distinct count (cardinality) of set PFCOUNT TopGun_HLL Returns: 4 (distinct users viewed this movie) 29 ignore duplicates Tunable Union 2 HyperLogLog Data Structures PFMERGE TopGun_HLL Taps_HLL
  30. 30. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Approximations Spark Core RDD.count*Approx() Spark SQL PartialResult HyperLogLogPlus approxCountDistinct(column) Spark ML Stratified sampling PairRDD.sampleByKey(fractions: Double[ ]) DIMSUM sampling Probabilistic sampling reduces amount of comparison shuffle RowMatrix.columnSimilarities(threshold) Spark Streaming A/B testing StreamingTest.setTestMethod(“welch”).registerStream(dstream) 30
  31. 31. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Demos! 31
  32. 32. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Twitter Algebird Fixed Memory, Large Number of Counts 32
  33. 33. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HashSet vs. HyperLogLog 33
  34. 34. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HashSet vs. CountMin Sketch 34
  35. 35. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Brute Force vs. Locality Sensitive Hashing Similar Items 35
  36. 36. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Brute Force Cartesian All Pair Similarity 36 90 mins!
  37. 37. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All Pairs & Locality Sensitive Hashing 37 << 90 mins!
  38. 38. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Thank You!! Chris Fregly IBM Spark Tech Center http://spark.tc San Francisco, California, USA http://advancedspark.com Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker Find me: LinkedIn, Twitter, Github, Email, Fax 38
  39. 39. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark advancedspark.com @cfregly

×