Your SlideShare is downloading.
×

- 1. Sketching Big Data with Spark Reynold Xin @rxin Sep 29, 2015 @ Strata NY
- 2. About Databricks Founded by creators of Spark in 2013 Cloud service for end-to-end data processing • Interactive notebooks, dashboards, and production jobs We are hiring!
- 3. Spark
- 4. Count-min sketch
- 5. Approximate frequent items
- 6. Taylor Swift
- 7. “Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune
- 8. Who is this guy? Co-founder & architect for Spark at Databricks Former PhD student at UC Berkeley AMPLab A “systems” guy, which means I won’t be showing equations and this talk might be the easiest to consume in HDS
- 9. This talk 1. Develop intuitions on these sketches so you know when to use it 2. Understand how certain parts in distributed data processing (e.g. Spark) work
- 10. Sketch: Reynold’s not-so-scientific definition 1. Use small amount of space to summarize a large dataset. 2. Go over each data point once, a.k.a. “streaming algorithm”, or “online algorithm” 3. Parallelizable, but only small amount of communication
- 11. What for? Exploratory analysis Feature engineering Combine sketch and exact to speed up processing
- 12. Sketches in Spark Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining Frequent items Stratified Sampling …
- 13. This Talk Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining Frequent items Stratified Sampling …
- 14. Set membership
- 15. Set membership Identify whether an item is in a set e.g. “You have bought this item before”
- 16. Exact set membership Track every member of the set • Space: size of data • One pass: yes • Parallelizable & communication: size of data
- 17. Approximate set membership Take 1. Use a 32-bit integer hash map to track • ~4 bytes per record • Max 4 billion items Take 2. Hash items to 256 buckets • Memory usage only 256 bits • Good if num records is small • Bad if num records is large (256+ items, collision rate 100%!)
- 18. Bloom filter Bloom filter algorithm • k hash functions • hash item into k separate positions • if any of the k positions is not set, then item is not in set Properties • ~500MB needed to have 10% error rate on 1 billion items • See http://hur.st/bloomfilter?n=1000000000&p=0.1 • False positives possible
- 19. Use case beyond exploration SELECT * FROM A join B on A.key = B.key 1. Assume A and B are both large, i.e. “shuﬀle join” 2. Some rows in A might not have matched rows in B 3. Wouldn’t it be nice if we only need to shuﬀle rows that match? Answer: use a bloom filter to filter the ones that don’t match
- 20. Frequent items
- 21. Frequent Items Find items more frequent than 1/k
- 22. Source: http://www.macfreek.nl/memory/Letter_Distribution
- 23. 4,474 3,146 2,352 1,749 1,2931,248 1,1071,0941,065 907 835 793 789 737 598 582 517 482 447 444 420 409 409 405 400 381 378 369 367 366 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Twitterfollowersinthousands Twitter Followers of NBA teams (in 1,000s), September 2015 Source: http://www.statista.com/statistics/240386/twitter-followers-of-national-basketball-association-teams/
- 24. Frequent Items Exploration • Identify important members in a network • E.g. “the”, LA Lakers, Taylor Swift Feature Engineering • Identify outliers • Ignore low frequency items
- 25. Frequent Items: Exact Algorithm SELECT item, count(*) cnt FROM corpus GROUP BY item HAVING cnt > k * cnt • Space: linear to |item| • One pass: no (two passes) • Parallelizable & communication: linear to |item|
- 26. Example 1: Find Items Frequency > ½ (k=2)
- 27. draw Put back if any pair of balls are the same color
- 28. draw Remove if balls are all diﬀerent color
- 29. Example 1: Find Items Frequency > 1/2 Blue ball left (frequent item)
- 30. Example 2: Find Items Frequency > ½ (k=2)
- 31. draw
- 32. draw
- 33. draw
- 34. 1 ball left (frequent item)
- 35. How do we implement this? Maintain a hash table of counts
- 36. Increment for every ball we see 0 => 1
- 37. Increment for every ball we see 1 => 2
- 38. Increment for every ball we see 0 => 4
- 39. Increment for every ball we see 0 => 4
- 40. Increment for every ball we see 4 0 => 1
- 41. When the hash table has k items, remove 1 from each item and remove the item if count = 0 4 => 3 1 => 0
- 42. 3
- 43. 3 0 => 1
- 44. 2
- 45. 2 0 => 1
- 46. 1
- 47. Implementation Maintains a hash table of counts • For each item, increment its count • If hash table size == k: – decrement 1 from each item; and – remove items whose count == 0 Parallelization: merge hash tables of max size k
- 48. Comparing Exact vs Approximate Naïve Exact Sketch # Passes 2 1 Memory |item| k Communication |item| k
- 49. Comparing Exact vs Approximate Naïve Exact Sketch Smart Exact # Passes 2 1 2 (1st pass using sketch) Memory |item| k k Communication |item| k k
- 50. Quiz: an example with false positive? K = 3
- 51. How to use it in Spark? Frequent items for multiple columns independently • df.stat.freqItems([“columnA”, “columnB”, …]) Frequent items for composite keys • df.stat.freqItems(struct(“columnA”, “columnB”))
- 52. Stratified sampling
- 53. Bernoulli sampling & Variance Sample US population (300m) using rate 0.000002 (~600) • Wyoming (0.5m) should have 1 • Bernoulli sampling likely leads to Wyoming having 0 Intuition: uniform sampling leads to ~ 600 samples. • i.e. it might be 600, or 601, or 599, or … • Impact on WY when going from 600 to 601 is much larger than that on CA’s
- 54. Stratified sampling Existing “exact” algorithms • Draw-by-draw • Selection-rejection • Reservoir • Random sort Either sequential or expensive (full global sort)
- 55. Random sort Example: sampling probability p = 0.1 on 100 items. 1. Generate random keys • (0.644, t1), (0.378, t2), … (0.500, t99), (0.471, t100) 2. Sort and select the smallest 10 items • (0.028, t94), (0.029, t44), …, (0.137, t69), …, (0.980, t26), (0.988, t60)
- 56. Heuristics Qualitatively speaking • If u is “much larger” than p, then t is “unlikely” to be selected • If u is “much smaller” than p, then it is “likely” to be selected Set two thresholds q1 and q2, such that: • If u < q1, accept t directly • If u > q2, reject t directly • Otherwise, put t in a buﬀer to be sorted
- 57. Spark’s stratified sampling algorithm Combines “exact” and “sketch” to achieve parallelization & low memory overhead df.stat.sampleByKeyExact(col, fractions, seed) Xiangrui Meng. Scalable Simple Random Sampling and Stratified Sampling. ICML 2013
- 58. This Talk Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining Frequent items Stratified Sampling …
- 59. Conclusion Sketches can be useful in exploration, feature engineering, as well as building faster exact algorithms. We are building a lot of these into Spark so you don’t need to reinvent the wheel!
- 60. Thank you. Meetup tonight @ Civic Hall, 6:30pm 156 5th Avenue, 2nd floor, New York, NY