Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Kafka Reliability - When it absolut... by Gwen (Chen) Shapira 8938 views
- (BDT309) Data Science & Best Practi... by Amazon Web Services 15529 views
- Distributed real time stream proces... by Petr Zapletal 17441 views
- No data loss pipeline with apache k... by Jiangjie Qin 7838 views
- Probabilistic algorithms for fun an... by Tyler Treat 44986 views
- Spark Summit EU 2016 Keynote - Simp... by Databricks 1579 views

6,036 views

Published on

Talk given by Reynold Xin (@rxin) at Strata New York 2015

Published in:
Software

No Downloads

Total views

6,036

On SlideShare

0

From Embeds

0

Number of Embeds

826

Shares

0

Downloads

135

Comments

0

Likes

35

No embeds

No notes for slide

- 1. Sketching Big Data with Spark Reynold Xin @rxin Sep 29, 2015 @ Strata NY
- 2. About Databricks Founded by creators of Spark in 2013 Cloud service for end-to-end data processing • Interactive notebooks, dashboards, and production jobs We are hiring!
- 3. Spark
- 4. Count-min sketch
- 5. Approximate frequent items
- 6. Taylor Swift
- 7. “Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune
- 8. Who is this guy? Co-founder & architect for Spark at Databricks Former PhD student at UC Berkeley AMPLab A “systems” guy, which means I won’t be showing equations and this talk might be the easiest to consume in HDS
- 9. This talk 1. Develop intuitions on these sketches so you know when to use it 2. Understand how certain parts in distributed data processing (e.g. Spark) work
- 10. Sketch: Reynold’s not-so-scientific definition 1. Use small amount of space to summarize a large dataset. 2. Go over each data point once, a.k.a. “streaming algorithm”, or “online algorithm” 3. Parallelizable, but only small amount of communication
- 11. What for? Exploratory analysis Feature engineering Combine sketch and exact to speed up processing
- 12. Sketches in Spark Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining Frequent items Stratified Sampling …
- 13. This Talk Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining Frequent items Stratified Sampling …
- 14. Set membership
- 15. Set membership Identify whether an item is in a set e.g. “You have bought this item before”
- 16. Exact set membership Track every member of the set • Space: size of data • One pass: yes • Parallelizable & communication: size of data
- 17. Approximate set membership Take 1. Use a 32-bit integer hash map to track • ~4 bytes per record • Max 4 billion items Take 2. Hash items to 256 buckets • Memory usage only 256 bits • Good if num records is small • Bad if num records is large (256+ items, collision rate 100%!)
- 18. Bloom filter Bloom filter algorithm • k hash functions • hash item into k separate positions • if any of the k positions is not set, then item is not in set Properties • ~500MB needed to have 10% error rate on 1 billion items • See http://hur.st/bloomfilter?n=1000000000&p=0.1 • False positives possible
- 19. Use case beyond exploration SELECT * FROM A join B on A.key = B.key 1. Assume A and B are both large, i.e. “shuﬀle join” 2. Some rows in A might not have matched rows in B 3. Wouldn’t it be nice if we only need to shuﬀle rows that match? Answer: use a bloom filter to filter the ones that don’t match
- 20. Frequent items
- 21. Frequent Items Find items more frequent than 1/k
- 22. Source: http://www.macfreek.nl/memory/Letter_Distribution
- 23. 4,474 3,146 2,352 1,749 1,2931,248 1,1071,0941,065 907 835 793 789 737 598 582 517 482 447 444 420 409 409 405 400 381 378 369 367 366 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Twitterfollowersinthousands Twitter Followers of NBA teams (in 1,000s), September 2015 Source: http://www.statista.com/statistics/240386/twitter-followers-of-national-basketball-association-teams/
- 24. Frequent Items Exploration • Identify important members in a network • E.g. “the”, LA Lakers, Taylor Swift Feature Engineering • Identify outliers • Ignore low frequency items
- 25. Frequent Items: Exact Algorithm SELECT item, count(*) cnt FROM corpus GROUP BY item HAVING cnt > k * cnt • Space: linear to |item| • One pass: no (two passes) • Parallelizable & communication: linear to |item|
- 26. Example 1: Find Items Frequency > ½ (k=2)
- 27. draw Put back if any pair of balls are the same color
- 28. draw Remove if balls are all diﬀerent color
- 29. Example 1: Find Items Frequency > 1/2 Blue ball left (frequent item)
- 30. Example 2: Find Items Frequency > ½ (k=2)
- 31. draw
- 32. draw
- 33. draw
- 34. 1 ball left (frequent item)
- 35. How do we implement this? Maintain a hash table of counts
- 36. Increment for every ball we see 0 => 1
- 37. Increment for every ball we see 1 => 2
- 38. Increment for every ball we see 0 => 4
- 39. Increment for every ball we see 0 => 4
- 40. Increment for every ball we see 4 0 => 1
- 41. When the hash table has k items, remove 1 from each item and remove the item if count = 0 4 => 3 1 => 0
- 42. 3
- 43. 3 0 => 1
- 44. 2
- 45. 2 0 => 1
- 46. 1
- 47. Implementation Maintains a hash table of counts • For each item, increment its count • If hash table size == k: – decrement 1 from each item; and – remove items whose count == 0 Parallelization: merge hash tables of max size k
- 48. Comparing Exact vs Approximate Naïve Exact Sketch # Passes 2 1 Memory |item| k Communication |item| k
- 49. Comparing Exact vs Approximate Naïve Exact Sketch Smart Exact # Passes 2 1 2 (1st pass using sketch) Memory |item| k k Communication |item| k k
- 50. Quiz: an example with false positive? K = 3
- 51. How to use it in Spark? Frequent items for multiple columns independently • df.stat.freqItems([“columnA”, “columnB”, …]) Frequent items for composite keys • df.stat.freqItems(struct(“columnA”, “columnB”))
- 52. Stratified sampling
- 53. Bernoulli sampling & Variance Sample US population (300m) using rate 0.000002 (~600) • Wyoming (0.5m) should have 1 • Bernoulli sampling likely leads to Wyoming having 0 Intuition: uniform sampling leads to ~ 600 samples. • i.e. it might be 600, or 601, or 599, or … • Impact on WY when going from 600 to 601 is much larger than that on CA’s
- 54. Stratified sampling Existing “exact” algorithms • Draw-by-draw • Selection-rejection • Reservoir • Random sort Either sequential or expensive (full global sort)
- 55. Random sort Example: sampling probability p = 0.1 on 100 items. 1. Generate random keys • (0.644, t1), (0.378, t2), … (0.500, t99), (0.471, t100) 2. Sort and select the smallest 10 items • (0.028, t94), (0.029, t44), …, (0.137, t69), …, (0.980, t26), (0.988, t60)
- 56. Heuristics Qualitatively speaking • If u is “much larger” than p, then t is “unlikely” to be selected • If u is “much smaller” than p, then it is “likely” to be selected Set two thresholds q1 and q2, such that: • If u < q1, accept t directly • If u > q2, reject t directly • Otherwise, put t in a buﬀer to be sorted
- 57. Spark’s stratified sampling algorithm Combines “exact” and “sketch” to achieve parallelization & low memory overhead df.stat.sampleByKeyExact(col, fractions, seed) Xiangrui Meng. Scalable Simple Random Sampling and Stratified Sampling. ICML 2013
- 58. This Talk Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining Frequent items Stratified Sampling …
- 59. Conclusion Sketches can be useful in exploration, feature engineering, as well as building faster exact algorithms. We are building a lot of these into Spark so you don’t need to reinvent the wheel!
- 60. Thank you. Meetup tonight @ Civic Hall, 6:30pm 156 5th Avenue, 2nd floor, New York, NY

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment