2. About Databricks
Founded by creators of Spark in 2013
Cloud service for end-to-end data processing
• Interactive notebooks, dashboards,
and production jobs
We are hiring!
8. “Spark is the Taylor Swift
of big data software.”
- Derrick Harris, Fortune
9. Who is this guy?
Co-founder & architect for Spark at Databricks
Former PhD student at UC Berkeley AMPLab
A “systems” guy, which means I won’t be showing equations and this
talk might be the easiest to consume in HDS
10. This talk
1. Develop intuitions on these sketches so you know when to use it
2. Understand how certain parts in distributed data processing (e.g.
Spark) work
11.
12. Sketch: Reynold’s not-so-scientific definition
1. Use small amount of space to summarize a large dataset.
2. Go over each data point once, a.k.a. “streaming algorithm”, or
“online algorithm”
3. Parallelizable, but only small amount of communication
18. Exact set membership
Track every member of the set
• Space: size of data
• One pass: yes
• Parallelizable & communication: size of data
19. Approximate set membership
Take 1. Use a 32-bit integer hash map to track
• ~4 bytes per record
• Max 4 billion items
Take 2. Hash items to 256 buckets
• Memory usage only 256 bits
• Good if num records is small
• Bad if num records is large (256+ items, collision rate 100%!)
20. Bloom filter
Bloom filter algorithm
• k hash functions
• hash item into k separate positions
• if any of the k positions is not set, then item is not in set
Properties
• ~500MB needed to have 10% error rate on 1 billion items
• See http://hur.st/bloomfilter?n=1000000000&p=0.1
• False positives possible
21. Use case beyond exploration
SELECT * FROM A join B on A.key = B.key
1. Assume A and B are both large, i.e. “shuffle join”
2. Some rows in A might not have matched rows in B
3. Wouldn’t it be nice if we only need to shuffle rows that match?
Answer: use a bloom filter to filter the ones that don’t match
26. Frequent Items
Exploration
• Identify important members in a network
• E.g. “the”, LA Lakers, Taylor Swift
Feature Engineering
• Identify outliers
• Ignore low frequency items
27. Frequent Items: Exact Algorithm
SELECT
item,
count(*)
cnt
FROM
corpus
GROUP
BY
item
HAVING
cnt
>
k
*
cnt
• Space: linear to |item|
• One pass: no (two passes)
• Parallelizable & communication: linear to |item|
52. Implementation
Maintains a hash table of counts
• For each item, increment its count
• If hash table size == k:
– decrement 1 from each item; and
– remove items whose count == 0
Parallelization: merge hash tables of max size k
53. Comparing Exact vs Approximate
Naïve Exact Sketch
# Passes 2 1
Memory |item| k
Communication |item| k
54. Comparing Exact vs Approximate
Naïve Exact Sketch Smart Exact
# Passes 2 1 2
(1st pass using sketch)
Memory |item| k k
Communication |item| k k
56. How to use it in Spark?
Frequent items for multiple columns independently
• df.stat.freqItems([“columnA”,
“columnB”,
…])
Frequent items for composite keys
• df.stat.freqItems(struct(“columnA”,
“columnB”))
58. Bernoulli sampling & Variance
Sample US population (300m) using rate 0.000002 (~600)
• Wyoming (0.5m) should have 1
• Bernoulli sampling likely leads to Wyoming having 0
Intuition: uniform sampling leads to ~ 600 samples.
• i.e. it might be 600, or 601, or 599, or …
• Impact on WY when going from 600 to 601 is much larger than that on CA’s
59. Stratified sampling
Existing “exact” algorithms
• Draw-by-draw
• Selection-rejection
• Reservoir
• Random sort
Either sequential or expensive (full global sort)
60. Random sort
Example: sampling probability p = 0.1 on 100 items.
1. Generate random keys
• (0.644, t1), (0.378, t2), … (0.500, t99), (0.471, t100)
2. Sort and select the smallest 10 items
• (0.028, t94), (0.029, t44), …, (0.137, t69), …, (0.980, t26), (0.988, t60)
61. Heuristics
Qualitatively speaking
• If u is “much larger” than p, then t is “unlikely” to be selected
• If u is “much smaller” than p, then it is “likely” to be selected
Set two thresholds q1 and q2, such that:
• If u < q1, accept t directly
• If u > q2, reject t directly
• Otherwise, put t in a buffer to be sorted
62. Spark’s stratified sampling algorithm
Combines “exact” and “sketch” to achieve parallelization & low
memory overhead
df.stat.sampleByKeyExact(col,
fractions,
seed)
Xiangrui Meng. Scalable Simple Random Sampling and Stratified
Sampling. ICML 2013
63. This Talk
Set membership (Bloom filter)
Cardinality (HyperLogLog)
Histogram (count-min sketch)
Frequent pattern mining
Frequent items
Stratified Sampling
…
64. Conclusion
Sketches can be useful in exploration, feature engineering, as
well as building faster exact algorithms.
We are building a lot of these into Spark so you don’t need to
reinvent the wheel!