1. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
and Approximations
Samples, Hashes, Approximates, and
Probabilistic Data Structures
DFW Data Science
Thanks, Amazon!!
Jan 21st, 2016
Chris Fregly
Principal Data Solutions Engineer
We’re Hiring! Only *Nice* People!!
advancedspark.com!
2. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2
Streaming Data Engineer
Netflix OSS Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced .
Due 2016
3. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
http://advancedspark.com
Meetup Metrics
Top 5 Most-active Spark Meetup!
2300+ Members in just 6 mos!!
2500+ Docker image downloads
Meetup Mission
Code dive deep into Spark and related open source code bases
Study integrations with Cassandra, ElasticSearch, Kafka, NiFi
Surface and share patterns and idioms of well-designed,
4. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Demos!
4
5. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Parallelism
5
O(log n)
Peter
6. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Composability
Max (a max b max c max d) == (a max b) max (c max d)
Set Union (a U b U c U d)
== (a U b) U (c U d)
Addition (a + b + c + d)
== (a + b)
+
(c + d)
Multiply
(a * b * c * d)
== (a * b) * (c * d)
6
7. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Division?
Division
(a / b / c / d)
!= (a / b) / (c / d)
(3 / 4 / 7 / 8)
!= (3 / 4) / (7 / 8)
(((3 / 4) / 7) / 8)
!= ((3 * 8) / (4 * 7))
0.134
!=
0.857
7
What were they thinking?!
Not Composable
“Divide like
an Egyptian”
8. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Average?
Average
(
a[3, 1]
((3 + 5) + (5 + 7))
20
b[5, 1] == ----------------------- == --- == 5
b[5, 1]
((1 + 2) + 1)
4
c[7, 1]
)
8
value
count
Pairwise Average
(3 + 5) (5 + 7) 8 12 20
------- + ------- == --- + --- == --- == 10 != 5
2
2
2 2
2
Divide, Add, Divide:
Not
Composable
Single Divide:
Composable!
AVG(3, 5, 5, 7) == 5
9. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Demos!
9
10. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When to Approximate?
Memory or time constrained queries
Relative vs. exact counts are OK (# errors between then and now)
Using machine learning or graph algos
Inherently probabilistic and approximate
Finding topics in documents (LDA)
Finding similar pairs of users, items, words at scale (LSH)
Finding top influencers (PageRank)
Streaming aggregations (distinct count or top k)
Inherently sloppy means of collecting (at least once delivery)
10
Approximate as much as you can get away with!
Ask for forgiveness later !!
11. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When NOT to Approximate?
If you’ve ever heard the term…
“Sarbanes-Oxley”
…in-that-order, at the office, after 2002.
11
12. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Demos!
12
13. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
A Few Good Algorithms
13
You can’t handle
the approximate!
14. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common to These Algos & Data Structs
Low, fixed size in memory
Known error bounds
Store large amount of data
Less memory than Java/Scala collections
Tunable tradeoff between size and error
Rely on multiple hash functions or operations
Size of hash range defines error
14
15. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Bloom Filter
Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
15
16. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter
Approximate set membership for key
False positive: expect contains(), actual !contains()
True negative: expect !contains(), actual !contains()
Elements only added, never removed
16
17. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter in Action
17
set(key)
contains(key): Boolean
Images by @avibryant
TRUE -> maybe contains
FALSE -> definitely does not contain.
18. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
CountMin Sketch
Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
18
19. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch (CMS)
Approximate frequency count and TopK for key
ie. “Heavy Hitters” on Twitter
19
Johnny Hallyday
Martin Odersky
Donald Trump
20. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch In Action
20
Images derived from @avibryant
Find minimum of all rows
…
…
Can overestimate,
but never underestimate
Multiple hash functions
(1 hash function per row)
Binary hash output
(1 element per column)
x 2 occurrences of
“Top Gun” for slightly
additional complexity
Top Gun
Top Gun
Top Gun
(x 2)
A Few
Good Men
Taps
Top Gun
(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A Few
Good Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men
21. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
HyperLogLog
Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
21
22. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog (HLL)
Approximate count distinct
Slight twist
Special hash function creates uniform distribution
Error estimate
14 bits for size of range
m = 2^14 = 16,384 slots
error = 1.04/(sqrt(16,384)) = .81%
22
Not many of these
23. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog In Action
Use Case: Distinct number of views per movie
23
0
32
Top Gun: Hour 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
user
1001
user
2009
user
3005
user
3003
Top Gun: Hour 1
user
3001
user
7009
0
16
Uniform Distribution:
Estimate distinct # of users in smaller space
Uniform Distribution:
Estimate distinct # of users in smaller space
Composable!
(a bit of precision loss)
24. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Locality Sensitive Hashing
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
24
25. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Locality Sensitive Hashing (LSH)
Approximate set similarity
Hash designed to cluster similar items
Avoids cartesian all-pairs comparison
Pre-process m rows into b buckets
b << m
Hash items multiple times
Similar items hash to overlapping buckets
Compare just contents of buckets
Much smaller cartesian … and parallel !!
25
26. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Demos!
26
27. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
27
Composable Library
Distributed Cache
Big Data Processing
28. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Twitter Algebird
Rooted in Algebraic Fundamentals!
Parallel
Associative
Composable
Examples
Min, Max, Avg
BloomFilter (Set.contains(key))
HyperLogLog (Count Distinct)
CountMin Sketch (TopK Count)
28
29. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Redis
Implementation of HyperLogLog (Count Distinct)
12KB per item count
2^64 max # of items
0.81% error (Tunable)
Add user views for given movie
PFADD TopGun_HLL user1001 user2009 user3005
PFADD TopGun_HLL user3003 user1001
Get distinct count (cardinality) of set
PFCOUNT TopGun_HLL
Returns: 4 (distinct users viewed this movie)
29
ignore duplicates
Tunable
Union 2 HyperLogLog Data Structures
PFMERGE TopGun_HLL Taps_HLL
30. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark Approximations
Spark Core
RDD.count*Approx()
Spark SQL
PartialResult
HyperLogLogPlus
approxCountDistinct(column)
Spark ML
Stratified sampling
PairRDD.sampleByKey(fractions: Double[ ])
DIMSUM sampling
Probabilistic sampling reduces amount of comparison shuffle
RowMatrix.columnSimilarities(threshold)
Spark Streaming
A/B testing
StreamingTest.setTestMethod(“welch”).registerStream(dstream)
30
31. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Demos!
31
32. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Demo!
Twitter Algebird
Fixed Memory, Large Number of Counts
32
33. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. HyperLogLog
33
34. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. CountMin Sketch
34
35. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Demo!
Brute Force vs. Locality Sensitive Hashing
Similar Items
35
36. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Brute Force Cartesian All Pair Similarity
36
90 mins!
37. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All Pairs & Locality Sensitive Hashing
37
<< 90 mins!
38. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thank You, Dallas!!
Chris Fregly
IBM Spark Tech Center
http://spark.tc
San Francisco, California, USA
http://advancedspark.com
Sign up for the Meetup and Book
Contribute to Github Repo
Run all Demos using Docker
Find me: LinkedIn, Twitter, Github, Email, Fax
38
39. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
advancedspark.com
@cfregly