Sam BESSALAH
Algebird is an abstract algebra library for Scala developed at Twitter and released under the ASL 2.0 license. It has support for algebraic structures such as semigroups, monoids, groups, rings and fields as well as the standard functional things like monads. More interestingly though are the probabilistic data structures and the accompanying monoids that come out of the box.
I'll talk a bit about Algebird in general and how it eases building large scale analytics systems with Map Reduce systems or in a stream processing context.
4. What do we want?
•We want to build scalable systems.
•Preferably by leveraging distributed computing
•A lot of analytics amount to counting or adding in some sort of way.
5. • Example : Finding TopK Elements
Read Input
Sort, Filter and take top K records
Write Output
11, 12, 0,3,56,48 K=3
56,48,12
6. • Example : Finding TopK Elements
Read Input
Sort, Filter and
take top K records
Write Output
Hadoop Map-Reduce
7. • Example : Finding TopK Elements
Read Input
Sort, Filter and take top K records
Write Output
Hadoop Map-Reduce
10. Problems
•Curse of the last reducer
•Network Chatter, hinder on performance
•Inefficient Order for map and reduce steps
•Multiple jobs, with a sync barrier at the reducer
12. But in Scalding, « sortWithTake » uses :
Priority Queue
Can be empty
Two Priority Queues can be added in any order
Associative + Commutative
PQ1 : 55, 45, 21, 3
PQ2: 100, 80, 40, 3
K = 4
PQ1 (+) PQ2 : 100, 80, 55, 45
13. But in Scalding, « sortWithTake » uses :
Priority Queue
Can be empty
Two Priority Queues can be added in any order
Associative + Commutative
PQ1 : 55, 45, 21, 3
PQ2: 100, 80, 40, 3
K = 4
PQ1 (+) PQ2 : 100, 80, 55, 45
In a single Pass
17. Do we have data structures that are intrinsically parallelizable?
18. Abstract Algebra Redux
•Semi Group
Associative Set (Grouping doesn’t matter)
•Monoid
Semi Group with a zero (Zeros get ignored)
•Group
Monoid with inverse
• Abelian Group
Commutative Set (ordering doesn’t matter)
19.
20.
21. Stream mining challenges
•Update predictions after every observation
•Single pass : can’t read old data or replay the stream
•Limited time for computation per observation
•O(n) memory size
22. Existing solutions
•Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory.
•Stream subsampling
•Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees
•Use time series analysis methods …
•Etc
25. Bloom filters
•Approximate data structure for set membership
•Like an approximate set
BloomFilter.contains(x) => Maybe | NO
P(False Positive) > 0
P(False Negative) = 0
26. •Bit Array of fixed size
add(x) : for all element i, b[h(x,i)]=1
contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
27.
28.
29. •Bloom Filters
Adding an element uses a boolean OR
Querying uses a boolean AND
Both are Monoids
31. Intuition
•Long runs of trailings 0 in a random bits chain are rare
•But the more bit chains you look at, the more likely you are to find a long one
•The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
32. HyperLogLog
•Popular sketch for cardinality estimation
HLL.size = Approx[Number]
We know the distribution on the error.
42. -Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data.
-Many exist : Q-Tree, Q-Digest, T-Digest
-All of those are associative.
-Another neat thing : types your data uniformaly.
43. Many more sketches and tricks
•FM Counters, KMV
•Histograms
•Ball Sketches : streaming k-means, clustering
•SGD : fit online machine learning algorithms
46. Conclusion
•Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers
•As data size grows, sampling becomes painful, hashing provide better cost effective solution
•Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems.
http://speakerdeck.com/samklr