OWF14 - Big Data Track : Abstract Algebra for Analytics

Abstract Algebra for Analytics
Sam BESSALAH
@samklr

What do we want?
•We want to build scalable systems.
•Preferably by leveraging distributed computing
•A lot of analytics amount to counting or adding in some sort of way.

• Example : Finding TopK Elements
Read Input
Sort, Filter and take top K records
Write Output
11, 12, 0,3,56,48 K=3
56,48,12

Read Input
Sort, Filter and
take top K records
Write Output
Hadoop Map-Reduce

Read Input
Sort, Filter and take top K records
Write Output
Hadoop Map-Reduce

Problems
•Curse of the last reducer
•Network Chatter, hinder on performance
•Inefficient Order for map and reduce steps
•Multiple jobs, with a sync barrier at the reducer

But in Scalding, « sortWithTake » uses :

Priority Queue
Can be empty
Two Priority Queues can be added in any order
Associative + Commutative
PQ1 : 55, 45, 21, 3
PQ2: 100, 80, 40, 3
K = 4
PQ1 (+) PQ2 : 100, 80, 55, 45

Priority Queue
Can be empty
Two Priority Queues can be added in any order
Associative + Commutative
PQ1 : 55, 45, 21, 3
PQ2: 100, 80, 40, 3
K = 4
PQ1 (+) PQ2 : 100, 80, 55, 45
In a single Pass

Associativity allows parallelism

Do we have data structures that are intrinsically parallelizable?

Abstract Algebra Redux
•Semi Group
Associative Set (Grouping doesn’t matter)
•Monoid
Semi Group with a zero (Zeros get ignored)
•Group
Monoid with inverse
• Abelian Group
Commutative Set (ordering doesn’t matter)

Stream mining challenges
•Update predictions after every observation
•Single pass : can’t read old data or replay the stream
•Limited time for computation per observation
•O(n) memory size

Existing solutions
•Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory.
•Stream subsampling
•Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees
•Use time series analysis methods …
•Etc

Approximate algorithms for stream analytics

Bloom filters
•Approximate data structure for set membership
•Like an approximate set
BloomFilter.contains(x) => Maybe | NO
P(False Positive) > 0
P(False Negative) = 0

•Bit Array of fixed size
add(x) : for all element i, b[h(x,i)]=1
contains(x) : TRUE if b[h(x,i)] = = 1 for all i.

•Bloom Filters
Adding an element uses a boolean OR
Querying uses a boolean AND
Both are Monoids

Intuition
•Long runs of trailings 0 in a random bits chain are rare
•But the more bit chains you look at, the more likely you are to find a long one
•The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.

HyperLogLog
•Popular sketch for cardinality estimation
HLL.size = Approx[Number]
We know the distribution on the error.

http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

•HyperLogLog
Adding an element uses MAX, which is a
monoid (Ordered Semi Group really ...)
Querying use an harmonic sum : Monoid.

Min Hash
•Gives the probability of two sets being similar.
•Essentially amounts to
P(A ∩ B) / P(A U B)
•Jaccard Similarity

Count min Sketch
Gives an approximation of the number of occurrences of an element in a set.

•Count min sketch
Adding an element is a numerical addition
Querying uses a MIN function.
Both are associative.

-Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data.
-Many exist : Q-Tree, Q-Digest, T-Digest
-All of those are associative.
-Another neat thing : types your data uniformaly.

Many more sketches and tricks
•FM Counters, KMV
•Histograms
•Ball Sketches : streaming k-means, clustering
•SGD : fit online machine learning algorithms

Conclusion
•Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers
•As data size grows, sampling becomes painful, hashing provide better cost effective solution
•Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems.
http://speakerdeck.com/samklr

Bibliography
•Great intro into Algebird
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird- monoid-monad-for-large-scala-data-analytics/
•Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch- of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
•Probabilistic data structures for web analytics.
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/
Algebird : github.com/twitter/algebird
Algebra for analytics https://speakerdeck.com/johnynek/algebra-for- analytics
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

OWF14 - Big Data Track : Abstract Algebra for Analytics

Recommended

Recommended

More Related Content

More from Paris Open Source Summit

More from Paris Open Source Summit (20)

Recently uploaded

Recently uploaded (20)

OWF14 - Big Data Track : Abstract Algebra for Analytics