Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Algebird : Abstract Algebra for big data analytics. Devoxx 2014

4,840 views

Published on

Algebird; abstract algebra for analytics.

Devoxx 2014. Antwerp. Belgium

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Algebird : Abstract Algebra for big data analytics. Devoxx 2014

  1. 1. Algebird Abstract Algebra for Analytics Sam BESSALAH @samklr Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr
  2. 2. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  3. 3. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  4. 4. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  5. 5. Abstract Algebra Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  6. 6. From WikiPedia Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  7. 7. Algebraic Structure “ Set of values, coupled with one or more finite operations,and a set of laws those operations must obey. “ Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  8. 8. Algebraic Structure “ Set of values, coupled with one or more finite operations, and a set of laws those operations must obey. “ e.g Sum, Magma, Semigroup, Groups, Monoid, Abelian Group, Semi Lattices, Rings, Monads, etc. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  9. 9. Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  10. 10. Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity) trait Semigroup[T] { def aggregate(x : T, y : T) : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  11. 11. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  12. 12. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identiy / zero) trait Monoid[T] { def identity : T def aggregate (x, y) : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  13. 13. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x trait Monoid[T] extends Semigroup[T]{ def identity : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  14. 14. Groups Group Laws: (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity) x <> inverse x = identity inverse x <> x = identity (invertibility) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  15. 15. Groups Group Laws (x <> y) <> z = x <> (y <> z) identity <> x = x x <> identity = x x <> inverse x = identity inverse x <> x = identity trait Group[T] extends Monoid[T]{ def inverse (v : T) :T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  16. 16. Many More - Abelian groups (Commutative Sets) - Rings - Semi Lattices - Ordered Semigroups - Fields .. Many of those are in Algebird …. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  17. 17. Examples - (a min b) min c = a (b min c) with Int. - a max ( b max c) = (a max b) max c ** - a or (b or c) = (a or b) or c - a and (b and c) = (a and b) and c - int addition - set union - harmonic sum - Integer mean - Priority queue Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  18. 18. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  19. 19. Why do we need those algebraic structures ? Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  20. 20. We want to : - Build scalable analytics systems - Leverage distributed computing to perform aggregation on really large data sets. - A lot of operations in analytics are just sorting and counting at the end of the day Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  21. 21. Distributed Computing → Parallellism Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  22. 22. Distributed Computing → Parallellism Associativity → enables parallelism Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  23. 23. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  24. 24. Distributed Computing → Parallellism Associativity enables parallelism Identity means we can ignore some data Commutativity helps us ignore order Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  25. 25. Typical Map Reduce ... Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  26. 26. Finding Top-K Elements in Scalding ... class TopKJob(args : Args) extends Job (args) { Tsv ( args(‘input’), visitScheme) .filter (. ..) .leftJoinWithTiny ( … ) .filter ( … ) .groupBy( ‘fieldOne) { _.sortWithTake (visitScheme -> top } (biggerSale) .write(Tsv(...) ) } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  27. 27. .sortWithTake( … ) Looking into .sortWithTake in Scalding, there’s one nice thing : class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  28. 28. class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Let’s take a look : PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 Priority Queue : Can be empty Two Priority Queues can be “added” in any order Associative + Commutative Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  29. 29. class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Let’s take a look : PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 Priority Queue : Makes Scalding go fast, by doing sorting, filtering and extracting in one single “map” step. Can be empty Two Priority Queues can be “added” in any order Associative + Commutative Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  30. 30. Stream Mining Challenges - Update predictions after each observation - Single pass : can’t read old data or replay the stream - Full size of the stream often unknown - Limited time for computation per observation - O(1) memory size Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  31. 31. Stream Mining Challenges http://radar.oreilly.com/2013/10/stream-mining-essentials.html Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  32. 32. Tradeoff : Space and speed over accuracy. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  33. 33. Tradeoff : Space and speed over accuracy. use sketches. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  34. 34. Sketches Probabilistic data structures that store a summary (hashed mostly)of a data set that would be costly to store in its entirety, thus providing most of the time, sublinear algorithmic properties. E.g Bloom Filters, Counter Sketch, KMV counters, Count Min Sketch, HyperLogLog, Min Hashes #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  35. 35. Bloom filters Approximate data structure for set membership Behaves like an approximate set BloomFilter.contains(x) => NO | Maybe P(False Positive) > 0 P(False Negative) = 0 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  36. 36. Internally : Bit Array of fixed size add(x) : for all element i, b[h(x,i)]=1 contains(x) : TRUE if b[h(x,i)] = = 1 for all i. (Boolean AND => associative) Both are associative => BF can be designed as a Monoid #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  37. 37. Bloom filters import com.twitter.algebird._ import com.twitter.algebird.Operators._ // generate 2 lists val A = (1 to 300).toList // Generate a Bloomfilter val NUM_HASHES = 6 val WIDTH = 6000 // bits val SEED = 1 implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED) // approximate set with bloomfilter val A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _) val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…) #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  38. 38. Count Min Sketch Gives an approximation of the number of occurrences of an element in a set. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  39. 39. Count Min Sketch Count min sketch Adding an element is a numerical addition Querying uses a MIN function. Both are associative. useful for detecting heavy hitters, topK, LSH We have in Algebird : #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  40. 40. HyperLogLog Popular sketch for cardinality estimtion. Gives within a probilistic distribution of an error the number of distinct values in a data set. HLL.size = Approx[Number] Intuition Long runs of trailings 0 in a random bits chain are rare But the more bit chains you look at, the more likely you are to find a long one The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  41. 41. Adding an element uses a Max and Sum function. Both are associative and Monoids. (Max is an ordered semigroup in Algebird really) Querying for an element uses an harmonic mean which is a Monoid. In Algebird : #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  42. 42. Many More juicy sketches ... - MinHashes to compute Jaccard similarity - QTree for quantiles estimation. Neat for anomaly detection. - SpaceSaverMonoid, Awesome to find the approximate most frequent and top K elements. - TopKMonoid - SGD, PriorityQueues, Histograms, etc. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  43. 43. SummingBird : Lamba in a box #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  44. 44. Heard of Lambda Architecture ? #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  45. 45. SummingBird Same code for both batch and real time processing. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  46. 46. SummingBird Same code, for both batch and real time processing. But works only on Monoids. Uses Storehaus, as a mergeable store layer. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  47. 47. http://github.com/twitter/algebird #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  48. 48. http://github.com/twitter/algebird #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  49. 49. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr These slides : http://bit.ly/1szncAZ http://slidesha.re/1zhhXKU
  50. 50. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  51. 51. Links -Algebra for analytics by Oscar Boykin (Creator of Algebird) http://speakerdeck.com/johnynek/algebra-for-analytics - Take a look into HLearn https://github.com/mikeizbicki/HLearn - Great intro into Algebird by Michael Noll http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad- for-large-scala-data-analytics/ -Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of- the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure - Probabilistic data structures for web analytics. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/ - http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure- for.html - http://infolab.stanford.edu/~ullman/mmds/ch3.pdf #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr

×