Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Introduction To Algebird: Abstract algebra for analytics
1. Presented By :
Simarpreet Kaur Monga
Software Consultant
Knoldus Software LLP
Introduction To
Algebird: Abstract algebra for analytics
2. Agenda
● What is Algebird ?
● Problems in computation in Big Data Analysis
● Advantages of Abstract Algebra in computing world.
● Why Algebird ?
● Projects using Algebird
● Example Usage - Counting With Algebird and Spark
● Demo
● References
3. What is Algebird ?
● Algebird is a library which provides abstractions for abstract
algebra in the Scala programming language.
● Twitter recently open-sourced Algebird, which provides you
with a JVM library to work with algebraic data structures
● Note that algebird was originally developed to support
scalding, which is Twitter's Scala library for Hadoop.
4. Problems in Computation in Big Data
Analysis
➢ How do you combine or add 2 instances of complex data
structure in an easy way ?
val first = BloomFilter(...)
val second = BloomFilter(...)
first + second == uh?
5. Problems in Computation in Big Data
Analysis
➢ What about performing other operations on those Bloom
filter instances, notably data processing pipelines based on
common functions such as map, flatMap, foldLeft,
reduceLeft?
val filters = Seq[BloomFilter](...)
val summary = filters flatMap { /* magic happens here */ }
reduceLeft { /* more magic */ } ...
7. Why was Ted being asked whether t-digest is associative?
And how does all this relate to semigroups and monoids?
8. Advantages Of Abstract Algebra in
Computing World
● Abstract algebra allows you to write reusable code.
● Data analysis involves lots of concepts that can be described
using abstract algebra.
● Using abstract algebra abstractions (specifically
commutative operations) allows you to merge distributed
computations without concern for the order in which they
are merged.
9. Why Algebird ?
If you can turn a data structure into a monoid (or semigroup, or
…), then Algebird allows you to put it to good use. You can then
work with your data structure just as nicely as you are so used to
when dealing with Int, Double or List. And you can use it with
large-scale data processing tools such as Hadoop and Storm,
too.
10. Monoids & Monads
As a grossly simplified rule of thumb:
➢ Monoid: If you want to “attach” operations such as +, -, *,
/ or <= to data objects – say, adding two Bloom filters –
then you want to provide monoid forms for those data
objects (e.g. a monoid for your Bloom filter data structure).
This way you can combine and juggle your custom data
structures just like you would do with plain integer
numbers.
11. ➢ Monad: If you want to create data processing pipelines that turn
data objects step-by-step into the desired, final output (e.g.
aggregating raw records into summary statistics), then you want to
build one or more monads to model these data pipelines. Particularly
if you want to run those pipelines in large-scala data processing
platforms such as Hadoop or Storm.
17. Counting With Algebird and Spark
Algebraic Properties Of Integers and Addition
Associative: (a + b) + c = a + (b + c) <= can be partitioned/parallelized
Closed: Int + Int = Int <= reduce in multiple stages
Identity: a + 0 = a <= support “empty”/”zero” items
18. Counting With Algebird and Spark
Algebraic Properties Of Integers and Addition
Associative: (a + b) + c = a + (b + c) <= can be partitioned/parallelized
Closed: Int + Int = Int <= reduce in multiple stages
Identity: a + 0 = a <= support “empty”/”zero” items
Integers form a Monoid under Addition
19. Counting With Algebird and Spark
Algebraic Properties Of Integers and Addition
Associative: (a + b) + c = a + (b + c) <= can be partitioned/parallelized
Closed: Int + Int = Int <= reduce in multiple stages
Identity: a + 0 = a <= support “empty”/”zero” items
Integers form a Monoid under Addition
Monoid computations can be effectively run on Spark