SlideShare a Scribd company logo
1 of 48
Download to read offline
Abstract Algebra for Analytics 
Sam BESSALAH 
@samklr
What do we want? 
•We want to build scalable systems. 
•Preferably by leveraging distributed computing 
•A lot of analytics amount to counting or adding in some sort of way.
• Example : Finding TopK Elements 
Read Input 
Sort, Filter and take top K records 
Write Output 
11, 12, 0,3,56,48 K=3 
56,48,12
• Example : Finding TopK Elements 
Read Input 
Sort, Filter and 
take top K records 
Write Output 
Hadoop Map-Reduce
• Example : Finding TopK Elements 
Read Input 
Sort, Filter and take top K records 
Write Output 
Hadoop Map-Reduce
In Scalding
In Scalding
Problems 
•Curse of the last reducer 
•Network Chatter, hinder on performance 
•Inefficient Order for map and reduce steps 
•Multiple jobs, with a sync barrier at the reducer
But in Scalding, « sortWithTake » uses :
But in Scalding, « sortWithTake » uses : 
Priority Queue 
Can be empty 
Two Priority Queues can be added in any order 
Associative + Commutative 
PQ1 : 55, 45, 21, 3 
PQ2: 100, 80, 40, 3 
K = 4 
PQ1 (+) PQ2 : 100, 80, 55, 45
But in Scalding, « sortWithTake » uses : 
Priority Queue 
Can be empty 
Two Priority Queues can be added in any order 
Associative + Commutative 
PQ1 : 55, 45, 21, 3 
PQ2: 100, 80, 40, 3 
K = 4 
PQ1 (+) PQ2 : 100, 80, 55, 45 
In a single Pass
Why is it better and faster?
Associativity allows parallelism
Do we have data structures that are intrinsically parallelizable?
Abstract Algebra Redux 
•Semi Group 
Associative Set (Grouping doesn’t matter) 
•Monoid 
Semi Group with a zero (Zeros get ignored) 
•Group 
Monoid with inverse 
• Abelian Group 
Commutative Set (ordering doesn’t matter)
Stream mining challenges 
•Update predictions after every observation 
•Single pass : can’t read old data or replay the stream 
•Limited time for computation per observation 
•O(n) memory size
Existing solutions 
•Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory. 
•Stream subsampling 
•Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees 
•Use time series analysis methods … 
•Etc
Approximate algorithms for stream analytics
Idea : Hash, don’t Sample
Bloom filters 
•Approximate data structure for set membership 
•Like an approximate set 
BloomFilter.contains(x) => Maybe | NO 
P(False Positive) > 0 
P(False Negative) = 0
•Bit Array of fixed size 
add(x) : for all element i, b[h(x,i)]=1 
contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
•Bloom Filters 
Adding an element uses a boolean OR 
Querying uses a boolean AND 
Both are Monoids
HyperLogLogard
Intuition 
•Long runs of trailings 0 in a random bits chain are rare 
•But the more bit chains you look at, the more likely you are to find a long one 
•The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
HyperLogLog 
•Popular sketch for cardinality estimation 
HLL.size = Approx[Number] 
We know the distribution on the error.
http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
•HyperLogLog 
Adding an element uses MAX, which is a 
monoid (Ordered Semi Group really ...) 
Querying use an harmonic sum : Monoid.
Min Hash 
•Gives the probability of two sets being similar. 
•Essentially amounts to 
P(A ∩ B) / P(A U B) 
•Jaccard Similarity
Count min Sketch 
Gives an approximation of the number of occurrences of an element in a set.
•Count min sketch 
Adding an element is a numerical addition 
Querying uses a MIN function. 
Both are associative.
Anomaly Detection
-Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data. 
-Many exist : Q-Tree, Q-Digest, T-Digest 
-All of those are associative. 
-Another neat thing : types your data uniformaly.
Many more sketches and tricks 
•FM Counters, KMV 
•Histograms 
•Ball Sketches : streaming k-means, clustering 
•SGD : fit online machine learning algorithms
Algebird
Conclusion 
•Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers 
•As data size grows, sampling becomes painful, hashing provide better cost effective solution 
•Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems. 
http://speakerdeck.com/samklr
DON’T BE SCARED ANYMORE.
Bibliography 
•Great intro into Algebird 
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird- monoid-monad-for-large-scala-data-analytics/ 
•Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch- of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ 
•Probabilistic data structures for web analytics. 
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/ 
Algebird : github.com/twitter/algebird 
Algebra for analytics https://speakerdeck.com/johnynek/algebra-for- analytics 
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

More Related Content

More from Paris Open Source Summit

#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
Paris Open Source Summit
 
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
Paris Open Source Summit
 
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
Paris Open Source Summit
 

More from Paris Open Source Summit (20)

#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
 
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
 
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
 
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
 
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
 
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
 
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
 
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
 
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
 
#OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données #OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données
 
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
 
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
 
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
 
#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...
#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...
#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...
 
OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...
OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...
OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...
 
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
 
#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...
#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...
#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...
 
#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...
#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...
#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...
 
#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...
#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...
#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...
 
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
 

Recently uploaded

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 

Recently uploaded (20)

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 

OWF14 - Big Data Track : Abstract Algebra for Analytics

  • 1. Abstract Algebra for Analytics Sam BESSALAH @samklr
  • 2.
  • 3.
  • 4. What do we want? •We want to build scalable systems. •Preferably by leveraging distributed computing •A lot of analytics amount to counting or adding in some sort of way.
  • 5. • Example : Finding TopK Elements Read Input Sort, Filter and take top K records Write Output 11, 12, 0,3,56,48 K=3 56,48,12
  • 6. • Example : Finding TopK Elements Read Input Sort, Filter and take top K records Write Output Hadoop Map-Reduce
  • 7. • Example : Finding TopK Elements Read Input Sort, Filter and take top K records Write Output Hadoop Map-Reduce
  • 10. Problems •Curse of the last reducer •Network Chatter, hinder on performance •Inefficient Order for map and reduce steps •Multiple jobs, with a sync barrier at the reducer
  • 11. But in Scalding, « sortWithTake » uses :
  • 12. But in Scalding, « sortWithTake » uses : Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45
  • 13. But in Scalding, « sortWithTake » uses : Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45 In a single Pass
  • 14. Why is it better and faster?
  • 16.
  • 17. Do we have data structures that are intrinsically parallelizable?
  • 18. Abstract Algebra Redux •Semi Group Associative Set (Grouping doesn’t matter) •Monoid Semi Group with a zero (Zeros get ignored) •Group Monoid with inverse • Abelian Group Commutative Set (ordering doesn’t matter)
  • 19.
  • 20.
  • 21. Stream mining challenges •Update predictions after every observation •Single pass : can’t read old data or replay the stream •Limited time for computation per observation •O(n) memory size
  • 22. Existing solutions •Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory. •Stream subsampling •Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees •Use time series analysis methods … •Etc
  • 23. Approximate algorithms for stream analytics
  • 24. Idea : Hash, don’t Sample
  • 25. Bloom filters •Approximate data structure for set membership •Like an approximate set BloomFilter.contains(x) => Maybe | NO P(False Positive) > 0 P(False Negative) = 0
  • 26. •Bit Array of fixed size add(x) : for all element i, b[h(x,i)]=1 contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
  • 27.
  • 28.
  • 29. •Bloom Filters Adding an element uses a boolean OR Querying uses a boolean AND Both are Monoids
  • 31. Intuition •Long runs of trailings 0 in a random bits chain are rare •But the more bit chains you look at, the more likely you are to find a long one •The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
  • 32. HyperLogLog •Popular sketch for cardinality estimation HLL.size = Approx[Number] We know the distribution on the error.
  • 33.
  • 34.
  • 36. •HyperLogLog Adding an element uses MAX, which is a monoid (Ordered Semi Group really ...) Querying use an harmonic sum : Monoid.
  • 37. Min Hash •Gives the probability of two sets being similar. •Essentially amounts to P(A ∩ B) / P(A U B) •Jaccard Similarity
  • 38.
  • 39. Count min Sketch Gives an approximation of the number of occurrences of an element in a set.
  • 40. •Count min sketch Adding an element is a numerical addition Querying uses a MIN function. Both are associative.
  • 42. -Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data. -Many exist : Q-Tree, Q-Digest, T-Digest -All of those are associative. -Another neat thing : types your data uniformaly.
  • 43. Many more sketches and tricks •FM Counters, KMV •Histograms •Ball Sketches : streaming k-means, clustering •SGD : fit online machine learning algorithms
  • 44.
  • 46. Conclusion •Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers •As data size grows, sampling becomes painful, hashing provide better cost effective solution •Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems. http://speakerdeck.com/samklr
  • 47. DON’T BE SCARED ANYMORE.
  • 48. Bibliography •Great intro into Algebird http://www.michael-noll.com/blog/2013/12/02/twitter-algebird- monoid-monad-for-large-scala-data-analytics/ •Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch- of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ •Probabilistic data structures for web analytics. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/ Algebird : github.com/twitter/algebird Algebra for analytics https://speakerdeck.com/johnynek/algebra-for- analytics http://infolab.stanford.edu/~ullman/mmds/ch3.pdf