SlideShare a Scribd company logo
1 of 48
Download to read offline
Abstract Algebra for Analytics 
Sam BESSALAH 
@samklr
OWF14 - Big Data Track : Abstract Algebra for Analytics
OWF14 - Big Data Track : Abstract Algebra for Analytics
What do we want? 
•We want to build scalable systems. 
•Preferably by leveraging distributed computing 
•A lot of analytics amount to counting or adding in some sort of way.
• Example : Finding TopK Elements 
Read Input 
Sort, Filter and take top K records 
Write Output 
11, 12, 0,3,56,48 K=3 
56,48,12
• Example : Finding TopK Elements 
Read Input 
Sort, Filter and 
take top K records 
Write Output 
Hadoop Map-Reduce
• Example : Finding TopK Elements 
Read Input 
Sort, Filter and take top K records 
Write Output 
Hadoop Map-Reduce
In Scalding
In Scalding
Problems 
•Curse of the last reducer 
•Network Chatter, hinder on performance 
•Inefficient Order for map and reduce steps 
•Multiple jobs, with a sync barrier at the reducer
But in Scalding, « sortWithTake » uses :
But in Scalding, « sortWithTake » uses : 
Priority Queue 
Can be empty 
Two Priority Queues can be added in any order 
Associative + Commutative 
PQ1 : 55, 45, 21, 3 
PQ2: 100, 80, 40, 3 
K = 4 
PQ1 (+) PQ2 : 100, 80, 55, 45
But in Scalding, « sortWithTake » uses : 
Priority Queue 
Can be empty 
Two Priority Queues can be added in any order 
Associative + Commutative 
PQ1 : 55, 45, 21, 3 
PQ2: 100, 80, 40, 3 
K = 4 
PQ1 (+) PQ2 : 100, 80, 55, 45 
In a single Pass
Why is it better and faster?
Associativity allows parallelism
OWF14 - Big Data Track : Abstract Algebra for Analytics
Do we have data structures that are intrinsically parallelizable?
Abstract Algebra Redux 
•Semi Group 
Associative Set (Grouping doesn’t matter) 
•Monoid 
Semi Group with a zero (Zeros get ignored) 
•Group 
Monoid with inverse 
• Abelian Group 
Commutative Set (ordering doesn’t matter)
OWF14 - Big Data Track : Abstract Algebra for Analytics
OWF14 - Big Data Track : Abstract Algebra for Analytics
Stream mining challenges 
•Update predictions after every observation 
•Single pass : can’t read old data or replay the stream 
•Limited time for computation per observation 
•O(n) memory size
Existing solutions 
•Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory. 
•Stream subsampling 
•Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees 
•Use time series analysis methods … 
•Etc
Approximate algorithms for stream analytics
Idea : Hash, don’t Sample
Bloom filters 
•Approximate data structure for set membership 
•Like an approximate set 
BloomFilter.contains(x) => Maybe | NO 
P(False Positive) > 0 
P(False Negative) = 0
•Bit Array of fixed size 
add(x) : for all element i, b[h(x,i)]=1 
contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
OWF14 - Big Data Track : Abstract Algebra for Analytics
OWF14 - Big Data Track : Abstract Algebra for Analytics
•Bloom Filters 
Adding an element uses a boolean OR 
Querying uses a boolean AND 
Both are Monoids
HyperLogLogard
Intuition 
•Long runs of trailings 0 in a random bits chain are rare 
•But the more bit chains you look at, the more likely you are to find a long one 
•The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
HyperLogLog 
•Popular sketch for cardinality estimation 
HLL.size = Approx[Number] 
We know the distribution on the error.
OWF14 - Big Data Track : Abstract Algebra for Analytics
OWF14 - Big Data Track : Abstract Algebra for Analytics
http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
•HyperLogLog 
Adding an element uses MAX, which is a 
monoid (Ordered Semi Group really ...) 
Querying use an harmonic sum : Monoid.
Min Hash 
•Gives the probability of two sets being similar. 
•Essentially amounts to 
P(A ∩ B) / P(A U B) 
•Jaccard Similarity
OWF14 - Big Data Track : Abstract Algebra for Analytics
Count min Sketch 
Gives an approximation of the number of occurrences of an element in a set.
•Count min sketch 
Adding an element is a numerical addition 
Querying uses a MIN function. 
Both are associative.
Anomaly Detection
-Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data. 
-Many exist : Q-Tree, Q-Digest, T-Digest 
-All of those are associative. 
-Another neat thing : types your data uniformaly.
Many more sketches and tricks 
•FM Counters, KMV 
•Histograms 
•Ball Sketches : streaming k-means, clustering 
•SGD : fit online machine learning algorithms
OWF14 - Big Data Track : Abstract Algebra for Analytics
Algebird
Conclusion 
•Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers 
•As data size grows, sampling becomes painful, hashing provide better cost effective solution 
•Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems. 
http://speakerdeck.com/samklr
DON’T BE SCARED ANYMORE.
Bibliography 
•Great intro into Algebird 
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird- monoid-monad-for-large-scala-data-analytics/ 
•Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch- of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ 
•Probabilistic data structures for web analytics. 
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/ 
Algebird : github.com/twitter/algebird 
Algebra for analytics https://speakerdeck.com/johnynek/algebra-for- analytics 
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

More Related Content

More from Paris Open Source Summit

#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, InriaParis Open Source Summit
 
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...Paris Open Source Summit
 
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...Paris Open Source Summit
 
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...Paris Open Source Summit
 
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...Paris Open Source Summit
 
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...Paris Open Source Summit
 
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...Paris Open Source Summit
 
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...Paris Open Source Summit
 
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...Paris Open Source Summit
 
#OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données #OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données Paris Open Source Summit
 
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...Paris Open Source Summit
 
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...Paris Open Source Summit
 
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...Paris Open Source Summit
 
#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...
#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...
#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...Paris Open Source Summit
 
OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...
OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...
OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...Paris Open Source Summit
 
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...Paris Open Source Summit
 
#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...
#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...
#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...Paris Open Source Summit
 
#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...
#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...
#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...Paris Open Source Summit
 
#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...
#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...
#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...Paris Open Source Summit
 
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, WorteksParis Open Source Summit
 

More from Paris Open Source Summit (20)

#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
 
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
 
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
 
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
 
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
 
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
 
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
 
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
 
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
 
#OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données #OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données
 
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
 
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
 
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
 
#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...
#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...
#OSSPARIS19 - Cryptpad : la collaboration chiffrée - LUDOVIC DUBOST, CEO XWik...
 
OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...
OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...
OSSPARIS19 - Customer Content Management: GED et CRM combiné - MICHAËL GENA, ...
 
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
OSSPARIS19 - Utiliser les outils open source pour démarrer une nouvelle entre...
 
#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...
#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...
#OSSPARIS19 - Comment un Logiciel Libre a conduit ma voiture sur plus de 8000...
 
#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...
#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...
#OSSPARIS19 - Blockchain tokenization at SocieteGenerale - SEBASTIEN CHROUKRO...
 
#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...
#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...
#OSSPARIS19 - La sécurité applicative par le design - CHRISTOPHE VILLENEUVE, ...
 
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
 

Recently uploaded

Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media PlatformsMahmoud Yasser
 
How to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentHow to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentAggregage
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performancePrithaVashisht1
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxEmmanuel Dauda
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsNeo4j
 
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Neo4j
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxShammiRai3
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
The market for cross-border mortgages in Europe
The market for cross-border mortgages in EuropeThe market for cross-border mortgages in Europe
The market for cross-border mortgages in Europe321k
 
Unleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMUnleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMMarco Wobben
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-ProfitsTimothy Spann
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfmxlos0
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...ferisulianta.com
 
Air Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdfAir Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdfJasonBoboKyaw
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxjkmrshll88
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsGain Insights
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfdcphostmaster
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseThinkInnovation
 

Recently uploaded (20)

Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media Platforms
 
How to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentHow to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product Development
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performance
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potx
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge Graphs
 
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptx
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
The market for cross-border mortgages in Europe
The market for cross-border mortgages in EuropeThe market for cross-border mortgages in Europe
The market for cross-border mortgages in Europe
 
Unleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMUnleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IM
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdf
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
 
Air Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdfAir Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdf
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptx
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded Analytics
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdf
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data Warehouse
 

OWF14 - Big Data Track : Abstract Algebra for Analytics

  • 1. Abstract Algebra for Analytics Sam BESSALAH @samklr
  • 4. What do we want? •We want to build scalable systems. •Preferably by leveraging distributed computing •A lot of analytics amount to counting or adding in some sort of way.
  • 5. • Example : Finding TopK Elements Read Input Sort, Filter and take top K records Write Output 11, 12, 0,3,56,48 K=3 56,48,12
  • 6. • Example : Finding TopK Elements Read Input Sort, Filter and take top K records Write Output Hadoop Map-Reduce
  • 7. • Example : Finding TopK Elements Read Input Sort, Filter and take top K records Write Output Hadoop Map-Reduce
  • 10. Problems •Curse of the last reducer •Network Chatter, hinder on performance •Inefficient Order for map and reduce steps •Multiple jobs, with a sync barrier at the reducer
  • 11. But in Scalding, « sortWithTake » uses :
  • 12. But in Scalding, « sortWithTake » uses : Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45
  • 13. But in Scalding, « sortWithTake » uses : Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45 In a single Pass
  • 14. Why is it better and faster?
  • 17. Do we have data structures that are intrinsically parallelizable?
  • 18. Abstract Algebra Redux •Semi Group Associative Set (Grouping doesn’t matter) •Monoid Semi Group with a zero (Zeros get ignored) •Group Monoid with inverse • Abelian Group Commutative Set (ordering doesn’t matter)
  • 21. Stream mining challenges •Update predictions after every observation •Single pass : can’t read old data or replay the stream •Limited time for computation per observation •O(n) memory size
  • 22. Existing solutions •Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory. •Stream subsampling •Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees •Use time series analysis methods … •Etc
  • 23. Approximate algorithms for stream analytics
  • 24. Idea : Hash, don’t Sample
  • 25. Bloom filters •Approximate data structure for set membership •Like an approximate set BloomFilter.contains(x) => Maybe | NO P(False Positive) > 0 P(False Negative) = 0
  • 26. •Bit Array of fixed size add(x) : for all element i, b[h(x,i)]=1 contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
  • 29. •Bloom Filters Adding an element uses a boolean OR Querying uses a boolean AND Both are Monoids
  • 31. Intuition •Long runs of trailings 0 in a random bits chain are rare •But the more bit chains you look at, the more likely you are to find a long one •The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
  • 32. HyperLogLog •Popular sketch for cardinality estimation HLL.size = Approx[Number] We know the distribution on the error.
  • 36. •HyperLogLog Adding an element uses MAX, which is a monoid (Ordered Semi Group really ...) Querying use an harmonic sum : Monoid.
  • 37. Min Hash •Gives the probability of two sets being similar. •Essentially amounts to P(A ∩ B) / P(A U B) •Jaccard Similarity
  • 39. Count min Sketch Gives an approximation of the number of occurrences of an element in a set.
  • 40. •Count min sketch Adding an element is a numerical addition Querying uses a MIN function. Both are associative.
  • 42. -Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data. -Many exist : Q-Tree, Q-Digest, T-Digest -All of those are associative. -Another neat thing : types your data uniformaly.
  • 43. Many more sketches and tricks •FM Counters, KMV •Histograms •Ball Sketches : streaming k-means, clustering •SGD : fit online machine learning algorithms
  • 46. Conclusion •Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers •As data size grows, sampling becomes painful, hashing provide better cost effective solution •Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems. http://speakerdeck.com/samklr
  • 47. DON’T BE SCARED ANYMORE.
  • 48. Bibliography •Great intro into Algebird http://www.michael-noll.com/blog/2013/12/02/twitter-algebird- monoid-monad-for-large-scala-data-analytics/ •Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch- of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ •Probabilistic data structures for web analytics. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/ Algebird : github.com/twitter/algebird Algebra for analytics https://speakerdeck.com/johnynek/algebra-for- analytics http://infolab.stanford.edu/~ullman/mmds/ch3.pdf