Distributed machine learning 101 using apache spark from the browser

Andy Petrella
Andy PetrellaCEO & Founder at Kensu
Distributed Machine Learning 101
using Apache Spark
from the Browser
Scala days 2015, Amsterdam
● what is Machine Learning?
◦ Variables, Variance and Bias
◦ Model selection
● Why Spark for machine learning?
● Spark MLlib by exampes
◦ Genomics clustering and classification example
● What for the future?
◦ Streaming
◦ Human Learning
Outline
Andy Petrella
Maths
scala
Apache Spark
Spark Notebook
Trainer
Data Banana
Xavier Tordoir
Physics
Bioinformatics
Scala
Spark
you cannot prove a vague theory is wrong
[…] Also, if the process of computing the
consequences is indefinite, then with a little
skill any experimental result can be made to
look like the expected consequences.
—Richard Feynman [1964]
What is Machine Learning?
Science with data
Surely You’re Joking
Mr…
● Modelling without first principle…
What is Machine Learning?
Overview
2nd law neither...
● Modelling without first principle…
What is Machine Learning?
Overview
Machine learning you
do with a Learning
Machine
Take that Newton...
● Modelling without first principle…
● Modelling dependencies from the data
What is Machine Learning?
Overview
With some “a priori”
knowledge
● What is the problem?
● Hypothesis?
● Data Generation Process?
● Collection and Preprocessing
● Interpretation
What is Machine Learning?
Learning Machine…
You still need a
domain expert…
Like me!
Learning
Machine
● Estimate dependencies from data
What is Machine Learning?
Overview
Machine learning you
do with a Learning
Machine
Samples
Generator
System
x
y
ỹ
z ?
Learning
Machine
● Estimate dependencies from data
● Minimize a risk functional over the
set given the data
What is Machine Learning?
Overview
I like them so much
in LaTeX2e
Samples
Generator
System
x
y
ỹ
z ?
Learning
Machine
● Regression: continuous output
○ Risk = Prediction error
● Classification: categorical output
○ Risk = Probability of misclassification
What is Machine Learning?
Supervised learning
Lyfxw y-fxw2…
WTF?
What is Machine Learning?
Unsupervised learning: no output
I like clusters,
specially with
roasted nuts
● Clustering
○ Risk = Error Distortion (distances to center)
● Density estimation (probability densities)
What is Machine Learning?
Bias - Variance, Regression illustration
Playtime!
Notebook!
What is Machine Learning?
Model selection
all work and no play
makes Jack a dull
boy
Model Complexity control: Resampling
Because we only see one sample of the universe
Replay it!
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
F3
More
Samples
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
F3
More
Samples
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
F3
Bigger
Samples
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
F3
Bigger
Samples
Spark for Machine Learning?
Model selection
Nice flag
K-Fold
K = 4
Genomics
The data
So… that’s what
separates us huh?
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Genomics
The data
Please, don’t mind
the colors...
1000 genomes: http://www.1000genomes.org/
~1000 samples
Few samples => Machine Learning
Genomics
The data
Woooow, really, you
must be kidding
me… ahahahahah
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Few samples => Machine Learning
Lots of Data => Distributed computing
Genomics
The data
Oh… damned… hum
huh
Data continues to flow
Models must be trained continuously
=> Streaming Machine learning algorithms
Models must be validated
=> Batch machine learning
→ ƛambda ML
What else?
Streaming
Lambada?
Learning probabilistic models
Not only learning which features are important...
but also Learning interactions effectively
explaining observations
What else?
Probabilistic Programming
I’ll probably program
too
That’s all folks
Roooaaar
Q / Option[A] / beers
THANKS!
Xavier Tordoir
@xtordoir
Andy Petrella
@noootsab
http://data-fellas.guru https://github.com/andypetrella/spark-notebook/
Frank Nothaft
Matt Massie
Matt Gianni
Venkat
Krishnamurthy
Look at the Code
The browser part is powered by the Spark Notebook.
The 3 notebooks are:
● mllib/Variance - Bias.snb
● adam/Clustering Genomes using Adam with LDA.snb
● adam/Classifying Genomes using Adam with RF.snb
So grab a Spark Notebook on http://spark-notebook.io/.
Yeaaaaah!
1 of 30

Recommended

Spark Based Distributed Deep Learning Framework For Big Data Applications by
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
746 views55 slides
Distributed machine learning 101 using apache spark from a browser devoxx.b... by
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
1.1K views107 slides
Deep Learning Class #0 - You Can Do It by
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItHolberton School
2.9K views118 slides
DL Classe 0 - You can do it by
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do itGregory Renard
343 views118 slides
Machine Learning Workshop, TSEC 2020 by
Machine Learning Workshop, TSEC 2020Machine Learning Workshop, TSEC 2020
Machine Learning Workshop, TSEC 2020Siddharth Adelkar
179 views34 slides
Lessons learned from building practical deep learning systems by
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsXavier Amatriain
71.9K views61 slides

More Related Content

Similar to Distributed machine learning 101 using apache spark from the browser

GDSC Introduction to Deep Learning Workshop by
GDSC Introduction to Deep Learning WorkshopGDSC Introduction to Deep Learning Workshop
GDSC Introduction to Deep Learning Workshopssuser540861
102 views54 slides
Hacking Predictive Modeling - RoadSec 2018 by
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
1.2K views74 slides
Teaching Your Computer To Play Video Games by
Teaching Your Computer To Play Video GamesTeaching Your Computer To Play Video Games
Teaching Your Computer To Play Video Gamesehrenbrav
699 views34 slides
Convolutional neural network in practice by
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice남주 김
10K views162 slides
Presentation_Technomancy by
Presentation_TechnomancyPresentation_Technomancy
Presentation_TechnomancyYu Hao
218 views21 slides
Europython - Machine Learning for dummies with Python by
Europython - Machine Learning for dummies with PythonEuropython - Machine Learning for dummies with Python
Europython - Machine Learning for dummies with PythonJavier Arias Losada
1K views55 slides

Similar to Distributed machine learning 101 using apache spark from the browser(20)

GDSC Introduction to Deep Learning Workshop by ssuser540861
GDSC Introduction to Deep Learning WorkshopGDSC Introduction to Deep Learning Workshop
GDSC Introduction to Deep Learning Workshop
ssuser540861102 views
Hacking Predictive Modeling - RoadSec 2018 by HJ van Veen
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen1.2K views
Teaching Your Computer To Play Video Games by ehrenbrav
Teaching Your Computer To Play Video GamesTeaching Your Computer To Play Video Games
Teaching Your Computer To Play Video Games
ehrenbrav699 views
Convolutional neural network in practice by 남주 김
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
남주 김10K views
Presentation_Technomancy by Yu Hao
Presentation_TechnomancyPresentation_Technomancy
Presentation_Technomancy
Yu Hao218 views
Europython - Machine Learning for dummies with Python by Javier Arias Losada
Europython - Machine Learning for dummies with PythonEuropython - Machine Learning for dummies with Python
Europython - Machine Learning for dummies with Python
Semi-Supervised Insight Generation from Petabyte Scale Text Data by Tech Triveni
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Tech Triveni20 views
Easy to use correctly, hard to use incorrectly by Christophe Addinquy
Easy to use correctly, hard to use incorrectlyEasy to use correctly, hard to use incorrectly
Easy to use correctly, hard to use incorrectly
Christophe Addinquy1.4K views
Spark meetup london share and analyse genomic data at scale with spark, adam... by Andy Petrella
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella2.1K views
How to use Artificial Intelligence with Python? Edureka by Edureka!
How to use Artificial Intelligence with Python? EdurekaHow to use Artificial Intelligence with Python? Edureka
How to use Artificial Intelligence with Python? Edureka
Edureka!591 views
Introduction AI ML& Mathematicals of ML.pdf by GandhiMathy6
Introduction AI ML& Mathematicals of ML.pdfIntroduction AI ML& Mathematicals of ML.pdf
Introduction AI ML& Mathematicals of ML.pdf
GandhiMathy65 views
Think machine-learning-with-scikit-learn-chetan by Chetan Khatri
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri849 views
Primer to Machine Learning by Jeff Tanner
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
Jeff Tanner352 views
Machine Learning by Shrey Malik
Machine LearningMachine Learning
Machine Learning
Shrey Malik3.5K views

More from Andy Petrella

Data Observability Best Pracices by
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
238 views19 slides
How to Build a Global Data Mapping by
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
783 views16 slides
Interactive notebooks by
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
211 views20 slides
Governance compliance by
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
398 views38 slides
Data science governance and GDPR by
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
550 views41 slides
Data science governance : what and how by
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
2K views30 slides

More from Andy Petrella(20)

Data Observability Best Pracices by Andy Petrella
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
Andy Petrella238 views
How to Build a Global Data Mapping by Andy Petrella
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
Andy Petrella783 views
Data science governance and GDPR by Andy Petrella
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
Andy Petrella550 views
Data science governance : what and how by Andy Petrella
Data science governance : what and howData science governance : what and how
Data science governance : what and how
Andy Petrella2K views
Scala: the unpredicted lingua franca for data science by Andy Petrella
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
Andy Petrella1.9K views
Agile data science with scala by Andy Petrella
Agile data science with scalaAgile data science with scala
Agile data science with scala
Andy Petrella1.8K views
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser... by Andy Petrella
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Andy Petrella1.2K views
What is a distributed data science pipeline. how with apache spark and friends. by Andy Petrella
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella2.4K views
Towards a rebirth of data science (by Data Fellas) by Andy Petrella
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
Andy Petrella2.2K views
Spark Summit Europe: Share and analyse genomic data at scale by Andy Petrella
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella765 views
Leveraging mesos as the ultimate distributed data science platform by Andy Petrella
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
Andy Petrella1.2K views
Data Enthusiasts London: Scalable and Interoperable data services. Applied to... by Andy Petrella
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Andy Petrella1.3K views
Liège créative: Open Science by Andy Petrella
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
Andy Petrella857 views
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale by Andy Petrella
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
Andy Petrella1.8K views
What is Distributed Computing, Why we use Apache Spark by Andy Petrella
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella6.4K views
Lightning fast genomics with Spark, Adam and Scala by Andy Petrella
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella85.7K views
Machine Learning and GraphX by Andy Petrella
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella6.1K views

Distributed machine learning 101 using apache spark from the browser

  • 1. Distributed Machine Learning 101 using Apache Spark from the Browser Scala days 2015, Amsterdam
  • 2. ● what is Machine Learning? ◦ Variables, Variance and Bias ◦ Model selection ● Why Spark for machine learning? ● Spark MLlib by exampes ◦ Genomics clustering and classification example ● What for the future? ◦ Streaming ◦ Human Learning Outline
  • 3. Andy Petrella Maths scala Apache Spark Spark Notebook Trainer Data Banana Xavier Tordoir Physics Bioinformatics Scala Spark
  • 4. you cannot prove a vague theory is wrong […] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences. —Richard Feynman [1964] What is Machine Learning? Science with data Surely You’re Joking Mr…
  • 5. ● Modelling without first principle… What is Machine Learning? Overview 2nd law neither...
  • 6. ● Modelling without first principle… What is Machine Learning? Overview Machine learning you do with a Learning Machine Take that Newton...
  • 7. ● Modelling without first principle… ● Modelling dependencies from the data What is Machine Learning? Overview With some “a priori” knowledge
  • 8. ● What is the problem? ● Hypothesis? ● Data Generation Process? ● Collection and Preprocessing ● Interpretation What is Machine Learning? Learning Machine… You still need a domain expert… Like me! Learning Machine
  • 9. ● Estimate dependencies from data What is Machine Learning? Overview Machine learning you do with a Learning Machine Samples Generator System x y ỹ z ? Learning Machine
  • 10. ● Estimate dependencies from data ● Minimize a risk functional over the set given the data What is Machine Learning? Overview I like them so much in LaTeX2e Samples Generator System x y ỹ z ? Learning Machine
  • 11. ● Regression: continuous output ○ Risk = Prediction error ● Classification: categorical output ○ Risk = Probability of misclassification What is Machine Learning? Supervised learning Lyfxw y-fxw2… WTF?
  • 12. What is Machine Learning? Unsupervised learning: no output I like clusters, specially with roasted nuts ● Clustering ○ Risk = Error Distortion (distances to center) ● Density estimation (probability densities)
  • 13. What is Machine Learning? Bias - Variance, Regression illustration Playtime! Notebook!
  • 14. What is Machine Learning? Model selection all work and no play makes Jack a dull boy Model Complexity control: Resampling Because we only see one sample of the universe Replay it!
  • 15. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2
  • 16. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2
  • 17. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 More Samples
  • 18. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 More Samples
  • 19. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 Bigger Samples
  • 20. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 Bigger Samples
  • 21. Spark for Machine Learning? Model selection Nice flag K-Fold K = 4
  • 22. Genomics The data So… that’s what separates us huh?
  • 23. 1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Genomics The data Please, don’t mind the colors...
  • 24. 1000 genomes: http://www.1000genomes.org/ ~1000 samples Few samples => Machine Learning Genomics The data Woooow, really, you must be kidding me… ahahahahah
  • 25. 1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Few samples => Machine Learning Lots of Data => Distributed computing Genomics The data Oh… damned… hum huh
  • 26. Data continues to flow Models must be trained continuously => Streaming Machine learning algorithms Models must be validated => Batch machine learning → ƛambda ML What else? Streaming Lambada?
  • 27. Learning probabilistic models Not only learning which features are important... but also Learning interactions effectively explaining observations What else? Probabilistic Programming I’ll probably program too
  • 29. Q / Option[A] / beers THANKS! Xavier Tordoir @xtordoir Andy Petrella @noootsab http://data-fellas.guru https://github.com/andypetrella/spark-notebook/ Frank Nothaft Matt Massie Matt Gianni Venkat Krishnamurthy
  • 30. Look at the Code The browser part is powered by the Spark Notebook. The 3 notebooks are: ● mllib/Variance - Bias.snb ● adam/Clustering Genomes using Adam with LDA.snb ● adam/Classifying Genomes using Adam with RF.snb So grab a Spark Notebook on http://spark-notebook.io/. Yeaaaaah!