Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Distributed Machine Learning 101
using Apache Spark
from the Browser
Scala days 2015, Amsterdam
● what is Machine Learning?
◦ Variables, Variance and Bias
◦ Model selection
● Why Spark for machine learning?
● Spark MLl...
Andy Petrella
Maths
scala
Apache Spark
Spark Notebook
Trainer
Data Banana
Xavier Tordoir
Physics
Bioinformatics
Scala
Spark
you cannot prove a vague theory is wrong
[…] Also, if the process of computing the
consequences is indefinite, then with a...
● Modelling without first principle…
What is Machine Learning?
Overview
2nd law neither...
● Modelling without first principle…
What is Machine Learning?
Overview
Machine learning you
do with a Learning
Machine
Ta...
● Modelling without first principle…
● Modelling dependencies from the data
What is Machine Learning?
Overview
With some “...
● What is the problem?
● Hypothesis?
● Data Generation Process?
● Collection and Preprocessing
● Interpretation
What is Ma...
● Estimate dependencies from data
What is Machine Learning?
Overview
Machine learning you
do with a Learning
Machine
Sampl...
● Estimate dependencies from data
● Minimize a risk functional over the
set given the data
What is Machine Learning?
Overv...
● Regression: continuous output
○ Risk = Prediction error
● Classification: categorical output
○ Risk = Probability of mis...
What is Machine Learning?
Unsupervised learning: no output
I like clusters,
specially with
roasted nuts
● Clustering
○ Ris...
What is Machine Learning?
Bias - Variance, Regression illustration
Playtime!
Notebook!
What is Machine Learning?
Model selection
all work and no play
makes Jack a dull
boy
Model Complexity control: Resampling
...
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
F3
More
Samples
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
F3
More
Samples
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
F3
Bigger
Samples
Spark for Machine Learning?
Model selection
Enough theory boy!
f0
f1
f2
F3
Bigger
Samples
Spark for Machine Learning?
Model selection
Nice flag
K-Fold
K = 4
Genomics
The data
So… that’s what
separates us huh?
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Genomics
The data
Please, don...
1000 genomes: http://www.1000genomes.org/
~1000 samples
Few samples => Machine Learning
Genomics
The data
Woooow, really, ...
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Few samples => Machine Learni...
Data continues to flow
Models must be trained continuously
=> Streaming Machine learning algorithms
Models must be validat...
Learning probabilistic models
Not only learning which features are important...
but also Learning interactions effectively...
That’s all folks
Roooaaar
Q / Option[A] / beers
THANKS!
Xavier Tordoir
@xtordoir
Andy Petrella
@noootsab
http://data-fellas.guru https://github.com/...
Look at the Code
The browser part is powered by the Spark Notebook.
The 3 notebooks are:
● mllib/Variance - Bias.snb
● ada...
Upcoming SlideShare
Loading in …5
×

Distributed machine learning 101 using apache spark from the browser

4,352 views

Published on

Talk given by Xavier Tordoir and myself at Scala Days Amsterdam 2015.

Contains intro to ML, focusing on what is it and models selection via the Bias Variation constraint.
Then switches a gear to show how genomics can be learned using LDA, KMeans and Random Forest.

Finishes with some insight on what we'll change in the future regarding machine learning and modeling.

Distributed machine learning 101 using apache spark from the browser

  1. 1. Distributed Machine Learning 101 using Apache Spark from the Browser Scala days 2015, Amsterdam
  2. 2. ● what is Machine Learning? ◦ Variables, Variance and Bias ◦ Model selection ● Why Spark for machine learning? ● Spark MLlib by exampes ◦ Genomics clustering and classification example ● What for the future? ◦ Streaming ◦ Human Learning Outline
  3. 3. Andy Petrella Maths scala Apache Spark Spark Notebook Trainer Data Banana Xavier Tordoir Physics Bioinformatics Scala Spark
  4. 4. you cannot prove a vague theory is wrong […] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences. —Richard Feynman [1964] What is Machine Learning? Science with data Surely You’re Joking Mr…
  5. 5. ● Modelling without first principle… What is Machine Learning? Overview 2nd law neither...
  6. 6. ● Modelling without first principle… What is Machine Learning? Overview Machine learning you do with a Learning Machine Take that Newton...
  7. 7. ● Modelling without first principle… ● Modelling dependencies from the data What is Machine Learning? Overview With some “a priori” knowledge
  8. 8. ● What is the problem? ● Hypothesis? ● Data Generation Process? ● Collection and Preprocessing ● Interpretation What is Machine Learning? Learning Machine… You still need a domain expert… Like me! Learning Machine
  9. 9. ● Estimate dependencies from data What is Machine Learning? Overview Machine learning you do with a Learning Machine Samples Generator System x y ỹ z ? Learning Machine
  10. 10. ● Estimate dependencies from data ● Minimize a risk functional over the set given the data What is Machine Learning? Overview I like them so much in LaTeX2e Samples Generator System x y ỹ z ? Learning Machine
  11. 11. ● Regression: continuous output ○ Risk = Prediction error ● Classification: categorical output ○ Risk = Probability of misclassification What is Machine Learning? Supervised learning Lyfxw y-fxw2… WTF?
  12. 12. What is Machine Learning? Unsupervised learning: no output I like clusters, specially with roasted nuts ● Clustering ○ Risk = Error Distortion (distances to center) ● Density estimation (probability densities)
  13. 13. What is Machine Learning? Bias - Variance, Regression illustration Playtime! Notebook!
  14. 14. What is Machine Learning? Model selection all work and no play makes Jack a dull boy Model Complexity control: Resampling Because we only see one sample of the universe Replay it!
  15. 15. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2
  16. 16. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2
  17. 17. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 More Samples
  18. 18. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 More Samples
  19. 19. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 Bigger Samples
  20. 20. Spark for Machine Learning? Model selection Enough theory boy! f0 f1 f2 F3 Bigger Samples
  21. 21. Spark for Machine Learning? Model selection Nice flag K-Fold K = 4
  22. 22. Genomics The data So… that’s what separates us huh?
  23. 23. 1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Genomics The data Please, don’t mind the colors...
  24. 24. 1000 genomes: http://www.1000genomes.org/ ~1000 samples Few samples => Machine Learning Genomics The data Woooow, really, you must be kidding me… ahahahahah
  25. 25. 1000 genomes: http://www.1000genomes.org/ ~1000 samples ~30M Genotypes per sample (features) Few samples => Machine Learning Lots of Data => Distributed computing Genomics The data Oh… damned… hum huh
  26. 26. Data continues to flow Models must be trained continuously => Streaming Machine learning algorithms Models must be validated => Batch machine learning → ƛambda ML What else? Streaming Lambada?
  27. 27. Learning probabilistic models Not only learning which features are important... but also Learning interactions effectively explaining observations What else? Probabilistic Programming I’ll probably program too
  28. 28. That’s all folks Roooaaar
  29. 29. Q / Option[A] / beers THANKS! Xavier Tordoir @xtordoir Andy Petrella @noootsab http://data-fellas.guru https://github.com/andypetrella/spark-notebook/ Frank Nothaft Matt Massie Matt Gianni Venkat Krishnamurthy
  30. 30. Look at the Code The browser part is powered by the Spark Notebook. The 3 notebooks are: ● mllib/Variance - Bias.snb ● adam/Clustering Genomes using Adam with LDA.snb ● adam/Classifying Genomes using Adam with RF.snb So grab a Spark Notebook on http://spark-notebook.io/. Yeaaaaah!

×