SlideShare a Scribd company logo
1 of 56
Download to read offline
OpenML
N E T W O R K E D M A C H I N E L E A R N I N G
F O R A L G O R I T H M S E L E C T I O N A N D O P T I M I S A T I O N
J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 5
Meta-learning:
• Learn from experience how to select + optimize
learning algorithms + workflows
Requires:
• Large amounts of real datasets
• Wide range of state-of-the-art algorithms
• Huge amounts of experimentation: explore how
algorithms/params behave on many kinds of data
Millions of real, open datasets are generated
• Drug activity, gene expressions, astronomical observations, text,…
Extensive toolboxes exist to analyse data
• MLR, SKLearn, RapidMiner, KNIME, WEKA, AmazonML, AzureML,…
Massive amounts of experiments are run, most of this
information is lost forever (in people’s heads)
• No learning across people, labs, fields
• Start from scratch, slows research and innovation
Let’s connect machine learning environments to network, so
that we can organize, learn from experience
Connect minds, collaborate globally in real time
Train algorithms to automate data science
F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G
Easy to use: Integrated in ML environments.Automated, reproducible sharing
Organized data: Experiments connected to data, code, people anywhere
Easy to contribute: Post single dataset, algorithm, experiment, comment
Reward structure*: Build reputation and trust (e-citations, social interaction)
OpenML
Data (ARFF) uploaded or referenced, versioned
analysed, characterized, organised online
analysed, characterized, organised online
Tasks contain data, goals, procedures.
Readable by tools, automates experimentation
Train-test
splits
Evaluation
measure
Results organized online: realtime overview
Results organized online: realtime overview
Train-test
splits
Evaluation
measure
Flows (code) from various tools, versioned.
Integrations + APIs (REST, Java, R, Python,…)
Integrations + APIs (REST, Java, R, Python,…)
reproducible, linked to data, flows and authors
Experiments auto-uploaded, evaluated online
Experiments auto-uploaded, evaluated online
Explore
Reuse
Share
• Search by
keywords or
properties
• Filters
• Tagging
Data
• Wiki-like
descriptions
• Analysis and
visualisation of
features
Data
• Wiki-like
descriptions
• Analysis and
visualisation of
features
• Auto-calculation
of large range
of meta-features
Data
• Simple: Number/perc/log of instances, classes, features, dimensionality,
NAs, …
• Statistical: Default acc, Min/max/mean distinct values, skewness, kurtosis,
stdev, mutual information, …
• Information-theoretic: Entropy (class, num/nom feats), signal-noise ratio,…
• Landmarkers (AUC, ErrRate, Kappa): Decision stumps, Naive Bayes, kNN,
RandomTree, J48(pruned), JRip, NBTree, lin.SVM, …
• Subsampling landmarkers (partial learning curves)
• Streaming landmarkers: Evaluations in previous window, previous best,
per-algorithm significant win/loss
• Change detectors: HoeffdingAdwin, HoeffdingDDM
• Domain specific meta-features (e.g. molecular descriptors)
• Many not included yet
Meta-features (120+)
• Example: Classification
on click prediction
dataset, using 10-fold
CV and AUC
• People submit results
(e.g. predictions)
• Server-side evaluation
(many measures)
• All results organized
online, per algorithm,
parameter setting
• Online visualizations:
every dot is a run
plotted by score
Tasks
• Leaderboards visualize progress over time: who delivered breakthroughs
when, who built on top of previous solutions
• Collaborative: all code and data available, learn from others
• Real-time: clear who submitted first, others can improve immediately
Classroom challenges
• All results obtained with same flow organised online
• Results linked to data sets, parameter settings -> trends/comparisons
• Visualisations (dots are models, ranked by score, colored by parameters)
• Detailed run info
• Author, data, flow,
parameter settings,
result files, …
• Evaluation details
(e.g., results per
sample)
REST API (new) Requires OpenML account
Predictive URLs
R API
Idem for Java, Python
Tutorial: http://www.openml.org/guide
List datasets List tasks
List flows
List runs and results (soon)
Share
Explore
Reuse
Website
1) Search,
2) Table view
3) CVS export*
R API
Idem for Java, Python
Tutorial: http://www.openml.org/guide
Download datasets Download tasks
Download flows
R API
Download run
Download run with predictions
Idem for Java, Python
Tutorial: http://www.openml.org/guide
Reuse
Explore
Share
WEKA
OpenML extension
in plugin manager
MOA
RapidMiner: 3 new operators
R API
Run a task
(with mlr):
R API
Run a task
(with mlr):
And upload:
Studies (e-papers)
- Online counterpart of a paper, linkable
- Add data, code, experiments (new or old)
- Public or shared within circle
Circles
Create collaborations with trusted researchers
Share results within team prior to publication
Altmetrics
- Measure real impact of your work
- Reuse, downloads, likes of data, code, projects,…
- Online reputation (more sharing)
Online collaboration (soon)
OpenML Community
Jan-Jun 2015
Used all over the world
400 7-day, 1700 30-day active users, growing
1000s of datasets, flows, 450000+ runs
Opportunities for automated algorithms selection + configuration
• Many datasets, flows, experiments: much larger meta-learning
studies than possible before
• More meta-features, better meta-models
• Meta-data to speed up algorithm configuration, learn over time
• APIs:Algorithm selection and configuration can be built on top
of OpenML, reusing and sharing data
OpenML in drug discovery
ChEMBL database
SMILEs'
Molecular'
proper0es'
(e.g.'MW,'LogP)'
Fingerprints'
(e.g.'FP2,'FP3,'
FP4,'MACSS)'
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
10.000+ regression datasets
1.4M compounds,10k proteins,
12,8M activities
Predict which compounds (drugs) will inhibit certain proteins (and hence viruses, parasites,…)
2750 targets have >10 compounds, x 4 fingerprints
OpenML in drug discovery
cforest
ctree
earth
fnn
gbm
glmnet
lassolmnnetpcr
plsr
rforest
ridge
rtree
ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024*
0.697* 0.427* 0.725+ 0.627*
|
msdfeat< 0.1729
mutualinfo< 0.7182
skewresp< 0.3315
nentropyfeat< 1.926
netcharge1>=0.9999
netcharge3< 13.85
hydphob21< −0.09666
glmnet
fnn pcr
rforest
ridge rforest
rforest ridge
Predict best algorithm with meta-models
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
Metafeatures:
- simple, statistical, info-theoretic, landmarkers
- target: aliphatic index, hydrophobicity, net
charge, mol. weight, sequence length, …
I. Olier et al. MetaSel@ECMLPKDD 2015
Best
algorithm
overall
(10f CV)
I. Olier et al. MetaSel@ECMLPKDD 2015
Random
Forest
meta-classifier
(10f CV)
I. Olier et al. MetaSel@ECMLPKDD 2015
Random
Forest
stacker
(10f CV)
Frugal learning
Which algorithms are fast (and low-memory) but accurate?
In Python Notebook:
Frugal learning
Which algorithms are fast (and low-memory) but accurate?
Frugal learning
Which algorithms are fast (and low-memory) but accurate?
Meta-learning on streams
Stream data in OpenML: ‘best’ algorithm changes over time
Concept drift
• Use meta-learning to select the best models at each
point in time
Meta-learning on streams
Stream data in OpenML: ‘best’ algorithm changes over time
Evaluation
• Base-level:
• Interleaved train-then-test (prequential evaluation)
• Accuracy on data streams using predicted method [default, best]
• Meta-level:
• Leave one stream out
• Accuracy of predicting best model [0,1]
Meta-learning on streams
• Streaming ensembles (current work)
• Train multiple models, use meta-learning to weight their votes
• Stacking / Cascading
• BLast (Best-Last), J. van Rijn et al. ICDM 2015
• Simply choose model that performed best in previous window.
Equivalent to state-of-the-art (Leveraging Bagging)
Algorithm selection (ranking)
J. van Rijn et al. IDA 2015
Learning curves in OpenML
• Identify k nearest prior datasets by distance between partial curves:
• Build a ranking of algorithms to run, start with overall best abest
• Draw random algorithm acompetitor
• If acompetitor wins most, add to ranking, repeat
For new dataset: evaluate
learning curves up to T
(e.g. 256 instances)
Fraction of full CV
No other meta-features
0
0.02
0.04
0.06
0.08
0.1
1 4 16 64 256 1024 4096 16384 65536
AccuracyLoss
Time (seconds)
Best on Sample
Average Rank
PCC
PCC (A3R, r = 1)
For 53 classifiers on 39 datasets
Multi-objective function A3R:
J. van Rijn et al. IDA 2015
Algorithm selection (ranking)
Towards automating machine learning
Human scientists
meta-data, models,
evaluations
Automated processes
Data cleaning
Algorithm Selection
Parameter Optimization
Workflow construction
Post processing
API
Connect your tools and services to OpenML, so they may learn
Bruce Mau
When everything is connected to everything else,
for better or for worse, everything matters.
- Open Source, on GitHub
- Regular workshops, hackathons
Join OpenML
Next workshop:
- Lorentz Center (Leiden),
14-18 March 2016
T H A N K Y O U
Joaquin Vanschoren
Jan van Rijn
Bernd Bischl
Matthias Feurer
Michel Lang
Nenad Tomašev
Giuseppe Casalicchio
Luis Torgo
You?
#OpenMLFarzan Majdani
Jakob Bossek

More Related Content

More from Joaquin Vanschoren

More from Joaquin Vanschoren (12)

OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
OpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine LearningOpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine Learning
 
Data science
Data scienceData science
Data science
 
OpenML 2014
OpenML 2014OpenML 2014
OpenML 2014
 
Open Machine Learning
Open Machine LearningOpen Machine Learning
Open Machine Learning
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop sensordata part2
Hadoop sensordata part2Hadoop sensordata part2
Hadoop sensordata part2
 
Hadoop sensordata part1
Hadoop sensordata part1Hadoop sensordata part1
Hadoop sensordata part1
 
Hadoop sensordata part3
Hadoop sensordata part3Hadoop sensordata part3
Hadoop sensordata part3
 

Recently uploaded

Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 

Recently uploaded (20)

Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

OpenML for Algorithm Selection and Configuration

  • 1. OpenML N E T W O R K E D M A C H I N E L E A R N I N G F O R A L G O R I T H M S E L E C T I O N A N D O P T I M I S A T I O N J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 5
  • 2. Meta-learning: • Learn from experience how to select + optimize learning algorithms + workflows Requires: • Large amounts of real datasets • Wide range of state-of-the-art algorithms • Huge amounts of experimentation: explore how algorithms/params behave on many kinds of data
  • 3. Millions of real, open datasets are generated • Drug activity, gene expressions, astronomical observations, text,… Extensive toolboxes exist to analyse data • MLR, SKLearn, RapidMiner, KNIME, WEKA, AmazonML, AzureML,… Massive amounts of experiments are run, most of this information is lost forever (in people’s heads) • No learning across people, labs, fields • Start from scratch, slows research and innovation
  • 4. Let’s connect machine learning environments to network, so that we can organize, learn from experience Connect minds, collaborate globally in real time Train algorithms to automate data science
  • 5. F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G Easy to use: Integrated in ML environments.Automated, reproducible sharing Organized data: Experiments connected to data, code, people anywhere Easy to contribute: Post single dataset, algorithm, experiment, comment Reward structure*: Build reputation and trust (e-citations, social interaction) OpenML
  • 6. Data (ARFF) uploaded or referenced, versioned analysed, characterized, organised online
  • 8. Tasks contain data, goals, procedures. Readable by tools, automates experimentation Train-test splits Evaluation measure Results organized online: realtime overview
  • 9. Results organized online: realtime overview Train-test splits Evaluation measure
  • 10. Flows (code) from various tools, versioned. Integrations + APIs (REST, Java, R, Python,…)
  • 11. Integrations + APIs (REST, Java, R, Python,…)
  • 12. reproducible, linked to data, flows and authors Experiments auto-uploaded, evaluated online
  • 15.
  • 16. • Search by keywords or properties • Filters • Tagging Data
  • 17. • Wiki-like descriptions • Analysis and visualisation of features Data
  • 18. • Wiki-like descriptions • Analysis and visualisation of features • Auto-calculation of large range of meta-features Data
  • 19. • Simple: Number/perc/log of instances, classes, features, dimensionality, NAs, … • Statistical: Default acc, Min/max/mean distinct values, skewness, kurtosis, stdev, mutual information, … • Information-theoretic: Entropy (class, num/nom feats), signal-noise ratio,… • Landmarkers (AUC, ErrRate, Kappa): Decision stumps, Naive Bayes, kNN, RandomTree, J48(pruned), JRip, NBTree, lin.SVM, … • Subsampling landmarkers (partial learning curves) • Streaming landmarkers: Evaluations in previous window, previous best, per-algorithm significant win/loss • Change detectors: HoeffdingAdwin, HoeffdingDDM • Domain specific meta-features (e.g. molecular descriptors) • Many not included yet Meta-features (120+)
  • 20. • Example: Classification on click prediction dataset, using 10-fold CV and AUC • People submit results (e.g. predictions) • Server-side evaluation (many measures) • All results organized online, per algorithm, parameter setting • Online visualizations: every dot is a run plotted by score Tasks
  • 21. • Leaderboards visualize progress over time: who delivered breakthroughs when, who built on top of previous solutions • Collaborative: all code and data available, learn from others • Real-time: clear who submitted first, others can improve immediately
  • 23. • All results obtained with same flow organised online • Results linked to data sets, parameter settings -> trends/comparisons • Visualisations (dots are models, ranked by score, colored by parameters)
  • 24. • Detailed run info • Author, data, flow, parameter settings, result files, … • Evaluation details (e.g., results per sample)
  • 25. REST API (new) Requires OpenML account Predictive URLs
  • 26. R API Idem for Java, Python Tutorial: http://www.openml.org/guide List datasets List tasks List flows List runs and results (soon)
  • 28. Website 1) Search, 2) Table view 3) CVS export*
  • 29. R API Idem for Java, Python Tutorial: http://www.openml.org/guide Download datasets Download tasks Download flows
  • 30. R API Download run Download run with predictions Idem for Java, Python Tutorial: http://www.openml.org/guide
  • 33. MOA
  • 34. RapidMiner: 3 new operators
  • 35. R API Run a task (with mlr):
  • 36. R API Run a task (with mlr): And upload:
  • 37. Studies (e-papers) - Online counterpart of a paper, linkable - Add data, code, experiments (new or old) - Public or shared within circle Circles Create collaborations with trusted researchers Share results within team prior to publication Altmetrics - Measure real impact of your work - Reuse, downloads, likes of data, code, projects,… - Online reputation (more sharing) Online collaboration (soon)
  • 38. OpenML Community Jan-Jun 2015 Used all over the world 400 7-day, 1700 30-day active users, growing 1000s of datasets, flows, 450000+ runs
  • 39. Opportunities for automated algorithms selection + configuration • Many datasets, flows, experiments: much larger meta-learning studies than possible before • More meta-features, better meta-models • Meta-data to speed up algorithm configuration, learn over time • APIs:Algorithm selection and configuration can be built on top of OpenML, reusing and sharing data
  • 40. OpenML in drug discovery ChEMBL database SMILEs' Molecular' proper0es' (e.g.'MW,'LogP)' Fingerprints' (e.g.'FP2,'FP3,' FP4,'MACSS)' MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! 10.000+ regression datasets 1.4M compounds,10k proteins, 12,8M activities Predict which compounds (drugs) will inhibit certain proteins (and hence viruses, parasites,…) 2750 targets have >10 compounds, x 4 fingerprints
  • 41. OpenML in drug discovery cforest ctree earth fnn gbm glmnet lassolmnnetpcr plsr rforest ridge rtree ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024* 0.697* 0.427* 0.725+ 0.627* | msdfeat< 0.1729 mutualinfo< 0.7182 skewresp< 0.3315 nentropyfeat< 1.926 netcharge1>=0.9999 netcharge3< 13.85 hydphob21< −0.09666 glmnet fnn pcr rforest ridge rforest rforest ridge Predict best algorithm with meta-models MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! Metafeatures: - simple, statistical, info-theoretic, landmarkers - target: aliphatic index, hydrophobicity, net charge, mol. weight, sequence length, …
  • 42. I. Olier et al. MetaSel@ECMLPKDD 2015 Best algorithm overall (10f CV)
  • 43. I. Olier et al. MetaSel@ECMLPKDD 2015 Random Forest meta-classifier (10f CV)
  • 44. I. Olier et al. MetaSel@ECMLPKDD 2015 Random Forest stacker (10f CV)
  • 45. Frugal learning Which algorithms are fast (and low-memory) but accurate? In Python Notebook:
  • 46. Frugal learning Which algorithms are fast (and low-memory) but accurate?
  • 47. Frugal learning Which algorithms are fast (and low-memory) but accurate?
  • 48. Meta-learning on streams Stream data in OpenML: ‘best’ algorithm changes over time Concept drift • Use meta-learning to select the best models at each point in time
  • 49. Meta-learning on streams Stream data in OpenML: ‘best’ algorithm changes over time Evaluation • Base-level: • Interleaved train-then-test (prequential evaluation) • Accuracy on data streams using predicted method [default, best] • Meta-level: • Leave one stream out • Accuracy of predicting best model [0,1]
  • 50. Meta-learning on streams • Streaming ensembles (current work) • Train multiple models, use meta-learning to weight their votes • Stacking / Cascading • BLast (Best-Last), J. van Rijn et al. ICDM 2015 • Simply choose model that performed best in previous window. Equivalent to state-of-the-art (Leveraging Bagging)
  • 51. Algorithm selection (ranking) J. van Rijn et al. IDA 2015 Learning curves in OpenML • Identify k nearest prior datasets by distance between partial curves: • Build a ranking of algorithms to run, start with overall best abest • Draw random algorithm acompetitor • If acompetitor wins most, add to ranking, repeat For new dataset: evaluate learning curves up to T (e.g. 256 instances) Fraction of full CV No other meta-features
  • 52. 0 0.02 0.04 0.06 0.08 0.1 1 4 16 64 256 1024 4096 16384 65536 AccuracyLoss Time (seconds) Best on Sample Average Rank PCC PCC (A3R, r = 1) For 53 classifiers on 39 datasets Multi-objective function A3R: J. van Rijn et al. IDA 2015 Algorithm selection (ranking)
  • 53. Towards automating machine learning Human scientists meta-data, models, evaluations Automated processes Data cleaning Algorithm Selection Parameter Optimization Workflow construction Post processing API Connect your tools and services to OpenML, so they may learn
  • 54. Bruce Mau When everything is connected to everything else, for better or for worse, everything matters.
  • 55. - Open Source, on GitHub - Regular workshops, hackathons Join OpenML Next workshop: - Lorentz Center (Leiden), 14-18 March 2016
  • 56. T H A N K Y O U Joaquin Vanschoren Jan van Rijn Bernd Bischl Matthias Feurer Michel Lang Nenad Tomašev Giuseppe Casalicchio Luis Torgo You? #OpenMLFarzan Majdani Jakob Bossek