SlideShare a Scribd company logo
OpenML
N E T W O R K E D M A C H I N E L E A R N I N G
Metalearning and algorithm selection require:
• Large amounts of real datasets
• Wide range of state-of-the-art algorithms
• Huge amounts of experimentation: explore how
algorithms behave on many kinds of data
• Integrated tools, databases, services to organise
all this
Gregg Gordon
Data by itself has no value. It’s the ever-changing
ecosystem surrounding data that gives it meaning.
So much data, experiments are going around.
Can we connect online to
share data, code, results?
N E T W O R K E D S C I E N C E
Polymaths: Mathematicians solved
centuries-old problems within weeks
by collaborating online, via blog
SDSS: Thousands of papers published
on data server from single telescope.
Also: GenBank, ChEMBL, GEO, …
Galaxy Zoo: Amateur astronomers
make new discoveries by looking
through thousands of images
F R I C T I O N - L E S S , N E T W O R K E D E N V I R O N M E N T
F O R M A C H I N E L E A R N I N G R E S E A R C H
Organized data: Experiments connected to data, code, people. Data and
code can live on existing repositories, services (e.g. GitHub)
Easy to use: Integrated in existing ML environments for easy reuse, and
automated, reproducible sharing
Easy to contribute: Post single dataset, algorithm, experiment, comment
Reward structure*: Build reputation and trust (auto-tracking of reuse, likes)
Data (ARFF) uploaded or referenced, versioned
analysed, characterized, organised online
Tasks contain data, goals, procedures.
Readable by tools, automates experimentation
Results organized online: realtime challenge
Train-test
splits
Evaluation
measure
Flows (code) from various tools, versioned.
Integrations + APIs (REST, Java, R, Python,…)
Experiments auto-uploaded, evaluated online,
reproducible, linked to data, flows and authors
Experiments auto-uploaded, evaluated online,
reproducible, linked to data, flows and authors
Explore
Reuse
Share
• Search by
keywords or
properties
• Filters
• Tagging
Data
• Wiki-like
descriptions
• Analysis and
visualisation of
features
Data
• Wiki-like
descriptions
• Analysis and
visualisation of
features
• Auto-calculation
of large range
of meta-features
Data
• Example: Classification
on click prediction
dataset, using 10-fold
CV and AUC
• People submit results
(e.g. predictions)
• Server-side evaluation
(many measures)
• All results organized
online, per algorithm,
parameter setting
• Online visualizations:
every dot is a run
plotted by score
Tasks
• Leaderboards visualize progress over time: who delivered breakthroughs
when, who built on top of previous solutions
• Collaborative: all code and data available, learn from others, form teams
• Real-time: who submits first gets credit, others can improve immediately
Classroom challenges
Classroom challenges
• All results obtained with same flow organised online
• Results linked to data sets, parameter settings -> trends/comparisons
• Visualisations (dots are models, ranked by score, colored by parameters)
• Detailed run info
• Author, data, flow,
parameter settings,
result files, …
• Evaluation details
(e.g., results per
sample)
REST API Requires OpenML account.
REST API
Guide on: www.openml.org/guide
List datasets: /data/list
List tasks: /task/list
List flows: /flow/list
List runs: /run/list
Get dataset: /data/<id>
Get task: /task/<id>
Get flow: /flow/<id>
Get run: /run/<id>
…
Endpoint: api.openml.org/v1/
R API
Tutorial:
https://github.com/openml/r/blob/master/doc/knitted/1-Introduction.md
List datasets
R API
List tasks
R API
List flows
R API
List runs and results (under construction)
Java API
Guide: https://www.openml.org/guide
Python API
Guide: https://www.openml.org/guide
Share
Explore
Reuse
Website
Search,
CVS export,
query
R API
Download data
R API
Download task
R API
Download task, data
R API
Download flow
R API
Download run
Download run with predictions
Python API
Simple example:
Reuse
Explore
Share
WEKA
OpenML extension
in plugin manager
MOA
RapidMiner: 3 new operators
See talk by Jan van Rijn
R API
Run a task
(with mlr):
R API
Run a task
(with mlr):
And upload:
R API
Simple example:
Studies (e-papers)
- Online counterpart of a paper, linkable
- Add data, code, experiments (new or old)
- Public or shared within circle
Circles
Create collaborations with trusted researchers
Share results within team prior to publication
Altmetrics
- Measure real impact of your work
- Reuse, downloads, likes of data, code, projects,…
- Online reputation (more sharing)
Online collaboration (soon)
Collaboratory: bring data
scientists and domain scientists
together online (and their data
and tools)
Easy, large-scale collab:
Extract actionable datasets, key
tools. Scientists shares data and
get help, DS can test technique
on many current datasets.
Real-time collab: share
experiments automatically,
discuss online.Automate
experimentation.
Data
scientist
Domain
scientist
Tool
developer
data+
code
Bridging gaps in data-driven science
Collaboratory: bring data
scientists and domain scientists
together online (and their data
and tools)
Easy, large-scale collab:
Extract actionable datasets, key
tools. Scientists shares data and
get help, DS can test technique
on many current datasets.
Real-time collab: share
experiments automatically,
discuss online.Automate
experimentation.
Data
scientist
Domain
scientist
Tool
developer
data+
code
Bridging gaps in data-driven science
OpenML Community
Jan-Jun 2015
Used all over the world
400 7-day, 1700 30-day active users, growing
1000s of datasets, flows, 450000+ runs
Algorithm selection, configuration
OpenML offers major opportunities for automated
algorithms selection and configuration
• Many datasets, flows, experiments: allows much larger
meta-learning studies than possible before
• APIs:Algorithm selection and configuration services can
be built on top of OpenML, reusing and sharing data
• These services can be integrated, accessible through
OpenML, continuously improving the more they are used.
Automating machine learning
Human scientists
meta-data, models,
evaluations
Automated processes
Data cleaning
Algorithm Selection
Parameter Optimization
Workflow construction
Post processing
API
A U T O M AT E D M A C H I N E L E A R N I N G
D
ata
Collection
D
ata
Cleaning
M
etric
Selection
Algorithm
Selection
Param
eter O
ptim
ization
Post-processing
O
nline
evaluation
D
ata
Coding
S E M I -
Thanks to R. Caruana
A U T O M AT E D M A C H I N E L E A R N I N G
D
ata
Collection
D
ata
Cleaning
M
etric
Selection
Algorithm
Selection
Param
eter O
ptim
ization
Post-processing
O
nline
evaluation
D
ata
Coding
S E M I -
Thanks to R. Caruana
A U T O M AT E D M A C H I N E L E A R N I N G
D
ata
Collection
D
ata
Cleaning
M
etric
Selection
Algorithm
Selection
Param
eter O
ptim
ization
Post-processing
O
nline
evaluation
D
ata
Coding
S E M I -
Thanks to R. Caruana
Bruce Mau
When everything is connected to everything else,
for better or for worse, everything matters.
- Open Source, on GitHub
- Regular workshops, hackathons
Join OpenML
Next workshop:
- Lorentz Center (Leiden),
14-18 March 2016
T H A N K Y O U
Joaquin Vanschoren
Jan van Rijn
Bernd Bischl
Matthias Feurer
Michel Lang
Nenad Tomašev
Giuseppe Casalicchio
Luis Torgo
You?
#OpenMLFarzan Majdani
Jakob Bossek

More Related Content

What's hot

Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
J On The Beach
 
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
Michael Cummings
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
Albert Bifet
 
Reproducible research
Reproducible researchReproducible research
Reproducible research
C. Tobin Magle
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
Albert Bifet
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
Doug Needham
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
Doug Needham
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
Araport
 
Tlad 2015 presentation amin+charles-final
Tlad 2015 presentation   amin+charles-finalTlad 2015 presentation   amin+charles-final
Tlad 2015 presentation amin+charles-finalAmin Chowdhury
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
William Lyon
 
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Jo-fai Chow
 
SampLD, Structural Properties as Proxy for Semantic Relevance
SampLD, Structural Properties as Proxy for Semantic RelevanceSampLD, Structural Properties as Proxy for Semantic Relevance
SampLD, Structural Properties as Proxy for Semantic Relevance
laurensrietveld
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
bodaceacat
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Mikel Emaldi Manrique
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Mo Patel
 
Linked Data Notifications for RDF Streams
Linked Data Notifications for RDF StreamsLinked Data Notifications for RDF Streams
Linked Data Notifications for RDF Streams
Jean-Paul Calbimonte
 
Java - Collections
Java - CollectionsJava - Collections
Java - Collections
Amith jayasekara
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 

What's hot (20)

Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Reproducible research
Reproducible researchReproducible research
Reproducible research
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Tlad 2015 presentation amin+charles-final
Tlad 2015 presentation   amin+charles-finalTlad 2015 presentation   amin+charles-final
Tlad 2015 presentation amin+charles-final
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters Tuning
 
SampLD, Structural Properties as Proxy for Semantic Relevance
SampLD, Structural Properties as Proxy for Semantic RelevanceSampLD, Structural Properties as Proxy for Semantic Relevance
SampLD, Structural Properties as Proxy for Semantic Relevance
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
Linked Data Notifications for RDF Streams
Linked Data Notifications for RDF StreamsLinked Data Notifications for RDF Streams
Linked Data Notifications for RDF Streams
 
Java - Collections
Java - CollectionsJava - Collections
Java - Collections
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 

Similar to OpenML Tutorial ECMLPKDD 2015

Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
Joaquin Vanschoren
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
Joaquin Vanschoren
 
OpenML NeurIPS2018
OpenML NeurIPS2018OpenML NeurIPS2018
OpenML NeurIPS2018
Joaquin Vanschoren
 
Git influencer - PPT
Git influencer - PPTGit influencer - PPT
Git influencer - PPT
Catherine Shen
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
ssuser3c3f88
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Robert Grossman
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
University of California, San Diego
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Tao Xie
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Sri Ambati
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)
Stian Soiland-Reyes
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
Stian Soiland-Reyes
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
Carole Goble
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
Tao Xie
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
A Pragmatic Approach to Semantic Repositories Benchmarking
A Pragmatic Approach to Semantic Repositories BenchmarkingA Pragmatic Approach to Semantic Repositories Benchmarking
A Pragmatic Approach to Semantic Repositories BenchmarkingDhaval Thakker
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
Mathieu d'Aquin
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Sri Ambati
 

Similar to OpenML Tutorial ECMLPKDD 2015 (20)

Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
OpenML NeurIPS2018
OpenML NeurIPS2018OpenML NeurIPS2018
OpenML NeurIPS2018
 
Git influencer - PPT
Git influencer - PPTGit influencer - PPT
Git influencer - PPT
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
A Pragmatic Approach to Semantic Repositories Benchmarking
A Pragmatic Approach to Semantic Repositories BenchmarkingA Pragmatic Approach to Semantic Repositories Benchmarking
A Pragmatic Approach to Semantic Repositories Benchmarking
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 

More from Joaquin Vanschoren

Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
Joaquin Vanschoren
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
Joaquin Vanschoren
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
Joaquin Vanschoren
 
Exposé Ontology
Exposé OntologyExposé Ontology
Exposé Ontology
Joaquin Vanschoren
 
Designed Serendipity
Designed SerendipityDesigned Serendipity
Designed Serendipity
Joaquin Vanschoren
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
Joaquin Vanschoren
 
OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017
Joaquin Vanschoren
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
Joaquin Vanschoren
 
OpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine LearningOpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine Learning
Joaquin Vanschoren
 
Data science
Data scienceData science
Data science
Joaquin Vanschoren
 
OpenML 2014
OpenML 2014OpenML 2014
OpenML 2014
Joaquin Vanschoren
 
Open Machine Learning
Open Machine LearningOpen Machine Learning
Open Machine Learning
Joaquin Vanschoren
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Joaquin Vanschoren
 
Hadoop sensordata part2
Hadoop sensordata part2Hadoop sensordata part2
Hadoop sensordata part2
Joaquin Vanschoren
 
Hadoop sensordata part1
Hadoop sensordata part1Hadoop sensordata part1
Hadoop sensordata part1
Joaquin Vanschoren
 
Hadoop sensordata part3
Hadoop sensordata part3Hadoop sensordata part3
Hadoop sensordata part3
Joaquin Vanschoren
 

More from Joaquin Vanschoren (16)

Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Exposé Ontology
Exposé OntologyExposé Ontology
Exposé Ontology
 
Designed Serendipity
Designed SerendipityDesigned Serendipity
Designed Serendipity
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
 
OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
OpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine LearningOpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine Learning
 
Data science
Data scienceData science
Data science
 
OpenML 2014
OpenML 2014OpenML 2014
OpenML 2014
 
Open Machine Learning
Open Machine LearningOpen Machine Learning
Open Machine Learning
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop sensordata part2
Hadoop sensordata part2Hadoop sensordata part2
Hadoop sensordata part2
 
Hadoop sensordata part1
Hadoop sensordata part1Hadoop sensordata part1
Hadoop sensordata part1
 
Hadoop sensordata part3
Hadoop sensordata part3Hadoop sensordata part3
Hadoop sensordata part3
 

Recently uploaded

Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 

Recently uploaded (20)

Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 

OpenML Tutorial ECMLPKDD 2015

  • 1. OpenML N E T W O R K E D M A C H I N E L E A R N I N G
  • 2. Metalearning and algorithm selection require: • Large amounts of real datasets • Wide range of state-of-the-art algorithms • Huge amounts of experimentation: explore how algorithms behave on many kinds of data • Integrated tools, databases, services to organise all this
  • 3. Gregg Gordon Data by itself has no value. It’s the ever-changing ecosystem surrounding data that gives it meaning. So much data, experiments are going around. Can we connect online to share data, code, results?
  • 4. N E T W O R K E D S C I E N C E Polymaths: Mathematicians solved centuries-old problems within weeks by collaborating online, via blog SDSS: Thousands of papers published on data server from single telescope. Also: GenBank, ChEMBL, GEO, … Galaxy Zoo: Amateur astronomers make new discoveries by looking through thousands of images
  • 5. F R I C T I O N - L E S S , N E T W O R K E D E N V I R O N M E N T F O R M A C H I N E L E A R N I N G R E S E A R C H Organized data: Experiments connected to data, code, people. Data and code can live on existing repositories, services (e.g. GitHub) Easy to use: Integrated in existing ML environments for easy reuse, and automated, reproducible sharing Easy to contribute: Post single dataset, algorithm, experiment, comment Reward structure*: Build reputation and trust (auto-tracking of reuse, likes)
  • 6. Data (ARFF) uploaded or referenced, versioned analysed, characterized, organised online
  • 7. Tasks contain data, goals, procedures. Readable by tools, automates experimentation Results organized online: realtime challenge Train-test splits Evaluation measure
  • 8. Flows (code) from various tools, versioned. Integrations + APIs (REST, Java, R, Python,…)
  • 9. Experiments auto-uploaded, evaluated online, reproducible, linked to data, flows and authors
  • 10. Experiments auto-uploaded, evaluated online, reproducible, linked to data, flows and authors
  • 12.
  • 13. • Search by keywords or properties • Filters • Tagging Data
  • 14. • Wiki-like descriptions • Analysis and visualisation of features Data
  • 15. • Wiki-like descriptions • Analysis and visualisation of features • Auto-calculation of large range of meta-features Data
  • 16. • Example: Classification on click prediction dataset, using 10-fold CV and AUC • People submit results (e.g. predictions) • Server-side evaluation (many measures) • All results organized online, per algorithm, parameter setting • Online visualizations: every dot is a run plotted by score Tasks
  • 17. • Leaderboards visualize progress over time: who delivered breakthroughs when, who built on top of previous solutions • Collaborative: all code and data available, learn from others, form teams • Real-time: who submits first gets credit, others can improve immediately
  • 20. • All results obtained with same flow organised online • Results linked to data sets, parameter settings -> trends/comparisons • Visualisations (dots are models, ranked by score, colored by parameters)
  • 21. • Detailed run info • Author, data, flow, parameter settings, result files, … • Evaluation details (e.g., results per sample)
  • 22. REST API Requires OpenML account.
  • 23. REST API Guide on: www.openml.org/guide List datasets: /data/list List tasks: /task/list List flows: /flow/list List runs: /run/list Get dataset: /data/<id> Get task: /task/<id> Get flow: /flow/<id> Get run: /run/<id> … Endpoint: api.openml.org/v1/
  • 27. R API List runs and results (under construction)
  • 28. Java API Guide: https://www.openml.org/guide Python API Guide: https://www.openml.org/guide
  • 35. R API Download run Download run with predictions
  • 39. MOA
  • 40. RapidMiner: 3 new operators See talk by Jan van Rijn
  • 41. R API Run a task (with mlr):
  • 42. R API Run a task (with mlr): And upload:
  • 44. Studies (e-papers) - Online counterpart of a paper, linkable - Add data, code, experiments (new or old) - Public or shared within circle Circles Create collaborations with trusted researchers Share results within team prior to publication Altmetrics - Measure real impact of your work - Reuse, downloads, likes of data, code, projects,… - Online reputation (more sharing) Online collaboration (soon)
  • 45. Collaboratory: bring data scientists and domain scientists together online (and their data and tools) Easy, large-scale collab: Extract actionable datasets, key tools. Scientists shares data and get help, DS can test technique on many current datasets. Real-time collab: share experiments automatically, discuss online.Automate experimentation. Data scientist Domain scientist Tool developer data+ code Bridging gaps in data-driven science
  • 46. Collaboratory: bring data scientists and domain scientists together online (and their data and tools) Easy, large-scale collab: Extract actionable datasets, key tools. Scientists shares data and get help, DS can test technique on many current datasets. Real-time collab: share experiments automatically, discuss online.Automate experimentation. Data scientist Domain scientist Tool developer data+ code Bridging gaps in data-driven science
  • 47. OpenML Community Jan-Jun 2015 Used all over the world 400 7-day, 1700 30-day active users, growing 1000s of datasets, flows, 450000+ runs
  • 48. Algorithm selection, configuration OpenML offers major opportunities for automated algorithms selection and configuration • Many datasets, flows, experiments: allows much larger meta-learning studies than possible before • APIs:Algorithm selection and configuration services can be built on top of OpenML, reusing and sharing data • These services can be integrated, accessible through OpenML, continuously improving the more they are used.
  • 49. Automating machine learning Human scientists meta-data, models, evaluations Automated processes Data cleaning Algorithm Selection Parameter Optimization Workflow construction Post processing API
  • 50. A U T O M AT E D M A C H I N E L E A R N I N G D ata Collection D ata Cleaning M etric Selection Algorithm Selection Param eter O ptim ization Post-processing O nline evaluation D ata Coding S E M I - Thanks to R. Caruana
  • 51. A U T O M AT E D M A C H I N E L E A R N I N G D ata Collection D ata Cleaning M etric Selection Algorithm Selection Param eter O ptim ization Post-processing O nline evaluation D ata Coding S E M I - Thanks to R. Caruana
  • 52. A U T O M AT E D M A C H I N E L E A R N I N G D ata Collection D ata Cleaning M etric Selection Algorithm Selection Param eter O ptim ization Post-processing O nline evaluation D ata Coding S E M I - Thanks to R. Caruana
  • 53. Bruce Mau When everything is connected to everything else, for better or for worse, everything matters.
  • 54. - Open Source, on GitHub - Regular workshops, hackathons Join OpenML Next workshop: - Lorentz Center (Leiden), 14-18 March 2016
  • 55. T H A N K Y O U Joaquin Vanschoren Jan van Rijn Bernd Bischl Matthias Feurer Michel Lang Nenad Tomašev Giuseppe Casalicchio Luis Torgo You? #OpenMLFarzan Majdani Jakob Bossek