OpenML is an online platform for sharing machine learning data, code, and experiments. This talk includes a tutorial on how to use it and leverage it for algorithm selection and configuration.
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
OpenML for Algorithm Selection and Configuration
1. OpenML
N E T W O R K E D M A C H I N E L E A R N I N G
F O R A L G O R I T H M S E L E C T I O N A N D O P T I M I S A T I O N
J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 5
2. Meta-learning:
• Learn from experience how to select + optimize
learning algorithms + workflows
Requires:
• Large amounts of real datasets
• Wide range of state-of-the-art algorithms
• Huge amounts of experimentation: explore how
algorithms/params behave on many kinds of data
3. Millions of real, open datasets are generated
• Drug activity, gene expressions, astronomical observations, text,…
Extensive toolboxes exist to analyse data
• MLR, SKLearn, RapidMiner, KNIME, WEKA, AmazonML, AzureML,…
Massive amounts of experiments are run, most of this
information is lost forever (in people’s heads)
• No learning across people, labs, fields
• Start from scratch, slows research and innovation
4. Let’s connect machine learning environments to network, so
that we can organize, learn from experience
Connect minds, collaborate globally in real time
Train algorithms to automate data science
5. F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G
Easy to use: Integrated in ML environments.Automated, reproducible sharing
Organized data: Experiments connected to data, code, people anywhere
Easy to contribute: Post single dataset, algorithm, experiment, comment
Reward structure*: Build reputation and trust (e-citations, social interaction)
OpenML
6. Data (ARFF) uploaded or referenced, versioned
analysed, characterized, organised online
20. • Example: Classification
on click prediction
dataset, using 10-fold
CV and AUC
• People submit results
(e.g. predictions)
• Server-side evaluation
(many measures)
• All results organized
online, per algorithm,
parameter setting
• Online visualizations:
every dot is a run
plotted by score
Tasks
21. • Leaderboards visualize progress over time: who delivered breakthroughs
when, who built on top of previous solutions
• Collaborative: all code and data available, learn from others
• Real-time: clear who submitted first, others can improve immediately
23. • All results obtained with same flow organised online
• Results linked to data sets, parameter settings -> trends/comparisons
• Visualisations (dots are models, ranked by score, colored by parameters)
24. • Detailed run info
• Author, data, flow,
parameter settings,
result files, …
• Evaluation details
(e.g., results per
sample)
37. Studies (e-papers)
- Online counterpart of a paper, linkable
- Add data, code, experiments (new or old)
- Public or shared within circle
Circles
Create collaborations with trusted researchers
Share results within team prior to publication
Altmetrics
- Measure real impact of your work
- Reuse, downloads, likes of data, code, projects,…
- Online reputation (more sharing)
Online collaboration (soon)
38. OpenML Community
Jan-Jun 2015
Used all over the world
400 7-day, 1700 30-day active users, growing
1000s of datasets, flows, 450000+ runs
39. Opportunities for automated algorithms selection + configuration
• Many datasets, flows, experiments: much larger meta-learning
studies than possible before
• More meta-features, better meta-models
• Meta-data to speed up algorithm configuration, learn over time
• APIs:Algorithm selection and configuration can be built on top
of OpenML, reusing and sharing data
48. Meta-learning on streams
Stream data in OpenML: ‘best’ algorithm changes over time
Concept drift
• Use meta-learning to select the best models at each
point in time
49. Meta-learning on streams
Stream data in OpenML: ‘best’ algorithm changes over time
Evaluation
• Base-level:
• Interleaved train-then-test (prequential evaluation)
• Accuracy on data streams using predicted method [default, best]
• Meta-level:
• Leave one stream out
• Accuracy of predicting best model [0,1]
50. Meta-learning on streams
• Streaming ensembles (current work)
• Train multiple models, use meta-learning to weight their votes
• Stacking / Cascading
• BLast (Best-Last), J. van Rijn et al. ICDM 2015
• Simply choose model that performed best in previous window.
Equivalent to state-of-the-art (Leveraging Bagging)
51. Algorithm selection (ranking)
J. van Rijn et al. IDA 2015
Learning curves in OpenML
• Identify k nearest prior datasets by distance between partial curves:
• Build a ranking of algorithms to run, start with overall best abest
• Draw random algorithm acompetitor
• If acompetitor wins most, add to ranking, repeat
For new dataset: evaluate
learning curves up to T
(e.g. 256 instances)
Fraction of full CV
No other meta-features
52. 0
0.02
0.04
0.06
0.08
0.1
1 4 16 64 256 1024 4096 16384 65536
AccuracyLoss
Time (seconds)
Best on Sample
Average Rank
PCC
PCC (A3R, r = 1)
For 53 classifiers on 39 datasets
Multi-objective function A3R:
J. van Rijn et al. IDA 2015
Algorithm selection (ranking)
53. Towards automating machine learning
Human scientists
meta-data, models,
evaluations
Automated processes
Data cleaning
Algorithm Selection
Parameter Optimization
Workflow construction
Post processing
API
Connect your tools and services to OpenML, so they may learn
54. Bruce Mau
When everything is connected to everything else,
for better or for worse, everything matters.
55. - Open Source, on GitHub
- Regular workshops, hackathons
Join OpenML
Next workshop:
- Lorentz Center (Leiden),
14-18 March 2016
56. T H A N K Y O U
Joaquin Vanschoren
Jan van Rijn
Bernd Bischl
Matthias Feurer
Michel Lang
Nenad Tomašev
Giuseppe Casalicchio
Luis Torgo
You?
#OpenMLFarzan Majdani
Jakob Bossek