Open Machine Learning

the open experiment database
meta-learning for the masses

Joaquin Vanschoren @joavanschoren

The Polymath story

Tim Gowers

Machine Learning
are we doing it right?

Computer Science
• The scientific method
• Make a hypothesis about the world
• Generate predictions based on this hypothesis
• Design experiments to verify/falsify the prediction
• Predictions verified: hypothesis might be true
• Predictions falsified: hypothesis is wrong

Computer Science
• The scientiﬁc method (for ML)
• Make a hypothesis about (the structure of) given data
• Generate models based on this hypothesis
• Design experiments to measure accuracy of the models
• Good performance: It works (on this data)
• Bad performance: It doesn’t work on this data
• Aggregates (it works 60% of the time) not useful

Computer Science
n zed o
acteri doesn’t work on this data
char
• Badtperformance: Its well?
nd a a be work
H ow •a Aggregatesm works 60% of the time) not useful
c
orith (it
hich t he alg
w

Computer Science
n
teri zed o ct of
harac It doesn’t work on thissdata effe
• Badtperformance: s well?
a be c
n da work hat i the tings?
H ca
ow • Aggregatesm works 60% of the time) not eter set
rith (it W
th e algo aram useful
w hich p

Meta-Learning
• The science of understanding which algorithms work
well on which types of data
• Hard: thorough understanding of data and algorithms
• Requires good data: extensive experimentation

• Why is this separate from other ML research?
• A thorough algorithm evaluation = a meta-learning study
• Original authors know algorithms and data best, have large sets
of experiments, are (presumably) interested in knowing on
which data their algorithms work well (or not)

Meta-Learning
With the right tools, can we make everyone a
meta-learner?

datasets algorithm comparison
data insight
learning curves
Large sets of experiments
algorithm selection
ML algorithm
meta-learning
design
algorithm characterization
algorithm insight
data characterization
source code
bias-variance analysis

Open science

World-wide Telescope

Open science

Microarray Databases

Open machine learning?
• We can also be òpen’
• Simple, common formats to describe experiments, workflows,
algorithms,...
• Platform to share, store, query, interact

• We can go (much) further
• Share experiments automatically (open source ML tools)
• Experiment on-the-fly (cheap, no expensive instruments)
• Controlled experimentation (experimentation engine)

Formalizing
machine learning

• Unique names for algorithms, datasets, evaluation
measures, data characterizations,... (ontology)
• Based on DMOP, OntoDM, KDOntology, EXPO,...
• Simple, structured way to describe algorithm setups,
workﬂows and experiment runs
• Detailed enough to reproduce all experiments

Run

Execution of a
predefined setup

run

Run

Execution of a
predefined setup

run

setup

Run

in
data run

setup

Run

machine

in
data run

setup

Run

machine

in out
data run data

setup

Run

machine

in out
data run data

Also:
start time
setup author
status,...

Setup
Plan of what
we want to do

setup

Setup
Plan of what
we want to do

setup

f(x)
algorithm function workflow experiment
setup setup

Setup
Hierarchical

part of

setup

f(x)
setup setup

Setup
Hierarchical
Parameterized

part of
p=!
setup parameter setting

f(x)
setup setup

Setup
Hierarchical
Parameterized
Abstract/concrete
part of
p=!
setup parameter setting

f(x)
setup setup

Algorithm Setup

algorithm
setup

Algorithm Setup
Fully defined algorithm
part of configuration

algorithm
setup

Algorithm Setup
Fully defined algorithm
part of configuration

algorithm
setup

p=! f(x)
implementation parameter setting function
setup

Algorithm Setup
part of

algorithm
setup

p=! f(x)
setup

Algorithm Setup
part of

algorithm
setup

p=! f(x)
setup

p=? f(x)
algorithm quality algorithm parameter mathematical function

Algorithm Setup
part of
unique
names
algorithm
setup

p=! f(x)
setup

p=? f(x)

Algorithm Setup
part of
unique Roles:
names learner,
algorithm base-learner,
setup kernel,...

p=! f(x)
setup

p=? f(x)

Setup

part of

setup

f(x)
setup setup

Workflow Setup
part of

setup

algorithm workflow
setup

Workflow Setup
part of

ta
so

rge
ur
setup

t
ce
algorithm workflow connection
setup

Workflow: components, connections,
and parameters (inputs)

Workflow Setup
part of
Also:

ta
ports

so

rge
ur
setup

t
datatype

ce
algorithm workflow connection
setup

Workflow: components, connections,
and parameters (inputs)

Workflow
Example

Weka. Weka. Weka.SMO
url Weka.RBF eval evalu-
ARFFLoader Evaluation
data ations
par p=! location= p=! F=10 p=! C=0.01 p=! G=0.01
http://... data
logRuns=true p=! S=1 f(x) 5:kernel
pred predic-
logRuns=false
tions
2:loadData logRuns=true 4:learner
3:crossValidate
1:mainFlow

Workflow
Example

data ations
http://... data
pred predic-
logRuns=false
tions
3:crossValidate
1:mainFlow
evaluations 6
eval Evaluations
data 8 data pred
Weka.Instances predictions 7
Predictions

Experiment
Setup
part of

setup

<X>
algorithm workflow experiment experiment
setup variable

Experiment
Setup
part of
se
tu
p
setup

<X>
algorithm workflow experiment experiment
setup variable

Also: experiment design, description,
literature reference, author,...

Experiment Setup
Variables: labeled tuples which can be
referenced in setups

Run

data

dataset evaluation model predictions

Run

source
data run


Run

source
data quality data run


EXPML
data ations
http://... data
pred predic-
logRuns=false
tions
3:crossValidate
1:mainFlow

Examples
1$

0.9$

0.8$
predic've)accuracy)

0.7$

0.6$
RandomForest$
0.5$ C45$
Logis<cRegression$
0.4$
RacedIncrementalLogitBoostAStump$
0.3$ NaiveBayes$
SVMARBF$
0.2$
10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$ 100$
percentage)of)original)dataset)size)
Learning curves

Examples

When does one algorithm outperform another?

Examples

Bias-variance profile + effect of dataset size

Examples

boosting

bagging

Bias-variance profile + effect of dataset size

Taking it further
Seamless integration

• Webservice for sharing, querying experiments
• Integrate experiment sharing in ML tools (WEKA,
KNIME, RapidMiner, R, ....)
• Mapping implementations, evaluation measures,...

• Online platform for custom querying, community
interaction
• Semantic wiki: algorithm/data descriptions, rankings, ...

Experimentation Engine
• Controlled experimentation (Delve, MLComp)
• Download datasets, build training/test sets
• Feed training and test sets to algorithms, retrieve predictions/
models
• Run broad set of evaluation measures
• Benchmarking (Cross-Validation), learning curve analysis,
bias-variance analysis, workﬂows(!)
• Compute data properties for new datasets

Why would you use it?
(seeding)
• Let the system run the experiments for you
• Immediate, highly detailed benchmarks (no repeats)
• Up to date, detailed results (vs. static, aggregated in journals)
• All your results organized online (private?), anytime, anywhere
• Interact with people (weird results?)
• Get credit for all your results (e.g. citations), unexpected results
• Visibility, new collaborations
• Check if your algorithm really the best (e.g. active testing)
• On which datasets does it perform well/badly?

Question

Is
open
machine learning
possible?

Merci
Danke Thanks
Xie Xie
Diolch
Toda
Dank U
Grazie
Spasiba
Efharisto
Gracias
Arigato
Köszönöm
Tesekkurler
Kia ora
Dhanyavaad
Hvala

http://expdb.cs.kuleuven.be

Open Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Open Machine Learning

Similar to Open Machine Learning (20)

More from Joaquin Vanschoren

More from Joaquin Vanschoren (19)

Recently uploaded

Recently uploaded (20)

Open Machine Learning