Quick presentation for the OpenML workshop in Eindhoven 2014

Manuel Martín Salvador
@draxus
msalvador@bournemouth.ac.uk
OpenML workshop
Eindhoven 21/10/2014

Background
● MSc. Computer Engineering
● Master in Soft Computing and Intelligent Systems
Currently
● PhD Student – Automatic and adaptive pre-processing for building
predictive models
● Teaching – Data Mining lab

Data preparation and pre-processing

Data preparation and pre-processing
Labour
intensive
tasks
(up to 80% of
a data mining
process)

Automating pre-processing
A lot of available techniques
No free lunch
Multiple combinations
Order of pre-processing methods matters
No semantic → some approaches use ontologies
Meta-learning → needs a good database of
experiments

Scientific workflow platforms and
repositories with experiments
Software Repository Applications
DiscoveryNet (inactive) -
Kepler - Various
Taverna MyExperiment (open) Bioinformatics
Pegasus - Various
Galaxy - Biomedical
Pipeline Pilot Accelrys (commercial)
* MLComp (“open”) Machine Learning
Weka,MOA,R,RapidMiner OpenML (open) Machine Learning

OpenML statistics
Datasets: 1042
Tasks: 3025
Flows: 640
Runs: 31540
Valid: 24410
With errors: 7130
Datasets: 300
Individual components: 136
Paired components: 635
“Flow size”: 1 – 8198
2 – 12178
3 – 1993
4 – 1533
5 – 502
6 – 6
4500
4000
3500
3000
2500
2000
1500
1000
500
0
Distribution of components
1600
1400
1200
1000
800
600
400
200
0
Distribution of datasets
Only 3 Weka filters:
Principal Components, Discretize, PLSFilter

TO DO
How to increase the number of pre-processing methods in OpenML?
- The only way right now is using FilteredClassifier in Weka
- What about R, MOA, RapidMiner?
Improving flow representation
- Right now is difficult to see how components are connected
- Clear distinction of parameters
- What about including Weka flows (XML based) and ADAMS flows?
- PMML support?
Statistics for available data, tasks, flows and runs
Flow recommendation system for a given dataset
[dataset, data characteristics, prediction accuracy, flow_id]
Flow validation before executing it
[dataset, data characteristics, flow characteristics, failure]

A little bit further
Adapting flows while processing data streams
- Detecting changes in data characteristics
- Locally checking input/output in each flow component
- Change propagation
- Reducing cost of adaptation

Photos CC by Cristina Granados
Visit us!
Data Science Institute @ Bournemouth University

Quick presentation for the OpenML workshop in Eindhoven 2014

More Related Content

What's hot

Viewers also liked

Similar to Quick presentation for the OpenML workshop in Eindhoven 2014

More from Manuel Martín

Recently uploaded

Quick presentation for the OpenML workshop in Eindhoven 2014