Manuel Martín Salvador 
@draxus 
msalvador@bournemouth.ac.uk 
OpenML workshop 
Eindhoven 21/10/2014
Background 
● MSc. Computer Engineering 
● Master in Soft Computing and Intelligent Systems 
Currently 
● PhD Student – Automatic and adaptive pre-processing for building 
predictive models 
● Teaching – Data Mining lab
Data preparation and pre-processing
Data preparation and pre-processing
Data preparation and pre-processing 
Labour 
intensive 
tasks 
(up to 80% of 
a data mining 
process)
Automating pre-processing 
A lot of available techniques 
No free lunch 
Multiple combinations 
Order of pre-processing methods matters 
No semantic → some approaches use ontologies 
Meta-learning → needs a good database of 
experiments
Scientific workflow platforms and 
repositories with experiments 
Software Repository Applications 
DiscoveryNet (inactive) - 
Kepler - Various 
Taverna MyExperiment (open) Bioinformatics 
Pegasus - Various 
Galaxy - Biomedical 
Pipeline Pilot Accelrys (commercial) 
* MLComp (“open”) Machine Learning 
Weka,MOA,R,RapidMiner OpenML (open) Machine Learning
OpenML statistics 
Datasets: 1042 
Tasks: 3025 
Flows: 640 
Runs: 31540 
Valid: 24410 
With errors: 7130 
Datasets: 300 
Individual components: 136 
Paired components: 635 
“Flow size”: 1 – 8198 
2 – 12178 
3 – 1993 
4 – 1533 
5 – 502 
6 – 6 
4500 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
Distribution of components 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
Distribution of datasets 
Only 3 Weka filters: 
Principal Components, Discretize, PLSFilter
TO DO 
How to increase the number of pre-processing methods in OpenML? 
- The only way right now is using FilteredClassifier in Weka 
- What about R, MOA, RapidMiner? 
Improving flow representation 
- Right now is difficult to see how components are connected 
- Clear distinction of parameters 
- What about including Weka flows (XML based) and ADAMS flows? 
- PMML support? 
Statistics for available data, tasks, flows and runs 
Flow recommendation system for a given dataset 
[dataset, data characteristics, prediction accuracy, flow_id] 
Flow validation before executing it 
[dataset, data characteristics, flow characteristics, failure]
A little bit further 
Adapting flows while processing data streams 
- Detecting changes in data characteristics 
- Locally checking input/output in each flow component 
- Change propagation 
- Reducing cost of adaptation
Photos CC by Cristina Granados 
Visit us! 
Data Science Institute @ Bournemouth University

Quick presentation for the OpenML workshop in Eindhoven 2014

  • 1.
    Manuel Martín Salvador @draxus msalvador@bournemouth.ac.uk OpenML workshop Eindhoven 21/10/2014
  • 2.
    Background ● MSc.Computer Engineering ● Master in Soft Computing and Intelligent Systems Currently ● PhD Student – Automatic and adaptive pre-processing for building predictive models ● Teaching – Data Mining lab
  • 3.
    Data preparation andpre-processing
  • 4.
    Data preparation andpre-processing
  • 5.
    Data preparation andpre-processing Labour intensive tasks (up to 80% of a data mining process)
  • 6.
    Automating pre-processing Alot of available techniques No free lunch Multiple combinations Order of pre-processing methods matters No semantic → some approaches use ontologies Meta-learning → needs a good database of experiments
  • 7.
    Scientific workflow platformsand repositories with experiments Software Repository Applications DiscoveryNet (inactive) - Kepler - Various Taverna MyExperiment (open) Bioinformatics Pegasus - Various Galaxy - Biomedical Pipeline Pilot Accelrys (commercial) * MLComp (“open”) Machine Learning Weka,MOA,R,RapidMiner OpenML (open) Machine Learning
  • 8.
    OpenML statistics Datasets:1042 Tasks: 3025 Flows: 640 Runs: 31540 Valid: 24410 With errors: 7130 Datasets: 300 Individual components: 136 Paired components: 635 “Flow size”: 1 – 8198 2 – 12178 3 – 1993 4 – 1533 5 – 502 6 – 6 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Distribution of components 1600 1400 1200 1000 800 600 400 200 0 Distribution of datasets Only 3 Weka filters: Principal Components, Discretize, PLSFilter
  • 9.
    TO DO Howto increase the number of pre-processing methods in OpenML? - The only way right now is using FilteredClassifier in Weka - What about R, MOA, RapidMiner? Improving flow representation - Right now is difficult to see how components are connected - Clear distinction of parameters - What about including Weka flows (XML based) and ADAMS flows? - PMML support? Statistics for available data, tasks, flows and runs Flow recommendation system for a given dataset [dataset, data characteristics, prediction accuracy, flow_id] Flow validation before executing it [dataset, data characteristics, flow characteristics, failure]
  • 10.
    A little bitfurther Adapting flows while processing data streams - Detecting changes in data characteristics - Locally checking input/output in each flow component - Change propagation - Reducing cost of adaptation
  • 11.
    Photos CC byCristina Granados Visit us! Data Science Institute @ Bournemouth University