Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Quick presentation for the OpenML workshop in Eindhoven 2014

474 views

Published on

Personal introduction and workplan for the OpenML workshop that took place in TU/e (Eindhoven)

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Quick presentation for the OpenML workshop in Eindhoven 2014

  1. 1. Manuel Martín Salvador @draxus msalvador@bournemouth.ac.uk OpenML workshop Eindhoven 21/10/2014
  2. 2. Background ● MSc. Computer Engineering ● Master in Soft Computing and Intelligent Systems Currently ● PhD Student – Automatic and adaptive pre-processing for building predictive models ● Teaching – Data Mining lab
  3. 3. Data preparation and pre-processing
  4. 4. Data preparation and pre-processing
  5. 5. Data preparation and pre-processing Labour intensive tasks (up to 80% of a data mining process)
  6. 6. Automating pre-processing A lot of available techniques No free lunch Multiple combinations Order of pre-processing methods matters No semantic → some approaches use ontologies Meta-learning → needs a good database of experiments
  7. 7. Scientific workflow platforms and repositories with experiments Software Repository Applications DiscoveryNet (inactive) - Kepler - Various Taverna MyExperiment (open) Bioinformatics Pegasus - Various Galaxy - Biomedical Pipeline Pilot Accelrys (commercial) * MLComp (“open”) Machine Learning Weka,MOA,R,RapidMiner OpenML (open) Machine Learning
  8. 8. OpenML statistics Datasets: 1042 Tasks: 3025 Flows: 640 Runs: 31540 Valid: 24410 With errors: 7130 Datasets: 300 Individual components: 136 Paired components: 635 “Flow size”: 1 – 8198 2 – 12178 3 – 1993 4 – 1533 5 – 502 6 – 6 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Distribution of components 1600 1400 1200 1000 800 600 400 200 0 Distribution of datasets Only 3 Weka filters: Principal Components, Discretize, PLSFilter
  9. 9. TO DO How to increase the number of pre-processing methods in OpenML? - The only way right now is using FilteredClassifier in Weka - What about R, MOA, RapidMiner? Improving flow representation - Right now is difficult to see how components are connected - Clear distinction of parameters - What about including Weka flows (XML based) and ADAMS flows? - PMML support? Statistics for available data, tasks, flows and runs Flow recommendation system for a given dataset [dataset, data characteristics, prediction accuracy, flow_id] Flow validation before executing it [dataset, data characteristics, flow characteristics, failure]
  10. 10. A little bit further Adapting flows while processing data streams - Detecting changes in data characteristics - Locally checking input/output in each flow component - Change propagation - Reducing cost of adaptation
  11. 11. Photos CC by Cristina Granados Visit us! Data Science Institute @ Bournemouth University

×