Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OpenML data@Sheffield


Published on

Presentation on the OpenML initiative to enable open, collaborative machine learning during the data@Sheffield event. We discuss how data, machine learning algorithms and experiments can be analysed collaboratively by data scientists and domain scientists, as well as citizen scientists.

Published in: Science
  • Be the first to comment

OpenML data@Sheffield

  1. 1. OpenML C O L L A B O R A T I V E M A C H I N E L E A R N I N G J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 5
  2. 2. W H A T I F W E C A N E X P L O R E D A TA C O L L A B O R AT I V E LY
  3. 3. W H A T I F W E C A N E X P L O R E D A TA C O L L A B O R AT I V E LY O N W E B S C A L E
  4. 4. W H A T I F W E C A N E X P L O R E D A TA C O L L A B O R AT I V E LY O N W E B S C A L E I N R E A L T I M E
  5. 5. Problem definition, data collection Algorithm design, selection Implementing, running code Data(driven) science ecosystem Few of us are experts in all crafts at once (we collaborate)
  6. 6. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Domain experts: learning and trying latest/best data science techniques takes lots of time Algorithm experts: learning domain language, finding latest/relevant data takes lots of time Unnecessary friction: time lost on tasks that others do in a fraction, automate altogether
  7. 7. In subfields, up to 85% medical research resources are wasted - underpowered, irreproducible research - improper analysis: false associations - flexibility in design, analysis - not translatable to applications Research better if: - Large-scale, interdisciplinary - Easy replication - Open sharing of data, protocols, software - Better methods, tests
  8. 8. magnetopause bow shock solar wind solar wind magnetosheath PLASMA PHYSICS SURFACE WAVES AND PROCESSES Nicolas Aunai
  10. 10. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Slows down, underpowers research Ioannidis: 85% medical research ineffective Similar findings in other fields Enterprises lack access to expertise, data Duplicate work, suboptimal solutions Value from data could be faster, cheaper Shortage of data science expertise McKinsey: 190k data scientists needed by 2018 Data science needs to scale: frictionless collaboration across fields and labs, open data, democratisation, automate drudge work
  11. 11. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Necessary, but: People’s attention doesn’t scale. Time is limited Talking is slow and labour-intensive Likely biased: asking N experts gives you N different answers Small-scale collaboration
  12. 12. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Yes, but: Slow: too many papers, domain-specific jargon, little cross-domain relevance. Faster to just try things yourself Mining the literature? OK, but information in papers is often imprecise, aggregated, hard to reproduce Paper is a 300 year old medium, internet is much better Literature
  13. 13. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Broadcast data so that many minds analyse the data in different ways Broadcast code so that many minds can apply it on their own data Organize everything on a collaboration platform Networked Science
  14. 14. Research different. Polymaths: Solve math problems through massive collaboration Broadcast question, combine many minds to solve it Solved hard problems in weeks Many (joint) publications
  15. 15. Research different. SDSS: Robotic telescope, data publicly online (SkyServer) +1 million distinct users vs. 10.000 astronomers Broadcast data, many minds ask the right questions Thousands of papers, citations Next: SynopticTelescope (15TB data per night, 10 year survey) Citizen science discoveries
  16. 16. Research different. How do you label a million galaxies? Offer right tools so that anybody can contribute, instantly Many novel discoveries by scientists and citizens More? Read ‘Networked science’ by M. Nielsen
  17. 17. Designed serendipity Create a sandbox for trying out the latest algorithms on the latest data, and analyze the results together What’s hard/surprising for one scientist is easy for another Data is the new soil Algorithms are like seeds that you sow on them We all have our own bits of soil and seed
  18. 18. Remove friction Contribute in seconds, not days Use infrastructure, tools that scientists already use Organized body of compatible and actionable scientific data and tools Easy, organised communication Track who did what, give credit
  19. 19. Millions of real, open datasets are generated • Drug activity, gene expressions, astronomical observations, text,… Extensive toolboxes exist to analyse data • MLR, SKLearn, RapidMiner, KNIME, WEKA, AmazonML, AzureML,… Massive amounts of experiments are run, most of this information is lost forever (in people’s heads) • No learning across people, labs, fields • Start from scratch, slows research and innovation
  20. 20. Let’s connect machine learning environments to network, so that we can organize, learn from experience Connect minds, collaborate globally in real time Train algorithms to automate data science
  21. 21. F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G Easy to use: Integrated in ML environments.Automated, reproducible sharing Organized data: Experiments connected to data, code, people anywhere Easy to contribute: Post single dataset, algorithm, experiment, comment Reward structure*: Build reputation and trust (e-citations, social interaction) OpenML
  22. 22. Data (ARFF) uploaded or referenced, versioned analysed, characterized, organised online
  23. 23. analysed, characterized, organised online
  24. 24. Tasks contain data, goals, procedures. Readable by tools, automates experimentation Train-test splits Evaluation measure Results organized online: realtime overview
  25. 25. Results organized online: realtime overview Train-test splits Evaluation measure
  26. 26. Flows (code) run locally, registered + versioned. Integrations + APIs (REST, Java, R, Python,…)
  27. 27. Integrations + APIs (REST, Java, R, Python,…)
  28. 28. reproducible, linked to data, flows and authors Experiments auto-uploaded, evaluated online
  29. 29. Experiments auto-uploaded, evaluated online
  30. 30. Explore Reuse Share
  31. 31. • Search by keywords or properties • Filters • Tagging Data
  32. 32. • Wiki-like descriptions • Analysis and visualisation of features Data
  33. 33. • Wiki-like descriptions • Analysis and visualisation of features • Auto-calculation of large range of meta-features • discover similar datasets • learn across datasets Data
  34. 34. • Example: Classification on click prediction dataset, using 10-fold CV and AUC • People submit results (e.g. predictions) • Server-side evaluation (many measures) • All results organized online, per algorithm, parameter setting • Online visualizations: every dot is a run plotted by score Tasks
  35. 35. • Leaderboards visualize progress over time: who delivered breakthroughs when, who built on top of previous solutions • Collaborative: all code and data available, learn from others • Real-time: clear who submitted first, others can improve immediately
  36. 36. Classroom challenges
  37. 37. • All results obtained with same flow organised online • Results linked to data sets, parameter settings -> trends/comparisons • Visualisations (dots are models, ranked by score, colored by parameters)
  38. 38. • Detailed run info • Author, data, flow, parameter settings, result files, … • Evaluation details (e.g., results per sample)
  39. 39. REST API (new) Requires OpenML account Predictive URLs
  40. 40. R API Idem for Java, Python Tutorial: List datasets List tasks List flows List runs and results
  41. 41. Share Explore Reuse
  42. 42. Website 1) Search, 2) Table view 3) CVS export*
  43. 43. R API Idem for Java, Python Tutorial: Download datasets Download tasks Download flows
  44. 44. R API Download run Download run with predictions Idem for Java, Python Tutorial:
  45. 45. Reuse Explore Share
  46. 46. WEKA OpenML extension in plugin manager
  47. 47. MOA
  48. 48. RapidMiner: 3 new operators
  49. 49. R API Run a task (with mlr):
  50. 50. R API Run a task (with mlr): And upload:
  51. 51. Studies (e-papers) - Start a question, invite others (or everyone) - Online counterpart of a paper, linkable Circles Create collaborations with trusted researchers Share results within team prior to publication Reputation - Auto-track reuse of shared resources (data, code) - Prove your activity, reach, impact Collaboration tools (in progress) Notebooks - Easy sharing, collaboration on scripts
  52. 52. Change scale - Invite anyone / everyone to work with your data - Global organization: state-of-the-art online - Discover interesting people, data, code,… Change speed - Easy data/code reuse, automated sharing - Real-time collaboration (seconds, not days) Automation - Bots that automatically run algorithms on data - Discover similar datasets, do basic analysis - Learn from lots of data: select/optimize algorithms Data science, differently
  53. 53. AutoML: automating machine learning Human scientists meta-data, models, evaluations Automated processes Data cleaning Algorithm Selection Parameter Optimization Workflow construction Post processing API Learn from many datasets and experiments how to do data analysis Create services
  54. 54. OpenML in drug discovery ChEMBL database SMILEs' Molecular' proper0es' (e.g.'MW,'LogP)' Fingerprints' (e.g.'FP2,'FP3,' FP4,'MACSS)' MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! 10.000+ regression datasets 1.4M compounds,10k proteins, 12,8M activities Predict which drugs will inhibit certain proteins (and hence viruses, parasites,…) 2750 targets have >10 compounds, x 4 fingerprints
  55. 55. OpenML in drug discovery cforest ctree earth fnn gbm glmnet lassolmnnetpcr plsr rforest ridge rtree ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024* 0.697* 0.427* 0.725+ 0.627* | msdfeat< 0.1729 mutualinfo< 0.7182 skewresp< 0.3315 nentropyfeat< 1.926 netcharge1>=0.9999 netcharge3< 13.85 hydphob21< −0.09666 glmnet fnn pcr rforest ridge rforest rforest ridge Predict best algorithm with meta-models MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! Metafeatures: - simple, statistical, info-theoretic, landmarkers - target: aliphatic index, hydrophobicity, net charge, mol. weight, sequence length, …
  56. 56. I. Olier et al. MetaSel@ECMLPKDD 2015 Best algorithms? Best features? OpenML in drug discovery
  57. 57. I. Olier et al. MetaSel@ECMLPKDD 2015 New technique: stacked generalisation Best model by far OpenML in drug discovery
  58. 58. New technique: multi-target learning When few drugs are tested on a given target, combine with the data on ‘related’ targets (same family, similar gene sequence) OpenML in drug discovery We just scratched the surface. Data will available on OpenML for many more studies and ideas, to test new algorithms,…
  59. 59. OpenML Community Jan-Jun 2015 Used all over the world 400 7-day, 1700 30-day active users, growing 1000s of datasets, flows, 450000+ runs
  60. 60. - Open Source, on GitHub - Regular workshops, hackathons Join OpenML Next workshop: - Lorentz Center (Leiden), 14-18 March 2016
  61. 61. T H A N K Y O U Joaquin Vanschoren Jan van Rijn Bernd Bischl Matthias Feurer Michel Lang Nenad Tomašev Giuseppe Casalicchio Luis Torgo You? #OpenMLFarzan Majdani Jakob Bossek