Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OpenML Tutorial: Networked Science in Machine Learning


Published on

Many sciences have made significant breakthroughs by adopting online tools that help organize, structure and mine information that is too detailed to be printed in journals. In this presentation, we introduce OpenML, a place for machine learning researchers to share and organize data in fine detail, so that they can collaborate more effectively with others to tackle harder problems. We discuss what benefits it brings for machine learning research, individual scientists, as well as students and practitioners. We show practical use cases and APIs for interacting with the system from machine learning software.

Published in: Science
  • Be the first to comment

OpenML Tutorial: Networked Science in Machine Learning

  1. 1. N E T W O R K E D MAC H I N E L E A R N I N G J OAQ U I N VA N S C H O R E N ( T U / E ) , 2 0 1 4 #OpenML
  2. 2. Research different.
  3. 3. 1 6 1 0 G A L I L E O G A L I L E I D I S C O V E R S S A T U R N ’ S R I N G S ‘ S M A I S M R M I L M E P O E TA L E U M I B U N E N U G T TA U I R A S ’
  4. 4. Research different. Royal society: Take nobody’s word for it Scientific Journal: Reputation-based culture
  5. 5. 3 0 0 Y E A R S L AT E R J O U R N A L S S H O W L I M I T S • Complex code not included • Large data sets not included • Experiment details scant • Results hard to reproduce • Papers not updatable • Slow, incomplete tracking of paper impact • Publication bias • No online public discussion • Open access?
  6. 6. J O U R N A L S : L O N G - T E R M M E M O RY I N T E R N E T: S H O R T- T E R M W O R K I N G M E M O RY N E T W O R K E D S C I E N C E O N L I N E D A TA B A S E S O P E N S O U R C E C O D E W E B S E R V I C E S , A P I S C O L L A B O R A T I V E T O O L S ! O P E N , S C A L A B L E C O L L A B O R A T I O N R E A L - T I M E D I S C U S S I O N C O M B I N E , R E U S E S C I E N T I F I C R E S U LT S C I T I Z E N S C I E N C E
  7. 7. Research different. Polymaths: Solve math problems through massive collaboration (not competition) Broadcast question, combine many minds to solve it Solved hard problems in weeks Many (joint) publications
  8. 8. Research different. SDSS: Robotic telescope, data publicly online (SkyServer) +1 million distinct users vs. 10.000 astronomers Broadcast data, allow many minds to ask the right questions Thousands of papers
  9. 9. Research different. Galaxy Zoo: citizen scientists classify a million galaxies Offer right tools so that anybody can be a scientist Many novel discoveries by scientists and citizens
  10. 10. Research different. Sharing data sparks discovery Designed serendipity: - What’s hard for one scientist is easy for another - Surprising ideas, observations can spark new discoveries Share, organise data for easy, large-scale collaboration Data exploding in all sciences: collaborative data analysis needed
  11. 11. Building reputation Authorship: easy to contribute + contributions stored, visible online Collaboration: build trust, work with new people Citation: more people see, build upon, and cite your work. Tell people how to cite data and code. Altmetrics: track reuse/interest online (ArXiv)
  12. 12. N E T W O R K E D MAC H I N E L E A R N I N G
  13. 13. Machine learning Complex code, large-scale data, experiments (impossible to print) Experiments not shared online: impossible to build on prior work: inhibits deeper analysis (e.g. meta-learning) Low reproducibility, generalisability (studies contradict) What if we could all connect with each other, and with other scientists, to explore and apply machine learning? Few collaborative tools to speed up research
  14. 14. OpenML Place to share data, code, experiments in full detail All results organised, linked together for further (meta)analysis, reuse, discussion, study, education Links to (open-source) code, open data anywhere online. Anyone can post data to analyse, anyone can share code and results (models, predictions, evaluations) Integrated in ML platforms (R,Weka, Rapidminer,…) to automatically load data, upload results Scientists can work in teams, but results only publicly visible if data, code shared
  15. 15. OpenML: benefits for scientists More time: automates routinizable work: - find data and/or code - setup and run large-scale experiments - results compared to state-of-the-art - log experiment details for future reference More control: - state how others should cite your work - track reuse - share results more easily More knowledge: - more time for actual research - build directly on prior work - easier, large-scale collaboration + interaction
  16. 16. Plugins:WEKA
  17. 17. Plugins: MOA
  18. 18. Plugins: RapidMiner 1 . O P E R AT O R T O D O W N L O A D TA S K ( TA S K T Y P E S P E C I F I C ) 2 . S U B W O R K F L O W T H AT S O LV E S T H E TA S K , G E N E R AT E S R E S U LT S 3 . O P E R AT O R F O R U P L O A D I N G R E S U LT S
  19. 19. OpenML: under development OpenML studies - collection of datasets, flows, runs, results in a study - online counterpart of paper (with url) - construct by simply tagging resources - easily include (build on) data of others Reputation building - Profile page: statistics of activity and impact on OpenML - Collaborative leaderboards: best contributors to solving a task Teams - Add scientists in teams (circles) - Share resources, results within team only - Make public at any time (e.g. after publication) Meta-learning support - Data/Flow qualities: easy adding, better overviews - Algorithm selection techniques running on website (vs humans?)
  20. 20. J O I N T H E C LU B