DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
OpenML 2014
1. N E T W O R K E D S C I E N C E A N D MAC H I N E L E A R N I N G
J OAQ U I N VA N S C H O R E N ( T U / E ) , 2 0 1 4
#OpenML
2. 1 6 1 0
G A L I L E O G A L I L E I
D I S C O V E R S S A T U R N ’ S R I N G S
‘ S M A I S M R M I L M E P O E TA L
E U M I B U N E N U G T TA U I R A S ’
3. How do you convince scientists to
share their discoveries?
4. 1 7 T H C E N T U RY
J O U R N A L S Y S T E M
R E P U TA T I O N - B A S E D
E C O N O M Y
5.
6. N E T W O R K E D S C I E N C E
T O D A Y
Online scholarly tools
Share data, code impossible to print in journals
Collect, organise, analyse all data
Collaborate in real time with hundreds of scientists
7. S C A L I N G U P C O L L A B O R AT I O N
• Large-scale collaborations change the way we
make discoveries
• Massively collaborative science
• Open data: mapping and mining
• Citizen science
8. D E S I G N E D S E R E N D I P I T Y
• Many scientists have complementary
expertise
• Right expertise at the right time
• Ideas spark new ideas, questions get
answered, data and tools reused in
unexpected ways
• `Happy accidents’ common in large collaborations
9. D Y N A M I C D I S T R I B U T I O N O F L A B O R
• Scientists have complementary skills:
generate ideas, experiment, analyse,
interpret
• Right skills, resources, time at the right
time
• Dramatically speeds up progress
• What is impossibly hard for one scientist is routine
for another
10. S C A L I N G U P C O L L A B O R AT I O N
• Online tools: contribute any amount at any time
• Encourage small contributions
• Subtasks that can be attacked independently
• Rich, structured information commons
• Architecture of attention
• Honor code
11. How do you convince scientists to share
their ideas, data, code?
12. M A S S I V E LY C O L L A B O R AT I V E
S C I E N C E
P O LY M A T H S
13. P O LY M AT H P R O J E C T S
• Designed serendipity
• Broadcast question hoping that many minds may find a
solution
• “find myself having thoughts I would not have had
without some chance remark of another contributor”
• Dynamic division of labor
• Throwing out ideas, criticising, testing ideas,
synthesising, reformulating, coordinating,…
14. W H Y S H A R E I D E A S ?
• Authorship: contributions clearly visible, self-
reporting publication
• Visibility: earn respect from notable peers
• Scalability: over many projects, concentrate on
where you have special insight and advantage
• Interaction: share ideas early (before others),
ideas are quickly developed, corrected
16. S L O A N D I G I TA L S K Y S U R V E Y
• Designed serendipity
• Broadcast data, believing that many minds will ask
unanticipated questions
• More data than single person can comprehend:
challenge is asking the right questions
• Dynamic division of labor
• Collect data, ask questions, mine the data
17. W H Y S H A R E D ATA ?
• Fame: releasing the data yields more citations:
people more likely to build on it
• Funding: sharing data increases value of
research to community as a whole, increasing
chances of continued funding
19. G A L A X Y Z O O
• Designed serendipity
• Unexpected observations reported on forum.
• Accidental discovery of new classes of objects: green
pea galaxies, passive red spirals, Hanny’s Voorwerp
• Dynamic division of labor
• Huge task subdivided in many small tasks which can
be easily learned
20. W H Y V O L U N T E E R ?
• Discovery: being the first to see a galaxy
• Progress: understanding universe, beating
cancer,…
• Fun: gamification
• Learning: learning more about a science/topic
• Community: meeting like-minded people
21. M A C H I N E L E A R N I N G
• Good candidate for networked science
• Highly complex data, code, workflows, yet most work
published in papers (graphs, pseudocode)
• Experiments are not shared online: impossible to
build on prior work, start each time from scratch
• Low generalisability: studies contradict
• Low reproducibility: code, experiment details missing
22. • Place to share data in fine detail, and organise it to work more
effectively, be more visible, collaborate, tackle hard problems
• Links to data available anywhere online, integrated in popular
machine learning environments (WEKA, R, MOA, RapidMiner)
• Website to find data, code, results; discuss, compare, visualise
39. 1 . O P E R AT O R T O D O W N L O A D TA S K ( TA S K T Y P E S P E C I F I C )
R A P I D M I N E R
2 . S U B W O R K F L O W T H AT S O LV E S T H E TA S K , G E N E R AT E S R E S U LT S
3 . O P E R AT O R F O R U P L O A D I N G R E S U LT S
40. O P E N M L C O N N E C T
• Library for Java
• Package for R
• In progress: Module for Python
• In progress: Command-line tools
42. D E S I G N E D S E R E N D I P I T Y
• `Impossible’ questions become possible by reusing
prior experiments
• Answer routine questions in minutes
• Mine all collected results for patterns: meta-learning
• Browse all data for unexpected results
• Reuse code, data in novel ways
43. D Y N A M I C D I V I S I O N O F L A B O R
• Scientists can focus attention on important problems by
adding data, collaborate with community
• Large collaborations: OpenML organizes all results to
follow progress
• Benchmark studies: only run algorithms you know well,
reuse all other results
• Students, citizen scientists can contribute data, runs
through plugins
44. E X A M P L E : M E TA - Q S A R P R O J E C T
• Large amounts of QSAR data available
• Not known which machine learning techniques are best
• OpenML used to try many algorithms and learn when
to use which techniques
• Applications in fighting malaria
45. B E Y O N D J O U R N A L S
• Enriches research output, linked to papers
• Freely accessible
• Organized online
• Low threshold for students
• Continuously updated
• Immensely detailed
• Reproducible
• Stimulates online discussion
• Diminishes publication bias
46. S C A L A B I L I T Y
• Easy to make small contributions: add data, code, run
experiments using plugins, leave comments
• Split up complex studies: OpenML tasks
• Rich, structured data: all data, flows, runs, users linked.
Keyword search, filters, SQL endpoint
• Data easily filtered: easy to focus on your interests
• Enforce scientific standards: task types, verifiability, server-
side evaluations, clear attribution, honor code
48. M O R E T I M E
• OpenML assists in most routinizable work:
• Find code and data online
• Setup, run & organize experiments
• Relate to state-of-the-art (benchmarks)
• Annotate code and data
• Full log of your research
• Keep control of your data, code, experiments
• Follow experiments on the go (mobile devices)
49. M O R E K N O W L E D G E
• Your results linked to everybody else’s
• Larger, more general studies
• Answer more questions
• Mine all combined results
• Find unexpected results
• Interact with others on global scale, get help
• Collaborate with scientists from other fields
51. M O R E C R E D I T
• Citation
• OpenML attributes data, flows, runs, tells others how to cite it
• More easy to find by others
• Altmetrics: track how often your work is reused
• Productivity: contribute efficiently to many studies
• Visibility: collaborate, climb leaderboards, self-publish (tweet)
• Funding: convincing way to make data open
• No publication bias: unexecpected results
52. F U T U R E W O R K
• OpenML studies: online representation of paper:
data, code, runs, discussions,…
• Social layer: control visibility: public, friends, private
• Collaborative leaderboards: all top-3 contributors
• Discussion forum for unexpected results
• More data types, tasks
53. S P R E A D T H E W O R D, W O R K O P E N LY
#OpenML