OpenML 2014

N E T W O R K E D S C I E N C E A N D MAC H I N E L E A R N I N G
J OAQ U I N VA N S C H O R E N ( T U / E ) , 2 0 1 4
#OpenML

1 6 1 0
G A L I L E O G A L I L E I
D I S C O V E R S S A T U R N ’ S R I N G S
‘ S M A I S M R M I L M E P O E TA L
E U M I B U N E N U G T TA U I R A S ’

How do you convince scientists to
share their discoveries?

1 7 T H C E N T U RY
J O U R N A L S Y S T E M
R E P U TA T I O N - B A S E D
E C O N O M Y

N E T W O R K E D S C I E N C E
T O D A Y
Online scholarly tools
Share data, code impossible to print in journals
Collect, organise, analyse all data
Collaborate in real time with hundreds of scientists

S C A L I N G U P C O L L A B O R AT I O N
• Large-scale collaborations change the way we
make discoveries
• Massively collaborative science
• Open data: mapping and mining
• Citizen science

D E S I G N E D S E R E N D I P I T Y
• Many scientists have complementary
expertise
• Right expertise at the right time
• Ideas spark new ideas, questions get
answered, data and tools reused in
unexpected ways
• `Happy accidents’ common in large collaborations

D Y N A M I C D I S T R I B U T I O N O F L A B O R
• Scientists have complementary skills:
generate ideas, experiment, analyse,
interpret
• Right skills, resources, time at the right
time
• Dramatically speeds up progress
• What is impossibly hard for one scientist is routine
for another

S C A L I N G U P C O L L A B O R AT I O N
• Online tools: contribute any amount at any time
• Encourage small contributions
• Subtasks that can be attacked independently
• Rich, structured information commons
• Architecture of attention
• Honor code

How do you convince scientists to share
their ideas, data, code?

M A S S I V E LY C O L L A B O R AT I V E
S C I E N C E
P O LY M A T H S

P O LY M AT H P R O J E C T S
• Designed serendipity
• Broadcast question hoping that many minds may find a
solution
• “find myself having thoughts I would not have had
without some chance remark of another contributor”
• Dynamic division of labor
• Throwing out ideas, criticising, testing ideas,
synthesising, reformulating, coordinating,…

W H Y S H A R E I D E A S ?
• Authorship: contributions clearly visible, self-
reporting publication
• Visibility: earn respect from notable peers
• Scalability: over many projects, concentrate on
where you have special insight and advantage
• Interaction: share ideas early (before others),
ideas are quickly developed, corrected

S L O A N D I G I TA L S K Y S U R V E Y
• Broadcast data, believing that many minds will ask
unanticipated questions
• More data than single person can comprehend:
challenge is asking the right questions
• Collect data, ask questions, mine the data

W H Y S H A R E D ATA ?
• Fame: releasing the data yields more citations:
people more likely to build on it
• Funding: sharing data increases value of
research to community as a whole, increasing
chances of continued funding

C I T I Z E N S C I E N C E
G A L A X Y Z O O

G A L A X Y Z O O
• Unexpected observations reported on forum.
• Accidental discovery of new classes of objects: green
pea galaxies, passive red spirals, Hanny’s Voorwerp
• Huge task subdivided in many small tasks which can
be easily learned

W H Y V O L U N T E E R ?
• Discovery: being the first to see a galaxy
• Progress: understanding universe, beating
cancer,…
• Fun: gamification
• Learning: learning more about a science/topic
• Community: meeting like-minded people

M A C H I N E L E A R N I N G
• Good candidate for networked science
• Highly complex data, code, workflows, yet most work
published in papers (graphs, pseudocode)
• Experiments are not shared online: impossible to
build on prior work, start each time from scratch
• Low generalisability: studies contradict
• Low reproducibility: code, experiment details missing

• Place to share data in fine detail, and organise it to work more
effectively, be more visible, collaborate, tackle hard problems
• Links to data available anywhere online, integrated in popular
machine learning environments (WEKA, R, MOA, RapidMiner)
• Website to find data, code, results; discuss, compare, visualise

1 . O P E R AT O R T O D O W N L O A D TA S K ( TA S K T Y P E S P E C I F I C )
R A P I D M I N E R
2 . S U B W O R K F L O W T H AT S O LV E S T H E TA S K , G E N E R AT E S R E S U LT S
3 . O P E R AT O R F O R U P L O A D I N G R E S U LT S

O P E N M L C O N N E C T
• Library for Java
• Package for R
• In progress: Module for Python
• In progress: Command-line tools

D E S I G N E D S E R E N D I P I T Y
• `Impossible’ questions become possible by reusing
prior experiments
• Answer routine questions in minutes
• Mine all collected results for patterns: meta-learning
• Browse all data for unexpected results
• Reuse code, data in novel ways

D Y N A M I C D I V I S I O N O F L A B O R
• Scientists can focus attention on important problems by
adding data, collaborate with community
• Large collaborations: OpenML organizes all results to
follow progress
• Benchmark studies: only run algorithms you know well,
reuse all other results
• Students, citizen scientists can contribute data, runs
through plugins

E X A M P L E : M E TA - Q S A R P R O J E C T
• Large amounts of QSAR data available
• Not known which machine learning techniques are best
• OpenML used to try many algorithms and learn when
to use which techniques
• Applications in fighting malaria

B E Y O N D J O U R N A L S
• Enriches research output, linked to papers
• Freely accessible
• Organized online
• Low threshold for students
• Continuously updated
• Immensely detailed
• Reproducible
• Stimulates online discussion
• Diminishes publication bias

S C A L A B I L I T Y
• Easy to make small contributions: add data, code, run
experiments using plugins, leave comments
• Split up complex studies: OpenML tasks
• Rich, structured data: all data, flows, runs, users linked.
Keyword search, filters, SQL endpoint
• Data easily filtered: easy to focus on your interests
• Enforce scientific standards: task types, verifiability, server-
side evaluations, clear attribution, honor code

M O R E T I M E
• OpenML assists in most routinizable work:
• Find code and data online
• Setup, run & organize experiments
• Relate to state-of-the-art (benchmarks)
• Annotate code and data
• Full log of your research
• Keep control of your data, code, experiments
• Follow experiments on the go (mobile devices)

M O R E K N O W L E D G E
• Your results linked to everybody else’s
• Larger, more general studies
• Answer more questions
• Mine all combined results
• Find unexpected results
• Interact with others on global scale, get help
• Collaborate with scientists from other fields

M O R E C R E D I T
• Citation
• OpenML attributes data, flows, runs, tells others how to cite it
• More easy to find by others
• Altmetrics: track how often your work is reused
• Productivity: contribute efficiently to many studies
• Visibility: collaborate, climb leaderboards, self-publish (tweet)
• Funding: convincing way to make data open
• No publication bias: unexecpected results

F U T U R E W O R K
• OpenML studies: online representation of paper:
data, code, runs, discussions,…
• Social layer: control visibility: public, friends, private
• Collaborative leaderboards: all top-3 contributors
• Discussion forum for unexpected results
• More data types, tasks

S P R E A D T H E W O R D, W O R K O P E N LY
#OpenML

OpenML 2014

Recommended

Recommended

More Related Content

Similar to OpenML 2014

Similar to OpenML 2014 (20)

More from Joaquin Vanschoren

More from Joaquin Vanschoren (17)

Recently uploaded

Recently uploaded (20)

OpenML 2014