SlideShare a Scribd company logo
1 of 63
Download to read offline
OpenML
C O L L A B O R A T I V E M A C H I N E L E A R N I N G
J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 5
W H A T I F W E C A N E X P L O R E D A TA
C O L L A B O R AT I V E LY
W H A T I F W E C A N E X P L O R E D A TA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H A T I F W E C A N E X P L O R E D A TA
C O L L A B O R AT I V E LY
O N W E B S C A L E I N R E A L T I M E
Problem definition,
data collection
Algorithm design,
selection
Implementing,
running code
Data(driven) science ecosystem
Few of us are experts in all crafts at once (we collaborate)
Problem definition,
data collection
Algorithm design,
selection
Gaps in the
ecosystem
Domain experts: learning and trying latest/best
data science techniques takes lots of time
Algorithm experts: learning domain language,
finding latest/relevant data takes lots of time
Unnecessary friction: time lost on tasks that
others do in a fraction, automate altogether
In subfields, up to 85% medical research resources are wasted
- underpowered, irreproducible research
- improper analysis: false associations
- flexibility in design, analysis
- not translatable to applications
Research better if:
- Large-scale, interdisciplinary
- Easy replication
- Open sharing of data,
protocols, software
- Better methods, tests
magnetopause
bow
shock
solar wind
solar wind
magnetosheath
PLASMA PHYSICS
SURFACE WAVES AND
PROCESSES
Nicolas Aunai
LABELLING (CATALOGUING) OF EVENTS STILL DONE MANUALLY
Problem definition,
data collection
Algorithm design,
selection
Gaps in the
ecosystem
Slows down, underpowers research
Ioannidis: 85% medical research ineffective
Similar findings in other fields
Enterprises lack access to expertise, data
Duplicate work, suboptimal solutions
Value from data could be faster, cheaper
Shortage of data science expertise
McKinsey: 190k data scientists needed by 2018
Data science needs to scale: frictionless collaboration across
fields and labs, open data, democratisation, automate drudge work
Problem definition,
data collection
Algorithm design,
selection
Gaps in the
ecosystem
Necessary, but:
People’s attention doesn’t scale.
Time is limited
Talking is slow and labour-intensive
Likely biased: asking N experts gives you
N different answers
Small-scale collaboration
Problem definition,
data collection
Algorithm design,
selection
Gaps in the
ecosystem
Yes, but:
Slow: too many papers, domain-specific jargon,
little cross-domain relevance. Faster to just try
things yourself
Mining the literature? OK, but information in
papers is often imprecise, aggregated, hard to
reproduce
Paper is a 300 year old medium, internet is
much better
Literature
Problem definition,
data collection
Algorithm design,
selection
Gaps in the
ecosystem
Broadcast data so that many minds
analyse the data in different ways
Broadcast code so that many minds can
apply it on their own data
Organize everything on a collaboration
platform
Networked Science
Research different.
Polymaths: Solve math problems
through massive collaboration
Broadcast question, combine
many minds to solve it
Solved hard problems in weeks
Many (joint) publications
Research different.
SDSS: Robotic telescope, data publicly online (SkyServer)
+1 million distinct users
vs. 10.000 astronomers
Broadcast data, many minds ask the right questions
Thousands of papers, citations
Next: SynopticTelescope (15TB data per night, 10 year survey)
Citizen science
discoveries
Research different.
How do you label a million galaxies?
Offer right tools so that anybody can contribute, instantly
Many novel discoveries by scientists and citizens
More? Read ‘Networked science’ by M. Nielsen
Designed serendipity
Create a sandbox for trying out the
latest algorithms on the latest data,
and analyze the results together
What’s hard/surprising for one
scientist is easy for another
Data is the new soil
Algorithms are like seeds that
you sow on them
We all have our own bits of
soil and seed
Remove friction
Contribute in seconds, not days
Use infrastructure, tools that
scientists already use
Organized body of compatible
and actionable scientific data
and tools
Easy, organised communication
Track who did what, give credit
Millions of real, open datasets are generated
• Drug activity, gene expressions, astronomical observations, text,…
Extensive toolboxes exist to analyse data
• MLR, SKLearn, RapidMiner, KNIME, WEKA, AmazonML, AzureML,…
Massive amounts of experiments are run, most of this
information is lost forever (in people’s heads)
• No learning across people, labs, fields
• Start from scratch, slows research and innovation
Let’s connect machine learning environments to network, so
that we can organize, learn from experience
Connect minds, collaborate globally in real time
Train algorithms to automate data science
F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G
Easy to use: Integrated in ML environments.Automated, reproducible sharing
Organized data: Experiments connected to data, code, people anywhere
Easy to contribute: Post single dataset, algorithm, experiment, comment
Reward structure*: Build reputation and trust (e-citations, social interaction)
OpenML
Data (ARFF) uploaded or referenced, versioned
analysed, characterized, organised online
analysed, characterized, organised online
Tasks contain data, goals, procedures.
Readable by tools, automates experimentation
Train-test
splits
Evaluation
measure
Results organized online: realtime overview
Results organized online: realtime overview
Train-test
splits
Evaluation
measure
Flows (code) run locally, registered + versioned.
Integrations + APIs (REST, Java, R, Python,…)
Integrations + APIs (REST, Java, R, Python,…)
reproducible, linked to data, flows and authors
Experiments auto-uploaded, evaluated online
Experiments auto-uploaded, evaluated online
Explore
Reuse
Share
• Search by
keywords or
properties
• Filters
• Tagging
Data
• Wiki-like
descriptions
• Analysis and
visualisation of
features
Data
• Wiki-like
descriptions
• Analysis and
visualisation of
features
• Auto-calculation of
large range of
meta-features
• discover similar
datasets
• learn across
datasets
Data
• Example: Classification
on click prediction
dataset, using 10-fold
CV and AUC
• People submit results
(e.g. predictions)
• Server-side evaluation
(many measures)
• All results organized
online, per algorithm,
parameter setting
• Online visualizations:
every dot is a run
plotted by score
Tasks
• Leaderboards visualize progress over time: who delivered breakthroughs
when, who built on top of previous solutions
• Collaborative: all code and data available, learn from others
• Real-time: clear who submitted first, others can improve immediately
Classroom challenges
• All results obtained with same flow organised online
• Results linked to data sets, parameter settings -> trends/comparisons
• Visualisations (dots are models, ranked by score, colored by parameters)
• Detailed run info
• Author, data, flow,
parameter settings,
result files, …
• Evaluation details
(e.g., results per
sample)
REST API (new) Requires OpenML account
Predictive URLs
R API
Idem for Java, Python
Tutorial: http://www.openml.org/guide
List datasets List tasks
List flows
List runs and results
Share
Explore
Reuse
Website
1) Search,
2) Table view
3) CVS export*
R API
Idem for Java, Python
Tutorial: http://www.openml.org/guide
Download datasets Download tasks
Download flows
R API
Download run
Download run with predictions
Idem for Java, Python
Tutorial: http://www.openml.org/guide
Reuse
Explore
Share
WEKA
OpenML extension
in plugin manager
MOA
RapidMiner: 3 new operators
R API
Run a task
(with mlr):
R API
Run a task
(with mlr):
And upload:
Studies (e-papers)
- Start a question, invite others (or everyone)
- Online counterpart of a paper, linkable
Circles
Create collaborations with trusted researchers
Share results within team prior to publication
Reputation
- Auto-track reuse of shared resources (data, code)
- Prove your activity, reach, impact
Collaboration tools (in progress)
Notebooks
- Easy sharing, collaboration on scripts
Change scale
- Invite anyone / everyone to work with your data
- Global organization: state-of-the-art online
- Discover interesting people, data, code,…
Change speed
- Easy data/code reuse, automated sharing
- Real-time collaboration (seconds, not days)
Automation
- Bots that automatically run algorithms on data
- Discover similar datasets, do basic analysis
- Learn from lots of data: select/optimize algorithms
Data science, differently
AutoML: automating machine learning
Human scientists
meta-data, models,
evaluations
Automated processes
Data cleaning
Algorithm Selection
Parameter Optimization
Workflow construction
Post processing
API
Learn from many datasets and experiments how to do data analysis
Create services
OpenML in drug discovery
ChEMBL database
SMILEs'
Molecular'
proper0es'
(e.g.'MW,'LogP)'
Fingerprints'
(e.g.'FP2,'FP3,'
FP4,'MACSS)'
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
10.000+ regression datasets
1.4M compounds,10k proteins,
12,8M activities
Predict which drugs will inhibit certain proteins (and hence viruses, parasites,…)
2750 targets have >10 compounds, x 4 fingerprints
OpenML in drug discovery
cforest
ctree
earth
fnn
gbm
glmnet
lassolmnnetpcr
plsr
rforest
ridge
rtree
ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024*
0.697* 0.427* 0.725+ 0.627*
|
msdfeat< 0.1729
mutualinfo< 0.7182
skewresp< 0.3315
nentropyfeat< 1.926
netcharge1>=0.9999
netcharge3< 13.85
hydphob21< −0.09666
glmnet
fnn pcr
rforest
ridge rforest
rforest ridge
Predict best algorithm with meta-models
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
Metafeatures:
- simple, statistical, info-theoretic, landmarkers
- target: aliphatic index, hydrophobicity, net
charge, mol. weight, sequence length, …
I. Olier et al. MetaSel@ECMLPKDD 2015
Best algorithms?
Best features?
OpenML in drug discovery
I. Olier et al. MetaSel@ECMLPKDD 2015
New technique: stacked generalisation
Best model by far
OpenML in drug discovery
New technique: multi-target learning
When few drugs are tested on a given target, combine with the
data on ‘related’ targets (same family, similar gene sequence)
OpenML in drug discovery
We just scratched the surface. Data will available on OpenML for
many more studies and ideas, to test new algorithms,…
OpenML Community
Jan-Jun 2015
Used all over the world
400 7-day, 1700 30-day active users, growing
1000s of datasets, flows, 450000+ runs
- Open Source, on GitHub
- Regular workshops, hackathons
Join OpenML
Next workshop:
- Lorentz Center (Leiden),
14-18 March 2016
T H A N K Y O U
Joaquin Vanschoren
Jan van Rijn
Bernd Bischl
Matthias Feurer
Michel Lang
Nenad Tomašev
Giuseppe Casalicchio
Luis Torgo
You?
#OpenMLFarzan Majdani
Jakob Bossek

More Related Content

What's hot

Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
Lukas Mandrake
 

What's hot (20)

Open Science for sustainability and inclusiveness: the SKA role model
 Open Science for sustainability and inclusiveness: the SKA role model Open Science for sustainability and inclusiveness: the SKA role model
Open Science for sustainability and inclusiveness: the SKA role model
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0
 
Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine Learning
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
Slides chase 2019  connected health conference - thursday 26 september 2019 -...Slides chase 2019  connected health conference - thursday 26 september 2019 -...
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introduction
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school students
 
Data Science
Data ScienceData Science
Data Science
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Data
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
 

Viewers also liked

Viewers also liked (19)

OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
Open Machine Learning
Open Machine LearningOpen Machine Learning
Open Machine Learning
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning
 
인공지능: 변화와 능력개발
인공지능: 변화와 능력개발인공지능: 변화와 능력개발
인공지능: 변화와 능력개발
 
Deep Learning for Stock Prediction
Deep Learning for Stock PredictionDeep Learning for Stock Prediction
Deep Learning for Stock Prediction
 
Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network  Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network
 
RNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq ModelsRNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq Models
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝
 
Deeplearning in finance
Deeplearning in financeDeeplearning in finance
Deeplearning in finance
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Continuous Sentiment Intensity Prediction based on Deep Learning
Continuous Sentiment Intensity Prediction based on Deep LearningContinuous Sentiment Intensity Prediction based on Deep Learning
Continuous Sentiment Intensity Prediction based on Deep Learning
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural Networks
 
쫄지말자딥러닝2 - CNN RNN 포함버전
쫄지말자딥러닝2 - CNN RNN 포함버전쫄지말자딥러닝2 - CNN RNN 포함버전
쫄지말자딥러닝2 - CNN RNN 포함버전
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Similar to OpenML data@Sheffield

Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Carole Goble
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
Rohit Dubey
 

Similar to OpenML data@Sheffield (20)

eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Informatics in Context: Managing Sample-to-Answer Multi-Omics Workflows
Informatics in Context: Managing Sample-to-Answer Multi-Omics WorkflowsInformatics in Context: Managing Sample-to-Answer Multi-Omics Workflows
Informatics in Context: Managing Sample-to-Answer Multi-Omics Workflows
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
How To Create an Integrated Lab?
How To Create an Integrated Lab? How To Create an Integrated Lab?
How To Create an Integrated Lab?
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Introduction
IntroductionIntroduction
Introduction
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
 

More from Joaquin Vanschoren

More from Joaquin Vanschoren (14)

Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Exposé Ontology
Exposé OntologyExposé Ontology
Exposé Ontology
 
Designed Serendipity
Designed SerendipityDesigned Serendipity
Designed Serendipity
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
 
OpenML NeurIPS2018
OpenML NeurIPS2018OpenML NeurIPS2018
OpenML NeurIPS2018
 
OpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine LearningOpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine Learning
 
Data science
Data scienceData science
Data science
 
OpenML 2014
OpenML 2014OpenML 2014
OpenML 2014
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop sensordata part2
Hadoop sensordata part2Hadoop sensordata part2
Hadoop sensordata part2
 
Hadoop sensordata part1
Hadoop sensordata part1Hadoop sensordata part1
Hadoop sensordata part1
 
Hadoop sensordata part3
Hadoop sensordata part3Hadoop sensordata part3
Hadoop sensordata part3
 

Recently uploaded

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 

Recently uploaded (20)

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 

OpenML data@Sheffield

  • 1. OpenML C O L L A B O R A T I V E M A C H I N E L E A R N I N G J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 5
  • 2. W H A T I F W E C A N E X P L O R E D A TA C O L L A B O R AT I V E LY
  • 3. W H A T I F W E C A N E X P L O R E D A TA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 4. W H A T I F W E C A N E X P L O R E D A TA C O L L A B O R AT I V E LY O N W E B S C A L E I N R E A L T I M E
  • 5. Problem definition, data collection Algorithm design, selection Implementing, running code Data(driven) science ecosystem Few of us are experts in all crafts at once (we collaborate)
  • 6. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Domain experts: learning and trying latest/best data science techniques takes lots of time Algorithm experts: learning domain language, finding latest/relevant data takes lots of time Unnecessary friction: time lost on tasks that others do in a fraction, automate altogether
  • 7.
  • 8. In subfields, up to 85% medical research resources are wasted - underpowered, irreproducible research - improper analysis: false associations - flexibility in design, analysis - not translatable to applications Research better if: - Large-scale, interdisciplinary - Easy replication - Open sharing of data, protocols, software - Better methods, tests
  • 9. magnetopause bow shock solar wind solar wind magnetosheath PLASMA PHYSICS SURFACE WAVES AND PROCESSES Nicolas Aunai
  • 10. LABELLING (CATALOGUING) OF EVENTS STILL DONE MANUALLY
  • 11. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Slows down, underpowers research Ioannidis: 85% medical research ineffective Similar findings in other fields Enterprises lack access to expertise, data Duplicate work, suboptimal solutions Value from data could be faster, cheaper Shortage of data science expertise McKinsey: 190k data scientists needed by 2018 Data science needs to scale: frictionless collaboration across fields and labs, open data, democratisation, automate drudge work
  • 12. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Necessary, but: People’s attention doesn’t scale. Time is limited Talking is slow and labour-intensive Likely biased: asking N experts gives you N different answers Small-scale collaboration
  • 13. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Yes, but: Slow: too many papers, domain-specific jargon, little cross-domain relevance. Faster to just try things yourself Mining the literature? OK, but information in papers is often imprecise, aggregated, hard to reproduce Paper is a 300 year old medium, internet is much better Literature
  • 14. Problem definition, data collection Algorithm design, selection Gaps in the ecosystem Broadcast data so that many minds analyse the data in different ways Broadcast code so that many minds can apply it on their own data Organize everything on a collaboration platform Networked Science
  • 15. Research different. Polymaths: Solve math problems through massive collaboration Broadcast question, combine many minds to solve it Solved hard problems in weeks Many (joint) publications
  • 16. Research different. SDSS: Robotic telescope, data publicly online (SkyServer) +1 million distinct users vs. 10.000 astronomers Broadcast data, many minds ask the right questions Thousands of papers, citations Next: SynopticTelescope (15TB data per night, 10 year survey) Citizen science discoveries
  • 17. Research different. How do you label a million galaxies? Offer right tools so that anybody can contribute, instantly Many novel discoveries by scientists and citizens More? Read ‘Networked science’ by M. Nielsen
  • 18. Designed serendipity Create a sandbox for trying out the latest algorithms on the latest data, and analyze the results together What’s hard/surprising for one scientist is easy for another Data is the new soil Algorithms are like seeds that you sow on them We all have our own bits of soil and seed
  • 19. Remove friction Contribute in seconds, not days Use infrastructure, tools that scientists already use Organized body of compatible and actionable scientific data and tools Easy, organised communication Track who did what, give credit
  • 20. Millions of real, open datasets are generated • Drug activity, gene expressions, astronomical observations, text,… Extensive toolboxes exist to analyse data • MLR, SKLearn, RapidMiner, KNIME, WEKA, AmazonML, AzureML,… Massive amounts of experiments are run, most of this information is lost forever (in people’s heads) • No learning across people, labs, fields • Start from scratch, slows research and innovation
  • 21. Let’s connect machine learning environments to network, so that we can organize, learn from experience Connect minds, collaborate globally in real time Train algorithms to automate data science
  • 22. F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G Easy to use: Integrated in ML environments.Automated, reproducible sharing Organized data: Experiments connected to data, code, people anywhere Easy to contribute: Post single dataset, algorithm, experiment, comment Reward structure*: Build reputation and trust (e-citations, social interaction) OpenML
  • 23. Data (ARFF) uploaded or referenced, versioned analysed, characterized, organised online
  • 25. Tasks contain data, goals, procedures. Readable by tools, automates experimentation Train-test splits Evaluation measure Results organized online: realtime overview
  • 26. Results organized online: realtime overview Train-test splits Evaluation measure
  • 27. Flows (code) run locally, registered + versioned. Integrations + APIs (REST, Java, R, Python,…)
  • 28. Integrations + APIs (REST, Java, R, Python,…)
  • 29. reproducible, linked to data, flows and authors Experiments auto-uploaded, evaluated online
  • 32.
  • 33. • Search by keywords or properties • Filters • Tagging Data
  • 34. • Wiki-like descriptions • Analysis and visualisation of features Data
  • 35. • Wiki-like descriptions • Analysis and visualisation of features • Auto-calculation of large range of meta-features • discover similar datasets • learn across datasets Data
  • 36. • Example: Classification on click prediction dataset, using 10-fold CV and AUC • People submit results (e.g. predictions) • Server-side evaluation (many measures) • All results organized online, per algorithm, parameter setting • Online visualizations: every dot is a run plotted by score Tasks
  • 37. • Leaderboards visualize progress over time: who delivered breakthroughs when, who built on top of previous solutions • Collaborative: all code and data available, learn from others • Real-time: clear who submitted first, others can improve immediately
  • 39. • All results obtained with same flow organised online • Results linked to data sets, parameter settings -> trends/comparisons • Visualisations (dots are models, ranked by score, colored by parameters)
  • 40. • Detailed run info • Author, data, flow, parameter settings, result files, … • Evaluation details (e.g., results per sample)
  • 41. REST API (new) Requires OpenML account Predictive URLs
  • 42. R API Idem for Java, Python Tutorial: http://www.openml.org/guide List datasets List tasks List flows List runs and results
  • 44. Website 1) Search, 2) Table view 3) CVS export*
  • 45. R API Idem for Java, Python Tutorial: http://www.openml.org/guide Download datasets Download tasks Download flows
  • 46. R API Download run Download run with predictions Idem for Java, Python Tutorial: http://www.openml.org/guide
  • 49. MOA
  • 50. RapidMiner: 3 new operators
  • 51. R API Run a task (with mlr):
  • 52. R API Run a task (with mlr): And upload:
  • 53. Studies (e-papers) - Start a question, invite others (or everyone) - Online counterpart of a paper, linkable Circles Create collaborations with trusted researchers Share results within team prior to publication Reputation - Auto-track reuse of shared resources (data, code) - Prove your activity, reach, impact Collaboration tools (in progress) Notebooks - Easy sharing, collaboration on scripts
  • 54. Change scale - Invite anyone / everyone to work with your data - Global organization: state-of-the-art online - Discover interesting people, data, code,… Change speed - Easy data/code reuse, automated sharing - Real-time collaboration (seconds, not days) Automation - Bots that automatically run algorithms on data - Discover similar datasets, do basic analysis - Learn from lots of data: select/optimize algorithms Data science, differently
  • 55. AutoML: automating machine learning Human scientists meta-data, models, evaluations Automated processes Data cleaning Algorithm Selection Parameter Optimization Workflow construction Post processing API Learn from many datasets and experiments how to do data analysis Create services
  • 56. OpenML in drug discovery ChEMBL database SMILEs' Molecular' proper0es' (e.g.'MW,'LogP)' Fingerprints' (e.g.'FP2,'FP3,' FP4,'MACSS)' MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! 10.000+ regression datasets 1.4M compounds,10k proteins, 12,8M activities Predict which drugs will inhibit certain proteins (and hence viruses, parasites,…) 2750 targets have >10 compounds, x 4 fingerprints
  • 57. OpenML in drug discovery cforest ctree earth fnn gbm glmnet lassolmnnetpcr plsr rforest ridge rtree ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024* 0.697* 0.427* 0.725+ 0.627* | msdfeat< 0.1729 mutualinfo< 0.7182 skewresp< 0.3315 nentropyfeat< 1.926 netcharge1>=0.9999 netcharge3< 13.85 hydphob21< −0.09666 glmnet fnn pcr rforest ridge rforest rforest ridge Predict best algorithm with meta-models MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! Metafeatures: - simple, statistical, info-theoretic, landmarkers - target: aliphatic index, hydrophobicity, net charge, mol. weight, sequence length, …
  • 58. I. Olier et al. MetaSel@ECMLPKDD 2015 Best algorithms? Best features? OpenML in drug discovery
  • 59. I. Olier et al. MetaSel@ECMLPKDD 2015 New technique: stacked generalisation Best model by far OpenML in drug discovery
  • 60. New technique: multi-target learning When few drugs are tested on a given target, combine with the data on ‘related’ targets (same family, similar gene sequence) OpenML in drug discovery We just scratched the surface. Data will available on OpenML for many more studies and ideas, to test new algorithms,…
  • 61. OpenML Community Jan-Jun 2015 Used all over the world 400 7-day, 1700 30-day active users, growing 1000s of datasets, flows, 450000+ runs
  • 62. - Open Source, on GitHub - Regular workshops, hackathons Join OpenML Next workshop: - Lorentz Center (Leiden), 14-18 March 2016
  • 63. T H A N K Y O U Joaquin Vanschoren Jan van Rijn Bernd Bischl Matthias Feurer Michel Lang Nenad Tomašev Giuseppe Casalicchio Luis Torgo You? #OpenMLFarzan Majdani Jakob Bossek