SlideShare a Scribd company logo
1 of 61
Download to read offline
1
Université Paris-Saclay / CNRS
BALÁZS KÉGL
MACHINE LEARNING IN SCIENTIFIC WORKFLOWS
2
The concept of interpretation is all here: there is no
experience of truth that is not interpretative. I do not know
anything that does not interest me. If it does interest me, it
is evident that I do not look at it in a noninterested way.
!
Gianni Vattimo: After the Death of God
(talking about Heidegger)
3
We are all problem solvers.
!
But how do we decide what problem to work on?
• ML in sciences	

• challenges	

• reproducible workflow building	

• examples	

• the RAMP tool	

• Towards automatic hypothesis generation
4
OUTLINE
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
5
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative, building interfaces, matching
people, helping them launching projects
345 affiliated researchers, 50 laboratories
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 6
Data scientist
Data Value Architect
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics

machine learning
information retrieval
signal processing
data visualization
databases
software engineering

clouds/grids
high-performance

computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
THE DATA SCIENCE ECOSYSTEM
https://medium.com/@balazskegl
• Lack of manpower, misplaced incentives	

• hammers & nails	

• engineering: who deals with production?	

• Lack of collaboration/innovation management tools	

• Bottleneck is sometimes data collection/annotation
• since domain scientists do not know ML, they do not collect the right
data
7
MANAGEMENT AND ORGANIZATIONAL
CHALLENGES
• Workflows and metrics	

• Designing the workflow, interaction with the rest of the pipeline, metrics
is often more important than “hyperopting” the predictor	

• Data generation	

• training is often done on simulations, so we need to design data
generation
• systematic uncertainties	

• the iid oracle is a fairy tale, happening only in machine learning textbooks	

• opportunity for diversifying ML benchmarks
8
TECHNICAL CHALLENGES
B. Kégl Data driven generation
SYSTEMATICS
• Your estimators f(x) should not only be efficient but also
insensitive to variables/parameters you don’t know	

• 90% of the time of an experimental physics PhD is spent on this	

• unsolved and mostly unknown in machine learning	

• “Fair ML”	

• Simulation-based training is biased by design	

• Because if we new all the distributions and parameters, we would not
need to simulate
9
B. Kégl Data driven generation
SYSTEMATICS
10
y
f(x)
simulation
real data
B. Kégl Data driven generation
SYSTEMATICS
11
y
f(x)
simulation
real data
B. Kégl Data driven generation
SYSTEMATICS
• ML is exacerbating the problem because it is so efficient in
optimizing the score	

• unless the score contains systematics, which is hard because systematics is usually
not an event-wise metrics	

• makes it similar to adversarial generative models, see the works of Kyle Cranmer
and Gilles Louppe	

• The classical approach: vary unknown parameters within their
known range, train on one extreme, evaluate on the other	

• exponential explosion which makes computation-heavy deep learning even
heavier	

• doesn’t minimize systematics, but at least measures it
12
• Data collection: replace human or algorithmic collector or
annotator	

• label insect photos, detect Mars craters, detect particle tracks	

• Inference: to invert the generative model	

• “predict” a particle, detect an anomaly, infer a parameter y from observation x	

• Generation, model reduction: to replace expensive simulations	

• “learn” a physics simulation or an agent based micro-economical model with a
neural net	

• Hypothesis generation: to “replace” theoreticians	

• learn, represent structural knowledge and generate novelty in model space,
e.g., molecule generation in drug discovery
13
ML USE CASES IN SCIENCES
https://www.ramp.studio/problems
14
Data collection
15
CLASSIFYING POLLENATING INSECT PHOTOS
• collaboration with ecologists at the Paris Museum of Natural History	

• 400 classes, 150K photos, long tail	

• great benchmark for transfer and few-shot learning
16
DETECTING MARS CRATERS
• collaboration with planetary geologists at Paris-Saclay
• new metrics and workflow	

• great benchmark for detection in satellite images
17
Inference
Center for Data Science
Paris-Saclay
CLASSIFYING VARIABLE STARS
18
• collaboration with astrophysicists at Paris-Saclay
• variable-length functional data
19
PREDICT AUTISM FROM BRAIN SCANS
• collaboration with neurologists of Institut Pasteur
• 2000+ subjects: a major major data collection effort	

• heavy preprocessing and quality control
• 9.5K€ prizes, deadline July 1
20
CLASSIFYING AND REGRESSING ON
MOLECULAR SPECTRA
• collaboration with the pharmacy
department of Georges Pompidou
Hospital
• major data collection effort
• functional data	

• probably the first ever research paper
where the ML workflow
optimization was entirely
crowdsourced
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
CLASSIFYING AND REGRESSING ON
MOLECULAR SPECTRA
21
chemotherapy
drug in
elastic pocket
laser
spectrometer
molecular
spectra
feature
extractor 1
feature
extractor 2
regressor
concentration
classifier
drug type
22
B. Kégl / AppStat@LAL Learning to discover
THE LHC IN GENEVA
23
B. Kégl / AppStat@LAL Learning to discover
THE ATLAS DETECTOR
24
B. Kégl / AppStat@LAL Learning to discover
DATA COLLECTION
• Hundreds of millions of proton-proton collisions per
second	

• Filtered down to 400 events per second	

• still petabytes per year 	

• real-time (budgeted) classification: trigger 	

• a research theme on its own
25
B. Kégl Data driven generation
FEATURE ENGINEERING
• Each collision is an event	

• hundreds of particles: decay products 	

• hundreds of thousands of sensors (but sparse) 	

• for each particle: type, energy, direction is measured 	

• a fixed-length list of ~30-40 extracted features: x	

• e.g., angles, energies, directions, reconstructed mass	

• based on 50 years of accumulated domain knowledge
26
B. Kégl Data driven generation
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
27
Goal: optimize the expected discovery significance
flux × time
selection
expected background	

say, b = 100 events
total count,	

say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation µb. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and µb. As
we estimate the expectation µb by its empirical counter-
+ b to obtain the approximate median significance
⇣
(s + b) ln
⇣
1 +
s
b
⌘
s
⌘
. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3 ⇥
s
1 + O
✓⇣ s
b
⌘3
◆
,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection 	

thresholdselection threshold
28
GENERATION AND MODEL REDUCTION
Why?
• Cost cutting 1: looking at the form of f, I can place my
fixed number of temperature sensors optimally	

• Cost cutting 2: f can replace costly simulation in a
detector optimization loop	

• Cost cutting 3: if I can generate realistic galaxy images, I
can replace costly manual labeling of real photos
29
Generation and model reduction
FORECASTING EL NINO: SPATIOTEMPORAL TIME SERIES
30
…
300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… …
time series feature
extractor
x
(a fixed length feature vector)
regressor
• collaboration with the Climate Informatics workshop
• also on Arctic sea ice and California rainfall prediction
31
GENERATION AND MODEL REDUCTION
Why?
• Cost cutting 1: looking at the form of f, I can place my
fixed number of temperature sensors optimally	

• Cost cutting 2: f can replace costly simulation in a
detector optimization loop	

• Cost cutting 3: if I can generate realistic galaxy images, I
can replace costly manual labeling of real photos
32
We have “industrialized”
workflow-building and optimization
By separating them
Then optimizing
“graduate student descent”
33
Funded by Université Paris-Saclay and CNRS
34
RAMP.STUDIO	

DATA CHALLENGE WITH CODE SUBMISSION
35
RAMP is a tool for
1. Collaborative prototyping
2. Teaching
3. Data science process management
36
A SINGLE SCRIPT TO DEFINE THE BUNDLE
X ypred score
type
score
cross-validation scheme
dataconnectors
FE CLF
workflow
37
Code submission
1. lets us deliver a working prototype
2. lets the participants collaborate
38
RAMP.STUDIO	

DATA CHALLENGE WITH CODE SUBMISSION
20+ challenges
40+ events
1500+ users
8000+ predictive models
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 39
what you achieved with a well tuned deep net
the diversity gap
the human blender gap
competitive phase
collaborative phase
THE POWER OF THE (COLLABORATING) CROWD	

OPTIMIZING GRADUATE STUDENT DESCENT
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 40
COMMUNICATION AND REUSE
41
toolkit:
github.com/paris-saclay-cds/ramp-workflow
examples:
github.com/ramp-kits
blogs:
medium.com/@balazskegl
slack:
ramp-studio.slack.com
frontend:
www.ramp.studio
mail:
balazs.kegl@gmail.com
42
Novelty generation with
generative models
43
MACHINE LEARNING IN SCIENCE
Generation/simulation and model reduction
Inference
• We can automate almost everything	

• simulation, inference, experimental design	

• this is not even controversial, just an extension of the current
paradigm	

• But not the hypothesis generation: what model to test?
44
Hypothesis generation is crucial
and, at the same time,
not covered by the scientific method
45
ROBOT SCIENTIST
46
ROBOT SCIENTIST
“Robot scientists are a natural extension of the trend of
increased involvement of automation in science. They can
automatically develop and test hypotheses to explain
observations, run experiments using laboratory robotics,
interpret the results to amend their hypotheses, and
then repeat the cycle, automating high-throughput
hypothesis-led research.”
http://www.cam.ac.uk/research/news/artificially-intelligent-robot-scientist-eve-could-boost-search-for-new-drugs
47
Hypothesis generation is crucial
and, at the same time,
not covered by the scientific method
This ignorance has already bitten us,
but with the appearance of the robot
scientist, it is unavoidable
• Come up with a hypothesis	

• Design an experiment to exclude it	

• Use a statistical test to show that the data is unlikely to
be generated by a world in which the hypothesis does
not hold (“background”)
48
THE SCIENTIFIC METHOD IN THE
TRENCHES
• Rutherford: “If your experiment needs statistics, you ought to
have done a better experiment”	

• Without statistics, science would be over	

• we went out of slam dunk infinite significance (“background free”)
hypotheses	

• phenomena are inherently noisy: nobody has seen or will ever see a
Higgs boson
49
THE SCIENTIFIC METHOD IN THE
TRENCHES
50
51
THE P-VALUE CONTROVERSY
But the main problem is a tautology:
if none of your hypotheses are true,
all your positives are false
But of course: if all your hypotheses are true,
you are not exploring
• Register all experiments and publish negatives	

• Don’t do underpowered experiments	

• Put the significance bar high enough	

• Test only “plausible” hypotheses
52
GUIDELINES
• What is a plausible but non-trivial hypothesis?	

• How to measure plausibility?	

• How to generate them (automatically)?	

• How are hypotheses related to prior/current
knowledge?
53
QUESTIONS
54
GENERATIVE MODELS IN ML
Interesting tools but it’s a whole new ballgame
and paradigmatically we are in the dark
55
Train on digits,
test on letters
56
Train on all music up to the Beatles,
test on Sex Pistols
57
Train on all phones up to 2006,
test on the iPhone
58
Train on all scientific knowledge up to Einstein,
test on relativity theory
59
CAN WE GENERATE NEW TYPES?
Existing objects of known types.
generative model
New objects. New types?
learning
generation
B. Kégl / AppStat@LAL Learning to discover
Center for Data Science
Paris-Saclay60
SOME WRITTEN STUFF
http://openreview.net/forum?id=ByEPMj5el
https://arxiv.org/abs/1606.04345
https://medium.com/@balazskegl/the-epistemological-challenges-of-automating-a-
b-testing-or-how-will-ai-do-science-b724f8217811#.q041gyvkt
61
Thank you!

More Related Content

What's hot

Crowd Counting from UAVs (ECCV2020)
Crowd Counting from UAVs (ECCV2020)Crowd Counting from UAVs (ECCV2020)
Crowd Counting from UAVs (ECCV2020)Gennaro Vessio
 
Quantum Computing in Cloud
Quantum Computing in CloudQuantum Computing in Cloud
Quantum Computing in CloudAnil Loutombam
 
Cyberinfrastructure And Astronomical Research
Cyberinfrastructure And Astronomical ResearchCyberinfrastructure And Astronomical Research
Cyberinfrastructure And Astronomical ResearchCybera Inc.
 
QuantumComputersPresentation
QuantumComputersPresentationQuantumComputersPresentation
QuantumComputersPresentationVinayak Suresh
 
Jpl pervasive cloud_now_and_future_aws_sep_2010_v2
Jpl pervasive cloud_now_and_future_aws_sep_2010_v2Jpl pervasive cloud_now_and_future_aws_sep_2010_v2
Jpl pervasive cloud_now_and_future_aws_sep_2010_v2ReadMaloney
 
LinkedIn Slide presentation_DSTL_179
LinkedIn Slide presentation_DSTL_179LinkedIn Slide presentation_DSTL_179
LinkedIn Slide presentation_DSTL_179Winson Lam, Ph.D.
 

What's hot (7)

Crowd Counting from UAVs (ECCV2020)
Crowd Counting from UAVs (ECCV2020)Crowd Counting from UAVs (ECCV2020)
Crowd Counting from UAVs (ECCV2020)
 
Quantum Computing in Cloud
Quantum Computing in CloudQuantum Computing in Cloud
Quantum Computing in Cloud
 
Cyberinfrastructure And Astronomical Research
Cyberinfrastructure And Astronomical ResearchCyberinfrastructure And Astronomical Research
Cyberinfrastructure And Astronomical Research
 
QuantumComputersPresentation
QuantumComputersPresentationQuantumComputersPresentation
QuantumComputersPresentation
 
Jpl pervasive cloud_now_and_future_aws_sep_2010_v2
Jpl pervasive cloud_now_and_future_aws_sep_2010_v2Jpl pervasive cloud_now_and_future_aws_sep_2010_v2
Jpl pervasive cloud_now_and_future_aws_sep_2010_v2
 
Quantum computing presentation 200115
Quantum computing presentation 200115Quantum computing presentation 200115
Quantum computing presentation 200115
 
LinkedIn Slide presentation_DSTL_179
LinkedIn Slide presentation_DSTL_179LinkedIn Slide presentation_DSTL_179
LinkedIn Slide presentation_DSTL_179
 

Similar to Machine learning in scientific workflows

Data-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural netsData-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural netsBalázs Kégl
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge Proto204
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningKAMAL CHOUDHARY
 
Performance characterization in computer vision
Performance characterization in computer visionPerformance characterization in computer vision
Performance characterization in computer visionpotaters
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
 
Mathematical Modeling
Mathematical ModelingMathematical Modeling
Mathematical ModelingMohsen EM
 
The Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data ModellingThe Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data ModellingEmir Muñoz
 
1. Intro DS.pptx
1. Intro DS.pptx1. Intro DS.pptx
1. Intro DS.pptxAnusuya123
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool developmentAnubhav Jain
 
Numerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parametersNumerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parametersMilan Milošević
 
Symbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine LearningSymbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine LearningSteffen Staab
 

Similar to Machine learning in scientific workflows (20)

Data-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural netsData-driven hypothesis generation using deep neural nets
Data-driven hypothesis generation using deep neural nets
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Performance characterization in computer vision
Performance characterization in computer visionPerformance characterization in computer vision
Performance characterization in computer vision
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
Mathematical Modeling
Mathematical ModelingMathematical Modeling
Mathematical Modeling
 
The Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data ModellingThe Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data Modelling
 
1. Intro DS.pptx
1. Intro DS.pptx1. Intro DS.pptx
1. Intro DS.pptx
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Collins seattle-2014-final
Collins seattle-2014-finalCollins seattle-2014-final
Collins seattle-2014-final
 
SMS Module-4(theory) ppt.pptx
SMS Module-4(theory) ppt.pptxSMS Module-4(theory) ppt.pptx
SMS Module-4(theory) ppt.pptx
 
CLIM: Transition Workshop - Optimization Methods in Remote Sensing - Jessica...
CLIM: Transition Workshop - Optimization Methods in Remote Sensing  - Jessica...CLIM: Transition Workshop - Optimization Methods in Remote Sensing  - Jessica...
CLIM: Transition Workshop - Optimization Methods in Remote Sensing - Jessica...
 
AI for Science
AI for ScienceAI for Science
AI for Science
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool development
 
Numerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parametersNumerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parameters
 
Symbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine LearningSymbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine Learning
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 

More from Balázs Kégl

Model-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsModel-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsBalázs Kégl
 
Managing the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loopManaging the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loopBalázs Kégl
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...Balázs Kégl
 
Build your own data challenge, or just organize team work
Build your own data challenge, or just organize team workBuild your own data challenge, or just organize team work
Build your own data challenge, or just organize team workBalázs Kégl
 
RAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submissionRAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submissionBalázs Kégl
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesBalázs Kégl
 
What is wrong with data challenges
What is wrong with data challengesWhat is wrong with data challenges
What is wrong with data challengesBalázs Kégl
 
The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)Balázs Kégl
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsBalázs Kégl
 
The Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data ScienceThe Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data ScienceBalázs Kégl
 

More from Balázs Kégl (10)

Model-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsModel-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systems
 
Managing the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loopManaging the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loop
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
 
Build your own data challenge, or just organize team work
Build your own data challenge, or just organize team workBuild your own data challenge, or just organize team work
Build your own data challenge, or just organize team work
 
RAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submissionRAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submission
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiatives
 
What is wrong with data challenges
What is wrong with data challengesWhat is wrong with data challenges
What is wrong with data challenges
 
The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physics
 
The Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data ScienceThe Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data Science
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Machine learning in scientific workflows

  • 1. 1 Université Paris-Saclay / CNRS BALÁZS KÉGL MACHINE LEARNING IN SCIENTIFIC WORKFLOWS
  • 2. 2 The concept of interpretation is all here: there is no experience of truth that is not interpretative. I do not know anything that does not interest me. If it does interest me, it is evident that I do not look at it in a noninterested way. ! Gianni Vattimo: After the Death of God (talking about Heidegger)
  • 3. 3 We are all problem solvers. ! But how do we decide what problem to work on?
  • 4. • ML in sciences • challenges • reproducible workflow building • examples • the RAMP tool • Towards automatic hypothesis generation 4 OUTLINE
  • 5. Center for Data Science Paris-Saclay B. Kégl (CNRS) Biology & bioinformatics IBISC/UEvry LRI/UPSud Hepatinov CESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/Agro MIAj-MIG/INRA LMAS/Centrale Chemistry EA4041/UPSud Earth sciences LATMOS/UVSQ GEOPS/UPSud IPSL/UVSQ LSCE/UVSQ LMD/Polytechnique Economy LM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA CosmoStat/CEA IAS/UPSud AIM/CEA LAL/UPSud 250researchers in 35laboratories Machine learning LRI/UPSud LTCI/Telecom CMLA/Cachan LS/ENSAE LIX/Polytechnique MIA/Agro CMA/Polytechnique LSS/Supélec CVN/Centrale LMAS/Centrale DTIM/ONERA IBISC/UEvry Visualization INRIA LIMSI Signal processing LTCI/Telecom CMA/Polytechnique CVN/Centrale LSS/Supélec CMLA/Cachan LIMSI DTIM/ONERA Statistics LMO/UPSud LS/ENSAE LSS/Supélec CMA/Polytechnique LMAS/Centrale MIA/AgroParisTech machine learning information retrieval signal processing data visualization databases Domain science human society life brain earth universe Tool building software engineering clouds/grids high-performance computing optimization Domain scientistSoftware engineer datascience-paris-saclay.fr LIST/CEA 5 Center for Data Science Paris-Saclay A multi-disciplinary initiative, building interfaces, matching people, helping them launching projects 345 affiliated researchers, 50 laboratories
  • 6. Center for Data Science Paris-Saclay B. Kégl (CNRS) 6 Data scientist Data Value Architect Domain expertSoftware engineer Data engineer Tool building Data domains Data science statistics
 machine learning information retrieval signal processing data visualization databases software engineering
 clouds/grids high-performance
 computing optimization energy and physical sciences health and life sciences Earth and environment economy and society brain THE DATA SCIENCE ECOSYSTEM https://medium.com/@balazskegl
  • 7. • Lack of manpower, misplaced incentives • hammers & nails • engineering: who deals with production? • Lack of collaboration/innovation management tools • Bottleneck is sometimes data collection/annotation • since domain scientists do not know ML, they do not collect the right data 7 MANAGEMENT AND ORGANIZATIONAL CHALLENGES
  • 8. • Workflows and metrics • Designing the workflow, interaction with the rest of the pipeline, metrics is often more important than “hyperopting” the predictor • Data generation • training is often done on simulations, so we need to design data generation • systematic uncertainties • the iid oracle is a fairy tale, happening only in machine learning textbooks • opportunity for diversifying ML benchmarks 8 TECHNICAL CHALLENGES
  • 9. B. Kégl Data driven generation SYSTEMATICS • Your estimators f(x) should not only be efficient but also insensitive to variables/parameters you don’t know • 90% of the time of an experimental physics PhD is spent on this • unsolved and mostly unknown in machine learning • “Fair ML” • Simulation-based training is biased by design • Because if we new all the distributions and parameters, we would not need to simulate 9
  • 10. B. Kégl Data driven generation SYSTEMATICS 10 y f(x) simulation real data
  • 11. B. Kégl Data driven generation SYSTEMATICS 11 y f(x) simulation real data
  • 12. B. Kégl Data driven generation SYSTEMATICS • ML is exacerbating the problem because it is so efficient in optimizing the score • unless the score contains systematics, which is hard because systematics is usually not an event-wise metrics • makes it similar to adversarial generative models, see the works of Kyle Cranmer and Gilles Louppe • The classical approach: vary unknown parameters within their known range, train on one extreme, evaluate on the other • exponential explosion which makes computation-heavy deep learning even heavier • doesn’t minimize systematics, but at least measures it 12
  • 13. • Data collection: replace human or algorithmic collector or annotator • label insect photos, detect Mars craters, detect particle tracks • Inference: to invert the generative model • “predict” a particle, detect an anomaly, infer a parameter y from observation x • Generation, model reduction: to replace expensive simulations • “learn” a physics simulation or an agent based micro-economical model with a neural net • Hypothesis generation: to “replace” theoreticians • learn, represent structural knowledge and generate novelty in model space, e.g., molecule generation in drug discovery 13 ML USE CASES IN SCIENCES https://www.ramp.studio/problems
  • 15. 15 CLASSIFYING POLLENATING INSECT PHOTOS • collaboration with ecologists at the Paris Museum of Natural History • 400 classes, 150K photos, long tail • great benchmark for transfer and few-shot learning
  • 16. 16 DETECTING MARS CRATERS • collaboration with planetary geologists at Paris-Saclay • new metrics and workflow • great benchmark for detection in satellite images
  • 18. Center for Data Science Paris-Saclay CLASSIFYING VARIABLE STARS 18 • collaboration with astrophysicists at Paris-Saclay • variable-length functional data
  • 19. 19 PREDICT AUTISM FROM BRAIN SCANS • collaboration with neurologists of Institut Pasteur • 2000+ subjects: a major major data collection effort • heavy preprocessing and quality control • 9.5K€ prizes, deadline July 1
  • 20. 20 CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA • collaboration with the pharmacy department of Georges Pompidou Hospital • major data collection effort • functional data • probably the first ever research paper where the ML workflow optimization was entirely crowdsourced
  • 21. Center for Data Science Paris-Saclay B. Kégl (CNRS) CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA 21 chemotherapy drug in elastic pocket laser spectrometer molecular spectra feature extractor 1 feature extractor 2 regressor concentration classifier drug type
  • 22. 22
  • 23. B. Kégl / AppStat@LAL Learning to discover THE LHC IN GENEVA 23
  • 24. B. Kégl / AppStat@LAL Learning to discover THE ATLAS DETECTOR 24
  • 25. B. Kégl / AppStat@LAL Learning to discover DATA COLLECTION • Hundreds of millions of proton-proton collisions per second • Filtered down to 400 events per second • still petabytes per year • real-time (budgeted) classification: trigger • a research theme on its own 25
  • 26. B. Kégl Data driven generation FEATURE ENGINEERING • Each collision is an event • hundreds of particles: decay products • hundreds of thousands of sensors (but sparse) • for each particle: type, energy, direction is measured • a fixed-length list of ~30-40 extracted features: x • e.g., angles, energies, directions, reconstructed mass • based on 50 years of accumulated domain knowledge 26
  • 27. B. Kégl Data driven generation count (per year) background signal probability background signal CLASSIFICATION FOR DISCOVERY 27 Goal: optimize the expected discovery significance flux × time selection expected background say, b = 100 events total count, say, 150 events excess is s = 50 events AMS = = 5 sigma ground expectation µb. When optimizing the design of gion G = {x : g(x) = s}, we do not know n and µb. As we estimate the expectation µb by its empirical counter- + b to obtain the approximate median significance ⇣ (s + b) ln ⇣ 1 + s b ⌘ s ⌘ . (14) x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as MS3 ⇥ s 1 + O ✓⇣ s b ⌘3 ◆ , AMS3 = s p b . (15) tically indistinguishable when b s. This approxima- nding on the chosen search region, be a valid surrogate selection thresholdselection threshold
  • 28. 28 GENERATION AND MODEL REDUCTION Why? • Cost cutting 1: looking at the form of f, I can place my fixed number of temperature sensors optimally • Cost cutting 2: f can replace costly simulation in a detector optimization loop • Cost cutting 3: if I can generate realistic galaxy images, I can replace costly manual labeling of real photos
  • 30. FORECASTING EL NINO: SPATIOTEMPORAL TIME SERIES 30 … 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… … time series feature extractor x (a fixed length feature vector) regressor • collaboration with the Climate Informatics workshop • also on Arctic sea ice and California rainfall prediction
  • 31. 31 GENERATION AND MODEL REDUCTION Why? • Cost cutting 1: looking at the form of f, I can place my fixed number of temperature sensors optimally • Cost cutting 2: f can replace costly simulation in a detector optimization loop • Cost cutting 3: if I can generate realistic galaxy images, I can replace costly manual labeling of real photos
  • 32. 32 We have “industrialized” workflow-building and optimization By separating them Then optimizing “graduate student descent”
  • 33. 33 Funded by Université Paris-Saclay and CNRS
  • 35. 35 RAMP is a tool for 1. Collaborative prototyping 2. Teaching 3. Data science process management
  • 36. 36 A SINGLE SCRIPT TO DEFINE THE BUNDLE X ypred score type score cross-validation scheme dataconnectors FE CLF workflow
  • 37. 37 Code submission 1. lets us deliver a working prototype 2. lets the participants collaborate
  • 38. 38 RAMP.STUDIO DATA CHALLENGE WITH CODE SUBMISSION 20+ challenges 40+ events 1500+ users 8000+ predictive models
  • 39. Center for Data Science Paris-Saclay B. Kégl (CNRS) 39 what you achieved with a well tuned deep net the diversity gap the human blender gap competitive phase collaborative phase THE POWER OF THE (COLLABORATING) CROWD OPTIMIZING GRADUATE STUDENT DESCENT
  • 40. Center for Data Science Paris-Saclay B. Kégl (CNRS) 40 COMMUNICATION AND REUSE
  • 43. 43 MACHINE LEARNING IN SCIENCE Generation/simulation and model reduction Inference • We can automate almost everything • simulation, inference, experimental design • this is not even controversial, just an extension of the current paradigm • But not the hypothesis generation: what model to test?
  • 44. 44 Hypothesis generation is crucial and, at the same time, not covered by the scientific method
  • 46. 46 ROBOT SCIENTIST “Robot scientists are a natural extension of the trend of increased involvement of automation in science. They can automatically develop and test hypotheses to explain observations, run experiments using laboratory robotics, interpret the results to amend their hypotheses, and then repeat the cycle, automating high-throughput hypothesis-led research.” http://www.cam.ac.uk/research/news/artificially-intelligent-robot-scientist-eve-could-boost-search-for-new-drugs
  • 47. 47 Hypothesis generation is crucial and, at the same time, not covered by the scientific method This ignorance has already bitten us, but with the appearance of the robot scientist, it is unavoidable
  • 48. • Come up with a hypothesis • Design an experiment to exclude it • Use a statistical test to show that the data is unlikely to be generated by a world in which the hypothesis does not hold (“background”) 48 THE SCIENTIFIC METHOD IN THE TRENCHES
  • 49. • Rutherford: “If your experiment needs statistics, you ought to have done a better experiment” • Without statistics, science would be over • we went out of slam dunk infinite significance (“background free”) hypotheses • phenomena are inherently noisy: nobody has seen or will ever see a Higgs boson 49 THE SCIENTIFIC METHOD IN THE TRENCHES
  • 50. 50
  • 51. 51 THE P-VALUE CONTROVERSY But the main problem is a tautology: if none of your hypotheses are true, all your positives are false But of course: if all your hypotheses are true, you are not exploring
  • 52. • Register all experiments and publish negatives • Don’t do underpowered experiments • Put the significance bar high enough • Test only “plausible” hypotheses 52 GUIDELINES
  • 53. • What is a plausible but non-trivial hypothesis? • How to measure plausibility? • How to generate them (automatically)? • How are hypotheses related to prior/current knowledge? 53 QUESTIONS
  • 54. 54 GENERATIVE MODELS IN ML Interesting tools but it’s a whole new ballgame and paradigmatically we are in the dark
  • 56. 56 Train on all music up to the Beatles, test on Sex Pistols
  • 57. 57 Train on all phones up to 2006, test on the iPhone
  • 58. 58 Train on all scientific knowledge up to Einstein, test on relativity theory
  • 59. 59 CAN WE GENERATE NEW TYPES? Existing objects of known types. generative model New objects. New types? learning generation
  • 60. B. Kégl / AppStat@LAL Learning to discover Center for Data Science Paris-Saclay60 SOME WRITTEN STUFF http://openreview.net/forum?id=ByEPMj5el https://arxiv.org/abs/1606.04345 https://medium.com/@balazskegl/the-epistemological-challenges-of-automating-a- b-testing-or-how-will-ai-do-science-b724f8217811#.q041gyvkt