SlideShare a Scribd company logo
B. Kégl Data driven generation
1
CNRS & Université Paris-Saclay
Center for Data Science
BALÁZS
KÉGL
DATA-DRIVEN HYPOTHESIS GENERATION
USING DEEP NEURAL NETS
Epistemology of Big Data in Physics
Bremen, March 2017
B. Kégl Data driven generation
• Machine learning in science
• induction, inference, simulation, generation
• Stretching the scientific method
• the p-value controversy and the problem of automated hypothesis
generation
• Generative models and novelty generation
2
OUTLINE
B. Kégl Data driven generation
3
Machine learning is an
engineering toolkit
for induction
B. Kégl Data driven generation
• Classification problem y = f(x)
4
DATA-DRIVEN INFERENCE
x
f y
‘Stomorhina’
f y
‘Scaeva’
x
B. Kégl Data driven generation
• Classification problem y = f(x)
• No model to fit, but a large set of (x, y)
pairs
• The source is typically observation + human labeling
• In science (or industry) it may also be simulation
• And a loss function L(y, ypred)
5
DATA-DRIVEN INFERENCE
B. Kégl Data driven generation
• A learning algorithm takes a set of (x, y)
pairs and induces (learns) a function f: x ⟶ y
• Generalization: f must work well on
previously unseen (x, y) pairs
• Algorithms need to minimize error (expected
loss), which involves avoiding overfitting
• regularization, smoothing, capacity/complexity control
6
DATA-DRIVEN INDUCTION
B. Kégl Data driven generation
7
THE PERCEPTRON (ROSENBLATT 1957)
Weights were encoded in potentiometers, and
weight updates during learning were performed by
electric motors.
B. Kégl Data driven generation
8
THE PERCEPTRON (ROSENBLATT 1957)
Based on Rosenblatt's
statements, The New York
Times reported the
perceptron to be "the
embryo of an electronic
computer that [the Navy]
expects will be able to
walk, talk, see, write,
reproduce itself and be
conscious of its existence."
B. Kégl Data driven generation
9
BACK PROPAGATION
B. Kégl Data driven generation
10
THE AT&T CHECK READER (90S)
B. Kégl Data driven generation
11
THE AT&T CHECK READER (90S)
B. Kégl Data driven generation
• NNs are back on the research agenda
12
2006: A NEW WAVE BEGINS
B. Kégl Data driven generation
13
2009: IMAGENET
“We believe that a large-scale ontology of images is a
critical resource for developing advanced, large-scale content-
based image search and image understanding algorithms, as well
as for providing critical training and benchmarking
data for such algorithms.” (Fei Fei Li et al CVPR09)
B. Kégl Data driven generation
• 80K hierarchical categories
• 80M images of size >100x100
• labeled by 50K Amazon Turks
14
2009: IMAGENET
B. Kégl Data driven generation
• Krizhevsky, Sutskever, Hinton (2012): 1.2M images, 60M
parameters, 6 days training on two GPUs
15
TECHNIQUES & TRICKS
B. Kégl Data driven generation
16
IMAGENET COMPETITIONS
B. Kégl Data driven generation
• Theano
• TensorFlow
• Keras
• Caffe
• Torch
17
TODAY: EASY-TO-USE LIBRARIES
B. Kégl Data driven generation
18
TODAY: HARDWARE
Google TPU
B. Kégl Data driven generation
19
COMMERCIAL APPLICATIONS
B. Kégl Data driven generation
20
GOOGLE IMAGE SEARCH
B. Kégl Data driven generation
21
FACE RECOGNITION/DETECTION
A 6B$ MARKET IN 2020
B. Kégl Data driven generation
22
SELF-DRVING CARS
B. Kégl Data driven generation
23
B. Kégl Data driven generation
24
MACHINE LEARNING IN SCIENCE
inverting the generative chain
exciting engineering feats but
epistemologically boring
Inference
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
Classifying variable stars
25
Center for Data Science
Paris-Saclay
VARIABLE STARS
26
B. Kégl Data driven generation
VARIABLE STARS
27
accuracy improvement: 89% to
96%
B. Kégl Data driven generation
THE ATLAS DETECTOR
28
B. Kégl Data driven generation
FEATURE ENGINEERING
• Each collision is an event
• hundreds of particles: decay products
• hundreds of thousands of sensors (but sparse)
• for each particle: type, energy, direction is measured
• a fixed-length list of ~30-40 extracted features: x
• e.g., angles, energies, directions, reconstructed mass
• based on 50 years of accumulated domain knowledge
29
B. Kégl Data driven generation
CLASSIFIER
• Training on simulated data
• Signal (Higgs) vs background (everything else)
• The goal is to find a good discriminator: maximizing
the power (sensitivity, expected significance) of the
test
30
B. Kégl Data driven generation
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
31
Goal: optimize the expected discovery significance
flux × time
selection
expected background
say, b = 100 events
total count,
say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation µb. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and µb. As
we estimate the expectation µb by its empirical counter-
+ b to obtain the approximate median significance
⇣
(s + b) ln
⇣
1 +
s
b
⌘
s
⌘
. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3 ⇥
s
1 + O
✓⇣ s
b
⌘3
◆
,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection
threshold
selection threshold
B. Kégl Data driven generation
32
MACHINE LEARNING IN SCIENCE
inverting the generative chain
exciting engineering feats but
epistemologically boring
Inference
B. Kégl Data driven generation
33
MACHINE LEARNING IN SCIENCE
replacing the generative chain
epistemologically more interesting
Simulation / generation /
forecasting
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS
AHEAD
34
…
300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50
… …
feature
extractor
x
(a fixed length feature vector)regressor
B. Kégl Data driven generation
35
MACHINE LEARNING IN SCIENCE
Why?
Simulation / generation /
forecasting
• Cost cutting 1: looking at the form of f, I can place my fixed
number of temperature sensors optimally
• Cost cutting 2: computing f real time may be much
cheaper/faster than running the full simulation
• Cost cutting 3: if I can generate realistic galaxy images, I can
replace costly manual labeling of real photos
B. Kégl Data driven generation
36
MACHINE LEARNING IN SCIENCE
Simulation / generation /
forecasting
Inference
• We can automate almost everything
• simulation, inference, experimental design
• this is not even controversial, just an extension of the current
paradigm
• But not the hypothesis generation: what model to
test?
B. Kégl Data driven generation
37
Hypothesis generation is crucial
and, at the same time,
not covered by the scientific
method
B. Kégl Data driven generation
38
ROBOT SCIENTIST
B. Kégl Data driven generation
39
ROBOT SCIENTIST
“Robot scientists are a natural extension of the trend of
increased involvement of automation in science. They can
automatically develop and test hypotheses to
explain observations, run experiments using
laboratory robotics, interpret the results to amend
their hypotheses, and then repeat the cycle,
automating high-throughput hypothesis-led
research.”
http://www.cam.ac.uk/research/news/artificially-intelligent-robot-scientist-eve-could-boost-search-for-new-drugs
B. Kégl Data driven generation
40
Hypothesis generation is crucial
and, at the same time,
not covered by the scientific
method
This ignorance has already bitten
us, but with the appearance of
the robot scientist, it is
unavoidable
B. Kégl Data driven generation
• Come up with a hypothesis
• Design an experiment to exclude it
• Use a statistical test to show that the data is unlikely
to be generated by a world in which the hypothesis does
not hold (“background”)
41
THE SCIENTIFIC METHOD IN THE
TRENCHES
B. Kégl Data driven generation
• Rutherford: “If your experiment needs statistics, you ought to
have done a better experiment”
• Without statistics, science would be over
• we went out of slam dunk infinite significance (“background free”)
hypotheses
• phenomena are inherently noisy: nobody has seen or will ever
see a Higgs boson
42
THE SCIENTIFIC METHOD IN THE
TRENCHES
B. Kégl Data driven generation
43
THE P-VALUE CONTROVERSY
“My position when I wrote “Thinking, Fast and Slow” was
that if a large body of evidence published in reputable
journals supports an initially implausible conclusion, then
scientific norms require us to believe that conclusion.
Implausibility is not sufficient to justify disbelief, and belief in
well-supported scientific conclusions is not optional. This
position still seems reasonable to me — it is why I think
people should believe in climate change. But the
argument only holds when all relevant results
are published.”
Daniel Kahneman
2002 Nobel Memorial Prize in Economic Sciences
B. Kégl Data driven generation
44
B. Kégl Data driven generation
45
THE P-VALUE CONTROVERSY
But the main problem is a tautology:
if none of your hypotheses are true,
all your positives are false
But of course: if all your hypotheses are tr
you are not exploring
B. Kégl Data driven generation
• Register all experiments and publish negatives
• Don’t do underpowered experiments
• Put the significance bar high enough
• Test only “plausible” hypotheses
46
GUIDELINES
B. Kégl Data driven generation
• What is a plausible but non-trivial hypothesis?
• How to measure plausibility?
• How to generate them (automatically)?
• How are hypotheses related to prior/current
knowledge?
47
QUESTIONS
B. Kégl Data driven generation
48
GENERATIVE MODELS IN ML
Interesting tools but it’s a whole new
ballgame and paradigmatically we are in
the dark
B. Kégl Data driven generation
• Feed a set of known objects
to an algorithm
• Ask it to generate similar
objects
• But different from the
training set
49
GENERATIVE MODELS IN ML
B. Kégl Data driven generation
• The current likelihood-based
paradigm is fundamentally flawed
• The trivial sampling of the
training set needs to be excluded
by heuristics
• The value of novelty is not
even raised as a question
50
GENERATIVE MODELS IN ML
B. Kégl Data driven generation
51
Our goal was to generate new objects
from new types, grounded in the
knowledge learned from examples
“Plausible” characters that could be
part of an alphabet in another
universe
B. Kégl Data driven generation
52
CAN WE GENERATE NEW TYPES?
Existing objects of known types.
generative model
New objects. New types?
learning
generation
B. Kégl Data driven generation
53
THE UNKNOWN HAS A STRUCTURE
Selected semi-manually:
t-SNE + clustering
B. Kégl Data driven generation
54
HOW TO EVALUATE THE CAPACITY OF
THESE MODELS TO GENERATE NEW TYPES?
Idea: validate on hold-out types
Train on known types,
test on types known to the
experimenter
but unknown to the model
B. Kégl Data driven generation
55
Train on digits,
test on letters
B. Kégl Data driven generation
56
Train on all music up to the Beatles,
test on Sex Pistols
B. Kégl Data driven generation
57
Train on all phones up to 2006,
test on the iPhone
B. Kégl Data driven generation
58
Train on all scientific knowledge up to
Enstein,
test on relativity theory
B. Kégl Data driven generation
59
CAN WE GENERATE NEW TYPES?
Existing objects of known types.
generative model
New objects. New types?
learning
generation
B. Kégl Data driven generation
60
CAN WE GENERATE NEW TYPES?
Existing objects of known types.
generative model
New objects. Are some of those letters?
learning
generation
B. Kégl Data driven generation
61
CAN WE GENERATE NEW TYPES?
Are some of those letters?
This we know how to do.
B. Kégl Data driven generation
62
THE EVALUATOR MODEL
Train a good discriminator on digit + letters
10 + 26 = 36 classes
discriminator
learning
B. Kégl Data driven generation
63
COUNT THE NUMBER OF LETTERS
discriminator
use
to count letters
low
hig
h
B. Kégl Data driven generation
64
COUNT THE NUMBER OF LETTERS
discriminator
use
to count letters
low
hig
h
B. Kégl Data driven generation
65
OBJECTNESS = POSTERIOR ENTROPY
objectness
use
to discard noise
high
low
B. Kégl Data driven generation
66
OBJECTNESS = POSTERIOR ENTROPY
objectness
use
to discard noise
high
low
B. Kégl Data driven generation
67
COMBINING THE TWO OBJECTIVES
objectness
letter count high
high
low
low
B. Kégl Data driven generation
68
PANGRAMS
hand-picked letters
top models found automatically
B. Kégl Data driven generation
69
SOME WRITTEN STUFF
http://openreview.net/forum?id=ByEPMj5el
https://arxiv.org/abs/1606.04345
https://medium.com/@balazskegl/the-epistemological-challenges-of-
automating-a-b-testing-or-how-will-ai-do-science-
b724f8217811#.q041gyvkt

More Related Content

Similar to Data-driven hypothesis generation using deep neural nets

Machine learning in scientific workflows
Machine learning in scientific workflowsMachine learning in scientific workflows
Machine learning in scientific workflows
Balázs Kégl
 
What is wrong with data challenges
What is wrong with data challengesWhat is wrong with data challenges
What is wrong with data challenges
Balázs Kégl
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physics
Balázs Kégl
 
Introduction to ambient GAN
Introduction to ambient GANIntroduction to ambient GAN
Introduction to ambient GAN
JaeJun Yoo
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
Sri Ambati
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
Greg Landrum
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
Paul Groth
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
PrudhvirajEluri1
 
Julian - diagnosing heart disease using convolutional neural networks
Julian - diagnosing heart disease using convolutional neural networksJulian - diagnosing heart disease using convolutional neural networks
Julian - diagnosing heart disease using convolutional neural networks
Abhishek Thakur
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 
20181212 ibm aot
20181212 ibm aot20181212 ibm aot
20181212 ibm aot
Hiroshi Maruyama
 
OOD_PPT.pptx
OOD_PPT.pptxOOD_PPT.pptx
OOD_PPT.pptx
YashBhatnagar30
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)
Julien SIMON
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
Istituto nazionale di statistica
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
Allen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
Allen Day, PhD
 
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
Artificial Intelligence Institute at UofSC
 
The Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data ModellingThe Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data Modelling
Emir Muñoz
 
Data mining BY Zubair Yaseen
Data mining BY Zubair YaseenData mining BY Zubair Yaseen
Data mining BY Zubair Yaseen
University of Education
 
Paris Data Ladies #14
Paris Data Ladies #14Paris Data Ladies #14
Paris Data Ladies #14
Nina Bertrand
 

Similar to Data-driven hypothesis generation using deep neural nets (20)

Machine learning in scientific workflows
Machine learning in scientific workflowsMachine learning in scientific workflows
Machine learning in scientific workflows
 
What is wrong with data challenges
What is wrong with data challengesWhat is wrong with data challenges
What is wrong with data challenges
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physics
 
Introduction to ambient GAN
Introduction to ambient GANIntroduction to ambient GAN
Introduction to ambient GAN
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
 
Julian - diagnosing heart disease using convolutional neural networks
Julian - diagnosing heart disease using convolutional neural networksJulian - diagnosing heart disease using convolutional neural networks
Julian - diagnosing heart disease using convolutional neural networks
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
20181212 ibm aot
20181212 ibm aot20181212 ibm aot
20181212 ibm aot
 
OOD_PPT.pptx
OOD_PPT.pptxOOD_PPT.pptx
OOD_PPT.pptx
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
 
The Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data ModellingThe Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data Modelling
 
Data mining BY Zubair Yaseen
Data mining BY Zubair YaseenData mining BY Zubair Yaseen
Data mining BY Zubair Yaseen
 
Paris Data Ladies #14
Paris Data Ladies #14Paris Data Ladies #14
Paris Data Ladies #14
 

More from Balázs Kégl

Model-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsModel-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systems
Balázs Kégl
 
Managing the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loopManaging the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loop
Balázs Kégl
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
Balázs Kégl
 
A historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricksA historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricks
Balázs Kégl
 
Build your own data challenge, or just organize team work
Build your own data challenge, or just organize team workBuild your own data challenge, or just organize team work
Build your own data challenge, or just organize team work
Balázs Kégl
 
RAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submissionRAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submission
Balázs Kégl
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiatives
Balázs Kégl
 
The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)
Balázs Kégl
 
The Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data ScienceThe Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data Science
Balázs Kégl
 

More from Balázs Kégl (9)

Model-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systemsModel-based reinforcement learning and self-driving engineering systems
Model-based reinforcement learning and self-driving engineering systems
 
Managing the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loopManaging the AI process: putting humans (back) in the loop
Managing the AI process: putting humans (back) in the loop
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
 
A historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricksA historical introduction to deep learning: hardware, data, and tricks
A historical introduction to deep learning: hardware, data, and tricks
 
Build your own data challenge, or just organize team work
Build your own data challenge, or just organize team workBuild your own data challenge, or just organize team work
Build your own data challenge, or just organize team work
 
RAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submissionRAMP: Collaborative challenge with code submission
RAMP: Collaborative challenge with code submission
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiatives
 
The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)The systemic challenges in data science initiatives (and some solutions)
The systemic challenges in data science initiatives (and some solutions)
 
The Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data ScienceThe Paris-Saclay Center for Data Science
The Paris-Saclay Center for Data Science
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Data-driven hypothesis generation using deep neural nets

  • 1. B. Kégl Data driven generation 1 CNRS & Université Paris-Saclay Center for Data Science BALÁZS KÉGL DATA-DRIVEN HYPOTHESIS GENERATION USING DEEP NEURAL NETS Epistemology of Big Data in Physics Bremen, March 2017
  • 2. B. Kégl Data driven generation • Machine learning in science • induction, inference, simulation, generation • Stretching the scientific method • the p-value controversy and the problem of automated hypothesis generation • Generative models and novelty generation 2 OUTLINE
  • 3. B. Kégl Data driven generation 3 Machine learning is an engineering toolkit for induction
  • 4. B. Kégl Data driven generation • Classification problem y = f(x) 4 DATA-DRIVEN INFERENCE x f y ‘Stomorhina’ f y ‘Scaeva’ x
  • 5. B. Kégl Data driven generation • Classification problem y = f(x) • No model to fit, but a large set of (x, y) pairs • The source is typically observation + human labeling • In science (or industry) it may also be simulation • And a loss function L(y, ypred) 5 DATA-DRIVEN INFERENCE
  • 6. B. Kégl Data driven generation • A learning algorithm takes a set of (x, y) pairs and induces (learns) a function f: x ⟶ y • Generalization: f must work well on previously unseen (x, y) pairs • Algorithms need to minimize error (expected loss), which involves avoiding overfitting • regularization, smoothing, capacity/complexity control 6 DATA-DRIVEN INDUCTION
  • 7. B. Kégl Data driven generation 7 THE PERCEPTRON (ROSENBLATT 1957) Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors.
  • 8. B. Kégl Data driven generation 8 THE PERCEPTRON (ROSENBLATT 1957) Based on Rosenblatt's statements, The New York Times reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."
  • 9. B. Kégl Data driven generation 9 BACK PROPAGATION
  • 10. B. Kégl Data driven generation 10 THE AT&T CHECK READER (90S)
  • 11. B. Kégl Data driven generation 11 THE AT&T CHECK READER (90S)
  • 12. B. Kégl Data driven generation • NNs are back on the research agenda 12 2006: A NEW WAVE BEGINS
  • 13. B. Kégl Data driven generation 13 2009: IMAGENET “We believe that a large-scale ontology of images is a critical resource for developing advanced, large-scale content- based image search and image understanding algorithms, as well as for providing critical training and benchmarking data for such algorithms.” (Fei Fei Li et al CVPR09)
  • 14. B. Kégl Data driven generation • 80K hierarchical categories • 80M images of size >100x100 • labeled by 50K Amazon Turks 14 2009: IMAGENET
  • 15. B. Kégl Data driven generation • Krizhevsky, Sutskever, Hinton (2012): 1.2M images, 60M parameters, 6 days training on two GPUs 15 TECHNIQUES & TRICKS
  • 16. B. Kégl Data driven generation 16 IMAGENET COMPETITIONS
  • 17. B. Kégl Data driven generation • Theano • TensorFlow • Keras • Caffe • Torch 17 TODAY: EASY-TO-USE LIBRARIES
  • 18. B. Kégl Data driven generation 18 TODAY: HARDWARE Google TPU
  • 19. B. Kégl Data driven generation 19 COMMERCIAL APPLICATIONS
  • 20. B. Kégl Data driven generation 20 GOOGLE IMAGE SEARCH
  • 21. B. Kégl Data driven generation 21 FACE RECOGNITION/DETECTION A 6B$ MARKET IN 2020
  • 22. B. Kégl Data driven generation 22 SELF-DRVING CARS
  • 23. B. Kégl Data driven generation 23
  • 24. B. Kégl Data driven generation 24 MACHINE LEARNING IN SCIENCE inverting the generative chain exciting engineering feats but epistemologically boring Inference
  • 25. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING Classifying variable stars 25
  • 26. Center for Data Science Paris-Saclay VARIABLE STARS 26
  • 27. B. Kégl Data driven generation VARIABLE STARS 27 accuracy improvement: 89% to 96%
  • 28. B. Kégl Data driven generation THE ATLAS DETECTOR 28
  • 29. B. Kégl Data driven generation FEATURE ENGINEERING • Each collision is an event • hundreds of particles: decay products • hundreds of thousands of sensors (but sparse) • for each particle: type, energy, direction is measured • a fixed-length list of ~30-40 extracted features: x • e.g., angles, energies, directions, reconstructed mass • based on 50 years of accumulated domain knowledge 29
  • 30. B. Kégl Data driven generation CLASSIFIER • Training on simulated data • Signal (Higgs) vs background (everything else) • The goal is to find a good discriminator: maximizing the power (sensitivity, expected significance) of the test 30
  • 31. B. Kégl Data driven generation count (per year) background signal probability background signal CLASSIFICATION FOR DISCOVERY 31 Goal: optimize the expected discovery significance flux × time selection expected background say, b = 100 events total count, say, 150 events excess is s = 50 events AMS = = 5 sigma ground expectation µb. When optimizing the design of gion G = {x : g(x) = s}, we do not know n and µb. As we estimate the expectation µb by its empirical counter- + b to obtain the approximate median significance ⇣ (s + b) ln ⇣ 1 + s b ⌘ s ⌘ . (14) x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as MS3 ⇥ s 1 + O ✓⇣ s b ⌘3 ◆ , AMS3 = s p b . (15) tically indistinguishable when b s. This approxima- nding on the chosen search region, be a valid surrogate selection threshold selection threshold
  • 32. B. Kégl Data driven generation 32 MACHINE LEARNING IN SCIENCE inverting the generative chain exciting engineering feats but epistemologically boring Inference
  • 33. B. Kégl Data driven generation 33 MACHINE LEARNING IN SCIENCE replacing the generative chain epistemologically more interesting Simulation / generation / forecasting
  • 34. Center for Data Science Paris-Saclay B. Kégl (CNRS) FORECASTING EL NINO SIX MONTHS AHEAD 34 … 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50 … … feature extractor x (a fixed length feature vector)regressor
  • 35. B. Kégl Data driven generation 35 MACHINE LEARNING IN SCIENCE Why? Simulation / generation / forecasting • Cost cutting 1: looking at the form of f, I can place my fixed number of temperature sensors optimally • Cost cutting 2: computing f real time may be much cheaper/faster than running the full simulation • Cost cutting 3: if I can generate realistic galaxy images, I can replace costly manual labeling of real photos
  • 36. B. Kégl Data driven generation 36 MACHINE LEARNING IN SCIENCE Simulation / generation / forecasting Inference • We can automate almost everything • simulation, inference, experimental design • this is not even controversial, just an extension of the current paradigm • But not the hypothesis generation: what model to test?
  • 37. B. Kégl Data driven generation 37 Hypothesis generation is crucial and, at the same time, not covered by the scientific method
  • 38. B. Kégl Data driven generation 38 ROBOT SCIENTIST
  • 39. B. Kégl Data driven generation 39 ROBOT SCIENTIST “Robot scientists are a natural extension of the trend of increased involvement of automation in science. They can automatically develop and test hypotheses to explain observations, run experiments using laboratory robotics, interpret the results to amend their hypotheses, and then repeat the cycle, automating high-throughput hypothesis-led research.” http://www.cam.ac.uk/research/news/artificially-intelligent-robot-scientist-eve-could-boost-search-for-new-drugs
  • 40. B. Kégl Data driven generation 40 Hypothesis generation is crucial and, at the same time, not covered by the scientific method This ignorance has already bitten us, but with the appearance of the robot scientist, it is unavoidable
  • 41. B. Kégl Data driven generation • Come up with a hypothesis • Design an experiment to exclude it • Use a statistical test to show that the data is unlikely to be generated by a world in which the hypothesis does not hold (“background”) 41 THE SCIENTIFIC METHOD IN THE TRENCHES
  • 42. B. Kégl Data driven generation • Rutherford: “If your experiment needs statistics, you ought to have done a better experiment” • Without statistics, science would be over • we went out of slam dunk infinite significance (“background free”) hypotheses • phenomena are inherently noisy: nobody has seen or will ever see a Higgs boson 42 THE SCIENTIFIC METHOD IN THE TRENCHES
  • 43. B. Kégl Data driven generation 43 THE P-VALUE CONTROVERSY “My position when I wrote “Thinking, Fast and Slow” was that if a large body of evidence published in reputable journals supports an initially implausible conclusion, then scientific norms require us to believe that conclusion. Implausibility is not sufficient to justify disbelief, and belief in well-supported scientific conclusions is not optional. This position still seems reasonable to me — it is why I think people should believe in climate change. But the argument only holds when all relevant results are published.” Daniel Kahneman 2002 Nobel Memorial Prize in Economic Sciences
  • 44. B. Kégl Data driven generation 44
  • 45. B. Kégl Data driven generation 45 THE P-VALUE CONTROVERSY But the main problem is a tautology: if none of your hypotheses are true, all your positives are false But of course: if all your hypotheses are tr you are not exploring
  • 46. B. Kégl Data driven generation • Register all experiments and publish negatives • Don’t do underpowered experiments • Put the significance bar high enough • Test only “plausible” hypotheses 46 GUIDELINES
  • 47. B. Kégl Data driven generation • What is a plausible but non-trivial hypothesis? • How to measure plausibility? • How to generate them (automatically)? • How are hypotheses related to prior/current knowledge? 47 QUESTIONS
  • 48. B. Kégl Data driven generation 48 GENERATIVE MODELS IN ML Interesting tools but it’s a whole new ballgame and paradigmatically we are in the dark
  • 49. B. Kégl Data driven generation • Feed a set of known objects to an algorithm • Ask it to generate similar objects • But different from the training set 49 GENERATIVE MODELS IN ML
  • 50. B. Kégl Data driven generation • The current likelihood-based paradigm is fundamentally flawed • The trivial sampling of the training set needs to be excluded by heuristics • The value of novelty is not even raised as a question 50 GENERATIVE MODELS IN ML
  • 51. B. Kégl Data driven generation 51 Our goal was to generate new objects from new types, grounded in the knowledge learned from examples “Plausible” characters that could be part of an alphabet in another universe
  • 52. B. Kégl Data driven generation 52 CAN WE GENERATE NEW TYPES? Existing objects of known types. generative model New objects. New types? learning generation
  • 53. B. Kégl Data driven generation 53 THE UNKNOWN HAS A STRUCTURE Selected semi-manually: t-SNE + clustering
  • 54. B. Kégl Data driven generation 54 HOW TO EVALUATE THE CAPACITY OF THESE MODELS TO GENERATE NEW TYPES? Idea: validate on hold-out types Train on known types, test on types known to the experimenter but unknown to the model
  • 55. B. Kégl Data driven generation 55 Train on digits, test on letters
  • 56. B. Kégl Data driven generation 56 Train on all music up to the Beatles, test on Sex Pistols
  • 57. B. Kégl Data driven generation 57 Train on all phones up to 2006, test on the iPhone
  • 58. B. Kégl Data driven generation 58 Train on all scientific knowledge up to Enstein, test on relativity theory
  • 59. B. Kégl Data driven generation 59 CAN WE GENERATE NEW TYPES? Existing objects of known types. generative model New objects. New types? learning generation
  • 60. B. Kégl Data driven generation 60 CAN WE GENERATE NEW TYPES? Existing objects of known types. generative model New objects. Are some of those letters? learning generation
  • 61. B. Kégl Data driven generation 61 CAN WE GENERATE NEW TYPES? Are some of those letters? This we know how to do.
  • 62. B. Kégl Data driven generation 62 THE EVALUATOR MODEL Train a good discriminator on digit + letters 10 + 26 = 36 classes discriminator learning
  • 63. B. Kégl Data driven generation 63 COUNT THE NUMBER OF LETTERS discriminator use to count letters low hig h
  • 64. B. Kégl Data driven generation 64 COUNT THE NUMBER OF LETTERS discriminator use to count letters low hig h
  • 65. B. Kégl Data driven generation 65 OBJECTNESS = POSTERIOR ENTROPY objectness use to discard noise high low
  • 66. B. Kégl Data driven generation 66 OBJECTNESS = POSTERIOR ENTROPY objectness use to discard noise high low
  • 67. B. Kégl Data driven generation 67 COMBINING THE TWO OBJECTIVES objectness letter count high high low low
  • 68. B. Kégl Data driven generation 68 PANGRAMS hand-picked letters top models found automatically
  • 69. B. Kégl Data driven generation 69 SOME WRITTEN STUFF http://openreview.net/forum?id=ByEPMj5el https://arxiv.org/abs/1606.04345 https://medium.com/@balazskegl/the-epistemological-challenges-of- automating-a-b-testing-or-how-will-ai-do-science- b724f8217811#.q041gyvkt