1. Machine learning in science
- induction, inference, simulation, generation
2. Stretching the scientific method
- the p-value controversy and the problem of automated hypothesis generation
3. Generative models and novelty generation
Scaling API-first – The story of a global engineering organization
Data-driven hypothesis generation using deep neural nets
1. B. Kégl Data driven generation
1
CNRS & Université Paris-Saclay
Center for Data Science
BALÁZS
KÉGL
DATA-DRIVEN HYPOTHESIS GENERATION
USING DEEP NEURAL NETS
Epistemology of Big Data in Physics
Bremen, March 2017
2. B. Kégl Data driven generation
• Machine learning in science
• induction, inference, simulation, generation
• Stretching the scientific method
• the p-value controversy and the problem of automated hypothesis
generation
• Generative models and novelty generation
2
OUTLINE
3. B. Kégl Data driven generation
3
Machine learning is an
engineering toolkit
for induction
4. B. Kégl Data driven generation
• Classification problem y = f(x)
4
DATA-DRIVEN INFERENCE
x
f y
‘Stomorhina’
f y
‘Scaeva’
x
5. B. Kégl Data driven generation
• Classification problem y = f(x)
• No model to fit, but a large set of (x, y)
pairs
• The source is typically observation + human labeling
• In science (or industry) it may also be simulation
• And a loss function L(y, ypred)
5
DATA-DRIVEN INFERENCE
6. B. Kégl Data driven generation
• A learning algorithm takes a set of (x, y)
pairs and induces (learns) a function f: x ⟶ y
• Generalization: f must work well on
previously unseen (x, y) pairs
• Algorithms need to minimize error (expected
loss), which involves avoiding overfitting
• regularization, smoothing, capacity/complexity control
6
DATA-DRIVEN INDUCTION
7. B. Kégl Data driven generation
7
THE PERCEPTRON (ROSENBLATT 1957)
Weights were encoded in potentiometers, and
weight updates during learning were performed by
electric motors.
8. B. Kégl Data driven generation
8
THE PERCEPTRON (ROSENBLATT 1957)
Based on Rosenblatt's
statements, The New York
Times reported the
perceptron to be "the
embryo of an electronic
computer that [the Navy]
expects will be able to
walk, talk, see, write,
reproduce itself and be
conscious of its existence."
9. B. Kégl Data driven generation
9
BACK PROPAGATION
10. B. Kégl Data driven generation
10
THE AT&T CHECK READER (90S)
11. B. Kégl Data driven generation
11
THE AT&T CHECK READER (90S)
12. B. Kégl Data driven generation
• NNs are back on the research agenda
12
2006: A NEW WAVE BEGINS
13. B. Kégl Data driven generation
13
2009: IMAGENET
“We believe that a large-scale ontology of images is a
critical resource for developing advanced, large-scale content-
based image search and image understanding algorithms, as well
as for providing critical training and benchmarking
data for such algorithms.” (Fei Fei Li et al CVPR09)
14. B. Kégl Data driven generation
• 80K hierarchical categories
• 80M images of size >100x100
• labeled by 50K Amazon Turks
14
2009: IMAGENET
15. B. Kégl Data driven generation
• Krizhevsky, Sutskever, Hinton (2012): 1.2M images, 60M
parameters, 6 days training on two GPUs
15
TECHNIQUES & TRICKS
16. B. Kégl Data driven generation
16
IMAGENET COMPETITIONS
17. B. Kégl Data driven generation
• Theano
• TensorFlow
• Keras
• Caffe
• Torch
17
TODAY: EASY-TO-USE LIBRARIES
18. B. Kégl Data driven generation
18
TODAY: HARDWARE
Google TPU
19. B. Kégl Data driven generation
19
COMMERCIAL APPLICATIONS
20. B. Kégl Data driven generation
20
GOOGLE IMAGE SEARCH
21. B. Kégl Data driven generation
21
FACE RECOGNITION/DETECTION
A 6B$ MARKET IN 2020
22. B. Kégl Data driven generation
22
SELF-DRVING CARS
24. B. Kégl Data driven generation
24
MACHINE LEARNING IN SCIENCE
inverting the generative chain
exciting engineering feats but
epistemologically boring
Inference
25. Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
Classifying variable stars
25
27. B. Kégl Data driven generation
VARIABLE STARS
27
accuracy improvement: 89% to
96%
28. B. Kégl Data driven generation
THE ATLAS DETECTOR
28
29. B. Kégl Data driven generation
FEATURE ENGINEERING
• Each collision is an event
• hundreds of particles: decay products
• hundreds of thousands of sensors (but sparse)
• for each particle: type, energy, direction is measured
• a fixed-length list of ~30-40 extracted features: x
• e.g., angles, energies, directions, reconstructed mass
• based on 50 years of accumulated domain knowledge
29
30. B. Kégl Data driven generation
CLASSIFIER
• Training on simulated data
• Signal (Higgs) vs background (everything else)
• The goal is to find a good discriminator: maximizing
the power (sensitivity, expected significance) of the
test
30
31. B. Kégl Data driven generation
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
31
Goal: optimize the expected discovery significance
flux × time
selection
expected background
say, b = 100 events
total count,
say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation µb. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and µb. As
we estimate the expectation µb by its empirical counter-
+ b to obtain the approximate median significance
⇣
(s + b) ln
⇣
1 +
s
b
⌘
s
⌘
. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3 ⇥
s
1 + O
✓⇣ s
b
⌘3
◆
,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection
threshold
selection threshold
32. B. Kégl Data driven generation
32
MACHINE LEARNING IN SCIENCE
inverting the generative chain
exciting engineering feats but
epistemologically boring
Inference
33. B. Kégl Data driven generation
33
MACHINE LEARNING IN SCIENCE
replacing the generative chain
epistemologically more interesting
Simulation / generation /
forecasting
34. Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS
AHEAD
34
…
300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50
… …
feature
extractor
x
(a fixed length feature vector)regressor
35. B. Kégl Data driven generation
35
MACHINE LEARNING IN SCIENCE
Why?
Simulation / generation /
forecasting
• Cost cutting 1: looking at the form of f, I can place my fixed
number of temperature sensors optimally
• Cost cutting 2: computing f real time may be much
cheaper/faster than running the full simulation
• Cost cutting 3: if I can generate realistic galaxy images, I can
replace costly manual labeling of real photos
36. B. Kégl Data driven generation
36
MACHINE LEARNING IN SCIENCE
Simulation / generation /
forecasting
Inference
• We can automate almost everything
• simulation, inference, experimental design
• this is not even controversial, just an extension of the current
paradigm
• But not the hypothesis generation: what model to
test?
37. B. Kégl Data driven generation
37
Hypothesis generation is crucial
and, at the same time,
not covered by the scientific
method
38. B. Kégl Data driven generation
38
ROBOT SCIENTIST
39. B. Kégl Data driven generation
39
ROBOT SCIENTIST
“Robot scientists are a natural extension of the trend of
increased involvement of automation in science. They can
automatically develop and test hypotheses to
explain observations, run experiments using
laboratory robotics, interpret the results to amend
their hypotheses, and then repeat the cycle,
automating high-throughput hypothesis-led
research.”
http://www.cam.ac.uk/research/news/artificially-intelligent-robot-scientist-eve-could-boost-search-for-new-drugs
40. B. Kégl Data driven generation
40
Hypothesis generation is crucial
and, at the same time,
not covered by the scientific
method
This ignorance has already bitten
us, but with the appearance of
the robot scientist, it is
unavoidable
41. B. Kégl Data driven generation
• Come up with a hypothesis
• Design an experiment to exclude it
• Use a statistical test to show that the data is unlikely
to be generated by a world in which the hypothesis does
not hold (“background”)
41
THE SCIENTIFIC METHOD IN THE
TRENCHES
42. B. Kégl Data driven generation
• Rutherford: “If your experiment needs statistics, you ought to
have done a better experiment”
• Without statistics, science would be over
• we went out of slam dunk infinite significance (“background free”)
hypotheses
• phenomena are inherently noisy: nobody has seen or will ever
see a Higgs boson
42
THE SCIENTIFIC METHOD IN THE
TRENCHES
43. B. Kégl Data driven generation
43
THE P-VALUE CONTROVERSY
“My position when I wrote “Thinking, Fast and Slow” was
that if a large body of evidence published in reputable
journals supports an initially implausible conclusion, then
scientific norms require us to believe that conclusion.
Implausibility is not sufficient to justify disbelief, and belief in
well-supported scientific conclusions is not optional. This
position still seems reasonable to me — it is why I think
people should believe in climate change. But the
argument only holds when all relevant results
are published.”
Daniel Kahneman
2002 Nobel Memorial Prize in Economic Sciences
45. B. Kégl Data driven generation
45
THE P-VALUE CONTROVERSY
But the main problem is a tautology:
if none of your hypotheses are true,
all your positives are false
But of course: if all your hypotheses are tr
you are not exploring
46. B. Kégl Data driven generation
• Register all experiments and publish negatives
• Don’t do underpowered experiments
• Put the significance bar high enough
• Test only “plausible” hypotheses
46
GUIDELINES
47. B. Kégl Data driven generation
• What is a plausible but non-trivial hypothesis?
• How to measure plausibility?
• How to generate them (automatically)?
• How are hypotheses related to prior/current
knowledge?
47
QUESTIONS
48. B. Kégl Data driven generation
48
GENERATIVE MODELS IN ML
Interesting tools but it’s a whole new
ballgame and paradigmatically we are in
the dark
49. B. Kégl Data driven generation
• Feed a set of known objects
to an algorithm
• Ask it to generate similar
objects
• But different from the
training set
49
GENERATIVE MODELS IN ML
50. B. Kégl Data driven generation
• The current likelihood-based
paradigm is fundamentally flawed
• The trivial sampling of the
training set needs to be excluded
by heuristics
• The value of novelty is not
even raised as a question
50
GENERATIVE MODELS IN ML
51. B. Kégl Data driven generation
51
Our goal was to generate new objects
from new types, grounded in the
knowledge learned from examples
“Plausible” characters that could be
part of an alphabet in another
universe
52. B. Kégl Data driven generation
52
CAN WE GENERATE NEW TYPES?
Existing objects of known types.
generative model
New objects. New types?
learning
generation
53. B. Kégl Data driven generation
53
THE UNKNOWN HAS A STRUCTURE
Selected semi-manually:
t-SNE + clustering
54. B. Kégl Data driven generation
54
HOW TO EVALUATE THE CAPACITY OF
THESE MODELS TO GENERATE NEW TYPES?
Idea: validate on hold-out types
Train on known types,
test on types known to the
experimenter
but unknown to the model
55. B. Kégl Data driven generation
55
Train on digits,
test on letters
56. B. Kégl Data driven generation
56
Train on all music up to the Beatles,
test on Sex Pistols
57. B. Kégl Data driven generation
57
Train on all phones up to 2006,
test on the iPhone
58. B. Kégl Data driven generation
58
Train on all scientific knowledge up to
Enstein,
test on relativity theory
59. B. Kégl Data driven generation
59
CAN WE GENERATE NEW TYPES?
Existing objects of known types.
generative model
New objects. New types?
learning
generation
60. B. Kégl Data driven generation
60
CAN WE GENERATE NEW TYPES?
Existing objects of known types.
generative model
New objects. Are some of those letters?
learning
generation
61. B. Kégl Data driven generation
61
CAN WE GENERATE NEW TYPES?
Are some of those letters?
This we know how to do.
62. B. Kégl Data driven generation
62
THE EVALUATOR MODEL
Train a good discriminator on digit + letters
10 + 26 = 36 classes
discriminator
learning
63. B. Kégl Data driven generation
63
COUNT THE NUMBER OF LETTERS
discriminator
use
to count letters
low
hig
h
64. B. Kégl Data driven generation
64
COUNT THE NUMBER OF LETTERS
discriminator
use
to count letters
low
hig
h
65. B. Kégl Data driven generation
65
OBJECTNESS = POSTERIOR ENTROPY
objectness
use
to discard noise
high
low
66. B. Kégl Data driven generation
66
OBJECTNESS = POSTERIOR ENTROPY
objectness
use
to discard noise
high
low
67. B. Kégl Data driven generation
67
COMBINING THE TWO OBJECTIVES
objectness
letter count high
high
low
low
68. B. Kégl Data driven generation
68
PANGRAMS
hand-picked letters
top models found automatically
69. B. Kégl Data driven generation
69
SOME WRITTEN STUFF
http://openreview.net/forum?id=ByEPMj5el
https://arxiv.org/abs/1606.04345
https://medium.com/@balazskegl/the-epistemological-challenges-of-
automating-a-b-testing-or-how-will-ai-do-science-
b724f8217811#.q041gyvkt