Data-driven hypothesis generation using deep neural nets

B. Kégl Data driven generation
1
CNRS & Université Paris-Saclay
Center for Data Science
BALÁZS
KÉGL
DATA-DRIVEN HYPOTHESIS GENERATION
USING DEEP NEURAL NETS
Epistemology of Big Data in Physics
Bremen, March 2017

• Machine learning in science
• induction, inference, simulation, generation
• Stretching the scientific method
• the p-value controversy and the problem of automated hypothesis
generation
• Generative models and novelty generation
2
OUTLINE

3
Machine learning is an
engineering toolkit
for induction

• Classification problem y = f(x)
4
DATA-DRIVEN INFERENCE
x
f y
‘Stomorhina’
f y
‘Scaeva’
x

• Classification problem y = f(x)
• No model to fit, but a large set of (x, y)
pairs
• The source is typically observation + human labeling
• In science (or industry) it may also be simulation
• And a loss function L(y, ypred)
5
DATA-DRIVEN INFERENCE

• A learning algorithm takes a set of (x, y)
pairs and induces (learns) a function f: x ⟶ y
• Generalization: f must work well on
previously unseen (x, y) pairs
• Algorithms need to minimize error (expected
loss), which involves avoiding overfitting
• regularization, smoothing, capacity/complexity control
6
DATA-DRIVEN INDUCTION

7
THE PERCEPTRON (ROSENBLATT 1957)
Weights were encoded in potentiometers, and
weight updates during learning were performed by
electric motors.

8
THE PERCEPTRON (ROSENBLATT 1957)
Based on Rosenblatt's
statements, The New York
Times reported the
perceptron to be "the
embryo of an electronic
computer that [the Navy]
expects will be able to
walk, talk, see, write,
reproduce itself and be
conscious of its existence."

9
BACK PROPAGATION

10
THE AT&T CHECK READER (90S)

11
THE AT&T CHECK READER (90S)

• NNs are back on the research agenda
12
2006: A NEW WAVE BEGINS

13
2009: IMAGENET
“We believe that a large-scale ontology of images is a
critical resource for developing advanced, large-scale content-
based image search and image understanding algorithms, as well
as for providing critical training and benchmarking
data for such algorithms.” (Fei Fei Li et al CVPR09)

• 80K hierarchical categories
• 80M images of size >100x100
• labeled by 50K Amazon Turks
14
2009: IMAGENET

• Krizhevsky, Sutskever, Hinton (2012): 1.2M images, 60M
parameters, 6 days training on two GPUs
15
TECHNIQUES & TRICKS

16
IMAGENET COMPETITIONS

• Theano
• TensorFlow
• Keras
• Caffe
• Torch
17
TODAY: EASY-TO-USE LIBRARIES

18
TODAY: HARDWARE
Google TPU

19
COMMERCIAL APPLICATIONS

20
GOOGLE IMAGE SEARCH

21
FACE RECOGNITION/DETECTION
A 6B$ MARKET IN 2020

22
SELF-DRVING CARS

23

24
MACHINE LEARNING IN SCIENCE
inverting the generative chain
exciting engineering feats but
epistemologically boring
Inference

Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
Classifying variable stars
25

Paris-Saclay
VARIABLE STARS
26

VARIABLE STARS
27
accuracy improvement: 89% to
96%

THE ATLAS DETECTOR
28

FEATURE ENGINEERING
• Each collision is an event
• hundreds of particles: decay products
• hundreds of thousands of sensors (but sparse)
• for each particle: type, energy, direction is measured
• a fixed-length list of ~30-40 extracted features: x
• e.g., angles, energies, directions, reconstructed mass
• based on 50 years of accumulated domain knowledge
29

CLASSIFIER
• Training on simulated data
• Signal (Higgs) vs background (everything else)
• The goal is to find a good discriminator: maximizing
the power (sensitivity, expected significance) of the
test
30

count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
31
Goal: optimize the expected discovery significance
flux × time
selection
expected background
say, b = 100 events
total count,
say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation µb. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and µb. As
we estimate the expectation µb by its empirical counter-
+ b to obtain the approximate median significance
⇣
(s + b) ln
⇣
1 +
s
b
⌘
s
⌘
. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3 ⇥
s
1 + O
✓⇣ s
b
⌘3
◆
,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection
threshold
selection threshold

32
inverting the generative chain
exciting engineering feats but
epistemologically boring
Inference

33
replacing the generative chain
epistemologically more interesting
Simulation / generation /
forecasting

Paris-Saclay
B. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS
AHEAD
34
…
300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50
… …
feature
extractor
x
(a fixed length feature vector)regressor

35
Why?
forecasting
• Cost cutting 1: looking at the form of f, I can place my fixed
number of temperature sensors optimally
• Cost cutting 2: computing f real time may be much
cheaper/faster than running the full simulation
• Cost cutting 3: if I can generate realistic galaxy images, I can
replace costly manual labeling of real photos

36
forecasting
Inference
• We can automate almost everything
• simulation, inference, experimental design
• this is not even controversial, just an extension of the current
paradigm
• But not the hypothesis generation: what model to
test?

37
Hypothesis generation is crucial
and, at the same time,
not covered by the scientific
method

38
ROBOT SCIENTIST

39
ROBOT SCIENTIST
“Robot scientists are a natural extension of the trend of
increased involvement of automation in science. They can
automatically develop and test hypotheses to
explain observations, run experiments using
laboratory robotics, interpret the results to amend
their hypotheses, and then repeat the cycle,
automating high-throughput hypothesis-led
research.”
http://www.cam.ac.uk/research/news/artificially-intelligent-robot-scientist-eve-could-boost-search-for-new-drugs

40
Hypothesis generation is crucial
and, at the same time,
not covered by the scientific
method
This ignorance has already bitten
us, but with the appearance of
the robot scientist, it is
unavoidable

• Come up with a hypothesis
• Design an experiment to exclude it
• Use a statistical test to show that the data is unlikely
to be generated by a world in which the hypothesis does
not hold (“background”)
41
THE SCIENTIFIC METHOD IN THE
TRENCHES

• Rutherford: “If your experiment needs statistics, you ought to
have done a better experiment”
• Without statistics, science would be over
• we went out of slam dunk infinite significance (“background free”)
hypotheses
• phenomena are inherently noisy: nobody has seen or will ever
see a Higgs boson
42
THE SCIENTIFIC METHOD IN THE
TRENCHES

43
THE P-VALUE CONTROVERSY
“My position when I wrote “Thinking, Fast and Slow” was
that if a large body of evidence published in reputable
journals supports an initially implausible conclusion, then
scientific norms require us to believe that conclusion.
Implausibility is not sufficient to justify disbelief, and belief in
well-supported scientific conclusions is not optional. This
position still seems reasonable to me — it is why I think
people should believe in climate change. But the
argument only holds when all relevant results
are published.”
Daniel Kahneman
2002 Nobel Memorial Prize in Economic Sciences

44

45
THE P-VALUE CONTROVERSY
But the main problem is a tautology:
if none of your hypotheses are true,
all your positives are false
But of course: if all your hypotheses are tr
you are not exploring

• Register all experiments and publish negatives
• Don’t do underpowered experiments
• Put the significance bar high enough
• Test only “plausible” hypotheses
46
GUIDELINES

• What is a plausible but non-trivial hypothesis?
• How to measure plausibility?
• How to generate them (automatically)?
• How are hypotheses related to prior/current
knowledge?
47
QUESTIONS

48
GENERATIVE MODELS IN ML
Interesting tools but it’s a whole new
ballgame and paradigmatically we are in
the dark

• Feed a set of known objects
to an algorithm
• Ask it to generate similar
objects
• But different from the
training set
49

• The current likelihood-based
paradigm is fundamentally flawed
• The trivial sampling of the
training set needs to be excluded
by heuristics
• The value of novelty is not
even raised as a question
50

51
Our goal was to generate new objects
from new types, grounded in the
knowledge learned from examples
“Plausible” characters that could be
part of an alphabet in another
universe

52
CAN WE GENERATE NEW TYPES?
Existing objects of known types.
generative model
New objects. New types?
learning
generation

53
THE UNKNOWN HAS A STRUCTURE
Selected semi-manually:
t-SNE + clustering

54
HOW TO EVALUATE THE CAPACITY OF
THESE MODELS TO GENERATE NEW TYPES?
Idea: validate on hold-out types
Train on known types,
test on types known to the
experimenter
but unknown to the model

55
Train on digits,
test on letters

56
Train on all music up to the Beatles,
test on Sex Pistols

57
Train on all phones up to 2006,
test on the iPhone

58
Train on all scientific knowledge up to
Enstein,
test on relativity theory

59
generative model
New objects. New types?
learning
generation

60
generative model
New objects. Are some of those letters?
learning
generation

61
Are some of those letters?
This we know how to do.

62
THE EVALUATOR MODEL
Train a good discriminator on digit + letters
10 + 26 = 36 classes
discriminator
learning

63
COUNT THE NUMBER OF LETTERS
discriminator
use
to count letters
low
hig
h

64
COUNT THE NUMBER OF LETTERS
discriminator
use
to count letters
low
hig
h

65
OBJECTNESS = POSTERIOR ENTROPY
objectness
use
to discard noise
high
low

66
OBJECTNESS = POSTERIOR ENTROPY
objectness
use
to discard noise
high
low

67
COMBINING THE TWO OBJECTIVES
objectness
letter count high
high
low
low

68
PANGRAMS
hand-picked letters
top models found automatically

69
SOME WRITTEN STUFF
http://openreview.net/forum?id=ByEPMj5el
https://arxiv.org/abs/1606.04345
https://medium.com/@balazskegl/the-epistemological-challenges-of-
automating-a-b-testing-or-how-will-ai-do-science-
b724f8217811#.q041gyvkt

Data-driven hypothesis generation using deep neural nets

Recommended

Recommended

More Related Content

Similar to Data-driven hypothesis generation using deep neural nets

Similar to Data-driven hypothesis generation using deep neural nets (20)

More from Balázs Kégl

More from Balázs Kégl (9)

Recently uploaded

Recently uploaded (20)

Data-driven hypothesis generation using deep neural nets