David Barber - Deep Nets, Bayes and the story of AI

Deep Nets, Bayes and the story of AI (continued)
David Barber

Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiﬀ
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook

Intelligent Machinery
1948 Turing and Champernowne ‘paper and pencil’ chess.

Intelligent Machinery
1951 Prinz mate-in-two moves chess machine.
1952 Strachey programs ﬁrst computer draughts algorithm.

Learning Machines
1951 Oettinger makes ﬁrst program that ‘learns’.
1955 Samuel adds ‘learning’ to his draughts algorithm.

Logical Intelligence
1968 Risch’s algorithm for integration in calculus
1972 Prolog for general logical reasoning
1997 Deep Blue defeats Kasparov

Other forms of intelligence
But is this getting us to where we’d like to be?
Selfridge-Shannon ﬁlm clip
Speech Recognition
Visual Processing
Natural Language modelling
Planning and decision in uncertain environments
Perhaps a diﬀerent approach would be useful.

Astonishing Hypothesis: Crick
“A person’s mental activities are entirely due to the behaviour of nerve
cells and the molecules that make them up and inﬂuence them.”

Information Processing in Brains
Neurons
RealWorld
Layer 1 Layer 2 High−level
Concepts
Feature
Hierarchical; Modular; Binary; Parallel; Noisy

Artiﬁcial Neuron (Perceptron)
weight 7
output neuron
neuron 1
neuron 2
neuron 3
neuron 4
neuron 7
neuron 6
neuron 5
inputs
weight 1

Training an artiﬁcial neural network
Want to generalise to new images with high accuracy.

Artiﬁcial Network
1957 Rosenblatt’s perceptron
perceptron ﬁlm clip

Connectionism
1960 Realised a perceptron can only solve simple tasks.
1970 Decline in interest.
1980 New computing power made training multilayer networks feasible.
outputinputs
Each node (or ‘neuron’) computes a function of a weighted combination of
parental nodes: hj = σ( i wijhi)

Neural Networks and Deep Learning
Historical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima?).
Particularly difficult to train a NN with a large number of layers (say larger
than around 10).
‘Gradient Diffusion Problem’ – difficult to assign responsibility of errors to
individual ‘neurons’.
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs.
More principled and computationally better understood techniques (SVMs
and related convex methods) replaced them.
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is not
sufficient for all AI purposes.
Uncertainty and reasoning are not naturally representable using standard
feed-forward nets.
Explosion in more ‘symbolic’ Bayesian AI.

Deep Learning
NNs have resurged in interest in the last few years (Hinton, Bengio, . . . )
Also called ‘deep learning’.
Sense that very complex tasks (object recognition, learning complex structure
in data) requires going beyond simple (convex) statistical techniques.
The brain uses hierarchical distributed processing and it is likely to be for a
good reason.
Many problems have a hierarchical structure: images are made of parts;
language is hierarchical, etc.
Why now?
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with many
parameters (1010
).
Recent evidence suggests local optima are not particularly problematic.

Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to ﬁnd a low dimensional
representation of the data.
Useful for unsupervised learning.

Autoencoder on MNIST digits (Hinton 2006 Science)
Figure : Reconstructions using H = 30 components. From the Top: Original image,
Autoencoder1, Autoencoder2, PCA
60,000 training images (28 × 28 = 784 pixels).
Use a form of autoencoder to ﬁnd a lower (30) dimensional representation.
At the time, the special layerwise training procedure was considered
fundamental to the success of this approach. Now not deemed necessary,
provided we use a sensible initialisation.

Google Cats
10 Million Youtube video frames (200x200 pixel images).
Use a specialised autoencoder with 9 layers (1 billion weights).
2000 computers + two weeks of computing.
Examine units to see what images they most respond to.

Google Autoencoder
From Nando De Freitas

Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)
but micro features.
For example, in handwritten digit recognition they correspond to small
constituent parts of the digits.
These are used then to process the image into a representation that is better
for recognition.

NNs in NLP
Bag of Words
We have D words in a dictionary, aardvark,. . .,zorro so that we can relate
each word with its dictionary index.
We can also think of this as a Euclidean embedding e:
aardvark → eaardvark =





1
0
...
0





, zorro → ezorro =





0
0
...
1





Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) v
that are learned.
Objective is, for example, next word prediction accuracy.
These are often called ‘neural language models’.

NNs in NLP
Each word w in the dictionary has an associated embedding vector vw.
Usually around 200 dimensional vectors are used.
Consider the sentence:
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat, vsat, vthe, vmat
The output of the network is a probability over all words in the dictionary
p(w| {vinputs}).
We want p(w = on|vcat, vsat, vthe, vmat) to be high.
The overall objective is then to learn all the word embeddings and network
parameters subject to predicting the word correctly based on the context.

Word Embeddings
Given a word (France, for example) we can ﬁnd which words w have embedding
vectors closest to vFrance. From Ronan Collabert (2011).

Word Embeddings
There appears to be a natural ‘geometry’ to the embeddings. For example, there
are directions that correspond to gender.
vwoman − vman ≈ vaunt − vuncle
vwoman − vman ≈ vqueen − vking
From Mikolov (2013).

Word Embeddings: Analogies
Given a relationship, France-Paris, we get the ‘relationship’ embedding
v = vParis − vFrance
Given Italy we can calculate vItaly + v and ﬁnd the word in the dictionary which
has closest embedding to this (it turns out to be Rome!). From Mikolov (2013).

Word Embeddings: Constrained Embeddings
We can learn embeddings for English words and embeddings for Chinese
words.
However, when we know that a Chinese and English word have a similar
meaning, we add a constraint that the word embeddings vChineseWord and
vEnglishWord should be close.
We have only a small amount of labelled ‘similar’ Chinese-English words
(these are the green border boxes in the above; they are standard translations
of the corresponding Chinese character).
We can visualise in 2D (using t-SNE) the embedding vectors. See Socher
(2013)

Word Embeddings: Constrained Embeddings

Recursive Nets and Embeddings
Stanford Sentiment Treebank. Consists of parsed sentences with sentiment labels
(−−, −, 0, +, ++) for each node (phrase) in the tree. 215,000 labelled phrases
(obtained from three humans).

Idea is to recursively combine embeddings such that they accurately predict
the sentiment at each node.

Training
We have a softmax classifier for each node in the tree, to predict the
sentiment of the phrase beneath this node in the tree.
The weights of this classifier are shared across all nodes.
At the leaf nodes at the bottom of the tree, the inputs to the classifiers are
the word embeddings.
The embeddings are combined by another network g with common
parameters, which forms the input to the sentiment classifier.
We then learn all the embeddings, shared classifier parameters and shared
combination parameters to maximise the classification accuracy.
Prediction
For a new movie review, the review is first parsed using a standard grammar
tree parser.
This forms the tree which can be used to recursively form the sentiment class
label for the review.
Currently the best sentiment classifier. Socher (2013)

õ õ
ð
ð
Î±¹»®
ð
Ü±¼¹»®
õ
õ
ð
·
õ
ð
±²»
õ
ð
±º
õ
õ
ð
¬¸»
õ
õ
ð
³±¬
õ
½±³°»´´·²¹
ð
ª¿®·¿¬·±²
ð
ð
±²
ð
ð
¬¸·
ð
¬¸»³»
ð
ò
¥
ð
ð
Î±¹»®
ð
Ü±¼¹»®
¥
¥
ð
·
¥
ð
±²»
¥
ð
±º
¥
¥
ð
¬¸»
¥
¥
¥
´»¿¬
õ
½±³°»´´·²¹
ð
ª¿®·¿¬·±²
ð
ð
±²
ð
ð
¬¸·
ð
¬¸»³»
ð
ò
õ
ð
×
õ
õ
õ
´·µ»¼
ð
ð
ð
»ª»®§
ð
ð
·²¹´»
ð
³·²«¬»
ð
ð
±º
ð
ð
¬¸·
ð
ð
ò
¥
ð
×
¥
¥
ð
ð
¼·¼
ð
²ù¬
ð
ð
´·µ»
ð
ð
ð
¿
ð
ð
·²¹´»
ð
³·²«¬»
ð
ð
±º
ð
ð
¬¸·
ð
ð
ò
¥
ð
×¬
¥
¥
ð
ð
ù
ð
¶«¬
¥
õ
·²½®»¼·¾´§
¥ ¥
¼«´´
ð
ò
ð
ð
×¬
ð
ð
ð
ð
ð
ù
õ
¥
²±¬
¥ ¥
¼«´´
ð
ò
Ú·¹«®» çæ ÎÒÌÒ °®»¼·½¬·±² ±º °±·¬·ª» ¿²¼ ²»¹¿¬·ª» ø¾±¬¬±³ ®·¹¸¬÷ »²¬»²½» ¿²¼ ¬¸»·® ²»¹¿¬·±²ò

Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)
depend on the previous state of the network ht−1, xt−1, yt−1 for inputs xt and
outputs yt.
In the above network, I ‘unrolled the net through time’ to give a standard NN
diagram.
I omitted the potential links from xt−1, yt−1 to ht.

Handwriting Generation using a RNN
Some training examples.

Handwriting Generation using a RNN
Some generated examples. Top line is real handwriting, for comparison. See Alex
Grave’s work.

Reasons research in deep learning has exploded
Much greater compute power. (GPU)
Much larger datasets.
AutoDiff.
What is AutoDiff?
AutoDiff takes a function f(x) and returns an exact value (up to machine
accuracy) for the gradient
gi(x) ≡
∂
∂xi
f
x
Note that this is not the same as a numerical approximation (such as central
differences) for the gradient.
One can show that, if done efficiently, one can always calculate the gradient in
less than 5 times the time it takes to compute f(x).

Reverse Diﬀerentiation
A useful graphical representation is that the total derivative of f with respect to x
is given by the sum over all path values from x to f, where each path value is the
product of the partial derivatives of the functions on the edges:
df
dx
=
∂f
∂x
+
∂f
∂g
dg
dx
x
f
g∂f
∂x
dg
dx
∂f
∂g
Example
For f(x) = x2
+ xgh, where g =
x2
and h = xg2
x
f
gh2x + gh
2x
xh
2gx
xg
g2
f (x) = (2x + gh) + (g2
xg) + (2x2gxxg) + (2xxh) = 2x + 8x7

Consider
f(x1, x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST):
x1 x2
f1
f2
f3
f1(x1, x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1, x2, we ﬁrst run forwards through the tree so that we can
associate each node with an actual function value.

x1 x2
f1
f2
f3
df3
dx1
=
∂f3
∂f2
df2
dx1
=
∂f3
∂f2
df2
df1
df3
df1
df1
dx1
Similarly,
df3
dx2
=
∂f3
∂f2
df2
df1
df3
df1
df1
dx2
The two derivatives share the same computation branch and
we want to exploit this.

x1 x2
f1
f2
f3
∂f1
∂x1
= x2
∂f1
∂x2
= x1
∂f2
∂f1
= cos(f1)
∂f3
∂f2
= − sin(f2)
1. Find the reverse ancestral (backwards) schedule
of nodes (f3, f2, f1, x1, x2).
2. Start with the first node n1 in the reverse
schedule and define tn1 = 1.
3. For the next node n in the reverse schedule, find
the child nodes ch (n). Then define
tn =
c∈ch(n)
∂fc
∂fn
tc
4. The total derivatives of f with respect to the
root nodes of the tree (here x1 and x2) are given
by the values of t at those nodes.
This is a general procedure that can be used to automatically define a subroutine
to efficiently compute the gradient. It is efficient because information is collected
at nodes in the tree and split between parents only when required.

Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence.
Solving chess problems is another and requires complex reasoning using some
form of internal model.
The world is noisy and information may be conﬂicting.
Recognised that new approaches are required.

Limitations of forward reasoning
World Representation
Models help us to fantasise about the world.

Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchen
snd – sound

Finding the Burglar
creak! creak!
bump!
creak!
bump! bump!
creak!
bump! bump! bump!
creak!
bump!

Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended key
hit – hit key

Stubby Fingers: errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55

Stubby Fingers: language
a b c d e f g h i j k l m n o p q r s t u v w x y z
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that this
corresponds to?
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word

Speech Recognition: raw signal
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Time

‘neural’ representation
10 20 30 40 50 60 70 80
5
10
15
20
25

Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho: phoneme (letter)
aud: audio signal (neural representation)

Medical Diagnosis
tumour ﬂu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient speciﬁc information.

Probability
Why Probability?
Probability is a logical calculus of uncertainty.
Natural framework to use in models of physical systems, such as the Ising
Model (1920) and in AI applications, such as the HMM (Baum 1966,
Stratonovich 1960).
The need for structure
We often want to make a probabilistic description of many objects (electron
spins, neurons, customers, etc. ).
Typically the representational and computational cost of probabilistic models
grows exponentially with the number of objects represented.
Without introducing strong structural limitations about how these objects can
interact, probability is a non-starter.
For this reason, computationally ‘simpler’ alternatives (such as fuzzy logic)
were introduced to try to avoid some of these diﬃculties – however, these are
typically frowed on by purists.

Graphical Models
We can use graphs to represent how objects can probabilistically interact with
each other.
Graphical Models and then a marriage between Graph and Probability theory.
Many of the quantities that we would like to compute in a probability
distribution can then be related to operations on the graph.
The computational complexity of operations can often be related to the
structure of the graph.
Graphical Models are now used as a standard framework in Engineering,
Statistics and Computer Science.
Graphical Models are used to perform reasoning under uncertainty and are
therefore widely applicable.

Uses in Industry
Microsoft: used to estimate the skill distribution of players in online games
(the worlds largest graphical model?!).
Hospitals use Belief Nets to encode knowledge about diseases and symptoms
to aid medical diagnosis.
Google, Microsoft, Facebook: used in many places, including advertising,
video game prediction, speech recognition.
Used to estimate inherent desirability of products in consumer retail.
Microsoft and others: Attempt to go beyond simple A/B testing by uses
Graphical Models to model the whole company/user relationship.

Conditional Probability and Bayes’ Rule
The probability of event x conditioned on knowing event y (or more shortly, the
probability of x given y) is deﬁned as
p(x|y) ≡
p(x, y)
p(y)
=
p(y|x)p(x)
p(y)
(Bayes’ rule)
Throwing darts
p(region 5|not region 20) =
p(region 5, not region 20)
p(not region 20)
=
p(region 5)
p(not region 20)
=
1/20
19/20
=
1
19
Interpretation
p(A = a|B = b) should not be interpreted as ‘Given the event B = b has occurred,
p(A = a|B = b) is the probability of the event A = a occurring’. The correct
interpretation should be ‘p(A = a|B = b) is the probability of A being in state a
under the constraint that B is in state b’.

Battleships
Assume there are 2 ships, 1 vertical (ship 1) and 1 horizontal (ship 2), of 5
pixels each.
Can be placed anywhere on the 10×10 grid, but cannot overlap.
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query ‘hit’ or ‘miss’ responses.
p(s1, s2|D) =
p(D|s1, s2)p(s1, s2)
p(D)
Let X be the matrix of pixel occupancy
p(X|D) =
s1,s2
p(X, s1, s2|D) =
s1,s2
p(X|s1, s2)p(s1, s2|D)
demoBattleships.m

Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated the
conditional probability of the node given its parents.
The joint distribution is obtained by taking the product of the conditional
probabilities:
p(A, B, C, D, E) = p(A)p(B)p(C|A, B)p(D|C)p(E|B, C)
p(E|B, C)
A B
C
D
E

Example – Part I
Sally’s burglar Alarm is sounding. Has she been Burgled, or was the alarm
triggered by an Earthquake? She turns the car Radio on for news of earthquakes.
Choosing an ordering
Without loss of generality, we can write
p(A, R, E, B) = p(A|R, E, B)p(R, E, B)
= p(A|R, E, B)p(R|E, B)p(E, B)
= p(A|R, E, B)p(R|E, B)p(E|B)p(B)
Assumptions:
The alarm is not directly inﬂuenced by any report on the radio,
p(A|R, E, B) = p(A|E, B)
The radio broadcast is not directly inﬂuenced by the burglar variable,
p(R|E, B) = p(R|E)
Burglaries don’t directly ‘cause’ earthquakes, p(E|B) = p(E)
Therefore
p(A, R, E, B) = p(A|E, B)p(R|E)p(E)p(B)

Example – Part II: Specifying the Tables
B
A
E
R
p(A|B, E)
Alarm = 1 Burglar Earthquake
0.9999 1 1
0.99 1 0
0.99 0 1
0.0001 0 0
p(R|E)
Radio = 1 Earthquake
1 1
0 0
The remaining tables are p(B = 1) = 0.01 and p(E = 1) = 0.000001. The tables
and graphical structure fully specify the distribution.

Example Part III: Inference
Initial Evidence: The alarm is sounding
p(B = 1|A = 1) =
E,R p(B = 1, E, A = 1, R)
B,E,R p(B, E, A = 1, R)
=
E,R p(A = 1|B = 1, E)p(B = 1)p(E)p(R|E)
B,E,R p(A = 1|B, E)p(B)p(E)p(R|E)
≈ 0.99
Additional Evidence: The radio broadcasts an earthquake warning:
A similar calculation gives p(B = 1|A = 1, R = 1) ≈ 0.01.
Initially, because the alarm sounds, Sally thinks that she’s been burgled.
However, this probability drops dramatically when she hears that there has
been an earthquake.

Markov Models
For timeseries data v1, . . . , vT , we need a model p(v1:T ). For causal consistency, it
is meaningful to consider the decomposition
p(v1:T ) =
T
t=1
p(vt|v1:t−1)
with the convention p(vt|v1:t−1) = p(v1) for t = 1.
v1 v2 v3 v4
Independence assumptions
It is often natural to assume that the inﬂuence of the immediate past is more
relevant than the remote past and in Markov models only a limited number of
previous observations are required to predict the future.

Markov Chain
Only the recent past is relevant:
p(vt|v1, . . . , vt−1) = p(vt|vt−L, . . . , vt−1)
where L ≥ 1 is the order of the Markov chain
p(v1:T ) = p(v1)p(v2|v1)p(v3|v2) . . . p(vT |vT −1)
For a stationary Markov chain the transitions p(vt = s |vt−1 = s) = f(s , s) are
time-independent (‘homogeneous’).
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure : (a): First order Markov chain. (b): Second order Markov chain.

Markov Chains
v1 v2 v3 v4
p(v1, . . . , vT ) = p(v1)
initial
T
t=2
p(vt|vt−1)
Transition
State transition diagram
Nodes represent states of the variable v and arcs non-zero elements of the
transition p(vt|vt−1)
1 2
34
56
7
8 9

Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1 − 2 − 7.
The most probable path from state 1 to state 7 is 1 − 8 − 9 − 7 (assuming
uniform transition probabilities). The latter path is longer but more probable
since for the path 1 − 2 − 7, the probability of exiting state 2 into state 7 is
1/5.

Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time:
p(xt = i) =
j
p(xt = i|xt−1 = j)
Mij
p(xt−1 = j)
p(xt = i) is the frequency that we visit state i at time t, given we started
from p(x1) and randomly drew samples from the transition p(xτ |xτ−1).
As we repeatedly sample a new state from the chain, the distribution at time
t, for an initial distribution p1(i) is
pt = Mt−1
p1
If, for t → ∞, p∞ is independent of the initial distribution p1, then p∞ is
called the equilibrium distribution of the chain:
p∞ = Mp∞
The equil. distribution is proportional to the eigenvector with unit eigenvalue
of the transition matrix.

PageRank
Deﬁne the matrix
Aij =
1 if website j has a hyperlink to website i
0 otherwise
From this we can deﬁne a Markov transition matrix with elements
Mij =
Aij
i Ai j
If we jump from website to website, the equilibrium distribution component
p∞(i) is the relative number of times we will visit website i. This has a
natural interpretation as the ‘importance’ of website i.
For each website i a list of words associated with that website is collected.
After doing this for all websites, one can make an ‘inverse’ list of which
websites contain word w. When a user searches for word w, the list of
websites that contain word is then returned, ranked according to the
importance of the site.

Hidden Markov Models
The HMM defines a Markov chain on hidden (or ‘latent’) variables h1:T . The
observed (or ‘visible’) variables are dependent on the hidden variables through an
emission p(vt|ht). This defines a joint distribution
p(h1:T , v1:T ) = p(v1|h1)p(h1)
T
t=2
p(vt|ht)p(ht|ht−1)
For a stationary HMM the transition p(ht|ht−1) and emission p(vt|ht) distributions
are constant through time.
v1 v2 v3 v4
h1 h2 h3 h4 Figure : A first order hidden Markov model
with ‘hidden’ variables
dom(ht) = {1, . . . , H}, t = 1 : T. The
‘visible’ variables vt can be either discrete or
continuous.

The classical inference problems
Filtering (Inferring the present) p(ht|v1:t)
Prediction (Inferring the future) p(ht|v1:s) t > s
Smoothing (Inferring the past) p(ht|v1:u) t < u
Likelihood p(v1:T )
Most likely path (Viterbi alignment) argmax
h1:T
p(h1:T |v1:T )
For prediction, one is also often interested in p(vt|v1:s) for t > s.

Inference in Hidden Markov Models
Belief network representation of a HMM:
h1 h2 h3 h4
v1 v2 v3 v4
Filtering, Smoothing and Viterbi are all computationally eﬃcient, scaling
linearly with the length of the timeseries (but quadratically with the number
of hidden states).
The algorithms are variants of ‘message passing on factor graphs’
Algorithm guaranteed to work if the graph is singly-connected.
Huge research eﬀort in the last 15 years to apply message passing for
approximate inference in multiply-connected graphs (eg low-density
parity-check codes).

HMMs for speech recognition
ht is the phoneme at time t. p(ht|ht−1) – language model. p(vt|ht) – speech
signal model.

Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speech
recognition.
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is some
function of the phoneme µ(ht; θ).
This function is a deep neural network, trained on a large amount of data.
Goldrush at the moment to ﬁnd similar breakthrough applications of deep
networks in reasoning systems.

Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructed
on the basis of a low dimensional representation.
Note that this is a Graphical Model, not a Function
The latent variables h can be sampled from, using p(h) and then an image
sampled from p(v|h).
One cannot use an autoencoder to generate new images.
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in these
models.
Statisticians typically use sampling as an approximation.
Very popular in ML to use a variational method – much faster for inference.

Variational Inference
Consider a distribution
p(v|θ) =
h
p(v|h, θ)p(h)
and that we wish to learn θ to maximise the probability this model generates
observed data.
log p(v|θ) ≥ − q(h|v, φ) log q(h|v, φ) +
h
q(h|v, φ)p(v|h, θ) + const.
Idea is to choose a ‘variational’ distribution q(h|v, φ) such that we can either
calculate analytically the bound, or sample it eﬃciently.
We then jointly maximise the bound wrt φ and θ.
We can parameterise p(v|h, θ) using a deep network.
Very popular approach – see ‘variational autoencoder’ and also attention
mechanisms.
Extension to semi-supervised method using p(v) = h c p(v|h, c)p(c)p(h)

Reinforcement Learning
Can we teach computers to play Atari video games?

Deep Reinforcement Learning
Given a state of the world, W and a set of possible actions A, we need to
decide which action to taken for any state of W that will be best for our long
term goals.
Problem is that the number of pixel states is enormous.
Need to learn a low dimensional representation of the screen (use a deep
generative model).
Learn then which action to take given the low dimensional representation.
Tetris
Google

Outlook
Machine Learning is in a boom period.
Renewed interest and hope in creating AI.
Combine new computational power with suitable hierarchical representations.
Impressive state of the art results in Speech Recognition, Image Analysis,
Game Playing.
Challenges
Improve understanding of optimisation for deep learning.
Learn how to more eﬃciently exploit computational resources.
Learn how to exploit massive databases.
Improve interaction between reinforcement learning and representation
learning.
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability.
Feel free to contact me at UCL or at my AI company reinfer
https:://reinfer.io

David Barber - Deep Nets, Bayes and the story of AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to David Barber - Deep Nets, Bayes and the story of AI

Similar to David Barber - Deep Nets, Bayes and the story of AI (20)

Recently uploaded

Recently uploaded (20)

David Barber - Deep Nets, Bayes and the story of AI