1
An introduction to
by Xander Steenbrugge
Reinforcement
Learning
2
The difference in mind between man
and the higher animals, great as it is,
certainly is one of degree and not of
kind.
-- Charles Darwin
“
3
Overview
1. A brief history of AI
2. Machine Learning today
3. Introduction to Reinforcement Learning
4. Problems in Reinforcement Learning
5. Promising Research
6. A look into the future
4
A brief history of AI
5
1950: Alan Turing’s “Turing test”
6
1997:
IBM’s
Deep Blue
7
Chess branching factor ~ 35
8
Chessbot basics
MiniMax Search
● Concept:
“Maximize the evaluation of your
move while minimizing opponent's
move evaluation”
● Reviews each possible move sequence
● Has high time cost since every possible
future board position must be evaluated
9
Minimax strategy + brute force heuristics based state
evaluations
10
2011:
Apple’s Siri
IBM wins Jeopardy
11
12
The Deep Learning Revolution: ImageNet 2012
Deep Learning on Google Trends
Ultimate Machine Learning with Google Cloud 13
The old, algorithmic approach
“apple”
“orange”
“banana”
IF (round) THEN
IF (orange AND coarse) THEN
“orange”
ELSE IF (green AND smooth) THEN
“apple”
ELSE IF ...
...
ELSE IF …
“banana”
Ultimate Machine Learning with Google Cloud 14
Let the machine find the rules
“apple”
“orange”
“banana”
?
Confidential & Proprietary
ConvNets
Keys to Successful ML
Large Datasets Good Models Lots of Computation
17
Machine Learning Today
Confidential + Proprietary
Machine Learning is everywhere
already…
20
Rapidly Accelerating Use of Deep Learning at Google
Number of projects using some form of deep learning
2012 2013 2014 2015
1500
1000
500
0
Used across products:
21Confidential & ProprietaryGoogle Cloud Platform 21
Speech recognition
Audio Input
Deep
Recurrent
Neural Network
Text Output
● Reduced word errors by more than 30%
● 20% of Mobile queries are Voice Search
Google Research blog, August 2012, August 2015
“How cold is it
outside?”
Confidential & ProprietaryGoogle Cloud Platform 22
Google Translate
Confidential + Proprietary
Datatonic & you
PLACE IMAGE HERE
24
In popular culture:
+ ‘The next big thing’
+ sentient AI in the next 10 years
+ Will put humans out of a job
+ Foolproof
Machine learning:
Datatonic & you
PLACE IMAGE HERE
25
Really:
+ Been around for 60 years now
+ ‘Sentient next year’, every
year, for the last 60 years.
+ AI winters: 1970, 1990, … ?
+ Not foolproof
Machine learning:
26
A person on a beach
flying a kite.
A person skiing down a
snow covered slope.
A group of giraffe standing
next to each other.
27
A woman riding a horse
on a dirt road.
An airplane is parked on
the tarmac at an airport.
A group of people standing
on top of a beach.
28
29
Image classifiers are easily fooled!
30
31
● Supervised: need large amounts of annotated
training data
● Static inference machines
● Bad transfer learning capabilities to new tasks
Practical limitations of current AI systems
32
Introduction to
Reinforcement Learning
33
34
Policy
35
Policy
36
37
Policy networks
Raw Pixels
38
39
Policy Gradients
Run a policy for a while. See what actions led to high rewards. Increase their probability.
40
Practical Applications of RL
41
42
Data Centre cooling
Confidential + Proprietary
Confidential + Proprietary
46
Chess branching factor ~ 35
47
Go branching factor ~ 250
48
Neural nets to the rescue!
49
Deep Learning enhanced MiniMax
50
Advantages that AlphaGo can leverage
1. Fully deterministic: no noise in the game
2. Fully observed: each player has complete information and there are
no hidden variables. (unlike Poker for example)
3. Discrete action space.
4. Each game is relatively short (approximately 200 actions).
5. Target function is clear (win/lose) & fast to evaluate.
6. Huge datasets of human gameplay are available to bootstrap the
learning, so AlphaGo doesn’t have to start from scratch.
51
52
3:53
53
Image Segmentation
54
Deepdrive in GTA V
55
Open AI Universe
Universe Starter Agent
56
Problems in
Reinforcement Learning
57
Bad sample efficiency
58
Cold Start
59
Exploration vs Exploitation
60
Subgoal creation
61
Promising Research
62
Attention
63
64
Memory
65
Domain Transfer
66
PathNet
67
Physical Intuition
Simulation Real
Ground Truth
Prediction
68
Auxiliary Learning Signals
Policy
69
Auxiliary Learning Signals
70
Auxiliary Learning Signals
71
Auxiliary Learning Signals (continued)
Divide observations in 3 classes:
1. Things that the agent can control
2. Things the agent cannot control but affect it
3. Things the agent cannot control but do not affect it
A good feature space for curiosity should model (1) and (2) and
be unaffected by (3).
72
Auxiliary Learning Signals (continued)
73
Auxiliary Learning Signals (continued)
74
Learning to communicate
Observations
Memory State
Message
Policy
Network
Memory
State
Message
75
Third person imitation learning
76
Adversarial Networks
77
Generative Networks
78
Generative Networks
79
Generative Networks
80
Unsupervised Learning’s potential
7
2
1
0
4
FFN < Auto-Encoder < GAN
81
Personal thoughts
▪ “Intelligent software will become the main driver of
most technological advances in the next decade”
➢ Self driving cars
▪ “Intelligent software will become the main driver of
most technological advances in the next decade”
➢ Self driving cars
➢ Personal, digital assistants (Siri, Viv, Alexa, ...)
Personal thoughts
▪ “Intelligent software will become the main driver of
most technological advances in the next decade”
➢ Self driving cars
➢ Personal, digital assistants (Siri, Viv, Alexa, ...)
➢ Machine generated/augmented content
Personal thoughts
▪ “Intelligent software will become the main driver of
most technological advances in the next decade”
▪ “Virtual Reality (VR) will become a mainstream
experience sharing platform”
Personal thoughts
▪ “Intelligent software will become the main driver of
most technological advances in the next decade”
▪ “Virtual Reality (VR) will become a mainstream
experience sharing platform”
▪ “Natural language processing will be fundamental
to interacting with all of these new technologies”
Personal thoughts
Plenty of problems to solve...
Plenty of solutions around...
93
Thanks!
94
We are hiring!
datatonic.com

Introduction to reinforcement learning