Fun With Neural Nets - (Jason Yosinski, Researcher at Geometric Intelligence)
Jason Yosinski is a researcher at Geometric Intelligence, where he uses neural networks and machine learning to build better AI. He was previously a PhD student and NASA Space Technology Research Fellow working at the Cornell Creative Machines Lab, the University of Montreal, the Caltech Jet Propulsion Laboratory, and Google DeepMind. His work on AI has been featured on NPR, Fast Company, the Economist, TEDx, and on the BBC. When not doing research, Mr. Yosinski enjoys tricking middle school students into learning math while they play with robots.
Jason will talk about how deep neural networks have recently been making a bit of a splash, enabling machines to learn to solve problems that had previously been easy for humans but hard for machines, like playing Atari games or identifying lions or jaguars in photos. But how do these neural nets actually work? What do they learn? This turns out to be a surprisingly tricky question to answer — surprising because we built the darn things, but tricky because the networks are so large and have many millions of connections that effect complex computation which is hard to interpret. Trickiness notwithstanding, in this talk we’ll see what we can learn about neural nets by looking at a few examples of networks in action and experiments designed to elucidate network behavior.
Website: http://yosinski.com/
2. Fun with Neural Nets
NYAI meetup
24 August 2016
Jason Yosinski
Original slides available under Creative Commons Attribution-ShareAlike 3.0
Geometric Intelligence
4. Neuralnetsstartworking
1950 1960 1970 1980 1990 2000 2010 2020 ……
Progress in AI
Chen et al., 2014
in
or
n-
ne
ch
e,
in
es
he
y-
te
ly
ed
detection and generates a vector of features every frame (10 ms).
These features are stacked using the left and right context to cre-
Fig. 1. Framework of Deep KWS system, components from left to
right: (i) Feature Extraction (ii) Deep Neural Network (iii) Posterior
Handling
Speech recognition, natural language conversation
5. Neuralnetsstartworking
1950 1960 1970 1980 1990 2000 2010 2020 ……
Progress in AI
Chen et al., 2014
We are interested in enabling users to have a fully hands-free
experience by developing a system that listens continuously for spe-
cific keywords to initiate voice input. This could be especially use-
ful in situations like driving. The proposed system must be highly
accurate, low-latency, small-footprint, and run in computationally
constrained environments such as modern mobile devices. Running
the system on the device avoids latency and power implications with
connecting to the server for recognition.
Keyword Spotting (KWS) aims at detecting predefined key-
words in an audio stream, and it is a potential technique to provide
the desired hands-free interface. There is an extensive literature in
KWS, although most of the proposed methods are not suitable for
low-latency applications in computationally constrained environ-
ments. For example, several KWS systems [2, 3, 4] assume offline
processing of the audio using large vocabulary continuous speech
recognition systems (LVCSR) to generate rich lattices. In this case,
their task focuses on efficient indexing and search for keywords in
the lattices. These systems are often used to search large databases
of audio content. We focus instead on detecting keywords in the
audio stream without any latency.
A commonly used technique for keyword spotting is the Key-
word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despite
being initially proposed over two decades ago, it remains highly
competitive. In this generative approach, an HMM model is trained
⇤The author performed the work as a summer intern at Google, MTV.
tal setup, results and some discussion follow in Section 4. Section 5
closes with the conclusions.
2. DEEP KWS SYSTEM
The proposed Deep KWS framework is illustrated in Figure 1. The
framework consists of three major components: (i) a feature extrac-
tion module, (ii) a deep neural network, and (iii) a posterior handling
module. The feature extraction module (i) performs voice-activity
detection and generates a vector of features every frame (10 ms).
These features are stacked using the left and right context to cre-
Fig. 1. Framework of Deep KWS system, components from left to
right: (i) Feature Extraction (ii) Deep Neural Network (iii) Posterior
Handling
Speech recognition, natural language conversation
6. Neuralnetsstartworking
1950 1960 1970 1980 1990 2000 2010 2020 ……
Progress in AI
Chen et al., 2014
We are interested in enabling users to have a fully hands-free
experience by developing a system that listens continuously for spe-
cific keywords to initiate voice input. This could be especially use-
ful in situations like driving. The proposed system must be highly
accurate, low-latency, small-footprint, and run in computationally
constrained environments such as modern mobile devices. Running
the system on the device avoids latency and power implications with
connecting to the server for recognition.
Keyword Spotting (KWS) aims at detecting predefined key-
words in an audio stream, and it is a potential technique to provide
the desired hands-free interface. There is an extensive literature in
KWS, although most of the proposed methods are not suitable for
low-latency applications in computationally constrained environ-
ments. For example, several KWS systems [2, 3, 4] assume offline
processing of the audio using large vocabulary continuous speech
recognition systems (LVCSR) to generate rich lattices. In this case,
their task focuses on efficient indexing and search for keywords in
the lattices. These systems are often used to search large databases
of audio content. We focus instead on detecting keywords in the
audio stream without any latency.
A commonly used technique for keyword spotting is the Key-
word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despite
being initially proposed over two decades ago, it remains highly
competitive. In this generative approach, an HMM model is trained
⇤The author performed the work as a summer intern at Google, MTV.
tal setup, results and some discussion follow in Section 4. Section 5
closes with the conclusions.
2. DEEP KWS SYSTEM
The proposed Deep KWS framework is illustrated in Figure 1. The
framework consists of three major components: (i) a feature extrac-
tion module, (ii) a deep neural network, and (iii) a posterior handling
module. The feature extraction module (i) performs voice-activity
detection and generates a vector of features every frame (10 ms).
These features are stacked using the left and right context to cre-
Fig. 1. Framework of Deep KWS system, components from left to
right: (i) Feature Extraction (ii) Deep Neural Network (iii) Posterior
Handling
Speech recognition, natural language conversation
Reinforcement Learning
Silver et al., 2016
7. Neuralnetsstartworking
1950 1960 1970 1980 1990 2000 2010 2020 ……
Progress in AI
Chen et al., 2014
We are interested in enabling users to have a fully hands-free
experience by developing a system that listens continuously for spe-
cific keywords to initiate voice input. This could be especially use-
ful in situations like driving. The proposed system must be highly
accurate, low-latency, small-footprint, and run in computationally
constrained environments such as modern mobile devices. Running
the system on the device avoids latency and power implications with
connecting to the server for recognition.
Keyword Spotting (KWS) aims at detecting predefined key-
words in an audio stream, and it is a potential technique to provide
the desired hands-free interface. There is an extensive literature in
KWS, although most of the proposed methods are not suitable for
low-latency applications in computationally constrained environ-
ments. For example, several KWS systems [2, 3, 4] assume offline
processing of the audio using large vocabulary continuous speech
recognition systems (LVCSR) to generate rich lattices. In this case,
their task focuses on efficient indexing and search for keywords in
the lattices. These systems are often used to search large databases
of audio content. We focus instead on detecting keywords in the
audio stream without any latency.
A commonly used technique for keyword spotting is the Key-
word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despite
being initially proposed over two decades ago, it remains highly
competitive. In this generative approach, an HMM model is trained
⇤The author performed the work as a summer intern at Google, MTV.
tal setup, results and some discussion follow in Section 4. Section 5
closes with the conclusions.
2. DEEP KWS SYSTEM
The proposed Deep KWS framework is illustrated in Figure 1. The
framework consists of three major components: (i) a feature extrac-
tion module, (ii) a deep neural network, and (iii) a posterior handling
module. The feature extraction module (i) performs voice-activity
detection and generates a vector of features every frame (10 ms).
These features are stacked using the left and right context to cre-
Fig. 1. Framework of Deep KWS system, components from left to
right: (i) Feature Extraction (ii) Deep Neural Network (iii) Posterior
Handling
Speech recognition, natural language conversation
Reinforcement Learning
Silver et al., 2016
8. Neuralnetsstartworking
1950 1960 1970 1980 1990 2000 2010 2020 ……
Progress in AI
Chen et al., 2014
We are interested in enabling users to have a fully hands-free
experience by developing a system that listens continuously for spe-
cific keywords to initiate voice input. This could be especially use-
ful in situations like driving. The proposed system must be highly
accurate, low-latency, small-footprint, and run in computationally
constrained environments such as modern mobile devices. Running
the system on the device avoids latency and power implications with
connecting to the server for recognition.
Keyword Spotting (KWS) aims at detecting predefined key-
words in an audio stream, and it is a potential technique to provide
the desired hands-free interface. There is an extensive literature in
KWS, although most of the proposed methods are not suitable for
low-latency applications in computationally constrained environ-
ments. For example, several KWS systems [2, 3, 4] assume offline
processing of the audio using large vocabulary continuous speech
recognition systems (LVCSR) to generate rich lattices. In this case,
their task focuses on efficient indexing and search for keywords in
the lattices. These systems are often used to search large databases
of audio content. We focus instead on detecting keywords in the
audio stream without any latency.
A commonly used technique for keyword spotting is the Key-
word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despite
being initially proposed over two decades ago, it remains highly
competitive. In this generative approach, an HMM model is trained
⇤The author performed the work as a summer intern at Google, MTV.
tal setup, results and some discussion follow in Section 4. Section 5
closes with the conclusions.
2. DEEP KWS SYSTEM
The proposed Deep KWS framework is illustrated in Figure 1. The
framework consists of three major components: (i) a feature extrac-
tion module, (ii) a deep neural network, and (iii) a posterior handling
module. The feature extraction module (i) performs voice-activity
detection and generates a vector of features every frame (10 ms).
These features are stacked using the left and right context to cre-
Fig. 1. Framework of Deep KWS system, components from left to
right: (i) Feature Extraction (ii) Deep Neural Network (iii) Posterior
Handling
Speech recognition, natural language conversation
Reinforcement Learning
Silver et al., 2016
24. Lion
Krizhevsky et al. 2012
AlexNet
Lion
Recipe for understanding:
• architecture
5 convolutional layers 3 FC layers
25. Lion
Krizhevsky et al. 2012
AlexNet
Lion
Recipe for understanding:
• architecture
• dataset (big: 250b)
5 convolutional layers 3 FC layers
26. Lion
Krizhevsky et al. 2012
AlexNet
Lion
Recipe for understanding:
• architecture
• dataset (big: 250b)
5 convolutional layers 3 FC layers
27. Lion
Krizhevsky et al. 2012
AlexNet
Lion
Recipe for understanding:
• architecture
• dataset (big: 250b)
5 convolutional layers 3 FC layers
ImageNet, Deng et al. 2009
28. Lion
Krizhevsky et al. 2012
AlexNet
Lion
Recipe for understanding:
• architecture
• dataset (big: 250b)
5 convolutional layers 3 FC layers
ImageNet, Deng et al. 2009
jaguar gibbon great white shark water bottle
golden retriever orangutan fireboat bubble
tobacco shop ambulance cowboy hat mixing bowl
29. Lion
Krizhevsky et al. 2012
AlexNet
Lion
Recipe for understanding:
• architecture
• dataset (big: 250b)
5 convolutional layers 3 FC layers
30. Lion
Krizhevsky et al. 2012
AlexNet
Lion
Recipe for understanding:
• architecture
• dataset (big: 250b)
• parameters (big: 60m)
5 convolutional layers 3 FC layers
? ? ?
51. Nguyen, Dosovitskiy, Yosinski, Brox, Clune.
“Synthesizing the preferred inputs for neurons in neural networks via deep generator networks”
...
I m age
banana
convertible
.....
Deep% generator%network
(prior) DNN% being%visualized
candle
Code
Forward%and%backward%passes
u9 u2
u1 c1
c2
fc6 fc7
fc8fc6
c3 c4 c5
...
u p c o n v o l u t i o n a l c o n v o l u t i o n a l
52. ...
I m age
banana
convertible
.....
Deep% generator%network
(prior) DNN% being%visualized
candle
Code
Forward%and%backward%passes
u9 u2
u1 c1
c2
fc6 fc7
fc8fc6
c3 c4 c5
...
u p c o n v o l u t i o n a l c o n v o l u t i o n a l
Nguyen, Dosovitskiy, Yosinski, Brox, Clune.
“Synthesizing the preferred inputs for neurons in neural networks via deep generator networks”