NYAI #5 - Fun With Neural Nets by Jason Yosinski

1. MEETUP #5: Neural Nets (Jason Yosinski) & ML for Production (Ken Sanford)

2. Fun with Neural Nets NYAI meetup 24 August 2016 Jason Yosinski Original slides available under Creative Commons Attribution-ShareAlike 3.0 Geometric Intelligence

3. Neuralnetsstartworking 1950 1960 1970 1980 1990 2000 2010 2020 …… Progress in AI

4. Neuralnetsstartworking 1950 1960 1970 1980 1990 2000 2010 2020 …… Progress in AI Chen et al., 2014 in or n- ne ch e, in es he y- te ly ed detection and generates a vector of features every frame (10 ms). These features are stacked using the left and right context to cre- Fig. 1. Framework of Deep KWS system, components from left to right: (i) Feature Extraction (ii) Deep Neural Network (iii) Posterior Handling Speech recognition, natural language conversation

5. Neuralnetsstartworking 1950 1960 1970 1980 1990 2000 2010 2020 …… Progress in AI Chen et al., 2014 We are interested in enabling users to have a fully hands-free experience by developing a system that listens continuously for spe- cific keywords to initiate voice input. This could be especially use- ful in situations like driving. The proposed system must be highly accurate, low-latency, small-footprint, and run in computationally constrained environments such as modern mobile devices. Running the system on the device avoids latency and power implications with connecting to the server for recognition. Keyword Spotting (KWS) aims at detecting predefined keywords in an audio stream, and it is a potential technique to provide the desired hands-free interface. There is an extensive literature in KWS, although most of the proposed methods are not suitable for low-latency applications in computationally constrained environments. For example, several KWS systems [2, 3, 4] assume offline processing of the audio using large vocabulary continuous speech recognition systems (LVCSR) to generate rich lattices. In this case, their task focuses on efficient indexing and search for keywords in the lattices. These systems are often used to search large databases of audio content. We focus instead on detecting keywords in the audio stream without any latency. A commonly used technique for keyword spotting is the Key- word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despite being initially proposed over two decades ago, it remains highly competitive. In this generative approach, an HMM model is trained ⇤The author performed the work as a summer intern at Google, MTV. tal setup, results and some discussion follow in Section 4. Section 5 closes with the conclusions. 2. DEEP KWS SYSTEM The proposed Deep KWS framework is illustrated in Figure 1. The framework consists of three major components: (i) a feature extraction module, (ii) a deep neural network, and (iii) a posterior handling module. The feature extraction module (i) performs voice-activity detection and generates a vector of features every frame (10 ms). These features are stacked using the left and right context to cre- Fig. 1. Framework of Deep KWS system, components from left to right: (i) Feature Extraction (ii) Deep Neural Network (iii) Posterior Handling Speech recognition, natural language conversation

6. Neuralnetsstartworking 1950 1960 1970 1980 1990 2000 2010 2020 …… Progress in AI Chen et al., 2014 We are interested in enabling users to have a fully hands-free experience by developing a system that listens continuously for spe- cific keywords to initiate voice input. This could be especially use- ful in situations like driving. The proposed system must be highly accurate, low-latency, small-footprint, and run in computationally constrained environments such as modern mobile devices. Running the system on the device avoids latency and power implications with connecting to the server for recognition. Keyword Spotting (KWS) aims at detecting predefined keywords in an audio stream, and it is a potential technique to provide the desired hands-free interface. There is an extensive literature in KWS, although most of the proposed methods are not suitable for low-latency applications in computationally constrained environments. For example, several KWS systems [2, 3, 4] assume offline processing of the audio using large vocabulary continuous speech recognition systems (LVCSR) to generate rich lattices. In this case, their task focuses on efficient indexing and search for keywords in the lattices. These systems are often used to search large databases of audio content. We focus instead on detecting keywords in the audio stream without any latency. A commonly used technique for keyword spotting is the Key- word/Filler Hidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despite being initially proposed over two decades ago, it remains highly competitive. In this generative approach, an HMM model is trained ⇤The author performed the work as a summer intern at Google, MTV. tal setup, results and some discussion follow in Section 4. Section 5 closes with the conclusions. 2. DEEP KWS SYSTEM The proposed Deep KWS framework is illustrated in Figure 1. The framework consists of three major components: (i) a feature extraction module, (ii) a deep neural network, and (iii) a posterior handling module. The feature extraction module (i) performs voice-activity detection and generates a vector of features every frame (10 ms). These features are stacked using the left and right context to cre- Fig. 1. Framework of Deep KWS system, components from left to right: (i) Feature Extraction (ii) Deep Neural Network (iii) Posterior Handling Speech recognition, natural language conversation Reinforcement Learning Silver et al., 2016

9. Not just perceiving the world, but also generating…

10. Robot Gait Discovery

11. Hand-Coded Gait

12. Fixed Shallow Topology, Learned Parameters

13. Learned Deep Topology, Learned Parameters

14. Learned Deep Topology, Learned Parameters

15. Learned Deep Topology, Learned Parameters 9x faster than human designed gait

24. Lion Krizhevsky et al. 2012 AlexNet Lion Recipe for understanding: • architecture    5 convolutional layers 3 FC layers

25. Lion Krizhevsky et al. 2012 AlexNet Lion Recipe for understanding: • architecture • dataset (big: 250b) 5 convolutional layers 3 FC layers

26. Lion Krizhevsky et al. 2012 AlexNet Lion Recipe for understanding: • architecture • dataset (big: 250b)  5 convolutional layers 3 FC layers

27. Lion Krizhevsky et al. 2012 AlexNet Lion Recipe for understanding: • architecture • dataset (big: 250b)  5 convolutional layers 3 FC layers ImageNet, Deng et al. 2009

28. Lion Krizhevsky et al. 2012 AlexNet Lion Recipe for understanding: • architecture • dataset (big: 250b)  5 convolutional layers 3 FC layers ImageNet, Deng et al. 2009 jaguar gibbon great white shark water bottle golden retriever orangutan ﬁreboat bubble tobacco shop ambulance cowboy hat mixing bowl

29. Lion Krizhevsky et al. 2012 AlexNet Lion Recipe for understanding: • architecture • dataset (big: 250b)  5 convolutional layers 3 FC layers

30. Lion Krizhevsky et al. 2012 AlexNet Lion Recipe for understanding: • architecture • dataset (big: 250b) • parameters (big: 60m) 5 convolutional layers 3 FC layers ? ? ?

31. < DeepVis Toolbox demo >   Code at: http://yosinski.com/

32. Lion Recipe for understanding: • architecture • dataset (big: 250b) • parameters (big: 60m)

33. See also: Erhan et al, 2009; Szegedy et al., 2013. Recipe for understanding: • architecture • dataset (big: 250b) • parameters (big: 60m)

38. yx r g b (similar to this)

43. Deep Neural Networks are Easily Fooled: High Conﬁdence Predictions for Unrecognizable Images

45. Simonyan ICLR ’14 L2 Dai, Lu, Wu, ICLR ’15 Peacock LearnedNo regularization

46. L2 + L1 + spatial No regularization

51. Nguyen, Dosovitskiy, Yosinski, Brox, Clune. “Synthesizing the preferred inputs for neurons in neural networks via deep generator networks” ... I m age banana convertible ..... Deep% generator%network (prior) DNN% being%visualized candle Code Forward%and%backward%passes u9 u2 u1 c1 c2 fc6 fc7 fc8fc6 c3 c4 c5 ... u p c o n v o l u t i o n a l c o n v o l u t i o n a l

52. ... I m age banana convertible ..... Deep% generator%network (prior) DNN% being%visualized candle Code Forward%and%backward%passes u9 u2 u1 c1 c2 fc6 fc7 fc8fc6 c3 c4 c5 ... u p c o n v o l u t i o n a l c o n v o l u t i o n a l Nguyen, Dosovitskiy, Yosinski, Brox, Clune. “Synthesizing the preferred inputs for neurons in neural networks via deep generator networks”

56. Castle Candle + = Fireboat Candle + =

58. “What I cannot create, I do not understand.” Richard Feynman’s blackboard Car Engine Intelligencevs.

59. time ability computation data scientiﬁc understanding AI Progress

60. time ability computation data scientiﬁc understanding AI Progress Waiting for EEs and Internet New ﬁeld “Pseudobiology” ? (study of fake life)

62. Thanks! Hod Lipson Jeﬀ Clune Yoshua Bengio Anh Nguyen Code/etc: Email: http://yosinski.com jason@yosinski.com ( Slides: http://s.yosinski.com/nyai.pdf )

63. Food & Drinks: O’Reilly AI Conference Ticket Giveaway INTERMISSION Randomly selected by Jason & Ken

NYAI #5 - Fun With Neural Nets by Jason Yosinski

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to NYAI #5 - Fun With Neural Nets by Jason Yosinski

Similar to NYAI #5 - Fun With Neural Nets by Jason Yosinski (20)

Recently uploaded

Recently uploaded (20)

NYAI #5 - Fun With Neural Nets by Jason Yosinski