Talk

1,271 views
1,077 views

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,271
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
15
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Talk

  1. 1. DEEP LEARNING 基礎と研究動向 Taichi Kiwaki Aihara Lab. Thursday, June 12, 14
  2. 2. 機械学習 紹介 Thursday, June 12, 14
  3. 3. 電線の良品/不良品判別 • パラメータ • 電線の抵抗 • 電線の最小太さ • キーワード • 教師あり学習 • 判別/判別問題 太さ 抵抗 線形分離! Thursday, June 12, 14
  4. 4. 画像識別問題 • パラメータ • 画素値 • 次元数 • 千∼数百万 (ベンチマーク) • 特徴 • 低次元データ多様体 • 線形分離不能 RN Thursday, June 12, 14
  5. 5. 系列予測問題 •        の学習 • 代表的手法 • N-gram • Back-off model • Hidden Markov Model (HMMs) • Conditional Random Fields (CRFs) A quick brown fox Given predict jumps over ... f : RN⇤T ! RN Thursday, June 12, 14
  6. 6. Outline • 機械学習の紹介 • Neural NetとDeep Learning (DL) • 実例紹介 • Deep Convolutional Networks • Recurrent Neural Networks Thursday, June 12, 14
  7. 7. Neural Nets (NNs) ERROR SIGNALS (Rumelhart et al., 1986) (McCulloch and Pitts., 1943) (LeCun et al., 1989) 548 LeCun, Boser, Denker, Henderson,Howard, Hubbard,and Jackel 10 output units layer H3 30 hidden units layer H2 hidden units 12 x 16=192 ,,* layer H1 hidden units H1.l 12 x 64 = 768 256 input units e0 0 0 0 0 _--------- fully connected - 300 links fully connected - 6000 links - 40,000 from 12 5 x 5 ~ 8 -20,OO 0 from 12 5 x 5 links kernels links kernels Figure 3: Log mean squared error (MSE) (top) and raw error rate (bottom) versus number of training passes. training set, 8.1%misclassifications on the test set, and 19.4%rejections for 1%error rate on the remaining test patterns. A full comparative study will be described in another paper. 5.1 Comparison with Other Work. The first several stages of pro- cessing in our previous system (described in Denker et al. 1989) in- volved convolutions in which the coefficientshad been laboriously hand designed. In the present system, the first two layers of the network are constrained to be convolutional, but the system automatically learns theThursday, June 12, 14
  8. 8. History of Neural Nets (NNs) Perceptron (Rosenblatt,1957) ERROR Backpropagation (Rumelhart et al., 1986) Boomofneural networksresearch 1960 1990 2010Deep Learning (2006~) Thursday, June 12, 14
  9. 9. Perceptron • 線形分離可能な問題 についての分類器 (Rosenblatt, 1957) • 線形識別モデル • 線形分離不可能な問 題では使えない • The XOR affair (Minsky and Papert, 1969) XOR y = Sign(Wx + b) Thursday, June 12, 14
  10. 10. Perceptron • 線形分離可能な問題 についての分類器 (Rosenblatt, 1957) • 線形識別モデル • 線形分離不可能な問 題では使えない • The XOR affair (Minsky and Papert, 1969) XOR y = Sign(Wx + b) Thursday, June 12, 14
  11. 11. Perceptron • 線形分離可能な問題 についての分類器 (Rosenblatt, 1957) • 線形識別モデル • 線形分離不可能な問 題では使えない • The XOR affair (Minsky and Papert, 1969) XOR y = Sign(Wx + b) Thursday, June 12, 14
  12. 12. Multilayer Perceptron (MLP) P1 P2 H1 H2 O X1 X2 Thursday, June 12, 14
  13. 13. Back Propagation (BP, Back-prop) (Rumelhart et al., 1986) • MLP用のGradient Descent 高速計算法 • 予測誤差をパラメータで微分 • Chain ruleによって微分値を 上層から下層へ伝搬 • Code at http://deeplearning.net/tutorial/ mlp.html ERROR SIGNALSThursday, June 12, 14
  14. 14. Conv Nets • LeCun et al. (1989) • 画像認識用MLP • 重み共有 • 畳み込み • Pooling 548 LeCun, Boser, Denker, Henderson,Howard, Hubbard,and Jackel 10 output units layer H3 30 hidden units layer H2 hidden units 12 x 16=192 ,,* layer H1 hidden units H1.l 12 x 64 = 768 256 input units e0 0 0 0 0 _--------- fully connected - 300 links fully connected - 6000 links - 40,000 from 12 5 x 5 ~ 8 -20,OO 0 from 12 5 x 5 links kernels links kernels Figure 3: Log mean squared error (MSE) (top) and raw error rate (bottom)Thursday, June 12, 14
  15. 15. Elman Nets • Elman, 1990 • 系列データ(e.g., テキスト ストリーム)を学習 • 1時間ステップ前の Context層の状態を フィードバック • BPThroughTime (BPTT) • https://github.com/pascanur/ trainingRNNs Page 4 This approach can be modified in the following way. Suppose a network (shown in Figure 2) is augmented at the input level by additional units; call these Context Units. These units are also “hidden” in the sense that they interact exclusively with other nodes internal to the network, and not the outside world. Imagine that there is a sequential input to be processed, and some clock which regulates presentation of the input to the network. Processing would then consist of the following sequence of events. At time t, the input units receive the first input in the sequence. Each input might be a single scalar value or a vector, depending on the nature of the problem. The context units are initially set to 0.5. 2 Both the input units and context units activate the hidden units; and then the hidden units feed forward to 2. The activation function used here bounds values between 0.0 and 1.0. one, with a fixed weight of 1.0. Not all connections are shown. Figure 2. A simple recurrent network in which activations are copied from hidden layer to context layer on a one-for-one basis, with fixed weight of 1.0. Dotted lines represent trainable connections. OUTPUT UNITS HIDDEN UNITS INPUT UNITS CONTEXT UNITS (Elman, 1990) quick brown fox Prediction Thursday, June 12, 14
  16. 16. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Thursday, June 12, 14
  17. 17. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Thursday, June 12, 14
  18. 18. Vanishing Gradient • Bengio, 1994; Hochreiter et al., 2001 • Deep MLPにおいて 誤差信号が減衰する @ @✓1 (fN · · · f1) = @fN @fN 1 · · · @f1 @✓1 ✓1 Thursday, June 12, 14
  19. 19. Vanishing/Exploding Grads of RNNs • Bengio 1994 • 入力を切ったRNNが 安定な力学系となれ ばVanishing Gradient • 不安定(カオス)と なればExploding Gradientが起こる @ @✓1 (f · · · f) = @f @xT · · · @f @x2 @f @✓1 x1 x2 = f(x1) xT = f(xT 1) Thursday, June 12, 14
  20. 20. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Thursday, June 12, 14
  21. 21. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Pretraining/ReL/Initialization Thursday, June 12, 14
  22. 22. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Pretraining/ReL/Initialization Visualization Techniques Thursday, June 12, 14
  23. 23. NNs の問題点 (in 80-90’s) • 層の深化を行ってもBack Propによる性能が伸びにくい • NNの中で何が起こっているのか分からない • 計算が重過ぎる Pretraining/ReL/Initialization Visualization Techniques Thursday, June 12, 14
  24. 24. Outline • 機械学習の紹介 • Neural NetとDeep Learning (DL) • 実例紹介 • Deep Convolutional Networks • Recurrent Neural Networks Thursday, June 12, 14
  25. 25. Key Persons and Research Institutes Montréal Toronto Bengio Hinton Le Cun ajor Breakthrough in 2006 Ability!to!train!deep!architectures!by! using!layerJwise!unsupervised! learning,!whereas!previous!purely! supervised!abempts!had!failed! Unsupervised!feature!learners:! •  RBMs! •  AutoJencoder!variants! •  Sparse!coding!variants! New York (from ICML ’12Tutorial byY. Bengio) Ng Manning Thursday, June 12, 14
  26. 26. Deep NNs, Deep Belief Nets, & Deep Auto Encoders • Hinton et al., 2006; Hinton and Salakhutdinov, 2006; Bengio et al., 2007 • Recipe • pretrain a network in a layer-wise manner • Stack networks • Finetune (e.g. by BP) Thursday, June 12, 14
  27. 27. DBNs/DAEs1544 G. Hinton, S. Osindero, and Y.-W. Teh Table 1: Error rates of Various Learning Algorithms on the MNIST Digit Recog- nition Task. Version of MNIST Task Learning Algorithm Test Error % Permutation invariant Our generative model: 784 → 500 → 500 ↔ 2000 ↔ 10 1.25 Permutation invariant Support vector machine: degree 9 polynomial kernel 1.4 Permutation invariant Backprop: 784 → 500 → 300 → 10 cross-entropy and weight-decay 1.51 Permutation invariant Backprop: 784 → 800 → 10 cross-entropy and early stopping 1.53 Permutation invariant Backprop: 784 → 500 → 150 → 10 squared error and on-line updates 2.95 Permutation invariant Nearest neighbor: all 60,000 examples and L3 norm 2.8 Permutation invariant Nearest neighbor: all 60,000 examples and L2 norm 3.1 Permutation invariant Nearest neighbor: 20,000 examples and L3 norm 4.0 Permutation invariant Nearest neighbor: 20,000 examples and L2 norm 4.4 Unpermuted images; extra Backprop: cross-entropy and 0.4 data from elastic early-stopping convolutional neural net deformations Unpermuted de-skewed Virtual SVM: degree 9 polynomial 0.56 images; extra data from 2 kernel pixel translations Unpermuted images Shape-context features: hand-coded matching 0.63 Unpermuted images; extra Backprop in LeNet5: convolutional 0.8 data from affine neural net transformations Unpermuted images Backprop in LeNet5: convolutional neural net 0.95 adjusting the weights and biases to lower the energy of that image and to raise the energy of similar, Bconfabulated[ images that the network would prefer to the real data. Given a training image, the binary state hj of each feature de- tector j is set to 1 with probability s(bj þP iviwij), where s(x) is the logistic function 1/E1 þ exp (–x)^, bj is the bias of j, vi is the state of pixel i, and wij is the weight between i and j. Once binary states have been chosen for the hidden units, a Bconfabulation[ is produced by setting each vi to 1 with probability s(bi þP jhjwij), where bi is the bias of i. The states of the hidden units are then updated once more so that they represent features of the confabula- tion. The change in a weight is given by Dwij 0 e À bvihjÀdata j bvihjÀrecon Á ð2Þ where e is a learning rate, bvi hjÀdata is the fraction of times that the pixel i and feature detector j are on together when the feature detectors are being driven by data, and bvi hjÀrecon is the corresponding fraction for confabulations. A simplified version of the same learning rule is used for the biases. The learning works well even though it is not exactly following the gradient of the log probability of the training data (6). A single layer of binary features is not the best way to model the structure in a set of im- ages. After learning one layer of feature de- tectors, we can treat their activities—when they are being driven by the data—as data for learning a second layer of features. The first layer of feature detectors then become the visible units for learning the next RBM. This layer-by-layer learning can be repeated as many Fig. 3. (A) The two- dimensional codes for 500 digits of each class produced by taking the first two prin- cipal components of all 60,000 training images. (B) The two-dimensional codes found by a 784- 1000-500-250-2 autoen- coder. For an alternative visualization, see (8). Fig. 4. (A) The fraction of retrieved documents in the same class as the query when REPORTS onJune7,2011www.sciencemag.orgloadedfrom (Hinton et al., 2006) (Hinton and Salakhutdinov., 2006) Thursday, June 12, 14
  28. 28. Effect of pretraining Effective deep learning became possible through unsupervised pre-training [Erhan!et!al.,!JMLR!2010]! Purely!supervised!neural!net! With!unsupervised!preJtraining! (with!RBMs!and!Denoising!AutoJEncoders)! 47! (Erhan et al., 2010) Thursday, June 12, 14
  29. 29. How does pre-training help learning of deep nets? • Analysis on deep linear networks performed by Saxe et al., 2014 • Pre-training initializes the weight matrices to be orthogonal matrices • The strength of both error/feedforward signals are preserved 0 500 1000 0 20 40 60 80 t (Epochs) modestrength 0 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 (t half −t analy )/t analy Figure 3: Left: Dynamics of learning in a three lay network’s representation of seven modes of the input-o Red traces show analytical curves from Eqn. 12. Blue network (Eqn. (2)) from small random initial condit three layer network with tanh activation functions. To we computed the nonlinear network’s evolving input elements of U33T ⌃31 tanhV 11 over time. The trainin associated with a 1000-dimensional feature vector gen in [16] with a five level binary tree and flip probability with the rest excluded for clarity. Network training p Right: Delay in learning due to competitive dynamics difference between simulated time of half learning and the analytical time of half learning. Error bars show st initializations. (Saxe et al., 2014) Thursday, June 12, 14
  30. 30. Deep Conv Net • Krizhevsky and Hinton (2012) • Points • Rectified Linear Units (ReLU) • Dropout • GPGPU • https://code.google.com/p/cuda- convnet/ Our model ● Max-pooling layers follow first, second, and fifth convolutional layers ● The number of neurons in each layer is given by 253440, 186624, 64896, 64896, 43264, 4096, 4096, 1000 (Krizhevsky and Hinton., 2012) Thursday, June 12, 14
  31. 31. ILSVRC12 ILSVRC13 Thursday, June 12, 14
  32. 32. ReLU (Rectified Linear Units) ✓1 • ReL(x) = max(0, x) ReLSigmoid Thursday, June 12, 14
  33. 33. Dropout/DropConnect • Dropout (Krizhevsky and HInton, 2012) • Randomly select units and temporally turn off these units • DropConnect (Wan et al., 2013) • Generalization of dropout on connections Thursday, June 12, 14
  34. 34. Dropout/DropConnect • Dropout (Krizhevsky and HInton, 2012) • Randomly select units and temporally turn off these units • DropConnect (Wan et al., 2013) • Generalization of dropout on connections :units turned off Thursday, June 12, 14
  35. 35. Dropout/DropConnect • Dropout (Krizhevsky and HInton, 2012) • Randomly select units and temporally turn off these units • DropConnect (Wan et al., 2013) • Generalization of dropout on connections :units turned off Thursday, June 12, 14
  36. 36. Dropout/DropConnect • Dropout (Krizhevsky and HInton, 2012) • Randomly select units and temporally turn off these units • DropConnect (Wan et al., 2013) • Generalization of dropout on connections :units turned off Thursday, June 12, 14
  37. 37. How does dropout work so well? • Wager et al, 2013; Baldi and Sadowski 2013 • Dropout is L2-regularization over parameters normalized by the Fisher information Figure A.2: Comparison of two L2 regularizers. In both cases, the black solid ellipses are level sur- faces of the likelihood and the blue dashed curves are level surfaces of the regularizer; the optimum of the regularized objective is denoted by OPT. The left panel shows a classic spherical L2 regulizer k k2 2, whereas the right panel has an L2 regularizer > diag(I) that has been adapted to the shape of the likelihood (I is the Fisher information matrix). The second regularizer is still aligned with the axes, but the relative importance of each axis is now scaled using the curvature of the likelihood function. As argued in (11), dropout training is comparable to the setup depicted in the right panel. (Wager et al., 2013) Classical L2 Dropout Thursday, June 12, 14
  38. 38. Speech/Audio Processing with Deep CNNs • Zeiler et al. (ICASSP 2013) showed that deep CNNs with ReLU can be also applied to speech data for utterance recognition • Oord and Dieleman (2013) also used deep CNNs for classification of music category from audio data Thursday, June 12, 14
  39. 39. Visualization of Features (1) he cortex. They also demonstrate that convolutional BNs (Lee et al., 2009), trained on aligned images of aces, can learn a face detector. This result is inter- sting, but unfortunately requires a certain degree of upervision during dataset construction: their training mages (i.e., Caltech 101 images) are aligned, homoge- eous and belong to one selected category. igure 1. The architecture and parameters in one layer of ur network. The overall network replicates this structure hree times. For simplicity, the images are in 1D. .2. Architecture logical and computational models (Pinto et al., 200 Lyu & Simoncelli, 2008; Jarrett et al., 2009).2 As mentioned above, central to our approach is the u of local connectivity between neurons. In our exper ments, the first sublayer has receptive fields of 18x1 pixels and the second sub-layer pools over 5x5 ove lapping neighborhoods of features (i.e., pooling size The neurons in the first sublayer connect to pixels in a input channels (or maps) whereas the neurons in th second sublayer connect to pixels of only one chann (or map).3 While the first sublayer outputs linear filt responses, the pooling layer outputs the square root the sum of the squares of its inputs, and therefore, is known as L2 pooling. Our style of stacking a series of uniform mo ules, switching between selectivity and tole ance layers, is reminiscent of Neocognition an HMAX (Fukushima & Miyake, 1982; LeCun et a 1998; Riesenhuber & Poggio, 1999). It has al been argued to be an architecture employed by th brain (DiCarlo et al., 2012). Although we use local receptive fields, they a not convolutional: the parameters are not share across different locations in the image. This a stark difference between our approach and pr vious work (LeCun et al., 1998; Jarrett et al., 200 Lee et al., 2009). In addition to being more biolo ically plausible, unshared weights allow the learnin of more invariances other than translational invar Building high-level features using large-scale unsupervised learning gure 4. Scale (left) and out-of-plane (3D) rotation (right) variance properties of the best feature. Figure 6. Visualization of the cat face neuron (left) and human body neuron (right). scribed in (Zhang et al., 2008). In this dataset, there are 10,000 positive images and 18,409 negative images (so that the positive-to-negative ratio is similar to the case of faces). The negative images are chosen ran- domly from the ImageNet dataset. and minimum activation values, then picked 20 equally spaced thresholds in between. The reported accuracy is the best classification accuracy among 20 thresholds. 4.3. Recognition Surprisingly, the best neuron in the network performs very well in recognizing faces, despite the fact that no supervisory signals were given during training. The best neuron in the network achieves 81.7% accuracy in detecting faces. There are 13,026 faces in the test set, so guessing all negative only achieves 64.8%. The best neuron in a one-layered network only achieves 71% ac- curacy while best linear filter, selected among 100,000 filters sampled randomly from the training set, only achieves 74%. To understand their contribution, we removed the lo- cal contrast normalization sublayers and trained the network again. Results show that the accuracy of best neuron drops to 78.5%. This agrees with pre- vious study showing the importance of local contrast normalization (Jarrett et al., 2009). We visualize histograms of activation values for face images and random images in Figure 2. It can be seen, even with exclusively unlabeled data, the neuron learns to differentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. Figure 2. Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one. 4.4. Visualization In this section, we will present two visualization tech- niques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the op- timal stimulus (Berkes & Wiskott, 2005; Erhan et al., tested neuron, by solving: x∗ = arg min x f(x; W, H), subject to ||x||2 = 1. Here, f(x; W, H) is the output of the tested neuron given learned parameters W, H and input x. In our experiments, this constraint optimization problem is solved by projected gradient descent with line search. These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown in Figure 3, confirm that the tested neuron in- deed learns the concept of faces. Figure 3. Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to nu- merical constraint optimization. 4.5. Invariance properties We would like to assess the robustness of the face de- tector against common object transformations, e.g., translation, scaling and out-of-plane rotation. First, we chose a set of 10 face images and perform distor- tions to them, e.g., scaling and translating. For out- of-plane rotation, we used 10 images of faces rotating in 3D (“out-of-plane”) as the test set. To check the ro- bustness of the neuron, we plot its averaged response over the small test set with respect to changes in scale, 3D rotation (Figure 4), and translation (Figure 5).6 neuron in a one-layered network only achieves 71% ac- curacy while best linear filter, selected among 100,000 filters sampled randomly from the training set, only achieves 74%. To understand their contribution, we removed the lo- cal contrast normalization sublayers and trained the network again. Results show that the accuracy of best neuron drops to 78.5%. This agrees with pre- vious study showing the importance of local contrast normalization (Jarrett et al., 2009). We visualize histograms of activation values for face images and random images in Figure 2. It can be seen, even with exclusively unlabeled data, the neuron learns to differentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. Figure 2. Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one. 4.4. Visualization In this section, we will present two visualization tech- niques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the op- timal stimulus (Berkes & Wiskott, 2005; Erhan et al., 2009; Le et al., 2010). In particular, we find the norm- bounded input x which maximizes the output f of the noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown in Figure 3, confirm that the tested neuron in- deed learns the concept of faces. Figure 3. Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to nu- merical constraint optimization. 4.5. Invariance properties We would like to assess the robustness of the face de- tector against common object transformations, e.g., translation, scaling and out-of-plane rotation. First, we chose a set of 10 face images and perform distor- tions to them, e.g., scaling and translating. For out- of-plane rotation, we used 10 images of faces rotating in 3D (“out-of-plane”) as the test set. To check the ro- bustness of the neuron, we plot its averaged response over the small test set with respect to changes in scale, 3D rotation (Figure 4), and translation (Figure 5).6 6 Scaled, translated faces are generated by standard cubic interpolation. For 3D rotated faces, we used 10 se- (Quoc et al., 2012) Thursday, June 12, 14
  40. 40. Visualization of Features (2) to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved. Layer Below Pooled Maps Feature Maps Rectified Feature Maps Convolu'onal) Filtering){F}) Rec'fied)Linear) Func'on) Pooled Maps Max)Pooling) Reconstruction Rectified Unpooled Maps Unpooled Maps Convolu'onal) Filtering){FT}) Rec'fied)Linear) Func'on) Layer Above Reconstruction Max)Unpooling) Switches) Unpooling Max Locations “Switches” Pooling Pooled Maps Feature Map Layer Above Reconstruction Unpooled Maps Rectified Feature Maps Figure 1. Top: A deconvnet layer (left) attached to a con- vnet layer (right). The deconvnet will reconstruct an ap- proximate version of the convnet features from the layer beneath. Bottom: An illustration of the unpooling oper- ation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet. 3. Training Details 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 di↵erent sub-crops of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10 2 , in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully con- nected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10 2 and biases are set to 0. Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10 1 to this fixed radius. This is cru- cial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in (Krizhevsky et al., 2012), we produce multiple di↵er- ent crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on (Krizhevsky et al., 2012). 4. Convnet Visualization Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set. Feature Visualization: Fig. 2 shows feature visu- alizations from our model once training is complete. However, instead of showing the single strongest ac- tivation for a given feature map, we show the top 9 activations. Projecting each separately down to pixel space reveals the di↵erent structures that excite a given feature map, hence showing its invariance to in- put deformations. Alongside these visualizations we Visualizing and Understanding Convolutional Networks Layer 2 Layer 1 Visualizing and Understanding Convolutional Networks Layer 3 (Zeiler and Forgus, 2013) Thursday, June 12, 14
  41. 41. Layer 2 Layer 1 Layer 3 Thursday, June 12, 14
  42. 42. Layer 4 Layer 5 e 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random sub ature maps across the validation data, projected down to pixel space using our deconvolutional network approa reconstructions are not samples from the model: they are reconstructed patterns from the validation set that caThursday, June 12, 14
  43. 43. Outline • 機械学習の紹介 • Neural NetとDeep Learning (DL) • 実例紹介 • Deep Convolutional Networks • Recurrent Neural Networks Thursday, June 12, 14
  44. 44. RNNLM • Mikolov et al., 2010 • They stabilized learning by truncation of “explosive” gradient vectors Table 1: Performance of models on WSJ DEV set when increas- ing size of training data. Model # words PPL WER KN5 LM 200K 336 16.4 KN5 LM + RNN 90/2 200K 271 15.4 KN5 LM 1M 287 15.1 KN5 LM + RNN 90/2 1M 225 14.0 KN5 LM 6.4M 221 13.5 KN5 LM + RNN 250/5 6.4M 156 11.7 where Crare is number of words in the vocabulary that occur less often than the threshold. All rare words are thus treated equally, ie. probability is distributed uniformly between them. Schwenk [4] describes several possible approaches that can be used for further performance improvements. Additional pos- sibilities are also discussed in [10][11][12] and most of them can be applied also to RNNs. For comparison, it takes around 6 hours for our basic implementation to train RNN model based on Brown corpus (800K words, 100 hidden units and vocab- ulary threshold 5), while Bengio reports 113 days for basic implementation and 26 hours with importance sampling [10], when using similar data and size of neural network. We use only BLAS library to speed up computation. 3. WSJ experiments To evaluate performance of simple recurrent neural network based language model, we have selected several standard speech recognition tasks. First we report results after rescor- ing 100-best lists from DARPA WSJ’92 and WSJ’93 data sets - the same data sets were used by Xu [8] and Filimonov [9]. Oracle WER is 6.1% for dev set and 9.5% for eval set. Training Table 2: Comparison of various configurations of RNN LMs and combinations with backoff models while using 6.4M words in training data (WSJ DEV). PPL WER Model RNN RNN+KN RNN RNN+KN KN5 - baseline - 221 - 13.5 RNN 60/20 229 186 13.2 12.6 RNN 90/10 202 173 12.8 12.2 RNN 250/5 173 155 12.3 11.7 RNN 250/2 176 156 12.0 11.9 RNN 400/10 171 152 12.5 12.1 3xRNN static 151 143 11.6 11.3 3xRNN dynamic 128 121 11.3 11.1 Table 3: Comparison of WSJ results obtained with various mod- els. Note that RNN models are trained just on 6.4M words. Model DEV WER EVAL WER Lattice 1 best 12.9 18.4 Baseline - KN5 (37M) 12.2 17.2 Discriminative LM [8] (37M) 11.5 16.9 Joint LM [9] (70M) - 16.7 Static 3xRNN + KN5 (37M) 11.0 15.5 Dynamic 3xRNN + KN5 (37M) 10.7 16.34 namic RNN LMs - actually, by mixing static and dynamic RNN LMs with larger learning rate used when processing testing data (↵ = 0.3), the best perplexity result was 112. All LMs in the preceding experiments were trained on only 6.4M words, which is much less than the amount of data used by others for this task. To provide a comparison with Xu [8] and (on WSJ ’92/WSJ’93 data sets)Table 4: Comparison of very large back-off LMs and RNN LMs trained only on limited in-domain data (5.4M words). Model WER static WER dynamic RT05 LM 24.5 - RT09 LM - baseline 24.1 - KN5 in-domain 25.7 - RNN 500/10 in-domain 24.2 24.1 RNN 500/10 + RT09 LM 23.3 23.2 RNN 800/10 in-domain 24.3 23.8 RNN 800/10 + RT09 LM 23.4 23.1 RNN 1000/5 in-domain 24.2 23.7 RNN 1000/5 + RT09 LM 23.4 22.9 3xRNN + RT09 LM 23.3 22.8 traction use 13 Mel-PLP’s features with deltas, double and triple wi toy rec ma tio dis tio fo mo vo lar sh po it ing tha (on NIST RT05) Thursday, June 12, 14
  45. 45. Initialization of RNNs • Sutskever et al. (2013) empirically showed that initialization and momentum critically improve RNN performance • Echo state network based initialization s use tion) effi- earn- ems, is an NN). (“re- ction going t and gical rent. rtifi- arget accu- s are t the er to These oyed slow tly irregular time series (Fig. 2A). The prediction task has two steps: (i) using an initial teacher sequence generated by the original MGS to learn a black-box model M of the generating system, and (ii) using M to predict the value of the sequence some steps ahead. First, we created a random RNN with 1000 neurons (called the “reservoir”) and one output neuron. The output neuron was equipped with random connections that project back into the reservoir (Fig. 2B). A 3000-step teacher sequence d(1), . . ., d(3000) was generated from the MGS equa- tion and fed into the output neuron. This excited the internal neurons through the out- put feedback connections. After an initial transient period, they started to exhibit sys- tematic individual variations of the teacher sequence (Fig. 2B). The fact that the internal neurons display systematic variants of the exciting external signal is constitutional for ESNs: The internal neurons must work as “echo functions” for the driving signal. Not every randomly gen- erated RNN has this property, but it can effectively be built into a reservoir (support- ing online text). square error NRMSE ϭ ͩ͸jϭ1 100 (dj(3084) Ϫ yj͑3084))2 /100␴2 ͒ͪ 1/2 Ϸ10Ϫ4.2 was obtained (dj and yj teacher and network 8759, ed. E- Fig. 1. (A) Schema of previous approaches to RNN learning. (B) Schema of ESN approach. Solid bold arrows, fixed synaptic connections; dotted arrows, adjustable connections. Both approaches aim at minimizing the error d(n) – y(n), where y(n) is the network output and d(n) is the teacher time series observed from the target system. 2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org (Jaeger and Haas, 2004) Thursday, June 12, 14
  46. 46. Regularization of RNNs Pascanu et al., 2013 • Mikolov et al. 2010と同 じ方法によりExploding gradientへ対処 • Vanishing gradientには を導入することで対処 On the di culty of training Recurrent Neural Networks te of success for solving the temporal order us log of sequence length. See text. t become an issue, addressing the explod- s problem ensures a better success rate. ining clipping as well as the regularization sed in section 3.3, we call this algorithm GD-CR solved the task with a success rate sequences up to 200 steps (the maximal in Martens and Sutskever (2011)). Fur- Table 1. Results on polyphonic music prediction in nega- tive log likelihood per time step. Lower is better. Data set Data fold SGD SGD+C SGD+CR Piano- train 6.87 6.81 7.01 midi.de test 7.56 7.53 7.46 Nottingham train 3.67 3.21 3.24 test 3.80 3.48 3.46 MuseData train 8.25 6.54 6.51 test 7.11 7.00 6.99 Table 2. Results on the next character prediction task in entropy (bits/character) Data set Data fold SGD SGD+C SGD+CR 1 step train 1.46 1.34 1.36 test 1.50 1.42 1.41 5 steps train N/A 3.76 3.70 test N/A 3.89 3.74 4.2. Natural problems We address the task of polyphonic music prediction, using the datasets Piano-midi.de, Nottingham and MuseData described in Boulanger-Lewandowski et al. (2012) and language modelling at the character level on the Penn Treebank dataset (Mikolov et al., 2012). On the di culty of training Recurrent Neural Networks of success for solving the temporal order s log of sequence length. See text. become an issue, addressing the explod- problem ensures a better success rate. ing clipping as well as the regularization d in section 3.3, we call this algorithm D-CR solved the task with a success rate equences up to 200 steps (the maximal n Martens and Sutskever (2011)). Fur- can train a single model to deal with Table 1. Results on polyphonic music prediction in nega- tive log likelihood per time step. Lower is better. Data set Data fold SGD SGD+C SGD+CR Piano- train 6.87 6.81 7.01 midi.de test 7.56 7.53 7.46 Nottingham train 3.67 3.21 3.24 test 3.80 3.48 3.46 MuseData train 8.25 6.54 6.51 test 7.11 7.00 6.99 Table 2. Results on the next character prediction task in entropy (bits/character) Data set Data fold SGD SGD+C SGD+CR 1 step train 1.46 1.34 1.36 test 1.50 1.42 1.41 5 steps train N/A 3.76 3.70 test N/A 3.89 3.74 4.2. Natural problems We address the task of polyphonic music prediction, using the datasets Piano-midi.de, Nottingham and MuseData described in Boulanger-Lewandowski et al. (2012) and language modelling at the character level on the Penn Treebank dataset (Mikolov et al., 2012). We also explore a modified version of the task, where nitude. Our intuition is that increasing the norm of @xt @xk means the error at time t is more sensitive to all inputs ut, .., uk ( @xt @xk is a factor in @Et @uk ). In practice some of these inputs will be irrelevant for the predic- tion at time t and will behave like noise that the net- work needs to learn to ignore. The network can not learn to ignore these irrelevant inputs unless there is an error signal. These two issues can not be solved in parallel, and it seems natural to expect that we need to force the network to increase the norm of @xt @xk at the expense of larger errors (caused by the irrelevant input entries) and then wait for it to learn to ignore these irrelevant input entries. This suggest that moving to- wards increasing the norm of @xt @xk can not be always done while following a descent direction of the error E (which is, for e.g., what a second order method would try to do), and therefore we need to enforce it via a regularization term. The regularizer we propose below prefers solutions for which the error signal preserves norm as it travels back in time: ⌦ = X k ⌦k = X k 0 @ @E @xk+1 @xk+1 @xk @E @xk+1 1 1 A 2 (9) In order to be computationally e cient, we only use the “immediate” partial derivative of ⌦ with respect to Wrec (we consider that xk and @E @xk+1 as being constant with respect to Wrec when computing the derivative of ⌦k), as depicted in equation (10). Note we use the parametrization of equation (11). This can be done ef- ficiently because we get the values of @E @xk from BPTT. We use Theano to compute these gradients (Bergstra model such that it is further away from the attrac- tor (such that it does not converge to it, case in which the gradients vanish) and closer to boundaries between basins of attractions, making it more probable for the gradients to explode. 4. Experiments and Results 4.1. Pathological synthetic problems As done in Martens and Sutskever (2011), we address the pathological problems proposed by Hochreiter and Schmidhuber (1997) that require learning long term correlations. We refer the reader to this original pa- per for a detailed description of the tasks and to the supplementary materials for the complete description of the experimental setup. 4.1.1. The Temporal Order problem We consider the temporal order problem as the pro- totypical pathological problem, extending our results to the other proposed tasks afterwards. The input is a long stream of discrete symbols. At two points in time (in the beginning and middle of the sequence) a symbol within {A, B} is emitted. The task consists in classifying the order (either AA, AB, BA, BB) at the end of the sequence. Fig. 7 shows the success rate of standard SGD, SGD-C (SGD enhanced with out clipping strategy) and SGD- CR (SGD with the clipping strategy and the regular- ization term). Note that for sequences longer than 20, the vanishing gradients problem ensures that neither SGD nor SGD-C algorithms can solve the task. The x-axis is on log scale. This task provides empirical evidence that explodingThursday, June 12, 14
  47. 47. Conclusion • NNsの歴史とDLへのつながりをまとめた • 最新のDLの研究動向をまとめた • 画像認識ではConv NNsが最高性能を保持 • RNNsは文章予測に関して最高性能を保持 • なぜDLが可能となったのか研究が進みつつある Thursday, June 12, 14
  48. 48. 情報源 • Conferences • NIPS • ICML • AISTATs • ICLR • ICASSP • Tutorial pages • http://deeplearning.net/ • http://deeplearning.net/ tutorial/contents.html • Google+ • Yann LeCun, Yoshua Bengioを筆頭にDL研究 者が多数 Thursday, June 12, 14
  49. 49. Thursday, June 12, 14

×