Your SlideShare is downloading.
×

Free with a 30 day trial from Scribd

- 1. Deep Learning Filling the gap between practice and theory Preferred Networks Daisuke Okanohara hillbig@preferred.jp Aug. 3rd 2017 Summer School of Correspondence and Fusion of AI and Brain Science
- 2. Background: Unreasonable success of deep learning l DL succeed in solving many complex tasks ̶ Image recognition, speech recognition, natural language processing, robot controlling, computational chemistry etc. l But we don’t understand why DL work so well ̶ Its success is much higher than our understanding
- 3. Background DL research process become close to science process l Try first, examine next ̶ First, we obtain an unexpected good result experimentally ̶ We then find a theory that explains why it work so well l This process is different from previous ML research ̶ Careful design of new algorithms sometimes (or often) doesn’t work ̶ Many results contradict our intuition
- 4. Outline Three main unsolved problems in deep learning l Why can DL learn ? l Why can DL recognize and generate real world data ? l Why can DL keep and manipulate complex information ?
- 5. Why can DL learn ?
- 6. Optimization in training DL l Learn a NN model f(x; q) by minimizing a training error L(q) L(q) = Si l(f(xi; q), yi) where l(f(xi; q), yi) is a loss function and q is a set of parameters l E.g. two layer feed forward NN f(x; q)) = a(W2(a(W1x)) where a is an element-wise activate function such as a(z)=max(0, z) l(f(xi; q), yi) = ||f(xi; q) – yi||2 (L2 loss)
- 7. Gradient descent Stochastic Gradient Descent l Gradient descent ̶ Compute the gradient of L(q) with regard to q; g(q), then update q using g(q) as qt+1 := qt – at g(qt) where at>0 is a learning rate l Stochastic gradient descent: ̶ Since the exact computation of gradient is expensive, we instead use an approximated gradient by using a sampled data set (mini-batch) g’(qt) = 1/|B| Si∈B l(qt, xi, yi) -αg θ2 θ1 Contour of L(q)
- 8. Optimization in Deep learning l L(q) is highly non-convex and includes many local optima, plateaus and saddle points ̶ In plateau regions, the gradient becomes almost zero and the convergence becomes significantly slow ̶ In saddle points, only few directions will decrease L(q) and it is hard to escape from such points Saddle pointsPlateau local optimum
- 9. Miracle of deep learning training l It was believed that we cannot train large NNs using SGD ̶ Impossible to optimize non-convex problem of over million dimensions l However, SGD can find a solution with low-training error ̶ When using large model, it often find a solution with zero training error ̶ Moreover, an initialization doesn’t matter (c.f. <-> K-means require good initializer) l More surprisingly, SGD can find a solution with low-test error ̶ Although the model is over-parameterized, it does not over-fit and achieves generalization l Practically OK, but we want to know why
- 10. Why can DL learn ? l Why does DL succeed in find a solution with a low train error? ̶ Although obtimization is a highly non-convex optimization problem l Why does DL succeed in finding a solution with a low test error ? ̶ Although NN is over parametrized and no effective regularization
- 11. Loss surface analysis using spherical spin glass model (1/5) [Choromanska+ 2015] l Consider a DNN with ReLU s(x)=max(0, x) where q is the normalization factor l We can re-express this as where Ai,j=1 if the path (i, j) is active and Ai,j=0 if the path is inactive ̶ ReLU can be considered as a switch; the path is active if all ReLU are active and is inactive otherwise ReLU is active ReLU is inactive xi Y Path is active if all Relu is active
- 12. Loss surface analysis using spherical spin glass model (2/5) l After several assumptions, this function can be re- exampressed as a H-spin spherical spin-grass model s.t. l Now, we can use the analysis of spherical spin-grass model ̶ We now know the distribution of critical points ̶ k: Index (the number of negative eigenvalues of the Hessian) k=n: local minimum, k>0: saddle point 12
- 13. Loss surface analysis using spherical spin glass model (3/5) Distribution of critical points Almost no critical points with large k above LEinf -> Few local minima In the band [LE0, LEinf] many critical points with small k are found in near LE0 ->local minima are close to the global minimum
- 14. Loss surface analysis using spherical spin glass model (4/5) Distribution of test losses 14
- 15. l This analysis is relied on several unrealistic assumptions ̶ Such as “Each activation is independent from inputs” “Each path‘s input is independent” l Can we remove these assumptions or show these assumptions hold in almost training cases ? Loss surface analysis using spherical spin glass model (5/5) Remaining problem
- 16. Depth creates no bad local minima [Lu+ 2017] l Non convexity comes from depth and nonlinearity l Depth only creates non convexity ̶ Weight space symmetry means that there are many distinct configuration with same loss values which would result in a non-convex epigraph l Consider a following feed forward linear NN minW L(W) = ||WH WH-1 …W1X – Y||2 Then If X and Y have full row rank, then all local minima of L(W) are global minima [Theorem 2.3, Lu, Kawaguchi 2017]
- 17. Deep and Wide NN also create no bad local minima [Nguyen+ 2017] l If the following conditions hold ̶ (1) Activation function s is analytic on R, strictly monotonically increasing ̶ (2) s is bounded* ̶ (3) the loss function l(a) is twice differentiable, ̶ l’(a)=0 if a is a global minimum ̶ (4) Training samples are linearly independent, then every critical point for with the weight matrices have full column rank, is a global minimum ̶ We can achieve these conditions if we use sigmoid, tanh or softplus for s and the squared loss for l ̶ -> Solved in the case of non-linear NN with some conditions
- 18. Why DL can learn ? l Why does DL succeed in find a solution with a low train error? ̶ Although obtimization is a highly non-convex optimization problem l Why does DL succeed in finding a solution with a low test error ? ̶ Although NN is over parametrized and no effective regularization
- 19. NN is over parametrized but achieves generalization l Although the number of parameters of DNN is much larger than the number of samples, DNN does not overfit and achieves generalization l Large model tend to achieve low test error Number of parameters Test error (lower the better) When num. of parameters is larger than num. of training samples “overfitting” is observed Conventional ML models DNN No over-fitting is observed Moreover the test error decreases as the num. of parameters is increased
- 20. Random Labeling experiment [Zhang+ 17] l Model capacity should be restricted to achieve generalization ̶ C.f. Rademacher complexity, VC-dimension, uniform stability l Conduct an experiment on a copy of the test data where the true labels were replaced by random labels -> NN model easily fit even for random labels l Compare the result with that using regularization techniques -> No significant difference l Therefore NN model has enough model complexity to fit to random labeling but it can generalize well w/o regularization ̶ For random labels, NN memorize the samples, but for true labels NN learn patterns for generalization [Arpit+ 17] l … WHY?
- 21. SGD plays a significant role for generalization l SGD achieves an approximate Bayesian inference [Mandt+ 17] ̶ Bayesian inference provides a sample following q ~ P(q|D) l SGD’s noise removes unnecessary information of input to estimate output [Shwartz-Ziv+ 17] ̶ During training the mutual information between input and the network is decreased but that between the network and output is kept l Sharpness and norms of weights also relate to generalization ̶ Flat minima achieve generalization. But it depends on the scale of weights ̶ If we find a flat minimum with small norm of weights, then it achieves generalization [Neyshabur+ 17] FlatSharp
- 22. Training always converge to the solution with low-test error [Wu+ 17] l Even when we optimize the model with different initializations, they always converge to the solution with low test error l Flat minima have large basin while sharp minima have small basin ̶ Almost parameters will converge to flat minima l Flat minima corresponds to the low model complexity = low test error l Question: Why does NN learning induce flat minima ? Flat minima have large basin Sharp minima have small basin
- 23. Why can DL recognize and generate real world data ?
- 24. Why does deep learning work ? Lin’s hypothesis [Lin+ 16] l Real world phenomena have following characteristics 1. Low order polynomial ̶ Known physical interactions have at most 4th-order polynomials 2. Local interaction ̶ Number of interactions between objects increases linearly 3. Symmetry ̶ Small degree of freedoms 4. Markovian ̶ Most generation process depends on only the previous state l -> DNN can exploit these characteristics 24/50
- 25. Generation and recognition (1/2) l Data x is generated from unknown factors z l Generation and recognition are inverse operations z x E.g. Image generation, recognition z：Object, Position of camera, Lighting condition (Dragon, [10, 2, -4], white） x：Image Generation z x Recognition （Inference） Inference: Infer the posterior P(z|x) Generation Recognition
- 26. Generation and recognition（2/2） l Data is often generated from multiple factors ̶ Uninteresting factors are sometimes called covariates or disturbance variables of hidden variable l Generation process can be very complex ̶ Each step can be non-linear ̶ Gaussian, non-Gaussian noises are added at several steps ̶ E.g. Image rendering requires dozens steps l In general, generation process is unknown ̶ Any generation process is the approx. of actual process 26/50 z1 x c h z2 hm
- 27. Why do we consider generative models? l For more accurate recognition and inference ̶ If we know the generate process, we can improve recognition and inference u “What I cannot create, I do not understand” Richard Feynman u “Computer vision is inverse computer graphics” Geofferty Hinton ̶ By inverting the generation process, we obtain recognition process l For transfer learning ̶ By changing covariates, we can transfer the learned model to other environments l For sampling examples to compute statistics and validation 27/50
- 28. E. g. Mapping of hand-written data into 2D using VAE Original hand-written data is high-dimension (784-dim) If we map these data into 2-dim space, types, shapes change smoothly If we want to classify “1”, we need to find this simple boundary
- 29. Representation learning is more powerful than the nearest neighbor method and manifold learning l Actually we can significantly reduce the required training samples when using representation learning [Arora+ 2017] l Using the distance metric defined on the original space, or the neighborhood notion may not work ? In reality, samples with the same label are located in very different places in the original space. Their region may not be even connected in original space Ideally, near sample will help to determine the label Man with glasses
- 30. Real-world data is distributed in low-dimensional manifold 30/50 Each point corresponds to a possible data Data distributed in low-dimensional space C.f. distribution of galaxies in the universe Why does low-dimensional manifold appear ? Low dimensional factor is converted to high-dimensional data without increasing the complexity [Lin+16]
- 31. Original space and latent space 31/50 generate recognition l In the latent space, the meaning of data is smoothly changed
- 32. Learning is easy in the latent space 32/50 generate recognition l Since many tasks related to the factors, the classification boundary becomes simple in the latent space Require many training examples in the original space Require few training examples in the latent space
- 33. How to learn a generative and inference model ? l Generation process and its counterpart recognition process are highly non-linear and complex l -> Use a deep neural network to approximate them z x Generation x = f(z) z x Recognition z = g(x)
- 34. Deep generative models Fast sampling of x Compute the likelihood P(x) Produce sharp image Stable Training VAE [Kingma+ 14] √ △ Lower-bound (IW-VAE [Burda+15]) X √ GAN [Goodfellow+ 14,16] (IPM) √ X √ X-△ AutoRegressive [Oord+ 16ab] △-√ (Parallel multi-scale [Reed+ 17]) √ √ √ Energy model [Zhao+ 16] [Dai+ 17] △-√ △ Up to constant √ △
- 35. VAE: Variational AutoEncoder [Kingma+ 14] z μ (μ, σ) = Dec(z; φ) x〜N(μ, σ) σ x A NN network outputs mean and covariance (μ, σ) = Dec(z; φ) Generate x in the following steps (1) Sample z = N(0, I) (2) Compute (μ, σ) = Dec(z; φ） (3) Sample x = N(μ, σI) Defined distribution p(x) = ∫p(x|z)p(z)dz
- 36. VAE: Variational Autoencoder Induced distribution l p(x|z) is a Gaussian and p(x) corresponds to (infinitely-many) mixture of Gaussians p(x) = ∫p(x|z)p(z)dz ̶ Neural network can model complex relation between z and x
- 37. VAE: Variational AutoEncoder Use maximum likelihood estimation for learning the parameter q Since the exact likelihood is intractable, we instead maximize the lower bound of likelihood known as ELBO (Evidence lower bound） The proposal distribution q(z|x) should be close to the true posterior p(z|x) Maximizing wrt. q(z|x) correspond to the minimization of KL(q(z|x) || p(z|x)) = Learn the encoder as a side effect
- 38. Reparametization Trick Since we take an expectation with regard to Q(z|x) it is difficult to compute the gradient of ELBO wrt. Q(z|x) -> We can use reparamerization trick ! μ' σ' x' z μ σ x ε Converted computation graph can be regarded as an auto-encoder where a noise εσ is added to the latent variable μ
- 39. The problem of maximum likelihood estimation against low-dimensional manifold data (1/3) [Arjovsky+ 17ab] l Maximum likelihood estimation (MLE) estimate a distribution P(x) using a model Q(x) LMLE(P, Q) = Sx P(x) log Q(x) ̶ Usually, this is replaced with the empirical distribution (1/N)Si log Q(xi) l In low-dimensional manifold data, P(x) = 0 in most x l To model such P, Q(x) also should satisfy Q(x) = 0 in most x l If we use such Q(x), log Q(x) is undefined (or NaN) when Q(xi) = 0, so we cannot optimize Q(x) using MLE l to solve this -> Use Q(x) s.t. Q(xi)>0 for all {xi} ̶ E.g. Q(x) = N(µ, s) , this means a sample is µ with added noise s
- 40. The problem of maximum likelihood estimation against low-dimensional manifold data (2/2) l MLE require Q(xi) >0 for all {xi} l to solve this -> Use Q(x) s.t. Q(xi)>0 for all {xi} l Q(x) = N(µ, s) this means a sample is µ with added noise s ̶ This makes blurry images l Another difficulty is there is no notion of the closeness wrt. the space geometry When the area size of the intersection are same, MLE will give the same score. Although the left distribution is close to the true distribution, MLE scores are same
- 41. GAN（Generative Adversarial Net） [Goodfellow+ 14, 17] l Compete two neural networks to learn a distribution l Generator (counterfeiters) ̶ Goal: deceive the generator ̶ Learn to generate a realistic sample that can deceive the generator l Discriminator (Police) ̶ Goal: detect a sample generated by the generator ̶ Learn to detect the difference between real and generated ones Generator Real Discriminator RealFake Chosen randomly
- 42. GAN: Generative adversarial z x = G(z) x Sample x in the following step (1) Sample z 〜 U(0, I) (2) Compute x = G(z） (without adding noise) No adding noise step at last
- 43. Training of GAN l Use Discriminator D(x) ̶ Output 1 if x is estimated as real and 0 otherwise l Train D to maximize V and G to minimize V ̶ If learning succeeded, this learning will reach the following Nash equilibrium ∫p(z)G(z)dz=P(x), D(x)=1/2 ̶ Since D provides dD(x)/dx to update G, so they are actually cooperate to learn P(x) z x' x = G(z) {1(Real), 0(Fake)} y = D(x) x
- 44. Modeling low dimensional manifold l When z is low-dimensional data, the deterministic function x = F(z) outputs low-dimensional manifold in the space x l Using CNN for G(z) and D(x) is also important ̶ D(x) becomes similar score when x and x’ are similar l Recent study showed that training without using discriminator is also able to generate realistic data [Bojanowski+ 17] l These two factors are important to produce realistic data z x=F(z) z ∈ R1 x ∈ R2
- 45. Demonstration of GAN training http://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/ 45 Each generated samples follows dD(x)/dx
- 46. Training GAN https://github.com/mattya/chainer-DCGAN After 30 minutes 46
- 47. After 2 hours 47
- 48. After 1 day 48
- 49. 49
- 50. LSGAN [Mao+ 16]
- 51. Stacked GAN http://mtyka.github.io/machine/learning/2017/06/06/highres-gan-faces.html
- 52. New GAN papers are coming out every week GAN Zoo https://github.com/hindupuravinash/the-gan-zoo l Since GAN provides a new way to train a probabilistic model many GAN papers are coming out, (20 papers/mon Jul.2017) l Interpretation of GAN framework ̶ Wasserstein Distance, Integral Probability Measure, Inverse RL l New stable training method ̶ Lipschitzness of D, Ensemble of Ds, etc. l New Applications ̶ Speech, Text, Inference model (q(z|x)) l Conditional GAN ̶ Multi-class Super-resolution,
- 53. Super Resolution + Regression loss for perception network [Chen+ 17] l Generate photo-realistic image from segmentation result ̶ High resolution, globally consistent, stable training Output: photo-realistic imageInput: Segmentation
- 54. ICA: Independent component analysis Reference: [Hyvärinen 01] l Find a component z that generates data x x = f(z) where f is an unknown function called mixture function and components are independent each other p(z) = Pp(zi) l When f is linear and p(zi) is non-Gaussian, we can identify f and z correctly l However, when f is nonlinear, we cannot identify f and z ̶ There are infinitely many possible f and z l -> When data is time-series data x(1), x(2), …, x(n) and they are generate from z which are (1) non-stationary or (2) stationary independent sources, we can identify non-linear f and z
- 55. Non-linear ICA for non-stationary time series data [Hyvärinen+ 16] l When sources are independent and non-stationary, we can identify a non-linear mixture function f and z l Assumption: sources change slowly ̶ sources can be considered as stationary in short time segment ̶ Many interesting data have this property 1. Divide time series data into segments 2. Train multi-class classifier to classify each data point into each segment 3. The last layer’s feature corresponds to (linear mixture of) independent sources
- 56. Non-linear ICA for stationary time series data [Hyvärinen+ 17] l When sources are independent and stationary, we can also identify a non-linear mixture function f and z l Sources should be uniform dependent ̶ for x = s(t) and y=s(t-1) 1. Train a binary classifier to classify whether given data pairs are taken from adjacent (x(t), x(t+1)) or random (x(t), x(u)) 2. The last layer’s features correspond to (linear mixture of) independent sources
- 57. Conjectures [Okanohara] l Train a multi-class classifier with very large number of classes (e.g. Imagenet). Then the features of last layer correspond to (mixture-of) independent component ̶ To show this, we need a reasonable model between the set of labels and independent components ̶ Dark knowledge [Hinton14] is effective to transfer the model because this reveals the independent components l Similarly GAN’s discriminators (or the energy functions) also extract the independent components
- 58. Why can DL keep and manipulate complex information ?
- 59. Information Abstract Level l Abstract knowledge ̶ Text, relation l Model ̶ Simulator / generative model l Raw Experience ̶ Sensory stream Abstract Detailed Small volume Independent from problem/task context Large volume Dependent on problem/task/context
- 60. Local representation vs distributed representation l Local representation ̶ each concept is represented by one symbol ̶ e.g. Giraff=1, Panda=2, Lion=3, Tiger=4 ̶ no interfere, noise immunity, precise l Distributed representation ̶ each concept is represented by a set of symbol, and each symbol participates in representing many concepts ̶ Generalizable ̶ less accurate ̶ interfere Giraff Pand Lion Tiger Long neck ◯ four legs ◯ ◯ ◯ ◯ body hair ◯ ◯ ◯ paw pad ◯ ◯
- 61. High dimensional vector vs low dimensional data l High dimensional vector ̶ Random two vectors are always almost orthogonal ̶ many concepts can be stored within one vector u w = x + y + z, ̶ Same characteristics as local representation l Low dimensional vector ̶ Interfere each other ̶ Cannot keep precise memory ̶ Beneficial for generalization l Interference and generalization are strongly related
- 62. Two layer feedforward network = memory augmented network [Vaswani+ 17] l Memory augmented network a = V Softmax(Kq) ̶ K is a key matrix (i-th row corresponds to a key for i-th memory) ̶ V is a value matix. i-th column correspond to a value for i-th value ̶ We may use winner-take-all instead of Softmax l Two layer feedforward network a = W2Relu(W1x) ̶ i-th row of W1 corresponds to a key for i-th memory ̶ i-th column of W2 corresponds to a value for i-th memory
- 63. Three layer feed-forward network is also memory-augmented network [Okanohara unpublished] l Three layer feed-forward network can be considered as first layer is used for computing keys and second stores key and t a = W3Relu(W2Relu(W1x)) l key: Relu(W1x) l The i-th row of W2 corresponds to the key of i-th memory cell l The i-th column of W3 corresponds to the value of i-th memory cell
- 64. Two-leayr NN update rule interpretation [Okanohara unpublished] l The update rule of two layer feedforward network for h = Relu(W1x) a = W2h is dh = W2 Tda dW2= da hT dW1= dh diag(Relu’(W1x)) xT = W2 Tda diag(Relu’(W1x)) xT l These update rules correspond to storing the error (da) as a value and storing input (x) as a key for memory network ̶ Update only for active memories (Relu’(W1x))
- 65. Resnet is memory augmented network [Okanohara unpublished] l Since resnet is the following form h = h + Resnet(h) and Resnet(h) consists of two layer, we can interpret it as recalling memory and add it to the current vector ̶ Squeeze operation correspond to limit the number of memory cells l Resnet lookups memory iteratively ̶ Large number of steps = large number of memory lookups l This interpretation is different from using shortcut [He+15] or unrolled iterative estimation [Greff+16]
- 66. Infinite memory network l What happen if we increase the number of hidden units iteratively for each training sample ? ̶ This is similar to “Memory Networks” where we store previous hidden activation in explicit memory or “Progressive Network” [Rusu+ 16] where we incrementally add new network (and fixed old network) for each new task l We expect that it can prevent from catastrophic forgetting and achieve one-shot learning ̶ How to make sure generalization ?
- 67. Conclusion l There are still many unsolved problems in DNN ̶ Why can DNN learn in general setting ? ̶ How to represent real world information ? l There are still many unsolved problems in AI ̶ Disentanglement of information ̶ One-shot learning using attention and memory mechanism u Avoid catastrophic forgetting, interference ̶ Stable, data-efficient reinforcement learning ̶ How to abstract information u grounding (language), strong noise (e.g. dropout), extract hidden factors by using (non-)stationary or commonality among task
- 68. References l [Choromanska+ 2015] “The loss surface of multilayer networks”, A. Choromanska, and et al., AIstats 2015 l [Lu+ 2017] ”Depth creates No Bad Local Minima”, H. Lu, and et al., arXiv:1702.08580 l [Nguyen+ 2017] “The loss surface of deep and wide neural networks”, Q. Nguyen, and et al., arXiv:1704.08045 l [Zhang+ 2017] “Understanding deep learning requires rethinking generalization”, C. Zhang, and et al., ICLR 2017 l [Arpit+ 2017] ”A Closer Look at Memorization in Deep Networks”, D. Arpit, and et al., ICML 2017 l [Mangt+ 2017] “Stochastic Gradient Descent as Approximate Bayesian Inference”, S. Mandt and et al., arXiv:1704.04289 l [Shwartz-Ziv+ 2017] “Opening the Black Box of Deep Neural Networks via Information”, R. Shartz-Ziv, and et al., arXiv:1703.00810
- 69. l [Neyshabur+ 17] “Exploring Generalization in Deep Learning”, B. Neyhabur, and et al., arXiv:1706.08947 l [Wu+ 17] “Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes”, L. Wu and et al., arXiv:1706.10239 l [Lin + 16] “Why does deep and cheap learning work so well”, H W. Lin, and et al., arXiv1708.08226 l [Arora+ 17] “Provable benefits of representation learning”, S. Arora, and et al., arXiv:1706.04601 l [Kingma+ 14] ”Auto-Encoding Variational Bayes”, D. P. Kingma and et al., ICLR 2014 l [Burda+ 15] “Importance Weighted Autoencoders”, Y. Burda and et al., arXiv:1509.00519
- 70. l [Goodfellow+ 14] “Generative Adversarial Nets”, I. Goodfellow, and et al., NIPS 2014 l [Goodfellow 16] “NIPS 16 Tutorial: Generative Adversarial Networks”, arXiv:1701.00160 l [Oord+ 16a], “Conditional Image Generation with PixelCNN decoders”, A. Oord and et al., NIPS 2016 l [Oord+ 16b], “WaveNet: A Generative Model for Raw Audio”, A. Oord and et al., arXiv1609.03499 l [Reed+ 17] “Parallel Multiscale Autoregressive Density Estimation”, S. Reed and et al, arXiv:1703.03664 l [Zhao+ 17] ”Energy-based Generative Adversarial Network”, J. Zhao and et al., arXiv:1609.03126 l [Dai+ 17] “Calibrating Energy-based Generative Adversarial networks”, Z. Dai and et al., ICLR 2017
- 71. l [Arjovsky+ 17a] ”Towards principled methods for training generative adversarial networks”, M. Arjovsky, and et al, arXiv:1701.04862 l [Arjovsky+ 17b] “Wasserstein Generative Adversarial Networks”, M. Arjovsky, and et al., ICML 2017 l [Bojanowski+ 17] “Optimizing the Latent Space of Generative Networks”, P. Bojanowski and et al., arXiv:1707.05776 l [Chen+ 17] ”Photographic Image Synthesis with Cascaded Refinement Networks”, Q. Chen and et al., arXiv:1707.09405 l [Hyvärinen+ 01] “Independent Component Analysis”, A. Hyvärinen and et al., John Wiley ‘ Sons. 2001 l [Hyvärinen+ 16] “Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA”, A. Hyvärinen and et al, NIPS 2016 l [Hyvärinen+ 17] “Nonlinear ICA of Temporally Dependent Stationary Sources”, A. Hyvärinen and et al, AISTATS 2017
- 72. l [Vaswani+ 17] “Attention is all you need”, A. Vaswani, arxiv:1706.03762 (the idea appears only in version 3 https://arxiv.org/abs/1706.03762v3) l [He+ 15] “Deep Residual Learning for Image Recognition”, K. He and et al., arXiv:1512.03385 l [Rusu+ 16] “Progressive Neural Networks”, A. Rusu+ and et al., arXiv:1606.04671