Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Alex Smola, Professor in the Machin... by MLconf 7866 views
- Online Stochastic Tensor Decomposit... by Andrews Cordolino... 728 views
- Tensor Train decomposition in machi... by Alexander Novikov 2638 views
- Geetu Ambwani, Principal Data Scien... by MLconf 811 views
- Narayanan Sundaram, Research Scient... by MLconf 1070 views
- Alessandro Magnani, Data Scientist,... by MLconf 1098 views

1,696 views

Published on

Published in:
Technology

No Downloads

Total views

1,696

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

60

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods Anima Anandkumar U.C. Irvine
- 2. Learning with Big Data Learning is ﬁnding needle in a haystack
- 3. Learning with Big Data Learning is ﬁnding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem.
- 4. Learning with Big Data Learning is ﬁnding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning with big data: statistically and computationally challenging!
- 5. Optimization for Learning Most learning problems can be cast as optimization. Unsupervised Learning Clustering k-means, hierarchical . . . Maximum Likelihood Estimator Probabilistic latent variable models Supervised Learning Optimizing a neural network with respect to a loss function Input Neuron Output
- 6. Convex vs. Non-convex Optimization Progress is only tip of the iceberg.. Images taken from https://www.facebook.com/nonconvex
- 7. Convex vs. Non-convex Optimization Progress is only tip of the iceberg.. Real world is mostly non-convex! Images taken from https://www.facebook.com/nonconvex
- 8. Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima
- 9. Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima
- 10. Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima How to deal with non-convexity?
- 11. Outline 1 Introduction 2 Spectral Methods: Matrices and Tensors 3 Applications of Tensor Methods 4 Conclusion
- 12. Matrices and Tensors as Data Structures Modern data: Multi-modal, multi-relational data. Matrices: pairwise relations. Tensors: higher order relations. Multi-modal data ﬁgure from Lise Getoor slides.
- 13. Tensor Representation of Higher Order Moments Matrix: Pairwise Moments E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1 xi2 ]. For matrices: E[x ⊗ x] = E[xx⊤]. Tensor: Higher order Moments E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1 xi2 xi3 ].
- 14. Spectral Decomposition of Tensors M2 = i λiui ⊗ vi = + .... Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2
- 15. Spectral Decomposition of Tensors M2 = i λiui ⊗ vi = + .... Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2 M3 = i λiui ⊗ vi ⊗ wi = + .... Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2 We have developed eﬃcient methods to solve tensor decomposition.
- 16. Tensor Decomposition Methods SVD on Pairwise Moments to obtain W Fast SVD via Random Projection + ++ + Dimensionality Reduction of third order moments using W + ++ + Power method on reduced tensor T: v → T(I, v, v) T(I, v, v) . Guaranteed to converge to correct solution!
- 17. Summary on Tensor Methods = + .... Tensor decomposition: iterative algorithm with tensor contractions. Tensor contraction: T(W1, W2, W3), where Wi are matrices/vectors. Strengths of tensor method Guaranteed convergence to globally optimal solution! Fast and accurate, orders of magnitude faster than previous methods. Embarrassingly parallel and suited for distributed systems, e.g.Spark. Exploit optimized BLAS operations on CPUs and GPUs.
- 18. Preliminary Results on Spark Alternating Least Squares for Tensor Decomposition. min w,A,B,C T − k i=1 λiA(:, i) ⊗ B(:, i) ⊗ C(:, i) 2 F Update Rows Independently Tensor slices B C worker 1 worker 2 worker i worker k Results on NYtimes corpus 3 ∗ 105 documents, 108 words Spark Map-Reduce 26mins 4 hrs https://github.com/FurongHuang/SpectralLDA-TensorSpark
- 19. Outline 1 Introduction 2 Spectral Methods: Matrices and Tensors 3 Applications of Tensor Methods 4 Conclusion
- 20. Tensor Methods for Learning LVMs GMM HMM h1 h2 h3 x1 x2 x3 ICA h1 h2 hk x1 x2 xd Multiview and Topic Models
- 21. Overview of Our Approach Tensor-based Model Estimation = + + Unlabeled Data
- 22. Overview of Our Approach Tensor-based Model Estimation = + + Unlabeled Data → Probabilistic admixture models
- 23. Overview of Our Approach Tensor-based Model Estimation = + + Unlabeled Data → Probabilistic admixture models → Tensor Decomposition
- 24. Overview of Our Approach Tensor-based Model Estimation = + + Unlabeled Data → Probabilistic admixture models → Tensor Decomposition → Inference
- 25. Overview of Our Approach Tensor-based Model Estimation = + + Unlabeled Data → Probabilistic admixture models → Tensor Decomposition → Inference tensor decomposition → correct model
- 26. Overview of Our Approach Tensor-based Model Estimation = + + Unlabeled Data → Probabilistic admixture models → Tensor Decomposition → Inference tensor decomposition → correct model Highly parallelized, CPU/GPU/Cloud implementation. Guaranteed global convergence for Tensor Decomposition.
- 27. Community Detection on Yelp and DBLP datasets Facebook n ∼ 20, 000 Yelp n ∼ 40, 000 DBLP n ∼ 1 million Tensor vs. Variational FB YP DBLP−sub DBLP 10 0 10 1 10 2 10 3 10 4 10 5 Datasets Runningtime(seconds) Variational Tensor
- 28. Tensor Methods for Training Neural Networks Tremendous practical impact with deep learning. Algorithm: backpropagation. Highly non-convex optimization
- 29. Toy Example: Failure of Backpropagation x1 x2 y=1y=−1 Labeled input samples Goal: binary classiﬁcation σ(·) σ(·) y x1 x2x w1 w2 Our method: guaranteed risk bounds for training neural networks
- 30. Toy Example: Failure of Backpropagation x1 x2 y=1y=−1 Labeled input samples Goal: binary classiﬁcation σ(·) σ(·) y x1 x2x w1 w2 Our method: guaranteed risk bounds for training neural networks
- 31. Toy Example: Failure of Backpropagation x1 x2 y=1y=−1 Labeled input samples Goal: binary classiﬁcation σ(·) σ(·) y x1 x2x w1 w2 Our method: guaranteed risk bounds for training neural networks
- 32. Backpropagation vs. Our Method Weights w2 randomly drawn and ﬁxed Backprop (quadratic) loss surface −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 200 250 300 350 400 450 500 550 600 650 w1(1) w1(2)
- 33. Backpropagation vs. Our Method Weights w2 randomly drawn and ﬁxed Backprop (quadratic) loss surface −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 200 250 300 350 400 450 500 550 600 650 w1(1) w1(2) Loss surface for our method −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 0 20 40 60 80 100 120 140 160 180 200 w1(1) w1(2)
- 34. Feature Transformation for Training Neural Networks Feature learning: Learn φ(·) from input data. How to use φ(·) to train neural networks? x φ(x) y
- 35. Feature Transformation for Training Neural Networks Feature learning: Learn φ(·) from input data. How to use φ(·) to train neural networks? x φ(x) y Invert E[φ(x) ⊗ y] to Learn Neural Network Parameters In general, E[φ(x) ⊗ y] is a tensor. What class of φ(·) useful for training neural networks? What algorithms can invert moment tensors to obtain parameters?
- 36. Score Function Transformations Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) Input: x ∈ Rd S1(x) ∈ Rd
- 37. Score Function Transformations Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) Input: x ∈ Rd S1(x) ∈ Rd
- 38. Score Function Transformations Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) Input: x ∈ Rd S1(x) ∈ Rd
- 39. Score Function Transformations Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth -order score function: Input: x ∈ Rd S1(x) ∈ Rd
- 40. Score Function Transformations Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth -order score function: Sm(x) := (−1)m ∇(m)p(x) p(x) Input: x ∈ Rd S1(x) ∈ Rd
- 41. Score Function Transformations Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth -order score function: Sm(x) := (−1)m ∇(m)p(x) p(x) Input: x ∈ Rd S2(x) ∈ Rd×d
- 42. Score Function Transformations Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth -order score function: Sm(x) := (−1)m ∇(m)p(x) p(x) Input: x ∈ Rd S3(x) ∈ Rd×d×d
- 43. NN-LiFT: Neural Network LearnIng using Feature Tensors Input: x ∈ Rd Score function S3(x) ∈ Rd×d×d
- 44. NN-LiFT: Neural Network LearnIng using Feature Tensors Input: x ∈ Rd Score function S3(x) ∈ Rd×d×d 1 n n i=1 yi ⊗ S3(xi) = 1 n n i=1 yi⊗ Estimating M3 using labeled data {(xi, yi)} S3(xi) Cross- moment
- 45. NN-LiFT: Neural Network LearnIng using Feature Tensors Input: x ∈ Rd Score function S3(x) ∈ Rd×d×d 1 n n i=1 yi ⊗ S3(xi) = 1 n n i=1 yi⊗ Estimating M3 using labeled data {(xi, yi)} S3(xi) Cross- moment Rank-1 components are the estimates of ﬁrst layer weights + + · · · CP tensor decomposition
- 46. Reinforcement Learning (RL) of POMDPs Partially observable Markov decision processes. Proposed Method Consider memoryless policies. Episodic learning: indirect exploration. Tensor methods: careful conditioning required for learning. First RL method for POMDPs with logarithmic regret bounds. xi xi+1 xi+2 yi yi+1 ri ri+1 ai ai+1 0 1000 2000 3000 4000 5000 6000 7000 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Number of Trials AverageReward SM−UCRL−POMDP UCRL−MDP Q−learning Random Policy Logarithmic Regret Bounds for POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A. , under preparation.
- 47. Convolutional Tensor Decomposition == ∗ xx f∗ i w∗ i F∗ w∗ (a)Convolutional dictionary model (b)Reformulated model .... ....= ++ Cumulant λ1(F∗ 1 )⊗3 +λ2(F∗ 2 )⊗3 . . . Eﬃcient methods for tensor decomposition with circulant constraints. Convolutional Dictionary Learning through Tensor Factorization by F. Huang, A. , June 2015.
- 48. Hierarchical Tensor Decomposition = + .... = + .... = + .... = + .... Relevant for learning latent tree graphical models. Propose ﬁrst algorithm for integrated learning of hierarchy and components of tensor decomposition. Highly parallelized operations but with global consistency guarantees. Scalable Latent Tree Model and its Application to Health Analytics By F. Huang, U. N. Niranjan, J. Perros, R. Chen, J. Sun, A. , Preprint, June 2014.
- 49. Outline 1 Introduction 2 Spectral Methods: Matrices and Tensors 3 Applications of Tensor Methods 4 Conclusion
- 50. Summary and Outlook Summary Tensor methods: a powerful paradigm for guaranteed large-scale machine learning. First methods to provide provable bounds for training neural networks, many latent variable models (e.g HMM, LDA), POMDPs!
- 51. Summary and Outlook Summary Tensor methods: a powerful paradigm for guaranteed large-scale machine learning. First methods to provide provable bounds for training neural networks, many latent variable models (e.g HMM, LDA), POMDPs! Outlook Training multi-layer neural networks, models with invariances, reinforcement learning using neural networks . . . Uniﬁed framework for tractable non-convex methods with guaranteed convergence to global optima?
- 52. My Research Group and Resources Podcast/lectures/papers/software available at http://newport.eecs.uci.edu/anandkumar/

No public clipboards found for this slide

Be the first to comment