MLconf NYC Animashree Anandkumar

524 views
358 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
524
On SlideShare
0
From Embeds
0
Number of Embeds
66
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

MLconf NYC Animashree Anandkumar

  1. 1. Tensor Decompositions for Guaranteed Learning of Latent Variable Models Anima Anandkumar U.C. Irvine
  2. 2. Application 1: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization.
  3. 3. Application 2: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of actors.
  4. 4. Application 3: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products.
  5. 5. Application 4: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures.
  6. 6. Application 5: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups “Unsupervised Learning of Transcriptional Regulatory Networks via Latent Tree Graphical Model” by A. Gitter, F. Huang, R. Valluvan, E. Fraenkel and A. Anandkumar Submitted to BMC Bioinformatics, Jan. 2014.
  7. 7. Statistical Framework In all applications: discover hidden structure in data: unsupervised learning. Latent Variable Models Concise statistical description through graphical modeling Conditional independence relationships or hierarchy of variables. x h
  8. 8. Statistical Framework In all applications: discover hidden structure in data: unsupervised learning. Latent Variable Models Concise statistical description through graphical modeling Conditional independence relationships or hierarchy of variables. x1 x2 x3 x4 x5 h
  9. 9. Statistical Framework In all applications: discover hidden structure in data: unsupervised learning. Latent Variable Models Concise statistical description through graphical modeling Conditional independence relationships or hierarchy of variables. x1 x2 x3 x4 x5 h1 h2 h3
  10. 10. Computational Framework Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities?
  11. 11. Computational Framework Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? Fast methods such as matrix factorization are not statistical. We cannot learn the latent variable model through such methods.
  12. 12. Computational Framework Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? Fast methods such as matrix factorization are not statistical. We cannot learn the latent variable model through such methods. Tensor-based Estimation Estimate moment tensors from data: higher order relationships. Compute decomposition of moment tensor. Iterative updates, e.g. tensor power iterations, alternating minimization. Non-convex: convergence to a local optima. No guarantees.
  13. 13. Computational Framework Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? Fast methods such as matrix factorization are not statistical. We cannot learn the latent variable model through such methods. Tensor-based Estimation Estimate moment tensors from data: higher order relationships. Compute decomposition of moment tensor. Iterative updates, e.g. tensor power iterations, alternating minimization. Non-convex: convergence to a local optima. No guarantees. Innovation: Guaranteed convergence to correct model.
  14. 14. Computational Framework Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? Fast methods such as matrix factorization are not statistical. We cannot learn the latent variable model through such methods. Tensor-based Estimation Estimate moment tensors from data: higher order relationships. Compute decomposition of moment tensor. Iterative updates, e.g. tensor power iterations, alternating minimization. Non-convex: convergence to a local optima. No guarantees. Innovation: Guaranteed convergence to correct model. In this talk: tensor decompositions and applications
  15. 15. Outline 1 Introduction 2 Topic Models 3 Efficient Tensor Decomposition 4 Experimental Results 5 Conclusion
  16. 16. Topic Models: Bag of Words
  17. 17. Probabilistic Topic Models Bag of words: order of words does not matter Graphical model representation l words in a document x1, . . . , xl. h: proportions of topics in a document. Word xi generated from topic yi. A(i, j) := P[xm = i|ym = j] : topic-word matrix. Words Topics Topic Mixture x1 x2 x3 x4 x5 y1 y2 y3 y4 y5 AAAAA h
  18. 18. Geometric Picture for Topic Models Topic proportions vector (h) Document Linear Model: E[xi|h] = Ah . Multiview model: h is fixed and multiple words (xi) are generated.
  19. 19. Geometric Picture for Topic Models Single topic (h) Linear Model: E[xi|h] = Ah . Multiview model: h is fixed and multiple words (xi) are generated.
  20. 20. Geometric Picture for Topic Models Topic proportions vector (h) Linear Model: E[xi|h] = Ah . Multiview model: h is fixed and multiple words (xi) are generated.
  21. 21. Geometric Picture for Topic Models Topic proportions vector (h) AAA x1 x2 x3 Word generation (x1, x2, . . .) Linear Model: E[xi|h] = Ah . Multiview model: h is fixed and multiple words (xi) are generated.
  22. 22. Moment Tensors Consider single topic model. E[xi|h] = Ah. λ := [E[h]]i. Learn topic-word matrix A, vector λ = P[h] M2: Co-occurrence of two words in a document M2 := E[x1x⊤ 2 ] = E[E[x1x⊤ 2 |h]] = AE[hh⊤ ]A⊤ = k r=1 λrara⊤ r
  23. 23. Moment Tensors Consider single topic model. E[xi|h] = Ah. λ := [E[h]]i. Learn topic-word matrix A, vector λ = P[h] M2: Co-occurrence of two words in a document M2 := E[x1x⊤ 2 ] = E[E[x1x⊤ 2 |h]] = AE[hh⊤ ]A⊤ = k r=1 λrara⊤ r Tensor M3: Co-occurrence of three words M3 := E(x1 ⊗ x2 ⊗ x3) = r λrar ⊗ ar ⊗ ar
  24. 24. Moment Tensors Consider single topic model. E[xi|h] = Ah. λ := [E[h]]i. Learn topic-word matrix A, vector λ = P[h] M2: Co-occurrence of two words in a document M2 := E[x1x⊤ 2 ] = E[E[x1x⊤ 2 |h]] = AE[hh⊤ ]A⊤ = k r=1 λrara⊤ r Tensor M3: Co-occurrence of three words M3 := E(x1 ⊗ x2 ⊗ x3) = r λrar ⊗ ar ⊗ ar Matrix and Tensor Forms: ar := rth column of A. M2 = k r=1 λrar ⊗ ar. M3 = k r=1 λrar ⊗ ar ⊗ ar
  25. 25. Tensor Decomposition Problem M2 = k r=1 λrar ⊗ ar. M3 = k r=1 λrar ⊗ ar ⊗ ar = + .... Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2 u ⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk. k topics, d words in vocabulary. M3: O(d × d × d) tensor, Rank k. Learning Topic Models through Tensor Decomposition
  26. 26. Detecting Communities in Networks
  27. 27. Detecting Communities in Networks Stochastic Block Model Non-overlapping
  28. 28. Detecting Communities in Networks Stochastic Block Model Non-overlapping Mixed Membership Model Overlapping
  29. 29. Detecting Communities in Networks Stochastic Block Model Non-overlapping Mixed Membership Model Overlapping
  30. 30. Detecting Communities in Networks Stochastic Block Model Non-overlapping Mixed Membership Model Overlapping Unifying Assumption Edges conditionally independent given community memberships
  31. 31. Multi-view Mixture Models
  32. 32. Tensor Forms in Other Models Independent Component Analysis Independent sources, unknown mixing. Blind source separation of speech, image, video.. h1 h2 hk x1 x2 xd A Gaussian Mixtures Hidden Markov Models/Latent Trees x1 x2 x3 x4 x5 h1 h2 h3 Reduction to similar moment forms
  33. 33. Outline 1 Introduction 2 Topic Models 3 Efficient Tensor Decomposition 4 Experimental Results 5 Conclusion
  34. 34. Tensor Decomposition Problem M3 = k r=1 λrar ⊗ ar ⊗ ar = + .... Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2 u ⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk. k topics, d words in vocabulary. M3: O(d × d × d) tensor, Rank k. d: vocabulary size for topic models or n: size of network for community models.
  35. 35. Dimensionality Reduction for Tensor Decomposition M3 = k r=1 λrar ⊗ ar ⊗ ar Dimensionality Reduction (Whitening) Convert M3 of size O(d × d × d) to tensor T of size k × k × k Carry out decomposition of T Tensor M3 Tensor T Dimensionality reduction through multi-linear transforms Computed from data, e.g. pairwise moments. T = i ρir⊗3 i is symmetric orthogonal tensor: {ri} are orthonormal
  36. 36. Orthogonal/Eigen Decomposition Orthogonal symmetric tensor: T = j∈[k] ρjr⊗3 j T(I, r1, r1) = j∈[k] ρj r1, rj 2rj = ρ1r1
  37. 37. Orthogonal/Eigen Decomposition Orthogonal symmetric tensor: T = j∈[k] ρjr⊗3 j T(I, r1, r1) = j∈[k] ρj r1, rj 2rj = ρ1r1 Obtaining eigenvectors through power iterations u → T(I, u, u) T(I, u, u)
  38. 38. Orthogonal/Eigen Decomposition Orthogonal symmetric tensor: T = j∈[k] ρjr⊗3 j T(I, r1, r1) = j∈[k] ρj r1, rj 2rj = ρ1r1 Obtaining eigenvectors through power iterations u → T(I, u, u) T(I, u, u) Basic Algorithm Random initialization, run power iterations and deflate
  39. 39. Practical Considerations k communities, n nodes, k ≪ n. Steps k-SVD of n × n matrix: randomized techniques Online k × k × k tensor decomposition: No tensor explicitly formed. Parallelization: Inherently parallelizable, GPU deployment. Sparse implementation: real-world networks are sparse Validation Metric: p-value test based “soft-pairing” Parallel time complexity: O nsk c + k3 , s is max. degree in graph and c is number of cores. Huang, Niranjan, Hakeem and Anandkumar, “Fast Detection of Overlapping Communities via Online Tensor Methods,” Preprint, Sept. 2013.
  40. 40. Scaling Of The Stochastic Iterations vt+1 i ← vt i − 3θβt k j=1 vt j, vt i 2 vt j + βt vt i, yt A vt i , yt B yt C + . . . Parallelize across eigenvectors. STGD is iterative: device code reuse buffers for updates. vt i yt A,yt B,yt C CPU GPU Standard Interface vt i yt A,yt B,yt C CPU GPU Device Interface vt i
  41. 41. Scaling Of The Stochastic Iterations 10 2 10 3 10 −1 10 0 10 1 10 2 10 3 10 4 Number of communities k Runningtime(secs) MATLAB Tensor Toolbox CULA Standard Interface CULA Device Interface Eigen Sparse
  42. 42. Outline 1 Introduction 2 Topic Models 3 Efficient Tensor Decomposition 4 Experimental Results 5 Conclusion
  43. 43. Experimental Results Friend Users Facebook n ∼ 20, 000 Business User Reviews Yelp n ∼ 40, 000 Author Coauthor DBLP n ∼ 1 million Error (E) and Recovery ratio (R) Dataset ˆk Method Running Time E R Facebook(k=360) 500 ours 468 0.0175 100% Facebook(k=360) 500 variational 86,808 0.0308 100% . Yelp(k=159) 100 ours 287 0.046 86% Yelp(k=159) 100 variational N.A. . DBLP(k=6000) 100 ours 5407 0.105 95%
  44. 44. Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang’s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31
  45. 45. Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang’s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31 Bridgeness: Distance from vector [1/ˆk, . . . , 1/ˆk]⊤ Top-5 bridging nodes (businesses) Business Categories Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe Pizzeria Bianco Restaurants, Pizza, Phoenix FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe
  46. 46. Outline 1 Introduction 2 Topic Models 3 Efficient Tensor Decomposition 4 Experimental Results 5 Conclusion
  47. 47. Conclusion Guaranteed Learning of Latent Variable Models Guaranteed to recover correct model Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. Mixed membership communities, topic models, ICA, Gaussian mixtures... Current and Future Goals Guaranteed online learning in high dimensions Large-scale cloud-based implementation of tensor approaches Code available on website and Github

×