Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16

655 views

Published on

Discovery of Latent Factors in High-dimensional Data Using Tensor Methods: Learning latent variable mixture models in high-dimension is applicable in numerous domains where low-dimensional latent factors out of the high-dimensional observations are desired. Popular likelihood based methods optimize over a non-convex likelihood which is computationally challenging due to the high-dimensionality of the data, and it is usually not guaranteed to converge to a global or even local optima without additional assumptions. We propose a framework to overcome the problem of unscalable and non-convex likelihood by taking the power of inverse method of moments. By matching the moments, the problem of learning latent variable mixture models is reduced to a tensor (higher order matrix) decomposition problem in low-dimensional space. This framework incorporates dimensionality reduction and thus is scalable to high-dimensional data. Moreover, we show that the algorithm is highly parallel and implemented a distributed version on Spark in Scala language.

Published in: Technology
  • Be the first to comment

Furong Huang, Ph.D. Candidate, UC Irvine at MLconf NYC - 4/15/16

  1. 1. Discovery of Latent Factors in High-dimensional Data Using Tensor Methods Furong Huang University of California, Irvine Machine Learning Conference 2016 New York City 1 / 25
  2. 2. Machine Learning - Modern Challenges Big Data Challenging Tasks Success of Supervised Learning Image classification Speech recognition Text processing Computation power growth Enormous labeled data 2 / 25
  3. 3. Machine Learning - Modern Challenges Big Data Challenging Tasks Real AI requires Unsupervised Learning Filter bank learning Feature extraction Embeddings; Topics 2 / 25
  4. 4. Machine Learning - Modern Challenges Big Data Challenging Tasks Real AI requires Unsupervised Learning Filter bank learning Feature extraction Embeddings; Topics Summarize key features in data: Machines vs Humans Foundation for successful supervised learning 2 / 25
  5. 5. Unsupervised Learning with Big Data Information Extraction High dimension observation vs Low dimension representation Cell Types Topics Communities 3 / 25
  6. 6. Unsupervised Learning with Big Data Information Extraction High dimension observation vs Low dimension representation Cell Types Topics Communities Finding Needle In the Haystack Is Challenging 3 / 25
  7. 7. Unsupervised Learning with Big Data Information Extraction Solution for Unsupervised Learning A Unified Tensor Decomposition Framework 3 / 25
  8. 8. Automated Categorization of Documents Mixed topics Topics Education Crime Sports 4 / 25
  9. 9. Community Extraction From Connectivity Graph Mixed memberships 5 / 25
  10. 10. Tensor Methods Compared with Variational Inference PubMed on Spark: 8 million docs 103 10 4 10 5 Perplexity Tensor Variational 0 2 4 6 8 10 ×104 RunningTime(s) 6 / 25
  11. 11. Tensor Methods Compared with Variational Inference PubMed on Spark: 8 million docs 103 10 4 10 5 Perplexity Tensor Variational 0 2 4 6 8 10 ×104 RunningTime(s) Facebook: n ∼ 20k Yelp: n ∼ 40k DBLP: n ∼ 1 million 10 -2 10-1 100 101 Error/group FB YP DBLPsub DBLP 10 2 103 10 4 10 5 10 6 RunningTimes(s) FB YP DBLPsub DBLP 6 / 25
  12. 12. Tensor Methods Compared with Variational Inference PubMed on Spark: 8 million docs 103 10 4 10 5 Perplexity Tensor Variational 0 2 4 6 8 10 ×104 RunningTime(s) Facebook: n ∼ 20k Yelp: n ∼ 40k DBLP: n ∼ 1 million 10 -2 10-1 100 101 Error/group FB YP DBLPsub DBLP 10 2 103 10 4 10 5 10 6 RunningTimes(s) FB YP DBLPsub DBLP Orders of Magnitude Faster & More Accurate “Online Tensor Methods for Learning Latent Variable Models”, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. “Tensor Methods on Apache Spark”, by F. Huang, A. Anandkumar, Oct. 2015. 6 / 25
  13. 13. Cataloging Neuronal Cell Types In the Brain 7 / 25
  14. 14. Cataloging Neuronal Cell Types In the Brain Our method vs Average expression level [Grange 14’] 0.5 1.0 1.5 2.0 2.5 k Spatial point process (ours) Average expression level ( )previous Recovered known cell types 1 astrocytes 2 interneurons 3 oligodendrocytes ” Discovering Neuronal Cell Types and Their Gene Expression Profiles Using a Spatial Point Process Mixture Model ” by F. Huang, A. Anandkumar, C. Borgs, J. Chayes, E. Fraenkel, M. Hawrylycz, E. Lein, A. Ingrosso, S. Turaga, NIPS 2015 BigNeuro workshop. 8 / 25
  15. 15. Word Sequence Embedding Extraction football soccer tree Word Embedding The weather is good. Her life spanned years of incredible change for women. Mary lived through an era of liberating reform for women. Word Sequence Embedding 9 / 25
  16. 16. Word Sequence Embedding Extraction football soccer tree Word Embedding The weather is good. Her life spanned years of incredible change for women. Mary lived through an era of liberating reform for women. Word Sequence Embedding Paraphrase Detection MSR paraphrase data: 5800 pairs of sentences Method Outside Information F score Vector Similarity (Baseline) word similarity 75.3% Convolutional Tensor (Proposed) none 80.7% Skip-thought (NIPS’15) train on large corpus 81.9% “Convolutional Dictionary Learning through Tensor Factorization”, by F. Huang, A. Anandkumar, conference and workshop proceeding of JMLR, vol.44, Dec 2015. 9 / 25
  17. 17. Human Disease Hierarchy Discovery CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases. ” Scalable Latent TreeModel and its Application to Health Analytics ” by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun, A. Anandkumar, NIPS 2015 MLHC workshop. 10 / 25
  18. 18. Unsupervised Learning via Probabilistic Models Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A Unlabeled data Probabilistic admixture model Learning Algorithm Inference 11 / 25
  19. 19. Unsupervised Learning via Probabilistic Models Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A Unlabeled data Probabilistic admixture model MCMC Inference MCMC: random sampling, slow ◮ Exponential mixing time 11 / 25
  20. 20. Unsupervised Learning via Probabilistic Models Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A Unlabeled data Probabilistic admixture model Likelihood Methods Inference MCMC: random sampling, slow ◮ Exponential mixing time Likelihood: non-convex, not scalable ◮ Exponential critical points 11 / 25
  21. 21. Unsupervised Learning via Probabilistic Models Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A Unlabeled data Probabilistic admixture model Likelihood Methods Inference MCMC: random sampling, slow ◮ Exponential mixing time Likelihood: non-convex, not scalable ◮ Exponential critical points Solution A unified tensor decomposition framework 11 / 25
  22. 22. Unsupervised Learning via Probabilistic Models Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A Unlabeled data Probabilistic admixture model Te ¡¢£ ¤e¥¢¦p¢¡§¨§¢  Inference a C C tensor decomposition → correct model 12 / 25
  23. 23. Unsupervised Learning via Probabilistic Models Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A Unlabeled data Probabilistic admixture model T© © Inference ! ! tensor decomposition → correct model Contributions Guaranteed online algorithm with global convergence guarantee Highly scalable, highly parallel, random projection Tensor library on CPU/GPU/Spark Interdisciplinary applications Extension to model with group invariance 12 / 25
  24. 24. What is a tensor? Matrix: Second Order Moments M2: pair-wise relationship. [x ⊗ x]i1,i2 = xi1 xi2 → [M2]i1,i2 = i1 i2 Tensor: Third Order Moments M3: triple-wise relationship. [x ⊗ x ⊗ x]i1,i2,i3 = xi1 xi2 xi3 → [M3]i1,i2,i3 = i1 i2 i3 13 / 25
  25. 25. Why are tensors powerful? Matrix Orthogonal Decomposition Not unique without eigenvalue gap 1 0 0 1 = e1e⊤ 1 +e2e⊤ 2 = u1u⊤ 1 +u2u⊤ 2 e1 e2 u1 = [ √ 2 2 , − √ 2 2 ] u2 = [ √ 2 2 , √ 2 2 ] 14 / 25
  26. 26. Why are tensors powerful? Matrix Orthogonal Decomposition Not unique without eigenvalue gap 1 0 0 1 = e1e⊤ 1 +e2e⊤ 2 = u1u⊤ 1 +u2u⊤ 2 e1 e2 u1 = [ √ 2 2 , − √ 2 2 ] u2 = [ √ 2 2 , √ 2 2 ] Tensor Orthogonal Decomposition Unique: eigenvalue gap not needed += ≠ 14 / 25
  27. 27. Why are tensors powerful? Matrix Orthogonal Decomposition Not unique without eigenvalue gap 1 0 0 1 = e1e⊤ 1 +e2e⊤ 2 = u1u⊤ 1 +u2u⊤ 2 e1 e2 u1 = [ √ 2 2 , − √ 2 2 ] u2 = [ √ 2 2 , √ 2 2 ] Tensor Orthogonal Decomposition Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap += ≠ 14 / 25
  28. 28. Why are tensors powerful? Matrix Orthogonal Decomposition Not unique without eigenvalue gap 1 0 0 1 = e1e⊤ 1 +e2e⊤ 2 = u1u⊤ 1 +u2u⊤ 2 e1 e2 u1 = [ √ 2 2 , − √ 2 2 ] u2 = [ √ 2 2 , √ 2 2 ] Tensor Orthogonal Decomposition Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap += ≠ 14 / 25
  29. 29. Outline 1 Introduction 2 LDA and Community Models From Data Aggregates to Model Parameters Guaranteed Online Algorithm 3 Conclusion 15 / 25
  30. 30. Outline 1 Introduction 2 LDA and Community Models From Data Aggregates to Model Parameters Guaranteed Online Algorithm 3 Conclusion 16 / 25
  31. 31. Probabilistic Topic Models - LDA Bag of words Topics Topic Proportion police witness campus police witness campus police witness campus 17 / 25
  32. 32. Probabilistic Topic Models - LDA Bag of words Topics Topic Proportion police witness campus police witness campus police witness campus police witness crime Sports Educa on campus 17 / 25
  33. 33. Probabilistic Topic Models - LDA Bag of words Topics Topic Proportion police witness campus police witness campus police witness campus police witness crime Sports Educaon campus Goal campus police witness Topic-word matrix P[word = i|topic = j] 17 / 25
  34. 34. Mixture Form of Moments Goal: Linearly independent topic-word table campus police witness 18 / 25
  35. 35. Mixture Form of Moments Goal: Linearly independent topic-word table campus police witness M1: Occurrence of Words = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness 18 / 25
  36. 36. Mixture Form of Moments Goal: Linearly independent topic-word table campus police witness M1: Occurrence of Words = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness No unique decomposition of vectors 18 / 25
  37. 37. Mixture Form of Moments Goal: Linearly independent topic-word table campus police witness M2: Modified Co-occurrence of Word Pairs = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness 18 / 25
  38. 38. Mixture Form of Moments Goal: Linearly independent topic-word table campus police witness M2: Modified Co-occurrence of Word Pairs = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness Matrix decomposition recovers subspace, not actual model 18 / 25
  39. 39. Mixture Form of Moments Goal: Linearly independent topic-word table Find a W W W W such that M2: Modified Co-occurrence of Word Pairs = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness Many such W’s, find one, project data with W 18 / 25
  40. 40. Mixture Form of Moments Goal: Linearly independent topic-word table Know a W W W W such that M3: Modified Co-occurrence of Word Triplets = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness W W W Unique orthogonal tensor decomposition, project result with W† 18 / 25
  41. 41. Mixture Form of Moments Goal: Linearly independent topic-word table Know a W W W W such that M3: Modified Co-occurrence of Word Triplets = + + campus police witness crim e Sports Educa on campus police witness cam pus police witness W W W Tensor decomposition uniquely discovers the correct model Learning Topic Models through Matrix/Tensor Decomposition 18 / 25
  42. 42. Mixed Membership Community Models Mixed memberships 19 / 25
  43. 43. Mixed Membership Community Models Mixed memberships What ensures guaranteed learning? # $ $ Alice Bob Charlie Mathema % cians Vegetarians Musicians David Ellen Frank Grace Jack Kathy 19 / 25
  44. 44. Mixed Membership Community Models Mixed memberships What ensures guaranteed learning? ' ' Alice Bob Charlie Mathema cians Vegetarians Musicians David Ellen Frank Grace Jack Kathy 19 / 25
  45. 45. Outline 1 Introduction 2 LDA and Community Models From Data Aggregates to Model Parameters Guaranteed Online Algorithm 3 Conclusion 20 / 25
  46. 46. How to do tensor decomposition? Model is uniquely identifiable! How to identify? 21 / 25
  47. 47. How to do tensor decomposition? How to find components? Non-convex optimization problem! 21 / 25
  48. 48. How to do tensor decomposition? How to find components? Non-convex optimization problem! Objective Function Theorem: We propose an objective function with equivalent local optima. 21 / 25
  49. 49. How to do tensor decomposition? How to find components? Non-convex optimization problem! Objective Function Theorem: We propose an objective function with equivalent local optima. Saddle point: enemy of SGD Saddle Point Saddle point has 0 gradient 21 / 25
  50. 50. How to do tensor decomposition? How to find components? Non-convex optimization problem! Objective Function Theorem: We propose an objective function with equivalent local optima. Saddle point: enemy of SGD Saddle Point Saddle point has 0 gradient Non-degenerate saddle: Hessian has ± eigenvalue 21 / 25
  51. 51. How to do tensor decomposition? How to find components? Non-convex optimization problem! Objective Function Theorem: We propose an objective function with equivalent local optima. Saddle point: enemy of SGD escape stuck Saddle point has 0 gradient Non-degenerate saddle: Hessian has ± eigenvalue Negative eigenvalue: direction of escape 21 / 25
  52. 52. How to do tensor decomposition? How to find components? Non-convex optimization problem! Objective Function Theorem: We propose an objective function with equivalent local optima. Saddle point: enemy of SGD escape stuck Saddle point has 0 gradient Non-degenerate saddle: Hessian has ± eigenvalue Negative eigenvalue: direction of escape Guaranteed Global Converge Online Tensor Decomposition Theorem: For smooth fn. with non-degenerate saddle points, noisy SGD converges to a local minimum in polynomial steps. 21 / 25
  53. 53. How to do tensor decomposition? How to find components? Non-convex optimization problem! Objective Function Theorem: We propose an objective function with equivalent local optima. Saddle point: enemy of SGD escape stuck Saddle point has 0 gradient Non-degenerate saddle: Hessian has ± eigenvalue Negative eigenvalue: direction of escape Guaranteed Global Converge Online Tensor Decomposition Theorem: For smooth fn. with non-degenerate saddle points, noisy SGD converges to a local minimum in polynomial steps. Noise could help! “Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition”,by R. Ge, F. Huang, C. Jin, Y. Yuan, COLT 2015. 21 / 25
  54. 54. Outline 1 Introduction 2 LDA and Community Models From Data Aggregates to Model Parameters Guaranteed Online Algorithm 3 Conclusion 22 / 25
  55. 55. Contributions Spectral methods reveal hidden structure Text/Image processing Social networks Neuroscience, heathcare ... 23 / 25
  56. 56. Contributions Spectral methods reveal hidden structure Text/Image processing Social networks Neuroscience, heathcare ... Versatile for latent variable models Flat model → hierarchical model Sparse coding → convolutional model Efficient, convergence guarantee ( ) ) ( ) ) ( ) ) ( ) ) 0 12221 1 12221 M3 escape stuck 23 / 25
  57. 57. Thank You Collaborators Anima Anandkumar UC Irvine Rong Ge Duke University Srini Turaga Janelia Research Chi Jin UC Berkeley Jennifer Chayes MSR Christian Borgs MSR Ernest Fraenkel MIT Yang Yuan Cornell U UN Niranjan UC Irvine 24 / 25

×