Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Hierarchical matrix techniques for ... by Alexander Litvinenko 210 views
- Text mining, By Hadi Mohammadzadeh by Hadi Mohammadzadeh 5549 views
- Zavala lilia tecnologia by Angela Zavala 306 views
- Query Based Summarization by Mariana Damova, Ph.D 3179 views
- Francesc Alted - Data Oriented Prog... by PyData 781 views
- Nipype by PyData 178 views

3,706 views

Published on

Published in:
Data & Analytics

No Downloads

Total views

3,706

On SlideShare

0

From Embeds

0

Number of Embeds

4

Shares

0

Downloads

77

Comments

0

Likes

8

No embeds

No notes for slide

- 1. Low-rank matrix approximations with Python Christian Thurau
- 2. Table of Contents 1 Intro 2 The Basics 3 Matrix approximation 4 Some methods 5 Matrix Factorization with Python 6 Example & Conclusion 2
- 3. For Starters... Observations • Data matrix factorization has become an important tool in information retrieval, data mining, and pattern recognition • Nowadays, typical data matrices are HUGE • Examples include: • Gene expression data and microarrays • Digital images • Term by document matrices • User ratings for movies, products, ... • Graph adjacency matrices 3
- 4. Matrix Factorization • given a matrix V • determine matrices W and H • such that V = WH or V ≈ WH • characteristics such as entries, shape, rank of V , W , and H will depend on application context 4
- 5. The Basics matrix factorization allows for: • solving linear equations • transforming data • compressing data matrix factorization facilitates subsequent processing in: • information retrieval • pattern recognition • data mining 5
- 6. Low-rank Matrix Approximations • Aapproximate V V ≈ WH • where V ∈ Rm×n W ∈ Rm×k H ∈ Rk×n • and rank(W ) ≪ rank(V ) k ≪ min(m, n) V = W H 6
- 7. Matrix Approximation • If V = WH • then vi,j = wi,∗h∗,j = k∑ x=1 wi,x hx,j V = W H 7
- 8. Matrix Approximation • More importantly: v∗,j = Wh∗,j = k∑ x=1 w∗,x hx,j • therefore W ↔ ”basis” matrix H ↔ coeﬃcient matrix V = W H = + + 8
- 9. On Matrix Factorization Methods • matrix factorization ↔ data transformation • matrix rank reduction ↔ data compression • Common form: V = WH • Broad range of methods: • K-means clustering • SVD/PCA • Non-negative Matrix Factorization • Archetypal Analysis • Binary matrix factorization • CUR decomposition • ... • Each method yields a unique view on data . . . • . . . and is suited for diﬀerent tasks 9
- 10. K-means Clustering1 • Baseline clustering method • Constrained quadradic optimization problem: min W ,H ∥V − WH∥2 s.t. H = [0; 1], ∑ k hk,i = 1 • Find W , H using expectation maximization • Optimal k-means partitioning is np-hard • Goal: group similar data points • Interesting: K-means clustering is matrix factorization 1 J.B. MacQueen, Some Methods for classiﬁcation and Analysis of Multivariate Observations”. Berkeley Symposium on Mathematical Statistics and Probability. 1967 10
- 11. K-means Clustering is Matrix Factorization! x1,1 x1,2 x1,3 . . . x1,n x2,1 x2,2 x2,3 . . . x2,n x3,1 x3,2 x3,3 . . . x3,n .. . .. . .. . ... .. . xm,1 xm,2 xm,3 . . . xm,n b1,1 b1,2 b1,3 b2,1 b2,2 b2,3 b3,1 b3,2 b2,3 .. . .. . .. . bn,1 bn,2 bn,3 0 1 1 . . . 0 1 0 0 . . . 0 0 0 0 . . . 1 • i.e. for X ∈ Rm×n, and B ∈ Rn×3, and A ∈ R3×n as above, the product XBA = MA realizes an assignment xi → mj , where mj = Xbj 11
- 12. Example: K-means ≈ 0.0 + 0.0 . . . 1.0 . . . 0.0 = • Similar images are grouped into k groups • Approximate data by mapping each data point onto the mean of a cluster regions 12
- 13. Python Matrix Factorization Toolbox (PyMF)2 • Started in 2010 at Fraunhofer IAIS/University of Bonn • Vast number of diﬀerent methods! • Supports hdf5/h5py and sparse matrices How to factorize a data matrix V : >>>import pymf >>>import numpy as np >>>data = np.array([[1.0, 0.0, 2.0], [0.0, 1.0, 1.0]]) >>>mdl = pymf.kmeans.Kmeans(data, num_bases=2) >>>mdl.factorize(niter=10) # optimize for WH >>>V_approx = np.dot(mdl.W, mdl.H) # V = WH 2 http://github.com/cthurau/pymf 13
- 14. Python Matrix Factorization Toolbox (PyMF)2 • Restarted development a few weeks back ;) • Looking for contributors! How to map data onto W : >>>import pymf >>>import numpy as np >>>test_data = np.array([[1.0], [0.3]]) >>>mdl_test = pymf.kmeans.Kmeans(test_data, num_bases=2) >>>mdl_test.W = mdl.W # mdl.W -> existing basis W >>>mdl_test.factorize(compute_w=False) >>>test_datx_approx = np.dot(mdl.W, mdl_test.H) 2 http://github.com/cthurau/pymf 14
- 15. PCA Principal Component Analysis (PCA)3 • SVD/PCA are baseline matrix factorization methods • Optimize: min W ,H ∥V − WH∥2 s.t. W T W = I • Restrict W to singular vectors of V (orthogonal matrix) • Can (usually does) violate non-negativity • Goal: best possible matrix approximation for a given k • Great for compression or ﬁltering out noise! 3 K. Pearson, On Lines and Planes of Closest Fit to Systems of Points in Space, Philosophical Magazine, 1901. 15
- 16. Example PCA >>>from pymf.pca import PCA >>>import numpy as np >>>mdl = PCA(data, num_bases=2) >>>mdl.factorize() >>>V_approx = np.dot(mdl.W, mdl.H) • Usage for data analysis questionable • Basis vectors usually not interpretable V ≈ Vapprox W = . . . 16
- 17. Non-negative Matrix Factorization4 • For V ≥ 0 constrained quadradic optimization problem: min W ,H ∥V − WH∥2 s.t. W ≥ 0 H ≥ 0 • a globally optimal solution provably exists; algorithms guaranteed to ﬁnd it remain elusive; exact NMF is NP hard • Often W converges to partial representations • Active area of research • Goal: reconstruct data by independent parts 4 D.D. Lee and H.S. Seung, Learning the Parts of Objects by Non-Negative Matrix Factorization, Nature, 401(6755), 1999 17
- 18. Example NMF >>>from pymf.nmf import NMF >>>import numpy as np >>>mdl = NMF(data, num_bases=2, iter=50) >>>mdl.factorize() >>>V_approx = np.dot(mdl.W, mdl.H) • Additive combination of parts • Interesting options for data analysis V ≈ Vapprox W = . . . 18
- 19. Archetypal Analysis5 • Convexity constrained quadratic optmization problem: min W ,H ∥V − VWH∥2 s.t. wl,i ≥ 0, ∑ l wl,i = 1 hk,i ≥ 0, ∑ k hk,i = 1 • Reconstruct data by its archetypes, i.e. convex combinations of polar opposites • Yields novel and intuitive insights into data • Great for interpretable data representations! • O(n2), but: eﬃcient approximations for large data exist 5 A. Cutler and L. Breiman, Archetypal Analysis, in Technometrics 36(4), 1994 19
- 20. Example Archetypal Analysis >>>from pymf.aa import AA >>>import numpy as np >>>mdl = AA(data, num_bases=2, iter=50) >>>mdl.factorize() >>>V_approx = np.dot(mdl.W, mdl.H) • Existent data points as basis vectors • Convex combination allows a probablilist interpretation V ≈ Vapprox W = . . . 20
- 21. Method Summary • Common form: V = WH (or V = VWH) W constraint H constraint Outcome PCA - - compressed V K-means - H = [0; 1], ∑ k hk,i = 1 groups NMF W ≥ 0 H ≥ 0 parts AA W ≥ 0, ∑ l wl,i = 1 H ≥ 0, ∑ k hk,i = 1 opposites • Doesn’t only work for images ;) • More complex constraints usually result in more complex solvers • Active area of research deals with approximations for large data 21
- 22. Large matrices: PyMF and h5py >>> import h5py >>> import numpy as np >>> from pymf.sivm import SIVM # uses [6] >>> file = h5py.File(’myfile.hdf5’, ’w’) >>> file[’dataset’] = np.random.random((100,1000)) >>> file[’W’] = np.random.random((100,10)) >>> file[’H’] = np.random.random((10,1000)) >>> sivm_mdl = SIVM(file[’dataset’], num_bases=10) >>> sivm_mdl.W = file[’W’] >>> sivm_mdl.H = file[’H’] >>> sivm_mdl.factorize() 6 Thurau, Kersting, and Bauckhage, ”Simplex volume maximization for descriptive web scale matrix factorization”, CIKM’2010 22
- 23. 7 Science, 2010: Vol. 330
- 24. Take Home Message • Most clustering, and data analysis methods are matrix approximations • Imposed constraints shape the factorization • Imposed constraints yield diﬀerent views on data • One of the most eﬀective and versatile tools for data exploration! • Python implementation → http://github.com/cthurau/pymf 24
- 25. Thank you for your attention! christian.thurau@unbelievable-machine.com

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment