Big, Practical Recommendations with Alternating Least Squares

22,852 views
22,280 views

Published on

Published in: Technology

Big, Practical Recommendations with Alternating Least Squares

  1. 1. Big, Practical Recommendationswith Alternating Least SquaresSean Owen • Apache Mahout / Myrrix.com
  2. 2. WHERE’S BIG LEARNING? Next: Application Layer  Analytics  Machine Learning Applications Like Apache Mahout  Common Big Data app today Processing  Clustering, recommenders, classifiers on Hadoop Database  Free, open source; not mature Where’s commercialized Storage Big Learning?
  3. 3. A RECOMMENDER SHOULD … Answer in Real-time  Accept Diverse Input  Ingest new data, now  Not just people and products  Modify recommendations based  Not just explicit ratings on newest data  Clicks, views, buys  No “cold start” for new data  Side information Scale Horizontally  Be “Pretty Accurate”  For queries per second  For size of data set
  4. 4. NEED: 2-TIER ARCHITECTURE Real-time Serving Layer  Quick results based on precomputed model  Incremental update  Partitionable for scale Batch Computation Layer  Builds model  Scales out (on Hadoop?)  Asynchronous, occasional, long-lived runs
  5. 5. A PRACTICAL ALGORITHMMATRIX FACTORIZATION BENEFITS Factor user-item matrix to  Models intuition user-feature + feature-item  Factorization is batch matrix parallelizable Well understood in ML, as:  Reconstruction (recs) in  Principal Component Analysis low-dimension is fast  Latent Semantic Indexing  Allows projection of new data Several algorithms, like:  Cold start solution  Singular Value Decomposition  Approximate update solution  Alternating Least Squares
  6. 6. A PRACTICAL IMPLEMENTATIONALTERNATING LEASTSQUARES BENEFITS Simple factorization P ≈ X YT  Parallelizable by row -- Approximate: X, Y are very Hadoop-friendly “skinny” (low-rank)  Iterative: OK answer fast, Faster than the SVD refine as long as desired  Trivially parallel, iterative  Yields to “binary” input model Dumber than the SVD  Ratings as regularization instead  No singular values,  Sparseness / 0s no longer a orthonormal basis problem
  7. 7. ALS ALGORITHM 1 Input: (user, item, strength) 1 4 3 tuples 3  Anything you can quantify is input 4 3 2  Strength is positive 5 2 3 Many tuples per user-item 5 R is sparse user-item 2 4 R interaction matrix rij = total strength of interaction between user i and item j
  8. 8. ALS ALGORITHM 2 Follow “Collaborative 1 1 1 0 0 Filtering for Implicit 0 0 1 0 0 Feedback Datasets” www2.research.att.com/~yifanhu/PUB/cf. 0 1 0 1 1 pdf 1 0 1 0 1 Construct “binary” matrix P 0 0 0 1 0  1 where R > 0 1 1 0 0 0 P  0 where R = 0 Factor P, not R  R returns in regularization Still sparse; implicit 0s fine
  9. 9. ALS ALGORITHM 3 P is m x n Choose k << m, n Factor P as Q = X YT, Q ≈ P  X is m x k ; YT is k x n YT Find best approximation Q  Minimize L2 norm of diff: || P-Q X ||2  Minimal squared error: “Least Squares” Recommendations are largest values in Q
  10. 10. ALS ALGORITHM 4 Optimizing X, Y simultaneously is non- convex, hard If X or Y are fixed, system of YT linear equations: convex, easy Initialize Y with random X values Solve for X Fix X, solve for Y Repeat (“Alternating”)
  11. 11. ALS ALGORITHM 5 Define regularization weights cui = 1 + α rui Minimize: Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2) Simple least-squares regression objective, plus  Weighted least-squared error terms by strength, a penalty for not reconstructing 1 at “strong” association is higher  Standard L2 regularization term
  12. 12. ALS ALGORITHM 6 With fixed Y, compute optimal X Each row xu is independent Define Cu as diagonal matrix of cu (user strength weights) xu = (YTCuY + λI)-1 YTCupu Compare to simple least-squares regression solution (YTY)-1 YTpu  Adds Tikhonov / ridge regression regularization term λI  Attaches cu weights to YT See paper for how YTCuY is computed efficiently; skipping the engineering!
  13. 13. EXAMPLE FACTORIZATION k = 3, λ = 2, α = 40, 10 iterations 0.96 0.99 0.99 0.38 0.93 1 1 1 0 0 0.44 0.39 0.98 -0.11 0.39 0 0 1 0 0 ≈ 0.70 0.99 0.42 0.98 0.98 0 1 0 1 1 1 0 1 0 1 1.00 1.04 0.99 0.44 0.98 Q = X•YT 0.11 0.51 -0.13 1.00 0.57 0 0 0 1 0 0.97 1.00 0.68 0.47 0.91 1 1 0 0 0
  14. 14. FOLD-IN Need immediate, if  Note (YTY)(YTY)-1 = I approximate, updates for  Gives YT’s right inverse: new data YT (Y(YTY)-1) = I New user u needs new row  Xu = Qu Y(YTY)-1 Qu = Xu YT  Xu ≈ Pu Y(YTY)-1 We have Pu ≈ Qu  Recommend as usual: Compute Xu via right inverse: Qu = XuYT X YT(YT)-1 = Q(YT)-1 so:  For existing user, instead X = Q(YT)-1 add to existing row Xu What is (YT)-1?
  15. 15. THIS IS MYRRIX Soft-launched Serving Layer available as open source download Computation Layer available as beta Ready on Amazon EC2 / EMR srowen@myrrix.com Full launch Q4 2012 myrrix.com
  16. 16. APPENDIX
  17. 17. EXAMPLESSTACKOVERFLOW TAGS WIKIPEDIA LINKS Recommend tags to  Recommend new linked questions articles from existing links Tag questions automatically,  Propose missing, related improve tag coverage links 3.5M questions x 30K tags  2.5M articles x 1.8M articles 4.3 hours x 5 machines on  28 hours x 2 PCs on Amazon EMR Apache Hadoop 1.0.3 $3.03 ≈ $0.08 per 100,000 recs

×