Big Practical Recommendations with Alternating Least Squares

Big, Practical Recommendations
with Alternating Least Squares

Sean Owen • Apache Mahout / Myrrix.com

WHERE’S BIG LEARNING?
 Next: Application Layer
 Analytics
 Machine Learning
Applications
 Like Apache Mahout
 Common Big Data app today Processing
 Clustering, recommenders,
classifiers on Hadoop Database
 Free, open source; not mature

 Where’s commercialized Storage
Big Learning?

A RECOMMENDER SHOULD …
 Answer in Real-time  Accept Diverse Input
 Ingest new data, now  Not just people and products
 Modify recommendations based  Not just explicit ratings
on newest data  Clicks, views, buys
 No “cold start” for new data
 Side information
 Scale Horizontally  Be “Pretty Accurate”
 For queries per second
 For size of data set

NEED: 2-TIER ARCHITECTURE
 Real-time Serving Layer
 Quick results based on
precomputed model
 Incremental update
 Partitionable for scale

 Batch Computation Layer
 Builds model
 Scales out (on Hadoop?)
 Asynchronous, occasional,
long-lived runs

A PRACTICAL ALGORITHM

MATRIX FACTORIZATION BENEFITS
 Factor user-item matrix to  Models intuition
user-feature + feature-item  Factorization is batch
matrix parallelizable
 Well understood in ML, as:  Reconstruction (recs) in
 Principal Component Analysis low-dimension is fast
 Latent Semantic Indexing
 Allows projection of new data
 Several algorithms, like:  Cold start solution
 Singular Value Decomposition  Approximate update solution
 Alternating Least Squares

A PRACTICAL IMPLEMENTATION
ALTERNATING LEAST
SQUARES BENEFITS
 Simple factorization P ≈ X YT  Parallelizable by row --
 Approximate: X, Y are very Hadoop-friendly
“skinny” (low-rank)  Iterative: OK answer fast,
 Faster than the SVD refine as long as desired
 Trivially parallel, iterative  Yields to “binary” input model
 Dumber than the SVD  Ratings as regularization
instead
 No singular values,
 Sparseness / 0s no longer a
orthonormal basis
problem

ALS ALGORITHM 1
 Input: (user, item, strength) 1 4 3
tuples
3
 Anything you can quantify is
input 4 3 2
 Strength is positive 5 2 3
 Many tuples per user-item 5
 R is sparse user-item 2 4 R
interaction matrix
 rij = total strength of
interaction between user i
and item j

ALS ALGORITHM 2
 Follow “Collaborative 1 1 1 0 0
Filtering for Implicit
0 0 1 0 0
Feedback Datasets”
www2.research.att.com/~yifanhu/PUB/cf. 0 1 0 1 1
pdf
1 0 1 0 1
 Construct “binary” matrix P
0 0 0 1 0
 1 where R > 0
1 1 0 0 0 P
 0 where R = 0

 Factor P, not R
 R returns in regularization

 Still sparse; implicit 0s fine

ALS ALGORITHM 3
 P is m x n
 Choose k << m, n
 Factor P as Q = X YT, Q ≈ P
 X is m x k ; YT is k x n YT
 Find best approximation Q
 Minimize L2 norm of diff: || P-Q X
||2
 Minimal squared error:
“Least Squares”
 Recommendations are
largest values in Q

ALS ALGORITHM 4
 Optimizing X, Y
simultaneously is non-
convex, hard
 If X or Y are fixed, system of
YT
linear equations:
convex, easy
 Initialize Y with random X
values
 Solve for X
 Fix X, solve for Y
 Repeat (“Alternating”)

ALS ALGORITHM 5
 Define regularization weights cui = 1 + α rui
 Minimize:

Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)

 Simple least-squares regression objective, plus
 Weighted least-squared error terms by strength,
a penalty for not reconstructing 1 at “strong” association is higher
 Standard L2 regularization term

ALS ALGORITHM 6
 With fixed Y, compute optimal X
 Each row xu is independent
 Define Cu as diagonal matrix of cu (user strength weights)
 xu = (YTCuY + λI)-1 YTCupu
 Compare to simple least-squares regression solution (YTY)-1 YTpu
 Adds Tikhonov / ridge regression regularization term λI
 Attaches cu weights to YT

 See paper for how YTCuY is computed efficiently;
skipping the engineering!

EXAMPLE FACTORIZATION
 k = 3, λ = 2, α = 40, 10 iterations

0.96 0.99 0.99 0.38 0.93
1 1 1 0 0
0.44 0.39 0.98 -0.11 0.39
0 0 1 0 0

≈
0.70 0.99 0.42 0.98 0.98
0 1 0 1 1
1 0 1 0 1 1.00 1.04 0.99 0.44 0.98 Q = X•YT
0.11 0.51 -0.13 1.00 0.57
0 0 0 1 0
0.97 1.00 0.68 0.47 0.91
1 1 0 0 0

FOLD-IN
 Need immediate, if  Note (YTY)(YTY)-1 = I
approximate, updates for  Gives YT’s right inverse:
new data YT (Y(YTY)-1) = I
 New user u needs new row  Xu = Qu Y(YTY)-1
Qu = Xu YT
 Xu ≈ Pu Y(YTY)-1
 We have Pu ≈ Qu
 Recommend as usual:
 Compute Xu via right inverse: Qu = XuYT
X YT(YT)-1 = Q(YT)-1 so:
 For existing user, instead
X = Q(YT)-1
add to existing row Xu
 What is (YT)-1?

THIS IS MYRRIX
 Soft-launched
 Serving Layer available
as open source download
 Computation Layer available
as beta
 Ready on Amazon EC2 / EMR
srowen@myrrix.com
 Full launch Q4 2012
 myrrix.com

EXAMPLES

STACKOVERFLOW TAGS WIKIPEDIA LINKS
 Recommend tags to  Recommend new linked
questions articles from existing links
 Tag questions automatically,  Propose missing, related
improve tag coverage links
 3.5M questions x 30K tags  2.5M articles x 1.8M articles
 4.3 hours x 5 machines on  28 hours x 2 PCs on
Amazon EMR Apache Hadoop 1.0.3
 $3.03 ≈ $0.08 per 100,000
recs

Big Practical Recommendations with Alternating Least Squares

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Practical Recommendations with Alternating Least Squares

Similar to Big Practical Recommendations with Alternating Least Squares (20)

More from Data Science London

More from Data Science London (20)

Recently uploaded

Recently uploaded (20)

Big Practical Recommendations with Alternating Least Squares