2. Agenda
Basic Counting Models: EMP
Feature Based Models: OLR
RLFM: Feature Model + Collaborative Filtering
Bonus: Tutorial on Collaborative Filtering
Note:
›
Will focus on the science framework
›
Will not focus on the optimization problem
3. EMP + OLR:
Basic Counting Models: EMP
›
Simple CTR model based on counting clicks/views
Feature Based Models: OLR
RLFM: Feature Model + Collaborative Filtering
Bonus: Tutorial on Collaborative Filtering
5. Counting Models: CTR
Estimate CTR for each article independently
CTR = Click-Thru-Rate = Total Clicks / Total Views
Online Model: Update every 5 mins:
Ct + Ct − 1 +... + C1 Ct + ∑<t C s
s
CTR =
=
Vt + Vt − 1 +... + V1 Vt + ∑<t Vs
s
C t = clicks during period ' t'
Vt = views during period ' t'
6. CTR Curves for Two Days
Each curve is the CTR of an item in the Today Module over time
Traffic obtained from a controlled randomized experiment
Things to note:
(a) Short lifetimes, (b) temporal effects, (c) often breaking news stories
7. Counting Models: Most Popular
EMP: Estimated Most Popular ( aka GMP ):
›
Decay = Forget about old clicks and views ( Gamma > 0.95-0.99 )
CTR EMP
C t + γ C t −1 + γ 2 C t −2 + ...
=
V t + γ V + γ 2 V + ...
t
t −1
t −2
Segmented Most Popular:
›
C t + γ C t −1 + γ C Male + ...
t −2
= Male
Male
2
Male
V
+ γ Vt −1 + γ Vt −2 + ...
t
Male
Male
Separate model for each segment of the population 2
CTR EMP −Male
8. Tracking behavior of Estimated Most Popular model
Low click rate articles – More temporal smoothing
9. OLR: Online Logistic Regression
Basic Counting Models: EMP
Feature Based Models: OLR
›
Motivation for using regression
›
Logistic Regression framework
›
Online Logistic Regression: general case
›
Per item-OLR Use Case : Today Module
›
Improving Model
RLFM: Feature Model + Collaborative Filtering
Affinity Models: Log Odds
Bonus: Tutorial on Collaborative Filtering
10. Motivation for using Regression:
• EMP: Breaks down if segment is too small:
• eg 40 yrs old Male in NewYork
CTR Male_40_NY
C Male_40_NY + γ C Male_40_NY + ...
t
t −1
= Male_40_NY
Male_40_NY
V
+ γ Vt −1
+ ...
t
Logistic Regression:
›
Natural framework to include more features:
›
Age, Gender, Location,User Interests,…
Log(Pclick /(1 − Pclick )) = b + ∑k ={features} Wk ∗ X k,u
›
Xk,u = value for feature k and user u: eg age of a user
›
Wk = weight parameter to be learned for each feature
11. Linear Regression: One Dimension
Linear Fit:
Y=a*X+b
180
Y = Weight
160
ERROR
140
120
100
80
60
70
80
90
100
110
120
X = Height
SSE = ∑i ={examples} (Yi − a ∗ X i − b)
2
• Find value of “a” and “b” that minimize Sum of Square of Errors (SSE)
• Take derivative of SSE with respect to “a” and “b” and equal to 0
12. Can’t Apply Linear Model to Click Prediction
For example: Probably of Click for article on Retirement as function of Age
Linear Doesn't Represent the Data Well
1.0
Probability of Click
0.8
Data Points
Linear Model
0.6
0.4
0.2
0.0
0
10
20
30
40
50
Age
60
70
80
90
100
13. Logistic Model for Click Prediction
Probably of Click for article on Retirement as function of Age
Logistic Model is much better
1.0
Probability of Click
0.8
Data Points
Logistic Model
0.6
0.4
P(Click ) =
0.2
1
1 + Exp (− a * Age − b)
0.0
0
10
20
30
40
50
Age
60
70
80
90
100
14. Logistic Regression: One Dimension
1
P(Yi ) =
1 + exp (− Yi ⋅ (a ⋅ Age i − b))
Yi = + 1 =>
P(Yi=+1) = Prob user Clicked on article
Yi = − 1 =>
P(Yi=-1) = Prob user Didn’t Click
(
• How to find parameter “a” and “b” for many training examples: Yi ,
Agei )
• Maximized Product of Probabilities (Likelihood):
Likelihood = P (Y1 ) ⋅ P (Y2 ) ⋅ P (Y3 ) ⋅ ⋅ ⋅ P (Yn )
• “Hard” to solve
15. Optimize Logistic Likelihood for 4 Data Points:
Likelihood (1..4) = P(Y1 ) ⋅ P (Y2 ) ⋅ P(Y3 ) ⋅ P(Y4 )
12
10
Likeliho
d
Prob(Y1
)
8
Prob(Y2
)
6
4
1
P(Yi ) =
1 + exp (− Yi ⋅ (a ⋅ Age i − b))
2
0
-0.3
-0.2
-0.1
0
0.1
X-axis: parameter “a”
For simplicity: I assume that I know the value of “b”
0.2
0.3
16. Optimize Logistic Likelihood for 40 Data Points:
Likelihood (1..40) = P(Y1 ) ⋅ P(Y2 ) ⋅ ⋅ ⋅ P(Y40 )
12
Likelihood ( rescaled )
10
40 Data Points
4 Data Points
8
6
4
2
0
-0.08
-0.05
-0.02
0
0.03
0.05
0.08
X-axis: parameter “a”
For simplicity: I assume that I know the value of “b”
0.1
0.13
0.15
0.18
0.2
17. Gaussian Approximation to Likelihood:
2
Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ Likelihood 40 (a)
• Replace Likelihood with a simple Gaussian with two Hyperparameters:
* Mean: m40
(what is the average value for “a”)
* Standard deviation:
(what is the error around the mean)
σ
40
1.2E+01
1.0E+01
m 40
8.0E+00
Gaussian_Max
Likelihood40
6.0E+00
4.0E+00
σ 40
2.0E+00
0.0E+00
0.00
0.02
0.04
0.06
0.08
0.10
X-axis: parameter “a”
0.12
0.14
0.16
0.18
0.20
18. Gaussian approx allow for Update for one data point at a time:
2
Exp(− (a − m 40 ) 2 /2σ 40 ) ≈ Likelihood 40 (a)
2
Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ P(Y40 ) ⋅ {P(Y39 ) ⋅ P(Y38 ) ⋅ ⋅ ⋅ P(Y1 )}
2
2
Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ P(Y40 ) ⋅ Exp(−(a − m 39 ) 2 /2σ 39 )
Posterior
≈ Likelihood
*
• Note: for simplicity I ignored all normalizations
Prior
19. OLR: Online Logistic Regression: one parameter
(Y,
• Solve Bayesian update for each new event: Age)
Posterior
≈ Likelihood *
Prior
Exp(−(a − m t ) 2 /2σ t2 ) ≈ P(Y) ⋅ Exp(−(a − m t -1 ) 2 /2σ t2-1 )
P(Y) =
1
1 + exp (− Y ⋅ (a ⋅ Age − b))
m t = m t −1 +
• Yrank update formulas:
1/σ t2 = 1/σ t2−1 +
• Yrank approximate solution: Scott Roy talk:
http://twiki.corp.yahoo.com/pub/Personalization/YRank/YRankLearning.ppt
20. OLR: Online Logistic Regression: General Case
{w
• Replace one parameter “a” by a set of parameters: f }
{X
• Replace on feature “Age” by a set of features: f }
(Y,
• Solve Bayesian update for each new event: {X f })
Posterior
≈ Likelihood *
Prior
2
2
Exp(−∑(w f − m f, t ) 2 /2σ f, t ) ≈ P(Y) ⋅ Exp(−∑(w f − m f, t -1 ) 2 /2σ f,t -1 )
f
f
1
P(Y) =
1 + exp (− Y ⋅ ∑ w f ⋅ X f )
f
m f,t = m f,t −1 +
• Yrank update formulas:
2
2
1/σ f,t = 1/σ f,t −1 +
21. OLR: General Case: Features
• Multi-dimension logistic regression model:
1
P(Y) =
1 + exp (− Y ⋅ (
∑ w f ⋅ X f ))
f ={features}
∑ wf *X f
f
= w1 ∗ 1
+ w2 X u = Male + w3 X u = Age40s + w4 X u =SanJose + w5 X u = likeSports
+ w6 * X i = about_Sports + w7 * X i = about_NBA
+ w8 * X (u = likeSports & i = about_ Sports)
<= Baseline
<= User Features
<= Article Features
<= User*Article Features
• More on Features:
http://twiki.corp.yahoo.com/view/SRelevance/NewsRecommendationFeatures
http://twiki.corp.yahoo.com/view/SRelevance/COREUserProfilesSparsePolarity
22. OLR: Online Logistic Regression
Basic Counting Models: EMP
Feature Based Models: OLR
›
Motivation for using regression
›
Logistic Regression framework
›
Online Logistic Regression: General Case
›
Per item-OLR Use Case : Today Module
›
Improving Model
RLFM: Feature Model + Collaborative Filtering
Affinity Models: Log Odds
Bonus: Tutorial on Collaborative Filtering
24. Per item-OLR use Case: Yahoo FP Today Module
• Front Page Module:
• Article don’t live very long ( < day )
• Many clicks/views for each article
• Each Article treated independently:
• A new OLR model for each new Article
• Trying to predict CTR for each user & article pair: u,i
P(Yui = 1) =
∑ w f *X u,f
1
1 + exp (−∑f ={user_features} w i,f ⋅ X u,f )
= w1
+ w 2 * X u =Male + w 3 * X u =Age20s + w 4 * X NewYork
+ w 5 * X u =likeSports + w 6 * X u =likeNFL + w 7 * X u =likeMusic
<= Baseline
<= User Features
26. Per item-OLR use Case: Yahoo FP Today Module
{w
• Each Article has its own OLR Model and its own set of weights: i,f }
P(Yui = 1) =
1
1 + exp (−∑f ={user_features} w i,f ⋅ X u,f )
• Each Article has its own:
2
Prior ~ Exp (−∑k ={features} (w i,f − m i,f ) 2 /2σ i,f ))
• For each event (Yui,{Xuf}) update the hyperparameters for that article:
m i,f, t = m i,f, t −1 +
Yrank Update Formula:
2
2
1/σ i,f, t = 1/σ i,f, t −1 +
27. Per item-OLR use Case: Yahoo FP Today Module
• How to use OLR model:
• Choose a candidate pool:
• Roughly 50-100 pick by editors
• Explore:
• In a small bucket: try all 50-100 articles randomly
• Modeling: For each event(click/view) apply Yrank for that
article
• Exploit:
• For the reminder (larger bucket)
• Scoring: Predict article CTR, and order by decreasing CTR:
CTR= P(Yui = 1) =
1
1 + exp (−∑f ={user features} m i,f ⋅ X u,f )
28. Improving Online Learning:
• Correlated OLR: Include interactions between hyperparameters: improvement
−1
Prior ~ Exp (−(1 / 2)∑ f1,f2 (w f1 − m f1 )A f1,f2 (w f2 − m f2 ))
• Mini-Batch: Update multiple data points at once: no gain in CTR
Likelihood(mini batch) = P(Y1 ) P(Yn )
• TechPulse 2011: Taesup Moon, Pradheep Elango, Su-Lin Wu
http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/techpulse.pdf
29. Improving Explore/Exploit: UCB
• UCB: improve Explore/Exploit strategy: improvement
ε• Old strategy: greedy
• Explore: update OLR only from events in a small random bucket
• Exploit:
• Order articles in decreasing value of predicted CTR
• New strategy: UCB (aka Upper Confidence Bound)
• Single bucket
• Explore:
• Update OLR with all events
• Exploit:
• Order articles in decreasing value of “optimistic” UCB
CTR
• TechPulse 2011: Taesup Moon, Pradheep Elango, Su-Lin Wu
http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/techpulse.pdf
30. Improving Explore/Exploit: UCB
• Upper Confidence Bound strategy: improvement
• Exploit:
• Order articles in decreasing value of “optimistic”
CTR UCB
• ONE DIMENSION EXAMPLE:
1.0
• Replace normal CTR:
1
CTR =
1 + exp (− m ⋅ X)
0.8
0.6
• With optimistic CTR:
CTR UCB =
CTR
CTR UCB 0.4
1
1 + exp (− (m ⋅ X + z ⋅ σ ⋅ X 2 ))
z = tunable parameter
0.2
0.0
-6
-5
-4
-3
-2
-1
0
1
2
3
4
31. RLFM: Regression based Latent Factor Model
Basic Counting Models: EMP
Feature Based Models: OLR
RLFM: Feature Model + Collaborative Filtering
›
RLFM components
›
Using RLFM: Offline & Online update
Bonus: Tutorial on Collaborative Filtering
32. RLFM: Regression based Latent Factor Model
• RLFM: basic idea
* Build a single logistic regression model for all users “u” and
articles “i”
* Add Collaborative Filtering using Matrix Factorization
• Modeling:
• Most of it is done offline in big batch mode ( millions events )
• One part of the model is also updated online ( one event using Yrank
update)
⇒Latent Factor Models are work in progress:
• Original Y Labs Paper: Deepak Agarwal, Bee-Chung Chen
http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/featfact.pdf
• Implementation for Coke:
http://twiki.corp.yahoo.com/view/YResearch/RLFMForCoke
33. RLFM: Regression based Latent Factor Model
RLFM components:
1) Build a logistic regression model for all users “u” and
articles “i”
2) Add user bias and article bias
3) Collaborative Filtering using Matrix Factorization
4) Predict factors for new user/article: Cold Start
5) Add Logistic Regression + Bias + Matrix Factorization
34. 1) Build logistic regression for all user/articles:
• Build a single logistic regression model for all users {u}, articles {i}:
P(Yui = 1) =
1
1 + exp (−
∑ wf * X ui,f )
f ={all_features}
∑ wf *X ui,f
f
= w1
+ w2 X u = Male + w3 X u = Age40s + w4 X u =SanJose + w5 X u = likeSports
+ w6 * X i = about_Sports + w7 * X i = about_NBA
+ w8 * X (u = likeSports & i = about_ Sports)
• A single set of parameters {Wf} for all users, articles
• Learned offline in batch mode
<= Baseline
<= User Features
<= Article Features
<= User*Article Features
35. 2) Add per user and per article baseline:
• Add bias parameters:
●
Some article are more/less popular than other
●
Some user read more/less stories than other
1
P(Yui = 1) =
1 + exp (−α u − β i −
∑ w f * X ui,f )
f ={all_features}
• Baseline is not the same for every user/article:
• Old Baseline: 1
w
• New baseline:1 + α u
w
+ βi
{w
• More parameters to optimize: f },
{α u }, {β i }
• Better with some priors – to be described later
36. 3) Matrix Factorization Motivation
• How do deal with:
• Article about disaster preparedness:
• Hurricanes: need user from coastline: Texas => Northeast
• Earthquakes: need user from West coast
• Would need : X_user_WestCoast * X_about_earthquakes
• I don’t have that …
• But if I have many views/clicks over many such articles
I can discover that pattern !!!
38. 3) Matrix Factorization Motivation
• Most Clicks explained by:
U1 * V1 + U 2 * V2
1
P(Click ui ) = P(Yui = 1) =
1 + exp (− U u,1 * Vi,1 − U u,2 * Vi,2 )
• The general case:
P(Yui = 1) =
1
1 + exp (−
∑
U uk * Vik )
k ={factors}
• Note:
• Number of factors ~ 50-200 << Nusers & Narticles
39. 3) Matrix Factorization Model
• Matrix Factorization Model: aka Collaborative Filtering
P(Yui = 1) =
1
1 + exp (−
∑ U uk * Vik )
k ={factors}
• Obtain U’s and V’s: maximize the following likelihood
1
Likelihood = Π
ui ={examples} 1 + exp ( − Yui (
∑ U uk * Vik ))
k ={factors}
• Π ui => Product over all past events (clicks/views)
• Yui => clicks = +1 and views = -1
• Better with some priors ...
40. 3)Matrix Factorization Model
• Better with some Priors :
⋅ prior for each U uk ~ Exp(-(U uk - m a ) 2 /2σ a2 )
2
⋅ prior for each Vik ~ Exp(-(Vik - m b ) 2 /2σ b )
⋅ Choose : m a = 0 and m b = 0,
⋅ σ a is the same for all U uk
⋅ σ bis the same for all Vik
• Get U’s and V’s: Maximize Likelihood * Prior
• Note: above priors are uncorrelated
• Original RLFM paper used correlated priors
41. 4) Matrix Factorization Model – Cold Start Problem
• Matrix Factorization Model:
P(Yui ) =
1
1 + exp (−
∑ U uk * Vik )
k ={factors}
• Cold start problem:
=>for new user U=0 or for new article V=0
42. 4) Matrix Factorization Model – Cold Start Problem
• Matrix Factorization Model:
P(Yui ) =
1
1 + exp (−
∑ U uk * Vik )
k ={factors}
• Cold start problem:
=>for new user U=0 or for new article V=0
• Solution choose different prior:
⋅ for each U uk = Exp(− (U uk −
⋅ for each Vik = Exp(− (Vik −
∑
2
G k,a X u,a ) 2 /2σ a )
a ={user_features}
∑
D k,b X i,b ) 2 /2σ 2 )
b
b ={item_features}
• Parameters G’s & D’s obtained from maximizing: Likelihood * Prior
43. 5) RLFM: Regression based Latent Factor Model
• Putting it back together: Bias + Regression + Matrix Factorization:
P(Yui = 1) =
1 + exp (−α u − β i −
1
∑ w f * X ui,f −
f ={features}
• Priors:
⋅ for each U uk = Exp(− (U uk −
⋅ for each Vik = Exp(−(Vik −
⋅ for each α u = Exp(− (α u −
⋅ for each β i = Exp(− (β i −
∑
∑ U uk * Vik )
k ={factors}
2
G k,a X u,a ) 2 /2σ a )
a ={user_features}
∑
D k,b X i,b ) 2 /2σ 2 )
b
b ={item_features}
∑
2
g a X u,a ) 2 /2σ α )
a ={user_features}
∑
2
d b X i,b ) 2 /2σ β )
b ={item_features}
44. RLFM: Regression based Latent Factor Model
Basic Counting Models: EMP
Feature Based Models: OLR
RLFM: Feature Model + Collaborative Filtering
›
RLFM components
›
Using RLFM: Offline & Online update
Bonus: Tutorial on Collaborative Filtering
45. Using RLFM: Offline Modeling:
• Offline Modeling:
• Batch mode: Maximize: Likelihood * Prior
• Millions to Billions of examples processed at once
• Input: {Y’s, X’s} all events and features
• Output:
factors : {α u }, {β i }, {U uk }, {Vik }
parameters : {w f }, {g a }, {d b }, {G k,a }, {D k,b }
46. Using RLFM: Online Modeling and Scoring:
• Online Scoring + some Modeling
• For new user or new article: compute factors from g,d,G,D
α =
• new user bias: u
∑
g a X u,a
a ={user_features}
• For old user or old article: get factors from offline batch mode
• For each event (click/view) on article “i”:
• Update Vik using per-item OLR approach
• Predict score using updated
P(Yui = 1) =
Vik:
1 + exp (−α u − β i −
1
∑ w f * X ui,f −
f ={features}
∑ U uk * Vik )
k ={factors}
47. RLFM: Offline Results on Coke Data: Today Module
• RLFM results on Offline experiment
• Y! Front Page – Today Module
• CTR relative lift for RLFM vs Feature-Only as function of clicks/user
http://twiki.corp.yahoo.com/view/YResearch/RLFMReplayExperiments
50. Collaborative Filtering: Introduction
Goal: predict ratings rui for a movie “i” that a user “u” hasn’t seen yet
›
Prediction based on Matrix of User/Movie Ratings:
●
rui = 1 through 5 stars:
›
Prediction equations for integer Ratings are simpler then for binary Clicks
›
Rating matrix is a large very sparse matrix:
●
10M-100M users and 10k-100K movies but with ~99% blank entries
Based on : http://research.yahoo.com/files/korenBellChapterSpringer.pdf
›
This talk: Focus on the most relevant models & Ignore some improvements:
●
Baseline adjustment : user bias, movie bias and overall average rating
●
time aware model, binary features ( rated, rented )
This talk:
›
Adjusted Ratings:
adjusted
raw
<= ui
− Baseline ( ui )
ui
r
r
51. Collaborative Filtering: the models
Correlated Neighborhood Model
›
Predict new rating based on ratings of similar movies
Global Neighborhood Model
›
Enlarge Neighborhood to be “global”
›
Introduce adjustable weight parameters
Factorized Neighborhood Model
›
Apply matrix factorization to weight parameters
SVD Model
›
Apply matrix factorization to rating matrix itself
52. Collaborative Filtering: Correlated Neighborhood Model
• Predict unknown rui based on known ratings of similar movies
• You will like movie“i” because you liked movies “j”
uj6
• Define movie-movie Similarity measure:
uj5
Sij based on correlation
• Define Correlated Neighborhood:
●
Sij6
uj4
ui
uj1
uj3
• set of ~20 movies with largest Sij that are rated by “u”
• Define Weight : normalized Sij
uj2
Sij ∝
ruj
∑
rui ∗ ruj / Normalization
u = Union(i, j)
53. Collaborative Filtering: Correlated Neighborhood Model
Ratings = rui =
Users
Movies
1
1
1
1
−1
1 1 1
1 1 1
1 −1 −1
1
1
• Movies:
•
•
•
•
i=1
i=2
i=3
i=4
Star Trek
Star Wars
Action movie
Horror movie
Movie-Movie Similarity = Sij:
rui ∗ ruj
u = Union(i, j)
Normalization
1
S ij = 015
.
0
Movies
Sij ∝
∑
Movies
1 0.5 0
1 0.5 0
0.5 1 0.5
0
0.5
1
54. Collaborative Filtering: Correlated Neighborhood Model
• Predict unknown rui based on known ratings of similar movies
ruj
• You will like movie“i” because you liked movies “j”
• Similarity measure:
uj6
uj1
uj5
Sij6
∑
rui ∗ ruj
u = Union(i, j)
• Correlated Neighborhood:
uj4
ui
uj3
uj2
Sij ∝
• set of ~20 movies with largest Sij that are rated by “u”
• Weight :
• normalized Sij
• Scoring:
~
rui =
∑
ruj
j ={correlated_neighbors}
∗ S ij / ∑ S ij
j
• Simple, intuitive model with ability to explain why we predict a new
movie
• Modeling:
• Need to precompute and store Sij: 10k * 10K = 100M
• Weights are fixed to normalized value of Sij
55. Collaborative Filtering: Global Neighborhood Model
• Extend Neighborhood to All Known Ratings for User “u”:
R (u ) = {ruj known}
• Let weight Wij be free parameters:
• Scoring:
• Modeling:
~ =
rui
∑ ruj ∗ wij /
| R ( u )|
j∈R ( u )
Pick Wij to minimize regularized Sum of Errors:
SSE =
∑
ui ={ past_ratings}
(rui − ∑ ruj ∗ wij /
j∈R ( u )
| R ( u )|
)2 + λ ∑ wij2
ij
λ = regularization parameter
• Better predictive power then previous model
• Not easy to explain recommendation
• Expensive Modeling, Scoring and Storage of Wij: Size = 100M
• Could try to limit based on Sij but there is a better approach
56. Reduce Number of Free Parameters: Matrix Factorization
• Want to reduce the number of free parameters in Wij:
• Current size:10k * 10K = 100M
• Matrix factorization:
• Goal: reduce number of free parameters to ~1M
• Toy example #1:
• Weight matrix is uniform:
1
1
Weigth =
1
...
1 ... 1 (1 1 1 ...)
1 ... 1
= 1
1 ...
... ... ... ...
1
1
1
• Replace my matrix(10k,10k) with outer-product of two vectors:
• each 10k long
57. Reduce Number of Free Parameters: Matrix Factorization
• Want to reduce the number of free parameters in Wij:
• Current size:10k * 10K = 100M
• Matrix factorization:
• Goal: reduce number of free parameters to ~1M
• Toy example #1:
• Weight matrix is uniform:
1
1
Weigth =
1
...
1 ... 1 (1 1 1 ...)
1 ... 1
= 1
1 ...
... ... ... ...
1
1
1
(
= U
• Replace my matrix(10k,10k) with outer-product of two vectors:
• each 10k long: U(10k), V(10k)
• U & V are call factors
V
)
58. Reduce Number of Free Parameters: Matrix Factorization
• Toy example #2:
• Weight matrix is almost uniform:
1.0
0.8
Weigth =
1.0
0.8
0.8 1.0 0.8
1 (1 1 1 1)
+1 (+ 1 − 1 + 1 − 1)
1.0 0.8 1.0
1
−1
= 0.9
+ 0.1
0.8 1.0 0.8
1
+1
1
−1
1.0 0.8 1.0
(
= d1 ⋅ U 1
Weights : Wij =
V1
)
(
+ d 2 ⋅ U 2
∑ d k ⋅ U ik ⋅ V jk
k ={1,2}
V2
)
60. Note on convention for Matrix Factorization:
• Last equation is the definition of SVD (Singular Value Decomposition)
wij = ∑ k U ik d k V jk
• Where factors U’s, V’s are chosen to be normalized:∑ U U ' = 1 if
ik
ik
k = k'
Independent from each U U ' = 0 if
∑ other:ik
ik
k ≠ k'
i
●
i
• In this talk and in Koren & Bell’s chapter:
• The dk’s are incorporated inside the Uk,Vk:
• Just a convention difference
wij = ∑ k U ik V jk
• Where the factors are now normalized as:
∑
i
U ik U ik ' = d k
if
k = k'
61. Collaborative Filtering: Factorized Neighborhood Model
• Recall Global Neighborhood Model:
•
Where Wij are free parameters
~ =
rui
∑ ruj ∗ wij /
| R ( u )|
j∈R ( u )
• Apply Matrix Factorization to Wij:
wij =>
∑ U ik V jk
Choose: Nk (number of factors) << N (number of movies)
k ={factors}
~200
<<
10k-100K
• Scoring: Factorized Neighborhood Model:
~ =
rui
∑ U ui ∗ ( ∑
k ={factors}
ruj V jk ) /
Free parameters: Uui and Vjk
| R ( u )|
j∈R ( u )
• Cheaper computation with same predictive power
• Modeling:
SSE =
∑
ui ={past_ratings}
(rui − ∑ U ik ∗ ∑ rujV jk /
k ={factors}
j∈R ( u )
| R ( u )|
2
2
)2 + λ ∑ U ik + λ ∑ V jk
ik
jk
62. Collaborative Filtering: SVD Model
SVD: Historical name for Matrix Factorization apply to Rating matrix
• Matrix Factorization apply to rui:
rui =>
• Scoring:
∑ U uk Vik
Choose: Nk (number of factors) << N (number of movies)
k ={factors}
~ =
rui
~200
<<
∑ U uk ∗ Vik
10-100K
Free parameters: Uui and Vjk
k ={factors}
• Same predictive power
• Not easy to explain recommendation
• Modeling:
SSE =
∑
ui ={past_ratings}
2
(rui − ∑ U uk ∗ Vik )2 + λ ∑ U uk + λ ∑ Vik2
k ={factors}
uk
ik