Introduction to core science models

Introduction to Core Science Models
Yahoo! Labs
2011/19/11

Agenda

 Basic Counting Models: EMP
 Feature Based Models: OLR
 RLFM: Feature Model + Collaborative Filtering
 Bonus: Tutorial on Collaborative Filtering

 Note:
›

Will focus on the science framework

›

Will not focus on the optimization problem

EMP + OLR:

›

Simple CTR model based on counting clicks/views


Counting Models: CTR
 Estimate CTR for each article independently
 CTR = Click-Thru-Rate = Total Clicks / Total Views
 Online Model: Update every 5 mins:

 Ct + Ct − 1 +... + C1   Ct + ∑<t C s
s
CTR = 
 =
 Vt + Vt − 1 +... + V1   Vt + ∑<t Vs
s

C t = clicks during period ' t'
Vt = views during period ' t'






CTR Curves for Two Days
Each curve is the CTR of an item in the Today Module over time

Traffic obtained from a controlled randomized experiment
Things to note:
(a) Short lifetimes, (b) temporal effects, (c) often breaking news stories

Counting Models: Most Popular
 EMP: Estimated Most Popular ( aka GMP ):
›

Decay = Forget about old clicks and views ( Gamma > 0.95-0.99 )

CTR EMP

 C t + γ C t −1 + γ 2 C t −2 + ... 
=
 V t + γ V + γ 2 V + ... 

t
t −1
t −2



 Segmented Most Popular:
›

 C t + γ C t −1 + γ C Male + ... 
t −2

=  Male
Male
2
Male
V
+ γ Vt −1 + γ Vt −2 + ... 
 t


Male
Male
Separate model for each segment of the population 2

CTR EMP −Male

Tracking behavior of Estimated Most Popular model
 Low click rate articles – More temporal smoothing

OLR: Online Logistic Regression

›

Motivation for using regression

›

Logistic Regression framework

›

Online Logistic Regression: general case

›

Per item-OLR Use Case : Today Module

›

Improving Model

 Affinity Models: Log Odds


Motivation for using Regression:
• EMP: Breaks down if segment is too small:
• eg 40 yrs old Male in NewYork

CTR Male_40_NY

 C Male_40_NY + γ C Male_40_NY + ... 
t
t −1

=  Male_40_NY
Male_40_NY
V
+ γ Vt −1
+ ... 
 t


 Logistic Regression:
›

Natural framework to include more features:

›

Age, Gender, Location,User Interests,…

Log(Pclick /(1 − Pclick )) = b + ∑k ={features} Wk ∗ X k,u
›

Xk,u = value for feature k and user u: eg age of a user

›

Wk = weight parameter to be learned for each feature

Linear Regression: One Dimension
Linear Fit:

Y=a*X+b

180

Y = Weight

160

ERROR

140
120
100
80
60

70

80

90

100

110

120

X = Height

SSE = ∑i ={examples} (Yi − a ∗ X i − b)

2

• Find value of “a” and “b” that minimize Sum of Square of Errors (SSE)
• Take derivative of SSE with respect to “a” and “b” and equal to 0

Can’t Apply Linear Model to Click Prediction
 For example: Probably of Click for article on Retirement as function of Age
Linear Doesn't Represent the Data Well

1.0

Probability of Click

0.8
Data Points
Linear Model
0.6

0.4

0.2

0.0
0

10

20

30

40

50

Age

60

70

80

90

100

Logistic Model for Click Prediction
 Probably of Click for article on Retirement as function of Age
Logistic Model is much better

1.0

Probability of Click

0.8
Data Points
Logistic Model
0.6

0.4

P(Click ) =
0.2

1
1 + Exp (− a * Age − b)

0.0
0

10

20

30

40

50

Age

60

70

80

90

100

Logistic Regression: One Dimension

1
P(Yi ) =
1 + exp (− Yi ⋅ (a ⋅ Age i − b))

Yi = + 1 =>

P(Yi=+1) = Prob user Clicked on article

Yi = − 1 =>

P(Yi=-1) = Prob user Didn’t Click

(
• How to find parameter “a” and “b” for many training examples: Yi ,

Agei )

• Maximized Product of Probabilities (Likelihood):

Likelihood = P (Y1 ) ⋅ P (Y2 ) ⋅ P (Y3 ) ⋅ ⋅ ⋅ P (Yn )
• “Hard” to solve

Optimize Logistic Likelihood for 4 Data Points:
Likelihood (1..4) = P(Y1 ) ⋅ P (Y2 ) ⋅ P(Y3 ) ⋅ P(Y4 )
12

10

Likeliho
d
Prob(Y1
)

8

Prob(Y2
)

6

4

1
P(Yi ) =
1 + exp (− Yi ⋅ (a ⋅ Age i − b))

2

0
-0.3

-0.2

-0.1

0

0.1

X-axis: parameter “a”
For simplicity: I assume that I know the value of “b”

0.2

0.3

Optimize Logistic Likelihood for 40 Data Points:

Likelihood (1..40) = P(Y1 ) ⋅ P(Y2 ) ⋅ ⋅ ⋅ P(Y40 )
12

Likelihood ( rescaled )

10

40 Data Points
4 Data Points

8

6

4

2

0
-0.08

-0.05

-0.02

0

0.03

0.05

0.08

For simplicity: I assume that I know the value of “b”

0.1

0.13

0.15

0.18

0.2

Gaussian Approximation to Likelihood:

2
Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ Likelihood 40 (a)

• Replace Likelihood with a simple Gaussian with two Hyperparameters:
* Mean: m40

(what is the average value for “a”)

* Standard deviation:

(what is the error around the mean)
σ
40

1.2E+01
1.0E+01

m 40

8.0E+00

Gaussian_Max
Likelihood40

6.0E+00
4.0E+00

σ 40

2.0E+00
0.0E+00
0.00

0.02

0.04

0.06

0.08

0.10


0.12

0.14

0.16

0.18

0.20

Gaussian approx allow for Update for one data point at a time:

2
Exp(− (a − m 40 ) 2 /2σ 40 ) ≈ Likelihood 40 (a)
2
Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ P(Y40 ) ⋅ {P(Y39 ) ⋅ P(Y38 ) ⋅ ⋅ ⋅ P(Y1 )}
2
2
Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ P(Y40 ) ⋅ Exp(−(a − m 39 ) 2 /2σ 39 )

Posterior

≈ Likelihood

*

• Note: for simplicity I ignored all normalizations

Prior

OLR: Online Logistic Regression: one parameter
(Y,
• Solve Bayesian update for each new event: Age)
Posterior

≈ Likelihood *

Prior

Exp(−(a − m t ) 2 /2σ t2 ) ≈ P(Y) ⋅ Exp(−(a − m t -1 ) 2 /2σ t2-1 )
P(Y) =

1
1 + exp (− Y ⋅ (a ⋅ Age − b))
m t = m t −1 + 

• Yrank update formulas:

1/σ t2 = 1/σ t2−1 + 

• Yrank approximate solution: Scott Roy talk:
http://twiki.corp.yahoo.com/pub/Personalization/YRank/YRankLearning.ppt

OLR: Online Logistic Regression: General Case
{w
• Replace one parameter “a” by a set of parameters: f }
{X
• Replace on feature “Age” by a set of features: f }
(Y,
• Solve Bayesian update for each new event: {X f })
Posterior

≈ Likelihood *

Prior

2
2
Exp(−∑(w f − m f, t ) 2 /2σ f, t ) ≈ P(Y) ⋅ Exp(−∑(w f − m f, t -1 ) 2 /2σ f,t -1 )
f

f

1
P(Y) =
1 + exp (− Y ⋅ ∑ w f ⋅ X f )
f

m f,t = m f,t −1 + 
• Yrank update formulas:

2
2
1/σ f,t = 1/σ f,t −1 + 

OLR: General Case: Features
• Multi-dimension logistic regression model:

1
P(Y) =
1 + exp (− Y ⋅ (

∑ w f ⋅ X f ))

f ={features}

∑ wf *X f
f

= w1 ∗ 1

+ w2 X u = Male + w3 X u = Age40s + w4 X u =SanJose + w5 X u = likeSports
+ w6 * X i = about_Sports + w7 * X i = about_NBA
+ w8 * X (u = likeSports & i = about_ Sports)

<= Baseline
<= User Features
<= Article Features
<= User*Article Features

• More on Features:
http://twiki.corp.yahoo.com/view/SRelevance/NewsRecommendationFeatures
http://twiki.corp.yahoo.com/view/SRelevance/COREUserProfilesSparsePolarity

OLR: Online Logistic Regression

›

Motivation for using regression

›

Logistic Regression framework

›

Online Logistic Regression: General Case

›

Per item-OLR Use Case : Today Module

›

Improving Model

 Affinity Models: Log Odds


Per item-OLR use Case: Yahoo FP Today Module

• Front Page Module:
• Article don’t live very long ( < day )
• Many clicks/views for each article
• Each Article treated independently:
• A new OLR model for each new Article
• Trying to predict CTR for each user & article pair: u,i

P(Yui = 1) =

∑ w f *X u,f

1
1 + exp (−∑f ={user_features} w i,f ⋅ X u,f )

= w1

+ w 2 * X u =Male + w 3 * X u =Age20s + w 4 * X NewYork
+ w 5 * X u =likeSports + w 6 * X u =likeNFL + w 7 * X u =likeMusic

<= Baseline
<= User Features

1
P(Yui = 1) =
1.0

0.8

P(Yi=1)

0.6

0.4

w1

0.2

0.0
-6

-5

-4

-3

-2

-1

0

∑w ⋅X

1

2

3

4

{w
• Each Article has its own OLR Model and its own set of weights: i,f }
P(Yui = 1) =

1

• Each Article has its own:

2
Prior ~ Exp (−∑k ={features} (w i,f − m i,f ) 2 /2σ i,f ))

• For each event (Yui,{Xuf}) update the hyperparameters for that article:

m i,f, t = m i,f, t −1 + 

Yrank Update Formula:

2
2
1/σ i,f, t = 1/σ i,f, t −1 + 

• How to use OLR model:
• Choose a candidate pool:
• Roughly 50-100 pick by editors
• Explore:
• In a small bucket: try all 50-100 articles randomly
• Modeling: For each event(click/view) apply Yrank for that
article
• Exploit:
• For the reminder (larger bucket)
• Scoring: Predict article CTR, and order by decreasing CTR:

CTR= P(Yui = 1) =

1

1 + exp (−∑f ={user features} m i,f ⋅ X u,f )

Improving Online Learning:
• Correlated OLR: Include interactions between hyperparameters:  improvement
−1
Prior ~ Exp (−(1 / 2)∑ f1,f2 (w f1 − m f1 )A f1,f2 (w f2 − m f2 ))

• Mini-Batch: Update multiple data points at once:  no gain in CTR

Likelihood(mini batch) = P(Y1 )  P(Yn )

• TechPulse 2011: Taesup Moon, Pradheep Elango, Su-Lin Wu
http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/techpulse.pdf

Improving Explore/Exploit: UCB
• UCB: improve Explore/Exploit strategy:  improvement

ε• Old strategy: greedy
• Explore: update OLR only from events in a small random bucket
• Exploit:
• Order articles in decreasing value of predicted CTR
• New strategy: UCB (aka Upper Confidence Bound)
• Single bucket
• Explore:
• Update OLR with all events
• Exploit:
• Order articles in decreasing value of “optimistic” UCB
CTR
• TechPulse 2011: Taesup Moon, Pradheep Elango, Su-Lin Wu
http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/techpulse.pdf

Improving Explore/Exploit: UCB
• Upper Confidence Bound strategy:  improvement
• Exploit:
• Order articles in decreasing value of “optimistic”
CTR UCB
• ONE DIMENSION EXAMPLE:
1.0

• Replace normal CTR:

1
CTR =
1 + exp (− m ⋅ X)

0.8

0.6

• With optimistic CTR:

CTR UCB =

CTR

CTR UCB 0.4

1
1 + exp (− (m ⋅ X + z ⋅ σ ⋅ X 2 ))

z = tunable parameter

0.2

0.0
-6

-5

-4

-3

-2

-1

0

1

2

3

4

RLFM: Regression based Latent Factor Model

›

RLFM components

›

Using RLFM: Offline & Online update


• RLFM: basic idea
* Build a single logistic regression model for all users “u” and
articles “i”
* Add Collaborative Filtering using Matrix Factorization
• Modeling:
• Most of it is done offline in big batch mode ( millions events )
• One part of the model is also updated online ( one event using Yrank
update)
⇒Latent Factor Models are work in progress:
• Original Y Labs Paper: Deepak Agarwal, Bee-Chung Chen
http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/featfact.pdf
• Implementation for Coke:
http://twiki.corp.yahoo.com/view/YResearch/RLFMForCoke

RLFM components:
1) Build a logistic regression model for all users “u” and
articles “i”
2) Add user bias and article bias
3) Collaborative Filtering using Matrix Factorization
4) Predict factors for new user/article: Cold Start
5) Add Logistic Regression + Bias + Matrix Factorization

1) Build logistic regression for all user/articles:
• Build a single logistic regression model for all users {u}, articles {i}:

P(Yui = 1) =

1
1 + exp (−

∑ wf * X ui,f )

f ={all_features}

∑ wf *X ui,f
f

= w1

+ w2 X u = Male + w3 X u = Age40s + w4 X u =SanJose + w5 X u = likeSports
+ w6 * X i = about_Sports + w7 * X i = about_NBA
+ w8 * X (u = likeSports & i = about_ Sports)
• A single set of parameters {Wf} for all users, articles
• Learned offline in batch mode

<= Baseline
<= User Features
<= Article Features
<= User*Article Features

2) Add per user and per article baseline:
• Add bias parameters:
●
Some article are more/less popular than other
●
Some user read more/less stories than other

1
P(Yui = 1) =
1 + exp (−α u − β i −

∑ w f * X ui,f )

f ={all_features}

• Baseline is not the same for every user/article:
• Old Baseline: 1
w
• New baseline:1 + α u
w

+ βi

{w
• More parameters to optimize: f },

{α u }, {β i }

• Better with some priors – to be described later

3) Matrix Factorization Motivation
• How do deal with:
• Article about disaster preparedness:
• Hurricanes: need user from coastline: Texas => Northeast
• Earthquakes: need user from West coast
• Would need : X_user_WestCoast * X_about_earthquakes
• I don’t have that …
• But if I have many views/clicks over many such articles
I can discover that pattern !!!


Clicks =

Users

• I can discover patterns within clicks:
• SIMPLE EXAMPLE:






SanJose
Oakland
NewYork
DC

Earthquake




 0

1 1 1 1
1 1 1 1
0 0 0 0
1 0 0

Politics




1

1 0 0 0
0 0 0 1
0 1 0 0
1 1 1

• Clicks mostly explained by:
 0  (0 0 0 0 1 1 1 1)
 1  (1 1 1 1 0 0 0 0)
 
 
0 
1
+  
 0
0
 
 
1 
 0
 
 

U1 * V1

+ U 2 * V2

• Most Clicks explained by:

U1 * V1 + U 2 * V2

1
P(Click ui ) = P(Yui = 1) =
1 + exp (− U u,1 * Vi,1 − U u,2 * Vi,2 )
• The general case:

P(Yui = 1) =

1
1 + exp (−

∑

U uk * Vik )

k ={factors}

• Note:
• Number of factors ~ 50-200 << Nusers & Narticles

3) Matrix Factorization Model
• Matrix Factorization Model: aka Collaborative Filtering

P(Yui = 1) =

1
1 + exp (−

∑ U uk * Vik )

k ={factors}

• Obtain U’s and V’s: maximize the following likelihood





1
Likelihood = Π 

ui ={examples}  1 + exp ( − Yui (
∑ U uk * Vik )) 
k ={factors}


• Π ui => Product over all past events (clicks/views)
• Yui => clicks = +1 and views = -1
• Better with some priors ...

3)Matrix Factorization Model
• Better with some Priors :
⋅ prior for each U uk ~ Exp(-(U uk - m a ) 2 /2σ a2 )
2
⋅ prior for each Vik ~ Exp(-(Vik - m b ) 2 /2σ b )

⋅ Choose : m a = 0 and m b = 0,
⋅ σ a is the same for all U uk
⋅ σ bis the same for all Vik
• Get U’s and V’s: Maximize Likelihood * Prior

• Note: above priors are uncorrelated
• Original RLFM paper used correlated priors

4) Matrix Factorization Model – Cold Start Problem
• Matrix Factorization Model:

P(Yui ) =

1
1 + exp (−

∑ U uk * Vik )

k ={factors}

• Cold start problem:
=>for new user U=0 or for new article V=0

4) Matrix Factorization Model – Cold Start Problem
• Matrix Factorization Model:

P(Yui ) =

1
1 + exp (−

∑ U uk * Vik )

k ={factors}

• Cold start problem:
=>for new user U=0 or for new article V=0
• Solution choose different prior:

⋅ for each U uk = Exp(− (U uk −
⋅ for each Vik = Exp(− (Vik −

∑

2
G k,a X u,a ) 2 /2σ a )

a ={user_features}

∑

D k,b X i,b ) 2 /2σ 2 )
b

b ={item_features}

• Parameters G’s & D’s obtained from maximizing: Likelihood * Prior

5) RLFM: Regression based Latent Factor Model
• Putting it back together: Bias + Regression + Matrix Factorization:

P(Yui = 1) =

1 + exp (−α u − β i −

1
∑ w f * X ui,f −

f ={features}

• Priors:

⋅ for each U uk = Exp(− (U uk −
⋅ for each Vik = Exp(−(Vik −
⋅ for each α u = Exp(− (α u −
⋅ for each β i = Exp(− (β i −

∑

∑ U uk * Vik )

k ={factors}

2
G k,a X u,a ) 2 /2σ a )

a ={user_features}

∑

D k,b X i,b ) 2 /2σ 2 )
b

b ={item_features}

∑

2
g a X u,a ) 2 /2σ α )

a ={user_features}

∑

2
d b X i,b ) 2 /2σ β )

b ={item_features}

Using RLFM: Offline Modeling:
• Offline Modeling:
• Batch mode: Maximize: Likelihood * Prior
• Millions to Billions of examples processed at once
• Input: {Y’s, X’s} all events and features
• Output:

factors : {α u }, {β i }, {U uk }, {Vik }
parameters : {w f }, {g a }, {d b }, {G k,a }, {D k,b }

Using RLFM: Online Modeling and Scoring:
• Online Scoring + some Modeling
• For new user or new article: compute factors from g,d,G,D

α =

• new user bias: u

∑

g a X u,a

a ={user_features}

• For old user or old article: get factors from offline batch mode
• For each event (click/view) on article “i”:
• Update Vik using per-item OLR approach
• Predict score using updated

P(Yui = 1) =

Vik:

1 + exp (−α u − β i −

1
∑ w f * X ui,f −

f ={features}

∑ U uk * Vik )

k ={factors}

RLFM: Offline Results on Coke Data: Today Module
• RLFM results on Offline experiment
• Y! Front Page – Today Module
• CTR relative lift for RLFM vs Feature-Only as function of clicks/user

http://twiki.corp.yahoo.com/view/YResearch/RLFMReplayExperiments

Q&A
Contributors:
Pradheep Elango, Su-Lin Wu,Teasup Moon, Pranam Kolari
Deepak Agarwal, Bee-Chung Chen, Scott Roy
Jean-Marc Langlois

•Coke Science Papers:
http://twiki.corp.yahoo.com/view/YResearch/CokeLabDiary

Tutorial on
Collaborative Filtering
Based on following Chapter
http://research.yahoo.com/files/korenBellChapterSpringer.pdf
By two of the Netflix winners

Collaborative Filtering: Introduction
 Goal: predict ratings rui for a movie “i” that a user “u” hasn’t seen yet
›

Prediction based on Matrix of User/Movie Ratings:
●

rui = 1 through 5 stars:

›

Prediction equations for integer Ratings are simpler then for binary Clicks

›

Rating matrix is a large very sparse matrix:
●

10M-100M users and 10k-100K movies but with ~99% blank entries

 Based on : http://research.yahoo.com/files/korenBellChapterSpringer.pdf
›

This talk: Focus on the most relevant models & Ignore some improvements:
●

Baseline adjustment : user bias, movie bias and overall average rating

●

time aware model, binary features ( rated, rented )

 This talk:
›

Adjusted Ratings:

adjusted
raw
<= ui
− Baseline ( ui )
ui

r

r

Collaborative Filtering: the models
 Correlated Neighborhood Model
›

Predict new rating based on ratings of similar movies

 Global Neighborhood Model
›

Enlarge Neighborhood to be “global”

›

Introduce adjustable weight parameters

 Factorized Neighborhood Model
›

Apply matrix factorization to weight parameters

 SVD Model
›

Apply matrix factorization to rating matrix itself

Collaborative Filtering: Correlated Neighborhood Model
• Predict unknown rui based on known ratings of similar movies
• You will like movie“i” because you liked movies “j”

uj6

• Define movie-movie Similarity measure:

uj5

Sij based on correlation
• Define Correlated Neighborhood:
●

Sij6

uj4

ui

uj1

uj3

• set of ~20 movies with largest Sij that are rated by “u”
• Define Weight : normalized Sij

uj2

Sij ∝

ruj

∑

rui ∗ ruj / Normalization

u = Union(i, j)


Ratings = rui =

Users

Movies

1
1
1
1




−1 

1 1 1
1 1 1
1 −1 −1
1

1

• Movies:
•
•
•
•

i=1
i=2
i=3
i=4

Star Trek
Star Wars
Action movie
Horror movie

Movie-Movie Similarity = Sij:

rui ∗ ruj

u = Union(i, j)

Normalization

1
S ij =  015
 .
0
Movies

Sij ∝

∑

Movies
1 0.5 0
1 0.5 0
0.5 1 0.5
0

0.5

1






• Predict unknown rui based on known ratings of similar movies

ruj

• You will like movie“i” because you liked movies “j”
• Similarity measure:
uj6

uj1

uj5

Sij6

∑

rui ∗ ruj

u = Union(i, j)

• Correlated Neighborhood:
uj4

ui
uj3
uj2

Sij ∝

• set of ~20 movies with largest Sij that are rated by “u”

• Weight :
• normalized Sij
• Scoring:
~

rui =

∑

ruj

j ={correlated_neighbors}

∗ S ij / ∑ S ij
j

•  Simple, intuitive model with ability to explain why we predict a new
movie
• Modeling:
•  Need to precompute and store Sij: 10k * 10K = 100M
• Weights are fixed to normalized value of Sij

Collaborative Filtering: Global Neighborhood Model
• Extend Neighborhood to All Known Ratings for User “u”:
R (u ) = {ruj known}
• Let weight Wij be free parameters:
• Scoring:

• Modeling:

~ =
rui

∑ ruj ∗ wij /

| R ( u )|

j∈R ( u )

Pick Wij to minimize regularized Sum of Errors:

SSE =

∑

ui ={ past_ratings}

(rui − ∑ ruj ∗ wij /
j∈R ( u )

| R ( u )|

)2 + λ ∑ wij2
ij

λ = regularization parameter

•  Better predictive power then previous model
•  Not easy to explain recommendation
•  Expensive Modeling, Scoring and Storage of Wij: Size = 100M
• Could try to limit based on Sij but there is a better approach

Reduce Number of Free Parameters: Matrix Factorization
• Want to reduce the number of free parameters in Wij:
• Current size:10k * 10K = 100M
• Matrix factorization:
• Goal: reduce number of free parameters to ~1M
• Toy example #1:
• Weight matrix is uniform:

1

1
Weigth = 
1

 ...


1 ...   1  (1 1 1 ...)
  
1 ...   1 
 = 1
1 ...
  
... ... ...   ... 
  
1
1
1

• Replace my matrix(10k,10k) with outer-product of two vectors:
• each 10k long

• Want to reduce the number of free parameters in Wij:
• Current size:10k * 10K = 100M
• Matrix factorization:
• Goal: reduce number of free parameters to ~1M
• Toy example #1:
• Weight matrix is uniform:

1

1
Weigth = 
1

 ...


1 ...   1  (1 1 1 ...)
  
1 ...   1 
 = 1
1 ...
  
... ... ...   ... 
  
1
1
1

  (
 
= U 
 
 

• Replace my matrix(10k,10k) with outer-product of two vectors:
• each 10k long: U(10k), V(10k)
• U & V are call factors

V

)

• Toy example #2:
• Weight matrix is almost uniform:
1.0

 0.8
Weigth = 
1.0

 0.8


0.8 1.0 0.8 
1 (1 1 1 1)
 +1 (+ 1 − 1 + 1 − 1)

 
 
1.0 0.8 1.0 
1

 −1
= 0.9 
+ 0.1 
0.8 1.0 0.8 
1
+1

 
 
1
 −1
1.0 0.8 1.0 

 
 
 (
 
= d1 ⋅ U 1 
 
 

Weights : Wij =

V1

)

 (
 
+ d 2 ⋅ U 2 
 
 

∑ d k ⋅ U ik ⋅ V jk

k ={1,2}

V2

)

• Toy example #3:
• Arbitrary weight matrix:
 0.74

 0.51
Weigth = 
0.50

 0.14


0.61 0.55 0.10 
 0.44  ( 0.39 0.48 0.58 0.53)
 − 0.71  ( − 0.72 − 0.18 0.03 0.67 )





0.42 0.30 0.13 
0.30 
− 0.42 


= 2.18
+ 0.79
0.35 0.84 0.83 
0.59 
+ 0.19 






 0.61 
 + 0.53 
0.75 0.71 0.94 




 − 0.14  ( 0.38 − 0.86 0.31 0.16 )


 − 0.54  ( 0.43 0.01 − 0.75 0.50 )


 − 0.14 
+ 0.36
+ 0.84 

+ 0.78 
+ 0.07


+ 0.02 
 − 0.59 




 − 0.04 



• Noticed that:
• An arbitrary N*N matrices can be decompose using N set of factors.
• Note that amplitude are decreasing: d1 = 2.18 >> d4 = 0.04
• Can approximate weight matrix with a small set of factors

Note on convention for Matrix Factorization:
• Last equation is the definition of SVD (Singular Value Decomposition)

wij = ∑ k U ik d k V jk

• Where factors U’s, V’s are chosen to be normalized:∑ U U ' = 1 if
ik
ik

k = k'

Independent from each U U ' = 0 if
∑ other:ik
ik

k ≠ k'

i

●

i

• In this talk and in Koren & Bell’s chapter:
• The dk’s are incorporated inside the Uk,Vk:
• Just a convention difference

wij = ∑ k U ik V jk
• Where the factors are now normalized as:

∑
i

U ik U ik ' = d k

if

k = k'

Collaborative Filtering: Factorized Neighborhood Model
• Recall Global Neighborhood Model:
•

Where Wij are free parameters

~ =
rui

∑ ruj ∗ wij /

| R ( u )|

j∈R ( u )

• Apply Matrix Factorization to Wij:

wij =>

∑ U ik V jk

Choose: Nk (number of factors) << N (number of movies)

k ={factors}

~200

<<

10k-100K

• Scoring: Factorized Neighborhood Model:

~ =
rui

∑ U ui ∗ ( ∑

k ={factors}

ruj V jk ) /

Free parameters: Uui and Vjk

| R ( u )|

j∈R ( u )

•  Cheaper computation with same predictive power
• Modeling:

SSE =

∑

ui ={past_ratings}

(rui − ∑ U ik ∗ ∑ rujV jk /
k ={factors}

j∈R ( u )

| R ( u )|

2
2
)2 + λ ∑ U ik + λ ∑ V jk
ik

jk

Collaborative Filtering: SVD Model
SVD: Historical name for Matrix Factorization apply to Rating matrix
• Matrix Factorization apply to rui:

rui =>
• Scoring:

∑ U uk Vik

Choose: Nk (number of factors) << N (number of movies)

k ={factors}

~ =
rui

~200

<<

∑ U uk ∗ Vik

10-100K

Free parameters: Uui and Vjk

k ={factors}

•  Same predictive power
•  Not easy to explain recommendation

• Modeling:

SSE =

∑

ui ={past_ratings}

2
(rui − ∑ U uk ∗ Vik )2 + λ ∑ U uk + λ ∑ Vik2
k ={factors}

uk

ik

Introduction to core science models

Recommended

Recommended

More Related Content

Similar to Introduction to core science models

Similar to Introduction to core science models (20)

Recently uploaded

Recently uploaded (20)

Introduction to core science models