SlideShare a Scribd company logo
1 of 63
Download to read offline
Introduction to Core Science Models
Yahoo! Labs
2011/19/11
Agenda

 Basic Counting Models: EMP
 Feature Based Models: OLR
 RLFM: Feature Model + Collaborative Filtering
 Bonus: Tutorial on Collaborative Filtering

 Note:
›

Will focus on the science framework

›

Will not focus on the optimization problem
EMP + OLR:

 Basic Counting Models: EMP
›

Simple CTR model based on counting clicks/views

 Feature Based Models: OLR
 RLFM: Feature Model + Collaborative Filtering
 Bonus: Tutorial on Collaborative Filtering
Today Module on Yahoo FP:
Counting Models: CTR
 Estimate CTR for each article independently
 CTR = Click-Thru-Rate = Total Clicks / Total Views
 Online Model: Update every 5 mins:

 Ct + Ct − 1 +... + C1   Ct + ∑<t C s
s
CTR = 
 =
 Vt + Vt − 1 +... + V1   Vt + ∑<t Vs
s

C t = clicks during period ' t'
Vt = views during period ' t'





CTR Curves for Two Days
Each curve is the CTR of an item in the Today Module over time

Traffic obtained from a controlled randomized experiment
Things to note:
(a) Short lifetimes, (b) temporal effects, (c) often breaking news stories
Counting Models: Most Popular
 EMP: Estimated Most Popular ( aka GMP ):
›

Decay = Forget about old clicks and views ( Gamma > 0.95-0.99 )

CTR EMP

 C t + γ C t −1 + γ 2 C t −2 + ... 
=
 V t + γ V + γ 2 V + ... 

t
t −1
t −2



 Segmented Most Popular:
›

 C t + γ C t −1 + γ C Male + ... 
t −2

=  Male
Male
2
Male
V
+ γ Vt −1 + γ Vt −2 + ... 
 t


Male
Male
Separate model for each segment of the population 2

CTR EMP −Male
Tracking behavior of Estimated Most Popular model
 Low click rate articles – More temporal smoothing
OLR: Online Logistic Regression

 Basic Counting Models: EMP
 Feature Based Models: OLR
›

Motivation for using regression

›

Logistic Regression framework

›

Online Logistic Regression: general case

›

Per item-OLR Use Case : Today Module

›

Improving Model

 RLFM: Feature Model + Collaborative Filtering
 Affinity Models: Log Odds

 Bonus: Tutorial on Collaborative Filtering
Motivation for using Regression:
• EMP: Breaks down if segment is too small:
• eg 40 yrs old Male in NewYork

CTR Male_40_NY

 C Male_40_NY + γ C Male_40_NY + ... 
t
t −1

=  Male_40_NY
Male_40_NY
V
+ γ Vt −1
+ ... 
 t


 Logistic Regression:
›

Natural framework to include more features:

›

Age, Gender, Location,User Interests,…

Log(Pclick /(1 − Pclick )) = b + ∑k ={features} Wk ∗ X k,u
›

Xk,u = value for feature k and user u: eg age of a user

›

Wk = weight parameter to be learned for each feature
Linear Regression: One Dimension
Linear Fit:

Y=a*X+b

180

Y = Weight

160

ERROR

140
120
100
80
60

70

80

90

100

110

120

X = Height

SSE = ∑i ={examples} (Yi − a ∗ X i − b)

2

• Find value of “a” and “b” that minimize Sum of Square of Errors (SSE)
• Take derivative of SSE with respect to “a” and “b” and equal to 0
Can’t Apply Linear Model to Click Prediction
 For example: Probably of Click for article on Retirement as function of Age
Linear Doesn't Represent the Data Well

1.0

Probability of Click

0.8
Data Points
Linear Model
0.6

0.4

0.2

0.0
0

10

20

30

40

50

Age

60

70

80

90

100
Logistic Model for Click Prediction
 Probably of Click for article on Retirement as function of Age
Logistic Model is much better

1.0

Probability of Click

0.8
Data Points
Logistic Model
0.6

0.4

P(Click ) =
0.2

1
1 + Exp (− a * Age − b)

0.0
0

10

20

30

40

50

Age

60

70

80

90

100
Logistic Regression: One Dimension

1
P(Yi ) =
1 + exp (− Yi ⋅ (a ⋅ Age i − b))

Yi = + 1 =>

P(Yi=+1) = Prob user Clicked on article

Yi = − 1 =>

P(Yi=-1) = Prob user Didn’t Click

(
• How to find parameter “a” and “b” for many training examples: Yi ,

Agei )

• Maximized Product of Probabilities (Likelihood):

Likelihood = P (Y1 ) ⋅ P (Y2 ) ⋅ P (Y3 ) ⋅ ⋅ ⋅ P (Yn )
• “Hard” to solve
Optimize Logistic Likelihood for 4 Data Points:
Likelihood (1..4) = P(Y1 ) ⋅ P (Y2 ) ⋅ P(Y3 ) ⋅ P(Y4 )
12

10

Likeliho
d
Prob(Y1
)

8

Prob(Y2
)

6

4

1
P(Yi ) =
1 + exp (− Yi ⋅ (a ⋅ Age i − b))

2

0
-0.3

-0.2

-0.1

0

0.1

X-axis: parameter “a”
For simplicity: I assume that I know the value of “b”

0.2

0.3
Optimize Logistic Likelihood for 40 Data Points:

Likelihood (1..40) = P(Y1 ) ⋅ P(Y2 ) ⋅ ⋅ ⋅ P(Y40 )
12

Likelihood ( rescaled )

10

40 Data Points
4 Data Points

8

6

4

2

0
-0.08

-0.05

-0.02

0

0.03

0.05

0.08

X-axis: parameter “a”
For simplicity: I assume that I know the value of “b”

0.1

0.13

0.15

0.18

0.2
Gaussian Approximation to Likelihood:

2
Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ Likelihood 40 (a)

• Replace Likelihood with a simple Gaussian with two Hyperparameters:
* Mean: m40

(what is the average value for “a”)

* Standard deviation:

(what is the error around the mean)
σ
40

1.2E+01
1.0E+01

m 40

8.0E+00

Gaussian_Max
Likelihood40

6.0E+00
4.0E+00

σ 40

2.0E+00
0.0E+00
0.00

0.02

0.04

0.06

0.08

0.10

X-axis: parameter “a”

0.12

0.14

0.16

0.18

0.20
Gaussian approx allow for Update for one data point at a time:

2
Exp(− (a − m 40 ) 2 /2σ 40 ) ≈ Likelihood 40 (a)
2
Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ P(Y40 ) ⋅ {P(Y39 ) ⋅ P(Y38 ) ⋅ ⋅ ⋅ P(Y1 )}
2
2
Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ P(Y40 ) ⋅ Exp(−(a − m 39 ) 2 /2σ 39 )

Posterior

≈ Likelihood

*

• Note: for simplicity I ignored all normalizations

Prior
OLR: Online Logistic Regression: one parameter
(Y,
• Solve Bayesian update for each new event: Age)
Posterior

≈ Likelihood *

Prior

Exp(−(a − m t ) 2 /2σ t2 ) ≈ P(Y) ⋅ Exp(−(a − m t -1 ) 2 /2σ t2-1 )
P(Y) =

1
1 + exp (− Y ⋅ (a ⋅ Age − b))
m t = m t −1 + 

• Yrank update formulas:

1/σ t2 = 1/σ t2−1 + 

• Yrank approximate solution: Scott Roy talk:
http://twiki.corp.yahoo.com/pub/Personalization/YRank/YRankLearning.ppt
OLR: Online Logistic Regression: General Case
{w
• Replace one parameter “a” by a set of parameters: f }
{X
• Replace on feature “Age” by a set of features: f }
(Y,
• Solve Bayesian update for each new event: {X f })
Posterior

≈ Likelihood *

Prior

2
2
Exp(−∑(w f − m f, t ) 2 /2σ f, t ) ≈ P(Y) ⋅ Exp(−∑(w f − m f, t -1 ) 2 /2σ f,t -1 )
f

f

1
P(Y) =
1 + exp (− Y ⋅ ∑ w f ⋅ X f )
f

m f,t = m f,t −1 + 
• Yrank update formulas:

2
2
1/σ f,t = 1/σ f,t −1 + 
OLR: General Case: Features
• Multi-dimension logistic regression model:

1
P(Y) =
1 + exp (− Y ⋅ (

∑ w f ⋅ X f ))

f ={features}

∑ wf *X f
f

= w1 ∗ 1

+ w2 X u = Male + w3 X u = Age40s + w4 X u =SanJose + w5 X u = likeSports
+ w6 * X i = about_Sports + w7 * X i = about_NBA
+ w8 * X (u = likeSports & i = about_ Sports)

<= Baseline
<= User Features
<= Article Features
<= User*Article Features

• More on Features:
http://twiki.corp.yahoo.com/view/SRelevance/NewsRecommendationFeatures
http://twiki.corp.yahoo.com/view/SRelevance/COREUserProfilesSparsePolarity
OLR: Online Logistic Regression

 Basic Counting Models: EMP
 Feature Based Models: OLR
›

Motivation for using regression

›

Logistic Regression framework

›

Online Logistic Regression: General Case

›

Per item-OLR Use Case : Today Module

›

Improving Model

 RLFM: Feature Model + Collaborative Filtering
 Affinity Models: Log Odds

 Bonus: Tutorial on Collaborative Filtering
Per item-OLR use Case: Yahoo FP Today Module
Per item-OLR use Case: Yahoo FP Today Module
• Front Page Module:
• Article don’t live very long ( < day )
• Many clicks/views for each article
• Each Article treated independently:
• A new OLR model for each new Article
• Trying to predict CTR for each user & article pair: u,i

P(Yui = 1) =

∑ w f *X u,f

1
1 + exp (−∑f ={user_features} w i,f ⋅ X u,f )

= w1

+ w 2 * X u =Male + w 3 * X u =Age20s + w 4 * X NewYork
+ w 5 * X u =likeSports + w 6 * X u =likeNFL + w 7 * X u =likeMusic

<= Baseline
<= User Features
Per item-OLR use Case: Yahoo FP Today Module
1
P(Yui = 1) =
1 + exp (−∑f ={user_features} w i,f ⋅ X u,f )
1.0

0.8

P(Yi=1)

0.6

0.4

w1

0.2

0.0
-6

-5

-4

-3

-2

-1

0

∑w ⋅X

1

2

3

4
Per item-OLR use Case: Yahoo FP Today Module
{w
• Each Article has its own OLR Model and its own set of weights: i,f }
P(Yui = 1) =

1
1 + exp (−∑f ={user_features} w i,f ⋅ X u,f )

• Each Article has its own:

2
Prior ~ Exp (−∑k ={features} (w i,f − m i,f ) 2 /2σ i,f ))

• For each event (Yui,{Xuf}) update the hyperparameters for that article:

m i,f, t = m i,f, t −1 + 

Yrank Update Formula:

2
2
1/σ i,f, t = 1/σ i,f, t −1 + 
Per item-OLR use Case: Yahoo FP Today Module
• How to use OLR model:
• Choose a candidate pool:
• Roughly 50-100 pick by editors
• Explore:
• In a small bucket: try all 50-100 articles randomly
• Modeling: For each event(click/view) apply Yrank for that
article
• Exploit:
• For the reminder (larger bucket)
• Scoring: Predict article CTR, and order by decreasing CTR:

CTR= P(Yui = 1) =

1

1 + exp (−∑f ={user features} m i,f ⋅ X u,f )
Improving Online Learning:
• Correlated OLR: Include interactions between hyperparameters:  improvement
−1
Prior ~ Exp (−(1 / 2)∑ f1,f2 (w f1 − m f1 )A f1,f2 (w f2 − m f2 ))

• Mini-Batch: Update multiple data points at once:  no gain in CTR

Likelihood(mini batch) = P(Y1 )  P(Yn )

• TechPulse 2011: Taesup Moon, Pradheep Elango, Su-Lin Wu
http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/techpulse.pdf
Improving Explore/Exploit: UCB
• UCB: improve Explore/Exploit strategy:  improvement

ε• Old strategy: greedy
• Explore: update OLR only from events in a small random bucket
• Exploit:
• Order articles in decreasing value of predicted CTR
• New strategy: UCB (aka Upper Confidence Bound)
• Single bucket
• Explore:
• Update OLR with all events
• Exploit:
• Order articles in decreasing value of “optimistic” UCB
CTR
• TechPulse 2011: Taesup Moon, Pradheep Elango, Su-Lin Wu
http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/techpulse.pdf
Improving Explore/Exploit: UCB
• Upper Confidence Bound strategy:  improvement
• Exploit:
• Order articles in decreasing value of “optimistic”
CTR UCB
• ONE DIMENSION EXAMPLE:
1.0

• Replace normal CTR:

1
CTR =
1 + exp (− m ⋅ X)

0.8

0.6

• With optimistic CTR:

CTR UCB =

CTR

CTR UCB 0.4

1
1 + exp (− (m ⋅ X + z ⋅ σ ⋅ X 2 ))

z = tunable parameter

0.2

0.0
-6

-5

-4

-3

-2

-1

0

1

2

3

4
RLFM: Regression based Latent Factor Model

 Basic Counting Models: EMP
 Feature Based Models: OLR
 RLFM: Feature Model + Collaborative Filtering
›

RLFM components

›

Using RLFM: Offline & Online update

 Bonus: Tutorial on Collaborative Filtering
RLFM: Regression based Latent Factor Model
• RLFM: basic idea
* Build a single logistic regression model for all users “u” and
articles “i”
* Add Collaborative Filtering using Matrix Factorization
• Modeling:
• Most of it is done offline in big batch mode ( millions events )
• One part of the model is also updated online ( one event using Yrank
update)
⇒Latent Factor Models are work in progress:
• Original Y Labs Paper: Deepak Agarwal, Bee-Chung Chen
http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/featfact.pdf
• Implementation for Coke:
http://twiki.corp.yahoo.com/view/YResearch/RLFMForCoke
RLFM: Regression based Latent Factor Model
RLFM components:
1) Build a logistic regression model for all users “u” and
articles “i”
2) Add user bias and article bias
3) Collaborative Filtering using Matrix Factorization
4) Predict factors for new user/article: Cold Start
5) Add Logistic Regression + Bias + Matrix Factorization
1) Build logistic regression for all user/articles:
• Build a single logistic regression model for all users {u}, articles {i}:

P(Yui = 1) =

1
1 + exp (−

∑ wf * X ui,f )

f ={all_features}

∑ wf *X ui,f
f

= w1

+ w2 X u = Male + w3 X u = Age40s + w4 X u =SanJose + w5 X u = likeSports
+ w6 * X i = about_Sports + w7 * X i = about_NBA
+ w8 * X (u = likeSports & i = about_ Sports)
• A single set of parameters {Wf} for all users, articles
• Learned offline in batch mode

<= Baseline
<= User Features
<= Article Features
<= User*Article Features
2) Add per user and per article baseline:
• Add bias parameters:
●
Some article are more/less popular than other
●
Some user read more/less stories than other

1
P(Yui = 1) =
1 + exp (−α u − β i −

∑ w f * X ui,f )

f ={all_features}

• Baseline is not the same for every user/article:
• Old Baseline: 1
w
• New baseline:1 + α u
w

+ βi

{w
• More parameters to optimize: f },

{α u }, {β i }

• Better with some priors – to be described later
3) Matrix Factorization Motivation
• How do deal with:
• Article about disaster preparedness:
• Hurricanes: need user from coastline: Texas => Northeast
• Earthquakes: need user from West coast
• Would need : X_user_WestCoast * X_about_earthquakes
• I don’t have that …
• But if I have many views/clicks over many such articles
I can discover that pattern !!!
3) Matrix Factorization Motivation

Clicks =

Users

• I can discover patterns within clicks:
• SIMPLE EXAMPLE:






SanJose
Oakland
NewYork
DC

Earthquake




 0

1 1 1 1
1 1 1 1
0 0 0 0
1 0 0

Politics




1

1 0 0 0
0 0 0 1
0 1 0 0
1 1 1

• Clicks mostly explained by:
 0  (0 0 0 0 1 1 1 1)
 1  (1 1 1 1 0 0 0 0)
 
 
0 
1
+  
 0
0
 
 
1 
 0
 
 

U1 * V1

+ U 2 * V2
3) Matrix Factorization Motivation
• Most Clicks explained by:

U1 * V1 + U 2 * V2

1
P(Click ui ) = P(Yui = 1) =
1 + exp (− U u,1 * Vi,1 − U u,2 * Vi,2 )
• The general case:

P(Yui = 1) =

1
1 + exp (−

∑

U uk * Vik )

k ={factors}

• Note:
• Number of factors ~ 50-200 << Nusers & Narticles
3) Matrix Factorization Model
• Matrix Factorization Model: aka Collaborative Filtering

P(Yui = 1) =

1
1 + exp (−

∑ U uk * Vik )

k ={factors}

• Obtain U’s and V’s: maximize the following likelihood





1
Likelihood = Π 

ui ={examples}  1 + exp ( − Yui (
∑ U uk * Vik )) 
k ={factors}


• Π ui => Product over all past events (clicks/views)
• Yui => clicks = +1 and views = -1
• Better with some priors ...
3)Matrix Factorization Model
• Better with some Priors :
⋅ prior for each U uk ~ Exp(-(U uk - m a ) 2 /2σ a2 )
2
⋅ prior for each Vik ~ Exp(-(Vik - m b ) 2 /2σ b )

⋅ Choose : m a = 0 and m b = 0,
⋅ σ a is the same for all U uk
⋅ σ bis the same for all Vik
• Get U’s and V’s: Maximize Likelihood * Prior

• Note: above priors are uncorrelated
• Original RLFM paper used correlated priors
4) Matrix Factorization Model – Cold Start Problem
• Matrix Factorization Model:

P(Yui ) =

1
1 + exp (−

∑ U uk * Vik )

k ={factors}

• Cold start problem:
=>for new user U=0 or for new article V=0
4) Matrix Factorization Model – Cold Start Problem
• Matrix Factorization Model:

P(Yui ) =

1
1 + exp (−

∑ U uk * Vik )

k ={factors}

• Cold start problem:
=>for new user U=0 or for new article V=0
• Solution choose different prior:

⋅ for each U uk = Exp(− (U uk −
⋅ for each Vik = Exp(− (Vik −

∑

2
G k,a X u,a ) 2 /2σ a )

a ={user_features}

∑

D k,b X i,b ) 2 /2σ 2 )
b

b ={item_features}

• Parameters G’s & D’s obtained from maximizing: Likelihood * Prior
5) RLFM: Regression based Latent Factor Model
• Putting it back together: Bias + Regression + Matrix Factorization:

P(Yui = 1) =

1 + exp (−α u − β i −

1
∑ w f * X ui,f −

f ={features}

• Priors:

⋅ for each U uk = Exp(− (U uk −
⋅ for each Vik = Exp(−(Vik −
⋅ for each α u = Exp(− (α u −
⋅ for each β i = Exp(− (β i −

∑

∑ U uk * Vik )

k ={factors}

2
G k,a X u,a ) 2 /2σ a )

a ={user_features}

∑

D k,b X i,b ) 2 /2σ 2 )
b

b ={item_features}

∑

2
g a X u,a ) 2 /2σ α )

a ={user_features}

∑

2
d b X i,b ) 2 /2σ β )

b ={item_features}
RLFM: Regression based Latent Factor Model

 Basic Counting Models: EMP
 Feature Based Models: OLR
 RLFM: Feature Model + Collaborative Filtering
›

RLFM components

›

Using RLFM: Offline & Online update

 Bonus: Tutorial on Collaborative Filtering
Using RLFM: Offline Modeling:
• Offline Modeling:
• Batch mode: Maximize: Likelihood * Prior
• Millions to Billions of examples processed at once
• Input: {Y’s, X’s} all events and features
• Output:

factors : {α u }, {β i }, {U uk }, {Vik }
parameters : {w f }, {g a }, {d b }, {G k,a }, {D k,b }
Using RLFM: Online Modeling and Scoring:
• Online Scoring + some Modeling
• For new user or new article: compute factors from g,d,G,D

α =

• new user bias: u

∑

g a X u,a

a ={user_features}

• For old user or old article: get factors from offline batch mode
• For each event (click/view) on article “i”:
• Update Vik using per-item OLR approach
• Predict score using updated

P(Yui = 1) =

Vik:

1 + exp (−α u − β i −

1
∑ w f * X ui,f −

f ={features}

∑ U uk * Vik )

k ={factors}
RLFM: Offline Results on Coke Data: Today Module
• RLFM results on Offline experiment
• Y! Front Page – Today Module
• CTR relative lift for RLFM vs Feature-Only as function of clicks/user

http://twiki.corp.yahoo.com/view/YResearch/RLFMReplayExperiments
Q&A
Contributors:
Pradheep Elango, Su-Lin Wu,Teasup Moon, Pranam Kolari
Deepak Agarwal, Bee-Chung Chen, Scott Roy
Jean-Marc Langlois

•Coke Science Papers:
http://twiki.corp.yahoo.com/view/YResearch/CokeLabDiary
Tutorial on
Collaborative Filtering
Based on following Chapter
http://research.yahoo.com/files/korenBellChapterSpringer.pdf
By two of the Netflix winners
Collaborative Filtering: Introduction
 Goal: predict ratings rui for a movie “i” that a user “u” hasn’t seen yet
›

Prediction based on Matrix of User/Movie Ratings:
●

rui = 1 through 5 stars:

›

Prediction equations for integer Ratings are simpler then for binary Clicks

›

Rating matrix is a large very sparse matrix:
●

10M-100M users and 10k-100K movies but with ~99% blank entries

 Based on : http://research.yahoo.com/files/korenBellChapterSpringer.pdf
›

This talk: Focus on the most relevant models & Ignore some improvements:
●

Baseline adjustment : user bias, movie bias and overall average rating

●

time aware model, binary features ( rated, rented )

 This talk:
›

Adjusted Ratings:

adjusted
raw
<= ui
− Baseline ( ui )
ui

r

r
Collaborative Filtering: the models
 Correlated Neighborhood Model
›

Predict new rating based on ratings of similar movies

 Global Neighborhood Model
›

Enlarge Neighborhood to be “global”

›

Introduce adjustable weight parameters

 Factorized Neighborhood Model
›

Apply matrix factorization to weight parameters

 SVD Model
›

Apply matrix factorization to rating matrix itself
Collaborative Filtering: Correlated Neighborhood Model
• Predict unknown rui based on known ratings of similar movies
• You will like movie“i” because you liked movies “j”

uj6

• Define movie-movie Similarity measure:

uj5

Sij based on correlation
• Define Correlated Neighborhood:
●

Sij6

uj4

ui

uj1

uj3

• set of ~20 movies with largest Sij that are rated by “u”
• Define Weight : normalized Sij

uj2

Sij ∝

ruj

∑

rui ∗ ruj / Normalization

u = Union(i, j)
Collaborative Filtering: Correlated Neighborhood Model

Ratings = rui =

Users

Movies

1
1
1
1




−1 

1 1 1
1 1 1
1 −1 −1
1

1

• Movies:
•
•
•
•

i=1
i=2
i=3
i=4

Star Trek
Star Wars
Action movie
Horror movie

Movie-Movie Similarity = Sij:

rui ∗ ruj

u = Union(i, j)

Normalization

1
S ij =  015
 .
0
Movies

Sij ∝

∑

Movies
1 0.5 0
1 0.5 0
0.5 1 0.5
0

0.5

1





Collaborative Filtering: Correlated Neighborhood Model
• Predict unknown rui based on known ratings of similar movies

ruj

• You will like movie“i” because you liked movies “j”
• Similarity measure:
uj6

uj1

uj5

Sij6

∑

rui ∗ ruj

u = Union(i, j)

• Correlated Neighborhood:
uj4

ui
uj3
uj2

Sij ∝

• set of ~20 movies with largest Sij that are rated by “u”

• Weight :
• normalized Sij
• Scoring:
~

rui =

∑

ruj

j ={correlated_neighbors}

∗ S ij / ∑ S ij
j

•  Simple, intuitive model with ability to explain why we predict a new
movie
• Modeling:
•  Need to precompute and store Sij: 10k * 10K = 100M
• Weights are fixed to normalized value of Sij
Collaborative Filtering: Global Neighborhood Model
• Extend Neighborhood to All Known Ratings for User “u”:
R (u ) = {ruj known}
• Let weight Wij be free parameters:
• Scoring:

• Modeling:

~ =
rui

∑ ruj ∗ wij /

| R ( u )|

j∈R ( u )

Pick Wij to minimize regularized Sum of Errors:

SSE =

∑

ui ={ past_ratings}

(rui − ∑ ruj ∗ wij /
j∈R ( u )

| R ( u )|

)2 + λ ∑ wij2
ij

λ = regularization parameter

•  Better predictive power then previous model
•  Not easy to explain recommendation
•  Expensive Modeling, Scoring and Storage of Wij: Size = 100M
• Could try to limit based on Sij but there is a better approach
Reduce Number of Free Parameters: Matrix Factorization
• Want to reduce the number of free parameters in Wij:
• Current size:10k * 10K = 100M
• Matrix factorization:
• Goal: reduce number of free parameters to ~1M
• Toy example #1:
• Weight matrix is uniform:

1

1
Weigth = 
1

 ...


1 ...   1  (1 1 1 ...)
  
1 ...   1 
 = 1
1 ...
  
... ... ...   ... 
  
1
1
1

• Replace my matrix(10k,10k) with outer-product of two vectors:
• each 10k long
Reduce Number of Free Parameters: Matrix Factorization
• Want to reduce the number of free parameters in Wij:
• Current size:10k * 10K = 100M
• Matrix factorization:
• Goal: reduce number of free parameters to ~1M
• Toy example #1:
• Weight matrix is uniform:

1

1
Weigth = 
1

 ...


1 ...   1  (1 1 1 ...)
  
1 ...   1 
 = 1
1 ...
  
... ... ...   ... 
  
1
1
1

  (
 
= U 
 
 

• Replace my matrix(10k,10k) with outer-product of two vectors:
• each 10k long: U(10k), V(10k)
• U & V are call factors

V

)
Reduce Number of Free Parameters: Matrix Factorization
• Toy example #2:
• Weight matrix is almost uniform:
1.0

 0.8
Weigth = 
1.0

 0.8


0.8 1.0 0.8 
1 (1 1 1 1)
 +1 (+ 1 − 1 + 1 − 1)

 
 
1.0 0.8 1.0 
1

 −1
= 0.9 
+ 0.1 
0.8 1.0 0.8 
1
+1

 
 
1
 −1
1.0 0.8 1.0 

 
 
 (
 
= d1 ⋅ U 1 
 
 

Weights : Wij =

V1

)

 (
 
+ d 2 ⋅ U 2 
 
 

∑ d k ⋅ U ik ⋅ V jk

k ={1,2}

V2

)
Reduce Number of Free Parameters: Matrix Factorization
• Toy example #3:
• Arbitrary weight matrix:
 0.74

 0.51
Weigth = 
0.50

 0.14


0.61 0.55 0.10 
 0.44  ( 0.39 0.48 0.58 0.53)
 − 0.71  ( − 0.72 − 0.18 0.03 0.67 )





0.42 0.30 0.13 
0.30 
− 0.42 


= 2.18
+ 0.79
0.35 0.84 0.83 
0.59 
+ 0.19 






 0.61 
 + 0.53 
0.75 0.71 0.94 




 − 0.14  ( 0.38 − 0.86 0.31 0.16 )


 − 0.54  ( 0.43 0.01 − 0.75 0.50 )


 − 0.14 
+ 0.36
+ 0.84 

+ 0.78 
+ 0.07


+ 0.02 
 − 0.59 




 − 0.04 



• Noticed that:
• An arbitrary N*N matrices can be decompose using N set of factors.
• Note that amplitude are decreasing: d1 = 2.18 >> d4 = 0.04
• Can approximate weight matrix with a small set of factors
Note on convention for Matrix Factorization:
• Last equation is the definition of SVD (Singular Value Decomposition)

wij = ∑ k U ik d k V jk

• Where factors U’s, V’s are chosen to be normalized:∑ U U ' = 1 if
ik
ik

k = k'

Independent from each U U ' = 0 if
∑ other:ik
ik

k ≠ k'

i

●

i

• In this talk and in Koren & Bell’s chapter:
• The dk’s are incorporated inside the Uk,Vk:
• Just a convention difference

wij = ∑ k U ik V jk
• Where the factors are now normalized as:

∑
i

U ik U ik ' = d k

if

k = k'
Collaborative Filtering: Factorized Neighborhood Model
• Recall Global Neighborhood Model:
•

Where Wij are free parameters

~ =
rui

∑ ruj ∗ wij /

| R ( u )|

j∈R ( u )

• Apply Matrix Factorization to Wij:

wij =>

∑ U ik V jk

Choose: Nk (number of factors) << N (number of movies)

k ={factors}

~200

<<

10k-100K

• Scoring: Factorized Neighborhood Model:

~ =
rui

∑ U ui ∗ ( ∑

k ={factors}

ruj V jk ) /

Free parameters: Uui and Vjk

| R ( u )|

j∈R ( u )

•  Cheaper computation with same predictive power
• Modeling:

SSE =

∑

ui ={past_ratings}

(rui − ∑ U ik ∗ ∑ rujV jk /
k ={factors}

j∈R ( u )

| R ( u )|

2
2
)2 + λ ∑ U ik + λ ∑ V jk
ik

jk
Collaborative Filtering: SVD Model
SVD: Historical name for Matrix Factorization apply to Rating matrix
• Matrix Factorization apply to rui:

rui =>
• Scoring:

∑ U uk Vik

Choose: Nk (number of factors) << N (number of movies)

k ={factors}

~ =
rui

~200

<<

∑ U uk ∗ Vik

10-100K

Free parameters: Uui and Vjk

k ={factors}

•  Same predictive power
•  Not easy to explain recommendation

• Modeling:

SSE =

∑

ui ={past_ratings}

2
(rui − ∑ U uk ∗ Vik )2 + λ ∑ U uk + λ ∑ Vik2
k ={factors}

uk

ik
The End

More Related Content

Similar to Introduction to core science models

VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
Victor Pillac
 
Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1
Suvadip Shome
 

Similar to Introduction to core science models (20)

Juman++ v2: A Practical and Modern Morphological Analyzer
Juman++ v2: A Practical and Modern Morphological AnalyzerJuman++ v2: A Practical and Modern Morphological Analyzer
Juman++ v2: A Practical and Modern Morphological Analyzer
 
Slides
SlidesSlides
Slides
 
An Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual SuggestionsAn Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual Suggestions
 
ML基本からResNetまで
ML基本からResNetまでML基本からResNetまで
ML基本からResNetまで
 
The SAM Pattern: State Machines and Computation
The SAM Pattern: State Machines and ComputationThe SAM Pattern: State Machines and Computation
The SAM Pattern: State Machines and Computation
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fp
 
Introduction to behavior based recommendation system
Introduction to behavior based recommendation systemIntroduction to behavior based recommendation system
Introduction to behavior based recommendation system
 
OR Ndejje Univ.pptx
OR Ndejje Univ.pptxOR Ndejje Univ.pptx
OR Ndejje Univ.pptx
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf
 
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
 
OR Ndejje Univ (1).pptx
OR Ndejje Univ (1).pptxOR Ndejje Univ (1).pptx
OR Ndejje Univ (1).pptx
 
FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With Degradation
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Mat lab workshop
Mat lab workshopMat lab workshop
Mat lab workshop
 
Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Introduction to core science models

  • 1. Introduction to Core Science Models Yahoo! Labs 2011/19/11
  • 2. Agenda  Basic Counting Models: EMP  Feature Based Models: OLR  RLFM: Feature Model + Collaborative Filtering  Bonus: Tutorial on Collaborative Filtering  Note: › Will focus on the science framework › Will not focus on the optimization problem
  • 3. EMP + OLR:  Basic Counting Models: EMP › Simple CTR model based on counting clicks/views  Feature Based Models: OLR  RLFM: Feature Model + Collaborative Filtering  Bonus: Tutorial on Collaborative Filtering
  • 4. Today Module on Yahoo FP:
  • 5. Counting Models: CTR  Estimate CTR for each article independently  CTR = Click-Thru-Rate = Total Clicks / Total Views  Online Model: Update every 5 mins:  Ct + Ct − 1 +... + C1   Ct + ∑<t C s s CTR =   =  Vt + Vt − 1 +... + V1   Vt + ∑<t Vs s  C t = clicks during period ' t' Vt = views during period ' t'    
  • 6. CTR Curves for Two Days Each curve is the CTR of an item in the Today Module over time Traffic obtained from a controlled randomized experiment Things to note: (a) Short lifetimes, (b) temporal effects, (c) often breaking news stories
  • 7. Counting Models: Most Popular  EMP: Estimated Most Popular ( aka GMP ): › Decay = Forget about old clicks and views ( Gamma > 0.95-0.99 ) CTR EMP  C t + γ C t −1 + γ 2 C t −2 + ...  =  V t + γ V + γ 2 V + ...   t t −1 t −2    Segmented Most Popular: ›  C t + γ C t −1 + γ C Male + ...  t −2  =  Male Male 2 Male V + γ Vt −1 + γ Vt −2 + ...   t  Male Male Separate model for each segment of the population 2 CTR EMP −Male
  • 8. Tracking behavior of Estimated Most Popular model  Low click rate articles – More temporal smoothing
  • 9. OLR: Online Logistic Regression  Basic Counting Models: EMP  Feature Based Models: OLR › Motivation for using regression › Logistic Regression framework › Online Logistic Regression: general case › Per item-OLR Use Case : Today Module › Improving Model  RLFM: Feature Model + Collaborative Filtering  Affinity Models: Log Odds  Bonus: Tutorial on Collaborative Filtering
  • 10. Motivation for using Regression: • EMP: Breaks down if segment is too small: • eg 40 yrs old Male in NewYork CTR Male_40_NY  C Male_40_NY + γ C Male_40_NY + ...  t t −1  =  Male_40_NY Male_40_NY V + γ Vt −1 + ...   t   Logistic Regression: › Natural framework to include more features: › Age, Gender, Location,User Interests,… Log(Pclick /(1 − Pclick )) = b + ∑k ={features} Wk ∗ X k,u › Xk,u = value for feature k and user u: eg age of a user › Wk = weight parameter to be learned for each feature
  • 11. Linear Regression: One Dimension Linear Fit: Y=a*X+b 180 Y = Weight 160 ERROR 140 120 100 80 60 70 80 90 100 110 120 X = Height SSE = ∑i ={examples} (Yi − a ∗ X i − b) 2 • Find value of “a” and “b” that minimize Sum of Square of Errors (SSE) • Take derivative of SSE with respect to “a” and “b” and equal to 0
  • 12. Can’t Apply Linear Model to Click Prediction  For example: Probably of Click for article on Retirement as function of Age Linear Doesn't Represent the Data Well 1.0 Probability of Click 0.8 Data Points Linear Model 0.6 0.4 0.2 0.0 0 10 20 30 40 50 Age 60 70 80 90 100
  • 13. Logistic Model for Click Prediction  Probably of Click for article on Retirement as function of Age Logistic Model is much better 1.0 Probability of Click 0.8 Data Points Logistic Model 0.6 0.4 P(Click ) = 0.2 1 1 + Exp (− a * Age − b) 0.0 0 10 20 30 40 50 Age 60 70 80 90 100
  • 14. Logistic Regression: One Dimension 1 P(Yi ) = 1 + exp (− Yi ⋅ (a ⋅ Age i − b)) Yi = + 1 => P(Yi=+1) = Prob user Clicked on article Yi = − 1 => P(Yi=-1) = Prob user Didn’t Click ( • How to find parameter “a” and “b” for many training examples: Yi , Agei ) • Maximized Product of Probabilities (Likelihood): Likelihood = P (Y1 ) ⋅ P (Y2 ) ⋅ P (Y3 ) ⋅ ⋅ ⋅ P (Yn ) • “Hard” to solve
  • 15. Optimize Logistic Likelihood for 4 Data Points: Likelihood (1..4) = P(Y1 ) ⋅ P (Y2 ) ⋅ P(Y3 ) ⋅ P(Y4 ) 12 10 Likeliho d Prob(Y1 ) 8 Prob(Y2 ) 6 4 1 P(Yi ) = 1 + exp (− Yi ⋅ (a ⋅ Age i − b)) 2 0 -0.3 -0.2 -0.1 0 0.1 X-axis: parameter “a” For simplicity: I assume that I know the value of “b” 0.2 0.3
  • 16. Optimize Logistic Likelihood for 40 Data Points: Likelihood (1..40) = P(Y1 ) ⋅ P(Y2 ) ⋅ ⋅ ⋅ P(Y40 ) 12 Likelihood ( rescaled ) 10 40 Data Points 4 Data Points 8 6 4 2 0 -0.08 -0.05 -0.02 0 0.03 0.05 0.08 X-axis: parameter “a” For simplicity: I assume that I know the value of “b” 0.1 0.13 0.15 0.18 0.2
  • 17. Gaussian Approximation to Likelihood: 2 Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ Likelihood 40 (a) • Replace Likelihood with a simple Gaussian with two Hyperparameters: * Mean: m40 (what is the average value for “a”) * Standard deviation: (what is the error around the mean) σ 40 1.2E+01 1.0E+01 m 40 8.0E+00 Gaussian_Max Likelihood40 6.0E+00 4.0E+00 σ 40 2.0E+00 0.0E+00 0.00 0.02 0.04 0.06 0.08 0.10 X-axis: parameter “a” 0.12 0.14 0.16 0.18 0.20
  • 18. Gaussian approx allow for Update for one data point at a time: 2 Exp(− (a − m 40 ) 2 /2σ 40 ) ≈ Likelihood 40 (a) 2 Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ P(Y40 ) ⋅ {P(Y39 ) ⋅ P(Y38 ) ⋅ ⋅ ⋅ P(Y1 )} 2 2 Exp(−(a − m 40 ) 2 /2σ 40 ) ≈ P(Y40 ) ⋅ Exp(−(a − m 39 ) 2 /2σ 39 ) Posterior ≈ Likelihood * • Note: for simplicity I ignored all normalizations Prior
  • 19. OLR: Online Logistic Regression: one parameter (Y, • Solve Bayesian update for each new event: Age) Posterior ≈ Likelihood * Prior Exp(−(a − m t ) 2 /2σ t2 ) ≈ P(Y) ⋅ Exp(−(a − m t -1 ) 2 /2σ t2-1 ) P(Y) = 1 1 + exp (− Y ⋅ (a ⋅ Age − b)) m t = m t −1 +  • Yrank update formulas: 1/σ t2 = 1/σ t2−1 +  • Yrank approximate solution: Scott Roy talk: http://twiki.corp.yahoo.com/pub/Personalization/YRank/YRankLearning.ppt
  • 20. OLR: Online Logistic Regression: General Case {w • Replace one parameter “a” by a set of parameters: f } {X • Replace on feature “Age” by a set of features: f } (Y, • Solve Bayesian update for each new event: {X f }) Posterior ≈ Likelihood * Prior 2 2 Exp(−∑(w f − m f, t ) 2 /2σ f, t ) ≈ P(Y) ⋅ Exp(−∑(w f − m f, t -1 ) 2 /2σ f,t -1 ) f f 1 P(Y) = 1 + exp (− Y ⋅ ∑ w f ⋅ X f ) f m f,t = m f,t −1 +  • Yrank update formulas: 2 2 1/σ f,t = 1/σ f,t −1 + 
  • 21. OLR: General Case: Features • Multi-dimension logistic regression model: 1 P(Y) = 1 + exp (− Y ⋅ ( ∑ w f ⋅ X f )) f ={features} ∑ wf *X f f = w1 ∗ 1 + w2 X u = Male + w3 X u = Age40s + w4 X u =SanJose + w5 X u = likeSports + w6 * X i = about_Sports + w7 * X i = about_NBA + w8 * X (u = likeSports & i = about_ Sports) <= Baseline <= User Features <= Article Features <= User*Article Features • More on Features: http://twiki.corp.yahoo.com/view/SRelevance/NewsRecommendationFeatures http://twiki.corp.yahoo.com/view/SRelevance/COREUserProfilesSparsePolarity
  • 22. OLR: Online Logistic Regression  Basic Counting Models: EMP  Feature Based Models: OLR › Motivation for using regression › Logistic Regression framework › Online Logistic Regression: General Case › Per item-OLR Use Case : Today Module › Improving Model  RLFM: Feature Model + Collaborative Filtering  Affinity Models: Log Odds  Bonus: Tutorial on Collaborative Filtering
  • 23. Per item-OLR use Case: Yahoo FP Today Module
  • 24. Per item-OLR use Case: Yahoo FP Today Module • Front Page Module: • Article don’t live very long ( < day ) • Many clicks/views for each article • Each Article treated independently: • A new OLR model for each new Article • Trying to predict CTR for each user & article pair: u,i P(Yui = 1) = ∑ w f *X u,f 1 1 + exp (−∑f ={user_features} w i,f ⋅ X u,f ) = w1 + w 2 * X u =Male + w 3 * X u =Age20s + w 4 * X NewYork + w 5 * X u =likeSports + w 6 * X u =likeNFL + w 7 * X u =likeMusic <= Baseline <= User Features
  • 25. Per item-OLR use Case: Yahoo FP Today Module 1 P(Yui = 1) = 1 + exp (−∑f ={user_features} w i,f ⋅ X u,f ) 1.0 0.8 P(Yi=1) 0.6 0.4 w1 0.2 0.0 -6 -5 -4 -3 -2 -1 0 ∑w ⋅X 1 2 3 4
  • 26. Per item-OLR use Case: Yahoo FP Today Module {w • Each Article has its own OLR Model and its own set of weights: i,f } P(Yui = 1) = 1 1 + exp (−∑f ={user_features} w i,f ⋅ X u,f ) • Each Article has its own: 2 Prior ~ Exp (−∑k ={features} (w i,f − m i,f ) 2 /2σ i,f )) • For each event (Yui,{Xuf}) update the hyperparameters for that article: m i,f, t = m i,f, t −1 +  Yrank Update Formula: 2 2 1/σ i,f, t = 1/σ i,f, t −1 + 
  • 27. Per item-OLR use Case: Yahoo FP Today Module • How to use OLR model: • Choose a candidate pool: • Roughly 50-100 pick by editors • Explore: • In a small bucket: try all 50-100 articles randomly • Modeling: For each event(click/view) apply Yrank for that article • Exploit: • For the reminder (larger bucket) • Scoring: Predict article CTR, and order by decreasing CTR: CTR= P(Yui = 1) = 1 1 + exp (−∑f ={user features} m i,f ⋅ X u,f )
  • 28. Improving Online Learning: • Correlated OLR: Include interactions between hyperparameters:  improvement −1 Prior ~ Exp (−(1 / 2)∑ f1,f2 (w f1 − m f1 )A f1,f2 (w f2 − m f2 )) • Mini-Batch: Update multiple data points at once:  no gain in CTR Likelihood(mini batch) = P(Y1 )  P(Yn ) • TechPulse 2011: Taesup Moon, Pradheep Elango, Su-Lin Wu http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/techpulse.pdf
  • 29. Improving Explore/Exploit: UCB • UCB: improve Explore/Exploit strategy:  improvement ε• Old strategy: greedy • Explore: update OLR only from events in a small random bucket • Exploit: • Order articles in decreasing value of predicted CTR • New strategy: UCB (aka Upper Confidence Bound) • Single bucket • Explore: • Update OLR with all events • Exploit: • Order articles in decreasing value of “optimistic” UCB CTR • TechPulse 2011: Taesup Moon, Pradheep Elango, Su-Lin Wu http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/techpulse.pdf
  • 30. Improving Explore/Exploit: UCB • Upper Confidence Bound strategy:  improvement • Exploit: • Order articles in decreasing value of “optimistic” CTR UCB • ONE DIMENSION EXAMPLE: 1.0 • Replace normal CTR: 1 CTR = 1 + exp (− m ⋅ X) 0.8 0.6 • With optimistic CTR: CTR UCB = CTR CTR UCB 0.4 1 1 + exp (− (m ⋅ X + z ⋅ σ ⋅ X 2 )) z = tunable parameter 0.2 0.0 -6 -5 -4 -3 -2 -1 0 1 2 3 4
  • 31. RLFM: Regression based Latent Factor Model  Basic Counting Models: EMP  Feature Based Models: OLR  RLFM: Feature Model + Collaborative Filtering › RLFM components › Using RLFM: Offline & Online update  Bonus: Tutorial on Collaborative Filtering
  • 32. RLFM: Regression based Latent Factor Model • RLFM: basic idea * Build a single logistic regression model for all users “u” and articles “i” * Add Collaborative Filtering using Matrix Factorization • Modeling: • Most of it is done offline in big batch mode ( millions events ) • One part of the model is also updated online ( one event using Yrank update) ⇒Latent Factor Models are work in progress: • Original Y Labs Paper: Deepak Agarwal, Bee-Chung Chen http://twiki.corp.yahoo.com/pub/YResearch/CokeLabDiary/featfact.pdf • Implementation for Coke: http://twiki.corp.yahoo.com/view/YResearch/RLFMForCoke
  • 33. RLFM: Regression based Latent Factor Model RLFM components: 1) Build a logistic regression model for all users “u” and articles “i” 2) Add user bias and article bias 3) Collaborative Filtering using Matrix Factorization 4) Predict factors for new user/article: Cold Start 5) Add Logistic Regression + Bias + Matrix Factorization
  • 34. 1) Build logistic regression for all user/articles: • Build a single logistic regression model for all users {u}, articles {i}: P(Yui = 1) = 1 1 + exp (− ∑ wf * X ui,f ) f ={all_features} ∑ wf *X ui,f f = w1 + w2 X u = Male + w3 X u = Age40s + w4 X u =SanJose + w5 X u = likeSports + w6 * X i = about_Sports + w7 * X i = about_NBA + w8 * X (u = likeSports & i = about_ Sports) • A single set of parameters {Wf} for all users, articles • Learned offline in batch mode <= Baseline <= User Features <= Article Features <= User*Article Features
  • 35. 2) Add per user and per article baseline: • Add bias parameters: ● Some article are more/less popular than other ● Some user read more/less stories than other 1 P(Yui = 1) = 1 + exp (−α u − β i − ∑ w f * X ui,f ) f ={all_features} • Baseline is not the same for every user/article: • Old Baseline: 1 w • New baseline:1 + α u w + βi {w • More parameters to optimize: f }, {α u }, {β i } • Better with some priors – to be described later
  • 36. 3) Matrix Factorization Motivation • How do deal with: • Article about disaster preparedness: • Hurricanes: need user from coastline: Texas => Northeast • Earthquakes: need user from West coast • Would need : X_user_WestCoast * X_about_earthquakes • I don’t have that … • But if I have many views/clicks over many such articles I can discover that pattern !!!
  • 37. 3) Matrix Factorization Motivation Clicks = Users • I can discover patterns within clicks: • SIMPLE EXAMPLE:     SanJose Oakland NewYork DC Earthquake     0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 Politics    1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 • Clicks mostly explained by:  0  (0 0 0 0 1 1 1 1)  1  (1 1 1 1 0 0 0 0)     0  1 +    0 0     1   0     U1 * V1 + U 2 * V2
  • 38. 3) Matrix Factorization Motivation • Most Clicks explained by: U1 * V1 + U 2 * V2 1 P(Click ui ) = P(Yui = 1) = 1 + exp (− U u,1 * Vi,1 − U u,2 * Vi,2 ) • The general case: P(Yui = 1) = 1 1 + exp (− ∑ U uk * Vik ) k ={factors} • Note: • Number of factors ~ 50-200 << Nusers & Narticles
  • 39. 3) Matrix Factorization Model • Matrix Factorization Model: aka Collaborative Filtering P(Yui = 1) = 1 1 + exp (− ∑ U uk * Vik ) k ={factors} • Obtain U’s and V’s: maximize the following likelihood     1 Likelihood = Π   ui ={examples}  1 + exp ( − Yui ( ∑ U uk * Vik ))  k ={factors}   • Π ui => Product over all past events (clicks/views) • Yui => clicks = +1 and views = -1 • Better with some priors ...
  • 40. 3)Matrix Factorization Model • Better with some Priors : ⋅ prior for each U uk ~ Exp(-(U uk - m a ) 2 /2σ a2 ) 2 ⋅ prior for each Vik ~ Exp(-(Vik - m b ) 2 /2σ b ) ⋅ Choose : m a = 0 and m b = 0, ⋅ σ a is the same for all U uk ⋅ σ bis the same for all Vik • Get U’s and V’s: Maximize Likelihood * Prior • Note: above priors are uncorrelated • Original RLFM paper used correlated priors
  • 41. 4) Matrix Factorization Model – Cold Start Problem • Matrix Factorization Model: P(Yui ) = 1 1 + exp (− ∑ U uk * Vik ) k ={factors} • Cold start problem: =>for new user U=0 or for new article V=0
  • 42. 4) Matrix Factorization Model – Cold Start Problem • Matrix Factorization Model: P(Yui ) = 1 1 + exp (− ∑ U uk * Vik ) k ={factors} • Cold start problem: =>for new user U=0 or for new article V=0 • Solution choose different prior: ⋅ for each U uk = Exp(− (U uk − ⋅ for each Vik = Exp(− (Vik − ∑ 2 G k,a X u,a ) 2 /2σ a ) a ={user_features} ∑ D k,b X i,b ) 2 /2σ 2 ) b b ={item_features} • Parameters G’s & D’s obtained from maximizing: Likelihood * Prior
  • 43. 5) RLFM: Regression based Latent Factor Model • Putting it back together: Bias + Regression + Matrix Factorization: P(Yui = 1) = 1 + exp (−α u − β i − 1 ∑ w f * X ui,f − f ={features} • Priors: ⋅ for each U uk = Exp(− (U uk − ⋅ for each Vik = Exp(−(Vik − ⋅ for each α u = Exp(− (α u − ⋅ for each β i = Exp(− (β i − ∑ ∑ U uk * Vik ) k ={factors} 2 G k,a X u,a ) 2 /2σ a ) a ={user_features} ∑ D k,b X i,b ) 2 /2σ 2 ) b b ={item_features} ∑ 2 g a X u,a ) 2 /2σ α ) a ={user_features} ∑ 2 d b X i,b ) 2 /2σ β ) b ={item_features}
  • 44. RLFM: Regression based Latent Factor Model  Basic Counting Models: EMP  Feature Based Models: OLR  RLFM: Feature Model + Collaborative Filtering › RLFM components › Using RLFM: Offline & Online update  Bonus: Tutorial on Collaborative Filtering
  • 45. Using RLFM: Offline Modeling: • Offline Modeling: • Batch mode: Maximize: Likelihood * Prior • Millions to Billions of examples processed at once • Input: {Y’s, X’s} all events and features • Output: factors : {α u }, {β i }, {U uk }, {Vik } parameters : {w f }, {g a }, {d b }, {G k,a }, {D k,b }
  • 46. Using RLFM: Online Modeling and Scoring: • Online Scoring + some Modeling • For new user or new article: compute factors from g,d,G,D α = • new user bias: u ∑ g a X u,a a ={user_features} • For old user or old article: get factors from offline batch mode • For each event (click/view) on article “i”: • Update Vik using per-item OLR approach • Predict score using updated P(Yui = 1) = Vik: 1 + exp (−α u − β i − 1 ∑ w f * X ui,f − f ={features} ∑ U uk * Vik ) k ={factors}
  • 47. RLFM: Offline Results on Coke Data: Today Module • RLFM results on Offline experiment • Y! Front Page – Today Module • CTR relative lift for RLFM vs Feature-Only as function of clicks/user http://twiki.corp.yahoo.com/view/YResearch/RLFMReplayExperiments
  • 48. Q&A Contributors: Pradheep Elango, Su-Lin Wu,Teasup Moon, Pranam Kolari Deepak Agarwal, Bee-Chung Chen, Scott Roy Jean-Marc Langlois •Coke Science Papers: http://twiki.corp.yahoo.com/view/YResearch/CokeLabDiary
  • 49. Tutorial on Collaborative Filtering Based on following Chapter http://research.yahoo.com/files/korenBellChapterSpringer.pdf By two of the Netflix winners
  • 50. Collaborative Filtering: Introduction  Goal: predict ratings rui for a movie “i” that a user “u” hasn’t seen yet › Prediction based on Matrix of User/Movie Ratings: ● rui = 1 through 5 stars: › Prediction equations for integer Ratings are simpler then for binary Clicks › Rating matrix is a large very sparse matrix: ● 10M-100M users and 10k-100K movies but with ~99% blank entries  Based on : http://research.yahoo.com/files/korenBellChapterSpringer.pdf › This talk: Focus on the most relevant models & Ignore some improvements: ● Baseline adjustment : user bias, movie bias and overall average rating ● time aware model, binary features ( rated, rented )  This talk: › Adjusted Ratings: adjusted raw <= ui − Baseline ( ui ) ui r r
  • 51. Collaborative Filtering: the models  Correlated Neighborhood Model › Predict new rating based on ratings of similar movies  Global Neighborhood Model › Enlarge Neighborhood to be “global” › Introduce adjustable weight parameters  Factorized Neighborhood Model › Apply matrix factorization to weight parameters  SVD Model › Apply matrix factorization to rating matrix itself
  • 52. Collaborative Filtering: Correlated Neighborhood Model • Predict unknown rui based on known ratings of similar movies • You will like movie“i” because you liked movies “j” uj6 • Define movie-movie Similarity measure: uj5 Sij based on correlation • Define Correlated Neighborhood: ● Sij6 uj4 ui uj1 uj3 • set of ~20 movies with largest Sij that are rated by “u” • Define Weight : normalized Sij uj2 Sij ∝ ruj ∑ rui ∗ ruj / Normalization u = Union(i, j)
  • 53. Collaborative Filtering: Correlated Neighborhood Model Ratings = rui = Users Movies 1 1 1 1    −1  1 1 1 1 1 1 1 −1 −1 1 1 • Movies: • • • • i=1 i=2 i=3 i=4 Star Trek Star Wars Action movie Horror movie Movie-Movie Similarity = Sij: rui ∗ ruj u = Union(i, j) Normalization 1 S ij =  015  . 0 Movies Sij ∝ ∑ Movies 1 0.5 0 1 0.5 0 0.5 1 0.5 0 0.5 1    
  • 54. Collaborative Filtering: Correlated Neighborhood Model • Predict unknown rui based on known ratings of similar movies ruj • You will like movie“i” because you liked movies “j” • Similarity measure: uj6 uj1 uj5 Sij6 ∑ rui ∗ ruj u = Union(i, j) • Correlated Neighborhood: uj4 ui uj3 uj2 Sij ∝ • set of ~20 movies with largest Sij that are rated by “u” • Weight : • normalized Sij • Scoring: ~ rui = ∑ ruj j ={correlated_neighbors} ∗ S ij / ∑ S ij j •  Simple, intuitive model with ability to explain why we predict a new movie • Modeling: •  Need to precompute and store Sij: 10k * 10K = 100M • Weights are fixed to normalized value of Sij
  • 55. Collaborative Filtering: Global Neighborhood Model • Extend Neighborhood to All Known Ratings for User “u”: R (u ) = {ruj known} • Let weight Wij be free parameters: • Scoring: • Modeling: ~ = rui ∑ ruj ∗ wij / | R ( u )| j∈R ( u ) Pick Wij to minimize regularized Sum of Errors: SSE = ∑ ui ={ past_ratings} (rui − ∑ ruj ∗ wij / j∈R ( u ) | R ( u )| )2 + λ ∑ wij2 ij λ = regularization parameter •  Better predictive power then previous model •  Not easy to explain recommendation •  Expensive Modeling, Scoring and Storage of Wij: Size = 100M • Could try to limit based on Sij but there is a better approach
  • 56. Reduce Number of Free Parameters: Matrix Factorization • Want to reduce the number of free parameters in Wij: • Current size:10k * 10K = 100M • Matrix factorization: • Goal: reduce number of free parameters to ~1M • Toy example #1: • Weight matrix is uniform: 1  1 Weigth =  1   ...  1 ...   1  (1 1 1 ...)    1 ...   1   = 1 1 ...    ... ... ...   ...     1 1 1 • Replace my matrix(10k,10k) with outer-product of two vectors: • each 10k long
  • 57. Reduce Number of Free Parameters: Matrix Factorization • Want to reduce the number of free parameters in Wij: • Current size:10k * 10K = 100M • Matrix factorization: • Goal: reduce number of free parameters to ~1M • Toy example #1: • Weight matrix is uniform: 1  1 Weigth =  1   ...  1 ...   1  (1 1 1 ...)    1 ...   1   = 1 1 ...    ... ... ...   ...     1 1 1   (   = U      • Replace my matrix(10k,10k) with outer-product of two vectors: • each 10k long: U(10k), V(10k) • U & V are call factors V )
  • 58. Reduce Number of Free Parameters: Matrix Factorization • Toy example #2: • Weight matrix is almost uniform: 1.0   0.8 Weigth =  1.0   0.8  0.8 1.0 0.8  1 (1 1 1 1)  +1 (+ 1 − 1 + 1 − 1)      1.0 0.8 1.0  1   −1 = 0.9  + 0.1  0.8 1.0 0.8  1 +1      1  −1 1.0 0.8 1.0        (   = d1 ⋅ U 1      Weights : Wij = V1 )  (   + d 2 ⋅ U 2      ∑ d k ⋅ U ik ⋅ V jk k ={1,2} V2 )
  • 59. Reduce Number of Free Parameters: Matrix Factorization • Toy example #3: • Arbitrary weight matrix:  0.74   0.51 Weigth =  0.50   0.14  0.61 0.55 0.10   0.44  ( 0.39 0.48 0.58 0.53)  − 0.71  ( − 0.72 − 0.18 0.03 0.67 )      0.42 0.30 0.13  0.30  − 0.42    = 2.18 + 0.79 0.35 0.84 0.83  0.59  + 0.19         0.61   + 0.53  0.75 0.71 0.94       − 0.14  ( 0.38 − 0.86 0.31 0.16 )    − 0.54  ( 0.43 0.01 − 0.75 0.50 )    − 0.14  + 0.36 + 0.84   + 0.78  + 0.07   + 0.02   − 0.59       − 0.04    • Noticed that: • An arbitrary N*N matrices can be decompose using N set of factors. • Note that amplitude are decreasing: d1 = 2.18 >> d4 = 0.04 • Can approximate weight matrix with a small set of factors
  • 60. Note on convention for Matrix Factorization: • Last equation is the definition of SVD (Singular Value Decomposition) wij = ∑ k U ik d k V jk • Where factors U’s, V’s are chosen to be normalized:∑ U U ' = 1 if ik ik k = k' Independent from each U U ' = 0 if ∑ other:ik ik k ≠ k' i ● i • In this talk and in Koren & Bell’s chapter: • The dk’s are incorporated inside the Uk,Vk: • Just a convention difference wij = ∑ k U ik V jk • Where the factors are now normalized as: ∑ i U ik U ik ' = d k if k = k'
  • 61. Collaborative Filtering: Factorized Neighborhood Model • Recall Global Neighborhood Model: • Where Wij are free parameters ~ = rui ∑ ruj ∗ wij / | R ( u )| j∈R ( u ) • Apply Matrix Factorization to Wij: wij => ∑ U ik V jk Choose: Nk (number of factors) << N (number of movies) k ={factors} ~200 << 10k-100K • Scoring: Factorized Neighborhood Model: ~ = rui ∑ U ui ∗ ( ∑ k ={factors} ruj V jk ) / Free parameters: Uui and Vjk | R ( u )| j∈R ( u ) •  Cheaper computation with same predictive power • Modeling: SSE = ∑ ui ={past_ratings} (rui − ∑ U ik ∗ ∑ rujV jk / k ={factors} j∈R ( u ) | R ( u )| 2 2 )2 + λ ∑ U ik + λ ∑ V jk ik jk
  • 62. Collaborative Filtering: SVD Model SVD: Historical name for Matrix Factorization apply to Rating matrix • Matrix Factorization apply to rui: rui => • Scoring: ∑ U uk Vik Choose: Nk (number of factors) << N (number of movies) k ={factors} ~ = rui ~200 << ∑ U uk ∗ Vik 10-100K Free parameters: Uui and Vjk k ={factors} •  Same predictive power •  Not easy to explain recommendation • Modeling: SSE = ∑ ui ={past_ratings} 2 (rui − ∑ U uk ∗ Vik )2 + λ ∑ U uk + λ ∑ Vik2 k ={factors} uk ik