Statistical Models for Massive Web Data

Statistical Models for Massive Web Data
Deepak Agarwal, LinkedIn, USA
Director, Applied Relevance Science (ARS)

CATS Big Data Panel, October 11, 2012
Hosted by National Academy of Sciences
Washington D.C., USA

Disclaimer

 The opinions expressed here are mine and in
no way represent the official position of LinkedIn

 Case studies presented today was work done
while I was at Yahoo!

NRC BIG DATA PANEL, AGARWAL, 2012

Big Data Applications in Business

 Big Data : Competitive advantage, Innovation, reduces
uncertainty in decision making

 High Frequency Data
– Large number of heterogeneous transactions per unit time
 Web visits, financial trading, credit card transactions, telephone
calls, packet flows in an IP network,…

 I will focus on statistical modeling for one such data
source
– User visits to websites


Example 1: Yahoo! front page

Today module

Recommend content links
F1 F2 F3 F4 (out of 30-40, editorially
programmed)

4 slots exposed, F1 has
maximum exposure
Routes traffic to other Y!
properties


LinkedIn News


LinkedIn Ads


Data Generation

User information

http request NEWS

ADS

Ranking Service

Model Updates NRC BIG DATA PANEL, AGARWAL, 2012

DATA

CONTEXT Select Item j with item covariates xj
(keywords, content categories, ...)

User i visits (i, j) : response yij
(User, context) (click/no-click)
covariates xit

(profile information, device id,
first degree connections,
browse information,…)


Statistical Problem

 Rank items (from an admissible pool) for user visits in
some context to maximize a utility of interest
 Examples of utility functions
– Click-rates (CTR)
– Share-rates (CTR* [Share|Click] )
– Revenue per page-view = CTR*bid (more complex due to
second price auction)

 CTR is a fundamental measure that opens the door to a
more principled approach to rank items
 Converge rapidly to maximum utility items
– Sequential decision making process
– Models: help cope data sparseness (curse of dimensionality


Illustrate with Y! front Page Application

 Simplify: Maximize CTR on first slot (F1)

 Article Pool
– Editorially selected for high quality and brand image
– Few articles in the pool but article pool dynamic

 We want to provide personalized recommendations
– Users with many prior visits see recommendations “tailored” to
their taste, others see the best for the “group” they belong to


Types of user covariates
 Demographics, geo:
– Not useful in front-page application

 Browse behavior: activity on Y! network ( xit )
– Previous visits to property, search, ad views, clicks,..
– This is useful for the front-page application

 Latent user factors based on previous clicks on the
module ( uit )
– Useful for active module users, obtained via factor models


Approach: Online + Offline

 Offline computation
– Intensive computations done infrequently (once
a day/week) to update parameters that are less
time-sensitive

 Online computation
– Lightweight computations frequent (once every 5-10
minutes) to update parameters that are time-
sensitive
– Adaptive experiments(explore-exploit) also done
online


Online computation: per-item online logistic regression

 For item j, the state-space model is

yijt ~ Ber( pijt )
lg t( pijt ) = u v jt + x b jt
'
i
'
it

(v j,t+1, b j,t+1 ) = (v j,t , b j,t ) + d j,t+1 ~ (0, t ) 2

(v j 0 , b j 0 ) = (Dx j , 0) + e j 0 ~ (0, s ) 2

 Item coefficients are update online via Kalman-filter (discounting
approach of West and Harrison)
– Item covariates are used to initialize coefficients at epoch zero


Closer Look at online model

 Different components of lgt( pijt ) = u v jt + x b jt
'
i
'
it

r x1
u i :User latent factors, useful for heavy users

xit b jt : Residual item affinity to user covariate (old items)
'


Online Adaptive Experimentation
(Explore/Exploit)
 Three schemes (all work reasonably well for the
front page application)
– epsilon-greedy: Show article with maximum posterior
mean except with a small probability epsilon, choose an
article at random.

– Upper confidence bound (UCB): Show article with
maximum score, where score = post-mean + k. post-std

– Thompson sampling: Draw a sample (δ,β) from posterior
to compute article CTR and show article with maximum
CTR


Offline computation

 Computing user latent factors and item
coefficient prior

– This is computed offline once a day using
retrospective (user,item) interaction data for last
X days (X = 30 in our case)
– Computations are done on Hadoop


Offline: Regression based Latent Factor
Model
yij ~ Ber(pij ) (# obs. per user has wide variation)
lgt(pij ) = å uik v jk = u¢v j (need shrinkage on factors)
i
k

ui = Gxi + e , e ~ N(0, diag(s , s ,.., s ))
u
i
u
i
2
1
2
2
2
r
regression weight matrix user/item-specific correction terms (learnt from data)

vi = Dx j + e , e ~ N(0, I)
v
j
v
j

vik ³ 0

Role of shrinkage (consider Guassian for
simplicity)
 For new user/article, factor estimates based on
covariates
 user  item
u new G x new , v new D x new
For old user, factor estimates
 user
E(ui | Rest) = (l I + å v j v ) (lGxi + å yij v j )
' -1
j
jÎNi jÎNi

 Linear combination of prior regression function
and user feedback on items


Estimating the Regression function via EM

Maximize
( f ( u i , v j , Data ) g (u i , G ) g ( v j , D )) du i dv j
ij i j i j

Integral cannot be computed in closed form,
approximated by Monte Carlo using Gibbs Sampling

For logistic, we use ARS (Gilks and Wild) to sample the
latent factors within the Gibbs sampler


Scaling to large data on Hadoop
 Randomly partition by users in the Map
 Run separate model on each partition
– Care taken to initialize each partition model with
same values, constraints on factors ensure
identifiability within each partition

 Create ensembles by using different user partitions,
average across ensembles to obtain estimates of
user factors and regression functions

– Estimates of user factors in ensembles uncorrelated,
averaging reduces variance


Data Example
 1B events, 8M users, 6K articles
 Offline training produced user factor ui
 Baseline: Online logistic without ui
– Covariate Only online Logistic model
lgt(pijt ) = x b jt
'
it

 Overall click lift: 9.7%,
 Heavy users (> 10 clicks last month): 26%
 Cold users (not seen in the past): 3%


Click-lift for heavy users

CTR LIFT Relative to COVARIATE ONLY
Logistic Model


Computational Advertising: Matching ads to opportunities

Advertisers
Pick
Ads best ads

Page Ad
User Network
Examples:
Yahoo, Google,
Opportunity MSN,
Publisher
Ad exchanges(network
of “networks”) …


Ad- exchange (RightMedia) [Agarwal et al. KDD 10]

 Advertisers participate in different ways
– CPM (pay by ad-view)
– CPC (pay per click)
– CPA (pay per conversion)

 To conduct an auction, normalize across pricing types
– Compute eCPM (expected CPM)
 Click-based ---- eCPM = click-rate*CPC
 Conversion-based ---- eCPM = conv-rate*CPA

 Similar strategy of computing offline and online components
– Process 90B records for each model fit
– Model has hundreds of millions of parameters
– Model fully deployed on RightMedia today


Summary

 Estimating interactions in high-dimensional
sparse data important in web applications

 Scaling such models to Big Data is a
challenging statistical problem

 Combining offline + online modeling with
explore/exploit a good practical strategy


Some Challenges

 Very high-dimensional modeling with very large and noisy data
– Few categorical variables with large number of levels interacting with
each other to produce response
– Scalability

 Designing sequential experiments
– Multi-armed bandits are back in a big way

 Data fusion
– From multiple and disparate sources

 Availability of data and ability to run experiments to researchers


Statistical Models for Massive Web Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Statistical Models for Massive Web Data

Similar to Statistical Models for Massive Web Data (20)

Recently uploaded

Recently uploaded (20)

Statistical Models for Massive Web Data

Editor's Notes