Enar short course

Statistical Computing
For Big Data
Deepak Agarwal
LinkedIn Applied Relevance Science
dagarwal@linkedin.com
ENAR 2014, Baltimore, USA

Structure of This Tutorial
• Part I: Introduction to Map-Reduce and the Hadoop
System
– Overview of Distributed Computing
– Introduction to Map-Reduce
– Introduction to the Hadoop System
– The Pig Language
– A Deep Dive of Hadoop Map-Reduce
• Part II: Examples of Statistical Computing for Big Data
– Bag of Little Bootstraps
– Large Scale Logistic Regression
– Parallel Matrix Factorization
• Part III: The Future of Cloud Computing

Big Data becoming Ubiquitous
• Bioinformatics
• Astronomy
• Internet
• Telecommunications
• Climatology
• …

Big Data: Some size estimates
• 1000 human genomes: > 100TB of data (1000
genomes project)
• Sloan Digital Sky Survey: 200GB data per night
(>140TB aggregated)
• Facebook: A billion monthly active users
• LinkedIn: 225M members worldwide
• Twitter: 500 million tweets a day
• Over 6 billion mobile phones in the world
generating data everyday

Big Data: Paradigm shift
• Classical Statistics
– Generalize using small data
• Paradigm Shift with Big Data
– We now have an almost infinite supply of data
– Easy Statistics ? Just appeal to asymptotic theory?
• So the issue is mostly computational?
– Not quite
• More data comes with more heterogeneity
• Need to change our statistical thinking to adapt
– Classical statistics still invaluable to think about big data analytics

Some Statistical Challenges
• Exploratory Analysis (EDA), Visualization
– Retrospective (on Terabytes)
– More Real Time (streaming computations every
few minutes/hours)
• Statistical Modeling
– Scale (computational challenge)
– Curse of dimensionality
• Millions of predictors, heterogeneity
– Temporal and Spatial correlations

Statistical Challenges continued
• Experiments
– To test new methods, test hypothesis from
randomized experiments
– Adaptive experiments
• Forecasting
– Planning, advertising
• Many more I are not fully well versed in

Defining Big Data
• How to know you have the big data problem?
– Is it only the number of terabytes ?
– What about
dimensionality, structured/unstructured, computa
tions required,…
• No clear definition, different point of views
– When desired computation cannot be completed
in the stipulated time with current best algorithm
using cores available on a commodity PC

Distributed Computing for Big Data
• Distributed computing invaluable tool to scale
computations for big data
• Some distributed computing models
– Multi-threading
– Graphics Processing Units (GPU)
– Message Passing Interface (MPI)
– Map-Reduce

Evaluating a method for a problem
• Scalability
– Process X GB in Y hours
• Ease of use for a statistician
• Reliability (fault tolerance)
– Especially in an industrial environment
• Cost
– Hardware and cost of maintaining
• Good for the computations required?
– E.g., Iterative versus one pass
• Resource sharing

Multithreading
• Multiple threads take advantage of multiple
CPUs
• Shared memory
• Threads can execute independently and
concurrently
• Can only handle Gigabytes of data
• Reliable

Graphics Processing Units (GPU)
• Number of cores:
– CPU: Order of 10
– GPU: smaller cores
• Order of 1000
• Can be >100x faster than CPU
– Parallel computationally intensive tasks off-loaded to GPU
• Good for certain computationally-intensive tasks
• Can only handle Gigabytes of data
• Not trivial to use, requires good understanding of low-level architecture
for efficient use
– But things changing, it is getting more user friendly

Message Passing Interface (MPI)
• Language independent communication
protocol among processes (e.g. computers)
• Most suitable for master/slave model
• Can handle Terabytes of data
• Good for iterative processing
• Fault tolerance is low

Map-Reduce (Dean &
Ghemawat, 2004)
Mappers
Reducers
Data
Output
• Computation split to Map
(scatter) and Reduce (gather)
stages
• Easy to Use:
– User needs to implement two
functions: Mapper and
Reducer
• Easily handles Terabytes of
data
• Very good fault tolerance
(failed tasks automatically
get restarted)

Comparison of Distributed Computing Methods
Multithreading GPU MPI Map-Reduce
Scalability (data
size)
Gigabytes Gigabytes Terabytes Terabytes
Fault Tolerance High High Low High
Maintenance Cost Low Medium Medium Medium-High
Iterative Process
Complexity
Cheap Cheap Cheap Usually
expensive
Resource Sharing Hard Hard Easy Easy
Easy to Implement? Easy Needs
understanding
of low-level GPU
architecture
Easy Easy

Example Problem
• Tabulating word counts in corpus of
documents
• Similar to table function in R

Word Count Through Map-Reduce
Hello World
Bye World
Hello Hadoop
Goodbye Hadoop
Mapper 1
<Hello, 1>
<Hadoop, 1>
<Goodbye, 1>
<Hadoop,1>
<Hello, 1>
<World, 1>
<Bye, 1>
<World,1>
Mapper 2
Reducer 1
Words from A-G
Reducer 2
Words from H-Z
<Bye, 1>
<Goodbye, 1>
<Hello, 2>
<World, 2>
<Hadoop, 2>

Key Ideas about Map-Reduce
Big Data
Partition 1 Partition 2 … Partition N
Mapper 1 Mapper 2 … Mapper N
<Key, Value> <Key, Value><Key, Value><Key, Value>
Reducer 1 Reducer 2 Reducer M…
Output 1 Output 1Output 1Output 1

• Data are split into partitions and stored in many
different machines on disk (distributed storage)
• Mappers process data chunks independently and
emit <Key, Value> pairs
• Data with the same key are sent to the same
reducer. One reducer can receive multiple keys
• Every reducer sorts its data by key
• For each key, the reducer processes the values
corresponding to the key according to the
customized reducer function and output

Compute Mean for Each Group
ID Group No. Score
1 1 0.5
2 3 1.0
3 1 0.8
4 2 0.7
5 2 1.5
6 3 1.2
7 1 0.8
8 2 0.9
9 4 1.3
… … …

• Data are split into partitions and stored in many different machines on
disk (distributed storage)
• Mappers process data chunks independently and emit <Key, Value> pairs
– For each row:
• Key = Group No.
• Value = Score
• Data with the same key are sent to the same reducer. One reducer can
receive multiple keys
– E.g. 2 reducers
– Reducer 1 receives data with key = 1, 2
– E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9>
• For each key, the reducer processes the values corresponding to the key
according to the customized reducer function and output
– E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>

• Data are split into partitions and stored in many different machines on
disk (distributed storage)
• Mappers process data chunks independently and emit <Key, Value> pairs
– For each row:
• Key = Group No.
• Value = Score
• Data with the same key are sent to the same reducer. One reducer can
receive multiple keys
– E.g. 2 reducers
– E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9>
• For each key, the reducer processes the values corresponding to the key
according to the customized reducer function and output
– E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>
What you need
to implement

Mapper:
Input: Data
for (row in Data)
{
groupNo = row$groupNo;
score = row$score;
Output(c(groupNo, score));
}
Reducer:
Input: Key (groupNo), List Value (a list of scores that belong to the Key)
count = 0;
sum = 0;
for (v in Value)
{
sum += v;
count++;
}
Output(c(Key, sum/count));
Pseudo Code (in R)

Exercise 1
• Problem: Average height per {Grade, Gender}?
• What should be the mapper output key?
• What should be the mapper output value?
• What are the reducer input?
• What are the reducer output?
• Write mapper and reducer for this?
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …

• Problem: Average height per Grade and Gender?
– {Grade, Gender}
– Height
– Key: {Grade, Gender}, Value: List of Heights
– {Grade, Gender, mean(Heights)}
1 3 M 120
2 2 F 115
3 2 M 116
… … …

Exercise 2
• Problem: Number of students per {Grade, Gender}?
• Write mapper and reducer for this?
1 3 M 120
2 2 F 115
3 2 M 116
… … …

• Problem: Number of students per {Grade, Gender}?
– {Grade, Gender}
– 1
– Key: {Grade, Gender}, Value: List of 1’s
– {Grade, Gender, sum(value list)}
– OR: {Grade, Gender, length(value list)}
1 3 M 120
2 2 F 115
3 2 M 116
… … …

More on Map-Reduce
• Depends on distributed file systems
• Typically mappers are the data storage nodes
• Map/Reduce tasks automatically get restarted
when they fail (good fault tolerance)
• Map and Reduce I/O are all on disk
– Data transmission from mappers to reducers are
through disk copy
• Iterative process through Map-Reduce
– Each iteration becomes a map-reduce job
– Can be expensive since map-reduce overhead is high

The Apache Hadoop System
• An open-source software for
reliable, scalable, distributed computing
• The most popular distributed computing
system in the world
• Key modules:
– Hadoop Distributed File System (HDFS)
– Hadoop YARN (job scheduling and cluster resource
management)
– Hadoop MapReduce

Major Tools on Hadoop
• Pig
– A high-level language for Map-Reduce computation
• Hive
– A SQL-like query language for data querying via Map-Reduce
• Hbase
– A distributed & scalable database on Hadoop
– Allows random, real time read/write access to big data
– Voldemort is similar to Hbase
• Mahout
– A scalable machine learning library
• …

Hadoop Installation
• Setting up Hadoop on your desktop/laptop:
– http://hadoop.apache.org/docs/stable/single_nod
e_setup.html
• Setting up Hadoop on a cluster of machines
– http://hadoop.apache.org/docs/stable/cluster_set
up.html

Hadoop Distributed File System (HDFS)
• Master/Slave architecture
• NameNode: a single master node that controls which
data block is stored where.
• DataNodes: slave nodes that store data and do R/W
operations
• Clients (Gateway): Allow users to login and interact
with HDFS and submit Map-Reduce jobs
• Big data is split to equal-sized blocks, each block can be
stored in different DataNodes
• Disk failure tolerance: data is replicated multiple times

Load the Data into Pig
• A = LOAD ‘Sample-1.dat' USING PigStorage() AS
(ID : int, groupNo: int, score: float);
– The path of the data on HDFS after LOAD
• USING PigStorage() means delimit the data by tab
(can be omitted)
• If data are delimited by other characters, e.g.
space, use USING PigStorage(‘ ‘)
• Data schema defined after AS
• Variable types: int, long, float, double, chararray,
…

Bag of Little Bootstraps
Kleiner et al. 2012

Bootstrap (Efron, 1979)
• A re-sampling based method to obtain statistical
distribution of sample estimators
• Why are we interested ?
– Re-sampling is embarrassingly parallelizable
• For example:
– Standard deviation of the mean of N samples (μ)
– For i = 1 to r do
• Randomly sample with replacement N times from the original
sample -> bootstrap data i
• Compute mean of the i-th bootstrap data -> μi
– Estimate of Sd(μ) = Sd(*μ1,…μr])
– r is usually a large number, e.g. 200

Bootstrap for Big Data
• Can have r nodes running in parallel, each
sampling one bootstrap data
• However…
– N can be very large
– Data may not fit into memory
– Collecting N samples with replacement on each
node can be computationally expensive

M out of N Bootstrap
(Bikel et al. 1997)
• Obtain SdM(μ) by sampling M samples with
replacement for each bootstrap, where M<N
• Apply analytical correction to SdM(μ) to obtain
Sd(μ) using prior knowledge of convergence rate
of sample estimates
• However…
– Prior knowledge is required
– Choice of M is critical to performance
– Finding optimal value of M needs more computation

Bag of Little Bootstraps (BLB)
• Example: Standard deviation of the mean
• Generate S sampled data sets, each obtained by random
sampling without replacement a subset of size b (or
partition the original data into S partitions, each with size
b)
• For each data p = 1 to S do
• N samples with replacement on data of size b
• Compute mean of the resampled data μpi
– Compute Sdp(μ) = Sd([μp1,…μpr])
• Estimate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])

• Interest: ξ(θ), where θ is an estimate obtained from
size N data
– ξ is some function of θ, such as standard deviation, …
• Generate S sampled data sets, each obtained from random
sampling without replacement a subset of size b (or partition
the original data into S partitions, each with size b)
• Sample N samples with replacement on data of size b
• Compute mean of the resampled data θpi
– Compute ξp(θ) = ξ([θp1,…θpr])
• Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])

• Interest: ξ(θ), where θ is an estimate obtained from
size N data
– ξ is some function of θ, such as standard deviation, …
• Generate S sampled data sets, each obtained from random
sampling without replacement a subset of size b (or partition
the original data into S partitions, each with size b)
• Sample N samples with replacement on the data of size b
• Compute mean of the resampled data θpi
– Compute ξp(θ) = ξ([θp1,…θpr])
• Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])
MapperReducer
Gateway

Why is BLB Efficient
• Before:
– N samples with replacement from size N data is
expensive when N is large
• Now:
– N samples with replacement from size b data
– b can be several magnitude smaller than N (e.g. b
= Nγ, γ in [0.5, 1))
– Equivalent to: A multinomial sampler with dim = b
– Storage = O(b), Computational complexity = O(b)

Simulation Experiment
• 95% CI of Logistic Regression Coefficients
• N = 20000, 10 explanatory variables
• Relative Error = |Estimated CI width – True CI
width | / True CI width
• BLB-γ: BLB with b = Nγ
• BOFN-γ: b out of N sampling with b = Nγ
• BOOT: Naïve bootstrap

Real Data
• 95% CI of Logistic Regression Coefficients
• N = 6M, 3000 explanatory variables
• Data size = 150GB, r = 50, s = 5, γ = 0.7

Summary of BLB
• A new algorithm for bootstrapping on big data
• Advantages
– Fast and efficient
– Easy to parallelize
– Easy to understand and implement
– Friendly to Hadoop, makes it routine to perform
statistical calculations on Big data

Large Scale Logistic Regression

Logistic Regression
• Binary response: Y
• Covariates: X
• Yi ~ Bernoulli(pi)
• log(pi/(1-pi)) = Xi
Tβ ; β ~ MVN(0 , 1/λ I )
• Widely used (research and applications)

• Binary response: Y
– E.g., Click / Non-click on an ad on a webpage
• Covariates: X
– User covariates:
• Age, gender, industry, education, job, job title, …
– Item covariates:
• Categories, keywords, topics, …
– Context covariates:
• Time, page type, position, …
– 2-way interaction:
• User covariates X item covariates
• Context covariates X item covariates
• …

Computational Challenge
• Hundreds of millions/billions of observations
• Hundreds of thousands/millions of covariates
• Fitting such a logistic regression model on a
single machine not feasible
• Model fitting iterative using methods like
gradient descent, Newton’s method etc
– Multiple passes over the data

Recap on Optimization method
• Problem: Find x to min(F(x))
• Iteration n: xn = xn-1 – bn-1 F’(xn-1)
• bn-1 is the step size that can change every
iteration
• Iterate until convergence
• Conjugate gradient, LBFGS, Newton trust
region, … all of this kind

Iterative Process with Hadoop
Disk Mappers Disk Reducers
DiskMappersDiskReducers
Disk Mappers Disk Reducers

Limitations of Hadoop for fitting a big
logistic regression
• Iterative process is expensive and slow
• Every iteration = a Map-Reduce job
• I/O of mapper and reducers are both through
disk
• Plus: Waiting in queue time
• Q: Can we find a fitting method that scales
with Hadoop ?

• Naïve:
– Partition the data and run logistic regression for each partition
– Take the mean of the learned coefficients
– Problem: Not guaranteed to converge to the model from single
machine!
• Alternating Direction Method of Multipliers (ADMM)
– Boyd et al. 2011
– Set up constraints: each partition’s coefficient = global
consensus
– Solve the optimization problem using Lagrange Multipliers
– Advantage: guaranteed to converge to a single machine logistic
regression on the entire data with reasonable number of
iterations

Large Scale Logistic Regression via ADMM
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Iteration 1

BIG DATA
Logistic
Regression
Consensus
Computation
Logistic
Regression
Logistic
Regression
Logistic
Regression
Iteration 1

BIG DATA
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Iteration 2

Dual Ascent Method
• Consider a convex optimization problem
• Lagrangian for the problem:
• Dual Ascent:

Augmented Lagrangians
• Bring robustness to the dual ascent method
• Yield convergence without assumptions like strict
convexity or finiteness of f
•
• The value of ρ influences the convergence rate

Alternating Direction Method of
Multipliers (ADMM)
• Problem:
• Augmented Lagrangians
• ADMM:

• Notation
– (Xi , yi): data in the ith partition
– βi: coefficient vector for partition i
– β: Consensus coefficient vector
– r(β): penalty component such as ||β||2
2
• Optimization problem

ADMM updates
LOCAL REGRESSIONS
Shrinkage towards current
best global estimate
UPDATED
CONSENSUS

An example implementation
• ADMM for Logistic regression model fitting with
L2/L1 penalty
• Each iteration of ADMM is a Map-Reduce job
– Mapper: partition the data into K partitions
– Reducer: For each partition, use liblinear/glmnet to fit
a L1/L2 logistic regression
– Gateway: consensus computation by results from all
reducers, and sends back the consensus to each
reducer node

KDD CUP 2010 Data
• Bridge to Algebra 2008-2009 data in
https://pslcdatashop.web.cmu.edu/KDDCup/do
wnloads.jsp
• Binary response, 20M covariates
• Only keep covariates with >= 10 occurrences
=> 2.2M covariates
• Training data: 8,407,752 samples
• Test data : 510,302 samples

Avg Training Loglikelihood vs Number
of Iterations

Test AUC vs Number of Iterations

Better Convergence Can
Be Achieved By
• Better Initialization
– Use results from Naïve method to initialize the
parameters
• Adaptively change step size (ρ) for each
iteration based on the convergence status of
the consensus

Recommender Problems for
Web Applications

Agenda
• Topic of Interest
– Recommender problems for dynamic, time-
sensitive applications
• Content Optimization, Online Advertising, Movie
recommendation, shopping,…
• Introduction
• Offline components
– Regression, Collaborative filtering (CF), …
• Online components + initialization
– Time-series, online/incremental
methods, explore/exploit (bandit)
• Evaluation methods + Multi-Objective
• Challenges

Three components we will focus on
• Defining the problem
– Formulate objectives whose optimization achieves some long-
term goals for the recommender system
• E.g. How to serve content to optimize audience reach and
engagement, optimize some combination of engagement and revenue ?
• Modeling (to estimate some critical inputs)
– Predict rates of some positive user interaction(s) with items
based on data obtained from historical user-item interactions
• E.g. Click rates, average time-spent on page, etc
• Could be explicit feedback like ratings
• Experimentation
– Create experiments to collect data proactively to improve
models, helps in converging to the best choice(s) cheaply and
rapidly.
• Explore and Exploit (continuous experimentation)
• DOE (testing hypotheses by avoiding bias inherent in data)

Modern Recommendation Systems
• Goal
– Serve the right item to a user in a given context to
optimize long-term business objectives
• A scientific discipline that involves
– Large scale Machine Learning & Statistics
• Offline Models (capture global & stable characteristics)
• Online Models (incorporates dynamic components)
• Explore/Exploit (active and adaptive experimentation)
– Multi-Objective Optimization
• Click-rates (CTR), Engagement, advertising revenue, diversity, etc
– Inferring user interest
• Constructing User Profiles
– Natural Language Processing to understand content
• Topics, “aboutness”, entities, follow-up of something, breaking news,…

Some examples from content
optimization
• Simple version
– I have a content module on my page, content inventory is
obtained from a third party source which is further refined
through editorial oversight. Can I algorithmically
recommend content on this module? I want to improve
overall click-rate (CTR) on this module
• More advanced
– I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. advertising revenue). Can I
increase downstream utility without losing too many clicks?
• Highly advanced
– There are multiple modules running on my webpage. How
do I perform a simultaneous optimization?

Recommend applications
Recommend search queries
Recommend news article
Recommend packages:
Image
Title, summary
Links to other pages
Pick 4 out of a pool of K
K = 20 ~ 40
Dynamic
Routes traffic other pages

Problems in this example
• Optimize CTR on multiple modules
– Today Module, Trending Now, Personal Assistant,
News
– Simple solution: Treat modules as independent,
optimize separately. May not be the best when
there are strong correlations.
• For any single module
– Optimize some combination of CTR, downstream
engagement, and perhaps advertising revenue.

Online Advertising
Advertisers
Ad Network
Ads
Page
Recommend
Best ad(s)
User
Publisher
Response rates
(click, conversion, ad-view)
Bids
Auction
Click
conversion
Select argmax f(bid,response rates)
ML /Statistical
model
Examples:
Yahoo, Google, MSN, …
Ad exchanges
(RightMedia, DoubleClick, …
)

LinkedIn Today: Content Module
Objective: Serve content to maximize engagement
metrics like CTR (or weighted CTR)

LinkedIn Ads: Match ads to users
visiting LinkedIn

Right Media Ad Exchange: Unified
Marketplace
Match ads to page views on publisher sites
Has ad
impression to
sell --
AUCTIONS
Bids $0.50
Bids $0.75 via Network…
… which becomes
$0.45 bid
Bids $0.65—WINS!
AdSense
Ad.com
Bids $0.60

Recommender problems in general
USER
Item Inventory
Articles, web page,
ads, …
Use an automated algorithm
to select item(s) to show
Get feedback (click, time spent,..)
Refine the models
Repeat (large number of times)
Optimize metric(s) of interest
(Total clicks, Total revenue,…)
Example applications
Search: Web, Vertical
Online Advertising
Content
…..
Context
query, page, …

• Items: Articles, ads, modules, movies, users, updates, etc.
• Context: query keywords, pages, mobile, social media, etc.
• Metric to optimize (e.g., relevance score, CTR, revenue, engagement)
– Currently, most applications are single-objective
– Could be multi-objective optimization (maximize X subject to Y, Z,..)
• Properties of the item pool
– Size (e.g., all web pages vs. 40 stories)
– Quality of the pool (e.g., anything vs. editorially selected)
– Lifetime (e.g., mostly old items vs. mostly new items)
Important Factors

Factors affecting Solution
(continued)
• Properties of the context
– Pull: Specified by explicit, user-driven query (e.g., keywords, a form)
– Push: Specified by implicit context (e.g., a page, a user, a session)
• Most applications are somewhere on continuum of pull and push
• Properties of the feedback on the matches
made
– Types and semantics of feedback (e.g., click, vote)
– Latency (e.g., available in 5 minutes vs. 1 day)
– Volume (e.g., 100K per day vs. 300M per day)
• Constraints specifying legitimate matches
– e.g., business rules, diversity rules, editorial Voice
– Multiple objectives
• Available Metadata (e.g., link graph, various user/item attributes)

Predicting User-Item Interactions
(e.g. CTR)
• Myth: We have so much data on the web, if we
can only process it the problem is solved
– Number of things to learn increases with sample size
• Rate of increase is not slow
– Dynamic nature of systems make things worse
– We want to learn things quickly and react fast
• Data is sparse in web recommender problems
– We lack enough data to learn all we want to learn and
as quickly as we would like to learn
– Several Power laws interacting with each other
• E.g. User visits power law, items served power law
– Bivariate Zipf: Owen & Dyer, 2011

Can Machine Learning help?
• Fortunately, there are group behaviors that generalize to
individuals & they are relatively stable
– E.g. Users in San Francisco tend to read more baseball news
• Key issue: Estimating such groups
– Coarse group : more stable but does not generalize that well.
– Granular group: less stable with few individuals
– Getting a good grouping structure is to hit the “sweet spot”
• Another big advantage on the web
– Intervene and run small experiments on a small population to
collect data that helps rapid convergence to the best choices(s)
• We don’t need to learn all user-item interactions, only those that are good.

Predicting user-item interaction rates
Offline
( Captures stable characteristics
at coarse resolutions)
(Logistic, Boosting,….)
Feature construction
Content: IR, clustering, taxonomy, entity,..
User profiles: clicks, views, social, community,..
Near Online
(Finer resolution
Corrections)
(item, user level)
(Quick updates)
Explore/Exploit
(Adaptive sampling)
(helps rapid convergence
to best choices)
Initialize

Post-click: An example in Content
Optimization
Recommender EDITORIAL
content
Clicks on FP links influence
downstream supply distribution
AD SERVER
DISPLAY ADVERTISING
Revenue
Downstream
engagement
(Time spent)

Serving Content on Front Page: Click Shaping
• What do we want to optimize?
• Current: Maximize clicks (maximize downstream supply from FP)
• But consider the following
– Article 1: CTR=5%, utility per click = 5
– Article 2: CTR=4.9%, utility per click=10
• By promoting 2, we lose 1 click/100 visits, gain 5 utils
• If we do this for a large number of visits --- lose some clicks but
obtain significant gains in utility?
– E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,
etc)

High level picture
http request
Statistical
Models updated in
Batch mode: e.g. once every
30 mins
Server
Item
Recommendation
system: thousands
of computations in
sub-seconds
User Interacts
e.g. click,
does nothing

High level overview: Item
Recommendation System
User Info
Item Index
Id, meta-data
ML/
Statistical
Models
Score Items
P(Click), P(share),
Semantic-relevance
score,….
Rank Items:
sort by score (CTR,bid*CTR,..)
combine scores using Multi-obj optim,
Threshold on some scores,….
User-item interaction
Data: batch process
Updated in batch:
Activity, profile
Pre-filter
SPAM,editorial,,..
Feature extraction
NLP, cllustering,..

ML/Statistical models for scoring
Number of items
Scored by ML
Traffic volume
1000100 100k 1M 100M
Few hours
Few days
Several days
LinkedIn Today
Yahoo! Front Page
Right Media Ad exchange
LinkedIn Ads

Summary of deployments
• Yahoo! Front page Today Module (2008-2011): 300% improvement
in click-through rates
– Similar algorithms delivered via a self-serve platform, adopted by
several Yahoo! Properties (2011): Significant improvement in
engagement across Yahoo! Network
• Fully deployed on LinkedIn Today Module (2012): Significant
improvement in click-through rates (numbers not revealed due to
reasons of confidentiality)
• Yahoo! RightMedia exchange (2012): Fully deployed algorithms to
estimate response rates (CTR, conversion rates). Significant
improvement in revenue (numbers not revealed due to reasons of
confidentiality)
• LinkedIn self-serve ads (2012-2013):Fully deployed
• LinkedIn News Feed (2013-2014): Fully deployed
• Several others in progress….

Broad Themes
• Curse of dimensionality
– Large number of observations (rows), large number of potential features
(columns)
– Use domain knowledge and machine learning to reduce the “effective”
dimension (constraints on parameters reduce degrees of freedom)
• I will give examples as we move along
• We often assume our job is to analyze “Big Data” but we often have
control on what data to collect through clever experimentation
– This can fundamentally change solutions
• Think of computation and models together for Big data
• Optimization: What we are trying to optimize is often complex,models to
work in harmony with optimization
– Pareto optimality with competing objectives

Statistical Problem
• Rank items (from an admissible pool) for user visits in some
context to maximize a utility of interest
• Examples of utility functions
– Click-rates (CTR)
– Share-rates (CTR* [Share|Click] )
– Revenue per page-view = CTR*bid (more complex due to second price
auction)
• CTR is a fundamental measure that opens the door to a more
principled approach to rank items
• Converge rapidly to maximum utility items
– Sequential decision making process (explore/exploit)

item j from a set of candidates
User i
with
user features
(e.g., industry,
behavioral features,
Demographic
features,……)
(i, j) : response yijvisits
Algorithm selects
(click or not)
Which item should we select?
 The item with highest predicted CTR
 An item for which we need data to
predict its CTR
Exploit
Explore
LinkedIn Today, Yahoo! Today Module:
Choose Items to maximize CTR
This is an “Explore/Exploit” Problem

The Explore/Exploit Problem (to
maximize CTR)
• Problem definition: Pick k items from a pool of N for a large
number of serves to maximize the number of clicks on the
picked items
• Easy!? Pick the items having the highest click-through rates
(CTRs)
• But …
– The system is highly dynamic:
• Items come and go with short lifetimes
• CTR of each item may change over time
– How much traffic should be allocated to explore new items to
achieve optimal performance ?
• Too little Unreliable CTR estimates due to “starvation”
• Too much Little traffic to exploit the high CTR items

Y! front Page Application
• Simplify: Maximize CTR on first slot (F1)
• Item Pool
– Editorially selected for high quality and brand image
– Few articles in the pool but item pool dynamic

CTR Curves of Items on LinkedIn
Today
CTR

Impact of repeat item views on a given
user
• Same user is shown an item multiple times
(despite not clicking)

Simple algorithm to estimate most popular
item with small but dynamic item pool
• Simple Explore/Exploit scheme
– % explore: with a small probability (e.g. 5%), choose an item at
random from the pool
– (100− )% exploit: with large probability (e.g. 95%), choose
highest scoring CTR item
• Temporal Smoothing
– Item CTRs change over time, provide more weight to recent data in
estimating item CTRs
• Kalman filter, moving average
• Discount item score with repeat views
– CTR(item) for a given user drops with repeat views by some “discount”
factor (estimated from data)
• Segmented most popular
– Perform separate most-popular for each user segment

Time series Model: Kalman filter
• Dynamic Gamma-Poisson: click-rate evolves over time
in a multiplicative fashion
• Estimated Click-rate distribution at time t+1
– Prior mean:
– Prior variance:
High CTR items more adaptive

More economical exploration? Better
bandit solutions
• Consider two armed problem
p2
(unknown payoff
probabilities)
The gambler has 1000 plays, what is the best way to experiment ?
(to maximize total expected reward)
This is called the “multi-armed bandit” problem, have been studied for a long time.
Optimal solution: Play the arm that has maximum potential of being good
Optimism in the face of uncertainty
p1 >

Item Recommendation: Bandits?
• Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000
– Greedy: Show Item 2 to all; not a good idea
– Item 1 CTR estimate noisy; item could be potentially
better
• Invest in Item 1 for better overall performance on average
– Exploit what is known to be good, explore what is potentially good
CTR
Probabilitydensity
Item 2
Item 1

Next few hours
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models Time-series models Incremental CF,
online regression
Intelligent Initialization Prior estimation Prior estimation,
dimension reduction
Explore/Exploit Multi-armed bandits Bandits with covariates

Offline Components:
Collaborative Filtering in Cold-start
Situations

Problem
Item j with
User i
with
user features xi
(demographics,
browse history,
search history, …)
item features xj
(keywords, content categories, ...)
(i, j) : response yijvisits
Algorithm selects
(explicit rating, implicit click/no-click)
Predict the unobserved entries based on
features and the observed entries

Model Choices
• Feature-based (or content-based) approach
– Use features to predict response
• (regression, Bayes Net, mixture models, …)
– Limitation: need predictive features
• Bias often high, does not capture signals at granular levels
• Collaborative filtering (CF aka Memory based)
– Make recommendation based on past user-item interaction
• User-user, item-item, matrix factorization, …
• See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc.
– Better performance for old users and old items
– Does not naturally handle new users and new items (cold-
start)

Collaborative Filtering (Memory
based methods)
User-User Similarity
Item-Item similarities, incorporating both
Estimating Similarities
Pearson’s correlation
Optimization based (Koren et al)

How to Deal with the Cold-Start
Problem
• Heuristic-based approaches
– Linear combination of regression and CF models
– Filterbot
• Add user features as psuedo users and do collaborative filtering
- Hybrid approaches
- Use content based to fill up entries, then use CF
• Matrix Factorization
– Good performance on Netflix (Koren, 2009)
• Model-based approaches
– Bilinear random-effects model (probabilistic matrix factorization)
• Good on Netflix data [Ruslan et al ICML, 2009]
– Add feature-based regression to matrix factorization
• (Agarwal and Chen, 2009)
– Add topic discovery (from textual items) to matrix factorization
• (Agarwal and Chen, 2009; Chun and Blei, 2011)

Per-item regression models
• When tracking users by cookies, distribution of
visit patters could get extremely skewed
– Majority of cookies have 1-2 visits
• Per item models (regression) based on user
covariates attractive in such cases

Several per-item regressions: Multi-task learning
Low dimension
(5-10),
B estimated
retrospective data
• Agarwal,Chen and Elango, KDD, 2010
Affinity to
old items

Per-user, per-item models
via bilinear random-effects
model

Motivation
• Data measuring k-way interactions pervasive
– Consider k = 2 for all our discussions
• E.g. User-Movie, User-content, User-Publisher-Ads,….
– Power law on both user and item degrees
• Classical Techniques
– Approximate matrix through a singular value
decomposition (SVD)
• After adjusting for marginal effects (user pop, movie pop,..)
– Does not work
• Matrix highly incomplete, severe over-fitting
– Key issue
• Regularization of eigenvectors (factors) to avoid overfitting

Early work on complete matrices
• Tukey’s 1-df model (1956)
– Rank 1 approximation of small nearly complete
matrix
• Criss-cross regression (Gabriel, 1978)
• Incomplete matrices: Psychometrics (1-factor
model only; small data sets; 1960s)
• Modern day recommender problems
– Highly incomplete, large, noisy.

Latent Factor Models
“newsy”
“sporty”
“newsy”
s
item
v
z
Affinity = u’v
Affinity = s’z
u s
p
o
r
t
y

Factorization – Brief Overview
• Latent user factors:
(αi , ui=(ui1,…,uin))
• (Nn + Mm)
parameters
• Key technical issue:
• Latent movie factors:
(βj , vj=(v j1,….,v jn))
will overfit for moderate
values of n,m
Regularization
Interaction
jijiij BvuyE )(

Latent Factor Models: Different
Aspects
• Matrix Factorization
– Factors in Euclidean space
– Factors on the simplex
• Incorporating features and ratings
simultaneously
• Online updates

Maximum Margin Matrix Factorization (MMMF)
• Complete matrix by minimizing loss (hinge,squared-
error) on observed entries subject to constraints on trace
norm
– Srebro, Rennie, Jakkola (NIPS 2004)
– Convex, Semi-definite programming (expensive, not
scalable)
• Fast MMMF (Rennie & Srebro, ICML, 2005)
– Constrain the Frobenious norm of left and right
eigenvector matrices, not convex but becomes
scalable.
• Other variation: Ensemble MMMF (DeCoste,
ICML2005)
– Ensembles of partially trained MMMF (some
improvements)

Matrix Factorization for Netflix prize
data
• Minimize the objective function
• Simon Funk: Stochastic Gradient Descent
• Koren et al (KDD 2007): Alternate Least
Squares
– They move to SGD later in the competition
obsij j
j
i
ij
T
iij vuvur )()(
222

ui vj
rij
au av
2
Optimization is through Iterated
conditional modes
Other variations like constraining the
mean through sigmoid, using “who-rated-
whom”
Combining with Boltzmann Machines also
improved performance
),(~
),(~
),(~ 2
IaMVN
IaMVN
Nr
vj
ui
j
T
iij
0v
0u
vu
Probabilistic Matrix Factorization
(Ruslan & Minh, 2008, NIPS)

Bayesian Probabilistic Matrix Factorization
(Ruslan and Minh, ICML 2008)
• Fully Bayesian treatment using an MCMC approach
– Significant improvement
• Interpretation as a fully Bayesian hierarchical model
shows why that is the case
– Failing to incorporate uncertainty leads to bias in
estimates
– Multi-modal posterior, MCMC helps in converging to a better one
r
Var-comp: au
MCEM also more resistant to over-fitting

Non-parametric Bayesian matrix completion
(Zhou et al, SAM, 2010)
• Specify rank probabilistically (automatic rank
selection)
)/)1(,/(~
)(~
),(~
1
2
rrbraBeta
Berz
vuzNy
k
kk
r
k
jkikkij
))1(/(Factors)#(
)))1(/(,1(~
rbaraE
rbaaBerzk

How to incorporate features:
Deal with both warm start and cold-start
• Models to predict ratings for new pairs
– Warm-start: (user, movie) present in the training data with large
sample size
– Cold-start: At least one of (user, movie) new or has small sample
size
• Rough definition, warm-start/cold-start is a continuum.
• Challenges
– Highly incomplete (user, movie) matrix
– Heavy tailed degree distributions for users/movies
• Large fraction of ratings from small fraction of
users/movies
– Handling both warm-start and cold-start effectively in the
presence of predictive features

Possible approaches
• Large scale regression based on covariates
– Does not provide good estimates for heavy users/movies
– Large number of predictors to estimate interactions
• Collaborative filtering
– Neighborhood based
– Factorization
• Good for warm-start; cold-start dealt with separately
• Single model that handles cold-start and warm-start
– Heavy users/movies → User/movie specific model
– Light users/movies → fallback on regression model
– Smooth fallback mechanism for good performance

Add Feature-based Regression
into
Matrix Factorization
RLFM: Regression-based Latent
Factor Model

Regression-based Factorization
Model (RLFM)
• Main idea: Flexible prior, predict factors
through regressions
• Seamlessly handles cold-start and warm-
start
• Modified state equation to incorporate
covariates

RLFM: Model
Rating: ),(~ 2
ijij Ny
)(~ ijij Bernoulliy
)(~ ijijij NPoissony
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
j
t
iji
t
ijij vubxt )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 Nxg iii
t
i
Popularity of item j: ),0(~, 2
0 Nxd jjj
t
j
Factors of user i: ),0(~, 2
INGxu u
u
i
u
iii
Factors of item j: ),0(~, 2
INDxv v
v
i
v
iji
Could use other classes of regression models

Graphical representation of the
model

Advantages of RLFM
• Better regularization of factors
– Covariates “shrink” towards a better centroid
• Cold-start: Fallback regression model (FeatureOnly)

RLFM: Illustration of Shrinkage
Plot the first factor
value for each user
(fitted using Yahoo! FP
data)

Model fitting: EM for our class of
models

The parameters for RLFM
• Latent parameters
• Hyper-parameters
}){},{},{},({ jiji vuΔ
)IaAI,aAD,G,,( vvuub

Computing the E-step
• Often hard to compute in closed form
• Stochastic EM (Markov Chain EM; MCEM)
– Compute expectation by drawing samples from
– Effective for multi-modal posteriors but more expensive
• Iterated Conditional Modes algorithm (ICM)
– Faster but biased hyper-parameter estimates

Monte Carlo E-step
• Through a vanilla Gibbs sampler (conditionals closed form)
• Other conditionals also Gaussian and closed form
• Conditionals of users (movies) sampled simultaneously
• Small number of samples in early iterations, large numbers in
later iterations

M-step (Why MCEM is better than
ICM)
• Update G, optimize
• Update Au=au I
Ignored by ICM, underestimates factor variability
Factors over-shrunk, posterior not explored well

Experiment 1: Better regularization
• MovieLens-100K, avg RMSE using pre-specified splits
• ZeroMean, RLFM and FeatureOnly (no cold-start
issues)
• Covariates:
– Users : age, gender, zipcode (1st digit only)
– Movies: genres

Experiment 2: Better handling of
Cold-start
• MovieLens-1M; EachMovie
• Training-test split based on timestamp
• Same covariates as in Experiment 1.

Experiment 4: Predicting click-rate
on articles
• Goal: Predict click-rate on articles for a user on F1
position
• Article lifetimes short, dynamic updates important
• User covariates:
– Age, Gender, Geo, Browse behavior
• Article covariates
– Content Category, keywords
• 2M ratings, 30K users, 4.5 K articles

Some other related approaches
• Stern, Herbrich and Graepel, WWW, 2009
– Similar to RLFM, different parametrization and
expectation propagation used to fit the models
• Porteus, Asuncion and Welling, AAAI, 2011
– Non-parametric approach using a Dirichlet process
• Agarwal, Zhang and Mazumdar, Annals of Applied
Statistics, 2011
– Regression + random effects per user regularized
through a Graphical Lasso

Add Topic Discovery into
Matrix Factorization
fLDA: Matrix Factorization through Latent
Dirichlet Allocation

fLDA: Introduction
• Model the rating yij that user i gives to item j as the user’s
affinity to the topics that the item has
– Unlike regular unsupervised LDA topic modeling, here the LDA
topics are learnt in a supervised manner based on past rating
data
– fLDA can be thought of as a “multi-task learning” version of the
supervised LDA model [Blei’07] for cold-start recommendation
k jkikij zsy ...
User i ’s affinity to topic k
Pr(item j has topic k) estimated by averaging
the LDA topic of each word in item j
Old items: zjk’s are Item latent factors learnt from data with the LDA prior
New items: zjk’s are predicted based on the bag of words in the items

11, …, 1W
…
k1, …, kW
…
K1, …, KW
Topic 1
Topic k
Topic K
LDA Topic Modeling (1)
• LDA is effective for unsupervised topic discovery [Blei’03]
– It models the generating process of a corpus of items (articles)
– For each topic k, draw a word distribution k = [ k1, …, kW] ~ Dir( )
– For each item j, draw a topic distribution j = [ j1, …, jK] ~ Dir( )
– For each word, say the nth word, in item j,
• Draw a topic zjn for that word from j = [ j1, …, jK]
• Draw a word wjn from k = [ k1, …, kW] with topic k = zjn
Item j
Topic distribution: [ j1, …, jK]
Words: wj1, …, wjn, …
Per-word topic: zj1, …, zjn, …
Assume zjn = topic k
Observed

LDA Topic Modeling (2)
• Model training:
– Estimate the prior parameters and the posterior topic word
distribution based on a training corpus of items
– EM + Gibbs sampling is a popular method
• Inference for new items
– Compute the item topic distribution based on the prior
parameters and estimated in the training phase
• Supervised LDA [Blei’07]
– Predict a target value for each item based on supervised LDA topics
k jkkj zsy
Target value of item j
Pr(item j has topic k) estimated by averaging
the topic of each word in item j
Regression weight for topic k
k jkikij zsy ...vs.
One regression per user
Same set of topics across different regressions

fLDA: Model
Rating: ),(~ 2
ijij Ny
)(~ ijij Bernoulliy
)(~ ijijij NPoissony
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
jkikkji
t
ijij zsbxt )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 Nxg iii
t
i
Popularity of item j: ),0(~, 2
0 Nxd jjj
t
j
Topic affinity of user i: ),0(~, 2
INHxs s
s
i
s
iii
Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk
The LDA topic of the nth word in item j
Observed words: ),,(~ jnjn zLDAw
The nth word in item j

Model Fitting
• Given:
– Features X = {xi, xj, xij}
– Observed ratings y = {yij} and words w = {wjn}
• Estimate:
– Parameters: = [b, g0, d0, H, 2, a , a , As, , ]
• Regression weights and prior parameters
– Latent factors: = { i, j, si} and z = {zjn}
• User factors, item factors and per-word topic assignment
• Empirical Bayes approach:
– Maximum likelihood estimate of the parameters:
– The posterior distribution of the factors:
dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxargˆ
]ˆ,|,Pr[ yz

The EM Algorithm
• Iterate through the E and M steps until convergence
– Let be the current estimate
– E-step: Compute
• The expectation is not in closed form
• We draw Gibbs samples and compute the Monte
Carlo mean
– M-step: Find
• It consists of solving a number of regression and
optimization problems
)]|,,,Pr([log)( )ˆ,,|,(
zwyEf n
wyz
)(maxargˆ )1(
fn
)(ˆ n

Supervised Topic Assignment
ji jnij
jn
jkjn
k
jn
kl
jn
kzyfZ
WZ
Z
kz
rated
)|(
)Rest|Pr(
Same as unsupervised LDA Likelihood of observed ratings
by users who rated item j when
zjn is set to topic k
Probability of observing yij
given the model
The topic of the nth word in item j

fLDA: Experimental Results
(Movie)
• Task: Predict the rating that a user would give a movie
• Training/test split:
– Sort observations by time
– First 75% Training data
– Last 25% Test data
• Item warm-start scenario
– Only 2% new items in test data
Model Test RMSE
RLFM 0.9363
fLDA 0.9381
Factor-Only 0.9422
FilterBot 0.9517
unsup-LDA 0.9520
MostPopular 0.9726
Feature-Only 1.0906
Constant 1.1190
fLDA is as strong as the best method
It does not reduce the performance in warm-start scenarios

fLDA: Experimental Results (Yahoo! Buzz)
• Task: Predict whether a user would buzz-up an article
• Severe item cold-start
– All items are new in test data
Data Statistics
1.2M observations
4K users
10K articles
fLDA significantly
outperforms other
models

Experimental Results: Buzzing
Topics
Top Terms (after stemming) Topic
bush, tortur, interrog, terror, administr, CIA, offici,
suspect, releas, investig, georg, memo, al
CIA interrogation
mexico, flu, pirat, swine, drug, ship, somali, border,
mexican, hostag, offici, somalia, captain
Swine flu
NFL, player, team, suleman, game, nadya, star, high,
octuplet, nadya_suleman, michael, week
NFL games
court, gai, marriag, suprem, right, judg, rule, sex,
pope, supreme_court, appeal, ban, legal, allow
Gay marriage
palin, republican, parti, obama, limbaugh, sarah, rush,
gop, presid, sarah_palin, sai, gov, alaska
Sarah Palin
idol, american, night, star, look, michel, win, dress,
susan, danc, judg, boyl, michelle_obama
American idol
economi, recess, job, percent, econom, bank, expect,
rate, jobless, year, unemploy, month
Recession
north, korea, china, north_korea, launch, nuclear,
rocket, missil, south, said, russia
North Korea issues
3/4 topics are interpretable; 1/2 are similar to unsupervised topics

fLDA Summary
• fLDA is a useful model for cold-start item recommendation
• It also provides interpretable recommendations for users
– User’s preference to interpretable LDA topics
• Future directions:
– Investigate Gibbs sampling chains and the convergence properties of
the EM algorithm
– Apply fLDA to other multi-task prediction problems
• fLDA can be used as a tool to generate supervised
features (topics) from text data

Summary
• Regularizing factors through covariates effective
• Regression based factor model that regularizes better
and deals with both cold-start and warm-start in a
single framework in a seamless way looks attractive
• Fitting method scalable; Gibbs sampling for users and
movies can be done in parallel. Regressions in M-step
can be done with any off-the-shelf scalable linear
regression routine
• Distributed computing on Hadoop: Multiple models
and average across partitions (more later)

Online Components:
Online Models, Intelligent
Initialization, Explore / Exploit

Why Online Components?
• Cold start
– New items or new users come to the system
– How to obtain data for new items/users (explore/exploit)
– Once data becomes available, how to quickly update the model
• Periodic rebuild (e.g., daily): Expensive
• Continuous online update (e.g., every minute): Cheap
• Concept drift
– Item popularity, user interest, mood, and user-to-item affinity may
change over time
– How to track the most recent behavior
• Down-weight old data
– How to model temporal patterns for better prediction
• … may not need to be online if the patterns are stationary

Big Picture
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models
Real systems are dynamic
Time-series models Incremental CF,
online regression
Intelligent Initialization
Do not start cold
Prior estimation Prior estimation,
dimension reduction
Explore/Exploit
Actively acquire data
Multi-armed bandits Bandits with covariates
Segmented Most
Popular Recommendation
Extension:

Online Components for
Most Popular Recommendation
Online models, intelligent initialization &
explore/exploit

Most popular recommendation:
Outline
• Most popular recommendation (no
personalization, all users see the same thing)
– Time-series models (online models)
– Prior estimation (initialization)
– Multi-armed bandits (explore/exploit)
– Sometimes hard to beat!!
• Segmented most popular recommendation
– Create user segments/clusters based on user
features
– Do most popular recommendation for each segment

Most Popular Recommendation
• Problem definition: Pick k items (articles) from a
pool of N to maximize the total number of clicks on
the picked items
• Easy!? Pick the items having the highest click-
through rates (CTRs)
• But …
– The system is highly dynamic:
• Items come and go with short lifetimes
• CTR of each item changes over time
– How much traffic should be allocated to explore new
items to achieve optimal performance
• Too little Unreliable CTR estimates
• Too much Little traffic to exploit the high CTR items

CTR Curves for Two Days on Yahoo! Front Page
Traffic obtained from a controlled randomized experiment (no confounding)
Things to note:
(a) Short lifetimes, (b) temporal effects, (c) often breaking news stories
Each curve is the CTR of an item in the Today Module on www.yahoo.com over time

For Simplicity, Assume …
• Pick only one item for each user visit
– Multi-slot optimization later
• No user segmentation, no personalization
(discussion later)
• The pool of candidate items is predetermined
and is relatively small ( 1000)
– E.g., selected by human editors or by a first-phase
filtering method
– Ideally, there should be a feedback loop
– Large item pool problem later
• Effects like user-fatigue, diversity in
recommendations, multi-objective optimization
not considered (discussion later)

Online Models
• How to track the changing CTR of an item
• Data: for each item, at time t, we observe
– Number of times the item nt was displayed (i.e., #views)
– Number of clicks ct on the item
• Problem Definition: Given c1, n1, …, ct, nt, predict the CTR
(click-through rate) pt+1 at time t+1
• Potential solutions:
– Observed CTR at t: ct / nt highly unstable (nt is usually small)
– Cumulative CTR: ( all i ci) / ( all i ni) react to changes very
slowly
– Moving window CTR: ( i last K ci) / ( i last K ni) reasonable
• But, no estimation of Var[pt+1] (useful for explore/exploit)

Online Models: Dynamic Gamma-Poisson
• Model-based approach
– (ct | nt, pt) ~ Poisson(nt pt)
– pt = pt-1 t, where t ~ Gamma(mean=1, var= )
– Model parameters:
• p1 ~ Gamma(mean= 0, var= 0
2) is the offline CTR estimate
• specifies how dynamic/smooth the CTR is over time
– Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?)
• Solve this recursively (online update rule)
Show the item nt times
Receive ct clicks
pt = CTR at time t
Notation:
p10, 0
2 p2 …
n1
c1
n2
c2

Online Models: Derivation
size)sample(effective/Let
),(~),,...,,|(
2
2
1111
ttt
ttttt varmeanGammancncp
)(
),(~),,...,,|(
2
|
2
|
2
|
2
1
|1
2
11111
ttttttt
ttt
ttttt varmeanGammancncp
tttttt
ttttttt
tttt
ttttttt
c
n
varmeanGammancncp
||
2
|
||
|
2
||11
/
/)(
size)sample(effectiveLet
),(~),,...,,|(
High CTR items more adaptive
Estimated CTR
distribution
at time t
Estimated CTR
distribution
at time t+1

Tracking behavior of Gamma-
Poisson model
• Low click rate articles – More temporal
smoothing

Intelligent Initialization: Prior
Estimation
• Prior CTR distribution: Gamma(mean= 0, var= 0
2)
– N historical items:
• ni = #views of item i in its first time interval
• ci = #clicks on item i in its first time interval
– Model
• ci ~ Poisson(ni pi) and pi ~ Gamma( 0, 0
2)
ci ~ NegBinomial( 0, 0
2, ni)
– Maximum likelihood estimate (MLE) of ( 0, 0
2)
• Better prior: Cluster items and find MLE for each cluster
– Agarwal & Chen, 2011 (SIGMOD)
i iii nccNN 2
0
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
0
2
0
2
0
2
00
loglogloglogmaxarg
,

Explore/Exploit: Problem Definition
time
Item 1
Item 2
…
Item K
x1% page views
x2% page views
…
xK% page views
Determine (x1, x2, …, xK) based on clicks and views observed before t in
order to maximize the expected total number of clicks in the future
t –1t –2 t
now
clicks in the future

Modeling the Uncertainty, NOT just
the Mean
Simplified setting: Two items
CTR
Probabilitydensity
Item A
Item B
We know the CTR of Item A (say, shown 1 million times)
We are uncertain about the CTR of Item B (only 100 times)
If we only make a single decision,
give 100% page views to Item A
If we make multiple decisions in the future
explore Item B since its CTR can potentially
be higher
qp
dppfqp )()(Potential
CTR of item A is q
CTR of item B is p
Probability density function of item B’s CTR is f(p)

Multi-Armed Bandits: Introduction
(1)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
“Pulling” arm i yields a reward:
reward = 1 with probability pi (success)
reward = 0 otherwise (failure)
For now, we are attacking the problem of choosing best article/arm for all users

(2)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
Goal: Pull arms sequentially to maximize the total reward
Bandit scheme/policy: Sequential algorithm to play arms (items)
Regret of a scheme = Expected loss relative to the “oracle” optimal scheme
that always plays the best arm
– “best” means highest success probability
– But, the best arm is not known … unless you have an oracle
– Regret is the price of exploration
– Low regret implies quick convergence to the best

(3)
• Bayesian approach
– Seeks to find the Bayes optimal solution to a Markov
decision process (MDP) with assumptions about
probability distributions
– Representative work: Gittins’ index, Whittle’s index
– Very computationally intensive
• Minimax approach
– Seeks to find a scheme that incurs bounded regret (with
no or mild assumptions about probability distributions)
– Representative work: UCB by Lai, Auer
– Usually, computationally easy
– But, they tend to explore too much in practice (probably
because the bounds are based on worse-case analysis)
Skip details

Multi-Armed Bandits: Markov Decision Process (1)
• Select an arm now at time t=0, to maximize expected total number
of clicks in t=0,…,T
• State at time t: t = ( 1t, …, Kt)
– it = State of arm i at time t (that captures all we know about arm i at t)
• Reward function Ri( t, t+1)
– Reward of pulling arm i that brings the state from t to t+1
• Transition probability Pr[ t+1 | t, pulling arm i ]
• Policy : A function that maps a state to an arm (action)
– ( t) returns an arm (to pull)
• Value of policy starting from the current state 0 with horizon T
),(),(),( 1110)(0 0
ΘΘΘΘ Θ TT VREV
11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T
Immediate reward Value of the remaining T-1 time slots
if we start from state 1

Multi-Armed Bandits: MDP (2)
• Optimal policy:
• Things to notice:
– Value is defined recursively (actually T high-dim integrals)
– Dynamic programming can be used to find the optimal policy
– But, just evaluating the value of a fixed policy can be very expensive
• Bandit Problem: The pull of one arm does not change the state of
other arms and the set of arms do not change over time
),(),(),( 1110)(0 0
ΘΘΘΘ Θ TT VREV
11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T
Immediate reward Value of the remaining T-1 time slots
if we start from state 1
),(maxarg 0ΘTV

• Which arm should be pulled next?
– Not necessarily what looks best right now, since it might have had a few
lucky successes
– Looks like it will be a function of successes and failures of all arms
• Consider a slightly different problem setting
– Infinite time horizon, but
– Future rewards are geometrically discounted
Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1)
• Theorem [Gittins 1979]: The optimal policy decouples and solves a
bandit problem for each arm independently
Policy ( t) is a function of ( 1t, …, Kt)
Policy ( t) = argmaxi { g( it) }
One K-dimensional problem
K one-dimensional problems
Still computationally expensive!!
Gittins’ Index

Bandit Policy
1. Compute the priority
(Gittins’ index) of each
arm based on its state
2. Pull arm with max
priority, and observe
reward
3. Update the state of the
pulled arm
Priority
1
Priority
2
Priority
3

• Theorem [Gittins 1979]: The optimal policy decouples
and solves a bandit problem for each arm
independently
– Many proofs and different interpretations of Gittins’ index
exist
• The index of an arm is the fixed charge per pull for a game with two options, whether to
pull the arm or not, so that the charge makes the optimal play of the game have zero
net reward
– Significantly reduces the dimension of the problem space
– But, Gittins’ index g( it) is still hard to compute
• For the Gamma-Poisson or Beta-Binomial models
it = (#successes, #pulls) for arm i up to time t
• g maps each possible (#successes, #pulls) pair to a number
– Approximate methods are used in practice
– Lai et al. have derived these for exponential family
distributions

Multi-Armed Bandits: Minimax Approach (1)
• Compute the priority of each arm i in a way that the
regret is bounded
– Lowest regret in the worst case
• One common policy is UCB1 [Auer 2002]
Number of successes of
arm i
Number of pulls
of arm i
Total number of pulls
of all arms
Observed
success rate
Factor representing
uncertainty
ii
i
i
n
n
n
c log2
Priority

• As total observations n becomes large:
– Observed payoff tends asymptotically towards the
true payoff probability
– The system never completely “converges” to one
best arm; only the rate of exploration tends to
zero
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority

• Sub-optimal arms are pulled O(log n) times
• Hence, UCB1 has O(log n) regret
• This is the lowest possible regret (but the constants matter )
• E.g. Regret after n plays is bounded by
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
ibesti
K
j
j
i ibesti
n
where,
3
1
ln
8
1
2
:

• Classical multi-armed bandits
– A fixed set of arms with fixed rewards
– Observe the reward before the next pull
• Bayesian approach (Markov decision process)
– Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits
• Pull the arm currently having the highest index value
– Whittle’s index [Whittle 1988]: Extension to a changing reward function
– Computationally intensive
• Minimax approach (providing guaranteed regret bounds)
– UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval
• Index of arm i =
• Heuristics
– -Greedy: Random exploration using fraction of traffic
– Softmax: Pick arm i with probability
– Posterior draw: Index = drawing from posterior CTR distribution of an arm
j j
i
}/ˆexp{
}/ˆexp{
Classical Multi-Armed Bandits: Summary
ii itemofCTRpredictedˆ
iii nnnc log2
retemperatu

Do Classical Bandits Apply to Web Recommenders?
Traffic obtained from a controlled randomized experiment (no confounding)
Things to note:
(a) Short lifetimes, (b) temporal effects, (c) often breaking news stories
Each curve is the CTR of an item in the Today Module on www.yahoo.com over time

Characteristics of Real
Recommender Systems• Dynamic set of items (arms)
– Items come and go with short lifetimes (e.g., a day)
– Asymptotically optimal policies may fail to achieve good performance
when item lifetimes are short
• Non-stationary CTR
– CTR of an item can change dramatically over time
• Different user populations at different times
• Same user behaves differently at different times (e.g., morning, lunch
time, at work, in the evening, etc.)
• Attention to breaking news stories decays over time
• Batch serving for scalability
– Making a decision and updating the model for each user visit in real time
is expensive
– Batch serving is more feasible: Create time slots (e.g., 5 min); for each
slot, decide the fraction xi of the visits in the slot to give to item i
[Agarwal et al., ICDM, 2009]

Explore/Exploit in Recommender
Systems
time
Item 1
Item 2
…
Item K
x1% page views
x2% page views
…
xK% page views
Determine (x1, x2, …, xK) based on clicks and views observed before t in
order to maximize the expected total number of clicks in the future
t –1t –2 t
now
clicks in the future
Let’s solve this from first principle

Bayesian Solution: Two Items, Two
Time Slots (1)
• Two time slots: t = 0 and t = 1
– Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1
– Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1
• To determine x, we need to estimate what would happen in the future
Question:
What fraction x of N0 views to item P
(1-x) to item Q
t=0 t=1
Now
time
N0 views N1 views
End
Obtain c clicks after serving x
(not yet observed; random variable)
Assume we observe c; we can update p1
CTR
density
Item Q
Item P
q1
p1(x,c)
CTR
density
Item Q
Item P
q0
p0
If x and c are given, optimal solution:
Give all views to Item P iff
E[ p1 I x, c ] > q1
),(ˆ1 cxp
),(ˆ1 cxp

• Expected total number of clicks in the two time slots
}]),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c
Gain(x, q0, q1) = Expected number of additional clicks if we explore
the uncertain item P with fraction x of views in slot 0, compared to
a scheme that only shows the certain item Q in both slots
Solution: argmaxx Gain(x, q0, q1)
Time Slots (2)
}]0,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c
E[#clicks] at t = 0 E[#clicks] at t = 1
Item P Item Q Show the item with higher E[CTR]: }),,(ˆmax{ 11 qcxp
E[#clicks] if we
always show
item Q
Gain(x, q0, q1)
Gain of exploring the uncertain item P using x

• Approximate by the normal distribution
– Reasonable approximation because of the central limit theorem
• Proposition: Using the approximation,
the Bayes optimal solution x can be
found in time O(log N0)
),(ˆ1 cxp
)ˆ(
)(
ˆ
1
)(
ˆ
)()ˆ(),,( 11
1
11
1
11
1100010 qp
x
pq
x
pq
xNqpxNqqxGain
)1()()(
)],(ˆ[)( 2
0
0
1
2
1
baba
ab
xNba
xN
cxpVarx
)/()],(ˆ[ˆ 11 baacxpEp c
),(~ofPrior 1 baBetap
Time Slots (3)

Bayesian Solution: Two Items, Two Time Slots (4)
• Quiz: Is it correct that the more we are uncertain about the CTR of
an item, the more we should explore the item?
Uncertainty: LowUncertainty: High
Different
curves are for
different prior
mean settings
(Fractionofviewstogivetotheitem)

– Apply Whittle’s Lagrange relaxation (1988) to this problem setting
Relax i zi(c) = 1, for all c, to Ec [ i zi(c)] = 1
Apply Lagrange multipliers (q0 and q1) to enforce the constraints
– We essentially reduce the K-item case to K independent two-item sub-
problems (which we have solved)
Bayesian Solution: General Case (1)
• From two items to K items
– Very difficult problem: ))}],(ˆ{[maxˆ(max 1100
0
iiiiiii cxpENpxN c
x
)],(ˆ)([max 1
0
iiiii cxpzE cc
z
cc possibleallfor,1)(ii z
Note: c = [c1, …, cK]
ci is a random variable representing
the # clicks on item i we may get
1ii x
)),,(max(min 101100
, 10
qqxGainqNqN ii xqq i

Bayesian Solution: General Case
(2)
• From two intervals to multiple time slots
– Approximate multiple time slots by two stages
• Non-stationary CTR
– Use the Dynamic Gamma-Poisson model to
estimate the CTR distribution for each item

Simulation Experiment: Different Traffic Volume
Simulation with ground truth estimated based on Yahoo! Front Page data
Setting:16 live items per interval
Scenarios: Web sites with different traffic volume (x-axis)

Simulation Experiment: Different Sizes of the Item Pool
Simulation with ground truth estimated based on Yahoo! Front Page data
Setting: 1000 views per interval; average item lifetime = 20 intervals
Scenarios: Different sizes of the item pool (x-axis)

Characteristics of Different Explore/Exploit Schemes (1)
• Why the Bayesian solution has better performance
• Characterize each scheme by three dimensions:
– Exploitation regret: The regret of a scheme when it is showing the item
which it thinks is the best (may not actually be the best)
• 0 means the scheme always picks the actual best
• It quantifies the scheme’s ability of finding good
items
– Exploration regret: The regret of a scheme when it is exploring the items
which it feels uncertain about
• It quantifies the price of exploration (lower better)
– Fraction of exploitation (higher better)
• Fraction of exploration = 1 – fraction of exploitation
Exploitation traffic Exploration
traffic
All traffic to a web site

Characteristics of Different Explore/Exploit Schemes (2)
Exploitation regret: Ability of finding good items (lower better)
Exploration regret: Price of exploration (lower better)
Fraction of Exploitation (higher better)
Exploration Regret Exploitation fraction
ExploitationRegret
ExploitationRegret
Good Good

Discussion: Large Content Pool
• The Bayesian solution looks promising
– ~10% from true optimal for a content pool of 1000 live
items
• 1000 views per interval; item lifetime ~20 intervals
• Intelligent initialization (offline modeling)
– Use item features to reduce the prior variance of an item
• E.g., Var[ item CTR | Sport ] < Var[ item CTR ]
– Require a CTR model that outputs both mean and
variance
• Linear regression model
• Segmented model: Estimate the CTR distribution of a random
article in an item category
– Existing taxonomies, decision tree, LDA topics
• Feature-based explore/exploit
– Estimate model parameters, instead of per-item CTR
– More later

Discussion: Multiple Positions,
Ranking
• Feature-based approach
– reward(page) = model( (item 1 at position 1, … item k at position k))
– Apply feature-based explore/exploit
• Online optimization for ranked list
– Ranked bandits [Radlinski et al., 2008]: Run an
independent bandit algorithm for each position
– Dueling bandit [Yue & Joachims, 2009]: Actions are
pairwise comparisons
• Online optimization of submodular functions
– S1, S2 and a, fa(S1 S2) fa(S1)
• where fa(S) = fa(S a ) – fa(S)
– Streeter & Golovin (2008)

Discussion: Segmented Most
Popular
• Partition users into segments, and then for each
segment, provide most popular recommendation
• How to segment users
– Hand-created segments: AgeGroup Gender
– Clustering or decision tree based on user features
• Users in the same cluster like similar items
• Segments can be organized by taxonomies/hierarchies
– Better CTR models can be built by hierarchical smoothing
• Shrink the CTR of a segment toward its parent
• Introduce bias to reduce uncertainty/variance
– Bandits for taxonomies (Pandey et al., 2008)
• First explore/exploit categories/segments
• Then, switch to individual items

Most Popular Recommendation:
Summary
• Online model:
– Estimate the mean and variance of the CTR of each item over
time
– Dynamic Gamma-Poisson model
• Intelligent initialization:
– Estimate the prior mean and variance of the CTR of each item
cluster using historical data
• Cluster items Maximum likelihood estimates of the priors
• Explore/exploit:
– Bayesian: Solve a Markov decision process problem
• Gittins’ index, Whittle’s index, approximations
• Better performance, computation intensive
• Thompson sampling: Sample from the posterior (simple)
– Minimax: Bound the regret
• UCB1: Easy to compute
• Explore more than necessary in practice
– -Greedy: Empirically competitive for tuned

Online Components for
Personalized Recommendation
Online models, intelligent initialization &
explore/exploit

Intelligent Initialization for Linear
Model (1)
• Linear/factorization model
– How to estimate the prior parameters j and
• Important for cold start: Predictions are made using prior
• Leverage available features
– How to learn the weights/factors quickly
• High dimensional j slow convergence
• Reduce the dimensionality
Subscript:
user i,
item j
),(~
),(~ 2
jj
jiij
N
uNy
rating that user i
gives item j
feature/factor vector of user i
factor vector of item j

Feature-based model initialization
• Dimensionality reduction for fast model convergence
),(~ jj AxN
FOBFM: Fast Online Bilinear Factor Model
),(~,~ jjjiij NuyPer-item
online model
),0(~
~
Nv
vuAxuy
j
jijiij
predicted by features
),0(~ 2
IN
Bv
j
jj
Subscript:
user i
item j
Data:
yij = rating that
user i gives item j
ui = offline factor vector
of user i
xj = feature vector
of item j
B is a n k linear projection matrix (k << n)
project: high dim(vj) low dim( j)
low-rank approx of Var[ j]:
=
vj jB
),(~ 2
BBAxN jj
Offline training: Determine A, B, 2
through the EM algorithm
(once per day or hour)

Feature-based model initialization
• Dimensionality reduction for fast model convergence
• Fast, parallel online learning
• Online selection of dimensionality (k = dim( j))
– Maintain an ensemble of models, one for each candidate dimensionality
),(~ jj AxN
FOBFM: Fast Online Bilinear Factor Model
),(~,~ jjjiij NuyPer-item
online model
),0(~
~
Nv
vuAxuy
j
jijiij
predicted by features
),0(~ 2
IN
Bv
j
jj B is a n k linear projection matrix (k << n)
project: high dim(vj) low dim( j)
low-rank approx of Var[ j]:
jijiij BuAxuy )(~
offset new feature vector (low dimensional)
, where j is updated in an online manner
),(~ 2
BBAxN jj
Subscript:
user i
item j
Data:
yij = rating that
user i gives item j
ui = offline factor vector
of user i
xj = feature vector
of item j

Experimental Results: My Yahoo!
Dataset (1)
• My Yahoo! is a personalized news reading
site
– Users manually select news/RSS feeds
• ~12M “ratings” from ~3M users on ~13K
articles
– Click = positive
– View without click = negative

Experimental Results: My Yahoo! Dataset (2)
Item-based data split: Every item is new in the test data
– First 8K articles are in the training data (offline training)
– Remaining articles are in the test data (online prediction & learning)
Supervised dimensionality reduction (reduced rank regression) significantly
outperforms other methods
Methods:
No-init: Standard online
regression with ~1000
parameters for each item
Offline: Feature-based model
without online update
PCR, PCR+: Two principal
component methods to
estimate B
FOBFM: Our fast online method

Experimental Results: My Yahoo! Dataset (3)
• Small number of factors (low dimensionality) is better when the amount of data
for online leaning is small
• Large number of factors is better when the data for learning becomes large
• The online selection method usually selects the best dimensionality
# factors =
Number of
parameters per
item updated
online

Intelligent Initialization: Summary
• For online learning, whenever historical data
is available, do not start cold
• For linear/factorization models
– Use available features to setup the starting point
– Reduce dimensionality to facilitate fast learning
• Next
– Explore/exploit for personalization
– Users are represented by covariates
• Features, factors, clusters, etc
– Covariate bandits

Explore/Exploit for Personalized Recommendation
• One extreme problem formulation
– One bandit problem per user with one arm per item
– Bandit problems are correlated: “Similar” users like similar
items
– Arms are correlated: “Similar” items have similar CTRs
• Model this correlation through covariates/features
– Input: User feature/factor vector, item feature/factor vector
– Output: Mean and variance of the CTR of this (user, item)
pair based on the data collected so far
• Covariate bandits
– Also known as contextual bandits, bandits with side
observations
– Provide a solution to
• Large content pool (correlated arms)
• Personalized recommendation (hint before pulling an arm)

Methods for Covariate Bandits
• Priority-based methods
– Rank items according to the user-specific “score” of each item; then,
update the model based on the user’s response
– UCB (upper confidence bound)
• Score of an item = E[posterior CTR] + k StDev[posterior CTR]
– Posterior draw (Thompson sampling)
• Score of an item = a number drawn from the posterior CTR distribution
– Softmax
• Score of an item = a number drawn according to
• -Greedy
– Allocate fraction of traffic for random exploration ( may be adaptive)
– Robust when the exploration pool is small
• Bayesian scheme
– Close to optimal if can be solved efficiently
j j
i
}/ˆexp{
}/ˆexp{

Covariate Bandits: Some
References
• Just a small sample of papers
– Hierarchical explore/exploit (Pandey et al., 2008)
• Explore/exploit categories/segments first; then, switch to individuals
– Variants of -greedy
• Epoch-greedy (Langford & Zhang, 2007): is determined based on
the generalization bound of the current model
• Banditron (Kakade et al., 2008): Linear model with binary response
• Non-parametric bandit (Yang & Zhu, 2002): decreases over time;
example model: histogram, nearest neighbor
– Variants of UCB methods
• Linearly parameterized bandits (Rusmevichientong et al., 2008):
minimax, based on uncertainty ellipsoid
• LinUCB (Li et al., 2010): Gaussian linear regression model
• Bandits in metric spaces (Kleinberg et al., 2008; Slivkins et al., 2009):
– Similar arms have similar rewards: | reward(i) – reward(j) | distance(i,j)

Online Components: Summary
• Real systems are dynamic
• Cold-start problem
– Incremental online update (online linear regression)
– Intelligent initialization (use features to predict initial factor
values)
– Explore/exploit (UCB, posterior draw, softmax, -greedy)
• Concept-drift problem
– Tracking the most recent behavior (state-space models,
Kalman filter)
– Modeling temporal patterns (tensor factorization, spline)

Evaluation Methods and
Challenges

Evaluation Methods
• Ideal method
– Experimental Design: Run side-by-side experiments on a small
fraction of randomly selected traffic with new method (treatment)
and status quo (control)
– Limitation
• Often expensive and difficult to test large number of methods
• Problem: How do we evaluate methods offline on
logged data?
– Goal: To maximize clicks/revenue and not prediction
accuracy on the entire system. Cost of predictive
inaccuracy for different instances vary.
• E.g. 100% error on a low CTR article may not matter much
because it always co-occurs with a high CTR article that is
predicted accurately

Usual Metrics
• Predictive accuracy
– Root Mean Squared Error (RMSE)
– Mean Absolute Error (MAE)
– Area under the Curve, ROC
• Other rank based measures based on retrieval accuracy for top-k
– Recall in test data
• What Fraction of items that user actually liked in the test data were
among the top-k recommended by the algorithm (fraction of hits, e.g.
Karypsis, CIKM 2001)
• One flaw in several papers
– Training and test split are not based on time.
• Information leakage
• Even in Netflix, this is the case to some extent
– Time split per user, not per event. For instance, information may leak if
models are based on user-user similarity.

Metrics continued..
• Recall per event based on Replay-Match
method
– Fraction of clicked events where the top
recommended item matches the clicked one.
• This is good if logged data collected from a
randomized serving scheme, with biased
data this could be a problem
– We will be inventing algorithms that provide
recommendations that are similar to the current
one
• No reward for novel recommendations

Details on Replay-Match method (Li, Langford, et al)
• x: feature vector for a visit
• r = [r1,r2,…,rK]: reward vector for the K items in inventory
• h(x): recommendation algorithm to be evaluated
• Goal: Estimate expected reward for h(x)
• s(x): recommendation scheme that generated logged-data
• x1,..,xT: visits in the logged data
• rti: reward for visit t, where i = s(xt)

Replay-Match continued
• Estimator
• If importance weights
and
– It can be shown estimator is unbiased
• E.g. if s(x) is random serving scheme, importance weights are
uniform over the item set
• If s(x) is not random, importance weights have to be
estimated through a model

Back to Multi-Objective
Optimization
Recommender EDITORIAL
content
Clicks on FP links influence
downstream supply distribution
AD SERVER
PREMIUM display
(GUARANTEED)
Spot Market (Cheaper)
Downstream
engagement
(Time spent)

Serving Content on Front Page:
Click Shaping
• What do we want to optimize?
• Current: Maximize clicks (maximize downstream supply from FP)
• But consider the following
– Article 1: CTR=5%, utility per click = 5
– Article 2: CTR=4.9%, utility per click=10
• By promoting 2, we lose 1 click/100 visits, gain 5 utils
• If we do this for a large number of visits --- lose some clicks but
obtain significant gains in utility?
– E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,
etc)

Why call it Click Shaping?autos finance
health
hotjobs
movies
new.music
news
omg
realestate
rivals
shine
shopping
sports
tech
travel
tv
video
other
gmy.news
buzz
videogames
autos
finance
health
hotjobs
movies
new.music
news
omg
realestate
rivals
shine
shopping
sports
tech
travel
tv
video
other
videogames
buzz
gmy.news
-10.00%
-8.00%
-6.00%
-4.00%
-2.00%
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
autos
buzz
finance
gmy.news
health
hotjobs
moviesnew.music
news
omg
realestate
rivals
shine
shopping
sports
tech
travel
tv
videovideogames
other
Supply distribution
Changes
BEFORE
AFTER
SHAPING can happen with respect to any downstream metrics (like engagement)

222
Multi-Objective Optimization
A1
A2
An
n articlesK properties
news
finance
omg
… …
S1
S2
Sm
m user segments
…
CTR of user segment i on article j: pij
Time duration of i on j: dij

11
Multi-Objective Program Scalarization
Goal Programming
Simplex constraints on xiJ is always applied
Constraints are linear
Every 10 mins, solve x
Use this x as the serving scheme in the next 10 mins

Pareto-optimal solution (more in KDD 2011)
224

Summary
• Modern recommendation systems on the web crucially depend on
extracting intelligence from massive amounts of data collected on a
routine basis
• Lots of data and processing power not enough, the number of things we
need to learn grows with data size
• Extracting grouping structures at coarser resolutions based on similarity
(correlations) is important
– ML has a big role to play here
• Continuous and adaptive experimentation in a judicious manner crucial
to maximize performance
– Again, ML has a big role to play
• Multi-objective optimization is often required, the objectives are
application dependent.
– ML has to work in close collaboration with
engineering, product & business execs

Recall: Some examples
• Simple version
– I have an important module on my page, content inventory
is obtained from a third party source which is further
refined through editorial oversight. Can I algorithmically
recommend content on this module? I want to drive up
total CTR on this module
• More advanced
– I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. dwell time). Can I increase
downstream utility without losing too many clicks?
• Highly advanced
– There are multiple modules running on my website. How
do I take a holistic approach and perform a simultaneous
optimization?

For the simple version
• Multi-position optimization
– Explore/exploit, optimal subset selection
• Explore/Exploit strategies for large content pool and
high dimensional problems
– Some work on hierarchical bandits but more needs to be
done
• Constructing user profiles from multiple sources with
less than full coverage
– Couple of papers at KDD 2011
• Content understanding
• Metrics to measure user engagement (other than
CTR)

Other problems
• Whole page optimization
– Incorporating correlations
• Incentivizing User generated content
• Incorporating Social information for better
recommendation (News Feed Recommendation)
• Multi-context Learning

Recommendations and Advertising on LinkedIn HP

Click Cost =
Bid3 x CTR3/CTR2
Profile:
region = US, age = 20
Context = profile
page, 300 x 250 ad slot
Ad
request
Sorted by
Bid * CTR
Response
Prediction Engine
Campaigns eligible for
auction
Automatic
Format
Selection
Filter Campaigns
(Targeting criteria,
Frequency Cap,
Budget Pacing)
LinkedIn Advertising: Flow
Serving constraint < 100 millisec

CTR Prediction Model for Ads
• Feature vectors
– Member feature vector: xi (identity, behavioral, network)
– Campaign feature vector: cj (text, adv-id,…)
– Context feature vector: zk (page type, device, …)
• Model:

• Feature vectors
– Member feature vector: xi
– Campaign feature vector: cj
– Context feature vector: zk
• Model:
Cold-start component
Warm-start
per-campaign component

• Feature vectors
– Member feature vector: xi
– Campaign feature vector: cj
– Context feature vector: zk
• Model:
Cold-start component
Warm-start
per-campaign component
Cold-start:
Warm-start:
Both can have L2 penalties.

Model Fitting
• Single machine (well understood)
– conjugate gradient
– L-BFGS
– Trusted region
– …
• Model Training with Large scale data
– Cold-start component Θw is more stable
• Weekly/bi-weekly training good enough
• However: difficulty from need for large-scale logistic regression
– Warm-start per-campaign model Θc is more dynamic
• New items can get generated any time
• Big loss if opportunities missed
• Need to update the warm-start component as frequently as possible

Model Fitting
• Single machine (well understood)
– conjugate gradient
– L-BFGS
– Trusted region
– …
• Model Training with Large scale data
– Cold-start component Θw is more stable
• Weekly/bi-weekly training good enough
• However: difficulty from need for large-scale logistic regression
– Warm-start per-campaign model Θc is more dynamic
• New items can get generated any time
• Big loss if opportunities missed
• Need to update the warm-start component as frequently as possible
Large Scale Logistic
Regression
Per-item logistic regression
given Θc

Explore/Exploit with Logistic
Regression
240
+
+
+
+
+
+
+
_
_
_
_
_
_
_
_
_
_ _
_
_
COLD START
COLD + WARM START
for an Ad-id
POSTERIOR of WARM-START
COEFFICIENTS
E/E: Sample a line from the
posterior
(Thompson Sampling)

Models Considered
• CONTROL: per-campaign CTR counting model
• COLD-ONLY: only cold-start component
• LASER: our model (cold-start + warm-start)
• LASER-EE: our model with Explore-Exploit
using Thompson sampling

Metrics
• Model metrics (offline)
– Test Log-likelihood
– AUC/ROC
– Observed/Expected ratio
• Online metrics (Online A/B Test)
– CTR
– CPM (Revenue per impression)
– Unique ads per user (diversity)

Observed / Expected Ratio
• Offline replay difficult with large items (randomization costly)
• Observed: #Clicks in the data , Expected: Sum of predicted CTR for
all impressions
• Not a “standard” classifier metric, but useful for this application
• What we usually see: Observed / Expected < 1
– Quantifies the “winner’s curse” aka selection bias in auctions
• When choosing from among thousands of candidates, an item with mistakenly
over-estimated CTR may end up winning the auction
• Particularly helpful in spotting inefficiencies by segment
– E.g. by bid, number of impressions in training (warmness), geo, etc.
– Allows us to see where the model might be giving too much weight to
the wrong campaigns
• High correlation between O/E ratio and model performance online

Online A/B Test
• Three models
– CONTROL (10%)
– LASER (85%)
– LASER-EE (5%)
• Segmented Analysis
– 8 segments by campaign warmness
• Degree of warmness: the number of training samples available in
the training data for the campaign
• Segment #1: Campaigns with almost no data in training
• Segment #8: Campaigns that are served most heavily in the
previous batches so that their CTR estimate can be quite accurate

Daily CTR Lift Over Control
PercentageofCTRLift
+%
+%
+%
+%
+%
Day1
Day2
Day3
Day4
Day5
Day6
Day7
●
● ●
●
●
●
●
●
LASER
LASER−EE

Daily CPM Lift Over Control
PercentageofeCPMLift
+%
+%
+%
+%
+%
+%
Day1
Day2
Day3
Day4
Day5
Day6
Day7
●
●
●
● ●
●
●
●
LASER
LASER−EE

CPM Lift By Campaign
Warmness Segments
Campaign Warmness Segment
LiftPercentageofCPM
−%
−%
−%
0%
+%
+%
1 2 3 4 5 6 7 8
LASER
LASER−EE

O/E Ratio By Campaign
Warmness Segments
Campaign Warmness Segment
ObservedClick/ExpectedClicks
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
CONTROL
LASER
LASER−EE

Number of Campaigns Served
Improvement from E/E

Insights
• Overall performance:
– LASER and LASER-EE are both much better than
control
– LASER and LASER-EE performance are very similar
• Great news! We get exploration without much additional
cost
• Exploration has other benefits
– LASER-EE serves significantly more campaigns than
LASER
– Provides healthier market place, more ad diversity per
user (better experience)

Solutions to Practical Problems
• Rapid model development cycle
– Quick reaction to changes in data, product
– Write once for training, testing, inference
• Can adapt to changing data
– Integrated Thompson sampling explore/exploit
– Automatic training
– Multiple training frequencies for different parts of model
• Good tools yield good models
– Reusable components for feature extraction and transformation
– Very high-performance inference engine for deployment
– Modelers can concentrate on building models, not re-writing
common functions or worrying about production issues

Summary
• Reducing dimension through logistic regression coupled
with explore/exploit schemes like Thompson sampling
effective mechanism to solve response prediction problems
in advertising
• Partitioning model components by cold-start (stable) and
warm-start (non-stationary) with different training
frequencies effective mechanism to scale computations
• ADMM with few modifications effective model training
strategy for large data with high dimensionality
• Methods work well for LinkedIn advertising, significant
improvements
©2013 LinkedIn Corporation. All Rights
Reserved.

Theory vs. Practice
Textbook
• Data is stationary
• Training data is clean
• Training is hard, testing and
inference are easy
• Models don’t change
• Complex algorithms work best
Reality
• Features, items changing
constantly
• Fraud, bugs,tracking delays,
online/offline inconsistencies,
etc.
• All aspects have challenges at
web scale
• Never-ending processes of
improvement
• Simple models with good features
and lots of data win

Current Work: Feed Recommendation
Network Updates
Job change
Job anniversaries
Connections
Endorsements
Upload Photos
…………..
Content
Articles by Influencer
Shares by friends
Content in Channels followed
Content by companies followed
Jobs recommendation
…………
Sponsored Updates
Company updates
Jobs

Tiered Approach to Ranking
 A second pass ranker that blends disparate
results returned by first-pass rankers
JobsAdsNetwork
Updates
Content
------------
BLENDER
Top k

Challenges
• Personalization
– Viewer-actor affinity by type (depends on strength of
connections in multiple contexts)
– Blending identity and behavioral data
• Frequency discounting, freshness, diversification
• Multi-objectives (Revenue, Engagement)
• A/B tests with interference
• Engagement metrics
– Function of various actions that optimize long-term
engagement metric like return visits
• Summarization and adding new content types

Enar short course

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Enar short course

Similar to Enar short course (20)

Recently uploaded

Recently uploaded (20)

Enar short course