SlideShare a Scribd company logo
Statistical Computing
For Big Data
Deepak Agarwal
LinkedIn Applied Relevance Science
dagarwal@linkedin.com
ENAR 2014, Baltimore, USA
Structure of This Tutorial
• Part I: Introduction to Map-Reduce and the Hadoop
System
– Overview of Distributed Computing
– Introduction to Map-Reduce
– Introduction to the Hadoop System
– The Pig Language
– A Deep Dive of Hadoop Map-Reduce
• Part II: Examples of Statistical Computing for Big Data
– Bag of Little Bootstraps
– Large Scale Logistic Regression
– Parallel Matrix Factorization
• Part III: The Future of Cloud Computing
Structure of This Tutorial
• Part I: Introduction to Map-Reduce and the Hadoop
System
– Overview of Distributed Computing
– Introduction to Map-Reduce
– Introduction to the Hadoop System
– The Pig Language
– A Deep Dive of Hadoop Map-Reduce
• Part II: Examples of Statistical Computing for Big Data
– Bag of Little Bootstraps
– Large Scale Logistic Regression
– Parallel Matrix Factorization
• Part III: The Future of Cloud Computing
Big Data becoming Ubiquitous
• Bioinformatics
• Astronomy
• Internet
• Telecommunications
• Climatology
• …
Big Data: Some size estimates
• 1000 human genomes: > 100TB of data (1000
genomes project)
• Sloan Digital Sky Survey: 200GB data per night
(>140TB aggregated)
• Facebook: A billion monthly active users
• LinkedIn: 225M members worldwide
• Twitter: 500 million tweets a day
• Over 6 billion mobile phones in the world
generating data everyday
Big Data: Paradigm shift
• Classical Statistics
– Generalize using small data
• Paradigm Shift with Big Data
– We now have an almost infinite supply of data
– Easy Statistics ? Just appeal to asymptotic theory?
• So the issue is mostly computational?
– Not quite
• More data comes with more heterogeneity
• Need to change our statistical thinking to adapt
– Classical statistics still invaluable to think about big data analytics
Some Statistical Challenges
• Exploratory Analysis (EDA), Visualization
– Retrospective (on Terabytes)
– More Real Time (streaming computations every
few minutes/hours)
• Statistical Modeling
– Scale (computational challenge)
– Curse of dimensionality
• Millions of predictors, heterogeneity
– Temporal and Spatial correlations
Statistical Challenges continued
• Experiments
– To test new methods, test hypothesis from
randomized experiments
– Adaptive experiments
• Forecasting
– Planning, advertising
• Many more I are not fully well versed in
Defining Big Data
• How to know you have the big data problem?
– Is it only the number of terabytes ?
– What about
dimensionality, structured/unstructured, computa
tions required,…
• No clear definition, different point of views
– When desired computation cannot be completed
in the stipulated time with current best algorithm
using cores available on a commodity PC
Distributed Computing for Big Data
• Distributed computing invaluable tool to scale
computations for big data
• Some distributed computing models
– Multi-threading
– Graphics Processing Units (GPU)
– Message Passing Interface (MPI)
– Map-Reduce
Evaluating a method for a problem
• Scalability
– Process X GB in Y hours
• Ease of use for a statistician
• Reliability (fault tolerance)
– Especially in an industrial environment
• Cost
– Hardware and cost of maintaining
• Good for the computations required?
– E.g., Iterative versus one pass
• Resource sharing
Multithreading
• Multiple threads take advantage of multiple
CPUs
• Shared memory
• Threads can execute independently and
concurrently
• Can only handle Gigabytes of data
• Reliable
Graphics Processing Units (GPU)
• Number of cores:
– CPU: Order of 10
– GPU: smaller cores
• Order of 1000
• Can be >100x faster than CPU
– Parallel computationally intensive tasks off-loaded to GPU
• Good for certain computationally-intensive tasks
• Can only handle Gigabytes of data
• Not trivial to use, requires good understanding of low-level architecture
for efficient use
– But things changing, it is getting more user friendly
Message Passing Interface (MPI)
• Language independent communication
protocol among processes (e.g. computers)
• Most suitable for master/slave model
• Can handle Terabytes of data
• Good for iterative processing
• Fault tolerance is low
Map-Reduce (Dean &
Ghemawat, 2004)
Mappers
Reducers
Data
Output
• Computation split to Map
(scatter) and Reduce (gather)
stages
• Easy to Use:
– User needs to implement two
functions: Mapper and
Reducer
• Easily handles Terabytes of
data
• Very good fault tolerance
(failed tasks automatically
get restarted)
Comparison of Distributed Computing Methods
Multithreading GPU MPI Map-Reduce
Scalability (data
size)
Gigabytes Gigabytes Terabytes Terabytes
Fault Tolerance High High Low High
Maintenance Cost Low Medium Medium Medium-High
Iterative Process
Complexity
Cheap Cheap Cheap Usually
expensive
Resource Sharing Hard Hard Easy Easy
Easy to Implement? Easy Needs
understanding
of low-level GPU
architecture
Easy Easy
Example Problem
• Tabulating word counts in corpus of
documents
• Similar to table function in R
Word Count Through Map-Reduce
Hello World
Bye World
Hello Hadoop
Goodbye Hadoop
Mapper 1
<Hello, 1>
<Hadoop, 1>
<Goodbye, 1>
<Hadoop,1>
<Hello, 1>
<World, 1>
<Bye, 1>
<World,1>
Mapper 2
Reducer 1
Words from A-G
Reducer 2
Words from H-Z
<Bye, 1>
<Goodbye, 1>
<Hello, 2>
<World, 2>
<Hadoop, 2>
Key Ideas about Map-Reduce
Big Data
Partition 1 Partition 2 … Partition N
Mapper 1 Mapper 2 … Mapper N
<Key, Value> <Key, Value><Key, Value><Key, Value>
Reducer 1 Reducer 2 Reducer M…
Output 1 Output 1Output 1Output 1
Key Ideas about Map-Reduce
• Data are split into partitions and stored in many
different machines on disk (distributed storage)
• Mappers process data chunks independently and
emit <Key, Value> pairs
• Data with the same key are sent to the same
reducer. One reducer can receive multiple keys
• Every reducer sorts its data by key
• For each key, the reducer processes the values
corresponding to the key according to the
customized reducer function and output
Compute Mean for Each Group
ID Group No. Score
1 1 0.5
2 3 1.0
3 1 0.8
4 2 0.7
5 2 1.5
6 3 1.2
7 1 0.8
8 2 0.9
9 4 1.3
… … …
Key Ideas about Map-Reduce
• Data are split into partitions and stored in many different machines on
disk (distributed storage)
• Mappers process data chunks independently and emit <Key, Value> pairs
– For each row:
• Key = Group No.
• Value = Score
• Data with the same key are sent to the same reducer. One reducer can
receive multiple keys
– E.g. 2 reducers
– Reducer 1 receives data with key = 1, 2
– Reducer 2 receives data with key = 3, 4
• Every reducer sorts its data by key
– E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9>
• For each key, the reducer processes the values corresponding to the key
according to the customized reducer function and output
– E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>
Key Ideas about Map-Reduce
• Data are split into partitions and stored in many different machines on
disk (distributed storage)
• Mappers process data chunks independently and emit <Key, Value> pairs
– For each row:
• Key = Group No.
• Value = Score
• Data with the same key are sent to the same reducer. One reducer can
receive multiple keys
– E.g. 2 reducers
– Reducer 1 receives data with key = 1, 2
– Reducer 2 receives data with key = 3, 4
• Every reducer sorts its data by key
– E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9>
• For each key, the reducer processes the values corresponding to the key
according to the customized reducer function and output
– E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>
What you need
to implement
Mapper:
Input: Data
for (row in Data)
{
groupNo = row$groupNo;
score = row$score;
Output(c(groupNo, score));
}
Reducer:
Input: Key (groupNo), List Value (a list of scores that belong to the Key)
count = 0;
sum = 0;
for (v in Value)
{
sum += v;
count++;
}
Output(c(Key, sum/count));
Pseudo Code (in R)
Exercise 1
• Problem: Average height per {Grade, Gender}?
• What should be the mapper output key?
• What should be the mapper output value?
• What are the reducer input?
• What are the reducer output?
• Write mapper and reducer for this?
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
• Problem: Average height per Grade and Gender?
• What should be the mapper output key?
– {Grade, Gender}
• What should be the mapper output value?
– Height
• What are the reducer input?
– Key: {Grade, Gender}, Value: List of Heights
• What are the reducer output?
– {Grade, Gender, mean(Heights)}
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
Exercise 2
• Problem: Number of students per {Grade, Gender}?
• What should be the mapper output key?
• What should be the mapper output value?
• What are the reducer input?
• What are the reducer output?
• Write mapper and reducer for this?
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
• Problem: Number of students per {Grade, Gender}?
• What should be the mapper output key?
– {Grade, Gender}
• What should be the mapper output value?
– 1
• What are the reducer input?
– Key: {Grade, Gender}, Value: List of 1’s
• What are the reducer output?
– {Grade, Gender, sum(value list)}
– OR: {Grade, Gender, length(value list)}
Student ID Grade Gender Height (cm)
1 3 M 120
2 2 F 115
3 2 M 116
… … …
More on Map-Reduce
• Depends on distributed file systems
• Typically mappers are the data storage nodes
• Map/Reduce tasks automatically get restarted
when they fail (good fault tolerance)
• Map and Reduce I/O are all on disk
– Data transmission from mappers to reducers are
through disk copy
• Iterative process through Map-Reduce
– Each iteration becomes a map-reduce job
– Can be expensive since map-reduce overhead is high
The Apache Hadoop System
• An open-source software for
reliable, scalable, distributed computing
• The most popular distributed computing
system in the world
• Key modules:
– Hadoop Distributed File System (HDFS)
– Hadoop YARN (job scheduling and cluster resource
management)
– Hadoop MapReduce
Major Tools on Hadoop
• Pig
– A high-level language for Map-Reduce computation
• Hive
– A SQL-like query language for data querying via Map-Reduce
• Hbase
– A distributed & scalable database on Hadoop
– Allows random, real time read/write access to big data
– Voldemort is similar to Hbase
• Mahout
– A scalable machine learning library
• …
Hadoop Installation
• Setting up Hadoop on your desktop/laptop:
– http://hadoop.apache.org/docs/stable/single_nod
e_setup.html
• Setting up Hadoop on a cluster of machines
– http://hadoop.apache.org/docs/stable/cluster_set
up.html
Hadoop Distributed File System (HDFS)
• Master/Slave architecture
• NameNode: a single master node that controls which
data block is stored where.
• DataNodes: slave nodes that store data and do R/W
operations
• Clients (Gateway): Allow users to login and interact
with HDFS and submit Map-Reduce jobs
• Big data is split to equal-sized blocks, each block can be
stored in different DataNodes
• Disk failure tolerance: data is replicated multiple times
Load the Data into Pig
• A = LOAD ‘Sample-1.dat' USING PigStorage() AS
(ID : int, groupNo: int, score: float);
– The path of the data on HDFS after LOAD
• USING PigStorage() means delimit the data by tab
(can be omitted)
• If data are delimited by other characters, e.g.
space, use USING PigStorage(‘ ‘)
• Data schema defined after AS
• Variable types: int, long, float, double, chararray,
…
Structure of This Tutorial
• Part I: Introduction to Map-Reduce and the Hadoop
System
– Overview of Distributed Computing
– Introduction to Map-Reduce
– Introduction to the Hadoop System
– The Pig Language
– A Deep Dive of Hadoop Map-Reduce
• Part II: Examples of Statistical Computing for Big Data
– Bag of Little Bootstraps
– Large Scale Logistic Regression
– Parallel Matrix Factorization
• Part III: The Future of Cloud Computing
Bag of Little Bootstraps
Kleiner et al. 2012
Bootstrap (Efron, 1979)
• A re-sampling based method to obtain statistical
distribution of sample estimators
• Why are we interested ?
– Re-sampling is embarrassingly parallelizable
• For example:
– Standard deviation of the mean of N samples (μ)
– For i = 1 to r do
• Randomly sample with replacement N times from the original
sample -> bootstrap data i
• Compute mean of the i-th bootstrap data -> μi
– Estimate of Sd(μ) = Sd(*μ1,…μr])
– r is usually a large number, e.g. 200
Bootstrap for Big Data
• Can have r nodes running in parallel, each
sampling one bootstrap data
• However…
– N can be very large
– Data may not fit into memory
– Collecting N samples with replacement on each
node can be computationally expensive
M out of N Bootstrap
(Bikel et al. 1997)
• Obtain SdM(μ) by sampling M samples with
replacement for each bootstrap, where M<N
• Apply analytical correction to SdM(μ) to obtain
Sd(μ) using prior knowledge of convergence rate
of sample estimates
• However…
– Prior knowledge is required
– Choice of M is critical to performance
– Finding optimal value of M needs more computation
Bag of Little Bootstraps (BLB)
• Example: Standard deviation of the mean
• Generate S sampled data sets, each obtained by random
sampling without replacement a subset of size b (or
partition the original data into S partitions, each with size
b)
• For each data p = 1 to S do
– For i = 1 to r do
• N samples with replacement on data of size b
• Compute mean of the resampled data μpi
– Compute Sdp(μ) = Sd([μp1,…μpr])
• Estimate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])
Bag of Little Bootstraps (BLB)
• Interest: ξ(θ), where θ is an estimate obtained from
size N data
– ξ is some function of θ, such as standard deviation, …
• Generate S sampled data sets, each obtained from random
sampling without replacement a subset of size b (or partition
the original data into S partitions, each with size b)
• For each data p = 1 to S do
– For i = 1 to r do
• Sample N samples with replacement on data of size b
• Compute mean of the resampled data θpi
– Compute ξp(θ) = ξ([θp1,…θpr])
• Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])
Bag of Little Bootstraps (BLB)
• Interest: ξ(θ), where θ is an estimate obtained from
size N data
– ξ is some function of θ, such as standard deviation, …
• Generate S sampled data sets, each obtained from random
sampling without replacement a subset of size b (or partition
the original data into S partitions, each with size b)
• For each data p = 1 to S do
– For i = 1 to r do
• Sample N samples with replacement on the data of size b
• Compute mean of the resampled data θpi
– Compute ξp(θ) = ξ([θp1,…θpr])
• Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])
MapperReducer
Gateway
Why is BLB Efficient
• Before:
– N samples with replacement from size N data is
expensive when N is large
• Now:
– N samples with replacement from size b data
– b can be several magnitude smaller than N (e.g. b
= Nγ, γ in [0.5, 1))
– Equivalent to: A multinomial sampler with dim = b
– Storage = O(b), Computational complexity = O(b)
Simulation Experiment
• 95% CI of Logistic Regression Coefficients
• N = 20000, 10 explanatory variables
• Relative Error = |Estimated CI width – True CI
width | / True CI width
• BLB-γ: BLB with b = Nγ
• BOFN-γ: b out of N sampling with b = Nγ
• BOOT: Naïve bootstrap
Simulation Experiment
Real Data
• 95% CI of Logistic Regression Coefficients
• N = 6M, 3000 explanatory variables
• Data size = 150GB, r = 50, s = 5, γ = 0.7
Summary of BLB
• A new algorithm for bootstrapping on big data
• Advantages
– Fast and efficient
– Easy to parallelize
– Easy to understand and implement
– Friendly to Hadoop, makes it routine to perform
statistical calculations on Big data
Large Scale Logistic Regression
Logistic Regression
• Binary response: Y
• Covariates: X
• Yi ~ Bernoulli(pi)
• log(pi/(1-pi)) = Xi
Tβ ; β ~ MVN(0 , 1/λ I )
• Widely used (research and applications)
Large Scale Logistic Regression
• Binary response: Y
– E.g., Click / Non-click on an ad on a webpage
• Covariates: X
– User covariates:
• Age, gender, industry, education, job, job title, …
– Item covariates:
• Categories, keywords, topics, …
– Context covariates:
• Time, page type, position, …
– 2-way interaction:
• User covariates X item covariates
• Context covariates X item covariates
• …
Computational Challenge
• Hundreds of millions/billions of observations
• Hundreds of thousands/millions of covariates
• Fitting such a logistic regression model on a
single machine not feasible
• Model fitting iterative using methods like
gradient descent, Newton’s method etc
– Multiple passes over the data
Recap on Optimization method
• Problem: Find x to min(F(x))
• Iteration n: xn = xn-1 – bn-1 F’(xn-1)
• bn-1 is the step size that can change every
iteration
• Iterate until convergence
• Conjugate gradient, LBFGS, Newton trust
region, … all of this kind
Iterative Process with Hadoop
Disk Mappers Disk Reducers
DiskMappersDiskReducers
Disk Mappers Disk Reducers
Limitations of Hadoop for fitting a big
logistic regression
• Iterative process is expensive and slow
• Every iteration = a Map-Reduce job
• I/O of mapper and reducers are both through
disk
• Plus: Waiting in queue time
• Q: Can we find a fitting method that scales
with Hadoop ?
Large Scale Logistic Regression
• Naïve:
– Partition the data and run logistic regression for each partition
– Take the mean of the learned coefficients
– Problem: Not guaranteed to converge to the model from single
machine!
• Alternating Direction Method of Multipliers (ADMM)
– Boyd et al. 2011
– Set up constraints: each partition’s coefficient = global
consensus
– Solve the optimization problem using Lagrange Multipliers
– Advantage: guaranteed to converge to a single machine logistic
regression on the entire data with reasonable number of
iterations
Large Scale Logistic Regression via ADMM
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Iteration 1
Large Scale Logistic Regression via ADMM
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Consensus
Computation
Logistic
Regression
Logistic
Regression
Logistic
Regression
Iteration 1
Large Scale Logistic Regression via ADMM
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Iteration 2
Details of ADMM
Dual Ascent Method
• Consider a convex optimization problem
• Lagrangian for the problem:
• Dual Ascent:
Augmented Lagrangians
• Bring robustness to the dual ascent method
• Yield convergence without assumptions like strict
convexity or finiteness of f
•
• The value of ρ influences the convergence rate
Alternating Direction Method of
Multipliers (ADMM)
• Problem:
• Augmented Lagrangians
• ADMM:
Large Scale Logistic Regression via ADMM
• Notation
– (Xi , yi): data in the ith partition
– βi: coefficient vector for partition i
– β: Consensus coefficient vector
– r(β): penalty component such as ||β||2
2
• Optimization problem
ADMM updates
LOCAL REGRESSIONS
Shrinkage towards current
best global estimate
UPDATED
CONSENSUS
An example implementation
• ADMM for Logistic regression model fitting with
L2/L1 penalty
• Each iteration of ADMM is a Map-Reduce job
– Mapper: partition the data into K partitions
– Reducer: For each partition, use liblinear/glmnet to fit
a L1/L2 logistic regression
– Gateway: consensus computation by results from all
reducers, and sends back the consensus to each
reducer node
KDD CUP 2010 Data
• Bridge to Algebra 2008-2009 data in
https://pslcdatashop.web.cmu.edu/KDDCup/do
wnloads.jsp
• Binary response, 20M covariates
• Only keep covariates with >= 10 occurrences
=> 2.2M covariates
• Training data: 8,407,752 samples
• Test data : 510,302 samples
Avg Training Loglikelihood vs Number
of Iterations
Test AUC vs Number of Iterations
Better Convergence Can
Be Achieved By
• Better Initialization
– Use results from Naïve method to initialize the
parameters
• Adaptively change step size (ρ) for each
iteration based on the convergence status of
the consensus
Recommender Problems for
Web Applications
Agenda
• Topic of Interest
– Recommender problems for dynamic, time-
sensitive applications
• Content Optimization, Online Advertising, Movie
recommendation, shopping,…
• Introduction
• Offline components
– Regression, Collaborative filtering (CF), …
• Online components + initialization
– Time-series, online/incremental
methods, explore/exploit (bandit)
• Evaluation methods + Multi-Objective
• Challenges
Three components we will focus on
• Defining the problem
– Formulate objectives whose optimization achieves some long-
term goals for the recommender system
• E.g. How to serve content to optimize audience reach and
engagement, optimize some combination of engagement and revenue ?
• Modeling (to estimate some critical inputs)
– Predict rates of some positive user interaction(s) with items
based on data obtained from historical user-item interactions
• E.g. Click rates, average time-spent on page, etc
• Could be explicit feedback like ratings
• Experimentation
– Create experiments to collect data proactively to improve
models, helps in converging to the best choice(s) cheaply and
rapidly.
• Explore and Exploit (continuous experimentation)
• DOE (testing hypotheses by avoiding bias inherent in data)
Modern Recommendation Systems
• Goal
– Serve the right item to a user in a given context to
optimize long-term business objectives
• A scientific discipline that involves
– Large scale Machine Learning & Statistics
• Offline Models (capture global & stable characteristics)
• Online Models (incorporates dynamic components)
• Explore/Exploit (active and adaptive experimentation)
– Multi-Objective Optimization
• Click-rates (CTR), Engagement, advertising revenue, diversity, etc
– Inferring user interest
• Constructing User Profiles
– Natural Language Processing to understand content
• Topics, “aboutness”, entities, follow-up of something, breaking news,…
Some examples from content
optimization
• Simple version
– I have a content module on my page, content inventory is
obtained from a third party source which is further refined
through editorial oversight. Can I algorithmically
recommend content on this module? I want to improve
overall click-rate (CTR) on this module
• More advanced
– I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. advertising revenue). Can I
increase downstream utility without losing too many clicks?
• Highly advanced
– There are multiple modules running on my webpage. How
do I perform a simultaneous optimization?
Recommend applications
Recommend search queries
Recommend news article
Recommend packages:
Image
Title, summary
Links to other pages
Pick 4 out of a pool of K
K = 20 ~ 40
Dynamic
Routes traffic other pages
Problems in this example
• Optimize CTR on multiple modules
– Today Module, Trending Now, Personal Assistant,
News
– Simple solution: Treat modules as independent,
optimize separately. May not be the best when
there are strong correlations.
• For any single module
– Optimize some combination of CTR, downstream
engagement, and perhaps advertising revenue.
Online Advertising
Advertisers
Ad Network
Ads
Page
Recommend
Best ad(s)
User
Publisher
Response rates
(click, conversion, ad-view)
Bids
Auction
Click
conversion
Select argmax f(bid,response rates)
ML /Statistical
model
Examples:
Yahoo, Google, MSN, …
Ad exchanges
(RightMedia, DoubleClick, …
)
LinkedIn Today: Content Module
Objective: Serve content to maximize engagement
metrics like CTR (or weighted CTR)
LinkedIn Ads: Match ads to users
visiting LinkedIn
Right Media Ad Exchange: Unified
Marketplace
Match ads to page views on publisher sites
Has ad
impression to
sell --
AUCTIONS
Bids $0.50
Bids $0.75 via Network…
… which becomes
$0.45 bid
Bids $0.65—WINS!
AdSense
Ad.com
Bids $0.60
Recommender problems in general
USER
Item Inventory
Articles, web page,
ads, …
Use an automated algorithm
to select item(s) to show
Get feedback (click, time spent,..)
Refine the models
Repeat (large number of times)
Optimize metric(s) of interest
(Total clicks, Total revenue,…)
Example applications
Search: Web, Vertical
Online Advertising
Content
…..
Context
query, page, …
• Items: Articles, ads, modules, movies, users, updates, etc.
• Context: query keywords, pages, mobile, social media, etc.
• Metric to optimize (e.g., relevance score, CTR, revenue, engagement)
– Currently, most applications are single-objective
– Could be multi-objective optimization (maximize X subject to Y, Z,..)
• Properties of the item pool
– Size (e.g., all web pages vs. 40 stories)
– Quality of the pool (e.g., anything vs. editorially selected)
– Lifetime (e.g., mostly old items vs. mostly new items)
Important Factors
Factors affecting Solution
(continued)
• Properties of the context
– Pull: Specified by explicit, user-driven query (e.g., keywords, a form)
– Push: Specified by implicit context (e.g., a page, a user, a session)
• Most applications are somewhere on continuum of pull and push
• Properties of the feedback on the matches
made
– Types and semantics of feedback (e.g., click, vote)
– Latency (e.g., available in 5 minutes vs. 1 day)
– Volume (e.g., 100K per day vs. 300M per day)
• Constraints specifying legitimate matches
– e.g., business rules, diversity rules, editorial Voice
– Multiple objectives
• Available Metadata (e.g., link graph, various user/item attributes)
Predicting User-Item Interactions
(e.g. CTR)
• Myth: We have so much data on the web, if we
can only process it the problem is solved
– Number of things to learn increases with sample size
• Rate of increase is not slow
– Dynamic nature of systems make things worse
– We want to learn things quickly and react fast
• Data is sparse in web recommender problems
– We lack enough data to learn all we want to learn and
as quickly as we would like to learn
– Several Power laws interacting with each other
• E.g. User visits power law, items served power law
– Bivariate Zipf: Owen & Dyer, 2011
Can Machine Learning help?
• Fortunately, there are group behaviors that generalize to
individuals & they are relatively stable
– E.g. Users in San Francisco tend to read more baseball news
• Key issue: Estimating such groups
– Coarse group : more stable but does not generalize that well.
– Granular group: less stable with few individuals
– Getting a good grouping structure is to hit the “sweet spot”
• Another big advantage on the web
– Intervene and run small experiments on a small population to
collect data that helps rapid convergence to the best choices(s)
• We don’t need to learn all user-item interactions, only those that are good.
Predicting user-item interaction rates
Offline
( Captures stable characteristics
at coarse resolutions)
(Logistic, Boosting,….)
Feature construction
Content: IR, clustering, taxonomy, entity,..
User profiles: clicks, views, social, community,..
Near Online
(Finer resolution
Corrections)
(item, user level)
(Quick updates)
Explore/Exploit
(Adaptive sampling)
(helps rapid convergence
to best choices)
Initialize
Post-click: An example in Content
Optimization
Recommender EDITORIAL
content
Clicks on FP links influence
downstream supply distribution
AD SERVER
DISPLAY ADVERTISING
Revenue
Downstream
engagement
(Time spent)
Serving Content on Front Page: Click Shaping
• What do we want to optimize?
• Current: Maximize clicks (maximize downstream supply from FP)
• But consider the following
– Article 1: CTR=5%, utility per click = 5
– Article 2: CTR=4.9%, utility per click=10
• By promoting 2, we lose 1 click/100 visits, gain 5 utils
• If we do this for a large number of visits --- lose some clicks but
obtain significant gains in utility?
– E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,
etc)
High level picture
http request
Statistical
Models updated in
Batch mode: e.g. once every
30 mins
Server
Item
Recommendation
system: thousands
of computations in
sub-seconds
User Interacts
e.g. click,
does nothing
High level overview: Item
Recommendation System
User Info
Item Index
Id, meta-data
ML/
Statistical
Models
Score Items
P(Click), P(share),
Semantic-relevance
score,….
Rank Items:
sort by score (CTR,bid*CTR,..)
combine scores using Multi-obj optim,
Threshold on some scores,….
User-item interaction
Data: batch process
Updated in batch:
Activity, profile
Pre-filter
SPAM,editorial,,..
Feature extraction
NLP, cllustering,..
ML/Statistical models for scoring
Number of items
Scored by ML
Traffic volume
1000100 100k 1M 100M
Few hours
Few days
Several days
LinkedIn Today
Yahoo! Front Page
Right Media Ad exchange
LinkedIn Ads
Summary of deployments
• Yahoo! Front page Today Module (2008-2011): 300% improvement
in click-through rates
– Similar algorithms delivered via a self-serve platform, adopted by
several Yahoo! Properties (2011): Significant improvement in
engagement across Yahoo! Network
• Fully deployed on LinkedIn Today Module (2012): Significant
improvement in click-through rates (numbers not revealed due to
reasons of confidentiality)
• Yahoo! RightMedia exchange (2012): Fully deployed algorithms to
estimate response rates (CTR, conversion rates). Significant
improvement in revenue (numbers not revealed due to reasons of
confidentiality)
• LinkedIn self-serve ads (2012-2013):Fully deployed
• LinkedIn News Feed (2013-2014): Fully deployed
• Several others in progress….
Broad Themes
• Curse of dimensionality
– Large number of observations (rows), large number of potential features
(columns)
– Use domain knowledge and machine learning to reduce the “effective”
dimension (constraints on parameters reduce degrees of freedom)
• I will give examples as we move along
• We often assume our job is to analyze “Big Data” but we often have
control on what data to collect through clever experimentation
– This can fundamentally change solutions
• Think of computation and models together for Big data
• Optimization: What we are trying to optimize is often complex,models to
work in harmony with optimization
– Pareto optimality with competing objectives
Statistical Problem
• Rank items (from an admissible pool) for user visits in some
context to maximize a utility of interest
• Examples of utility functions
– Click-rates (CTR)
– Share-rates (CTR* [Share|Click] )
– Revenue per page-view = CTR*bid (more complex due to second price
auction)
• CTR is a fundamental measure that opens the door to a more
principled approach to rank items
• Converge rapidly to maximum utility items
– Sequential decision making process (explore/exploit)
item j from a set of candidates
User i
with
user features
(e.g., industry,
behavioral features,
Demographic
features,……)
(i, j) : response yijvisits
Algorithm selects
(click or not)
Which item should we select?
 The item with highest predicted CTR
 An item for which we need data to
predict its CTR
Exploit
Explore
LinkedIn Today, Yahoo! Today Module:
Choose Items to maximize CTR
This is an “Explore/Exploit” Problem
The Explore/Exploit Problem (to
maximize CTR)
• Problem definition: Pick k items from a pool of N for a large
number of serves to maximize the number of clicks on the
picked items
• Easy!? Pick the items having the highest click-through rates
(CTRs)
• But …
– The system is highly dynamic:
• Items come and go with short lifetimes
• CTR of each item may change over time
– How much traffic should be allocated to explore new items to
achieve optimal performance ?
• Too little Unreliable CTR estimates due to “starvation”
• Too much Little traffic to exploit the high CTR items
Y! front Page Application
• Simplify: Maximize CTR on first slot (F1)
• Item Pool
– Editorially selected for high quality and brand image
– Few articles in the pool but item pool dynamic
CTR Curves of Items on LinkedIn
Today
CTR
Impact of repeat item views on a given
user
• Same user is shown an item multiple times
(despite not clicking)
Simple algorithm to estimate most popular
item with small but dynamic item pool
• Simple Explore/Exploit scheme
– % explore: with a small probability (e.g. 5%), choose an item at
random from the pool
– (100− )% exploit: with large probability (e.g. 95%), choose
highest scoring CTR item
• Temporal Smoothing
– Item CTRs change over time, provide more weight to recent data in
estimating item CTRs
• Kalman filter, moving average
• Discount item score with repeat views
– CTR(item) for a given user drops with repeat views by some “discount”
factor (estimated from data)
• Segmented most popular
– Perform separate most-popular for each user segment
Time series Model: Kalman filter
• Dynamic Gamma-Poisson: click-rate evolves over time
in a multiplicative fashion
• Estimated Click-rate distribution at time t+1
– Prior mean:
– Prior variance:
High CTR items more adaptive
More economical exploration? Better
bandit solutions
• Consider two armed problem
p2
(unknown payoff
probabilities)
The gambler has 1000 plays, what is the best way to experiment ?
(to maximize total expected reward)
This is called the “multi-armed bandit” problem, have been studied for a long time.
Optimal solution: Play the arm that has maximum potential of being good
Optimism in the face of uncertainty
p1 >
Item Recommendation: Bandits?
• Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000
– Greedy: Show Item 2 to all; not a good idea
– Item 1 CTR estimate noisy; item could be potentially
better
• Invest in Item 1 for better overall performance on average
– Exploit what is known to be good, explore what is potentially good
CTR
Probabilitydensity
Item 2
Item 1
Next few hours
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models Time-series models Incremental CF,
online regression
Intelligent Initialization Prior estimation Prior estimation,
dimension reduction
Explore/Exploit Multi-armed bandits Bandits with covariates
Offline Components:
Collaborative Filtering in Cold-start
Situations
Problem
Item j with
User i
with
user features xi
(demographics,
browse history,
search history, …)
item features xj
(keywords, content categories, ...)
(i, j) : response yijvisits
Algorithm selects
(explicit rating, implicit click/no-click)
Predict the unobserved entries based on
features and the observed entries
Model Choices
• Feature-based (or content-based) approach
– Use features to predict response
• (regression, Bayes Net, mixture models, …)
– Limitation: need predictive features
• Bias often high, does not capture signals at granular levels
• Collaborative filtering (CF aka Memory based)
– Make recommendation based on past user-item interaction
• User-user, item-item, matrix factorization, …
• See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc.
– Better performance for old users and old items
– Does not naturally handle new users and new items (cold-
start)
Collaborative Filtering (Memory
based methods)
User-User Similarity
Item-Item similarities, incorporating both
Estimating Similarities
Pearson’s correlation
Optimization based (Koren et al)
How to Deal with the Cold-Start
Problem
• Heuristic-based approaches
– Linear combination of regression and CF models
– Filterbot
• Add user features as psuedo users and do collaborative filtering
- Hybrid approaches
- Use content based to fill up entries, then use CF
• Matrix Factorization
– Good performance on Netflix (Koren, 2009)
• Model-based approaches
– Bilinear random-effects model (probabilistic matrix factorization)
• Good on Netflix data [Ruslan et al ICML, 2009]
– Add feature-based regression to matrix factorization
• (Agarwal and Chen, 2009)
– Add topic discovery (from textual items) to matrix factorization
• (Agarwal and Chen, 2009; Chun and Blei, 2011)
Per-item regression models
• When tracking users by cookies, distribution of
visit patters could get extremely skewed
– Majority of cookies have 1-2 visits
• Per item models (regression) based on user
covariates attractive in such cases
Several per-item regressions: Multi-task learning
Low dimension
(5-10),
B estimated
retrospective data
• Agarwal,Chen and Elango, KDD, 2010
Affinity to
old items
Per-user, per-item models
via bilinear random-effects
model
Motivation
• Data measuring k-way interactions pervasive
– Consider k = 2 for all our discussions
• E.g. User-Movie, User-content, User-Publisher-Ads,….
– Power law on both user and item degrees
• Classical Techniques
– Approximate matrix through a singular value
decomposition (SVD)
• After adjusting for marginal effects (user pop, movie pop,..)
– Does not work
• Matrix highly incomplete, severe over-fitting
– Key issue
• Regularization of eigenvectors (factors) to avoid overfitting
Early work on complete matrices
• Tukey’s 1-df model (1956)
– Rank 1 approximation of small nearly complete
matrix
• Criss-cross regression (Gabriel, 1978)
• Incomplete matrices: Psychometrics (1-factor
model only; small data sets; 1960s)
• Modern day recommender problems
– Highly incomplete, large, noisy.
Latent Factor Models
“newsy”
“sporty”
“newsy”
s
item
v
z
Affinity = u’v
Affinity = s’z
u s
p
o
r
t
y
Factorization – Brief Overview
• Latent user factors:
(αi , ui=(ui1,…,uin))
• (Nn + Mm)
parameters
• Key technical issue:
• Latent movie factors:
(βj , vj=(v j1,….,v jn))
will overfit for moderate
values of n,m
Regularization
Interaction
jijiij BvuyE )(
Latent Factor Models: Different
Aspects
• Matrix Factorization
– Factors in Euclidean space
– Factors on the simplex
• Incorporating features and ratings
simultaneously
• Online updates
Maximum Margin Matrix Factorization (MMMF)
• Complete matrix by minimizing loss (hinge,squared-
error) on observed entries subject to constraints on trace
norm
– Srebro, Rennie, Jakkola (NIPS 2004)
– Convex, Semi-definite programming (expensive, not
scalable)
• Fast MMMF (Rennie & Srebro, ICML, 2005)
– Constrain the Frobenious norm of left and right
eigenvector matrices, not convex but becomes
scalable.
• Other variation: Ensemble MMMF (DeCoste,
ICML2005)
– Ensembles of partially trained MMMF (some
improvements)
Matrix Factorization for Netflix prize
data
• Minimize the objective function
• Simon Funk: Stochastic Gradient Descent
• Koren et al (KDD 2007): Alternate Least
Squares
– They move to SGD later in the competition
obsij j
j
i
ij
T
iij vuvur )()(
222
ui vj
rij
au av
2
Optimization is through Iterated
conditional modes
Other variations like constraining the
mean through sigmoid, using “who-rated-
whom”
Combining with Boltzmann Machines also
improved performance
),(~
),(~
),(~ 2
IaMVN
IaMVN
Nr
vj
ui
j
T
iij
0v
0u
vu
Probabilistic Matrix Factorization
(Ruslan & Minh, 2008, NIPS)
Bayesian Probabilistic Matrix Factorization
(Ruslan and Minh, ICML 2008)
• Fully Bayesian treatment using an MCMC approach
– Significant improvement
• Interpretation as a fully Bayesian hierarchical model
shows why that is the case
– Failing to incorporate uncertainty leads to bias in
estimates
– Multi-modal posterior, MCMC helps in converging to a better one
r
Var-comp: au
MCEM also more resistant to over-fitting
Non-parametric Bayesian matrix completion
(Zhou et al, SAM, 2010)
• Specify rank probabilistically (automatic rank
selection)
)/)1(,/(~
)(~
),(~
1
2
rrbraBeta
Berz
vuzNy
k
kk
r
k
jkikkij
))1(/(Factors)#(
)))1(/(,1(~
rbaraE
rbaaBerzk
How to incorporate features:
Deal with both warm start and cold-start
• Models to predict ratings for new pairs
– Warm-start: (user, movie) present in the training data with large
sample size
– Cold-start: At least one of (user, movie) new or has small sample
size
• Rough definition, warm-start/cold-start is a continuum.
• Challenges
– Highly incomplete (user, movie) matrix
– Heavy tailed degree distributions for users/movies
• Large fraction of ratings from small fraction of
users/movies
– Handling both warm-start and cold-start effectively in the
presence of predictive features
Possible approaches
• Large scale regression based on covariates
– Does not provide good estimates for heavy users/movies
– Large number of predictors to estimate interactions
• Collaborative filtering
– Neighborhood based
– Factorization
• Good for warm-start; cold-start dealt with separately
• Single model that handles cold-start and warm-start
– Heavy users/movies → User/movie specific model
– Light users/movies → fallback on regression model
– Smooth fallback mechanism for good performance
Add Feature-based Regression
into
Matrix Factorization
RLFM: Regression-based Latent
Factor Model
Regression-based Factorization
Model (RLFM)
• Main idea: Flexible prior, predict factors
through regressions
• Seamlessly handles cold-start and warm-
start
• Modified state equation to incorporate
covariates
RLFM: Model
Rating: ),(~ 2
ijij Ny
)(~ ijij Bernoulliy
)(~ ijijij NPoissony
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
j
t
iji
t
ijij vubxt )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 Nxg iii
t
i
Popularity of item j: ),0(~, 2
0 Nxd jjj
t
j
Factors of user i: ),0(~, 2
INGxu u
u
i
u
iii
Factors of item j: ),0(~, 2
INDxv v
v
i
v
iji
Could use other classes of regression models
Graphical representation of the
model
Advantages of RLFM
• Better regularization of factors
– Covariates “shrink” towards a better centroid
• Cold-start: Fallback regression model (FeatureOnly)
RLFM: Illustration of Shrinkage
Plot the first factor
value for each user
(fitted using Yahoo! FP
data)
Model fitting: EM for our class of
models
The parameters for RLFM
• Latent parameters
• Hyper-parameters
}){},{},{},({ jiji vuΔ
)IaAI,aAD,G,,( vvuub
Computing the mode
Minimized
The EM algorithm
Computing the E-step
• Often hard to compute in closed form
• Stochastic EM (Markov Chain EM; MCEM)
– Compute expectation by drawing samples from
– Effective for multi-modal posteriors but more expensive
• Iterated Conditional Modes algorithm (ICM)
– Faster but biased hyper-parameter estimates
Monte Carlo E-step
• Through a vanilla Gibbs sampler (conditionals closed form)
• Other conditionals also Gaussian and closed form
• Conditionals of users (movies) sampled simultaneously
• Small number of samples in early iterations, large numbers in
later iterations
M-step (Why MCEM is better than
ICM)
• Update G, optimize
• Update Au=au I
Ignored by ICM, underestimates factor variability
Factors over-shrunk, posterior not explored well
Experiment 1: Better regularization
• MovieLens-100K, avg RMSE using pre-specified splits
• ZeroMean, RLFM and FeatureOnly (no cold-start
issues)
• Covariates:
– Users : age, gender, zipcode (1st digit only)
– Movies: genres
Experiment 2: Better handling of
Cold-start
• MovieLens-1M; EachMovie
• Training-test split based on timestamp
• Same covariates as in Experiment 1.
Experiment 4: Predicting click-rate
on articles
• Goal: Predict click-rate on articles for a user on F1
position
• Article lifetimes short, dynamic updates important
• User covariates:
– Age, Gender, Geo, Browse behavior
• Article covariates
– Content Category, keywords
• 2M ratings, 30K users, 4.5 K articles
Results on Y! FP data
Some other related approaches
• Stern, Herbrich and Graepel, WWW, 2009
– Similar to RLFM, different parametrization and
expectation propagation used to fit the models
• Porteus, Asuncion and Welling, AAAI, 2011
– Non-parametric approach using a Dirichlet process
• Agarwal, Zhang and Mazumdar, Annals of Applied
Statistics, 2011
– Regression + random effects per user regularized
through a Graphical Lasso
Add Topic Discovery into
Matrix Factorization
fLDA: Matrix Factorization through Latent
Dirichlet Allocation
fLDA: Introduction
• Model the rating yij that user i gives to item j as the user’s
affinity to the topics that the item has
– Unlike regular unsupervised LDA topic modeling, here the LDA
topics are learnt in a supervised manner based on past rating
data
– fLDA can be thought of as a “multi-task learning” version of the
supervised LDA model [Blei’07] for cold-start recommendation
k jkikij zsy ...
User i ’s affinity to topic k
Pr(item j has topic k) estimated by averaging
the LDA topic of each word in item j
Old items: zjk’s are Item latent factors learnt from data with the LDA prior
New items: zjk’s are predicted based on the bag of words in the items
11, …, 1W
…
k1, …, kW
…
K1, …, KW
Topic 1
Topic k
Topic K
LDA Topic Modeling (1)
• LDA is effective for unsupervised topic discovery [Blei’03]
– It models the generating process of a corpus of items (articles)
– For each topic k, draw a word distribution k = [ k1, …, kW] ~ Dir( )
– For each item j, draw a topic distribution j = [ j1, …, jK] ~ Dir( )
– For each word, say the nth word, in item j,
• Draw a topic zjn for that word from j = [ j1, …, jK]
• Draw a word wjn from k = [ k1, …, kW] with topic k = zjn
Item j
Topic distribution: [ j1, …, jK]
Words: wj1, …, wjn, …
Per-word topic: zj1, …, zjn, …
Assume zjn = topic k
Observed
LDA Topic Modeling (2)
• Model training:
– Estimate the prior parameters and the posterior topic word
distribution based on a training corpus of items
– EM + Gibbs sampling is a popular method
• Inference for new items
– Compute the item topic distribution based on the prior
parameters and estimated in the training phase
• Supervised LDA [Blei’07]
– Predict a target value for each item based on supervised LDA topics
k jkkj zsy
Target value of item j
Pr(item j has topic k) estimated by averaging
the topic of each word in item j
Regression weight for topic k
k jkikij zsy ...vs.
One regression per user
Same set of topics across different regressions
fLDA: Model
Rating: ),(~ 2
ijij Ny
)(~ ijij Bernoulliy
)(~ ijijij NPoissony
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
jkikkji
t
ijij zsbxt )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 Nxg iii
t
i
Popularity of item j: ),0(~, 2
0 Nxd jjj
t
j
Topic affinity of user i: ),0(~, 2
INHxs s
s
i
s
iii
Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk
The LDA topic of the nth word in item j
Observed words: ),,(~ jnjn zLDAw
The nth word in item j
Model Fitting
• Given:
– Features X = {xi, xj, xij}
– Observed ratings y = {yij} and words w = {wjn}
• Estimate:
– Parameters: = [b, g0, d0, H, 2, a , a , As, , ]
• Regression weights and prior parameters
– Latent factors: = { i, j, si} and z = {zjn}
• User factors, item factors and per-word topic assignment
• Empirical Bayes approach:
– Maximum likelihood estimate of the parameters:
– The posterior distribution of the factors:
dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxargˆ
]ˆ,|,Pr[ yz
The EM Algorithm
• Iterate through the E and M steps until convergence
– Let be the current estimate
– E-step: Compute
• The expectation is not in closed form
• We draw Gibbs samples and compute the Monte
Carlo mean
– M-step: Find
• It consists of solving a number of regression and
optimization problems
)]|,,,Pr([log)( )ˆ,,|,(
zwyEf n
wyz
)(maxargˆ )1(
fn
)(ˆ n
Supervised Topic Assignment
ji jnij
jn
jkjn
k
jn
kl
jn
kzyfZ
WZ
Z
kz
rated
)|(
)Rest|Pr(
Same as unsupervised LDA Likelihood of observed ratings
by users who rated item j when
zjn is set to topic k
Probability of observing yij
given the model
The topic of the nth word in item j
fLDA: Experimental Results
(Movie)
• Task: Predict the rating that a user would give a movie
• Training/test split:
– Sort observations by time
– First 75% Training data
– Last 25% Test data
• Item warm-start scenario
– Only 2% new items in test data
Model Test RMSE
RLFM 0.9363
fLDA 0.9381
Factor-Only 0.9422
FilterBot 0.9517
unsup-LDA 0.9520
MostPopular 0.9726
Feature-Only 1.0906
Constant 1.1190
fLDA is as strong as the best method
It does not reduce the performance in warm-start scenarios
fLDA: Experimental Results (Yahoo! Buzz)
• Task: Predict whether a user would buzz-up an article
• Severe item cold-start
– All items are new in test data
Data Statistics
1.2M observations
4K users
10K articles
fLDA significantly
outperforms other
models
Experimental Results: Buzzing
Topics
Top Terms (after stemming) Topic
bush, tortur, interrog, terror, administr, CIA, offici,
suspect, releas, investig, georg, memo, al
CIA interrogation
mexico, flu, pirat, swine, drug, ship, somali, border,
mexican, hostag, offici, somalia, captain
Swine flu
NFL, player, team, suleman, game, nadya, star, high,
octuplet, nadya_suleman, michael, week
NFL games
court, gai, marriag, suprem, right, judg, rule, sex,
pope, supreme_court, appeal, ban, legal, allow
Gay marriage
palin, republican, parti, obama, limbaugh, sarah, rush,
gop, presid, sarah_palin, sai, gov, alaska
Sarah Palin
idol, american, night, star, look, michel, win, dress,
susan, danc, judg, boyl, michelle_obama
American idol
economi, recess, job, percent, econom, bank, expect,
rate, jobless, year, unemploy, month
Recession
north, korea, china, north_korea, launch, nuclear,
rocket, missil, south, said, russia
North Korea issues
3/4 topics are interpretable; 1/2 are similar to unsupervised topics
fLDA Summary
• fLDA is a useful model for cold-start item recommendation
• It also provides interpretable recommendations for users
– User’s preference to interpretable LDA topics
• Future directions:
– Investigate Gibbs sampling chains and the convergence properties of
the EM algorithm
– Apply fLDA to other multi-task prediction problems
• fLDA can be used as a tool to generate supervised
features (topics) from text data
Summary
• Regularizing factors through covariates effective
• Regression based factor model that regularizes better
and deals with both cold-start and warm-start in a
single framework in a seamless way looks attractive
• Fitting method scalable; Gibbs sampling for users and
movies can be done in parallel. Regressions in M-step
can be done with any off-the-shelf scalable linear
regression routine
• Distributed computing on Hadoop: Multiple models
and average across partitions (more later)
Online Components:
Online Models, Intelligent
Initialization, Explore / Exploit
Why Online Components?
• Cold start
– New items or new users come to the system
– How to obtain data for new items/users (explore/exploit)
– Once data becomes available, how to quickly update the model
• Periodic rebuild (e.g., daily): Expensive
• Continuous online update (e.g., every minute): Cheap
• Concept drift
– Item popularity, user interest, mood, and user-to-item affinity may
change over time
– How to track the most recent behavior
• Down-weight old data
– How to model temporal patterns for better prediction
• … may not need to be online if the patterns are stationary
Big Picture
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models
Real systems are dynamic
Time-series models Incremental CF,
online regression
Intelligent Initialization
Do not start cold
Prior estimation Prior estimation,
dimension reduction
Explore/Exploit
Actively acquire data
Multi-armed bandits Bandits with covariates
Segmented Most
Popular Recommendation
Extension:
Online Components for
Most Popular Recommendation
Online models, intelligent initialization &
explore/exploit
Most popular recommendation:
Outline
• Most popular recommendation (no
personalization, all users see the same thing)
– Time-series models (online models)
– Prior estimation (initialization)
– Multi-armed bandits (explore/exploit)
– Sometimes hard to beat!!
• Segmented most popular recommendation
– Create user segments/clusters based on user
features
– Do most popular recommendation for each segment
Most Popular Recommendation
• Problem definition: Pick k items (articles) from a
pool of N to maximize the total number of clicks on
the picked items
• Easy!? Pick the items having the highest click-
through rates (CTRs)
• But …
– The system is highly dynamic:
• Items come and go with short lifetimes
• CTR of each item changes over time
– How much traffic should be allocated to explore new
items to achieve optimal performance
• Too little Unreliable CTR estimates
• Too much Little traffic to exploit the high CTR items
CTR Curves for Two Days on Yahoo! Front Page
Traffic obtained from a controlled randomized experiment (no confounding)
Things to note:
(a) Short lifetimes, (b) temporal effects, (c) often breaking news stories
Each curve is the CTR of an item in the Today Module on www.yahoo.com over time
For Simplicity, Assume …
• Pick only one item for each user visit
– Multi-slot optimization later
• No user segmentation, no personalization
(discussion later)
• The pool of candidate items is predetermined
and is relatively small ( 1000)
– E.g., selected by human editors or by a first-phase
filtering method
– Ideally, there should be a feedback loop
– Large item pool problem later
• Effects like user-fatigue, diversity in
recommendations, multi-objective optimization
not considered (discussion later)
Online Models
• How to track the changing CTR of an item
• Data: for each item, at time t, we observe
– Number of times the item nt was displayed (i.e., #views)
– Number of clicks ct on the item
• Problem Definition: Given c1, n1, …, ct, nt, predict the CTR
(click-through rate) pt+1 at time t+1
• Potential solutions:
– Observed CTR at t: ct / nt highly unstable (nt is usually small)
– Cumulative CTR: ( all i ci) / ( all i ni) react to changes very
slowly
– Moving window CTR: ( i last K ci) / ( i last K ni) reasonable
• But, no estimation of Var[pt+1] (useful for explore/exploit)
Online Models: Dynamic Gamma-Poisson
• Model-based approach
– (ct | nt, pt) ~ Poisson(nt pt)
– pt = pt-1 t, where t ~ Gamma(mean=1, var= )
– Model parameters:
• p1 ~ Gamma(mean= 0, var= 0
2) is the offline CTR estimate
• specifies how dynamic/smooth the CTR is over time
– Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?)
• Solve this recursively (online update rule)
Show the item nt times
Receive ct clicks
pt = CTR at time t
Notation:
p10, 0
2 p2 …
n1
c1
n2
c2
Online Models: Derivation
size)sample(effective/Let
),(~),,...,,|(
2
2
1111
ttt
ttttt varmeanGammancncp
)(
),(~),,...,,|(
2
|
2
|
2
|
2
1
|1
2
11111
ttttttt
ttt
ttttt varmeanGammancncp
tttttt
ttttttt
tttt
ttttttt
c
n
varmeanGammancncp
||
2
|
||
|
2
||11
/
/)(
size)sample(effectiveLet
),(~),,...,,|(
High CTR items more adaptive
Estimated CTR
distribution
at time t
Estimated CTR
distribution
at time t+1
Tracking behavior of Gamma-
Poisson model
• Low click rate articles – More temporal
smoothing
Intelligent Initialization: Prior
Estimation
• Prior CTR distribution: Gamma(mean= 0, var= 0
2)
– N historical items:
• ni = #views of item i in its first time interval
• ci = #clicks on item i in its first time interval
– Model
• ci ~ Poisson(ni pi) and pi ~ Gamma( 0, 0
2)
ci ~ NegBinomial( 0, 0
2, ni)
– Maximum likelihood estimate (MLE) of ( 0, 0
2)
• Better prior: Cluster items and find MLE for each cluster
– Agarwal & Chen, 2011 (SIGMOD)
i iii nccNN 2
0
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
0
2
0
2
0
2
00
loglogloglogmaxarg
,
Explore/Exploit: Problem Definition
time
Item 1
Item 2
…
Item K
x1% page views
x2% page views
…
xK% page views
Determine (x1, x2, …, xK) based on clicks and views observed before t in
order to maximize the expected total number of clicks in the future
t –1t –2 t
now
clicks in the future
Modeling the Uncertainty, NOT just
the Mean
Simplified setting: Two items
CTR
Probabilitydensity
Item A
Item B
We know the CTR of Item A (say, shown 1 million times)
We are uncertain about the CTR of Item B (only 100 times)
If we only make a single decision,
give 100% page views to Item A
If we make multiple decisions in the future
explore Item B since its CTR can potentially
be higher
qp
dppfqp )()(Potential
CTR of item A is q
CTR of item B is p
Probability density function of item B’s CTR is f(p)
Multi-Armed Bandits: Introduction
(1)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
“Pulling” arm i yields a reward:
reward = 1 with probability pi (success)
reward = 0 otherwise (failure)
For now, we are attacking the problem of choosing best article/arm for all users
Multi-Armed Bandits: Introduction
(2)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
Goal: Pull arms sequentially to maximize the total reward
Bandit scheme/policy: Sequential algorithm to play arms (items)
Regret of a scheme = Expected loss relative to the “oracle” optimal scheme
that always plays the best arm
– “best” means highest success probability
– But, the best arm is not known … unless you have an oracle
– Regret is the price of exploration
– Low regret implies quick convergence to the best
Multi-Armed Bandits: Introduction
(3)
• Bayesian approach
– Seeks to find the Bayes optimal solution to a Markov
decision process (MDP) with assumptions about
probability distributions
– Representative work: Gittins’ index, Whittle’s index
– Very computationally intensive
• Minimax approach
– Seeks to find a scheme that incurs bounded regret (with
no or mild assumptions about probability distributions)
– Representative work: UCB by Lai, Auer
– Usually, computationally easy
– But, they tend to explore too much in practice (probably
because the bounds are based on worse-case analysis)
Skip details
Multi-Armed Bandits: Markov Decision Process (1)
• Select an arm now at time t=0, to maximize expected total number
of clicks in t=0,…,T
• State at time t: t = ( 1t, …, Kt)
– it = State of arm i at time t (that captures all we know about arm i at t)
• Reward function Ri( t, t+1)
– Reward of pulling arm i that brings the state from t to t+1
• Transition probability Pr[ t+1 | t, pulling arm i ]
• Policy : A function that maps a state to an arm (action)
– ( t) returns an arm (to pull)
• Value of policy starting from the current state 0 with horizon T
),(),(),( 1110)(0 0
ΘΘΘΘ Θ TT VREV
11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T
Immediate reward Value of the remaining T-1 time slots
if we start from state 1
Multi-Armed Bandits: MDP (2)
• Optimal policy:
• Things to notice:
– Value is defined recursively (actually T high-dim integrals)
– Dynamic programming can be used to find the optimal policy
– But, just evaluating the value of a fixed policy can be very expensive
• Bandit Problem: The pull of one arm does not change the state of
other arms and the set of arms do not change over time
),(),(),( 1110)(0 0
ΘΘΘΘ Θ TT VREV
11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T
Immediate reward Value of the remaining T-1 time slots
if we start from state 1
),(maxarg 0ΘTV
Multi-Armed Bandits: MDP (3)
• Which arm should be pulled next?
– Not necessarily what looks best right now, since it might have had a few
lucky successes
– Looks like it will be a function of successes and failures of all arms
• Consider a slightly different problem setting
– Infinite time horizon, but
– Future rewards are geometrically discounted
Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1)
• Theorem [Gittins 1979]: The optimal policy decouples and solves a
bandit problem for each arm independently
Policy ( t) is a function of ( 1t, …, Kt)
Policy ( t) = argmaxi { g( it) }
One K-dimensional problem
K one-dimensional problems
Still computationally expensive!!
Gittins’ Index
Multi-Armed Bandits: MDP (4)
Bandit Policy
1. Compute the priority
(Gittins’ index) of each
arm based on its state
2. Pull arm with max
priority, and observe
reward
3. Update the state of the
pulled arm
Priority
1
Priority
2
Priority
3
Multi-Armed Bandits: MDP (5)
• Theorem [Gittins 1979]: The optimal policy decouples
and solves a bandit problem for each arm
independently
– Many proofs and different interpretations of Gittins’ index
exist
• The index of an arm is the fixed charge per pull for a game with two options, whether to
pull the arm or not, so that the charge makes the optimal play of the game have zero
net reward
– Significantly reduces the dimension of the problem space
– But, Gittins’ index g( it) is still hard to compute
• For the Gamma-Poisson or Beta-Binomial models
it = (#successes, #pulls) for arm i up to time t
• g maps each possible (#successes, #pulls) pair to a number
– Approximate methods are used in practice
– Lai et al. have derived these for exponential family
distributions
Multi-Armed Bandits: Minimax Approach (1)
• Compute the priority of each arm i in a way that the
regret is bounded
– Lowest regret in the worst case
• One common policy is UCB1 [Auer 2002]
Number of successes of
arm i
Number of pulls
of arm i
Total number of pulls
of all arms
Observed
success rate
Factor representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
Multi-Armed Bandits: Minimax Approach (2)
• As total observations n becomes large:
– Observed payoff tends asymptotically towards the
true payoff probability
– The system never completely “converges” to one
best arm; only the rate of exploration tends to
zero
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
Multi-Armed Bandits: Minimax Approach (3)
• Sub-optimal arms are pulled O(log n) times
• Hence, UCB1 has O(log n) regret
• This is the lowest possible regret (but the constants matter )
• E.g. Regret after n plays is bounded by
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
ibesti
K
j
j
i ibesti
n
where,
3
1
ln
8
1
2
:
• Classical multi-armed bandits
– A fixed set of arms with fixed rewards
– Observe the reward before the next pull
• Bayesian approach (Markov decision process)
– Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits
• Pull the arm currently having the highest index value
– Whittle’s index [Whittle 1988]: Extension to a changing reward function
– Computationally intensive
• Minimax approach (providing guaranteed regret bounds)
– UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval
• Index of arm i =
• Heuristics
– -Greedy: Random exploration using fraction of traffic
– Softmax: Pick arm i with probability
– Posterior draw: Index = drawing from posterior CTR distribution of an arm
j j
i
}/ˆexp{
}/ˆexp{
Classical Multi-Armed Bandits: Summary
ii itemofCTRpredictedˆ
iii nnnc log2
retemperatu
Do Classical Bandits Apply to Web Recommenders?
Traffic obtained from a controlled randomized experiment (no confounding)
Things to note:
(a) Short lifetimes, (b) temporal effects, (c) often breaking news stories
Each curve is the CTR of an item in the Today Module on www.yahoo.com over time
Characteristics of Real
Recommender Systems• Dynamic set of items (arms)
– Items come and go with short lifetimes (e.g., a day)
– Asymptotically optimal policies may fail to achieve good performance
when item lifetimes are short
• Non-stationary CTR
– CTR of an item can change dramatically over time
• Different user populations at different times
• Same user behaves differently at different times (e.g., morning, lunch
time, at work, in the evening, etc.)
• Attention to breaking news stories decays over time
• Batch serving for scalability
– Making a decision and updating the model for each user visit in real time
is expensive
– Batch serving is more feasible: Create time slots (e.g., 5 min); for each
slot, decide the fraction xi of the visits in the slot to give to item i
[Agarwal et al., ICDM, 2009]
Explore/Exploit in Recommender
Systems
time
Item 1
Item 2
…
Item K
x1% page views
x2% page views
…
xK% page views
Determine (x1, x2, …, xK) based on clicks and views observed before t in
order to maximize the expected total number of clicks in the future
t –1t –2 t
now
clicks in the future
Let’s solve this from first principle
Bayesian Solution: Two Items, Two
Time Slots (1)
• Two time slots: t = 0 and t = 1
– Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1
– Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1
• To determine x, we need to estimate what would happen in the future
Question:
What fraction x of N0 views to item P
(1-x) to item Q
t=0 t=1
Now
time
N0 views N1 views
End
Obtain c clicks after serving x
(not yet observed; random variable)
Assume we observe c; we can update p1
CTR
density
Item Q
Item P
q1
p1(x,c)
CTR
density
Item Q
Item P
q0
p0
If x and c are given, optimal solution:
Give all views to Item P iff
E[ p1 I x, c ] > q1
),(ˆ1 cxp
),(ˆ1 cxp
• Expected total number of clicks in the two time slots
}]),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c
Gain(x, q0, q1) = Expected number of additional clicks if we explore
the uncertain item P with fraction x of views in slot 0, compared to
a scheme that only shows the certain item Q in both slots
Solution: argmaxx Gain(x, q0, q1)
Bayesian Solution: Two Items, Two
Time Slots (2)
}]0,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c
E[#clicks] at t = 0 E[#clicks] at t = 1
Item P Item Q Show the item with higher E[CTR]: }),,(ˆmax{ 11 qcxp
E[#clicks] if we
always show
item Q
Gain(x, q0, q1)
Gain of exploring the uncertain item P using x
• Approximate by the normal distribution
– Reasonable approximation because of the central limit theorem
• Proposition: Using the approximation,
the Bayes optimal solution x can be
found in time O(log N0)
),(ˆ1 cxp
)ˆ(
)(
ˆ
1
)(
ˆ
)()ˆ(),,( 11
1
11
1
11
1100010 qp
x
pq
x
pq
xNqpxNqqxGain
)1()()(
)],(ˆ[)( 2
0
0
1
2
1
baba
ab
xNba
xN
cxpVarx
)/()],(ˆ[ˆ 11 baacxpEp c
),(~ofPrior 1 baBetap
Bayesian Solution: Two Items, Two
Time Slots (3)
Bayesian Solution: Two Items, Two Time Slots (4)
• Quiz: Is it correct that the more we are uncertain about the CTR of
an item, the more we should explore the item?
Uncertainty: LowUncertainty: High
Different
curves are for
different prior
mean settings
(Fractionofviewstogivetotheitem)
– Apply Whittle’s Lagrange relaxation (1988) to this problem setting
Relax i zi(c) = 1, for all c, to Ec [ i zi(c)] = 1
Apply Lagrange multipliers (q0 and q1) to enforce the constraints
– We essentially reduce the K-item case to K independent two-item sub-
problems (which we have solved)
Bayesian Solution: General Case (1)
• From two items to K items
– Very difficult problem: ))}],(ˆ{[maxˆ(max 1100
0
iiiiiii cxpENpxN c
x
)],(ˆ)([max 1
0
iiiii cxpzE cc
z
cc possibleallfor,1)(ii z
Note: c = [c1, …, cK]
ci is a random variable representing
the # clicks on item i we may get
1ii x
)),,(max(min 101100
, 10
qqxGainqNqN ii xqq i
Bayesian Solution: General Case
(2)
• From two intervals to multiple time slots
– Approximate multiple time slots by two stages
• Non-stationary CTR
– Use the Dynamic Gamma-Poisson model to
estimate the CTR distribution for each item
Simulation Experiment: Different Traffic Volume
Simulation with ground truth estimated based on Yahoo! Front Page data
Setting:16 live items per interval
Scenarios: Web sites with different traffic volume (x-axis)
Simulation Experiment: Different Sizes of the Item Pool
Simulation with ground truth estimated based on Yahoo! Front Page data
Setting: 1000 views per interval; average item lifetime = 20 intervals
Scenarios: Different sizes of the item pool (x-axis)
Characteristics of Different Explore/Exploit Schemes (1)
• Why the Bayesian solution has better performance
• Characterize each scheme by three dimensions:
– Exploitation regret: The regret of a scheme when it is showing the item
which it thinks is the best (may not actually be the best)
• 0 means the scheme always picks the actual best
• It quantifies the scheme’s ability of finding good
items
– Exploration regret: The regret of a scheme when it is exploring the items
which it feels uncertain about
• It quantifies the price of exploration (lower better)
– Fraction of exploitation (higher better)
• Fraction of exploration = 1 – fraction of exploitation
Exploitation traffic Exploration
traffic
All traffic to a web site
Characteristics of Different Explore/Exploit Schemes (2)
Exploitation regret: Ability of finding good items (lower better)
Exploration regret: Price of exploration (lower better)
Fraction of Exploitation (higher better)
Exploration Regret Exploitation fraction
ExploitationRegret
ExploitationRegret
Good Good
Discussion: Large Content Pool
• The Bayesian solution looks promising
– ~10% from true optimal for a content pool of 1000 live
items
• 1000 views per interval; item lifetime ~20 intervals
• Intelligent initialization (offline modeling)
– Use item features to reduce the prior variance of an item
• E.g., Var[ item CTR | Sport ] < Var[ item CTR ]
– Require a CTR model that outputs both mean and
variance
• Linear regression model
• Segmented model: Estimate the CTR distribution of a random
article in an item category
– Existing taxonomies, decision tree, LDA topics
• Feature-based explore/exploit
– Estimate model parameters, instead of per-item CTR
– More later
Discussion: Multiple Positions,
Ranking
• Feature-based approach
– reward(page) = model( (item 1 at position 1, … item k at position k))
– Apply feature-based explore/exploit
• Online optimization for ranked list
– Ranked bandits [Radlinski et al., 2008]: Run an
independent bandit algorithm for each position
– Dueling bandit [Yue & Joachims, 2009]: Actions are
pairwise comparisons
• Online optimization of submodular functions
– S1, S2 and a, fa(S1 S2) fa(S1)
• where fa(S) = fa(S a ) – fa(S)
– Streeter & Golovin (2008)
Discussion: Segmented Most
Popular
• Partition users into segments, and then for each
segment, provide most popular recommendation
• How to segment users
– Hand-created segments: AgeGroup Gender
– Clustering or decision tree based on user features
• Users in the same cluster like similar items
• Segments can be organized by taxonomies/hierarchies
– Better CTR models can be built by hierarchical smoothing
• Shrink the CTR of a segment toward its parent
• Introduce bias to reduce uncertainty/variance
– Bandits for taxonomies (Pandey et al., 2008)
• First explore/exploit categories/segments
• Then, switch to individual items
Most Popular Recommendation:
Summary
• Online model:
– Estimate the mean and variance of the CTR of each item over
time
– Dynamic Gamma-Poisson model
• Intelligent initialization:
– Estimate the prior mean and variance of the CTR of each item
cluster using historical data
• Cluster items Maximum likelihood estimates of the priors
• Explore/exploit:
– Bayesian: Solve a Markov decision process problem
• Gittins’ index, Whittle’s index, approximations
• Better performance, computation intensive
• Thompson sampling: Sample from the posterior (simple)
– Minimax: Bound the regret
• UCB1: Easy to compute
• Explore more than necessary in practice
– -Greedy: Empirically competitive for tuned
Online Components for
Personalized Recommendation
Online models, intelligent initialization &
explore/exploit
Intelligent Initialization for Linear
Model (1)
• Linear/factorization model
– How to estimate the prior parameters j and
• Important for cold start: Predictions are made using prior
• Leverage available features
– How to learn the weights/factors quickly
• High dimensional j slow convergence
• Reduce the dimensionality
Subscript:
user i,
item j
),(~
),(~ 2
jj
jiij
N
uNy
rating that user i
gives item j
feature/factor vector of user i
factor vector of item j
Feature-based model initialization
• Dimensionality reduction for fast model convergence
),(~ jj AxN
FOBFM: Fast Online Bilinear Factor Model
),(~,~ jjjiij NuyPer-item
online model
),0(~
~
Nv
vuAxuy
j
jijiij
predicted by features
),0(~ 2
IN
Bv
j
jj
Subscript:
user i
item j
Data:
yij = rating that
user i gives item j
ui = offline factor vector
of user i
xj = feature vector
of item j
B is a n k linear projection matrix (k << n)
project: high dim(vj) low dim( j)
low-rank approx of Var[ j]:
=
vj jB
),(~ 2
BBAxN jj
Offline training: Determine A, B, 2
through the EM algorithm
(once per day or hour)
Feature-based model initialization
• Dimensionality reduction for fast model convergence
• Fast, parallel online learning
• Online selection of dimensionality (k = dim( j))
– Maintain an ensemble of models, one for each candidate dimensionality
),(~ jj AxN
FOBFM: Fast Online Bilinear Factor Model
),(~,~ jjjiij NuyPer-item
online model
),0(~
~
Nv
vuAxuy
j
jijiij
predicted by features
),0(~ 2
IN
Bv
j
jj B is a n k linear projection matrix (k << n)
project: high dim(vj) low dim( j)
low-rank approx of Var[ j]:
jijiij BuAxuy )(~
offset new feature vector (low dimensional)
, where j is updated in an online manner
),(~ 2
BBAxN jj
Subscript:
user i
item j
Data:
yij = rating that
user i gives item j
ui = offline factor vector
of user i
xj = feature vector
of item j
Experimental Results: My Yahoo!
Dataset (1)
• My Yahoo! is a personalized news reading
site
– Users manually select news/RSS feeds
• ~12M “ratings” from ~3M users on ~13K
articles
– Click = positive
– View without click = negative
Experimental Results: My Yahoo! Dataset (2)
Item-based data split: Every item is new in the test data
– First 8K articles are in the training data (offline training)
– Remaining articles are in the test data (online prediction & learning)
Supervised dimensionality reduction (reduced rank regression) significantly
outperforms other methods
Methods:
No-init: Standard online
regression with ~1000
parameters for each item
Offline: Feature-based model
without online update
PCR, PCR+: Two principal
component methods to
estimate B
FOBFM: Our fast online method
Experimental Results: My Yahoo! Dataset (3)
• Small number of factors (low dimensionality) is better when the amount of data
for online leaning is small
• Large number of factors is better when the data for learning becomes large
• The online selection method usually selects the best dimensionality
# factors =
Number of
parameters per
item updated
online
Intelligent Initialization: Summary
• For online learning, whenever historical data
is available, do not start cold
• For linear/factorization models
– Use available features to setup the starting point
– Reduce dimensionality to facilitate fast learning
• Next
– Explore/exploit for personalization
– Users are represented by covariates
• Features, factors, clusters, etc
– Covariate bandits
Explore/Exploit for Personalized Recommendation
• One extreme problem formulation
– One bandit problem per user with one arm per item
– Bandit problems are correlated: “Similar” users like similar
items
– Arms are correlated: “Similar” items have similar CTRs
• Model this correlation through covariates/features
– Input: User feature/factor vector, item feature/factor vector
– Output: Mean and variance of the CTR of this (user, item)
pair based on the data collected so far
• Covariate bandits
– Also known as contextual bandits, bandits with side
observations
– Provide a solution to
• Large content pool (correlated arms)
• Personalized recommendation (hint before pulling an arm)
Methods for Covariate Bandits
• Priority-based methods
– Rank items according to the user-specific “score” of each item; then,
update the model based on the user’s response
– UCB (upper confidence bound)
• Score of an item = E[posterior CTR] + k StDev[posterior CTR]
– Posterior draw (Thompson sampling)
• Score of an item = a number drawn from the posterior CTR distribution
– Softmax
• Score of an item = a number drawn according to
• -Greedy
– Allocate fraction of traffic for random exploration ( may be adaptive)
– Robust when the exploration pool is small
• Bayesian scheme
– Close to optimal if can be solved efficiently
j j
i
}/ˆexp{
}/ˆexp{
Covariate Bandits: Some
References
• Just a small sample of papers
– Hierarchical explore/exploit (Pandey et al., 2008)
• Explore/exploit categories/segments first; then, switch to individuals
– Variants of -greedy
• Epoch-greedy (Langford & Zhang, 2007): is determined based on
the generalization bound of the current model
• Banditron (Kakade et al., 2008): Linear model with binary response
• Non-parametric bandit (Yang & Zhu, 2002): decreases over time;
example model: histogram, nearest neighbor
– Variants of UCB methods
• Linearly parameterized bandits (Rusmevichientong et al., 2008):
minimax, based on uncertainty ellipsoid
• LinUCB (Li et al., 2010): Gaussian linear regression model
• Bandits in metric spaces (Kleinberg et al., 2008; Slivkins et al., 2009):
– Similar arms have similar rewards: | reward(i) – reward(j) | distance(i,j)
Online Components: Summary
• Real systems are dynamic
• Cold-start problem
– Incremental online update (online linear regression)
– Intelligent initialization (use features to predict initial factor
values)
– Explore/exploit (UCB, posterior draw, softmax, -greedy)
• Concept-drift problem
– Tracking the most recent behavior (state-space models,
Kalman filter)
– Modeling temporal patterns (tensor factorization, spline)
Evaluation Methods and
Challenges
Evaluation Methods
• Ideal method
– Experimental Design: Run side-by-side experiments on a small
fraction of randomly selected traffic with new method (treatment)
and status quo (control)
– Limitation
• Often expensive and difficult to test large number of methods
• Problem: How do we evaluate methods offline on
logged data?
– Goal: To maximize clicks/revenue and not prediction
accuracy on the entire system. Cost of predictive
inaccuracy for different instances vary.
• E.g. 100% error on a low CTR article may not matter much
because it always co-occurs with a high CTR article that is
predicted accurately
Usual Metrics
• Predictive accuracy
– Root Mean Squared Error (RMSE)
– Mean Absolute Error (MAE)
– Area under the Curve, ROC
• Other rank based measures based on retrieval accuracy for top-k
– Recall in test data
• What Fraction of items that user actually liked in the test data were
among the top-k recommended by the algorithm (fraction of hits, e.g.
Karypsis, CIKM 2001)
• One flaw in several papers
– Training and test split are not based on time.
• Information leakage
• Even in Netflix, this is the case to some extent
– Time split per user, not per event. For instance, information may leak if
models are based on user-user similarity.
Metrics continued..
• Recall per event based on Replay-Match
method
– Fraction of clicked events where the top
recommended item matches the clicked one.
• This is good if logged data collected from a
randomized serving scheme, with biased
data this could be a problem
– We will be inventing algorithms that provide
recommendations that are similar to the current
one
• No reward for novel recommendations
Details on Replay-Match method (Li, Langford, et al)
• x: feature vector for a visit
• r = [r1,r2,…,rK]: reward vector for the K items in inventory
• h(x): recommendation algorithm to be evaluated
• Goal: Estimate expected reward for h(x)
• s(x): recommendation scheme that generated logged-data
• x1,..,xT: visits in the logged data
• rti: reward for visit t, where i = s(xt)
Replay-Match continued
• Estimator
• If importance weights
and
– It can be shown estimator is unbiased
• E.g. if s(x) is random serving scheme, importance weights are
uniform over the item set
• If s(x) is not random, importance weights have to be
estimated through a model
Back to Multi-Objective
Optimization
Recommender EDITORIAL
content
Clicks on FP links influence
downstream supply distribution
AD SERVER
PREMIUM display
(GUARANTEED)
Spot Market (Cheaper)
Downstream
engagement
(Time spent)
Serving Content on Front Page:
Click Shaping
• What do we want to optimize?
• Current: Maximize clicks (maximize downstream supply from FP)
• But consider the following
– Article 1: CTR=5%, utility per click = 5
– Article 2: CTR=4.9%, utility per click=10
• By promoting 2, we lose 1 click/100 visits, gain 5 utils
• If we do this for a large number of visits --- lose some clicks but
obtain significant gains in utility?
– E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,
etc)
Why call it Click Shaping?autos finance
health
hotjobs
movies
new.music
news
omg
realestate
rivals
shine
shopping
sports
tech
travel
tv
video
other
gmy.news
buzz
videogames
autos
finance
health
hotjobs
movies
new.music
news
omg
realestate
rivals
shine
shopping
sports
tech
travel
tv
video
other
videogames
buzz
gmy.news
-10.00%
-8.00%
-6.00%
-4.00%
-2.00%
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
autos
buzz
finance
gmy.news
health
hotjobs
moviesnew.music
news
omg
realestate
rivals
shine
shopping
sports
tech
travel
tv
videovideogames
other
Supply distribution
Changes
BEFORE
AFTER
SHAPING can happen with respect to any downstream metrics (like engagement)
222
Multi-Objective Optimization
A1
A2
An
n articlesK properties
news
finance
omg
… …
S1
S2
Sm
m user segments
…
CTR of user segment i on article j: pij
Time duration of i on j: dij
11
Multi-Objective Program Scalarization
Goal Programming
Simplex constraints on xiJ is always applied
Constraints are linear
Every 10 mins, solve x
Use this x as the serving scheme in the next 10 mins
Pareto-optimal solution (more in KDD 2011)
224
Summary
• Modern recommendation systems on the web crucially depend on
extracting intelligence from massive amounts of data collected on a
routine basis
• Lots of data and processing power not enough, the number of things we
need to learn grows with data size
• Extracting grouping structures at coarser resolutions based on similarity
(correlations) is important
– ML has a big role to play here
• Continuous and adaptive experimentation in a judicious manner crucial
to maximize performance
– Again, ML has a big role to play
• Multi-objective optimization is often required, the objectives are
application dependent.
– ML has to work in close collaboration with
engineering, product & business execs
Challenges
Recall: Some examples
• Simple version
– I have an important module on my page, content inventory
is obtained from a third party source which is further
refined through editorial oversight. Can I algorithmically
recommend content on this module? I want to drive up
total CTR on this module
• More advanced
– I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. dwell time). Can I increase
downstream utility without losing too many clicks?
• Highly advanced
– There are multiple modules running on my website. How
do I take a holistic approach and perform a simultaneous
optimization?
For the simple version
• Multi-position optimization
– Explore/exploit, optimal subset selection
• Explore/Exploit strategies for large content pool and
high dimensional problems
– Some work on hierarchical bandits but more needs to be
done
• Constructing user profiles from multiple sources with
less than full coverage
– Couple of papers at KDD 2011
• Content understanding
• Metrics to measure user engagement (other than
CTR)
Other problems
• Whole page optimization
– Incorporating correlations
• Incentivizing User generated content
• Incorporating Social information for better
recommendation (News Feed Recommendation)
• Multi-context Learning
Case Studies
Recommendations and Advertising on LinkedIn HP
EXAMPLE: DISPLAY AD
PLACEMENTS ON LINKEDIN
©2013 LinkedIn Corporation. All Rights
Reserved.
Recommendations and Advertising on LinkedIn HP
Click Cost =
Bid3 x CTR3/CTR2
Profile:
region = US, age = 20
Context = profile
page, 300 x 250 ad slot
Ad
request
Sorted by
Bid * CTR
Response
Prediction Engine
Campaigns eligible for
auction
Automatic
Format
Selection
Filter Campaigns
(Targeting criteria,
Frequency Cap,
Budget Pacing)
LinkedIn Advertising: Flow
Serving constraint < 100 millisec
CTR Prediction Model for Ads
• Feature vectors
– Member feature vector: xi (identity, behavioral, network)
– Campaign feature vector: cj (text, adv-id,…)
– Context feature vector: zk (page type, device, …)
• Model:
CTR Prediction Model for Ads
• Feature vectors
– Member feature vector: xi
– Campaign feature vector: cj
– Context feature vector: zk
• Model:
Cold-start component
Warm-start
per-campaign component
CTR Prediction Model for Ads
• Feature vectors
– Member feature vector: xi
– Campaign feature vector: cj
– Context feature vector: zk
• Model:
Cold-start component
Warm-start
per-campaign component
Cold-start:
Warm-start:
Both can have L2 penalties.
Model Fitting
• Single machine (well understood)
– conjugate gradient
– L-BFGS
– Trusted region
– …
• Model Training with Large scale data
– Cold-start component Θw is more stable
• Weekly/bi-weekly training good enough
• However: difficulty from need for large-scale logistic regression
– Warm-start per-campaign model Θc is more dynamic
• New items can get generated any time
• Big loss if opportunities missed
• Need to update the warm-start component as frequently as possible
Model Fitting
• Single machine (well understood)
– conjugate gradient
– L-BFGS
– Trusted region
– …
• Model Training with Large scale data
– Cold-start component Θw is more stable
• Weekly/bi-weekly training good enough
• However: difficulty from need for large-scale logistic regression
– Warm-start per-campaign model Θc is more dynamic
• New items can get generated any time
• Big loss if opportunities missed
• Need to update the warm-start component as frequently as possible
Large Scale Logistic
Regression
Per-item logistic regression
given Θc
Explore/Exploit with Logistic
Regression
240
+
+
+
+
+
+
+
_
_
_
_
_
_
_
_
_
_ _
_
_
COLD START
COLD + WARM START
for an Ad-id
POSTERIOR of WARM-START
COEFFICIENTS
E/E: Sample a line from the
posterior
(Thompson Sampling)
Models Considered
• CONTROL: per-campaign CTR counting model
• COLD-ONLY: only cold-start component
• LASER: our model (cold-start + warm-start)
• LASER-EE: our model with Explore-Exploit
using Thompson sampling
Metrics
• Model metrics (offline)
– Test Log-likelihood
– AUC/ROC
– Observed/Expected ratio
• Online metrics (Online A/B Test)
– CTR
– CPM (Revenue per impression)
– Unique ads per user (diversity)
Observed / Expected Ratio
• Offline replay difficult with large items (randomization costly)
• Observed: #Clicks in the data , Expected: Sum of predicted CTR for
all impressions
• Not a “standard” classifier metric, but useful for this application
• What we usually see: Observed / Expected < 1
– Quantifies the “winner’s curse” aka selection bias in auctions
• When choosing from among thousands of candidates, an item with mistakenly
over-estimated CTR may end up winning the auction
• Particularly helpful in spotting inefficiencies by segment
– E.g. by bid, number of impressions in training (warmness), geo, etc.
– Allows us to see where the model might be giving too much weight to
the wrong campaigns
• High correlation between O/E ratio and model performance online
Online A/B Test
• Three models
– CONTROL (10%)
– LASER (85%)
– LASER-EE (5%)
• Segmented Analysis
– 8 segments by campaign warmness
• Degree of warmness: the number of training samples available in
the training data for the campaign
• Segment #1: Campaigns with almost no data in training
• Segment #8: Campaigns that are served most heavily in the
previous batches so that their CTR estimate can be quite accurate
Daily CTR Lift Over Control
PercentageofCTRLift
+%
+%
+%
+%
+%
Day1
Day2
Day3
Day4
Day5
Day6
Day7
●
● ●
●
●
●
●
●
LASER
LASER−EE
Daily CPM Lift Over Control
PercentageofeCPMLift
+%
+%
+%
+%
+%
+%
Day1
Day2
Day3
Day4
Day5
Day6
Day7
●
●
●
● ●
●
●
●
LASER
LASER−EE
CPM Lift By Campaign
Warmness Segments
Campaign Warmness Segment
LiftPercentageofCPM
−%
−%
−%
0%
+%
+%
1 2 3 4 5 6 7 8
LASER
LASER−EE
O/E Ratio By Campaign
Warmness Segments
Campaign Warmness Segment
ObservedClick/ExpectedClicks
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
CONTROL
LASER
LASER−EE
Number of Campaigns Served
Improvement from E/E
Insights
• Overall performance:
– LASER and LASER-EE are both much better than
control
– LASER and LASER-EE performance are very similar
• Great news! We get exploration without much additional
cost
• Exploration has other benefits
– LASER-EE serves significantly more campaigns than
LASER
– Provides healthier market place, more ad diversity per
user (better experience)
Solutions to Practical Problems
• Rapid model development cycle
– Quick reaction to changes in data, product
– Write once for training, testing, inference
• Can adapt to changing data
– Integrated Thompson sampling explore/exploit
– Automatic training
– Multiple training frequencies for different parts of model
• Good tools yield good models
– Reusable components for feature extraction and transformation
– Very high-performance inference engine for deployment
– Modelers can concentrate on building models, not re-writing
common functions or worrying about production issues
Summary
• Reducing dimension through logistic regression coupled
with explore/exploit schemes like Thompson sampling
effective mechanism to solve response prediction problems
in advertising
• Partitioning model components by cold-start (stable) and
warm-start (non-stationary) with different training
frequencies effective mechanism to scale computations
• ADMM with few modifications effective model training
strategy for large data with high dimensionality
• Methods work well for LinkedIn advertising, significant
improvements
©2013 LinkedIn Corporation. All Rights
Reserved.
Theory vs. Practice
Textbook
• Data is stationary
• Training data is clean
• Training is hard, testing and
inference are easy
• Models don’t change
• Complex algorithms work best
Reality
• Features, items changing
constantly
• Fraud, bugs,tracking delays,
online/offline inconsistencies,
etc.
• All aspects have challenges at
web scale
• Never-ending processes of
improvement
• Simple models with good features
and lots of data win
Current Work: Feed Recommendation
Network Updates
Job change
Job anniversaries
Connections
Endorsements
Upload Photos
…………..
Content
Articles by Influencer
Shares by friends
Content in Channels followed
Content by companies followed
Jobs recommendation
…………
Sponsored Updates
Company updates
Jobs
Tiered Approach to Ranking
 A second pass ranker that blends disparate
results returned by first-pass rankers
JobsAdsNetwork
Updates
Content
------------
BLENDER
Top k
Challenges
• Personalization
– Viewer-actor affinity by type (depends on strength of
connections in multiple contexts)
– Blending identity and behavioral data
• Frequency discounting, freshness, diversification
• Multi-objectives (Revenue, Engagement)
• A/B tests with interference
• Engagement metrics
– Function of various actions that optimize long-term
engagement metric like return visits
• Summarization and adding new content types
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course
Enar short course

More Related Content

What's hot

Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
IEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialIEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialSrinath Perera
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligenceananth
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionWill Johnson
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelinesRamesh Sampath
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017MLconf
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Measuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and PythonMeasuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and PythonSujit Pal
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Turi, Inc.
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Lucidworks
 

What's hot (20)

Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
IEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialIEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On Tutorial
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligence
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Machine learning yearning
Machine learning yearningMachine learning yearning
Machine learning yearning
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Ironwood4_Tuesday_Medasani_1PM
Ironwood4_Tuesday_Medasani_1PMIronwood4_Tuesday_Medasani_1PM
Ironwood4_Tuesday_Medasani_1PM
 
Measuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and PythonMeasuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and Python
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
 

Viewers also liked

Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Wayne Lee
 
Contexual bandit @TokyoWebMining
Contexual bandit @TokyoWebMiningContexual bandit @TokyoWebMining
Contexual bandit @TokyoWebMining正志 坪坂
 
Reading Efron's 1979 paper on bootstrap
Reading Efron's 1979 paper on bootstrapReading Efron's 1979 paper on bootstrap
Reading Efron's 1979 paper on bootstrapChristian Robert
 
Reading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimatorsReading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimatorsChristian Robert
 
Heart and voluntary
Heart and voluntaryHeart and voluntary
Heart and voluntaryhpinn
 
หน้าที่พิเศษของใบ
หน้าที่พิเศษของใบหน้าที่พิเศษของใบ
หน้าที่พิเศษของใบBiobiome
 
gabrovo1
gabrovo1gabrovo1
gabrovo1niod
 
Types de fonds d'investiment au Luxembourg
Types de fonds d'investiment au LuxembourgTypes de fonds d'investiment au Luxembourg
Types de fonds d'investiment au LuxembourgBridgeWest.eu
 
Technical Bulletin 1128A Mechanical Insulation In Typical Refrigeration Appli...
Technical Bulletin 1128A Mechanical Insulation In Typical Refrigeration Appli...Technical Bulletin 1128A Mechanical Insulation In Typical Refrigeration Appli...
Technical Bulletin 1128A Mechanical Insulation In Typical Refrigeration Appli...Dyplast Products
 
หน้าที่พิเศษของราก
หน้าที่พิเศษของรากหน้าที่พิเศษของราก
หน้าที่พิเศษของรากBiobiome
 
07 bio มข
07 bio มข07 bio มข
07 bio มขBiobiome
 

Viewers also liked (12)

Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap
 
Contexual bandit @TokyoWebMining
Contexual bandit @TokyoWebMiningContexual bandit @TokyoWebMining
Contexual bandit @TokyoWebMining
 
Reading Efron's 1979 paper on bootstrap
Reading Efron's 1979 paper on bootstrapReading Efron's 1979 paper on bootstrap
Reading Efron's 1979 paper on bootstrap
 
Reading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimatorsReading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimators
 
Heart and voluntary
Heart and voluntaryHeart and voluntary
Heart and voluntary
 
หน้าที่พิเศษของใบ
หน้าที่พิเศษของใบหน้าที่พิเศษของใบ
หน้าที่พิเศษของใบ
 
Dpa acto 2015_eng
Dpa acto 2015_engDpa acto 2015_eng
Dpa acto 2015_eng
 
gabrovo1
gabrovo1gabrovo1
gabrovo1
 
Types de fonds d'investiment au Luxembourg
Types de fonds d'investiment au LuxembourgTypes de fonds d'investiment au Luxembourg
Types de fonds d'investiment au Luxembourg
 
Technical Bulletin 1128A Mechanical Insulation In Typical Refrigeration Appli...
Technical Bulletin 1128A Mechanical Insulation In Typical Refrigeration Appli...Technical Bulletin 1128A Mechanical Insulation In Typical Refrigeration Appli...
Technical Bulletin 1128A Mechanical Insulation In Typical Refrigeration Appli...
 
หน้าที่พิเศษของราก
หน้าที่พิเศษของรากหน้าที่พิเศษของราก
หน้าที่พิเศษของราก
 
07 bio มข
07 bio มข07 bio มข
07 bio มข
 

Similar to Enar short course

Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Map reduce programming model to solve graph problems
Map reduce programming model to solve graph problemsMap reduce programming model to solve graph problems
Map reduce programming model to solve graph problemsNishant Gandhi
 
Lecture 2 role of algorithms in computing
Lecture 2   role of algorithms in computingLecture 2   role of algorithms in computing
Lecture 2 role of algorithms in computingjayavignesh86
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAtner Yegorov
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisHiye Biniam
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation SystemsJared Winick
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 

Similar to Enar short course (20)

Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Lecture 1 (bce-7)
Lecture   1 (bce-7)Lecture   1 (bce-7)
Lecture 1 (bce-7)
 
Map reduce programming model to solve graph problems
Map reduce programming model to solve graph problemsMap reduce programming model to solve graph problems
Map reduce programming model to solve graph problems
 
Lecture 2 role of algorithms in computing
Lecture 2   role of algorithms in computingLecture 2   role of algorithms in computing
Lecture 2 role of algorithms in computing
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
 
try
trytry
try
 
Deep Dive - DynamoDB
Deep Dive - DynamoDBDeep Dive - DynamoDB
Deep Dive - DynamoDB
 
Connected Components Labeling
Connected Components LabelingConnected Components Labeling
Connected Components Labeling
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation Systems
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 

Recently uploaded

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsVlad Stirbu
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 

Enar short course

  • 1.
  • 2. Statistical Computing For Big Data Deepak Agarwal LinkedIn Applied Relevance Science dagarwal@linkedin.com ENAR 2014, Baltimore, USA
  • 3. Structure of This Tutorial • Part I: Introduction to Map-Reduce and the Hadoop System – Overview of Distributed Computing – Introduction to Map-Reduce – Introduction to the Hadoop System – The Pig Language – A Deep Dive of Hadoop Map-Reduce • Part II: Examples of Statistical Computing for Big Data – Bag of Little Bootstraps – Large Scale Logistic Regression – Parallel Matrix Factorization • Part III: The Future of Cloud Computing
  • 4. Structure of This Tutorial • Part I: Introduction to Map-Reduce and the Hadoop System – Overview of Distributed Computing – Introduction to Map-Reduce – Introduction to the Hadoop System – The Pig Language – A Deep Dive of Hadoop Map-Reduce • Part II: Examples of Statistical Computing for Big Data – Bag of Little Bootstraps – Large Scale Logistic Regression – Parallel Matrix Factorization • Part III: The Future of Cloud Computing
  • 5. Big Data becoming Ubiquitous • Bioinformatics • Astronomy • Internet • Telecommunications • Climatology • …
  • 6. Big Data: Some size estimates • 1000 human genomes: > 100TB of data (1000 genomes project) • Sloan Digital Sky Survey: 200GB data per night (>140TB aggregated) • Facebook: A billion monthly active users • LinkedIn: 225M members worldwide • Twitter: 500 million tweets a day • Over 6 billion mobile phones in the world generating data everyday
  • 7. Big Data: Paradigm shift • Classical Statistics – Generalize using small data • Paradigm Shift with Big Data – We now have an almost infinite supply of data – Easy Statistics ? Just appeal to asymptotic theory? • So the issue is mostly computational? – Not quite • More data comes with more heterogeneity • Need to change our statistical thinking to adapt – Classical statistics still invaluable to think about big data analytics
  • 8. Some Statistical Challenges • Exploratory Analysis (EDA), Visualization – Retrospective (on Terabytes) – More Real Time (streaming computations every few minutes/hours) • Statistical Modeling – Scale (computational challenge) – Curse of dimensionality • Millions of predictors, heterogeneity – Temporal and Spatial correlations
  • 9. Statistical Challenges continued • Experiments – To test new methods, test hypothesis from randomized experiments – Adaptive experiments • Forecasting – Planning, advertising • Many more I are not fully well versed in
  • 10. Defining Big Data • How to know you have the big data problem? – Is it only the number of terabytes ? – What about dimensionality, structured/unstructured, computa tions required,… • No clear definition, different point of views – When desired computation cannot be completed in the stipulated time with current best algorithm using cores available on a commodity PC
  • 11. Distributed Computing for Big Data • Distributed computing invaluable tool to scale computations for big data • Some distributed computing models – Multi-threading – Graphics Processing Units (GPU) – Message Passing Interface (MPI) – Map-Reduce
  • 12. Evaluating a method for a problem • Scalability – Process X GB in Y hours • Ease of use for a statistician • Reliability (fault tolerance) – Especially in an industrial environment • Cost – Hardware and cost of maintaining • Good for the computations required? – E.g., Iterative versus one pass • Resource sharing
  • 13. Multithreading • Multiple threads take advantage of multiple CPUs • Shared memory • Threads can execute independently and concurrently • Can only handle Gigabytes of data • Reliable
  • 14. Graphics Processing Units (GPU) • Number of cores: – CPU: Order of 10 – GPU: smaller cores • Order of 1000 • Can be >100x faster than CPU – Parallel computationally intensive tasks off-loaded to GPU • Good for certain computationally-intensive tasks • Can only handle Gigabytes of data • Not trivial to use, requires good understanding of low-level architecture for efficient use – But things changing, it is getting more user friendly
  • 15. Message Passing Interface (MPI) • Language independent communication protocol among processes (e.g. computers) • Most suitable for master/slave model • Can handle Terabytes of data • Good for iterative processing • Fault tolerance is low
  • 16. Map-Reduce (Dean & Ghemawat, 2004) Mappers Reducers Data Output • Computation split to Map (scatter) and Reduce (gather) stages • Easy to Use: – User needs to implement two functions: Mapper and Reducer • Easily handles Terabytes of data • Very good fault tolerance (failed tasks automatically get restarted)
  • 17. Comparison of Distributed Computing Methods Multithreading GPU MPI Map-Reduce Scalability (data size) Gigabytes Gigabytes Terabytes Terabytes Fault Tolerance High High Low High Maintenance Cost Low Medium Medium Medium-High Iterative Process Complexity Cheap Cheap Cheap Usually expensive Resource Sharing Hard Hard Easy Easy Easy to Implement? Easy Needs understanding of low-level GPU architecture Easy Easy
  • 18. Example Problem • Tabulating word counts in corpus of documents • Similar to table function in R
  • 19. Word Count Through Map-Reduce Hello World Bye World Hello Hadoop Goodbye Hadoop Mapper 1 <Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop,1> <Hello, 1> <World, 1> <Bye, 1> <World,1> Mapper 2 Reducer 1 Words from A-G Reducer 2 Words from H-Z <Bye, 1> <Goodbye, 1> <Hello, 2> <World, 2> <Hadoop, 2>
  • 20. Key Ideas about Map-Reduce Big Data Partition 1 Partition 2 … Partition N Mapper 1 Mapper 2 … Mapper N <Key, Value> <Key, Value><Key, Value><Key, Value> Reducer 1 Reducer 2 Reducer M… Output 1 Output 1Output 1Output 1
  • 21. Key Ideas about Map-Reduce • Data are split into partitions and stored in many different machines on disk (distributed storage) • Mappers process data chunks independently and emit <Key, Value> pairs • Data with the same key are sent to the same reducer. One reducer can receive multiple keys • Every reducer sorts its data by key • For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output
  • 22. Compute Mean for Each Group ID Group No. Score 1 1 0.5 2 3 1.0 3 1 0.8 4 2 0.7 5 2 1.5 6 3 1.2 7 1 0.8 8 2 0.9 9 4 1.3 … … …
  • 23. Key Ideas about Map-Reduce • Data are split into partitions and stored in many different machines on disk (distributed storage) • Mappers process data chunks independently and emit <Key, Value> pairs – For each row: • Key = Group No. • Value = Score • Data with the same key are sent to the same reducer. One reducer can receive multiple keys – E.g. 2 reducers – Reducer 1 receives data with key = 1, 2 – Reducer 2 receives data with key = 3, 4 • Every reducer sorts its data by key – E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9> • For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output – E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)>
  • 24. Key Ideas about Map-Reduce • Data are split into partitions and stored in many different machines on disk (distributed storage) • Mappers process data chunks independently and emit <Key, Value> pairs – For each row: • Key = Group No. • Value = Score • Data with the same key are sent to the same reducer. One reducer can receive multiple keys – E.g. 2 reducers – Reducer 1 receives data with key = 1, 2 – Reducer 2 receives data with key = 3, 4 • Every reducer sorts its data by key – E.g. Reducer 1: <key = 1, values=[0.5, 0.8, 0.8]>, <key=2, values=<0.7, 1.5, 0.9> • For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output – E.g. Reducer 1 output: <1, mean(0.5, 0.8, 0.8)>, <2, mean(0.7, 1.5, 0.9)> What you need to implement
  • 25. Mapper: Input: Data for (row in Data) { groupNo = row$groupNo; score = row$score; Output(c(groupNo, score)); } Reducer: Input: Key (groupNo), List Value (a list of scores that belong to the Key) count = 0; sum = 0; for (v in Value) { sum += v; count++; } Output(c(Key, sum/count)); Pseudo Code (in R)
  • 26. Exercise 1 • Problem: Average height per {Grade, Gender}? • What should be the mapper output key? • What should be the mapper output value? • What are the reducer input? • What are the reducer output? • Write mapper and reducer for this? Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 … … …
  • 27. • Problem: Average height per Grade and Gender? • What should be the mapper output key? – {Grade, Gender} • What should be the mapper output value? – Height • What are the reducer input? – Key: {Grade, Gender}, Value: List of Heights • What are the reducer output? – {Grade, Gender, mean(Heights)} Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 … … …
  • 28. Exercise 2 • Problem: Number of students per {Grade, Gender}? • What should be the mapper output key? • What should be the mapper output value? • What are the reducer input? • What are the reducer output? • Write mapper and reducer for this? Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 … … …
  • 29. • Problem: Number of students per {Grade, Gender}? • What should be the mapper output key? – {Grade, Gender} • What should be the mapper output value? – 1 • What are the reducer input? – Key: {Grade, Gender}, Value: List of 1’s • What are the reducer output? – {Grade, Gender, sum(value list)} – OR: {Grade, Gender, length(value list)} Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 … … …
  • 30. More on Map-Reduce • Depends on distributed file systems • Typically mappers are the data storage nodes • Map/Reduce tasks automatically get restarted when they fail (good fault tolerance) • Map and Reduce I/O are all on disk – Data transmission from mappers to reducers are through disk copy • Iterative process through Map-Reduce – Each iteration becomes a map-reduce job – Can be expensive since map-reduce overhead is high
  • 31. The Apache Hadoop System • An open-source software for reliable, scalable, distributed computing • The most popular distributed computing system in the world • Key modules: – Hadoop Distributed File System (HDFS) – Hadoop YARN (job scheduling and cluster resource management) – Hadoop MapReduce
  • 32. Major Tools on Hadoop • Pig – A high-level language for Map-Reduce computation • Hive – A SQL-like query language for data querying via Map-Reduce • Hbase – A distributed & scalable database on Hadoop – Allows random, real time read/write access to big data – Voldemort is similar to Hbase • Mahout – A scalable machine learning library • …
  • 33. Hadoop Installation • Setting up Hadoop on your desktop/laptop: – http://hadoop.apache.org/docs/stable/single_nod e_setup.html • Setting up Hadoop on a cluster of machines – http://hadoop.apache.org/docs/stable/cluster_set up.html
  • 34. Hadoop Distributed File System (HDFS) • Master/Slave architecture • NameNode: a single master node that controls which data block is stored where. • DataNodes: slave nodes that store data and do R/W operations • Clients (Gateway): Allow users to login and interact with HDFS and submit Map-Reduce jobs • Big data is split to equal-sized blocks, each block can be stored in different DataNodes • Disk failure tolerance: data is replicated multiple times
  • 35. Load the Data into Pig • A = LOAD ‘Sample-1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float); – The path of the data on HDFS after LOAD • USING PigStorage() means delimit the data by tab (can be omitted) • If data are delimited by other characters, e.g. space, use USING PigStorage(‘ ‘) • Data schema defined after AS • Variable types: int, long, float, double, chararray, …
  • 36. Structure of This Tutorial • Part I: Introduction to Map-Reduce and the Hadoop System – Overview of Distributed Computing – Introduction to Map-Reduce – Introduction to the Hadoop System – The Pig Language – A Deep Dive of Hadoop Map-Reduce • Part II: Examples of Statistical Computing for Big Data – Bag of Little Bootstraps – Large Scale Logistic Regression – Parallel Matrix Factorization • Part III: The Future of Cloud Computing
  • 37. Bag of Little Bootstraps Kleiner et al. 2012
  • 38. Bootstrap (Efron, 1979) • A re-sampling based method to obtain statistical distribution of sample estimators • Why are we interested ? – Re-sampling is embarrassingly parallelizable • For example: – Standard deviation of the mean of N samples (μ) – For i = 1 to r do • Randomly sample with replacement N times from the original sample -> bootstrap data i • Compute mean of the i-th bootstrap data -> μi – Estimate of Sd(μ) = Sd(*μ1,…μr]) – r is usually a large number, e.g. 200
  • 39. Bootstrap for Big Data • Can have r nodes running in parallel, each sampling one bootstrap data • However… – N can be very large – Data may not fit into memory – Collecting N samples with replacement on each node can be computationally expensive
  • 40. M out of N Bootstrap (Bikel et al. 1997) • Obtain SdM(μ) by sampling M samples with replacement for each bootstrap, where M<N • Apply analytical correction to SdM(μ) to obtain Sd(μ) using prior knowledge of convergence rate of sample estimates • However… – Prior knowledge is required – Choice of M is critical to performance – Finding optimal value of M needs more computation
  • 41. Bag of Little Bootstraps (BLB) • Example: Standard deviation of the mean • Generate S sampled data sets, each obtained by random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b) • For each data p = 1 to S do – For i = 1 to r do • N samples with replacement on data of size b • Compute mean of the resampled data μpi – Compute Sdp(μ) = Sd([μp1,…μpr]) • Estimate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])
  • 42. Bag of Little Bootstraps (BLB) • Interest: ξ(θ), where θ is an estimate obtained from size N data – ξ is some function of θ, such as standard deviation, … • Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b) • For each data p = 1 to S do – For i = 1 to r do • Sample N samples with replacement on data of size b • Compute mean of the resampled data θpi – Compute ξp(θ) = ξ([θp1,…θpr]) • Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])
  • 43. Bag of Little Bootstraps (BLB) • Interest: ξ(θ), where θ is an estimate obtained from size N data – ξ is some function of θ, such as standard deviation, … • Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or partition the original data into S partitions, each with size b) • For each data p = 1 to S do – For i = 1 to r do • Sample N samples with replacement on the data of size b • Compute mean of the resampled data θpi – Compute ξp(θ) = ξ([θp1,…θpr]) • Estimate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)]) MapperReducer Gateway
  • 44. Why is BLB Efficient • Before: – N samples with replacement from size N data is expensive when N is large • Now: – N samples with replacement from size b data – b can be several magnitude smaller than N (e.g. b = Nγ, γ in [0.5, 1)) – Equivalent to: A multinomial sampler with dim = b – Storage = O(b), Computational complexity = O(b)
  • 45. Simulation Experiment • 95% CI of Logistic Regression Coefficients • N = 20000, 10 explanatory variables • Relative Error = |Estimated CI width – True CI width | / True CI width • BLB-γ: BLB with b = Nγ • BOFN-γ: b out of N sampling with b = Nγ • BOOT: Naïve bootstrap
  • 47. Real Data • 95% CI of Logistic Regression Coefficients • N = 6M, 3000 explanatory variables • Data size = 150GB, r = 50, s = 5, γ = 0.7
  • 48. Summary of BLB • A new algorithm for bootstrapping on big data • Advantages – Fast and efficient – Easy to parallelize – Easy to understand and implement – Friendly to Hadoop, makes it routine to perform statistical calculations on Big data
  • 49. Large Scale Logistic Regression
  • 50. Logistic Regression • Binary response: Y • Covariates: X • Yi ~ Bernoulli(pi) • log(pi/(1-pi)) = Xi Tβ ; β ~ MVN(0 , 1/λ I ) • Widely used (research and applications)
  • 51. Large Scale Logistic Regression • Binary response: Y – E.g., Click / Non-click on an ad on a webpage • Covariates: X – User covariates: • Age, gender, industry, education, job, job title, … – Item covariates: • Categories, keywords, topics, … – Context covariates: • Time, page type, position, … – 2-way interaction: • User covariates X item covariates • Context covariates X item covariates • …
  • 52. Computational Challenge • Hundreds of millions/billions of observations • Hundreds of thousands/millions of covariates • Fitting such a logistic regression model on a single machine not feasible • Model fitting iterative using methods like gradient descent, Newton’s method etc – Multiple passes over the data
  • 53. Recap on Optimization method • Problem: Find x to min(F(x)) • Iteration n: xn = xn-1 – bn-1 F’(xn-1) • bn-1 is the step size that can change every iteration • Iterate until convergence • Conjugate gradient, LBFGS, Newton trust region, … all of this kind
  • 54. Iterative Process with Hadoop Disk Mappers Disk Reducers DiskMappersDiskReducers Disk Mappers Disk Reducers
  • 55. Limitations of Hadoop for fitting a big logistic regression • Iterative process is expensive and slow • Every iteration = a Map-Reduce job • I/O of mapper and reducers are both through disk • Plus: Waiting in queue time • Q: Can we find a fitting method that scales with Hadoop ?
  • 56. Large Scale Logistic Regression • Naïve: – Partition the data and run logistic regression for each partition – Take the mean of the learned coefficients – Problem: Not guaranteed to converge to the model from single machine! • Alternating Direction Method of Multipliers (ADMM) – Boyd et al. 2011 – Set up constraints: each partition’s coefficient = global consensus – Solve the optimization problem using Lagrange Multipliers – Advantage: guaranteed to converge to a single machine logistic regression on the entire data with reasonable number of iterations
  • 57. Large Scale Logistic Regression via ADMM BIG DATA Partition 1 Partition 2 Partition 3 Partition K Logistic Regression Logistic Regression Logistic Regression Logistic Regression Consensus Computation Iteration 1
  • 58. Large Scale Logistic Regression via ADMM BIG DATA Partition 1 Partition 2 Partition 3 Partition K Logistic Regression Consensus Computation Logistic Regression Logistic Regression Logistic Regression Iteration 1
  • 59. Large Scale Logistic Regression via ADMM BIG DATA Partition 1 Partition 2 Partition 3 Partition K Logistic Regression Logistic Regression Logistic Regression Logistic Regression Consensus Computation Iteration 2
  • 61. Dual Ascent Method • Consider a convex optimization problem • Lagrangian for the problem: • Dual Ascent:
  • 62. Augmented Lagrangians • Bring robustness to the dual ascent method • Yield convergence without assumptions like strict convexity or finiteness of f • • The value of ρ influences the convergence rate
  • 63. Alternating Direction Method of Multipliers (ADMM) • Problem: • Augmented Lagrangians • ADMM:
  • 64. Large Scale Logistic Regression via ADMM • Notation – (Xi , yi): data in the ith partition – βi: coefficient vector for partition i – β: Consensus coefficient vector – r(β): penalty component such as ||β||2 2 • Optimization problem
  • 65. ADMM updates LOCAL REGRESSIONS Shrinkage towards current best global estimate UPDATED CONSENSUS
  • 66. An example implementation • ADMM for Logistic regression model fitting with L2/L1 penalty • Each iteration of ADMM is a Map-Reduce job – Mapper: partition the data into K partitions – Reducer: For each partition, use liblinear/glmnet to fit a L1/L2 logistic regression – Gateway: consensus computation by results from all reducers, and sends back the consensus to each reducer node
  • 67. KDD CUP 2010 Data • Bridge to Algebra 2008-2009 data in https://pslcdatashop.web.cmu.edu/KDDCup/do wnloads.jsp • Binary response, 20M covariates • Only keep covariates with >= 10 occurrences => 2.2M covariates • Training data: 8,407,752 samples • Test data : 510,302 samples
  • 68. Avg Training Loglikelihood vs Number of Iterations
  • 69. Test AUC vs Number of Iterations
  • 70. Better Convergence Can Be Achieved By • Better Initialization – Use results from Naïve method to initialize the parameters • Adaptively change step size (ρ) for each iteration based on the convergence status of the consensus
  • 72. Agenda • Topic of Interest – Recommender problems for dynamic, time- sensitive applications • Content Optimization, Online Advertising, Movie recommendation, shopping,… • Introduction • Offline components – Regression, Collaborative filtering (CF), … • Online components + initialization – Time-series, online/incremental methods, explore/exploit (bandit) • Evaluation methods + Multi-Objective • Challenges
  • 73. Three components we will focus on • Defining the problem – Formulate objectives whose optimization achieves some long- term goals for the recommender system • E.g. How to serve content to optimize audience reach and engagement, optimize some combination of engagement and revenue ? • Modeling (to estimate some critical inputs) – Predict rates of some positive user interaction(s) with items based on data obtained from historical user-item interactions • E.g. Click rates, average time-spent on page, etc • Could be explicit feedback like ratings • Experimentation – Create experiments to collect data proactively to improve models, helps in converging to the best choice(s) cheaply and rapidly. • Explore and Exploit (continuous experimentation) • DOE (testing hypotheses by avoiding bias inherent in data)
  • 74. Modern Recommendation Systems • Goal – Serve the right item to a user in a given context to optimize long-term business objectives • A scientific discipline that involves – Large scale Machine Learning & Statistics • Offline Models (capture global & stable characteristics) • Online Models (incorporates dynamic components) • Explore/Exploit (active and adaptive experimentation) – Multi-Objective Optimization • Click-rates (CTR), Engagement, advertising revenue, diversity, etc – Inferring user interest • Constructing User Profiles – Natural Language Processing to understand content • Topics, “aboutness”, entities, follow-up of something, breaking news,…
  • 75. Some examples from content optimization • Simple version – I have a content module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module • More advanced – I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks? • Highly advanced – There are multiple modules running on my webpage. How do I perform a simultaneous optimization?
  • 76. Recommend applications Recommend search queries Recommend news article Recommend packages: Image Title, summary Links to other pages Pick 4 out of a pool of K K = 20 ~ 40 Dynamic Routes traffic other pages
  • 77. Problems in this example • Optimize CTR on multiple modules – Today Module, Trending Now, Personal Assistant, News – Simple solution: Treat modules as independent, optimize separately. May not be the best when there are strong correlations. • For any single module – Optimize some combination of CTR, downstream engagement, and perhaps advertising revenue.
  • 78. Online Advertising Advertisers Ad Network Ads Page Recommend Best ad(s) User Publisher Response rates (click, conversion, ad-view) Bids Auction Click conversion Select argmax f(bid,response rates) ML /Statistical model Examples: Yahoo, Google, MSN, … Ad exchanges (RightMedia, DoubleClick, … )
  • 79. LinkedIn Today: Content Module Objective: Serve content to maximize engagement metrics like CTR (or weighted CTR)
  • 80. LinkedIn Ads: Match ads to users visiting LinkedIn
  • 81. Right Media Ad Exchange: Unified Marketplace Match ads to page views on publisher sites Has ad impression to sell -- AUCTIONS Bids $0.50 Bids $0.75 via Network… … which becomes $0.45 bid Bids $0.65—WINS! AdSense Ad.com Bids $0.60
  • 82. Recommender problems in general USER Item Inventory Articles, web page, ads, … Use an automated algorithm to select item(s) to show Get feedback (click, time spent,..) Refine the models Repeat (large number of times) Optimize metric(s) of interest (Total clicks, Total revenue,…) Example applications Search: Web, Vertical Online Advertising Content ….. Context query, page, …
  • 83. • Items: Articles, ads, modules, movies, users, updates, etc. • Context: query keywords, pages, mobile, social media, etc. • Metric to optimize (e.g., relevance score, CTR, revenue, engagement) – Currently, most applications are single-objective – Could be multi-objective optimization (maximize X subject to Y, Z,..) • Properties of the item pool – Size (e.g., all web pages vs. 40 stories) – Quality of the pool (e.g., anything vs. editorially selected) – Lifetime (e.g., mostly old items vs. mostly new items) Important Factors
  • 84. Factors affecting Solution (continued) • Properties of the context – Pull: Specified by explicit, user-driven query (e.g., keywords, a form) – Push: Specified by implicit context (e.g., a page, a user, a session) • Most applications are somewhere on continuum of pull and push • Properties of the feedback on the matches made – Types and semantics of feedback (e.g., click, vote) – Latency (e.g., available in 5 minutes vs. 1 day) – Volume (e.g., 100K per day vs. 300M per day) • Constraints specifying legitimate matches – e.g., business rules, diversity rules, editorial Voice – Multiple objectives • Available Metadata (e.g., link graph, various user/item attributes)
  • 85. Predicting User-Item Interactions (e.g. CTR) • Myth: We have so much data on the web, if we can only process it the problem is solved – Number of things to learn increases with sample size • Rate of increase is not slow – Dynamic nature of systems make things worse – We want to learn things quickly and react fast • Data is sparse in web recommender problems – We lack enough data to learn all we want to learn and as quickly as we would like to learn – Several Power laws interacting with each other • E.g. User visits power law, items served power law – Bivariate Zipf: Owen & Dyer, 2011
  • 86. Can Machine Learning help? • Fortunately, there are group behaviors that generalize to individuals & they are relatively stable – E.g. Users in San Francisco tend to read more baseball news • Key issue: Estimating such groups – Coarse group : more stable but does not generalize that well. – Granular group: less stable with few individuals – Getting a good grouping structure is to hit the “sweet spot” • Another big advantage on the web – Intervene and run small experiments on a small population to collect data that helps rapid convergence to the best choices(s) • We don’t need to learn all user-item interactions, only those that are good.
  • 87. Predicting user-item interaction rates Offline ( Captures stable characteristics at coarse resolutions) (Logistic, Boosting,….) Feature construction Content: IR, clustering, taxonomy, entity,.. User profiles: clicks, views, social, community,.. Near Online (Finer resolution Corrections) (item, user level) (Quick updates) Explore/Exploit (Adaptive sampling) (helps rapid convergence to best choices) Initialize
  • 88. Post-click: An example in Content Optimization Recommender EDITORIAL content Clicks on FP links influence downstream supply distribution AD SERVER DISPLAY ADVERTISING Revenue Downstream engagement (Time spent)
  • 89. Serving Content on Front Page: Click Shaping • What do we want to optimize? • Current: Maximize clicks (maximize downstream supply from FP) • But consider the following – Article 1: CTR=5%, utility per click = 5 – Article 2: CTR=4.9%, utility per click=10 • By promoting 2, we lose 1 click/100 visits, gain 5 utils • If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? – E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc)
  • 90. High level picture http request Statistical Models updated in Batch mode: e.g. once every 30 mins Server Item Recommendation system: thousands of computations in sub-seconds User Interacts e.g. click, does nothing
  • 91. High level overview: Item Recommendation System User Info Item Index Id, meta-data ML/ Statistical Models Score Items P(Click), P(share), Semantic-relevance score,…. Rank Items: sort by score (CTR,bid*CTR,..) combine scores using Multi-obj optim, Threshold on some scores,…. User-item interaction Data: batch process Updated in batch: Activity, profile Pre-filter SPAM,editorial,,.. Feature extraction NLP, cllustering,..
  • 92. ML/Statistical models for scoring Number of items Scored by ML Traffic volume 1000100 100k 1M 100M Few hours Few days Several days LinkedIn Today Yahoo! Front Page Right Media Ad exchange LinkedIn Ads
  • 93. Summary of deployments • Yahoo! Front page Today Module (2008-2011): 300% improvement in click-through rates – Similar algorithms delivered via a self-serve platform, adopted by several Yahoo! Properties (2011): Significant improvement in engagement across Yahoo! Network • Fully deployed on LinkedIn Today Module (2012): Significant improvement in click-through rates (numbers not revealed due to reasons of confidentiality) • Yahoo! RightMedia exchange (2012): Fully deployed algorithms to estimate response rates (CTR, conversion rates). Significant improvement in revenue (numbers not revealed due to reasons of confidentiality) • LinkedIn self-serve ads (2012-2013):Fully deployed • LinkedIn News Feed (2013-2014): Fully deployed • Several others in progress….
  • 94. Broad Themes • Curse of dimensionality – Large number of observations (rows), large number of potential features (columns) – Use domain knowledge and machine learning to reduce the “effective” dimension (constraints on parameters reduce degrees of freedom) • I will give examples as we move along • We often assume our job is to analyze “Big Data” but we often have control on what data to collect through clever experimentation – This can fundamentally change solutions • Think of computation and models together for Big data • Optimization: What we are trying to optimize is often complex,models to work in harmony with optimization – Pareto optimality with competing objectives
  • 95. Statistical Problem • Rank items (from an admissible pool) for user visits in some context to maximize a utility of interest • Examples of utility functions – Click-rates (CTR) – Share-rates (CTR* [Share|Click] ) – Revenue per page-view = CTR*bid (more complex due to second price auction) • CTR is a fundamental measure that opens the door to a more principled approach to rank items • Converge rapidly to maximum utility items – Sequential decision making process (explore/exploit)
  • 96. item j from a set of candidates User i with user features (e.g., industry, behavioral features, Demographic features,……) (i, j) : response yijvisits Algorithm selects (click or not) Which item should we select?  The item with highest predicted CTR  An item for which we need data to predict its CTR Exploit Explore LinkedIn Today, Yahoo! Today Module: Choose Items to maximize CTR This is an “Explore/Exploit” Problem
  • 97. The Explore/Exploit Problem (to maximize CTR) • Problem definition: Pick k items from a pool of N for a large number of serves to maximize the number of clicks on the picked items • Easy!? Pick the items having the highest click-through rates (CTRs) • But … – The system is highly dynamic: • Items come and go with short lifetimes • CTR of each item may change over time – How much traffic should be allocated to explore new items to achieve optimal performance ? • Too little Unreliable CTR estimates due to “starvation” • Too much Little traffic to exploit the high CTR items
  • 98. Y! front Page Application • Simplify: Maximize CTR on first slot (F1) • Item Pool – Editorially selected for high quality and brand image – Few articles in the pool but item pool dynamic
  • 99. CTR Curves of Items on LinkedIn Today CTR
  • 100. Impact of repeat item views on a given user • Same user is shown an item multiple times (despite not clicking)
  • 101. Simple algorithm to estimate most popular item with small but dynamic item pool • Simple Explore/Exploit scheme – % explore: with a small probability (e.g. 5%), choose an item at random from the pool – (100− )% exploit: with large probability (e.g. 95%), choose highest scoring CTR item • Temporal Smoothing – Item CTRs change over time, provide more weight to recent data in estimating item CTRs • Kalman filter, moving average • Discount item score with repeat views – CTR(item) for a given user drops with repeat views by some “discount” factor (estimated from data) • Segmented most popular – Perform separate most-popular for each user segment
  • 102. Time series Model: Kalman filter • Dynamic Gamma-Poisson: click-rate evolves over time in a multiplicative fashion • Estimated Click-rate distribution at time t+1 – Prior mean: – Prior variance: High CTR items more adaptive
  • 103. More economical exploration? Better bandit solutions • Consider two armed problem p2 (unknown payoff probabilities) The gambler has 1000 plays, what is the best way to experiment ? (to maximize total expected reward) This is called the “multi-armed bandit” problem, have been studied for a long time. Optimal solution: Play the arm that has maximum potential of being good Optimism in the face of uncertainty p1 >
  • 104. Item Recommendation: Bandits? • Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000 – Greedy: Show Item 2 to all; not a good idea – Item 1 CTR estimate noisy; item could be potentially better • Invest in Item 1 for better overall performance on average – Exploit what is known to be good, explore what is potentially good CTR Probabilitydensity Item 2 Item 1
  • 105. Next few hours Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Time-series models Incremental CF, online regression Intelligent Initialization Prior estimation Prior estimation, dimension reduction Explore/Exploit Multi-armed bandits Bandits with covariates
  • 106. Offline Components: Collaborative Filtering in Cold-start Situations
  • 107. Problem Item j with User i with user features xi (demographics, browse history, search history, …) item features xj (keywords, content categories, ...) (i, j) : response yijvisits Algorithm selects (explicit rating, implicit click/no-click) Predict the unobserved entries based on features and the observed entries
  • 108. Model Choices • Feature-based (or content-based) approach – Use features to predict response • (regression, Bayes Net, mixture models, …) – Limitation: need predictive features • Bias often high, does not capture signals at granular levels • Collaborative filtering (CF aka Memory based) – Make recommendation based on past user-item interaction • User-user, item-item, matrix factorization, … • See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc. – Better performance for old users and old items – Does not naturally handle new users and new items (cold- start)
  • 109. Collaborative Filtering (Memory based methods) User-User Similarity Item-Item similarities, incorporating both Estimating Similarities Pearson’s correlation Optimization based (Koren et al)
  • 110. How to Deal with the Cold-Start Problem • Heuristic-based approaches – Linear combination of regression and CF models – Filterbot • Add user features as psuedo users and do collaborative filtering - Hybrid approaches - Use content based to fill up entries, then use CF • Matrix Factorization – Good performance on Netflix (Koren, 2009) • Model-based approaches – Bilinear random-effects model (probabilistic matrix factorization) • Good on Netflix data [Ruslan et al ICML, 2009] – Add feature-based regression to matrix factorization • (Agarwal and Chen, 2009) – Add topic discovery (from textual items) to matrix factorization • (Agarwal and Chen, 2009; Chun and Blei, 2011)
  • 111. Per-item regression models • When tracking users by cookies, distribution of visit patters could get extremely skewed – Majority of cookies have 1-2 visits • Per item models (regression) based on user covariates attractive in such cases
  • 112. Several per-item regressions: Multi-task learning Low dimension (5-10), B estimated retrospective data • Agarwal,Chen and Elango, KDD, 2010 Affinity to old items
  • 113. Per-user, per-item models via bilinear random-effects model
  • 114. Motivation • Data measuring k-way interactions pervasive – Consider k = 2 for all our discussions • E.g. User-Movie, User-content, User-Publisher-Ads,…. – Power law on both user and item degrees • Classical Techniques – Approximate matrix through a singular value decomposition (SVD) • After adjusting for marginal effects (user pop, movie pop,..) – Does not work • Matrix highly incomplete, severe over-fitting – Key issue • Regularization of eigenvectors (factors) to avoid overfitting
  • 115. Early work on complete matrices • Tukey’s 1-df model (1956) – Rank 1 approximation of small nearly complete matrix • Criss-cross regression (Gabriel, 1978) • Incomplete matrices: Psychometrics (1-factor model only; small data sets; 1960s) • Modern day recommender problems – Highly incomplete, large, noisy.
  • 117. Factorization – Brief Overview • Latent user factors: (αi , ui=(ui1,…,uin)) • (Nn + Mm) parameters • Key technical issue: • Latent movie factors: (βj , vj=(v j1,….,v jn)) will overfit for moderate values of n,m Regularization Interaction jijiij BvuyE )(
  • 118. Latent Factor Models: Different Aspects • Matrix Factorization – Factors in Euclidean space – Factors on the simplex • Incorporating features and ratings simultaneously • Online updates
  • 119. Maximum Margin Matrix Factorization (MMMF) • Complete matrix by minimizing loss (hinge,squared- error) on observed entries subject to constraints on trace norm – Srebro, Rennie, Jakkola (NIPS 2004) – Convex, Semi-definite programming (expensive, not scalable) • Fast MMMF (Rennie & Srebro, ICML, 2005) – Constrain the Frobenious norm of left and right eigenvector matrices, not convex but becomes scalable. • Other variation: Ensemble MMMF (DeCoste, ICML2005) – Ensembles of partially trained MMMF (some improvements)
  • 120. Matrix Factorization for Netflix prize data • Minimize the objective function • Simon Funk: Stochastic Gradient Descent • Koren et al (KDD 2007): Alternate Least Squares – They move to SGD later in the competition obsij j j i ij T iij vuvur )()( 222
  • 121. ui vj rij au av 2 Optimization is through Iterated conditional modes Other variations like constraining the mean through sigmoid, using “who-rated- whom” Combining with Boltzmann Machines also improved performance ),(~ ),(~ ),(~ 2 IaMVN IaMVN Nr vj ui j T iij 0v 0u vu Probabilistic Matrix Factorization (Ruslan & Minh, 2008, NIPS)
  • 122. Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008) • Fully Bayesian treatment using an MCMC approach – Significant improvement • Interpretation as a fully Bayesian hierarchical model shows why that is the case – Failing to incorporate uncertainty leads to bias in estimates – Multi-modal posterior, MCMC helps in converging to a better one r Var-comp: au MCEM also more resistant to over-fitting
  • 123. Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010) • Specify rank probabilistically (automatic rank selection) )/)1(,/(~ )(~ ),(~ 1 2 rrbraBeta Berz vuzNy k kk r k jkikkij ))1(/(Factors)#( )))1(/(,1(~ rbaraE rbaaBerzk
  • 124. How to incorporate features: Deal with both warm start and cold-start • Models to predict ratings for new pairs – Warm-start: (user, movie) present in the training data with large sample size – Cold-start: At least one of (user, movie) new or has small sample size • Rough definition, warm-start/cold-start is a continuum. • Challenges – Highly incomplete (user, movie) matrix – Heavy tailed degree distributions for users/movies • Large fraction of ratings from small fraction of users/movies – Handling both warm-start and cold-start effectively in the presence of predictive features
  • 125. Possible approaches • Large scale regression based on covariates – Does not provide good estimates for heavy users/movies – Large number of predictors to estimate interactions • Collaborative filtering – Neighborhood based – Factorization • Good for warm-start; cold-start dealt with separately • Single model that handles cold-start and warm-start – Heavy users/movies → User/movie specific model – Light users/movies → fallback on regression model – Smooth fallback mechanism for good performance
  • 126. Add Feature-based Regression into Matrix Factorization RLFM: Regression-based Latent Factor Model
  • 127. Regression-based Factorization Model (RLFM) • Main idea: Flexible prior, predict factors through regressions • Seamlessly handles cold-start and warm- start • Modified state equation to incorporate covariates
  • 128. RLFM: Model Rating: ),(~ 2 ijij Ny )(~ ijij Bernoulliy )(~ ijijij NPoissony Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) j t iji t ijij vubxt )( user i gives item j Bias of user i: ),0(~, 2 0 Nxg iii t i Popularity of item j: ),0(~, 2 0 Nxd jjj t j Factors of user i: ),0(~, 2 INGxu u u i u iii Factors of item j: ),0(~, 2 INDxv v v i v iji Could use other classes of regression models
  • 130. Advantages of RLFM • Better regularization of factors – Covariates “shrink” towards a better centroid • Cold-start: Fallback regression model (FeatureOnly)
  • 131. RLFM: Illustration of Shrinkage Plot the first factor value for each user (fitted using Yahoo! FP data)
  • 132. Model fitting: EM for our class of models
  • 133. The parameters for RLFM • Latent parameters • Hyper-parameters }){},{},{},({ jiji vuΔ )IaAI,aAD,G,,( vvuub
  • 136. Computing the E-step • Often hard to compute in closed form • Stochastic EM (Markov Chain EM; MCEM) – Compute expectation by drawing samples from – Effective for multi-modal posteriors but more expensive • Iterated Conditional Modes algorithm (ICM) – Faster but biased hyper-parameter estimates
  • 137. Monte Carlo E-step • Through a vanilla Gibbs sampler (conditionals closed form) • Other conditionals also Gaussian and closed form • Conditionals of users (movies) sampled simultaneously • Small number of samples in early iterations, large numbers in later iterations
  • 138. M-step (Why MCEM is better than ICM) • Update G, optimize • Update Au=au I Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well
  • 139. Experiment 1: Better regularization • MovieLens-100K, avg RMSE using pre-specified splits • ZeroMean, RLFM and FeatureOnly (no cold-start issues) • Covariates: – Users : age, gender, zipcode (1st digit only) – Movies: genres
  • 140. Experiment 2: Better handling of Cold-start • MovieLens-1M; EachMovie • Training-test split based on timestamp • Same covariates as in Experiment 1.
  • 141. Experiment 4: Predicting click-rate on articles • Goal: Predict click-rate on articles for a user on F1 position • Article lifetimes short, dynamic updates important • User covariates: – Age, Gender, Geo, Browse behavior • Article covariates – Content Category, keywords • 2M ratings, 30K users, 4.5 K articles
  • 142. Results on Y! FP data
  • 143. Some other related approaches • Stern, Herbrich and Graepel, WWW, 2009 – Similar to RLFM, different parametrization and expectation propagation used to fit the models • Porteus, Asuncion and Welling, AAAI, 2011 – Non-parametric approach using a Dirichlet process • Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 – Regression + random effects per user regularized through a Graphical Lasso
  • 144. Add Topic Discovery into Matrix Factorization fLDA: Matrix Factorization through Latent Dirichlet Allocation
  • 145. fLDA: Introduction • Model the rating yij that user i gives to item j as the user’s affinity to the topics that the item has – Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data – fLDA can be thought of as a “multi-task learning” version of the supervised LDA model [Blei’07] for cold-start recommendation k jkikij zsy ... User i ’s affinity to topic k Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j Old items: zjk’s are Item latent factors learnt from data with the LDA prior New items: zjk’s are predicted based on the bag of words in the items
  • 146. 11, …, 1W … k1, …, kW … K1, …, KW Topic 1 Topic k Topic K LDA Topic Modeling (1) • LDA is effective for unsupervised topic discovery [Blei’03] – It models the generating process of a corpus of items (articles) – For each topic k, draw a word distribution k = [ k1, …, kW] ~ Dir( ) – For each item j, draw a topic distribution j = [ j1, …, jK] ~ Dir( ) – For each word, say the nth word, in item j, • Draw a topic zjn for that word from j = [ j1, …, jK] • Draw a word wjn from k = [ k1, …, kW] with topic k = zjn Item j Topic distribution: [ j1, …, jK] Words: wj1, …, wjn, … Per-word topic: zj1, …, zjn, … Assume zjn = topic k Observed
  • 147. LDA Topic Modeling (2) • Model training: – Estimate the prior parameters and the posterior topic word distribution based on a training corpus of items – EM + Gibbs sampling is a popular method • Inference for new items – Compute the item topic distribution based on the prior parameters and estimated in the training phase • Supervised LDA [Blei’07] – Predict a target value for each item based on supervised LDA topics k jkkj zsy Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j Regression weight for topic k k jkikij zsy ...vs. One regression per user Same set of topics across different regressions
  • 148. fLDA: Model Rating: ),(~ 2 ijij Ny )(~ ijij Bernoulliy )(~ ijijij NPoissony Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) jkikkji t ijij zsbxt )( user i gives item j Bias of user i: ),0(~, 2 0 Nxg iii t i Popularity of item j: ),0(~, 2 0 Nxd jjj t j Topic affinity of user i: ),0(~, 2 INHxs s s i s iii Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk The LDA topic of the nth word in item j Observed words: ),,(~ jnjn zLDAw The nth word in item j
  • 149. Model Fitting • Given: – Features X = {xi, xj, xij} – Observed ratings y = {yij} and words w = {wjn} • Estimate: – Parameters: = [b, g0, d0, H, 2, a , a , As, , ] • Regression weights and prior parameters – Latent factors: = { i, j, si} and z = {zjn} • User factors, item factors and per-word topic assignment • Empirical Bayes approach: – Maximum likelihood estimate of the parameters: – The posterior distribution of the factors: dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxargˆ ]ˆ,|,Pr[ yz
  • 150. The EM Algorithm • Iterate through the E and M steps until convergence – Let be the current estimate – E-step: Compute • The expectation is not in closed form • We draw Gibbs samples and compute the Monte Carlo mean – M-step: Find • It consists of solving a number of regression and optimization problems )]|,,,Pr([log)( )ˆ,,|,( zwyEf n wyz )(maxargˆ )1( fn )(ˆ n
  • 151. Supervised Topic Assignment ji jnij jn jkjn k jn kl jn kzyfZ WZ Z kz rated )|( )Rest|Pr( Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when zjn is set to topic k Probability of observing yij given the model The topic of the nth word in item j
  • 152. fLDA: Experimental Results (Movie) • Task: Predict the rating that a user would give a movie • Training/test split: – Sort observations by time – First 75% Training data – Last 25% Test data • Item warm-start scenario – Only 2% new items in test data Model Test RMSE RLFM 0.9363 fLDA 0.9381 Factor-Only 0.9422 FilterBot 0.9517 unsup-LDA 0.9520 MostPopular 0.9726 Feature-Only 1.0906 Constant 1.1190 fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios
  • 153. fLDA: Experimental Results (Yahoo! Buzz) • Task: Predict whether a user would buzz-up an article • Severe item cold-start – All items are new in test data Data Statistics 1.2M observations 4K users 10K articles fLDA significantly outperforms other models
  • 154. Experimental Results: Buzzing Topics Top Terms (after stemming) Topic bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al CIA interrogation mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain Swine flu NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week NFL games court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow Gay marriage palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska Sarah Palin idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama American idol economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month Recession north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia North Korea issues 3/4 topics are interpretable; 1/2 are similar to unsupervised topics
  • 155. fLDA Summary • fLDA is a useful model for cold-start item recommendation • It also provides interpretable recommendations for users – User’s preference to interpretable LDA topics • Future directions: – Investigate Gibbs sampling chains and the convergence properties of the EM algorithm – Apply fLDA to other multi-task prediction problems • fLDA can be used as a tool to generate supervised features (topics) from text data
  • 156. Summary • Regularizing factors through covariates effective • Regression based factor model that regularizes better and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive • Fitting method scalable; Gibbs sampling for users and movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine • Distributed computing on Hadoop: Multiple models and average across partitions (more later)
  • 157. Online Components: Online Models, Intelligent Initialization, Explore / Exploit
  • 158. Why Online Components? • Cold start – New items or new users come to the system – How to obtain data for new items/users (explore/exploit) – Once data becomes available, how to quickly update the model • Periodic rebuild (e.g., daily): Expensive • Continuous online update (e.g., every minute): Cheap • Concept drift – Item popularity, user interest, mood, and user-to-item affinity may change over time – How to track the most recent behavior • Down-weight old data – How to model temporal patterns for better prediction • … may not need to be online if the patterns are stationary
  • 159. Big Picture Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Real systems are dynamic Time-series models Incremental CF, online regression Intelligent Initialization Do not start cold Prior estimation Prior estimation, dimension reduction Explore/Exploit Actively acquire data Multi-armed bandits Bandits with covariates Segmented Most Popular Recommendation Extension:
  • 160. Online Components for Most Popular Recommendation Online models, intelligent initialization & explore/exploit
  • 161. Most popular recommendation: Outline • Most popular recommendation (no personalization, all users see the same thing) – Time-series models (online models) – Prior estimation (initialization) – Multi-armed bandits (explore/exploit) – Sometimes hard to beat!! • Segmented most popular recommendation – Create user segments/clusters based on user features – Do most popular recommendation for each segment
  • 162. Most Popular Recommendation • Problem definition: Pick k items (articles) from a pool of N to maximize the total number of clicks on the picked items • Easy!? Pick the items having the highest click- through rates (CTRs) • But … – The system is highly dynamic: • Items come and go with short lifetimes • CTR of each item changes over time – How much traffic should be allocated to explore new items to achieve optimal performance • Too little Unreliable CTR estimates • Too much Little traffic to exploit the high CTR items
  • 163. CTR Curves for Two Days on Yahoo! Front Page Traffic obtained from a controlled randomized experiment (no confounding) Things to note: (a) Short lifetimes, (b) temporal effects, (c) often breaking news stories Each curve is the CTR of an item in the Today Module on www.yahoo.com over time
  • 164. For Simplicity, Assume … • Pick only one item for each user visit – Multi-slot optimization later • No user segmentation, no personalization (discussion later) • The pool of candidate items is predetermined and is relatively small ( 1000) – E.g., selected by human editors or by a first-phase filtering method – Ideally, there should be a feedback loop – Large item pool problem later • Effects like user-fatigue, diversity in recommendations, multi-objective optimization not considered (discussion later)
  • 165. Online Models • How to track the changing CTR of an item • Data: for each item, at time t, we observe – Number of times the item nt was displayed (i.e., #views) – Number of clicks ct on the item • Problem Definition: Given c1, n1, …, ct, nt, predict the CTR (click-through rate) pt+1 at time t+1 • Potential solutions: – Observed CTR at t: ct / nt highly unstable (nt is usually small) – Cumulative CTR: ( all i ci) / ( all i ni) react to changes very slowly – Moving window CTR: ( i last K ci) / ( i last K ni) reasonable • But, no estimation of Var[pt+1] (useful for explore/exploit)
  • 166. Online Models: Dynamic Gamma-Poisson • Model-based approach – (ct | nt, pt) ~ Poisson(nt pt) – pt = pt-1 t, where t ~ Gamma(mean=1, var= ) – Model parameters: • p1 ~ Gamma(mean= 0, var= 0 2) is the offline CTR estimate • specifies how dynamic/smooth the CTR is over time – Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?) • Solve this recursively (online update rule) Show the item nt times Receive ct clicks pt = CTR at time t Notation: p10, 0 2 p2 … n1 c1 n2 c2
  • 167. Online Models: Derivation size)sample(effective/Let ),(~),,...,,|( 2 2 1111 ttt ttttt varmeanGammancncp )( ),(~),,...,,|( 2 | 2 | 2 | 2 1 |1 2 11111 ttttttt ttt ttttt varmeanGammancncp tttttt ttttttt tttt ttttttt c n varmeanGammancncp || 2 | || | 2 ||11 / /)( size)sample(effectiveLet ),(~),,...,,|( High CTR items more adaptive Estimated CTR distribution at time t Estimated CTR distribution at time t+1
  • 168. Tracking behavior of Gamma- Poisson model • Low click rate articles – More temporal smoothing
  • 169. Intelligent Initialization: Prior Estimation • Prior CTR distribution: Gamma(mean= 0, var= 0 2) – N historical items: • ni = #views of item i in its first time interval • ci = #clicks on item i in its first time interval – Model • ci ~ Poisson(ni pi) and pi ~ Gamma( 0, 0 2) ci ~ NegBinomial( 0, 0 2, ni) – Maximum likelihood estimate (MLE) of ( 0, 0 2) • Better prior: Cluster items and find MLE for each cluster – Agarwal & Chen, 2011 (SIGMOD) i iii nccNN 2 0 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 0 2 0 2 0 2 00 loglogloglogmaxarg ,
  • 170. Explore/Exploit: Problem Definition time Item 1 Item 2 … Item K x1% page views x2% page views … xK% page views Determine (x1, x2, …, xK) based on clicks and views observed before t in order to maximize the expected total number of clicks in the future t –1t –2 t now clicks in the future
  • 171. Modeling the Uncertainty, NOT just the Mean Simplified setting: Two items CTR Probabilitydensity Item A Item B We know the CTR of Item A (say, shown 1 million times) We are uncertain about the CTR of Item B (only 100 times) If we only make a single decision, give 100% page views to Item A If we make multiple decisions in the future explore Item B since its CTR can potentially be higher qp dppfqp )()(Potential CTR of item A is q CTR of item B is p Probability density function of item B’s CTR is f(p)
  • 172. Multi-Armed Bandits: Introduction (1) Bandit “arms” p1 p2 p3 (unknown payoff probabilities) “Pulling” arm i yields a reward: reward = 1 with probability pi (success) reward = 0 otherwise (failure) For now, we are attacking the problem of choosing best article/arm for all users
  • 173. Multi-Armed Bandits: Introduction (2) Bandit “arms” p1 p2 p3 (unknown payoff probabilities) Goal: Pull arms sequentially to maximize the total reward Bandit scheme/policy: Sequential algorithm to play arms (items) Regret of a scheme = Expected loss relative to the “oracle” optimal scheme that always plays the best arm – “best” means highest success probability – But, the best arm is not known … unless you have an oracle – Regret is the price of exploration – Low regret implies quick convergence to the best
  • 174. Multi-Armed Bandits: Introduction (3) • Bayesian approach – Seeks to find the Bayes optimal solution to a Markov decision process (MDP) with assumptions about probability distributions – Representative work: Gittins’ index, Whittle’s index – Very computationally intensive • Minimax approach – Seeks to find a scheme that incurs bounded regret (with no or mild assumptions about probability distributions) – Representative work: UCB by Lai, Auer – Usually, computationally easy – But, they tend to explore too much in practice (probably because the bounds are based on worse-case analysis) Skip details
  • 175. Multi-Armed Bandits: Markov Decision Process (1) • Select an arm now at time t=0, to maximize expected total number of clicks in t=0,…,T • State at time t: t = ( 1t, …, Kt) – it = State of arm i at time t (that captures all we know about arm i at t) • Reward function Ri( t, t+1) – Reward of pulling arm i that brings the state from t to t+1 • Transition probability Pr[ t+1 | t, pulling arm i ] • Policy : A function that maps a state to an arm (action) – ( t) returns an arm (to pull) • Value of policy starting from the current state 0 with horizon T ),(),(),( 1110)(0 0 ΘΘΘΘ Θ TT VREV 11110)(001 ),(),()(,|Pr 0 ΘΘΘΘΘΘΘ Θ dVR T Immediate reward Value of the remaining T-1 time slots if we start from state 1
  • 176. Multi-Armed Bandits: MDP (2) • Optimal policy: • Things to notice: – Value is defined recursively (actually T high-dim integrals) – Dynamic programming can be used to find the optimal policy – But, just evaluating the value of a fixed policy can be very expensive • Bandit Problem: The pull of one arm does not change the state of other arms and the set of arms do not change over time ),(),(),( 1110)(0 0 ΘΘΘΘ Θ TT VREV 11110)(001 ),(),()(,|Pr 0 ΘΘΘΘΘΘΘ Θ dVR T Immediate reward Value of the remaining T-1 time slots if we start from state 1 ),(maxarg 0ΘTV
  • 177. Multi-Armed Bandits: MDP (3) • Which arm should be pulled next? – Not necessarily what looks best right now, since it might have had a few lucky successes – Looks like it will be a function of successes and failures of all arms • Consider a slightly different problem setting – Infinite time horizon, but – Future rewards are geometrically discounted Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1) • Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently Policy ( t) is a function of ( 1t, …, Kt) Policy ( t) = argmaxi { g( it) } One K-dimensional problem K one-dimensional problems Still computationally expensive!! Gittins’ Index
  • 178. Multi-Armed Bandits: MDP (4) Bandit Policy 1. Compute the priority (Gittins’ index) of each arm based on its state 2. Pull arm with max priority, and observe reward 3. Update the state of the pulled arm Priority 1 Priority 2 Priority 3
  • 179. Multi-Armed Bandits: MDP (5) • Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently – Many proofs and different interpretations of Gittins’ index exist • The index of an arm is the fixed charge per pull for a game with two options, whether to pull the arm or not, so that the charge makes the optimal play of the game have zero net reward – Significantly reduces the dimension of the problem space – But, Gittins’ index g( it) is still hard to compute • For the Gamma-Poisson or Beta-Binomial models it = (#successes, #pulls) for arm i up to time t • g maps each possible (#successes, #pulls) pair to a number – Approximate methods are used in practice – Lai et al. have derived these for exponential family distributions
  • 180. Multi-Armed Bandits: Minimax Approach (1) • Compute the priority of each arm i in a way that the regret is bounded – Lowest regret in the worst case • One common policy is UCB1 [Auer 2002] Number of successes of arm i Number of pulls of arm i Total number of pulls of all arms Observed success rate Factor representing uncertainty ii i i n n n c log2 Priority
  • 181. Multi-Armed Bandits: Minimax Approach (2) • As total observations n becomes large: – Observed payoff tends asymptotically towards the true payoff probability – The system never completely “converges” to one best arm; only the rate of exploration tends to zero Observed payoff Factor representing uncertainty ii i i n n n c log2 Priority
  • 182. Multi-Armed Bandits: Minimax Approach (3) • Sub-optimal arms are pulled O(log n) times • Hence, UCB1 has O(log n) regret • This is the lowest possible regret (but the constants matter ) • E.g. Regret after n plays is bounded by Observed payoff Factor representing uncertainty ii i i n n n c log2 Priority ibesti K j j i ibesti n where, 3 1 ln 8 1 2 :
  • 183. • Classical multi-armed bandits – A fixed set of arms with fixed rewards – Observe the reward before the next pull • Bayesian approach (Markov decision process) – Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits • Pull the arm currently having the highest index value – Whittle’s index [Whittle 1988]: Extension to a changing reward function – Computationally intensive • Minimax approach (providing guaranteed regret bounds) – UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval • Index of arm i = • Heuristics – -Greedy: Random exploration using fraction of traffic – Softmax: Pick arm i with probability – Posterior draw: Index = drawing from posterior CTR distribution of an arm j j i }/ˆexp{ }/ˆexp{ Classical Multi-Armed Bandits: Summary ii itemofCTRpredictedˆ iii nnnc log2 retemperatu
  • 184. Do Classical Bandits Apply to Web Recommenders? Traffic obtained from a controlled randomized experiment (no confounding) Things to note: (a) Short lifetimes, (b) temporal effects, (c) often breaking news stories Each curve is the CTR of an item in the Today Module on www.yahoo.com over time
  • 185. Characteristics of Real Recommender Systems• Dynamic set of items (arms) – Items come and go with short lifetimes (e.g., a day) – Asymptotically optimal policies may fail to achieve good performance when item lifetimes are short • Non-stationary CTR – CTR of an item can change dramatically over time • Different user populations at different times • Same user behaves differently at different times (e.g., morning, lunch time, at work, in the evening, etc.) • Attention to breaking news stories decays over time • Batch serving for scalability – Making a decision and updating the model for each user visit in real time is expensive – Batch serving is more feasible: Create time slots (e.g., 5 min); for each slot, decide the fraction xi of the visits in the slot to give to item i [Agarwal et al., ICDM, 2009]
  • 186. Explore/Exploit in Recommender Systems time Item 1 Item 2 … Item K x1% page views x2% page views … xK% page views Determine (x1, x2, …, xK) based on clicks and views observed before t in order to maximize the expected total number of clicks in the future t –1t –2 t now clicks in the future Let’s solve this from first principle
  • 187. Bayesian Solution: Two Items, Two Time Slots (1) • Two time slots: t = 0 and t = 1 – Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1 – Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1 • To determine x, we need to estimate what would happen in the future Question: What fraction x of N0 views to item P (1-x) to item Q t=0 t=1 Now time N0 views N1 views End Obtain c clicks after serving x (not yet observed; random variable) Assume we observe c; we can update p1 CTR density Item Q Item P q1 p1(x,c) CTR density Item Q Item P q0 p0 If x and c are given, optimal solution: Give all views to Item P iff E[ p1 I x, c ] > q1 ),(ˆ1 cxp ),(ˆ1 cxp
  • 188. • Expected total number of clicks in the two time slots }]),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c Gain(x, q0, q1) = Expected number of additional clicks if we explore the uncertain item P with fraction x of views in slot 0, compared to a scheme that only shows the certain item Q in both slots Solution: argmaxx Gain(x, q0, q1) Bayesian Solution: Two Items, Two Time Slots (2) }]0,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c E[#clicks] at t = 0 E[#clicks] at t = 1 Item P Item Q Show the item with higher E[CTR]: }),,(ˆmax{ 11 qcxp E[#clicks] if we always show item Q Gain(x, q0, q1) Gain of exploring the uncertain item P using x
  • 189. • Approximate by the normal distribution – Reasonable approximation because of the central limit theorem • Proposition: Using the approximation, the Bayes optimal solution x can be found in time O(log N0) ),(ˆ1 cxp )ˆ( )( ˆ 1 )( ˆ )()ˆ(),,( 11 1 11 1 11 1100010 qp x pq x pq xNqpxNqqxGain )1()()( )],(ˆ[)( 2 0 0 1 2 1 baba ab xNba xN cxpVarx )/()],(ˆ[ˆ 11 baacxpEp c ),(~ofPrior 1 baBetap Bayesian Solution: Two Items, Two Time Slots (3)
  • 190. Bayesian Solution: Two Items, Two Time Slots (4) • Quiz: Is it correct that the more we are uncertain about the CTR of an item, the more we should explore the item? Uncertainty: LowUncertainty: High Different curves are for different prior mean settings (Fractionofviewstogivetotheitem)
  • 191. – Apply Whittle’s Lagrange relaxation (1988) to this problem setting Relax i zi(c) = 1, for all c, to Ec [ i zi(c)] = 1 Apply Lagrange multipliers (q0 and q1) to enforce the constraints – We essentially reduce the K-item case to K independent two-item sub- problems (which we have solved) Bayesian Solution: General Case (1) • From two items to K items – Very difficult problem: ))}],(ˆ{[maxˆ(max 1100 0 iiiiiii cxpENpxN c x )],(ˆ)([max 1 0 iiiii cxpzE cc z cc possibleallfor,1)(ii z Note: c = [c1, …, cK] ci is a random variable representing the # clicks on item i we may get 1ii x )),,(max(min 101100 , 10 qqxGainqNqN ii xqq i
  • 192. Bayesian Solution: General Case (2) • From two intervals to multiple time slots – Approximate multiple time slots by two stages • Non-stationary CTR – Use the Dynamic Gamma-Poisson model to estimate the CTR distribution for each item
  • 193. Simulation Experiment: Different Traffic Volume Simulation with ground truth estimated based on Yahoo! Front Page data Setting:16 live items per interval Scenarios: Web sites with different traffic volume (x-axis)
  • 194. Simulation Experiment: Different Sizes of the Item Pool Simulation with ground truth estimated based on Yahoo! Front Page data Setting: 1000 views per interval; average item lifetime = 20 intervals Scenarios: Different sizes of the item pool (x-axis)
  • 195. Characteristics of Different Explore/Exploit Schemes (1) • Why the Bayesian solution has better performance • Characterize each scheme by three dimensions: – Exploitation regret: The regret of a scheme when it is showing the item which it thinks is the best (may not actually be the best) • 0 means the scheme always picks the actual best • It quantifies the scheme’s ability of finding good items – Exploration regret: The regret of a scheme when it is exploring the items which it feels uncertain about • It quantifies the price of exploration (lower better) – Fraction of exploitation (higher better) • Fraction of exploration = 1 – fraction of exploitation Exploitation traffic Exploration traffic All traffic to a web site
  • 196. Characteristics of Different Explore/Exploit Schemes (2) Exploitation regret: Ability of finding good items (lower better) Exploration regret: Price of exploration (lower better) Fraction of Exploitation (higher better) Exploration Regret Exploitation fraction ExploitationRegret ExploitationRegret Good Good
  • 197. Discussion: Large Content Pool • The Bayesian solution looks promising – ~10% from true optimal for a content pool of 1000 live items • 1000 views per interval; item lifetime ~20 intervals • Intelligent initialization (offline modeling) – Use item features to reduce the prior variance of an item • E.g., Var[ item CTR | Sport ] < Var[ item CTR ] – Require a CTR model that outputs both mean and variance • Linear regression model • Segmented model: Estimate the CTR distribution of a random article in an item category – Existing taxonomies, decision tree, LDA topics • Feature-based explore/exploit – Estimate model parameters, instead of per-item CTR – More later
  • 198. Discussion: Multiple Positions, Ranking • Feature-based approach – reward(page) = model( (item 1 at position 1, … item k at position k)) – Apply feature-based explore/exploit • Online optimization for ranked list – Ranked bandits [Radlinski et al., 2008]: Run an independent bandit algorithm for each position – Dueling bandit [Yue & Joachims, 2009]: Actions are pairwise comparisons • Online optimization of submodular functions – S1, S2 and a, fa(S1 S2) fa(S1) • where fa(S) = fa(S a ) – fa(S) – Streeter & Golovin (2008)
  • 199. Discussion: Segmented Most Popular • Partition users into segments, and then for each segment, provide most popular recommendation • How to segment users – Hand-created segments: AgeGroup Gender – Clustering or decision tree based on user features • Users in the same cluster like similar items • Segments can be organized by taxonomies/hierarchies – Better CTR models can be built by hierarchical smoothing • Shrink the CTR of a segment toward its parent • Introduce bias to reduce uncertainty/variance – Bandits for taxonomies (Pandey et al., 2008) • First explore/exploit categories/segments • Then, switch to individual items
  • 200. Most Popular Recommendation: Summary • Online model: – Estimate the mean and variance of the CTR of each item over time – Dynamic Gamma-Poisson model • Intelligent initialization: – Estimate the prior mean and variance of the CTR of each item cluster using historical data • Cluster items Maximum likelihood estimates of the priors • Explore/exploit: – Bayesian: Solve a Markov decision process problem • Gittins’ index, Whittle’s index, approximations • Better performance, computation intensive • Thompson sampling: Sample from the posterior (simple) – Minimax: Bound the regret • UCB1: Easy to compute • Explore more than necessary in practice – -Greedy: Empirically competitive for tuned
  • 201. Online Components for Personalized Recommendation Online models, intelligent initialization & explore/exploit
  • 202. Intelligent Initialization for Linear Model (1) • Linear/factorization model – How to estimate the prior parameters j and • Important for cold start: Predictions are made using prior • Leverage available features – How to learn the weights/factors quickly • High dimensional j slow convergence • Reduce the dimensionality Subscript: user i, item j ),(~ ),(~ 2 jj jiij N uNy rating that user i gives item j feature/factor vector of user i factor vector of item j
  • 203. Feature-based model initialization • Dimensionality reduction for fast model convergence ),(~ jj AxN FOBFM: Fast Online Bilinear Factor Model ),(~,~ jjjiij NuyPer-item online model ),0(~ ~ Nv vuAxuy j jijiij predicted by features ),0(~ 2 IN Bv j jj Subscript: user i item j Data: yij = rating that user i gives item j ui = offline factor vector of user i xj = feature vector of item j B is a n k linear projection matrix (k << n) project: high dim(vj) low dim( j) low-rank approx of Var[ j]: = vj jB ),(~ 2 BBAxN jj Offline training: Determine A, B, 2 through the EM algorithm (once per day or hour)
  • 204. Feature-based model initialization • Dimensionality reduction for fast model convergence • Fast, parallel online learning • Online selection of dimensionality (k = dim( j)) – Maintain an ensemble of models, one for each candidate dimensionality ),(~ jj AxN FOBFM: Fast Online Bilinear Factor Model ),(~,~ jjjiij NuyPer-item online model ),0(~ ~ Nv vuAxuy j jijiij predicted by features ),0(~ 2 IN Bv j jj B is a n k linear projection matrix (k << n) project: high dim(vj) low dim( j) low-rank approx of Var[ j]: jijiij BuAxuy )(~ offset new feature vector (low dimensional) , where j is updated in an online manner ),(~ 2 BBAxN jj Subscript: user i item j Data: yij = rating that user i gives item j ui = offline factor vector of user i xj = feature vector of item j
  • 205. Experimental Results: My Yahoo! Dataset (1) • My Yahoo! is a personalized news reading site – Users manually select news/RSS feeds • ~12M “ratings” from ~3M users on ~13K articles – Click = positive – View without click = negative
  • 206. Experimental Results: My Yahoo! Dataset (2) Item-based data split: Every item is new in the test data – First 8K articles are in the training data (offline training) – Remaining articles are in the test data (online prediction & learning) Supervised dimensionality reduction (reduced rank regression) significantly outperforms other methods Methods: No-init: Standard online regression with ~1000 parameters for each item Offline: Feature-based model without online update PCR, PCR+: Two principal component methods to estimate B FOBFM: Our fast online method
  • 207. Experimental Results: My Yahoo! Dataset (3) • Small number of factors (low dimensionality) is better when the amount of data for online leaning is small • Large number of factors is better when the data for learning becomes large • The online selection method usually selects the best dimensionality # factors = Number of parameters per item updated online
  • 208. Intelligent Initialization: Summary • For online learning, whenever historical data is available, do not start cold • For linear/factorization models – Use available features to setup the starting point – Reduce dimensionality to facilitate fast learning • Next – Explore/exploit for personalization – Users are represented by covariates • Features, factors, clusters, etc – Covariate bandits
  • 209. Explore/Exploit for Personalized Recommendation • One extreme problem formulation – One bandit problem per user with one arm per item – Bandit problems are correlated: “Similar” users like similar items – Arms are correlated: “Similar” items have similar CTRs • Model this correlation through covariates/features – Input: User feature/factor vector, item feature/factor vector – Output: Mean and variance of the CTR of this (user, item) pair based on the data collected so far • Covariate bandits – Also known as contextual bandits, bandits with side observations – Provide a solution to • Large content pool (correlated arms) • Personalized recommendation (hint before pulling an arm)
  • 210. Methods for Covariate Bandits • Priority-based methods – Rank items according to the user-specific “score” of each item; then, update the model based on the user’s response – UCB (upper confidence bound) • Score of an item = E[posterior CTR] + k StDev[posterior CTR] – Posterior draw (Thompson sampling) • Score of an item = a number drawn from the posterior CTR distribution – Softmax • Score of an item = a number drawn according to • -Greedy – Allocate fraction of traffic for random exploration ( may be adaptive) – Robust when the exploration pool is small • Bayesian scheme – Close to optimal if can be solved efficiently j j i }/ˆexp{ }/ˆexp{
  • 211. Covariate Bandits: Some References • Just a small sample of papers – Hierarchical explore/exploit (Pandey et al., 2008) • Explore/exploit categories/segments first; then, switch to individuals – Variants of -greedy • Epoch-greedy (Langford & Zhang, 2007): is determined based on the generalization bound of the current model • Banditron (Kakade et al., 2008): Linear model with binary response • Non-parametric bandit (Yang & Zhu, 2002): decreases over time; example model: histogram, nearest neighbor – Variants of UCB methods • Linearly parameterized bandits (Rusmevichientong et al., 2008): minimax, based on uncertainty ellipsoid • LinUCB (Li et al., 2010): Gaussian linear regression model • Bandits in metric spaces (Kleinberg et al., 2008; Slivkins et al., 2009): – Similar arms have similar rewards: | reward(i) – reward(j) | distance(i,j)
  • 212. Online Components: Summary • Real systems are dynamic • Cold-start problem – Incremental online update (online linear regression) – Intelligent initialization (use features to predict initial factor values) – Explore/exploit (UCB, posterior draw, softmax, -greedy) • Concept-drift problem – Tracking the most recent behavior (state-space models, Kalman filter) – Modeling temporal patterns (tensor factorization, spline)
  • 214. Evaluation Methods • Ideal method – Experimental Design: Run side-by-side experiments on a small fraction of randomly selected traffic with new method (treatment) and status quo (control) – Limitation • Often expensive and difficult to test large number of methods • Problem: How do we evaluate methods offline on logged data? – Goal: To maximize clicks/revenue and not prediction accuracy on the entire system. Cost of predictive inaccuracy for different instances vary. • E.g. 100% error on a low CTR article may not matter much because it always co-occurs with a high CTR article that is predicted accurately
  • 215. Usual Metrics • Predictive accuracy – Root Mean Squared Error (RMSE) – Mean Absolute Error (MAE) – Area under the Curve, ROC • Other rank based measures based on retrieval accuracy for top-k – Recall in test data • What Fraction of items that user actually liked in the test data were among the top-k recommended by the algorithm (fraction of hits, e.g. Karypsis, CIKM 2001) • One flaw in several papers – Training and test split are not based on time. • Information leakage • Even in Netflix, this is the case to some extent – Time split per user, not per event. For instance, information may leak if models are based on user-user similarity.
  • 216. Metrics continued.. • Recall per event based on Replay-Match method – Fraction of clicked events where the top recommended item matches the clicked one. • This is good if logged data collected from a randomized serving scheme, with biased data this could be a problem – We will be inventing algorithms that provide recommendations that are similar to the current one • No reward for novel recommendations
  • 217. Details on Replay-Match method (Li, Langford, et al) • x: feature vector for a visit • r = [r1,r2,…,rK]: reward vector for the K items in inventory • h(x): recommendation algorithm to be evaluated • Goal: Estimate expected reward for h(x) • s(x): recommendation scheme that generated logged-data • x1,..,xT: visits in the logged data • rti: reward for visit t, where i = s(xt)
  • 218. Replay-Match continued • Estimator • If importance weights and – It can be shown estimator is unbiased • E.g. if s(x) is random serving scheme, importance weights are uniform over the item set • If s(x) is not random, importance weights have to be estimated through a model
  • 219. Back to Multi-Objective Optimization Recommender EDITORIAL content Clicks on FP links influence downstream supply distribution AD SERVER PREMIUM display (GUARANTEED) Spot Market (Cheaper) Downstream engagement (Time spent)
  • 220. Serving Content on Front Page: Click Shaping • What do we want to optimize? • Current: Maximize clicks (maximize downstream supply from FP) • But consider the following – Article 1: CTR=5%, utility per click = 5 – Article 2: CTR=4.9%, utility per click=10 • By promoting 2, we lose 1 click/100 visits, gain 5 utils • If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? – E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc)
  • 221. Why call it Click Shaping?autos finance health hotjobs movies new.music news omg realestate rivals shine shopping sports tech travel tv video other gmy.news buzz videogames autos finance health hotjobs movies new.music news omg realestate rivals shine shopping sports tech travel tv video other videogames buzz gmy.news -10.00% -8.00% -6.00% -4.00% -2.00% 0.00% 2.00% 4.00% 6.00% 8.00% 10.00% autos buzz finance gmy.news health hotjobs moviesnew.music news omg realestate rivals shine shopping sports tech travel tv videovideogames other Supply distribution Changes BEFORE AFTER SHAPING can happen with respect to any downstream metrics (like engagement)
  • 222. 222 Multi-Objective Optimization A1 A2 An n articlesK properties news finance omg … … S1 S2 Sm m user segments … CTR of user segment i on article j: pij Time duration of i on j: dij
  • 223. 11 Multi-Objective Program Scalarization Goal Programming Simplex constraints on xiJ is always applied Constraints are linear Every 10 mins, solve x Use this x as the serving scheme in the next 10 mins
  • 224. Pareto-optimal solution (more in KDD 2011) 224
  • 225. Summary • Modern recommendation systems on the web crucially depend on extracting intelligence from massive amounts of data collected on a routine basis • Lots of data and processing power not enough, the number of things we need to learn grows with data size • Extracting grouping structures at coarser resolutions based on similarity (correlations) is important – ML has a big role to play here • Continuous and adaptive experimentation in a judicious manner crucial to maximize performance – Again, ML has a big role to play • Multi-objective optimization is often required, the objectives are application dependent. – ML has to work in close collaboration with engineering, product & business execs
  • 227. Recall: Some examples • Simple version – I have an important module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to drive up total CTR on this module • More advanced – I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. dwell time). Can I increase downstream utility without losing too many clicks? • Highly advanced – There are multiple modules running on my website. How do I take a holistic approach and perform a simultaneous optimization?
  • 228. For the simple version • Multi-position optimization – Explore/exploit, optimal subset selection • Explore/Exploit strategies for large content pool and high dimensional problems – Some work on hierarchical bandits but more needs to be done • Constructing user profiles from multiple sources with less than full coverage – Couple of papers at KDD 2011 • Content understanding • Metrics to measure user engagement (other than CTR)
  • 229. Other problems • Whole page optimization – Incorporating correlations • Incentivizing User generated content • Incorporating Social information for better recommendation (News Feed Recommendation) • Multi-context Learning
  • 232. EXAMPLE: DISPLAY AD PLACEMENTS ON LINKEDIN ©2013 LinkedIn Corporation. All Rights Reserved.
  • 234. Click Cost = Bid3 x CTR3/CTR2 Profile: region = US, age = 20 Context = profile page, 300 x 250 ad slot Ad request Sorted by Bid * CTR Response Prediction Engine Campaigns eligible for auction Automatic Format Selection Filter Campaigns (Targeting criteria, Frequency Cap, Budget Pacing) LinkedIn Advertising: Flow Serving constraint < 100 millisec
  • 235. CTR Prediction Model for Ads • Feature vectors – Member feature vector: xi (identity, behavioral, network) – Campaign feature vector: cj (text, adv-id,…) – Context feature vector: zk (page type, device, …) • Model:
  • 236. CTR Prediction Model for Ads • Feature vectors – Member feature vector: xi – Campaign feature vector: cj – Context feature vector: zk • Model: Cold-start component Warm-start per-campaign component
  • 237. CTR Prediction Model for Ads • Feature vectors – Member feature vector: xi – Campaign feature vector: cj – Context feature vector: zk • Model: Cold-start component Warm-start per-campaign component Cold-start: Warm-start: Both can have L2 penalties.
  • 238. Model Fitting • Single machine (well understood) – conjugate gradient – L-BFGS – Trusted region – … • Model Training with Large scale data – Cold-start component Θw is more stable • Weekly/bi-weekly training good enough • However: difficulty from need for large-scale logistic regression – Warm-start per-campaign model Θc is more dynamic • New items can get generated any time • Big loss if opportunities missed • Need to update the warm-start component as frequently as possible
  • 239. Model Fitting • Single machine (well understood) – conjugate gradient – L-BFGS – Trusted region – … • Model Training with Large scale data – Cold-start component Θw is more stable • Weekly/bi-weekly training good enough • However: difficulty from need for large-scale logistic regression – Warm-start per-campaign model Θc is more dynamic • New items can get generated any time • Big loss if opportunities missed • Need to update the warm-start component as frequently as possible Large Scale Logistic Regression Per-item logistic regression given Θc
  • 240. Explore/Exploit with Logistic Regression 240 + + + + + + + _ _ _ _ _ _ _ _ _ _ _ _ _ COLD START COLD + WARM START for an Ad-id POSTERIOR of WARM-START COEFFICIENTS E/E: Sample a line from the posterior (Thompson Sampling)
  • 241. Models Considered • CONTROL: per-campaign CTR counting model • COLD-ONLY: only cold-start component • LASER: our model (cold-start + warm-start) • LASER-EE: our model with Explore-Exploit using Thompson sampling
  • 242. Metrics • Model metrics (offline) – Test Log-likelihood – AUC/ROC – Observed/Expected ratio • Online metrics (Online A/B Test) – CTR – CPM (Revenue per impression) – Unique ads per user (diversity)
  • 243. Observed / Expected Ratio • Offline replay difficult with large items (randomization costly) • Observed: #Clicks in the data , Expected: Sum of predicted CTR for all impressions • Not a “standard” classifier metric, but useful for this application • What we usually see: Observed / Expected < 1 – Quantifies the “winner’s curse” aka selection bias in auctions • When choosing from among thousands of candidates, an item with mistakenly over-estimated CTR may end up winning the auction • Particularly helpful in spotting inefficiencies by segment – E.g. by bid, number of impressions in training (warmness), geo, etc. – Allows us to see where the model might be giving too much weight to the wrong campaigns • High correlation between O/E ratio and model performance online
  • 244. Online A/B Test • Three models – CONTROL (10%) – LASER (85%) – LASER-EE (5%) • Segmented Analysis – 8 segments by campaign warmness • Degree of warmness: the number of training samples available in the training data for the campaign • Segment #1: Campaigns with almost no data in training • Segment #8: Campaigns that are served most heavily in the previous batches so that their CTR estimate can be quite accurate
  • 245. Daily CTR Lift Over Control PercentageofCTRLift +% +% +% +% +% Day1 Day2 Day3 Day4 Day5 Day6 Day7 ● ● ● ● ● ● ● ● LASER LASER−EE
  • 246. Daily CPM Lift Over Control PercentageofeCPMLift +% +% +% +% +% +% Day1 Day2 Day3 Day4 Day5 Day6 Day7 ● ● ● ● ● ● ● ● LASER LASER−EE
  • 247. CPM Lift By Campaign Warmness Segments Campaign Warmness Segment LiftPercentageofCPM −% −% −% 0% +% +% 1 2 3 4 5 6 7 8 LASER LASER−EE
  • 248. O/E Ratio By Campaign Warmness Segments Campaign Warmness Segment ObservedClick/ExpectedClicks 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 CONTROL LASER LASER−EE
  • 249. Number of Campaigns Served Improvement from E/E
  • 250. Insights • Overall performance: – LASER and LASER-EE are both much better than control – LASER and LASER-EE performance are very similar • Great news! We get exploration without much additional cost • Exploration has other benefits – LASER-EE serves significantly more campaigns than LASER – Provides healthier market place, more ad diversity per user (better experience)
  • 251. Solutions to Practical Problems • Rapid model development cycle – Quick reaction to changes in data, product – Write once for training, testing, inference • Can adapt to changing data – Integrated Thompson sampling explore/exploit – Automatic training – Multiple training frequencies for different parts of model • Good tools yield good models – Reusable components for feature extraction and transformation – Very high-performance inference engine for deployment – Modelers can concentrate on building models, not re-writing common functions or worrying about production issues
  • 252. Summary • Reducing dimension through logistic regression coupled with explore/exploit schemes like Thompson sampling effective mechanism to solve response prediction problems in advertising • Partitioning model components by cold-start (stable) and warm-start (non-stationary) with different training frequencies effective mechanism to scale computations • ADMM with few modifications effective model training strategy for large data with high dimensionality • Methods work well for LinkedIn advertising, significant improvements ©2013 LinkedIn Corporation. All Rights Reserved.
  • 253. Theory vs. Practice Textbook • Data is stationary • Training data is clean • Training is hard, testing and inference are easy • Models don’t change • Complex algorithms work best Reality • Features, items changing constantly • Fraud, bugs,tracking delays, online/offline inconsistencies, etc. • All aspects have challenges at web scale • Never-ending processes of improvement • Simple models with good features and lots of data win
  • 254. Current Work: Feed Recommendation Network Updates Job change Job anniversaries Connections Endorsements Upload Photos ………….. Content Articles by Influencer Shares by friends Content in Channels followed Content by companies followed Jobs recommendation ………… Sponsored Updates Company updates Jobs
  • 255. Tiered Approach to Ranking  A second pass ranker that blends disparate results returned by first-pass rankers JobsAdsNetwork Updates Content ------------ BLENDER Top k
  • 256. Challenges • Personalization – Viewer-actor affinity by type (depends on strength of connections in multiple contexts) – Blending identity and behavioral data • Frequency discounting, freshness, diversification • Multi-objectives (Revenue, Engagement) • A/B tests with interference • Engagement metrics – Function of various actions that optimize long-term engagement metric like return visits • Summarization and adding new content types