ENAR short course

Sta$s$cal
Compu$ng

For
Big
Data

Deepak
Agarwal

LinkedIn
Applied
Relevance
Science

dagarwal@linkedin.com

ENAR
2014,
Bal$more,
USA

Main
Collaborators:
several
others
at
both
Y!

and
LinkedIn

•  I
won’t
be
here
without
them,
extremely
lucky
to
work
with
such
talented

individuals

Bee-Chung Chen Liang Zhang Bo Long
Jonathan Traupman
Paul Ogilvie

Structure
of
This
Tutorial

•  Part
I:
Introduc$on
to
Map-‐Reduce
and
the

Hadoop
System

–  Overview
of
Distributed
Compu$ng

–  Introduc$on
to
Map-‐Reduce

–  Some
sta$s$cal
computa$ons
using
Map-‐Reduce

•  Bootstrap,
Logis$c
Regression

•  Part
II:
Recommender
Systems
for
Web

Applica$ons

–  Introduc$on

–  Content
Recommenda$on

–  Online
Adver$sing

Big
Data
becoming
Ubiquitous

•  Bioinforma$cs

•  Astronomy

•  Internet

•  Telecommunica$ons

•  Climatology

•  …

Big
Data:
Some
size
es$mates

•  1000
human
genomes:
>
100TB
of
data
(1000

genomes
project)

•  Sloan
Digital
Sky
Survey:
200GB
data
per
night

(>140TB
aggregated)

•  Facebook:
A
billion
monthly
ac$ve
users

•  LinkedIn:

roughly
>
280M
members
worldwide

•  Twiaer:
>
500
million
tweets
a
day

•  Over
6
billion
mobile
phones
in
the
world

genera$ng
data
everyday

Big
Data:
Paradigm
shid

•  Classical
Sta$s$cs

–  Generalize
using
small
data

•  Paradigm
Shid
with
Big
Data

–  We
now
have
an
almost
inﬁnite
supply
of
data

–  Easy
Sta$s$cs
?
Just
appeal
to
asympto$c
theory?

•  So
the
issue
is
mostly
computa$onal?

–  Not
quite

•  More
data
comes
with
more
heterogeneity

•  Need
to
change
our
sta$s$cal
thinking
to
adapt

–  Classical
sta$s$cs
s$ll
invaluable
to
think
about
big
data
analy$cs

Some
Sta$s$cal
Challenges

•  Exploratory
Analysis
(EDA),
Visualiza$on

– Retrospec$ve
(on
Terabytes)

– More
Real
Time

(streaming
computa$ons
every

few
minutes/hours)

•  Sta$s$cal
Modeling

– Scale
(computa$onal
challenge)

– Curse
of
dimensionality

•  Millions
of
predictors,
heterogeneity

– Temporal
and
Spa$al
correla$ons

Sta$s$cal
Challenges
con$nued

•  Experiments

– To
test
new
methods,
test
hypothesis
from

randomized
experiments

– Adap$ve
experiments

•  Forecas$ng

– Planning,
adver$sing

•  Many
more
I
are
not
fully
well
versed
in

Defining
Big
Data

•  How
to
know
you
have
the
big
data
problem?

– Is
it
only
the
number
of
terabytes
?

– What
about
dimensionality,
structured/
unstructured,
computa$ons
required,…

•  No
clear
defini$on,
different
point
of
views

– When
desired
computa$on
cannot
be
completed

in
the
s$pulated
$me
with
current
best
algorithm

using
cores
available
on
a
commodity
PC

Distributed
Compu$ng
for
Big
Data

•  Distributed
compu$ng
invaluable
tool
to
scale

computa$ons
for
big
data

•  Some
distributed
compu$ng
models

– Mul$-‐threading

– Graphics
Processing
Units
(GPU)

– Message
Passing
Interface
(MPI)

– Map-‐Reduce

Evalua$ng
a
method
for
a
problem

•  Scalability

–  Process
X
GB
in
Y
hours

•  Ease
of
use
for
a
sta$s$cian

•  Reliability
(fault
tolerance)

–  Especially
in
an
industrial
environment

•  Cost

–  Hardware
and
cost
of
maintaining

•  Good
for
the
computa$ons
required?

–  E.g.,
Itera$ve
versus
one
pass

•  Resource
sharing

Mul$threading

•  Mul$ple
threads
take
advantage
of
mul$ple

CPUs

•  Shared
memory

•  Threads
can
execute
independently
and

concurrently

•  Can
only
handle
Gigabytes
of
data

•  Reliable

Graphics
Processing
Units
(GPU)

•  Number
of
cores:

–  CPU:
Order
of
10

–  GPU:
smaller
cores

•  Order
of
1000

•  Can
be
>100x
faster
than
CPU

–  Parallel
computa$onally
intensive
tasks
oﬀ-‐loaded
to
GPU

•  Good
for
certain
computa$onally-‐intensive
tasks

•  Can
only
handle
Gigabytes
of
data

•  Not
trivial
to
use,
requires
good
understanding
of
low-‐level
architecture

for
eﬃcient
use

–  But
things
changing,
it
is
gemng
more
user
friendly

Message
Passing
Interface
(MPI)

•  Language
independent
communica$on

protocol
among
processes
(e.g.
computers)

•  Most
suitable
for
master/slave
model

•  Can
handle
Terabytes
of
data

•  Good
for
itera$ve
processing

•  Fault
tolerance
is
low

Map-‐Reduce
(Dean
&
Ghemawat,

2004)

Mappers

Reducers

Data

Output

•  Computa$on
split
to
Map

(scaaer)
and
Reduce
(gather)

stages

•  Easy
to
Use:

–  User
needs
to
implement
two

func$ons:
Mapper
and

Reducer

•  Easily
handles
Terabytes
of

data

•  Very
good
fault
tolerance

(failed
tasks
automa$cally

get
restarted)

Comparison
of
Distributed
Compu$ng
Methods

Mul$threading
GPU
MPI
Map-‐Reduce

Scalability
(data

size)

Gigabytes
Gigabytes
Terabytes
Terabytes

Fault
Tolerance
High
High
Low
High

Maintenance
Cost
Low
Medium
Medium
Medium-‐High

Itera$ve
Process

Complexity

Cheap
Cheap
Cheap
Usually

expensive

Resource
Sharing
Hard
Hard
Easy
Easy

Easy
to
Implement?
Easy
Needs

understanding

of
low-‐level
GPU

architecture

Easy
Easy

Example
Problem

•  Tabula$ng
word
counts
in
corpus
of

documents

•  Similar
to
table
func$on
in
R

Word
Count
Through
Map-‐Reduce

Hello
World

Bye
World

Hello
Hadoop

Goodbye
Hadoop

Mapper
1

<Hello,
1>

<Hadoop,
1>

<Goodbye,
1>

<Hadoop,1>

<Hello,
1>

<World,
1>

<Bye,
1>

<World,1>

Mapper
2

Reducer
1

Words
from
A-‐G

Reducer
2

Words
from
H-‐Z

<Bye,
1>

<Goodbye,
1>

<Hello,
2>

<World,
2>

<Hadoop,
2>

Key
Ideas
about
Map-‐Reduce

Big
Data

Par$$on
1
Par$$on
2
…
Par$$on
N

Mapper
1
Mapper
2
…
Mapper
N

<Key,
Value>
<Key,
Value>
<Key,
Value>
<Key,
Value>

Reducer
1
Reducer
2
Reducer
M
…

Output
1
Output
1
Output
1
Output
1

Key
Ideas
about
Map-‐Reduce

•  Data
are
split
into
par$$ons
and
stored
in
many

diﬀerent
machines
on
disk
(distributed
storage)

•  Mappers
process
data
chunks
independently

and

emit
<Key,
Value>
pairs

•  Data
with
the
same
key
are
sent
to
the
same

reducer.
One
reducer
can
receive
mul$ple
keys

•  Every
reducer
sorts
its
data
by
key

•  For
each
key,
the
reducer
processes
the
values

corresponding
to
the
key
according
to
the

customized
reducer
func$on
and
output

Compute
Mean
for
Each
Group

ID
Group
No.
Score

1
1
0.5

2
3
1.0

3
1
0.8

4
2
0.7

5
2
1.5

6
3
1.2

7
1
0.8

8
2
0.9

9
4
1.3

…
…
…

Key
Ideas
about
Map-‐Reduce

•  Data
are
split
into
par$$ons
and
stored
in
many
diﬀerent
machines
on

disk
(distributed
storage)

•  Mappers
process
data
chunks
independently

and
emit
<Key,
Value>
pairs

–  For
each
row:

•  Key
=
Group
No.

•  Value
=
Score

•  Data
with
the
same
key
are
sent
to
the
same
reducer.
One
reducer
can

receive
mul$ple
keys

–  E.g.
2
reducers

–  Reducer
1
receives
data
with
key
=
1,
2

–  Reducer
2
receives
data
with
key
=
3,
4

•  Every
reducer
sorts
its
data
by
key

–  E.g.
Reducer
1:
<key
=
1,
values=[0.5,
0.8,
0.8]>,
<key=2,
values=<0.7,
1.5,
0.9>

•  For
each
key,
the
reducer
processes
the
values
corresponding
to
the
key

according
to
the
customized
reducer
func$on
and
output

–  E.g.
Reducer
1
output:
<1,
mean(0.5,
0.8,
0.8)>,
<2,
mean(0.7,
1.5,
0.9)>

Key
Ideas
about
Map-‐Reduce

•  Data
are
split
into
par$$ons
and
stored
in
many
diﬀerent
machines
on

disk
(distributed
storage)

•  Mappers
process
data
chunks
independently

and
emit
<Key,
Value>
pairs

–  For
each
row:

•  Key
=
Group
No.

•  Value
=
Score

•  Data
with
the
same
key
are
sent
to
the
same
reducer.
One
reducer
can

receive
mul$ple
keys

–  E.g.
2
reducers

–  Reducer
1
receives
data
with
key
=
1,
2

–  Reducer
2
receives
data
with
key
=
3,
4

•  Every
reducer
sorts
its
data
by
key

–  E.g.
Reducer
1:
<key
=
1,
values=[0.5,
0.8,
0.8]>,
<key=2,
values=<0.7,
1.5,
0.9>

•  For
each
key,
the
reducer
processes
the
values
corresponding
to
the
key

according
to
the
customized
reducer
func$on
and
output

–  E.g.
Reducer
1
output:
<1,
mean(0.5,
0.8,
0.8)>,
<2,
mean(0.7,
1.5,
0.9)>

What
you
need

to
implement

Mapper:

Input:
Data

for
(row
in
Data)

{

groupNo
=
row$groupNo;

score
=
row$score;

Output(c(groupNo,
score));

}

Reducer:

Input:
Key
(groupNo),
List
Value
(a
list
of
scores
that
belong
to
the
Key)

count
=
0;

sum
=
0;

for
(v
in
Value)

{

sum
+=
v;

count++;

}

Output(c(Key,
sum/count));

Pseudo
Code
(in
R)

Exercise
1

•  Problem:
Average
height
per
{Grade,
Gender}?

•  What
should
be
the
mapper
output
key?

•  What
should
be
the
mapper
output
value?

•  What
are
the
reducer
input?

•  What
are
the
reducer
output?

•  Write
mapper
and
reducer
for
this?

Student
ID
Grade
Gender
Height
(cm)

1
3
M
120

2
2
F
115

3
2
M
116

…
…
…

•  Problem:
Average
height
per
Grade
and
Gender?

•  What
should
be
the
mapper
output
key?

–  {Grade,
Gender}

•  What
should
be
the
mapper
output
value?

–  Height

•  What
are
the
reducer
input?

–  Key:
{Grade,
Gender},
Value:
List
of
Heights

•  What
are
the
reducer
output?

–  {Grade,
Gender,
mean(Heights)}

Student
ID
Grade
Gender
Height
(cm)

1
3
M
120

2
2
F
115

3
2
M
116

…
…
…

Exercise
2

•  Problem:
Number
of
students
per
{Grade,
Gender}?

•  What
should
be
the
mapper
output
key?

•  What
should
be
the
mapper
output
value?

•  What
are
the
reducer
input?

•  What
are
the
reducer
output?

•  Write
mapper
and
reducer
for
this?

Student
ID
Grade
Gender
Height
(cm)

1
3
M
120

2
2
F
115

3
2
M
116

…
…
…

•  Problem:
Number
of
students
per
{Grade,
Gender}?

•  What
should
be
the
mapper
output
key?

–  {Grade,
Gender}

•  What
should
be
the
mapper
output
value?

–  1

•  What
are
the
reducer
input?

–  Key:
{Grade,
Gender},
Value:
List
of
1’s

•  What
are
the
reducer
output?

–  {Grade,
Gender,
sum(value
list)}

–  OR:
{Grade,
Gender,
length(value
list)}

Student
ID
Grade
Gender
Height
(cm)

1
3
M
120

2
2
F
115

3
2
M
116

…
…
…

More
on
Map-‐Reduce

•  Depends
on
distributed
ﬁle
systems

•  Typically
mappers
are
the
data
storage
nodes

•  Map/Reduce
tasks
automa$cally
get
restarted

when
they
fail
(good
fault
tolerance)

•  Map
and
Reduce
I/O
are
all
on
disk

–  Data
transmission
from
mappers
to
reducers
are

through
disk
copy

•  Itera$ve
process
through
Map-‐Reduce

–  Each
itera$on
becomes
a
map-‐reduce
job

–  Can
be
expensive
since
map-‐reduce
overhead
is
high

The
Apache
Hadoop
System

•  An
open-‐source
sodware
for
reliable,
scalable,

distributed
compu$ng

•  The
most
popular
distributed
compu$ng

system
in
the
world

•  Key
modules:

– Hadoop
Distributed
File
System
(HDFS)

– Hadoop
YARN
(job
scheduling
and
cluster

resource
management)

– Hadoop
MapReduce

Major
Tools
on
Hadoop

•  Pig

–  A
high-‐level
language
for
Map-‐Reduce
computa$on

•  Hive

–  A
SQL-‐like
query
language
for
data
querying
via
Map-‐Reduce

•  Hbase

–  A
distributed
&
scalable
database
on
Hadoop

–  Allows
random,
real
$me
read/write
access
to
big
data

–  Voldemort
is
similar
to
Hbase

•  Mahout

–  A
scalable
machine
learning
library

•  …

Hadoop
Installa$on

•  Semng
up
Hadoop
on
your
desktop/laptop:

– hap://hadoop.apache.org/docs/stable/
single_node_setup.html

•  Semng
up
Hadoop
on
a
cluster
of
machines

– hap://hadoop.apache.org/docs/stable/
cluster_setup.html

Hadoop
Distributed
File
System
(HDFS)

•  Master/Slave
architecture

•  NameNode:
a
single
master
node
that
controls
which

data
block
is
stored
where.

•  DataNodes:
slave
nodes
that
store
data
and
do
R/W

opera$ons

•  Clients
(Gateway):
Allow
users
to
login
and
interact

with
HDFS
and
submit
Map-‐Reduce
jobs

•  Big
data
is
split
to
equal-‐sized
blocks,
each
block
can
be

stored
in
diﬀerent
DataNodes

•  Disk
failure
tolerance:
data
is
replicated
mul$ple
$mes

Load
the
Data
into
Pig

•  A
=
LOAD
‘Sample-‐1.dat'
USING
PigStorage()
AS

(ID
:
int,
groupNo:
int,
score:
float);

–  The
path
of
the
data
on
HDFS
ader
LOAD

•  USING
PigStorage()
means
delimit
the
data
by
tab

(can
be
omiaed)

•  If
data
are
delimited
by
other
characters,
e.g.

space,
use
USING
PigStorage(‘
‘)

•  Data
schema
defined
ader
AS

•  Variable
types:
int,
long,
float,
double,
chararray,

…

Structure
of
This
Tutorial

•  Part
I:
Introduc$on
to
Map-‐Reduce
and
the

Hadoop
System

– Overview
of
Distributed
Compu$ng

– Introduc$on
to
Map-‐Reduce

– Introduc$on
to
the
Hadoop
System

– 
Examples
of
Sta$s$cal
Compu$ng
for
Big
Data

•  Bag
of
Liale
Bootstraps

•  Large
Scale
Logis$c
Regression

Bag
of
Liale
Bootstraps

Kleiner
et
al.
2012

Bootstrap
(Efron,
1979)

•  A
re-‐sampling
based
method
to
obtain
sta$s$cal

distribu$on
of
sample
es$mators

•  Why
are
we
interested
?

–  Re-‐sampling
is
embarrassingly
parallelizable

•  For
example:

–  Standard
devia$on
of
the
mean
of
N
samples
(μ)

–  For
i
=
1
to
r
do

•  Randomly
sample
with
replacement
N
$mes
from
the
original

sample
-‐>
bootstrap
data
i

•  Compute
mean
of
the
i-‐th
bootstrap
data
-‐>
μi

–  Es$mate
of
Sd(μ)
=
Sd([μ1,…μr])

–  r
is
usually
a
large
number,
e.g.
200

Bootstrap
for
Big
Data

•  Can
have
r
nodes
running
in
parallel,
each

sampling
one
bootstrap
data

•  However…

– N
can
be
very
large

– Data
may
not
ﬁt
into
memory

– Collec$ng
N
samples
with
replacement
on
each

node
can
be
computa$onally
expensive

M
out
of
N
Bootstrap

(Bikel
et
al.
1997)

•  Obtain
SdM(μ)
by
sampling
M
samples
with

replacement
for
each
bootstrap,
where
M<N

•  Apply
analy$cal
correc$on
to
SdM(μ)
to
obtain

Sd(μ)
using
prior
knowledge
of
convergence
rate

of
sample
es$mates

•  However…

–  Prior
knowledge
is
required

–  Choice
of
M
is
cri$cal
to
performance

–  Finding
op$mal
value
of
M
needs
more

computa$on

Bag
of
Liale
Bootstraps
(BLB)

•  Example:
Standard
devia$on
of
the
mean

•  Generate
S
sampled
data
sets,
each
obtained
by
random

sampling
without
replacement
a
subset
of
size
b
(or

par$$on
the
original
data
into
S
par$$ons,
each
with
size

b)

•  For
each
data
p
=
1
to
S
do

–  For
i
=
1
to
r
do

•  N
samples
with
replacement
on
data
of
size
b

•  Compute
mean
of
the
resampled
data
μpi

–  Compute
Sdp(μ)
=
Sd([μp1,…μpr])

•  Es$mate
of
Sd(μ)
=
Avg([Sd1(μ),…,
SdS(μ)])

Bag
of
Liale
Bootstraps
(BLB)

•  Interest:
ξ(θ),
where
θ
is
an
es$mate
obtained
from

size
N
data

– 
ξ
is
some
func$on
of
θ,
such
as
standard
devia$on,
…

•  Generate
S
sampled
data
sets,
each
obtained
from
random

sampling
without
replacement
a
subset
of
size
b
(or
par$$on

the
original
data
into
S
par$$ons,
each
with
size
b)

•  For
each
data
p
=
1
to
S
do

–  For
i
=
1
to
r
do

•  Sample
N
samples
with
replacement
on
data
of
size
b

•  Compute
mean
of
the
resampled
data
θpi

–  Compute
ξp(θ)
=
ξ([θp1,…θpr])

•  Es$mate
of
ξ(μ)
=
Avg([ξ1(θ),…,
ξS(θ)])

Bag
of
Liale
Bootstraps
(BLB)

•  Interest:
ξ(θ),
where
θ
is
an
es$mate
obtained
from

size
N
data

– 
ξ
is
some
func$on
of
θ,
such
as
standard
devia$on,
…

•  Generate
S
sampled
data
sets,
each
obtained
from
random

sampling
without
replacement
a
subset
of
size
b
(or
par$$on

the
original
data
into
S
par$$ons,
each
with
size
b)

•  For
each
data
p
=
1
to
S
do

–  For
i
=
1
to
r
do

•  Sample
N
samples
with
replacement
on
the
data
of
size
b

•  Compute
mean
of
the
resampled
data
θpi

–  Compute
ξp(θ)
=
ξ([θp1,…θpr])

•  Es$mate
of
ξ(μ)
=
Avg([ξ1(θ),…,
ξS(θ)])

Mapper
Reducer

Gateway

Why
is
BLB
Eﬃcient

•  Before:

– N
samples
with
replacement
from
size
N
data
is

expensive
when
N
is
large

•  Now:

– N
samples
with
replacement
from
size
b
data

– b
can
be
several
magnitude
smaller
than
N
(e.g.
b

=
Nγ,
γ
in
[0.5,
1))

– Equivalent
to:
A
mul$nomial
sampler
with
dim
=
b

– Storage
=
O(b),
Computa$onal
complexity
=
O(b)

Simula$on
Experiment

•  95%
CI
of
Logis$c
Regression
Coeﬃcients

•  N
=
20000,
10
explanatory
variables

•  Rela$ve
Error
=
|Es$mated
CI
width
–
True
CI

width
|
/
True
CI
width

•  BLB-‐γ:
BLB
with
b
=
Nγ

•  BOFN-‐γ:
b
out
of
N
sampling
with
b
=
Nγ

•  BOOT:
Naïve
bootstrap

Real
Data

•  95%
CI
of
Logis$c
Regression
Coeﬃcients

•  N
=
6M,
3000
explanatory
variables

•  Data
size
=
150GB,
r
=
50,
s
=
5,
γ
=
0.7

Summary
of
BLB

•  A
new
algorithm
for
bootstrapping
on
big
data

•  Advantages

– Fast
and
eﬃcient

– Easy
to
parallelize

– Easy
to
understand
and
implement

– Friendly
to
Hadoop,
makes
it
rou$ne
to
perform

sta$s$cal
calcula$ons
on
Big
data

Large
Scale
Logis$c
Regression

Logis$c
Regression

•  Binary
response:
Y

•  Covariates:
X

•  Yi
~
Bernoulli(pi)

•  log(pi/(1-‐pi))
=
Xi
Tβ
;

β
~
MVN(0
,
1/λ
I
)

•  Widely
used
(research
and
applica$ons)

Large
Scale
Logis$c
Regression

•  Binary
response:
Y

–  E.g.,
Click
/
Non-‐click
on
an
ad
on
a
webpage

•  Covariates:
X

–  User
covariates:

•  Age,
gender,
industry,
educa$on,
job,
job
$tle,
…

–  Item
covariates:

•  Categories,
keywords,
topics,
…

–  Context
covariates:

•  Time,
page
type,
posi$on,
…

–  2-‐way
interac$on:

•  User
covariates
X
item
covariates

•  Context
covariates
X
item
covariates

•  …

Computa$onal
Challenge

•  Hundreds
of
millions/billions
of
observa$ons

•  Hundreds
of
thousands/millions
of
covariates

•  Fimng
such
a
logis$c
regression
model
on
a

single
machine
not
feasible

•  Model
ﬁmng
itera$ve
using
methods
like

gradient
descent,
Newton’s
method
etc

– Mul$ple
passes
over
the
data

Recap
on
Op$miza$on
method

•  Problem:
Find
x
to
min(F(x))

•  Itera$on
n:
xn
=
xn-‐1
–
bn-‐1
F’(xn-‐1)

• 
bn-‐1
is
the
step
size
that
can
change
every

itera$on

•  Iterate
un$l
convergence

•  Conjugate
gradient,
LBFGS,
Newton
trust

region,
…
all
of
this
kind

Itera$ve
Process
with
Hadoop

Disk
Mappers
Disk
Reducers

Disk
Mappers
Disk
Reducers

Disk
Mappers
Disk
Reducers

Limita$ons
of
Hadoop
for
fimng
a
big

logis$c
regression

•  Itera$ve
process
is
expensive
and
slow

•  Every
itera$on
=
a
Map-‐Reduce
job

•  I/O
of
mapper
and
reducers
are
both
through

disk

•  Plus:
Wai$ng
in
queue
$me

•  Q:
Can
we
find
a
fimng
method
that
scales

with
Hadoop
?

Large
Scale
Logis$c
Regression

•  Naïve:

–  Par$$on
the
data
and
run
logis$c
regression
for
each
par$$on

–  Take
the
mean
of
the
learned
coeﬃcients

–  Problem:
Not
guaranteed
to
converge
to
the
model
from
single

machine!

•  Alterna$ng
Direc$on
Method
of
Mul$pliers
(ADMM)

–  Boyd
et
al.
2011

–  Set
up
constraints:
each
par$$on’s
coeﬃcient
=
global

consensus

–  Solve
the
op$miza$on
problem
using
Lagrange
Mul$pliers

–  Advantage:
guaranteed
to
converge
to
a
single
machine
logis$c

regression
on
the
en$re
data
with
reasonable
number
of

itera$ons

Large
Scale
Logis$c
Regression
via
ADMM

BIG
DATA

Par$$on
1
Par$$on
2
Par$$on
3
Par$$on
K

Logis$c

Regression

Logis$c

Regression

Logis$c

Regression

Logis$c

Regression

Consensus

Computa$on

Iteration 1

Large
Scale
Logis$c
Regression
via
ADMM

BIG
DATA

Par$$on
1
Par$$on
2
Par$$on
3
Par$$on
K

Logis$c

Regression

Consensus

Computa$on

Logis$c

Regression

Logis$c

Regression

Logis$c

Regression

Iteration 1

Large
Scale
Logis$c
Regression
via
ADMM

BIG
DATA

Par$$on
1
Par$$on
2
Par$$on
3
Par$$on
K

Logis$c

Regression

Logis$c

Regression

Logis$c

Regression

Logis$c

Regression

Consensus

Computa$on

Iteration 2

Dual
Ascent
Method

•  Consider
a
convex
op$miza$on
problem

•  Lagrangian
for
the
problem:

•  Dual
Ascent:

round and motivation.
Dual Ascent
der the equality-constrained convex optimization problem
minimize f(x)
subject to Ax = b,
(2.
ariable x ∈ Rn
, where A ∈ Rm×n
and f : Rn
→ R is convex.
e Lagrangian for problem (2.1) is
L(x,y) = f(x) + yT
(Ax − b)
he dual function is
g(y) = inf
x
L(x,y) = −f∗
(−AT
y) − bT
y,
y is the dual variable or Lagrange multiplier, and f∗ is the conv
round and motivation.
Dual Ascent
der the equality-constrained convex optimization problem
minimize f(x)
subject to Ax = b,
(2.
variable x ∈ Rn
, where A ∈ Rm×n
and f : Rn
→ R is convex.
he Lagrangian for problem (2.1) is
L(x,y) = f(x) + yT
(Ax − b)
he dual function is
g(y) = inf
x
L(x,y) = −f∗
(−AT
y) − bT
y,
y is the dual variable or Lagrange multiplier, and f∗ is the conv
gate of f; see [20, §3.3] or [140, §12] for background. The du
rimal optimal point x from a dual optimal point y as
x = argmin
x
L(x,y ),
vided there is only one minimizer of L(x,y ). (This is the case
e.g., f is strictly convex.) In the sequel, we will use the notation
minx F(x) to denote any minimizer of F, even when F does not
e a unique minimizer.
In the dual ascent method, we solve the dual problem using gradient
ent. Assuming that g is differentiable, the gradient g(y) can be
uated as follows. We first find x+ = argminx L(x,y); then we have
(y) = Ax+ − b, which is the residual for the equality constraint. The
l ascent method consists of iterating the updates
xk+1
:= argmin
x
L(x,yk
) (2.2)
yk+1
:= yk
+ αk
(Axk+1
− b), (2.3)
ere αk > 0 is a step size, and the superscript is the iteration counter.

Augmented
Lagrangians

•  Bring
robustness
to
the
dual
ascent
method

•  Yield
convergence
without
assump$ons
like
strict

convexity
or
finiteness
of
f

• 

•  The
value
of
ρ
influences
the
convergence
rate

aph-structured optimization problems.
Augmented Lagrangians and the Method of Multiplie
mented Lagrangian methods were developed in part to br
tness to the dual ascent method, and in particular, to yield c
nce without assumptions like strict convexity or finiteness of
augmented Lagrangian for (2.1) is
Lρ(x,y) = f(x) + yT
(Ax − b) + (ρ/2) Ax − b 2
2, (22.3 Augmented Lagrangians and the Method
where ρ > 0 is called the penalty parameter. (Note
standard Lagrangian for the problem.) The augmen
can be viewed as the (unaugmented) Lagrangian asso
problem
2

Alterna$ng
Direc$on
Method
of

Mul$pliers
(ADMM)

•  Problem:

•  Augmented
Lagrangians

•  ADMM:

MM is an algorithm that is intended to blend the decompos
dual ascent with the superior convergence properties of the m
multipliers. The algorithm solves problems in the form
minimize f(x) + g(z)
subject to Ax + Bz = c
h variables x ∈ Rn
and z ∈ Rm
, where A ∈ Rp×n
, B ∈ Rp×m
Rp
. We will assume that f and g are convex; more specific as
ns will be discussed in §3.2. The only difference from the g
ear equality-constrained problem (2.1) is that the variable, ca
re, has been split into two parts, called x and z here, with the
e function separable across this splitting. The optimal value
blem (3.1) will be denoted by
p = inf{f(x) + g(z) | Ax + Bz = c}.
with variables x ∈ Rn
and z ∈ Rm
, where A ∈ Rp×n
, B ∈ Rp×m
, and
c ∈ Rp
. We will assume that f and g are convex; more specific assump-
tions will be discussed in §3.2. The only difference from the general
linear equality-constrained problem (2.1) is that the variable, called x
there, has been split into two parts, called x and z here, with the objec-
tive function separable across this splitting. The optimal value of the
problem (3.1) will be denoted by
p = inf{f(x) + g(z) | Ax + Bz = c}.
As in the method of multipliers, we form the augmented Lagrangian
Lρ(x,z,y) = f(x) + g(z) + yT
(Ax + Bz − c) + (ρ/2) Ax + Bz − c 2
2.
13
14 Alternating Direction Method of Multipliers
ADMM consists of the iterations
xk+1
:= argmin
x
Lρ(x,zk
,yk
) (3.2)
zk+1
:= argmin
z
Lρ(xk+1
,z,yk
) (3.3)
yk+1
:= yk
+ ρ(Axk+1
+ Bzk+1
− c), (3.4)
where ρ > 0. The algorithm is very similar to dual ascent and the

Large
Scale
Logis$c
Regression
via
ADMM

•  Nota$on

–  (Xi
,
yi):
data
in
the
ith
par$$on

–  βi:
coeﬃcient
vector
for
par$$on
i

–  β:
Consensus
coeﬃcient
vector

–  r(β):
penalty
component
such
as
||β||2
2

•  Op$miza$on
problem

Brief Article
The Author
July 7, 2013
min
NX
i=1
li(yi, XT
i i) + r( )
subject to i =

ADMM
updates

LOCAL
REGRESSIONS

Shrinkage
towards
current

best
global
es$mate

UPDATED

CONSENSUS

An
example
implementa$on

•  ADMM
for
Logis$c
regression
model
ﬁmng
with

L2/L1
penalty

•  Each
itera$on
of
ADMM
is
a
Map-‐Reduce
job

–  Mapper:
par$$on
the
data
into
K
par$$ons

–  Reducer:
For
each
par$$on,
use
liblinear/glmnet
to
ﬁt

a
L1/L2
logis$c
regression

–  Gateway:
consensus
computa$on
by
results
from
all

reducers,
and
sends
back
the
consensus
to
each

reducer
node

KDD
CUP
2010
Data

•  Bridge
to
Algebra
2008-‐2009
data
in

haps://pslcdatashop.web.cmu.edu/KDDCup/
downloads.jsp

•  Binary
response,
20M
covariates

•  Only
keep
covariates
with
>=
10
occurrences

=>
2.2M
covariates

•  Training
data:
8,407,752
samples

•  Test
data
:
510,302
samples

Avg
Training
Loglikelihood
vs
Number

of
Itera$ons

Test
AUC
vs
Number
of
Itera$ons

Beaer
Convergence
Can

Be
Achieved
By

•  Beaer
Ini$aliza$on

– Use
results
from
Naïve
method
to
ini$alize
the

parameters

•  Adap$vely
change
step
size
(ρ)
for
each

itera$on
based
on
the
convergence
status
of

the
consensus

Recommender Problems for
Web Applications

Agenda
•  Topic of Interest
–  Recommender problems for dynamic, time-
sensitive applications
•  Content Optimization, Online Advertising, Movie
recommendation, shopping,…
•  Introduction
•  Offline components
–  Regression, Collaborative filtering (CF), …
•  Online components + initialization
–  Time-series, online/incremental methods, explore/
exploit (bandit)
•  Evaluation methods + Multi-Objective
•  Challenges

Three components we will focus on
•  Defining the problem
–  Formulate objectives whose optimization achieves some long-
term goals for the recommender system
•  E.g. How to serve content to optimize audience reach and engagement,
optimize some combination of engagement and revenue ?
•  Modeling (to estimate some critical inputs)
–  Predict rates of some positive user interaction(s) with items
based on data obtained from historical user-item interactions
•  E.g. Click rates, average time-spent on page, etc
•  Could be explicit feedback like ratings
•  Experimentation
–  Create experiments to collect data proactively to improve
models, helps in converging to the best choice(s) cheaply and
rapidly.
•  Explore and Exploit (continuous experimentation)
•  DOE (testing hypotheses by avoiding bias inherent in data)

Modern Recommendation Systems
•  Goal
–  Serve the right item to a user in a given context to
optimize long-term business objectives
•  A scientific discipline that involves
–  Large scale Machine Learning & Statistics
•  Offline Models (capture global & stable characteristics)
•  Online Models (incorporates dynamic components)
•  Explore/Exploit (active and adaptive experimentation)
–  Multi-Objective Optimization
•  Click-rates (CTR), Engagement, advertising revenue, diversity, etc
–  Inferring user interest
•  Constructing User Profiles
–  Natural Language Processing to understand content
•  Topics, “aboutness”, entities, follow-up of something, breaking news,…

Some examples from content
optimization
•  Simple version
–  I have a content module on my page, content inventory is
obtained from a third party source which is further refined
through editorial oversight. Can I algorithmically
recommend content on this module? I want to improve
overall click-rate (CTR) on this module
•  More advanced
–  I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. advertising revenue). Can I
increase downstream utility without losing too many clicks?
•  Highly advanced
–  There are multiple modules running on my webpage. How
do I perform a simultaneous optimization?

Recommend applications
Recommend search queries
Recommend news article
Recommend packages:
Image
Title, summary
Links to other pages
Pick
4
out
of
a
pool
of
K

K
=
20
~
40

Dynamic

Routes
traﬃc
other
pages

Problems in this example
•  Optimize CTR on multiple modules
–  Today Module, Trending Now, Personal Assistant,
News
–  Simple solution: Treat modules as independent,
optimize separately. May not be the best when
there are strong correlations.
•  For any single module
–  Optimize some combination of CTR, downstream
engagement, and perhaps advertising revenue.

Online Advertising
Advertisers
Ad Network
Ads
Page
Recommend
Best ad(s)
User
Publisher
Response rates
(click, conversion, ad-view)
Bids
Auction
Click
conversion
Select argmax f(bid,response rates)
ML /Statistical
model
Examples:
Yahoo, Google, MSN, …
Ad exchanges (RightMedia,
DoubleClick, …)

LinkedIn
Today:
Content
Module

Objective: Serve content to maximize engagement
metrics like CTR (or weighted CTR)

LinkedIn
Ads:
Match
ads
to
users

visi$ng
LinkedIn

Right
Media
Ad
Exchange:
Uniﬁed

Marketplace

Match
ads
to
page
views
on
publisher
sites

Has
ad

impression
to

sell
-‐-‐

AUCTIONS

Bids
$0.50

Bids
$0.75
via
Network…

…
which
becomes

$0.45
bid

Bids
$0.65—WINS!

AdSense

Ad.com

Bids
$0.60

Recommender problems in general
USER

Item
Inventory

Ar$cles,
web
page,

ads,
…

Use
an
automated
algorithm

to
select
item(s)
to
show

Get
feedback
(click,
$me
spent,..)

Reﬁne
the
models

Repeat
(large
number
of
:mes)

Op:mize
metric(s)
of
interest

(Total
clicks,
Total
revenue,…)

Example applications
Search: Web, Vertical
Online Advertising
Content
…..
Context

query,
page,
…

•  Items: Articles, ads, modules, movies, users, updates, etc.
•  Context: query keywords, pages, mobile, social media, etc.
•  Metric to optimize (e.g., relevance score, CTR, revenue, engagement)
–  Currently, most applications are single-objective
–  Could be multi-objective optimization (maximize X subject to Y, Z,..)
•  Properties of the item pool
–  Size (e.g., all web pages vs. 40 stories)
–  Quality of the pool (e.g., anything vs. editorially selected)
–  Lifetime (e.g., mostly old items vs. mostly new items)
Important Factors

Factors affecting Solution
(continued)
•  Properties of the context
–  Pull: Specified by explicit, user-driven query (e.g., keywords, a form)
–  Push: Specified by implicit context (e.g., a page, a user, a session)
•  Most applications are somewhere on continuum of pull and push
•  Properties of the feedback on the matches
made
–  Types and semantics of feedback (e.g., click, vote)
–  Latency (e.g., available in 5 minutes vs. 1 day)
–  Volume (e.g., 100K per day vs. 300M per day)
•  Constraints specifying legitimate matches
–  e.g., business rules, diversity rules, editorial Voice
–  Multiple objectives
•  Available Metadata (e.g., link graph, various user/item attributes)

Predicting User-Item Interactions
(e.g. CTR)
•  Myth: We have so much data on the web, if we
can only process it the problem is solved
–  Number of things to learn increases with sample size
•  Rate of increase is not slow
–  Dynamic nature of systems make things worse
–  We want to learn things quickly and react fast
•  Data is sparse in web recommender problems
–  We lack enough data to learn all we want to learn and
as quickly as we would like to learn
–  Several Power laws interacting with each other
•  E.g. User visits power law, items served power law
–  Bivariate Zipf: Owen & Dyer, 2011

Can Machine Learning help?
•  Fortunately, there are group behaviors that generalize to
individuals & they are relatively stable
–  E.g. Users in San Francisco tend to read more baseball news
•  Key issue: Estimating such groups
–  Coarse group : more stable but does not generalize that well.
–  Granular group: less stable with few individuals
–  Getting a good grouping structure is to hit the “sweet spot”
•  Another big advantage on the web
–  Intervene and run small experiments on a small population to
collect data that helps rapid convergence to the best choices(s)
•  We don’t need to learn all user-item interactions, only those that are good.

Predicting user-item interaction rates
Oﬄine

(
Captures
stable
characteris$cs

at
coarse
resolu$ons)

(Logis$c,
Boos$ng,….)

Feature

construc$on

Content:
IR,
clustering,
taxonomy,
en$ty,..

User
proﬁles:
clicks,
views,
social,
community,..

Near
Online

(Finer
resolu$on

Correc$ons)

(item,
user
level)

(Quick
updates)

Explore/Exploit

(Adap$ve
sampling)

(helps
rapid
convergence

to
best
choices)

Initialize

Post-click: An example in Content
Optimization
Recommender

EDITORIAL

content
Clicks on FP links influence
downstream supply distribution

AD
SERVER

DISPLAY

ADVERTISING

Revenue

Downstream

engagement

(Time
spent)

Serving Content on Front Page: Click Shaping
•  What do we want to optimize?
•  Current: Maximize clicks (maximize downstream supply from FP)
•  But consider the following
–  Article 1: CTR=5%, utility per click = 5
–  Article 2: CTR=4.9%, utility per click=10
•  By promoting 2, we lose 1 click/100 visits, gain 5 utils
•  If we do this for a large number of visits --- lose some clicks but
obtain significant gains in utility?
–  E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,
etc)

High
level
picture

http request
Statistical
Models updated in
Batch mode: e.g. once every
30 mins
Server

Item

Recommenda$on

system:
thousands

of
computa$ons
in

sub-‐seconds

User Interacts
e.g. click,
does nothing

High
level
overview:
Item

Recommenda$on
System

User Info
Item Index
Id, meta-data
ML/
Statistical
Models
Score Items
P(Click), P(share),
Semantic-relevance
score,….
Rank Items:
sort by score (CTR,bid*CTR,..)
combine scores using Multi-obj optim,
Threshold on some scores,….
User-item interaction
Data: batch process
Updated in batch:
Activity, profile
Pre-filter
SPAM,editorial,,..
Feature extraction
NLP, cllustering,..

ML/Sta$s$cal
models
for
scoring

Number of items
Scored by ML
Traffic volume
1000100 100k 1M 100M
Few hours
Few days
Several days
LinkedIn Today
Yahoo! Front Page
Right Media Ad exchange
LinkedIn Ads

Summary
of
deployments

•  Yahoo!
Front
page
Today
Module
(2008-‐2011):
300%
improvement

in
click-‐through
rates

–  Similar
algorithms
delivered
via
a
self-‐serve
pla•orm,
adopted
by

several
Yahoo!
Proper$es
(2011):
Significant
improvement
in

engagement
across
Yahoo!
Network

•  Fully
deployed
on
LinkedIn
Today
Module
(2012):
Significant

improvement
in
click-‐through
rates
(numbers
not
revealed
due
to

reasons
of
confiden$ality)

•  Yahoo!
RightMedia
exchange
(2012):
Fully
deployed
algorithms
to

es$mate
response
rates
(CTR,
conversion
rates).
Significant

improvement
in
revenue
(numbers
not
revealed
due
to
reasons
of

confiden$ality)

•  LinkedIn
self-‐serve
ads
(2012-‐2013):Fully
deployed

•  LinkedIn
News
Feed
(2013-‐2014):
Fully
deployed

•  Several
others
in
progress….

Broad
Themes

•  Curse
of
dimensionality

–  Large
number
of
observa$ons
(rows),
large
number
of
poten$al
features

(columns)

–  Use
domain
knowledge
and
machine
learning
to
reduce
the
“eﬀec$ve”

dimension
(constraints
on
parameters
reduce
degrees
of
freedom)

•  I
will
give
examples
as
we
move
along

• 
We
oden
assume
our
job
is
to
analyze
“Big
Data”
but
we
oden
have

control
on
what
data
to
collect
through
clever
experimenta$on

–  This
can
fundamentally
change
solu$ons

•  Think
of
computa$on
and
models
together
for
Big
data

•  Op$miza$on:
What
we
are
trying
to
op$mize
is
oden
complex,models
to

work
in
harmony
with
op$miza$on

–  Pareto
op$mality
with
compe$ng
objec$ves

Sta$s$cal
Problem

•  Rank
items
(from
an
admissible
pool)
for
user
visits
in
some

context
to
maximize
a
u$lity
of
interest

•  Examples
of
u$lity
func$ons

–  Click-‐rates
(CTR)

–  Share-‐rates
(CTR*
[Share|Click]
)

–  Revenue
per
page-‐view
=
CTR*bid
(more
complex
due
to
second
price

auc$on)

•  CTR
is
a
fundamental
measure
that
opens
the
door
to
a
more

principled
approach
to
rank
items

•  Converge
rapidly
to
maximum
u$lity
items

–  Sequen$al
decision
making
process
(explore/exploit)

item
j
from
a
set
of
candidates

User
i

with

user
features

(e.g.,
industry,

behavioral
features,

Demographic
features,
……)

(i,
j)
:
response
yij
visits

Algorithm
selects

(click
or
not)

Which
item
should
we
select?

Ÿ
The
item
with
highest
predicted
CTR

Ÿ
An
item
for
which
we
need
data
to

predict
its
CTR

Exploit

Explore

LinkedIn Today, Yahoo! Today Module:
Choose Items to maximize CTR
This is an “Explore/Exploit” Problem

The Explore/Exploit Problem (to
maximize CTR)
•  Problem definition: Pick k items from a pool of N for a large
number of serves to maximize the number of clicks on the
picked items
•  Easy!? Pick the items having the highest click-through rates
(CTRs)
•  But …
–  The system is highly dynamic:
•  Items come and go with short lifetimes
•  CTR of each item may change over time
–  How much traffic should be allocated to explore new items to
achieve optimal performance ?
•  Too little → Unreliable CTR estimates due to “starvation”
•  Too much → Little traffic to exploit the high CTR items

Y!
front
Page
Applica$on

•  Simplify:
Maximize
CTR
on
ﬁrst
slot
(F1)

•  Item
Pool

–  Editorially
selected
for
high
quality
and
brand
image

–  Few
ar$cles
in
the
pool
but
item
pool
dynamic

CTR Curves of Items on LinkedIn
Today
CTR

Impact
of
repeat
item
views
on
a
given

user

•  Same
user
is
shown
an
item
mul$ple
$mes

(despite
not
clicking)

Simple
algorithm
to
es$mate
most
popular

item
with
small
but
dynamic
item
pool

•  Simple
Explore/Exploit
scheme
–  ε%
explore:
with
a
small
probability
(e.g.
5%),
choose
an
item
at

random
from
the
pool

–  (100−ε)%
exploit:
with
large
probability
(e.g.
95%),
choose

highest
scoring
CTR
item

•  Temporal
Smoothing

–  Item
CTRs
change
over
$me,
provide
more
weight
to
recent
data
in

es$ma$ng
item
CTRs

•  Kalman
ﬁlter,
moving
average

•  Discount
item
score
with
repeat
views

–  CTR(item)
for
a
given
user
drops
with
repeat
views
by
some
“discount”

factor
(es$mated
from
data)

•  Segmented
most
popular

–  Perform
separate
most-‐popular
for
each
user
segment

Time
series
Model:
Kalman
ﬁlter

•  Dynamic
Gamma-‐Poisson:
click-‐rate
evolves
over
$me

in
a
mul$plica$ve
fashion

•  Es$mated
Click-‐rate
distribu$on
at
$me
t+1

–  Prior
mean:

–  Prior
variance:

High
CTR
items
more
adap$ve

More
economical
explora$on?
Beaer

bandit
solu$ons

•  Consider
two
armed
problem

p2
(unknown payoff
probabilities)
The
gambler
has
1000
plays,
what
is
the
best
way
to
experiment
?

(to
maximize
total
expected
reward)

This
is
called
the
“mul$-‐armed
bandit”
problem,
have
been
studied
for
a
long
$me.

Op$mal
solu$on:
Play
the
arm
that
has
maximum
poten:al
of
being
good

Op:mism
in
the
face
of
uncertainty

p1 >

Item
Recommenda$on:
Bandits?

•  Two
Items:
Item
1
CTR=
2/100
;
Item
2
CTR=
250/10000

–  Greedy:
Show
Item
2
to
all;
not
a
good
idea

–  Item
1
CTR
es$mate
noisy;
item
could
be
poten$ally

beaer

•  Invest
in
Item
1
for
beaer
overall
performance
on
average

–  Exploit
what
is
known
to
be
good,
explore
what
is
poten$ally
good

CTR

Probabilitydensity

Item 2

Item 1

Next few hours
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models Time-series models Incremental CF,
online regression
Intelligent Initialization Prior estimation Prior estimation,
dimension reduction
Explore/Exploit Multi-armed bandits Bandits with covariates

Offline Components:
Collaborative Filtering in Cold-start
Situations

Problem
Item j with
User i
with
user features xi
(demographics,
browse history,
search history, …)
item features xj
(keywords, content categories, ...)
(i, j) : response yijvisits
Algorithm selects
(explicit rating, implicit click/no-click)
Predict the unobserved entries based on
features and the observed entries

Model Choices
•  Feature-based (or content-based) approach
–  Use features to predict response
•  (regression, Bayes Net, mixture models, …)
–  Limitation: need predictive features
•  Bias often high, does not capture signals at granular levels
•  Collaborative filtering (CF aka Memory based)
–  Make recommendation based on past user-item interaction
•  User-user, item-item, matrix factorization, …
•  See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc.
–  Better performance for old users and old items
–  Does not naturally handle new users and new items (cold-
start)

Collaborative Filtering (Memory
based methods)
User-User Similarity
Item-Item similarities, incorporating both
Estimating Similarities
Pearson’s correlation
Optimization based (Koren et al)

How to Deal with the Cold-Start
Problem
•  Heuris$c-‐based
approaches

–  Linear
combina$on
of
regression
and
CF
models

–  Filterbot

•  Add
user
features
as
psuedo
users
and
do
collabora$ve
filtering

-‐  Hybrid
approaches

-‐  Use
content
based
to
fill
up
entries,
then
use
CF

•  Matrix
Factoriza$on

–  Good
performance
on
Ne•lix
(Koren,
2009)

•  Model-‐based
approaches

–  Bilinear
random-‐effects
model
(probabilis$c
matrix
factoriza$on)

•  Good
on
Ne•lix
data
[Ruslan
et
al
ICML,
2009]

–  Add
feature-‐based
regression
to
matrix
factoriza$on

•  (Agarwal
and
Chen,
2009)

–  Add
topic
discovery
(from
textual
items)
to
matrix
factoriza$on

•  (Agarwal
and
Chen,
2009;
Chun
and
Blei,
2011)

Per-item regression models
•  When tracking users by cookies, distribution of
visit patters could get extremely skewed
– Majority of cookies have 1-2 visits
•  Per item models (regression) based on user
covariates attractive in such cases

Several per-item regressions: Multi-task learning
Low dimension
(5-10),
B estimated
retrospective data
•  Agarwal,Chen and Elango, KDD, 2010
Affinity to
old items

Per-user, per-item models
via bilinear random-effects
model

Motivation
•  Data measuring k-way interactions pervasive
–  Consider k = 2 for all our discussions
•  E.g. User-Movie, User-content, User-Publisher-Ads,….
–  Power law on both user and item degrees
•  Classical Techniques
–  Approximate matrix through a singular value
decomposition (SVD)
•  After adjusting for marginal effects (user pop, movie pop,..)
–  Does not work
•  Matrix highly incomplete, severe over-fitting
–  Key issue
•  Regularization of eigenvectors (factors) to avoid overfitting

Early work on complete matrices
•  Tukey’s 1-df model (1956)
–  Rank 1 approximation of small nearly complete
matrix
•  Criss-cross regression (Gabriel, 1978)
•  Incomplete matrices: Psychometrics (1-factor
model only; small data sets; 1960s)
•  Modern day recommender problems
–  Highly incomplete, large, noisy.

Latent Factor Models
“newsy”
“sporty”
“newsy”
s
item
v
z
Affinity = u’v
Affinity = s’z
u s
p
o
r
t
y

Factorization – Brief Overview
•  Latent user factors:
(αi , ui=(ui1,…,uin))
•  (Nn + Mm)
parameters
•  Key technical issue:
•  Latent movie factors:
(βj , vj=(v j1,….,v jn))
will overfit for moderate
values of n,m
Regularization
Interaction
jijiij BvuyE ʹ′+++= βαµ)(

Latent Factor Models: Different
Aspects
•  Matrix Factorization
– Factors in Euclidean space
– Factors on the simplex
•  Incorporating features and ratings
simultaneously
•  Online updates

Maximum Margin Matrix Factorization (MMMF)
•  Complete matrix by minimizing loss (hinge,squared-
error) on observed entries subject to constraints on trace
norm
–  Srebro, Rennie, Jakkola (NIPS 2004)
–  Convex, Semi-definite programming (expensive, not
scalable)
•  Fast MMMF (Rennie & Srebro, ICML, 2005)
–  Constrain the Frobenious norm of left and right
eigenvector matrices, not convex but becomes
scalable.
•  Other variation: Ensemble MMMF (DeCoste,
ICML2005)
–  Ensembles of partially trained MMMF (some
improvements)

Matrix Factorization for Netflix prize
data
•  Minimize the objective function
•  Simon Funk: Stochastic Gradient Descent
•  Koren et al (KDD 2007): Alternate Least
Squares
–  They move to SGD later in the competition
∑ ∑∑∈
++−
obsij j
j
i
ij
T
iij vuvur )()(
222
λ

ui vj
rij
au av
2
σ
Optimization is through Iterated
conditional modes
Other variations like constraining the
mean through sigmoid, using “who-rated-
whom”
Combining with Boltzmann Machines also
improved performance
),(~
),(~
),(~ 2
IaMVN
IaMVN
Nr
vj
ui
j
T
iij
0v
0u
vu σ
Probabilis$c
Matrix
Factoriza$on

(Ruslan
&
Minh,
2008,
NIPS)

Bayesian Probabilistic Matrix Factorization
(Ruslan and Minh, ICML 2008)
•  Fully Bayesian treatment using an MCMC approach
–  Significant improvement
•  Interpretation as a fully Bayesian hierarchical model
shows why that is the case
–  Failing to incorporate uncertainty leads to bias in
estimates
–  Multi-modal posterior, MCMC helps in converging to a better one
r
Var-comp: au
MCEM also more resistant to over-fitting

Non-parametric Bayesian matrix completion
(Zhou et al, SAM, 2010)
•  Specify rank probabilistically (automatic rank
selection)
)/)1(,/(~
)(~
),(~
1
2
rrbraBeta
Berz
vuzNy
k
kk
r
k
jkikkij
−
∑=
π
π
σ
))1(/(Factors)#(
)))1(/(,1(~
−+=
−+
rbaraE
rbaaBerzk

How to incorporate features:
Deal with both warm start and cold-start
•  Models to predict ratings for new pairs
–  Warm-start: (user, movie) present in the training data with large
sample size
–  Cold-start: At least one of (user, movie) new or has small sample
size
•  Rough definition, warm-start/cold-start is a continuum.
•  Challenges
–  Highly incomplete (user, movie) matrix
–  Heavy tailed degree distributions for users/movies
•  Large fraction of ratings from small fraction of users/
movies
–  Handling both warm-start and cold-start effectively in the
presence of predictive features

Possible approaches
•  Large scale regression based on covariates
–  Does not provide good estimates for heavy users/movies
–  Large number of predictors to estimate interactions
•  Collaborative filtering
–  Neighborhood based
–  Factorization
•  Good for warm-start; cold-start dealt with separately
•  Single model that handles cold-start and warm-start
–  Heavy users/movies → User/movie specific model
–  Light users/movies → fallback on regression model
–  Smooth fallback mechanism for good performance

Add Feature-based Regression
into
Matrix Factorization
RLFM: Regression-based Latent
Factor Model

Regression-based Factorization
Model (RLFM)
•  Main idea: Flexible prior, predict factors
through regressions
•  Seamlessly handles cold-start and warm-
start
•  Modified state equation to incorporate
covariates

RLFM: Model
Rating: ),(~ 2
σµijij Ny
)(~ ijij Bernoulliy µ
)(~ ijijij NPoissony µ
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
j
t
iji
t
ijij vubxt +++= βαµ )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 α
αα
σεεα Nxg iii
t
i +=
Popularity of item j: ),0(~, 2
0 β
ββ
σεεβ Nxd jjj
t
j +=
Factors of user i: ),0(~, 2
INGxu u
u
i
u
iii σεε+=
Factors of item j: ),0(~, 2
INDxv v
v
i
v
iji σεε+=
Could use other classes of regression models

Graphical representation of the
model

Advantages of RLFM
•  Better regularization of factors
–  Covariates “shrink” towards a better centroid
•  Cold-start: Fallback regression model (FeatureOnly)

RLFM: Illustration of Shrinkage
Plot the first factor
value for each user
(fitted using Yahoo! FP
data)

Model fitting: EM for our class of
models

The parameters for RLFM
•  Latent parameters
•  Hyper-parameters
}){},{},{},({ jiji vuβα=Δ
)IaAI,aAD,G,,( vvuu ===Θ b

Computing the E-step
•  Often hard to compute in closed form
•  Stochastic EM (Markov Chain EM; MCEM)
–  Compute expectation by drawing samples from
–  Effective for multi-modal posteriors but more expensive
•  Iterated Conditional Modes algorithm (ICM)
–  Faster but biased hyper-parameter estimates

Monte Carlo E-step
•  Through a vanilla Gibbs sampler (conditionals closed form)
•  Other conditionals also Gaussian and closed form
•  Conditionals of users (movies) sampled simultaneously
•  Small number of samples in early iterations, large numbers in
later iterations

M-step (Why MCEM is better than
ICM)
•  Update G, optimize
•  Update Au=au I
Ignored by ICM, underestimates factor variability
Factors over-shrunk, posterior not explored well

Experiment 1: Better regularization
•  MovieLens-100K, avg RMSE using pre-specified splits
•  ZeroMean, RLFM and FeatureOnly (no cold-start
issues)
•  Covariates:
–  Users : age, gender, zipcode (1st digit only)
–  Movies: genres

Experiment 2: Better handling of
Cold-start
•  MovieLens-1M; EachMovie
•  Training-test split based on timestamp
•  Same covariates as in Experiment 1.

Experiment 4: Predicting click-rate
on articles
•  Goal: Predict click-rate on articles for a user on F1
position
•  Article lifetimes short, dynamic updates important
•  User covariates:
–  Age, Gender, Geo, Browse behavior
•  Article covariates
–  Content Category, keywords
•  2M ratings, 30K users, 4.5 K articles

Some other related approaches
•  Stern, Herbrich and Graepel, WWW, 2009
–  Similar to RLFM, different parametrization and
expectation propagation used to fit the models
•  Porteus, Asuncion and Welling, AAAI, 2011
–  Non-parametric approach using a Dirichlet process
•  Agarwal, Zhang and Mazumdar, Annals of Applied
Statistics, 2011
–  Regression + random effects per user regularized
through a Graphical Lasso

Add Topic Discovery into
Matrix Factorization
fLDA: Matrix Factorization through Latent
Dirichlet Allocation

fLDA: Introduction
•  Model the rating yij that user i gives to item j as the user’s
affinity to the topics that the item has
–  Unlike regular unsupervised LDA topic modeling, here the LDA
topics are learnt in a supervised manner based on past rating
data
–  fLDA can be thought of as a “multi-task learning” version of the
supervised LDA model [Blei’07] for cold-start recommendation
∑+=
k jkikij zsy ...
User i ’s affinity to topic k
Pr(item j has topic k) estimated by averaging
the LDA topic of each word in item j
Old items: zjk’s are Item latent factors learnt from data with the LDA prior
New items: zjk’s are predicted based on the bag of words in the items

Φ11,
…,
Φ1W

…

Φk1,
…,
ΦkW

…

ΦK1,
…,
ΦKW

Topic
1

Topic
k

Topic
K

LDA Topic Modeling (1)
•  LDA is effective for unsupervised topic discovery [Blei’03]
–  It models the generating process of a corpus of items (articles)
–  For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η)
–  For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ)
–  For each word, say the nth word, in item j,
•  Draw a topic zjn for that word from θj = [θj1, …, θjK]
•  Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn
Item j
Topic distribution: [θj1, …, θjK]
Words: wj1, …, wjn, …
Per-word topic: zj1, …, zjn, …
Assume zjn = topic k
Observed

LDA Topic Modeling (2)
•  Model training:
–  Estimate the prior parameters and the posterior topic×word
distribution Φ based on a training corpus of items
–  EM + Gibbs sampling is a popular method
•  Inference for new items
–  Compute the item topic distribution based on the prior
parameters and Φ estimated in the training phase
•  Supervised LDA [Blei’07]
–  Predict a target value for each item based on supervised LDA topics
∑=
k jkkj zsy
Target value of item j
Pr(item j has topic k) estimated by averaging
the topic of each word in item j
Regression weight for topic k
∑+=
k jkikij zsy ...vs.
One regression per user
Same set of topics across different regressions

fLDA: Model
Rating: ),(~ 2
σµijij Ny
)(~ ijij Bernoulliy µ
)(~ ijijij NPoissony µ
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
jkikkji
t
ijij zsbxt ∑+++= βαµ )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 α
αα
σεεα Nxg iii
t
i +=
Popularity of item j: ),0(~, 2
0 β
ββ
σεεβ Nxd jjj
t
j +=
Topic affinity of user i: ),0(~, 2
INHxs s
s
i
s
iii σεε+=
Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk =∑=
The LDA topic of the nth word in item j
Observed words: ),,(~ jnjn zLDAw ηλ
The nth word in item j

Model Fitting
•  Given:
–  Features X = {xi, xj, xij}
–  Observed ratings y = {yij} and words w = {wjn}
•  Estimate:
–  Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η]
•  Regression weights and prior parameters
–  Latent factors: Δ = {αi, βj, si} and z = {zjn}
•  User factors, item factors and per-word topic assignment
•  Empirical Bayes approach:
–  Maximum likelihood estimate of the parameters:
–  The posterior distribution of the factors:
∫ ΔΘΔ=Θ=Θ
ΘΘ
dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxargˆ
]ˆ,|,Pr[ ΘΔ yz

The EM Algorithm
•  Iterate through the E and M steps until convergence
– Let be the current estimate
– E-step: Compute
•  The expectation is not in closed form
•  We draw Gibbs samples and compute the Monte
Carlo mean
– M-step: Find
•  It consists of solving a number of regression and
optimization problems
)]|,,,Pr([log)( )ˆ,,|,(
ΘΔ=Θ ΘΔ
zwyEf n
wyz
)(maxargˆ )1(
Θ=Θ
Θ
+
fn
)(ˆ n
Θ

Supervised Topic Assignment
( ) ∏ =⋅+
+
+
∝
=
¬
¬
¬
ji jnij
jn
jkjn
k
jn
kl
jn
kzyfZ
WZ
Z
kz
rated
)|(
)Rest|Pr(
λ
η
η
Same as unsupervised LDA Likelihood of observed ratings
by users who rated item j when
zjn is set to topic k
Probability of observing yij
given the model
The topic of the nth word in item j

fLDA: Experimental Results (Movie)
•  Task: Predict the rating that a user would give a movie
•  Training/test split:
–  Sort observations by time
–  First 75% → Training data
–  Last 25% → Test data
•  Item warm-start scenario
–  Only 2% new items in test data
Model Test RMSE
RLFM 0.9363
fLDA 0.9381
Factor-Only 0.9422
FilterBot 0.9517
unsup-LDA 0.9520
MostPopular 0.9726
Feature-Only 1.0906
Constant 1.1190
fLDA is as strong as the best method
It does not reduce the performance in warm-start scenarios

fLDA: Experimental Results (Yahoo! Buzz)
•  Task: Predict whether a user would buzz-up an article
•  Severe item cold-start
–  All items are new in test data
Data Statistics
1.2M observations
4K users
10K articles
fLDA significantly
outperforms other
models

Experimental Results: Buzzing
Topics
Top Terms (after stemming) Topic
bush, tortur, interrog, terror, administr, CIA, offici,
suspect, releas, investig, georg, memo, al
CIA interrogation
mexico, flu, pirat, swine, drug, ship, somali, border,
mexican, hostag, offici, somalia, captain
Swine flu
NFL, player, team, suleman, game, nadya, star, high,
octuplet, nadya_suleman, michael, week
NFL games
court, gai, marriag, suprem, right, judg, rule, sex, pope,
supreme_court, appeal, ban, legal, allow
Gay marriage
palin, republican, parti, obama, limbaugh, sarah, rush,
gop, presid, sarah_palin, sai, gov, alaska
Sarah Palin
idol, american, night, star, look, michel, win, dress,
susan, danc, judg, boyl, michelle_obama
American idol
economi, recess, job, percent, econom, bank, expect,
rate, jobless, year, unemploy, month
Recession
north, korea, china, north_korea, launch, nuclear,
rocket, missil, south, said, russia
North Korea issues
3/4 topics are interpretable; 1/2 are similar to unsupervised topics

fLDA Summary
•  fLDA is a useful model for cold-start item recommendation
•  It also provides interpretable recommendations for users
–  User’s preference to interpretable LDA topics
•  Future directions:
–  Investigate Gibbs sampling chains and the convergence properties of
the EM algorithm
–  Apply fLDA to other multi-task prediction problems
•  fLDA can be used as a tool to generate supervised
features (topics) from text data

Summary
•  Regularizing factors through covariates effective
•  Regression based factor model that regularizes better
and deals with both cold-start and warm-start in a
single framework in a seamless way looks attractive
•  Fitting method scalable; Gibbs sampling for users and
movies can be done in parallel. Regressions in M-step
can be done with any off-the-shelf scalable linear
regression routine
•  Distributed computing on Hadoop: Multiple models
and average across partitions (more later)

Online
Components:

Online
Models,
Intelligent

Ini$aliza$on,
Explore
/
Exploit

Why Online Components?
•  Cold start
–  New items or new users come to the system
–  How to obtain data for new items/users (explore/exploit)
–  Once data becomes available, how to quickly update the model
•  Periodic rebuild (e.g., daily): Expensive
•  Continuous online update (e.g., every minute): Cheap
•  Concept drift
–  Item popularity, user interest, mood, and user-to-item affinity may
change over time
–  How to track the most recent behavior
•  Down-weight old data
–  How to model temporal patterns for better prediction
•  … may not need to be online if the patterns are stationary

Big Picture
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models
Real systems are dynamic
Time-series models Incremental CF,
online regression
Intelligent Initialization
Do not start cold
Prior estimation Prior estimation,
dimension reduction
Explore/Exploit
Actively acquire data
Multi-armed bandits Bandits with covariates
Segmented
Most

Popular
Recommenda$on

Extension:

Online
Components
for

Most
Popular
Recommenda$on

Online
models,
intelligent
ini$aliza$on
&

explore/exploit

Most popular recommendation:
Outline
•  Most popular recommendation (no
personalization, all users see the same thing)
–  Time-series models (online models)
–  Prior estimation (initialization)
–  Multi-armed bandits (explore/exploit)
–  Sometimes hard to beat!!
•  Segmented most popular recommendation
–  Create user segments/clusters based on user
features
–  Do most popular recommendation for each segment

Most Popular Recommendation
•  Problem definition: Pick k items (articles) from a
pool of N to maximize the total number of clicks on
the picked items
•  Easy!? Pick the items having the highest click-
through rates (CTRs)
•  But …
–  The system is highly dynamic:
•  Items come and go with short lifetimes
•  CTR of each item changes over time
–  How much traffic should be allocated to explore new
items to achieve optimal performance
•  Too little → Unreliable CTR estimates
•  Too much → Little traffic to exploit the high CTR items

CTR Curves for Two Days on Yahoo! Front Page
Traﬃc
obtained
from
a
controlled
randomized
experiment
(no
confounding)

Things
to
note:

(a)
Short
life$mes,
(b)
temporal
eﬀects,
(c)
oden
breaking
news
stories

Each
curve
is
the
CTR
of
an
item
in
the
Today
Module
on
www.yahoo.com
over
$me

For Simplicity, Assume …
•  Pick only one item for each user visit
–  Multi-slot optimization later
•  No user segmentation, no personalization
(discussion later)
•  The pool of candidate items is predetermined
and is relatively small (≤ 1000)
–  E.g., selected by human editors or by a first-phase
filtering method
–  Ideally, there should be a feedback loop
–  Large item pool problem later
•  Effects like user-fatigue, diversity in
recommendations, multi-objective optimization
not considered (discussion later)

Online Models
•  How to track the changing CTR of an item
•  Data: for each item, at time t, we observe
–  Number of times the item nt was displayed (i.e., #views)
–  Number of clicks ct on the item
•  Problem Definition: Given c1, n1, …, ct, nt, predict the CTR
(click-through rate) pt+1 at time t+1
•  Potential solutions:
–  Observed CTR at t: ct / nt → highly unstable (nt is usually small)
–  Cumulative CTR: (∑all i ci) / (∑all i ni) → react to changes very
slowly
–  Moving window CTR: (∑i∈last K ci) / (∑i∈last K ni) → reasonable
•  But, no estimation of Var[pt+1] (useful for explore/exploit)

Online Models: Dynamic Gamma-Poisson
•  Model-based approach
–  (ct | nt, pt) ~ Poisson(nt pt)
–  pt = pt-1 εt, where εt ~ Gamma(mean=1, var=η)
–  Model parameters:
•  p1 ~ Gamma(mean=µ0, var=σ0
2) is the offline CTR estimate
•  η specifies how dynamic/smooth the CTR is over time
–  Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?)
•  Solve this recursively (online update rule)

Show
the
item
nt
$mes

Receive
ct
clicks

pt
=
CTR
at
$me
t

Nota$on:

p1
µ0,
σ0
2

p2
…
n1

c1

n2

c2

η

Online Models: Derivation
size)sample(effective/Let
),(~),,...,,|(
2
2
1111
ttt
ttttt varmeanGammancncp
σµγ
σµ
=
==−−
)(
),(~),,...,,|(
2
|
2
|
2
|
2
1
|1
2
11111
ttttttt
ttt
ttttt varmeanGammancncp
σµησσ
µµ
σµ
++=
=
==
+
+
+++
tttttt
ttttttt
tttt
ttttttt
c
n
varmeanGammancncp
||
2
|
||
|
2
||11
/
/)(
size)sample(effectiveLet
),(~),,...,,|(
γµσ
γγµµ
γγ
σµ
=
+⋅=
+=
==
High
CTR
items
more
adap$ve

Es$mated
CTR

distribu$on

at
$me
t

Es$mated
CTR

distribu$on

at
$me
t+1

Tracking behavior of Gamma-
Poisson model
•  Low click rate articles – More temporal
smoothing

Intelligent Initialization: Prior
Estimation
•  Prior CTR distribution: Gamma(mean=µ0, var=σ0
2)
–  N historical items:
•  ni = #views of item i in its first time interval
•  ci = #clicks on item i in its first time interval
–  Model
•  ci ~ Poisson(ni pi) and pi ~ Gamma(µ0, σ0
2)
⇒ ci ~ NegBinomial(µ0, σ0
2, ni)
–  Maximum likelihood estimate (MLE) of (µ0, σ0
2)
•  Better prior: Cluster items and find MLE for each cluster
–  Agarwal & Chen, 2011 (SIGMOD)
∑ ⎟
⎠
⎞
⎜
⎝
⎛ +⎟
⎠
⎞⎜
⎝
⎛ +−⎟
⎠
⎞⎜
⎝
⎛ +Γ+⎟
⎠
⎞⎜
⎝
⎛Γ−
i iii nccNN 2
0
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
0
2
0
2
0
2
00
loglogloglogmaxarg
,
σ
µ
σ
µ
σ
µ
σ
µ
σ
µ
σ
µ
σµ

Explore/Exploit: Problem Definition
$me

Item
1

Item
2

…

Item
K

x1%
page
views

x2%
page
views

…

xK%
page
views

Determine
(x1,
x2,
…,
xK)
based
on
clicks
and
views
observed
before
t
in

order
to
maximize
the
expected
total
number
of
clicks
in
the
future

t
–1

t
–2

t

now

clicks
in
the
future

Modeling the Uncertainty, NOT just
the Mean
Simpliﬁed
semng:
Two
items

CTR

Probabilitydensity

Item A

Item B

We
know
the
CTR
of
Item
A
(say,
shown
1
million
$mes)

We
are
uncertain
about
the
CTR
of
Item
B
(only
100
$mes)

If
we
only
make
a
single
decision,

give
100%
page
views
to
Item
A

If
we
make
mul$ple
decisions
in
the
future

explore
Item
B
since
its
CTR
can
poten$ally

be
higher

∫ >
⋅−=
qp
dppfqp )()(Potential
CTR of item A is q

CTR of item B is p

Probability density function of item B’s CTR is f(p)

Multi-Armed Bandits: Introduction
(1)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
“Pulling” arm i yields a reward:
reward = 1 with probability pi (success)
reward = 0 otherwise (failure)
For
now,
we
are
aaacking
the
problem
of
choosing
best
ar$cle/arm
for
all
users

(2)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
Goal:
Pull
arms
sequen$ally
to
maximize
the
total
reward

Bandit
scheme/policy:
Sequen$al
algorithm
to
play
arms
(items)

Regret
of
a
scheme
=
Expected
loss
rela$ve
to
the
“oracle”
op-mal
scheme

that
always
plays
the
best
arm

–  “best”
means
highest
success
probability

–  But,
the
best
arm
is
not
known
…
unless
you
have
an
oracle

–  Regret
is
the
price
of
explora$on

–  Low
regret
implies
quick
convergence
to
the
best

(3)
•  Bayesian approach
–  Seeks to find the Bayes optimal solution to a Markov
decision process (MDP) with assumptions about
probability distributions
–  Representative work: Gittins’ index, Whittle’s index
–  Very computationally intensive
•  Minimax approach
–  Seeks to find a scheme that incurs bounded regret (with no
or mild assumptions about probability distributions)
–  Representative work: UCB by Lai, Auer
–  Usually, computationally easy
–  But, they tend to explore too much in practice (probably
because the bounds are based on worse-case analysis)
Skip
details

Multi-Armed Bandits: Markov Decision Process (1)
•  Select an arm now at time t=0, to maximize expected total number
of clicks in t=0,…,T
•  State at time t: Θt = (θ1t, …, θKt)
–  θit = State of arm i at time t (that captures all we know about arm i at t)
•  Reward function Ri(Θt, Θt+1)
–  Reward of pulling arm i that brings the state from Θt to Θt+1
•  Transition probability Pr[Θt+1 | Θt, pulling arm i ]
•  Policy π: A function that maps a state to an arm (action)
–  π(Θt) returns an arm (to pull)
•  Value of policy π starting from the current state Θ0 with horizon T
[ ]),(),(),( 1110)(0 0
ΘΘΘΘ Θ ππ π −+= TT VREV
[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T ππ π
Immediate
reward
Value
of
the
remaining
T-‐1
$me
slots

if
we
start
from
state
Θ1

Multi-Armed Bandits: MDP (2)
•  Optimal policy:
•  Things to notice:
–  Value is defined recursively (actually T high-dim integrals)
–  Dynamic programming can be used to find the optimal policy
–  But, just evaluating the value of a fixed policy can be very expensive
•  Bandit Problem: The pull of one arm does not change the state of
other arms and the set of arms do not change over time
[ ]),(),(),( 1110)(0 0
ΘΘΘΘ Θ ππ π −+= TT VREV
[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T ππ π
Immediate
reward
Value
of
the
remaining
T-‐1
$me
slots

if
we
start
from
state
Θ1

),(maxarg 0Θπ
π
TV

•  Which arm should be pulled next?
–  Not necessarily what looks best right now, since it might have had a few
lucky successes
–  Looks like it will be a function of successes and failures of all arms
•  Consider a slightly different problem setting
–  Infinite time horizon, but
–  Future rewards are geometrically discounted
Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1)
•  Theorem [Gittins 1979]: The optimal policy decouples and solves a
bandit problem for each arm independently
Policy
π(Θt)
is
a
func$on
of
(θ1t,
…,
θKt)

Policy
π(Θt)
=
argmaxi
{
g(θit)
}

One
K-‐dimensional
problem

K
one-‐dimensional
problems

S$ll
computa$onally
expensive!!

Gimns’
Index

Bandit Policy
1.  Compute the priority
(Gittins’ index) of each
arm based on its state
2.  Pull arm with max
priority, and observe
reward
3.  Update the state of the
pulled arm
Priority

1

Priority

2

Priority

3

•  Theorem [Gittins 1979]: The optimal policy decouples
and solves a bandit problem for each arm
independently
–  Many proofs and different interpretations of Gittins’ index
exist
•  The index of an arm is the fixed charge per pull for a game with two options, whether
to pull the arm or not, so that the charge makes the optimal play of the game have
zero net reward
–  Significantly reduces the dimension of the problem space
–  But, Gittins’ index g(θit) is still hard to compute
•  For the Gamma-Poisson or Beta-Binomial models
θit = (#successes, #pulls) for arm i up to time t
•  g maps each possible (#successes, #pulls) pair to a number
–  Approximate methods are used in practice
–  Lai et al. have derived these for exponential family
distributions

Multi-Armed Bandits: Minimax Approach (1)
•  Compute the priority of each arm i in a way that the
regret is bounded
–  Lowest regret in the worst case
•  One common policy is UCB1 [Auer 2002]
Number of successes of
arm i
Number of pulls
of arm i
Total number of pulls
of all arms
Observed
success rate
Factor representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
⋅
+=

•  As total observations n becomes large:
–  Observed payoff tends asymptotically towards the
true payoff probability
–  The system never completely “converges” to one
best arm; only the rate of exploration tends to
zero
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
⋅
+=

•  Sub-optimal arms are pulled O(log n) times
•  Hence, UCB1 has O(log n) regret
•  This is the lowest possible regret (but the constants matter J)
•  E.g. Regret after n plays is bounded by
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
⋅
+=
ibesti
K
j
j
i ibesti
n
µµ
π
µµ
−=Δ⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
Δ⋅⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
++⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
Δ
∑∑ =<
where,
3
1
ln
8
1
2
:

•  Classical multi-armed bandits
–  A fixed set of arms with fixed rewards
–  Observe the reward before the next pull
•  Bayesian approach (Markov decision process)
–  Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits
•  Pull the arm currently having the highest index value
–  Whittle’s index [Whittle 1988]: Extension to a changing reward function
–  Computationally intensive
•  Minimax approach (providing guaranteed regret bounds)
–  UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval
•  Index of arm i =
•  Heuristics
–  ε-Greedy: Random exploration using fraction ε of traffic
–  Softmax: Pick arm i with probability
–  Posterior draw: Index = drawing from posterior CTR distribution of an arm
∑j j
i
}/ˆexp{
}/ˆexp{
τµ
τµ
Classical Multi-Armed Bandits: Summary
ii itemofCTRpredictedˆ =µ
iii nnnc log2⋅+
retemperatu=τ

Do Classical Bandits Apply to Web Recommenders?
Traﬃc
obtained
from
a
controlled
randomized
experiment
(no
confounding)

Things
to
note:

(a)
Short
life$mes,
(b)
temporal
eﬀects,
(c)
oden
breaking
news
stories

Each
curve
is
the
CTR
of
an
item
in
the
Today
Module
on
www.yahoo.com
over
$me

Characteristics of Real
Recommender Systems•  Dynamic set of items (arms)
–  Items come and go with short lifetimes (e.g., a day)
–  Asymptotically optimal policies may fail to achieve good performance
when item lifetimes are short
•  Non-stationary CTR
–  CTR of an item can change dramatically over time
•  Different user populations at different times
•  Same user behaves differently at different times (e.g., morning, lunch
time, at work, in the evening, etc.)
•  Attention to breaking news stories decays over time
•  Batch serving for scalability
–  Making a decision and updating the model for each user visit in real time
is expensive
–  Batch serving is more feasible: Create time slots (e.g., 5 min); for each
slot, decide the fraction xi of the visits in the slot to give to item i
[Agarwal
et
al.,
ICDM,
2009]

Explore/Exploit in Recommender
Systems
$me

Item
1

Item
2

…

Item
K

x1%
page
views

x2%
page
views

…

xK%
page
views

Determine
(x1,
x2,
…,
xK)
based
on
clicks
and
views
observed
before
t
in

order
to
maximize
the
expected
total
number
of
clicks
in
the
future

t
–1

t
–2

t

now

clicks
in
the
future

Let’s
solve
this
from
ﬁrst
principle

Bayesian Solution: Two Items, Two
Time Slots (1)
•  Two time slots: t = 0 and t = 1
–  Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1
–  Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1
•  To determine x, we need to estimate what would happen in the future
Question:
What fraction x of N0 views to item P
(1-x) to item Q
t=0 t=1
Now
time
N0 views N1 views
End

Obtain
c
clicks
ader
serving
x

(not
yet
observed;
random
variable)

Assume
we
observe
c;
we
can
update
p1

CTR

density

Item Q

Item P

q1

p1(x,c)
CTR

density

Item Q

Item P

q0

p0

If
x
and
c
are
given,
op$mal
solu$on:

Give
all
views
to
Item
P
iﬀ

E[
p1
I
x,
c
]
>
q1

),(ˆ1 cxp
),(ˆ1 cxp

•  Expected total number of clicks in the two time slots
}]),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c+−+
Gain(x, q0, q1) = Expected number of additional clicks if we explore
the uncertain item P with fraction x of views in slot 0, compared to
a scheme that only shows the certain item Q in both slots
Solution: argmaxx Gain(x, q0, q1)
Time Slots (2)
}]0,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c −+−++=
E[#clicks] at t = 0 E[#clicks] at t = 1
Item P Item Q Show
the
item
with
higher
E[CTR]:
}),,(ˆmax{ 11 qcxp
E[#clicks] if we
always show
item Q
Gain(x, q0, q1)
Gain of exploring the uncertain item P using x

•  Approximate by the normal distribution
–  Reasonable approximation because of the central limit theorem
•  Proposition: Using the approximation,
the Bayes optimal solution x can be
found in time O(log N0)
),(ˆ1 cxp
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
−⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −
Φ−+⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −
⋅+−= )ˆ(
)(
ˆ
1
)(
ˆ
)()ˆ(),,( 11
1
11
1
11
1100010 qp
x
pq
x
pq
xNqpxNqqxGain
σσ
φσ
)1()()(
)],(ˆ[)( 2
0
0
1
2
1
baba
ab
xNba
xN
cxpVarx
+++++
==σ
)/()],(ˆ[ˆ 11 baacxpEp c +==
),(~ofPrior 1 baBetap
Time Slots (3)

Bayesian Solution: Two Items, Two Time Slots (4)
•  Quiz: Is it correct that the more we are uncertain about the CTR of
an item, the more we should explore the item?
Uncertainty:
Low
Uncertainty:
High

Diﬀerent

curves
are
for

diﬀerent
prior

mean
semngs

(Frac$on
of
views
to
give
to
the
item)

ENAR short course

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to ENAR short course

Similar to ENAR short course (20)

Recently uploaded

Recently uploaded (20)

ENAR short course