Human Factors of XR: Using Human Factors to Design XR Systems
ENAR short course
1. Sta$s$cal
Compu$ng
For
Big
Data
Deepak
Agarwal
LinkedIn
Applied
Relevance
Science
dagarwal@linkedin.com
ENAR
2014,
Bal$more,
USA
2. Main
Collaborators:
several
others
at
both
Y!
and
LinkedIn
• I
won’t
be
here
without
them,
extremely
lucky
to
work
with
such
talented
individuals
Bee-Chung Chen Liang Zhang Bo Long
Jonathan Traupman
Paul Ogilvie
3. Structure
of
This
Tutorial
• Part
I:
Introduc$on
to
Map-‐Reduce
and
the
Hadoop
System
– Overview
of
Distributed
Compu$ng
– Introduc$on
to
Map-‐Reduce
– Some
sta$s$cal
computa$ons
using
Map-‐Reduce
• Bootstrap,
Logis$c
Regression
• Part
II:
Recommender
Systems
for
Web
Applica$ons
– Introduc$on
– Content
Recommenda$on
– Online
Adver$sing
4. Big
Data
becoming
Ubiquitous
• Bioinforma$cs
• Astronomy
• Internet
• Telecommunica$ons
• Climatology
• …
5. Big
Data:
Some
size
es$mates
• 1000
human
genomes:
>
100TB
of
data
(1000
genomes
project)
• Sloan
Digital
Sky
Survey:
200GB
data
per
night
(>140TB
aggregated)
• Facebook:
A
billion
monthly
ac$ve
users
• LinkedIn:
roughly
>
280M
members
worldwide
• Twiaer:
>
500
million
tweets
a
day
• Over
6
billion
mobile
phones
in
the
world
genera$ng
data
everyday
6. Big
Data:
Paradigm
shid
• Classical
Sta$s$cs
– Generalize
using
small
data
• Paradigm
Shid
with
Big
Data
– We
now
have
an
almost
infinite
supply
of
data
– Easy
Sta$s$cs
?
Just
appeal
to
asympto$c
theory?
• So
the
issue
is
mostly
computa$onal?
– Not
quite
• More
data
comes
with
more
heterogeneity
• Need
to
change
our
sta$s$cal
thinking
to
adapt
– Classical
sta$s$cs
s$ll
invaluable
to
think
about
big
data
analy$cs
7. Some
Sta$s$cal
Challenges
• Exploratory
Analysis
(EDA),
Visualiza$on
– Retrospec$ve
(on
Terabytes)
– More
Real
Time
(streaming
computa$ons
every
few
minutes/hours)
• Sta$s$cal
Modeling
– Scale
(computa$onal
challenge)
– Curse
of
dimensionality
• Millions
of
predictors,
heterogeneity
– Temporal
and
Spa$al
correla$ons
8. Sta$s$cal
Challenges
con$nued
• Experiments
– To
test
new
methods,
test
hypothesis
from
randomized
experiments
– Adap$ve
experiments
• Forecas$ng
– Planning,
adver$sing
• Many
more
I
are
not
fully
well
versed
in
9. Defining
Big
Data
• How
to
know
you
have
the
big
data
problem?
– Is
it
only
the
number
of
terabytes
?
– What
about
dimensionality,
structured/
unstructured,
computa$ons
required,…
• No
clear
defini$on,
different
point
of
views
– When
desired
computa$on
cannot
be
completed
in
the
s$pulated
$me
with
current
best
algorithm
using
cores
available
on
a
commodity
PC
10. Distributed
Compu$ng
for
Big
Data
• Distributed
compu$ng
invaluable
tool
to
scale
computa$ons
for
big
data
• Some
distributed
compu$ng
models
– Mul$-‐threading
– Graphics
Processing
Units
(GPU)
– Message
Passing
Interface
(MPI)
– Map-‐Reduce
11. Evalua$ng
a
method
for
a
problem
• Scalability
– Process
X
GB
in
Y
hours
• Ease
of
use
for
a
sta$s$cian
• Reliability
(fault
tolerance)
– Especially
in
an
industrial
environment
• Cost
– Hardware
and
cost
of
maintaining
• Good
for
the
computa$ons
required?
– E.g.,
Itera$ve
versus
one
pass
• Resource
sharing
12. Mul$threading
• Mul$ple
threads
take
advantage
of
mul$ple
CPUs
• Shared
memory
• Threads
can
execute
independently
and
concurrently
• Can
only
handle
Gigabytes
of
data
• Reliable
13. Graphics
Processing
Units
(GPU)
• Number
of
cores:
– CPU:
Order
of
10
– GPU:
smaller
cores
• Order
of
1000
• Can
be
>100x
faster
than
CPU
– Parallel
computa$onally
intensive
tasks
off-‐loaded
to
GPU
• Good
for
certain
computa$onally-‐intensive
tasks
• Can
only
handle
Gigabytes
of
data
• Not
trivial
to
use,
requires
good
understanding
of
low-‐level
architecture
for
efficient
use
– But
things
changing,
it
is
gemng
more
user
friendly
14. Message
Passing
Interface
(MPI)
• Language
independent
communica$on
protocol
among
processes
(e.g.
computers)
• Most
suitable
for
master/slave
model
• Can
handle
Terabytes
of
data
• Good
for
itera$ve
processing
• Fault
tolerance
is
low
15. Map-‐Reduce
(Dean
&
Ghemawat,
2004)
Mappers
Reducers
Data
Output
• Computa$on
split
to
Map
(scaaer)
and
Reduce
(gather)
stages
• Easy
to
Use:
– User
needs
to
implement
two
func$ons:
Mapper
and
Reducer
• Easily
handles
Terabytes
of
data
• Very
good
fault
tolerance
(failed
tasks
automa$cally
get
restarted)
16. Comparison
of
Distributed
Compu$ng
Methods
Mul$threading
GPU
MPI
Map-‐Reduce
Scalability
(data
size)
Gigabytes
Gigabytes
Terabytes
Terabytes
Fault
Tolerance
High
High
Low
High
Maintenance
Cost
Low
Medium
Medium
Medium-‐High
Itera$ve
Process
Complexity
Cheap
Cheap
Cheap
Usually
expensive
Resource
Sharing
Hard
Hard
Easy
Easy
Easy
to
Implement?
Easy
Needs
understanding
of
low-‐level
GPU
architecture
Easy
Easy
17. Example
Problem
• Tabula$ng
word
counts
in
corpus
of
documents
• Similar
to
table
func$on
in
R
18. Word
Count
Through
Map-‐Reduce
Hello
World
Bye
World
Hello
Hadoop
Goodbye
Hadoop
Mapper
1
<Hello,
1>
<Hadoop,
1>
<Goodbye,
1>
<Hadoop,1>
<Hello,
1>
<World,
1>
<Bye,
1>
<World,1>
Mapper
2
Reducer
1
Words
from
A-‐G
Reducer
2
Words
from
H-‐Z
<Bye,
1>
<Goodbye,
1>
<Hello,
2>
<World,
2>
<Hadoop,
2>
19. Key
Ideas
about
Map-‐Reduce
Big
Data
Par$$on
1
Par$$on
2
…
Par$$on
N
Mapper
1
Mapper
2
…
Mapper
N
<Key,
Value>
<Key,
Value>
<Key,
Value>
<Key,
Value>
Reducer
1
Reducer
2
Reducer
M
…
Output
1
Output
1
Output
1
Output
1
20. Key
Ideas
about
Map-‐Reduce
• Data
are
split
into
par$$ons
and
stored
in
many
different
machines
on
disk
(distributed
storage)
• Mappers
process
data
chunks
independently
and
emit
<Key,
Value>
pairs
• Data
with
the
same
key
are
sent
to
the
same
reducer.
One
reducer
can
receive
mul$ple
keys
• Every
reducer
sorts
its
data
by
key
• For
each
key,
the
reducer
processes
the
values
corresponding
to
the
key
according
to
the
customized
reducer
func$on
and
output
21. Compute
Mean
for
Each
Group
ID
Group
No.
Score
1
1
0.5
2
3
1.0
3
1
0.8
4
2
0.7
5
2
1.5
6
3
1.2
7
1
0.8
8
2
0.9
9
4
1.3
…
…
…
22. Key
Ideas
about
Map-‐Reduce
• Data
are
split
into
par$$ons
and
stored
in
many
different
machines
on
disk
(distributed
storage)
• Mappers
process
data
chunks
independently
and
emit
<Key,
Value>
pairs
– For
each
row:
• Key
=
Group
No.
• Value
=
Score
• Data
with
the
same
key
are
sent
to
the
same
reducer.
One
reducer
can
receive
mul$ple
keys
– E.g.
2
reducers
– Reducer
1
receives
data
with
key
=
1,
2
– Reducer
2
receives
data
with
key
=
3,
4
• Every
reducer
sorts
its
data
by
key
– E.g.
Reducer
1:
<key
=
1,
values=[0.5,
0.8,
0.8]>,
<key=2,
values=<0.7,
1.5,
0.9>
• For
each
key,
the
reducer
processes
the
values
corresponding
to
the
key
according
to
the
customized
reducer
func$on
and
output
– E.g.
Reducer
1
output:
<1,
mean(0.5,
0.8,
0.8)>,
<2,
mean(0.7,
1.5,
0.9)>
23. Key
Ideas
about
Map-‐Reduce
• Data
are
split
into
par$$ons
and
stored
in
many
different
machines
on
disk
(distributed
storage)
• Mappers
process
data
chunks
independently
and
emit
<Key,
Value>
pairs
– For
each
row:
• Key
=
Group
No.
• Value
=
Score
• Data
with
the
same
key
are
sent
to
the
same
reducer.
One
reducer
can
receive
mul$ple
keys
– E.g.
2
reducers
– Reducer
1
receives
data
with
key
=
1,
2
– Reducer
2
receives
data
with
key
=
3,
4
• Every
reducer
sorts
its
data
by
key
– E.g.
Reducer
1:
<key
=
1,
values=[0.5,
0.8,
0.8]>,
<key=2,
values=<0.7,
1.5,
0.9>
• For
each
key,
the
reducer
processes
the
values
corresponding
to
the
key
according
to
the
customized
reducer
func$on
and
output
– E.g.
Reducer
1
output:
<1,
mean(0.5,
0.8,
0.8)>,
<2,
mean(0.7,
1.5,
0.9)>
What
you
need
to
implement
24. Mapper:
Input:
Data
for
(row
in
Data)
{
groupNo
=
row$groupNo;
score
=
row$score;
Output(c(groupNo,
score));
}
Reducer:
Input:
Key
(groupNo),
List
Value
(a
list
of
scores
that
belong
to
the
Key)
count
=
0;
sum
=
0;
for
(v
in
Value)
{
sum
+=
v;
count++;
}
Output(c(Key,
sum/count));
Pseudo
Code
(in
R)
25. Exercise
1
• Problem:
Average
height
per
{Grade,
Gender}?
• What
should
be
the
mapper
output
key?
• What
should
be
the
mapper
output
value?
• What
are
the
reducer
input?
• What
are
the
reducer
output?
• Write
mapper
and
reducer
for
this?
Student
ID
Grade
Gender
Height
(cm)
1
3
M
120
2
2
F
115
3
2
M
116
…
…
…
26. • Problem:
Average
height
per
Grade
and
Gender?
• What
should
be
the
mapper
output
key?
– {Grade,
Gender}
• What
should
be
the
mapper
output
value?
– Height
• What
are
the
reducer
input?
– Key:
{Grade,
Gender},
Value:
List
of
Heights
• What
are
the
reducer
output?
– {Grade,
Gender,
mean(Heights)}
Student
ID
Grade
Gender
Height
(cm)
1
3
M
120
2
2
F
115
3
2
M
116
…
…
…
27. Exercise
2
• Problem:
Number
of
students
per
{Grade,
Gender}?
• What
should
be
the
mapper
output
key?
• What
should
be
the
mapper
output
value?
• What
are
the
reducer
input?
• What
are
the
reducer
output?
• Write
mapper
and
reducer
for
this?
Student
ID
Grade
Gender
Height
(cm)
1
3
M
120
2
2
F
115
3
2
M
116
…
…
…
28. • Problem:
Number
of
students
per
{Grade,
Gender}?
• What
should
be
the
mapper
output
key?
– {Grade,
Gender}
• What
should
be
the
mapper
output
value?
– 1
• What
are
the
reducer
input?
– Key:
{Grade,
Gender},
Value:
List
of
1’s
• What
are
the
reducer
output?
– {Grade,
Gender,
sum(value
list)}
– OR:
{Grade,
Gender,
length(value
list)}
Student
ID
Grade
Gender
Height
(cm)
1
3
M
120
2
2
F
115
3
2
M
116
…
…
…
29. More
on
Map-‐Reduce
• Depends
on
distributed
file
systems
• Typically
mappers
are
the
data
storage
nodes
• Map/Reduce
tasks
automa$cally
get
restarted
when
they
fail
(good
fault
tolerance)
• Map
and
Reduce
I/O
are
all
on
disk
– Data
transmission
from
mappers
to
reducers
are
through
disk
copy
• Itera$ve
process
through
Map-‐Reduce
– Each
itera$on
becomes
a
map-‐reduce
job
– Can
be
expensive
since
map-‐reduce
overhead
is
high
30. The
Apache
Hadoop
System
• An
open-‐source
sodware
for
reliable,
scalable,
distributed
compu$ng
• The
most
popular
distributed
compu$ng
system
in
the
world
• Key
modules:
– Hadoop
Distributed
File
System
(HDFS)
– Hadoop
YARN
(job
scheduling
and
cluster
resource
management)
– Hadoop
MapReduce
31. Major
Tools
on
Hadoop
• Pig
– A
high-‐level
language
for
Map-‐Reduce
computa$on
• Hive
– A
SQL-‐like
query
language
for
data
querying
via
Map-‐Reduce
• Hbase
– A
distributed
&
scalable
database
on
Hadoop
– Allows
random,
real
$me
read/write
access
to
big
data
– Voldemort
is
similar
to
Hbase
• Mahout
– A
scalable
machine
learning
library
• …
32. Hadoop
Installa$on
• Semng
up
Hadoop
on
your
desktop/laptop:
– hap://hadoop.apache.org/docs/stable/
single_node_setup.html
• Semng
up
Hadoop
on
a
cluster
of
machines
– hap://hadoop.apache.org/docs/stable/
cluster_setup.html
33. Hadoop
Distributed
File
System
(HDFS)
• Master/Slave
architecture
• NameNode:
a
single
master
node
that
controls
which
data
block
is
stored
where.
• DataNodes:
slave
nodes
that
store
data
and
do
R/W
opera$ons
• Clients
(Gateway):
Allow
users
to
login
and
interact
with
HDFS
and
submit
Map-‐Reduce
jobs
• Big
data
is
split
to
equal-‐sized
blocks,
each
block
can
be
stored
in
different
DataNodes
• Disk
failure
tolerance:
data
is
replicated
mul$ple
$mes
34. Load
the
Data
into
Pig
• A
=
LOAD
‘Sample-‐1.dat'
USING
PigStorage()
AS
(ID
:
int,
groupNo:
int,
score:
float);
– The
path
of
the
data
on
HDFS
ader
LOAD
• USING
PigStorage()
means
delimit
the
data
by
tab
(can
be
omiaed)
• If
data
are
delimited
by
other
characters,
e.g.
space,
use
USING
PigStorage(‘
‘)
• Data
schema
defined
ader
AS
• Variable
types:
int,
long,
float,
double,
chararray,
…
35. Structure
of
This
Tutorial
• Part
I:
Introduc$on
to
Map-‐Reduce
and
the
Hadoop
System
– Overview
of
Distributed
Compu$ng
– Introduc$on
to
Map-‐Reduce
– Introduc$on
to
the
Hadoop
System
–
Examples
of
Sta$s$cal
Compu$ng
for
Big
Data
• Bag
of
Liale
Bootstraps
• Large
Scale
Logis$c
Regression
37. Bootstrap
(Efron,
1979)
• A
re-‐sampling
based
method
to
obtain
sta$s$cal
distribu$on
of
sample
es$mators
• Why
are
we
interested
?
– Re-‐sampling
is
embarrassingly
parallelizable
• For
example:
– Standard
devia$on
of
the
mean
of
N
samples
(μ)
– For
i
=
1
to
r
do
• Randomly
sample
with
replacement
N
$mes
from
the
original
sample
-‐>
bootstrap
data
i
• Compute
mean
of
the
i-‐th
bootstrap
data
-‐>
μi
– Es$mate
of
Sd(μ)
=
Sd([μ1,…μr])
– r
is
usually
a
large
number,
e.g.
200
38. Bootstrap
for
Big
Data
• Can
have
r
nodes
running
in
parallel,
each
sampling
one
bootstrap
data
• However…
– N
can
be
very
large
– Data
may
not
fit
into
memory
– Collec$ng
N
samples
with
replacement
on
each
node
can
be
computa$onally
expensive
39. M
out
of
N
Bootstrap
(Bikel
et
al.
1997)
• Obtain
SdM(μ)
by
sampling
M
samples
with
replacement
for
each
bootstrap,
where
M<N
• Apply
analy$cal
correc$on
to
SdM(μ)
to
obtain
Sd(μ)
using
prior
knowledge
of
convergence
rate
of
sample
es$mates
• However…
– Prior
knowledge
is
required
– Choice
of
M
is
cri$cal
to
performance
– Finding
op$mal
value
of
M
needs
more
computa$on
40. Bag
of
Liale
Bootstraps
(BLB)
• Example:
Standard
devia$on
of
the
mean
• Generate
S
sampled
data
sets,
each
obtained
by
random
sampling
without
replacement
a
subset
of
size
b
(or
par$$on
the
original
data
into
S
par$$ons,
each
with
size
b)
• For
each
data
p
=
1
to
S
do
– For
i
=
1
to
r
do
• N
samples
with
replacement
on
data
of
size
b
• Compute
mean
of
the
resampled
data
μpi
– Compute
Sdp(μ)
=
Sd([μp1,…μpr])
• Es$mate
of
Sd(μ)
=
Avg([Sd1(μ),…,
SdS(μ)])
41. Bag
of
Liale
Bootstraps
(BLB)
• Interest:
ξ(θ),
where
θ
is
an
es$mate
obtained
from
size
N
data
–
ξ
is
some
func$on
of
θ,
such
as
standard
devia$on,
…
• Generate
S
sampled
data
sets,
each
obtained
from
random
sampling
without
replacement
a
subset
of
size
b
(or
par$$on
the
original
data
into
S
par$$ons,
each
with
size
b)
• For
each
data
p
=
1
to
S
do
– For
i
=
1
to
r
do
• Sample
N
samples
with
replacement
on
data
of
size
b
• Compute
mean
of
the
resampled
data
θpi
– Compute
ξp(θ)
=
ξ([θp1,…θpr])
• Es$mate
of
ξ(μ)
=
Avg([ξ1(θ),…,
ξS(θ)])
42. Bag
of
Liale
Bootstraps
(BLB)
• Interest:
ξ(θ),
where
θ
is
an
es$mate
obtained
from
size
N
data
–
ξ
is
some
func$on
of
θ,
such
as
standard
devia$on,
…
• Generate
S
sampled
data
sets,
each
obtained
from
random
sampling
without
replacement
a
subset
of
size
b
(or
par$$on
the
original
data
into
S
par$$ons,
each
with
size
b)
• For
each
data
p
=
1
to
S
do
– For
i
=
1
to
r
do
• Sample
N
samples
with
replacement
on
the
data
of
size
b
• Compute
mean
of
the
resampled
data
θpi
– Compute
ξp(θ)
=
ξ([θp1,…θpr])
• Es$mate
of
ξ(μ)
=
Avg([ξ1(θ),…,
ξS(θ)])
Mapper
Reducer
Gateway
43. Why
is
BLB
Efficient
• Before:
– N
samples
with
replacement
from
size
N
data
is
expensive
when
N
is
large
• Now:
– N
samples
with
replacement
from
size
b
data
– b
can
be
several
magnitude
smaller
than
N
(e.g.
b
=
Nγ,
γ
in
[0.5,
1))
– Equivalent
to:
A
mul$nomial
sampler
with
dim
=
b
– Storage
=
O(b),
Computa$onal
complexity
=
O(b)
44. Simula$on
Experiment
• 95%
CI
of
Logis$c
Regression
Coefficients
• N
=
20000,
10
explanatory
variables
• Rela$ve
Error
=
|Es$mated
CI
width
–
True
CI
width
|
/
True
CI
width
• BLB-‐γ:
BLB
with
b
=
Nγ
• BOFN-‐γ:
b
out
of
N
sampling
with
b
=
Nγ
• BOOT:
Naïve
bootstrap
46. Real
Data
• 95%
CI
of
Logis$c
Regression
Coefficients
• N
=
6M,
3000
explanatory
variables
• Data
size
=
150GB,
r
=
50,
s
=
5,
γ
=
0.7
47. Summary
of
BLB
• A
new
algorithm
for
bootstrapping
on
big
data
• Advantages
– Fast
and
efficient
– Easy
to
parallelize
– Easy
to
understand
and
implement
– Friendly
to
Hadoop,
makes
it
rou$ne
to
perform
sta$s$cal
calcula$ons
on
Big
data
49. Logis$c
Regression
• Binary
response:
Y
• Covariates:
X
• Yi
~
Bernoulli(pi)
• log(pi/(1-‐pi))
=
Xi
Tβ
;
β
~
MVN(0
,
1/λ
I
)
• Widely
used
(research
and
applica$ons)
50. Large
Scale
Logis$c
Regression
• Binary
response:
Y
– E.g.,
Click
/
Non-‐click
on
an
ad
on
a
webpage
• Covariates:
X
– User
covariates:
• Age,
gender,
industry,
educa$on,
job,
job
$tle,
…
– Item
covariates:
• Categories,
keywords,
topics,
…
– Context
covariates:
• Time,
page
type,
posi$on,
…
– 2-‐way
interac$on:
• User
covariates
X
item
covariates
• Context
covariates
X
item
covariates
• …
51. Computa$onal
Challenge
• Hundreds
of
millions/billions
of
observa$ons
• Hundreds
of
thousands/millions
of
covariates
• Fimng
such
a
logis$c
regression
model
on
a
single
machine
not
feasible
• Model
fimng
itera$ve
using
methods
like
gradient
descent,
Newton’s
method
etc
– Mul$ple
passes
over
the
data
52. Recap
on
Op$miza$on
method
• Problem:
Find
x
to
min(F(x))
• Itera$on
n:
xn
=
xn-‐1
–
bn-‐1
F’(xn-‐1)
•
bn-‐1
is
the
step
size
that
can
change
every
itera$on
• Iterate
un$l
convergence
• Conjugate
gradient,
LBFGS,
Newton
trust
region,
…
all
of
this
kind
53. Itera$ve
Process
with
Hadoop
Disk
Mappers
Disk
Reducers
Disk
Mappers
Disk
Reducers
Disk
Mappers
Disk
Reducers
54. Limita$ons
of
Hadoop
for
fimng
a
big
logis$c
regression
• Itera$ve
process
is
expensive
and
slow
• Every
itera$on
=
a
Map-‐Reduce
job
• I/O
of
mapper
and
reducers
are
both
through
disk
• Plus:
Wai$ng
in
queue
$me
• Q:
Can
we
find
a
fimng
method
that
scales
with
Hadoop
?
55. Large
Scale
Logis$c
Regression
• Naïve:
– Par$$on
the
data
and
run
logis$c
regression
for
each
par$$on
– Take
the
mean
of
the
learned
coefficients
– Problem:
Not
guaranteed
to
converge
to
the
model
from
single
machine!
• Alterna$ng
Direc$on
Method
of
Mul$pliers
(ADMM)
– Boyd
et
al.
2011
– Set
up
constraints:
each
par$$on’s
coefficient
=
global
consensus
– Solve
the
op$miza$on
problem
using
Lagrange
Mul$pliers
– Advantage:
guaranteed
to
converge
to
a
single
machine
logis$c
regression
on
the
en$re
data
with
reasonable
number
of
itera$ons
56. Large
Scale
Logis$c
Regression
via
ADMM
BIG
DATA
Par$$on
1
Par$$on
2
Par$$on
3
Par$$on
K
Logis$c
Regression
Logis$c
Regression
Logis$c
Regression
Logis$c
Regression
Consensus
Computa$on
Iteration 1
57. Large
Scale
Logis$c
Regression
via
ADMM
BIG
DATA
Par$$on
1
Par$$on
2
Par$$on
3
Par$$on
K
Logis$c
Regression
Consensus
Computa$on
Logis$c
Regression
Logis$c
Regression
Logis$c
Regression
Iteration 1
58. Large
Scale
Logis$c
Regression
via
ADMM
BIG
DATA
Par$$on
1
Par$$on
2
Par$$on
3
Par$$on
K
Logis$c
Regression
Logis$c
Regression
Logis$c
Regression
Logis$c
Regression
Consensus
Computa$on
Iteration 2
60. Dual
Ascent
Method
• Consider
a
convex
op$miza$on
problem
• Lagrangian
for
the
problem:
• Dual
Ascent:
round and motivation.
Dual Ascent
der the equality-constrained convex optimization problem
minimize f(x)
subject to Ax = b,
(2.
ariable x ∈ Rn
, where A ∈ Rm×n
and f : Rn
→ R is convex.
e Lagrangian for problem (2.1) is
L(x,y) = f(x) + yT
(Ax − b)
he dual function is
g(y) = inf
x
L(x,y) = −f∗
(−AT
y) − bT
y,
y is the dual variable or Lagrange multiplier, and f∗ is the conv
round and motivation.
Dual Ascent
der the equality-constrained convex optimization problem
minimize f(x)
subject to Ax = b,
(2.
variable x ∈ Rn
, where A ∈ Rm×n
and f : Rn
→ R is convex.
he Lagrangian for problem (2.1) is
L(x,y) = f(x) + yT
(Ax − b)
he dual function is
g(y) = inf
x
L(x,y) = −f∗
(−AT
y) − bT
y,
y is the dual variable or Lagrange multiplier, and f∗ is the conv
gate of f; see [20, §3.3] or [140, §12] for background. The du
rimal optimal point x from a dual optimal point y as
x = argmin
x
L(x,y ),
vided there is only one minimizer of L(x,y ). (This is the case
e.g., f is strictly convex.) In the sequel, we will use the notation
minx F(x) to denote any minimizer of F, even when F does not
e a unique minimizer.
In the dual ascent method, we solve the dual problem using gradient
ent. Assuming that g is differentiable, the gradient g(y) can be
uated as follows. We first find x+ = argminx L(x,y); then we have
(y) = Ax+ − b, which is the residual for the equality constraint. The
l ascent method consists of iterating the updates
xk+1
:= argmin
x
L(x,yk
) (2.2)
yk+1
:= yk
+ αk
(Axk+1
− b), (2.3)
ere αk > 0 is a step size, and the superscript is the iteration counter.
61. Augmented
Lagrangians
• Bring
robustness
to
the
dual
ascent
method
• Yield
convergence
without
assump$ons
like
strict
convexity
or
finiteness
of
f
•
• The
value
of
ρ
influences
the
convergence
rate
aph-structured optimization problems.
Augmented Lagrangians and the Method of Multiplie
mented Lagrangian methods were developed in part to br
tness to the dual ascent method, and in particular, to yield c
nce without assumptions like strict convexity or finiteness of
augmented Lagrangian for (2.1) is
Lρ(x,y) = f(x) + yT
(Ax − b) + (ρ/2) Ax − b 2
2, (22.3 Augmented Lagrangians and the Method
where ρ > 0 is called the penalty parameter. (Note
standard Lagrangian for the problem.) The augmen
can be viewed as the (unaugmented) Lagrangian asso
problem
2
62. Alterna$ng
Direc$on
Method
of
Mul$pliers
(ADMM)
• Problem:
• Augmented
Lagrangians
• ADMM:
MM is an algorithm that is intended to blend the decompos
dual ascent with the superior convergence properties of the m
multipliers. The algorithm solves problems in the form
minimize f(x) + g(z)
subject to Ax + Bz = c
h variables x ∈ Rn
and z ∈ Rm
, where A ∈ Rp×n
, B ∈ Rp×m
Rp
. We will assume that f and g are convex; more specific as
ns will be discussed in §3.2. The only difference from the g
ear equality-constrained problem (2.1) is that the variable, ca
re, has been split into two parts, called x and z here, with the
e function separable across this splitting. The optimal value
blem (3.1) will be denoted by
p = inf{f(x) + g(z) | Ax + Bz = c}.
with variables x ∈ Rn
and z ∈ Rm
, where A ∈ Rp×n
, B ∈ Rp×m
, and
c ∈ Rp
. We will assume that f and g are convex; more specific assump-
tions will be discussed in §3.2. The only difference from the general
linear equality-constrained problem (2.1) is that the variable, called x
there, has been split into two parts, called x and z here, with the objec-
tive function separable across this splitting. The optimal value of the
problem (3.1) will be denoted by
p = inf{f(x) + g(z) | Ax + Bz = c}.
As in the method of multipliers, we form the augmented Lagrangian
Lρ(x,z,y) = f(x) + g(z) + yT
(Ax + Bz − c) + (ρ/2) Ax + Bz − c 2
2.
13
14 Alternating Direction Method of Multipliers
ADMM consists of the iterations
xk+1
:= argmin
x
Lρ(x,zk
,yk
) (3.2)
zk+1
:= argmin
z
Lρ(xk+1
,z,yk
) (3.3)
yk+1
:= yk
+ ρ(Axk+1
+ Bzk+1
− c), (3.4)
where ρ > 0. The algorithm is very similar to dual ascent and the
63. Large
Scale
Logis$c
Regression
via
ADMM
• Nota$on
– (Xi
,
yi):
data
in
the
ith
par$$on
– βi:
coefficient
vector
for
par$$on
i
– β:
Consensus
coefficient
vector
– r(β):
penalty
component
such
as
||β||2
2
• Op$miza$on
problem
Brief Article
The Author
July 7, 2013
min
NX
i=1
li(yi, XT
i i) + r( )
subject to i =
64. ADMM
updates
LOCAL
REGRESSIONS
Shrinkage
towards
current
best
global
es$mate
UPDATED
CONSENSUS
65. An
example
implementa$on
• ADMM
for
Logis$c
regression
model
fimng
with
L2/L1
penalty
• Each
itera$on
of
ADMM
is
a
Map-‐Reduce
job
– Mapper:
par$$on
the
data
into
K
par$$ons
– Reducer:
For
each
par$$on,
use
liblinear/glmnet
to
fit
a
L1/L2
logis$c
regression
– Gateway:
consensus
computa$on
by
results
from
all
reducers,
and
sends
back
the
consensus
to
each
reducer
node
66. KDD
CUP
2010
Data
• Bridge
to
Algebra
2008-‐2009
data
in
haps://pslcdatashop.web.cmu.edu/KDDCup/
downloads.jsp
• Binary
response,
20M
covariates
• Only
keep
covariates
with
>=
10
occurrences
=>
2.2M
covariates
• Training
data:
8,407,752
samples
• Test
data
:
510,302
samples
69. Beaer
Convergence
Can
Be
Achieved
By
• Beaer
Ini$aliza$on
– Use
results
from
Naïve
method
to
ini$alize
the
parameters
• Adap$vely
change
step
size
(ρ)
for
each
itera$on
based
on
the
convergence
status
of
the
consensus
72. Three components we will focus on
• Defining the problem
– Formulate objectives whose optimization achieves some long-
term goals for the recommender system
• E.g. How to serve content to optimize audience reach and engagement,
optimize some combination of engagement and revenue ?
• Modeling (to estimate some critical inputs)
– Predict rates of some positive user interaction(s) with items
based on data obtained from historical user-item interactions
• E.g. Click rates, average time-spent on page, etc
• Could be explicit feedback like ratings
• Experimentation
– Create experiments to collect data proactively to improve
models, helps in converging to the best choice(s) cheaply and
rapidly.
• Explore and Exploit (continuous experimentation)
• DOE (testing hypotheses by avoiding bias inherent in data)
73. Modern Recommendation Systems
• Goal
– Serve the right item to a user in a given context to
optimize long-term business objectives
• A scientific discipline that involves
– Large scale Machine Learning & Statistics
• Offline Models (capture global & stable characteristics)
• Online Models (incorporates dynamic components)
• Explore/Exploit (active and adaptive experimentation)
– Multi-Objective Optimization
• Click-rates (CTR), Engagement, advertising revenue, diversity, etc
– Inferring user interest
• Constructing User Profiles
– Natural Language Processing to understand content
• Topics, “aboutness”, entities, follow-up of something, breaking news,…
74. Some examples from content
optimization
• Simple version
– I have a content module on my page, content inventory is
obtained from a third party source which is further refined
through editorial oversight. Can I algorithmically
recommend content on this module? I want to improve
overall click-rate (CTR) on this module
• More advanced
– I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. advertising revenue). Can I
increase downstream utility without losing too many clicks?
• Highly advanced
– There are multiple modules running on my webpage. How
do I perform a simultaneous optimization?
75. Recommend applications
Recommend search queries
Recommend news article
Recommend packages:
Image
Title, summary
Links to other pages
Pick
4
out
of
a
pool
of
K
K
=
20
~
40
Dynamic
Routes
traffic
other
pages
76. Problems in this example
• Optimize CTR on multiple modules
– Today Module, Trending Now, Personal Assistant,
News
– Simple solution: Treat modules as independent,
optimize separately. May not be the best when
there are strong correlations.
• For any single module
– Optimize some combination of CTR, downstream
engagement, and perhaps advertising revenue.
77. Online Advertising
Advertisers
Ad Network
Ads
Page
Recommend
Best ad(s)
User
Publisher
Response rates
(click, conversion, ad-view)
Bids
Auction
Click
conversion
Select argmax f(bid,response rates)
ML /Statistical
model
Examples:
Yahoo, Google, MSN, …
Ad exchanges (RightMedia,
DoubleClick, …)
78. LinkedIn
Today:
Content
Module
Objective: Serve content to maximize engagement
metrics like CTR (or weighted CTR)
80. Right
Media
Ad
Exchange:
Unified
Marketplace
Match
ads
to
page
views
on
publisher
sites
Has
ad
impression
to
sell
-‐-‐
AUCTIONS
Bids
$0.50
Bids
$0.75
via
Network…
…
which
becomes
$0.45
bid
Bids
$0.65—WINS!
AdSense
Ad.com
Bids
$0.60
81. Recommender problems in general
USER
Item
Inventory
Ar$cles,
web
page,
ads,
…
Use
an
automated
algorithm
to
select
item(s)
to
show
Get
feedback
(click,
$me
spent,..)
Refine
the
models
Repeat
(large
number
of
:mes)
Op:mize
metric(s)
of
interest
(Total
clicks,
Total
revenue,…)
Example applications
Search: Web, Vertical
Online Advertising
Content
…..
Context
query,
page,
…
82. • Items: Articles, ads, modules, movies, users, updates, etc.
• Context: query keywords, pages, mobile, social media, etc.
• Metric to optimize (e.g., relevance score, CTR, revenue, engagement)
– Currently, most applications are single-objective
– Could be multi-objective optimization (maximize X subject to Y, Z,..)
• Properties of the item pool
– Size (e.g., all web pages vs. 40 stories)
– Quality of the pool (e.g., anything vs. editorially selected)
– Lifetime (e.g., mostly old items vs. mostly new items)
Important Factors
83. Factors affecting Solution
(continued)
• Properties of the context
– Pull: Specified by explicit, user-driven query (e.g., keywords, a form)
– Push: Specified by implicit context (e.g., a page, a user, a session)
• Most applications are somewhere on continuum of pull and push
• Properties of the feedback on the matches
made
– Types and semantics of feedback (e.g., click, vote)
– Latency (e.g., available in 5 minutes vs. 1 day)
– Volume (e.g., 100K per day vs. 300M per day)
• Constraints specifying legitimate matches
– e.g., business rules, diversity rules, editorial Voice
– Multiple objectives
• Available Metadata (e.g., link graph, various user/item attributes)
84. Predicting User-Item Interactions
(e.g. CTR)
• Myth: We have so much data on the web, if we
can only process it the problem is solved
– Number of things to learn increases with sample size
• Rate of increase is not slow
– Dynamic nature of systems make things worse
– We want to learn things quickly and react fast
• Data is sparse in web recommender problems
– We lack enough data to learn all we want to learn and
as quickly as we would like to learn
– Several Power laws interacting with each other
• E.g. User visits power law, items served power law
– Bivariate Zipf: Owen & Dyer, 2011
85. Can Machine Learning help?
• Fortunately, there are group behaviors that generalize to
individuals & they are relatively stable
– E.g. Users in San Francisco tend to read more baseball news
• Key issue: Estimating such groups
– Coarse group : more stable but does not generalize that well.
– Granular group: less stable with few individuals
– Getting a good grouping structure is to hit the “sweet spot”
• Another big advantage on the web
– Intervene and run small experiments on a small population to
collect data that helps rapid convergence to the best choices(s)
• We don’t need to learn all user-item interactions, only those that are good.
86. Predicting user-item interaction rates
Offline
(
Captures
stable
characteris$cs
at
coarse
resolu$ons)
(Logis$c,
Boos$ng,….)
Feature
construc$on
Content:
IR,
clustering,
taxonomy,
en$ty,..
User
profiles:
clicks,
views,
social,
community,..
Near
Online
(Finer
resolu$on
Correc$ons)
(item,
user
level)
(Quick
updates)
Explore/Exploit
(Adap$ve
sampling)
(helps
rapid
convergence
to
best
choices)
Initialize
87. Post-click: An example in Content
Optimization
Recommender
EDITORIAL
content
Clicks on FP links influence
downstream supply distribution
AD
SERVER
DISPLAY
ADVERTISING
Revenue
Downstream
engagement
(Time
spent)
88. Serving Content on Front Page: Click Shaping
• What do we want to optimize?
• Current: Maximize clicks (maximize downstream supply from FP)
• But consider the following
– Article 1: CTR=5%, utility per click = 5
– Article 2: CTR=4.9%, utility per click=10
• By promoting 2, we lose 1 click/100 visits, gain 5 utils
• If we do this for a large number of visits --- lose some clicks but
obtain significant gains in utility?
– E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,
etc)
89. High
level
picture
http request
Statistical
Models updated in
Batch mode: e.g. once every
30 mins
Server
Item
Recommenda$on
system:
thousands
of
computa$ons
in
sub-‐seconds
User Interacts
e.g. click,
does nothing
90. High
level
overview:
Item
Recommenda$on
System
User Info
Item Index
Id, meta-data
ML/
Statistical
Models
Score Items
P(Click), P(share),
Semantic-relevance
score,….
Rank Items:
sort by score (CTR,bid*CTR,..)
combine scores using Multi-obj optim,
Threshold on some scores,….
User-item interaction
Data: batch process
Updated in batch:
Activity, profile
Pre-filter
SPAM,editorial,,..
Feature extraction
NLP, cllustering,..
91. ML/Sta$s$cal
models
for
scoring
Number of items
Scored by ML
Traffic volume
1000100 100k 1M 100M
Few hours
Few days
Several days
LinkedIn Today
Yahoo! Front Page
Right Media Ad exchange
LinkedIn Ads
92. Summary
of
deployments
• Yahoo!
Front
page
Today
Module
(2008-‐2011):
300%
improvement
in
click-‐through
rates
– Similar
algorithms
delivered
via
a
self-‐serve
pla•orm,
adopted
by
several
Yahoo!
Proper$es
(2011):
Significant
improvement
in
engagement
across
Yahoo!
Network
• Fully
deployed
on
LinkedIn
Today
Module
(2012):
Significant
improvement
in
click-‐through
rates
(numbers
not
revealed
due
to
reasons
of
confiden$ality)
• Yahoo!
RightMedia
exchange
(2012):
Fully
deployed
algorithms
to
es$mate
response
rates
(CTR,
conversion
rates).
Significant
improvement
in
revenue
(numbers
not
revealed
due
to
reasons
of
confiden$ality)
• LinkedIn
self-‐serve
ads
(2012-‐2013):Fully
deployed
• LinkedIn
News
Feed
(2013-‐2014):
Fully
deployed
• Several
others
in
progress….
93. Broad
Themes
• Curse
of
dimensionality
– Large
number
of
observa$ons
(rows),
large
number
of
poten$al
features
(columns)
– Use
domain
knowledge
and
machine
learning
to
reduce
the
“effec$ve”
dimension
(constraints
on
parameters
reduce
degrees
of
freedom)
• I
will
give
examples
as
we
move
along
•
We
oden
assume
our
job
is
to
analyze
“Big
Data”
but
we
oden
have
control
on
what
data
to
collect
through
clever
experimenta$on
– This
can
fundamentally
change
solu$ons
• Think
of
computa$on
and
models
together
for
Big
data
• Op$miza$on:
What
we
are
trying
to
op$mize
is
oden
complex,models
to
work
in
harmony
with
op$miza$on
– Pareto
op$mality
with
compe$ng
objec$ves
94. Sta$s$cal
Problem
• Rank
items
(from
an
admissible
pool)
for
user
visits
in
some
context
to
maximize
a
u$lity
of
interest
• Examples
of
u$lity
func$ons
– Click-‐rates
(CTR)
– Share-‐rates
(CTR*
[Share|Click]
)
– Revenue
per
page-‐view
=
CTR*bid
(more
complex
due
to
second
price
auc$on)
• CTR
is
a
fundamental
measure
that
opens
the
door
to
a
more
principled
approach
to
rank
items
• Converge
rapidly
to
maximum
u$lity
items
– Sequen$al
decision
making
process
(explore/exploit)
95. item
j
from
a
set
of
candidates
User
i
with
user
features
(e.g.,
industry,
behavioral
features,
Demographic
features,
……)
(i,
j)
:
response
yij
visits
Algorithm
selects
(click
or
not)
Which
item
should
we
select?
Ÿ
The
item
with
highest
predicted
CTR
Ÿ
An
item
for
which
we
need
data
to
predict
its
CTR
Exploit
Explore
LinkedIn Today, Yahoo! Today Module:
Choose Items to maximize CTR
This is an “Explore/Exploit” Problem
96. The Explore/Exploit Problem (to
maximize CTR)
• Problem definition: Pick k items from a pool of N for a large
number of serves to maximize the number of clicks on the
picked items
• Easy!? Pick the items having the highest click-through rates
(CTRs)
• But …
– The system is highly dynamic:
• Items come and go with short lifetimes
• CTR of each item may change over time
– How much traffic should be allocated to explore new items to
achieve optimal performance ?
• Too little → Unreliable CTR estimates due to “starvation”
• Too much → Little traffic to exploit the high CTR items
97. Y!
front
Page
Applica$on
• Simplify:
Maximize
CTR
on
first
slot
(F1)
• Item
Pool
– Editorially
selected
for
high
quality
and
brand
image
– Few
ar$cles
in
the
pool
but
item
pool
dynamic
99. Impact
of
repeat
item
views
on
a
given
user
• Same
user
is
shown
an
item
mul$ple
$mes
(despite
not
clicking)
100. Simple
algorithm
to
es$mate
most
popular
item
with
small
but
dynamic
item
pool
• Simple
Explore/Exploit
scheme
– ε%
explore:
with
a
small
probability
(e.g.
5%),
choose
an
item
at
random
from
the
pool
– (100−ε)%
exploit:
with
large
probability
(e.g.
95%),
choose
highest
scoring
CTR
item
• Temporal
Smoothing
– Item
CTRs
change
over
$me,
provide
more
weight
to
recent
data
in
es$ma$ng
item
CTRs
• Kalman
filter,
moving
average
• Discount
item
score
with
repeat
views
– CTR(item)
for
a
given
user
drops
with
repeat
views
by
some
“discount”
factor
(es$mated
from
data)
• Segmented
most
popular
– Perform
separate
most-‐popular
for
each
user
segment
101. Time
series
Model:
Kalman
filter
• Dynamic
Gamma-‐Poisson:
click-‐rate
evolves
over
$me
in
a
mul$plica$ve
fashion
• Es$mated
Click-‐rate
distribu$on
at
$me
t+1
– Prior
mean:
– Prior
variance:
High
CTR
items
more
adap$ve
102. More
economical
explora$on?
Beaer
bandit
solu$ons
• Consider
two
armed
problem
p2
(unknown payoff
probabilities)
The
gambler
has
1000
plays,
what
is
the
best
way
to
experiment
?
(to
maximize
total
expected
reward)
This
is
called
the
“mul$-‐armed
bandit”
problem,
have
been
studied
for
a
long
$me.
Op$mal
solu$on:
Play
the
arm
that
has
maximum
poten:al
of
being
good
Op:mism
in
the
face
of
uncertainty
p1 >
103. Item
Recommenda$on:
Bandits?
• Two
Items:
Item
1
CTR=
2/100
;
Item
2
CTR=
250/10000
– Greedy:
Show
Item
2
to
all;
not
a
good
idea
– Item
1
CTR
es$mate
noisy;
item
could
be
poten$ally
beaer
• Invest
in
Item
1
for
beaer
overall
performance
on
average
– Exploit
what
is
known
to
be
good,
explore
what
is
poten$ally
good
CTR
Probabilitydensity
Item 2
Item 1
104. Next few hours
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models Time-series models Incremental CF,
online regression
Intelligent Initialization Prior estimation Prior estimation,
dimension reduction
Explore/Exploit Multi-armed bandits Bandits with covariates
106. Problem
Item j with
User i
with
user features xi
(demographics,
browse history,
search history, …)
item features xj
(keywords, content categories, ...)
(i, j) : response yijvisits
Algorithm selects
(explicit rating, implicit click/no-click)
Predict the unobserved entries based on
features and the observed entries
107. Model Choices
• Feature-based (or content-based) approach
– Use features to predict response
• (regression, Bayes Net, mixture models, …)
– Limitation: need predictive features
• Bias often high, does not capture signals at granular levels
• Collaborative filtering (CF aka Memory based)
– Make recommendation based on past user-item interaction
• User-user, item-item, matrix factorization, …
• See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc.
– Better performance for old users and old items
– Does not naturally handle new users and new items (cold-
start)
108. Collaborative Filtering (Memory
based methods)
User-User Similarity
Item-Item similarities, incorporating both
Estimating Similarities
Pearson’s correlation
Optimization based (Koren et al)
109. How to Deal with the Cold-Start
Problem
• Heuris$c-‐based
approaches
– Linear
combina$on
of
regression
and
CF
models
– Filterbot
• Add
user
features
as
psuedo
users
and
do
collabora$ve
filtering
-‐ Hybrid
approaches
-‐ Use
content
based
to
fill
up
entries,
then
use
CF
• Matrix
Factoriza$on
– Good
performance
on
Ne•lix
(Koren,
2009)
• Model-‐based
approaches
– Bilinear
random-‐effects
model
(probabilis$c
matrix
factoriza$on)
• Good
on
Ne•lix
data
[Ruslan
et
al
ICML,
2009]
– Add
feature-‐based
regression
to
matrix
factoriza$on
• (Agarwal
and
Chen,
2009)
– Add
topic
discovery
(from
textual
items)
to
matrix
factoriza$on
• (Agarwal
and
Chen,
2009;
Chun
and
Blei,
2011)
110. Per-item regression models
• When tracking users by cookies, distribution of
visit patters could get extremely skewed
– Majority of cookies have 1-2 visits
• Per item models (regression) based on user
covariates attractive in such cases
111. Several per-item regressions: Multi-task learning
Low dimension
(5-10),
B estimated
retrospective data
• Agarwal,Chen and Elango, KDD, 2010
Affinity to
old items
113. Motivation
• Data measuring k-way interactions pervasive
– Consider k = 2 for all our discussions
• E.g. User-Movie, User-content, User-Publisher-Ads,….
– Power law on both user and item degrees
• Classical Techniques
– Approximate matrix through a singular value
decomposition (SVD)
• After adjusting for marginal effects (user pop, movie pop,..)
– Does not work
• Matrix highly incomplete, severe over-fitting
– Key issue
• Regularization of eigenvectors (factors) to avoid overfitting
114. Early work on complete matrices
• Tukey’s 1-df model (1956)
– Rank 1 approximation of small nearly complete
matrix
• Criss-cross regression (Gabriel, 1978)
• Incomplete matrices: Psychometrics (1-factor
model only; small data sets; 1960s)
• Modern day recommender problems
– Highly incomplete, large, noisy.
116. Factorization – Brief Overview
• Latent user factors:
(αi , ui=(ui1,…,uin))
• (Nn + Mm)
parameters
• Key technical issue:
• Latent movie factors:
(βj , vj=(v j1,….,v jn))
will overfit for moderate
values of n,m
Regularization
Interaction
jijiij BvuyE ʹ′+++= βαµ)(
117. Latent Factor Models: Different
Aspects
• Matrix Factorization
– Factors in Euclidean space
– Factors on the simplex
• Incorporating features and ratings
simultaneously
• Online updates
118. Maximum Margin Matrix Factorization (MMMF)
• Complete matrix by minimizing loss (hinge,squared-
error) on observed entries subject to constraints on trace
norm
– Srebro, Rennie, Jakkola (NIPS 2004)
– Convex, Semi-definite programming (expensive, not
scalable)
• Fast MMMF (Rennie & Srebro, ICML, 2005)
– Constrain the Frobenious norm of left and right
eigenvector matrices, not convex but becomes
scalable.
• Other variation: Ensemble MMMF (DeCoste,
ICML2005)
– Ensembles of partially trained MMMF (some
improvements)
119. Matrix Factorization for Netflix prize
data
• Minimize the objective function
• Simon Funk: Stochastic Gradient Descent
• Koren et al (KDD 2007): Alternate Least
Squares
– They move to SGD later in the competition
∑ ∑∑∈
++−
obsij j
j
i
ij
T
iij vuvur )()(
222
λ
120. ui vj
rij
au av
2
σ
Optimization is through Iterated
conditional modes
Other variations like constraining the
mean through sigmoid, using “who-rated-
whom”
Combining with Boltzmann Machines also
improved performance
),(~
),(~
),(~ 2
IaMVN
IaMVN
Nr
vj
ui
j
T
iij
0v
0u
vu σ
Probabilis$c
Matrix
Factoriza$on
(Ruslan
&
Minh,
2008,
NIPS)
121. Bayesian Probabilistic Matrix Factorization
(Ruslan and Minh, ICML 2008)
• Fully Bayesian treatment using an MCMC approach
– Significant improvement
• Interpretation as a fully Bayesian hierarchical model
shows why that is the case
– Failing to incorporate uncertainty leads to bias in
estimates
– Multi-modal posterior, MCMC helps in converging to a better one
r
Var-comp: au
MCEM also more resistant to over-fitting
122. Non-parametric Bayesian matrix completion
(Zhou et al, SAM, 2010)
• Specify rank probabilistically (automatic rank
selection)
)/)1(,/(~
)(~
),(~
1
2
rrbraBeta
Berz
vuzNy
k
kk
r
k
jkikkij
−
∑=
π
π
σ
))1(/(Factors)#(
)))1(/(,1(~
−+=
−+
rbaraE
rbaaBerzk
123. How to incorporate features:
Deal with both warm start and cold-start
• Models to predict ratings for new pairs
– Warm-start: (user, movie) present in the training data with large
sample size
– Cold-start: At least one of (user, movie) new or has small sample
size
• Rough definition, warm-start/cold-start is a continuum.
• Challenges
– Highly incomplete (user, movie) matrix
– Heavy tailed degree distributions for users/movies
• Large fraction of ratings from small fraction of users/
movies
– Handling both warm-start and cold-start effectively in the
presence of predictive features
124. Possible approaches
• Large scale regression based on covariates
– Does not provide good estimates for heavy users/movies
– Large number of predictors to estimate interactions
• Collaborative filtering
– Neighborhood based
– Factorization
• Good for warm-start; cold-start dealt with separately
• Single model that handles cold-start and warm-start
– Heavy users/movies → User/movie specific model
– Light users/movies → fallback on regression model
– Smooth fallback mechanism for good performance
126. Regression-based Factorization
Model (RLFM)
• Main idea: Flexible prior, predict factors
through regressions
• Seamlessly handles cold-start and warm-
start
• Modified state equation to incorporate
covariates
127. RLFM: Model
Rating: ),(~ 2
σµijij Ny
)(~ ijij Bernoulliy µ
)(~ ijijij NPoissony µ
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
j
t
iji
t
ijij vubxt +++= βαµ )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 α
αα
σεεα Nxg iii
t
i +=
Popularity of item j: ),0(~, 2
0 β
ββ
σεεβ Nxd jjj
t
j +=
Factors of user i: ),0(~, 2
INGxu u
u
i
u
iii σεε+=
Factors of item j: ),0(~, 2
INDxv v
v
i
v
iji σεε+=
Could use other classes of regression models
129. Advantages of RLFM
• Better regularization of factors
– Covariates “shrink” towards a better centroid
• Cold-start: Fallback regression model (FeatureOnly)
130. RLFM: Illustration of Shrinkage
Plot the first factor
value for each user
(fitted using Yahoo! FP
data)
135. Computing the E-step
• Often hard to compute in closed form
• Stochastic EM (Markov Chain EM; MCEM)
– Compute expectation by drawing samples from
– Effective for multi-modal posteriors but more expensive
• Iterated Conditional Modes algorithm (ICM)
– Faster but biased hyper-parameter estimates
136. Monte Carlo E-step
• Through a vanilla Gibbs sampler (conditionals closed form)
• Other conditionals also Gaussian and closed form
• Conditionals of users (movies) sampled simultaneously
• Small number of samples in early iterations, large numbers in
later iterations
137. M-step (Why MCEM is better than
ICM)
• Update G, optimize
• Update Au=au I
Ignored by ICM, underestimates factor variability
Factors over-shrunk, posterior not explored well
139. Experiment 2: Better handling of
Cold-start
• MovieLens-1M; EachMovie
• Training-test split based on timestamp
• Same covariates as in Experiment 1.
140. Experiment 4: Predicting click-rate
on articles
• Goal: Predict click-rate on articles for a user on F1
position
• Article lifetimes short, dynamic updates important
• User covariates:
– Age, Gender, Geo, Browse behavior
• Article covariates
– Content Category, keywords
• 2M ratings, 30K users, 4.5 K articles
142. Some other related approaches
• Stern, Herbrich and Graepel, WWW, 2009
– Similar to RLFM, different parametrization and
expectation propagation used to fit the models
• Porteus, Asuncion and Welling, AAAI, 2011
– Non-parametric approach using a Dirichlet process
• Agarwal, Zhang and Mazumdar, Annals of Applied
Statistics, 2011
– Regression + random effects per user regularized
through a Graphical Lasso
143. Add Topic Discovery into
Matrix Factorization
fLDA: Matrix Factorization through Latent
Dirichlet Allocation
144. fLDA: Introduction
• Model the rating yij that user i gives to item j as the user’s
affinity to the topics that the item has
– Unlike regular unsupervised LDA topic modeling, here the LDA
topics are learnt in a supervised manner based on past rating
data
– fLDA can be thought of as a “multi-task learning” version of the
supervised LDA model [Blei’07] for cold-start recommendation
∑+=
k jkikij zsy ...
User i ’s affinity to topic k
Pr(item j has topic k) estimated by averaging
the LDA topic of each word in item j
Old items: zjk’s are Item latent factors learnt from data with the LDA prior
New items: zjk’s are predicted based on the bag of words in the items
145. Φ11,
…,
Φ1W
…
Φk1,
…,
ΦkW
…
ΦK1,
…,
ΦKW
Topic
1
Topic
k
Topic
K
LDA Topic Modeling (1)
• LDA is effective for unsupervised topic discovery [Blei’03]
– It models the generating process of a corpus of items (articles)
– For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η)
– For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ)
– For each word, say the nth word, in item j,
• Draw a topic zjn for that word from θj = [θj1, …, θjK]
• Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn
Item j
Topic distribution: [θj1, …, θjK]
Words: wj1, …, wjn, …
Per-word topic: zj1, …, zjn, …
Assume zjn = topic k
Observed
146. LDA Topic Modeling (2)
• Model training:
– Estimate the prior parameters and the posterior topic×word
distribution Φ based on a training corpus of items
– EM + Gibbs sampling is a popular method
• Inference for new items
– Compute the item topic distribution based on the prior
parameters and Φ estimated in the training phase
• Supervised LDA [Blei’07]
– Predict a target value for each item based on supervised LDA topics
∑=
k jkkj zsy
Target value of item j
Pr(item j has topic k) estimated by averaging
the topic of each word in item j
Regression weight for topic k
∑+=
k jkikij zsy ...vs.
One regression per user
Same set of topics across different regressions
147. fLDA: Model
Rating: ),(~ 2
σµijij Ny
)(~ ijij Bernoulliy µ
)(~ ijijij NPoissony µ
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
jkikkji
t
ijij zsbxt ∑+++= βαµ )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 α
αα
σεεα Nxg iii
t
i +=
Popularity of item j: ),0(~, 2
0 β
ββ
σεεβ Nxd jjj
t
j +=
Topic affinity of user i: ),0(~, 2
INHxs s
s
i
s
iii σεε+=
Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk =∑=
The LDA topic of the nth word in item j
Observed words: ),,(~ jnjn zLDAw ηλ
The nth word in item j
148. Model Fitting
• Given:
– Features X = {xi, xj, xij}
– Observed ratings y = {yij} and words w = {wjn}
• Estimate:
– Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η]
• Regression weights and prior parameters
– Latent factors: Δ = {αi, βj, si} and z = {zjn}
• User factors, item factors and per-word topic assignment
• Empirical Bayes approach:
– Maximum likelihood estimate of the parameters:
– The posterior distribution of the factors:
∫ ΔΘΔ=Θ=Θ
ΘΘ
dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxargˆ
]ˆ,|,Pr[ ΘΔ yz
149. The EM Algorithm
• Iterate through the E and M steps until convergence
– Let be the current estimate
– E-step: Compute
• The expectation is not in closed form
• We draw Gibbs samples and compute the Monte
Carlo mean
– M-step: Find
• It consists of solving a number of regression and
optimization problems
)]|,,,Pr([log)( )ˆ,,|,(
ΘΔ=Θ ΘΔ
zwyEf n
wyz
)(maxargˆ )1(
Θ=Θ
Θ
+
fn
)(ˆ n
Θ
150. Supervised Topic Assignment
( ) ∏ =⋅+
+
+
∝
=
¬
¬
¬
ji jnij
jn
jkjn
k
jn
kl
jn
kzyfZ
WZ
Z
kz
rated
)|(
)Rest|Pr(
λ
η
η
Same as unsupervised LDA Likelihood of observed ratings
by users who rated item j when
zjn is set to topic k
Probability of observing yij
given the model
The topic of the nth word in item j
151. fLDA: Experimental Results (Movie)
• Task: Predict the rating that a user would give a movie
• Training/test split:
– Sort observations by time
– First 75% → Training data
– Last 25% → Test data
• Item warm-start scenario
– Only 2% new items in test data
Model Test RMSE
RLFM 0.9363
fLDA 0.9381
Factor-Only 0.9422
FilterBot 0.9517
unsup-LDA 0.9520
MostPopular 0.9726
Feature-Only 1.0906
Constant 1.1190
fLDA is as strong as the best method
It does not reduce the performance in warm-start scenarios
152. fLDA: Experimental Results (Yahoo! Buzz)
• Task: Predict whether a user would buzz-up an article
• Severe item cold-start
– All items are new in test data
Data Statistics
1.2M observations
4K users
10K articles
fLDA significantly
outperforms other
models
153. Experimental Results: Buzzing
Topics
Top Terms (after stemming) Topic
bush, tortur, interrog, terror, administr, CIA, offici,
suspect, releas, investig, georg, memo, al
CIA interrogation
mexico, flu, pirat, swine, drug, ship, somali, border,
mexican, hostag, offici, somalia, captain
Swine flu
NFL, player, team, suleman, game, nadya, star, high,
octuplet, nadya_suleman, michael, week
NFL games
court, gai, marriag, suprem, right, judg, rule, sex, pope,
supreme_court, appeal, ban, legal, allow
Gay marriage
palin, republican, parti, obama, limbaugh, sarah, rush,
gop, presid, sarah_palin, sai, gov, alaska
Sarah Palin
idol, american, night, star, look, michel, win, dress,
susan, danc, judg, boyl, michelle_obama
American idol
economi, recess, job, percent, econom, bank, expect,
rate, jobless, year, unemploy, month
Recession
north, korea, china, north_korea, launch, nuclear,
rocket, missil, south, said, russia
North Korea issues
3/4 topics are interpretable; 1/2 are similar to unsupervised topics
154. fLDA Summary
• fLDA is a useful model for cold-start item recommendation
• It also provides interpretable recommendations for users
– User’s preference to interpretable LDA topics
• Future directions:
– Investigate Gibbs sampling chains and the convergence properties of
the EM algorithm
– Apply fLDA to other multi-task prediction problems
• fLDA can be used as a tool to generate supervised
features (topics) from text data
155. Summary
• Regularizing factors through covariates effective
• Regression based factor model that regularizes better
and deals with both cold-start and warm-start in a
single framework in a seamless way looks attractive
• Fitting method scalable; Gibbs sampling for users and
movies can be done in parallel. Regressions in M-step
can be done with any off-the-shelf scalable linear
regression routine
• Distributed computing on Hadoop: Multiple models
and average across partitions (more later)
157. Why Online Components?
• Cold start
– New items or new users come to the system
– How to obtain data for new items/users (explore/exploit)
– Once data becomes available, how to quickly update the model
• Periodic rebuild (e.g., daily): Expensive
• Continuous online update (e.g., every minute): Cheap
• Concept drift
– Item popularity, user interest, mood, and user-to-item affinity may
change over time
– How to track the most recent behavior
• Down-weight old data
– How to model temporal patterns for better prediction
• … may not need to be online if the patterns are stationary
158. Big Picture
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models
Real systems are dynamic
Time-series models Incremental CF,
online regression
Intelligent Initialization
Do not start cold
Prior estimation Prior estimation,
dimension reduction
Explore/Exploit
Actively acquire data
Multi-armed bandits Bandits with covariates
Segmented
Most
Popular
Recommenda$on
Extension:
159. Online
Components
for
Most
Popular
Recommenda$on
Online
models,
intelligent
ini$aliza$on
&
explore/exploit
160. Most popular recommendation:
Outline
• Most popular recommendation (no
personalization, all users see the same thing)
– Time-series models (online models)
– Prior estimation (initialization)
– Multi-armed bandits (explore/exploit)
– Sometimes hard to beat!!
• Segmented most popular recommendation
– Create user segments/clusters based on user
features
– Do most popular recommendation for each segment
161. Most Popular Recommendation
• Problem definition: Pick k items (articles) from a
pool of N to maximize the total number of clicks on
the picked items
• Easy!? Pick the items having the highest click-
through rates (CTRs)
• But …
– The system is highly dynamic:
• Items come and go with short lifetimes
• CTR of each item changes over time
– How much traffic should be allocated to explore new
items to achieve optimal performance
• Too little → Unreliable CTR estimates
• Too much → Little traffic to exploit the high CTR items
162. CTR Curves for Two Days on Yahoo! Front Page
Traffic
obtained
from
a
controlled
randomized
experiment
(no
confounding)
Things
to
note:
(a)
Short
life$mes,
(b)
temporal
effects,
(c)
oden
breaking
news
stories
Each
curve
is
the
CTR
of
an
item
in
the
Today
Module
on
www.yahoo.com
over
$me
163. For Simplicity, Assume …
• Pick only one item for each user visit
– Multi-slot optimization later
• No user segmentation, no personalization
(discussion later)
• The pool of candidate items is predetermined
and is relatively small (≤ 1000)
– E.g., selected by human editors or by a first-phase
filtering method
– Ideally, there should be a feedback loop
– Large item pool problem later
• Effects like user-fatigue, diversity in
recommendations, multi-objective optimization
not considered (discussion later)
164. Online Models
• How to track the changing CTR of an item
• Data: for each item, at time t, we observe
– Number of times the item nt was displayed (i.e., #views)
– Number of clicks ct on the item
• Problem Definition: Given c1, n1, …, ct, nt, predict the CTR
(click-through rate) pt+1 at time t+1
• Potential solutions:
– Observed CTR at t: ct / nt → highly unstable (nt is usually small)
– Cumulative CTR: (∑all i ci) / (∑all i ni) → react to changes very
slowly
– Moving window CTR: (∑i∈last K ci) / (∑i∈last K ni) → reasonable
• But, no estimation of Var[pt+1] (useful for explore/exploit)
165. Online Models: Dynamic Gamma-Poisson
• Model-based approach
– (ct | nt, pt) ~ Poisson(nt pt)
– pt = pt-1 εt, where εt ~ Gamma(mean=1, var=η)
– Model parameters:
• p1 ~ Gamma(mean=µ0, var=σ0
2) is the offline CTR estimate
• η specifies how dynamic/smooth the CTR is over time
– Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?)
• Solve this recursively (online update rule)
Show
the
item
nt
$mes
Receive
ct
clicks
pt
=
CTR
at
$me
t
Nota$on:
p1
µ0,
σ0
2
p2
…
n1
c1
n2
c2
η
167. Tracking behavior of Gamma-
Poisson model
• Low click rate articles – More temporal
smoothing
168. Intelligent Initialization: Prior
Estimation
• Prior CTR distribution: Gamma(mean=µ0, var=σ0
2)
– N historical items:
• ni = #views of item i in its first time interval
• ci = #clicks on item i in its first time interval
– Model
• ci ~ Poisson(ni pi) and pi ~ Gamma(µ0, σ0
2)
⇒ ci ~ NegBinomial(µ0, σ0
2, ni)
– Maximum likelihood estimate (MLE) of (µ0, σ0
2)
• Better prior: Cluster items and find MLE for each cluster
– Agarwal & Chen, 2011 (SIGMOD)
∑ ⎟
⎠
⎞
⎜
⎝
⎛ +⎟
⎠
⎞⎜
⎝
⎛ +−⎟
⎠
⎞⎜
⎝
⎛ +Γ+⎟
⎠
⎞⎜
⎝
⎛Γ−
i iii nccNN 2
0
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
0
2
0
2
0
2
00
loglogloglogmaxarg
,
σ
µ
σ
µ
σ
µ
σ
µ
σ
µ
σ
µ
σµ
169. Explore/Exploit: Problem Definition
$me
Item
1
Item
2
…
Item
K
x1%
page
views
x2%
page
views
…
xK%
page
views
Determine
(x1,
x2,
…,
xK)
based
on
clicks
and
views
observed
before
t
in
order
to
maximize
the
expected
total
number
of
clicks
in
the
future
t
–1
t
–2
t
now
clicks
in
the
future
170. Modeling the Uncertainty, NOT just
the Mean
Simplified
semng:
Two
items
CTR
Probabilitydensity
Item A
Item B
We
know
the
CTR
of
Item
A
(say,
shown
1
million
$mes)
We
are
uncertain
about
the
CTR
of
Item
B
(only
100
$mes)
If
we
only
make
a
single
decision,
give
100%
page
views
to
Item
A
If
we
make
mul$ple
decisions
in
the
future
explore
Item
B
since
its
CTR
can
poten$ally
be
higher
∫ >
⋅−=
qp
dppfqp )()(Potential
CTR of item A is q
CTR of item B is p
Probability density function of item B’s CTR is f(p)
171. Multi-Armed Bandits: Introduction
(1)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
“Pulling” arm i yields a reward:
reward = 1 with probability pi (success)
reward = 0 otherwise (failure)
For
now,
we
are
aaacking
the
problem
of
choosing
best
ar$cle/arm
for
all
users
172. Multi-Armed Bandits: Introduction
(2)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
Goal:
Pull
arms
sequen$ally
to
maximize
the
total
reward
Bandit
scheme/policy:
Sequen$al
algorithm
to
play
arms
(items)
Regret
of
a
scheme
=
Expected
loss
rela$ve
to
the
“oracle”
op-mal
scheme
that
always
plays
the
best
arm
– “best”
means
highest
success
probability
– But,
the
best
arm
is
not
known
…
unless
you
have
an
oracle
– Regret
is
the
price
of
explora$on
– Low
regret
implies
quick
convergence
to
the
best
173. Multi-Armed Bandits: Introduction
(3)
• Bayesian approach
– Seeks to find the Bayes optimal solution to a Markov
decision process (MDP) with assumptions about
probability distributions
– Representative work: Gittins’ index, Whittle’s index
– Very computationally intensive
• Minimax approach
– Seeks to find a scheme that incurs bounded regret (with no
or mild assumptions about probability distributions)
– Representative work: UCB by Lai, Auer
– Usually, computationally easy
– But, they tend to explore too much in practice (probably
because the bounds are based on worse-case analysis)
Skip
details
174. Multi-Armed Bandits: Markov Decision Process (1)
• Select an arm now at time t=0, to maximize expected total number
of clicks in t=0,…,T
• State at time t: Θt = (θ1t, …, θKt)
– θit = State of arm i at time t (that captures all we know about arm i at t)
• Reward function Ri(Θt, Θt+1)
– Reward of pulling arm i that brings the state from Θt to Θt+1
• Transition probability Pr[Θt+1 | Θt, pulling arm i ]
• Policy π: A function that maps a state to an arm (action)
– π(Θt) returns an arm (to pull)
• Value of policy π starting from the current state Θ0 with horizon T
[ ]),(),(),( 1110)(0 0
ΘΘΘΘ Θ ππ π −+= TT VREV
[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T ππ π
Immediate
reward
Value
of
the
remaining
T-‐1
$me
slots
if
we
start
from
state
Θ1
175. Multi-Armed Bandits: MDP (2)
• Optimal policy:
• Things to notice:
– Value is defined recursively (actually T high-dim integrals)
– Dynamic programming can be used to find the optimal policy
– But, just evaluating the value of a fixed policy can be very expensive
• Bandit Problem: The pull of one arm does not change the state of
other arms and the set of arms do not change over time
[ ]),(),(),( 1110)(0 0
ΘΘΘΘ Θ ππ π −+= TT VREV
[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T ππ π
Immediate
reward
Value
of
the
remaining
T-‐1
$me
slots
if
we
start
from
state
Θ1
),(maxarg 0Θπ
π
TV
176. Multi-Armed Bandits: MDP (3)
• Which arm should be pulled next?
– Not necessarily what looks best right now, since it might have had a few
lucky successes
– Looks like it will be a function of successes and failures of all arms
• Consider a slightly different problem setting
– Infinite time horizon, but
– Future rewards are geometrically discounted
Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1)
• Theorem [Gittins 1979]: The optimal policy decouples and solves a
bandit problem for each arm independently
Policy
π(Θt)
is
a
func$on
of
(θ1t,
…,
θKt)
Policy
π(Θt)
=
argmaxi
{
g(θit)
}
One
K-‐dimensional
problem
K
one-‐dimensional
problems
S$ll
computa$onally
expensive!!
Gimns’
Index
177. Multi-Armed Bandits: MDP (4)
Bandit Policy
1. Compute the priority
(Gittins’ index) of each
arm based on its state
2. Pull arm with max
priority, and observe
reward
3. Update the state of the
pulled arm
Priority
1
Priority
2
Priority
3
178. Multi-Armed Bandits: MDP (5)
• Theorem [Gittins 1979]: The optimal policy decouples
and solves a bandit problem for each arm
independently
– Many proofs and different interpretations of Gittins’ index
exist
• The index of an arm is the fixed charge per pull for a game with two options, whether
to pull the arm or not, so that the charge makes the optimal play of the game have
zero net reward
– Significantly reduces the dimension of the problem space
– But, Gittins’ index g(θit) is still hard to compute
• For the Gamma-Poisson or Beta-Binomial models
θit = (#successes, #pulls) for arm i up to time t
• g maps each possible (#successes, #pulls) pair to a number
– Approximate methods are used in practice
– Lai et al. have derived these for exponential family
distributions
179. Multi-Armed Bandits: Minimax Approach (1)
• Compute the priority of each arm i in a way that the
regret is bounded
– Lowest regret in the worst case
• One common policy is UCB1 [Auer 2002]
Number of successes of
arm i
Number of pulls
of arm i
Total number of pulls
of all arms
Observed
success rate
Factor representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
⋅
+=
180. Multi-Armed Bandits: Minimax Approach (2)
• As total observations n becomes large:
– Observed payoff tends asymptotically towards the
true payoff probability
– The system never completely “converges” to one
best arm; only the rate of exploration tends to
zero
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
⋅
+=
181. Multi-Armed Bandits: Minimax Approach (3)
• Sub-optimal arms are pulled O(log n) times
• Hence, UCB1 has O(log n) regret
• This is the lowest possible regret (but the constants matter J)
• E.g. Regret after n plays is bounded by
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
⋅
+=
ibesti
K
j
j
i ibesti
n
µµ
π
µµ
−=Δ⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
Δ⋅⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
++⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
Δ
∑∑ =<
where,
3
1
ln
8
1
2
:
182. • Classical multi-armed bandits
– A fixed set of arms with fixed rewards
– Observe the reward before the next pull
• Bayesian approach (Markov decision process)
– Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits
• Pull the arm currently having the highest index value
– Whittle’s index [Whittle 1988]: Extension to a changing reward function
– Computationally intensive
• Minimax approach (providing guaranteed regret bounds)
– UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval
• Index of arm i =
• Heuristics
– ε-Greedy: Random exploration using fraction ε of traffic
– Softmax: Pick arm i with probability
– Posterior draw: Index = drawing from posterior CTR distribution of an arm
∑j j
i
}/ˆexp{
}/ˆexp{
τµ
τµ
Classical Multi-Armed Bandits: Summary
ii itemofCTRpredictedˆ =µ
iii nnnc log2⋅+
retemperatu=τ
183. Do Classical Bandits Apply to Web Recommenders?
Traffic
obtained
from
a
controlled
randomized
experiment
(no
confounding)
Things
to
note:
(a)
Short
life$mes,
(b)
temporal
effects,
(c)
oden
breaking
news
stories
Each
curve
is
the
CTR
of
an
item
in
the
Today
Module
on
www.yahoo.com
over
$me
184. Characteristics of Real
Recommender Systems• Dynamic set of items (arms)
– Items come and go with short lifetimes (e.g., a day)
– Asymptotically optimal policies may fail to achieve good performance
when item lifetimes are short
• Non-stationary CTR
– CTR of an item can change dramatically over time
• Different user populations at different times
• Same user behaves differently at different times (e.g., morning, lunch
time, at work, in the evening, etc.)
• Attention to breaking news stories decays over time
• Batch serving for scalability
– Making a decision and updating the model for each user visit in real time
is expensive
– Batch serving is more feasible: Create time slots (e.g., 5 min); for each
slot, decide the fraction xi of the visits in the slot to give to item i
[Agarwal
et
al.,
ICDM,
2009]
185. Explore/Exploit in Recommender
Systems
$me
Item
1
Item
2
…
Item
K
x1%
page
views
x2%
page
views
…
xK%
page
views
Determine
(x1,
x2,
…,
xK)
based
on
clicks
and
views
observed
before
t
in
order
to
maximize
the
expected
total
number
of
clicks
in
the
future
t
–1
t
–2
t
now
clicks
in
the
future
Let’s
solve
this
from
first
principle
186. Bayesian Solution: Two Items, Two
Time Slots (1)
• Two time slots: t = 0 and t = 1
– Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1
– Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1
• To determine x, we need to estimate what would happen in the future
Question:
What fraction x of N0 views to item P
(1-x) to item Q
t=0 t=1
Now
time
N0 views N1 views
End
Obtain
c
clicks
ader
serving
x
(not
yet
observed;
random
variable)
Assume
we
observe
c;
we
can
update
p1
CTR
density
Item Q
Item P
q1
p1(x,c)
CTR
density
Item Q
Item P
q0
p0
If
x
and
c
are
given,
op$mal
solu$on:
Give
all
views
to
Item
P
iff
E[
p1
I
x,
c
]
>
q1
),(ˆ1 cxp
),(ˆ1 cxp
187. • Expected total number of clicks in the two time slots
}]),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c+−+
Gain(x, q0, q1) = Expected number of additional clicks if we explore
the uncertain item P with fraction x of views in slot 0, compared to
a scheme that only shows the certain item Q in both slots
Solution: argmaxx Gain(x, q0, q1)
Bayesian Solution: Two Items, Two
Time Slots (2)
}]0,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c −+−++=
E[#clicks] at t = 0 E[#clicks] at t = 1
Item P Item Q Show
the
item
with
higher
E[CTR]:
}),,(ˆmax{ 11 qcxp
E[#clicks] if we
always show
item Q
Gain(x, q0, q1)
Gain of exploring the uncertain item P using x
188. • Approximate by the normal distribution
– Reasonable approximation because of the central limit theorem
• Proposition: Using the approximation,
the Bayes optimal solution x can be
found in time O(log N0)
),(ˆ1 cxp
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
−⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −
Φ−+⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −
⋅+−= )ˆ(
)(
ˆ
1
)(
ˆ
)()ˆ(),,( 11
1
11
1
11
1100010 qp
x
pq
x
pq
xNqpxNqqxGain
σσ
φσ
)1()()(
)],(ˆ[)( 2
0
0
1
2
1
baba
ab
xNba
xN
cxpVarx
+++++
==σ
)/()],(ˆ[ˆ 11 baacxpEp c +==
),(~ofPrior 1 baBetap
Bayesian Solution: Two Items, Two
Time Slots (3)
189. Bayesian Solution: Two Items, Two Time Slots (4)
• Quiz: Is it correct that the more we are uncertain about the CTR of
an item, the more we should explore the item?
Uncertainty:
Low
Uncertainty:
High
Different
curves
are
for
different
prior
mean
semngs
(Frac$on
of
views
to
give
to
the
item)