SlideShare a Scribd company logo
1 of 309
Download to read offline
Sta$s$cal	
  Compu$ng	
  	
  
For	
  Big	
  Data	
  
Deepak	
  Agarwal	
  
LinkedIn	
  Applied	
  Relevance	
  Science	
  
dagarwal@linkedin.com	
  
ENAR	
  2014,	
  Bal$more,	
  USA	
  
Main	
  Collaborators:	
  several	
  others	
  at	
  both	
  Y!	
  
and	
  LinkedIn	
  
•  I	
  won’t	
  be	
  here	
  without	
  them,	
  extremely	
  lucky	
  to	
  work	
  with	
  such	
  talented	
  
individuals	
  
Bee-Chung Chen Liang Zhang Bo Long
Jonathan Traupman
Paul Ogilvie
Structure	
  of	
  This	
  Tutorial	
  
•  Part	
  I:	
  Introduc$on	
  to	
  Map-­‐Reduce	
  and	
  the	
  
Hadoop	
  System	
  
–  Overview	
  of	
  Distributed	
  Compu$ng	
  
–  Introduc$on	
  to	
  Map-­‐Reduce	
  
–  Some	
  sta$s$cal	
  computa$ons	
  using	
  Map-­‐Reduce	
  
•  Bootstrap,	
  Logis$c	
  Regression	
  
•  Part	
  II:	
  Recommender	
  Systems	
  for	
  Web	
  
Applica$ons	
  
–  Introduc$on	
  
–  Content	
  Recommenda$on	
  
–  Online	
  Adver$sing	
  
Big	
  Data	
  becoming	
  Ubiquitous	
  
•  Bioinforma$cs	
  
•  Astronomy	
  
•  Internet	
  
•  Telecommunica$ons	
  
•  Climatology	
  
•  …	
  
	
  
Big	
  Data:	
  Some	
  size	
  es$mates	
  
•  1000	
  human	
  genomes:	
  >	
  100TB	
  of	
  data	
  (1000	
  
genomes	
  project)	
  
•  Sloan	
  Digital	
  Sky	
  Survey:	
  200GB	
  data	
  per	
  night	
  
(>140TB	
  aggregated)	
  
•  Facebook:	
  A	
  billion	
  monthly	
  ac$ve	
  users	
  
•  LinkedIn:	
  	
  	
  roughly	
  >	
  280M	
  members	
  worldwide	
  
•  Twiaer:	
  >	
  500	
  million	
  tweets	
  a	
  day	
  
•  Over	
  6	
  billion	
  mobile	
  phones	
  in	
  the	
  world	
  
genera$ng	
  data	
  everyday	
  
Big	
  Data:	
  Paradigm	
  shid	
  
•  Classical	
  Sta$s$cs	
  
–  Generalize	
  using	
  small	
  data	
  
	
  
•  Paradigm	
  Shid	
  with	
  Big	
  Data	
  
–  We	
  now	
  have	
  an	
  almost	
  infinite	
  supply	
  of	
  data	
  
–  Easy	
  Sta$s$cs	
  ?	
  Just	
  appeal	
  to	
  asympto$c	
  theory?	
  
•  So	
  the	
  issue	
  is	
  mostly	
  computa$onal?	
  
–  Not	
  quite	
  
•  More	
  data	
  comes	
  with	
  more	
  heterogeneity	
  
•  Need	
  to	
  change	
  our	
  sta$s$cal	
  thinking	
  to	
  adapt	
  
–  Classical	
  sta$s$cs	
  s$ll	
  invaluable	
  to	
  think	
  about	
  big	
  data	
  analy$cs	
  	
  
Some	
  Sta$s$cal	
  Challenges	
  
•  Exploratory	
  Analysis	
  (EDA),	
  Visualiza$on	
  
– Retrospec$ve	
  (on	
  Terabytes)	
  
– More	
  Real	
  Time	
  	
  (streaming	
  computa$ons	
  every	
  
few	
  minutes/hours)	
  
•  Sta$s$cal	
  Modeling	
  
– Scale	
  (computa$onal	
  challenge)	
  
– Curse	
  of	
  dimensionality	
  	
  
•  Millions	
  of	
  predictors,	
  heterogeneity	
  
– Temporal	
  and	
  Spa$al	
  correla$ons	
  
Sta$s$cal	
  Challenges	
  con$nued	
  
•  Experiments	
  
– To	
  test	
  new	
  methods,	
  test	
  hypothesis	
  from	
  
randomized	
  experiments	
  
– Adap$ve	
  experiments	
  	
  
•  Forecas$ng	
  	
  
– Planning,	
  adver$sing	
  
•  Many	
  more	
  I	
  are	
  not	
  fully	
  well	
  versed	
  in	
  
	
  
	
  
	
  
Defining	
  Big	
  Data	
  	
  
•  How	
  to	
  know	
  you	
  have	
  the	
  big	
  data	
  problem?	
  
– Is	
  it	
  only	
  the	
  number	
  of	
  terabytes	
  ?	
  
– What	
  about	
  dimensionality,	
  structured/
unstructured,	
  computa$ons	
  required,…	
  
•  No	
  clear	
  defini$on,	
  different	
  point	
  of	
  views	
  
– When	
  desired	
  computa$on	
  cannot	
  be	
  completed	
  
in	
  the	
  s$pulated	
  $me	
  with	
  current	
  best	
  algorithm	
  
using	
  cores	
  available	
  on	
  a	
  commodity	
  PC	
  	
  
	
  
 Distributed	
  Compu$ng	
  for	
  Big	
  Data	
  	
  	
  
•  Distributed	
  compu$ng	
  invaluable	
  tool	
  to	
  scale	
  
computa$ons	
  for	
  big	
  data	
  
•  Some	
  distributed	
  compu$ng	
  models	
  
– Mul$-­‐threading	
  
– Graphics	
  Processing	
  Units	
  (GPU)	
  
– Message	
  Passing	
  Interface	
  (MPI)	
  
– Map-­‐Reduce	
  
Evalua$ng	
  a	
  method	
  for	
  a	
  problem	
  
•  Scalability	
  
–  Process	
  X	
  GB	
  in	
  Y	
  hours	
  
•  Ease	
  of	
  use	
  for	
  a	
  sta$s$cian	
  	
  
•  Reliability	
  (fault	
  tolerance)	
  
–  Especially	
  in	
  an	
  industrial	
  environment	
  
•  Cost	
  
–  Hardware	
  and	
  cost	
  of	
  maintaining	
  
•  Good	
  for	
  the	
  computa$ons	
  required?	
  
–  E.g.,	
  Itera$ve	
  versus	
  one	
  pass	
  
•  Resource	
  sharing	
  
Mul$threading	
  
•  Mul$ple	
  threads	
  take	
  advantage	
  of	
  mul$ple	
  
CPUs	
  
•  Shared	
  memory	
  
•  Threads	
  can	
  execute	
  independently	
  and	
  
concurrently	
  
•  Can	
  only	
  handle	
  Gigabytes	
  of	
  data	
  
•  Reliable	
  
Graphics	
  Processing	
  Units	
  (GPU)	
  
•  Number	
  of	
  cores:	
  
–  CPU:	
  Order	
  of	
  10	
  
–  GPU:	
  smaller	
  cores	
  
•  Order	
  of	
  1000	
  
•  Can	
  be	
  >100x	
  faster	
  than	
  CPU	
  
–  Parallel	
  computa$onally	
  intensive	
  tasks	
  off-­‐loaded	
  to	
  GPU	
  
•  Good	
  for	
  certain	
  computa$onally-­‐intensive	
  tasks	
  
•  Can	
  only	
  handle	
  Gigabytes	
  of	
  data	
  
•  Not	
  trivial	
  to	
  use,	
  requires	
  good	
  understanding	
  of	
  low-­‐level	
  architecture	
  
for	
  efficient	
  use	
  
–  But	
  things	
  changing,	
  it	
  is	
  gemng	
  more	
  user	
  friendly	
  
Message	
  Passing	
  Interface	
  (MPI)	
  
•  Language	
  independent	
  communica$on	
  
protocol	
  among	
  processes	
  (e.g.	
  computers)	
  
•  Most	
  suitable	
  for	
  master/slave	
  model	
  
•  Can	
  handle	
  Terabytes	
  of	
  data	
  
•  Good	
  for	
  itera$ve	
  processing	
  
•  Fault	
  tolerance	
  is	
  low	
  
Map-­‐Reduce	
  (Dean	
  &	
  Ghemawat,	
  
2004)	
  
Mappers	
  
Reducers	
  
Data	
  
Output	
  
•  Computa$on	
  split	
  to	
  Map	
  
(scaaer)	
  and	
  Reduce	
  (gather)	
  
stages	
  
•  Easy	
  to	
  Use:	
  	
  
–  User	
  needs	
  to	
  implement	
  two	
  
func$ons:	
  Mapper	
  and	
  
Reducer	
  
•  Easily	
  handles	
  Terabytes	
  of	
  
data	
  
•  Very	
  good	
  fault	
  tolerance	
  
(failed	
  tasks	
  automa$cally	
  
get	
  restarted)	
  
Comparison	
  of	
  Distributed	
  Compu$ng	
  Methods	
  
Mul$threading	
   GPU	
   MPI	
   Map-­‐Reduce	
  
Scalability	
  (data	
  
size)	
  
Gigabytes	
   Gigabytes	
   Terabytes	
   Terabytes	
  
Fault	
  Tolerance	
   High	
   High	
   Low	
   High	
  
Maintenance	
  Cost	
   Low	
   Medium	
   Medium	
   Medium-­‐High	
  
Itera$ve	
  Process	
  
Complexity	
  
Cheap	
   Cheap	
   Cheap	
   Usually	
  
expensive	
  
Resource	
  Sharing	
   Hard	
   Hard	
   Easy	
   Easy	
  
Easy	
  to	
  Implement?	
   Easy	
   Needs	
  
understanding	
  
of	
  low-­‐level	
  GPU	
  
architecture	
  
Easy	
   Easy	
  
Example	
  Problem	
  
•  Tabula$ng	
  word	
  counts	
  in	
  corpus	
  of	
  
documents	
  
•  Similar	
  to	
  table	
  func$on	
  in	
  R	
  
Word	
  Count	
  Through	
  Map-­‐Reduce	
  
Hello	
  World	
  
Bye	
  World	
  	
  
Hello	
  Hadoop	
  
Goodbye	
  Hadoop	
  
Mapper	
  1	
  
<Hello,	
  1>	
  
<Hadoop,	
  1>	
  
<Goodbye,	
  1>	
  
<Hadoop,1>	
  
<Hello,	
  1>	
  
<World,	
  1>	
  
<Bye,	
  1>	
  
<World,1>	
  
Mapper	
  2	
  
Reducer	
  1	
  
Words	
  from	
  A-­‐G	
  
Reducer	
  2	
  
Words	
  from	
  H-­‐Z	
  
<Bye,	
  1>	
  
<Goodbye,	
  1>	
  
<Hello,	
  2>	
  
<World,	
  2>	
  
<Hadoop,	
  2>	
  
Key	
  Ideas	
  about	
  Map-­‐Reduce	
  
Big	
  Data	
  
Par$$on	
  1	
   Par$$on	
  2	
   …	
   Par$$on	
  N	
  
Mapper	
  1	
   Mapper	
  2	
   …	
   Mapper	
  N	
  
<Key,	
  Value>	
   <Key,	
  Value>	
  <Key,	
  Value>	
  <Key,	
  Value>	
  
Reducer	
  1	
   Reducer	
  2	
   Reducer	
  M	
  …	
  
Output	
  1	
   Output	
  1	
  Output	
  1	
  Output	
  1	
  
Key	
  Ideas	
  about	
  Map-­‐Reduce	
  
•  Data	
  are	
  split	
  into	
  par$$ons	
  and	
  stored	
  in	
  many	
  
different	
  machines	
  on	
  disk	
  (distributed	
  storage)	
  
•  Mappers	
  process	
  data	
  chunks	
  independently	
  	
  and	
  
emit	
  <Key,	
  Value>	
  pairs	
  
•  Data	
  with	
  the	
  same	
  key	
  are	
  sent	
  to	
  the	
  same	
  
reducer.	
  One	
  reducer	
  can	
  receive	
  mul$ple	
  keys	
  
•  Every	
  reducer	
  sorts	
  its	
  data	
  by	
  key	
  
•  For	
  each	
  key,	
  the	
  reducer	
  processes	
  the	
  values	
  
corresponding	
  to	
  the	
  key	
  according	
  to	
  the	
  
customized	
  reducer	
  func$on	
  and	
  output	
  
Compute	
  Mean	
  for	
  Each	
  Group	
  
ID	
   Group	
  No.	
   Score	
  
1	
   1	
   0.5	
  
2	
   3	
   1.0	
  
3	
   1	
   0.8	
  
4	
   2	
   0.7	
  
5	
   2	
   1.5	
  
6	
   3	
   1.2	
  
7	
   1	
   0.8	
  
8	
   2	
   0.9	
  
9	
   4	
   1.3	
  
…	
   …	
   …	
  
Key	
  Ideas	
  about	
  Map-­‐Reduce	
  
•  Data	
  are	
  split	
  into	
  par$$ons	
  and	
  stored	
  in	
  many	
  different	
  machines	
  on	
  
disk	
  (distributed	
  storage)	
  
•  Mappers	
  process	
  data	
  chunks	
  independently	
  	
  and	
  emit	
  <Key,	
  Value>	
  pairs	
  
–  For	
  each	
  row:	
  
•  Key	
  =	
  Group	
  No.	
  
•  Value	
  =	
  Score	
  
•  Data	
  with	
  the	
  same	
  key	
  are	
  sent	
  to	
  the	
  same	
  reducer.	
  One	
  reducer	
  can	
  
receive	
  mul$ple	
  keys	
  
–  E.g.	
  2	
  reducers	
  
–  Reducer	
  1	
  receives	
  data	
  with	
  key	
  =	
  1,	
  2	
  
–  Reducer	
  2	
  receives	
  data	
  with	
  key	
  =	
  3,	
  4	
  
•  Every	
  reducer	
  sorts	
  its	
  data	
  by	
  key	
  
–  E.g.	
  Reducer	
  1:	
  <key	
  =	
  1,	
  values=[0.5,	
  0.8,	
  0.8]>,	
  <key=2,	
  values=<0.7,	
  1.5,	
  0.9>	
  
•  For	
  each	
  key,	
  the	
  reducer	
  processes	
  the	
  values	
  corresponding	
  to	
  the	
  key	
  
according	
  to	
  the	
  customized	
  reducer	
  func$on	
  and	
  output	
  
–  E.g.	
  Reducer	
  1	
  output:	
  <1,	
  mean(0.5,	
  0.8,	
  0.8)>,	
  <2,	
  mean(0.7,	
  1.5,	
  0.9)>	
  
Key	
  Ideas	
  about	
  Map-­‐Reduce	
  
•  Data	
  are	
  split	
  into	
  par$$ons	
  and	
  stored	
  in	
  many	
  different	
  machines	
  on	
  
disk	
  (distributed	
  storage)	
  
•  Mappers	
  process	
  data	
  chunks	
  independently	
  	
  and	
  emit	
  <Key,	
  Value>	
  pairs	
  
–  For	
  each	
  row:	
  
•  Key	
  =	
  Group	
  No.	
  
•  Value	
  =	
  Score	
  
•  Data	
  with	
  the	
  same	
  key	
  are	
  sent	
  to	
  the	
  same	
  reducer.	
  One	
  reducer	
  can	
  
receive	
  mul$ple	
  keys	
  
–  E.g.	
  2	
  reducers	
  
–  Reducer	
  1	
  receives	
  data	
  with	
  key	
  =	
  1,	
  2	
  
–  Reducer	
  2	
  receives	
  data	
  with	
  key	
  =	
  3,	
  4	
  
•  Every	
  reducer	
  sorts	
  its	
  data	
  by	
  key	
  
–  E.g.	
  Reducer	
  1:	
  <key	
  =	
  1,	
  values=[0.5,	
  0.8,	
  0.8]>,	
  <key=2,	
  values=<0.7,	
  1.5,	
  0.9>	
  
•  For	
  each	
  key,	
  the	
  reducer	
  processes	
  the	
  values	
  corresponding	
  to	
  the	
  key	
  
according	
  to	
  the	
  customized	
  reducer	
  func$on	
  and	
  output	
  
–  E.g.	
  Reducer	
  1	
  output:	
  <1,	
  mean(0.5,	
  0.8,	
  0.8)>,	
  <2,	
  mean(0.7,	
  1.5,	
  0.9)>	
  
What	
  you	
  need	
  
to	
  implement	
  
Mapper:	
  
Input:	
  Data	
  
for	
  (row	
  in	
  Data)	
  
{	
  
	
  groupNo	
  =	
  row$groupNo;	
  
	
  score	
  =	
  row$score;	
  
	
  Output(c(groupNo,	
  score));	
  
}	
  
Reducer:	
  
Input:	
  Key	
  (groupNo),	
  List	
  Value	
  (a	
  list	
  of	
  scores	
  that	
  belong	
  to	
  the	
  Key)	
  
count	
  =	
  0;	
  
sum	
  =	
  0;	
  
for	
  (v	
  in	
  Value)	
  
{	
  
	
  sum	
  +=	
  v;	
  
	
  count++;	
  
}	
  
Output(c(Key,	
  sum/count));	
  
Pseudo	
  Code	
  (in	
  R)	
  
Exercise	
  1	
  
•  Problem:	
  Average	
  height	
  per	
  {Grade,	
  Gender}?	
  
•  What	
  should	
  be	
  the	
  mapper	
  output	
  key?	
  
•  What	
  should	
  be	
  the	
  mapper	
  output	
  value?	
  
•  What	
  are	
  the	
  reducer	
  input?	
  
•  What	
  are	
  the	
  reducer	
  output?	
  
•  Write	
  mapper	
  and	
  reducer	
  for	
  this?	
  
Student	
  ID	
   Grade	
   Gender	
   Height	
  (cm)	
  
1	
   3	
   M	
   120	
  
2	
   2	
   F	
   115	
  
3	
   2	
   M	
   116	
  
…	
   …	
   …	
  
•  Problem:	
  Average	
  height	
  per	
  Grade	
  and	
  Gender?	
  
•  What	
  should	
  be	
  the	
  mapper	
  output	
  key?	
  
–  {Grade,	
  Gender}	
  
•  What	
  should	
  be	
  the	
  mapper	
  output	
  value?	
  
–  Height	
  
•  What	
  are	
  the	
  reducer	
  input?	
  
–  Key:	
  {Grade,	
  Gender},	
  Value:	
  List	
  of	
  Heights	
  	
  
•  What	
  are	
  the	
  reducer	
  output?	
  
–  {Grade,	
  Gender,	
  mean(Heights)}	
  
Student	
  ID	
   Grade	
   Gender	
   Height	
  (cm)	
  
1	
   3	
   M	
   120	
  
2	
   2	
   F	
   115	
  
3	
   2	
   M	
   116	
  
…	
   …	
   …	
  
Exercise	
  2	
  
•  Problem:	
  Number	
  of	
  students	
  per	
  {Grade,	
  Gender}?	
  
•  What	
  should	
  be	
  the	
  mapper	
  output	
  key?	
  
•  What	
  should	
  be	
  the	
  mapper	
  output	
  value?	
  
•  What	
  are	
  the	
  reducer	
  input?	
  
•  What	
  are	
  the	
  reducer	
  output?	
  
•  Write	
  mapper	
  and	
  reducer	
  for	
  this?	
  
Student	
  ID	
   Grade	
   Gender	
   Height	
  (cm)	
  
1	
   3	
   M	
   120	
  
2	
   2	
   F	
   115	
  
3	
   2	
   M	
   116	
  
…	
   …	
   …	
  
•  Problem:	
  Number	
  of	
  students	
  per	
  {Grade,	
  Gender}?	
  
•  What	
  should	
  be	
  the	
  mapper	
  output	
  key?	
  
–  {Grade,	
  Gender}	
  
•  What	
  should	
  be	
  the	
  mapper	
  output	
  value?	
  
–  1	
  
•  What	
  are	
  the	
  reducer	
  input?	
  
–  Key:	
  {Grade,	
  Gender},	
  Value:	
  List	
  of	
  1’s	
  
•  What	
  are	
  the	
  reducer	
  output?	
  
–  {Grade,	
  Gender,	
  sum(value	
  list)}	
  
–  OR:	
  {Grade,	
  Gender,	
  length(value	
  list)}	
  
Student	
  ID	
   Grade	
   Gender	
   Height	
  (cm)	
  
1	
   3	
   M	
   120	
  
2	
   2	
   F	
   115	
  
3	
   2	
   M	
   116	
  
…	
   …	
   …	
  
More	
  on	
  Map-­‐Reduce	
  
•  Depends	
  on	
  distributed	
  file	
  systems	
  
•  Typically	
  mappers	
  are	
  the	
  data	
  storage	
  nodes	
  
•  Map/Reduce	
  tasks	
  automa$cally	
  get	
  restarted	
  
when	
  they	
  fail	
  (good	
  fault	
  tolerance)	
  
•  Map	
  and	
  Reduce	
  I/O	
  are	
  all	
  on	
  disk	
  
–  Data	
  transmission	
  from	
  mappers	
  to	
  reducers	
  are	
  
through	
  disk	
  copy	
  
•  Itera$ve	
  process	
  through	
  Map-­‐Reduce	
  
–  Each	
  itera$on	
  becomes	
  a	
  map-­‐reduce	
  job	
  
–  Can	
  be	
  expensive	
  since	
  map-­‐reduce	
  overhead	
  is	
  high	
  
The	
  Apache	
  Hadoop	
  System	
  
•  An	
  open-­‐source	
  sodware	
  for	
  reliable,	
  scalable,	
  
distributed	
  compu$ng	
  	
  
•  The	
  most	
  popular	
  distributed	
  compu$ng	
  
system	
  in	
  the	
  world	
  
•  Key	
  modules:	
  
– Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  
– Hadoop	
  YARN	
  (job	
  scheduling	
  and	
  cluster	
  
resource	
  management)	
  
– Hadoop	
  MapReduce	
  
Major	
  Tools	
  on	
  Hadoop	
  
•  Pig	
  
–  A	
  high-­‐level	
  language	
  for	
  Map-­‐Reduce	
  computa$on	
  
•  Hive	
  
–  A	
  SQL-­‐like	
  query	
  language	
  for	
  data	
  querying	
  via	
  Map-­‐Reduce	
  
•  Hbase	
  
–  A	
  distributed	
  &	
  scalable	
  database	
  on	
  Hadoop	
  
–  Allows	
  random,	
  real	
  $me	
  read/write	
  access	
  to	
  big	
  data	
  
–  Voldemort	
  is	
  similar	
  to	
  Hbase	
  
•  Mahout	
  
–  A	
  scalable	
  machine	
  learning	
  library	
  
•  …	
  
Hadoop	
  Installa$on	
  
•  Semng	
  up	
  Hadoop	
  on	
  your	
  desktop/laptop:	
  
– hap://hadoop.apache.org/docs/stable/
single_node_setup.html	
  
•  Semng	
  up	
  Hadoop	
  on	
  a	
  cluster	
  of	
  machines	
  
– hap://hadoop.apache.org/docs/stable/
cluster_setup.html	
  
Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  
•  Master/Slave	
  architecture	
  
•  NameNode:	
  a	
  single	
  master	
  node	
  that	
  controls	
  which	
  
data	
  block	
  is	
  stored	
  where.	
  	
  
•  DataNodes:	
  slave	
  nodes	
  that	
  store	
  data	
  and	
  do	
  R/W	
  
opera$ons	
  
•  Clients	
  (Gateway):	
  Allow	
  users	
  to	
  login	
  and	
  interact	
  
with	
  HDFS	
  and	
  submit	
  Map-­‐Reduce	
  jobs	
  
•  Big	
  data	
  is	
  split	
  to	
  equal-­‐sized	
  blocks,	
  each	
  block	
  can	
  be	
  
stored	
  in	
  different	
  DataNodes	
  
•  Disk	
  failure	
  tolerance:	
  data	
  is	
  replicated	
  mul$ple	
  $mes	
  
Load	
  the	
  Data	
  into	
  Pig	
  
•  A	
  =	
  LOAD	
  ‘Sample-­‐1.dat'	
  USING	
  PigStorage()	
  AS	
  
(ID	
  :	
  int,	
  groupNo:	
  int,	
  score:	
  float);	
  	
  
–  The	
  path	
  of	
  the	
  data	
  on	
  HDFS	
  ader	
  LOAD	
  	
  
•  USING	
  PigStorage()	
  means	
  delimit	
  the	
  data	
  by	
  tab	
  
(can	
  be	
  omiaed)	
  
•  If	
  data	
  are	
  delimited	
  by	
  other	
  characters,	
  e.g.	
  
space,	
  use	
  USING	
  PigStorage(‘	
  ‘)	
  	
  
•  Data	
  schema	
  defined	
  ader	
  AS	
  	
  
•  Variable	
  types:	
  int,	
  long,	
  float,	
  double,	
  chararray,	
  
…	
  
Structure	
  of	
  This	
  Tutorial	
  
•  Part	
  I:	
  Introduc$on	
  to	
  Map-­‐Reduce	
  and	
  the	
  
Hadoop	
  System	
  
– Overview	
  of	
  Distributed	
  Compu$ng	
  
– Introduc$on	
  to	
  Map-­‐Reduce	
  
– Introduc$on	
  to	
  the	
  Hadoop	
  System	
  
– 	
  Examples	
  of	
  Sta$s$cal	
  Compu$ng	
  for	
  Big	
  Data	
  
•  Bag	
  of	
  Liale	
  Bootstraps	
  
•  Large	
  Scale	
  Logis$c	
  Regression	
  
Bag	
  of	
  Liale	
  Bootstraps	
  
Kleiner	
  et	
  al.	
  2012	
  
Bootstrap	
  (Efron,	
  1979)	
  
•  A	
  re-­‐sampling	
  based	
  method	
  to	
  obtain	
  sta$s$cal	
  
distribu$on	
  of	
  sample	
  es$mators	
  
•  Why	
  are	
  we	
  interested	
  ?	
  
–  Re-­‐sampling	
  is	
  embarrassingly	
  parallelizable	
  	
  
•  For	
  example:	
  
–  Standard	
  devia$on	
  of	
  the	
  mean	
  of	
  N	
  samples	
  (μ)	
  
–  For	
  i	
  =	
  1	
  to	
  r	
  do	
  	
  
•  Randomly	
  sample	
  with	
  replacement	
  N	
  $mes	
  from	
  the	
  original	
  
sample	
  -­‐>	
  bootstrap	
  data	
  i	
  
•  Compute	
  mean	
  of	
  the	
  i-­‐th	
  bootstrap	
  data	
  -­‐>	
  μi	
  
–  Es$mate	
  of	
  Sd(μ)	
  =	
  Sd([μ1,…μr])	
  
–  r	
  is	
  usually	
  a	
  large	
  number,	
  e.g.	
  200	
  
Bootstrap	
  for	
  Big	
  Data	
  
•  Can	
  have	
  r	
  nodes	
  running	
  in	
  parallel,	
  each	
  
sampling	
  one	
  bootstrap	
  data	
  
•  However…	
  
– N	
  can	
  be	
  very	
  large	
  
– Data	
  may	
  not	
  fit	
  into	
  memory	
  
– Collec$ng	
  N	
  samples	
  with	
  replacement	
  on	
  each	
  
node	
  can	
  be	
  computa$onally	
  expensive	
  
M	
  out	
  of	
  N	
  Bootstrap	
  	
  
(Bikel	
  et	
  al.	
  1997)	
  
•  Obtain	
  SdM(μ)	
  by	
  sampling	
  M	
  samples	
  with	
  
replacement	
  for	
  each	
  bootstrap,	
  where	
  M<N	
  
•  Apply	
  analy$cal	
  correc$on	
  to	
  SdM(μ)	
  to	
  obtain	
  
Sd(μ)	
  using	
  prior	
  knowledge	
  of	
  convergence	
  rate	
  
of	
  sample	
  es$mates	
  
•  However…	
  
–  Prior	
  knowledge	
  is	
  required	
  
–  Choice	
  of	
  M	
  is	
  cri$cal	
  to	
  performance	
  
–  Finding	
  op$mal	
  value	
  of	
  M	
  needs	
  more	
  	
  computa$on	
  
Bag	
  of	
  Liale	
  Bootstraps	
  (BLB)	
  
•  Example:	
  Standard	
  devia$on	
  of	
  the	
  mean	
  	
  
•  Generate	
  S	
  sampled	
  data	
  sets,	
  each	
  obtained	
  by	
  random	
  
sampling	
  without	
  replacement	
  a	
  subset	
  of	
  size	
  b	
  (or	
  
par$$on	
  the	
  original	
  data	
  into	
  S	
  par$$ons,	
  each	
  with	
  size	
  
b)	
  
•  For	
  each	
  data	
  p	
  =	
  1	
  to	
  S	
  do	
  
–  For	
  i	
  =	
  1	
  to	
  r	
  do	
  	
  
•  N	
  samples	
  with	
  replacement	
  on	
  data	
  of	
  size	
  b	
  
•  Compute	
  mean	
  of	
  the	
  resampled	
  data	
  μpi	
  
–  Compute	
  Sdp(μ)	
  =	
  Sd([μp1,…μpr])	
  
•  Es$mate	
  of	
  Sd(μ)	
  =	
  Avg([Sd1(μ),…,	
  SdS(μ)])	
  
Bag	
  of	
  Liale	
  Bootstraps	
  (BLB)	
  
•  Interest:	
  ξ(θ),	
  where	
  θ	
  is	
  an	
  es$mate	
  obtained	
  from	
  
size	
  N	
  data	
  
–  	
  ξ	
  is	
  some	
  func$on	
  of	
  θ,	
  such	
  as	
  standard	
  devia$on,	
  …	
  
•  Generate	
  S	
  sampled	
  data	
  sets,	
  each	
  obtained	
  from	
  random	
  
sampling	
  without	
  replacement	
  a	
  subset	
  of	
  size	
  b	
  (or	
  par$$on	
  
the	
  original	
  data	
  into	
  S	
  par$$ons,	
  each	
  with	
  size	
  b)	
  
•  For	
  each	
  data	
  p	
  =	
  1	
  to	
  S	
  do	
  
–  For	
  i	
  =	
  1	
  to	
  r	
  do	
  	
  
•  Sample	
  N	
  samples	
  with	
  replacement	
  on	
  data	
  of	
  size	
  b	
  
•  Compute	
  mean	
  of	
  the	
  resampled	
  data	
  θpi	
  
–  Compute	
  ξp(θ)	
  =	
  ξ([θp1,…θpr])	
  
•  Es$mate	
  of	
  ξ(μ)	
  =	
  Avg([ξ1(θ),…,	
  ξS(θ)])	
  
Bag	
  of	
  Liale	
  Bootstraps	
  (BLB)	
  
•  Interest:	
  ξ(θ),	
  where	
  θ	
  is	
  an	
  es$mate	
  obtained	
  from	
  
size	
  N	
  data	
  
–  	
  ξ	
  is	
  some	
  func$on	
  of	
  θ,	
  such	
  as	
  standard	
  devia$on,	
  …	
  
•  Generate	
  S	
  sampled	
  data	
  sets,	
  each	
  obtained	
  from	
  random	
  
sampling	
  without	
  replacement	
  a	
  subset	
  of	
  size	
  b	
  (or	
  par$$on	
  
the	
  original	
  data	
  into	
  S	
  par$$ons,	
  each	
  with	
  size	
  b)	
  
•  For	
  each	
  data	
  p	
  =	
  1	
  to	
  S	
  do	
  
–  For	
  i	
  =	
  1	
  to	
  r	
  do	
  	
  
•  Sample	
  N	
  samples	
  with	
  replacement	
  on	
  the	
  data	
  of	
  size	
  b	
  
•  Compute	
  mean	
  of	
  the	
  resampled	
  data	
  θpi	
  
–  Compute	
  ξp(θ)	
  =	
  ξ([θp1,…θpr])	
  
•  Es$mate	
  of	
  ξ(μ)	
  =	
  Avg([ξ1(θ),…,	
  ξS(θ)])	
  
Mapper	
  Reducer	
  
Gateway	
  
Why	
  is	
  BLB	
  Efficient	
  
•  Before:	
  
– N	
  samples	
  with	
  replacement	
  from	
  size	
  N	
  data	
  is	
  
expensive	
  when	
  N	
  is	
  large	
  
•  Now:	
  
– N	
  samples	
  with	
  replacement	
  from	
  size	
  b	
  data	
  
– b	
  can	
  be	
  several	
  magnitude	
  smaller	
  than	
  N	
  (e.g.	
  b	
  
=	
  Nγ,	
  γ	
  in	
  [0.5,	
  1))	
  
– Equivalent	
  to:	
  A	
  mul$nomial	
  sampler	
  with	
  dim	
  =	
  b	
  
– Storage	
  =	
  O(b),	
  Computa$onal	
  complexity	
  =	
  O(b)	
  
Simula$on	
  Experiment	
  
•  95%	
  CI	
  of	
  Logis$c	
  Regression	
  Coefficients	
  
•  N	
  =	
  20000,	
  10	
  explanatory	
  variables	
  
•  Rela$ve	
  Error	
  =	
  |Es$mated	
  CI	
  width	
  –	
  True	
  CI	
  
width	
  |	
  /	
  True	
  CI	
  width	
  
•  BLB-­‐γ:	
  BLB	
  with	
  b	
  =	
  Nγ	
  
•  BOFN-­‐γ:	
  b	
  out	
  of	
  N	
  sampling	
  with	
  b	
  =	
  Nγ	
  
•  BOOT:	
  Naïve	
  bootstrap	
  
Simula$on	
  Experiment	
  
Real	
  Data	
  
•  95%	
  CI	
  of	
  Logis$c	
  Regression	
  Coefficients	
  
•  N	
  =	
  6M,	
  3000	
  explanatory	
  variables	
  
•  Data	
  size	
  =	
  150GB,	
  r	
  =	
  50,	
  s	
  =	
  5,	
  γ	
  =	
  0.7	
  
Summary	
  of	
  BLB	
  
•  A	
  new	
  algorithm	
  for	
  bootstrapping	
  on	
  big	
  data	
  
•  Advantages	
  
– Fast	
  and	
  efficient	
  
– Easy	
  to	
  parallelize	
  
– Easy	
  to	
  understand	
  and	
  implement	
  
– Friendly	
  to	
  Hadoop,	
  makes	
  it	
  rou$ne	
  to	
  perform	
  
sta$s$cal	
  calcula$ons	
  on	
  Big	
  data	
  
Large	
  Scale	
  Logis$c	
  Regression	
  
Logis$c	
  Regression	
  
•  Binary	
  response:	
  Y	
  
•  Covariates:	
  X	
  
•  Yi	
  ~	
  Bernoulli(pi)	
  
•  log(pi/(1-­‐pi))	
  =	
  Xi
Tβ	
  ;	
  	
  β	
  ~	
  MVN(0	
  ,	
  1/λ	
  I	
  )	
  
•  Widely	
  used	
  (research	
  and	
  applica$ons)	
  
Large	
  Scale	
  Logis$c	
  Regression	
  
•  Binary	
  response:	
  Y	
  
–  E.g.,	
  Click	
  /	
  Non-­‐click	
  on	
  an	
  ad	
  on	
  a	
  webpage	
  
•  Covariates:	
  X	
  
–  User	
  covariates:	
  	
  
•  Age,	
  gender,	
  industry,	
  educa$on,	
  job,	
  job	
  $tle,	
  …	
  
–  Item	
  covariates:	
  
•  Categories,	
  keywords,	
  topics,	
  …	
  
–  Context	
  covariates:	
  
•  Time,	
  page	
  type,	
  posi$on,	
  …	
  
–  2-­‐way	
  interac$on:	
  
•  User	
  covariates	
  X	
  item	
  covariates	
  
•  Context	
  covariates	
  X	
  item	
  covariates	
  
•  …	
  
Computa$onal	
  Challenge	
  
•  Hundreds	
  of	
  millions/billions	
  of	
  observa$ons	
  	
  
•  Hundreds	
  of	
  thousands/millions	
  of	
  covariates	
  
•  Fimng	
  such	
  a	
  logis$c	
  regression	
  model	
  on	
  a	
  
single	
  machine	
  not	
  feasible	
  
•  Model	
  fimng	
  itera$ve	
  using	
  methods	
  like	
  
gradient	
  descent,	
  Newton’s	
  method	
  etc	
  
– Mul$ple	
  passes	
  over	
  the	
  data	
  
Recap	
  on	
  Op$miza$on	
  method	
  
•  Problem:	
  Find	
  x	
  to	
  min(F(x))	
  
•  Itera$on	
  n:	
  xn	
  =	
  xn-­‐1	
  –	
  bn-­‐1	
  F’(xn-­‐1)	
  
•  	
  bn-­‐1	
  is	
  the	
  step	
  size	
  that	
  can	
  change	
  every	
  
itera$on	
  
•  Iterate	
  un$l	
  convergence	
  	
  
•  Conjugate	
  gradient,	
  LBFGS,	
  Newton	
  trust	
  
region,	
  …	
  all	
  of	
  this	
  kind	
  
Itera$ve	
  Process	
  with	
  Hadoop	
  	
  
Disk	
   Mappers	
   Disk	
   Reducers	
  
Disk	
  Mappers	
  Disk	
  Reducers	
  
Disk	
   Mappers	
   Disk	
   Reducers	
  
Limita$ons	
  of	
  Hadoop	
  for	
  fimng	
  a	
  big	
  
logis$c	
  regression	
  
•  Itera$ve	
  process	
  is	
  expensive	
  and	
  slow	
  
•  Every	
  itera$on	
  =	
  a	
  Map-­‐Reduce	
  job	
  
•  I/O	
  of	
  mapper	
  and	
  reducers	
  are	
  both	
  through	
  
disk	
  
•  Plus:	
  Wai$ng	
  in	
  queue	
  $me	
  	
  
•  Q:	
  Can	
  we	
  find	
  a	
  fimng	
  method	
  that	
  scales	
  
with	
  Hadoop	
  ?	
  
Large	
  Scale	
  Logis$c	
  Regression	
  
•  Naïve:	
  	
  
–  Par$$on	
  the	
  data	
  and	
  run	
  logis$c	
  regression	
  for	
  each	
  par$$on	
  
–  Take	
  the	
  mean	
  of	
  the	
  learned	
  coefficients	
  
–  Problem:	
  Not	
  guaranteed	
  to	
  converge	
  to	
  the	
  model	
  from	
  single	
  
machine!	
  
•  Alterna$ng	
  Direc$on	
  Method	
  of	
  Mul$pliers	
  (ADMM)	
  
–  Boyd	
  et	
  al.	
  2011	
  
–  Set	
  up	
  constraints:	
  each	
  par$$on’s	
  coefficient	
  =	
  global	
  
consensus	
  
–  Solve	
  the	
  op$miza$on	
  problem	
  using	
  Lagrange	
  Mul$pliers	
  
–  Advantage:	
  guaranteed	
  to	
  converge	
  to	
  a	
  single	
  machine	
  logis$c	
  
regression	
  on	
  the	
  en$re	
  data	
  with	
  reasonable	
  number	
  of	
  
itera$ons	
  
	
  
Large	
  Scale	
  Logis$c	
  Regression	
  via	
  ADMM	
  
BIG	
  DATA	
  
Par$$on	
  1	
   Par$$on	
  2	
   Par$$on	
  3	
   Par$$on	
  K	
  
Logis$c	
  
Regression	
  
Logis$c	
  
Regression	
  
Logis$c	
  
Regression	
  
Logis$c	
  
Regression	
  
Consensus	
  
Computa$on	
  
Iteration 1
Large	
  Scale	
  Logis$c	
  Regression	
  via	
  ADMM	
  
BIG	
  DATA	
  
Par$$on	
  1	
   Par$$on	
  2	
   Par$$on	
  3	
   Par$$on	
  K	
  
Logis$c	
  
Regression	
  
Consensus	
  
Computa$on	
  
Logis$c	
  
Regression	
  
Logis$c	
  
Regression	
  
Logis$c	
  
Regression	
  
Iteration 1
Large	
  Scale	
  Logis$c	
  Regression	
  via	
  ADMM	
  
BIG	
  DATA	
  
Par$$on	
  1	
   Par$$on	
  2	
   Par$$on	
  3	
   Par$$on	
  K	
  
Logis$c	
  
Regression	
  
Logis$c	
  
Regression	
  
Logis$c	
  
Regression	
  
Logis$c	
  
Regression	
  
Consensus	
  
Computa$on	
  
Iteration 2
Details	
  of	
  ADMM	
  
Dual	
  Ascent	
  Method	
  
•  Consider	
  a	
  convex	
  op$miza$on	
  problem	
  
	
  
•  Lagrangian	
  for	
  the	
  problem:	
  
	
  
•  Dual	
  Ascent:	
  
round and motivation.
Dual Ascent
der the equality-constrained convex optimization problem
minimize f(x)
subject to Ax = b,
(2.
ariable x ∈ Rn
, where A ∈ Rm×n
and f : Rn
→ R is convex.
e Lagrangian for problem (2.1) is
L(x,y) = f(x) + yT
(Ax − b)
he dual function is
g(y) = inf
x
L(x,y) = −f∗
(−AT
y) − bT
y,
y is the dual variable or Lagrange multiplier, and f∗ is the conv
round and motivation.
Dual Ascent
der the equality-constrained convex optimization problem
minimize f(x)
subject to Ax = b,
(2.
variable x ∈ Rn
, where A ∈ Rm×n
and f : Rn
→ R is convex.
he Lagrangian for problem (2.1) is
L(x,y) = f(x) + yT
(Ax − b)
he dual function is
g(y) = inf
x
L(x,y) = −f∗
(−AT
y) − bT
y,
y is the dual variable or Lagrange multiplier, and f∗ is the conv
gate of f; see [20, §3.3] or [140, §12] for background. The du
rimal optimal point x from a dual optimal point y as
x = argmin
x
L(x,y ),
vided there is only one minimizer of L(x,y ). (This is the case
e.g., f is strictly convex.) In the sequel, we will use the notation
minx F(x) to denote any minimizer of F, even when F does not
e a unique minimizer.
In the dual ascent method, we solve the dual problem using gradient
ent. Assuming that g is differentiable, the gradient g(y) can be
uated as follows. We first find x+ = argminx L(x,y); then we have
(y) = Ax+ − b, which is the residual for the equality constraint. The
l ascent method consists of iterating the updates
xk+1
:= argmin
x
L(x,yk
) (2.2)
yk+1
:= yk
+ αk
(Axk+1
− b), (2.3)
ere αk > 0 is a step size, and the superscript is the iteration counter.
Augmented	
  Lagrangians	
  
•  Bring	
  robustness	
  to	
  the	
  dual	
  ascent	
  method	
  
•  Yield	
  convergence	
  without	
  assump$ons	
  like	
  strict	
  
convexity	
  or	
  finiteness	
  of	
  f	
  
•  	
  	
  
•  The	
  value	
  of	
  ρ	
  influences	
  the	
  convergence	
  rate	
  
aph-structured optimization problems.
Augmented Lagrangians and the Method of Multiplie
mented Lagrangian methods were developed in part to br
tness to the dual ascent method, and in particular, to yield c
nce without assumptions like strict convexity or finiteness of
augmented Lagrangian for (2.1) is
Lρ(x,y) = f(x) + yT
(Ax − b) + (ρ/2) Ax − b 2
2, (22.3 Augmented Lagrangians and the Method
where ρ > 0 is called the penalty parameter. (Note
standard Lagrangian for the problem.) The augmen
can be viewed as the (unaugmented) Lagrangian asso
problem
2
Alterna$ng	
  Direc$on	
  Method	
  of	
  
Mul$pliers	
  (ADMM)	
  
•  Problem:	
  
	
  
•  Augmented	
  Lagrangians	
  
	
  
•  ADMM:	
  
MM is an algorithm that is intended to blend the decompos
dual ascent with the superior convergence properties of the m
multipliers. The algorithm solves problems in the form
minimize f(x) + g(z)
subject to Ax + Bz = c
h variables x ∈ Rn
and z ∈ Rm
, where A ∈ Rp×n
, B ∈ Rp×m
Rp
. We will assume that f and g are convex; more specific as
ns will be discussed in §3.2. The only difference from the g
ear equality-constrained problem (2.1) is that the variable, ca
re, has been split into two parts, called x and z here, with the
e function separable across this splitting. The optimal value
blem (3.1) will be denoted by
p = inf{f(x) + g(z) | Ax + Bz = c}.
with variables x ∈ Rn
and z ∈ Rm
, where A ∈ Rp×n
, B ∈ Rp×m
, and
c ∈ Rp
. We will assume that f and g are convex; more specific assump-
tions will be discussed in §3.2. The only difference from the general
linear equality-constrained problem (2.1) is that the variable, called x
there, has been split into two parts, called x and z here, with the objec-
tive function separable across this splitting. The optimal value of the
problem (3.1) will be denoted by
p = inf{f(x) + g(z) | Ax + Bz = c}.
As in the method of multipliers, we form the augmented Lagrangian
Lρ(x,z,y) = f(x) + g(z) + yT
(Ax + Bz − c) + (ρ/2) Ax + Bz − c 2
2.
13
14 Alternating Direction Method of Multipliers
ADMM consists of the iterations
xk+1
:= argmin
x
Lρ(x,zk
,yk
) (3.2)
zk+1
:= argmin
z
Lρ(xk+1
,z,yk
) (3.3)
yk+1
:= yk
+ ρ(Axk+1
+ Bzk+1
− c), (3.4)
where ρ > 0. The algorithm is very similar to dual ascent and the
Large	
  Scale	
  Logis$c	
  Regression	
  via	
  ADMM	
  
•  Nota$on	
  	
  
–  (Xi	
  ,	
  yi):	
  data	
  in	
  the	
  ith	
  par$$on	
  
–  βi:	
  coefficient	
  vector	
  for	
  par$$on	
  i	
  
–  β:	
  Consensus	
  coefficient	
  vector	
  
–  r(β):	
  penalty	
  component	
  such	
  as	
  ||β||2
2
	
  	
  
•  Op$miza$on	
  problem	
  
Brief Article
The Author
July 7, 2013
min
NX
i=1
li(yi, XT
i i) + r( )
subject to i =
ADMM	
  updates	
  
LOCAL	
  REGRESSIONS	
  
Shrinkage	
  towards	
  current	
  
best	
  global	
  es$mate	
  
UPDATED	
  
CONSENSUS	
  
An	
  example	
  implementa$on	
  
•  ADMM	
  for	
  Logis$c	
  regression	
  model	
  fimng	
  with	
  
L2/L1	
  penalty	
  
•  Each	
  itera$on	
  of	
  ADMM	
  is	
  a	
  Map-­‐Reduce	
  job	
  
–  Mapper:	
  par$$on	
  the	
  data	
  into	
  K	
  par$$ons	
  
–  Reducer:	
  For	
  each	
  par$$on,	
  use	
  liblinear/glmnet	
  to	
  fit	
  
a	
  L1/L2	
  logis$c	
  regression	
  
–  Gateway:	
  consensus	
  computa$on	
  by	
  results	
  from	
  all	
  
reducers,	
  and	
  sends	
  back	
  the	
  consensus	
  to	
  each	
  
reducer	
  node	
  
KDD	
  CUP	
  2010	
  Data	
  
•  Bridge	
  to	
  Algebra	
  2008-­‐2009	
  data	
  in	
  	
  
haps://pslcdatashop.web.cmu.edu/KDDCup/
downloads.jsp	
  
•  Binary	
  response,	
  20M	
  covariates	
  
•  Only	
  keep	
  covariates	
  with	
  >=	
  10	
  occurrences	
  
=>	
  2.2M	
  covariates	
  
•  Training	
  data:	
  8,407,752	
  samples	
  
•  Test	
  data	
  :	
  510,302	
  samples	
  
Avg	
  Training	
  Loglikelihood	
  vs	
  Number	
  
of	
  Itera$ons	
  
Test	
  AUC	
  vs	
  Number	
  of	
  Itera$ons	
  	
  
Beaer	
  Convergence	
  Can	
  	
  
Be	
  Achieved	
  By	
  
•  Beaer	
  Ini$aliza$on	
  
– Use	
  results	
  from	
  Naïve	
  method	
  to	
  ini$alize	
  the	
  
parameters	
  
•  Adap$vely	
  change	
  step	
  size	
  (ρ)	
  for	
  each	
  
itera$on	
  based	
  on	
  the	
  convergence	
  status	
  of	
  
the	
  consensus	
  
Recommender Problems for
Web Applications
Agenda
•  Topic of Interest
–  Recommender problems for dynamic, time-
sensitive applications
•  Content Optimization, Online Advertising, Movie
recommendation, shopping,…
•  Introduction
•  Offline components
–  Regression, Collaborative filtering (CF), …
•  Online components + initialization
–  Time-series, online/incremental methods, explore/
exploit (bandit)
•  Evaluation methods + Multi-Objective
•  Challenges
Three components we will focus on
•  Defining the problem
–  Formulate objectives whose optimization achieves some long-
term goals for the recommender system
•  E.g. How to serve content to optimize audience reach and engagement,
optimize some combination of engagement and revenue ?
•  Modeling (to estimate some critical inputs)
–  Predict rates of some positive user interaction(s) with items
based on data obtained from historical user-item interactions
•  E.g. Click rates, average time-spent on page, etc
•  Could be explicit feedback like ratings
•  Experimentation
–  Create experiments to collect data proactively to improve
models, helps in converging to the best choice(s) cheaply and
rapidly.
•  Explore and Exploit (continuous experimentation)
•  DOE (testing hypotheses by avoiding bias inherent in data)
Modern Recommendation Systems
•  Goal
–  Serve the right item to a user in a given context to
optimize long-term business objectives
•  A scientific discipline that involves
–  Large scale Machine Learning & Statistics
•  Offline Models (capture global & stable characteristics)
•  Online Models (incorporates dynamic components)
•  Explore/Exploit (active and adaptive experimentation)
–  Multi-Objective Optimization
•  Click-rates (CTR), Engagement, advertising revenue, diversity, etc
–  Inferring user interest
•  Constructing User Profiles
–  Natural Language Processing to understand content
•  Topics, “aboutness”, entities, follow-up of something, breaking news,…
Some examples from content
optimization
•  Simple version
–  I have a content module on my page, content inventory is
obtained from a third party source which is further refined
through editorial oversight. Can I algorithmically
recommend content on this module? I want to improve
overall click-rate (CTR) on this module
•  More advanced
–  I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. advertising revenue). Can I
increase downstream utility without losing too many clicks?
•  Highly advanced
–  There are multiple modules running on my webpage. How
do I perform a simultaneous optimization?
Recommend applications
Recommend search queries
Recommend news article
Recommend packages:
Image
Title, summary
Links to other pages
Pick	
  4	
  out	
  of	
  a	
  pool	
  of	
  K	
  
	
  	
  	
  	
  K	
  =	
  20	
  ~	
  40	
  
	
  	
  	
  	
  Dynamic	
  
	
  
Routes	
  traffic	
  other	
  pages	
  
Problems in this example
•  Optimize CTR on multiple modules
–  Today Module, Trending Now, Personal Assistant,
News
–  Simple solution: Treat modules as independent,
optimize separately. May not be the best when
there are strong correlations.
•  For any single module
–  Optimize some combination of CTR, downstream
engagement, and perhaps advertising revenue.
Online Advertising
Advertisers
Ad Network
Ads
Page
Recommend
Best ad(s)
User
Publisher
Response rates
(click, conversion, ad-view)
Bids
Auction
Click
conversion
Select argmax f(bid,response rates)
ML /Statistical
model
Examples:
Yahoo, Google, MSN, …
Ad exchanges (RightMedia,
DoubleClick, …)
LinkedIn	
  Today:	
  Content	
  Module	
  
Objective: Serve content to maximize engagement
metrics like CTR (or weighted CTR)
LinkedIn	
  Ads:	
  Match	
  ads	
  to	
  users	
  
visi$ng	
  LinkedIn	
  
Right	
  Media	
  Ad	
  Exchange:	
  Unified	
  
Marketplace	
  
Match	
  ads	
  to	
  page	
  views	
  on	
  publisher	
  sites	
  
Has	
  ad	
  
impression	
  to	
  
sell	
  -­‐-­‐	
  
AUCTIONS	
  
Bids	
  $0.50	
  
Bids	
  $0.75	
  via	
  Network…	
  
…	
  which	
  becomes	
  
$0.45	
  bid	
  
Bids	
  $0.65—WINS!	
  
AdSense	
  
Ad.com	
  
Bids	
  $0.60	
  
Recommender problems in general
USER	
  
	
  
Item	
  Inventory	
  
Ar$cles,	
  web	
  page,	
  	
  
ads,	
  …	
  
Use	
  an	
  automated	
  algorithm	
  	
  
to	
  select	
  item(s)	
  to	
  show	
  
	
  
Get	
  feedback	
  (click,	
  $me	
  spent,..)	
  	
  
Refine	
  the	
  models	
  
	
  
Repeat	
  (large	
  number	
  of	
  :mes)	
  
Op:mize	
  metric(s)	
  of	
  interest	
  
(Total	
  clicks,	
  Total	
  revenue,…)	
  
Example applications
Search: Web, Vertical
Online Advertising
Content
…..
Context	
  
query,	
  page,	
  …	
  
•  Items: Articles, ads, modules, movies, users, updates, etc.
•  Context: query keywords, pages, mobile, social media, etc.
•  Metric to optimize (e.g., relevance score, CTR, revenue, engagement)
–  Currently, most applications are single-objective
–  Could be multi-objective optimization (maximize X subject to Y, Z,..)
•  Properties of the item pool
–  Size (e.g., all web pages vs. 40 stories)
–  Quality of the pool (e.g., anything vs. editorially selected)
–  Lifetime (e.g., mostly old items vs. mostly new items)
Important Factors
Factors affecting Solution
(continued)
•  Properties of the context
–  Pull: Specified by explicit, user-driven query (e.g., keywords, a form)
–  Push: Specified by implicit context (e.g., a page, a user, a session)
•  Most applications are somewhere on continuum of pull and push
•  Properties of the feedback on the matches
made
–  Types and semantics of feedback (e.g., click, vote)
–  Latency (e.g., available in 5 minutes vs. 1 day)
–  Volume (e.g., 100K per day vs. 300M per day)
•  Constraints specifying legitimate matches
–  e.g., business rules, diversity rules, editorial Voice
–  Multiple objectives
•  Available Metadata (e.g., link graph, various user/item attributes)
Predicting User-Item Interactions
(e.g. CTR)
•  Myth: We have so much data on the web, if we
can only process it the problem is solved
–  Number of things to learn increases with sample size
•  Rate of increase is not slow
–  Dynamic nature of systems make things worse
–  We want to learn things quickly and react fast
•  Data is sparse in web recommender problems
–  We lack enough data to learn all we want to learn and
as quickly as we would like to learn
–  Several Power laws interacting with each other
•  E.g. User visits power law, items served power law
–  Bivariate Zipf: Owen & Dyer, 2011
Can Machine Learning help?
•  Fortunately, there are group behaviors that generalize to
individuals & they are relatively stable
–  E.g. Users in San Francisco tend to read more baseball news
•  Key issue: Estimating such groups
–  Coarse group : more stable but does not generalize that well.
–  Granular group: less stable with few individuals
–  Getting a good grouping structure is to hit the “sweet spot”
•  Another big advantage on the web
–  Intervene and run small experiments on a small population to
collect data that helps rapid convergence to the best choices(s)
•  We don’t need to learn all user-item interactions, only those that are good.
Predicting user-item interaction rates
Offline	
  
(	
  Captures	
  stable	
  characteris$cs	
  
at	
  coarse	
  resolu$ons)	
  
(Logis$c,	
  Boos$ng,….)	
  
	
  
Feature	
  	
  construc$on	
  
Content:	
  IR,	
  clustering,	
  taxonomy,	
  en$ty,..	
  	
  
User	
  profiles:	
  clicks,	
  views,	
  social,	
  community,..	
  
Near	
  Online	
  
(Finer	
  resolu$on	
  
Correc$ons)	
  
(item,	
  user	
  level)	
  
(Quick	
  updates)	
  
Explore/Exploit	
  
(Adap$ve	
  sampling)	
  
(helps	
  rapid	
  convergence	
  
to	
  best	
  choices)	
  
Initialize
Post-click: An example in Content
Optimization
Recommender	
   	
  
	
  	
  	
  EDITORIAL	
  	
  	
  	
  	
  	
  	
  
content
Clicks on FP links influence
downstream supply distribution
	
  	
  AD	
  SERVER	
  
	
  
	
  	
  	
  	
  	
  	
  DISPLAY	
  	
  	
  	
  	
  ADVERTISING	
  
Revenue	
  
Downstream	
  
engagement	
  
	
  
	
  (Time	
  spent)	
  
Serving Content on Front Page: Click Shaping
•  What do we want to optimize?
•  Current: Maximize clicks (maximize downstream supply from FP)
•  But consider the following
–  Article 1: CTR=5%, utility per click = 5
–  Article 2: CTR=4.9%, utility per click=10
•  By promoting 2, we lose 1 click/100 visits, gain 5 utils
•  If we do this for a large number of visits --- lose some clicks but
obtain significant gains in utility?
–  E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement,
etc)
High	
  level	
  picture	
  
http request
Statistical
Models updated in
Batch mode: e.g. once every
30 mins
Server	
  
Item	
  
Recommenda$on	
  
system:	
  thousands	
  
of	
  computa$ons	
  in	
  
sub-­‐seconds	
  
User Interacts
e.g. click,
does nothing
High	
  level	
  overview:	
  Item	
  
Recommenda$on	
  System	
  
User Info
Item Index
Id, meta-data
ML/
Statistical
Models
Score Items
P(Click), P(share),
Semantic-relevance
score,….
Rank Items:
sort by score (CTR,bid*CTR,..)
combine scores using Multi-obj optim,
Threshold on some scores,….
User-item interaction
Data: batch process
Updated in batch:
Activity, profile
Pre-filter
SPAM,editorial,,..
Feature extraction
NLP, cllustering,..
ML/Sta$s$cal	
  models	
  for	
  scoring	
  
Number of items
Scored by ML
Traffic volume
1000100 100k 1M 100M
Few hours
Few days
Several days
LinkedIn Today
Yahoo! Front Page
Right Media Ad exchange
LinkedIn Ads
Summary	
  of	
  deployments	
  	
  	
  
•  Yahoo!	
  Front	
  page	
  Today	
  Module	
  (2008-­‐2011):	
  300%	
  improvement	
  
in	
  click-­‐through	
  rates	
  
–  Similar	
  algorithms	
  delivered	
  via	
  a	
  self-­‐serve	
  pla•orm,	
  adopted	
  by	
  
several	
  Yahoo!	
  Proper$es	
  (2011):	
  Significant	
  improvement	
  in	
  
engagement	
  across	
  Yahoo!	
  Network	
  
•  Fully	
  deployed	
  on	
  LinkedIn	
  Today	
  Module	
  (2012):	
  Significant	
  
improvement	
  in	
  click-­‐through	
  rates	
  (numbers	
  not	
  revealed	
  due	
  to	
  
reasons	
  of	
  confiden$ality)	
  
•  Yahoo!	
  RightMedia	
  exchange	
  (2012):	
  Fully	
  deployed	
  algorithms	
  to	
  
es$mate	
  response	
  rates	
  (CTR,	
  conversion	
  rates).	
  Significant	
  
improvement	
  in	
  revenue	
  (numbers	
  not	
  revealed	
  due	
  to	
  reasons	
  of	
  
confiden$ality)	
  
•  LinkedIn	
  self-­‐serve	
  ads	
  (2012-­‐2013):Fully	
  deployed	
  
•  LinkedIn	
  News	
  Feed	
  (2013-­‐2014):	
  Fully	
  deployed	
  
•  Several	
  others	
  in	
  progress….	
  
Broad	
  Themes	
  
•  Curse	
  of	
  dimensionality	
  
–  Large	
  number	
  of	
  observa$ons	
  (rows),	
  large	
  number	
  of	
  poten$al	
  features	
  
(columns)	
  
–  Use	
  domain	
  knowledge	
  and	
  machine	
  learning	
  to	
  reduce	
  the	
  “effec$ve”	
  
dimension	
  (constraints	
  on	
  parameters	
  reduce	
  degrees	
  of	
  freedom)	
  
•  I	
  will	
  give	
  examples	
  as	
  we	
  move	
  along	
  
•  	
  We	
  oden	
  assume	
  our	
  job	
  is	
  to	
  analyze	
  “Big	
  Data”	
  but	
  we	
  oden	
  have	
  
control	
  on	
  what	
  data	
  to	
  collect	
  through	
  clever	
  experimenta$on	
  
–  This	
  can	
  fundamentally	
  change	
  solu$ons	
  
•  Think	
  of	
  computa$on	
  and	
  models	
  together	
  for	
  Big	
  data	
  
•  Op$miza$on:	
  What	
  we	
  are	
  trying	
  to	
  op$mize	
  is	
  oden	
  complex,models	
  to	
  
work	
  in	
  harmony	
  with	
  op$miza$on	
  
–  Pareto	
  op$mality	
  with	
  compe$ng	
  objec$ves	
  
Sta$s$cal	
  Problem	
  
•  Rank	
  items	
  (from	
  an	
  admissible	
  pool)	
  for	
  user	
  visits	
  in	
  some	
  
context	
  to	
  maximize	
  a	
  u$lity	
  of	
  interest	
  
•  Examples	
  of	
  u$lity	
  func$ons	
  
–  Click-­‐rates	
  (CTR)	
  
–  Share-­‐rates	
  (CTR*	
  [Share|Click]	
  )	
  
–  Revenue	
  per	
  page-­‐view	
  =	
  CTR*bid	
  (more	
  complex	
  due	
  to	
  second	
  price	
  
auc$on)	
  
•  CTR	
  is	
  a	
  fundamental	
  measure	
  that	
  opens	
  the	
  door	
  to	
  a	
  more	
  
principled	
  approach	
  to	
  rank	
  items	
  
•  Converge	
  rapidly	
  to	
  maximum	
  u$lity	
  items	
  	
  
–  Sequen$al	
  decision	
  making	
  process	
  (explore/exploit)	
  
	
  
item	
  j	
  from	
  a	
  set	
  of	
  candidates	
  
User	
  i	
  	
  
with	
  
user	
  features	
  
(e.g.,	
  industry,	
  
	
  behavioral	
  features,	
  
Demographic	
  features,
……)	
  
	
  	
  	
  	
  	
  	
  	
  	
  (i,	
  j)	
  :	
  response	
  yij	
  visits	
  
Algorithm	
  selects	
  
(click	
  or	
  not)	
  
Which	
  item	
  should	
  we	
  select?	
  
	
  Ÿ	
  The	
  item	
  with	
  highest	
  predicted	
  CTR	
  
	
  Ÿ	
  An	
  item	
  for	
  which	
  we	
  need	
  data	
  to	
  	
  
	
  	
  	
  	
  predict	
  its	
  CTR	
  
Exploit	
  
Explore	
  
LinkedIn Today, Yahoo! Today Module:
Choose Items to maximize CTR
This is an “Explore/Exploit” Problem
The Explore/Exploit Problem (to
maximize CTR)
•  Problem definition: Pick k items from a pool of N for a large
number of serves to maximize the number of clicks on the
picked items
•  Easy!? Pick the items having the highest click-through rates
(CTRs)
•  But …
–  The system is highly dynamic:
•  Items come and go with short lifetimes
•  CTR of each item may change over time
–  How much traffic should be allocated to explore new items to
achieve optimal performance ?
•  Too little → Unreliable CTR estimates due to “starvation”
•  Too much → Little traffic to exploit the high CTR items
Y!	
  front	
  Page	
  Applica$on	
  
•  Simplify:	
  Maximize	
  CTR	
  on	
  first	
  slot	
  (F1)	
  	
  
•  Item	
  Pool	
  
–  Editorially	
  selected	
  for	
  high	
  quality	
  and	
  brand	
  image	
  
–  Few	
  ar$cles	
  in	
  the	
  pool	
  but	
  item	
  pool	
  dynamic	
  
	
  
CTR Curves of Items on LinkedIn
Today
CTR	
  
Impact	
  of	
  repeat	
  item	
  views	
  on	
  a	
  given	
  
user	
  
•  Same	
  user	
  is	
  shown	
  an	
  item	
  mul$ple	
  $mes	
  
(despite	
  not	
  clicking)	
  
Simple	
  algorithm	
  to	
  es$mate	
  most	
  popular	
  
item	
  with	
  small	
  but	
  dynamic	
  item	
  pool	
  
•  Simple	
  Explore/Exploit	
  scheme
–  ε%	
  explore:	
  with	
  a	
  small	
  probability	
  (e.g.	
  5%),	
  choose	
  an	
  item	
  at	
  
random	
  from	
  the	
  pool	
  
–  (100−ε)%	
  exploit:	
  with	
  large	
  probability	
  (e.g.	
  95%),	
  choose	
  
highest	
  scoring	
  CTR	
  item	
  
•  Temporal	
  Smoothing	
  
–  Item	
  CTRs	
  change	
  over	
  $me,	
  provide	
  more	
  weight	
  to	
  recent	
  data	
  in	
  
es$ma$ng	
  item	
  CTRs	
  
•  Kalman	
  filter,	
  moving	
  average	
  
•  Discount	
  item	
  score	
  with	
  repeat	
  views	
  
–  CTR(item)	
  for	
  a	
  given	
  user	
  drops	
  with	
  repeat	
  views	
  by	
  some	
  “discount”	
  
factor	
  (es$mated	
  from	
  data)	
  
•  Segmented	
  most	
  popular	
  
–  Perform	
  separate	
  most-­‐popular	
  for	
  each	
  user	
  segment	
  
	
  
Time	
  series	
  Model:	
  Kalman	
  filter	
  
•  Dynamic	
  Gamma-­‐Poisson:	
  click-­‐rate	
  evolves	
  over	
  $me	
  
in	
  a	
  mul$plica$ve	
  fashion	
  
•  Es$mated	
  Click-­‐rate	
  distribu$on	
  at	
  $me	
  t+1	
  	
  
–  Prior	
  mean:	
  
–  Prior	
  variance:	
  	
  	
  
High	
  CTR	
  items	
  more	
  adap$ve	
  
More	
  economical	
  explora$on?	
  Beaer	
  
bandit	
  solu$ons	
  
•  Consider	
  two	
  armed	
  problem	
  
p2
(unknown payoff
probabilities)
The	
  gambler	
  has	
  1000	
  plays,	
  what	
  is	
  the	
  best	
  way	
  to	
  experiment	
  ?	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (to	
  maximize	
  total	
  expected	
  reward)	
  
	
  
	
  This	
  is	
  called	
  the	
  “mul$-­‐armed	
  bandit”	
  problem,	
  have	
  been	
  studied	
  for	
  a	
  long	
  $me.	
  
	
  Op$mal	
  solu$on:	
  Play	
  the	
  arm	
  that	
  has	
  maximum	
  poten:al	
  of	
  being	
  good	
  
	
  Op:mism	
  in	
  the	
  face	
  of	
  uncertainty	
  
p1 >
Item	
  Recommenda$on:	
  Bandits?	
  
•  Two	
  Items:	
  Item	
  1	
  CTR=	
  2/100	
  ;	
  Item	
  2	
  CTR=	
  250/10000	
  
–  Greedy:	
  Show	
  Item	
  2	
  to	
  all;	
  not	
  a	
  good	
  idea	
  
–  Item	
  1	
  CTR	
  es$mate	
  noisy;	
  item	
  could	
  be	
  poten$ally	
  
beaer	
  
•  Invest	
  in	
  Item	
  1	
  for	
  beaer	
  overall	
  performance	
  on	
  average	
  
	
  
	
  
–  Exploit	
  what	
  is	
  known	
  to	
  be	
  good,	
  explore	
  what	
  is	
  poten$ally	
  good	
  
CTR	

Probabilitydensity	

Item 2	

Item 1
Next few hours
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models Time-series models Incremental CF,
online regression
Intelligent Initialization Prior estimation Prior estimation,
dimension reduction
Explore/Exploit Multi-armed bandits Bandits with covariates
Offline Components:
Collaborative Filtering in Cold-start
Situations
Problem
Item j with
User i
with
user features xi
(demographics,
browse history,
search history, …)
item features xj
(keywords, content categories, ...)
(i, j) : response yijvisits
Algorithm selects
(explicit rating, implicit click/no-click)
Predict the unobserved entries based on
features and the observed entries
Model Choices
•  Feature-based (or content-based) approach
–  Use features to predict response
•  (regression, Bayes Net, mixture models, …)
–  Limitation: need predictive features
•  Bias often high, does not capture signals at granular levels
•  Collaborative filtering (CF aka Memory based)
–  Make recommendation based on past user-item interaction
•  User-user, item-item, matrix factorization, …
•  See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc.
–  Better performance for old users and old items
–  Does not naturally handle new users and new items (cold-
start)
Collaborative Filtering (Memory
based methods)
User-User Similarity
Item-Item similarities, incorporating both
Estimating Similarities
Pearson’s correlation
Optimization based (Koren et al)
How to Deal with the Cold-Start
Problem
•  Heuris$c-­‐based	
  approaches	
  
–  Linear	
  combina$on	
  of	
  regression	
  and	
  CF	
  models	
  
	
  
–  Filterbot	
  
•  Add	
  user	
  features	
  as	
  psuedo	
  users	
  and	
  do	
  collabora$ve	
  filtering	
  
-­‐  Hybrid	
  approaches	
  
-­‐  Use	
  content	
  based	
  to	
  fill	
  up	
  entries,	
  then	
  use	
  CF	
  
•  Matrix	
  Factoriza$on	
  
–  Good	
  performance	
  on	
  Ne•lix	
  (Koren,	
  2009)	
  
•  Model-­‐based	
  approaches	
  
	
  
–  Bilinear	
  random-­‐effects	
  model	
  (probabilis$c	
  matrix	
  factoriza$on)	
  
•  Good	
  on	
  Ne•lix	
  data	
  [Ruslan	
  et	
  al	
  ICML,	
  2009]	
  
–  Add	
  feature-­‐based	
  regression	
  to	
  matrix	
  factoriza$on	
  	
  
•  (Agarwal	
  and	
  Chen,	
  2009)	
  
–  Add	
  topic	
  discovery	
  (from	
  textual	
  items)	
  to	
  matrix	
  factoriza$on	
  	
  
•  (Agarwal	
  and	
  Chen,	
  2009;	
  Chun	
  and	
  Blei,	
  2011)	
  
Per-item regression models
•  When tracking users by cookies, distribution of
visit patters could get extremely skewed
– Majority of cookies have 1-2 visits
•  Per item models (regression) based on user
covariates attractive in such cases
Several per-item regressions: Multi-task learning
Low dimension
(5-10),
B estimated
retrospective data
•  Agarwal,Chen and Elango, KDD, 2010
Affinity to
old items
Per-user, per-item models
via bilinear random-effects
model
Motivation
•  Data measuring k-way interactions pervasive
–  Consider k = 2 for all our discussions
•  E.g. User-Movie, User-content, User-Publisher-Ads,….
–  Power law on both user and item degrees
•  Classical Techniques
–  Approximate matrix through a singular value
decomposition (SVD)
•  After adjusting for marginal effects (user pop, movie pop,..)
–  Does not work
•  Matrix highly incomplete, severe over-fitting
–  Key issue
•  Regularization of eigenvectors (factors) to avoid overfitting
Early work on complete matrices
•  Tukey’s 1-df model (1956)
–  Rank 1 approximation of small nearly complete
matrix
•  Criss-cross regression (Gabriel, 1978)
•  Incomplete matrices: Psychometrics (1-factor
model only; small data sets; 1960s)
•  Modern day recommender problems
–  Highly incomplete, large, noisy.
Latent Factor Models
“newsy”
“sporty”
“newsy”
s
item
v
z
Affinity = u’v
Affinity = s’z
u s
p
o
r
t
y
Factorization – Brief Overview
•  Latent user factors:
(αi , ui=(ui1,…,uin))
•  (Nn + Mm)
parameters
•  Key technical issue:
•  Latent movie factors:
(βj , vj=(v j1,….,v jn))
will overfit for moderate
values of n,m
Regularization
Interaction
jijiij BvuyE ʹ′+++= βαµ)(
Latent Factor Models: Different
Aspects
•  Matrix Factorization
– Factors in Euclidean space
– Factors on the simplex
•  Incorporating features and ratings
simultaneously
•  Online updates
Maximum Margin Matrix Factorization (MMMF)
•  Complete matrix by minimizing loss (hinge,squared-
error) on observed entries subject to constraints on trace
norm
–  Srebro, Rennie, Jakkola (NIPS 2004)
–  Convex, Semi-definite programming (expensive, not
scalable)
•  Fast MMMF (Rennie & Srebro, ICML, 2005)
–  Constrain the Frobenious norm of left and right
eigenvector matrices, not convex but becomes
scalable.
•  Other variation: Ensemble MMMF (DeCoste,
ICML2005)
–  Ensembles of partially trained MMMF (some
improvements)
Matrix Factorization for Netflix prize
data
•  Minimize the objective function
•  Simon Funk: Stochastic Gradient Descent
•  Koren et al (KDD 2007): Alternate Least
Squares
–  They move to SGD later in the competition
∑ ∑∑∈
++−
obsij j
j
i
ij
T
iij vuvur )()(
222
λ
ui vj
rij
au av
2
σ
Optimization is through Iterated
conditional modes
Other variations like constraining the
mean through sigmoid, using “who-rated-
whom”
Combining with Boltzmann Machines also
improved performance
),(~
),(~
),(~ 2
IaMVN
IaMVN
Nr
vj
ui
j
T
iij
0v
0u
vu σ
Probabilis$c	
  Matrix	
  Factoriza$on	
  
(Ruslan	
  &	
  Minh,	
  2008,	
  NIPS)	
  
Bayesian Probabilistic Matrix Factorization
(Ruslan and Minh, ICML 2008)
•  Fully Bayesian treatment using an MCMC approach
–  Significant improvement
•  Interpretation as a fully Bayesian hierarchical model
shows why that is the case
–  Failing to incorporate uncertainty leads to bias in
estimates
–  Multi-modal posterior, MCMC helps in converging to a better one
r
Var-comp: au
MCEM also more resistant to over-fitting
Non-parametric Bayesian matrix completion
(Zhou et al, SAM, 2010)
•  Specify rank probabilistically (automatic rank
selection)
)/)1(,/(~
)(~
),(~
1
2
rrbraBeta
Berz
vuzNy
k
kk
r
k
jkikkij
−
∑=
π
π
σ
))1(/(Factors)#(
)))1(/(,1(~
−+=
−+
rbaraE
rbaaBerzk
How to incorporate features:
Deal with both warm start and cold-start
•  Models to predict ratings for new pairs
–  Warm-start: (user, movie) present in the training data with large
sample size
–  Cold-start: At least one of (user, movie) new or has small sample
size
•  Rough definition, warm-start/cold-start is a continuum.
•  Challenges
–  Highly incomplete (user, movie) matrix
–  Heavy tailed degree distributions for users/movies
•  Large fraction of ratings from small fraction of users/
movies
–  Handling both warm-start and cold-start effectively in the
presence of predictive features
Possible approaches
•  Large scale regression based on covariates
–  Does not provide good estimates for heavy users/movies
–  Large number of predictors to estimate interactions
•  Collaborative filtering
–  Neighborhood based
–  Factorization
•  Good for warm-start; cold-start dealt with separately
•  Single model that handles cold-start and warm-start
–  Heavy users/movies → User/movie specific model
–  Light users/movies → fallback on regression model
–  Smooth fallback mechanism for good performance
Add Feature-based Regression
into
Matrix Factorization
RLFM: Regression-based Latent
Factor Model
Regression-based Factorization
Model (RLFM)
•  Main idea: Flexible prior, predict factors
through regressions
•  Seamlessly handles cold-start and warm-
start
•  Modified state equation to incorporate
covariates
RLFM: Model
Rating: ),(~ 2
σµijij Ny
)(~ ijij Bernoulliy µ
)(~ ijijij NPoissony µ
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
j
t
iji
t
ijij vubxt +++= βαµ )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 α
αα
σεεα Nxg iii
t
i +=
Popularity of item j: ),0(~, 2
0 β
ββ
σεεβ Nxd jjj
t
j +=
Factors of user i: ),0(~, 2
INGxu u
u
i
u
iii σεε+=
Factors of item j: ),0(~, 2
INDxv v
v
i
v
iji σεε+=
Could use other classes of regression models
Graphical representation of the
model
Advantages of RLFM
•  Better regularization of factors
–  Covariates “shrink” towards a better centroid
•  Cold-start: Fallback regression model (FeatureOnly)
RLFM: Illustration of Shrinkage
Plot the first factor
value for each user
(fitted using Yahoo! FP
data)
Model fitting: EM for our class of
models
The parameters for RLFM
•  Latent parameters
•  Hyper-parameters
}){},{},{},({ jiji vuβα=Δ
)IaAI,aAD,G,,( vvuu ===Θ b
Computing the mode
Minimized
The EM algorithm
Computing the E-step
•  Often hard to compute in closed form
•  Stochastic EM (Markov Chain EM; MCEM)
–  Compute expectation by drawing samples from
–  Effective for multi-modal posteriors but more expensive
•  Iterated Conditional Modes algorithm (ICM)
–  Faster but biased hyper-parameter estimates
Monte Carlo E-step
•  Through a vanilla Gibbs sampler (conditionals closed form)
•  Other conditionals also Gaussian and closed form
•  Conditionals of users (movies) sampled simultaneously
•  Small number of samples in early iterations, large numbers in
later iterations
M-step (Why MCEM is better than
ICM)
•  Update G, optimize
•  Update Au=au I
Ignored by ICM, underestimates factor variability
Factors over-shrunk, posterior not explored well
Experiment 1: Better regularization
•  MovieLens-100K, avg RMSE using pre-specified splits
•  ZeroMean, RLFM and FeatureOnly (no cold-start
issues)
•  Covariates:
–  Users : age, gender, zipcode (1st digit only)
–  Movies: genres
Experiment 2: Better handling of
Cold-start
•  MovieLens-1M; EachMovie
•  Training-test split based on timestamp
•  Same covariates as in Experiment 1.
Experiment 4: Predicting click-rate
on articles
•  Goal: Predict click-rate on articles for a user on F1
position
•  Article lifetimes short, dynamic updates important
•  User covariates:
–  Age, Gender, Geo, Browse behavior
•  Article covariates
–  Content Category, keywords
•  2M ratings, 30K users, 4.5 K articles
Results on Y! FP data
Some other related approaches
•  Stern, Herbrich and Graepel, WWW, 2009
–  Similar to RLFM, different parametrization and
expectation propagation used to fit the models
•  Porteus, Asuncion and Welling, AAAI, 2011
–  Non-parametric approach using a Dirichlet process
•  Agarwal, Zhang and Mazumdar, Annals of Applied
Statistics, 2011
–  Regression + random effects per user regularized
through a Graphical Lasso
Add Topic Discovery into
Matrix Factorization
fLDA: Matrix Factorization through Latent
Dirichlet Allocation
fLDA: Introduction
•  Model the rating yij that user i gives to item j as the user’s
affinity to the topics that the item has
–  Unlike regular unsupervised LDA topic modeling, here the LDA
topics are learnt in a supervised manner based on past rating
data
–  fLDA can be thought of as a “multi-task learning” version of the
supervised LDA model [Blei’07] for cold-start recommendation
∑+=
k jkikij zsy ...
User i ’s affinity to topic k
Pr(item j has topic k) estimated by averaging
the LDA topic of each word in item j
Old items: zjk’s are Item latent factors learnt from data with the LDA prior
New items: zjk’s are predicted based on the bag of words in the items
Φ11,	
  …,	
  Φ1W	
  
	
  	
  	
  	
  	
  	
  	
  …	
  
Φk1,	
  …,	
  ΦkW	
  
	
  	
  	
  	
  	
  	
  	
  …	
  
ΦK1,	
  …,	
  ΦKW	
  
Topic	
  1	
  
Topic	
  k	
  
Topic	
  K	
  
LDA Topic Modeling (1)
•  LDA is effective for unsupervised topic discovery [Blei’03]
–  It models the generating process of a corpus of items (articles)
–  For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η)
–  For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ)
–  For each word, say the nth word, in item j,
•  Draw a topic zjn for that word from θj = [θj1, …, θjK]
•  Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn
Item j
Topic distribution: [θj1, …, θjK]
Words: wj1, …, wjn, …
Per-word topic: zj1, …, zjn, …
Assume zjn = topic k
Observed
LDA Topic Modeling (2)
•  Model training:
–  Estimate the prior parameters and the posterior topic×word
distribution Φ based on a training corpus of items
–  EM + Gibbs sampling is a popular method
•  Inference for new items
–  Compute the item topic distribution based on the prior
parameters and Φ estimated in the training phase
•  Supervised LDA [Blei’07]
–  Predict a target value for each item based on supervised LDA topics
∑=
k jkkj zsy
Target value of item j
Pr(item j has topic k) estimated by averaging
the topic of each word in item j
Regression weight for topic k
∑+=
k jkikij zsy ...vs.
One regression per user
Same set of topics across different regressions
fLDA: Model
Rating: ),(~ 2
σµijij Ny
)(~ ijij Bernoulliy µ
)(~ ijijij NPoissony µ
Gaussian Model
Logistic Model (for binary rating)
Poisson Model (for counts)
jkikkji
t
ijij zsbxt ∑+++= βαµ )(
user i
gives
item j
Bias of user i: ),0(~, 2
0 α
αα
σεεα Nxg iii
t
i +=
Popularity of item j: ),0(~, 2
0 β
ββ
σεεβ Nxd jjj
t
j +=
Topic affinity of user i: ),0(~, 2
INHxs s
s
i
s
iii σεε+=
Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk =∑=
The LDA topic of the nth word in item j
Observed words: ),,(~ jnjn zLDAw ηλ
The nth word in item j
Model Fitting
•  Given:
–  Features X = {xi, xj, xij}
–  Observed ratings y = {yij} and words w = {wjn}
•  Estimate:
–  Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η]
•  Regression weights and prior parameters
–  Latent factors: Δ = {αi, βj, si} and z = {zjn}
•  User factors, item factors and per-word topic assignment
•  Empirical Bayes approach:
–  Maximum likelihood estimate of the parameters:
–  The posterior distribution of the factors:
∫ ΔΘΔ=Θ=Θ
ΘΘ
dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxargˆ
]ˆ,|,Pr[ ΘΔ yz
The EM Algorithm
•  Iterate through the E and M steps until convergence
– Let be the current estimate
– E-step: Compute
•  The expectation is not in closed form
•  We draw Gibbs samples and compute the Monte
Carlo mean
– M-step: Find
•  It consists of solving a number of regression and
optimization problems
)]|,,,Pr([log)( )ˆ,,|,(
ΘΔ=Θ ΘΔ
zwyEf n
wyz
)(maxargˆ )1(
Θ=Θ
Θ
+
fn
)(ˆ n
Θ
Supervised Topic Assignment
( ) ∏ =⋅+
+
+
∝
=
¬
¬
¬
ji jnij
jn
jkjn
k
jn
kl
jn
kzyfZ
WZ
Z
kz
rated
)|(
)Rest|Pr(
λ
η
η
Same as unsupervised LDA Likelihood of observed ratings
by users who rated item j when
zjn is set to topic k
Probability of observing yij
given the model
The topic of the nth word in item j
fLDA: Experimental Results (Movie)
•  Task: Predict the rating that a user would give a movie
•  Training/test split:
–  Sort observations by time
–  First 75% → Training data
–  Last 25% → Test data
•  Item warm-start scenario
–  Only 2% new items in test data
Model Test RMSE
RLFM 0.9363
fLDA 0.9381
Factor-Only 0.9422
FilterBot 0.9517
unsup-LDA 0.9520
MostPopular 0.9726
Feature-Only 1.0906
Constant 1.1190
fLDA is as strong as the best method
It does not reduce the performance in warm-start scenarios
fLDA: Experimental Results (Yahoo! Buzz)
•  Task: Predict whether a user would buzz-up an article
•  Severe item cold-start
–  All items are new in test data
Data Statistics
1.2M observations
4K users
10K articles
fLDA significantly
outperforms other
models
Experimental Results: Buzzing
Topics
Top Terms (after stemming) Topic
bush, tortur, interrog, terror, administr, CIA, offici,
suspect, releas, investig, georg, memo, al
CIA interrogation
mexico, flu, pirat, swine, drug, ship, somali, border,
mexican, hostag, offici, somalia, captain
Swine flu
NFL, player, team, suleman, game, nadya, star, high,
octuplet, nadya_suleman, michael, week
NFL games
court, gai, marriag, suprem, right, judg, rule, sex, pope,
supreme_court, appeal, ban, legal, allow
Gay marriage
palin, republican, parti, obama, limbaugh, sarah, rush,
gop, presid, sarah_palin, sai, gov, alaska
Sarah Palin
idol, american, night, star, look, michel, win, dress,
susan, danc, judg, boyl, michelle_obama
American idol
economi, recess, job, percent, econom, bank, expect,
rate, jobless, year, unemploy, month
Recession
north, korea, china, north_korea, launch, nuclear,
rocket, missil, south, said, russia
North Korea issues
3/4 topics are interpretable; 1/2 are similar to unsupervised topics
fLDA Summary
•  fLDA is a useful model for cold-start item recommendation
•  It also provides interpretable recommendations for users
–  User’s preference to interpretable LDA topics
•  Future directions:
–  Investigate Gibbs sampling chains and the convergence properties of
the EM algorithm
–  Apply fLDA to other multi-task prediction problems
•  fLDA can be used as a tool to generate supervised
features (topics) from text data
Summary
•  Regularizing factors through covariates effective
•  Regression based factor model that regularizes better
and deals with both cold-start and warm-start in a
single framework in a seamless way looks attractive
•  Fitting method scalable; Gibbs sampling for users and
movies can be done in parallel. Regressions in M-step
can be done with any off-the-shelf scalable linear
regression routine
•  Distributed computing on Hadoop: Multiple models
and average across partitions (more later)
Online	
  Components:	
  
Online	
  Models,	
  Intelligent	
  
Ini$aliza$on,	
  Explore	
  /	
  Exploit	
  
Why Online Components?
•  Cold start
–  New items or new users come to the system
–  How to obtain data for new items/users (explore/exploit)
–  Once data becomes available, how to quickly update the model
•  Periodic rebuild (e.g., daily): Expensive
•  Continuous online update (e.g., every minute): Cheap
•  Concept drift
–  Item popularity, user interest, mood, and user-to-item affinity may
change over time
–  How to track the most recent behavior
•  Down-weight old data
–  How to model temporal patterns for better prediction
•  … may not need to be online if the patterns are stationary
Big Picture
Most Popular
Recommendation
Personalized
Recommendation
Offline Models Collaborative filtering
(cold-start problem)
Online Models
Real systems are dynamic
Time-series models Incremental CF,
online regression
Intelligent Initialization
Do not start cold
Prior estimation Prior estimation,
dimension reduction
Explore/Exploit
Actively acquire data
Multi-armed bandits Bandits with covariates
Segmented	
  Most	
  	
  
Popular	
  Recommenda$on	
  
Extension:	
  
Online	
  Components	
  for	
  	
  
Most	
  Popular	
  Recommenda$on	
  
Online	
  models,	
  intelligent	
  ini$aliza$on	
  &	
  
explore/exploit	
  
Most popular recommendation:
Outline
•  Most popular recommendation (no
personalization, all users see the same thing)
–  Time-series models (online models)
–  Prior estimation (initialization)
–  Multi-armed bandits (explore/exploit)
–  Sometimes hard to beat!!
•  Segmented most popular recommendation
–  Create user segments/clusters based on user
features
–  Do most popular recommendation for each segment
Most Popular Recommendation
•  Problem definition: Pick k items (articles) from a
pool of N to maximize the total number of clicks on
the picked items
•  Easy!? Pick the items having the highest click-
through rates (CTRs)
•  But …
–  The system is highly dynamic:
•  Items come and go with short lifetimes
•  CTR of each item changes over time
–  How much traffic should be allocated to explore new
items to achieve optimal performance
•  Too little → Unreliable CTR estimates
•  Too much → Little traffic to exploit the high CTR items
CTR Curves for Two Days on Yahoo! Front Page
Traffic	
  obtained	
  from	
  a	
  controlled	
  randomized	
  experiment	
  (no	
  confounding)	
  
Things	
  to	
  note:	
  
	
  	
  	
  (a)	
  Short	
  life$mes,	
  (b)	
  temporal	
  effects,	
  (c)	
  oden	
  breaking	
  news	
  stories	
  
Each	
  curve	
  is	
  the	
  CTR	
  of	
  an	
  item	
  in	
  the	
  Today	
  Module	
  on	
  www.yahoo.com	
  over	
  $me	
  
For Simplicity, Assume …
•  Pick only one item for each user visit
–  Multi-slot optimization later
•  No user segmentation, no personalization
(discussion later)
•  The pool of candidate items is predetermined
and is relatively small (≤ 1000)
–  E.g., selected by human editors or by a first-phase
filtering method
–  Ideally, there should be a feedback loop
–  Large item pool problem later
•  Effects like user-fatigue, diversity in
recommendations, multi-objective optimization
not considered (discussion later)
Online Models
•  How to track the changing CTR of an item
•  Data: for each item, at time t, we observe
–  Number of times the item nt was displayed (i.e., #views)
–  Number of clicks ct on the item
•  Problem Definition: Given c1, n1, …, ct, nt, predict the CTR
(click-through rate) pt+1 at time t+1
•  Potential solutions:
–  Observed CTR at t: ct / nt → highly unstable (nt is usually small)
–  Cumulative CTR: (∑all i ci) / (∑all i ni) → react to changes very
slowly
–  Moving window CTR: (∑i∈last K ci) / (∑i∈last K ni) → reasonable
•  But, no estimation of Var[pt+1] (useful for explore/exploit)
Online Models: Dynamic Gamma-Poisson
•  Model-based approach
–  (ct | nt, pt) ~ Poisson(nt pt)
–  pt = pt-1 εt, where εt ~ Gamma(mean=1, var=η)
–  Model parameters:
•  p1 ~ Gamma(mean=µ0, var=σ0
2) is the offline CTR estimate
•  η specifies how dynamic/smooth the CTR is over time
–  Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?)
•  Solve this recursively (online update rule)
	
  	
  Show	
  the	
  item	
  nt	
  $mes	
  
	
  	
  Receive	
  ct	
  clicks	
  
	
  	
  pt	
  =	
  CTR	
  at	
  $me	
  t	
  
Nota$on:	
  
p1	
  µ0,	
  σ0
2
	
  
p2	
   …
n1	
  
c1	
  
n2	
  
c2	
  
η	
  
Online Models: Derivation
size)sample(effective/Let
),(~),,...,,|(
2
2
1111
ttt
ttttt varmeanGammancncp
σµγ
σµ
=
==−−
)(
),(~),,...,,|(
2
|
2
|
2
|
2
1
|1
2
11111
ttttttt
ttt
ttttt varmeanGammancncp
σµησσ
µµ
σµ
++=
=
==
+
+
+++
tttttt
ttttttt
tttt
ttttttt
c
n
varmeanGammancncp
||
2
|
||
|
2
||11
/
/)(
size)sample(effectiveLet
),(~),,...,,|(
γµσ
γγµµ
γγ
σµ
=
+⋅=
+=
==
High	
  CTR	
  items	
  more	
  adap$ve	
  
Es$mated	
  CTR	
  
distribu$on	
  
at	
  $me	
  t	
  
Es$mated	
  CTR	
  	
  
distribu$on	
  
at	
  $me	
  t+1	
  
Tracking behavior of Gamma-
Poisson model
•  Low click rate articles – More temporal
smoothing
Intelligent Initialization: Prior
Estimation
•  Prior CTR distribution: Gamma(mean=µ0, var=σ0
2)
–  N historical items:
•  ni = #views of item i in its first time interval
•  ci = #clicks on item i in its first time interval
–  Model
•  ci ~ Poisson(ni pi) and pi ~ Gamma(µ0, σ0
2)
⇒ ci ~ NegBinomial(µ0, σ0
2, ni)
–  Maximum likelihood estimate (MLE) of (µ0, σ0
2)
•  Better prior: Cluster items and find MLE for each cluster
–  Agarwal & Chen, 2011 (SIGMOD)
∑ ⎟
⎠
⎞
⎜
⎝
⎛ +⎟
⎠
⎞⎜
⎝
⎛ +−⎟
⎠
⎞⎜
⎝
⎛ +Γ+⎟
⎠
⎞⎜
⎝
⎛Γ−
i iii nccNN 2
0
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
0
2
0
2
0
2
00
loglogloglogmaxarg
,
σ
µ
σ
µ
σ
µ
σ
µ
σ
µ
σ
µ
σµ
Explore/Exploit: Problem Definition
$me	
  
Item	
  1	
  
Item	
  2	
  
…	
  
Item	
  K	
  
x1%	
  page	
  views	
  
x2%	
  page	
  views	
  
…	
  
xK%	
  page	
  views	
  
Determine	
  (x1,	
  x2,	
  …,	
  xK)	
  based	
  on	
  clicks	
  and	
  views	
  observed	
  before	
  t	
  in	
  
order	
  to	
  maximize	
  the	
  expected	
  total	
  number	
  of	
  clicks	
  in	
  the	
  future	
  
t	
  –1	
  	
  t	
  –2	
  	
   t	
  
now	
  
clicks	
  in	
  the	
  future	
  
Modeling the Uncertainty, NOT just
the Mean
Simplified	
  semng:	
  Two	
  items	
  
CTR	

Probabilitydensity	

Item A	

Item B	

We	
  know	
  the	
  CTR	
  of	
  Item	
  A	
  (say,	
  shown	
  1	
  million	
  $mes)	
  	
  
We	
  are	
  uncertain	
  about	
  the	
  CTR	
  of	
  Item	
  B	
  (only	
  100	
  $mes)	
  
If	
  we	
  only	
  make	
  a	
  single	
  decision,	
  
give	
  100%	
  page	
  views	
  to	
  Item	
  A	
  
	
  
If	
  we	
  make	
  mul$ple	
  decisions	
  in	
  the	
  future	
  
explore	
  Item	
  B	
  since	
  its	
  CTR	
  can	
  poten$ally	
  
be	
  higher	
  
∫ >
⋅−=
qp
dppfqp )()(Potential
CTR of item A is q	

CTR of item B is p	

Probability density function of item B’s CTR is f(p)
Multi-Armed Bandits: Introduction
(1)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
“Pulling” arm i yields a reward:
reward = 1 with probability pi (success)
reward = 0 otherwise (failure)
For	
  now,	
  we	
  are	
  aaacking	
  the	
  problem	
  of	
  choosing	
  best	
  ar$cle/arm	
  for	
  all	
  users	
  
Multi-Armed Bandits: Introduction
(2)
Bandit “arms”
p1 p2 p3
(unknown payoff
probabilities)
Goal:	
  Pull	
  arms	
  sequen$ally	
  to	
  maximize	
  the	
  total	
  reward	
  
Bandit	
  scheme/policy:	
  Sequen$al	
  algorithm	
  to	
  play	
  arms	
  (items)	
  
Regret	
  of	
  a	
  scheme	
  =	
  Expected	
  loss	
  rela$ve	
  to	
  the	
  “oracle”	
  op-mal	
  scheme	
  
that	
  always	
  plays	
  the	
  best	
  arm	
  
–  “best”	
  means	
  highest	
  success	
  probability	
  
–  But,	
  the	
  best	
  arm	
  is	
  not	
  known	
  …	
  unless	
  you	
  have	
  an	
  oracle	
  
–  Regret	
  is	
  the	
  price	
  of	
  explora$on	
  
–  Low	
  regret	
  implies	
  quick	
  convergence	
  to	
  the	
  best	
  
Multi-Armed Bandits: Introduction
(3)
•  Bayesian approach
–  Seeks to find the Bayes optimal solution to a Markov
decision process (MDP) with assumptions about
probability distributions
–  Representative work: Gittins’ index, Whittle’s index
–  Very computationally intensive
•  Minimax approach
–  Seeks to find a scheme that incurs bounded regret (with no
or mild assumptions about probability distributions)
–  Representative work: UCB by Lai, Auer
–  Usually, computationally easy
–  But, they tend to explore too much in practice (probably
because the bounds are based on worse-case analysis)
Skip	
  details
Multi-Armed Bandits: Markov Decision Process (1)
•  Select an arm now at time t=0, to maximize expected total number
of clicks in t=0,…,T
•  State at time t: Θt = (θ1t, …, θKt)
–  θit = State of arm i at time t (that captures all we know about arm i at t)
•  Reward function Ri(Θt, Θt+1)
–  Reward of pulling arm i that brings the state from Θt to Θt+1
•  Transition probability Pr[Θt+1 | Θt, pulling arm i ]
•  Policy π: A function that maps a state to an arm (action)
–  π(Θt) returns an arm (to pull)
•  Value of policy π starting from the current state Θ0 with horizon T
[ ]),(),(),( 1110)(0 0
ΘΘΘΘ Θ ππ π −+= TT VREV
[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T ππ π
Immediate	
  reward	
   Value	
  of	
  the	
  remaining	
  T-­‐1	
  $me	
  slots	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  if	
  we	
  start	
  from	
  state	
  Θ1	
  
Multi-Armed Bandits: MDP (2)
•  Optimal policy:
•  Things to notice:
–  Value is defined recursively (actually T high-dim integrals)
–  Dynamic programming can be used to find the optimal policy
–  But, just evaluating the value of a fixed policy can be very expensive
•  Bandit Problem: The pull of one arm does not change the state of
other arms and the set of arms do not change over time
[ ]),(),(),( 1110)(0 0
ΘΘΘΘ Θ ππ π −+= TT VREV
[ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0
ΘΘΘΘΘΘΘ Θ dVR T ππ π
Immediate	
  reward	
   Value	
  of	
  the	
  remaining	
  T-­‐1	
  $me	
  slots	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  if	
  we	
  start	
  from	
  state	
  Θ1	
  
),(maxarg 0Θπ
π
TV
Multi-Armed Bandits: MDP (3)
•  Which arm should be pulled next?
–  Not necessarily what looks best right now, since it might have had a few
lucky successes
–  Looks like it will be a function of successes and failures of all arms
•  Consider a slightly different problem setting
–  Infinite time horizon, but
–  Future rewards are geometrically discounted
Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1)
•  Theorem [Gittins 1979]: The optimal policy decouples and solves a
bandit problem for each arm independently
Policy	
  π(Θt)	
  is	
  a	
  func$on	
  of	
  (θ1t,	
  …,	
  θKt)	
  
Policy	
  π(Θt)	
  =	
  argmaxi	
  {	
  g(θit)	
  }	
  
One	
  K-­‐dimensional	
  problem	
  
K	
  one-­‐dimensional	
  problems	
  
S$ll	
  computa$onally	
  expensive!!	
  
Gimns’	
  Index	
  
Multi-Armed Bandits: MDP (4)
Bandit Policy
1.  Compute the priority
(Gittins’ index) of each
arm based on its state
2.  Pull arm with max
priority, and observe
reward
3.  Update the state of the
pulled arm
Priority	
  
1	
  
Priority	
  
2	
  
Priority	
  
3	
  
Multi-Armed Bandits: MDP (5)
•  Theorem [Gittins 1979]: The optimal policy decouples
and solves a bandit problem for each arm
independently
–  Many proofs and different interpretations of Gittins’ index
exist
•  The index of an arm is the fixed charge per pull for a game with two options, whether
to pull the arm or not, so that the charge makes the optimal play of the game have
zero net reward
–  Significantly reduces the dimension of the problem space
–  But, Gittins’ index g(θit) is still hard to compute
•  For the Gamma-Poisson or Beta-Binomial models
θit = (#successes, #pulls) for arm i up to time t
•  g maps each possible (#successes, #pulls) pair to a number
–  Approximate methods are used in practice
–  Lai et al. have derived these for exponential family
distributions
Multi-Armed Bandits: Minimax Approach (1)
•  Compute the priority of each arm i in a way that the
regret is bounded
–  Lowest regret in the worst case
•  One common policy is UCB1 [Auer 2002]
Number of successes of
arm i
Number of pulls
of arm i
Total number of pulls
of all arms
Observed
success rate
Factor representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
⋅
+=
Multi-Armed Bandits: Minimax Approach (2)
•  As total observations n becomes large:
–  Observed payoff tends asymptotically towards the
true payoff probability
–  The system never completely “converges” to one
best arm; only the rate of exploration tends to
zero
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
⋅
+=
Multi-Armed Bandits: Minimax Approach (3)
•  Sub-optimal arms are pulled O(log n) times
•  Hence, UCB1 has O(log n) regret
•  This is the lowest possible regret (but the constants matter J)
•  E.g. Regret after n plays is bounded by
Observed
payoff
Factor
representing
uncertainty
ii
i
i
n
n
n
c log2
Priority
⋅
+=
ibesti
K
j
j
i ibesti
n
µµ
π
µµ
−=Δ⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
Δ⋅⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
++⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
Δ
∑∑ =<
where,
3
1
ln
8
1
2
:
•  Classical multi-armed bandits
–  A fixed set of arms with fixed rewards
–  Observe the reward before the next pull
•  Bayesian approach (Markov decision process)
–  Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits
•  Pull the arm currently having the highest index value
–  Whittle’s index [Whittle 1988]: Extension to a changing reward function
–  Computationally intensive
•  Minimax approach (providing guaranteed regret bounds)
–  UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval
•  Index of arm i =
•  Heuristics
–  ε-Greedy: Random exploration using fraction ε of traffic
–  Softmax: Pick arm i with probability
–  Posterior draw: Index = drawing from posterior CTR distribution of an arm
∑j j
i
}/ˆexp{
}/ˆexp{
τµ
τµ
Classical Multi-Armed Bandits: Summary
ii itemofCTRpredictedˆ =µ
iii nnnc log2⋅+
retemperatu=τ
Do Classical Bandits Apply to Web Recommenders?
Traffic	
  obtained	
  from	
  a	
  controlled	
  randomized	
  experiment	
  (no	
  confounding)	
  
Things	
  to	
  note:	
  
	
  	
  	
  (a)	
  Short	
  life$mes,	
  (b)	
  temporal	
  effects,	
  (c)	
  oden	
  breaking	
  news	
  stories	
  
Each	
  curve	
  is	
  the	
  CTR	
  of	
  an	
  item	
  in	
  the	
  Today	
  Module	
  on	
  www.yahoo.com	
  over	
  $me	
  
Characteristics of Real
Recommender Systems•  Dynamic set of items (arms)
–  Items come and go with short lifetimes (e.g., a day)
–  Asymptotically optimal policies may fail to achieve good performance
when item lifetimes are short
•  Non-stationary CTR
–  CTR of an item can change dramatically over time
•  Different user populations at different times
•  Same user behaves differently at different times (e.g., morning, lunch
time, at work, in the evening, etc.)
•  Attention to breaking news stories decays over time
•  Batch serving for scalability
–  Making a decision and updating the model for each user visit in real time
is expensive
–  Batch serving is more feasible: Create time slots (e.g., 5 min); for each
slot, decide the fraction xi of the visits in the slot to give to item i
[Agarwal	
  et	
  al.,	
  ICDM,	
  2009]	
  
Explore/Exploit in Recommender
Systems
$me	
  
Item	
  1	
  
Item	
  2	
  
…	
  
Item	
  K	
  
x1%	
  page	
  views	
  
x2%	
  page	
  views	
  
…	
  
xK%	
  page	
  views	
  
Determine	
  (x1,	
  x2,	
  …,	
  xK)	
  based	
  on	
  clicks	
  and	
  views	
  observed	
  before	
  t	
  in	
  
order	
  to	
  maximize	
  the	
  expected	
  total	
  number	
  of	
  clicks	
  in	
  the	
  future	
  
t	
  –1	
  	
  t	
  –2	
  	
   t	
  
now	
  
clicks	
  in	
  the	
  future	
  
Let’s	
  solve	
  this	
  from	
  first	
  principle	
  
Bayesian Solution: Two Items, Two
Time Slots (1)
•  Two time slots: t = 0 and t = 1
–  Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1
–  Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1
•  To determine x, we need to estimate what would happen in the future
Question:
What fraction x of N0 views to item P
(1-x) to item Q
t=0 t=1
Now
time
N0 views N1 views
End	
  
Obtain	
  c	
  clicks	
  ader	
  serving	
  x	
  
(not	
  yet	
  observed;	
  random	
  variable)	
  
	
  Assume	
  we	
  observe	
  c;	
  we	
  can	
  update	
  p1	
  
CTR	

density	

Item Q	

Item P	

q1	
  
p1(x,c)
CTR	

density	

Item Q	

Item P	

q0	
  
p0
	
  If	
  x	
  and	
  c	
  are	
  given,	
  op$mal	
  solu$on:	
  
Give	
  all	
  views	
  to	
  Item	
  P	
  iff	
  
	
  E[	
  p1	
  I	
  x,	
  c	
  ]	
  >	
  q1	
  
),(ˆ1 cxp
),(ˆ1 cxp
•  Expected total number of clicks in the two time slots
}]),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c+−+
Gain(x, q0, q1) = Expected number of additional clicks if we explore
the uncertain item P with fraction x of views in slot 0, compared to
a scheme that only shows the certain item Q in both slots
Solution: argmaxx Gain(x, q0, q1)
Bayesian Solution: Two Items, Two
Time Slots (2)
}]0,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c −+−++=
E[#clicks] at t = 0 E[#clicks] at t = 1
Item P Item Q Show	
  the	
  item	
  with	
  higher	
  E[CTR]:	
   }),,(ˆmax{ 11 qcxp
E[#clicks] if we
always show
item Q
Gain(x, q0, q1)
Gain of exploring the uncertain item P using x
•  Approximate by the normal distribution
–  Reasonable approximation because of the central limit theorem
•  Proposition: Using the approximation,
the Bayes optimal solution x can be
found in time O(log N0)
),(ˆ1 cxp
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
−⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −
Φ−+⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −
⋅+−= )ˆ(
)(
ˆ
1
)(
ˆ
)()ˆ(),,( 11
1
11
1
11
1100010 qp
x
pq
x
pq
xNqpxNqqxGain
σσ
φσ
)1()()(
)],(ˆ[)( 2
0
0
1
2
1
baba
ab
xNba
xN
cxpVarx
+++++
==σ
)/()],(ˆ[ˆ 11 baacxpEp c +==
),(~ofPrior 1 baBetap
Bayesian Solution: Two Items, Two
Time Slots (3)
Bayesian Solution: Two Items, Two Time Slots (4)
•  Quiz: Is it correct that the more we are uncertain about the CTR of
an item, the more we should explore the item?
Uncertainty:	
  Low	
  Uncertainty:	
  High	
  
Different	
  
curves	
  are	
  for	
  
different	
  prior	
  
mean	
  semngs	
  
(Frac$on	
  of	
  views	
  to	
  give	
  to	
  the	
  item)	
  
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course
ENAR short course

More Related Content

What's hot

Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
IEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialIEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialSrinath Perera
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligenceananth
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelinesRamesh Sampath
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionWill Johnson
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017MLconf
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Measuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and PythonMeasuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and PythonSujit Pal
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Turi, Inc.
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Lucidworks
 

What's hot (20)

Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
IEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialIEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On Tutorial
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligence
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Machine learning yearning
Machine learning yearningMachine learning yearning
Machine learning yearning
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Ironwood4_Tuesday_Medasani_1PM
Ironwood4_Tuesday_Medasani_1PMIronwood4_Tuesday_Medasani_1PM
Ironwood4_Tuesday_Medasani_1PM
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Measuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and PythonMeasuring Search Engine Quality using Spark and Python
Measuring Search Engine Quality using Spark and Python
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
 

Viewers also liked

Some links of recommender system
Some links of recommender systemSome links of recommender system
Some links of recommender systemjins0618
 
100銘傳大學資訊網路志業服務隊
100銘傳大學資訊網路志業服務隊100銘傳大學資訊網路志業服務隊
100銘傳大學資訊網路志業服務隊mrJim Note
 
Types de fonds d'investiment au Luxembourg
Types de fonds d'investiment au LuxembourgTypes de fonds d'investiment au Luxembourg
Types de fonds d'investiment au LuxembourgBridgeWest.eu
 
17102413 otisci-bogova
17102413 otisci-bogova17102413 otisci-bogova
17102413 otisci-bogovacosmoi
 
A Slave of Today
A Slave of TodayA Slave of Today
A Slave of Todayerikamarzaz
 
เนื้อเยื่อชั้นต่างๆ ของราก
เนื้อเยื่อชั้นต่างๆ ของรากเนื้อเยื่อชั้นต่างๆ ของราก
เนื้อเยื่อชั้นต่างๆ ของรากBiobiome
 
Future Educator Association FEA Advisor Update Session
Future Educator Association FEA Advisor Update Session Future Educator Association FEA Advisor Update Session
Future Educator Association FEA Advisor Update Session Rachael Mann
 
gabrovo1
gabrovo1gabrovo1
gabrovo1niod
 
KnowledgeZoom for Java: A Concept-Based Exam Study Tool
KnowledgeZoom for Java: A Concept-Based Exam Study Tool KnowledgeZoom for Java: A Concept-Based Exam Study Tool
KnowledgeZoom for Java: A Concept-Based Exam Study Tool Michelle Liang
 
Aucfan Presentation @Redherring Asia 2014 Presentation
Aucfan Presentation @Redherring Asia 2014 PresentationAucfan Presentation @Redherring Asia 2014 Presentation
Aucfan Presentation @Redherring Asia 2014 PresentationYuma Ichiba
 
Establish a company in Iceland
Establish a company in IcelandEstablish a company in Iceland
Establish a company in IcelandBridgeWest.eu
 
Intro to Cheese August 15
Intro to Cheese August 15Intro to Cheese August 15
Intro to Cheese August 15Rachael Mann
 

Viewers also liked (17)

Some links of recommender system
Some links of recommender systemSome links of recommender system
Some links of recommender system
 
100銘傳大學資訊網路志業服務隊
100銘傳大學資訊網路志業服務隊100銘傳大學資訊網路志業服務隊
100銘傳大學資訊網路志業服務隊
 
Types de fonds d'investiment au Luxembourg
Types de fonds d'investiment au LuxembourgTypes de fonds d'investiment au Luxembourg
Types de fonds d'investiment au Luxembourg
 
Ask profile
Ask profileAsk profile
Ask profile
 
Eu unit 4
Eu unit 4Eu unit 4
Eu unit 4
 
Rails3 asset-pipeline
Rails3 asset-pipelineRails3 asset-pipeline
Rails3 asset-pipeline
 
17102413 otisci-bogova
17102413 otisci-bogova17102413 otisci-bogova
17102413 otisci-bogova
 
No Title
No TitleNo Title
No Title
 
A Slave of Today
A Slave of TodayA Slave of Today
A Slave of Today
 
เนื้อเยื่อชั้นต่างๆ ของราก
เนื้อเยื่อชั้นต่างๆ ของรากเนื้อเยื่อชั้นต่างๆ ของราก
เนื้อเยื่อชั้นต่างๆ ของราก
 
Future Educator Association FEA Advisor Update Session
Future Educator Association FEA Advisor Update Session Future Educator Association FEA Advisor Update Session
Future Educator Association FEA Advisor Update Session
 
gabrovo1
gabrovo1gabrovo1
gabrovo1
 
KnowledgeZoom for Java: A Concept-Based Exam Study Tool
KnowledgeZoom for Java: A Concept-Based Exam Study Tool KnowledgeZoom for Java: A Concept-Based Exam Study Tool
KnowledgeZoom for Java: A Concept-Based Exam Study Tool
 
Eu future tense posted
Eu future tense postedEu future tense posted
Eu future tense posted
 
Aucfan Presentation @Redherring Asia 2014 Presentation
Aucfan Presentation @Redherring Asia 2014 PresentationAucfan Presentation @Redherring Asia 2014 Presentation
Aucfan Presentation @Redherring Asia 2014 Presentation
 
Establish a company in Iceland
Establish a company in IcelandEstablish a company in Iceland
Establish a company in Iceland
 
Intro to Cheese August 15
Intro to Cheese August 15Intro to Cheese August 15
Intro to Cheese August 15
 

Similar to ENAR short course

Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation SystemsJared Winick
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
MapReduce and the New Software Stack
MapReduce and the New Software StackMapReduce and the New Software Stack
MapReduce and the New Software StackMaruf Aytekin
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed SystemsSoftware Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systemsadrianionel
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 

Similar to ENAR short course (20)

Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
try
trytry
try
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation Systems
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
MapReduce and the New Software Stack
MapReduce and the New Software StackMapReduce and the New Software Stack
MapReduce and the New Software Stack
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed SystemsSoftware Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

ENAR short course

  • 1. Sta$s$cal  Compu$ng     For  Big  Data   Deepak  Agarwal   LinkedIn  Applied  Relevance  Science   dagarwal@linkedin.com   ENAR  2014,  Bal$more,  USA  
  • 2. Main  Collaborators:  several  others  at  both  Y!   and  LinkedIn   •  I  won’t  be  here  without  them,  extremely  lucky  to  work  with  such  talented   individuals   Bee-Chung Chen Liang Zhang Bo Long Jonathan Traupman Paul Ogilvie
  • 3. Structure  of  This  Tutorial   •  Part  I:  Introduc$on  to  Map-­‐Reduce  and  the   Hadoop  System   –  Overview  of  Distributed  Compu$ng   –  Introduc$on  to  Map-­‐Reduce   –  Some  sta$s$cal  computa$ons  using  Map-­‐Reduce   •  Bootstrap,  Logis$c  Regression   •  Part  II:  Recommender  Systems  for  Web   Applica$ons   –  Introduc$on   –  Content  Recommenda$on   –  Online  Adver$sing  
  • 4. Big  Data  becoming  Ubiquitous   •  Bioinforma$cs   •  Astronomy   •  Internet   •  Telecommunica$ons   •  Climatology   •  …    
  • 5. Big  Data:  Some  size  es$mates   •  1000  human  genomes:  >  100TB  of  data  (1000   genomes  project)   •  Sloan  Digital  Sky  Survey:  200GB  data  per  night   (>140TB  aggregated)   •  Facebook:  A  billion  monthly  ac$ve  users   •  LinkedIn:      roughly  >  280M  members  worldwide   •  Twiaer:  >  500  million  tweets  a  day   •  Over  6  billion  mobile  phones  in  the  world   genera$ng  data  everyday  
  • 6. Big  Data:  Paradigm  shid   •  Classical  Sta$s$cs   –  Generalize  using  small  data     •  Paradigm  Shid  with  Big  Data   –  We  now  have  an  almost  infinite  supply  of  data   –  Easy  Sta$s$cs  ?  Just  appeal  to  asympto$c  theory?   •  So  the  issue  is  mostly  computa$onal?   –  Not  quite   •  More  data  comes  with  more  heterogeneity   •  Need  to  change  our  sta$s$cal  thinking  to  adapt   –  Classical  sta$s$cs  s$ll  invaluable  to  think  about  big  data  analy$cs    
  • 7. Some  Sta$s$cal  Challenges   •  Exploratory  Analysis  (EDA),  Visualiza$on   – Retrospec$ve  (on  Terabytes)   – More  Real  Time    (streaming  computa$ons  every   few  minutes/hours)   •  Sta$s$cal  Modeling   – Scale  (computa$onal  challenge)   – Curse  of  dimensionality     •  Millions  of  predictors,  heterogeneity   – Temporal  and  Spa$al  correla$ons  
  • 8. Sta$s$cal  Challenges  con$nued   •  Experiments   – To  test  new  methods,  test  hypothesis  from   randomized  experiments   – Adap$ve  experiments     •  Forecas$ng     – Planning,  adver$sing   •  Many  more  I  are  not  fully  well  versed  in        
  • 9. Defining  Big  Data     •  How  to  know  you  have  the  big  data  problem?   – Is  it  only  the  number  of  terabytes  ?   – What  about  dimensionality,  structured/ unstructured,  computa$ons  required,…   •  No  clear  defini$on,  different  point  of  views   – When  desired  computa$on  cannot  be  completed   in  the  s$pulated  $me  with  current  best  algorithm   using  cores  available  on  a  commodity  PC      
  • 10.  Distributed  Compu$ng  for  Big  Data       •  Distributed  compu$ng  invaluable  tool  to  scale   computa$ons  for  big  data   •  Some  distributed  compu$ng  models   – Mul$-­‐threading   – Graphics  Processing  Units  (GPU)   – Message  Passing  Interface  (MPI)   – Map-­‐Reduce  
  • 11. Evalua$ng  a  method  for  a  problem   •  Scalability   –  Process  X  GB  in  Y  hours   •  Ease  of  use  for  a  sta$s$cian     •  Reliability  (fault  tolerance)   –  Especially  in  an  industrial  environment   •  Cost   –  Hardware  and  cost  of  maintaining   •  Good  for  the  computa$ons  required?   –  E.g.,  Itera$ve  versus  one  pass   •  Resource  sharing  
  • 12. Mul$threading   •  Mul$ple  threads  take  advantage  of  mul$ple   CPUs   •  Shared  memory   •  Threads  can  execute  independently  and   concurrently   •  Can  only  handle  Gigabytes  of  data   •  Reliable  
  • 13. Graphics  Processing  Units  (GPU)   •  Number  of  cores:   –  CPU:  Order  of  10   –  GPU:  smaller  cores   •  Order  of  1000   •  Can  be  >100x  faster  than  CPU   –  Parallel  computa$onally  intensive  tasks  off-­‐loaded  to  GPU   •  Good  for  certain  computa$onally-­‐intensive  tasks   •  Can  only  handle  Gigabytes  of  data   •  Not  trivial  to  use,  requires  good  understanding  of  low-­‐level  architecture   for  efficient  use   –  But  things  changing,  it  is  gemng  more  user  friendly  
  • 14. Message  Passing  Interface  (MPI)   •  Language  independent  communica$on   protocol  among  processes  (e.g.  computers)   •  Most  suitable  for  master/slave  model   •  Can  handle  Terabytes  of  data   •  Good  for  itera$ve  processing   •  Fault  tolerance  is  low  
  • 15. Map-­‐Reduce  (Dean  &  Ghemawat,   2004)   Mappers   Reducers   Data   Output   •  Computa$on  split  to  Map   (scaaer)  and  Reduce  (gather)   stages   •  Easy  to  Use:     –  User  needs  to  implement  two   func$ons:  Mapper  and   Reducer   •  Easily  handles  Terabytes  of   data   •  Very  good  fault  tolerance   (failed  tasks  automa$cally   get  restarted)  
  • 16. Comparison  of  Distributed  Compu$ng  Methods   Mul$threading   GPU   MPI   Map-­‐Reduce   Scalability  (data   size)   Gigabytes   Gigabytes   Terabytes   Terabytes   Fault  Tolerance   High   High   Low   High   Maintenance  Cost   Low   Medium   Medium   Medium-­‐High   Itera$ve  Process   Complexity   Cheap   Cheap   Cheap   Usually   expensive   Resource  Sharing   Hard   Hard   Easy   Easy   Easy  to  Implement?   Easy   Needs   understanding   of  low-­‐level  GPU   architecture   Easy   Easy  
  • 17. Example  Problem   •  Tabula$ng  word  counts  in  corpus  of   documents   •  Similar  to  table  func$on  in  R  
  • 18. Word  Count  Through  Map-­‐Reduce   Hello  World   Bye  World     Hello  Hadoop   Goodbye  Hadoop   Mapper  1   <Hello,  1>   <Hadoop,  1>   <Goodbye,  1>   <Hadoop,1>   <Hello,  1>   <World,  1>   <Bye,  1>   <World,1>   Mapper  2   Reducer  1   Words  from  A-­‐G   Reducer  2   Words  from  H-­‐Z   <Bye,  1>   <Goodbye,  1>   <Hello,  2>   <World,  2>   <Hadoop,  2>  
  • 19. Key  Ideas  about  Map-­‐Reduce   Big  Data   Par$$on  1   Par$$on  2   …   Par$$on  N   Mapper  1   Mapper  2   …   Mapper  N   <Key,  Value>   <Key,  Value>  <Key,  Value>  <Key,  Value>   Reducer  1   Reducer  2   Reducer  M  …   Output  1   Output  1  Output  1  Output  1  
  • 20. Key  Ideas  about  Map-­‐Reduce   •  Data  are  split  into  par$$ons  and  stored  in  many   different  machines  on  disk  (distributed  storage)   •  Mappers  process  data  chunks  independently    and   emit  <Key,  Value>  pairs   •  Data  with  the  same  key  are  sent  to  the  same   reducer.  One  reducer  can  receive  mul$ple  keys   •  Every  reducer  sorts  its  data  by  key   •  For  each  key,  the  reducer  processes  the  values   corresponding  to  the  key  according  to  the   customized  reducer  func$on  and  output  
  • 21. Compute  Mean  for  Each  Group   ID   Group  No.   Score   1   1   0.5   2   3   1.0   3   1   0.8   4   2   0.7   5   2   1.5   6   3   1.2   7   1   0.8   8   2   0.9   9   4   1.3   …   …   …  
  • 22. Key  Ideas  about  Map-­‐Reduce   •  Data  are  split  into  par$$ons  and  stored  in  many  different  machines  on   disk  (distributed  storage)   •  Mappers  process  data  chunks  independently    and  emit  <Key,  Value>  pairs   –  For  each  row:   •  Key  =  Group  No.   •  Value  =  Score   •  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can   receive  mul$ple  keys   –  E.g.  2  reducers   –  Reducer  1  receives  data  with  key  =  1,  2   –  Reducer  2  receives  data  with  key  =  3,  4   •  Every  reducer  sorts  its  data  by  key   –  E.g.  Reducer  1:  <key  =  1,  values=[0.5,  0.8,  0.8]>,  <key=2,  values=<0.7,  1.5,  0.9>   •  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key   according  to  the  customized  reducer  func$on  and  output   –  E.g.  Reducer  1  output:  <1,  mean(0.5,  0.8,  0.8)>,  <2,  mean(0.7,  1.5,  0.9)>  
  • 23. Key  Ideas  about  Map-­‐Reduce   •  Data  are  split  into  par$$ons  and  stored  in  many  different  machines  on   disk  (distributed  storage)   •  Mappers  process  data  chunks  independently    and  emit  <Key,  Value>  pairs   –  For  each  row:   •  Key  =  Group  No.   •  Value  =  Score   •  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can   receive  mul$ple  keys   –  E.g.  2  reducers   –  Reducer  1  receives  data  with  key  =  1,  2   –  Reducer  2  receives  data  with  key  =  3,  4   •  Every  reducer  sorts  its  data  by  key   –  E.g.  Reducer  1:  <key  =  1,  values=[0.5,  0.8,  0.8]>,  <key=2,  values=<0.7,  1.5,  0.9>   •  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key   according  to  the  customized  reducer  func$on  and  output   –  E.g.  Reducer  1  output:  <1,  mean(0.5,  0.8,  0.8)>,  <2,  mean(0.7,  1.5,  0.9)>   What  you  need   to  implement  
  • 24. Mapper:   Input:  Data   for  (row  in  Data)   {    groupNo  =  row$groupNo;    score  =  row$score;    Output(c(groupNo,  score));   }   Reducer:   Input:  Key  (groupNo),  List  Value  (a  list  of  scores  that  belong  to  the  Key)   count  =  0;   sum  =  0;   for  (v  in  Value)   {    sum  +=  v;    count++;   }   Output(c(Key,  sum/count));   Pseudo  Code  (in  R)  
  • 25. Exercise  1   •  Problem:  Average  height  per  {Grade,  Gender}?   •  What  should  be  the  mapper  output  key?   •  What  should  be  the  mapper  output  value?   •  What  are  the  reducer  input?   •  What  are  the  reducer  output?   •  Write  mapper  and  reducer  for  this?   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  
  • 26. •  Problem:  Average  height  per  Grade  and  Gender?   •  What  should  be  the  mapper  output  key?   –  {Grade,  Gender}   •  What  should  be  the  mapper  output  value?   –  Height   •  What  are  the  reducer  input?   –  Key:  {Grade,  Gender},  Value:  List  of  Heights     •  What  are  the  reducer  output?   –  {Grade,  Gender,  mean(Heights)}   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  
  • 27. Exercise  2   •  Problem:  Number  of  students  per  {Grade,  Gender}?   •  What  should  be  the  mapper  output  key?   •  What  should  be  the  mapper  output  value?   •  What  are  the  reducer  input?   •  What  are  the  reducer  output?   •  Write  mapper  and  reducer  for  this?   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  
  • 28. •  Problem:  Number  of  students  per  {Grade,  Gender}?   •  What  should  be  the  mapper  output  key?   –  {Grade,  Gender}   •  What  should  be  the  mapper  output  value?   –  1   •  What  are  the  reducer  input?   –  Key:  {Grade,  Gender},  Value:  List  of  1’s   •  What  are  the  reducer  output?   –  {Grade,  Gender,  sum(value  list)}   –  OR:  {Grade,  Gender,  length(value  list)}   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  
  • 29. More  on  Map-­‐Reduce   •  Depends  on  distributed  file  systems   •  Typically  mappers  are  the  data  storage  nodes   •  Map/Reduce  tasks  automa$cally  get  restarted   when  they  fail  (good  fault  tolerance)   •  Map  and  Reduce  I/O  are  all  on  disk   –  Data  transmission  from  mappers  to  reducers  are   through  disk  copy   •  Itera$ve  process  through  Map-­‐Reduce   –  Each  itera$on  becomes  a  map-­‐reduce  job   –  Can  be  expensive  since  map-­‐reduce  overhead  is  high  
  • 30. The  Apache  Hadoop  System   •  An  open-­‐source  sodware  for  reliable,  scalable,   distributed  compu$ng     •  The  most  popular  distributed  compu$ng   system  in  the  world   •  Key  modules:   – Hadoop  Distributed  File  System  (HDFS)   – Hadoop  YARN  (job  scheduling  and  cluster   resource  management)   – Hadoop  MapReduce  
  • 31. Major  Tools  on  Hadoop   •  Pig   –  A  high-­‐level  language  for  Map-­‐Reduce  computa$on   •  Hive   –  A  SQL-­‐like  query  language  for  data  querying  via  Map-­‐Reduce   •  Hbase   –  A  distributed  &  scalable  database  on  Hadoop   –  Allows  random,  real  $me  read/write  access  to  big  data   –  Voldemort  is  similar  to  Hbase   •  Mahout   –  A  scalable  machine  learning  library   •  …  
  • 32. Hadoop  Installa$on   •  Semng  up  Hadoop  on  your  desktop/laptop:   – hap://hadoop.apache.org/docs/stable/ single_node_setup.html   •  Semng  up  Hadoop  on  a  cluster  of  machines   – hap://hadoop.apache.org/docs/stable/ cluster_setup.html  
  • 33. Hadoop  Distributed  File  System  (HDFS)   •  Master/Slave  architecture   •  NameNode:  a  single  master  node  that  controls  which   data  block  is  stored  where.     •  DataNodes:  slave  nodes  that  store  data  and  do  R/W   opera$ons   •  Clients  (Gateway):  Allow  users  to  login  and  interact   with  HDFS  and  submit  Map-­‐Reduce  jobs   •  Big  data  is  split  to  equal-­‐sized  blocks,  each  block  can  be   stored  in  different  DataNodes   •  Disk  failure  tolerance:  data  is  replicated  mul$ple  $mes  
  • 34. Load  the  Data  into  Pig   •  A  =  LOAD  ‘Sample-­‐1.dat'  USING  PigStorage()  AS   (ID  :  int,  groupNo:  int,  score:  float);     –  The  path  of  the  data  on  HDFS  ader  LOAD     •  USING  PigStorage()  means  delimit  the  data  by  tab   (can  be  omiaed)   •  If  data  are  delimited  by  other  characters,  e.g.   space,  use  USING  PigStorage(‘  ‘)     •  Data  schema  defined  ader  AS     •  Variable  types:  int,  long,  float,  double,  chararray,   …  
  • 35. Structure  of  This  Tutorial   •  Part  I:  Introduc$on  to  Map-­‐Reduce  and  the   Hadoop  System   – Overview  of  Distributed  Compu$ng   – Introduc$on  to  Map-­‐Reduce   – Introduc$on  to  the  Hadoop  System   –   Examples  of  Sta$s$cal  Compu$ng  for  Big  Data   •  Bag  of  Liale  Bootstraps   •  Large  Scale  Logis$c  Regression  
  • 36. Bag  of  Liale  Bootstraps   Kleiner  et  al.  2012  
  • 37. Bootstrap  (Efron,  1979)   •  A  re-­‐sampling  based  method  to  obtain  sta$s$cal   distribu$on  of  sample  es$mators   •  Why  are  we  interested  ?   –  Re-­‐sampling  is  embarrassingly  parallelizable     •  For  example:   –  Standard  devia$on  of  the  mean  of  N  samples  (μ)   –  For  i  =  1  to  r  do     •  Randomly  sample  with  replacement  N  $mes  from  the  original   sample  -­‐>  bootstrap  data  i   •  Compute  mean  of  the  i-­‐th  bootstrap  data  -­‐>  μi   –  Es$mate  of  Sd(μ)  =  Sd([μ1,…μr])   –  r  is  usually  a  large  number,  e.g.  200  
  • 38. Bootstrap  for  Big  Data   •  Can  have  r  nodes  running  in  parallel,  each   sampling  one  bootstrap  data   •  However…   – N  can  be  very  large   – Data  may  not  fit  into  memory   – Collec$ng  N  samples  with  replacement  on  each   node  can  be  computa$onally  expensive  
  • 39. M  out  of  N  Bootstrap     (Bikel  et  al.  1997)   •  Obtain  SdM(μ)  by  sampling  M  samples  with   replacement  for  each  bootstrap,  where  M<N   •  Apply  analy$cal  correc$on  to  SdM(μ)  to  obtain   Sd(μ)  using  prior  knowledge  of  convergence  rate   of  sample  es$mates   •  However…   –  Prior  knowledge  is  required   –  Choice  of  M  is  cri$cal  to  performance   –  Finding  op$mal  value  of  M  needs  more    computa$on  
  • 40. Bag  of  Liale  Bootstraps  (BLB)   •  Example:  Standard  devia$on  of  the  mean     •  Generate  S  sampled  data  sets,  each  obtained  by  random   sampling  without  replacement  a  subset  of  size  b  (or   par$$on  the  original  data  into  S  par$$ons,  each  with  size   b)   •  For  each  data  p  =  1  to  S  do   –  For  i  =  1  to  r  do     •  N  samples  with  replacement  on  data  of  size  b   •  Compute  mean  of  the  resampled  data  μpi   –  Compute  Sdp(μ)  =  Sd([μp1,…μpr])   •  Es$mate  of  Sd(μ)  =  Avg([Sd1(μ),…,  SdS(μ)])  
  • 41. Bag  of  Liale  Bootstraps  (BLB)   •  Interest:  ξ(θ),  where  θ  is  an  es$mate  obtained  from   size  N  data   –   ξ  is  some  func$on  of  θ,  such  as  standard  devia$on,  …   •  Generate  S  sampled  data  sets,  each  obtained  from  random   sampling  without  replacement  a  subset  of  size  b  (or  par$$on   the  original  data  into  S  par$$ons,  each  with  size  b)   •  For  each  data  p  =  1  to  S  do   –  For  i  =  1  to  r  do     •  Sample  N  samples  with  replacement  on  data  of  size  b   •  Compute  mean  of  the  resampled  data  θpi   –  Compute  ξp(θ)  =  ξ([θp1,…θpr])   •  Es$mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])  
  • 42. Bag  of  Liale  Bootstraps  (BLB)   •  Interest:  ξ(θ),  where  θ  is  an  es$mate  obtained  from   size  N  data   –   ξ  is  some  func$on  of  θ,  such  as  standard  devia$on,  …   •  Generate  S  sampled  data  sets,  each  obtained  from  random   sampling  without  replacement  a  subset  of  size  b  (or  par$$on   the  original  data  into  S  par$$ons,  each  with  size  b)   •  For  each  data  p  =  1  to  S  do   –  For  i  =  1  to  r  do     •  Sample  N  samples  with  replacement  on  the  data  of  size  b   •  Compute  mean  of  the  resampled  data  θpi   –  Compute  ξp(θ)  =  ξ([θp1,…θpr])   •  Es$mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])   Mapper  Reducer   Gateway  
  • 43. Why  is  BLB  Efficient   •  Before:   – N  samples  with  replacement  from  size  N  data  is   expensive  when  N  is  large   •  Now:   – N  samples  with  replacement  from  size  b  data   – b  can  be  several  magnitude  smaller  than  N  (e.g.  b   =  Nγ,  γ  in  [0.5,  1))   – Equivalent  to:  A  mul$nomial  sampler  with  dim  =  b   – Storage  =  O(b),  Computa$onal  complexity  =  O(b)  
  • 44. Simula$on  Experiment   •  95%  CI  of  Logis$c  Regression  Coefficients   •  N  =  20000,  10  explanatory  variables   •  Rela$ve  Error  =  |Es$mated  CI  width  –  True  CI   width  |  /  True  CI  width   •  BLB-­‐γ:  BLB  with  b  =  Nγ   •  BOFN-­‐γ:  b  out  of  N  sampling  with  b  =  Nγ   •  BOOT:  Naïve  bootstrap  
  • 46. Real  Data   •  95%  CI  of  Logis$c  Regression  Coefficients   •  N  =  6M,  3000  explanatory  variables   •  Data  size  =  150GB,  r  =  50,  s  =  5,  γ  =  0.7  
  • 47. Summary  of  BLB   •  A  new  algorithm  for  bootstrapping  on  big  data   •  Advantages   – Fast  and  efficient   – Easy  to  parallelize   – Easy  to  understand  and  implement   – Friendly  to  Hadoop,  makes  it  rou$ne  to  perform   sta$s$cal  calcula$ons  on  Big  data  
  • 48. Large  Scale  Logis$c  Regression  
  • 49. Logis$c  Regression   •  Binary  response:  Y   •  Covariates:  X   •  Yi  ~  Bernoulli(pi)   •  log(pi/(1-­‐pi))  =  Xi Tβ  ;    β  ~  MVN(0  ,  1/λ  I  )   •  Widely  used  (research  and  applica$ons)  
  • 50. Large  Scale  Logis$c  Regression   •  Binary  response:  Y   –  E.g.,  Click  /  Non-­‐click  on  an  ad  on  a  webpage   •  Covariates:  X   –  User  covariates:     •  Age,  gender,  industry,  educa$on,  job,  job  $tle,  …   –  Item  covariates:   •  Categories,  keywords,  topics,  …   –  Context  covariates:   •  Time,  page  type,  posi$on,  …   –  2-­‐way  interac$on:   •  User  covariates  X  item  covariates   •  Context  covariates  X  item  covariates   •  …  
  • 51. Computa$onal  Challenge   •  Hundreds  of  millions/billions  of  observa$ons     •  Hundreds  of  thousands/millions  of  covariates   •  Fimng  such  a  logis$c  regression  model  on  a   single  machine  not  feasible   •  Model  fimng  itera$ve  using  methods  like   gradient  descent,  Newton’s  method  etc   – Mul$ple  passes  over  the  data  
  • 52. Recap  on  Op$miza$on  method   •  Problem:  Find  x  to  min(F(x))   •  Itera$on  n:  xn  =  xn-­‐1  –  bn-­‐1  F’(xn-­‐1)   •   bn-­‐1  is  the  step  size  that  can  change  every   itera$on   •  Iterate  un$l  convergence     •  Conjugate  gradient,  LBFGS,  Newton  trust   region,  …  all  of  this  kind  
  • 53. Itera$ve  Process  with  Hadoop     Disk   Mappers   Disk   Reducers   Disk  Mappers  Disk  Reducers   Disk   Mappers   Disk   Reducers  
  • 54. Limita$ons  of  Hadoop  for  fimng  a  big   logis$c  regression   •  Itera$ve  process  is  expensive  and  slow   •  Every  itera$on  =  a  Map-­‐Reduce  job   •  I/O  of  mapper  and  reducers  are  both  through   disk   •  Plus:  Wai$ng  in  queue  $me     •  Q:  Can  we  find  a  fimng  method  that  scales   with  Hadoop  ?  
  • 55. Large  Scale  Logis$c  Regression   •  Naïve:     –  Par$$on  the  data  and  run  logis$c  regression  for  each  par$$on   –  Take  the  mean  of  the  learned  coefficients   –  Problem:  Not  guaranteed  to  converge  to  the  model  from  single   machine!   •  Alterna$ng  Direc$on  Method  of  Mul$pliers  (ADMM)   –  Boyd  et  al.  2011   –  Set  up  constraints:  each  par$$on’s  coefficient  =  global   consensus   –  Solve  the  op$miza$on  problem  using  Lagrange  Mul$pliers   –  Advantage:  guaranteed  to  converge  to  a  single  machine  logis$c   regression  on  the  en$re  data  with  reasonable  number  of   itera$ons    
  • 56. Large  Scale  Logis$c  Regression  via  ADMM   BIG  DATA   Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Consensus   Computa$on   Iteration 1
  • 57. Large  Scale  Logis$c  Regression  via  ADMM   BIG  DATA   Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K   Logis$c   Regression   Consensus   Computa$on   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Iteration 1
  • 58. Large  Scale  Logis$c  Regression  via  ADMM   BIG  DATA   Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Consensus   Computa$on   Iteration 2
  • 60. Dual  Ascent  Method   •  Consider  a  convex  op$miza$on  problem     •  Lagrangian  for  the  problem:     •  Dual  Ascent:   round and motivation. Dual Ascent der the equality-constrained convex optimization problem minimize f(x) subject to Ax = b, (2. ariable x ∈ Rn , where A ∈ Rm×n and f : Rn → R is convex. e Lagrangian for problem (2.1) is L(x,y) = f(x) + yT (Ax − b) he dual function is g(y) = inf x L(x,y) = −f∗ (−AT y) − bT y, y is the dual variable or Lagrange multiplier, and f∗ is the conv round and motivation. Dual Ascent der the equality-constrained convex optimization problem minimize f(x) subject to Ax = b, (2. variable x ∈ Rn , where A ∈ Rm×n and f : Rn → R is convex. he Lagrangian for problem (2.1) is L(x,y) = f(x) + yT (Ax − b) he dual function is g(y) = inf x L(x,y) = −f∗ (−AT y) − bT y, y is the dual variable or Lagrange multiplier, and f∗ is the conv gate of f; see [20, §3.3] or [140, §12] for background. The du rimal optimal point x from a dual optimal point y as x = argmin x L(x,y ), vided there is only one minimizer of L(x,y ). (This is the case e.g., f is strictly convex.) In the sequel, we will use the notation minx F(x) to denote any minimizer of F, even when F does not e a unique minimizer. In the dual ascent method, we solve the dual problem using gradient ent. Assuming that g is differentiable, the gradient g(y) can be uated as follows. We first find x+ = argminx L(x,y); then we have (y) = Ax+ − b, which is the residual for the equality constraint. The l ascent method consists of iterating the updates xk+1 := argmin x L(x,yk ) (2.2) yk+1 := yk + αk (Axk+1 − b), (2.3) ere αk > 0 is a step size, and the superscript is the iteration counter.
  • 61. Augmented  Lagrangians   •  Bring  robustness  to  the  dual  ascent  method   •  Yield  convergence  without  assump$ons  like  strict   convexity  or  finiteness  of  f   •      •  The  value  of  ρ  influences  the  convergence  rate   aph-structured optimization problems. Augmented Lagrangians and the Method of Multiplie mented Lagrangian methods were developed in part to br tness to the dual ascent method, and in particular, to yield c nce without assumptions like strict convexity or finiteness of augmented Lagrangian for (2.1) is Lρ(x,y) = f(x) + yT (Ax − b) + (ρ/2) Ax − b 2 2, (22.3 Augmented Lagrangians and the Method where ρ > 0 is called the penalty parameter. (Note standard Lagrangian for the problem.) The augmen can be viewed as the (unaugmented) Lagrangian asso problem 2
  • 62. Alterna$ng  Direc$on  Method  of   Mul$pliers  (ADMM)   •  Problem:     •  Augmented  Lagrangians     •  ADMM:   MM is an algorithm that is intended to blend the decompos dual ascent with the superior convergence properties of the m multipliers. The algorithm solves problems in the form minimize f(x) + g(z) subject to Ax + Bz = c h variables x ∈ Rn and z ∈ Rm , where A ∈ Rp×n , B ∈ Rp×m Rp . We will assume that f and g are convex; more specific as ns will be discussed in §3.2. The only difference from the g ear equality-constrained problem (2.1) is that the variable, ca re, has been split into two parts, called x and z here, with the e function separable across this splitting. The optimal value blem (3.1) will be denoted by p = inf{f(x) + g(z) | Ax + Bz = c}. with variables x ∈ Rn and z ∈ Rm , where A ∈ Rp×n , B ∈ Rp×m , and c ∈ Rp . We will assume that f and g are convex; more specific assump- tions will be discussed in §3.2. The only difference from the general linear equality-constrained problem (2.1) is that the variable, called x there, has been split into two parts, called x and z here, with the objec- tive function separable across this splitting. The optimal value of the problem (3.1) will be denoted by p = inf{f(x) + g(z) | Ax + Bz = c}. As in the method of multipliers, we form the augmented Lagrangian Lρ(x,z,y) = f(x) + g(z) + yT (Ax + Bz − c) + (ρ/2) Ax + Bz − c 2 2. 13 14 Alternating Direction Method of Multipliers ADMM consists of the iterations xk+1 := argmin x Lρ(x,zk ,yk ) (3.2) zk+1 := argmin z Lρ(xk+1 ,z,yk ) (3.3) yk+1 := yk + ρ(Axk+1 + Bzk+1 − c), (3.4) where ρ > 0. The algorithm is very similar to dual ascent and the
  • 63. Large  Scale  Logis$c  Regression  via  ADMM   •  Nota$on     –  (Xi  ,  yi):  data  in  the  ith  par$$on   –  βi:  coefficient  vector  for  par$$on  i   –  β:  Consensus  coefficient  vector   –  r(β):  penalty  component  such  as  ||β||2 2     •  Op$miza$on  problem   Brief Article The Author July 7, 2013 min NX i=1 li(yi, XT i i) + r( ) subject to i =
  • 64. ADMM  updates   LOCAL  REGRESSIONS   Shrinkage  towards  current   best  global  es$mate   UPDATED   CONSENSUS  
  • 65. An  example  implementa$on   •  ADMM  for  Logis$c  regression  model  fimng  with   L2/L1  penalty   •  Each  itera$on  of  ADMM  is  a  Map-­‐Reduce  job   –  Mapper:  par$$on  the  data  into  K  par$$ons   –  Reducer:  For  each  par$$on,  use  liblinear/glmnet  to  fit   a  L1/L2  logis$c  regression   –  Gateway:  consensus  computa$on  by  results  from  all   reducers,  and  sends  back  the  consensus  to  each   reducer  node  
  • 66. KDD  CUP  2010  Data   •  Bridge  to  Algebra  2008-­‐2009  data  in     haps://pslcdatashop.web.cmu.edu/KDDCup/ downloads.jsp   •  Binary  response,  20M  covariates   •  Only  keep  covariates  with  >=  10  occurrences   =>  2.2M  covariates   •  Training  data:  8,407,752  samples   •  Test  data  :  510,302  samples  
  • 67. Avg  Training  Loglikelihood  vs  Number   of  Itera$ons  
  • 68. Test  AUC  vs  Number  of  Itera$ons    
  • 69. Beaer  Convergence  Can     Be  Achieved  By   •  Beaer  Ini$aliza$on   – Use  results  from  Naïve  method  to  ini$alize  the   parameters   •  Adap$vely  change  step  size  (ρ)  for  each   itera$on  based  on  the  convergence  status  of   the  consensus  
  • 71. Agenda •  Topic of Interest –  Recommender problems for dynamic, time- sensitive applications •  Content Optimization, Online Advertising, Movie recommendation, shopping,… •  Introduction •  Offline components –  Regression, Collaborative filtering (CF), … •  Online components + initialization –  Time-series, online/incremental methods, explore/ exploit (bandit) •  Evaluation methods + Multi-Objective •  Challenges
  • 72. Three components we will focus on •  Defining the problem –  Formulate objectives whose optimization achieves some long- term goals for the recommender system •  E.g. How to serve content to optimize audience reach and engagement, optimize some combination of engagement and revenue ? •  Modeling (to estimate some critical inputs) –  Predict rates of some positive user interaction(s) with items based on data obtained from historical user-item interactions •  E.g. Click rates, average time-spent on page, etc •  Could be explicit feedback like ratings •  Experimentation –  Create experiments to collect data proactively to improve models, helps in converging to the best choice(s) cheaply and rapidly. •  Explore and Exploit (continuous experimentation) •  DOE (testing hypotheses by avoiding bias inherent in data)
  • 73. Modern Recommendation Systems •  Goal –  Serve the right item to a user in a given context to optimize long-term business objectives •  A scientific discipline that involves –  Large scale Machine Learning & Statistics •  Offline Models (capture global & stable characteristics) •  Online Models (incorporates dynamic components) •  Explore/Exploit (active and adaptive experimentation) –  Multi-Objective Optimization •  Click-rates (CTR), Engagement, advertising revenue, diversity, etc –  Inferring user interest •  Constructing User Profiles –  Natural Language Processing to understand content •  Topics, “aboutness”, entities, follow-up of something, breaking news,…
  • 74. Some examples from content optimization •  Simple version –  I have a content module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module •  More advanced –  I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks? •  Highly advanced –  There are multiple modules running on my webpage. How do I perform a simultaneous optimization?
  • 75. Recommend applications Recommend search queries Recommend news article Recommend packages: Image Title, summary Links to other pages Pick  4  out  of  a  pool  of  K          K  =  20  ~  40          Dynamic     Routes  traffic  other  pages  
  • 76. Problems in this example •  Optimize CTR on multiple modules –  Today Module, Trending Now, Personal Assistant, News –  Simple solution: Treat modules as independent, optimize separately. May not be the best when there are strong correlations. •  For any single module –  Optimize some combination of CTR, downstream engagement, and perhaps advertising revenue.
  • 77. Online Advertising Advertisers Ad Network Ads Page Recommend Best ad(s) User Publisher Response rates (click, conversion, ad-view) Bids Auction Click conversion Select argmax f(bid,response rates) ML /Statistical model Examples: Yahoo, Google, MSN, … Ad exchanges (RightMedia, DoubleClick, …)
  • 78. LinkedIn  Today:  Content  Module   Objective: Serve content to maximize engagement metrics like CTR (or weighted CTR)
  • 79. LinkedIn  Ads:  Match  ads  to  users   visi$ng  LinkedIn  
  • 80. Right  Media  Ad  Exchange:  Unified   Marketplace   Match  ads  to  page  views  on  publisher  sites   Has  ad   impression  to   sell  -­‐-­‐   AUCTIONS   Bids  $0.50   Bids  $0.75  via  Network…   …  which  becomes   $0.45  bid   Bids  $0.65—WINS!   AdSense   Ad.com   Bids  $0.60  
  • 81. Recommender problems in general USER     Item  Inventory   Ar$cles,  web  page,     ads,  …   Use  an  automated  algorithm     to  select  item(s)  to  show     Get  feedback  (click,  $me  spent,..)     Refine  the  models     Repeat  (large  number  of  :mes)   Op:mize  metric(s)  of  interest   (Total  clicks,  Total  revenue,…)   Example applications Search: Web, Vertical Online Advertising Content ….. Context   query,  page,  …  
  • 82. •  Items: Articles, ads, modules, movies, users, updates, etc. •  Context: query keywords, pages, mobile, social media, etc. •  Metric to optimize (e.g., relevance score, CTR, revenue, engagement) –  Currently, most applications are single-objective –  Could be multi-objective optimization (maximize X subject to Y, Z,..) •  Properties of the item pool –  Size (e.g., all web pages vs. 40 stories) –  Quality of the pool (e.g., anything vs. editorially selected) –  Lifetime (e.g., mostly old items vs. mostly new items) Important Factors
  • 83. Factors affecting Solution (continued) •  Properties of the context –  Pull: Specified by explicit, user-driven query (e.g., keywords, a form) –  Push: Specified by implicit context (e.g., a page, a user, a session) •  Most applications are somewhere on continuum of pull and push •  Properties of the feedback on the matches made –  Types and semantics of feedback (e.g., click, vote) –  Latency (e.g., available in 5 minutes vs. 1 day) –  Volume (e.g., 100K per day vs. 300M per day) •  Constraints specifying legitimate matches –  e.g., business rules, diversity rules, editorial Voice –  Multiple objectives •  Available Metadata (e.g., link graph, various user/item attributes)
  • 84. Predicting User-Item Interactions (e.g. CTR) •  Myth: We have so much data on the web, if we can only process it the problem is solved –  Number of things to learn increases with sample size •  Rate of increase is not slow –  Dynamic nature of systems make things worse –  We want to learn things quickly and react fast •  Data is sparse in web recommender problems –  We lack enough data to learn all we want to learn and as quickly as we would like to learn –  Several Power laws interacting with each other •  E.g. User visits power law, items served power law –  Bivariate Zipf: Owen & Dyer, 2011
  • 85. Can Machine Learning help? •  Fortunately, there are group behaviors that generalize to individuals & they are relatively stable –  E.g. Users in San Francisco tend to read more baseball news •  Key issue: Estimating such groups –  Coarse group : more stable but does not generalize that well. –  Granular group: less stable with few individuals –  Getting a good grouping structure is to hit the “sweet spot” •  Another big advantage on the web –  Intervene and run small experiments on a small population to collect data that helps rapid convergence to the best choices(s) •  We don’t need to learn all user-item interactions, only those that are good.
  • 86. Predicting user-item interaction rates Offline   (  Captures  stable  characteris$cs   at  coarse  resolu$ons)   (Logis$c,  Boos$ng,….)     Feature    construc$on   Content:  IR,  clustering,  taxonomy,  en$ty,..     User  profiles:  clicks,  views,  social,  community,..   Near  Online   (Finer  resolu$on   Correc$ons)   (item,  user  level)   (Quick  updates)   Explore/Exploit   (Adap$ve  sampling)   (helps  rapid  convergence   to  best  choices)   Initialize
  • 87. Post-click: An example in Content Optimization Recommender          EDITORIAL               content Clicks on FP links influence downstream supply distribution    AD  SERVER                DISPLAY          ADVERTISING   Revenue   Downstream   engagement      (Time  spent)  
  • 88. Serving Content on Front Page: Click Shaping •  What do we want to optimize? •  Current: Maximize clicks (maximize downstream supply from FP) •  But consider the following –  Article 1: CTR=5%, utility per click = 5 –  Article 2: CTR=4.9%, utility per click=10 •  By promoting 2, we lose 1 click/100 visits, gain 5 utils •  If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? –  E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc)
  • 89. High  level  picture   http request Statistical Models updated in Batch mode: e.g. once every 30 mins Server   Item   Recommenda$on   system:  thousands   of  computa$ons  in   sub-­‐seconds   User Interacts e.g. click, does nothing
  • 90. High  level  overview:  Item   Recommenda$on  System   User Info Item Index Id, meta-data ML/ Statistical Models Score Items P(Click), P(share), Semantic-relevance score,…. Rank Items: sort by score (CTR,bid*CTR,..) combine scores using Multi-obj optim, Threshold on some scores,…. User-item interaction Data: batch process Updated in batch: Activity, profile Pre-filter SPAM,editorial,,.. Feature extraction NLP, cllustering,..
  • 91. ML/Sta$s$cal  models  for  scoring   Number of items Scored by ML Traffic volume 1000100 100k 1M 100M Few hours Few days Several days LinkedIn Today Yahoo! Front Page Right Media Ad exchange LinkedIn Ads
  • 92. Summary  of  deployments       •  Yahoo!  Front  page  Today  Module  (2008-­‐2011):  300%  improvement   in  click-­‐through  rates   –  Similar  algorithms  delivered  via  a  self-­‐serve  pla•orm,  adopted  by   several  Yahoo!  Proper$es  (2011):  Significant  improvement  in   engagement  across  Yahoo!  Network   •  Fully  deployed  on  LinkedIn  Today  Module  (2012):  Significant   improvement  in  click-­‐through  rates  (numbers  not  revealed  due  to   reasons  of  confiden$ality)   •  Yahoo!  RightMedia  exchange  (2012):  Fully  deployed  algorithms  to   es$mate  response  rates  (CTR,  conversion  rates).  Significant   improvement  in  revenue  (numbers  not  revealed  due  to  reasons  of   confiden$ality)   •  LinkedIn  self-­‐serve  ads  (2012-­‐2013):Fully  deployed   •  LinkedIn  News  Feed  (2013-­‐2014):  Fully  deployed   •  Several  others  in  progress….  
  • 93. Broad  Themes   •  Curse  of  dimensionality   –  Large  number  of  observa$ons  (rows),  large  number  of  poten$al  features   (columns)   –  Use  domain  knowledge  and  machine  learning  to  reduce  the  “effec$ve”   dimension  (constraints  on  parameters  reduce  degrees  of  freedom)   •  I  will  give  examples  as  we  move  along   •   We  oden  assume  our  job  is  to  analyze  “Big  Data”  but  we  oden  have   control  on  what  data  to  collect  through  clever  experimenta$on   –  This  can  fundamentally  change  solu$ons   •  Think  of  computa$on  and  models  together  for  Big  data   •  Op$miza$on:  What  we  are  trying  to  op$mize  is  oden  complex,models  to   work  in  harmony  with  op$miza$on   –  Pareto  op$mality  with  compe$ng  objec$ves  
  • 94. Sta$s$cal  Problem   •  Rank  items  (from  an  admissible  pool)  for  user  visits  in  some   context  to  maximize  a  u$lity  of  interest   •  Examples  of  u$lity  func$ons   –  Click-­‐rates  (CTR)   –  Share-­‐rates  (CTR*  [Share|Click]  )   –  Revenue  per  page-­‐view  =  CTR*bid  (more  complex  due  to  second  price   auc$on)   •  CTR  is  a  fundamental  measure  that  opens  the  door  to  a  more   principled  approach  to  rank  items   •  Converge  rapidly  to  maximum  u$lity  items     –  Sequen$al  decision  making  process  (explore/exploit)    
  • 95. item  j  from  a  set  of  candidates   User  i     with   user  features   (e.g.,  industry,    behavioral  features,   Demographic  features, ……)                  (i,  j)  :  response  yij  visits   Algorithm  selects   (click  or  not)   Which  item  should  we  select?    Ÿ  The  item  with  highest  predicted  CTR    Ÿ  An  item  for  which  we  need  data  to            predict  its  CTR   Exploit   Explore   LinkedIn Today, Yahoo! Today Module: Choose Items to maximize CTR This is an “Explore/Exploit” Problem
  • 96. The Explore/Exploit Problem (to maximize CTR) •  Problem definition: Pick k items from a pool of N for a large number of serves to maximize the number of clicks on the picked items •  Easy!? Pick the items having the highest click-through rates (CTRs) •  But … –  The system is highly dynamic: •  Items come and go with short lifetimes •  CTR of each item may change over time –  How much traffic should be allocated to explore new items to achieve optimal performance ? •  Too little → Unreliable CTR estimates due to “starvation” •  Too much → Little traffic to exploit the high CTR items
  • 97. Y!  front  Page  Applica$on   •  Simplify:  Maximize  CTR  on  first  slot  (F1)     •  Item  Pool   –  Editorially  selected  for  high  quality  and  brand  image   –  Few  ar$cles  in  the  pool  but  item  pool  dynamic    
  • 98. CTR Curves of Items on LinkedIn Today CTR  
  • 99. Impact  of  repeat  item  views  on  a  given   user   •  Same  user  is  shown  an  item  mul$ple  $mes   (despite  not  clicking)  
  • 100. Simple  algorithm  to  es$mate  most  popular   item  with  small  but  dynamic  item  pool   •  Simple  Explore/Exploit  scheme –  ε%  explore:  with  a  small  probability  (e.g.  5%),  choose  an  item  at   random  from  the  pool   –  (100−ε)%  exploit:  with  large  probability  (e.g.  95%),  choose   highest  scoring  CTR  item   •  Temporal  Smoothing   –  Item  CTRs  change  over  $me,  provide  more  weight  to  recent  data  in   es$ma$ng  item  CTRs   •  Kalman  filter,  moving  average   •  Discount  item  score  with  repeat  views   –  CTR(item)  for  a  given  user  drops  with  repeat  views  by  some  “discount”   factor  (es$mated  from  data)   •  Segmented  most  popular   –  Perform  separate  most-­‐popular  for  each  user  segment    
  • 101. Time  series  Model:  Kalman  filter   •  Dynamic  Gamma-­‐Poisson:  click-­‐rate  evolves  over  $me   in  a  mul$plica$ve  fashion   •  Es$mated  Click-­‐rate  distribu$on  at  $me  t+1     –  Prior  mean:   –  Prior  variance:       High  CTR  items  more  adap$ve  
  • 102. More  economical  explora$on?  Beaer   bandit  solu$ons   •  Consider  two  armed  problem   p2 (unknown payoff probabilities) The  gambler  has  1000  plays,  what  is  the  best  way  to  experiment  ?                                                (to  maximize  total  expected  reward)      This  is  called  the  “mul$-­‐armed  bandit”  problem,  have  been  studied  for  a  long  $me.    Op$mal  solu$on:  Play  the  arm  that  has  maximum  poten:al  of  being  good    Op:mism  in  the  face  of  uncertainty   p1 >
  • 103. Item  Recommenda$on:  Bandits?   •  Two  Items:  Item  1  CTR=  2/100  ;  Item  2  CTR=  250/10000   –  Greedy:  Show  Item  2  to  all;  not  a  good  idea   –  Item  1  CTR  es$mate  noisy;  item  could  be  poten$ally   beaer   •  Invest  in  Item  1  for  beaer  overall  performance  on  average       –  Exploit  what  is  known  to  be  good,  explore  what  is  poten$ally  good   CTR Probabilitydensity Item 2 Item 1
  • 104. Next few hours Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Time-series models Incremental CF, online regression Intelligent Initialization Prior estimation Prior estimation, dimension reduction Explore/Exploit Multi-armed bandits Bandits with covariates
  • 105. Offline Components: Collaborative Filtering in Cold-start Situations
  • 106. Problem Item j with User i with user features xi (demographics, browse history, search history, …) item features xj (keywords, content categories, ...) (i, j) : response yijvisits Algorithm selects (explicit rating, implicit click/no-click) Predict the unobserved entries based on features and the observed entries
  • 107. Model Choices •  Feature-based (or content-based) approach –  Use features to predict response •  (regression, Bayes Net, mixture models, …) –  Limitation: need predictive features •  Bias often high, does not capture signals at granular levels •  Collaborative filtering (CF aka Memory based) –  Make recommendation based on past user-item interaction •  User-user, item-item, matrix factorization, … •  See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc. –  Better performance for old users and old items –  Does not naturally handle new users and new items (cold- start)
  • 108. Collaborative Filtering (Memory based methods) User-User Similarity Item-Item similarities, incorporating both Estimating Similarities Pearson’s correlation Optimization based (Koren et al)
  • 109. How to Deal with the Cold-Start Problem •  Heuris$c-­‐based  approaches   –  Linear  combina$on  of  regression  and  CF  models     –  Filterbot   •  Add  user  features  as  psuedo  users  and  do  collabora$ve  filtering   -­‐  Hybrid  approaches   -­‐  Use  content  based  to  fill  up  entries,  then  use  CF   •  Matrix  Factoriza$on   –  Good  performance  on  Ne•lix  (Koren,  2009)   •  Model-­‐based  approaches     –  Bilinear  random-­‐effects  model  (probabilis$c  matrix  factoriza$on)   •  Good  on  Ne•lix  data  [Ruslan  et  al  ICML,  2009]   –  Add  feature-­‐based  regression  to  matrix  factoriza$on     •  (Agarwal  and  Chen,  2009)   –  Add  topic  discovery  (from  textual  items)  to  matrix  factoriza$on     •  (Agarwal  and  Chen,  2009;  Chun  and  Blei,  2011)  
  • 110. Per-item regression models •  When tracking users by cookies, distribution of visit patters could get extremely skewed – Majority of cookies have 1-2 visits •  Per item models (regression) based on user covariates attractive in such cases
  • 111. Several per-item regressions: Multi-task learning Low dimension (5-10), B estimated retrospective data •  Agarwal,Chen and Elango, KDD, 2010 Affinity to old items
  • 112. Per-user, per-item models via bilinear random-effects model
  • 113. Motivation •  Data measuring k-way interactions pervasive –  Consider k = 2 for all our discussions •  E.g. User-Movie, User-content, User-Publisher-Ads,…. –  Power law on both user and item degrees •  Classical Techniques –  Approximate matrix through a singular value decomposition (SVD) •  After adjusting for marginal effects (user pop, movie pop,..) –  Does not work •  Matrix highly incomplete, severe over-fitting –  Key issue •  Regularization of eigenvectors (factors) to avoid overfitting
  • 114. Early work on complete matrices •  Tukey’s 1-df model (1956) –  Rank 1 approximation of small nearly complete matrix •  Criss-cross regression (Gabriel, 1978) •  Incomplete matrices: Psychometrics (1-factor model only; small data sets; 1960s) •  Modern day recommender problems –  Highly incomplete, large, noisy.
  • 116. Factorization – Brief Overview •  Latent user factors: (αi , ui=(ui1,…,uin)) •  (Nn + Mm) parameters •  Key technical issue: •  Latent movie factors: (βj , vj=(v j1,….,v jn)) will overfit for moderate values of n,m Regularization Interaction jijiij BvuyE ʹ′+++= βαµ)(
  • 117. Latent Factor Models: Different Aspects •  Matrix Factorization – Factors in Euclidean space – Factors on the simplex •  Incorporating features and ratings simultaneously •  Online updates
  • 118. Maximum Margin Matrix Factorization (MMMF) •  Complete matrix by minimizing loss (hinge,squared- error) on observed entries subject to constraints on trace norm –  Srebro, Rennie, Jakkola (NIPS 2004) –  Convex, Semi-definite programming (expensive, not scalable) •  Fast MMMF (Rennie & Srebro, ICML, 2005) –  Constrain the Frobenious norm of left and right eigenvector matrices, not convex but becomes scalable. •  Other variation: Ensemble MMMF (DeCoste, ICML2005) –  Ensembles of partially trained MMMF (some improvements)
  • 119. Matrix Factorization for Netflix prize data •  Minimize the objective function •  Simon Funk: Stochastic Gradient Descent •  Koren et al (KDD 2007): Alternate Least Squares –  They move to SGD later in the competition ∑ ∑∑∈ ++− obsij j j i ij T iij vuvur )()( 222 λ
  • 120. ui vj rij au av 2 σ Optimization is through Iterated conditional modes Other variations like constraining the mean through sigmoid, using “who-rated- whom” Combining with Boltzmann Machines also improved performance ),(~ ),(~ ),(~ 2 IaMVN IaMVN Nr vj ui j T iij 0v 0u vu σ Probabilis$c  Matrix  Factoriza$on   (Ruslan  &  Minh,  2008,  NIPS)  
  • 121. Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008) •  Fully Bayesian treatment using an MCMC approach –  Significant improvement •  Interpretation as a fully Bayesian hierarchical model shows why that is the case –  Failing to incorporate uncertainty leads to bias in estimates –  Multi-modal posterior, MCMC helps in converging to a better one r Var-comp: au MCEM also more resistant to over-fitting
  • 122. Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010) •  Specify rank probabilistically (automatic rank selection) )/)1(,/(~ )(~ ),(~ 1 2 rrbraBeta Berz vuzNy k kk r k jkikkij − ∑= π π σ ))1(/(Factors)#( )))1(/(,1(~ −+= −+ rbaraE rbaaBerzk
  • 123. How to incorporate features: Deal with both warm start and cold-start •  Models to predict ratings for new pairs –  Warm-start: (user, movie) present in the training data with large sample size –  Cold-start: At least one of (user, movie) new or has small sample size •  Rough definition, warm-start/cold-start is a continuum. •  Challenges –  Highly incomplete (user, movie) matrix –  Heavy tailed degree distributions for users/movies •  Large fraction of ratings from small fraction of users/ movies –  Handling both warm-start and cold-start effectively in the presence of predictive features
  • 124. Possible approaches •  Large scale regression based on covariates –  Does not provide good estimates for heavy users/movies –  Large number of predictors to estimate interactions •  Collaborative filtering –  Neighborhood based –  Factorization •  Good for warm-start; cold-start dealt with separately •  Single model that handles cold-start and warm-start –  Heavy users/movies → User/movie specific model –  Light users/movies → fallback on regression model –  Smooth fallback mechanism for good performance
  • 125. Add Feature-based Regression into Matrix Factorization RLFM: Regression-based Latent Factor Model
  • 126. Regression-based Factorization Model (RLFM) •  Main idea: Flexible prior, predict factors through regressions •  Seamlessly handles cold-start and warm- start •  Modified state equation to incorporate covariates
  • 127. RLFM: Model Rating: ),(~ 2 σµijij Ny )(~ ijij Bernoulliy µ )(~ ijijij NPoissony µ Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) j t iji t ijij vubxt +++= βαµ )( user i gives item j Bias of user i: ),0(~, 2 0 α αα σεεα Nxg iii t i += Popularity of item j: ),0(~, 2 0 β ββ σεεβ Nxd jjj t j += Factors of user i: ),0(~, 2 INGxu u u i u iii σεε+= Factors of item j: ),0(~, 2 INDxv v v i v iji σεε+= Could use other classes of regression models
  • 129. Advantages of RLFM •  Better regularization of factors –  Covariates “shrink” towards a better centroid •  Cold-start: Fallback regression model (FeatureOnly)
  • 130. RLFM: Illustration of Shrinkage Plot the first factor value for each user (fitted using Yahoo! FP data)
  • 131. Model fitting: EM for our class of models
  • 132. The parameters for RLFM •  Latent parameters •  Hyper-parameters }){},{},{},({ jiji vuβα=Δ )IaAI,aAD,G,,( vvuu ===Θ b
  • 135. Computing the E-step •  Often hard to compute in closed form •  Stochastic EM (Markov Chain EM; MCEM) –  Compute expectation by drawing samples from –  Effective for multi-modal posteriors but more expensive •  Iterated Conditional Modes algorithm (ICM) –  Faster but biased hyper-parameter estimates
  • 136. Monte Carlo E-step •  Through a vanilla Gibbs sampler (conditionals closed form) •  Other conditionals also Gaussian and closed form •  Conditionals of users (movies) sampled simultaneously •  Small number of samples in early iterations, large numbers in later iterations
  • 137. M-step (Why MCEM is better than ICM) •  Update G, optimize •  Update Au=au I Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well
  • 138. Experiment 1: Better regularization •  MovieLens-100K, avg RMSE using pre-specified splits •  ZeroMean, RLFM and FeatureOnly (no cold-start issues) •  Covariates: –  Users : age, gender, zipcode (1st digit only) –  Movies: genres
  • 139. Experiment 2: Better handling of Cold-start •  MovieLens-1M; EachMovie •  Training-test split based on timestamp •  Same covariates as in Experiment 1.
  • 140. Experiment 4: Predicting click-rate on articles •  Goal: Predict click-rate on articles for a user on F1 position •  Article lifetimes short, dynamic updates important •  User covariates: –  Age, Gender, Geo, Browse behavior •  Article covariates –  Content Category, keywords •  2M ratings, 30K users, 4.5 K articles
  • 141. Results on Y! FP data
  • 142. Some other related approaches •  Stern, Herbrich and Graepel, WWW, 2009 –  Similar to RLFM, different parametrization and expectation propagation used to fit the models •  Porteus, Asuncion and Welling, AAAI, 2011 –  Non-parametric approach using a Dirichlet process •  Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 –  Regression + random effects per user regularized through a Graphical Lasso
  • 143. Add Topic Discovery into Matrix Factorization fLDA: Matrix Factorization through Latent Dirichlet Allocation
  • 144. fLDA: Introduction •  Model the rating yij that user i gives to item j as the user’s affinity to the topics that the item has –  Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data –  fLDA can be thought of as a “multi-task learning” version of the supervised LDA model [Blei’07] for cold-start recommendation ∑+= k jkikij zsy ... User i ’s affinity to topic k Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j Old items: zjk’s are Item latent factors learnt from data with the LDA prior New items: zjk’s are predicted based on the bag of words in the items
  • 145. Φ11,  …,  Φ1W                …   Φk1,  …,  ΦkW                …   ΦK1,  …,  ΦKW   Topic  1   Topic  k   Topic  K   LDA Topic Modeling (1) •  LDA is effective for unsupervised topic discovery [Blei’03] –  It models the generating process of a corpus of items (articles) –  For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η) –  For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ) –  For each word, say the nth word, in item j, •  Draw a topic zjn for that word from θj = [θj1, …, θjK] •  Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn Item j Topic distribution: [θj1, …, θjK] Words: wj1, …, wjn, … Per-word topic: zj1, …, zjn, … Assume zjn = topic k Observed
  • 146. LDA Topic Modeling (2) •  Model training: –  Estimate the prior parameters and the posterior topic×word distribution Φ based on a training corpus of items –  EM + Gibbs sampling is a popular method •  Inference for new items –  Compute the item topic distribution based on the prior parameters and Φ estimated in the training phase •  Supervised LDA [Blei’07] –  Predict a target value for each item based on supervised LDA topics ∑= k jkkj zsy Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j Regression weight for topic k ∑+= k jkikij zsy ...vs. One regression per user Same set of topics across different regressions
  • 147. fLDA: Model Rating: ),(~ 2 σµijij Ny )(~ ijij Bernoulliy µ )(~ ijijij NPoissony µ Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) jkikkji t ijij zsbxt ∑+++= βαµ )( user i gives item j Bias of user i: ),0(~, 2 0 α αα σεεα Nxg iii t i += Popularity of item j: ),0(~, 2 0 β ββ σεεβ Nxd jjj t j += Topic affinity of user i: ),0(~, 2 INHxs s s i s iii σεε+= Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk =∑= The LDA topic of the nth word in item j Observed words: ),,(~ jnjn zLDAw ηλ The nth word in item j
  • 148. Model Fitting •  Given: –  Features X = {xi, xj, xij} –  Observed ratings y = {yij} and words w = {wjn} •  Estimate: –  Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η] •  Regression weights and prior parameters –  Latent factors: Δ = {αi, βj, si} and z = {zjn} •  User factors, item factors and per-word topic assignment •  Empirical Bayes approach: –  Maximum likelihood estimate of the parameters: –  The posterior distribution of the factors: ∫ ΔΘΔ=Θ=Θ ΘΘ dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxargˆ ]ˆ,|,Pr[ ΘΔ yz
  • 149. The EM Algorithm •  Iterate through the E and M steps until convergence – Let be the current estimate – E-step: Compute •  The expectation is not in closed form •  We draw Gibbs samples and compute the Monte Carlo mean – M-step: Find •  It consists of solving a number of regression and optimization problems )]|,,,Pr([log)( )ˆ,,|,( ΘΔ=Θ ΘΔ zwyEf n wyz )(maxargˆ )1( Θ=Θ Θ + fn )(ˆ n Θ
  • 150. Supervised Topic Assignment ( ) ∏ =⋅+ + + ∝ = ¬ ¬ ¬ ji jnij jn jkjn k jn kl jn kzyfZ WZ Z kz rated )|( )Rest|Pr( λ η η Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when zjn is set to topic k Probability of observing yij given the model The topic of the nth word in item j
  • 151. fLDA: Experimental Results (Movie) •  Task: Predict the rating that a user would give a movie •  Training/test split: –  Sort observations by time –  First 75% → Training data –  Last 25% → Test data •  Item warm-start scenario –  Only 2% new items in test data Model Test RMSE RLFM 0.9363 fLDA 0.9381 Factor-Only 0.9422 FilterBot 0.9517 unsup-LDA 0.9520 MostPopular 0.9726 Feature-Only 1.0906 Constant 1.1190 fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios
  • 152. fLDA: Experimental Results (Yahoo! Buzz) •  Task: Predict whether a user would buzz-up an article •  Severe item cold-start –  All items are new in test data Data Statistics 1.2M observations 4K users 10K articles fLDA significantly outperforms other models
  • 153. Experimental Results: Buzzing Topics Top Terms (after stemming) Topic bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al CIA interrogation mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain Swine flu NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week NFL games court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow Gay marriage palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska Sarah Palin idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama American idol economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month Recession north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia North Korea issues 3/4 topics are interpretable; 1/2 are similar to unsupervised topics
  • 154. fLDA Summary •  fLDA is a useful model for cold-start item recommendation •  It also provides interpretable recommendations for users –  User’s preference to interpretable LDA topics •  Future directions: –  Investigate Gibbs sampling chains and the convergence properties of the EM algorithm –  Apply fLDA to other multi-task prediction problems •  fLDA can be used as a tool to generate supervised features (topics) from text data
  • 155. Summary •  Regularizing factors through covariates effective •  Regression based factor model that regularizes better and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive •  Fitting method scalable; Gibbs sampling for users and movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine •  Distributed computing on Hadoop: Multiple models and average across partitions (more later)
  • 156. Online  Components:   Online  Models,  Intelligent   Ini$aliza$on,  Explore  /  Exploit  
  • 157. Why Online Components? •  Cold start –  New items or new users come to the system –  How to obtain data for new items/users (explore/exploit) –  Once data becomes available, how to quickly update the model •  Periodic rebuild (e.g., daily): Expensive •  Continuous online update (e.g., every minute): Cheap •  Concept drift –  Item popularity, user interest, mood, and user-to-item affinity may change over time –  How to track the most recent behavior •  Down-weight old data –  How to model temporal patterns for better prediction •  … may not need to be online if the patterns are stationary
  • 158. Big Picture Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Real systems are dynamic Time-series models Incremental CF, online regression Intelligent Initialization Do not start cold Prior estimation Prior estimation, dimension reduction Explore/Exploit Actively acquire data Multi-armed bandits Bandits with covariates Segmented  Most     Popular  Recommenda$on   Extension:  
  • 159. Online  Components  for     Most  Popular  Recommenda$on   Online  models,  intelligent  ini$aliza$on  &   explore/exploit  
  • 160. Most popular recommendation: Outline •  Most popular recommendation (no personalization, all users see the same thing) –  Time-series models (online models) –  Prior estimation (initialization) –  Multi-armed bandits (explore/exploit) –  Sometimes hard to beat!! •  Segmented most popular recommendation –  Create user segments/clusters based on user features –  Do most popular recommendation for each segment
  • 161. Most Popular Recommendation •  Problem definition: Pick k items (articles) from a pool of N to maximize the total number of clicks on the picked items •  Easy!? Pick the items having the highest click- through rates (CTRs) •  But … –  The system is highly dynamic: •  Items come and go with short lifetimes •  CTR of each item changes over time –  How much traffic should be allocated to explore new items to achieve optimal performance •  Too little → Unreliable CTR estimates •  Too much → Little traffic to exploit the high CTR items
  • 162. CTR Curves for Two Days on Yahoo! Front Page Traffic  obtained  from  a  controlled  randomized  experiment  (no  confounding)   Things  to  note:        (a)  Short  life$mes,  (b)  temporal  effects,  (c)  oden  breaking  news  stories   Each  curve  is  the  CTR  of  an  item  in  the  Today  Module  on  www.yahoo.com  over  $me  
  • 163. For Simplicity, Assume … •  Pick only one item for each user visit –  Multi-slot optimization later •  No user segmentation, no personalization (discussion later) •  The pool of candidate items is predetermined and is relatively small (≤ 1000) –  E.g., selected by human editors or by a first-phase filtering method –  Ideally, there should be a feedback loop –  Large item pool problem later •  Effects like user-fatigue, diversity in recommendations, multi-objective optimization not considered (discussion later)
  • 164. Online Models •  How to track the changing CTR of an item •  Data: for each item, at time t, we observe –  Number of times the item nt was displayed (i.e., #views) –  Number of clicks ct on the item •  Problem Definition: Given c1, n1, …, ct, nt, predict the CTR (click-through rate) pt+1 at time t+1 •  Potential solutions: –  Observed CTR at t: ct / nt → highly unstable (nt is usually small) –  Cumulative CTR: (∑all i ci) / (∑all i ni) → react to changes very slowly –  Moving window CTR: (∑i∈last K ci) / (∑i∈last K ni) → reasonable •  But, no estimation of Var[pt+1] (useful for explore/exploit)
  • 165. Online Models: Dynamic Gamma-Poisson •  Model-based approach –  (ct | nt, pt) ~ Poisson(nt pt) –  pt = pt-1 εt, where εt ~ Gamma(mean=1, var=η) –  Model parameters: •  p1 ~ Gamma(mean=µ0, var=σ0 2) is the offline CTR estimate •  η specifies how dynamic/smooth the CTR is over time –  Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?) •  Solve this recursively (online update rule)    Show  the  item  nt  $mes      Receive  ct  clicks      pt  =  CTR  at  $me  t   Nota$on:   p1  µ0,  σ0 2   p2   … n1   c1   n2   c2   η  
  • 166. Online Models: Derivation size)sample(effective/Let ),(~),,...,,|( 2 2 1111 ttt ttttt varmeanGammancncp σµγ σµ = ==−− )( ),(~),,...,,|( 2 | 2 | 2 | 2 1 |1 2 11111 ttttttt ttt ttttt varmeanGammancncp σµησσ µµ σµ ++= = == + + +++ tttttt ttttttt tttt ttttttt c n varmeanGammancncp || 2 | || | 2 ||11 / /)( size)sample(effectiveLet ),(~),,...,,|( γµσ γγµµ γγ σµ = +⋅= += == High  CTR  items  more  adap$ve   Es$mated  CTR   distribu$on   at  $me  t   Es$mated  CTR     distribu$on   at  $me  t+1  
  • 167. Tracking behavior of Gamma- Poisson model •  Low click rate articles – More temporal smoothing
  • 168. Intelligent Initialization: Prior Estimation •  Prior CTR distribution: Gamma(mean=µ0, var=σ0 2) –  N historical items: •  ni = #views of item i in its first time interval •  ci = #clicks on item i in its first time interval –  Model •  ci ~ Poisson(ni pi) and pi ~ Gamma(µ0, σ0 2) ⇒ ci ~ NegBinomial(µ0, σ0 2, ni) –  Maximum likelihood estimate (MLE) of (µ0, σ0 2) •  Better prior: Cluster items and find MLE for each cluster –  Agarwal & Chen, 2011 (SIGMOD) ∑ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ +⎟ ⎠ ⎞⎜ ⎝ ⎛ +−⎟ ⎠ ⎞⎜ ⎝ ⎛ +Γ+⎟ ⎠ ⎞⎜ ⎝ ⎛Γ− i iii nccNN 2 0 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 0 2 0 2 0 2 00 loglogloglogmaxarg , σ µ σ µ σ µ σ µ σ µ σ µ σµ
  • 169. Explore/Exploit: Problem Definition $me   Item  1   Item  2   …   Item  K   x1%  page  views   x2%  page  views   …   xK%  page  views   Determine  (x1,  x2,  …,  xK)  based  on  clicks  and  views  observed  before  t  in   order  to  maximize  the  expected  total  number  of  clicks  in  the  future   t  –1    t  –2     t   now   clicks  in  the  future  
  • 170. Modeling the Uncertainty, NOT just the Mean Simplified  semng:  Two  items   CTR Probabilitydensity Item A Item B We  know  the  CTR  of  Item  A  (say,  shown  1  million  $mes)     We  are  uncertain  about  the  CTR  of  Item  B  (only  100  $mes)   If  we  only  make  a  single  decision,   give  100%  page  views  to  Item  A     If  we  make  mul$ple  decisions  in  the  future   explore  Item  B  since  its  CTR  can  poten$ally   be  higher   ∫ > ⋅−= qp dppfqp )()(Potential CTR of item A is q CTR of item B is p Probability density function of item B’s CTR is f(p)
  • 171. Multi-Armed Bandits: Introduction (1) Bandit “arms” p1 p2 p3 (unknown payoff probabilities) “Pulling” arm i yields a reward: reward = 1 with probability pi (success) reward = 0 otherwise (failure) For  now,  we  are  aaacking  the  problem  of  choosing  best  ar$cle/arm  for  all  users  
  • 172. Multi-Armed Bandits: Introduction (2) Bandit “arms” p1 p2 p3 (unknown payoff probabilities) Goal:  Pull  arms  sequen$ally  to  maximize  the  total  reward   Bandit  scheme/policy:  Sequen$al  algorithm  to  play  arms  (items)   Regret  of  a  scheme  =  Expected  loss  rela$ve  to  the  “oracle”  op-mal  scheme   that  always  plays  the  best  arm   –  “best”  means  highest  success  probability   –  But,  the  best  arm  is  not  known  …  unless  you  have  an  oracle   –  Regret  is  the  price  of  explora$on   –  Low  regret  implies  quick  convergence  to  the  best  
  • 173. Multi-Armed Bandits: Introduction (3) •  Bayesian approach –  Seeks to find the Bayes optimal solution to a Markov decision process (MDP) with assumptions about probability distributions –  Representative work: Gittins’ index, Whittle’s index –  Very computationally intensive •  Minimax approach –  Seeks to find a scheme that incurs bounded regret (with no or mild assumptions about probability distributions) –  Representative work: UCB by Lai, Auer –  Usually, computationally easy –  But, they tend to explore too much in practice (probably because the bounds are based on worse-case analysis) Skip  details
  • 174. Multi-Armed Bandits: Markov Decision Process (1) •  Select an arm now at time t=0, to maximize expected total number of clicks in t=0,…,T •  State at time t: Θt = (θ1t, …, θKt) –  θit = State of arm i at time t (that captures all we know about arm i at t) •  Reward function Ri(Θt, Θt+1) –  Reward of pulling arm i that brings the state from Θt to Θt+1 •  Transition probability Pr[Θt+1 | Θt, pulling arm i ] •  Policy π: A function that maps a state to an arm (action) –  π(Θt) returns an arm (to pull) •  Value of policy π starting from the current state Θ0 with horizon T [ ]),(),(),( 1110)(0 0 ΘΘΘΘ Θ ππ π −+= TT VREV [ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0 ΘΘΘΘΘΘΘ Θ dVR T ππ π Immediate  reward   Value  of  the  remaining  T-­‐1  $me  slots                                                  if  we  start  from  state  Θ1  
  • 175. Multi-Armed Bandits: MDP (2) •  Optimal policy: •  Things to notice: –  Value is defined recursively (actually T high-dim integrals) –  Dynamic programming can be used to find the optimal policy –  But, just evaluating the value of a fixed policy can be very expensive •  Bandit Problem: The pull of one arm does not change the state of other arms and the set of arms do not change over time [ ]),(),(),( 1110)(0 0 ΘΘΘΘ Θ ππ π −+= TT VREV [ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0 ΘΘΘΘΘΘΘ Θ dVR T ππ π Immediate  reward   Value  of  the  remaining  T-­‐1  $me  slots                                                  if  we  start  from  state  Θ1   ),(maxarg 0Θπ π TV
  • 176. Multi-Armed Bandits: MDP (3) •  Which arm should be pulled next? –  Not necessarily what looks best right now, since it might have had a few lucky successes –  Looks like it will be a function of successes and failures of all arms •  Consider a slightly different problem setting –  Infinite time horizon, but –  Future rewards are geometrically discounted Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1) •  Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently Policy  π(Θt)  is  a  func$on  of  (θ1t,  …,  θKt)   Policy  π(Θt)  =  argmaxi  {  g(θit)  }   One  K-­‐dimensional  problem   K  one-­‐dimensional  problems   S$ll  computa$onally  expensive!!   Gimns’  Index  
  • 177. Multi-Armed Bandits: MDP (4) Bandit Policy 1.  Compute the priority (Gittins’ index) of each arm based on its state 2.  Pull arm with max priority, and observe reward 3.  Update the state of the pulled arm Priority   1   Priority   2   Priority   3  
  • 178. Multi-Armed Bandits: MDP (5) •  Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently –  Many proofs and different interpretations of Gittins’ index exist •  The index of an arm is the fixed charge per pull for a game with two options, whether to pull the arm or not, so that the charge makes the optimal play of the game have zero net reward –  Significantly reduces the dimension of the problem space –  But, Gittins’ index g(θit) is still hard to compute •  For the Gamma-Poisson or Beta-Binomial models θit = (#successes, #pulls) for arm i up to time t •  g maps each possible (#successes, #pulls) pair to a number –  Approximate methods are used in practice –  Lai et al. have derived these for exponential family distributions
  • 179. Multi-Armed Bandits: Minimax Approach (1) •  Compute the priority of each arm i in a way that the regret is bounded –  Lowest regret in the worst case •  One common policy is UCB1 [Auer 2002] Number of successes of arm i Number of pulls of arm i Total number of pulls of all arms Observed success rate Factor representing uncertainty ii i i n n n c log2 Priority ⋅ +=
  • 180. Multi-Armed Bandits: Minimax Approach (2) •  As total observations n becomes large: –  Observed payoff tends asymptotically towards the true payoff probability –  The system never completely “converges” to one best arm; only the rate of exploration tends to zero Observed payoff Factor representing uncertainty ii i i n n n c log2 Priority ⋅ +=
  • 181. Multi-Armed Bandits: Minimax Approach (3) •  Sub-optimal arms are pulled O(log n) times •  Hence, UCB1 has O(log n) regret •  This is the lowest possible regret (but the constants matter J) •  E.g. Regret after n plays is bounded by Observed payoff Factor representing uncertainty ii i i n n n c log2 Priority ⋅ += ibesti K j j i ibesti n µµ π µµ −=Δ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ Δ⋅⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ++⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ Δ ∑∑ =< where, 3 1 ln 8 1 2 :
  • 182. •  Classical multi-armed bandits –  A fixed set of arms with fixed rewards –  Observe the reward before the next pull •  Bayesian approach (Markov decision process) –  Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits •  Pull the arm currently having the highest index value –  Whittle’s index [Whittle 1988]: Extension to a changing reward function –  Computationally intensive •  Minimax approach (providing guaranteed regret bounds) –  UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval •  Index of arm i = •  Heuristics –  ε-Greedy: Random exploration using fraction ε of traffic –  Softmax: Pick arm i with probability –  Posterior draw: Index = drawing from posterior CTR distribution of an arm ∑j j i }/ˆexp{ }/ˆexp{ τµ τµ Classical Multi-Armed Bandits: Summary ii itemofCTRpredictedˆ =µ iii nnnc log2⋅+ retemperatu=τ
  • 183. Do Classical Bandits Apply to Web Recommenders? Traffic  obtained  from  a  controlled  randomized  experiment  (no  confounding)   Things  to  note:        (a)  Short  life$mes,  (b)  temporal  effects,  (c)  oden  breaking  news  stories   Each  curve  is  the  CTR  of  an  item  in  the  Today  Module  on  www.yahoo.com  over  $me  
  • 184. Characteristics of Real Recommender Systems•  Dynamic set of items (arms) –  Items come and go with short lifetimes (e.g., a day) –  Asymptotically optimal policies may fail to achieve good performance when item lifetimes are short •  Non-stationary CTR –  CTR of an item can change dramatically over time •  Different user populations at different times •  Same user behaves differently at different times (e.g., morning, lunch time, at work, in the evening, etc.) •  Attention to breaking news stories decays over time •  Batch serving for scalability –  Making a decision and updating the model for each user visit in real time is expensive –  Batch serving is more feasible: Create time slots (e.g., 5 min); for each slot, decide the fraction xi of the visits in the slot to give to item i [Agarwal  et  al.,  ICDM,  2009]  
  • 185. Explore/Exploit in Recommender Systems $me   Item  1   Item  2   …   Item  K   x1%  page  views   x2%  page  views   …   xK%  page  views   Determine  (x1,  x2,  …,  xK)  based  on  clicks  and  views  observed  before  t  in   order  to  maximize  the  expected  total  number  of  clicks  in  the  future   t  –1    t  –2     t   now   clicks  in  the  future   Let’s  solve  this  from  first  principle  
  • 186. Bayesian Solution: Two Items, Two Time Slots (1) •  Two time slots: t = 0 and t = 1 –  Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1 –  Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1 •  To determine x, we need to estimate what would happen in the future Question: What fraction x of N0 views to item P (1-x) to item Q t=0 t=1 Now time N0 views N1 views End   Obtain  c  clicks  ader  serving  x   (not  yet  observed;  random  variable)    Assume  we  observe  c;  we  can  update  p1   CTR density Item Q Item P q1   p1(x,c) CTR density Item Q Item P q0   p0  If  x  and  c  are  given,  op$mal  solu$on:   Give  all  views  to  Item  P  iff    E[  p1  I  x,  c  ]  >  q1   ),(ˆ1 cxp ),(ˆ1 cxp
  • 187. •  Expected total number of clicks in the two time slots }]),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c+−+ Gain(x, q0, q1) = Expected number of additional clicks if we explore the uncertain item P with fraction x of views in slot 0, compared to a scheme that only shows the certain item Q in both slots Solution: argmaxx Gain(x, q0, q1) Bayesian Solution: Two Items, Two Time Slots (2) }]0,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c −+−++= E[#clicks] at t = 0 E[#clicks] at t = 1 Item P Item Q Show  the  item  with  higher  E[CTR]:   }),,(ˆmax{ 11 qcxp E[#clicks] if we always show item Q Gain(x, q0, q1) Gain of exploring the uncertain item P using x
  • 188. •  Approximate by the normal distribution –  Reasonable approximation because of the central limit theorem •  Proposition: Using the approximation, the Bayes optimal solution x can be found in time O(log N0) ),(ˆ1 cxp ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ −⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − Φ−+⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − ⋅+−= )ˆ( )( ˆ 1 )( ˆ )()ˆ(),,( 11 1 11 1 11 1100010 qp x pq x pq xNqpxNqqxGain σσ φσ )1()()( )],(ˆ[)( 2 0 0 1 2 1 baba ab xNba xN cxpVarx +++++ ==σ )/()],(ˆ[ˆ 11 baacxpEp c +== ),(~ofPrior 1 baBetap Bayesian Solution: Two Items, Two Time Slots (3)
  • 189. Bayesian Solution: Two Items, Two Time Slots (4) •  Quiz: Is it correct that the more we are uncertain about the CTR of an item, the more we should explore the item? Uncertainty:  Low  Uncertainty:  High   Different   curves  are  for   different  prior   mean  semngs   (Frac$on  of  views  to  give  to  the  item)