• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Storm Users Group Real Time Hadoop
 

Storm Users Group Real Time Hadoop

on

  • 549 views

More about how Lambda architecture can be improved with exact computations and correct failover.

More about how Lambda architecture can be improved with exact computations and correct failover.

Statistics

Views

Total Views
549
Views on SlideShare
549
Embed Views
0

Actions

Likes
2
Downloads
19
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Storm Users Group Real Time Hadoop Storm Users Group Real Time Hadoop Presentation Transcript

    • 1  ©MapR  Technologies  -­‐  Confiden6al   Real-­‐&me  Learning  with  Hadoop   The  “λ  +  ε”  architecture  
    • 2  ©MapR  Technologies  -­‐  Confiden6al   §  Contact:   –  tdunning@maprtech.com   –  @ted_dunning   §  Slides  and  such  (available  late  tonight):   –  hEp://slideshare.net/tdunning   §  Hash  tags:  #mapr  #storm      
    • 3  ©MapR  Technologies  -­‐  Confiden6al   The  Challenge   §  Hadoop  is  great  of  processing  vats  of  data   –  But  sucks  for  real-­‐6me  (by  design!)     §  Storm  is  great  for  real-­‐6me  processing   –  But  lacks  any  way  to  deal  with  batch  processing   §  It  sounds  like  there  isn’t  a  solu6on   –  Neither  fashionable  solu6on  handles  everything  
    • 4  ©MapR  Technologies  -­‐  Confiden6al   This  is  not  a  problem.    It’s  an  opportunity!  
    • 5  ©MapR  Technologies  -­‐  Confiden6al   t   now   Hadoop  is  Not  Very  Real-­‐&me   Unprocessed Data   Fully   processed   Latest  full   period   Hadoop  job   takes  this   long  for  this   data  
    • 6  ©MapR  Technologies  -­‐  Confiden6al   t   now   Hadoop  works   great  back  here   Storm   works   here   Real-­‐&me  and  Long-­‐&me  together   Blended   view   Blended   view   Blended   View  
    • 7  ©MapR  Technologies  -­‐  Confiden6al   An  Example  
    • 8  ©MapR  Technologies  -­‐  Confiden6al   The  Same  Problem  
    • 9  ©MapR  Technologies  -­‐  Confiden6al   What  Does  the  Lambda  Architecture  Do?   §  The  idea  is  that  we  want  to  compute  a  func6on  of  a  all  history  up   to  6me  n   §  In  order  to  get  real-­‐6me  response,  we  divide  this  into  two  parts   §  Where  the  addi6on  may  not  really  be  addi6on   §  The  idea  is  that  if  we  lose  the  history  from  m+1  un6l  n,  things  get   beEer  soon  enough   f x1...xn( ) f x1...xm...xn( )= f x1...xm( )+ f xm+1...xn( )
    • 10  ©MapR  Technologies  -­‐  Confiden6al   Can  We  Do  BeLer?   §  Can  we  minimize  or  avoid  failure  transients?   §  Can  we  guarantee  precise  boundaries?   §  Can  we  synchronize  computa6ons  accurately?  
    • 11  ©MapR  Technologies  -­‐  Confiden6al   Alterna&ve  without  Lambda   Search   Engine   NoSql   de  Jour   Consumer   Real-­‐6me   Long-­‐6me   ?  
    • 12  ©MapR  Technologies  -­‐  Confiden6al   Problems   §  Simply  dumping  into  noSql  engine  doesn’t  quite  work   §  Insert  rate  is  limited   §  No  load  isola6on   –  Big  retrospec6ve  jobs  kill  real-­‐6me   §  Low  scan  performance   –  Hbase  preEy  good,  but  not  stellar   §  Difficult  to  set  boundaries   –  where  does  real-­‐6me  end  and  long-­‐6me  begin?  
    • 13  ©MapR  Technologies  -­‐  Confiden6al   Almost  a  Solu&on   §  Lambda  architecture  talks  about  func6on  of  long-­‐6me  state   –  Real-­‐6me  approximate  accelerator  adjusts  previous  result  to  current  state   §  Sounds  good,  but  …   –  How  does  the  real-­‐6me  accelerator  combine  with  long-­‐6me?   –  What  algorithms  can  do  this?   –  How  can  we  avoid  gaps  and  overlaps  and  other  errors?   §  Needs  more  work   §  We  need  a  “λ  +  ε”  architecture  !  
    • 14  ©MapR  Technologies  -­‐  Confiden6al   A  Simple  Example   §  Let’s  start  with  the  simplest  case  …  coun6ng   §  Coun6ng  =  addi6on   –  Addi6on  is  associa6ve   –  Addi6on  is  on-­‐line   –  We  can  generalize  these  results  to  all  associa6ve,  on-­‐line  func6ons   –  But  let’s  start  simple  
    • 15  ©MapR  Technologies  -­‐  Confiden6al   Data   Sources   Catcher   Cluster   Rough  Design  –  Data  Flow   Catcher   Cluster   Query  Event   Spout   Logger   Bolt   Counter   Bolt   Raw   Logs   Logger   Bolt   Semi   Agg   Hadoop   Aggregator   Snap   Long   agg   ProtoSpout   Counter   Bolt   Logger   Bolt   Data   Sources  
    • 16  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  Catcher  Protocol   Data   Sources   Catcher   Cluster   Catcher   Cluster   Data   Sources   The  data  sources  and  catchers   communicate  with  a  very  simple   protocol.     Hello()  =>  list  of  catchers   Log(topic,message)  =>            (OK|FAIL,  redirect-­‐to-­‐catcher)  
    • 17  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  Catcher  Queues   Catcher   Cluster   Catcher   Cluster   The  catchers  forward  log  requests   to  the  correct  catcher  and  return   that  host  in  the  reply  to  allow  the   client  to  avoid  the  extra  hop.     Each  topic  file  is  appended  by   exactly  one  catcher.     Topic  files  are  kept  in  shared  file   storage.   Topic   File   Topic   File  
    • 18  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  ProtoSpout   The  ProtoSpout  tails  the  topic  files,   parses  log  records  into  tuples  and   injects  them  into  the  Storm   topology.     Last  fully  acked  posi6on  stored  in   shared,  transac6onally  correct  file   system.   Topic   File   Topic   File   ProtoSpout  
    • 19  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  Counter  Bolt   §  Cri6cal  design  goals:   –  fast  ack  for  all  tuples   –  fast  restart  of  counter   §  Ack  happens  when  tuple  hits  the  replay  log  (10’s  of  milliseconds,   group  commit)   §  Restart  involves  replaying  semi-­‐agg’s  +  replay  log  (very  fast)   §  Replay  log  only  lasts  un6l  next  semi-­‐aggregate  goes  out   Counter   Bolt   Replay   Log   Semi-­‐ aggregated   records   Incoming   records   Real-­‐6me   Long-­‐6me  
    • 20  ©MapR  Technologies  -­‐  Confiden6al   A  Frozen  Moment  in  Time   §  Snapshot  defines  the  dividing  line   §  All  data  in  the  snap  is  long-­‐6me,  all   awer  is  real-­‐6me   §  Semi-­‐agg  strategy  allows  clean   combina6on  of  both  kinds  of  data   §  Data  synchronized  snap  not   needed  (if  the  snap  is  really  a  snap)   Semi   Agg   Hadoop   Aggregator   Snap   Long   agg  
    • 21  ©MapR  Technologies  -­‐  Confiden6al   Guarantees   §  Counter  output  volume  is  small-­‐ish   –  the  greater  of  k  tuples  per  100K  inputs  or  k  tuple/s   –  1  tuple/s/label/bolt  for  this  exercise   §  Persistence  layer  must  provide  guarantees   –  distributed  against  node  failure   –  must  have  either  readable  flush  or  closed-­‐append   §  HDFS  is  distributed,  but  provides  no  guarantees  and  strange   seman6cs   §  MapRfs  is  distributed,  provides  all  necessary  guarantees  
    • 22  ©MapR  Technologies  -­‐  Confiden6al   Presenta&on  Layer   §  Presenta6on  must   –  read  recent  output  of  Logger  bolt   –  read  relevant  output  of  Hadoop  jobs   –  combine  semi-­‐aggregated  records   §  User  will  see   –  counts  that  increment  within  0-­‐2  s  of  events   –  seamless  and  accurate  meld  of  short  and  long-­‐term  data  
    • 23  ©MapR  Technologies  -­‐  Confiden6al   The  Basic  Idea   §  Online  algorithms  generally  have  rela6vely  small  state  (like   coun6ng)   §  Online  algorithms  generally  have  a  simple  update  (like  coun6ng)   §  If  we  can  do  this  with  coun6ng,  we  can  do  it  with  all  kinds  of   algorithms  
    • 24  ©MapR  Technologies  -­‐  Confiden6al   Summary  –  Part  1   §  Semi-­‐agg  strategy  +  snapshots  allows  correct  real-­‐6me  counts   –  because  addi6on  is  on-­‐line  and  associa6ve   §  Other  on-­‐line  associa6ve  opera6ons  include:   –  k-­‐means  clustering  (see  Dan  Filimon’s  talk  at  16.)   –  count  dis6nct  (see  hyper-­‐log-­‐log  counters  from  streamlib  or  kmv  from   Brickhouse)   –  top-­‐k  values   –  top-­‐k  (count(*))  (see  streamlib)   –  contextual  Bayesian  bandits  (see  part  2  of  this  talk)  
    • 25  ©MapR  Technologies  -­‐  Confiden6al   Example  2  –  AB  tes&ng  in  real-­‐&me   §  I  have  15  versions  of  my  landing  page   §  Each  visitor  is  assigned  to  a  version   –  Which  version?   §  A  conversion  or  sale  or  whatever  can  happen   –  How  long  to  wait?   §  Some  versions  of  the  landing  page  are  horrible   –  Don’t  want  to  give  them  traffic  
    • 26  ©MapR  Technologies  -­‐  Confiden6al   A  Quick  Diversion   §  You  see  a  coin   –  What  is  the  probability  of  heads?   –  Could  it  be  larger  or  smaller  than  that?   §  I  flip  the  coin  and  while  it  is  in  the  air  ask  again   §  I  catch  the  coin  and  ask  again   §  I  look  at  the  coin  (and  you  don’t)  and  ask  again   §  Why  does  the  answer  change?   –  And  did  it  ever  have  a  single  value?  
    • 27  ©MapR  Technologies  -­‐  Confiden6al   A  Philosophical  Conclusion   §  Probability  as  expressed  by  humans  is  subjec6ve  and  depends  on   informa6on  and  experience  
    • 28  ©MapR  Technologies  -­‐  Confiden6al   A  Prac&cal  Applica&on  
    • 29  ©MapR  Technologies  -­‐  Confiden6al   I  Dunno   0 0.2 0.4 0.6 0.8 1 p Prob(p)
    • 30  ©MapR  Technologies  -­‐  Confiden6al   5  heads  out  of  10  throws   0 0.2 0.4 0.6 0.8 1 p Prob(p)
    • 31  ©MapR  Technologies  -­‐  Confiden6al   2  heads  out  of  12  throws   0 0.2 0.4 0.6 0.8 1 p Prob(p) Mean   Using  any  single  number  as  a  “best”   es6mate  denies  the  uncertain  nature  of   a  distribu6on   Adding  confidence  bounds  s6ll  loses  most  of   the  informa6on  in  the  distribu6on  and   prevents  good  modeling  of  the  tails  
    • 32  ©MapR  Technologies  -­‐  Confiden6al   Bayesian  Bandit   §  Compute  distribu6ons  based  on  data   §  Sample  p1  and  p2  from  these  distribu6ons   §  Put  a  coin  in  bandit  1  if  p1  >  p2   §  Else,  put  the  coin  in  bandit  2  
    • 33  ©MapR  Technologies  -­‐  Confiden6al   And  it  works!   11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε-greedy, ε = 0.05 Bayesian Bandit with Gamma-Normal
    • 34  ©MapR  Technologies  -­‐  Confiden6al   Video  Demo  
    • 35  ©MapR  Technologies  -­‐  Confiden6al   The  Code   §  Select  an  alterna6ve   §  Select  and  learn   §  But  we  already  know  how  to  count!   n = dim(k)[1]! p0 = rep(0, length.out=n)! for (i in 1:n) {! p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)! }! return (which(p0 == max(p0)))! for (z in 1:steps) {! i = select(k)! j = test(i)! k[i,j] = k[i,j]+1! }! return (k)!
    • 36  ©MapR  Technologies  -­‐  Confiden6al   The  Basic  Idea   §  We  can  encode  a  distribu6on  by  sampling   §  Sampling  allows  unifica6on  of  explora6on  and  exploita6on   §  Can  be  extended  to  more  general  response  models   §  Note  that  learning  here  =  coun6ng  =  on-­‐line  algorithm  
    • 37  ©MapR  Technologies  -­‐  Confiden6al   Generalized  Banditry   §  Suppose  we  have  an  infinite  number  of  bandits   –  suppose  they  are  each  labeled  by  two  real  numbers  x  and  y  in  [0,1]   –  also  that  expected  payoff  is  a  parameterized  func6on  of  x  and  y   –  now  assume  a  distribu6on  for  θ  that  we  can  learn  online   §  Selec6on  works  by  sampling  θ,  then  compu6ng  f   §  Learning  works  by  propaga6ng  updates  back  to  θ   –  If  f  is  linear,  this  is  very  easy   §  Don’t  just  have  to  have  two  labels,  could  have  labels  and  context     E z[ ]= f (x, y |θ)
    • 38  ©MapR  Technologies  -­‐  Confiden6al   Caveats   §  Original  Bayesian  Bandit  only  requires  real-­‐6me   §  Generalized  Bandit  may  require  access  to  long  history  for  learning   –  Pseudo  online  learning  may  be  easier  than  true  online   §  Bandit  variables  can  include  content,  6me  of  day,  day  of  week   §  Context  variables  can  include  user  id,  user  features   §  Bandit  ×  context  variables  provide  the  real  power  
    • 39  ©MapR  Technologies  -­‐  Confiden6al   §  Contact:   –  tdunning@maprtech.com   –  @ted_dunning   §  Slides  and  such  (just  don’t  believe  the  metrics):   –  hEp://slideshare.net/tdunning   §  Hash  tags:  #mapr  #storm      
    • 40  ©MapR  Technologies  -­‐  Confiden6al   Thank  You