Emphemeral hadoop clusters in the cloud
Upcoming SlideShare
Loading in...5
×
 

Emphemeral hadoop clusters in the cloud

on

  • 1,690 views

 

Statistics

Views

Total Views
1,690
Views on SlideShare
1,690
Embed Views
0

Actions

Likes
1
Downloads
14
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Emphemeral hadoop clusters in the cloud Emphemeral hadoop clusters in the cloud Presentation Transcript

  • Ephemeral  Hadoop  Clusters  in  the  Cloud   [1]   Greg  Fodor,  Etsy   gfodor@etsy.com  
  • about  me   gfodor@etsy.com   @gfodor   Data  Wrangler  
  • about  etsy  
  • the  world’s  handmade  marketplace  
  • total  members:  9,000,000  total  acHve  shops:  800,000   items  listed:  9.5M  page  views  per  month:  >1B   2010  sales:  $314.3M  
  • lots  of  data  
  • about  this  talk  
  • ephemeral?  
  • [5]  
  • “elasHc”  to  the  extreme  
  • how  did  we  get  here?  
  • wanted  to  dip  our  toes   stop  hiWng  the  database   stop  grepping  log  files  
  • 2  data  sources  -­‐>  S3  
  • database  snapshots   input:   nightly  diffs  (SELECT  *  FROM  <table>  WHERE  update_date  >  1  day  ago)   output:  full  tables  as  sequence  files  
  • visit  logs   input:  akamai  access  logs   (event  beacons)   output:   [visit_id,  [event]]  
  • processing  the  data  
  • [2]  
  • data  flow   joins,  group  bys,  etc.  
  • cascading   Chris  Wensel  hhp://www.cascading.org/  
  • great  implementaHon  
  • Java  syntax   [10]  
  • cascading.jruby  Grégoire  Marabout  (Qualtera),  Mah  Walker  (Etsy),  Stefan  Karpinski  (Etsy),  Steve  Mardenfeld  (Etsy)   github:  hhp://bit.ly/o3DNtC   blog:  hhp://etsy.me/cFytuL  
  • “push”  job  binaries  to  S3  run  on  ElasHc  Map/Reduce   starts  cluster,  runs,  shuts  down   access  results  on  S3  
  • next  project:  shop  recommendaHons  
  • 3  steps:  ✔ data  preparaHon  -­‐  Cascading   ✖ analysis/training   ✖ predicHon  
  • sparse  implementaHon  of  SVD  
  • 3  steps:  ✔ data  preparaHon  -­‐  Cascading   ✖ analysis/training  -­‐  MATLAB   ✖ predicHon  -­‐  MATLAB  
  • “MATLAB,  in  my    Hadoop  cluster?”  
  • hadoop  streaming  
  • arbitrary  scripts  for  map  &  reduce  
  • Swiss  army  knife  Full  dataset  analysis  Matlab,  Ruby  scripts  ‘ArHfact’  outputs  Tokyo  Cabinet,  Lucene,  SQLite  Side-­‐effects  MySQL,  CloudFront   [3]  
  • 3  steps:  ✔ data  preparaHon  -­‐  Cascading  ✔ analysis/training  -­‐  MATLAB   ✔ predicHon  -­‐  MATLAB  
  • Job  2   Job  1   [4]  
  • Barnum  
  • Sinatra  web  service  on  EC2  
  • barnum  starts  job  and  passes   callback  URL   when  job  finishes,  hadoop  hits  callback  URL  to  barnum  to  proceed  
  • Barnum  constructs  
  • 3  steps:  ✔ data  preparaHon  -­‐  Cascading  ✔ analysis/training  -­‐  MATLAB   ✔ predicHon  -­‐  MATLAB  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • suggested_shops.yaml:  
  • geWng  data  back  to  web  stack?  
  • [6]  v1  
  • ad-­‐hoc  shell  scripts  TSV  into  unsharded  MySQL   not  re-­‐usable   [6]  
  • v2  
  • datasets  are  versioned  based  upon   job  execuHon  Hme  
  • MySQL  Tables:  Memcache  Cluster:  
  • Output  dataset  <-­‐>  ORM  Model  
  • PHP:  
  • PHP:  Cascading:  
  • PHP:  Cascading:  PHP:  
  • Old  tables  regularly  dropped  
  • how  we’re  using  this  stack  analyHcs   products   (internal)   (external)  
  • analyHcs  
  • products  
  • search  quality  recommendaHons  
  • May  2011:    4,926  successful  job  runs  
  • [5]  
  • scale  up  from  zero  
  • isolaHon  
  • isolaHon  across  runs   fresh  machine  each  Hme  
  • isolaHon  between  developers   no  toe-­‐stepping  
  • heterogeneous  clusters  
  • big  RAM  when  you  need  it   (but  not  when  you  don’t)  
  • need  one  machine?     use  one  machine.  
  • wriHng  jobs  
  • PHENOMENAL   COSMIC   POWERS   [7]  
  • prototyping  run  slow,  unopHmized  version  on  500  machine  for  <  $100  
  • parameter  tuning  Try  N=1,  2,  5,  10  and  see  which  results  in  best  output  
  • [9]  
  • quesHons?  
  • photo  credits  [1]  by  elfike  hhp://www.flickr.com/photos/elfike/157439707/  [2]  by  Dan4th  hhp://www.flickr.com/photos/43264265@N00/5371557240/  [3]  by  mandolux    hhp://www.flickr.com/photos/73935252@N00/34418046/  [4]  by  The  Suss-­‐Man  hhp://www.flickr.com/photos/8692813@N06/4580254188/  [5]  by  Stephen  Rees  hhp://www.flickr.com/photos/60142746@N00/214461223/  [6]  by  Let  Ideas  Compete  hhp://www.flickr.com/photos/quesHon_everything/3414827746/  [7]  by  funkandjazz  hhp://www.flickr.com/photos/phunk/2484159004/  [8]  by  ViaMoi  hhp://www.flickr.com/photos/12187843@N07/3343619603/  [9]  by  kreg.steppe  hhp://www.flickr.com/photos/spyndle/500305000/  [10]  clipart  (really)  [11]  by  Chris  Pirillo  hhp://www.flickr.com/photos/49503157467@N01/34588230/