Emphemeral hadoop clusters in the cloud

1,676 views
1,607 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,676
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Emphemeral hadoop clusters in the cloud

  1. 1. Ephemeral  Hadoop  Clusters  in  the  Cloud   [1]   Greg  Fodor,  Etsy   gfodor@etsy.com  
  2. 2. about  me   gfodor@etsy.com   @gfodor   Data  Wrangler  
  3. 3. about  etsy  
  4. 4. the  world’s  handmade  marketplace  
  5. 5. total  members:  9,000,000  total  acHve  shops:  800,000   items  listed:  9.5M  page  views  per  month:  >1B   2010  sales:  $314.3M  
  6. 6. lots  of  data  
  7. 7. about  this  talk  
  8. 8. ephemeral?  
  9. 9. [5]  
  10. 10. “elasHc”  to  the  extreme  
  11. 11. how  did  we  get  here?  
  12. 12. wanted  to  dip  our  toes   stop  hiWng  the  database   stop  grepping  log  files  
  13. 13. 2  data  sources  -­‐>  S3  
  14. 14. database  snapshots   input:   nightly  diffs  (SELECT  *  FROM  <table>  WHERE  update_date  >  1  day  ago)   output:  full  tables  as  sequence  files  
  15. 15. visit  logs   input:  akamai  access  logs   (event  beacons)   output:   [visit_id,  [event]]  
  16. 16. processing  the  data  
  17. 17. [2]  
  18. 18. data  flow   joins,  group  bys,  etc.  
  19. 19. cascading   Chris  Wensel  hhp://www.cascading.org/  
  20. 20. great  implementaHon  
  21. 21. Java  syntax   [10]  
  22. 22. cascading.jruby  Grégoire  Marabout  (Qualtera),  Mah  Walker  (Etsy),  Stefan  Karpinski  (Etsy),  Steve  Mardenfeld  (Etsy)   github:  hhp://bit.ly/o3DNtC   blog:  hhp://etsy.me/cFytuL  
  23. 23. “push”  job  binaries  to  S3  run  on  ElasHc  Map/Reduce   starts  cluster,  runs,  shuts  down   access  results  on  S3  
  24. 24. next  project:  shop  recommendaHons  
  25. 25. 3  steps:  ✔ data  preparaHon  -­‐  Cascading   ✖ analysis/training   ✖ predicHon  
  26. 26. sparse  implementaHon  of  SVD  
  27. 27. 3  steps:  ✔ data  preparaHon  -­‐  Cascading   ✖ analysis/training  -­‐  MATLAB   ✖ predicHon  -­‐  MATLAB  
  28. 28. “MATLAB,  in  my    Hadoop  cluster?”  
  29. 29. hadoop  streaming  
  30. 30. arbitrary  scripts  for  map  &  reduce  
  31. 31. Swiss  army  knife  Full  dataset  analysis  Matlab,  Ruby  scripts  ‘ArHfact’  outputs  Tokyo  Cabinet,  Lucene,  SQLite  Side-­‐effects  MySQL,  CloudFront   [3]  
  32. 32. 3  steps:  ✔ data  preparaHon  -­‐  Cascading  ✔ analysis/training  -­‐  MATLAB   ✔ predicHon  -­‐  MATLAB  
  33. 33. Job  2   Job  1   [4]  
  34. 34. Barnum  
  35. 35. Sinatra  web  service  on  EC2  
  36. 36. barnum  starts  job  and  passes   callback  URL   when  job  finishes,  hadoop  hits  callback  URL  to  barnum  to  proceed  
  37. 37. Barnum  constructs  
  38. 38. 3  steps:  ✔ data  preparaHon  -­‐  Cascading  ✔ analysis/training  -­‐  MATLAB   ✔ predicHon  -­‐  MATLAB  
  39. 39. suggested_shops.yaml:  
  40. 40. suggested_shops.yaml:  
  41. 41. suggested_shops.yaml:  
  42. 42. suggested_shops.yaml:  
  43. 43. suggested_shops.yaml:  
  44. 44. suggested_shops.yaml:  
  45. 45. suggested_shops.yaml:  
  46. 46. suggested_shops.yaml:  
  47. 47. suggested_shops.yaml:  
  48. 48. suggested_shops.yaml:  
  49. 49. suggested_shops.yaml:  
  50. 50. suggested_shops.yaml:  
  51. 51. suggested_shops.yaml:  
  52. 52. suggested_shops.yaml:  
  53. 53. suggested_shops.yaml:  
  54. 54. geWng  data  back  to  web  stack?  
  55. 55. [6]  v1  
  56. 56. ad-­‐hoc  shell  scripts  TSV  into  unsharded  MySQL   not  re-­‐usable   [6]  
  57. 57. v2  
  58. 58. datasets  are  versioned  based  upon   job  execuHon  Hme  
  59. 59. MySQL  Tables:  Memcache  Cluster:  
  60. 60. Output  dataset  <-­‐>  ORM  Model  
  61. 61. PHP:  
  62. 62. PHP:  Cascading:  
  63. 63. PHP:  Cascading:  PHP:  
  64. 64. Old  tables  regularly  dropped  
  65. 65. how  we’re  using  this  stack  analyHcs   products   (internal)   (external)  
  66. 66. analyHcs  
  67. 67. products  
  68. 68. search  quality  recommendaHons  
  69. 69. May  2011:    4,926  successful  job  runs  
  70. 70. [5]  
  71. 71. scale  up  from  zero  
  72. 72. isolaHon  
  73. 73. isolaHon  across  runs   fresh  machine  each  Hme  
  74. 74. isolaHon  between  developers   no  toe-­‐stepping  
  75. 75. heterogeneous  clusters  
  76. 76. big  RAM  when  you  need  it   (but  not  when  you  don’t)  
  77. 77. need  one  machine?     use  one  machine.  
  78. 78. wriHng  jobs  
  79. 79. PHENOMENAL   COSMIC   POWERS   [7]  
  80. 80. prototyping  run  slow,  unopHmized  version  on  500  machine  for  <  $100  
  81. 81. parameter  tuning  Try  N=1,  2,  5,  10  and  see  which  results  in  best  output  
  82. 82. [9]  
  83. 83. quesHons?  
  84. 84. photo  credits  [1]  by  elfike  hhp://www.flickr.com/photos/elfike/157439707/  [2]  by  Dan4th  hhp://www.flickr.com/photos/43264265@N00/5371557240/  [3]  by  mandolux    hhp://www.flickr.com/photos/73935252@N00/34418046/  [4]  by  The  Suss-­‐Man  hhp://www.flickr.com/photos/8692813@N06/4580254188/  [5]  by  Stephen  Rees  hhp://www.flickr.com/photos/60142746@N00/214461223/  [6]  by  Let  Ideas  Compete  hhp://www.flickr.com/photos/quesHon_everything/3414827746/  [7]  by  funkandjazz  hhp://www.flickr.com/photos/phunk/2484159004/  [8]  by  ViaMoi  hhp://www.flickr.com/photos/12187843@N07/3343619603/  [9]  by  kreg.steppe  hhp://www.flickr.com/photos/spyndle/500305000/  [10]  clipart  (really)  [11]  by  Chris  Pirillo  hhp://www.flickr.com/photos/49503157467@N01/34588230/  

×