Amazon Elastic MapReduce

       Peter Sirota
Amazon	
  Elas+c	
  MapReduce	
  
!  Enables	
  customers	
  to	
  easily	
  and	
  cost-­‐
   effec+vely	
  process	
  vast	
  amounts	
  of	
  data.	
  	
  
!  U+lizes	
  a	
  hosted	
  Hadoop	
  framework	
  
   running	
  on	
  the	
  web-­‐scale	
  infrastructure	
  
   of	
  Amazon.	
  
!  Launched	
  in	
  the	
  US	
  in	
  April	
  and	
  EU	
  in	
  July	
  
   of	
  2009	
  
Amazon	
  Elas+c	
  MapReduce	
  
!  Large	
  scale	
  data	
  processing	
  has	
  a	
  lot	
  of	
  
   MUCK	
  and	
  we	
  want	
  to	
  remove	
  it	
  for	
  our	
  
   customers	
  
    !  Hard	
  to	
  manage	
  compute	
  clusters	
  
    !  Hard	
  to	
  tune	
  Hadoop	
  
   !  Hadoop	
  issues	
  preven+ng	
  smooth	
  opera+on	
  
      in	
  the	
  cloud	
  

                           Amazon.com	
  Confiden+al	
     3	
  
Hadoop	
  made	
  simple	
  and	
  easy	
  
Amazon Elastic MapReduce




                                  Amazon EC2 Instances
                                                                                                 End
Deploy Application
                                 Hadoop                Hadoop     Hadoop
                       Elastic                                                       Elastic
                     MapReduce                                                     MapReduce
                                 Hadoop                Hadoop     Hadoop                       Notify
Web Console,
Command line tools               Input                                    output
                                 dataset                                  results



                                     Input	
  S3	
              Output	
  S3	
                  Get Results
    Input Data
                                      bucket	
                   bucket	
  



                                              Amazon S3
Amazon Elastic MapReduce
              Benefits
                 Uses as many or as few EC2 instances as needed.
   Elastic
                 Spin up large or small job flows in minutes.

                 Get up and running quickly with easy-to-use web
 Easy to use     console, robust command line clients and sample
                 jobs. No configuration necessary.

                 Fault tolerant service built on top of battle-tested
   Reliable
                 AWS infrastructure. Automatically retries failed tasks.

                 We monitor progress of your jobs and turn off
Cost Effective
                 resources when job flow is done.
Problems	
  customers	
  solve	
  with	
  	
  
               Elas+c	
  MapReduce	
  
!  Data	
  mining	
  (Log	
  processing,	
  click	
  stream	
  
   analysis,	
  similari+es,	
  etc.)	
  	
  
!  Bio-­‐informa+cs	
  (Genome	
  analysis)	
  	
  
!  Financial	
  simula+on	
  (Monte	
  Carlo	
  simula+on)	
  
!  File	
  processing	
  (resize	
  jpegs)	
  
!  Web	
  indexing	
  


                         Amazon.com	
  Confiden+al	
     7	
  
Customer	
  Feedback	
  
!   Pros:	
  
     !   Amazon	
  Elas+c	
  MapReduce	
  makes	
  it	
  easy	
  to	
  run	
  Hadoop	
  
         applica+ons.	
  
     !   Reliable	
  plaZorm	
  for	
  produc+on	
  data-­‐processing	
  
!   Challenges:	
  
     !   Simple	
  tasks	
  such	
  as	
  log	
  processing	
  require	
  fluency	
  in	
  
         MapReduce	
  
     !   Hadoop	
  applica+ons	
  are	
  difficult	
  to	
  develop	
  
New	
  Features
                                     	
  
!  Support	
  for	
  Apache	
  Pig	
  –	
  August	
  2009	
  
   !    Batch	
  and	
  interac+ve	
  mode	
  
   !    Concurrent	
  access	
  to	
  mul+ple	
  file	
  systems	
  
   !    Loading	
  resources	
  from	
  Amazon	
  S3	
  
   !    Addi+onal	
  Piggybank	
  func+ons	
  
   !    Integra+on	
  with	
  Elas+c	
  MapReduce	
  Client	
  
        and	
  Web	
  Console	
  
New	
  Features
                                             	
  
!  Support	
  for	
  Apache	
  Hive	
  0.4	
  –	
  Today	
  
    !  Batch	
  and	
  interac+ve	
  mode	
  
    !  Integra+on	
  with	
  Elas+c	
  MapReduce	
  Client	
  and	
  
       Web	
  Console	
  
    !  Addi+ons	
  to	
  Hive	
  	
  
        •    Load	
  table	
  par++ons	
  automa+cally	
  from	
  Amazon	
  S3	
  
        •    Specify	
  an	
  off-­‐instance	
  metadata	
  store	
  	
  
        •    Op+mized	
  data	
  writes	
  to	
  Amazon	
  S3	
  
        •    Reference	
  resources	
  on	
  Amazon	
  S3	
  
Amazon	
  Elas+c	
  MapReduce	
  Ecosystem	
  
!  Karmasphere	
  Studio	
  for	
  Hadoop	
  –	
  NetBeans	
  
   IDE	
  for	
  development,	
  debugging,	
  deployment	
  
   and	
  management	
  of	
  Hadoop	
  jobs	
  
   !    Deploy	
  Hadoop	
  jobs	
  to	
  Elas+c	
  MapReduce	
  
   !    Monitor	
  progress	
  of	
  Elas+c	
  MapReduce	
  job	
  flows	
  
   !    Amazon	
  S3	
  file	
  browser	
  
   !    Elas+c	
  MapReduce	
  HDFS	
  browser	
  
Amazon	
  Elas+c	
  MapReduce	
  Ecosystem	
  
!  Support	
  for	
  Cloudera’s	
  Hadoop	
  distribu+on	
  
   (private	
  beta)	
  
   !  Op+onally	
  use	
  Cloudera’s	
  Hadoop	
  while	
  execu+ng	
  
      Elas+c	
  MapReduce	
  job	
  flows	
  
   !  Get	
  support	
  from	
  Cloudera	
  for	
  the	
  Elas+c	
  
      MapReduce	
  job	
  flows	
  
Q&A	
  

Hw09 Making Hadoop Easy On Amazon Web Services

  • 1.
  • 2.
    Amazon  Elas+c  MapReduce   !  Enables  customers  to  easily  and  cost-­‐ effec+vely  process  vast  amounts  of  data.     !  U+lizes  a  hosted  Hadoop  framework   running  on  the  web-­‐scale  infrastructure   of  Amazon.   !  Launched  in  the  US  in  April  and  EU  in  July   of  2009  
  • 3.
    Amazon  Elas+c  MapReduce   !  Large  scale  data  processing  has  a  lot  of   MUCK  and  we  want  to  remove  it  for  our   customers   !  Hard  to  manage  compute  clusters   !  Hard  to  tune  Hadoop   !  Hadoop  issues  preven+ng  smooth  opera+on   in  the  cloud   Amazon.com  Confiden+al   3  
  • 4.
    Hadoop  made  simple  and  easy  
  • 5.
    Amazon Elastic MapReduce Amazon EC2 Instances End Deploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop Notify Web Console, Command line tools Input output dataset results Input  S3   Output  S3   Get Results Input Data bucket   bucket   Amazon S3
  • 6.
    Amazon Elastic MapReduce Benefits Uses as many or as few EC2 instances as needed. Elastic Spin up large or small job flows in minutes. Get up and running quickly with easy-to-use web Easy to use console, robust command line clients and sample jobs. No configuration necessary. Fault tolerant service built on top of battle-tested Reliable AWS infrastructure. Automatically retries failed tasks. We monitor progress of your jobs and turn off Cost Effective resources when job flow is done.
  • 7.
    Problems  customers  solve  with     Elas+c  MapReduce   !  Data  mining  (Log  processing,  click  stream   analysis,  similari+es,  etc.)     !  Bio-­‐informa+cs  (Genome  analysis)     !  Financial  simula+on  (Monte  Carlo  simula+on)   !  File  processing  (resize  jpegs)   !  Web  indexing   Amazon.com  Confiden+al   7  
  • 8.
    Customer  Feedback   !  Pros:   !   Amazon  Elas+c  MapReduce  makes  it  easy  to  run  Hadoop   applica+ons.   !   Reliable  plaZorm  for  produc+on  data-­‐processing   !   Challenges:   !   Simple  tasks  such  as  log  processing  require  fluency  in   MapReduce   !   Hadoop  applica+ons  are  difficult  to  develop  
  • 9.
    New  Features   !  Support  for  Apache  Pig  –  August  2009   !  Batch  and  interac+ve  mode   !  Concurrent  access  to  mul+ple  file  systems   !  Loading  resources  from  Amazon  S3   !  Addi+onal  Piggybank  func+ons   !  Integra+on  with  Elas+c  MapReduce  Client   and  Web  Console  
  • 10.
    New  Features   !  Support  for  Apache  Hive  0.4  –  Today   !  Batch  and  interac+ve  mode   !  Integra+on  with  Elas+c  MapReduce  Client  and   Web  Console   !  Addi+ons  to  Hive     •  Load  table  par++ons  automa+cally  from  Amazon  S3   •  Specify  an  off-­‐instance  metadata  store     •  Op+mized  data  writes  to  Amazon  S3   •  Reference  resources  on  Amazon  S3  
  • 11.
    Amazon  Elas+c  MapReduce  Ecosystem   !  Karmasphere  Studio  for  Hadoop  –  NetBeans   IDE  for  development,  debugging,  deployment   and  management  of  Hadoop  jobs   !  Deploy  Hadoop  jobs  to  Elas+c  MapReduce   !  Monitor  progress  of  Elas+c  MapReduce  job  flows   !  Amazon  S3  file  browser   !  Elas+c  MapReduce  HDFS  browser  
  • 12.
    Amazon  Elas+c  MapReduce  Ecosystem   !  Support  for  Cloudera’s  Hadoop  distribu+on   (private  beta)   !  Op+onally  use  Cloudera’s  Hadoop  while  execu+ng   Elas+c  MapReduce  job  flows   !  Get  support  from  Cloudera  for  the  Elas+c   MapReduce  job  flows  
  • 13.