An Introduction to
                        MapReduce 2 and
                              YARN

                            Tom White, Cloudera
                              @tom_e_white
                               June 4, 2012
                               Chicago HUG
Tuesday, June 5, 2012
Road Trip




Tuesday, June 5, 2012
About me
                    •   Apache Hadoop Committer,
                        PMC Member, Apache
                        Member

                    •   Engineer at Cloudera
                        working on core Hadoop

                    •   Founder of Apache Whirr

                    •   Author of “Hadoop: The
                        Definitive Guide”

                        •   http://hadoopbook.com

Tuesday, June 5, 2012
First, whatʼs
                        MapReduce 1?


Tuesday, June 5, 2012
Tuesday, June 5, 2012
Whatʼs wrong with
                             MR1?


Tuesday, June 5, 2012
Motivation 1


                    •   Scaling >4000 nodes
                    • Fewer, larger clusters



Tuesday, June 5, 2012
Motivation 2


                    •   HA of Job Tracker
                    • Large, complex state



Tuesday, June 5, 2012
Motivation 3

                    •   Poor resource utilization
                    • Slots in MR1 are for either
                        map or reduce


Tuesday, June 5, 2012
Yet Another Resource Negotiator




Tuesday, June 5, 2012
Tuesday, June 5, 2012
Tuesday, June 5, 2012
Node Manager
               is a generalized Task Tracker
                    • Task Tracker
                     • fixed number of map or reduce
                        slots
                    • Node Manager
                     • containers with variable resource
                        limits

Tuesday, June 5, 2012
Tuesday, June 5, 2012
Tuesday, June 5, 2012
MR is user space
                         YARN is kernel


Tuesday, June 5, 2012
Bonus Apps

                    •   Distributed shell
                    • MPI   (MAPREDUCE-2911)



                    • Master-worker            (MAPREDUCE-3315)



                    • Apache Giraph, Hama

Tuesday, June 5, 2012
Tuesday, June 5, 2012
Tuesday, June 5, 2012
Old API ≠ MR1
                        New API ≠ MR2


Tuesday, June 5, 2012
Old API         New API
                              o.a.h.mapred   o.a.h.mapreduce



                        MR1       ✓                ✓

                        MR2       ✓                ✓

Tuesday, June 5, 2012
Tuesday, June 5, 2012
Try out MR2

                    • Apache Hadoop 2.0.0-alpha
                     • hadoop.apache.org
                    • CDH4 and Cloudera Manager
                     • cloudera.com
                    • Cloud - Apache Whirr
Tuesday, June 5, 2012
MR1
   <dependency>
       <groupId>org.apache.hadoop</groupId>
       <artifactId>hadoop-client</artifactId>
       <version>1.0.3</version>
   </dependency>


     MR2
   <dependency>
       <groupId>org.apache.hadoop</groupId>
       <artifactId>hadoop-client</artifactId>
       <version>2.0.0-alpha</version>
   </dependency>

Tuesday, June 5, 2012
TODO

                    • Still alpha status
                    • Performance tuning
                    • Usability bug fixes
                    • RM recovery
                    • Security in MR2 not complete

Tuesday, June 5, 2012
Questions?



Tuesday, June 5, 2012

Map Reduce v2 and YARN - CHUG - 20120604

  • 1.
    An Introduction to MapReduce 2 and YARN Tom White, Cloudera @tom_e_white June 4, 2012 Chicago HUG Tuesday, June 5, 2012
  • 2.
  • 3.
    About me • Apache Hadoop Committer, PMC Member, Apache Member • Engineer at Cloudera working on core Hadoop • Founder of Apache Whirr • Author of “Hadoop: The Definitive Guide” • http://hadoopbook.com Tuesday, June 5, 2012
  • 4.
    First, whatʼs MapReduce 1? Tuesday, June 5, 2012
  • 5.
  • 6.
    Whatʼs wrong with MR1? Tuesday, June 5, 2012
  • 7.
    Motivation 1 • Scaling >4000 nodes • Fewer, larger clusters Tuesday, June 5, 2012
  • 8.
    Motivation 2 • HA of Job Tracker • Large, complex state Tuesday, June 5, 2012
  • 9.
    Motivation 3 • Poor resource utilization • Slots in MR1 are for either map or reduce Tuesday, June 5, 2012
  • 10.
    Yet Another ResourceNegotiator Tuesday, June 5, 2012
  • 11.
  • 12.
  • 13.
    Node Manager is a generalized Task Tracker • Task Tracker • fixed number of map or reduce slots • Node Manager • containers with variable resource limits Tuesday, June 5, 2012
  • 14.
  • 15.
  • 16.
    MR is userspace YARN is kernel Tuesday, June 5, 2012
  • 17.
    Bonus Apps • Distributed shell • MPI (MAPREDUCE-2911) • Master-worker (MAPREDUCE-3315) • Apache Giraph, Hama Tuesday, June 5, 2012
  • 18.
  • 19.
  • 20.
    Old API ≠MR1 New API ≠ MR2 Tuesday, June 5, 2012
  • 21.
    Old API New API o.a.h.mapred o.a.h.mapreduce MR1 ✓ ✓ MR2 ✓ ✓ Tuesday, June 5, 2012
  • 22.
  • 23.
    Try out MR2 • Apache Hadoop 2.0.0-alpha • hadoop.apache.org • CDH4 and Cloudera Manager • cloudera.com • Cloud - Apache Whirr Tuesday, June 5, 2012
  • 24.
    MR1 <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>1.0.3</version> </dependency> MR2 <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.0.0-alpha</version> </dependency> Tuesday, June 5, 2012
  • 25.
    TODO • Still alpha status • Performance tuning • Usability bug fixes • RM recovery • Security in MR2 not complete Tuesday, June 5, 2012
  • 26.