Apache Hadoop MapReduceWhat next?Arun C. MurthyFounder & Architect@acmurthy (@hortonworks)                           Page 1
Hello! I’m Arun• Founder/Architect at Hortonworks Inc.  – Lead, Map-Reduce  – Formerly, Architect Hadoop MapReduce, Yahoo ...
Agenda• Hadoop MapReduce, State of the Art• Hadoop YARN   – Overview   – State of the art• Art of the possible   – YARN Ru...
Hadoop MapReduceState of the Art                   Page 4
Hadoop MapReduce Classic• JobTracker  – Manages cluster resources and job scheduling• TaskTracker  – Per-node agent  – Man...
Hadoop 1 – Enterprise Ready• Hadoop 1.x is the most stable & reliable version of  Hadoop MapReduce ever   – Proven to be r...
Hadoop 1 – Availability for MR• JobTracker Restart  – Enhanced to restart all jobs on rare JT failures• JobTracker Safemod...
Hadoop YARNOverview & Status Quo                        Page 8
MapReduce - Areas for Improvement • Utilization • Scalability    – Maximum Cluster size – 4,000 nodes    – Maximum concurr...
Requirements• Reliability• Availability• Utilization• Wire Compatibility• Agility & Evolution – Ability for customers to c...
Design Centre• Split up the two major functions of JobTracker   – Cluster resource management   – Application life-cycle m...
Concepts• Application   – Application is a job submitted to the framework   – Example – Map Reduce Job• Container   – Basi...
Architecture• Resource Manager   – Global resource scheduler   – Hierarchical queues• Node Manager   – Per-machine agent  ...
Architecture                                             Node                                             Node            ...
How do I get it?• Available in hadoop-2.0.0-alpha release                          15
Performance• 2x+ across the board (HDFS, YARN, MapReduce)• MapReduce  –Unlock lots of improvements from Terasort record (O...
Resourceshadoop-2.0.0 (alpha release):http://hadoop.apache.org/common/releases.htmlRelease Documentation:http://hadoop.apa...
Art of the possibleYARN RuntimeMapReduce Framework                      Page 18
Looking ahead• YARN  –Runtime Improvements  –Alternate programming models  –Long(er) running services• MapReduce  –Framewo...
YARN - Roadmap• Scheduler  –Multi-dimensional resource scheduling (MAPREDUCE-4327)  –Preemption (MAPREDUCE-3938)  –Gang sc...
YARN - Data Processing Applications• OpenMPI on Hadoop• Spark (UC Berkeley)  –Shark is Hive-on-Spark• Real-time data proce...
YARN - Beyond Data Processing Apps• Apache Hbase  –Deployment via YARN (HBASE-4329)  –Co-processors via YARN (HBASE-4047)•...
MapReduce – Way Forward• MapReduce Framework Runtime   –Monolithic software• MR Runtime?   –Sort, Merge, Shuffle et al• Un...
MapReduce – Pluggable Sort• Pig & Hive benefit from hash-based aggregation  –Several queries don’t need full-sort of map-o...
MapReduce – Pluggable Shuffle• Push v/s Pull shuffle• Plug shuffle implementation (already in hadoop-2)   –E.g. RDMA for s...
MapReduce – More ideas• Allow for Map-Reduce-Reduce  –Allow for reduce output to be sorted/shuffled  – JOIN followed by OR...
MapReduce – How do we get there?• Multiple, concurrent implementations of MapReduce  –YARN is a really big deal…  –Allows ...
Questions?Thank You.@acmurthy             Page 28
Upcoming SlideShare
Loading in...5
×

Apache Hadoop MapReduce - What Next? Hadoop Summit 2012

4,897

Published on

Apache Hadoop has made giant strides since in the last 12 months: the community has released hadoop-1.0 after nearly 6 years and now hadoop-2. In hadoop-2 Apache Hadoop MapReduce has undergone a complete re-haul to emerge as Apache Hadoop YARN, a generic compute fabric to support MapReduce and other application paradigms. This really changes the game to recast Hadoop as a much more powerful data-processing system making Hadoop very different from itself 12 months ago. In this talk we will talk about the next set of features like pre emption, web services and near real time analysis and how we are working on tackling these in the near future. In this talk we will also cover the roadmap for the MapReduce framework itself.

Published in: Technology, Education
1 Comment
12 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Hadoop.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
4,897
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
1
Likes
12
Embeds 0
No embeds

No notes for slide

Apache Hadoop MapReduce - What Next? Hadoop Summit 2012

  1. 1. Apache Hadoop MapReduceWhat next?Arun C. MurthyFounder & Architect@acmurthy (@hortonworks) Page 1
  2. 2. Hello! I’m Arun• Founder/Architect at Hortonworks Inc. – Lead, Map-Reduce – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint)• Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time >6 years) – Release Manager for hadoop-2 Page 2
  3. 3. Agenda• Hadoop MapReduce, State of the Art• Hadoop YARN – Overview – State of the art• Art of the possible – YARN Runtime – MapReduce Framework• Q&A Page 3
  4. 4. Hadoop MapReduceState of the Art Page 4
  5. 5. Hadoop MapReduce Classic• JobTracker – Manages cluster resources and job scheduling• TaskTracker – Per-node agent – Manage tasks
  6. 6. Hadoop 1 – Enterprise Ready• Hadoop 1.x is the most stable & reliable version of Hadoop MapReduce ever – Proven to be reliable at the most demanding Hadoop clusters in the world• CapacityScheduler for Multi-tenancy – Share clusters at scale – Resource & User limits for fine-grained – Queue & Job ACLs – Resilient to misbehaving/rogue applications, users etc., helping drive SLA for applications, pipelines etc. 6
  7. 7. Hadoop 1 – Availability for MR• JobTracker Restart – Enhanced to restart all jobs on rare JT failures• JobTracker Safemode – Admin driven for known issues – Auto-monitoring of HDFS for full-stack availability 7
  8. 8. Hadoop YARNOverview & Status Quo Page 8
  9. 9. MapReduce - Areas for Improvement • Utilization • Scalability – Maximum Cluster size – 4,000 nodes – Maximum concurrent tasks – 40,000 • Hard partition of resources into map and reduce slots • Lacks support for alternate paradigms • Lack of wire-compatible protocols 9
  10. 10. Requirements• Reliability• Availability• Utilization• Wire Compatibility• Agility & Evolution – Ability for customers to control upgrades to the grid software stack.• Scalability - Clusters of 6,000-10,000 machines – Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks – 100,000+ concurrent tasks – 10,000 concurrent jobs 10
  11. 11. Design Centre• Split up the two major functions of JobTracker – Cluster resource management – Application life-cycle management• MapReduce becomes user-land library 11
  12. 12. Concepts• Application – Application is a job submitted to the framework – Example – Map Reduce Job• Container – Basic unit of allocation – Example – container A = 2GB, 1CPU – Replaces the fixed map/reduce slots 12
  13. 13. Architecture• Resource Manager – Global resource scheduler – Hierarchical queues• Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring• Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master 13
  14. 14. Architecture Node Node Manager Manager Container App Mstr App Mstr Client Resource Node Node Resource Manager Manager Manager Manager Client Client App Mstr Container Container MapReduce Status Node Node MapReduce Status Manager Manager Job Submission Job Submission Node Status Node Status Resource Request Resource Request Container Container
  15. 15. How do I get it?• Available in hadoop-2.0.0-alpha release 15
  16. 16. Performance• 2x+ across the board (HDFS, YARN, MapReduce)• MapReduce –Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Merge improvements –Small Jobs – Uber AM –Re-use task slots (containers) http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/ Page 16
  17. 17. Resourceshadoop-2.0.0 (alpha release):http://hadoop.apache.org/common/releases.htmlRelease Documentation:http://hadoop.apache.org/common/docs/r2.0.0-alpha/ Page 17
  18. 18. Art of the possibleYARN RuntimeMapReduce Framework Page 18
  19. 19. Looking ahead• YARN –Runtime Improvements –Alternate programming models –Long(er) running services• MapReduce –Framework enhancements –Unpack! Page 19
  20. 20. YARN - Roadmap• Scheduler –Multi-dimensional resource scheduling (MAPREDUCE-4327) –Preemption (MAPREDUCE-3938) –Gang scheduling• Runtime improvements –Container Isolation (MAPREDUCE-4334) Page 20
  21. 21. YARN - Data Processing Applications• OpenMPI on Hadoop• Spark (UC Berkeley) –Shark is Hive-on-Spark• Real-time data processing – Storm (Twitter) – Apache S4• Graph processing – Apache Giraph Page 21
  22. 22. YARN - Beyond Data Processing Apps• Apache Hbase –Deployment via YARN (HBASE-4329) –Co-processors via YARN (HBASE-4047)• Simple deployment for cluster services Page 22
  23. 23. MapReduce – Way Forward• MapReduce Framework Runtime –Monolithic software• MR Runtime? –Sort, Merge, Shuffle et al• Unpack into smaller building blocks! –Allow applications and Pig/Hive to ‘plug-n-play’ –MR framework, as we know today, becomes a particular configuration of the building blocks Page 23
  24. 24. MapReduce – Pluggable Sort• Pig & Hive benefit from hash-based aggregation –Several queries don’t need full-sort of map-outputs –Aggregation suffices –Allow for pluggable MapOutputBuffer in MapTask –Sort Avoidance - MAPREDUCE-4039 –External sort plugin – MAPREDUCE-2454 Page 24
  25. 25. MapReduce – Pluggable Shuffle• Push v/s Pull shuffle• Plug shuffle implementation (already in hadoop-2) –E.g. RDMA for shuffle –MAPREDUCE-4049• Collation tasks –Sailfish - Yahoo Research (includes auto-tuning of reduces) Page 25
  26. 26. MapReduce – More ideas• Allow for Map-Reduce-Reduce –Allow for reduce output to be sorted/shuffled – JOIN followed by ORDER BY – Really big deal for Pig/Hive• DAG Management for Pig/Hive – Scheduling improvements – Restart semantics Page 26
  27. 27. MapReduce – How do we get there?• Multiple, concurrent implementations of MapReduce –YARN is a really big deal… –Allows for safe experiments, much less risky! –Exposure surface is highly limited Page 27
  28. 28. Questions?Thank You.@acmurthy Page 28

×