• Save
Apache Hadoop MapReduce: What's Next
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Apache Hadoop MapReduce: What's Next



Apache Hadoop has made giant strides since the last Hadoop Summit: the community has released hadoop-1.0 after nearly 6 years and is now on the cusp of the Hadoop.next (think of it as hadoop-2.0). ...

Apache Hadoop has made giant strides since the last Hadoop Summit: the community has released hadoop-1.0 after nearly 6 years and is now on the cusp of the Hadoop.next (think of it as hadoop-2.0). Given the next generation of MR is out with 0.23.0 and 0.23.1, there is a new set of features that have been requested in the community. In this talk we will talk about the next set of features like pre emption, web services and near real time analysis and how we are working on tackling these in the near future. In this talk we will also cover the roadmap for Next Gen Map Reduce and timelines along with the release schedule for Apache Hadoop.



Total Views
Views on SlideShare
Embed Views



8 Embeds 1,469

http://www.scoop.it 1317
http://marilson.pbworks.com 82
http://eventifier.co 46
http://eventifier.com 15
https://hwtest.uservoice.com 3
https://twitter.com 3
http://webcache.googleusercontent.com 2
http://translate.googleusercontent.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Apache Hadoop MapReduce: What's Next Presentation Transcript

  • 1. Apache Hadoop MapReduceWhat is next?Arun C. MurthyFounder & Architect@acmurthy (@hortonworks) Page 1
  • 2. Hello! I’m Arun• Founder/Architect at Hortonworks Inc. – Lead, Map-Reduce – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint)• Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time >6 years) – Release Manager for hadoop-2 Page 2
  • 3. Agenda• Yesterday: Hadoop MapReduce, circa 2011• Today: Hadoop YARN – Overview – State of the art•  Art of the possible – YARN Runtime – MapReduce Framework• Q&A Page 3
  • 4. Hadoop MapReducecirca 2011 Page 4
  • 5. Hadoop MapReduce Classic•  JobTracker –  Manages cluster resources and job scheduling•  TaskTracker –  Per-node agent –  Manage tasks
  • 6. Current Limitations•  Utilization•  Scalability –  Maximum Cluster size – 4,000 nodes –  Maximum concurrent tasks – 40,000 –  Coarse synchronization in JobTracker•  Single point of failure –  Failure kills all queued and running jobs –  Jobs restarted on bounce 6
  • 7. Current Limitations•  Hard partition of resources into map and reduce slots –  Low resource utilization•  Lacks support for alternate paradigms –  Iterative applications implemented using MapReduce are 10x slower –  Hacks for the likes of MPI/Graph Processing•  Lack of wire-compatible protocols –  Client and cluster must be of same version –  Applications and workflows cannot migrate to different clusters 7
  • 8. Hadoop YARNOverview Page 8
  • 9. Requirements•  Reliability•  Availability•  Utilization•  Wire Compatibility•  Agility & Evolution – Ability for customers to control upgrades to the grid software stack.•  Scalability - Clusters of 6,000-10,000 machines –  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks –  100,000+ concurrent tasks –  10,000 concurrent jobs 9
  • 10. Design Centre•  Split up the two major functions of JobTracker –  Cluster resource management –  Application life-cycle management•  MapReduce becomes user-land library 10
  • 11. Architecture•  Application –  Application is a job submitted to the framework –  Example – Map Reduce Job•  Container –  Basic unit of allocation –  Example – container A = 2GB, 1CPU –  Replaces the fixed map/reduce slots 11
  • 12. Architecture•  Resource Manager –  Global resource scheduler –  Hierarchical queues•  Node Manager –  Per-machine agent –  Manages the life-cycle of container –  Container resource monitoring•  Application Master –  Per-application –  Manages application scheduling and task execution –  E.g. MapReduce Application Master 12
  • 13. Architecture Node Node Manager Manager Container App Mstr App Mstr Client Resource Node Node Resource Manager Manager Manager Manager Client Client App Mstr Container Container MapReduce Status Node Node MapReduce Status Manager Manager Job Submission Job Submission Node Status Node Status Resource Request Resource Request Container Container
  • 14. How do I get it?•  Available in hadoop-2.0.0-alpha release 14
  • 15. Performance• 2x+ across the board• MapReduce – Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Merge improvements – Small Jobs – Uber AM – Re-use task slots (containers)More details: http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/ Page 15
  • 16. Resourceshadoop-2.0.0 (alpha release):http://hadoop.apache.org/common/releases.htmlRelease Documentation:http://hadoop.apache.org/common/docs/r2.0.0-alpha/ Page 16
  • 17. Art of the possibleYARN RuntimeMapReduce Framework Page 17
  • 18. Looking ahead• YARN – Runtime Improvements – Alternate programming models – Long(er) running services• MapReduce – Framework enhancements – Unpack! Page 18
  • 19. YARN - Roadmap• Scheduler – Multi-dimensional resource scheduling (MAPREDUCE-4327) – Preemption (MAPREDUCE-3938) – Gang scheduling• Runtime improvements – Container Isolation (MAPREDUCE-4334) Page 19
  • 20. YARN - Data Processing Applications• OpenMPI on Hadoop• Spark (UC Berkeley) – Shark is Hive-on-Spark• Real-time data processing – Storm (Twitter) – Apache S4•  Graph processing – Apache Giraph Page 20
  • 21. YARN - Beyond Data Processing Apps• Apache Hbase – Deployment via YARN (HBASE-4329) – Co-processors via YARN (HBASE-4047)• Simple deployment for cluster services Page 21
  • 22. MapReduce – Way Forward• MapReduce Framework Runtime – Monolithic software• MR Runtime? – Sort, Merge, Shuffle et al• Unpack into smaller building blocks! – Allow applications and Pig/Hive to ‘plug-n-play’ – MR framework, as we know today, becomes a particular configuration of the building blocks Page 22
  • 23. MapReduce – Pluggable Sort• Pig & Hive benefit from hash-based aggregation – Several queries don’t need full-sort of map-outputs – Aggregation suffices – Allow for pluggable MapOutputBuffer in MapTask – Sort Avoidance - MAPREDUCE-4039 – External sort plugin – MAPREDUCE-2454 Page 23
  • 24. MapReduce – Pluggable Shuffle• Push v/s Pull shuffle• Plug shuffle implementation (already in hadoop-2) – E.g. RDMA for shuffle – MAPREDUCE-4049• Collation tasks – Sailfish - Yahoo Research (includes auto-tuning of reduces) Page 24
  • 25. MapReduce – More ideas• Allow for Map-Reduce-Reduce – Allow for reduce output to be sorted/shuffled – JOIN followed by ORDER BY – Really big deal for Pig/Hive Page 25
  • 26. MapReduce – How do we get there?• Multiple, concurrent implementations of MapReduce – YARN is a really big deal… – Allows for safe experiments, much less risky! – Exposure surface is highly limited Page 26
  • 27. Questions?Thank You.@acmurthy Page 27