Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Hadoop MapReduce: What's Next


Published on

Apache Hadoop has made giant strides since the last Hadoop Summit: the community has released hadoop-1.0 after nearly 6 years and is now on the cusp of the (think of it as hadoop-2.0). Given the next generation of MR is out with 0.23.0 and 0.23.1, there is a new set of features that have been requested in the community. In this talk we will talk about the next set of features like pre emption, web services and near real time analysis and how we are working on tackling these in the near future. In this talk we will also cover the roadmap for Next Gen Map Reduce and timelines along with the release schedule for Apache Hadoop.

Published in: Technology, Education
  • Be the first to comment

Apache Hadoop MapReduce: What's Next

  1. 1. Apache Hadoop MapReduceWhat is next?Arun C. MurthyFounder & Architect@acmurthy (@hortonworks) Page 1
  2. 2. Hello! I’m Arun• Founder/Architect at Hortonworks Inc. – Lead, Map-Reduce – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint)• Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time >6 years) – Release Manager for hadoop-2 Page 2
  3. 3. Agenda• Yesterday: Hadoop MapReduce, circa 2011• Today: Hadoop YARN – Overview – State of the art•  Art of the possible – YARN Runtime – MapReduce Framework• Q&A Page 3
  4. 4. Hadoop MapReducecirca 2011 Page 4
  5. 5. Hadoop MapReduce Classic•  JobTracker –  Manages cluster resources and job scheduling•  TaskTracker –  Per-node agent –  Manage tasks
  6. 6. Current Limitations•  Utilization•  Scalability –  Maximum Cluster size – 4,000 nodes –  Maximum concurrent tasks – 40,000 –  Coarse synchronization in JobTracker•  Single point of failure –  Failure kills all queued and running jobs –  Jobs restarted on bounce 6
  7. 7. Current Limitations•  Hard partition of resources into map and reduce slots –  Low resource utilization•  Lacks support for alternate paradigms –  Iterative applications implemented using MapReduce are 10x slower –  Hacks for the likes of MPI/Graph Processing•  Lack of wire-compatible protocols –  Client and cluster must be of same version –  Applications and workflows cannot migrate to different clusters 7
  8. 8. Hadoop YARNOverview Page 8
  9. 9. Requirements•  Reliability•  Availability•  Utilization•  Wire Compatibility•  Agility & Evolution – Ability for customers to control upgrades to the grid software stack.•  Scalability - Clusters of 6,000-10,000 machines –  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks –  100,000+ concurrent tasks –  10,000 concurrent jobs 9
  10. 10. Design Centre•  Split up the two major functions of JobTracker –  Cluster resource management –  Application life-cycle management•  MapReduce becomes user-land library 10
  11. 11. Architecture•  Application –  Application is a job submitted to the framework –  Example – Map Reduce Job•  Container –  Basic unit of allocation –  Example – container A = 2GB, 1CPU –  Replaces the fixed map/reduce slots 11
  12. 12. Architecture•  Resource Manager –  Global resource scheduler –  Hierarchical queues•  Node Manager –  Per-machine agent –  Manages the life-cycle of container –  Container resource monitoring•  Application Master –  Per-application –  Manages application scheduling and task execution –  E.g. MapReduce Application Master 12
  13. 13. Architecture Node Node Manager Manager Container App Mstr App Mstr Client Resource Node Node Resource Manager Manager Manager Manager Client Client App Mstr Container Container MapReduce Status Node Node MapReduce Status Manager Manager Job Submission Job Submission Node Status Node Status Resource Request Resource Request Container Container
  14. 14. How do I get it?•  Available in hadoop-2.0.0-alpha release 14
  15. 15. Performance• 2x+ across the board• MapReduce – Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Merge improvements – Small Jobs – Uber AM – Re-use task slots (containers)More details: Page 15
  16. 16. Resourceshadoop-2.0.0 (alpha release): Documentation: Page 16
  17. 17. Art of the possibleYARN RuntimeMapReduce Framework Page 17
  18. 18. Looking ahead• YARN – Runtime Improvements – Alternate programming models – Long(er) running services• MapReduce – Framework enhancements – Unpack! Page 18
  19. 19. YARN - Roadmap• Scheduler – Multi-dimensional resource scheduling (MAPREDUCE-4327) – Preemption (MAPREDUCE-3938) – Gang scheduling• Runtime improvements – Container Isolation (MAPREDUCE-4334) Page 19
  20. 20. YARN - Data Processing Applications• OpenMPI on Hadoop• Spark (UC Berkeley) – Shark is Hive-on-Spark• Real-time data processing – Storm (Twitter) – Apache S4•  Graph processing – Apache Giraph Page 20
  21. 21. YARN - Beyond Data Processing Apps• Apache Hbase – Deployment via YARN (HBASE-4329) – Co-processors via YARN (HBASE-4047)• Simple deployment for cluster services Page 21
  22. 22. MapReduce – Way Forward• MapReduce Framework Runtime – Monolithic software• MR Runtime? – Sort, Merge, Shuffle et al• Unpack into smaller building blocks! – Allow applications and Pig/Hive to ‘plug-n-play’ – MR framework, as we know today, becomes a particular configuration of the building blocks Page 22
  23. 23. MapReduce – Pluggable Sort• Pig & Hive benefit from hash-based aggregation – Several queries don’t need full-sort of map-outputs – Aggregation suffices – Allow for pluggable MapOutputBuffer in MapTask – Sort Avoidance - MAPREDUCE-4039 – External sort plugin – MAPREDUCE-2454 Page 23
  24. 24. MapReduce – Pluggable Shuffle• Push v/s Pull shuffle• Plug shuffle implementation (already in hadoop-2) – E.g. RDMA for shuffle – MAPREDUCE-4049• Collation tasks – Sailfish - Yahoo Research (includes auto-tuning of reduces) Page 24
  25. 25. MapReduce – More ideas• Allow for Map-Reduce-Reduce – Allow for reduce output to be sorted/shuffled – JOIN followed by ORDER BY – Really big deal for Pig/Hive Page 25
  26. 26. MapReduce – How do we get there?• Multiple, concurrent implementations of MapReduce – YARN is a really big deal… – Allows for safe experiments, much less risky! – Exposure surface is highly limited Page 26
  27. 27. Questions?Thank You.@acmurthy Page 27