Your SlideShare is downloading. ×
0
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

YARN: Future of Data Processing with Apache Hadoop

3,761

Published on

Published in: Education
1 Comment
4 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/MapReduce.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,761
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
198
Comments
1
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Future of Data Processing withApache HadoopVinod Kumar VavilapalliLead Developer @ Hortonworks@Tshooter (@hortonworks) Page 1
  • 2. Hello! I’m Vinod•  Lead Software Developer @ Hortonworks –  MapReduce and YARN –  Hadoop-1.0, Hadoop security, CapacityScheduler and HadoopOnDemand before that. –  Previously at Yahoo! stabilizing Hadoop, helping take it to 10s of thousands of nodes scale, the fruits of which the ROTW is enjoying J•  Apache Hadoop, ASF –  Committer and PMC member of Apache Hadoop. –  Lurking in MapReduce and YARN mailing lists and bug trackers. –  Hadoop YARN project lead developer. Page 2
  • 3. Agenda• Overview• Status Quo• Architecture• Improvements and Updates• Q&A Page 3
  • 4. Hadoop MapReduce Classic•  JobTracker –  Manages cluster resources and job scheduling•  TaskTracker –  Per-node agent –  Manage tasks
  • 5. Current Limitations•  Scalability –  Maximum Cluster size – 4,000 nodes –  Maximum concurrent tasks – 40,000 –  Coarse synchronization in JobTracker•  Single point of failure –  Failure kills all queued and running jobs –  Jobs need to be re-submitted by users•  Restart is very tricky due to complex state 5
  • 6. Current Limitations•  Hard partition of resources into map and reduce slots –  Low resource utilization•  Lacks support for alternate paradigms –  Iterative applications implemented using MapReduce are 10x slower –  Hacks for the likes of MPI/Graph Processing•  Lack of wire-compatible protocols –  Client and cluster must be of same version –  Applications and workflows cannot migrate to different clusters 6
  • 7. Requirements•  Reliability•  Availability•  Utilization•  Wire Compatibility•  Agility & Evolution – Ability for customers to control upgrades to the grid software stack.•  Scalability - Clusters of 6,000-10,000 machines –  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks –  100,000+ concurrent tasks –  10,000 concurrent jobs 7
  • 8. Design Centre•  Split up the two major functions of JobTracker –  Cluster resource management –  Application life-cycle management•  MapReduce becomes user-land library 8
  • 9. Architecture•  Application –  Application is a job submitted to the framework –  Example – Map Reduce Job•  Container –  Basic unit of allocation –  Example – container A = 2GB, 1CPU –  Replaces the fixed map/reduce slots 9
  • 10. Architecture•  Resource Manager –  Global resource scheduler –  Hierarchical queues•  Node Manager –  Per-machine agent –  Manages the life-cycle of container –  Container resource monitoring•  Application Master –  Per-application –  Manages application scheduling and task execution –  E.g. MapReduce Application Master 10
  • 11. Architecture Node Node Manager Manager Container App Mstr App Mstr Client Resource Node Node Resource Manager Manager Manager Manager Client Client App Mstr Container Container MapReduce Status Node Node MapReduce Status Manager Manager Job Submission Job Submission Node Status Node Status Resource Request Resource Request Container Container
  • 12. Improvements vis-à-vis classic MapReduce•  Utilization –  Generic resource model •  Data Locality (node, rack etc.) •  Memory •  CPU •  Disk b/q •  Network b/w –  Remove fixed partition of map and reduce slot•  Scalability –  Application life-cycle management is very expensive –  Partition resource management and application life-cycle management –  Application management is distributed –  Hardware trends - Currently run clusters of 4,000 machines •  6,000 2012 machines > 12,000 2009 machines •  <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB> 12
  • 13. Improvements vis-à-vis classic MapReduce•  Fault Tolerance and Availability –  Resource Manager •  No single point of failure – state saved in ZooKeeper (coming soon) •  Application Masters are restarted automatically on RM restart –  Application Master •  Optional failover via application-specific checkpoint •  MapReduce applications pick up where they left off via state saved in HDFS•  Wire Compatibility –  Protocols are wire-compatible –  Old clients can talk to new servers –  Rolling upgrades 13
  • 14. Improvements vis-à-vis classic MapReduce•  Innovation and Agility –  MapReduce now becomes a user-land library –  Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) •  Faster deployment cycles for improvements –  Customers upgrade MapReduce versions on their schedule –  Users can customize MapReduce 14
  • 15. Improvements vis-à-vis classic MapReduce•  Support for programming paradigms other than MapReduce –  MPI –  Master-Worker –  Machine Learning –  Iterative processing –  Enabled by allowing the use of paradigm-specific application master –  Run all on the same Hadoop cluster 15
  • 16. Any Performance Gains?• Significant gains across the board, 2x in some cases.• MapReduce – Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Small Jobs – Uber AM – Re-use task slots (containersMore details: http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/ Page 16
  • 17. Testing?• Testing, *lots* of it• Benchmarks (every release should be at least as good as the last one)• Integration testing – HBase – Pig – Hive – Oozie• Functional tests – Nightly – Over 1000 functional tests for Map-Reduce alone – Several hundred for Pig/Hive etc.• Deployment discipline Page 17
  • 18. Benchmarks• Benchmark every part of the HDFS & MR pipeline – HDFS read/write throughput – NN operations – Scan, Shuffle, Sort• GridMixv3 – Run production traces in test clusters – Thousands of jobs – Stress mode v/s Replay mode Page 18
  • 19. Deployment•  Alpha/Test (early UAT) in November 2011 – Small scale (500-800 nodes)•  Alpha in February 2012 – Majority of users – ~1000 nodes per cluster, > 2,000 nodes in all•  Beta – Misnomer: 10s of PB of storage – Significantly wide variety of applications and load –  4000+ nodes per cluster, > 15000 nodes in all – Q3, 2012•  Production – Well, it’s production Page 19
  • 20. Resourceshadoop-2.0.2-alpha:http://hadoop.apache.org/common/releases.htmlRelease Documentation:http://hadoop.apache.org/common/docs/r2.0.2-alpha/Blogs:Blog series: http://hortonworks.com/blog/category/apache-hadoop/yarn/http://hortonworks.com/blog/introducing-apache-hadoop-yarn/http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/ Page 20
  • 21. What next?1 Download Hortonworks Data Platform 2.0 (for preview only) hortonworks.com/download2 Use the getting started guide hortonworks.com/get-started3 Learn more…. Hortonworks Support •  Expert role based training •  Full lifecycle technical support •  Course for admins, developers across four service levels and operators •  Delivered by Apache Hadoop •  Certification program Experts/Committers •  Custom onsite options •  Forward-compatible hortonworks.com/training hortonworks.com/support Page 21 © Hortonworks Inc. 2012
  • 22. Questions & Answers TRY download at hortonworks.com LEARN Hortonworks University FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks MORE EVENTS hortonworks.com/events Further questions & comments: events@hortonworks.com Page 22 © Hortonworks Inc. 2012

×