Your SlideShare is downloading. ×
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

NextGen Apache Hadoop MapReduce

23,135

Published on

Arun C Murthy, Founder and Architect at Hortonworks Inc., talks about the upcoming Next Generation Apache Hadoop MapReduce framework at the Hadoop Summit, 2011.

Arun C Murthy, Founder and Architect at Hortonworks Inc., talks about the upcoming Next Generation Apache Hadoop MapReduce framework at the Hadoop Summit, 2011.

Published in: Technology
5 Comments
44 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/MapReduce.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • how to download?????????????
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • It's correct, of course..
    It means, Nowdays computer can do more than 2009 does
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • slide 10 7th line:
    6,000 2012 machines > 12,000 2009 machines
    '2009' is correct? It seems '2013' or more
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • slide 10 7th line:
    6,000 2012 machines > 12,000 2009 machines
    '2009' is correct? It seems '2013' or more
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
23,135
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
92
Comments
5
Likes
44
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Next Generation of Apache Hadoop MapReduce
    Arun C. Murthy - Hortonworks Founder and Architect
    @acmurthy (@hortonworks)
    Formerly Architect, MapReduce @ Yahoo!
    8 years @ Yahoo!
    © Hortonworks Inc. 2011
    June 29, 2011
  • 2. Hello! I’m Arun…
    Architect & Lead, Apache Hadoop MapReduce Development Team at Hortonworks (formerly at Yahoo!)
    Apache Hadoop Committer and Member of PMC
    Full-time contributor to Apache Hadoop since early 2006
  • 3. Hadoop MapReduce Today
    JobTracker
    Manages cluster resources and job scheduling
    TaskTracker
    Per-node agent
    Manage tasks
  • 4. Current Limitations
    Scalability
    Maximum Cluster size – 4,000 nodes
    Maximum concurrent tasks – 40,000
    Coarse synchronization in JobTracker
    Single point of failure
    Failure kills all queued and running jobs
    Jobs need to be re-submitted by users
    Restart is very tricky due to complex state
    Hard partition of resources into map and reduce slots
    © Hortonworks Inc. 2011
    5
  • 5. Current Limitations
    Lacks support for alternate paradigms
    Iterative applications implemented using MapReduce are 10x slower.
    Example: K-Means, PageRank
    Lack of wire-compatible protocols
    Client and cluster must be of same version
    Applications and workflows cannot migrate to different clusters
    © Hortonworks Inc. 2011
    6
  • 6. Requirements
    Reliability
    Availability
    Scalability - Clusters of 6,000-10,000 machines
    Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks
    100,000+ concurrent tasks
    10,000 concurrent jobs
    Wire Compatibility
    Agility & Evolution – Ability for customers to control upgrades to the grid software stack.
    © Hortonworks Inc. 2011
    7
  • 7. Design Centre
    Split up the two major functions of JobTracker
    Cluster resource management
    Application life-cycle management
    MapReduce becomes user-land library
    © Hortonworks Inc. 2011
    8
  • 8. Architecture
  • 9. Architecture
    Resource Manager
    Global resource scheduler
    Hierarchical queues
    Node Manager
    Per-machine agent
    Manages the life-cycle of container
    Container resource monitoring
    Application Master
    Per-application
    Manages application scheduling and task execution
    E.g. MapReduce Application Master
    © Hortonworks Inc. 2011
    10
  • 10. Improvements vis-à-vis current MapReduce
    Scalability
    Application life-cycle management is very expensive
    Partition resource management and application life-cycle management
    Application management is distributed
    Hardware trends - Currently run clusters of 4,000 machines
    6,000 2012 machines > 12,000 2009 machines
    <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB>
    © Hortonworks Inc. 2011
    11
  • 11. Improvments vis-à-vis current MapReduce
    Fault Tolerance and Availability
    Resource Manager
    No single point of failure – state saved in ZooKeeper
    Application Masters are restarted automatically on RM restart
    Applications continue to progress with existing resources during restart, new resources aren’t allocated
    Application Master
    Optional failover via application-specific checkpoint
    MapReduce applications pick up where they left off via state saved in HDFS
    © Hortonworks Inc. 2011
    12
  • 12. Improvements vis-à-vis current MapReduce
    Wire Compatibility
    Protocols are wire-compatible
    Old clients can talk to new servers
    Rolling upgrades
    © Hortonworks Inc. 2011
    13
  • 13. Improvements vis-à-vis current MapReduce
    Innovation and Agility
    MapReduce now becomes a user-land library
    Multiple versions of MapReduce can run in the same cluster (a la Apache Pig)
    Faster deployment cycles for improvements
    Customers upgrade MapReduce versions on their schedule
    Users can customize MapReduce e.g. HOP without affecting everyone!
    © Hortonworks Inc. 2011
    14
  • 14. Improvements vis-à-vis current MapReduce
    Utilization
    Generic resource model
    Memory
    CPU
    Disk b/w
    Network b/w
    Remove fixed partition of map and reduce slots
    © Hortonworks Inc. 2011
    15
  • 15. Improvements vis-à-vis current MapReduce
    Support for programming paradigms other than MapReduce
    MPI
    Master-Worker
    Machine Learning
    Iterative processing
    Enabled by allowing use of paradigm-specific Application Master
    Run all on the same Hadoop cluster
    © Hortonworks Inc. 2011
    16
  • 16. Summary
    MapReduce .Next takes Hadoop to the next level
    Scale-out even further
    High availability
    Cluster Utilization
    Support for paradigms other than MapReduce
    © Hortonworks Inc. 2011
    17
  • 17. Status – June, 2011
    Feature complete
    Rigorous testing cycle underway
    Scale testing at ~500 nodes
    Sort/Scan/Shuffle benchmarks
    GridMixV3!
    Integration testing
    Pig integration complete!
    Coming in the next release of Apache Hadoop!
    Beta deployments of next release of Apache Hadoop at Yahoo! in Q4, 2011
    © Hortonworks Inc. 2011
    18
  • 18. Questions?
    http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
    © Hortonworks Inc. 2011
    19
  • 19. Thank You.
    © Hortonworks Inc. 2011

×