Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal
Upcoming SlideShare
Loading in...5
×
 

Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal

on

  • 2,626 views

 

Statistics

Views

Total Views
2,626
Views on SlideShare
2,437
Embed Views
189

Actions

Likes
2
Downloads
78
Comments
0

4 Embeds 189

http://d.hatena.ne.jp 162
http://funnel.hasgeek.com 24
https://funnel.hasgeek.com 2
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal Presentation Transcript

  • The Next Generation of
    Hadoop Map-Reduce
    Sharad Agarwal
    sharadag@yahoo-inc.com
    sharad@apache.org
  • About Me
    Hadoop Committer and PMC member
    Architect at Yahoo!
  • Hadoop Map-Reduce Today
    JobTracker
    Manages cluster resources and job scheduling
    TaskTracker
    Per-node agent
    Manage tasks
  • Current Limitations
    Scalability
    Maximum Cluster size – 4,000 nodes
    Maximum concurrent tasks – 40,000
    Coarse synchronization in JobTracker
    Single point of failure
    Failure kills all queued and running jobs
    Jobs need to be re-submitted by users
    Restart is very tricky due to complex state
    Hard partition of resources into map and reduce slots
  • Current Limitations
    Lacks support for alternate paradigms
    Iterative applications implemented using Map-Reduce are 10x slower.
    Example: K-Means, PageRank
    Lack of wire-compatible protocols
    Client and cluster must be of same version
    Applications and workflows cannot migrate to different clusters
  • Next Generation Map-Reduce Requirements
    Reliability
    Availability
    Scalability - Clusters of 6,000 machines
    Each machine with 16 cores, 48G RAM, 24TB disks
    100,000 concurrent tasks
    10,000 concurrent jobs
    Wire Compatibility
    Agility & Evolution – Ability for customers to control upgrades to the grid software stack.
  • Next Generation Map-Reduce – Design Centre
    Split up the two major functions of JobTracker
    Cluster resource management
    Application life-cycle management
    Map-Reduce becomes user-land library
  • Architecture
  • Architecture
    Resource Manager
    Global resource scheduler
    Hierarchical queues
    Node Manager
    Per-machine agent
    Manages the life-cycle of container
    Container resource monitoring
    Application Master
    Per-application
    Manages application scheduling and task execution
    E.g. Map-Reduce Application Master
  • Improvements vis-à-vis current Map-Reduce
    Scalability
    Application life-cycle management is very expensive
    Partition resource management and application life-cycle management
    Application management is distributed
    Hardware trends - Currently run clusters of 4,000 machines
    6,000 2012 machines > 12,000 2009 machines
    <8 cores, 16G, 4TB> v/s <16+ cores, 48/96G, 24TB>
  • Improvements vis-à-vis current Map-Reduce
    Availability
    Application Master
    Optional failover via application-specific checkpoint
    Map-Reduce applications pick up where they left off
    Resource Manager
    No single point of failure - failover via ZooKeeper
    Application Masters are restarted automatically
  • Improvements vis-à-vis current Map-Reduce
    Wire Compatibility
    Protocols are wire-compatible
    Old clients can talk to new servers
    Rolling upgrades
  • Improvements vis-à-vis current Map-Reduce
    Agility / Evolution
    Map-Reduce now becomes a user-land library
    Multiple versions of Map-Reduce can run in the same cluster (ala Apache Pig)
    Faster deployment cycles for improvements
    Customers upgrade Map-Reduce versions on their schedule
  • Improvements vis-à-vis current Map-Reduce
    Utilization
    Generic resource model
    Memory
    CPU
    Disk b/w
    Network b/w
    Remove fixed partition of map and reduce slots
  • Improvements vis-à-vis current Map-Reduce
    Support for programming paradigms other than Map-Reduce
    MPI
    Master-Worker
    Machine Learning
    Iterative processing
    Enabled by allowing use of paradigm-specific Application Master
    Run all on the same Hadoop cluster
  • Summary
    The next generation of Map-Reduce takes Hadoop to the next level
    Scale-out even further
    High availability
    Cluster Utilization
    Support for paradigms other than Map-Reduce
  • Questions?