• Save
NextGen Apache Hadoop MapReduce
Upcoming SlideShare
Loading in...5
×
 

NextGen Apache Hadoop MapReduce

on

  • 23,993 views

Arun C Murthy, Founder and Architect at Hortonworks Inc., talks about the upcoming Next Generation Apache Hadoop MapReduce framework at the Hadoop Summit, 2011.

Arun C Murthy, Founder and Architect at Hortonworks Inc., talks about the upcoming Next Generation Apache Hadoop MapReduce framework at the Hadoop Summit, 2011.

Statistics

Views

Total Views
23,993
Views on SlideShare
21,320
Embed Views
2,673

Actions

Likes
44
Downloads
92
Comments
4

19 Embeds 2,673

http://blog.nosqlfan.com 1377
http://d.hatena.ne.jp 527
http://hbase.info 431
http://paper.li 155
http://processingbigdata.com 67
url_unknown 36
http://twitter.com 25
https://twitter.com 11
http://us-w1.rockmelt.com 10
http://xianguo.com 8
http://quantlabs.net 6
http://a0.twimg.com 5
http://tweetedtimes.com 5
http://cache.baidu.com 3
http://fanyi.youdao.com 2
http://webcache.googleusercontent.com 2
http://processingbigdata.com:9000 1
http://www.moriwaki.net 1
http://pmomale-ld1 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • how to download?????????????
    Are you sure you want to
    Your message goes here
    Processing…
  • It's correct, of course..
    It means, Nowdays computer can do more than 2009 does
    Are you sure you want to
    Your message goes here
    Processing…
  • slide 10 7th line:
    6,000 2012 machines > 12,000 2009 machines
    '2009' is correct? It seems '2013' or more
    Are you sure you want to
    Your message goes here
    Processing…
  • slide 10 7th line:
    6,000 2012 machines > 12,000 2009 machines
    '2009' is correct? It seems '2013' or more
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

NextGen Apache Hadoop MapReduce NextGen Apache Hadoop MapReduce Presentation Transcript

  • Next Generation of Apache Hadoop MapReduce
    Arun C. Murthy - Hortonworks Founder and Architect
    @acmurthy (@hortonworks)
    Formerly Architect, MapReduce @ Yahoo!
    8 years @ Yahoo!
    © Hortonworks Inc. 2011
    June 29, 2011
  • Hello! I’m Arun…
    Architect & Lead, Apache Hadoop MapReduce Development Team at Hortonworks (formerly at Yahoo!)
    Apache Hadoop Committer and Member of PMC
    Full-time contributor to Apache Hadoop since early 2006
  • Hadoop MapReduce Today
    JobTracker
    Manages cluster resources and job scheduling
    TaskTracker
    Per-node agent
    Manage tasks
  • Current Limitations
    Scalability
    Maximum Cluster size – 4,000 nodes
    Maximum concurrent tasks – 40,000
    Coarse synchronization in JobTracker
    Single point of failure
    Failure kills all queued and running jobs
    Jobs need to be re-submitted by users
    Restart is very tricky due to complex state
    Hard partition of resources into map and reduce slots
    © Hortonworks Inc. 2011
    5
  • Current Limitations
    Lacks support for alternate paradigms
    Iterative applications implemented using MapReduce are 10x slower.
    Example: K-Means, PageRank
    Lack of wire-compatible protocols
    Client and cluster must be of same version
    Applications and workflows cannot migrate to different clusters
    © Hortonworks Inc. 2011
    6
  • Requirements
    Reliability
    Availability
    Scalability - Clusters of 6,000-10,000 machines
    Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks
    100,000+ concurrent tasks
    10,000 concurrent jobs
    Wire Compatibility
    Agility & Evolution – Ability for customers to control upgrades to the grid software stack.
    © Hortonworks Inc. 2011
    7
  • Design Centre
    Split up the two major functions of JobTracker
    Cluster resource management
    Application life-cycle management
    MapReduce becomes user-land library
    © Hortonworks Inc. 2011
    8
  • Architecture
  • Architecture
    Resource Manager
    Global resource scheduler
    Hierarchical queues
    Node Manager
    Per-machine agent
    Manages the life-cycle of container
    Container resource monitoring
    Application Master
    Per-application
    Manages application scheduling and task execution
    E.g. MapReduce Application Master
    © Hortonworks Inc. 2011
    10
  • Improvements vis-à-vis current MapReduce
    Scalability
    Application life-cycle management is very expensive
    Partition resource management and application life-cycle management
    Application management is distributed
    Hardware trends - Currently run clusters of 4,000 machines
    6,000 2012 machines > 12,000 2009 machines
    <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB>
    © Hortonworks Inc. 2011
    11
  • Improvments vis-à-vis current MapReduce
    Fault Tolerance and Availability
    Resource Manager
    No single point of failure – state saved in ZooKeeper
    Application Masters are restarted automatically on RM restart
    Applications continue to progress with existing resources during restart, new resources aren’t allocated
    Application Master
    Optional failover via application-specific checkpoint
    MapReduce applications pick up where they left off via state saved in HDFS
    © Hortonworks Inc. 2011
    12
  • Improvements vis-à-vis current MapReduce
    Wire Compatibility
    Protocols are wire-compatible
    Old clients can talk to new servers
    Rolling upgrades
    © Hortonworks Inc. 2011
    13
  • Improvements vis-à-vis current MapReduce
    Innovation and Agility
    MapReduce now becomes a user-land library
    Multiple versions of MapReduce can run in the same cluster (a la Apache Pig)
    Faster deployment cycles for improvements
    Customers upgrade MapReduce versions on their schedule
    Users can customize MapReduce e.g. HOP without affecting everyone!
    © Hortonworks Inc. 2011
    14
  • Improvements vis-à-vis current MapReduce
    Utilization
    Generic resource model
    Memory
    CPU
    Disk b/w
    Network b/w
    Remove fixed partition of map and reduce slots
    © Hortonworks Inc. 2011
    15
  • Improvements vis-à-vis current MapReduce
    Support for programming paradigms other than MapReduce
    MPI
    Master-Worker
    Machine Learning
    Iterative processing
    Enabled by allowing use of paradigm-specific Application Master
    Run all on the same Hadoop cluster
    © Hortonworks Inc. 2011
    16
  • Summary
    MapReduce .Next takes Hadoop to the next level
    Scale-out even further
    High availability
    Cluster Utilization
    Support for paradigms other than MapReduce
    © Hortonworks Inc. 2011
    17
  • Status – June, 2011
    Feature complete
    Rigorous testing cycle underway
    Scale testing at ~500 nodes
    Sort/Scan/Shuffle benchmarks
    GridMixV3!
    Integration testing
    Pig integration complete!
    Coming in the next release of Apache Hadoop!
    Beta deployments of next release of Apache Hadoop at Yahoo! in Q4, 2011
    © Hortonworks Inc. 2011
    18
  • Questions?
    http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
    © Hortonworks Inc. 2011
    19
  • Thank You.
    © Hortonworks Inc. 2011