• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Next Generation of Hadoop MapReduce
 

Next Generation of Hadoop MapReduce

on

  • 3,216 views

 

Statistics

Views

Total Views
3,216
Views on SlideShare
2,987
Embed Views
229

Actions

Likes
6
Downloads
125
Comments
1

3 Embeds 229

http://lanyrd.com 215
http://paper.li 7
http://paper.li 7

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hey very nice blog!!
    Hi there,I enjoy reading through your article post, I wanted to write a little comment to support you and wish you a good

    continuationAll the best for all your blogging efforts.
    Appreciate the recommendation! Let me try it out.
    Keep working ,great job!
    Hadoop training
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Next Generation of Hadoop MapReduce Next Generation of Hadoop MapReduce Presentation Transcript

    • Next Generation of Apache Hadoop MapReduce Owen O’Malley oom@yahoo-inc.com @owen_omalley
    • What is Hadoop? A framework for storing and processing big data on lots of commodity machines. - Up to 4,000 machines in a cluster - Up to 20 PB in a cluster Open Source Apache project High reliability done in software - Automated failover for data and computation Implemented in Java Primary data analysis platform at Yahoo! - 40,000+ machines running Hadoop
    • What is Hadoop? HDFS – Distributed File System - Combines cluster’s local storage into a single namespace. - All data is replicated to multiple machines. - Provides locality information to clients MapReduce - Batch computation framework - Tasks re-executed on failure - User code wrapped around a distributed sort - Optimizes for data locality of input
    • Case Study: Yahoo Front Page Personalized for each visitortwice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 3
    • Hadoop MapReduce Today JobTracker - Manages cluster resources and job scheduling TaskTracker - Per-node agent - Manage tasks
    • Current Limitations Scalability - Maximum Cluster size – 4,000 nodes - Maximum concurrent tasks – 40,000 - Coarse synchronization in JobTracker Single point of failure - Failure kills all queued and running jobs - Jobs need to be re-submitted by users Restart is very tricky due to complex state Hard partition of resources into map and reduce slots
    • Current Limitations Lacks support for alternate paradigms - Iterative applications implemented using MapReduce are 10x slower. - Users use MapReduce to run arbitrary code - Example: K-Means, PageRank Lack of wire-compatible protocols - Client and cluster must be of same version - Applications and workflows cannot migrate to different clusters
    • MapReduce Requirements for 2011 Reliability Availability Scalability - Clusters of 6,000 machines - Each machine with 16 cores, 48G RAM, 24TB disks - 100,000 concurrent tasks - 10,000 concurrent jobs Wire Compatibility Agility & Evolution – Ability for customers to control upgrades to the grid software stack.
    • MapReduce – Design Focus Split up the two major functions of JobTracker - Cluster resource management - Application life-cycle management MapReduce becomes user-land library
    • Architecture
    • Architecture Resource Manager - Global resource scheduler - Hierarchical queues Node Manager - Per-machine agent - Manages the life-cycle of container - Container resource monitoring Application Master - Per-application - Manages application scheduling and task execution - E.g. MapReduce Application Master
    • Improvements vis-à-vis current MapReduce  Scalability - Application life-cycle management is very expensive - Partition resource management and application life-cycle management - Application management is distributed - Hardware trends • Machines are getting bigger and faster • Moving toward 12 2TB disks instead of 4 1TB disks • Enables more tasks per a machine
    • Improvements vis-à-vis current MapReduce  Availability - Application Master • Optional failover via application-specific checkpoint • MapReduce applications pick up where they left off - Resource Manager • No single point of failure - failover via ZooKeeper • Application Masters are restarted automatically
    • Improvements vis-à-vis current MapReduce  Wire Compatibility - Protocols are wire-compatible - Old clients can talk to new servers - Evolution toward rolling upgrades
    • Improvements vis-à-vis current MapReduce  Innovation and Agility - MapReduce now becomes a user-land library - Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) • Faster deployment cycles for improvements - Customers upgrade MapReduce versions on their schedule - Users can use customized MapReduce versions without affecting everyone!
    • Improvements vis-à-vis current MapReduce  Utilization - Generic resource model • Memory • CPU • Disk b/w • Network b/w - Remove fixed partition of map and reduce slots
    • Improvements vis-à-vis current MapReduce  Support for programming paradigms other than MapReduce - MPI - Master-Worker - Machine Learning and Iterative processing - Enabled by paradigm-specific Application Master - All can run on the same Hadoop cluster
    • Summary Takes Hadoop to the next level - Scale-out even further - High availability - Cluster Utilization - Support for paradigms other than MapReduce
    • Questions? http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/