SlideShare a Scribd company logo
1 of 17
Next Generation Apache Hadoop Map
Reduce


Mahadev Konar
Co Founder
@mahadevkonar (@hortonworks)




                                    Page 1
Bio
•  Working on Apache Hadoop since 2006 - committer and PMC member
•  Working on Apache ZooKeeper since 2008 – committer and PMC
   member




                                                                Page 2
Hadoop MapReduce Classic


•  JobTracker
  –  Manages cluster
     resources and job
     scheduling
•  TaskTracker
  –  Per-node agent
  –  Manage tasks
Current Limitations

•  Hard partition of resources into map and reduce slots
   –  Low resource utilization
•  Lacks support for alternate paradigms
    –  Iterative applications implemented using
       MapReduce are 10x slower.
    –  Hacks for the likes of MPI/others
•  Lack of wire-compatible protocols
    –  Client and cluster must be of same version
    –  Applications and workflows cannot migrate to
       different clusters


                             4
Current Limitations

•  Scalability
    –  Maximum Cluster size – 4,000 nodes
    –  Maximum concurrent tasks – 40,000
    –  Coarse synchronization in JobTracker
•  Single point of failure
    –  Failure kills all queued and running jobs
    –  Jobs need to be re-submitted by users
•  Restart is very tricky due to complex state




                          5
Requirements

•  Reliability
•  Availability
•  Scalability - Clusters of 6,000-10,000 machines
    –  Each machine with 16 cores, 48G/96G RAM, 24TB/
       36TB disks
    –  100,000+ concurrent tasks
    –  10,000 concurrent jobs
•  Wire Compatibility
•  Agility & Evolution – Ability for customers to control
   upgrades to the grid software stack.


                        6
Design Centre

•  Split up the two major functions of JobTracker
    –  Cluster resource management
    –  Application life-cycle management
•  MapReduce becomes user-land library




                         7
Architecture

                                           Node
                                           Node
                                          Manager
                                          Manager


                                    Container   App Mstr
                                                App Mstr


     Client

                         Resource          Node
                                           Node
                         Resource
                         Manager
                         Manager          Manager
                                          Manager
     Client
      Client

                                    App Mstr    Container
                                                Container




      MapReduce Status                     Node
                                           Node
      MapReduce Status
                                          Manager
                                          Manager
        Job Submission
       Job Submission
         Node Status
        Node Status
      Resource Request
      Resource Request              Container   Container
Architecture

•  Resource Manager
    –  Global resource scheduler
    –  Hierarchical queues
•  Node Manager
    –  Per-machine agent
    –  Manages the life-cycle of container
    –  Container resource monitoring
•  Application Master
    –  Per-application
    –  Manages application scheduling and task execution
    –  E.g. MapReduce Application Master
                        9
Improvements vis-à-vis current MapReduce


•  Scalability
    –  Application life-cycle management is very expensive
    –  Partition resource management and application life-
       cycle management
    –  Application management is distributed
    –  Hardware trends - Currently run clusters of 4,000
       machines
        •  6,000 2012 machines > 12,000 2009 machines
        •  <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G,
           4TB>


                        10
Improvements vis-à-vis current MapReduce



•  Fault Tolerance and Availability
    –  Resource Manager
        •  No single point of failure – state saved in ZooKeeper
        •  Application Masters are restarted automatically on RM
           restart
        •  Applications continue to progress with existing
           resources during restart, new resources aren’t
           allocated
    –  Application Master
        •  Optional failover via application-specific checkpoint
        •  MapReduce applications pick up where they left off
           via state saved in HDFS

                          11
Improvements vis-à-vis current MapReduce


•    Wire Compatibility
     –  Protocols are wire-compatible
     –  Old clients can talk to new servers
     –  Rolling upgrades




                                12
Improvements vis-à-vis current MapReduce


•    Innovation and Agility
     –  MapReduce now becomes a user-land library
     –  Multiple versions of MapReduce can run in the same cluster (a la
        Apache Pig)
         •  Faster deployment cycles for improvements
     –  Customers upgrade MapReduce versions on their schedule
     –  Users can customize MapReduce e.g. HOP without affecting
        everyone!




                                  13
Improvements vis-à-vis current MapReduce


•    Utilization
      –  Generic resource model
          •  Memory
          •  CPU
          •  Disk b/w
          •  Network b/w
      –  Remove fixed partition of map and reduce slots




                                  14
Improvements vis-à-vis current MapReduce


•    Support for programming paradigms other than MapReduce
     –  MPI
     –  Master-Worker
     –  Machine Learning
     –  Iterative processing
     –  Enabled by allowing use of paradigm-specific Application Master
     –  Run all on the same Hadoop cluster




                               15
Summary


•    MapReduce .Next takes Hadoop to the next level
     –  Scale-out even further
     –  High availability
     –  Cluster Utilization
     –  Support for paradigms other than MapReduce




                             16
Thank You.
@mahadevkonar

More Related Content

What's hot

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Jason Shao
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
MapR Technologies
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 

What's hot (20)

Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
10c introduction
10c introduction10c introduction
10c introduction
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Resource scheduling
Resource schedulingResource scheduling
Resource scheduling
 

Viewers also liked (6)

An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similar to Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, Hortonworks

Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hortonworks
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
Hortonworks
 
YARN Hadoop Summit Bangalore 2011
YARN Hadoop Summit Bangalore 2011YARN Hadoop Summit Bangalore 2011
YARN Hadoop Summit Bangalore 2011
Sharad Agarwal
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
huguk
 
Apache Hadoop YARN - Hortonworks Meetup Presentation
Apache Hadoop YARN - Hortonworks Meetup PresentationApache Hadoop YARN - Hortonworks Meetup Presentation
Apache Hadoop YARN - Hortonworks Meetup Presentation
Hortonworks
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Yahoo Developer Network
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
hdhappy001
 

Similar to Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, Hortonworks (20)

Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
 
Hadoop bangalore-meetup-dec-2011-hadoop nextgen
Hadoop bangalore-meetup-dec-2011-hadoop nextgenHadoop bangalore-meetup-dec-2011-hadoop nextgen
Hadoop bangalore-meetup-dec-2011-hadoop nextgen
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
 
Apache Hadoop MapReduce: What's Next
Apache Hadoop MapReduce: What's NextApache Hadoop MapReduce: What's Next
Apache Hadoop MapReduce: What's Next
 
YARN Hadoop Summit Bangalore 2011
YARN Hadoop Summit Bangalore 2011YARN Hadoop Summit Bangalore 2011
YARN Hadoop Summit Bangalore 2011
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Apache Hadoop YARN - Hortonworks Meetup Presentation
Apache Hadoop YARN - Hortonworks Meetup PresentationApache Hadoop YARN - Hortonworks Meetup Presentation
Apache Hadoop YARN - Hortonworks Meetup Presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
 
Yarnthug2014
Yarnthug2014Yarnthug2014
Yarnthug2014
 
Yarn
YarnYarn
Yarn
 
[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史
[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史
[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史
 
A sdn based application aware and network provisioning
A sdn based application aware and network provisioningA sdn based application aware and network provisioning
A sdn based application aware and network provisioning
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, Hortonworks

  • 1. Next Generation Apache Hadoop Map Reduce Mahadev Konar Co Founder @mahadevkonar (@hortonworks) Page 1
  • 2. Bio •  Working on Apache Hadoop since 2006 - committer and PMC member •  Working on Apache ZooKeeper since 2008 – committer and PMC member Page 2
  • 3. Hadoop MapReduce Classic •  JobTracker –  Manages cluster resources and job scheduling •  TaskTracker –  Per-node agent –  Manage tasks
  • 4. Current Limitations •  Hard partition of resources into map and reduce slots –  Low resource utilization •  Lacks support for alternate paradigms –  Iterative applications implemented using MapReduce are 10x slower. –  Hacks for the likes of MPI/others •  Lack of wire-compatible protocols –  Client and cluster must be of same version –  Applications and workflows cannot migrate to different clusters 4
  • 5. Current Limitations •  Scalability –  Maximum Cluster size – 4,000 nodes –  Maximum concurrent tasks – 40,000 –  Coarse synchronization in JobTracker •  Single point of failure –  Failure kills all queued and running jobs –  Jobs need to be re-submitted by users •  Restart is very tricky due to complex state 5
  • 6. Requirements •  Reliability •  Availability •  Scalability - Clusters of 6,000-10,000 machines –  Each machine with 16 cores, 48G/96G RAM, 24TB/ 36TB disks –  100,000+ concurrent tasks –  10,000 concurrent jobs •  Wire Compatibility •  Agility & Evolution – Ability for customers to control upgrades to the grid software stack. 6
  • 7. Design Centre •  Split up the two major functions of JobTracker –  Cluster resource management –  Application life-cycle management •  MapReduce becomes user-land library 7
  • 8. Architecture Node Node Manager Manager Container App Mstr App Mstr Client Resource Node Node Resource Manager Manager Manager Manager Client Client App Mstr Container Container MapReduce Status Node Node MapReduce Status Manager Manager Job Submission Job Submission Node Status Node Status Resource Request Resource Request Container Container
  • 9. Architecture •  Resource Manager –  Global resource scheduler –  Hierarchical queues •  Node Manager –  Per-machine agent –  Manages the life-cycle of container –  Container resource monitoring •  Application Master –  Per-application –  Manages application scheduling and task execution –  E.g. MapReduce Application Master 9
  • 10. Improvements vis-à-vis current MapReduce •  Scalability –  Application life-cycle management is very expensive –  Partition resource management and application life- cycle management –  Application management is distributed –  Hardware trends - Currently run clusters of 4,000 machines •  6,000 2012 machines > 12,000 2009 machines •  <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB> 10
  • 11. Improvements vis-à-vis current MapReduce •  Fault Tolerance and Availability –  Resource Manager •  No single point of failure – state saved in ZooKeeper •  Application Masters are restarted automatically on RM restart •  Applications continue to progress with existing resources during restart, new resources aren’t allocated –  Application Master •  Optional failover via application-specific checkpoint •  MapReduce applications pick up where they left off via state saved in HDFS 11
  • 12. Improvements vis-à-vis current MapReduce •  Wire Compatibility –  Protocols are wire-compatible –  Old clients can talk to new servers –  Rolling upgrades 12
  • 13. Improvements vis-à-vis current MapReduce •  Innovation and Agility –  MapReduce now becomes a user-land library –  Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) •  Faster deployment cycles for improvements –  Customers upgrade MapReduce versions on their schedule –  Users can customize MapReduce e.g. HOP without affecting everyone! 13
  • 14. Improvements vis-à-vis current MapReduce •  Utilization –  Generic resource model •  Memory •  CPU •  Disk b/w •  Network b/w –  Remove fixed partition of map and reduce slots 14
  • 15. Improvements vis-à-vis current MapReduce •  Support for programming paradigms other than MapReduce –  MPI –  Master-Worker –  Machine Learning –  Iterative processing –  Enabled by allowing use of paradigm-specific Application Master –  Run all on the same Hadoop cluster 15
  • 16. Summary •  MapReduce .Next takes Hadoop to the next level –  Scale-out even further –  High availability –  Cluster Utilization –  Support for paradigms other than MapReduce 16