SlideShare a Scribd company logo
1 of 22
YARN
Apache Hadoop Next Generation
Compute Platform
Bikas Saha
@bikassaha

© Hortonworks Inc. 2013

Page 1
Apache Hadoop & YARN
• Apache Hadoop
– De facto Big Data open source platform
– Running for about 5 years in production at hundreds of companies
like Yahoo, Ebay and Facebook

• Hadoop 2
– Significant improvements in HDFS distributed storage layer. High
Availability, NFS, Snapshots
– YARN – next generation compute framework for Hadoop designed
from the ground up based on experience gained from Hadoop 1
– YARN running in production at Yahoo for about a year
– YARN awarded Best Paper at SOCC 2013

© Hortonworks Inc. 2013 - Confidential

Page 2
1st Generation Hadoop: Batch Focus
HADOOP 1.0
Built for Web-Scale Batch Apps

Single App

Single App

INTERACTIVE

ONLINE

Single App

Single App

Single App

BATCH

BATCH

BATCH

HDFS

HDFS

All other usage patterns
MUST leverage same
infrastructure

HDFS

© Hortonworks Inc. 2013 - Confidential

Forces Creation of Silos to
Manage Mixed Workloads

Page 3
Hadoop 1 Architecture
JobTracker
Manage Cluster Resources & Job Scheduling

TaskTracker
Per-node agent
Manage Tasks

© Hortonworks Inc. 2013 - Confidential

Page 4
Hadoop 1 Limitations
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower

Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000

Availability
Failure Kills Queued & Running Jobs

Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
© Hortonworks Inc. 2013 - Confidential

Page 5
Our Vision: Hadoop as Next-Gen Platform

Single Use System

Multi Purpose Platform

Batch Apps

Batch, Interactive, Online, Streaming, …

HADOOP 1.0

HADOOP 2.0
MapReduce

Others

(data processing)

MapReduce

YARN

(cluster resource management
& data processing)

(cluster resource management)

HDFS

HDFS2

(redundant, reliable storage)

(redundant, highly-available & reliable storage)

© Hortonworks Inc. 2013 - Confidential

Page 6
Hadoop 2 - YARN Architecture
ResourceManager (RM)
Central agent - Manages and allocates
cluster resources

Node
Manager

NodeManager (NM)
Per-Node agent - Manages and

App Mstr

enforces node resource allocations

ApplicationMaster (AM)
Per-Application –

Resource
Manager

Node
Manager

Client
Container

Manages application
lifecycle and task

scheduling

MapReduce Status
Job Submission

Node
Manager

Node Status
Resource Request

© Hortonworks Inc. 2013 - Confidential

Page 7
YARN: Taking Hadoop Beyond Batch
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
Applications Run Natively in Hadoop
BATCH
INTERACTIVE
(MapReduce)
(Tez)

ONLINE
(HBase)

STREAMING
(Storm, S4,…)

GRAPH
(Giraph)

IN-MEMORY
(Spark)

HPC MPI
(OpenMPI)

OTHER
(Search)
(Weave…)

YARN (Cluster Resource Management)
HDFS2 (Redundant, Reliable Storage)

© Hortonworks Inc. 2013 - Confidential

Page 8
5 Key Benefits of YARN
1.

New Applications & Services

2.

Improved cluster utilization

3.

Scale

4.

Experimental Agility

5.

Shared Services

© Hortonworks Inc. 2013 - Confidential

Page 9
Key Improvements in YARN
Framework supporting multiple applications
– Separate generic resource brokering from application logic
– Define protocols/libraries and provide a framework for custom
application development
– Share same Hadoop Cluster across applications

Cluster Utilization
– Generic resource container model replaces fixed Map/Reduce
slots. Container allocations based on locality, memory (CPU
coming soon)
– Sharing cluster among multiple application

© Hortonworks Inc. 2013 - Confidential

Page 10
Key Improvements in YARN
Scalability
– Removed complex app logic from RM, scale further
– State machine, message passing based loosely coupled design
– Compact scheduling protocol

Application Agility and Innovation
– Use Protocol Buffers for RPC gives wire compatibility
– Map Reduce becomes an application in user space unlocking
safe innovation
– Multiple versions of an app can co-exist leading to
experimentation
– Easier upgrade of framework and application

© Hortonworks Inc. 2013 - Confidential

Page 11
Key Improvements in YARN
Shared Services
– Common services needed to build distributed application are
included in a pluggable framework
– Distributed file sharing service
– Remote data read service
– Log Aggregation Service

© Hortonworks Inc. 2013 - Confidential

Page 12
YARN: Efficiency with Shared Services

Yahoo! leverages YARN
40,000+ nodes running YARN across over 365PB of data
~400,000 jobs per day for about 10 million hours of compute
time
Estimated a 60% – 150% improvement on node usage per
day using YARN
Eliminated Colo (~10K nodes) due to increased utilization
For more details check out the YARN SOCC 2013 paper
© Hortonworks Inc. 2013 - Confidential

Page 13
YARN as Cluster Operating System
ResourceManager

Scheduler

NodeManager

NodeManager

NodeManager

NodeManager

map 1.1
nimbus0

vertex1.1.1

vertex1.2.2

NodeManager

NodeManager

NodeManager

NodeManager

map1.2
Batch

Interactive SQL

vertex1.1.2

nimbus2

NodeManager

NodeManager

NodeManager

NodeManager

nimbus1
Real-Time

vertex1.2.1

reduce1.1

© Hortonworks Inc. 2013 - Confidential

Page 14
Multi-Tenancy is Built-in
• Queues
• Economics as queue-capacity
– Hierarchical Queues

• SLAs

ResourceManager

– Cooperative Preemption

Scheduler

• Resource Isolation
– Linux: cgroups
– Roadmap: Virtualization (Xen, KVM)

• Administration
– Queue ACLs
– Run-time re-configuration for queues
Default Capacity Scheduler supports
all features
© Hortonworks Inc. 2013 - Confidential

Hierarchical
Queues

root

Mrkting
20%

Dev
20%

Adhoc
10%

Prod
80%

DW
70%

Dev Reserved Prod
10%
20%
70%

P0
70%

P1
30%

Capacity Scheduler
Page 15
YARN Eco-system
Applications Powered by YARN
Apache Giraph – Graph Processing
Apache Hama - BSP
Apache Hadoop MapReduce – Batch
Apache Tez – Batch/Interactive
Apache S4 – Stream Processing
Apache Samza – Stream Processing
Apache Storm – Stream Processing
Apache Spark – Iterative applications
Elastic Search – Scalable Search
Cloudera Llama – Impala on YARN
DataTorrent – Data Analysis
HOYA – HBase on YARN

© Hortonworks Inc. 2013 - Confidential

There's an app for that...
YARN App Marketplace!

Frameworks Powered By YARN
Apache Twill
REEF by Microsoft
Spring support for Hadoop 2

Page 16
YARN Application Lifecycle
Application Client
Protocol

Application Client

YarnClient
App
Specific API

Resource
Manager
NodeManager
Application Master
Protocol

App
Container

Application Master

AMRMClient

Container
Management
Protocol

NMClient

© Hortonworks Inc. 2013 - Confidential

Page 17
BYOA – Bring Your Own App
Application Client Protocol: Client to RM interaction
– Library: YarnClient
– Application Lifecycle control
– Access Cluster Information

Application Master Protocol: AM – RM interaction
– Library: AMRMClient / AMRMClientAsync
– Resource negotiation
– Heartbeat to the RM

Container Management Protocol: AM to NM interaction
– Library: NMClient/NMClientAsync
– Launching allocated containers
– Stop Running containers

Use external frameworks like Twill/REEF/Spring
© Hortonworks Inc. 2013 - Confidential

Page 18
YARN Future Work
• ResourceManager High Availability
– Automatic failover
– Work preserving failover

• Scheduler Enhancements
– SLA Driven Scheduling, Low latency allocations
– Multiple resource types – disk/network/GPUs/affinity

• Rolling upgrades
• Generic History Service
• Long running services
– Better support to running services like HBase
– Service Discovery

• More utilities/libraries for Application Developers
– Failover/Checkpointing

© Hortonworks Inc. 2013 - Confidential

Page 19
Key Take-Aways
• YARN is a platform to build/run Multiple Distributed Applications
in Hadoop
• YARN is completely Backwards Compatible for existing
MapReduce apps
• YARN enables Fine Grained Resource Management via Generic
Resource Containers.
• YARN has built-in support for multi-tenancy to share cluster
resources and increase cost efficiency
• YARN provides a cluster operating system like abstraction for a
modern data architecture

© Hortonworks Inc. 2013 - Confidential

Page 20
Apache YARN
The Data Operating System for Hadoop 2.0
Flexible

Efficient

Shared

Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming

Increase processing IN Hadoop
on the same hardware while
providing predictable
performance & quality of service

Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads

Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce

INTERACTIVE
Tez

ONLINE
HBase

STREAMING
Storm, S4, …

GRAPH
Giraph

MICROSOFT
REEF

SAS
LASR, HPA

OTHERS

YARN: Cluster Resource Management
HDFS2: Redundant, Reliable Storage

© Hortonworks Inc. 2013 - Confidential

Page 21
Thank you!

http://hortonworks.com/products/hortonworks-sandbox/

Download Sandbox: Experience Apache Hadoop
Both 2.0 and 1.x Versions Available!
http://hortonworks.com/products/hortonworks-sandbox/

Questions?
© Hortonworks Inc. 2013 - Confidential

Page 22

More Related Content

What's hot

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databasesAshwani Kumar
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture PatternsMaynooth University
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Advanced Operating System- Introduction
Advanced Operating System- IntroductionAdvanced Operating System- Introduction
Advanced Operating System- IntroductionDebasis Das
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 

What's hot (20)

Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture Patterns
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Advanced Operating System- Introduction
Advanced Operating System- IntroductionAdvanced Operating System- Introduction
Advanced Operating System- Introduction
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 

Similar to Apache Hadoop YARN: Understanding the Data Operating System of Hadoop

Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN ApplicationsHortonworks
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopHortonworks
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupRommel Garcia
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Hortonworks
 
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Hakka Labs
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopPOSSCON
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionWangda Tan
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNHortonworks
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 
Hadoop - Past, Present and Future - v2.0
Hadoop - Past, Present and Future - v2.0Hadoop - Past, Present and Future - v2.0
Hadoop - Past, Present and Future - v2.0Big Data Joe™ Rossi
 

Similar to Apache Hadoop YARN: Understanding the Data Operating System of Hadoop (20)

Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
 
Yarnthug2014
Yarnthug2014Yarnthug2014
Yarnthug2014
 
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Hadoop - Past, Present and Future - v2.0
Hadoop - Past, Present and Future - v2.0Hadoop - Past, Present and Future - v2.0
Hadoop - Past, Present and Future - v2.0
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

Apache Hadoop YARN: Understanding the Data Operating System of Hadoop

  • 1. YARN Apache Hadoop Next Generation Compute Platform Bikas Saha @bikassaha © Hortonworks Inc. 2013 Page 1
  • 2. Apache Hadoop & YARN • Apache Hadoop – De facto Big Data open source platform – Running for about 5 years in production at hundreds of companies like Yahoo, Ebay and Facebook • Hadoop 2 – Significant improvements in HDFS distributed storage layer. High Availability, NFS, Snapshots – YARN – next generation compute framework for Hadoop designed from the ground up based on experience gained from Hadoop 1 – YARN running in production at Yahoo for about a year – YARN awarded Best Paper at SOCC 2013 © Hortonworks Inc. 2013 - Confidential Page 2
  • 3. 1st Generation Hadoop: Batch Focus HADOOP 1.0 Built for Web-Scale Batch Apps Single App Single App INTERACTIVE ONLINE Single App Single App Single App BATCH BATCH BATCH HDFS HDFS All other usage patterns MUST leverage same infrastructure HDFS © Hortonworks Inc. 2013 - Confidential Forces Creation of Silos to Manage Mixed Workloads Page 3
  • 4. Hadoop 1 Architecture JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks © Hortonworks Inc. 2013 - Confidential Page 4
  • 5. Hadoop 1 Limitations Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization © Hortonworks Inc. 2013 - Confidential Page 5
  • 6. Our Vision: Hadoop as Next-Gen Platform Single Use System Multi Purpose Platform Batch Apps Batch, Interactive, Online, Streaming, … HADOOP 1.0 HADOOP 2.0 MapReduce Others (data processing) MapReduce YARN (cluster resource management & data processing) (cluster resource management) HDFS HDFS2 (redundant, reliable storage) (redundant, highly-available & reliable storage) © Hortonworks Inc. 2013 - Confidential Page 6
  • 7. Hadoop 2 - YARN Architecture ResourceManager (RM) Central agent - Manages and allocates cluster resources Node Manager NodeManager (NM) Per-Node agent - Manages and App Mstr enforces node resource allocations ApplicationMaster (AM) Per-Application – Resource Manager Node Manager Client Container Manages application lifecycle and task scheduling MapReduce Status Job Submission Node Manager Node Status Resource Request © Hortonworks Inc. 2013 - Confidential Page 7
  • 8. YARN: Taking Hadoop Beyond Batch Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applications Run Natively in Hadoop BATCH INTERACTIVE (MapReduce) (Tez) ONLINE (HBase) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave…) YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) © Hortonworks Inc. 2013 - Confidential Page 8
  • 9. 5 Key Benefits of YARN 1. New Applications & Services 2. Improved cluster utilization 3. Scale 4. Experimental Agility 5. Shared Services © Hortonworks Inc. 2013 - Confidential Page 9
  • 10. Key Improvements in YARN Framework supporting multiple applications – Separate generic resource brokering from application logic – Define protocols/libraries and provide a framework for custom application development – Share same Hadoop Cluster across applications Cluster Utilization – Generic resource container model replaces fixed Map/Reduce slots. Container allocations based on locality, memory (CPU coming soon) – Sharing cluster among multiple application © Hortonworks Inc. 2013 - Confidential Page 10
  • 11. Key Improvements in YARN Scalability – Removed complex app logic from RM, scale further – State machine, message passing based loosely coupled design – Compact scheduling protocol Application Agility and Innovation – Use Protocol Buffers for RPC gives wire compatibility – Map Reduce becomes an application in user space unlocking safe innovation – Multiple versions of an app can co-exist leading to experimentation – Easier upgrade of framework and application © Hortonworks Inc. 2013 - Confidential Page 11
  • 12. Key Improvements in YARN Shared Services – Common services needed to build distributed application are included in a pluggable framework – Distributed file sharing service – Remote data read service – Log Aggregation Service © Hortonworks Inc. 2013 - Confidential Page 12
  • 13. YARN: Efficiency with Shared Services Yahoo! leverages YARN 40,000+ nodes running YARN across over 365PB of data ~400,000 jobs per day for about 10 million hours of compute time Estimated a 60% – 150% improvement on node usage per day using YARN Eliminated Colo (~10K nodes) due to increased utilization For more details check out the YARN SOCC 2013 paper © Hortonworks Inc. 2013 - Confidential Page 13
  • 14. YARN as Cluster Operating System ResourceManager Scheduler NodeManager NodeManager NodeManager NodeManager map 1.1 nimbus0 vertex1.1.1 vertex1.2.2 NodeManager NodeManager NodeManager NodeManager map1.2 Batch Interactive SQL vertex1.1.2 nimbus2 NodeManager NodeManager NodeManager NodeManager nimbus1 Real-Time vertex1.2.1 reduce1.1 © Hortonworks Inc. 2013 - Confidential Page 14
  • 15. Multi-Tenancy is Built-in • Queues • Economics as queue-capacity – Hierarchical Queues • SLAs ResourceManager – Cooperative Preemption Scheduler • Resource Isolation – Linux: cgroups – Roadmap: Virtualization (Xen, KVM) • Administration – Queue ACLs – Run-time re-configuration for queues Default Capacity Scheduler supports all features © Hortonworks Inc. 2013 - Confidential Hierarchical Queues root Mrkting 20% Dev 20% Adhoc 10% Prod 80% DW 70% Dev Reserved Prod 10% 20% 70% P0 70% P1 30% Capacity Scheduler Page 15
  • 16. YARN Eco-system Applications Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative applications Elastic Search – Scalable Search Cloudera Llama – Impala on YARN DataTorrent – Data Analysis HOYA – HBase on YARN © Hortonworks Inc. 2013 - Confidential There's an app for that... YARN App Marketplace! Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2 Page 16
  • 17. YARN Application Lifecycle Application Client Protocol Application Client YarnClient App Specific API Resource Manager NodeManager Application Master Protocol App Container Application Master AMRMClient Container Management Protocol NMClient © Hortonworks Inc. 2013 - Confidential Page 17
  • 18. BYOA – Bring Your Own App Application Client Protocol: Client to RM interaction – Library: YarnClient – Application Lifecycle control – Access Cluster Information Application Master Protocol: AM – RM interaction – Library: AMRMClient / AMRMClientAsync – Resource negotiation – Heartbeat to the RM Container Management Protocol: AM to NM interaction – Library: NMClient/NMClientAsync – Launching allocated containers – Stop Running containers Use external frameworks like Twill/REEF/Spring © Hortonworks Inc. 2013 - Confidential Page 18
  • 19. YARN Future Work • ResourceManager High Availability – Automatic failover – Work preserving failover • Scheduler Enhancements – SLA Driven Scheduling, Low latency allocations – Multiple resource types – disk/network/GPUs/affinity • Rolling upgrades • Generic History Service • Long running services – Better support to running services like HBase – Service Discovery • More utilities/libraries for Application Developers – Failover/Checkpointing © Hortonworks Inc. 2013 - Confidential Page 19
  • 20. Key Take-Aways • YARN is a platform to build/run Multiple Distributed Applications in Hadoop • YARN is completely Backwards Compatible for existing MapReduce apps • YARN enables Fine Grained Resource Management via Generic Resource Containers. • YARN has built-in support for multi-tenancy to share cluster resources and increase cost efficiency • YARN provides a cluster operating system like abstraction for a modern data architecture © Hortonworks Inc. 2013 - Confidential Page 20
  • 21. Apache YARN The Data Operating System for Hadoop 2.0 Flexible Efficient Shared Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Increase processing IN Hadoop on the same hardware while providing predictable performance & quality of service Provides a stable, reliable, secure foundation and shared operational services across multiple workloads Data Processing Engines Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez ONLINE HBase STREAMING Storm, S4, … GRAPH Giraph MICROSOFT REEF SAS LASR, HPA OTHERS YARN: Cluster Resource Management HDFS2: Redundant, Reliable Storage © Hortonworks Inc. 2013 - Confidential Page 21
  • 22. Thank you! http://hortonworks.com/products/hortonworks-sandbox/ Download Sandbox: Experience Apache Hadoop Both 2.0 and 1.x Versions Available! http://hortonworks.com/products/hortonworks-sandbox/ Questions? © Hortonworks Inc. 2013 - Confidential Page 22

Editor's Notes

  1. As Arun mentioned there are less JVMs to spin up per job management (1 instead of 3) as well as the RM and NM provisioning being fasterOriginally conceived & architected by the team at Yahoo!Arun Murthy created the original JIRA in 2008 and led the PMCThe team at Hortonworks has been working on YARN for 4 years: 90% of code from Hortonworks & Yahoo!YARN based architecture running at scale at Yahoo!Deployed on 35,000 nodes for 6+ monthsMultitude of YARN applications*********************On great public example of in production use of YARN, is at Yahoo!. They outlined some performance gains in a keynote address at Hadoop Summit this year. Yahoo uses YARN for three use cases, stream processing, iterative processing and shared storage. With Storm on YARN they stream data into a cluster and execute 5 second analytics windows. This cluster is only 320 nodes, but is processing 133,000 events per second and is executing 12000 threads. Their shared data cluster uses 1900 nodes to store 2PB of data.In all, Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time. They also have estimated a 60% – 150% improvement on node usage per day.ANDAt this point, over 50,000 Hadoop nodes have been upgraded at Yahoo from Hadoop 1.0 to Hadoop 2, yielding 50% improvement in cluster utilization & efficiency.This should be a big deal in terms of potential ROI.
  2. Reconfig already works (HDP-1.3 and HDP-2.0).Chargeback - coming in HDP-2.1https://issues.apache.org/jira/browse/YARN-415----- Meeting Notes (11/14/13 11:07) -----POINT AND CLICK TIME BASED RESOURCE ALLOCATION
  3. Graph processing – Giraph, HamaStream proessing – Smaza, Storm, Spark, DataTorrentMapReduceTez – fast query executionWeave/REEF – frameworks to help with writing applicationsList of some of the applications which already support YARN, in some form.Smaza, Storm, S4 and DataTorrent are streaming frameworksVarious types of graph processing frameworks – Giraph and Hama are graph processing systemsThere’s some github projects – caching systems, on-demand web-server spin up Wave and REEF are frameworks on top of YARN to make writing applications easier
  4. YarnClientApplication Lifecycle Control – submit, query state, kill - .e.g Submit, followed by poll state, and optionally connect to AM for further stateCluster Information – Query node status, Scheduler queues etcAMRMCLientRequest Resources from the RM – specifies a priority (within app only), location information, and whether strict locality is required, also number of such containers.Hides the details of the protocol – which is absolute (i.e. each call to the RM must provide complete information about a priority level)Once a container is allocated, provides a utility method to link it back to the entity which generated the request (entity optionally passed in while asking for containers)Allocate call doubles as a heartbeat – must be sent out ever so often.Also provides some node level information, available resources, etcNMClientOnce resources are obtained from the RM, an App uses this to communicate with the NM to start containers. Setup the ContainerLuanchContext – cmdLine, environment, resources.Authorization information is part of the contianers allocated by the RM – to prevent unauthorized usageOnly a single process can be launched on an allocated container, and one must be within a certain period otherwise the allocation times outContainer must launch a process with a certain time after being allocatedOnly a single process can be launched on an allocated container
  5. HA and work preserving – being actively worked upoin by the communitiy.Scheduler – Additional resources – specifically disk / network. Gang schedulingRolling upgrades – upgrading a cluster typically involves downtime. NM forgets containers across restartsLong Running – Enhandcement to log handling, security, multiple tasks per container, container resizingHA and work preserving restart are still being worked on in the community – YARN-128 and YARN-149.On scheduling – there’ve been requests for gang scheduling, meeting SLAs. Also TBD is support for scheduling additional resource types – disk/ network.Rolling Upgrades – some work pending. Big piece here, which ties in with work preserving restart – restarting a NodeManager should not cause processes started by the previous NM to be killedLong Running Services support – handling logs, security – specifically token expiryAdditional utility libraries to help AppWriters – primarily geared towards checkpointing in the AM, app history handling
  6. The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.