SlideShare a Scribd company logo
LESSONS LEARNED FROM
SCALING YARN TO 40K
MACHINES IN A MULTI
TENANCY ENVIRONMENT
Roni Burd : Principal Software Eng Mgr
Hitesh Sharma: Principal Software Eng
Sarvesh Sakalanaga: Senior Software Eng Mgr
Cosmos: World’s Biggest YARN Cluster!
Cosmos in numbers
• Single DC >40,000 machines
• Multiple DCs (do the math )
• >500,000 jobs / day
• >2,000,000 containers / hour
• Up to 90% CPU utilization
• 99.9% reliability/availability
• Several Exabytes in storage
• Hundreds of petabytes processed per
day
Journey
Migrate Cosmos to YARN
Run OSS workloads
Multiplier effect from the community
• Running different YARN applications with different characteristics
• SCOPE (AKA U-SQL) is the most common application today
• Spark, custom REEF code and custom YARN AM
• As we migrate existing workloads to YARN, we have to remain compatible with
legacy apps
• Cosmos supports running SLA and non-SLA jobs in the same datacenter
• Customer are allocated Virtual cluster (VC) with a max number of containers
• Machines allocated to each VC are shared across all tenants
• Virtual clusters have guaranteed capacity and can use idle capacity from other
tenants
• Maximize COGS and reduce latency
• Maximize data locality and prevent data movement/copy
• Don’t leave any resource unused – CPU, Memory, IOs and Bandwidth
Cosmos environment
NM1
NM 2
NM 3
NM 4
NM 5
NM 6
NM 7
NM 8
NM 9
NM 10
NM 11
NM 12
NM 14
NM 15
NM 13
NM 3900
NM 3901
NM 3903
NM 3904
NM 3902
Scope AM
Time
Parallel
containers
Max containers
allowed
• Scope submits containers in ‘waves’
• Waves are capped by max allocation
• Waves can be very big (2000 or more)
• Jobs achieve SLA by avoiding contention
• Allocated capacity < Total Cluster Capacity
• This leads to natural underutilization
Scope AM
Scope request characteristics
…
Rack Rack Rack
Achieving 90% CPU utilization: Mercury (YARN-2877)
Opportunistic
containers
• Scope allocates OPPORTUNISTIC containers
• OPP are queued in NM based on NMs capacity
• Scheduled after exhausting max GUARANTEED
• Once GUARANEED finish, SCOPE can either:
• Promote running OPPORTUNISTIC
• Scheduled new GUARANEED wave instead
• OPPORTUNISTIC get paused (YARN-5972)
• This improved latencies, reduced COGS and preserves SLAs
DC1 DC2
In aggregate SCOPE generate 4000 QPS (avg) /10,000 QPS (max)
YARN RM
Scope
AM
Pending
containers
Allocated
Guaranteed
NM
Scheduler
SCOPE Aux
Service
Allocate(G)
Allocate(OPP)
Immediate response
“Promote”*
Response only when allocated
*:based on 2.7 patch
Working with OSS for 2.9
Opportunistic
queue
Paused
queue
O
O
O
G
G
G
Running
StartContainer(G)
G
OLoad
information
O
O
Response only when allocated
Latency sensitive workloads
• Duration of scope containers vary widely, but most are small – delays are
expensive!
• Any delays in Allocate() or StartContainer() can lead to job slowdown and failures
• A 20sec delay can mean millions of U$S
• Locality is extremely important for latency and COGS
• Scope has visibility to all 40,000 machines through a Aux Service
• Scope AM is smart and makes several optimizations (e.g. future JOINs, RACK aggregations)
• RACK locality is “almost” as good as NODE locality – OFF_SWITCH is bad 
Container allocation need to be <5sec @ 95%tile
Node Locality on every request across the DC is important for latency
How did we scale YARN to deal with our QPS, latency and locality
requirements?
How did we scale YARN to 40K machines?
Hard lessons (that we have time to cover today)
YARN RM Scalability
Hitesh Sharma
Principal Software Engineer
Scope – main workload
In aggregate SCOPE generate 4000 QPS (avg) /10,000 QPS (max)
Container allocation need to be <5sec @ 95%tile
Node Locality on every request across the DC is important for latency
• A few Scope jobs would overwhelm YARN RM in a few minutes
• Long delays in allocating containers would result in job failures
• Scale testing in big clusters is a very expensive process
• Lack of metrics and telemetry to understand what’s happening in YARN RM
Legend
• Red line : pending containers
• Blue line :allocated containers
Allocated vs Pending containers
(Note: Cluster had resources to satisfy all the
requests being made)
Setup
• Test cluster with 3000 nodes
• Running Hadoop 2.7.1
• CapacityScheduler with
single queue
Challenges
Scheduling in capacity-scheduler
YARN RM
NM
NM
NM
NM
HB
HB
HB
HB
Thread Scheduler
loop
(thread
driven by
HB)
Thread
Thread
Allocate
Allocate
Allocate
Pending
request
YARN RM takes a lock and looks at all
the outstanding container requests in
the queue on each HB
If the node can satisfy a pending
request then YARN RM allocates
otherwise it just looks at the next one
AM
AM
AM
Look at _every_ pending requests
(potentially thousands) - algorithm
uses this to downgrade locality
• Scheduler loop is very expensive as it looks over all the
outstanding requests
• Counting missed opportunities to relax locality impacts scale
and hard to tune
• Creating immutable “Resource” objects during heartbeat
processing is very expensive
• Log4j is synchronous by default
• Only one allocation per node heartbeat
Bottlenecks
1. Scheduler key pruning
• Each node heartbeat looks at the outstanding allocation requests for that
node only
2. Time based decay for locality
• Use time to decide if we should downgrade to rack or any where in the cluster
• Reduces the work done in each node heartbeat
3. Switched to async log4j appender
• Reduces expensive lock and IO contention
4. Metrics to track allocation latencies and QPS
• TTD and TTM are critical to achieve 99.9% availability/reliability
Improvements
• Test cluster with 4000 nodes
• RM-NM and RM-AM heartbeat set to 1sec sec
• Using log4j async logger
Stage Impact
Before improvements + relaxLocality ON • Allocation latency @ 95th%tile< 10s
• Promotion latency @ 95th%tile< 10s
• Node locality – <10%
• ANY> locality – >80%
• Sustained load < 500 QPS
Scheduler key pruning • Allocation latency @ 95th%tile< 4s
• Promotion latency @ 95th%tile< 4s
• Node locality – 99.51%
• Rack locality– 0.23%
• Sustained load < 2000 QPS
Time based decay for locality • Allocation latency @ 95th%tile< 3s
• Promotion latency @ 95th%tile< 3s
• Node locality – 99.84%
• Rack locality– 0.11%
• Sustained load < 3000 QPS
Test results
Allocated vs Pending containers
Before After
Test results
Pending/Allocated
Paused containers
Queued container
Percentile latency
and QPS for
Allocate, Promote,
Locality metrics
Containers running and
type
Containers Queued
Containers Paused
Example dashboard
• Single machine setup to stress test YARN scheduler (part of OSS since 2.6)
• Allows us to try out cluster configurations with any number of nodes and
different configurations (NM-RM heartbeat frequency, scheduler
configurations, etc.)
• Updated SLS to generate load similar to our workloads
We can now try out different settings and fixes in a matter of minutes!
Simulator
(mimic Scope AM)
Real YARN RM
Mock
NM
Mock
NM
Mock
NM
HB
HB
HB
• Container requests wave
• Container duration
• Multiple apps
• Allocation ID
• Relax locality
• Opportunistic allocations
• Container promotions
thread thread threads
Scheduler Load Simulator (aka SLS)
Scaling YARN to 40K+ nodes and
beyond
Sarvesh Sakalanaga
Senior Engineering Manager
Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2
RM
Task
RM
Task
RM
Task
AM RM Proxy Service
(Per NodeManager)
Policy StateRouter Service
YARN Client
Federation
Services
YARN
Sub Clusters
Servers in Datacenter
AM
AM
• Implements Client-RM Protocol
• Stateless, Scalable Service
• Multiple Instances with Load
Balancer
• Implements AM-RM Protocol
• Hosted in NM
• Intercepts all AM-RM
communications
• Sub-clusters are unmodified standalone
YARN clusters with about 4K nodes.
Start
Containers
Submit App
• Voila! Applications can transparently span
across multiple YARN sub clusters and scale
to Datacenter level
• No code change in any application• Centralized, highly-available repository
Federation Interceptor
UAM POOL
Smart
Policy
Federation – High level architecture
Scope – main workload
In aggregate SCOPE generate 4000 QPS (avg) /10,000 QPS (max)
Container allocation need to be <5sec @ 95%tile
Node Locality on every request across the DC is important for latency
1. Load shaping
2. Cluster maintenance
3. Log management
Production challenges
Limitations of BroadcastAMRMProxyPolicy
•RM Scalability: increases QPS in all the sub-clusters
•Cluster Utilization: duplicate allocations for each sub-
cluster
Solution: LocalityMulticastAMRMProxyPolicy
1. Create UAMs on demand
2. Allocate on sub-clusters on a rack
1: Load shaping
2: Cluster maintenance
Each DC will have constants machine movements
• 13 sub-clusters split our 40K+ DC
• About ~800 machines/day need some form of maintenance
• Clusters keeps growing and changing (e.g. decommissioning RACKs)
Solution: Use a Sub-Cluster Manger for balancing
1. Node to sub-cluster resolver service
2. Dynamic balancing of sub-cluster capacity
3. NM Maintenance mode: Container draining
Yarn Sub-Cluster #1
2: Cluster maintenance scenario: adding machines
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
Yarn Sub-Cluster #2
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
Yarn Sub-Cluster #3
HEALTHY HEALTHY
HEALTHY HEALTHY
MAINTENANCE MAINTENANCE
Sub-cluster Manager
Log volume per DC: 2 PB/day ; 2.5 GB/hour per machine
1. Custom log aggregator that works with our Scope AM to keep logs
only for critical path containers and failed containers
2. Custom tools to aggregates container logs outside of YARN on need
basis
3. Custom log search tool (Helios – Internal code name) that indexes
YARN logs bases on keywords
4. Scope NM AUX service to keep track of key information on
container stats
3: Log handing
Scalability
Multi-cast policy tuning for better placements, Container reuse, Have light
weight “Resource” objects, Multiple node allocations in the same heartbeat
Utilization
Container resize, Improve opportunistic containers utilization with system
priorities, Federation Global policy generator enhancements and Relax
locality for Opportunistic containers
Operability
ATS V2, more metrics and logs and better log management
And many more…!
Next Steps
Fully committed to YARN and Open Source
Committed
• Federation: YARN-2915
• AllocationID: YARN-4879
• Support distributed scheduling (AKA
mercury): YARN-2877
• AMRMProxy: YARN-2884
• UnmanagedAM pool manager: YARN-5531
• Federation Interceptor: YARN-6511
• Locality – Multicast Policy: YARN-5325
In progress/Open
• Federation phase 2: YARN-5597
• GPG: YARN-3660
• Router: YARN-3659
• Pausing of Opportunistic containers: YARN-5972
• Container Promotion: YARN-5085
• Scheduling of Opp through YARN RM: YARN-5220
Conclusion
Largest YARN deployment in the world!
...and we are growing 5X next year
…with more OSS workloads
Our Journey has just started!

More Related Content

What's hot

Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Amazon Web Services
 
(SDD407) Amazon DynamoDB: Data Modeling and Scaling Best Practices | AWS re:I...
(SDD407) Amazon DynamoDB: Data Modeling and Scaling Best Practices | AWS re:I...(SDD407) Amazon DynamoDB: Data Modeling and Scaling Best Practices | AWS re:I...
(SDD407) Amazon DynamoDB: Data Modeling and Scaling Best Practices | AWS re:I...
Amazon Web Services
 
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon Web Services
 
AWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
AWS Black Belt Tech シリーズ 2015 - AWS Data PipelineAWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
AWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
Amazon Web Services Japan
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs
 
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Amazon Web Services
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
Cloudera, Inc.
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
NAVER D2
 
AWSで透過プロキシをやってみた
AWSで透過プロキシをやってみたAWSで透過プロキシをやってみた
AWSで透過プロキシをやってみた
kuro kuro
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
Amazon Web Services Japan
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE
Masaki Yamakawa
 
Amazon Redshift 概要 (20分版)
Amazon Redshift 概要 (20分版)Amazon Redshift 概要 (20分版)
Amazon Redshift 概要 (20分版)
Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2016 Amazon EC2 Container Service
AWS Black Belt Online Seminar 2016 Amazon EC2 Container ServiceAWS Black Belt Online Seminar 2016 Amazon EC2 Container Service
AWS Black Belt Online Seminar 2016 Amazon EC2 Container Service
Amazon Web Services Japan
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
Amazon Web Services Japan
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Erik Krogen
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
Sim Janghoon
 

What's hot (20)

Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
 
(SDD407) Amazon DynamoDB: Data Modeling and Scaling Best Practices | AWS re:I...
(SDD407) Amazon DynamoDB: Data Modeling and Scaling Best Practices | AWS re:I...(SDD407) Amazon DynamoDB: Data Modeling and Scaling Best Practices | AWS re:I...
(SDD407) Amazon DynamoDB: Data Modeling and Scaling Best Practices | AWS re:I...
 
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
 
AWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
AWS Black Belt Tech シリーズ 2015 - AWS Data PipelineAWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
AWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
AWSで透過プロキシをやってみた
AWSで透過プロキシをやってみたAWSで透過プロキシをやってみた
AWSで透過プロキシをやってみた
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE
 
Amazon Redshift 概要 (20分版)
Amazon Redshift 概要 (20分版)Amazon Redshift 概要 (20分版)
Amazon Redshift 概要 (20分版)
 
AWS Black Belt Online Seminar 2016 Amazon EC2 Container Service
AWS Black Belt Online Seminar 2016 Amazon EC2 Container ServiceAWS Black Belt Online Seminar 2016 Amazon EC2 Container Service
AWS Black Belt Online Seminar 2016 Amazon EC2 Container Service
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
20190828 AWS Black Belt Online Seminar Amazon Aurora with PostgreSQL Compatib...
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 

Similar to Lessons learned from scaling YARN to 40K machines in a multi tenancy environment

Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
inside-BigData.com
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
John Georgiadis
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
Sumant Tambe
 
Real world repairs
Real world repairsReal world repairs
Real world repairs
Vinay Kumar Chella
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
HungWei Chiu
 
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardOPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
Tao Feng
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
DataStax Academy
 
Understand oracle real application cluster
Understand oracle real application clusterUnderstand oracle real application cluster
Understand oracle real application cluster
Satishbabu Gunukula
 
Toward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackToward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStack
Ton Ngo
 
MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
Kyle Bader
 
My SQL on Ceph
My SQL on CephMy SQL on Ceph
My SQL on Ceph
Red_Hat_Storage
 
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Apache Hadoop YARN State of the Union
Apache Hadoop YARN State of the UnionApache Hadoop YARN State of the Union
Apache Hadoop YARN State of the Union
Weiwei Yang
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling Out
Sander Temme
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
inside-BigData.com
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
NoSQLmatters
 
Storm worker redesign
Storm worker redesignStorm worker redesign
Storm worker redesign
Roshan Naik
 

Similar to Lessons learned from scaling YARN to 40K machines in a multi tenancy environment (20)

Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Real world repairs
Real world repairsReal world repairs
Real world repairs
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardOPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
 
Understand oracle real application cluster
Understand oracle real application clusterUnderstand oracle real application cluster
Understand oracle real application cluster
 
Toward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackToward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStack
 
MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
 
My SQL on Ceph
My SQL on CephMy SQL on Ceph
My SQL on Ceph
 
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
 
Apache Hadoop YARN State of the Union
Apache Hadoop YARN State of the UnionApache Hadoop YARN State of the Union
Apache Hadoop YARN State of the Union
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling Out
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Storm worker redesign
Storm worker redesignStorm worker redesign
Storm worker redesign
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 

Recently uploaded (20)

The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 

Lessons learned from scaling YARN to 40K machines in a multi tenancy environment

  • 1. LESSONS LEARNED FROM SCALING YARN TO 40K MACHINES IN A MULTI TENANCY ENVIRONMENT Roni Burd : Principal Software Eng Mgr Hitesh Sharma: Principal Software Eng Sarvesh Sakalanaga: Senior Software Eng Mgr
  • 2. Cosmos: World’s Biggest YARN Cluster! Cosmos in numbers • Single DC >40,000 machines • Multiple DCs (do the math ) • >500,000 jobs / day • >2,000,000 containers / hour • Up to 90% CPU utilization • 99.9% reliability/availability • Several Exabytes in storage • Hundreds of petabytes processed per day Journey Migrate Cosmos to YARN Run OSS workloads Multiplier effect from the community
  • 3. • Running different YARN applications with different characteristics • SCOPE (AKA U-SQL) is the most common application today • Spark, custom REEF code and custom YARN AM • As we migrate existing workloads to YARN, we have to remain compatible with legacy apps • Cosmos supports running SLA and non-SLA jobs in the same datacenter • Customer are allocated Virtual cluster (VC) with a max number of containers • Machines allocated to each VC are shared across all tenants • Virtual clusters have guaranteed capacity and can use idle capacity from other tenants • Maximize COGS and reduce latency • Maximize data locality and prevent data movement/copy • Don’t leave any resource unused – CPU, Memory, IOs and Bandwidth Cosmos environment
  • 4. NM1 NM 2 NM 3 NM 4 NM 5 NM 6 NM 7 NM 8 NM 9 NM 10 NM 11 NM 12 NM 14 NM 15 NM 13 NM 3900 NM 3901 NM 3903 NM 3904 NM 3902 Scope AM Time Parallel containers Max containers allowed • Scope submits containers in ‘waves’ • Waves are capped by max allocation • Waves can be very big (2000 or more) • Jobs achieve SLA by avoiding contention • Allocated capacity < Total Cluster Capacity • This leads to natural underutilization Scope AM Scope request characteristics … Rack Rack Rack
  • 5. Achieving 90% CPU utilization: Mercury (YARN-2877) Opportunistic containers • Scope allocates OPPORTUNISTIC containers • OPP are queued in NM based on NMs capacity • Scheduled after exhausting max GUARANTEED • Once GUARANEED finish, SCOPE can either: • Promote running OPPORTUNISTIC • Scheduled new GUARANEED wave instead • OPPORTUNISTIC get paused (YARN-5972) • This improved latencies, reduced COGS and preserves SLAs DC1 DC2
  • 6. In aggregate SCOPE generate 4000 QPS (avg) /10,000 QPS (max) YARN RM Scope AM Pending containers Allocated Guaranteed NM Scheduler SCOPE Aux Service Allocate(G) Allocate(OPP) Immediate response “Promote”* Response only when allocated *:based on 2.7 patch Working with OSS for 2.9 Opportunistic queue Paused queue O O O G G G Running StartContainer(G) G OLoad information O O Response only when allocated
  • 7. Latency sensitive workloads • Duration of scope containers vary widely, but most are small – delays are expensive! • Any delays in Allocate() or StartContainer() can lead to job slowdown and failures • A 20sec delay can mean millions of U$S • Locality is extremely important for latency and COGS • Scope has visibility to all 40,000 machines through a Aux Service • Scope AM is smart and makes several optimizations (e.g. future JOINs, RACK aggregations) • RACK locality is “almost” as good as NODE locality – OFF_SWITCH is bad  Container allocation need to be <5sec @ 95%tile Node Locality on every request across the DC is important for latency
  • 8. How did we scale YARN to deal with our QPS, latency and locality requirements? How did we scale YARN to 40K machines? Hard lessons (that we have time to cover today)
  • 9. YARN RM Scalability Hitesh Sharma Principal Software Engineer
  • 10. Scope – main workload In aggregate SCOPE generate 4000 QPS (avg) /10,000 QPS (max) Container allocation need to be <5sec @ 95%tile Node Locality on every request across the DC is important for latency
  • 11. • A few Scope jobs would overwhelm YARN RM in a few minutes • Long delays in allocating containers would result in job failures • Scale testing in big clusters is a very expensive process • Lack of metrics and telemetry to understand what’s happening in YARN RM Legend • Red line : pending containers • Blue line :allocated containers Allocated vs Pending containers (Note: Cluster had resources to satisfy all the requests being made) Setup • Test cluster with 3000 nodes • Running Hadoop 2.7.1 • CapacityScheduler with single queue Challenges
  • 12. Scheduling in capacity-scheduler YARN RM NM NM NM NM HB HB HB HB Thread Scheduler loop (thread driven by HB) Thread Thread Allocate Allocate Allocate Pending request YARN RM takes a lock and looks at all the outstanding container requests in the queue on each HB If the node can satisfy a pending request then YARN RM allocates otherwise it just looks at the next one AM AM AM Look at _every_ pending requests (potentially thousands) - algorithm uses this to downgrade locality
  • 13. • Scheduler loop is very expensive as it looks over all the outstanding requests • Counting missed opportunities to relax locality impacts scale and hard to tune • Creating immutable “Resource” objects during heartbeat processing is very expensive • Log4j is synchronous by default • Only one allocation per node heartbeat Bottlenecks
  • 14. 1. Scheduler key pruning • Each node heartbeat looks at the outstanding allocation requests for that node only 2. Time based decay for locality • Use time to decide if we should downgrade to rack or any where in the cluster • Reduces the work done in each node heartbeat 3. Switched to async log4j appender • Reduces expensive lock and IO contention 4. Metrics to track allocation latencies and QPS • TTD and TTM are critical to achieve 99.9% availability/reliability Improvements
  • 15. • Test cluster with 4000 nodes • RM-NM and RM-AM heartbeat set to 1sec sec • Using log4j async logger Stage Impact Before improvements + relaxLocality ON • Allocation latency @ 95th%tile< 10s • Promotion latency @ 95th%tile< 10s • Node locality – <10% • ANY> locality – >80% • Sustained load < 500 QPS Scheduler key pruning • Allocation latency @ 95th%tile< 4s • Promotion latency @ 95th%tile< 4s • Node locality – 99.51% • Rack locality– 0.23% • Sustained load < 2000 QPS Time based decay for locality • Allocation latency @ 95th%tile< 3s • Promotion latency @ 95th%tile< 3s • Node locality – 99.84% • Rack locality– 0.11% • Sustained load < 3000 QPS Test results
  • 16. Allocated vs Pending containers Before After Test results
  • 17. Pending/Allocated Paused containers Queued container Percentile latency and QPS for Allocate, Promote, Locality metrics Containers running and type Containers Queued Containers Paused Example dashboard
  • 18. • Single machine setup to stress test YARN scheduler (part of OSS since 2.6) • Allows us to try out cluster configurations with any number of nodes and different configurations (NM-RM heartbeat frequency, scheduler configurations, etc.) • Updated SLS to generate load similar to our workloads We can now try out different settings and fixes in a matter of minutes! Simulator (mimic Scope AM) Real YARN RM Mock NM Mock NM Mock NM HB HB HB • Container requests wave • Container duration • Multiple apps • Allocation ID • Relax locality • Opportunistic allocations • Container promotions thread thread threads Scheduler Load Simulator (aka SLS)
  • 19. Scaling YARN to 40K+ nodes and beyond Sarvesh Sakalanaga Senior Engineering Manager
  • 20. Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2 RM Task RM Task RM Task AM RM Proxy Service (Per NodeManager) Policy StateRouter Service YARN Client Federation Services YARN Sub Clusters Servers in Datacenter AM AM • Implements Client-RM Protocol • Stateless, Scalable Service • Multiple Instances with Load Balancer • Implements AM-RM Protocol • Hosted in NM • Intercepts all AM-RM communications • Sub-clusters are unmodified standalone YARN clusters with about 4K nodes. Start Containers Submit App • Voila! Applications can transparently span across multiple YARN sub clusters and scale to Datacenter level • No code change in any application• Centralized, highly-available repository Federation Interceptor UAM POOL Smart Policy Federation – High level architecture
  • 21. Scope – main workload In aggregate SCOPE generate 4000 QPS (avg) /10,000 QPS (max) Container allocation need to be <5sec @ 95%tile Node Locality on every request across the DC is important for latency
  • 22. 1. Load shaping 2. Cluster maintenance 3. Log management Production challenges
  • 23. Limitations of BroadcastAMRMProxyPolicy •RM Scalability: increases QPS in all the sub-clusters •Cluster Utilization: duplicate allocations for each sub- cluster Solution: LocalityMulticastAMRMProxyPolicy 1. Create UAMs on demand 2. Allocate on sub-clusters on a rack 1: Load shaping
  • 24. 2: Cluster maintenance Each DC will have constants machine movements • 13 sub-clusters split our 40K+ DC • About ~800 machines/day need some form of maintenance • Clusters keeps growing and changing (e.g. decommissioning RACKs) Solution: Use a Sub-Cluster Manger for balancing 1. Node to sub-cluster resolver service 2. Dynamic balancing of sub-cluster capacity 3. NM Maintenance mode: Container draining
  • 25. Yarn Sub-Cluster #1 2: Cluster maintenance scenario: adding machines O O G G HEALTHY HEALTHY O O G G O O G G HEALTHY HEALTHY O O G G O O G G HEALTHY HEALTHY O O G G Yarn Sub-Cluster #2 O O G G HEALTHY HEALTHY O O G G O O G G HEALTHY HEALTHY O O G G O O G G HEALTHY HEALTHY O O G G Yarn Sub-Cluster #3 HEALTHY HEALTHY HEALTHY HEALTHY MAINTENANCE MAINTENANCE Sub-cluster Manager
  • 26. Log volume per DC: 2 PB/day ; 2.5 GB/hour per machine 1. Custom log aggregator that works with our Scope AM to keep logs only for critical path containers and failed containers 2. Custom tools to aggregates container logs outside of YARN on need basis 3. Custom log search tool (Helios – Internal code name) that indexes YARN logs bases on keywords 4. Scope NM AUX service to keep track of key information on container stats 3: Log handing
  • 27. Scalability Multi-cast policy tuning for better placements, Container reuse, Have light weight “Resource” objects, Multiple node allocations in the same heartbeat Utilization Container resize, Improve opportunistic containers utilization with system priorities, Federation Global policy generator enhancements and Relax locality for Opportunistic containers Operability ATS V2, more metrics and logs and better log management And many more…! Next Steps
  • 28. Fully committed to YARN and Open Source Committed • Federation: YARN-2915 • AllocationID: YARN-4879 • Support distributed scheduling (AKA mercury): YARN-2877 • AMRMProxy: YARN-2884 • UnmanagedAM pool manager: YARN-5531 • Federation Interceptor: YARN-6511 • Locality – Multicast Policy: YARN-5325 In progress/Open • Federation phase 2: YARN-5597 • GPG: YARN-3660 • Router: YARN-3659 • Pausing of Opportunistic containers: YARN-5972 • Container Promotion: YARN-5085 • Scheduling of Opp through YARN RM: YARN-5220
  • 29. Conclusion Largest YARN deployment in the world! ...and we are growing 5X next year …with more OSS workloads Our Journey has just started!

Editor's Notes

  1. 1) When we started to run YARN in a cluster with over 3000 nodes then a few Scope jobs would easily overwhelm YARN RM. For instance, just 10 terasorts could actually cause YARN RM to fallover. 2) We would see long delays in allocation latencies and in most cases never got a container back from YARN RM. 3) The other challenge that we have had is that telemetry and logging were some what missing and it was incredibly hard to diagnose issues in production. 4) It is also quite expensive to test in big clusters as the cycle of deployment and testing can be a few hours in the best case. 5) This graph here shows the number of allocated and pending containers in YARN RM. We use these metrics as a proxy to know whether YARN RM can keep up with the rate at which AMs are requesting for containers. 6) The Y axis is the # of containers and X axis shows time. The blue line shows the # of allocated containers while the red line shows the # of pending containers. If the number of outstanding containers becomes too high then YARN RM won’t be able to do any allocations.
  2. At a very high level, here is a quick recap of how capacity scheduler does scheduling The application masters are sending container requests to YARN RM. These get added in a “pending queue” inside YARN RM. YARN RM also receives heartbeats from all the nodes in the cluster. Upon receiving a node heartbeat YARN RM takes a lock and looks at the pending queue to see what can be satisfied by that node. If the node can satisfy a request then an allocation is done. Otherwise YARN RM would look through the other pending requests in the queue. One major issue here is that upon each node heartbeat YARN RM is looking at all the outstanding requests, which may not be even satisfiable by that node. Further the algorithm used to decay locality from node to rack and any is based on heartbeat counting. The summary is that we spend a lot of time in each heartbeat processing and if you have high allocation request rate with 1 sec node heartbeats, it becomes hard for YARN RM to do allocations fast enough.
  3. Testing on real clusters is quite expensive as it involves deploying the bits, generating the load, and then analyze GBs of logs. This is a long cycle and to improve on it we started to leverage SLS. To recap, SLS is Scheduler Load Simulator and is a tool in Hadoop since 2.6. It allows you to start YARN RM with the given config and simulates multiple NMs heartbeating into it. It can then simulate submitting different apps with containers starting and exiting. We have extended SLS to generate load that is similar to the workload we run. This included sending waves of container requests, containers can live for different durations, multiple apps, do allocations with alloc ID, relaxed locality, and O allocations. A big shout out to everyone who worked and contributed to SLS. This tool has been a life saver! Today SLS is one of the key tools we use for testing any changes being made to YARN RM.
  4. Add callout http://hadoopsummit.org/san-jose/agenda/
  5. Talk about CISL Router add multiple instances Talk about policies
  6. Talk why we did not use YARN log aggregation Talk about our custom AM using talking to PN to handle logs Talk about sub-cluster to node mapping (CWS app id) Timeline server
  7. Remove links