Multi-Tenant Data Cloud
with YARN & Helix
LinkedIn - Data infra : Helix, Espresso
@kishore_b_g
Yahoo - Ads infra : S4
Kishore Gopalakrishna
1Thursday, June 5, 14
What is YARN
Next Generation Compute Platform
MapReduce
HDFS
Hadoop 1.0
MapReduce
HDFS
Hadoop 2.0
Others
(Batch, Interactive, Online,
Streaming)
YARN
(cluster resource management)
2Thursday, June 5, 14
What is YARN
Next Generation Compute Platform
MapReduce
HDFS
Hadoop 1.0
MapReduce
HDFS
Hadoop 2.0
Others
(Batch, Interactive, Online,
Streaming)
YARN
(cluster resource management)
A1
A1
A2
A3
B1 C1
C5
B2
B3 C2
B4
B5
C3
C4
Enables
2Thursday, June 5, 14
HDFS/Common Area
YARN
YARN Architecture
Client
Resource
Manager
Node Manager Node Manager
submit job
node statusnode status
container
request
App Package
Application
Master
Container
3Thursday, June 5, 14
So, let’s build something
4Thursday, June 5, 14
Example System
Generate Data
Serve
M/R
Redis
Server 3
HDFS 3
- Generate data in Hadoop
- Use it for serving
5Thursday, June 5, 14
Example System
Generate Data
Serve
M/R
Server 3
HDFS 3
6Thursday, June 5, 14
Example System
Requirements
Big Data :-)
Partitioned, replicated
Fault tolerant, Scalable
Efficient resource utilization
Generate Data
Serve
M/R
Server 3
HDFS 3
6Thursday, June 5, 14
Application
Master
Example System
Request
Containers Assign work
Handle Failure
Handle
workload
Changes
Requirements
Big Data :-)
Partitioned, replicated
Fault tolerant, Scalable
Efficient resource utilization
Generate Data
Serve
M/R
Server 3
HDFS 3
6Thursday, June 5, 14
Allocation + Assignment
HDFS
Server 1 Server 2Server 3
Partition Assignment - affinity, even distribution
Replica Placement - on different physical machines
Container Allocation - data affinity, rack aware placement
M/Rp1 p2 p3 p4 p5 p6
p1 p2
p5 p4
Server 3
p3 p4
p1 p6
Server 3
p5 p6
p3 p2
Multiple servers to serve
the partitioned data
M/R job generates partitioned data
7Thursday, June 5, 14
Failure Handling
Server 1 Server 2Server 1
Acquire new container close to data if possible
Assign failed partitions to new container
On Failure - Even load distribution, while waiting for new container
Server 23 Server 3
p5 p4 p1 p6 p3 p2
p1 p2 p3 p4 p5 p6
8Thursday, June 5, 14
Failure Handling
Server 1 Server 2Server 1
Acquire new container close to data if possible
Assign failed partitions to new container
On Failure - Even load distribution, while waiting for new container
Server 23 Server 3
p5 p4 p1 p6 p3 p2
p1 p2 p3 p4 p5 p6
8Thursday, June 5, 14
Failure Handling
Server 1 Server 2Server 1
Acquire new container close to data if possible
Assign failed partitions to new container
On Failure - Even load distribution, while waiting for new container
Server 23 Server 3 Server 4
p5 p4 p1 p6 p3 p2
p1 p2 p3 p4 p5 p6
p3 p2
p5 p6
8Thursday, June 5, 14
Workload Changes
Server 1 Server 2Server 3
Workload change - Acquire/Release containers
Container change - Re-distribute work
Monitor - CPU, Memory, Latency, Tps
p1 p2
p5 p4
Server 3
p3 p4
p1 p6
Server 3
p5 p6
p3 p2
9Thursday, June 5, 14
Workload Changes
Server 1 Server 2Server 3
Workload change - Acquire/Release containers
Container change - Re-distribute work
Monitor - CPU, Memory, Latency, Tps
p1 p2
p5 p4
Server 3
p3 p4
p1 p6
Server 3
p5 p6
p3 p2
Server 3
p4 p6
p2
9Thursday, June 5, 14
Workload Changes
Server 1 Server 2Server 3
Workload change - Acquire/Release containers
Container change - Re-distribute work
Monitor - CPU, Memory, Latency, Tps
p1 p2
p5
Server 3
p3 p4
p1
Server 3
p5 p6
p3
Server 3
p4 p6
p2
9Thursday, June 5, 14
Service Discovery
Server 1 Server 2Server 3
Dynamically updated on changes
Discover everything, what is running where
p1 p2
p1 p1
Server 3
p3 p4
p1 p1
Server 3
p5 p6
p1 p1
10Thursday, June 5, 14
Service Discovery
Server 1 Server 2Server 3
Dynamically updated on changes
Discover everything, what is running where
p1 p2
p1 p1
Server 3
p3 p4
p1 p1
Server 3
p5 p6
p1 p1
Client Client
Service Discovery
10Thursday, June 5, 14
Building YARN Application
Writing AM is Hard and Error Prone
Handling Faults, Workload Changes is non-trivial and often overlooked
Request
container
How many
containers
Where
Assign work
Place
partitions &
replicas
Affinity
Workload
changes
acquire/
release
containers
Minimize
movement
Faults
Handling
Detect non
trivial failures
new v/s
reuse
containers
Other
Service
Discovery
Monitoring
11Thursday, June 5, 14
Building YARN Application
Writing AM is Hard and Error Prone
Handling Faults, Workload Changes is non-trivial and often overlooked
Request
container
How many
containers
Where
Assign work
Place
partitions &
replicas
Affinity
Workload
changes
acquire/
release
containers
Minimize
movement
Faults
Handling
Detect non
trivial failures
new v/s
reuse
containers
Other
Service
Discovery
Monitoring
Is there something that can make
this easy?
11Thursday, June 5, 14
Apache Helix
12Thursday, June 5, 14
What is Helix?
Built at LinkedIn, 2+ years in production
Generic cluster management framework
Contributed to Apache, now a TLP: helix.apache.org
Decoupling cluster management from core functionality
13Thursday, June 5, 14
Helix at LinkedIn
Oracle
Oracle
OracleDB
Change Capture
Change
Consumers
Index Search Index
User Writes
Data Replicator
In Production
ETL
HDFS
Analytics
14Thursday, June 5, 14
Helix at LinkedIn
In Production
Over 1000 instances covering over 30000
partitions
Over 1000 instances for change
capture consumers
As many as 500 instances in a
single Helix cluster
(all numbers are per-datacenter)
15Thursday, June 5, 14
Others Using Helix
16Thursday, June 5, 14
Helix concepts
Resource
(Database, Index, Topic, Task)
17Thursday, June 5, 14
Helix concepts
Resource
(Database, Index, Topic, Task)
Partitions
p1 p2 p3 p4 p5 p6
17Thursday, June 5, 14
Helix concepts
Resource
(Database, Index, Topic, Task)
Partitions
Replicas
p1 p2 p3 p4 p5 p6
r1
r2
r3
17Thursday, June 5, 14
Helix concepts
Resource
(Database, Index, Topic, Task)
Partitions
Replicas
p1 p2 p3 p4 p5 p6
r1
r2
r3
Container
Process
Container
Process
Container
Process
17Thursday, June 5, 14
Helix concepts
Resource
(Database, Index, Topic, Task)
Partitions
Replicas
p1 p2 p3 p4 p5 p6
r1
r2
r3
Container
Process
Container
Process
Container
Process
Assignment ?
17Thursday, June 5, 14
State Model and Constraints
Helix Concepts
18Thursday, June 5, 14
Serve
bootstrap
State Model and Constraints
Helix Concepts
Stop
18Thursday, June 5, 14
Serve
bootstrap
State Model and Constraints
Helix Concepts
State
Constraints
Transition
Constraints
Partition
Resource
Node
Cluster
Serve: 3
bootstrap: 0
Max T1 transitions in
parallel
-
Max T2 transitions in
parallel
No more than
10 replicas
Max T3 transitions in
parallel
-
Max T4 transitions in
parallel
Stop
18Thursday, June 5, 14
Serve
bootstrap
State Model and Constraints
Helix Concepts
State
Constraints
Transition
Constraints
Partition
Resource
Node
Cluster
Serve: 3
bootstrap: 0
Max T1 transitions in
parallel
-
Max T2 transitions in
parallel
No more than
10 replicas
Max T3 transitions in
parallel
-
Max T4 transitions in
parallel
StateCount=
Replication factor:3
Stop
18Thursday, June 5, 14
ParticipantParticipantParticipant
Helix Architecture
P1
stop
bootstrap
server
P2 P5
P3
P4
P8
P6
P7
Controller
Client Client Target Provider
Provisioner
Rebalancer
assign work via callback
spectator spectator
Service Discovery
metrics
metrics
19Thursday, June 5, 14
Helix Controller
High-Level Overview
Resource
Config
Constraints
Objectives
Controller
TargetProvider
Provisioner
Rebalancer
Number of Containers
Task-> Container
Mapping
YARN RM
20Thursday, June 5, 14
Helix Controller
Target Provider
Determine how many containers are required along with the spec
Fixed CPU Memory Bin Packing
monitoring system provides usage information
Default implementations, Bin Packing can be used to customize further
TargetProvider
Resources p1,p2 .. pn
Existing containers c1,c2 .. cn
Health of tasks,
containers
cpu, memory, health
Allocation
constraints
Affinity,
rack locality
SLA
Fixed: 10 containers
CPU headroom:30%
Memory Usage: 70%
time: 5h
Number of
container
release list
acquire list
Container spec
cpu: x
memory: y
location: L
21Thursday, June 5, 14
Helix Controller
Provisioner
Given the container spec, interact with YARN RM to
acquire/release, NM to start/stop containers
YARN
Interacts with YARN RM and subscribes to notifications
22Thursday, June 5, 14
Helix Controller
Rebalancer
Based on the current nodes in the cluster and constraints, find an
assignment of task to node
Auto Semi-Auto Static
Rebalancer
Tasks t1,t2 .. tn
Existing containers c1,c2 .. cn
Allocation
constraints &
objectives
Affinity,
rack locality,
Even distribution of
tasks,
Minimize movement
while expanding
Assignment
C1: t1,t2
C2: t3,t4
User defined
Based on the FSM, compute & fire the transitions to Participants
23Thursday, June 5, 14
Example System: Helix-Based Solution
Solution
Configure App
Configure Target Provider
Configure Provisioner
Configure Rebalancer
Generate Data
Serve
M/R
Server 3
HDFS 3
24Thursday, June 5, 14
Configure AppConfigure App
App Name Partitioned Data Server
App Master
Package
/path/to/
GenericHelixAppMaster.tar
App package
/path/to/
RedisServerLauncher.tar
App Config
DataDirectory: hdfs:/path/to/
data
Configure target providerConfigure target provider
TargetProvider RedisTargetProvider
Goal Target TPS: 1 million
Min container 1
Max containers 25
Configure ProvisionerConfigure Provisioner
YARN RM host:port
Configure RebalancerConfigure Rebalancer
Partitions 6
Replica 2
Max partitions per container 4
Rebalancer.Mode AUTO
Placement Data Affinity
FailureHandling Even distribution
Scaling Minimize Movement
app_config_spec.yaml
Example System: Helix-Based Solution
25Thursday, June 5, 14
yarn_app_launcher.sh	
  app_config_spec.yaml
Launch Application
26Thursday, June 5, 14
Helix + YARN
Server 1 Server 2
27Thursday, June 5, 14
Helix + YARN
YARN
Resource
Manager
Client
submit job
Server 1 Server 2
27Thursday, June 5, 14
Application Master
Helix + YARN
YARN
Resource
Manager
Client
submit job
Launch
AM
Server 1 Server 2
27Thursday, June 5, 14
Application Master
Helix + YARN
Helix Controller
YARN
Resource
Manager
Target Provider
Provisioner
RebalancerClient
submit job
Launch
AM
Server 1 Server 2
27Thursday, June 5, 14
Application Master
Helix + YARN
Helix Controller
YARN
Resource
Manager
Target Provider
Provisioner
RebalancerClient
submit job
Launch
AM
request
cntrs
Server 1 Server 2
27Thursday, June 5, 14
Node ManagerNode Manager
Application Master
Helix + YARN
Helix Controller
Node Manager
YARN
Resource
Manager
Target Provider
Provisioner
RebalancerClient
submit job
Launch
AM
request
cntrs
launch
containers
Server 1 Server 2participant 3 participant 3 participant 3
27Thursday, June 5, 14
Node ManagerNode Manager
Application Master
Helix + YARN
Helix Controller
Node Manager
YARN
Resource
Manager
Target Provider
Provisioner
Rebalancer
assign
work
Client
submit job
Launch
AM
request
cntrs
launch
containers
Server 1 Server 2participant 3
p1 p2
p5 p4
participant 3
p3 p4
p1 p6
participant 3
p5 p6
p3 p2
27Thursday, June 5, 14
Auto Scaling
Non linear scaling from 0 to 1M TPS and back
28Thursday, June 5, 14
Failure Handling: Random Faults
Recovering from faults at 1M Tps (5%, 10%, 20% failures/min)
29Thursday, June 5, 14
Summary
HDFS
YARN
(cluster resource management)
HELIX
(container + task management)
Others
(Batch, Interactive, Online, Streaming)
Fault tolerance, Expansion handled transparently
Generic Application Master
Efficient resource utilization by task model
30Thursday, June 5, 14
Questions?
Website
Twitter
Mail
Team
helix.apache.org, #apachehelix
@apachehelix, @kishore_b_g
user@helix.apache.org
Kanak Biscuitwala, Zhen Zhang
?We love helping & being helped
31Thursday, June 5, 14

Multi-Tenant Data Cloud with YARN & Helix

  • 1.
    Multi-Tenant Data Cloud withYARN & Helix LinkedIn - Data infra : Helix, Espresso @kishore_b_g Yahoo - Ads infra : S4 Kishore Gopalakrishna 1Thursday, June 5, 14
  • 2.
    What is YARN NextGeneration Compute Platform MapReduce HDFS Hadoop 1.0 MapReduce HDFS Hadoop 2.0 Others (Batch, Interactive, Online, Streaming) YARN (cluster resource management) 2Thursday, June 5, 14
  • 3.
    What is YARN NextGeneration Compute Platform MapReduce HDFS Hadoop 1.0 MapReduce HDFS Hadoop 2.0 Others (Batch, Interactive, Online, Streaming) YARN (cluster resource management) A1 A1 A2 A3 B1 C1 C5 B2 B3 C2 B4 B5 C3 C4 Enables 2Thursday, June 5, 14
  • 4.
    HDFS/Common Area YARN YARN Architecture Client Resource Manager NodeManager Node Manager submit job node statusnode status container request App Package Application Master Container 3Thursday, June 5, 14
  • 5.
    So, let’s buildsomething 4Thursday, June 5, 14
  • 6.
    Example System Generate Data Serve M/R Redis Server3 HDFS 3 - Generate data in Hadoop - Use it for serving 5Thursday, June 5, 14
  • 7.
    Example System Generate Data Serve M/R Server3 HDFS 3 6Thursday, June 5, 14
  • 8.
    Example System Requirements Big Data:-) Partitioned, replicated Fault tolerant, Scalable Efficient resource utilization Generate Data Serve M/R Server 3 HDFS 3 6Thursday, June 5, 14
  • 9.
    Application Master Example System Request Containers Assignwork Handle Failure Handle workload Changes Requirements Big Data :-) Partitioned, replicated Fault tolerant, Scalable Efficient resource utilization Generate Data Serve M/R Server 3 HDFS 3 6Thursday, June 5, 14
  • 10.
    Allocation + Assignment HDFS Server1 Server 2Server 3 Partition Assignment - affinity, even distribution Replica Placement - on different physical machines Container Allocation - data affinity, rack aware placement M/Rp1 p2 p3 p4 p5 p6 p1 p2 p5 p4 Server 3 p3 p4 p1 p6 Server 3 p5 p6 p3 p2 Multiple servers to serve the partitioned data M/R job generates partitioned data 7Thursday, June 5, 14
  • 11.
    Failure Handling Server 1Server 2Server 1 Acquire new container close to data if possible Assign failed partitions to new container On Failure - Even load distribution, while waiting for new container Server 23 Server 3 p5 p4 p1 p6 p3 p2 p1 p2 p3 p4 p5 p6 8Thursday, June 5, 14
  • 12.
    Failure Handling Server 1Server 2Server 1 Acquire new container close to data if possible Assign failed partitions to new container On Failure - Even load distribution, while waiting for new container Server 23 Server 3 p5 p4 p1 p6 p3 p2 p1 p2 p3 p4 p5 p6 8Thursday, June 5, 14
  • 13.
    Failure Handling Server 1Server 2Server 1 Acquire new container close to data if possible Assign failed partitions to new container On Failure - Even load distribution, while waiting for new container Server 23 Server 3 Server 4 p5 p4 p1 p6 p3 p2 p1 p2 p3 p4 p5 p6 p3 p2 p5 p6 8Thursday, June 5, 14
  • 14.
    Workload Changes Server 1Server 2Server 3 Workload change - Acquire/Release containers Container change - Re-distribute work Monitor - CPU, Memory, Latency, Tps p1 p2 p5 p4 Server 3 p3 p4 p1 p6 Server 3 p5 p6 p3 p2 9Thursday, June 5, 14
  • 15.
    Workload Changes Server 1Server 2Server 3 Workload change - Acquire/Release containers Container change - Re-distribute work Monitor - CPU, Memory, Latency, Tps p1 p2 p5 p4 Server 3 p3 p4 p1 p6 Server 3 p5 p6 p3 p2 Server 3 p4 p6 p2 9Thursday, June 5, 14
  • 16.
    Workload Changes Server 1Server 2Server 3 Workload change - Acquire/Release containers Container change - Re-distribute work Monitor - CPU, Memory, Latency, Tps p1 p2 p5 Server 3 p3 p4 p1 Server 3 p5 p6 p3 Server 3 p4 p6 p2 9Thursday, June 5, 14
  • 17.
    Service Discovery Server 1Server 2Server 3 Dynamically updated on changes Discover everything, what is running where p1 p2 p1 p1 Server 3 p3 p4 p1 p1 Server 3 p5 p6 p1 p1 10Thursday, June 5, 14
  • 18.
    Service Discovery Server 1Server 2Server 3 Dynamically updated on changes Discover everything, what is running where p1 p2 p1 p1 Server 3 p3 p4 p1 p1 Server 3 p5 p6 p1 p1 Client Client Service Discovery 10Thursday, June 5, 14
  • 19.
    Building YARN Application WritingAM is Hard and Error Prone Handling Faults, Workload Changes is non-trivial and often overlooked Request container How many containers Where Assign work Place partitions & replicas Affinity Workload changes acquire/ release containers Minimize movement Faults Handling Detect non trivial failures new v/s reuse containers Other Service Discovery Monitoring 11Thursday, June 5, 14
  • 20.
    Building YARN Application WritingAM is Hard and Error Prone Handling Faults, Workload Changes is non-trivial and often overlooked Request container How many containers Where Assign work Place partitions & replicas Affinity Workload changes acquire/ release containers Minimize movement Faults Handling Detect non trivial failures new v/s reuse containers Other Service Discovery Monitoring Is there something that can make this easy? 11Thursday, June 5, 14
  • 21.
  • 22.
    What is Helix? Builtat LinkedIn, 2+ years in production Generic cluster management framework Contributed to Apache, now a TLP: helix.apache.org Decoupling cluster management from core functionality 13Thursday, June 5, 14
  • 23.
    Helix at LinkedIn Oracle Oracle OracleDB ChangeCapture Change Consumers Index Search Index User Writes Data Replicator In Production ETL HDFS Analytics 14Thursday, June 5, 14
  • 24.
    Helix at LinkedIn InProduction Over 1000 instances covering over 30000 partitions Over 1000 instances for change capture consumers As many as 500 instances in a single Helix cluster (all numbers are per-datacenter) 15Thursday, June 5, 14
  • 25.
  • 26.
    Helix concepts Resource (Database, Index,Topic, Task) 17Thursday, June 5, 14
  • 27.
    Helix concepts Resource (Database, Index,Topic, Task) Partitions p1 p2 p3 p4 p5 p6 17Thursday, June 5, 14
  • 28.
    Helix concepts Resource (Database, Index,Topic, Task) Partitions Replicas p1 p2 p3 p4 p5 p6 r1 r2 r3 17Thursday, June 5, 14
  • 29.
    Helix concepts Resource (Database, Index,Topic, Task) Partitions Replicas p1 p2 p3 p4 p5 p6 r1 r2 r3 Container Process Container Process Container Process 17Thursday, June 5, 14
  • 30.
    Helix concepts Resource (Database, Index,Topic, Task) Partitions Replicas p1 p2 p3 p4 p5 p6 r1 r2 r3 Container Process Container Process Container Process Assignment ? 17Thursday, June 5, 14
  • 31.
    State Model andConstraints Helix Concepts 18Thursday, June 5, 14
  • 32.
    Serve bootstrap State Model andConstraints Helix Concepts Stop 18Thursday, June 5, 14
  • 33.
    Serve bootstrap State Model andConstraints Helix Concepts State Constraints Transition Constraints Partition Resource Node Cluster Serve: 3 bootstrap: 0 Max T1 transitions in parallel - Max T2 transitions in parallel No more than 10 replicas Max T3 transitions in parallel - Max T4 transitions in parallel Stop 18Thursday, June 5, 14
  • 34.
    Serve bootstrap State Model andConstraints Helix Concepts State Constraints Transition Constraints Partition Resource Node Cluster Serve: 3 bootstrap: 0 Max T1 transitions in parallel - Max T2 transitions in parallel No more than 10 replicas Max T3 transitions in parallel - Max T4 transitions in parallel StateCount= Replication factor:3 Stop 18Thursday, June 5, 14
  • 35.
    ParticipantParticipantParticipant Helix Architecture P1 stop bootstrap server P2 P5 P3 P4 P8 P6 P7 Controller ClientClient Target Provider Provisioner Rebalancer assign work via callback spectator spectator Service Discovery metrics metrics 19Thursday, June 5, 14
  • 36.
  • 37.
    Helix Controller Target Provider Determinehow many containers are required along with the spec Fixed CPU Memory Bin Packing monitoring system provides usage information Default implementations, Bin Packing can be used to customize further TargetProvider Resources p1,p2 .. pn Existing containers c1,c2 .. cn Health of tasks, containers cpu, memory, health Allocation constraints Affinity, rack locality SLA Fixed: 10 containers CPU headroom:30% Memory Usage: 70% time: 5h Number of container release list acquire list Container spec cpu: x memory: y location: L 21Thursday, June 5, 14
  • 38.
    Helix Controller Provisioner Given thecontainer spec, interact with YARN RM to acquire/release, NM to start/stop containers YARN Interacts with YARN RM and subscribes to notifications 22Thursday, June 5, 14
  • 39.
    Helix Controller Rebalancer Based onthe current nodes in the cluster and constraints, find an assignment of task to node Auto Semi-Auto Static Rebalancer Tasks t1,t2 .. tn Existing containers c1,c2 .. cn Allocation constraints & objectives Affinity, rack locality, Even distribution of tasks, Minimize movement while expanding Assignment C1: t1,t2 C2: t3,t4 User defined Based on the FSM, compute & fire the transitions to Participants 23Thursday, June 5, 14
  • 40.
    Example System: Helix-BasedSolution Solution Configure App Configure Target Provider Configure Provisioner Configure Rebalancer Generate Data Serve M/R Server 3 HDFS 3 24Thursday, June 5, 14
  • 41.
    Configure AppConfigure App AppName Partitioned Data Server App Master Package /path/to/ GenericHelixAppMaster.tar App package /path/to/ RedisServerLauncher.tar App Config DataDirectory: hdfs:/path/to/ data Configure target providerConfigure target provider TargetProvider RedisTargetProvider Goal Target TPS: 1 million Min container 1 Max containers 25 Configure ProvisionerConfigure Provisioner YARN RM host:port Configure RebalancerConfigure Rebalancer Partitions 6 Replica 2 Max partitions per container 4 Rebalancer.Mode AUTO Placement Data Affinity FailureHandling Even distribution Scaling Minimize Movement app_config_spec.yaml Example System: Helix-Based Solution 25Thursday, June 5, 14
  • 42.
  • 43.
    Helix + YARN Server1 Server 2 27Thursday, June 5, 14
  • 44.
    Helix + YARN YARN Resource Manager Client submitjob Server 1 Server 2 27Thursday, June 5, 14
  • 45.
    Application Master Helix +YARN YARN Resource Manager Client submit job Launch AM Server 1 Server 2 27Thursday, June 5, 14
  • 46.
    Application Master Helix +YARN Helix Controller YARN Resource Manager Target Provider Provisioner RebalancerClient submit job Launch AM Server 1 Server 2 27Thursday, June 5, 14
  • 47.
    Application Master Helix +YARN Helix Controller YARN Resource Manager Target Provider Provisioner RebalancerClient submit job Launch AM request cntrs Server 1 Server 2 27Thursday, June 5, 14
  • 48.
    Node ManagerNode Manager ApplicationMaster Helix + YARN Helix Controller Node Manager YARN Resource Manager Target Provider Provisioner RebalancerClient submit job Launch AM request cntrs launch containers Server 1 Server 2participant 3 participant 3 participant 3 27Thursday, June 5, 14
  • 49.
    Node ManagerNode Manager ApplicationMaster Helix + YARN Helix Controller Node Manager YARN Resource Manager Target Provider Provisioner Rebalancer assign work Client submit job Launch AM request cntrs launch containers Server 1 Server 2participant 3 p1 p2 p5 p4 participant 3 p3 p4 p1 p6 participant 3 p5 p6 p3 p2 27Thursday, June 5, 14
  • 50.
    Auto Scaling Non linearscaling from 0 to 1M TPS and back 28Thursday, June 5, 14
  • 51.
    Failure Handling: RandomFaults Recovering from faults at 1M Tps (5%, 10%, 20% failures/min) 29Thursday, June 5, 14
  • 52.
    Summary HDFS YARN (cluster resource management) HELIX (container+ task management) Others (Batch, Interactive, Online, Streaming) Fault tolerance, Expansion handled transparently Generic Application Master Efficient resource utilization by task model 30Thursday, June 5, 14
  • 53.