mace15retro-presentation

Targeted Resource Management in
Multi-tenant Distributed Systems
Jonathan Mace
Brown University
Peter Bodik
MSR Redmond
Rodrigo Fonseca
Brown University
Madanlal Musuvathi
MSR Redmond

Resource Management in Multi-Tenant Systems
2

Failure in availability zone cascades to shared control
plane, causes thread pool starvation for all zones
• April 2011 – Amazon EBS Failure
2

Aggressive background task responds to increased
hardware capacity with deluge of warnings and logging
• Aug. 2012 – Azure Storage Outage
2

Code change increases database usage, shifts
bottleneck to unmanaged application-level lock
• Nov. 2014 – Visual Studio Online outage
2

2
Shared storage layer bottlenecks circumvent
resource management layer
• 2014 – Communication with Cloudera

2
Degraded performance, Violated SLOs, system outages

3

3
OS
Hypervisor

Containers / VMs
Shared Systems:
Storage, Database,
Queueing, etc.
4

Monitors resource usage of each tenant in near real-time
Actively schedules tenants and activities
5

High-level, centralized policies:
Encapsulates resource management logic
5

High-level, centralized policies:
Encapsulates resource management logic
Abstractions – not specific to resource type, system
Achieve different goals: guarantee average latencies, fair share
a resource, etc.
5

Hadoop Distributed File System (HDFS)
6

HDFS NameNode
Filesystem metadata
6

HDFS DataNode
HDFS DataNode
HDFS DataNode
Replicated block storage
HDFS NameNode
Filesystem metadata
6

HDFS DataNode
HDFS DataNode
HDFS DataNode
HDFS NameNode
Filesystem metadata
Rename
6

HDFS DataNode
HDFS DataNode
HDFS DataNode
HDFS NameNode
Filesystem metadata
Read
Rename
6

HDFS NameNode HDFS DataNode
HDFS DataNode
HDFS DataNode
7

HDFS DataNode
HDFS DataNode
7
time
500
0

HDFS DataNode
HDFS DataNode
8
500
0

HDFS DataNode
HDFS DataNode
8
500
0
12
0

HDFS DataNode
HDFS DataNode
9
500
0
12
0

HDFS DataNode
HDFS DataNode
9
500
0
12
0
500
0

HDFS DataNode
HDFS DataNode
10
500
0
12
0
500
0

HDFS DataNode
HDFS DataNode
10
500
0
12
0
500
0
500
0

Local storage
HDFS DataNode
11

Local storage
HBase RegionServer
HDFS DataNode
11

Local storage
HBase RegionServer
HDFS DataNode
MapReduce
Shuffler
Yarn ContainerYarn Container
Hadoop YARN
Container
MapReduce Tasks
11

Local storage
HBase RegionServer
HDFS DataNode
Hadoop YARN
NodeManager
Yarn ContainerYarn ContainerHadoop YARN
Container
MapReduce
Local storage
HBase RegionServer
HDFS DataNode
Hadoop YARN
NodeManager
Yarn ContainerYarn ContainerHadoop YARN
Container
MapReduce
Local storage
HBase RegionServer
HDFS DataNode
MapReduce
Shuffler
Yarn ContainerYarn Container
Hadoop YARN
Container
MapReduce Tasks
11

Co-ordinated control across processes, machines, and
services
Goals
12

services
Handle system and application level resources
Goals
12

services
Principals: tenants, background tasks
Goals
12

services
Real-time and reactive
Goals
12

services
Real-time and reactive
Efficient: Only control what is needed
Goals
12

Workflows
Purpose: identify requests from different users, background activities
eg, all requests from a tenant over time
eg, data balancing in HDFS
15

Workflows
Unit of resource measurement, attribution, and enforcement
15

Workflows
Unit of resource measurement, attribution, and enforcement
Tracks a request across varying levels of granularity
Orthogonal to threads, processes, network flows, etc.
15

Workflows Resources
Purpose: cope with diversity of resources
16

Workflows Resources
What we need:
1. Identify overloaded resources
16

Workflows Resources
What we need:
2. Identify culprit workflows
16

Workflows Resources
What we need:
Slowdown Ratio of how slow the resource is now compared to
its baseline performance with no contention.
16

Workflows Resources
What we need:
Load Fraction of current utilization that we can attribute
to each workflow
16

Workflows Resources
17
What we need:
to each workflow

Workflows Resources
Slowdown (queue time + execute time) / execute time
eg. 100ms queue, 10ms execute
=> slowdown 11
Load time spent executing
eg. 10ms execute
=> load 10
17
What we need:
to each workflow

Workflows Resources Control Points
Goal: enforce resource management decisions
18

Decoupled from resources
Rate-limits workflows, agnostic to underlying implementation e.g.,
token bucket
priority queue
18

Decoupled from resources
Rate-limits workflows, agnostic to underlying implementation e.g.,
token bucket
priority queue
19

20

Pervasive Measurement
1. Pervasive Measurement
Aggregated locally then reported centrally once per second
20

RetroControllerAPI
2. Centralized Controller
Global, abstracted view of the system
20

RetroControllerAPI
Policies run in continuous control loop
PolicyPolicyPolicy
20

RetroControllerAPI
Policies run in continuous control loop
3. Distributed Enforcement
Co-ordinates enforcement using distributed token bucket
PolicyPolicyPolicy
Distributed Enforcement
20

RetroControllerAPI
Distributed Enforcement
PolicyPolicyPolicy
“Control Plane” for resource management
Easier to write
Reusable
21

22
Example: LatencySLO
H High Priority Workflows
Policy
“200ms average request latency”

22
Example: LatencySLO
Policy
L Low Priority Workflows
(use spare capacity)

22
Example: LatencySLO
Policy
monitor latencies

22
Example: LatencySLO
Policy
monitor latencies
attribute interference

22
Example: LatencySLO
Policy
monitor latencies
throttle interfering workflows

23
1 foreach candidate in H
2 miss[candidate] = latency(candidate) / guarantee[candidate]
3 W = candidate in H with max miss[candidate]
4 foreach rsrc in resources() // calculate importance of each resource for hipri
5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))
6 foreach lopri in L // calculate low priority workflow interference
7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)
8 foreach lopri in L // normalize interference
9 interference[lopri] /= Σk interference[k]
10 foreach lopri in L
11 if miss[W] > 1 // throttle
12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]
13 else // release
14 scalefactor = 1 + β
15 foreach cpoint in controlpoints() // apply new rates
16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)
Example: LatencySLO
Policy

24
13 else // release
Example: LatencySLO
Policy
Select the high priority workflow W with worst performance

24
13 else // release
Example: LatencySLO
Policy
Weight low priority workflows by their interference with W

24
13 else // release
Example: LatencySLO
Policy
Throttle low priority workflows proportionally to their weight

25
13 else // release
Example: LatencySLO
Policy

26
13 else // release
Example: LatencySLO
Policy

27
13 else // release
Example: LatencySLO
Policy

Other types of policy…
Bottleneck Fairness
Detect most overloaded resource
Fair-share resource between tenants using it
Policy
28

Bottleneck Fairness
Dominant Resource Fairness
Estimate demands and capacities from
measurements
Policy
Policy
28

Bottleneck Fairness
Dominant Resource Fairness
Estimate demands and capacities from
measurements
Policy
Policy
Concise
Any resources can be bottleneck (policy doesn’t care)
Not system specific
28

Instrumentation
Retro implementation in Java
Instrumentation Library
Central controller implementation
30

Instrumentation
To enable Retro
Propagate Workflow ID within application (like X-Trace, Dapper)
Instrument resources with wrapper classes
30

Instrumentation
To enable Retro
Overheads
31

Instrumentation
To enable Retro
Overheads
Resource instrumentation automatic using AspectJ
31

Instrumentation
To enable Retro
Overheads
Overall 50-200 lines per system to modify RPCs
31

Instrumentation
To enable Retro
Overheads
Overall 50-200 lines per system to modify RPCs
Retro overhead: max 1-2% on throughput, latency
31

Experiments
YARN MapReduce
ZooKeeper
Systems
HDFS
HBase
32

Experiments
YARN MapReduce
ZooKeeper
Systems
Workflows MapReduce Jobs (HiBench)
HBase (YCSB)
HDFS clients
Background Data Replication
Background Heartbeats
HDFS
HBase
32

Experiments
YARN MapReduce
ZooKeeper
Systems
HBase (YCSB)
HDFS clients
Resources
CPU, Disk, Network (All systems)
Locks, Queues (HDFS, HBase)
HDFS
HBase
32

Experiments
YARN MapReduce
ZooKeeper
Systems
HBase (YCSB)
HDFS clients
Resources
Policies
LatencySLO
Bottleneck
Fairness
Dominant Resource
Fairness
HDFS
HBase
32

Experiments
Policies for a mixture of systems, workflows, and resources
Results on clusters up to 200 nodes
YARN MapReduce
ZooKeeper
Systems
HBase (YCSB)
HDFS clients
Resources
Policies
LatencySLO
Bottleneck
Fairness
Dominant Resource
Fairness
See paper for full experiment results
HDFS
HBase
32

Experiments
Policies for a mixture of systems, workflows, and resources
Results on clusters up to 200 nodes
YARN MapReduce
ZooKeeper
Systems
HBase (YCSB)
HDFS clients
Resources
Policies
LatencySLO
Bottleneck
Fairness
Dominant Resource
Fairness
See paper for full experiment results
This talk LatencySLO policy results
HDFS
HBase
32

HDFS read 8k
HBase read 1
cached row
HBase read 1 row
33

HDFS read 8k
HBase read 1
cached row
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
33

1
10
100
1000
0 5 10 15 20 25 30
0.1
SLO-NormalizedLatency
Time [minutes]
HDFS read 8k
HBase read 1
cached row
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
33

1
10
100
1000
0 5 10 15 20 25 30
0.1
Time [minutes]
HDFS read 8k
HBase read 1
cached row
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
SLO
target
33

0
5
10
15
0 5 10 15 20 25 30
Slowdown
Time [minutes]
1
10
100
1000
0 5 10 15 20 25 30
0.1
Time [minutes]
HDFS read 8k
HBase read 1
cached row
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
SLO
target
34

0
5
10
15
0 5 10 15 20 25 30
Slowdown
Time [minutes]
1
10
100
1000
0 5 10 15 20 25 30
0.1
Time [minutes]
HDFS read 8k
HBase read 1
cached row
Disk
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
SLO
target
34

0
5
10
15
0 5 10 15 20 25 30
Slowdown
Time [minutes]
1
10
100
1000
0 5 10 15 20 25 30
0.1
Time [minutes]
HDFS read 8k
HBase read 1
cached row
Disk
CPU
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
SLO
target
34

0
5
10
15
0 5 10 15 20 25 30
Slowdown
Time [minutes]
1
10
100
1000
0 5 10 15 20 25 30
0.1
Time [minutes]
HDFS read 8k
HBase read 1
cached row
Disk
CPU
HDFS NN Lock
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
SLO
target
34

0
5
10
15
0 5 10 15 20 25 30
Slowdown
Time [minutes]
1
10
100
1000
0 5 10 15 20 25 30
0.1
Time [minutes]
HDFS read 8k
HBase read 1
cached row
Disk
CPU
HDFS NN Lock
HDFS NN Queue
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
SLO
target
34

0
5
10
15
0 5 10 15 20 25 30
Slowdown
Time [minutes]
1
10
100
1000
0 5 10 15 20 25 30
0.1
Time [minutes]
HDFS read 8k
HBase read 1
cached row
Disk
CPU
HDFS NN Lock
HDFS NN Queue
HBase Queue
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
SLO
target
34

1
10
100
1000
0 5 10 15 20 25 30
0.1
Time [minutes]
HDFS read 8k
HBase read 1
cached row
0.1
1
10
0 5 10 15 20 25 30
Time [minutes]
+ SLO Policy Enabled
SLO target
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
35
SLO
target

0.1
1
10
0 5 10 15 20 25 30
Time [minutes]
HDFS read 8k
HBase read 1
cached row
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
36

0.1
1
10
0 5 10 15 20 25 30
Time [minutes]
HDFS read 8k
HBase read 1
cached row
0.2
1 HBase
Table Scan
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
36

0.1
1
10
0 5 10 15 20 25 30
Time [minutes]
HDFS read 8k
HBase read 1
cached row
0.2
1 HBase
Table Scan
0.5
1 HDFS
mkdir
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
36

0.1
1
10
0 5 10 15 20 25 30
Time [minutes]
HDFS read 8k
HBase read 1
cached row
0.5
1 HBase
Cached
Table Scan
0.2
1 HBase
Table Scan
0.5
1 HDFS
mkdir
HBase read 1 row
HBase
Cached
Table Scan
HDFS
mkdir
HBase
Table Scan
36

Resource management for shared
distributed systems
Conclusion
37

distributed systems
Centralized resource management
Conclusion
37

distributed systems
Conclusion
Comprehensive: resources,
processes, tenants, background
tasks
37

distributed systems
Conclusion
tasks
Abstractions for writing concise,
general-purpose policies:
Workflows
Resources (slowdown, load)
Control points
37

distributed systems
Conclusion
http://cs.brown.edu/~jcmace
tasks
Abstractions for writing concise,
general-purpose policies:
Workflows
Resources (slowdown, load)
Control points

mace15retro-presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to mace15retro-presentation

Similar to mace15retro-presentation (20)

mace15retro-presentation