Contention - Aware Scheduling (a different approach)

11
LCN : Design and Implementation of a Contention-LCN : Design and Implementation of a Contention-
Aware SchedulerAware Scheduler
Raptis Dimos – DimitriosRaptis Dimos – Dimitrios
88thth
SFHMMY Conference 2015SFHMMY Conference 2015
April 4, 2015April 4, 2015
April 4, 2015April 4, 2015 11National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering88thth
SFHMMY ConferenceSFHMMY Conference

22April 4, 2015April 4, 2015 22National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Outline

Motivation

Background

Similar Research

Scheduler Overview

Classification Scheme

Prediction Model

Scheduling Algorithm

Comparison with Similar Research

Conclusion

Future Work
88thth

Motivation
Memory Wall
Protocols
SMPs & CMPs
Multithreaded
Cache Coherency
Programming
Parallel Processing
88thth

Motivation
Cache Coherence Problems

Legacy PC
applications
(not benefiting)

Applications
benefiting from
multithreaded
environments

“Embarassingly
parallel” applications
(GPU etc.)
Leveraging
Parallelism
88thth

Motivation
Problems Approaches

Memory Contention
Problem

Cache Coherence
Problem

Missing existing
infrastructure to detect
and restrict system
resources contention

What if it was not
programmer's
responsibility to
“allocate”
resources ?

What if Operating
System was
responsible for
judging applications'
parallelism ?
Contention – Aware Scheduling
88thth

Contention – Aware Scheduling
Classification
(based on locality and
degree of contention)
Background
HPC Monitoring
++ Our approach contains an additional
component : a prediction model
88thth

Similar Research
Various Approaches

Simple Heuristic approaches
(LLC misses & Memory bandwidth)

Stack Distance Profiling approaches

Dynamic Scheduling approaches using supervised learning
(linear regression, fuzzy-rule models, K-nearest neighbour)
Differences

Simple Heuristic approaches
(LLC misses & Memory bandwidth)

Stack Distance Profiling approaches

Dynamic Scheduling approaches using supervised learning
(linear regression, fuzzy-rule models, K-nearest neighbour)
Not covering the whole memory hierarchy
Using additional hardware not available currently in OS
Targeting the same problem from a different view
Pre-defined allocated resources in applications
88thth

Scheduler Overview
Scheduler Main Components

4 categories of applications based on memory hierarchy

Prediction Model
prediction of contention in varying resources allocations

scheduling a workload of applications based on

classification scheme (co-scheduling combinations)

prediction model (for ideal resource management)
88thth

4 main categories of applications
L LC
C N
88thth

Co-scheduling interference

N - * : no interference

L - L : contention on same resource, bandwidth “divided”

L - C : contention in different resources

severe performance degradation in C

no impact in L

L - LC : performance degradation for both

LC faces bigger degradation than L

LC - LC : contention in 2 resources (memory link and LLC)

Both have degradation but in low levels

LC - C : mediocre contention, mainly in C

C - C : most difficult to predict - based on data access
patterns (MESI protocol)
88thth

Co-scheduling interference
Analysis from
workload of 16
applications
4 applications
belonging to each
class
Co-scheduling of all
possible
combinations
Average slowdown
calculated for each
combinationTable : Average slowdown in co-execution
88thth

Classification tree
88thth

Prediction Model

Linear Regression Model

Target : Prediction of scaling

possess HPC monitored for 1 core allocation

capability to predict scaling for any possible allocation

use of threshold value for defining optimal scaling

Use the suitable counters for each class

Class L : memory link (bandwidth)

Class LC : LLC reuse (MESI protocol)

Class C : L2 and LLC reuse (MESI protocol)

Class N : private part of memory hierarchy
88thth

Prediction Model
L class
Rp = (Mem1 p)/(Maximum Memory Bandwidth)∗
poptimum = max{p}, Rp < 1.15
LC class
Completion(LC) = 0.01799 ∗ fLC
+ 0.50119 (p = 2cores)
= 0.02516 ∗ fLC
+ 0.34286 (p = 3 cores)
= 0.02846 ∗ fLC
+ 0.26028 (p = 4 cores)
= 0.03199 ∗ fLC
+ 0.21584 (p = 5 cores)
= 0.03404 ∗ fLC
+ 0.18296 (p = 6 cores)
= 0.03621 ∗ fLC
+ 0.16410 (p = 7 cores)
= 0.03751 ∗ fLC
+ 0.13969 (p = 8 cores)
Ideal_Completionp
= 1/p , fLC
= L2 RFO Requests/(L3 reuse*105
)
Rp = (Ideal_Completionp
/Completionp
) 100∗
poptimum
= max{p}, Rp > 70
88thth

Prediction Model
C class
Completion(C) = 0.3447 ∗ fC
+ 0.4947 (2cores)
= 0.46974 ∗ fC
+ 0.34415 (p = 3 cores)
= 0.5155 ∗ fC
+ 0.2478 (p = 4 cores)
= 0.63609 ∗ fC
+ 0.22492 (p = 5 cores)
= 0.61403 ∗ fC
+ 0.18127 (p = 6 cores)
= 0.65915 ∗ fC
+ 0.15864 (p = 7 cores)
= 0.6095 ∗ fC
+ 0.1263 (p = 8 cores)
Ideal_Completionp
= 1/p , fC
= (L2 Shared*104
)/Inst.Retired
Rp = (Ideal_Completionp
/Completionp
) 100∗
poptimum
= max{p}, Rp > 70
N class
Completion(N)p
= Completion_idealp
poptimum
= max{p}
88thth

April 4, 2015April 4, 2015 National Technical University of AthensNational Technical University of Athens 1616
Prediction Model
Example
L class LC class
Mem1
= 4GB/sec
Memmax
= 13.5
GB/sec
R1
= 4/13.5 = 0.29
R2
= (4*2)/13.5 = 0.59
R3
= (4*3)/13.5 = 0.88
R4
= (4*4)/13.5 =
1.185
R5
= (4*5)/13.5 = 1.48
R6
= (4*6)/13.5 = 1.77
R7
= (4*7)/13.5 = 2.07
R8
= (4*8)/13.5 = 2.37
poptimum
= 3 cores
RFO1
= 319106 per second , L3 reuse = 1.51
fLC
= 319106/(1.51*105
) = 2.10
Completion(LC)2
= 0.01799*2.10 + 0.50119 = 0.53 →
R2
=0.5/0.53*100= 92.792.7
Completion(LC)3
= 0.02516*2.10 + 0.34286 = 0.39 →
R3
=0.33/0.39*100= 84.284.2
Completion(LC)4
= 0.02846*2.10 + 0.26028 = 0.32 →
R4
=0.25/0.32*100= 78.078.0
Completion(LC)5
= 0.03199*2.10 + 0.21584 = 0.28 →
R5
=0.2/0.28*100= 70.670.6
Completion(LC)6
= 0.03404*2.10 + 0.18296 =0.25 →
R6
=0.166/0.25*100= 65.465.4
Completion(LC)7
= 0.03621*2.10 + 0.16410 = 0.24 →
R7
=0.142/0.24*100= 59.059.0
Completion(LC)8
= 0.03751*2.10 + 0.13969 = 0.21 →
R8
=0.125/0.21*100= 57.157.1
Poptimum
= 5 cores
88thth

Prediction Model
Evaluation – Verification
Relative Errors in Predictions of C class
88thth

Prediction Model
Evaluation – Verification
Relative Errors in Predictions of LC class
88thth

Prediction Model
LC - C prediction model improvement

Integration of 7 relationships to a single one

Coefficients follow logarithmic trendline

Results after analysis
Completion(LC)p = [0.0139536 log(p) + 0.0090562] f∗ ∗ LC + [−0.252533 log(p) + 0.6407058]∗
Completion(C)p = [0.2151318 log(p) + 0.2239032] f∗ ∗ C + [−0.25468 log(p) + 0.6397947]∗
Ideal_Completionp = 1/p
Rp = (Ideal_Completionp /Completionp ) 100∗
poptimum = max{p}, Rp > 70
88thth

Prediction Model
Evaluation – Verification of Refinement
Deviation in C coefficients Deviation in LC coefficients
88thth

Prediction Model
April 4, 2015April 4, 2015 National Technical University of AthensNational Technical University of Athens 2121
Experimentation Platform

cores : 8

L1D,I: 32KB
8-way

L2 : 256KB
8-way

L3 : 16 MB
16-way

64bytes line

Mem :64GB
DDR3 1.3GHZ

Debian 6.06
*(Prediction
Model also
tested on
Nehalem
architecture)
88thth


Executed after first 2 steps are finished for each application

Step 1 has classified each application

Step 2 has predicted the optimum number of cores that
should be allocated by the scheduler to each application

The algorithm tries to co-schedule the applications in pairs
so that

Sum of cores does not exceed package cores

Contention is avoided as much as possible
(using conclusions from Classification step)

The approach can be extended for co-execution of more
than 2 applications

N applications are allocated half cores and scheduled twice
(their profile implies that they are not affected by this)
88thth

Lists of applications separated by class : L, LC, C, N
while(N not empty){
x = current N application ;
y = popMatchFromTheEnd(C, L, LC, N);
coschedule(x, y);
}
while( LC not empty){
x = current LC application;
y = popMatchFromTheEnd(C, LC, L);
coschedule(x, y);
}
while(L not empty){
x = current L application;
y = popMatchFromTheEnd(L);
coschedule(x, y);
}
while(C not empty){
x = current C application;
y = popMatchFromTheEnd(C);
coschedule(x, y);
}
scheduleRemainingApplications();
88thth

The other state-of-the-art schedulers

Sorting by heuristic

Distributing load

Combining
application from the
top with application
from the bottom

LLC – MRB
LLC misses

LBB
memory bandwidth
88thth

Experiments – Comparison Process

Linux CFS, LCN, LLC-MRB, LLB to be compared

Workload of 17 applications (equally shared among classes)

Whole workload executed for 1 hour

Time quantums of 1 second defined in all schedulers

When application finishes, it gets respawn to re-execute

Comparison between schedulers with 2 criteria

Throughput

Total number of executions of all applications

Number of improved applications

Fairness

Standard Deviation between gain of each application
*gain compared to Gang scheduler
88thth

Most Improved
Applications
Linux : 5
LLC – Balance : 7
MEM-Balance : 5
LCN : 8
88thth

Criteria :
- Throughput
LCNLCN
- Fairness
LLC-BalanceLLC-Balance **
* fairness can be* fairness can be
misinterpretedmisinterpreted
88thth

Major Drawbacks of other schedulers

Linux Scheduler CFS

Cannot locate contention

Does not identify threads of the same application
parallelism benefits lost

MEM - Balance Scheduler

Uses over-generic heuristic

Does not take into account all memory hierarchy parts

LLC - Balance Scheduler

Cannot differentiate between class N and C applications,
since they both exhibit low LLC misses

Results co-scheduling L with C applications → contention
88thth

Conclusion

Proposed contention-aware schedulers that

Does not require additional OS hardware adjustments

Simple, easily integratable as component in modern OS

Consisted of 3 parts

Compared to other state-of-the-art schedulers and the CFS

Presents the best throughput

Presents equal fairness to CFS
(and lower than the other contention-aware schedulers)

Can be integrated to real-life scheduling with 2 approaches:

Applications executed when inserted in queue for 2-3 quantums

Start scheduling and monitoring simultaneously (dynamic adaptation)
88thth

Future Work
Major Improvements

Improvements in the prediction model

Stepwise regresion models to add more variables

Decrease error

Caution : limitation in number of monitored counters

Other methods, such as machine learning

Investigation of added overhead

Extension of approach to NUMA architectures

Implemented and tested for 1 package only

Extensible to multiple packages, using thread migrations
• Initially try to allocate threads of the same application in the
same package
• Thread migrations executed when class change is observed
along with memory migrations
88thth

THE END
Thank you !!!
Any Questions ??
88thth

Contention - Aware Scheduling (a different approach)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Contention - Aware Scheduling (a different approach)

Similar to Contention - Aware Scheduling (a different approach) (20)

Recently uploaded

Recently uploaded (20)

Contention - Aware Scheduling (a different approach)