Abstract: Optimizing energy efficiency of parallel execution on computing systems, ranging from server farms, mobile devices to embedded systems, becomes increasingly one of the first-order concerns. A common way to express a parallel application is as a directed acyclic graph (DAG) in which each node represents a task. The problem of such task scheduling on multiprocessor systems is to find the proper execution processors. Especially nowadays asymmetric multiprocessor systems feature different type of cores with different performance and power consumption, e.g. Arm big.LITTLE and Intel Lakefield. However, naive task assignment without considering core types and task features could result in inefficient resources utilization and detrimentally impacts the overall energy consumption. Dynamic task scheduling is a widely used scheduling strategy, which does not require prior knowledge, e.g. architecture heterogeneity, task DAG structure, before execution but makes the decisions during runtime. Work stealing has been proven to be an effective method among dynamic task scheduling with better scalability in larger systems. DVFS is a common technique to achieve better energy efficiency, however, exploiting it costs reconfiguration overhead ranging from tens of microseconds to one millisecond. With fine-grained tasks as small as milliseconds, as required to expose large parallelism, it is not realistic to use DVFS on a per-task level. Also, it shows that the energy consumed in cores’ under-utilized period is significant.
Based on these problem statements, we come up with a low energy task scheduling work stealing runtime based on XiTAO where the system environment configurations are either fixed or managed by the O/S power governors or system administrators. The runtime contains dynamic performance tracing module, idleness tracing module, power profiling module and a task mapping algorithm. The dynamic performance model is able to give the accurate predictions for future tasks given a set of resources. It is independent of platforms and frequencies and achieves scalability and portability. Power profiling helps runtime systems to understand CPU power consumption trends with respect to number/type of cores and frequencies. Idleness tracing presents the real-time status of cores and contributes to the energy conservation of under-utilized period. It also provides the real-time parallel slackness of active cores, which allows the task mapping algorithm to attribute corresponding power consumption on each concurrent running task. The task mapping algorithm integrates the information from above three modules and outputs the predicted best resources placements for ready tasks.
Poster presented by jing Chen at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'
Microteaching on terms used in filtration .Pharmaceutical Engineering
Low Energy Task Scheduling based on Work Stealing
1. The LEGaTO project has received funding from the European Union’s Horizon 2020 research and innovation
programme under the grant agreement No 780681. www.legato-project.eu
Low Energy Task Scheduling based on Work Stealing
Jing Chen
Chalmers University of Technology
Directed Acyclic Graph (DAG)
n A task-based way to express
multithreading applications.
n Nodes are tasks.
n Edges are dependencies.
Asymmetric platforms feature
n High performance and power
hungry cores
n Energy efficient and small
cores
Dynamic Task Scheduling
Work stealing: better scalability in larger systems, less
communication contention than centralized scheme.
Performance Improvement NOT enough for energy reduction
DVFS: voltage and frequency scaling
n Users are usually not permitted to manipulate DVFS settings.
Overhead: tens
of 𝜇𝑠 to over
one 𝑚𝑠
Multithreading
Application: 𝜇𝑠
level fine-
grained tasks
NOT realistic to
use DVFS per
task
State-of-the-
art: Per-core
DVFS
Significant
hardware cost
(inductors and
capacitors)
Most systems
only feature
cluster-based
DVFS
State-of-the-
art: platform
complete
control
If some other
applications run
on same cluster
Badly influence
energy of these
applications
Low Energy Runtime Design
Power Profiling
n Help runtime understand CPU power consumption trends
(number/type of cores, different frequencies)
n We evaluate power profiling techniques:
(a) Directly sample power by accessing the onboard power sensor, e.g.
NVIDIA Jetson TX2 INA3221.
(b) Intel RAPL energy model, sample energy every fixed time, then:
Powern+1 = (Energyn+1 - Energyn) / (tn+1 - tn)
Dynamic Performance Modeling
n Provide accurate prediction for future task given a set of resources
n Independent of platforms and frequencies
n Achieve scalablity and portablity goals
Idleness Tracing
n Give the information about real-time status of cores
n Put cores to ”sleep” when it is under-utilized
n Sleeping time exploits backoff exponential strategy
n Provide the real-time parallel slackness of active cores =>
calculation of shared board static power on each running task
Task Mapping Algorithm (Per task level)
For a given configuration (Start core, number of cores):
n Performance Tracer => Execution Time Prediction
n Power Profiles => Dynamic Power Prediction
n Power Profiles + Idleness Tracer => Static Power Prediction
Energy Prediction = (Static Power + Dynamic Power) x Execution Time
Experimental Results
Name Acronym Notion
Random Work
Stealing (+Sleep)
RWS (+S) Typical greedy scheduling (enhanced with Sleep)
Fastest Cores
with Criticality
(+Sleep)
FCC (+S) Critical tasks are mapped to the set of cores that
minimize execution time and are not allowed
work stealing, noncritical tasks follow parent
queue and only search for the best number of
cores that minimize the execution time of the
task (enhanced with Sleep)
Lowest Cost with
Criticality
(+Sleep)
LCC (+S) The difference between LCC and FCC is that
minimizing execution time becomes minimizing
parallel cost. The parallel cost means ”execution
time * number of cores” (enhanced with Sleep)
Lowest Energy
without
Criticality
LENC Task scheduling targets lowest energy, no need
for criticality awareness
0
2
4
6
8
10
12
14
16
18
RWS RWS+Sleep FCC FCC+Sleep LCC LCC+Sleep LENC
EnergyConsumption[J]
x1000
2D-Heat on Haswell one node
1000 iterations, resolution=10240
0
100
200
300
400
500
600
MAX&MAX MAX&MIN MIN&MAX MIN&MIN
Energy[J]
VGG-16 on NVIDIA Jetson TX2
RWS RWS+S FCC FCC+S LCC LCC+S LENC
n MAX&MIN (x-axis) means on TX2, Denver cluster frequency is
maximum, A57 cluster frequency is minimum.
n LENC achieves lowest energy, e.g.31%-74% energy reduction than
RWS, 19%-68% than FCC, 25%-73% than LCC.
n Haswell is a symmetric platform, 2D-Heat includes two kernels:
copy (memory-bound) and stencil (compute-bound).
n Sleep strategy brings 38% energy reduction in RWS vs. RWS+S, 9%
in FCC vs. FCC+S, 33% in LCC vs. LCC+S.
n LENC achieves low energy task type awareness:
(a) Copy tasks choose number of cores=5
(b) Stencil tasks choose number of cores=10.
Background
The importance of task feature awareness:
n Naive assignment causes the mismatch of task types and
core types, e.g. compute-bound kernels using powerful
Denver cluster on TX2 is more energy efficient than using all.