2. Problem
• upper bound on the performance of single core processors
• future embedded systems must be faster while less consuming
energy
• reducing energy consumption usually results in reduced
performance
• Solution: multiple processors per die
• multi-core system on chip (MCSOC).
• combining processors with different architectures
• heterogeneity creates opportunity for optimization
• highly effective for large scale data centers
• task mapping grows increasingly complex
• reliable and fast task mapping is needed
2
3. Project Goal
• develop a static/offline method of assigning incoming tasks
(also known as mapping) the various cores of a heterogeneous
MCSOC
• the mapping will minimize the energy consumed to fully
execute a workload, such that all task are executed
3
6. Simulated Annealing
•
•
•
•
task mapping is NP-complete
simulated annealing is an iterative search heuristic
allows escape from local minima
solution not ideal, but is “good enough”
6
13. Mapping Algorithm Setup
• Global arrival rate for each task type
• Each core:
• p-state
• task flow rate for each task type
•
•
•
•
Global execution rate for each task type
Simulated annealing loop
Fitness function (evaluated energy)
Solution repair function
13
17. Conclusion
• Heterogeneity in MCSOCs creates opportunities for
optimization
• Simulated annealing is an effective optimization heuristic
• Proper mapping of workloads in heterogeneous MCSOCs can
greatly reduce total energy consumption when compared to a
non-energy aware mapping methodology
17
Editor's Notes
Certain tasks will run faster on one architecture compared to a different architecturethis reduction in runtime results in a lower amount of energy in required to execute that task (energy is the integral of power over time, less time means less energy, given a constant power consumption)properly matching tasks to machines can reduce total system power
Square gridThe MCSOC is assumed to have 2 types of processor cores; high efficiency and high performanceThere is an equal number of each core type (N2/2) where possible. For cases with an odd number of total cores
iterative heuristic search algorithm that mimics the formation of structures in metals during coolingnature of structuresare a function of the rate of coolingFaster cooling will result in more irregular structures (e.g. higher total energy)slower cooling will result in more regular structures (e.g. lower total energy).
Neighbor generated by modifying some of the parameters of current solution
solution accepted if z > ythe smaller the change in energy (fitness function), and the higher the temperature, the greater chance of accepting the proposed solution
Two architectures chosen to create heterogeneous computing evironmentthe A7 is high efficiencywhile the A15, with its much more complex pipeline, is higher performance but much higher power
A7 and A15 each have 4 pstatespower and performance are normalized for simplicityrelative power is used forpstate power in simulation, code snippet shownrelativeperformance of each pstate is used in the Gem5 workload simulations to build the ECS matrix
Synthetically generated workloadFive standard benchmarks (FFT, ocean w/ contiguous partitions, ocean w/ non-contiguous partitions, radix, lu) were simulated on the ARM architecture included in Gem5four different clock speeds (1GHz, 866MHz, 650MHz, 434MHz) from real world ARM pstates shown in prev. slideThe FFT benchmark was run twice with two different problem sizesWhile the simulated workloads are not a comprehensive survey of all possible tasks for an embedded system, they vary sufficiently in runtime and computational intensity for the purposes of this investigation
Runtimes uesd to generate the ECS matrix for all task/core/pstate combinationsActual ECS matrix from code shownECS in the inverse of runtime
Tasks are assigned to cores as flow rates, rather than as individual tasksflow rates are the fraction of time that the core spends executing that taskThe execution rate for a task type on a core in a given p-state is the product of the flow rate and the ECS for that task/core/p-state.Global execution rate for each task is the sum of the individual execution rates on each coreenergy calculated by summing the energy of each coreenergy of each core is a function of its core type and its current p-stateThe relative energy for each p-state on each core type was obtained from the previously mentioned ARM whitepapergenerated solution might not be validrepair function randomly increases pstates until all task types are fully executed
Several hours to collect the necessary number of trials for all data points
Five different MCSOCs sizes were consideredEach configuration was simulated five times for each of the four iteration limits (1,000 iterations, 10,000 iterations, 100,000 iterations, and 1,000,000 iterations) of SA algorithmpercent decrease calculated as relative decrease in total device energy from a randomly generated initial solutionfive trials, averaged to accurately represent performance (due to random initial solution)explain stress factorThe stress factor is the percentage of the maximum workload that the device can support. For example, a stress factor of 0.8 means that the workload is 80% of the maximum. Higher stress factors allow less opportunity for optimization, as more of the device resources are utilized, limiting the number of available allocation options. During simulations, 4 stress factors were tested to simulate a full spectrum of MCSOC workload conditions. A stress factor of 1 was not simulated, as this would mean that the entire devicewas fully utilized and therefore there would be no opportunity for optimization.
While this simulation is intended to be a static mappingimportant to consider how long it takes for the mapping to completethe mapping times are very small except for large MCSOCs with a large number of iterations of the SA algorithm