Energy Aware Task Allocation on Heterogeneous Multi-Core SOC (EA-TA-HMCS

•Download as PPTX, PDF•

0 likes•160 views

Eric Jonardi

Technology Business

Problem
• upper bound on the performance of single core processors
• future embedded systems must be faster while less consuming
energy
• reducing energy consumption usually results in reduced
performance

• Solution: multiple processors per die
• multi-core system on chip (MCSOC).
• combining processors with different architectures
• heterogeneity creates opportunity for optimization

• highly effective for large scale data centers
• task mapping grows increasingly complex
• reliable and fast task mapping is needed

2

Project Goal
• develop a static/offline method of assigning incoming tasks
(also known as mapping) the various cores of a heterogeneous
MCSOC
• the mapping will minimize the energy consumed to fully
execute a workload, such that all task are executed

3

Overview
•
•
•
•
•
•

MCSOC model
Simulated annealing algorithm
ARM A7 & A15 Architecture
Gem5 task simulations
mapping algorithm setup
Results

4

Simulated Annealing
•
•
•
•

task mapping is NP-complete
simulated annealing is an iterative search heuristic
allows escape from local minima
solution not ideal, but is “good enough”

6

Simulated Annealing
• iterative search heuristic
• allows escape from local minima
• primary variables
• Initial temperature
• Cooling rate
• Parameter mutation rate
• P-state mutation rate
• Task flow mutation rate

8

P-states for A7 & A15
Normalized P-states
1
0.9
0.8
0.7
Power

0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Performance
A15

A7

10

Gem5 Simulated Tasks
0.016

0.014

Runtime (seconds)

0.012

0.01

0.008

0.006

0.004

0.002

0
400

500

600

700

800

900

Clock Speed (GHz)
fft

ocean

oceanNC

fft2

radix

lu

1,000

11

Mapping Algorithm Setup
• Global arrival rate for each task type
• Each core:
• p-state
• task flow rate for each task type

•
•
•
•

Global execution rate for each task type
Simulated annealing loop
Fitness function (evaluated energy)
Solution repair function

13

Results
16
14
12
10
8
6
4
2
0

Energy Decrease (stress = 0.6)

3x3
4x4
5x5

6x6
10x10

1000

10000

100000

Avg % Decrease in Energy

Avg % Decrease in Energy

Energy Decrease (stress = 0.8)

1000000

60
50
40

3x3

30

4x4

20

5x5

10

6x6

0

10x10

1000

# Iterations

3x3
4x4
5x5
6x6
100000

# Iterations

1000000

Energy Decrease (stress = 0.1)

1000000

10x10

Avg % Decrease in Energy

Avg % Decrease in Energy

70
60
50
40
30
20
10
0
10000

100000

# Iterations

Energy Decrease (stress = 0.4)

1000

10000

100
80

3x3

60

4x4

40

5x5
20

6x6

0
1000

10x10
10000

100000

# Iterations

1000000

15

Simulation Runtimes
Size
3x3

4x4

5x5

6x6

10x10

# Iterations
1000
10000
100000
1000000
1000
10000
100000
1000000
1000
10000
100000
1000000
1000
10000
100000
1000000
1000
10000
100000
1000000

Avg Runtime (s)
<1
<1
<1
14
<1
<1
2
20
<1
<1
4
36
<1
<1
5
48
<1
2
25
244

16

Conclusion
• Heterogeneity in MCSOCs creates opportunities for
optimization
• Simulated annealing is an effective optimization heuristic
• Proper mapping of workloads in heterogeneous MCSOCs can
greatly reduce total energy consumption when compared to a
non-energy aware mapping methodology

17

What's hot

Threading Successes 03 Gamebryoguest40fc7cd

Introduction to KlepsydraPablo Ghiglino

Scylla Summit 2019 Keynote - Dor Laor - Beyond CassandraScyllaDB

Investing the Effects of Overcommitting YARN resourcesDataWorks Summit/Hadoop Summit

Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong

Getting started with Riak in the CloudInes Sombra

QCON 2015: Gearpump, Realtime Streaming on AkkaSean Zhong

LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTINGijccsa

Using SLOs for Continuous Performance Optimizations of Your k8s WorkloadsScyllaDB

Scylla Summit 2022: IO Scheduling & NVMe Disk ModellingScyllaDB

Use of a Levy Distribution for Modeling Best Case Execution Time VariationJonathan Beard

Stream Computing (The Engineer's Perspective)Ilya Ganelin

Brooklin Mirror Maker - How and why we moved away from Kafka Mirror MakerShun-ping Chiu

C-Cube: Elastic Continuous Clustering in the CloudQian Lin

Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx

Acrlastyler

Apache Gearpump next-gen streaming engineTianlun Zhang

ICANN DNS Symposium (IDS 2019): RDAP CDN Distribution ExperienceAPNIC

The Database Sizing WorkflowKristofferson A

Performance Tuning - Understanding Garbage CollectionHaribabu Nandyal Padmanaban

What's hot (20)

Threading Successes 03 Gamebryo

Introduction to Klepsydra

Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra

Investing the Effects of Overcommitting YARN resources

Strata Singapore: GearpumpReal time DAG-Processing with Akka at Scale

Getting started with Riak in the Cloud

QCON 2015: Gearpump, Realtime Streaming on Akka

LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING

Using SLOs for Continuous Performance Optimizations of Your k8s Workloads

Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling

Use of a Levy Distribution for Modeling Best Case Execution Time Variation

Stream Computing (The Engineer's Perspective)

Brooklin Mirror Maker - How and why we moved away from Kafka Mirror Maker

C-Cube: Elastic Continuous Clustering in the Cloud

Scaling ingest pipelines with high performance computing principles - Rajiv K...

Acrl

Apache Gearpump next-gen streaming engine

ICANN DNS Symposium (IDS 2019): RDAP CDN Distribution Experience

The Database Sizing Workflow

Performance Tuning - Understanding Garbage Collection

Viewers also liked

Ring chromosome 7Sumit Sandhu

Valentine monsterschackettb

How to have a great holidaychackettb

Affecting feelings group brainstormchackettb

Science-2013-Vannier-239-42Sumit Sandhu

HLTFSumit Sandhu

Windows xp presentation3880075

Ruhani sandhuSumit Sandhu

Working Effectively in Diverse TeamsÖzge Özdemir

Bondia.cat 17/12/2013Bondia Lleida Sl

Jornal Valor Econômico: Dados Commodities 07/01/2016Agricultura Sao Paulo

Resume 2015William Steed

Qbiss One - Breakthrough in facadestrimo-vsk

В шаге от покупки...U-Too

Smu mscit sem 4 spring 2015 assignmentssolved_assignments

Viewers also liked (15)

Ring chromosome 7

Valentine monsters

How to have a great holiday

Affecting feelings group brainstorm

Science-2013-Vannier-239-42

HLTF

Windows xp presentation

Ruhani sandhu

Working Effectively in Diverse Teams

Bondia.cat 17/12/2013

Jornal Valor Econômico: Dados Commodities 07/01/2016

Resume 2015

Qbiss One - Breakthrough in facades

В шаге от покупки...

Smu mscit sem 4 spring 2015 assignments

Similar to Energy Aware Task Allocation on Heterogeneous Multi-Core SOC (EA-TA-HMCS

Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs

Unleash performance through parallelism - Intel® Math Kernel LibraryIntel IT Center

OOW-IMC-finalManuel Martin Marquez

Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit

Spark Autotuning - Spark Summit East 2017 Alpine Data

A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPUCarlos Reaño González

How to achieve 95%+ Accurate power measurement during architecture exploration? Deepak Shankar

04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai

Project Slides for Website 2020-22.pptxAkshitAgiwal1

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit

03 performancemarangburu42

High Performance Erlang - Pitfalls and SolutionsYinghai Lu

Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth

ARM® Cortex™ M Energy Optimization - Using Instruction CacheRaahul Raghavan

Run-time power management in cloud and containerized environmentsNECST Lab @ Politecnico di Milano

Sharam salamianObsidian Software

04 performancemarangburu42

Embedded system custom single purpose processorsAiswaryadevi Jaganmohan

Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentDataWorks Summit

Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...Arun Joseph

Similar to Energy Aware Task Allocation on Heterogeneous Multi-Core SOC (EA-TA-HMCS (20)

Performance Benchmarking of the R Programming Environment on the Stampede 1.5...

Unleash performance through parallelism - Intel® Math Kernel Library

OOW-IMC-final

Spark Autotuning: Spark Summit East talk by Lawrence Spracklen

Spark Autotuning - Spark Summit East 2017

A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU

How to achieve 95%+ Accurate power measurement during architecture exploration?

04 accelerating dl inference with (open)capi and posit numbers

Project Slides for Website 2020-22.pptx

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)

03 performance

High Performance Erlang - Pitfalls and Solutions

Fast switching of threads between cores - Advanced Operating Systems

ARM® Cortex™ M Energy Optimization - Using Instruction Cache

Run-time power management in cloud and containerized environments

Sharam salamian

04 performance

Embedded system custom single purpose processors

Lessons learned from scaling YARN to 40K machines in a multi tenancy environment

Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...

Recently uploaded

APIForce Zurich 5 April Automation LPDGMarianaLemus7

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Build your next Gen AI Breakthrough - April 2024Neo4j

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Recently uploaded (20)

APIForce Zurich 5 April Automation LPDG

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Maximizing Board Effectiveness 2024 Webinar.pptx

Pigging Solutions Piggable Sweeping Elbows

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Unblocking The Main Thread Solving ANRs and Frozen Frames

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Scanning the Internet for External Cloud Exposures via SSL Certs

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

DMCC Future of Trade Web3 - Special Edition

SQL Database Design For Developers at php[tek] 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Build your next Gen AI Breakthrough - April 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

My Hashitalk Indonesia April 2024 Presentation

Unlocking the Potential of the Cloud for IBM Power Systems

The transition to renewables in India.pdf

Energy Aware Task Allocation on Heterogeneous Multi-Core SOC (EA-TA-HMCS

1. Energy Aware Task Allocation on a Large Scale Heterogeneous Multi-Core SOC Eric Jonardi 1

2. Problem • upper bound on the performance of single core processors • future embedded systems must be faster while less consuming energy • reducing energy consumption usually results in reduced performance • Solution: multiple processors per die • multi-core system on chip (MCSOC). • combining processors with different architectures • heterogeneity creates opportunity for optimization • highly effective for large scale data centers • task mapping grows increasingly complex • reliable and fast task mapping is needed 2

3. Project Goal • develop a static/offline method of assigning incoming tasks (also known as mapping) the various cores of a heterogeneous MCSOC • the mapping will minimize the energy consumed to fully execute a workload, such that all task are executed 3

4. Overview • • • • • • MCSOC model Simulated annealing algorithm ARM A7 & A15 Architecture Gem5 task simulations mapping algorithm setup Results 4

5. MCSOC Device Model 5

6. Simulated Annealing • • • • task mapping is NP-complete simulated annealing is an iterative search heuristic allows escape from local minima solution not ideal, but is “good enough” 6

7. Simulated Annealing 7

8. Simulated Annealing • iterative search heuristic • allows escape from local minima • primary variables • Initial temperature • Cooling rate • Parameter mutation rate • P-state mutation rate • Task flow mutation rate 8

9. ARM A7 & A15 Architectures 9

10. P-states for A7 & A15 Normalized P-states 1 0.9 0.8 0.7 Power 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Performance A15 A7 10

11. Gem5 Simulated Tasks 0.016 0.014 Runtime (seconds) 0.012 0.01 0.008 0.006 0.004 0.002 0 400 500 600 700 800 900 Clock Speed (GHz) fft ocean oceanNC fft2 radix lu 1,000 11

12. ECS Matrix 12

13. Mapping Algorithm Setup • Global arrival rate for each task type • Each core: • p-state • task flow rate for each task type • • • • Global execution rate for each task type Simulated annealing loop Fitness function (evaluated energy) Solution repair function 13

14. Running the simulations 14

15. Results 16 14 12 10 8 6 4 2 0 Energy Decrease (stress = 0.6) 3x3 4x4 5x5 6x6 10x10 1000 10000 100000 Avg % Decrease in Energy Avg % Decrease in Energy Energy Decrease (stress = 0.8) 1000000 60 50 40 3x3 30 4x4 20 5x5 10 6x6 0 10x10 1000 # Iterations 3x3 4x4 5x5 6x6 100000 # Iterations 1000000 Energy Decrease (stress = 0.1) 1000000 10x10 Avg % Decrease in Energy Avg % Decrease in Energy 70 60 50 40 30 20 10 0 10000 100000 # Iterations Energy Decrease (stress = 0.4) 1000 10000 100 80 3x3 60 4x4 40 5x5 20 6x6 0 1000 10x10 10000 100000 # Iterations 1000000 15

16. Simulation Runtimes Size 3x3 4x4 5x5 6x6 10x10 # Iterations 1000 10000 100000 1000000 1000 10000 100000 1000000 1000 10000 100000 1000000 1000 10000 100000 1000000 1000 10000 100000 1000000 Avg Runtime (s) <1 <1 <1 14 <1 <1 2 20 <1 <1 4 36 <1 <1 5 48 <1 2 25 244 16

17. Conclusion • Heterogeneity in MCSOCs creates opportunities for optimization • Simulated annealing is an effective optimization heuristic • Proper mapping of workloads in heterogeneous MCSOCs can greatly reduce total energy consumption when compared to a non-energy aware mapping methodology 17

Editor's Notes

Certain tasks will run faster on one architecture compared to a different architecturethis reduction in runtime results in a lower amount of energy in required to execute that task (energy is the integral of power over time, less time means less energy, given a constant power consumption)properly matching tasks to machines can reduce total system power
Square gridThe MCSOC is assumed to have 2 types of processor cores; high efficiency and high performanceThere is an equal number of each core type (N2/2) where possible. For cases with an odd number of total cores
iterative heuristic search algorithm that mimics the formation of structures in metals during coolingnature of structuresare a function of the rate of coolingFaster cooling will result in more irregular structures (e.g. higher total energy)slower cooling will result in more regular structures (e.g. lower total energy).
Neighbor generated by modifying some of the parameters of current solution
solution accepted if z > ythe smaller the change in energy (fitness function), and the higher the temperature, the greater chance of accepting the proposed solution
Two architectures chosen to create heterogeneous computing evironmentthe A7 is high efficiencywhile the A15, with its much more complex pipeline, is higher performance but much higher power
A7 and A15 each have 4 pstatespower and performance are normalized for simplicityrelative power is used forpstate power in simulation, code snippet shownrelativeperformance of each pstate is used in the Gem5 workload simulations to build the ECS matrix
Synthetically generated workloadFive standard benchmarks (FFT, ocean w/ contiguous partitions, ocean w/ non-contiguous partitions, radix, lu) were simulated on the ARM architecture included in Gem5four different clock speeds (1GHz, 866MHz, 650MHz, 434MHz) from real world ARM pstates shown in prev. slideThe FFT benchmark was run twice with two different problem sizesWhile the simulated workloads are not a comprehensive survey of all possible tasks for an embedded system, they vary sufficiently in runtime and computational intensity for the purposes of this investigation
Runtimes uesd to generate the ECS matrix for all task/core/pstate combinationsActual ECS matrix from code shownECS in the inverse of runtime
Tasks are assigned to cores as flow rates, rather than as individual tasksflow rates are the fraction of time that the core spends executing that taskThe execution rate for a task type on a core in a given p-state is the product of the flow rate and the ECS for that task/core/p-state.Global execution rate for each task is the sum of the individual execution rates on each coreenergy calculated by summing the energy of each coreenergy of each core is a function of its core type and its current p-stateThe relative energy for each p-state on each core type was obtained from the previously mentioned ARM whitepapergenerated solution might not be validrepair function randomly increases pstates until all task types are fully executed
Several hours to collect the necessary number of trials for all data points
Five different MCSOCs sizes were consideredEach configuration was simulated five times for each of the four iteration limits (1,000 iterations, 10,000 iterations, 100,000 iterations, and 1,000,000 iterations) of SA algorithmpercent decrease calculated as relative decrease in total device energy from a randomly generated initial solutionfive trials, averaged to accurately represent performance (due to random initial solution)explain stress factorThe stress factor is the percentage of the maximum workload that the device can support. For example, a stress factor of 0.8 means that the workload is 80% of the maximum. Higher stress factors allow less opportunity for optimization, as more of the device resources are utilized, limiting the number of available allocation options. During simulations, 4 stress factors were tested to simulate a full spectrum of MCSOC workload conditions. A stress factor of 1 was not simulated, as this would mean that the entire devicewas fully utilized and therefore there would be no opportunity for optimization.
While this simulation is intended to be a static mappingimportant to consider how long it takes for the mapping to completethe mapping times are very small except for large MCSOCs with a large number of iterations of the SA algorithm

Energy Aware Task Allocation on Heterogeneous Multi-Core SOC (EA-TA-HMCS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Energy Aware Task Allocation on Heterogeneous Multi-Core SOC (EA-TA-HMCS

Similar to Energy Aware Task Allocation on Heterogeneous Multi-Core SOC (EA-TA-HMCS (20)

Recently uploaded

Recently uploaded (20)

Energy Aware Task Allocation on Heterogeneous Multi-Core SOC (EA-TA-HMCS

Editor's Notes