Multicore architectures

Multicore
Architectures
Muhammet Abdullah Soytürk
1

Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant,
Karthikeyan Sankaralingam, Doug Burger
Number of citations = 1622
Dark Silicon and the
End of Multicore
Scaling (2011)
2

Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
3

Modelling of
multicore scaling
limits Multicore Scaling
Device Scaling
Single-Core Scaling
for the next five technology
generations.
By combining:
Summary
4

Outline
○ Summary
○ Motivation
○ Models
○ Findings
○ Limitations
○ Conclusion
5

Motivation
With the failure of Dennard scaling, core count scaling may
be in jeopardy, which would leave the community with no
clear scaling path to exploit continued transistor count
increases.
How good multicore performance will be in the long term?
In 2024, will processors have 32 times the performance of
processors from 2008, exploiting five generations of core
doubling?
6

Outline
○ Summary
○ Motivation
○ Models
○ Findings
○ Limitations
○ Conclusion
7

Device Scaling Model
Frequency Scaling
Area Scaling
Power Scaling
factors at technology nodes from 45 nm
to 8 nm. ITRS Roadmap projections and
conservative scaling parameters are
considered.
Provides:
8

Single-core Scaling
Model
Maximum performance that
a single-core can sustain for
any given area
Minimum power that
must be consumed to
sustain this level of
performance
Provides:
9

Multicore Scaling
Model
Multicore CPUs
Many-thread GPUs
Models:
which are two extreme points in the
threads-per-core spectrum. Both
models use the A(q) and P(q) frontiers
from the core-scaling model.
Symmetric
Asymmetric
Topologies:
Dynamic
Composed
10

11
The process for choosing the optimal core configuration for
the symmetric topology at a given technology node:
● The area/performance frontier is investigated, and all
processor design points along the frontier are considered.
● For each area/performance design point, the multicore is
constructed starting with a single core. We add one core
per iteration and compute the new speedup and the
power consumption using the power/performance Pareto
frontier
● Speedups are computed.
● After some number of iterations, the area limit is hit, or
power wall is hit, or we start seeing performance
degradation. At this point the optimal speedup and the
optimal number of cores is found. The fraction of dark
silicon can then be computed by subtracting the area
occupied by these cores from the total die area allocated
to processor cores.
Device x Core x CMP
Scaling

Outline
○ Summary
○ Motivation
○ Models
○ Findings
○ Limitations
○ Conclusion
12

Findings
Using PARSEC benchmarks and ITRS
scaling projections, this study predicts
best-case average speedup of 7.9
times between now and 2024 at 8 nm.
When conservative scaling projections
applied, half of that ideal gain vanishes:
the path to 8 nm in 2018 results in best
case average 3.7x speed up.
13

Outline
○ Summary
○ Motivation
○ Models
○ Findings
○ Limitations
○ Conclusion
14

Limitations
● SMT(Simultaneous Multithreading) support was
not considered.
● The power impact of “uncore” components such
as the memory subsystem were ignored.
● ARM or Tilera cores were not considered
because they are designed for different
application domains.
● Validation against real and simulated systems
shows the model underpredicts performance for
two benchmarks.
15

Outline
○ Summary
○ Motivation
○ Models
○ Findings
○ Limitations
○ Conclusion
16

“If multicore scaling ceases
to be the primary driver of
performance gains at
16nm (in 2014) the
“multicore era” will have
lasted a mere nine years.”
Conclusion
17

Thanks!
Any questions so far?
18

M. Aater Suleman, Onur Mutlu, Jose A. Joao,
Khubaib, Yale N. Patt
Number of citations = 58
Data Marshaling for
Multicore
Architectures (2010)
19

Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
20

Applying Data
Marshaling Concept
to Multicore
Architectures
Accelerated Critical Sections
Producer-Consumer pipeline parallelism
on both homogeneous and heterogeneous
multicore systems.
To improve the performance of Staged
Execution (SE) models :
21

Outline
○ Summary
○ Motivation
■ Two examples
○ Data Marshaling
○ Applications
○ Conclusion
22

Motivation
Previous research has shown that Staged Execution, i.e,
dividing a program into segments and executing each
segment at the core that has the data and/or functionality to
best run that segment can improve performance and save
power.
BUT SE’s benefit is limited because most segments access
data generated by the previous segment which causes cache
misses.
Can we apply data marshaling concept from network
programming to eliminate these cache misses?
23

Outline
○ Summary
○ Motivation
■ Two examples
○ Data Marshaling
○ Applications
○ Conclusion
24

Staged Execution Model
● Accelerated critical sections [Suleman et al.,
ASPLOS 2010]
● Producer-consumer pipeline parallelism.
● Task parallelism (Cilk, Intel TBB, Apple Grand
Central Dispatch)
● Special-purpose cores and functional units.
ExamplesSpeed up a program by dividing it into segments
and run each segment on the core best-suited to
run it.
Goal
● Accelerates segments/critical-paths using
specialized/heterogeneous core.
● Exploits inter-segment parallelism.
● Improves locality of within-segment data.
Benefit
25

Outline
○ Summary
○ Motivation
■ Two examples
○ Data Marshaling
○ Applications
○ Conclusion
27

The Problem
Idea: Ship critical sections to a large core.
Problem: Critical section incurs a cache miss when it
touches data produced in the non-critical section (i.e.,
thread private data)
Accelerated Critical Sections
Idea: Split a loop iteration into multiple “pipeline
stages” each stage runs on a different core.
Problem: A stage incurs a cache miss when it touches
data produced by the previous stage.
Producer-Consumer Pipeline
28

Outline
○ Summary
○ Motivation
■ Two examples
○ Data Marshaling
○ Applications
○ Conclusion
29

Data Marshaling
30
Set of generator instructions is stable over execution
time and across input sets
Observation
● Identify the generator instructions.
● Record cache blocks produced by generator
instructions.
● Proactively send such cache blocks to the next
segment’s core before initiating the next segment
Idea

33
Data Marshaling Support/Cost
Support
● Profiler/Compiler: Generators, marshal instructions.
● ISA: Generator prefix, marshal instructions
● Library/Hardware: Bind next segment ID to a
physical core.
Hardware
● Marshal Buffer
○ Stores physical addresses of cache blocks to be
marshaled.
○ 16 entries enough for almost all workloads 96
bytes per core.
● Ability to execute generator prefixes and marshal
instructions.
● Ability to push data to another cache.

Outline
○ Summary
○ Motivation
■ Two examples
○ Data Marshaling
○ Applications
○ Conclusion
34

Accelerated Critical
Sections
Ship critical sections to a large core in an asymmetric
CMP.
Idea
Faster execution of critical section, reduced
serialization, improved lock and shared data locality
Benefit
35

36
Methodology
12 critical section intensive applications.
● Data mining kernels, sorting, database, web,
networking
● Different training and simulation input sets
Workloads
Multi-core x86 simulator.
● 1 large and 28 small cores
● Aggressive stream prefetcher employed
at each core.
Simulator
● Large core: 2GHz, out-of-order, 128-entry
ROB, 4-wide, 12-stage
● Small core: 2GHz, in-order, 2-wide, 5-stage
● Private 32 KB L1, private 256KB L2, 8MB
shared L3
● On-chip interconnect: Bi-directional ring, 5-
cycle hop latency
Details

Producer - Consumer
Pipeline
Split a loop iteration into multiple “pipeline stages”
where one stage consumes data produced by the next
stage each stage runs on a different core
Idea
Stage-level parallelism, better locality faster execution.
Benefit
38

39
Methodology
9 applications with pipeline paralellism..
● Financial, compression, multimedia,
encoding/decoding.
● Different training and simulation input sets
Workloads
Multi-core x86 simulator.
● 32-core CMP: 2GHz, in-order, 2-wide, 5-
stage
● Aggressive stream prefetcher employed
at each core.
● Private 32 KB L1, private 256KB L2, 8MB
shared L3.
● On-chip interconnect: Bi-directional ring,
5-cycle hop latency
Simulator

Outline
○ Summary
○ Motivation
■ Two examples
○ Data Marshaling
○ Applications
○ Conclusion
41

Conclusion
42
● Inter-segment data transfers between cores limit
the benefit of promising Staged Execution (SE)
models.
● Data Marshaling is a hardware/software
cooperative solution: detect inter-segment data
generator instructions and push their data to next
segment’s core.
○ Significantly reduces cache misses for inter-
segment data.
○ Low cost, high-coverage, timely for arbitrary
address sequences.
○ Achieves most of the potential of eliminating
such misses.
● Applicable to several existing Staged Execution
models.
○ Accelerated Critical Sections: 9% performance
benefit.
○ Pipeline Parallelism: 16% performance benefit.
● Can enable new models very fine-grained remote
execution.

Multicore architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multicore architectures

Similar to Multicore architectures (20)

Recently uploaded

Recently uploaded (20)

Multicore architectures