2. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant,
Karthikeyan Sankaralingam, Doug Burger
Number of citations = 1622
Dark Silicon and the
End of Multicore
Scaling (2011)
2
3. Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
3
4. Modelling of
multicore scaling
limits Multicore Scaling
Device Scaling
Single-Core Scaling
for the next five technology
generations.
By combining:
Summary
4
5. Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
5
6. Motivation
With the failure of Dennard scaling, core count scaling may
be in jeopardy, which would leave the community with no
clear scaling path to exploit continued transistor count
increases.
How good multicore performance will be in the long term?
In 2024, will processors have 32 times the performance of
processors from 2008, exploiting five generations of core
doubling?
6
7. Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
7
8. Device Scaling Model
Frequency Scaling
Area Scaling
Power Scaling
factors at technology nodes from 45 nm
to 8 nm. ITRS Roadmap projections and
conservative scaling parameters are
considered.
Provides:
8
10. Multicore Scaling
Model
Multicore CPUs
Many-thread GPUs
Models:
which are two extreme points in the
threads-per-core spectrum. Both
models use the A(q) and P(q) frontiers
from the core-scaling model.
Symmetric
Asymmetric
Topologies:
Dynamic
Composed
10
11. 11
The process for choosing the optimal core configuration for
the symmetric topology at a given technology node:
● The area/performance frontier is investigated, and all
processor design points along the frontier are considered.
● For each area/performance design point, the multicore is
constructed starting with a single core. We add one core
per iteration and compute the new speedup and the
power consumption using the power/performance Pareto
frontier
● Speedups are computed.
● After some number of iterations, the area limit is hit, or
power wall is hit, or we start seeing performance
degradation. At this point the optimal speedup and the
optimal number of cores is found. The fraction of dark
silicon can then be computed by subtracting the area
occupied by these cores from the total die area allocated
to processor cores.
Device x Core x CMP
Scaling
12. Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
12
13. Findings
Using PARSEC benchmarks and ITRS
scaling projections, this study predicts
best-case average speedup of 7.9
times between now and 2024 at 8 nm.
When conservative scaling projections
applied, half of that ideal gain vanishes:
the path to 8 nm in 2018 results in best
case average 3.7x speed up.
13
14. Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
14
15. Limitations
● SMT(Simultaneous Multithreading) support was
not considered.
● The power impact of “uncore” components such
as the memory subsystem were ignored.
● ARM or Tilera cores were not considered
because they are designed for different
application domains.
● Validation against real and simulated systems
shows the model underpredicts performance for
two benchmarks.
15
16. Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
16
17. “If multicore scaling ceases
to be the primary driver of
performance gains at
16nm (in 2014) the
“multicore era” will have
lasted a mere nine years.”
Conclusion
17
19. M. Aater Suleman, Onur Mutlu, Jose A. Joao,
Khubaib, Yale N. Patt
Number of citations = 58
Data Marshaling for
Multicore
Architectures (2010)
19
20. Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
20
21. Applying Data
Marshaling Concept
to Multicore
Architectures
Accelerated Critical Sections
Producer-Consumer pipeline parallelism
on both homogeneous and heterogeneous
multicore systems.
To improve the performance of Staged
Execution (SE) models :
21
22. Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
22
23. Motivation
Previous research has shown that Staged Execution, i.e,
dividing a program into segments and executing each
segment at the core that has the data and/or functionality to
best run that segment can improve performance and save
power.
BUT SE’s benefit is limited because most segments access
data generated by the previous segment which causes cache
misses.
Can we apply data marshaling concept from network
programming to eliminate these cache misses?
23
24. Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
24
25. Staged Execution Model
● Accelerated critical sections [Suleman et al.,
ASPLOS 2010]
● Producer-consumer pipeline parallelism.
● Task parallelism (Cilk, Intel TBB, Apple Grand
Central Dispatch)
● Special-purpose cores and functional units.
ExamplesSpeed up a program by dividing it into segments
and run each segment on the core best-suited to
run it.
Goal
● Accelerates segments/critical-paths using
specialized/heterogeneous core.
● Exploits inter-segment parallelism.
● Improves locality of within-segment data.
Benefit
25
27. Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
27
28. The Problem
Idea: Ship critical sections to a large core.
Problem: Critical section incurs a cache miss when it
touches data produced in the non-critical section (i.e.,
thread private data)
Accelerated Critical Sections
Idea: Split a loop iteration into multiple “pipeline
stages” each stage runs on a different core.
Problem: A stage incurs a cache miss when it touches
data produced by the previous stage.
Producer-Consumer Pipeline
28
29. Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
29
30. Data Marshaling
30
Set of generator instructions is stable over execution
time and across input sets
Observation
● Identify the generator instructions.
● Record cache blocks produced by generator
instructions.
● Proactively send such cache blocks to the next
segment’s core before initiating the next segment
Idea
33. 33
Data Marshaling Support/Cost
Support
● Profiler/Compiler: Generators, marshal instructions.
● ISA: Generator prefix, marshal instructions
● Library/Hardware: Bind next segment ID to a
physical core.
Hardware
● Marshal Buffer
○ Stores physical addresses of cache blocks to be
marshaled.
○ 16 entries enough for almost all workloads 96
bytes per core.
● Ability to execute generator prefixes and marshal
instructions.
● Ability to push data to another cache.
34. Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
34
35. Accelerated Critical
Sections
Ship critical sections to a large core in an asymmetric
CMP.
Idea
Faster execution of critical section, reduced
serialization, improved lock and shared data locality
Benefit
35
36. 36
Methodology
12 critical section intensive applications.
● Data mining kernels, sorting, database, web,
networking
● Different training and simulation input sets
Workloads
Multi-core x86 simulator.
● 1 large and 28 small cores
● Aggressive stream prefetcher employed
at each core.
Simulator
● Large core: 2GHz, out-of-order, 128-entry
ROB, 4-wide, 12-stage
● Small core: 2GHz, in-order, 2-wide, 5-stage
● Private 32 KB L1, private 256KB L2, 8MB
shared L3
● On-chip interconnect: Bi-directional ring, 5-
cycle hop latency
Details
38. Producer - Consumer
Pipeline
Split a loop iteration into multiple “pipeline stages”
where one stage consumes data produced by the next
stage each stage runs on a different core
Idea
Stage-level parallelism, better locality faster execution.
Benefit
38
39. 39
Methodology
9 applications with pipeline paralellism..
● Financial, compression, multimedia,
encoding/decoding.
● Different training and simulation input sets
Workloads
Multi-core x86 simulator.
● 32-core CMP: 2GHz, in-order, 2-wide, 5-
stage
● Aggressive stream prefetcher employed
at each core.
● Private 32 KB L1, private 256KB L2, 8MB
shared L3.
● On-chip interconnect: Bi-directional ring,
5-cycle hop latency
Simulator
41. Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
41
42. Conclusion
42
● Inter-segment data transfers between cores limit
the benefit of promising Staged Execution (SE)
models.
● Data Marshaling is a hardware/software
cooperative solution: detect inter-segment data
generator instructions and push their data to next
segment’s core.
○ Significantly reduces cache misses for inter-
segment data.
○ Low cost, high-coverage, timely for arbitrary
address sequences.
○ Achieves most of the potential of eliminating
such misses.
● Applicable to several existing Staged Execution
models.
○ Accelerated Critical Sections: 9% performance
benefit.
○ Pipeline Parallelism: 16% performance benefit.
● Can enable new models very fine-grained remote
execution.