SlideShare a Scribd company logo
1 of 43
Multicore
Architectures
Muhammet Abdullah Soytürk
1
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant,
Karthikeyan Sankaralingam, Doug Burger
Number of citations = 1622
Dark Silicon and the
End of Multicore
Scaling (2011)
2
Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
3
Modelling of
multicore scaling
limits Multicore Scaling
Device Scaling
Single-Core Scaling
for the next five technology
generations.
By combining:
Summary
4
Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
5
Motivation
With the failure of Dennard scaling, core count scaling may
be in jeopardy, which would leave the community with no
clear scaling path to exploit continued transistor count
increases.
How good multicore performance will be in the long term?
In 2024, will processors have 32 times the performance of
processors from 2008, exploiting five generations of core
doubling?
6
Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
7
Device Scaling Model
Frequency Scaling
Area Scaling
Power Scaling
factors at technology nodes from 45 nm
to 8 nm. ITRS Roadmap projections and
conservative scaling parameters are
considered.
Provides:
8
Single-core Scaling
Model
Maximum performance that
a single-core can sustain for
any given area
Minimum power that
must be consumed to
sustain this level of
performance
Provides:
9
Multicore Scaling
Model
Multicore CPUs
Many-thread GPUs
Models:
which are two extreme points in the
threads-per-core spectrum. Both
models use the A(q) and P(q) frontiers
from the core-scaling model.
Symmetric
Asymmetric
Topologies:
Dynamic
Composed
10
11
The process for choosing the optimal core configuration for
the symmetric topology at a given technology node:
● The area/performance frontier is investigated, and all
processor design points along the frontier are considered.
● For each area/performance design point, the multicore is
constructed starting with a single core. We add one core
per iteration and compute the new speedup and the
power consumption using the power/performance Pareto
frontier
● Speedups are computed.
● After some number of iterations, the area limit is hit, or
power wall is hit, or we start seeing performance
degradation. At this point the optimal speedup and the
optimal number of cores is found. The fraction of dark
silicon can then be computed by subtracting the area
occupied by these cores from the total die area allocated
to processor cores.
Device x Core x CMP
Scaling
Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
12
Findings
Using PARSEC benchmarks and ITRS
scaling projections, this study predicts
best-case average speedup of 7.9
times between now and 2024 at 8 nm.
When conservative scaling projections
applied, half of that ideal gain vanishes:
the path to 8 nm in 2018 results in best
case average 3.7x speed up.
13
Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
14
Limitations
● SMT(Simultaneous Multithreading) support was
not considered.
● The power impact of “uncore” components such
as the memory subsystem were ignored.
● ARM or Tilera cores were not considered
because they are designed for different
application domains.
● Validation against real and simulated systems
shows the model underpredicts performance for
two benchmarks.
15
Outline
➢Dark Silicon and the End of Multicore Scaling
○ Summary
○ Motivation
○ Models
■ Device Scaling Model
■ Single-core Scaling Model
■ Multicore Scaling Model
■ Device x Core x CMP Scaling
○ Findings
○ Limitations
○ Conclusion
16
“If multicore scaling ceases
to be the primary driver of
performance gains at
16nm (in 2014) the
“multicore era” will have
lasted a mere nine years.”
Conclusion
17
Thanks!
Any questions so far?
18
M. Aater Suleman, Onur Mutlu, Jose A. Joao,
Khubaib, Yale N. Patt
Number of citations = 58
Data Marshaling for
Multicore
Architectures (2010)
19
Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
20
Applying Data
Marshaling Concept
to Multicore
Architectures
Accelerated Critical Sections
Producer-Consumer pipeline parallelism
on both homogeneous and heterogeneous
multicore systems.
To improve the performance of Staged
Execution (SE) models :
21
Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
22
Motivation
Previous research has shown that Staged Execution, i.e,
dividing a program into segments and executing each
segment at the core that has the data and/or functionality to
best run that segment can improve performance and save
power.
BUT SE’s benefit is limited because most segments access
data generated by the previous segment which causes cache
misses.
Can we apply data marshaling concept from network
programming to eliminate these cache misses?
23
Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
24
Staged Execution Model
● Accelerated critical sections [Suleman et al.,
ASPLOS 2010]
● Producer-consumer pipeline parallelism.
● Task parallelism (Cilk, Intel TBB, Apple Grand
Central Dispatch)
● Special-purpose cores and functional units.
ExamplesSpeed up a program by dividing it into segments
and run each segment on the core best-suited to
run it.
Goal
● Accelerates segments/critical-paths using
specialized/heterogeneous core.
● Exploits inter-segment parallelism.
● Improves locality of within-segment data.
Benefit
25
Staged Execution Model
26
Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
27
The Problem
Idea: Ship critical sections to a large core.
Problem: Critical section incurs a cache miss when it
touches data produced in the non-critical section (i.e.,
thread private data)
Accelerated Critical Sections
Idea: Split a loop iteration into multiple “pipeline
stages” each stage runs on a different core.
Problem: A stage incurs a cache miss when it touches
data produced by the previous stage.
Producer-Consumer Pipeline
28
Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
29
Data Marshaling
30
Set of generator instructions is stable over execution
time and across input sets
Observation
● Identify the generator instructions.
● Record cache blocks produced by generator
instructions.
● Proactively send such cache blocks to the next
segment’s core before initiating the next segment
Idea
31
Data Marshaling
32
Data Marshaling
33
Data Marshaling Support/Cost
Support
● Profiler/Compiler: Generators, marshal instructions.
● ISA: Generator prefix, marshal instructions
● Library/Hardware: Bind next segment ID to a
physical core.
Hardware
● Marshal Buffer
○ Stores physical addresses of cache blocks to be
marshaled.
○ 16 entries enough for almost all workloads 96
bytes per core.
● Ability to execute generator prefixes and marshal
instructions.
● Ability to push data to another cache.
Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
34
Accelerated Critical
Sections
Ship critical sections to a large core in an asymmetric
CMP.
Idea
Faster execution of critical section, reduced
serialization, improved lock and shared data locality
Benefit
35
36
Methodology
12 critical section intensive applications.
● Data mining kernels, sorting, database, web,
networking
● Different training and simulation input sets
Workloads
Multi-core x86 simulator.
● 1 large and 28 small cores
● Aggressive stream prefetcher employed
at each core.
Simulator
● Large core: 2GHz, out-of-order, 128-entry
ROB, 4-wide, 12-stage
● Small core: 2GHz, in-order, 2-wide, 5-stage
● Private 32 KB L1, private 256KB L2, 8MB
shared L3
● On-chip interconnect: Bi-directional ring, 5-
cycle hop latency
Details
37
Results
Producer - Consumer
Pipeline
Split a loop iteration into multiple “pipeline stages”
where one stage consumes data produced by the next
stage each stage runs on a different core
Idea
Stage-level parallelism, better locality faster execution.
Benefit
38
39
Methodology
9 applications with pipeline paralellism..
● Financial, compression, multimedia,
encoding/decoding.
● Different training and simulation input sets
Workloads
Multi-core x86 simulator.
● 32-core CMP: 2GHz, in-order, 2-wide, 5-
stage
● Aggressive stream prefetcher employed
at each core.
● Private 32 KB L1, private 256KB L2, 8MB
shared L3.
● On-chip interconnect: Bi-directional ring,
5-cycle hop latency
Simulator
40
Results
Outline
➢Data Marshaling for Multicore Architectures
○ Summary
○ Motivation
○ Staged Execution Model
■ Two examples
○ The Problem: Inter-segment Data Transfers
○ Data Marshaling
○ Applications
■ Accelerated Critical Sections
■ Pipeline Parallelism
○ Conclusion
41
Conclusion
42
● Inter-segment data transfers between cores limit
the benefit of promising Staged Execution (SE)
models.
● Data Marshaling is a hardware/software
cooperative solution: detect inter-segment data
generator instructions and push their data to next
segment’s core.
○ Significantly reduces cache misses for inter-
segment data.
○ Low cost, high-coverage, timely for arbitrary
address sequences.
○ Achieves most of the potential of eliminating
such misses.
● Applicable to several existing Staged Execution
models.
○ Accelerated Critical Sections: 9% performance
benefit.
○ Pipeline Parallelism: 16% performance benefit.
● Can enable new models very fine-grained remote
execution.
Thanks!
Any questions?
43

More Related Content

What's hot

Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdlIaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdlIaetsd Iaetsd
 
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...John Gunnels
 
Partially connected 3D NoC - Access Noxim.
Partially connected 3D NoC - Access Noxim. Partially connected 3D NoC - Access Noxim.
Partially connected 3D NoC - Access Noxim. Abhishek Madav
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESPERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESBUKYABALAJI
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Miniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesMiniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesX. Breogan COSTA
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDhritiman Halder
 
CArcMOOC 01.01 - Automated information processing
CArcMOOC 01.01 - Automated information processingCArcMOOC 01.01 - Automated information processing
CArcMOOC 01.01 - Automated information processingAlessandro Bogliolo
 

What's hot (20)

Floor planning
Floor planningFloor planning
Floor planning
 
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdlIaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
 
Vector computing
Vector computingVector computing
Vector computing
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
 
Parallel Computing
Parallel Computing Parallel Computing
Parallel Computing
 
Partially connected 3D NoC - Access Noxim.
Partially connected 3D NoC - Access Noxim. Partially connected 3D NoC - Access Noxim.
Partially connected 3D NoC - Access Noxim.
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESPERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Poster Presentation
Poster PresentationPoster Presentation
Poster Presentation
 
Low Power Area Efficient Parallel Counter Architecture
Low Power Area Efficient Parallel Counter ArchitectureLow Power Area Efficient Parallel Counter Architecture
Low Power Area Efficient Parallel Counter Architecture
 
3rd 3DDRESD: DReAMS
3rd 3DDRESD: DReAMS3rd 3DDRESD: DReAMS
3rd 3DDRESD: DReAMS
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
Miniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesMiniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellites
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
 
Aca2 06 new
Aca2 06 newAca2 06 new
Aca2 06 new
 
CArcMOOC 01.01 - Automated information processing
CArcMOOC 01.01 - Automated information processingCArcMOOC 01.01 - Automated information processing
CArcMOOC 01.01 - Automated information processing
 

Similar to Multicore architectures

Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingLéia de Sousa
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deploymenttaeseon ryu
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph
OpenStackTage Cologne - OpenStack at 99.999% availability with CephOpenStackTage Cologne - OpenStack at 99.999% availability with Ceph
OpenStackTage Cologne - OpenStack at 99.999% availability with CephDanny Al-Gaaf
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationHao Xu
 
99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's GuideDanny Al-Gaaf
 
RECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP Project
 
COA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxCOA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxRuhul Amin
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24Aritra Sarkar
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...zionsaint
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel acceleratorBaharJV
 
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...Tarik Reza Toha
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdfFrangoCamila
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 

Similar to Multicore architectures (20)

Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scaling
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph
OpenStackTage Cologne - OpenStack at 99.999% availability with CephOpenStackTage Cologne - OpenStack at 99.999% availability with Ceph
OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
 
99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide
 
RECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP: The Simulation Approach
RECAP: The Simulation Approach
 
COA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxCOA-Unit4-PPT.pptx
COA-Unit4-PPT.pptx
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf
 
Greendroid ppt
Greendroid pptGreendroid ppt
Greendroid ppt
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 

Recently uploaded

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 

Recently uploaded (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 

Multicore architectures

  • 2. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, Doug Burger Number of citations = 1622 Dark Silicon and the End of Multicore Scaling (2011) 2
  • 3. Outline ➢Dark Silicon and the End of Multicore Scaling ○ Summary ○ Motivation ○ Models ■ Device Scaling Model ■ Single-core Scaling Model ■ Multicore Scaling Model ■ Device x Core x CMP Scaling ○ Findings ○ Limitations ○ Conclusion 3
  • 4. Modelling of multicore scaling limits Multicore Scaling Device Scaling Single-Core Scaling for the next five technology generations. By combining: Summary 4
  • 5. Outline ➢Dark Silicon and the End of Multicore Scaling ○ Summary ○ Motivation ○ Models ■ Device Scaling Model ■ Single-core Scaling Model ■ Multicore Scaling Model ■ Device x Core x CMP Scaling ○ Findings ○ Limitations ○ Conclusion 5
  • 6. Motivation With the failure of Dennard scaling, core count scaling may be in jeopardy, which would leave the community with no clear scaling path to exploit continued transistor count increases. How good multicore performance will be in the long term? In 2024, will processors have 32 times the performance of processors from 2008, exploiting five generations of core doubling? 6
  • 7. Outline ➢Dark Silicon and the End of Multicore Scaling ○ Summary ○ Motivation ○ Models ■ Device Scaling Model ■ Single-core Scaling Model ■ Multicore Scaling Model ■ Device x Core x CMP Scaling ○ Findings ○ Limitations ○ Conclusion 7
  • 8. Device Scaling Model Frequency Scaling Area Scaling Power Scaling factors at technology nodes from 45 nm to 8 nm. ITRS Roadmap projections and conservative scaling parameters are considered. Provides: 8
  • 9. Single-core Scaling Model Maximum performance that a single-core can sustain for any given area Minimum power that must be consumed to sustain this level of performance Provides: 9
  • 10. Multicore Scaling Model Multicore CPUs Many-thread GPUs Models: which are two extreme points in the threads-per-core spectrum. Both models use the A(q) and P(q) frontiers from the core-scaling model. Symmetric Asymmetric Topologies: Dynamic Composed 10
  • 11. 11 The process for choosing the optimal core configuration for the symmetric topology at a given technology node: ● The area/performance frontier is investigated, and all processor design points along the frontier are considered. ● For each area/performance design point, the multicore is constructed starting with a single core. We add one core per iteration and compute the new speedup and the power consumption using the power/performance Pareto frontier ● Speedups are computed. ● After some number of iterations, the area limit is hit, or power wall is hit, or we start seeing performance degradation. At this point the optimal speedup and the optimal number of cores is found. The fraction of dark silicon can then be computed by subtracting the area occupied by these cores from the total die area allocated to processor cores. Device x Core x CMP Scaling
  • 12. Outline ➢Dark Silicon and the End of Multicore Scaling ○ Summary ○ Motivation ○ Models ■ Device Scaling Model ■ Single-core Scaling Model ■ Multicore Scaling Model ■ Device x Core x CMP Scaling ○ Findings ○ Limitations ○ Conclusion 12
  • 13. Findings Using PARSEC benchmarks and ITRS scaling projections, this study predicts best-case average speedup of 7.9 times between now and 2024 at 8 nm. When conservative scaling projections applied, half of that ideal gain vanishes: the path to 8 nm in 2018 results in best case average 3.7x speed up. 13
  • 14. Outline ➢Dark Silicon and the End of Multicore Scaling ○ Summary ○ Motivation ○ Models ■ Device Scaling Model ■ Single-core Scaling Model ■ Multicore Scaling Model ■ Device x Core x CMP Scaling ○ Findings ○ Limitations ○ Conclusion 14
  • 15. Limitations ● SMT(Simultaneous Multithreading) support was not considered. ● The power impact of “uncore” components such as the memory subsystem were ignored. ● ARM or Tilera cores were not considered because they are designed for different application domains. ● Validation against real and simulated systems shows the model underpredicts performance for two benchmarks. 15
  • 16. Outline ➢Dark Silicon and the End of Multicore Scaling ○ Summary ○ Motivation ○ Models ■ Device Scaling Model ■ Single-core Scaling Model ■ Multicore Scaling Model ■ Device x Core x CMP Scaling ○ Findings ○ Limitations ○ Conclusion 16
  • 17. “If multicore scaling ceases to be the primary driver of performance gains at 16nm (in 2014) the “multicore era” will have lasted a mere nine years.” Conclusion 17
  • 19. M. Aater Suleman, Onur Mutlu, Jose A. Joao, Khubaib, Yale N. Patt Number of citations = 58 Data Marshaling for Multicore Architectures (2010) 19
  • 20. Outline ➢Data Marshaling for Multicore Architectures ○ Summary ○ Motivation ○ Staged Execution Model ■ Two examples ○ The Problem: Inter-segment Data Transfers ○ Data Marshaling ○ Applications ■ Accelerated Critical Sections ■ Pipeline Parallelism ○ Conclusion 20
  • 21. Applying Data Marshaling Concept to Multicore Architectures Accelerated Critical Sections Producer-Consumer pipeline parallelism on both homogeneous and heterogeneous multicore systems. To improve the performance of Staged Execution (SE) models : 21
  • 22. Outline ➢Data Marshaling for Multicore Architectures ○ Summary ○ Motivation ○ Staged Execution Model ■ Two examples ○ The Problem: Inter-segment Data Transfers ○ Data Marshaling ○ Applications ■ Accelerated Critical Sections ■ Pipeline Parallelism ○ Conclusion 22
  • 23. Motivation Previous research has shown that Staged Execution, i.e, dividing a program into segments and executing each segment at the core that has the data and/or functionality to best run that segment can improve performance and save power. BUT SE’s benefit is limited because most segments access data generated by the previous segment which causes cache misses. Can we apply data marshaling concept from network programming to eliminate these cache misses? 23
  • 24. Outline ➢Data Marshaling for Multicore Architectures ○ Summary ○ Motivation ○ Staged Execution Model ■ Two examples ○ The Problem: Inter-segment Data Transfers ○ Data Marshaling ○ Applications ■ Accelerated Critical Sections ■ Pipeline Parallelism ○ Conclusion 24
  • 25. Staged Execution Model ● Accelerated critical sections [Suleman et al., ASPLOS 2010] ● Producer-consumer pipeline parallelism. ● Task parallelism (Cilk, Intel TBB, Apple Grand Central Dispatch) ● Special-purpose cores and functional units. ExamplesSpeed up a program by dividing it into segments and run each segment on the core best-suited to run it. Goal ● Accelerates segments/critical-paths using specialized/heterogeneous core. ● Exploits inter-segment parallelism. ● Improves locality of within-segment data. Benefit 25
  • 27. Outline ➢Data Marshaling for Multicore Architectures ○ Summary ○ Motivation ○ Staged Execution Model ■ Two examples ○ The Problem: Inter-segment Data Transfers ○ Data Marshaling ○ Applications ■ Accelerated Critical Sections ■ Pipeline Parallelism ○ Conclusion 27
  • 28. The Problem Idea: Ship critical sections to a large core. Problem: Critical section incurs a cache miss when it touches data produced in the non-critical section (i.e., thread private data) Accelerated Critical Sections Idea: Split a loop iteration into multiple “pipeline stages” each stage runs on a different core. Problem: A stage incurs a cache miss when it touches data produced by the previous stage. Producer-Consumer Pipeline 28
  • 29. Outline ➢Data Marshaling for Multicore Architectures ○ Summary ○ Motivation ○ Staged Execution Model ■ Two examples ○ The Problem: Inter-segment Data Transfers ○ Data Marshaling ○ Applications ■ Accelerated Critical Sections ■ Pipeline Parallelism ○ Conclusion 29
  • 30. Data Marshaling 30 Set of generator instructions is stable over execution time and across input sets Observation ● Identify the generator instructions. ● Record cache blocks produced by generator instructions. ● Proactively send such cache blocks to the next segment’s core before initiating the next segment Idea
  • 33. 33 Data Marshaling Support/Cost Support ● Profiler/Compiler: Generators, marshal instructions. ● ISA: Generator prefix, marshal instructions ● Library/Hardware: Bind next segment ID to a physical core. Hardware ● Marshal Buffer ○ Stores physical addresses of cache blocks to be marshaled. ○ 16 entries enough for almost all workloads 96 bytes per core. ● Ability to execute generator prefixes and marshal instructions. ● Ability to push data to another cache.
  • 34. Outline ➢Data Marshaling for Multicore Architectures ○ Summary ○ Motivation ○ Staged Execution Model ■ Two examples ○ The Problem: Inter-segment Data Transfers ○ Data Marshaling ○ Applications ■ Accelerated Critical Sections ■ Pipeline Parallelism ○ Conclusion 34
  • 35. Accelerated Critical Sections Ship critical sections to a large core in an asymmetric CMP. Idea Faster execution of critical section, reduced serialization, improved lock and shared data locality Benefit 35
  • 36. 36 Methodology 12 critical section intensive applications. ● Data mining kernels, sorting, database, web, networking ● Different training and simulation input sets Workloads Multi-core x86 simulator. ● 1 large and 28 small cores ● Aggressive stream prefetcher employed at each core. Simulator ● Large core: 2GHz, out-of-order, 128-entry ROB, 4-wide, 12-stage ● Small core: 2GHz, in-order, 2-wide, 5-stage ● Private 32 KB L1, private 256KB L2, 8MB shared L3 ● On-chip interconnect: Bi-directional ring, 5- cycle hop latency Details
  • 38. Producer - Consumer Pipeline Split a loop iteration into multiple “pipeline stages” where one stage consumes data produced by the next stage each stage runs on a different core Idea Stage-level parallelism, better locality faster execution. Benefit 38
  • 39. 39 Methodology 9 applications with pipeline paralellism.. ● Financial, compression, multimedia, encoding/decoding. ● Different training and simulation input sets Workloads Multi-core x86 simulator. ● 32-core CMP: 2GHz, in-order, 2-wide, 5- stage ● Aggressive stream prefetcher employed at each core. ● Private 32 KB L1, private 256KB L2, 8MB shared L3. ● On-chip interconnect: Bi-directional ring, 5-cycle hop latency Simulator
  • 41. Outline ➢Data Marshaling for Multicore Architectures ○ Summary ○ Motivation ○ Staged Execution Model ■ Two examples ○ The Problem: Inter-segment Data Transfers ○ Data Marshaling ○ Applications ■ Accelerated Critical Sections ■ Pipeline Parallelism ○ Conclusion 41
  • 42. Conclusion 42 ● Inter-segment data transfers between cores limit the benefit of promising Staged Execution (SE) models. ● Data Marshaling is a hardware/software cooperative solution: detect inter-segment data generator instructions and push their data to next segment’s core. ○ Significantly reduces cache misses for inter- segment data. ○ Low cost, high-coverage, timely for arbitrary address sequences. ○ Achieves most of the potential of eliminating such misses. ● Applicable to several existing Staged Execution models. ○ Accelerated Critical Sections: 9% performance benefit. ○ Pipeline Parallelism: 16% performance benefit. ● Can enable new models very fine-grained remote execution.