SlideShare a Scribd company logo
1 of 51
Download to read offline
Enhancing Cache Coherent Architectures
with Access Patterns for Embedded
Manycore Systems
Jussara Marandola, St´ephane Louise, Lo¨ıc Cudennec, David
A. Bader
stephane.louise@cea.fr
Oct 11-12, 2012, SoC 2012
1 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Background
Multicore and manycore systems: Architecture and its future:
Single processor time is over
Multiprocessor are there and will remain
Down to embedded systems (e.g. my cellphone)
Manycore systems are on the verge of appearing (e.g. Tilera,
but others are on the way)
The future is manycore, even in the embedded world
We have to prepare for this.
Programmability?
2 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do we
program them?
New paradigms (e.g. stream programming) still young, need
to learn we way of programming. Bad for legacy software
(porting costs!)
3 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do we
program them?
New paradigms (e.g. stream programming) still young, need
to learn we way of programming. Bad for legacy software
(porting costs!)
MPI (OK for HPC applications, also heavy work for
parallelization)
3 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do we
program them?
New paradigms (e.g. stream programming) still young, need
to learn we way of programming. Bad for legacy software
(porting costs!)
MPI (OK for HPC applications, also heavy work for
parallelization)
openMP and the like: “only” adding some pragma to
parallelize an application.
3 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do we
program them?
New paradigms (e.g. stream programming) still young, need
to learn we way of programming. Bad for legacy software
(porting costs!)
MPI (OK for HPC applications, also heavy work for
parallelization)
openMP and the like: “only” adding some pragma to
parallelize an application.
OpenMP relies on a shared memory model. So a shared memory
behavior must be provided and if possible done in hardware
(because it is faster)
3 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Shared memory consistency for multicore/manycores
For manycore systems, memory consistency = cache coherence
mechanisms:
Based on the four state MESI protocol
Modified: a single valid copy of the data exist in the system
and was modified since its fetch from memory
Exclusive: the value is only in one core’s cache and wasn’t
modified since it was accessed from memory
Shared: multiple copy of the value exist in the system, and
only read operation where done
Invalid: the current copy that the core has must not be used
and will be discarded
4 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Shared memory consistency for multicore/manycores
For manycore systems, memory consistency = cache coherence
mechanisms:
Based on the four state MESI protocol
Modified: a single valid copy of the data exist in the system
and was modified since its fetch from memory
Exclusive: the value is only in one core’s cache and wasn’t
modified since it was accessed from memory
Shared: multiple copy of the value exist in the system, and
only read operation where done
Invalid: the current copy that the core has must not be used
and will be discarded
Use of Home Nodes to keep the state consistency:
For a given address in memory only one core of the system will
keep the coherence state
The distribution of home nodes is done as a modulo on an
address mask (round-robin, line size) to avoid hot spots
A processor mask tracks the cores that share the cache line
Baseline protocol is the base for all memory consistency
systems within SotA
4 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Baseline implementation in a manycore system
Cache
Coherence
Directory
MemoryInterface
NetworkInterface
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
L2Cache
5 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
Processor with shared copy
of the data
➀
6 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
Processor with shared copy
of the data
➁
6 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
Processor with shared copy
of the data
➂
6 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
➃
6 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Issues of baseline protocol and memory access patterns
Sometime a single write on a shared value triggers lot of
coherence traffic on the NoC
For regular but non conterminous access, lot accesses
Typical example: reading an image by column
But the accesses are simple and deterministic
In some areas the baseline protocol does not work as well as it
could and lacks a bit of scalability
In the embedded world lots of low level data processing display a
regular behavior WRT their memory accesses
Convolutions on images
usual transformations (e.g. FFT, DCT)
vector operation
etc.
7 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Issues of baseline protocol and memory access patterns
Sometime a single write on a shared value triggers lot of
coherence traffic on the NoC
For regular but non conterminous access, lot accesses
Typical example: reading an image by column
But the accesses are simple and deterministic
In some areas the baseline protocol does not work as well as it
could and lacks a bit of scalability
In the embedded world lots of low level data processing display a
regular behavior WRT their memory accesses
Convolutions on images
usual transformations (e.g. FFT, DCT)
vector operation
etc.
The idea: take advantage of these regular memory access pattern
to reduce the coherence traffic and enable memory prefetch
7 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
State of the Art, Memory patterns and shared memory
coherence
Use of memory patterns:
Intel: use of special instructions to perform regular accesses to
memory limited to a single core; Patent US 7,143,264 (2006)
IBM: special instructions used to detect and apply patterns,
also limited to a single cache; Patent US 7,395,407 (2008)
Other platforms:
STAR project aim to provide a scalable manycore with a
coherent shared memory
8 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Cache Coherence Architecture with patterns
Our enhancement to Cache Coherence Architecture (CCA)
Relies on the baseline protocol (adds to it, not replace it)
Update it with special cases for pattern management
Add storage with each core for pattern storage and detection
Patterns are a result of the compilation process
9 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Cache Coherence Architecture with patterns
Our enhancement to Cache Coherence Architecture (CCA)
Relies on the baseline protocol (adds to it, not replace it)
Update it with special cases for pattern management
Add storage with each core for pattern storage and detection
Patterns are a result of the compilation process
It can not work worst than baseline, because baseline is still the
default.
Modifies:
Core IP with the pattern storage and matching
Add the speculative protocol to the baseline protocol
The patterns (and the speculative protocol) has its own
determination of Home Node (which can be the same or differ
from the baseline Home Node)
We call this modified system CoCCA (Codesigned CCA)
9 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
CoCCA architecture scheme
Cache
Coherence
Directory
CoCCA
Pattern
Table
MemoryInterface
NetworkInterface
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
Address Pattern Coherence info. (state and bit fields)
Address Pattern Coherence info. (state and bit fields)
Chip area overhead: ~+3%?
10 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Pattern definition and storage
Patterns are not stored the same way on use nodes and home
nodes
The minimum implementation uses a 2D strided access shape:
a start address
a stride lenght
a pattern lenght
On the home node: a pattern size
A speculative access fetches cache lines (as baseline protocol do)
but the access pattern may need to be more fine grained in its
specification (overlaps)
Definition of triggers: way of detecting the signature of a pattern
to fetch
the simplest trigger is the first address of the pattern access
11 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { }
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, }
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, }
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, }
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, @11}
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Base of the modified protocol
Requester
DIR
Lookup
PT
Lookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cache
Read
Pattern
Lookup
Memory
Access
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIR
Lookup
Memory
Access
Send RD_RQ_AK
miss
hit
Baseline
Home Node
Hybrid
(CoCCA)
Home Node
Send RD_RQ_AK
miss Pattern
length
Pattern
length
Without pattern information or in case of pattern miss: the
system acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is fired
13 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Base of the modified protocol
Requester
DIR
Lookup
PT
Lookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cache
Read
Pattern
Lookup
Memory
Access
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIR
Lookup
Memory
Access
Send RD_RQ_AK
miss
hit
Baseline
Home Node
Hybrid
(CoCCA)
Home Node
Send RD_RQ_AK
miss Pattern
length
Pattern
length
X
Without pattern information or in case of pattern miss: the
system acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is fired
13 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Base of the modified protocol
Requester
DIR
Lookup
PT
Lookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cache
Read
Pattern
Lookup
Memory
Access
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIR
Lookup
Memory
Access
Send RD_RQ_AK
miss
hit
Baseline
Home Node
Hybrid
(CoCCA)
Home Node
Send RD_RQ_AK
miss Pattern
length
Pattern
length
X
Without pattern information or in case of pattern miss: the
system acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is fired
13 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Hardware tables and special instructions
A C-language description of pattern storing tables:
unsigned long capacity; /* sizeof(address) */
unsigned long size; /* address number */
unsigned long * offset; /* pattern offset */
unsigned long * length; /* pattern length */
unsigned long * stride; /* pattern stride */
So it is possible to have a rough estimate of the size of an entry in
the pattern table
14 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Hardware tables and special instructions
A C-language description of pattern storing tables:
unsigned long capacity; /* sizeof(address) */
unsigned long size; /* address number */
unsigned long * offset; /* pattern offset */
unsigned long * length; /* pattern length */
unsigned long * stride; /* pattern stride */
So it is possible to have a rough estimate of the size of an entry in
the pattern table A few specialized instructions to deal manage
pattern tables:
PatternNew: to create a pattern,
PatternAddOffset: to add an offset entry,
PatternAddLength: to add a length entry,
PatternAddStride: to add a stride entry,
PatternFree: to release the pattern after use.
14 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
A first benchmark program for early evaluation
The choice of a benchmark program for our speculative protocol:
be representative of typical embedded application
stress the protocol proposal on several aspects
We choose a 2 step image cascading filtering
the first filter result is the source of the second filter
5x5 filter
applied on chunks of the image, for each core with shared
cache lines both in read mode as in write mode
the result of the second filter is written back on the source
(write invalidation)
15 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Memory mapping of our benchmark program
16 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Instrumentation choice: Pin/Pintools
Pin/Pintools:
Pin is an instrumentation framework of binaries based on JIT
technique to accelerate the instrumentation. It is a Intel
project
Pin acts in association with the instrumentation tool called
Pintool which is programmable
Several Pintools are provided in the basic distribution of Pin
17 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Instrumentation choice: Pin/Pintools
Pin/Pintools:
Pin is an instrumentation framework of binaries based on JIT
technique to accelerate the instrumentation. It is a Intel
project
Pin acts in association with the instrumentation tool called
Pintool which is programmable
Several Pintools are provided in the basic distribution of Pin
We used:
inscount: pintool which gives the number of executed
instructions
pinatrace: pintool which trace and log all the memory
accesses (load/store operations)
See paper for details.
17 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
Source image prefetch and setting of old Shared values (S) to
Exclusive values (E) when the source image becomes the
destination (2 patterns per core)
18 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
False concurrency of write accesses between two rectangles of
the destination image. This happens because the frontiers is
not alined with L2 cache lines. The associated patterns is 6
vertical lines with 0 bytes in common
18 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
Shared read data (because convolution kernels read pixels in
conterminous rectangles, see figure 1). There are 6 vertical
lines and 3 sets of two horizontal lines for these patterns
18 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
After simplification, only 6 patterns are required
18 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283
Exclusive line sharing (2 cores) 12768 12768
Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
19 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283
Exclusive line sharing (2 cores) 12768 12768
Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
19 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283
Exclusive line sharing (2 cores) 12768 12768
Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
prefetch stands for 10% of cache accesses
19 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283
Exclusive line sharing (2 cores) 12768 12768
Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
prefetch stands for 10% of cache accesses
Means that without prefetch the application runs 67% slower
(20 cycles for on chip shared cache access and 80 cycles for
external memory accesses)
19 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Contributions
Shared memory and coherence is important for
programmability of CMP
SotA cache coherence mechanisms falls into worst case
behaviors for scenarios that seems simple: regular access to
memory with patterns
We defined an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilation
process
20 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Contributions
Shared memory and coherence is important for
programmability of CMP
SotA cache coherence mechanisms falls into worst case
behaviors for scenarios that seems simple: regular access to
memory with patterns
We defined an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilation
process
Only few patterns entries are necessary for each typical low
level filter
Patterns can reduce significantly coherence message
throughput
Patterns allow for early and efficient cache preloading which
accelerate significantly applications
20 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Contributions
Shared memory and coherence is important for
programmability of CMP
SotA cache coherence mechanisms falls into worst case
behaviors for scenarios that seems simple: regular access to
memory with patterns
We defined an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilation
process
Only few patterns entries are necessary for each typical low
level filter
Patterns can reduce significantly coherence message
throughput
Patterns allow for early and efficient cache preloading which
accelerate significantly applications
May provide a path to cache coherency in massive many-cores
20 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Future work and perspective
extend the number of benchmark applications to draw more
general conclusions
apply our ideas in a NoC simulator to do cycle accurate
simulations
21 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Future work and perspective
extend the number of benchmark applications to draw more
general conclusions
apply our ideas in a NoC simulator to do cycle accurate
simulations
include it in a full scale simulator (e.g. SoCLib)
21 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Future work and perspective
extend the number of benchmark applications to draw more
general conclusions
apply our ideas in a NoC simulator to do cycle accurate
simulations
include it in a full scale simulator (e.g. SoCLib)
extend our work toward a HPC friendly architecture that
would determine patterns dynamically at runtime
21 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Thank you for your attention
Questions?
22 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
ALCHEMY wokshop @ ICCS 2013 (Barcelona)
The International Conference on Computational Science (ICCS)
can be a good place to talk with people using HPC architectures
for their needs.
Lo¨ıc Cudennec and I are organizing a workshop on the issues that
are raising with future manycore systems (number of cores ¿ 1000
and beyond)
Architecture, Language, Compilation and Hardware support
for Emerging ManYcore systems
ALCHEMY wokshop
23 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
ALCHEMY wokshop @ ICCS 2013 (Barcelona)
The International Conference on Computational Science (ICCS)
can be a good place to talk with people using HPC architectures
for their needs.
Lo¨ıc Cudennec and I are organizing a workshop on the issues that
are raising with future manycore systems (number of cores ¿ 1000
and beyond)
Architecture, Language, Compilation and Hardware support
for Emerging ManYcore systems
ALCHEMY wokshop
Topics:
Advanced architecture support for massive parallelism
management
Advanced architecture support for enhanced communication
for manycores
Full paper submission: December 15th Notification: Feb. 10
23 / 23

More Related Content

What's hot

Characteristics of an on chip cache on nec sx
Characteristics of an on chip cache on nec sxCharacteristics of an on chip cache on nec sx
Characteristics of an on chip cache on nec sxLéia de Sousa
 
Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GEricsson
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Danilo Calle
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency modelspalani kumar
 
Crdom cell re ordering based domino on-the-fly mapping
Crdom  cell re ordering based domino on-the-fly mappingCrdom  cell re ordering based domino on-the-fly mapping
Crdom cell re ordering based domino on-the-fly mappingVLSICS Design
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platformsVajira Thambawita
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution modelVajira Thambawita
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collectionsijdpsjournal
 
Artificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular ApproachArtificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular ApproachRoee Levy
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 

What's hot (16)

Characteristics of an on chip cache on nec sx
Characteristics of an on chip cache on nec sxCharacteristics of an on chip cache on nec sx
Characteristics of an on chip cache on nec sx
 
Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5G
 
S peculative multi
S peculative multiS peculative multi
S peculative multi
 
UIC Panella Thesis
UIC Panella ThesisUIC Panella Thesis
UIC Panella Thesis
 
1
11
1
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
 
Crdom cell re ordering based domino on-the-fly mapping
Crdom  cell re ordering based domino on-the-fly mappingCrdom  cell re ordering based domino on-the-fly mapping
Crdom cell re ordering based domino on-the-fly mapping
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platforms
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collections
 
Artificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular ApproachArtificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular Approach
 
HPPS - Final - 06/14/2007
HPPS - Final - 06/14/2007HPPS - Final - 06/14/2007
HPPS - Final - 06/14/2007
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 

Similar to SoC-2012-pres-2

PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
 
System on Chip Design and Modelling Dr. David J Greaves
System on Chip Design and Modelling   Dr. David J GreavesSystem on Chip Design and Modelling   Dr. David J Greaves
System on Chip Design and Modelling Dr. David J GreavesSatya Harish
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level ParallelismDilum Bandara
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
 
Concurrent Replication of Parallel and Distributed Simulations
Concurrent Replication of Parallel and Distributed SimulationsConcurrent Replication of Parallel and Distributed Simulations
Concurrent Replication of Parallel and Distributed SimulationsGabriele D'Angelo
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
An Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed SimulationAn Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed SimulationGabriele D'Angelo
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...ChangWoo Min
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problemsRichard Ashworth
 

Similar to SoC-2012-pres-2 (20)

3DD 1e Laura
3DD 1e Laura3DD 1e Laura
3DD 1e Laura
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
 
94
9494
94
 
System on Chip Design and Modelling Dr. David J Greaves
System on Chip Design and Modelling   Dr. David J GreavesSystem on Chip Design and Modelling   Dr. David J Greaves
System on Chip Design and Modelling Dr. David J Greaves
 
SSBSE10.ppt
SSBSE10.pptSSBSE10.ppt
SSBSE10.ppt
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
 
3rd 3DDRESD: ReCPU 4 NIDS
3rd 3DDRESD: ReCPU 4 NIDS3rd 3DDRESD: ReCPU 4 NIDS
3rd 3DDRESD: ReCPU 4 NIDS
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
 
Concurrent Replication of Parallel and Distributed Simulations
Concurrent Replication of Parallel and Distributed SimulationsConcurrent Replication of Parallel and Distributed Simulations
Concurrent Replication of Parallel and Distributed Simulations
 
3rd 3DDRESD: DReAMS
3rd 3DDRESD: DReAMS3rd 3DDRESD: DReAMS
3rd 3DDRESD: DReAMS
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
An Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed SimulationAn Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed Simulation
 
Introducing Parallel Pixie Dust
Introducing Parallel Pixie DustIntroducing Parallel Pixie Dust
Introducing Parallel Pixie Dust
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 

SoC-2012-pres-2

  • 1. Enhancing Cache Coherent Architectures with Access Patterns for Embedded Manycore Systems Jussara Marandola, St´ephane Louise, Lo¨ıc Cudennec, David A. Bader stephane.louise@cea.fr Oct 11-12, 2012, SoC 2012 1 / 23
  • 2. Introduction CoCCA approach evaluation Conclusion and perspective Background Multicore and manycore systems: Architecture and its future: Single processor time is over Multiprocessor are there and will remain Down to embedded systems (e.g. my cellphone) Manycore systems are on the verge of appearing (e.g. Tilera, but others are on the way) The future is manycore, even in the embedded world We have to prepare for this. Programmability? 2 / 23
  • 3. Introduction CoCCA approach evaluation Conclusion and perspective Programmability What are the programming paradigms for manycores? How do we program them? New paradigms (e.g. stream programming) still young, need to learn we way of programming. Bad for legacy software (porting costs!) 3 / 23
  • 4. Introduction CoCCA approach evaluation Conclusion and perspective Programmability What are the programming paradigms for manycores? How do we program them? New paradigms (e.g. stream programming) still young, need to learn we way of programming. Bad for legacy software (porting costs!) MPI (OK for HPC applications, also heavy work for parallelization) 3 / 23
  • 5. Introduction CoCCA approach evaluation Conclusion and perspective Programmability What are the programming paradigms for manycores? How do we program them? New paradigms (e.g. stream programming) still young, need to learn we way of programming. Bad for legacy software (porting costs!) MPI (OK for HPC applications, also heavy work for parallelization) openMP and the like: “only” adding some pragma to parallelize an application. 3 / 23
  • 6. Introduction CoCCA approach evaluation Conclusion and perspective Programmability What are the programming paradigms for manycores? How do we program them? New paradigms (e.g. stream programming) still young, need to learn we way of programming. Bad for legacy software (porting costs!) MPI (OK for HPC applications, also heavy work for parallelization) openMP and the like: “only” adding some pragma to parallelize an application. OpenMP relies on a shared memory model. So a shared memory behavior must be provided and if possible done in hardware (because it is faster) 3 / 23
  • 7. Introduction CoCCA approach evaluation Conclusion and perspective Shared memory consistency for multicore/manycores For manycore systems, memory consistency = cache coherence mechanisms: Based on the four state MESI protocol Modified: a single valid copy of the data exist in the system and was modified since its fetch from memory Exclusive: the value is only in one core’s cache and wasn’t modified since it was accessed from memory Shared: multiple copy of the value exist in the system, and only read operation where done Invalid: the current copy that the core has must not be used and will be discarded 4 / 23
  • 8. Introduction CoCCA approach evaluation Conclusion and perspective Shared memory consistency for multicore/manycores For manycore systems, memory consistency = cache coherence mechanisms: Based on the four state MESI protocol Modified: a single valid copy of the data exist in the system and was modified since its fetch from memory Exclusive: the value is only in one core’s cache and wasn’t modified since it was accessed from memory Shared: multiple copy of the value exist in the system, and only read operation where done Invalid: the current copy that the core has must not be used and will be discarded Use of Home Nodes to keep the state consistency: For a given address in memory only one core of the system will keep the coherence state The distribution of home nodes is done as a modulo on an address mask (round-robin, line size) to avoid hot spots A processor mask tracks the cores that share the cache line Baseline protocol is the base for all memory consistency systems within SotA 4 / 23
  • 9. Introduction CoCCA approach evaluation Conclusion and perspective Baseline implementation in a manycore system Cache Coherence Directory MemoryInterface NetworkInterface Address Coherence info. (state and vector bit fields) Address Coherence info. (state and vector bit fields) Address Coherence info. (state and vector bit fields) L2Cache 5 / 23
  • 10. Introduction CoCCA approach evaluation Conclusion and perspective Modification of a shared value by a given core Processor with shared copy of the data ➀ 6 / 23
  • 11. Introduction CoCCA approach evaluation Conclusion and perspective Modification of a shared value by a given core Processor with shared copy of the data ➁ 6 / 23
  • 12. Introduction CoCCA approach evaluation Conclusion and perspective Modification of a shared value by a given core Processor with shared copy of the data ➂ 6 / 23
  • 13. Introduction CoCCA approach evaluation Conclusion and perspective Modification of a shared value by a given core ➃ 6 / 23
  • 14. Introduction CoCCA approach evaluation Conclusion and perspective Issues of baseline protocol and memory access patterns Sometime a single write on a shared value triggers lot of coherence traffic on the NoC For regular but non conterminous access, lot accesses Typical example: reading an image by column But the accesses are simple and deterministic In some areas the baseline protocol does not work as well as it could and lacks a bit of scalability In the embedded world lots of low level data processing display a regular behavior WRT their memory accesses Convolutions on images usual transformations (e.g. FFT, DCT) vector operation etc. 7 / 23
  • 15. Introduction CoCCA approach evaluation Conclusion and perspective Issues of baseline protocol and memory access patterns Sometime a single write on a shared value triggers lot of coherence traffic on the NoC For regular but non conterminous access, lot accesses Typical example: reading an image by column But the accesses are simple and deterministic In some areas the baseline protocol does not work as well as it could and lacks a bit of scalability In the embedded world lots of low level data processing display a regular behavior WRT their memory accesses Convolutions on images usual transformations (e.g. FFT, DCT) vector operation etc. The idea: take advantage of these regular memory access pattern to reduce the coherence traffic and enable memory prefetch 7 / 23
  • 16. Introduction CoCCA approach evaluation Conclusion and perspective State of the Art, Memory patterns and shared memory coherence Use of memory patterns: Intel: use of special instructions to perform regular accesses to memory limited to a single core; Patent US 7,143,264 (2006) IBM: special instructions used to detect and apply patterns, also limited to a single cache; Patent US 7,395,407 (2008) Other platforms: STAR project aim to provide a scalable manycore with a coherent shared memory 8 / 23
  • 17. Introduction CoCCA approach evaluation Conclusion and perspective Cache Coherence Architecture with patterns Our enhancement to Cache Coherence Architecture (CCA) Relies on the baseline protocol (adds to it, not replace it) Update it with special cases for pattern management Add storage with each core for pattern storage and detection Patterns are a result of the compilation process 9 / 23
  • 18. Introduction CoCCA approach evaluation Conclusion and perspective Cache Coherence Architecture with patterns Our enhancement to Cache Coherence Architecture (CCA) Relies on the baseline protocol (adds to it, not replace it) Update it with special cases for pattern management Add storage with each core for pattern storage and detection Patterns are a result of the compilation process It can not work worst than baseline, because baseline is still the default. Modifies: Core IP with the pattern storage and matching Add the speculative protocol to the baseline protocol The patterns (and the speculative protocol) has its own determination of Home Node (which can be the same or differ from the baseline Home Node) We call this modified system CoCCA (Codesigned CCA) 9 / 23
  • 19. Introduction CoCCA approach evaluation Conclusion and perspective CoCCA architecture scheme Cache Coherence Directory CoCCA Pattern Table MemoryInterface NetworkInterface Address Coherence info. (state and vector bit fields) Address Coherence info. (state and vector bit fields) Address Coherence info. (state and vector bit fields) Address Pattern Coherence info. (state and bit fields) Address Pattern Coherence info. (state and bit fields) Chip area overhead: ~+3%? 10 / 23
  • 20. Introduction CoCCA approach evaluation Conclusion and perspective Pattern definition and storage Patterns are not stored the same way on use nodes and home nodes The minimum implementation uses a 2D strided access shape: a start address a stride lenght a pattern lenght On the home node: a pattern size A speculative access fetches cache lines (as baseline protocol do) but the access pattern may need to be more fine grained in its specification (overlaps) Definition of triggers: way of detecting the signature of a pattern to fetch the simplest trigger is the first address of the pattern access 11 / 23
  • 21. Introduction CoCCA approach evaluation Conclusion and perspective Triggers and pattern definition Pattern matching principle (hw): Pattern calculation (simple case): Desc = fn(Baddr , s, δ) Baddr Base address s size of the pattern δ interval (stride) between 2 consecutive access E.G.: Pat(1, 4, 2)(@1) = { } 12 / 23
  • 22. Introduction CoCCA approach evaluation Conclusion and perspective Triggers and pattern definition Pattern matching principle (hw): Pattern calculation (simple case): Desc = fn(Baddr , s, δ) Baddr Base address s size of the pattern δ interval (stride) between 2 consecutive access E.G.: Pat(1, 4, 2)(@1) = { @2, } 12 / 23
  • 23. Introduction CoCCA approach evaluation Conclusion and perspective Triggers and pattern definition Pattern matching principle (hw): Pattern calculation (simple case): Desc = fn(Baddr , s, δ) Baddr Base address s size of the pattern δ interval (stride) between 2 consecutive access E.G.: Pat(1, 4, 2)(@1) = { @2, @5, } 12 / 23
  • 24. Introduction CoCCA approach evaluation Conclusion and perspective Triggers and pattern definition Pattern matching principle (hw): Pattern calculation (simple case): Desc = fn(Baddr , s, δ) Baddr Base address s size of the pattern δ interval (stride) between 2 consecutive access E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, } 12 / 23
  • 25. Introduction CoCCA approach evaluation Conclusion and perspective Triggers and pattern definition Pattern matching principle (hw): Pattern calculation (simple case): Desc = fn(Baddr , s, δ) Baddr Base address s size of the pattern δ interval (stride) between 2 consecutive access E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, @11} 12 / 23
  • 26. Introduction CoCCA approach evaluation Conclusion and perspective Base of the modified protocol Requester DIR Lookup PT Lookup hit miss hit miss Send RD_RQSend SPEC_RQ L2 cache Read Pattern Lookup Memory Access Send RD_RQ_AK hit Send RD_RQ_AK DIR Lookup Memory Access Send RD_RQ_AK miss hit Baseline Home Node Hybrid (CoCCA) Home Node Send RD_RQ_AK miss Pattern length Pattern length Without pattern information or in case of pattern miss: the system acts as an ordinary baseline architecture In case of pattern hit: the speculative protocol is fired 13 / 23
  • 27. Introduction CoCCA approach evaluation Conclusion and perspective Base of the modified protocol Requester DIR Lookup PT Lookup hit miss hit miss Send RD_RQSend SPEC_RQ L2 cache Read Pattern Lookup Memory Access Send RD_RQ_AK hit Send RD_RQ_AK DIR Lookup Memory Access Send RD_RQ_AK miss hit Baseline Home Node Hybrid (CoCCA) Home Node Send RD_RQ_AK miss Pattern length Pattern length X Without pattern information or in case of pattern miss: the system acts as an ordinary baseline architecture In case of pattern hit: the speculative protocol is fired 13 / 23
  • 28. Introduction CoCCA approach evaluation Conclusion and perspective Base of the modified protocol Requester DIR Lookup PT Lookup hit miss hit miss Send RD_RQSend SPEC_RQ L2 cache Read Pattern Lookup Memory Access Send RD_RQ_AK hit Send RD_RQ_AK DIR Lookup Memory Access Send RD_RQ_AK miss hit Baseline Home Node Hybrid (CoCCA) Home Node Send RD_RQ_AK miss Pattern length Pattern length X Without pattern information or in case of pattern miss: the system acts as an ordinary baseline architecture In case of pattern hit: the speculative protocol is fired 13 / 23
  • 29. Introduction CoCCA approach evaluation Conclusion and perspective Hardware tables and special instructions A C-language description of pattern storing tables: unsigned long capacity; /* sizeof(address) */ unsigned long size; /* address number */ unsigned long * offset; /* pattern offset */ unsigned long * length; /* pattern length */ unsigned long * stride; /* pattern stride */ So it is possible to have a rough estimate of the size of an entry in the pattern table 14 / 23
  • 30. Introduction CoCCA approach evaluation Conclusion and perspective Hardware tables and special instructions A C-language description of pattern storing tables: unsigned long capacity; /* sizeof(address) */ unsigned long size; /* address number */ unsigned long * offset; /* pattern offset */ unsigned long * length; /* pattern length */ unsigned long * stride; /* pattern stride */ So it is possible to have a rough estimate of the size of an entry in the pattern table A few specialized instructions to deal manage pattern tables: PatternNew: to create a pattern, PatternAddOffset: to add an offset entry, PatternAddLength: to add a length entry, PatternAddStride: to add a stride entry, PatternFree: to release the pattern after use. 14 / 23
  • 31. Introduction CoCCA approach evaluation Conclusion and perspective A first benchmark program for early evaluation The choice of a benchmark program for our speculative protocol: be representative of typical embedded application stress the protocol proposal on several aspects We choose a 2 step image cascading filtering the first filter result is the source of the second filter 5x5 filter applied on chunks of the image, for each core with shared cache lines both in read mode as in write mode the result of the second filter is written back on the source (write invalidation) 15 / 23
  • 32. Introduction CoCCA approach evaluation Conclusion and perspective Memory mapping of our benchmark program 16 / 23
  • 33. Introduction CoCCA approach evaluation Conclusion and perspective Instrumentation choice: Pin/Pintools Pin/Pintools: Pin is an instrumentation framework of binaries based on JIT technique to accelerate the instrumentation. It is a Intel project Pin acts in association with the instrumentation tool called Pintool which is programmable Several Pintools are provided in the basic distribution of Pin 17 / 23
  • 34. Introduction CoCCA approach evaluation Conclusion and perspective Instrumentation choice: Pin/Pintools Pin/Pintools: Pin is an instrumentation framework of binaries based on JIT technique to accelerate the instrumentation. It is a Intel project Pin acts in association with the instrumentation tool called Pintool which is programmable Several Pintools are provided in the basic distribution of Pin We used: inscount: pintool which gives the number of executed instructions pinatrace: pintool which trace and log all the memory accesses (load/store operations) See paper for details. 17 / 23
  • 35. Introduction CoCCA approach evaluation Conclusion and perspective Data sharing and prefetch Rect. i Rect. i+1 Rect. i+7 Rect. i+8 Exclusive data (1 core only) Data shared by 2 cores Data shared by 4 cores Figure: Read data sharing in conterminous rectangles We can define three kinds of patterns on this benchmark: Source image prefetch and setting of old Shared values (S) to Exclusive values (E) when the source image becomes the destination (2 patterns per core) 18 / 23
  • 36. Introduction CoCCA approach evaluation Conclusion and perspective Data sharing and prefetch Rect. i Rect. i+1 Rect. i+7 Rect. i+8 Exclusive data (1 core only) Data shared by 2 cores Data shared by 4 cores Figure: Read data sharing in conterminous rectangles We can define three kinds of patterns on this benchmark: False concurrency of write accesses between two rectangles of the destination image. This happens because the frontiers is not alined with L2 cache lines. The associated patterns is 6 vertical lines with 0 bytes in common 18 / 23
  • 37. Introduction CoCCA approach evaluation Conclusion and perspective Data sharing and prefetch Rect. i Rect. i+1 Rect. i+7 Rect. i+8 Exclusive data (1 core only) Data shared by 2 cores Data shared by 4 cores Figure: Read data sharing in conterminous rectangles We can define three kinds of patterns on this benchmark: Shared read data (because convolution kernels read pixels in conterminous rectangles, see figure 1). There are 6 vertical lines and 3 sets of two horizontal lines for these patterns 18 / 23
  • 38. Introduction CoCCA approach evaluation Conclusion and perspective Data sharing and prefetch Rect. i Rect. i+1 Rect. i+7 Rect. i+8 Exclusive data (1 core only) Data shared by 2 cores Data shared by 4 cores Figure: Read data sharing in conterminous rectangles We can define three kinds of patterns on this benchmark: After simplification, only 6 patterns are required 18 / 23
  • 39. Introduction CoCCA approach evaluation Conclusion and perspective Evaluation results Condition MESI CoCCA Shared line invalidation 34560 17283 Exclusive line sharing (2 cores) 12768 12768 Exclusive line sharing (4 cores) 1344 772 Total throughput 48672 30723 19 / 23
  • 40. Introduction CoCCA approach evaluation Conclusion and perspective Evaluation results Condition MESI CoCCA Shared line invalidation 34560 17283 Exclusive line sharing (2 cores) 12768 12768 Exclusive line sharing (4 cores) 1344 772 Total throughput 48672 30723 reduction of 37% of coherence message throughput 19 / 23
  • 41. Introduction CoCCA approach evaluation Conclusion and perspective Evaluation results Condition MESI CoCCA Shared line invalidation 34560 17283 Exclusive line sharing (2 cores) 12768 12768 Exclusive line sharing (4 cores) 1344 772 Total throughput 48672 30723 reduction of 37% of coherence message throughput prefetch stands for 10% of cache accesses 19 / 23
  • 42. Introduction CoCCA approach evaluation Conclusion and perspective Evaluation results Condition MESI CoCCA Shared line invalidation 34560 17283 Exclusive line sharing (2 cores) 12768 12768 Exclusive line sharing (4 cores) 1344 772 Total throughput 48672 30723 reduction of 37% of coherence message throughput prefetch stands for 10% of cache accesses Means that without prefetch the application runs 67% slower (20 cycles for on chip shared cache access and 80 cycles for external memory accesses) 19 / 23
  • 43. Introduction CoCCA approach evaluation Conclusion and perspective Contributions Shared memory and coherence is important for programmability of CMP SotA cache coherence mechanisms falls into worst case behaviors for scenarios that seems simple: regular access to memory with patterns We defined an extension of cores to store pattern We extended the baseline protocol with a speculative protocol For embedded system: tables are part of the compilation process 20 / 23
  • 44. Introduction CoCCA approach evaluation Conclusion and perspective Contributions Shared memory and coherence is important for programmability of CMP SotA cache coherence mechanisms falls into worst case behaviors for scenarios that seems simple: regular access to memory with patterns We defined an extension of cores to store pattern We extended the baseline protocol with a speculative protocol For embedded system: tables are part of the compilation process Only few patterns entries are necessary for each typical low level filter Patterns can reduce significantly coherence message throughput Patterns allow for early and efficient cache preloading which accelerate significantly applications 20 / 23
  • 45. Introduction CoCCA approach evaluation Conclusion and perspective Contributions Shared memory and coherence is important for programmability of CMP SotA cache coherence mechanisms falls into worst case behaviors for scenarios that seems simple: regular access to memory with patterns We defined an extension of cores to store pattern We extended the baseline protocol with a speculative protocol For embedded system: tables are part of the compilation process Only few patterns entries are necessary for each typical low level filter Patterns can reduce significantly coherence message throughput Patterns allow for early and efficient cache preloading which accelerate significantly applications May provide a path to cache coherency in massive many-cores 20 / 23
  • 46. Introduction CoCCA approach evaluation Conclusion and perspective Future work and perspective extend the number of benchmark applications to draw more general conclusions apply our ideas in a NoC simulator to do cycle accurate simulations 21 / 23
  • 47. Introduction CoCCA approach evaluation Conclusion and perspective Future work and perspective extend the number of benchmark applications to draw more general conclusions apply our ideas in a NoC simulator to do cycle accurate simulations include it in a full scale simulator (e.g. SoCLib) 21 / 23
  • 48. Introduction CoCCA approach evaluation Conclusion and perspective Future work and perspective extend the number of benchmark applications to draw more general conclusions apply our ideas in a NoC simulator to do cycle accurate simulations include it in a full scale simulator (e.g. SoCLib) extend our work toward a HPC friendly architecture that would determine patterns dynamically at runtime 21 / 23
  • 49. Introduction CoCCA approach evaluation Conclusion and perspective Thank you for your attention Questions? 22 / 23
  • 50. Introduction CoCCA approach evaluation Conclusion and perspective ALCHEMY wokshop @ ICCS 2013 (Barcelona) The International Conference on Computational Science (ICCS) can be a good place to talk with people using HPC architectures for their needs. Lo¨ıc Cudennec and I are organizing a workshop on the issues that are raising with future manycore systems (number of cores ¿ 1000 and beyond) Architecture, Language, Compilation and Hardware support for Emerging ManYcore systems ALCHEMY wokshop 23 / 23
  • 51. Introduction CoCCA approach evaluation Conclusion and perspective ALCHEMY wokshop @ ICCS 2013 (Barcelona) The International Conference on Computational Science (ICCS) can be a good place to talk with people using HPC architectures for their needs. Lo¨ıc Cudennec and I are organizing a workshop on the issues that are raising with future manycore systems (number of cores ¿ 1000 and beyond) Architecture, Language, Compilation and Hardware support for Emerging ManYcore systems ALCHEMY wokshop Topics: Advanced architecture support for massive parallelism management Advanced architecture support for enhanced communication for manycores Full paper submission: December 15th Notification: Feb. 10 23 / 23