SoC-2012-pres-2

Enhancing Cache Coherent Architectures
with Access Patterns for Embedded
Manycore Systems
Jussara Marandola, St´ephane Louise, Lo¨ıc Cudennec, David
A. Bader
stephane.louise@cea.fr
Oct 11-12, 2012, SoC 2012
1 / 23

Introduction CoCCA approach evaluation Conclusion and perspective
Background
Multicore and manycore systems: Architecture and its future:
Single processor time is over
Multiprocessor are there and will remain
Down to embedded systems (e.g. my cellphone)
Manycore systems are on the verge of appearing (e.g. Tilera,
but others are on the way)
The future is manycore, even in the embedded world
We have to prepare for this.
Programmability?
2 / 23

Programmability
What are the programming paradigms for manycores? How do we
program them?
New paradigms (e.g. stream programming) still young, need
to learn we way of programming. Bad for legacy software
(porting costs!)
3 / 23

Programmability
program them?
(porting costs!)
MPI (OK for HPC applications, also heavy work for
parallelization)
3 / 23

Programmability
program them?
(porting costs!)
parallelization)
openMP and the like: “only” adding some pragma to
parallelize an application.
3 / 23

Programmability
program them?
(porting costs!)
parallelization)
openMP and the like: “only” adding some pragma to
parallelize an application.
OpenMP relies on a shared memory model. So a shared memory
behavior must be provided and if possible done in hardware
(because it is faster)
3 / 23

Shared memory consistency for multicore/manycores
For manycore systems, memory consistency = cache coherence
mechanisms:
Based on the four state MESI protocol
Modified: a single valid copy of the data exist in the system
and was modified since its fetch from memory
Exclusive: the value is only in one core’s cache and wasn’t
modified since it was accessed from memory
Shared: multiple copy of the value exist in the system, and
only read operation where done
Invalid: the current copy that the core has must not be used
and will be discarded
4 / 23

Shared memory consistency for multicore/manycores
For manycore systems, memory consistency = cache coherence
mechanisms:
Based on the four state MESI protocol
Modified: a single valid copy of the data exist in the system
and was modified since its fetch from memory
Exclusive: the value is only in one core’s cache and wasn’t
modified since it was accessed from memory
Shared: multiple copy of the value exist in the system, and
only read operation where done
Invalid: the current copy that the core has must not be used
and will be discarded
Use of Home Nodes to keep the state consistency:
For a given address in memory only one core of the system will
keep the coherence state
The distribution of home nodes is done as a modulo on an
address mask (round-robin, line size) to avoid hot spots
A processor mask tracks the cores that share the cache line
Baseline protocol is the base for all memory consistency
systems within SotA
4 / 23

Baseline implementation in a manycore system
Cache
Coherence
Directory
MemoryInterface
NetworkInterface
Address Coherence info. (state and vector bit ﬁelds)
L2Cache
5 / 23

Modiﬁcation of a shared value by a given core
Processor with shared copy
of the data
➀
6 / 23

of the data
➁
6 / 23

of the data
➂
6 / 23

➃
6 / 23

Issues of baseline protocol and memory access patterns
Sometime a single write on a shared value triggers lot of
coherence traﬃc on the NoC
For regular but non conterminous access, lot accesses
Typical example: reading an image by column
But the accesses are simple and deterministic
In some areas the baseline protocol does not work as well as it
could and lacks a bit of scalability
In the embedded world lots of low level data processing display a
regular behavior WRT their memory accesses
Convolutions on images
usual transformations (e.g. FFT, DCT)
vector operation
etc.
7 / 23

Issues of baseline protocol and memory access patterns
Sometime a single write on a shared value triggers lot of
coherence traﬃc on the NoC
For regular but non conterminous access, lot accesses
Typical example: reading an image by column
But the accesses are simple and deterministic
In some areas the baseline protocol does not work as well as it
could and lacks a bit of scalability
In the embedded world lots of low level data processing display a
regular behavior WRT their memory accesses
Convolutions on images
usual transformations (e.g. FFT, DCT)
vector operation
etc.
The idea: take advantage of these regular memory access pattern
to reduce the coherence traﬃc and enable memory prefetch
7 / 23

State of the Art, Memory patterns and shared memory
coherence
Use of memory patterns:
Intel: use of special instructions to perform regular accesses to
memory limited to a single core; Patent US 7,143,264 (2006)
IBM: special instructions used to detect and apply patterns,
also limited to a single cache; Patent US 7,395,407 (2008)
Other platforms:
STAR project aim to provide a scalable manycore with a
coherent shared memory
8 / 23

Cache Coherence Architecture with patterns
Our enhancement to Cache Coherence Architecture (CCA)
Relies on the baseline protocol (adds to it, not replace it)
Update it with special cases for pattern management
Add storage with each core for pattern storage and detection
Patterns are a result of the compilation process
9 / 23

Cache Coherence Architecture with patterns
Our enhancement to Cache Coherence Architecture (CCA)
Relies on the baseline protocol (adds to it, not replace it)
Update it with special cases for pattern management
Add storage with each core for pattern storage and detection
Patterns are a result of the compilation process
It can not work worst than baseline, because baseline is still the
default.
Modifies:
Core IP with the pattern storage and matching
Add the speculative protocol to the baseline protocol
The patterns (and the speculative protocol) has its own
determination of Home Node (which can be the same or differ
from the baseline Home Node)
We call this modified system CoCCA (Codesigned CCA)
9 / 23

CoCCA architecture scheme
Cache
Coherence
Directory
CoCCA
Pattern
Table
MemoryInterface
NetworkInterface
Address Pattern Coherence info. (state and bit ﬁelds)
Address Pattern Coherence info. (state and bit ﬁelds)
Chip area overhead: ~+3%?
10 / 23

Pattern definition and storage
Patterns are not stored the same way on use nodes and home
nodes
The minimum implementation uses a 2D strided access shape:
a start address
a stride lenght
a pattern lenght
On the home node: a pattern size
A speculative access fetches cache lines (as baseline protocol do)
but the access pattern may need to be more fine grained in its
specification (overlaps)
Definition of triggers: way of detecting the signature of a pattern
to fetch
the simplest trigger is the first address of the pattern access
11 / 23

Triggers and pattern deﬁnition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { }
12 / 23

Baddr Base address
E.G.: Pat(1, 4, 2)(@1) = { @2, }
12 / 23

Baddr Base address
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, }
12 / 23

Baddr Base address
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, }
12 / 23

Baddr Base address
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, @11}
12 / 23

Base of the modiﬁed protocol
Requester
DIR
Lookup
PT
Lookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cache
Read
Pattern
Lookup
Memory
Access
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIR
Lookup
Memory
Access
Send RD_RQ_AK
miss
hit
Baseline
Home Node
Hybrid
(CoCCA)
Home Node
Send RD_RQ_AK
miss Pattern
length
Pattern
length
Without pattern information or in case of pattern miss: the
system acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is ﬁred
13 / 23

Base of the modiﬁed protocol
Requester
DIR
Lookup
PT
Lookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cache
Read
Pattern
Lookup
Memory
Access
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIR
Lookup
Memory
Access
Send RD_RQ_AK
miss
hit
Baseline
Home Node
Hybrid
(CoCCA)
Home Node
Send RD_RQ_AK
miss Pattern
length
Pattern
length
X
Without pattern information or in case of pattern miss: the
system acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is ﬁred
13 / 23

Hardware tables and special instructions
A C-language description of pattern storing tables:
unsigned long capacity; /* sizeof(address) */
unsigned long size; /* address number */
unsigned long * oﬀset; /* pattern oﬀset */
unsigned long * length; /* pattern length */
unsigned long * stride; /* pattern stride */
So it is possible to have a rough estimate of the size of an entry in
the pattern table
14 / 23

Hardware tables and special instructions
A C-language description of pattern storing tables:
unsigned long capacity; /* sizeof(address) */
unsigned long size; /* address number */
unsigned long * offset; /* pattern offset */
unsigned long * length; /* pattern length */
unsigned long * stride; /* pattern stride */
So it is possible to have a rough estimate of the size of an entry in
the pattern table A few specialized instructions to deal manage
pattern tables:
PatternNew: to create a pattern,
PatternAddOffset: to add an offset entry,
PatternAddLength: to add a length entry,
PatternAddStride: to add a stride entry,
PatternFree: to release the pattern after use.
14 / 23

A first benchmark program for early evaluation
The choice of a benchmark program for our speculative protocol:
be representative of typical embedded application
stress the protocol proposal on several aspects
We choose a 2 step image cascading filtering
the first filter result is the source of the second filter
5x5 filter
applied on chunks of the image, for each core with shared
cache lines both in read mode as in write mode
the result of the second filter is written back on the source
(write invalidation)
15 / 23

Memory mapping of our benchmark program
16 / 23

Instrumentation choice: Pin/Pintools
Pin/Pintools:
Pin is an instrumentation framework of binaries based on JIT
technique to accelerate the instrumentation. It is a Intel
project
Pin acts in association with the instrumentation tool called
Pintool which is programmable
Several Pintools are provided in the basic distribution of Pin
17 / 23

Instrumentation choice: Pin/Pintools
Pin/Pintools:
Pin is an instrumentation framework of binaries based on JIT
technique to accelerate the instrumentation. It is a Intel
project
Pin acts in association with the instrumentation tool called
Pintool which is programmable
Several Pintools are provided in the basic distribution of Pin
We used:
inscount: pintool which gives the number of executed
instructions
pinatrace: pintool which trace and log all the memory
accesses (load/store operations)
See paper for details.
17 / 23

Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Figure: Read data sharing in conterminous rectangles
We can deﬁne three kinds of patterns on this benchmark:
Source image prefetch and setting of old Shared values (S) to
Exclusive values (E) when the source image becomes the
destination (2 patterns per core)
18 / 23

Rect. i Rect. i+1
Rect. i+7 Rect. i+8
False concurrency of write accesses between two rectangles of
the destination image. This happens because the frontiers is
not alined with L2 cache lines. The associated patterns is 6
vertical lines with 0 bytes in common
18 / 23

Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Shared read data (because convolution kernels read pixels in
conterminous rectangles, see ﬁgure 1). There are 6 vertical
lines and 3 sets of two horizontal lines for these patterns
18 / 23

Rect. i Rect. i+1
Rect. i+7 Rect. i+8
After simpliﬁcation, only 6 patterns are required
18 / 23

Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283
Exclusive line sharing (2 cores) 12768 12768
Total throughput 48672 30723
19 / 23

Evaluation results
reduction of 37% of coherence message throughput
19 / 23

Evaluation results
prefetch stands for 10% of cache accesses
19 / 23

Evaluation results
prefetch stands for 10% of cache accesses
Means that without prefetch the application runs 67% slower
(20 cycles for on chip shared cache access and 80 cycles for
external memory accesses)
19 / 23

Contributions
Shared memory and coherence is important for
programmability of CMP
SotA cache coherence mechanisms falls into worst case
behaviors for scenarios that seems simple: regular access to
memory with patterns
We deﬁned an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilation
process
20 / 23

Contributions
process
Only few patterns entries are necessary for each typical low
level filter
Patterns can reduce significantly coherence message
throughput
Patterns allow for early and efficient cache preloading which
accelerate significantly applications
20 / 23

Contributions
process
Only few patterns entries are necessary for each typical low
level filter
Patterns can reduce significantly coherence message
throughput
Patterns allow for early and efficient cache preloading which
accelerate significantly applications
May provide a path to cache coherency in massive many-cores
20 / 23

Future work and perspective
extend the number of benchmark applications to draw more
general conclusions
apply our ideas in a NoC simulator to do cycle accurate
simulations
21 / 23

general conclusions
simulations
include it in a full scale simulator (e.g. SoCLib)
21 / 23

general conclusions
simulations
include it in a full scale simulator (e.g. SoCLib)
extend our work toward a HPC friendly architecture that
would determine patterns dynamically at runtime
21 / 23

Thank you for your attention
Questions?
22 / 23

ALCHEMY wokshop @ ICCS 2013 (Barcelona)
The International Conference on Computational Science (ICCS)
can be a good place to talk with people using HPC architectures
for their needs.
Lo¨ıc Cudennec and I are organizing a workshop on the issues that
are raising with future manycore systems (number of cores ¿ 1000
and beyond)
Architecture, Language, Compilation and Hardware support
for Emerging ManYcore systems
ALCHEMY wokshop
23 / 23

ALCHEMY wokshop @ ICCS 2013 (Barcelona)
The International Conference on Computational Science (ICCS)
can be a good place to talk with people using HPC architectures
for their needs.
Lo¨ıc Cudennec and I are organizing a workshop on the issues that
are raising with future manycore systems (number of cores ¿ 1000
and beyond)
Architecture, Language, Compilation and Hardware support
for Emerging ManYcore systems
ALCHEMY wokshop
Topics:
Advanced architecture support for massive parallelism
management
Advanced architecture support for enhanced communication
for manycores
Full paper submission: December 15th Notiﬁcation: Feb. 10
23 / 23

SoC-2012-pres-2

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to SoC-2012-pres-2

Similar to SoC-2012-pres-2 (20)

SoC-2012-pres-2