Process synchronization in multi core systems using on-chip memories
1. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru1
Process Synchronization
in Multi-core Systems
Using On-Chip Memories
Arun Joseph, Nagu Dhanwada
arujosep@in.ibm.com, nagu@us.ibm.com
Systems & Technology Group, IBM
2. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru2
SUMMARY
We present a novel process synchronization mechanism and the
application of on-chip memories for process synchronization in
multi-core systems.
A multi-core processor architecture and a signaling scheme which
supports the novel process synchronization mechanism is
presented.
The validity of the proposed synchronization mechanism is
demonstrated by experiments on a virtual prototyping platform.
Comparison against external memory based schemes shows that
the proposed use of on-chip memories in multi-core process
synchronization is an effective solution to reduce synchronization
overheads.
3. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru3
INTRODUCTION
• Multi-core applications need to synchronize the computations in the different
processor cores, so that the computations can proceed with integrity.
• A wide range of working solutions are available: lock-based and lock-free.
• Lock-based techniques locks a shared variable to get exclusive access to
the data, and another process that needs to use the shared variable,
remains in busy-wait state, frequently checking if the lock has become free,
and then competes for the lock once the variable becomes free. [1]
• Lock-free techniques allow multiple threads to concurrently read and write
shared data without corrupting it. [2]
• These techniques make use of atomic operations, provided by the
processor architecture, which allow a single process to test if the lock is
free, and if free, acquires the lock in a single atomic operation. [3, 4]
4. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru4
INTRODUCTION
• We introduce a multi-core process synchronization mechanism which is
based on a novel signaling scheme, which does not need the support of
atomic operations or disabling of interrupts.
• Performance overhead of synchronization operations is dependent on the
number of remote accesses required, and also the latency of each remote
access.
• Significant amount of on-chip memory is available in recent multi-core
architectures like Cell BE, which is known to improve overall system
performance by reducing access time significantly.
• We present a first of its kind approach to exploit the available on-chip
memory for efficient process synchronization.
5. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru5
PREVIOUS RELATED WORK
• Commonly used lock-based schemes are semaphores and condition
variables. Non-blocking synchronization algorithms are designed in such a
way that a critical section is not required. Their implementation requires
specific atomic operations like, compare-and-swap (CAS). Maurice teaches
that using the CAS atomic primitive and other primitive operations any lock-
free mechanism can be implemented [2].
• The proposed mechanism is based on an on-chip memory which is non-
caching and shared by all the processor cores and provides a memory
region for each processor core with exclusive write access, while all the
cores have read access.
• To our knowledge, the proposed signaling scheme is fundamentally different
from prior approaches, and does not require any atomic instructions or the
need for disabling interrupts.
• Though on-chip memories has been used for a wide range of applications,
including speed up [9, 10], to our knowledge, this is the first work to study
the use of on-chip, shared, non-cached memories to reduce multi-core
process synchronization overheads.
6. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru6
COMPONENTS OF THE SCHEME
• The main components of the proposed process synchronization mechanism
are:
• (a) n-core multi-core processor, with an On-chip, Shared, Non-caching
(OSN) memory.
• (b) A novel signaling scheme.
• The OSN memory is not essential to the proposed scheme, and in its
absence an External, Shared, Non-Caching (ESN) memory can be used for
the same purpose, with a penalty in performance.
7. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru7
THE MULTI-CORE PROCESSOR
• Similar architectures have been
explored in processors like the
Cell [6], and other academic work
[10].
• Efficient usage of on-chip memory
is important [9].
• OSN memory is used for building
a signaling scheme and hence
only a small amount of the
memory is required.
• While all processor cores have
read access to the OSN memory,
each core has dedicated regions
in the OSN memory, where it has
exclusive write access
Cache
Core 1 …………….
On-Chip
Memory
Core 0 Core n-1
Cache Cache
External
Memory
SYSTEM BUS
Cache
Core 1 …………….
On-Chip
Memory
Core 0 Core n-1
Cache Cache
External
Memory
SYSTEM BUS
Figure 1.
Multi-core processor with on-chip
memory.
8. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru8
PROCESS SYNCHRONIZATION MECHANISM
Signaling Scheme
• The proposed signaling scheme provides a two state system to support a
continuous signaling mechanism.
• Processors cores are logically numbered from 0 to (n-1) where; n is the total
number of processor cores.
• The basic signaling mechanism from a signal generator to a signal receiver
is based on the proposed concepts of a signal location and two value
locations.
• The signal location is a specific location in the on-chip memory for which
only the signal generator has write access and all others have read access.
• Of the two value locations, one location is managed by the signal generator
in a location on the on-chip memory for which it has write access. The
second value location is managed by the receiving side in a location on the
on-chip memory for which the receiver has write access.
9. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru9
PROCESS SYNCHRONIZATION MECHANISM
Signaling Scheme
• The value locations have only two states other than the initial/reset state,
which is a zero value state.
• The two value location states, other than the zero state, are two values
which can be set by the core managing the location.
• For example, these states can be 0xfe and 0xff, and a state toggle can be
obtained by an 'exclusive or' operation with 0x01.
• A signal is set by the generator to the receiver when the signal location and
receiver value locations have the same value.
• After setting the signal, the signal generator toggles its generator value
location, and the receiver after receiving the signal, toggles its receiver
value location so that a new state is formed for a new signal.
10. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru10
PROCESS SYNCHRONIZATION MECHANISM
Signaling Sequence and System
• At the end of the signaling phase,
a new state is formed and the
signaling process can continue.
• An acknowledgment can be
obtained by a reply signal. With
this signal mechanism the full
implementation of the signaling
system can be built.
• The signaling system in a
processor with 'n' cores is
implemented using one ‘nxn’
Signal Location Matrix and two
‘nxn’ Value Location Matrices.
• These three matrices are
maintained in the OSN memory
(or the ESN memory, if the
external memory scheme is used).
Initialization Phase:
Step 1: Initial / Reset State
Signal Location: 0x00
Generator Value Location: 0x00 Receiver Value Location: 0x00
Step 2: Cores Set Value Locations
Signal Location: 0x00
Generator Value Location: 0xfe Receiver Value Location: 0xfe
Signaling Phase:
Step 3: Generator Sets Signals
Signal Location: 0xfe
Generator Value Location: 0xfe Receiver Value Location: 0xfe
Step 4: Generator Toggles Value Locations
Signal Location: 0xfe
Generator Value Location: 0xff Receiver Value Location: 0xfe
Step 5: Receiver Receives Signal and Toggles its Value Location
Signal Location: 0xfe
Generator Value Location: 0xff Receiver Value Location: 0xff
Figure 2. Signaling Sequence.
11. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru11
PROCESS SYNCHRONIZATION MECHANISM
Signaling System
• We refer to the signal location matrix as 'S' and the two value location
matrices as 'G' and 'R'.
• While G holds the value for setting the signal on the generator side, R holds
the expected value on the receiver side.
• Each of the rows of the S matrix are the signal locations for each of the n
processors cores. In other words, the ith row vector of Matrix S corresponds
to ith core, and are locations in the on-chip memory for which core-i has
write access.
• Rows of G and R are also placed in the on-chip memory. jth location in the
ith row vector corresponds to the signal location for core-i to set signal for
core-j. It uses the jth location of ith row vector of Matrix G for setting the
signal to core-j.
• In a similar way, core-j looking for signal from core-i looks at jth location of
ith row of S for a value equal to ith location of jth row of Matrix R.
12. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru12
PROCESS SYNCHRONIZATION MECHANISM:
Signaling process from core i to core j
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
Current State
0xfe0xff
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
Current State
0xfe0xff
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
i sets its signal to j
0xfe0xfe
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
i sets its signal to j
0xfe0xfe
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
i toggles its G location
0xff0xfe
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
i toggles its G location
0xff0xfe
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
j identifies and receives signal and toggles its R location
0xff0xfe
0xff
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
j identifies and receives signal and toggles its R location
0xff0xfe
0xff
Figure 3. Core i to j - Current State. Figure 4. Core i to j – i sets its signal to j.
Figure 5. Core i to j – i toggles its G location. Figure 6. Core i to j - j identifies and receives
signal and toggles its R location.
13. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru13
PROCESS SYNCHRONIZATION MECHANISM
• The process synchronization between two cores, say 'Core-i' and 'Core-j' is
implemented as follows:
– Core-i sets signal to Core-j.
– Core-i waits for signal from Core-j.
– Core-j waits for signal from Core-i.
– Core-j gets the signal from Core-i.
– Core-j sets reply signal to Core-i.
– Core-i gets the reply signal from Core-j.
• The basic synchronization scheme is built on three matrices of order nxn,
where n is the number of cores. Hence, for example, the scheme for a 1000
core system can be implemented using 3MB of on-chip-memory.
• The scheme has the potential to be extended for multiple types of signals
and inter core communication, which requires extra memory to implement.
14. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru14
EXPERIMENTAL SETUP
• The mechanism was simulated on
a multi-core System-On-Chip
(SOC) virtual prototyping platform,
as shown in Figure 11.
• The platform also provides
mechanism for plugging-in user-
defined modules to support
abstraction of additionally defined
hardware components.
• The CoreConnect-based [16] SOC
has 8 processor cores, 1MB OSN
memory, in addition to the several
other peripherals and bus
components.
Figure 11. Virtual Multi-core SOC.
15. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru15
EXPERIMENTAL SETUP
• Different experiments were carried out, while running a parallel multiplication
of two 16x16 matrices.
• In scenario 1: Process synchronization was achieved using the proposed
OSN-based synchronization technique.
• In scenario 2: Process synchronization was achieved using the proposed
ESN-based synchronization technique.
• In scenario 3: Process synchronization was achieved using an external
memory based semaphore, BetaSemaphore, which was implemented using
an atomic Test and Set operation, as defined in [14].
• Performance comparisons between scenario 1 and 2 indicates that even for
applications like matrix multiplication, where the number of synchronization
operations is small, the impact of the OSN memory on reducing
synchronization overhead is reasonably significant, especially as the
number of the cores increase.
16. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru16
EXPERIMENTAL RESULTS
• The proposed process
synchronization scheme is not
expected to reduce
synchronizations overheads,
unless used with the OSN memory.
• For an 8-core SOC a speed-up of
7.5 was seen with the OSN-based
technique vs. 5.5 in the ESN based
technique.
• The performance of scenario 2 and
3 are more comparable. The delta
between the two can be potentially
attributed to the differences in the
approach used to implement them.
Number of
processors Execution time (us) Idle time (us) Speed-Up
1.0 5205753.0 54662.0 1.0
2.0 2632011.0 92797.0 2.0
4.0 1338610.0 171432.0 3.9
8.0 695988.0 344361.0 7.5
Number of
processors Execution time (us) Idle time (us) Speed-Up
1.0 5205733.0 54662.0 1.0
2.0 2632086.0 106798.0 2.0
4.0 1462357.0 332871.0 3.6
8.0 820631.0 469267.0 5.5
Number of
processors Execution time (us) Idle time (us) Speed-Up
1.0 5205733.0 54662.0 1.0
2.0 2632094.0 106802.0 2.0
4.0 1607896.0 213437.0 3.2
8.0 912465.0 450986.0 5.7
Scenario 1: Proposed Scheme using OSN Memory
Scenario 2: Proposed Scheme using ESN Memory
Scenario 3: Semaphore using ESN Memory
Figure 13. Scenarios 1-3
17. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru17
EXPERIMENTAL SETUP
• In another study, a micro benchmark was created to forcefully create 10000
synchronization operations, and then the time taken for those 10000
operations were extracted using selective profiling functions provided in the
virtual prototyping platform.
• The study was done on the same SOC as before, but with 2 and 4
processor cores, in 3 different scenarios.
• In scenario 4 the OSN-based proposed synchronization scheme was used.
• In scenario 5, the OSN-based BetaSemaphore implementation was used.
• In scenario 6 the ESN-based BetaSemaphore implementation was used.
18. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru18
EXPERIMENTAL RESULTS
• Synchronization overheads from scenario 4 and 5 are in the comparable
range, and again the overhead in scenario 4 was lesser than scenario 5.
• The overhead of scenario 4 was approximately 1/4th the synchronization
overhead of scenario 6, and is expected to improve even further as the
number of cores increase.
• This strongly suggests that irrespective of process synchronization scheme
used, the OSN memory significantly reduces the process synchronization
overheads, especially as the number of processor cores increase.
Synchronization Overhead for 10000 synchronizations (in usec)
No. of cores
Scenario 4:
Proposed Scheme
(OSN)
Scenario 5:
BetaSemaphore
(OSN)
Scenario 6:
BetaSemaphore
(ESN)
2 1500723 1862737 3751627
4 2775356 3184996 12185251
Figure 14. Scenarios 4-6
19. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru19
CONCLUSION & FUTURE WORK
• A novel multi-core signaling scheme and a process synchronization
mechanism is presented.
• We have also presented the notion of using on-chip, shared, non-cached
memories to reduce the process synchronization overheads in multi-core
systems.
• The basic signaling scheme presented is a two state mechanism. However,
the scheme can be extended further as a signaling system with multiple
states.
• We are investigating how multiple types of signals can be implemented by
providing specified number of locations maintained by the generator and
read by receiver, to further classify the signal.
• The scheme can be extended to enable inter-processor communication.
20. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru20
REFERENCES
1. J. M. Mellor-Crummey and M. L. Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors,
ACM Trans. On Computer Systems, 9(1), February 1991.
2. M.P. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124--149,
January 1991.
3. Intel Corp. Intel Itanium 2 processor reference manual.
4. C.May, E. Silha, R. Simpson, and H. Warren. The PowerPC Architecture: A Specification for a New Family of
Processors, 2nd edition. Morgan Kaufmann, May 1994.
5. Zhen Fang, Lixin Zhang, John B. Carter, Liqun Cheng, and Michael Parker. 2005. Fast synchronization on shared-
memory multiprocessors: An architectural approach. J. Parallel Distrib. Comput. 65, 10 (October 2005), 1158-1170.
6. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. 2005. Introduction to the cell
multiprocessor. IBM J. Res. Dev. 49, 4/5 (July 2005), 589-604.
7. L. A. Polka et al., Intel Technoloyg Journal, vol. 11, 197 (2007).
8. A. Silberschatz, P. B. Galvin, G. Gagne, “Operating System Concepts”, 7th ed.: John Wiley & Sons, Inc., 2005.
9. Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. off-chip memory: the data partitioning
problem in embedded processor-based systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3 (July 2000), 682-704
10. C. Villavieja, I. Gelado, A. Ramrez, and N. Navarro, "Memory Management on Chip-MultiProcessors with on-chip
Memories", Proc. workshop on the Interaction between Operating Systems and Computer Architecture, 2008.
11. N.R. Dhanwada, R.A. Bergamaschi, W.W. Dungan, I. Nair, P. Gramann, W.E. Dougherty, and I. Lin, "Transaction-
level modeling for architectural and power analysis of PowerPC and CoreConnect-based systems", ;presented at Design
Autom. for Emb. Sys., 2005, pp.105-125.
12. Meet the PowerPC 405 Evaluation Kit, 2005.
13. The Open SystemC Initiative. http://www.systemc.org.
14. Benini, L., D. Bertozzi, D. Bruni, N. Drago, F. Fummi, M. Poncino. Legacy SystemC Co-Simulation of Multi-Processor
Systems-on-Chip. In Proceedings 2002 IEEE International Conference on Computer Design: VLSI in Computers and
Processors (ICCD), IEEE, 494, 2002.
15. PowerPC User Instruction Set Architecture Book I Version 2.02
16. The CoreConnect™ Bus Architecture, 1999
21. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru21
PROCESS SYNCHRONIZATION MECHANISM
Appendix A
• Though the proposed scheme is a blocking scheme, it need not be so, if the program
logic permits.
– Core-i can set the signal to Core-j and continue rather than waiting, until it needs
the acknowledgment or before the next signal. In a similar way, Core-j need not
wait for a signal from Core-i. If the program logic permits, it can as well check for
the signal and continue if the signal is not available and wait for the signal when it
is really needed.
– If it is a synchronization point, but not sure who should initiate the signal, it is
possible to have a convention that the lower numbered core sets the signal, and
the other waits and acknowledges the signal.
• In a similar way, synchronization of a group of cores, or a barrier point, can be
implemented.
– The highest numbered core will scan for signals from all the lower numbered
cores, while all the lower numbered cores set signals to the highest numbered
core and waits for an acknowledgment from the highest numbered core. When
the highest numbered core receives signal from all other cores, it sets
acknowledgment to all other cores.
– Since a core can check for a signal without blocking, signal from a number of
cores arriving in a random sequence can be handled by searching in a cyclic
manner. It can also be seen that the synchronization is built on a scheme in
which cores write only on locations where it has exclusive write access. Hence,
servicing of interrupts has no adverse effects on the synchronization scheme.