SlideShare a Scribd company logo
1 of 21
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru1
Process Synchronization
in Multi-core Systems
Using On-Chip Memories
Arun Joseph, Nagu Dhanwada
arujosep@in.ibm.com, nagu@us.ibm.com
Systems & Technology Group, IBM
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru2
SUMMARY
 We present a novel process synchronization mechanism and the
application of on-chip memories for process synchronization in
multi-core systems.
 A multi-core processor architecture and a signaling scheme which
supports the novel process synchronization mechanism is
presented.
 The validity of the proposed synchronization mechanism is
demonstrated by experiments on a virtual prototyping platform.
 Comparison against external memory based schemes shows that
the proposed use of on-chip memories in multi-core process
synchronization is an effective solution to reduce synchronization
overheads.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru3
INTRODUCTION
• Multi-core applications need to synchronize the computations in the different
processor cores, so that the computations can proceed with integrity.
• A wide range of working solutions are available: lock-based and lock-free.
• Lock-based techniques locks a shared variable to get exclusive access to
the data, and another process that needs to use the shared variable,
remains in busy-wait state, frequently checking if the lock has become free,
and then competes for the lock once the variable becomes free. [1]
• Lock-free techniques allow multiple threads to concurrently read and write
shared data without corrupting it. [2]
• These techniques make use of atomic operations, provided by the
processor architecture, which allow a single process to test if the lock is
free, and if free, acquires the lock in a single atomic operation. [3, 4]
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru4
INTRODUCTION
• We introduce a multi-core process synchronization mechanism which is
based on a novel signaling scheme, which does not need the support of
atomic operations or disabling of interrupts.
• Performance overhead of synchronization operations is dependent on the
number of remote accesses required, and also the latency of each remote
access.
• Significant amount of on-chip memory is available in recent multi-core
architectures like Cell BE, which is known to improve overall system
performance by reducing access time significantly.
• We present a first of its kind approach to exploit the available on-chip
memory for efficient process synchronization.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru5
PREVIOUS RELATED WORK
• Commonly used lock-based schemes are semaphores and condition
variables. Non-blocking synchronization algorithms are designed in such a
way that a critical section is not required. Their implementation requires
specific atomic operations like, compare-and-swap (CAS). Maurice teaches
that using the CAS atomic primitive and other primitive operations any lock-
free mechanism can be implemented [2].
• The proposed mechanism is based on an on-chip memory which is non-
caching and shared by all the processor cores and provides a memory
region for each processor core with exclusive write access, while all the
cores have read access.
• To our knowledge, the proposed signaling scheme is fundamentally different
from prior approaches, and does not require any atomic instructions or the
need for disabling interrupts.
• Though on-chip memories has been used for a wide range of applications,
including speed up [9, 10], to our knowledge, this is the first work to study
the use of on-chip, shared, non-cached memories to reduce multi-core
process synchronization overheads.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru6
COMPONENTS OF THE SCHEME
• The main components of the proposed process synchronization mechanism
are:
• (a) n-core multi-core processor, with an On-chip, Shared, Non-caching
(OSN) memory.
• (b) A novel signaling scheme.
• The OSN memory is not essential to the proposed scheme, and in its
absence an External, Shared, Non-Caching (ESN) memory can be used for
the same purpose, with a penalty in performance.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru7
THE MULTI-CORE PROCESSOR
• Similar architectures have been
explored in processors like the
Cell [6], and other academic work
[10].
• Efficient usage of on-chip memory
is important [9].
• OSN memory is used for building
a signaling scheme and hence
only a small amount of the
memory is required.
• While all processor cores have
read access to the OSN memory,
each core has dedicated regions
in the OSN memory, where it has
exclusive write access
Cache
Core 1 …………….
On-Chip
Memory
Core 0 Core n-1
Cache Cache
External
Memory
SYSTEM BUS
Cache
Core 1 …………….
On-Chip
Memory
Core 0 Core n-1
Cache Cache
External
Memory
SYSTEM BUS
Figure 1.
Multi-core processor with on-chip
memory.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru8
PROCESS SYNCHRONIZATION MECHANISM
Signaling Scheme
• The proposed signaling scheme provides a two state system to support a
continuous signaling mechanism.
• Processors cores are logically numbered from 0 to (n-1) where; n is the total
number of processor cores.
• The basic signaling mechanism from a signal generator to a signal receiver
is based on the proposed concepts of a signal location and two value
locations.
• The signal location is a specific location in the on-chip memory for which
only the signal generator has write access and all others have read access.
• Of the two value locations, one location is managed by the signal generator
in a location on the on-chip memory for which it has write access. The
second value location is managed by the receiving side in a location on the
on-chip memory for which the receiver has write access.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru9
PROCESS SYNCHRONIZATION MECHANISM
Signaling Scheme
• The value locations have only two states other than the initial/reset state,
which is a zero value state.
• The two value location states, other than the zero state, are two values
which can be set by the core managing the location.
• For example, these states can be 0xfe and 0xff, and a state toggle can be
obtained by an 'exclusive or' operation with 0x01.
• A signal is set by the generator to the receiver when the signal location and
receiver value locations have the same value.
• After setting the signal, the signal generator toggles its generator value
location, and the receiver after receiving the signal, toggles its receiver
value location so that a new state is formed for a new signal.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru10
PROCESS SYNCHRONIZATION MECHANISM
Signaling Sequence and System
• At the end of the signaling phase,
a new state is formed and the
signaling process can continue.
• An acknowledgment can be
obtained by a reply signal. With
this signal mechanism the full
implementation of the signaling
system can be built.
• The signaling system in a
processor with 'n' cores is
implemented using one ‘nxn’
Signal Location Matrix and two
‘nxn’ Value Location Matrices.
• These three matrices are
maintained in the OSN memory
(or the ESN memory, if the
external memory scheme is used).
Initialization Phase:
Step 1: Initial / Reset State
Signal Location: 0x00
Generator Value Location: 0x00 Receiver Value Location: 0x00
Step 2: Cores Set Value Locations
Signal Location: 0x00
Generator Value Location: 0xfe Receiver Value Location: 0xfe
Signaling Phase:
Step 3: Generator Sets Signals
Signal Location: 0xfe
Generator Value Location: 0xfe Receiver Value Location: 0xfe
Step 4: Generator Toggles Value Locations
Signal Location: 0xfe
Generator Value Location: 0xff Receiver Value Location: 0xfe
Step 5: Receiver Receives Signal and Toggles its Value Location
Signal Location: 0xfe
Generator Value Location: 0xff Receiver Value Location: 0xff
Figure 2. Signaling Sequence.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru11
PROCESS SYNCHRONIZATION MECHANISM
Signaling System
• We refer to the signal location matrix as 'S' and the two value location
matrices as 'G' and 'R'.
• While G holds the value for setting the signal on the generator side, R holds
the expected value on the receiver side.
• Each of the rows of the S matrix are the signal locations for each of the n
processors cores. In other words, the ith row vector of Matrix S corresponds
to ith core, and are locations in the on-chip memory for which core-i has
write access.
• Rows of G and R are also placed in the on-chip memory. jth location in the
ith row vector corresponds to the signal location for core-i to set signal for
core-j. It uses the jth location of ith row vector of Matrix G for setting the
signal to core-j.
• In a similar way, core-j looking for signal from core-i looks at jth location of
ith row of S for a value equal to ith location of jth row of Matrix R.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru12
PROCESS SYNCHRONIZATION MECHANISM:
Signaling process from core i to core j
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
Current State
0xfe0xff
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
Current State
0xfe0xff
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
i sets its signal to j
0xfe0xfe
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
i sets its signal to j
0xfe0xfe
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
i toggles its G location
0xff0xfe
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
i toggles its G location
0xff0xfe
0xfe
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
j identifies and receives signal and toggles its R location
0xff0xfe
0xff
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
ithRowofS,GandRMatrices
Core=i
0 nji
S
G
R
jthRowofS,GandRMatrices
Core=j
j identifies and receives signal and toggles its R location
0xff0xfe
0xff
Figure 3. Core i to j - Current State. Figure 4. Core i to j – i sets its signal to j.
Figure 5. Core i to j – i toggles its G location. Figure 6. Core i to j - j identifies and receives
signal and toggles its R location.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru13
PROCESS SYNCHRONIZATION MECHANISM
• The process synchronization between two cores, say 'Core-i' and 'Core-j' is
implemented as follows:
– Core-i sets signal to Core-j.
– Core-i waits for signal from Core-j.
– Core-j waits for signal from Core-i.
– Core-j gets the signal from Core-i.
– Core-j sets reply signal to Core-i.
– Core-i gets the reply signal from Core-j.
• The basic synchronization scheme is built on three matrices of order nxn,
where n is the number of cores. Hence, for example, the scheme for a 1000
core system can be implemented using 3MB of on-chip-memory.
• The scheme has the potential to be extended for multiple types of signals
and inter core communication, which requires extra memory to implement.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru14
EXPERIMENTAL SETUP
• The mechanism was simulated on
a multi-core System-On-Chip
(SOC) virtual prototyping platform,
as shown in Figure 11.
• The platform also provides
mechanism for plugging-in user-
defined modules to support
abstraction of additionally defined
hardware components.
• The CoreConnect-based [16] SOC
has 8 processor cores, 1MB OSN
memory, in addition to the several
other peripherals and bus
components.
Figure 11. Virtual Multi-core SOC.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru15
EXPERIMENTAL SETUP
• Different experiments were carried out, while running a parallel multiplication
of two 16x16 matrices.
• In scenario 1: Process synchronization was achieved using the proposed
OSN-based synchronization technique.
• In scenario 2: Process synchronization was achieved using the proposed
ESN-based synchronization technique.
• In scenario 3: Process synchronization was achieved using an external
memory based semaphore, BetaSemaphore, which was implemented using
an atomic Test and Set operation, as defined in [14].
• Performance comparisons between scenario 1 and 2 indicates that even for
applications like matrix multiplication, where the number of synchronization
operations is small, the impact of the OSN memory on reducing
synchronization overhead is reasonably significant, especially as the
number of the cores increase.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru16
EXPERIMENTAL RESULTS
• The proposed process
synchronization scheme is not
expected to reduce
synchronizations overheads,
unless used with the OSN memory.
• For an 8-core SOC a speed-up of
7.5 was seen with the OSN-based
technique vs. 5.5 in the ESN based
technique.
• The performance of scenario 2 and
3 are more comparable. The delta
between the two can be potentially
attributed to the differences in the
approach used to implement them.
Number of
processors Execution time (us) Idle time (us) Speed-Up
1.0 5205753.0 54662.0 1.0
2.0 2632011.0 92797.0 2.0
4.0 1338610.0 171432.0 3.9
8.0 695988.0 344361.0 7.5
Number of
processors Execution time (us) Idle time (us) Speed-Up
1.0 5205733.0 54662.0 1.0
2.0 2632086.0 106798.0 2.0
4.0 1462357.0 332871.0 3.6
8.0 820631.0 469267.0 5.5
Number of
processors Execution time (us) Idle time (us) Speed-Up
1.0 5205733.0 54662.0 1.0
2.0 2632094.0 106802.0 2.0
4.0 1607896.0 213437.0 3.2
8.0 912465.0 450986.0 5.7
Scenario 1: Proposed Scheme using OSN Memory
Scenario 2: Proposed Scheme using ESN Memory
Scenario 3: Semaphore using ESN Memory
Figure 13. Scenarios 1-3
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru17
EXPERIMENTAL SETUP
• In another study, a micro benchmark was created to forcefully create 10000
synchronization operations, and then the time taken for those 10000
operations were extracted using selective profiling functions provided in the
virtual prototyping platform.
• The study was done on the same SOC as before, but with 2 and 4
processor cores, in 3 different scenarios.
• In scenario 4 the OSN-based proposed synchronization scheme was used.
• In scenario 5, the OSN-based BetaSemaphore implementation was used.
• In scenario 6 the ESN-based BetaSemaphore implementation was used.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru18
EXPERIMENTAL RESULTS
• Synchronization overheads from scenario 4 and 5 are in the comparable
range, and again the overhead in scenario 4 was lesser than scenario 5.
• The overhead of scenario 4 was approximately 1/4th the synchronization
overhead of scenario 6, and is expected to improve even further as the
number of cores increase.
• This strongly suggests that irrespective of process synchronization scheme
used, the OSN memory significantly reduces the process synchronization
overheads, especially as the number of processor cores increase.
Synchronization Overhead for 10000 synchronizations (in usec)
No. of cores
Scenario 4:
Proposed Scheme
(OSN)
Scenario 5:
BetaSemaphore
(OSN)
Scenario 6:
BetaSemaphore
(ESN)
2 1500723 1862737 3751627
4 2775356 3184996 12185251
Figure 14. Scenarios 4-6
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru19
CONCLUSION & FUTURE WORK
• A novel multi-core signaling scheme and a process synchronization
mechanism is presented.
• We have also presented the notion of using on-chip, shared, non-cached
memories to reduce the process synchronization overheads in multi-core
systems.
• The basic signaling scheme presented is a two state mechanism. However,
the scheme can be extended further as a signaling system with multiple
states.
• We are investigating how multiple types of signals can be implemented by
providing specified number of locations maintained by the generator and
read by receiver, to further classify the signal.
• The scheme can be extended to enable inter-processor communication.
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru20
REFERENCES
1. J. M. Mellor-Crummey and M. L. Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors,
ACM Trans. On Computer Systems, 9(1), February 1991.
2. M.P. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124--149,
January 1991.
3. Intel Corp. Intel Itanium 2 processor reference manual.
4. C.May, E. Silha, R. Simpson, and H. Warren. The PowerPC Architecture: A Specification for a New Family of
Processors, 2nd edition. Morgan Kaufmann, May 1994.
5. Zhen Fang, Lixin Zhang, John B. Carter, Liqun Cheng, and Michael Parker. 2005. Fast synchronization on shared-
memory multiprocessors: An architectural approach. J. Parallel Distrib. Comput. 65, 10 (October 2005), 1158-1170.
6. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. 2005. Introduction to the cell
multiprocessor. IBM J. Res. Dev. 49, 4/5 (July 2005), 589-604.
7. L. A. Polka et al., Intel Technoloyg Journal, vol. 11, 197 (2007).
8. A. Silberschatz, P. B. Galvin, G. Gagne, “Operating System Concepts”, 7th ed.: John Wiley & Sons, Inc., 2005.
9. Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. off-chip memory: the data partitioning
problem in embedded processor-based systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3 (July 2000), 682-704
10. C. Villavieja, I. Gelado, A. Ramrez, and N. Navarro, "Memory Management on Chip-MultiProcessors with on-chip
Memories", Proc. workshop on the Interaction between Operating Systems and Computer Architecture, 2008.
11. N.R. Dhanwada, R.A. Bergamaschi, W.W. Dungan, I. Nair, P. Gramann, W.E. Dougherty, and I. Lin, "Transaction-
level modeling for architectural and power analysis of PowerPC and CoreConnect-based systems", ;presented at Design
Autom. for Emb. Sys., 2005, pp.105-125.
12. Meet the PowerPC 405 Evaluation Kit, 2005.
13. The Open SystemC Initiative. http://www.systemc.org.
14. Benini, L., D. Bertozzi, D. Bruni, N. Drago, F. Fummi, M. Poncino. Legacy SystemC Co-Simulation of Multi-Processor
Systems-on-Chip. In Proceedings 2002 IEEE International Conference on Computer Design: VLSI in Computers and
Processors (ICCD), IEEE, 494, 2002.
15. PowerPC User Instruction Set Architecture Book I Version 2.02
16. The CoreConnect™ Bus Architecture, 1999
Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru21
PROCESS SYNCHRONIZATION MECHANISM
Appendix A
• Though the proposed scheme is a blocking scheme, it need not be so, if the program
logic permits.
– Core-i can set the signal to Core-j and continue rather than waiting, until it needs
the acknowledgment or before the next signal. In a similar way, Core-j need not
wait for a signal from Core-i. If the program logic permits, it can as well check for
the signal and continue if the signal is not available and wait for the signal when it
is really needed.
– If it is a synchronization point, but not sure who should initiate the signal, it is
possible to have a convention that the lower numbered core sets the signal, and
the other waits and acknowledges the signal.
• In a similar way, synchronization of a group of cores, or a barrier point, can be
implemented.
– The highest numbered core will scan for signals from all the lower numbered
cores, while all the lower numbered cores set signals to the highest numbered
core and waits for an acknowledgment from the highest numbered core. When
the highest numbered core receives signal from all other cores, it sets
acknowledgment to all other cores.
– Since a core can check for a signal without blocking, signal from a number of
cores arriving in a random sequence can be handled by searching in a cyclic
manner. It can also be seen that the synchronization is built on a scheme in
which cores write only on locations where it has exclusive write access. Hence,
servicing of interrupts has no adverse effects on the synchronization scheme.

More Related Content

What's hot (18)

Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
Pentium microprocessor
Pentium microprocessorPentium microprocessor
Pentium microprocessor
 
Pentium
PentiumPentium
Pentium
 
Intel Pentium Pro
Intel Pentium ProIntel Pentium Pro
Intel Pentium Pro
 
Report on hyperthreading
Report on hyperthreadingReport on hyperthreading
Report on hyperthreading
 
TMS320C6X Architecture
TMS320C6X ArchitectureTMS320C6X Architecture
TMS320C6X Architecture
 
Computer System Architecture Lecture Note 4: intel microprocessors
Computer System Architecture Lecture Note 4: intel microprocessorsComputer System Architecture Lecture Note 4: intel microprocessors
Computer System Architecture Lecture Note 4: intel microprocessors
 
Hyper thread technology
Hyper thread technologyHyper thread technology
Hyper thread technology
 
Memory protection unit
Memory protection unit Memory protection unit
Memory protection unit
 
Unit4.tms320c54x
Unit4.tms320c54xUnit4.tms320c54x
Unit4.tms320c54x
 
Hyper threading
Hyper threadingHyper threading
Hyper threading
 
Rtos
RtosRtos
Rtos
 
SOC Peripheral Components & SOC Tools
SOC Peripheral Components & SOC ToolsSOC Peripheral Components & SOC Tools
SOC Peripheral Components & SOC Tools
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
Microprocessor - Intel Pentium Series
Microprocessor - Intel Pentium SeriesMicroprocessor - Intel Pentium Series
Microprocessor - Intel Pentium Series
 
SOC Processors Used in SOC
SOC Processors Used in SOCSOC Processors Used in SOC
SOC Processors Used in SOC
 
Al2ed chapter3
Al2ed chapter3Al2ed chapter3
Al2ed chapter3
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technology
 

Viewers also liked

Process synchronization
Process synchronizationProcess synchronization
Process synchronizationAli Ahmad
 
Process Synchronization-R.D.Sivakumar
Process Synchronization-R.D.SivakumarProcess Synchronization-R.D.Sivakumar
Process Synchronization-R.D.SivakumarSivakumar R D .
 
Ch7: Process Synchronization
Ch7: Process SynchronizationCh7: Process Synchronization
Ch7: Process SynchronizationAhmar Hashmi
 
Process Synchronization And Deadlocks
Process Synchronization And DeadlocksProcess Synchronization And Deadlocks
Process Synchronization And Deadlockstech2click
 
Unit II - 3 - Operating System - Process Synchronization
Unit II - 3 - Operating System - Process SynchronizationUnit II - 3 - Operating System - Process Synchronization
Unit II - 3 - Operating System - Process Synchronizationcscarcas
 
Process Synchronization
Process SynchronizationProcess Synchronization
Process SynchronizationSonali Chauhan
 
OS Process Synchronization, semaphore and Monitors
OS Process Synchronization, semaphore and MonitorsOS Process Synchronization, semaphore and Monitors
OS Process Synchronization, semaphore and Monitorssgpraju
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating SystemsRitu Ranjan Shrivastwa
 
Chapter 6 - Process Synchronization
Chapter 6 - Process SynchronizationChapter 6 - Process Synchronization
Chapter 6 - Process SynchronizationWayne Jones Jnr
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2Amir Payberah
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1Amir Payberah
 

Viewers also liked (11)

Process synchronization
Process synchronizationProcess synchronization
Process synchronization
 
Process Synchronization-R.D.Sivakumar
Process Synchronization-R.D.SivakumarProcess Synchronization-R.D.Sivakumar
Process Synchronization-R.D.Sivakumar
 
Ch7: Process Synchronization
Ch7: Process SynchronizationCh7: Process Synchronization
Ch7: Process Synchronization
 
Process Synchronization And Deadlocks
Process Synchronization And DeadlocksProcess Synchronization And Deadlocks
Process Synchronization And Deadlocks
 
Unit II - 3 - Operating System - Process Synchronization
Unit II - 3 - Operating System - Process SynchronizationUnit II - 3 - Operating System - Process Synchronization
Unit II - 3 - Operating System - Process Synchronization
 
Process Synchronization
Process SynchronizationProcess Synchronization
Process Synchronization
 
OS Process Synchronization, semaphore and Monitors
OS Process Synchronization, semaphore and MonitorsOS Process Synchronization, semaphore and Monitors
OS Process Synchronization, semaphore and Monitors
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating Systems
 
Chapter 6 - Process Synchronization
Chapter 6 - Process SynchronizationChapter 6 - Process Synchronization
Chapter 6 - Process Synchronization
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1
 

Similar to Process synchronization in multi core systems using on-chip memories

Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerAmrutaMehata
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdfarpowersarps
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsSabidur Rahman
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisationKarunamoorthy B
 
Unit 2 processor&memory-organisation
Unit 2 processor&memory-organisationUnit 2 processor&memory-organisation
Unit 2 processor&memory-organisationPavithra S
 
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxParallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxSumalatha A
 
Memory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsMemory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsSepidehShirkhanzadeh
 
Structure of processes ppt
Structure of processes pptStructure of processes ppt
Structure of processes pptYojana Nanaware
 
Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systemsvampugani
 
The structure of process
The structure of processThe structure of process
The structure of processAbhaysinh Surve
 

Similar to Process synchronization in multi core systems using on-chip memories (20)

Computer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and MicrocontrollerComputer Organization: Introduction to Microprocessor and Microcontroller
Computer Organization: Introduction to Microprocessor and Microcontroller
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Parallel Processing.pptx
Parallel Processing.pptxParallel Processing.pptx
Parallel Processing.pptx
 
Microprocessor
MicroprocessorMicroprocessor
Microprocessor
 
Mod 3.pptx
Mod 3.pptxMod 3.pptx
Mod 3.pptx
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisation
 
Unit 2 processor&memory-organisation
Unit 2 processor&memory-organisationUnit 2 processor&memory-organisation
Unit 2 processor&memory-organisation
 
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxParallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
 
Memory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsMemory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware Accelerators
 
shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
 
Structure of processes ppt
Structure of processes pptStructure of processes ppt
Structure of processes ppt
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systems
 
UNIT 2.pptx
UNIT 2.pptxUNIT 2.pptx
UNIT 2.pptx
 
The structure of process
The structure of processThe structure of process
The structure of process
 

More from Arun Joseph

Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...Arun Joseph
 
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Arun Joseph
 
FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...Arun Joseph
 
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...Arun Joseph
 
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...Arun Joseph
 
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...Arun Joseph
 
Per domain power analysis
Per domain power analysisPer domain power analysis
Per domain power analysisArun Joseph
 

More from Arun Joseph (9)

Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
Rapidly Building Next Generation Web-based EDA Applications and Platforms fro...
 
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
 
FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...FVCAG: A framework for formal verification driven power modelling and verific...
FVCAG: A framework for formal verification driven power modelling and verific...
 
FreqLeak
FreqLeakFreqLeak
FreqLeak
 
FirmLeak
FirmLeakFirmLeak
FirmLeak
 
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
A Hybrid Approach to Standard Cell Power Characterization based on PVT Indepe...
 
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
 
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...End to End Self-Heating Analysis Methodology and Toolset for High Performance...
End to End Self-Heating Analysis Methodology and Toolset for High Performance...
 
Per domain power analysis
Per domain power analysisPer domain power analysis
Per domain power analysis
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Process synchronization in multi core systems using on-chip memories

  • 1. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru1 Process Synchronization in Multi-core Systems Using On-Chip Memories Arun Joseph, Nagu Dhanwada arujosep@in.ibm.com, nagu@us.ibm.com Systems & Technology Group, IBM
  • 2. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru2 SUMMARY  We present a novel process synchronization mechanism and the application of on-chip memories for process synchronization in multi-core systems.  A multi-core processor architecture and a signaling scheme which supports the novel process synchronization mechanism is presented.  The validity of the proposed synchronization mechanism is demonstrated by experiments on a virtual prototyping platform.  Comparison against external memory based schemes shows that the proposed use of on-chip memories in multi-core process synchronization is an effective solution to reduce synchronization overheads.
  • 3. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru3 INTRODUCTION • Multi-core applications need to synchronize the computations in the different processor cores, so that the computations can proceed with integrity. • A wide range of working solutions are available: lock-based and lock-free. • Lock-based techniques locks a shared variable to get exclusive access to the data, and another process that needs to use the shared variable, remains in busy-wait state, frequently checking if the lock has become free, and then competes for the lock once the variable becomes free. [1] • Lock-free techniques allow multiple threads to concurrently read and write shared data without corrupting it. [2] • These techniques make use of atomic operations, provided by the processor architecture, which allow a single process to test if the lock is free, and if free, acquires the lock in a single atomic operation. [3, 4]
  • 4. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru4 INTRODUCTION • We introduce a multi-core process synchronization mechanism which is based on a novel signaling scheme, which does not need the support of atomic operations or disabling of interrupts. • Performance overhead of synchronization operations is dependent on the number of remote accesses required, and also the latency of each remote access. • Significant amount of on-chip memory is available in recent multi-core architectures like Cell BE, which is known to improve overall system performance by reducing access time significantly. • We present a first of its kind approach to exploit the available on-chip memory for efficient process synchronization.
  • 5. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru5 PREVIOUS RELATED WORK • Commonly used lock-based schemes are semaphores and condition variables. Non-blocking synchronization algorithms are designed in such a way that a critical section is not required. Their implementation requires specific atomic operations like, compare-and-swap (CAS). Maurice teaches that using the CAS atomic primitive and other primitive operations any lock- free mechanism can be implemented [2]. • The proposed mechanism is based on an on-chip memory which is non- caching and shared by all the processor cores and provides a memory region for each processor core with exclusive write access, while all the cores have read access. • To our knowledge, the proposed signaling scheme is fundamentally different from prior approaches, and does not require any atomic instructions or the need for disabling interrupts. • Though on-chip memories has been used for a wide range of applications, including speed up [9, 10], to our knowledge, this is the first work to study the use of on-chip, shared, non-cached memories to reduce multi-core process synchronization overheads.
  • 6. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru6 COMPONENTS OF THE SCHEME • The main components of the proposed process synchronization mechanism are: • (a) n-core multi-core processor, with an On-chip, Shared, Non-caching (OSN) memory. • (b) A novel signaling scheme. • The OSN memory is not essential to the proposed scheme, and in its absence an External, Shared, Non-Caching (ESN) memory can be used for the same purpose, with a penalty in performance.
  • 7. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru7 THE MULTI-CORE PROCESSOR • Similar architectures have been explored in processors like the Cell [6], and other academic work [10]. • Efficient usage of on-chip memory is important [9]. • OSN memory is used for building a signaling scheme and hence only a small amount of the memory is required. • While all processor cores have read access to the OSN memory, each core has dedicated regions in the OSN memory, where it has exclusive write access Cache Core 1 ……………. On-Chip Memory Core 0 Core n-1 Cache Cache External Memory SYSTEM BUS Cache Core 1 ……………. On-Chip Memory Core 0 Core n-1 Cache Cache External Memory SYSTEM BUS Figure 1. Multi-core processor with on-chip memory.
  • 8. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru8 PROCESS SYNCHRONIZATION MECHANISM Signaling Scheme • The proposed signaling scheme provides a two state system to support a continuous signaling mechanism. • Processors cores are logically numbered from 0 to (n-1) where; n is the total number of processor cores. • The basic signaling mechanism from a signal generator to a signal receiver is based on the proposed concepts of a signal location and two value locations. • The signal location is a specific location in the on-chip memory for which only the signal generator has write access and all others have read access. • Of the two value locations, one location is managed by the signal generator in a location on the on-chip memory for which it has write access. The second value location is managed by the receiving side in a location on the on-chip memory for which the receiver has write access.
  • 9. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru9 PROCESS SYNCHRONIZATION MECHANISM Signaling Scheme • The value locations have only two states other than the initial/reset state, which is a zero value state. • The two value location states, other than the zero state, are two values which can be set by the core managing the location. • For example, these states can be 0xfe and 0xff, and a state toggle can be obtained by an 'exclusive or' operation with 0x01. • A signal is set by the generator to the receiver when the signal location and receiver value locations have the same value. • After setting the signal, the signal generator toggles its generator value location, and the receiver after receiving the signal, toggles its receiver value location so that a new state is formed for a new signal.
  • 10. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru10 PROCESS SYNCHRONIZATION MECHANISM Signaling Sequence and System • At the end of the signaling phase, a new state is formed and the signaling process can continue. • An acknowledgment can be obtained by a reply signal. With this signal mechanism the full implementation of the signaling system can be built. • The signaling system in a processor with 'n' cores is implemented using one ‘nxn’ Signal Location Matrix and two ‘nxn’ Value Location Matrices. • These three matrices are maintained in the OSN memory (or the ESN memory, if the external memory scheme is used). Initialization Phase: Step 1: Initial / Reset State Signal Location: 0x00 Generator Value Location: 0x00 Receiver Value Location: 0x00 Step 2: Cores Set Value Locations Signal Location: 0x00 Generator Value Location: 0xfe Receiver Value Location: 0xfe Signaling Phase: Step 3: Generator Sets Signals Signal Location: 0xfe Generator Value Location: 0xfe Receiver Value Location: 0xfe Step 4: Generator Toggles Value Locations Signal Location: 0xfe Generator Value Location: 0xff Receiver Value Location: 0xfe Step 5: Receiver Receives Signal and Toggles its Value Location Signal Location: 0xfe Generator Value Location: 0xff Receiver Value Location: 0xff Figure 2. Signaling Sequence.
  • 11. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru11 PROCESS SYNCHRONIZATION MECHANISM Signaling System • We refer to the signal location matrix as 'S' and the two value location matrices as 'G' and 'R'. • While G holds the value for setting the signal on the generator side, R holds the expected value on the receiver side. • Each of the rows of the S matrix are the signal locations for each of the n processors cores. In other words, the ith row vector of Matrix S corresponds to ith core, and are locations in the on-chip memory for which core-i has write access. • Rows of G and R are also placed in the on-chip memory. jth location in the ith row vector corresponds to the signal location for core-i to set signal for core-j. It uses the jth location of ith row vector of Matrix G for setting the signal to core-j. • In a similar way, core-j looking for signal from core-i looks at jth location of ith row of S for a value equal to ith location of jth row of Matrix R.
  • 12. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru12 PROCESS SYNCHRONIZATION MECHANISM: Signaling process from core i to core j 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R jthRowofS,GandRMatrices Core=j Current State 0xfe0xff 0xfe 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R jthRowofS,GandRMatrices Core=j Current State 0xfe0xff 0xfe 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R jthRowofS,GandRMatrices Core=j i sets its signal to j 0xfe0xfe 0xfe 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R jthRowofS,GandRMatrices Core=j i sets its signal to j 0xfe0xfe 0xfe 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R jthRowofS,GandRMatrices Core=j i toggles its G location 0xff0xfe 0xfe 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R jthRowofS,GandRMatrices Core=j i toggles its G location 0xff0xfe 0xfe 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R jthRowofS,GandRMatrices Core=j j identifies and receives signal and toggles its R location 0xff0xfe 0xff 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R ithRowofS,GandRMatrices Core=i 0 nji S G R jthRowofS,GandRMatrices Core=j j identifies and receives signal and toggles its R location 0xff0xfe 0xff Figure 3. Core i to j - Current State. Figure 4. Core i to j – i sets its signal to j. Figure 5. Core i to j – i toggles its G location. Figure 6. Core i to j - j identifies and receives signal and toggles its R location.
  • 13. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru13 PROCESS SYNCHRONIZATION MECHANISM • The process synchronization between two cores, say 'Core-i' and 'Core-j' is implemented as follows: – Core-i sets signal to Core-j. – Core-i waits for signal from Core-j. – Core-j waits for signal from Core-i. – Core-j gets the signal from Core-i. – Core-j sets reply signal to Core-i. – Core-i gets the reply signal from Core-j. • The basic synchronization scheme is built on three matrices of order nxn, where n is the number of cores. Hence, for example, the scheme for a 1000 core system can be implemented using 3MB of on-chip-memory. • The scheme has the potential to be extended for multiple types of signals and inter core communication, which requires extra memory to implement.
  • 14. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru14 EXPERIMENTAL SETUP • The mechanism was simulated on a multi-core System-On-Chip (SOC) virtual prototyping platform, as shown in Figure 11. • The platform also provides mechanism for plugging-in user- defined modules to support abstraction of additionally defined hardware components. • The CoreConnect-based [16] SOC has 8 processor cores, 1MB OSN memory, in addition to the several other peripherals and bus components. Figure 11. Virtual Multi-core SOC.
  • 15. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru15 EXPERIMENTAL SETUP • Different experiments were carried out, while running a parallel multiplication of two 16x16 matrices. • In scenario 1: Process synchronization was achieved using the proposed OSN-based synchronization technique. • In scenario 2: Process synchronization was achieved using the proposed ESN-based synchronization technique. • In scenario 3: Process synchronization was achieved using an external memory based semaphore, BetaSemaphore, which was implemented using an atomic Test and Set operation, as defined in [14]. • Performance comparisons between scenario 1 and 2 indicates that even for applications like matrix multiplication, where the number of synchronization operations is small, the impact of the OSN memory on reducing synchronization overhead is reasonably significant, especially as the number of the cores increase.
  • 16. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru16 EXPERIMENTAL RESULTS • The proposed process synchronization scheme is not expected to reduce synchronizations overheads, unless used with the OSN memory. • For an 8-core SOC a speed-up of 7.5 was seen with the OSN-based technique vs. 5.5 in the ESN based technique. • The performance of scenario 2 and 3 are more comparable. The delta between the two can be potentially attributed to the differences in the approach used to implement them. Number of processors Execution time (us) Idle time (us) Speed-Up 1.0 5205753.0 54662.0 1.0 2.0 2632011.0 92797.0 2.0 4.0 1338610.0 171432.0 3.9 8.0 695988.0 344361.0 7.5 Number of processors Execution time (us) Idle time (us) Speed-Up 1.0 5205733.0 54662.0 1.0 2.0 2632086.0 106798.0 2.0 4.0 1462357.0 332871.0 3.6 8.0 820631.0 469267.0 5.5 Number of processors Execution time (us) Idle time (us) Speed-Up 1.0 5205733.0 54662.0 1.0 2.0 2632094.0 106802.0 2.0 4.0 1607896.0 213437.0 3.2 8.0 912465.0 450986.0 5.7 Scenario 1: Proposed Scheme using OSN Memory Scenario 2: Proposed Scheme using ESN Memory Scenario 3: Semaphore using ESN Memory Figure 13. Scenarios 1-3
  • 17. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru17 EXPERIMENTAL SETUP • In another study, a micro benchmark was created to forcefully create 10000 synchronization operations, and then the time taken for those 10000 operations were extracted using selective profiling functions provided in the virtual prototyping platform. • The study was done on the same SOC as before, but with 2 and 4 processor cores, in 3 different scenarios. • In scenario 4 the OSN-based proposed synchronization scheme was used. • In scenario 5, the OSN-based BetaSemaphore implementation was used. • In scenario 6 the ESN-based BetaSemaphore implementation was used.
  • 18. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru18 EXPERIMENTAL RESULTS • Synchronization overheads from scenario 4 and 5 are in the comparable range, and again the overhead in scenario 4 was lesser than scenario 5. • The overhead of scenario 4 was approximately 1/4th the synchronization overhead of scenario 6, and is expected to improve even further as the number of cores increase. • This strongly suggests that irrespective of process synchronization scheme used, the OSN memory significantly reduces the process synchronization overheads, especially as the number of processor cores increase. Synchronization Overhead for 10000 synchronizations (in usec) No. of cores Scenario 4: Proposed Scheme (OSN) Scenario 5: BetaSemaphore (OSN) Scenario 6: BetaSemaphore (ESN) 2 1500723 1862737 3751627 4 2775356 3184996 12185251 Figure 14. Scenarios 4-6
  • 19. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru19 CONCLUSION & FUTURE WORK • A novel multi-core signaling scheme and a process synchronization mechanism is presented. • We have also presented the notion of using on-chip, shared, non-cached memories to reduce the process synchronization overheads in multi-core systems. • The basic signaling scheme presented is a two state mechanism. However, the scheme can be extended further as a signaling system with multiple states. • We are investigating how multiple types of signals can be implemented by providing specified number of locations maintained by the generator and read by receiver, to further classify the signal. • The scheme can be extended to enable inter-processor communication.
  • 20. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru20 REFERENCES 1. J. M. Mellor-Crummey and M. L. Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, ACM Trans. On Computer Systems, 9(1), February 1991. 2. M.P. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124--149, January 1991. 3. Intel Corp. Intel Itanium 2 processor reference manual. 4. C.May, E. Silha, R. Simpson, and H. Warren. The PowerPC Architecture: A Specification for a New Family of Processors, 2nd edition. Morgan Kaufmann, May 1994. 5. Zhen Fang, Lixin Zhang, John B. Carter, Liqun Cheng, and Michael Parker. 2005. Fast synchronization on shared- memory multiprocessors: An architectural approach. J. Parallel Distrib. Comput. 65, 10 (October 2005), 1158-1170. 6. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. 2005. Introduction to the cell multiprocessor. IBM J. Res. Dev. 49, 4/5 (July 2005), 589-604. 7. L. A. Polka et al., Intel Technoloyg Journal, vol. 11, 197 (2007). 8. A. Silberschatz, P. B. Galvin, G. Gagne, “Operating System Concepts”, 7th ed.: John Wiley & Sons, Inc., 2005. 9. Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3 (July 2000), 682-704 10. C. Villavieja, I. Gelado, A. Ramrez, and N. Navarro, "Memory Management on Chip-MultiProcessors with on-chip Memories", Proc. workshop on the Interaction between Operating Systems and Computer Architecture, 2008. 11. N.R. Dhanwada, R.A. Bergamaschi, W.W. Dungan, I. Nair, P. Gramann, W.E. Dougherty, and I. Lin, "Transaction- level modeling for architectural and power analysis of PowerPC and CoreConnect-based systems", ;presented at Design Autom. for Emb. Sys., 2005, pp.105-125. 12. Meet the PowerPC 405 Evaluation Kit, 2005. 13. The Open SystemC Initiative. http://www.systemc.org. 14. Benini, L., D. Bertozzi, D. Bruni, N. Drago, F. Fummi, M. Poncino. Legacy SystemC Co-Simulation of Multi-Processor Systems-on-Chip. In Proceedings 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD), IEEE, 494, 2002. 15. PowerPC User Instruction Set Architecture Book I Version 2.02 16. The CoreConnect™ Bus Architecture, 1999
  • 21. Process Synchronization in Multi-core Systems Using On-Chip Memories: Aru21 PROCESS SYNCHRONIZATION MECHANISM Appendix A • Though the proposed scheme is a blocking scheme, it need not be so, if the program logic permits. – Core-i can set the signal to Core-j and continue rather than waiting, until it needs the acknowledgment or before the next signal. In a similar way, Core-j need not wait for a signal from Core-i. If the program logic permits, it can as well check for the signal and continue if the signal is not available and wait for the signal when it is really needed. – If it is a synchronization point, but not sure who should initiate the signal, it is possible to have a convention that the lower numbered core sets the signal, and the other waits and acknowledges the signal. • In a similar way, synchronization of a group of cores, or a barrier point, can be implemented. – The highest numbered core will scan for signals from all the lower numbered cores, while all the lower numbered cores set signals to the highest numbered core and waits for an acknowledgment from the highest numbered core. When the highest numbered core receives signal from all other cores, it sets acknowledgment to all other cores. – Since a core can check for a signal without blocking, signal from a number of cores arriving in a random sequence can be handled by searching in a cyclic manner. It can also be seen that the synchronization is built on a scheme in which cores write only on locations where it has exclusive write access. Hence, servicing of interrupts has no adverse effects on the synchronization scheme.