The document discusses different cache coherence protocols used in multi-core systems, including snooping-based and directory-based approaches. It describes the MSI protocol, a 3-state snooping protocol, and the MESI protocol, a 4-state optimization. It also covers the Dragon update protocol, a 4-state write-back update snooping approach, and compares the number of bus transactions required for different protocols and memory access patterns.
Caches in multiprocessing environment introduce the Cache Coherence problem.
When multiple processors maintain locally cached copies of a unique shared memory location, any local modification of the location can result in a globally inconsistent view of memory. This is called Cache Coherence Problem.
A brief discussion about its solutions are given.
Coherence and consistency models in multiprocessor architectureUniversity of Pisa
Cache coherence and consistency model in multiprocessor architecture. These slide show the introduction of multiprocessor and cache multilevel and then describe the basic mechanism of coherence and consistency protocols. In particular the protocols describe are the following: snooping and directory protocols for the coherence part and sequential protocol for the consistency part. There are also example of (in)consistency and (in)coherence.
Caches in multiprocessing environment introduce the Cache Coherence problem.
When multiple processors maintain locally cached copies of a unique shared memory location, any local modification of the location can result in a globally inconsistent view of memory. This is called Cache Coherence Problem.
A brief discussion about its solutions are given.
Coherence and consistency models in multiprocessor architectureUniversity of Pisa
Cache coherence and consistency model in multiprocessor architecture. These slide show the introduction of multiprocessor and cache multilevel and then describe the basic mechanism of coherence and consistency protocols. In particular the protocols describe are the following: snooping and directory protocols for the coherence part and sequential protocol for the consistency part. There are also example of (in)consistency and (in)coherence.
Audio Version available in YouTube Link : https://www.youtube.com/AKSHARAM?sub_confirmation=1
subscribe the channel
Computer Architecture and Organization
V semester
Anna University
By
Babu M, Assistant Professor
Department of ECE
RMK College of Engineering and Technology
Chennai
Very long instruction word or VLIW refers to a processor architecture designed to take advantage of instruction level parallelism
This type of processor architecture is intended to allow higher performance without the inherent complexity of some other approaches.
1) Design and Implementation of Multicore Processors
2) Coherence and Consistency
3) Power and Temperature
4) Interconnects
5) Multicore Caches
6) Security
7) Real world examples
Audio Version available in YouTube Link : https://www.youtube.com/AKSHARAM?sub_confirmation=1
subscribe the channel
Computer Architecture and Organization
V semester
Anna University
By
Babu M, Assistant Professor
Department of ECE
RMK College of Engineering and Technology
Chennai
Very long instruction word or VLIW refers to a processor architecture designed to take advantage of instruction level parallelism
This type of processor architecture is intended to allow higher performance without the inherent complexity of some other approaches.
1) Design and Implementation of Multicore Processors
2) Coherence and Consistency
3) Power and Temperature
4) Interconnects
5) Multicore Caches
6) Security
7) Real world examples
Blocks is a cool concept and is very much needed for performance improvements and responsiveness. GCD helps run blocks effortlessly by scheduling on a desired queue, priority and lots more.
MODULE III Parallel Processors and Memory Organization 15 Hours
Parallel Processors: Introduction to parallel processors, Concurrent access to memory and cache
coherency. Introduction to multicore architecture. Memory system design: semiconductor memory
technologies, memory organization. Memory interleaving, concept of hierarchical memory
organization, cache memory, cache size vs. block size, mapping functions, replacement
algorithms, write policies.
Case Study: Instruction sets of some common CPUs - Design of a simple hypothetical CPU- A
sequential Y86-64 design-Sun Ultra SPARC II pipeline structure
Presentation about the Spil Storage Platform (SSP) written in Erlang. This talk was first given at the Erlang User Group Netherlands in July 2012 hosted at Spilgames in Hilversum.
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsHeechul Yun
In this paper, we show that cache partitioning does
not necessarily ensure predictable cache performance in modern
COTS multicore platforms that use non-blocking caches to exploit
memory-level-parallelism (MLP).
Through carefully designed experiments using three real COTS
multicore platforms (four distinct CPU architectures) and a cycleaccurate
full system simulator, we show that special hardware
registers in non-blocking caches, known as Miss Status Holding
Registers (MSHRs), which track the status of outstanding cachemisses,
can be a significant source of contention; we observe up
to 21X WCET increase in a real COTS multicore platform due
to MSHR contention.
We propose a hardware and system software (OS) collaborative
approach to efficiently eliminate MSHR contention for
multicore real-time systems. Our approach includes a low-cost
hardware extension that enables dynamic control of per-core
MLP by the OS. Using the hardware extension, the OS scheduler
then globally controls each core’s MLP in such a way that
eliminates MSHR contention and maximizes overall throughput
of the system.
We implement the hardware extension in a cycle-accurate fullsystem
simulator and the scheduler modification in Linux 3.14
kernel. We evaluate the effectiveness of our approach using a set
of synthetic and macro benchmarks. In a case study, we achieve
up to 19% WCET reduction (average: 13%) for a set of EEMBC
benchmarks compared to a baseline cache partitioning setup.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
3. 3
Multi-Core Cache Organizations
Private L1 caches
Shared L2 cache, but physically distributed
Scalable network
Directory-based coherence between L1s
P
C
P
C
P
C
P
C
P
C
P
C
P
C
P
C
4. 4
Multi-Core Cache Organizations
Private L1 caches
Shared L2 cache, but physically distributed
Bus connecting the four L1s and four L2 banks
Snooping-based coherence between L1s
P
C
P
C
P
C
P
C
5. 5
Multi-Core Cache Organizations
Private L1 caches
Private L2 caches
Scalable network
Directory-based coherence between L2s
(through a separate directory)
P
C
P
C
P
C
P
C
P
C
P
C
P
C
P
C
D
6. 6
Shared-Memory Vs. Message Passing
• Shared-memory
single copy of (shared) data in memory
threads communicate by reading/writing to a shared
location
• Message-passing
each thread has a copy of data in its own private
memory that other threads cannot access
threads communicate by passing values with SEND/
RECEIVE message pairs
7. 7
Ocean Kernel
Procedure Solve(A)
begin
diff = done = 0;
while (!done) do
diff = 0;
for i 1 to n do
for j 1 to n do
temp = A[i,j];
A[i,j] 0.2 * (A[i,j] + neighbors);
diff += abs(A[i,j] – temp);
end for
end for
if (diff < TOL) then done = 1;
end while
end procedure
8. 8
Shared Address Space Model
int n, nprocs;
float **A, diff;
LOCKDEC(diff_lock);
BARDEC(bar1);
main()
begin
read(n); read(nprocs);
A G_MALLOC();
initialize (A);
CREATE (nprocs,Solve,A);
WAIT_FOR_END (nprocs);
end main
procedure Solve(A)
int i, j, pid, done=0;
float temp, mydiff=0;
int mymin = 1 + (pid * n/procs);
int mymax = mymin + n/nprocs -1;
while (!done) do
mydiff = diff = 0;
BARRIER(bar1,nprocs);
for i mymin to mymax
for j 1 to n do
…
endfor
endfor
LOCK(diff_lock);
diff += mydiff;
UNLOCK(diff_lock);
BARRIER (bar1, nprocs);
if (diff < TOL) then done = 1;
BARRIER (bar1, nprocs);
endwhile
9. 9
Message Passing Model
main()
read(n); read(nprocs);
CREATE (nprocs-1, Solve);
Solve();
WAIT_FOR_END (nprocs-1);
procedure Solve()
int i, j, pid, nn = n/nprocs, done=0;
float temp, tempdiff, mydiff = 0;
myA malloc(…)
initialize(myA);
while (!done) do
mydiff = 0;
if (pid != 0)
SEND(&myA[1,0], n, pid-1, ROW);
if (pid != nprocs-1)
SEND(&myA[nn,0], n, pid+1, ROW);
if (pid != 0)
RECEIVE(&myA[0,0], n, pid-1, ROW);
if (pid != nprocs-1)
RECEIVE(&myA[nn+1,0], n, pid+1, ROW);
for i 1 to nn do
for j 1 to n do
…
endfor
endfor
if (pid != 0)
SEND(mydiff, 1, 0, DIFF);
RECEIVE(done, 1, 0, DONE);
else
for i 1 to nprocs-1 do
RECEIVE(tempdiff, 1, *, DIFF);
mydiff += tempdiff;
endfor
if (mydiff < TOL) done = 1;
for i 1 to nprocs-1 do
SEND(done, 1, I, DONE);
endfor
endif
endwhile
10. 10
Models for SEND and RECEIVE
• Synchronous: SEND returns control back to the program
only when the RECEIVE has completed
• Blocking Asynchronous: SEND returns control back to the
program after the OS has copied the message into its space
-- the program can now modify the sent data structure
• Nonblocking Asynchronous: SEND and RECEIVE return
control immediately – the message will get copied at some
point, so the process must overlap some other computation
with the communication – other primitives are used to
probe if the communication has finished or not
11. 11
Deterministic Execution
• Need synch after every anti-diagonal
• Potential load imbalance
• Shared-memory vs. message passing
• Function of the model for SEND-RECEIVE
• Function of the algorithm: diagonal, red-black ordering
12. 12
Cache Coherence
A multiprocessor system is cache coherent if
• a value written by a processor is eventually visible to
reads by other processors – write propagation
• two writes to the same location by two processors are
seen in the same order by all processors – write
serialization
13. 13
Cache Coherence Protocols
• Directory-based: A single location (directory) keeps track
of the sharing status of a block of memory
• Snooping: Every cache block is accompanied by the sharing
status of that block – all cache controllers monitor the
shared bus so they can update the sharing status of the
block, if necessary
Write-invalidate: a processor gains exclusive access of
a block before writing by invalidating all other copies
Write-update: when a processor writes, it updates other
shared copies of that block
14. 14
Protocol-I MSI
• 3-state write-back invalidation bus-based snooping protocol
• Each block can be in one of three states – invalid, shared,
modified (exclusive)
• A processor must acquire the block in exclusive state in
order to write to it – this is done by placing an exclusive
read request on the bus – every other cached copy is
invalidated
• When some other processor tries to read an exclusive
block, the block is demoted to shared
15. 15
Design Issues, Optimizations
• When does memory get updated?
demotion from modified to shared?
move from modified in one cache to modified in another?
• Who responds with data? – memory or a cache that has
the block in exclusive state – does it help if sharers respond?
• We can assume that bus, memory, and cache state
transactions are atomic – if not, we will need more states
• A transition from shared to modified only requires an upgrade
request and no transfer of data
• Is the protocol simpler for a write-through cache?
16. 16
4-State Protocol
• Multiprocessors execute many single-threaded programs
• A read followed by a write will generate bus transactions
to acquire the block in exclusive state even though there
are no sharers
• Note that we can optimize protocols by adding more
states – increases design/verification complexity
17. 17
MESI Protocol
• The new state is exclusive-clean – the cache can service
read requests and no other cache has the same block
• When the processor attempts a write, the block is
upgraded to exclusive-modified without generating a bus
transaction
• When a processor makes a read request, it must detect
if it has the only cached copy – the interconnect must
include an additional signal that is asserted by each
cache if it has a valid copy of the block
18. 18
Design Issues
• When caches evict blocks, they do not inform other
caches – it is possible to have a block in shared state
even though it is an exclusive-clean copy
• Cache-to-cache sharing: SRAM vs. DRAM latencies,
contention in remote caches, protocol complexities
(memory has to wait, which cache responds), can be
especially useful in distributed memory systems
• The protocol can be improved by adding a fifth
state (owner – MOESI) – the owner services reads
(instead of memory)
19. 19
Update Protocol (Dragon)
• 4-state write-back update protocol, first used in the
Dragon multiprocessor (1984)
• Write-back update is not the same as write-through –
on a write, only caches are updated, not memory
• Goal: writes may usually not be on the critical path, but
subsequent reads may be
20. 20
4 States
• No invalid state
• Modified and Exclusive-clean as before: used when there
is a sole cached copy
• Shared-clean: potentially multiple caches have this block
and main memory may or may not be up-to-date
• Shared-modified: potentially multiple caches have this
block, main memory is not up-to-date, and this cache
must update memory – only one block can be in Sm state
• In reality, one state would have sufficed – more states
to reduce traffic
21. 21
Design Issues
• If the update is also sent to main memory, the Sm
state can be eliminated
• If all caches are informed when a block is evicted, the
block can be moved from shared to M or E – this can
help save future bus transactions
• Having an extra wire to determine exclusivity seems
like a worthy trade-off in update systems
22. 22
Examples
P1 P2
MSI MESI Dragon MSI MESI Dragon
• P1: Rd X
• P1: Wr X
• P2: Rd X
• P1: Wr X
• P1: Wr X
• P2: Rd X
• P2: Wr X
Total transfers: