This document discusses implementing Non-Uniform Memory Access (NUMA) systems. It provides background on NUMA, describing how in NUMA architectures each processor has local memory that it can access directly, while still being able to access other processors' memory through interconnects. It discusses shared memory, cache coherence challenges, and strategies for managing coherence like directory-based approaches. It includes a memory architecture block diagram, circuit diagram, memory chip algorithm and example of the memory architecture in operation.
1. Implementing of Non-Uniform
Memory Access (NUMA) Systems
Project Submitted
by
PALLAB KUMAR RAY
(ME 2014-10001)
Under the supervision of
Mr. Somak Das
(Dept. of CSE/IT)
2. INTRODUCTION
• In a shared memory multiprocessor, all main memory is accessible to and shared
by all processors. The cost of accessing shared memory is the same for all
processors. In this case, from a memory access viewpoint they are called UMA or
Uniform Memory Access Systems.
• A particular category of shared memory multiprocessor is NUMA or Non-Uniform
Memory Access. It is a shared memory architecture that describes the placement
of main memory modules with respect to processors in a multiprocessor system.
Like most every other processor architectural feature, ignorance of NUMA can
result in sub-par application memory performance.
3. NUMA
In the NUMA shared memory architecture, each processor has its own local memory module that it
can access directly with a distinctive performance advantage. At the same time, it can also access
any memory module belonging to another processor using a shared bus (or some other type of
interconnect) as seen in the diagram
4. Shared Memory
In contrast to "shared nothing" architectures, memory is globally accessible under shared memory. Communication is
anonymous; there is no explicit recipient of a shared memory access, as in message passing, and processors may
communicate without necessarily being aware of one another. Shared memory provides 2 services:
a. Direct access to another processor's local memory.
b. Automatic address mapping of a (virtual) memory address onto a (processor, local memory address) pair.
1. Convergence of parallel architectures :- While cc-NUMA architectures add specialized support for shared
memory, e.g. coherence control, they still rely on fine-grained message passing involving short messages. So do
single sided architectures. So it appears that designs are converging, with the important detailed handled through
a combination of software and specialized support.
2. The Cache Coherence Problem:- Owing to the use of cache memories in modern computer architectures, shared
memory introduces the cache coherence problem. Cache coherence arises with shared data that is to be written
and read. If one processor modifies a shared cached value, then the other processor(s) must get the latest value.
Coherence says nothing about when changes propagate through the memory sub system, only that they will
eventually happen. Other steps must be take (usually in software) to avoid race conditions could lead to non -
deterministic program behavior.
a. Program order :- If a processor writes and then reads the same location X, and there are no other intervening writes by other
processors to X , then the read will always return the value previously written.
b. Definition of a coherent view of memory :- If a processor P reads from location X that was previously written by a processor Q,
then the read will return the value previously written, if a sufficient amount of time has elapsed between the read and the write.
c. Serialization of writes :- Multiple writes to a location X happen sequentially. If two processors write to the same location, then
other processors reading X will observe the same the sequence of values in the order written. If a 10 and then a 20 is written to X, then
it is not possible for any processor to read 20 then 10.
5. 3. Managing coherence:- There are two major strategies for managing coherence
a. Snooping protocol. In this bus-based scheme processors passively listen for bus activity, updating or invalidating cache entries as
necessary. The scheme is ultimately non-scalable, and isn't appropriate for machine with tens of processors or more.
b. Directory-based. This is a scalable scheme employing point-to-point messages to handle coherence. A memory structure called
a directory maintains information about data sharing. This scheme was first applied to cache-coherent multiprocessors by the DASH
project at Stanford and is used in the SGI Altix 3000, which is a CC-NUMA architecture. NUMA stands for Non Uniform Memory Access
time to memory. Depending on the location of the processor and the address accessed.
The 3 goals of DASH are :-
• Scalable memory bandwidth
• Scalable cost (use commodity parts)
• Deal with large memory latencies
6. Memory Architecture block diagram
the memory unit has 8-bit address bus, two 8-bit data buses – one for input and the one for output, a clock –
input and 1-bit write-enable input.
When the write_enable is set HIGH (1), the incoming data (coming through the data_in bus) is first store in the
memory address specified by the address bus, and then the newly written data is fetched from the same address
and is output through the data_outbus.
7. Circuit diagram
In this chip write_enable that is a input . Three more inputs are connectd to the chip. Those are the clk , the clk is
HIGH all the time another is 8-bit data input bus, last one is 8-bit address bus.
When the data input in the memory , the write_enable is HIGH. Data pass through the chip and the data stored in the
address bus. When data is fetched , the output data is store in the data_out bus. That time the clk is high or enable.
8. Memory chip algorithm
• This is the algorithm use for the memory chip
begin
if (write_enable)
begin
memory[address] <= data_in;
end
data_out <= memory[address];
9. An Example of the Memory Architecture
When the Right Enable is LOW
At this time CLk = (high)
Data_in = X (don’t care)
Address = (A6)H [10100110]
Write_enable = 0
Then the
Data_out = (8B)H [10001011]
Data_out = memory [A6]
this is using by this
data_out <= memory [address];
this is diagram for the write _enable is low and how it works,
10. When the Right Enable is HIGH
At this time CLk = high
Data_in = 9FH [10011111]
Address = (A6)H [10100110]
Write_enable = 1
clk =when High
low
Then the
Mem[A6] = 9F (data_in)
Data_out = mem[A6]
Data_out = 9F(10011111)
this is using by this
data_out <= memory [address];
Here when the Right Enable is High the DATA store in A6 address that is changed. The DATA is 9FH
that is the present Data in Address A6.
13. References
[1] https://books.google.co.in /books NUMA coding for VLSI and Source
code.
[2] “Introduction to NUMA architecture with shared memory.”
https://computing.llnl.gov/tutorials/parallel_comp
[3] Intel in NUMA
[4]“UMA-NUMA Scalability”
www.cs.drexel.edu/~wmm24/cs281/lectures/ppt/cs282_lec12.ppt