SlideShare a Scribd company logo
1 of 4
Download to read offline
ABSTRACT
NUMA refers to the computer memory design choice
available for multiprocessors. NUMA means that it will take
longer to access some regions of memory than others. This work
aims at explaining what NUMA is, the background
developments, and how the memory access time depends on the
memory location relative to a processor. First, we present a
background of multiprocessor architectures, and some trends in
hardware that exist along with NUMA. We, then briefly discuss
the changes NUMA demands to be made in two key areas. One
is in the policies the Operating System should implement for
scheduling and run-time memory allocation scheme used for
threads and the other is in the programming approach the
programmers should take, in order to harness NUMA’s full
potential. In the end we also present some numbers for
comparing UMA vs. NUMA’s performance.
Keywords: NUMA, Intel i7, NUMA Awareness, NUMA Distance
SECTIONS
In the following sections we first describe the background,
hardware trends, Operating System’s goals, changes in
programming paradigms, and then we conclude after giving some
numbers for comparison.
Background
Hardware Goals / Performance Criteria
There are 3 criteria on which performance of a multiprocessor
system can be judged, viz. Scalability, Latency and Bandwidth.
Scalability is the ability of a system to demonstrate a proportionate
increase in parallel speedup with the addition of more processors.
Latency is the time taken in sending a message from node A to node
B, while bandwidth is the amount of data that can be communicated
per unit of time. So, the goal of a multiprocessor system is to
achieve a highly scalable, low latency, high bandwidth system.
Parallel Architectures
Typically, there are 2 major types of Parallel Architectures that
are prevalent in the industry: Shared Memory Architecture and
Distributed Memory Architecture. Shared Memory Architecture,
again, is of 2 types: Uniform Memory Access (UMA), and Non-
Uniform Memory Access (NUMA).
Shared Memory Architecture
As seen from the figure 1 (more details shown in “Hardware
Trends” section) all processors share the same memory, and treat it
as a global address space. The major challenge to overcome in such
architecture is the issue of Cache Coherency (i.e. every read must
Figure 1 Shared Memory Architecture (from [1])
reflect the latest write). Such architecture is usually adapted in
hardware model of general purpose CPU’s in laptops and
desktops.
Distributed Memory Architecture
In figure 2 (more details shown in “Hardware Trends”
section) type of architecture, all the processors have their own
local memory, and there is no mapping of memory addresses
across processors. So, we don’t have any concept of global
address space or cache coherency. To access data in another
processor, processors use explicit communication. One example
where this architecture is used with clusters, with different nodes
connected over the internet as network.
Shared Memory Architecture – UMA
Shared Memory Architecture, again, is of 2 distinct types,
Uniform Memory Access (UMA), and Non-Uniform Memory
Access (NUMA).
Figure 2 Distributed Memory (from [1])
Figure 3 UMA Architecture Layout (from [3])
Non-Uniform Memory Access (NUMA)
Nakul Manchanda and Karan Anand
New York University
{nm1157, ka804} @cs.nyu.edu
The Figure 3 shows a sample layout of processors and memory
across a bus interconnection. All the processors are identical, and
have equal access times to all memory regions. These are also
sometimes known as Symmetric Multiprocessor (SMP) machines.
The architectures that take care of cache coherency in hardware
level, are knows as CC-UMA (cache coherent UMA).
Shared Memory Architecture – NUMA
Figure 4 shows type of shared memory architecture, we have
identical processors connected to a scalable network, and each
processor has a portion of memory attached directly to it. The
primary difference between a NUMA and distributed memory
architecture is that no processor can have mappings to memory
connected to other processors in case of distributed memory
architecture, however, in case of NUMA, a processor may have so.
It also introduces classification of local memory and remote
memory based on access latency to different memory region seen
from each processor. Such systems are often made by physically
linking SMP machines. UMA, however, has a major disadvantage of
not being scalable after a number of processors [6].
Hardware Trends
We now discuss 2 practical implementations of the memory
architectures that we just saw, one is the Front Side Bus and the
other is Intel’s Quick Path Interconnect based implementation.
Traditional FSB Architecture (used in UMA)
As shown in Figure 5, FSB based UMA architecture has a
Memory Controller Hub, which has all the memory connected to
it. The CPUs interact with the MCH whenever they need to
access the memory. The I/O controller hub is also connected to
the MCH, hence the major bottleneck in this implementation is
the bus, which has a finite speed, and has scalability issues. This
is because, for any communication, the CPU’s need to take
control of the bus which leads to contention problems.
Quick Path Interconnect Architecture (used in NUMA)
The key point to be observed in this implementation is that
the memory is directly connected to the CPU’s instead of a
memory controller. Instead of accessing memory via a Memory
Controller Hub, each CPU now has a memory controller
embedded inside it. Also, the CPU’s are connected to an I/O hub,
and to each other. So, in effect, this implementation tries to
address the common-channel contention problems.
New Cache Coherency Protocol
This new QPI based implementation also introduces a new
cache coherency protocol, “MESIF” instead of “MESI”. The
new state “F” stands for forward, and is used to denote that a
cache should act as a designated responder for any requests.
Operating System Policies
OS Design Goals
Operating Systems, basically, try to achieve 2 major goals,
viz. Usability and Utilization. By usability, we mean that OS
should be able to abstract the hardware for programmer’s
convenience. The other goal is to achieve optimal resource
management, and the ability to multiplex the hardware amongst
different applications.
Figure 4 NUMA Architecture Layout (from [3])
Figure 5 Intel's FSB based UMA Arch. (from [4])
Figure 6 Intel's QPI based NUMA Arch. (from [4])
Features of NUMA aware OS
The basic requirements of a NUMA aware OS are to be able to
discover the underlying hardware topology, and to be able to
calculate the NUMA distance accurately. NUMA distances tell the
processors (and / or the programmer) how much time it would take
to access that particular memory.
Besides these, the OS should provide a mechanism for processor
affinity. This is basically done to make sure that some threads are
scheduled on certain processor(s), to ensure data locality. This not
only avoids remote access, but can also take the advantage of hot
cache. Also, the operating system needs to exploit the first touch
memory allocation policy.
Optimized Scheduling Decisions
The operating systems needs to make sure that load is balanced
amongst the different processors (by making sure that data is
distributed amongst CPU’s for large jobs), and also to implement
dynamic page migration (i.e. use latency topology to make page
migration decisions).
Conflicting Goals
The goals that the Operating System is trying to achieve are
conflicting in nature, in the sense, on one hand we are trying to
optimize the memory placement (for load balancing), and on the
other hand, we would like to minimize the migration of data (to
overcome resource contention). Eventually, there is a trade off
which is decided on the basis of the type of application.
Programming Paradigms
NUMA Aware Programming Approach
The main goals of NUMA aware programming approach are to
reduce lock contention and maximize memory allocation on local
node. Also, programmers need to manage their own memory for
maximum portability. This is can prove to be quite a challenge,
since most languages do not have an in-built memory manager.
Support for Programmers
Programmers rely on tools and libraries for application
development. Hence the tools and libraries need to help the
programmers in achieving maximum efficiency, also to implement
implicit parallelism. The user or the system interface, in turn needs
to have programming constructs for associating virtual memory
addresses. They also need to provide certain functions for obtaining
page residency.
Programming Approach
The programmers need to explore the various NUMA libraries
that are available to help simplify the task. If the data allocation
pattern is analyzed properly, “First Touch Access” can be exploited
fully. There are several lock-free approaches available, which can be
used.
Besides these approaches, the programmers can exploit various
parallel programming paradigms, such as Threads, Message
Passing, and Data Parallelism.
Performance Comparison
Scalability – UMA vs NUMA
We can see from the figure, that UMA based implementation
have scalability issues. Initially both the architectures scale
linearly, until the bus reaches a limit and stagnates. Since there is
no concept of a “shared bus” in NUMA, it is more scalable.
Cache Latency
Figure 8 UMA vs NUMA - Cache Latency (from [4])
The figure shows a comparison of cache latency numbers of
UMA and NUMA. There is no layer 3 cache in UMA. However,
for Main Memory and Layer 2 cache, NUMA shows a
considerable improvement. Only for Layer 1 cache, UMA
marginally beats NUMA.
CONCLUSION
The hardware industry has adapted NUMA as a architecture
design choice, primarily because of its characteristics like
scalability and low latency. However, modern hardware changes
also demand changes in the programming approaches
(development libraries, data analysis) as well Operating System
Figure 7 UMA vs. NUMA – Scalability (from [6])
Figure 7:
policies (processor affinity, page migration). Without these changes,
full potential of NUMA cannot be exploited.
REFERENCES
[1] “Introduction to Parallel Computing.”:
https://computing.llnl.gov/tutorials/parallel_comp/
[2] “Optimizing software applications for NUMA”:
http://software.intel.com/en-us/articles/optimizing-software-
applications-for-numa/
[3] “Parallel Computer Architecture - Slides”:
http://www.eecs.berkeley.edu/~culler/cs258-s99/
[4] “Cache Latency Comparison”:
http://arstechnica.com/hardware/reviews/2008/11/nehalem-
launch-review.ars/3
[5] “Intel – Processor Specifications”:
http://www.intel.com/products/processor/index.htm
[6] “UMA-NUMA Scalability”
www.cs.drexel.edu/~wmm24/cs281/lectures/ppt/cs28
2_lec12.ppt

More Related Content

What's hot

Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessorKishan Panara
 
Distributed Shared Memory
Distributed Shared MemoryDistributed Shared Memory
Distributed Shared MemoryPrakhar Rastogi
 
Multiprocessor
MultiprocessorMultiprocessor
MultiprocessorNeel Patel
 
Centralized shared memory architectures
Centralized shared memory architecturesCentralized shared memory architectures
Centralized shared memory architecturesGokuldhev mony
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platformsVajira Thambawita
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processorMazin Alwaaly
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architectureMaulik Togadiya
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3Md. Mahedi Mahfuj
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
Multi core processors
Multi core processorsMulti core processors
Multi core processorsAdithya Bhat
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system modelHarshad Umredkar
 
Chapter 8 : Memory
Chapter 8 : MemoryChapter 8 : Memory
Chapter 8 : MemoryAmin Omi
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory MultiprocessorsSalvatore La Bua
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platformsSyed Zaid Irshad
 
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Raid Data Recovery
 

What's hot (20)

Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
Distributed Shared Memory
Distributed Shared MemoryDistributed Shared Memory
Distributed Shared Memory
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Centralized shared memory architectures
Centralized shared memory architecturesCentralized shared memory architectures
Centralized shared memory architectures
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platforms
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processor
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architecture
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
 
NUMA overview
NUMA overviewNUMA overview
NUMA overview
 
Cache memory
Cache memoryCache memory
Cache memory
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computingUnderlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
 
6.distributed shared memory
6.distributed shared memory6.distributed shared memory
6.distributed shared memory
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
Chapter 8 : Memory
Chapter 8 : MemoryChapter 8 : Memory
Chapter 8 : Memory
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory Multiprocessors
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
 

Viewers also liked

Linux numa evolution
Linux numa evolutionLinux numa evolution
Linux numa evolutionLukas Pirl
 
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUsSpecification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUsAlexander Kamkin
 
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsTackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsThe Linux Foundation
 
Reverse engineering for_beginners-en
Reverse engineering for_beginners-enReverse engineering for_beginners-en
Reverse engineering for_beginners-enAndri Yabu
 
BKK16-404A PCI Development Meeting
BKK16-404A PCI Development MeetingBKK16-404A PCI Development Meeting
BKK16-404A PCI Development MeetingLinaro
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingAnne Nicolas
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheadsSandeep Joshi
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Praguetomasbart
 
BKK16-104 sched-freq
BKK16-104 sched-freqBKK16-104 sched-freq
BKK16-104 sched-freqLinaro
 
Gc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linuxGc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linuxCuong Tran
 
Cgroup resource mgmt_v1
Cgroup resource mgmt_v1Cgroup resource mgmt_v1
Cgroup resource mgmt_v1sprdd
 
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV FeaturesRaul Leite
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)Linaro
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesRaghavendra Prabhu
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadOpen-NFP
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)Nagarajan
 
Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Neeraj Shrimali
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsDavidlohr Bueso
 

Viewers also liked (20)

Linux numa evolution
Linux numa evolutionLinux numa evolution
Linux numa evolution
 
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUsSpecification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
 
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsTackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
 
Reverse engineering for_beginners-en
Reverse engineering for_beginners-enReverse engineering for_beginners-en
Reverse engineering for_beginners-en
 
Dulloor xen-summit
Dulloor xen-summitDulloor xen-summit
Dulloor xen-summit
 
BKK16-404A PCI Development Meeting
BKK16-404A PCI Development MeetingBKK16-404A PCI Development Meeting
BKK16-404A PCI Development Meeting
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
 
BKK16-104 sched-freq
BKK16-104 sched-freqBKK16-104 sched-freq
BKK16-104 sched-freq
 
Gc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linuxGc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linux
 
Cgroup resource mgmt_v1
Cgroup resource mgmt_v1Cgroup resource mgmt_v1
Cgroup resource mgmt_v1
 
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV Features
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and Opportunities
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)
 
Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup.
 
Process scheduling linux
Process scheduling linuxProcess scheduling linux
Process scheduling linux
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA Systems
 

Similar to Non-Uniform Memory Access ( NUMA)

Symmetric multiprocessing and Microkernel
Symmetric multiprocessing and MicrokernelSymmetric multiprocessing and Microkernel
Symmetric multiprocessing and MicrokernelManoraj Pannerselum
 
Lecture 6
Lecture  6Lecture  6
Lecture 6Mr SMAK
 
Lecture 6
Lecture  6Lecture  6
Lecture 6Mr SMAK
 
Lecture 6
Lecture  6Lecture  6
Lecture 6Mr SMAK
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Subhajit Sahu
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.pptssuser413a98
 
Parallel Processing Presentation2
Parallel Processing Presentation2Parallel Processing Presentation2
Parallel Processing Presentation2daniyalqureshi712
 
message passing vs shared memory
message passing vs shared memorymessage passing vs shared memory
message passing vs shared memoryHamza Zahid
 
Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...
Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...
Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...langonej
 
Concepts of o s chapter 1
Concepts of o s  chapter 1Concepts of o s  chapter 1
Concepts of o s chapter 1cathline44
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfKishaKiddo
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architectureArpan Baishya
 

Similar to Non-Uniform Memory Access ( NUMA) (20)

Symmetric multiprocessing and Microkernel
Symmetric multiprocessing and MicrokernelSymmetric multiprocessing and Microkernel
Symmetric multiprocessing and Microkernel
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
 
Module 4 memory management
Module 4 memory managementModule 4 memory management
Module 4 memory management
 
035
035035
035
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.ppt
 
Module 2.pdf
Module 2.pdfModule 2.pdf
Module 2.pdf
 
Parallel Processing Presentation2
Parallel Processing Presentation2Parallel Processing Presentation2
Parallel Processing Presentation2
 
message passing vs shared memory
message passing vs shared memorymessage passing vs shared memory
message passing vs shared memory
 
Os
OsOs
Os
 
Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...
Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...
Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...
 
Embedded Linux
Embedded LinuxEmbedded Linux
Embedded Linux
 
Concepts of o s chapter 1
Concepts of o s  chapter 1Concepts of o s  chapter 1
Concepts of o s chapter 1
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdf
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architecture
 

Recently uploaded

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 

Non-Uniform Memory Access ( NUMA)

  • 1. ABSTRACT NUMA refers to the computer memory design choice available for multiprocessors. NUMA means that it will take longer to access some regions of memory than others. This work aims at explaining what NUMA is, the background developments, and how the memory access time depends on the memory location relative to a processor. First, we present a background of multiprocessor architectures, and some trends in hardware that exist along with NUMA. We, then briefly discuss the changes NUMA demands to be made in two key areas. One is in the policies the Operating System should implement for scheduling and run-time memory allocation scheme used for threads and the other is in the programming approach the programmers should take, in order to harness NUMA’s full potential. In the end we also present some numbers for comparing UMA vs. NUMA’s performance. Keywords: NUMA, Intel i7, NUMA Awareness, NUMA Distance SECTIONS In the following sections we first describe the background, hardware trends, Operating System’s goals, changes in programming paradigms, and then we conclude after giving some numbers for comparison. Background Hardware Goals / Performance Criteria There are 3 criteria on which performance of a multiprocessor system can be judged, viz. Scalability, Latency and Bandwidth. Scalability is the ability of a system to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Latency is the time taken in sending a message from node A to node B, while bandwidth is the amount of data that can be communicated per unit of time. So, the goal of a multiprocessor system is to achieve a highly scalable, low latency, high bandwidth system. Parallel Architectures Typically, there are 2 major types of Parallel Architectures that are prevalent in the industry: Shared Memory Architecture and Distributed Memory Architecture. Shared Memory Architecture, again, is of 2 types: Uniform Memory Access (UMA), and Non- Uniform Memory Access (NUMA). Shared Memory Architecture As seen from the figure 1 (more details shown in “Hardware Trends” section) all processors share the same memory, and treat it as a global address space. The major challenge to overcome in such architecture is the issue of Cache Coherency (i.e. every read must Figure 1 Shared Memory Architecture (from [1]) reflect the latest write). Such architecture is usually adapted in hardware model of general purpose CPU’s in laptops and desktops. Distributed Memory Architecture In figure 2 (more details shown in “Hardware Trends” section) type of architecture, all the processors have their own local memory, and there is no mapping of memory addresses across processors. So, we don’t have any concept of global address space or cache coherency. To access data in another processor, processors use explicit communication. One example where this architecture is used with clusters, with different nodes connected over the internet as network. Shared Memory Architecture – UMA Shared Memory Architecture, again, is of 2 distinct types, Uniform Memory Access (UMA), and Non-Uniform Memory Access (NUMA). Figure 2 Distributed Memory (from [1]) Figure 3 UMA Architecture Layout (from [3]) Non-Uniform Memory Access (NUMA) Nakul Manchanda and Karan Anand New York University {nm1157, ka804} @cs.nyu.edu
  • 2. The Figure 3 shows a sample layout of processors and memory across a bus interconnection. All the processors are identical, and have equal access times to all memory regions. These are also sometimes known as Symmetric Multiprocessor (SMP) machines. The architectures that take care of cache coherency in hardware level, are knows as CC-UMA (cache coherent UMA). Shared Memory Architecture – NUMA Figure 4 shows type of shared memory architecture, we have identical processors connected to a scalable network, and each processor has a portion of memory attached directly to it. The primary difference between a NUMA and distributed memory architecture is that no processor can have mappings to memory connected to other processors in case of distributed memory architecture, however, in case of NUMA, a processor may have so. It also introduces classification of local memory and remote memory based on access latency to different memory region seen from each processor. Such systems are often made by physically linking SMP machines. UMA, however, has a major disadvantage of not being scalable after a number of processors [6]. Hardware Trends We now discuss 2 practical implementations of the memory architectures that we just saw, one is the Front Side Bus and the other is Intel’s Quick Path Interconnect based implementation. Traditional FSB Architecture (used in UMA) As shown in Figure 5, FSB based UMA architecture has a Memory Controller Hub, which has all the memory connected to it. The CPUs interact with the MCH whenever they need to access the memory. The I/O controller hub is also connected to the MCH, hence the major bottleneck in this implementation is the bus, which has a finite speed, and has scalability issues. This is because, for any communication, the CPU’s need to take control of the bus which leads to contention problems. Quick Path Interconnect Architecture (used in NUMA) The key point to be observed in this implementation is that the memory is directly connected to the CPU’s instead of a memory controller. Instead of accessing memory via a Memory Controller Hub, each CPU now has a memory controller embedded inside it. Also, the CPU’s are connected to an I/O hub, and to each other. So, in effect, this implementation tries to address the common-channel contention problems. New Cache Coherency Protocol This new QPI based implementation also introduces a new cache coherency protocol, “MESIF” instead of “MESI”. The new state “F” stands for forward, and is used to denote that a cache should act as a designated responder for any requests. Operating System Policies OS Design Goals Operating Systems, basically, try to achieve 2 major goals, viz. Usability and Utilization. By usability, we mean that OS should be able to abstract the hardware for programmer’s convenience. The other goal is to achieve optimal resource management, and the ability to multiplex the hardware amongst different applications. Figure 4 NUMA Architecture Layout (from [3]) Figure 5 Intel's FSB based UMA Arch. (from [4]) Figure 6 Intel's QPI based NUMA Arch. (from [4])
  • 3. Features of NUMA aware OS The basic requirements of a NUMA aware OS are to be able to discover the underlying hardware topology, and to be able to calculate the NUMA distance accurately. NUMA distances tell the processors (and / or the programmer) how much time it would take to access that particular memory. Besides these, the OS should provide a mechanism for processor affinity. This is basically done to make sure that some threads are scheduled on certain processor(s), to ensure data locality. This not only avoids remote access, but can also take the advantage of hot cache. Also, the operating system needs to exploit the first touch memory allocation policy. Optimized Scheduling Decisions The operating systems needs to make sure that load is balanced amongst the different processors (by making sure that data is distributed amongst CPU’s for large jobs), and also to implement dynamic page migration (i.e. use latency topology to make page migration decisions). Conflicting Goals The goals that the Operating System is trying to achieve are conflicting in nature, in the sense, on one hand we are trying to optimize the memory placement (for load balancing), and on the other hand, we would like to minimize the migration of data (to overcome resource contention). Eventually, there is a trade off which is decided on the basis of the type of application. Programming Paradigms NUMA Aware Programming Approach The main goals of NUMA aware programming approach are to reduce lock contention and maximize memory allocation on local node. Also, programmers need to manage their own memory for maximum portability. This is can prove to be quite a challenge, since most languages do not have an in-built memory manager. Support for Programmers Programmers rely on tools and libraries for application development. Hence the tools and libraries need to help the programmers in achieving maximum efficiency, also to implement implicit parallelism. The user or the system interface, in turn needs to have programming constructs for associating virtual memory addresses. They also need to provide certain functions for obtaining page residency. Programming Approach The programmers need to explore the various NUMA libraries that are available to help simplify the task. If the data allocation pattern is analyzed properly, “First Touch Access” can be exploited fully. There are several lock-free approaches available, which can be used. Besides these approaches, the programmers can exploit various parallel programming paradigms, such as Threads, Message Passing, and Data Parallelism. Performance Comparison Scalability – UMA vs NUMA We can see from the figure, that UMA based implementation have scalability issues. Initially both the architectures scale linearly, until the bus reaches a limit and stagnates. Since there is no concept of a “shared bus” in NUMA, it is more scalable. Cache Latency Figure 8 UMA vs NUMA - Cache Latency (from [4]) The figure shows a comparison of cache latency numbers of UMA and NUMA. There is no layer 3 cache in UMA. However, for Main Memory and Layer 2 cache, NUMA shows a considerable improvement. Only for Layer 1 cache, UMA marginally beats NUMA. CONCLUSION The hardware industry has adapted NUMA as a architecture design choice, primarily because of its characteristics like scalability and low latency. However, modern hardware changes also demand changes in the programming approaches (development libraries, data analysis) as well Operating System Figure 7 UMA vs. NUMA – Scalability (from [6]) Figure 7:
  • 4. policies (processor affinity, page migration). Without these changes, full potential of NUMA cannot be exploited. REFERENCES [1] “Introduction to Parallel Computing.”: https://computing.llnl.gov/tutorials/parallel_comp/ [2] “Optimizing software applications for NUMA”: http://software.intel.com/en-us/articles/optimizing-software- applications-for-numa/ [3] “Parallel Computer Architecture - Slides”: http://www.eecs.berkeley.edu/~culler/cs258-s99/ [4] “Cache Latency Comparison”: http://arstechnica.com/hardware/reviews/2008/11/nehalem- launch-review.ars/3 [5] “Intel – Processor Specifications”: http://www.intel.com/products/processor/index.htm [6] “UMA-NUMA Scalability” www.cs.drexel.edu/~wmm24/cs281/lectures/ppt/cs28 2_lec12.ppt