Introduction to symmetric multiprocessor

Introduction to Symmetric
Multiprocessors
Süha TUNA
Bilişim Enstitüsü
UHeM Yaz Çalıştayı - 21.06.2012

Outline
• Shared Memory Architecture
– SMP Architectures (NUMA, ccNUMA)
– Cache & Cache Coherency Protocols
• Snoopy
• Directory Based
– What is Thread?
– What is Process?
– Thread vs. Process
• OpenMP vs MPI

Shared Memory Architecture (SMP)
• CPUs access shared memory through a bus
• All processors share a single view of data and the
communication between processors can be as fast as
memory accesses to a same location
• CPU-to-memory connection becomes bottleneck (req. high
speed interconnects !!!)
Distributed Memory Shared Memory
P P P P
BUS (Network Bus)
Memory
Network
M
P NIC
M
P NIC
M
P NIC
M
P NIC

• UMA (Uniform Memory Access): individual processors share
memory (and I/O) in such a way that each of them can access
any memory location with the same speed
– Many small shared machines are symmetric
– Larger shared memory machines do not satisfy this definition (NUMA or
cc-NUMA)
Shared Memory Architecture
• NUMA (Non Uniform Memory Access)
architecture was designed to
overcome the scalability limits of the
SMP (Shared Memory Processor /
Symmetric Multiprocessor)
architecture.
Distributed Shared Memory
BUS (Network Bus)
Memory
P
P P P
BUS (Network Bus)
Memory
P
P P P
BUS (Network Bus)

 Login to your UYBHM node using ssh :
 Run cpuinfo command
bash: $ ssh du??@wsl-node??.uybhm.itu.edu.tr
bash: $ cpuinfo
Architecture : x86_64
Hyperthreading: disabled
Packages : 2
Cores : 4
Processors : 4
===== Processor identification =====
Processor Thread Core Package
0 0 0 0
1 0 0 3
2 0 1 0
3 0 1 3

 Run cpuinfo command
bash: $ cpuinfo
Architecture : x86_64
Hyperthreading: disabled
Packages : 2
Cores : 4
Processors : 4
===== Processor identification =====
Processor Thread Core Package
0 0 0 0
1 0 0 3
2 0 1 0
3 0 1 3
===== Processor placement =====
Package Cores Processors
0 0,1 0,2
3 0,1 1,3
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 4 MB (0,2)(1,3)

• What is cache?
– Extremely fast and relatively small memory unit
• L1 Cache: built into cpu itself
• L2 Cache: resides on a separate chip next to the CPU
– CPU does not use motherboard system bus to data transfer
– Reduce memory access time
– Decrease bandwidth requirement of local memory module and
global interconnect
Register Disk
Memory
L1 Cache
L
2
C
a
c
h
e
CPU

• NUMA Architecture Types:
– ccNUMA means cache coherent NUMA architecture.
– Cache coherence is integrity of data stored in local caches of a
shared resource.

• Coherence defines the behavior of reads and writes to the
same memory location.
– If each processor has a cache that reflects the state of various
parts of memory, it is possible that two or more caches may
have copies of the same line.
– If two threads make appropriately serialized changes to those
data items, the result could be that both caches end up with
different, incorrect versions of the line of memory.
– The system's state is no longer coherent !!!

Cache Coherence
• Solution: Directory-Based protocol or Snooping protocol
(Invalidate or Update techniques)
Memory
2
5
CPU
A
BUS
Memory
Cache
CPU
B
Cache
7
(a) (b) (c) (d)
CPU
A
BUS
Memory
CPU
B
7
CPU
A
BUS
Memory
CPU
B
7
7
CPU
A
BUS
CPU
B
7 7
7 7 5 2

• Solution: Cache Coherence Protocols !!!
• Protocols takes two kind of action when a cache line (L) is
written
– Invalidate all copies of L from the other cache of the machine
– They may update those lines with the new value being written
• Most modern cache coherent multiprocessors use
invalidation technique rather than update technique since it
is easier to implement in hardware

• Process
– It is the "heaviest" unit of kernel scheduling.
– It is unit of allocation
– Processes execute independently. Interact with each other
via interprocess communication mechanisms
– Processes have own resources allocated by the operating
system. Resources include memory (address space) and
state information
– Own register set (temporary memory cell)
Main Definitions

• Thread
– It is the "lightest" unit of kernel scheduling.
– It is unit of execution
– At least one thread exists within each process. If multiple
threads can exist within a process, then they share the same
memory and file resources.
– Share address space, register set, process stack
– Threads do not own resources
Main Definitions
An execution entity having a serial flow of control, a set of
private variables, and access to shared variables.
OpenMP Review Board

Process vs. Thread
 It is a flow of control within a process.
 It is a basic unit of CPU utilization.
 It comprises of a thread ID, a program counter, a register set and a stack.
 If the two threads belong to the same process , they share its code
section , data section and other operating system resource.
 A traditional process has a single thread of control.
 If the process has multiple threads of control, it can do more than one task
at a time.
Process
Thread
Thread

OpenMP vs. MPI
Pros of OpenMP
• considered by some to be easier to program and debug
(compared to MPI)
• data layout and decomposition is handled automatically by
directives.
• allows incremental parallelism: directives can be added
incrementally, so the program can be parallelized one portion after
another and thus no dramatic change to code is needed.
• unified code for both serial and parallel applications: OpenMP
constructs are treated as comments when sequential compilers
are used.
• original (serial) code statements need not, in general, be
modified when parallelized with OpenMP. This reduces the chance
of inadvertently introducing bugs and helps maintenance as well.
• both coarse-grained and fine-grained parallelism are possible

OpenMP vs. MPI
Cons of OpenMP
• currently only runs efficiently in shared-memory multiprocessor
platforms
• requires a compiler that supports OpenMP.
• scalability is limited by memory architecture.
• reliable error handling is missing.
• lacks fine-grained mechanisms to control thread-processor
mapping.
• synchronization between subsets of threads is not allowed.
• mostly used for loop parallelization
• can be difficult to debug, due to implicit communication between
threads via shared variables.

OpenMP vs. MPI
Pros of MPI
• does not require shared memory architectures which are more
expensive than distributed memory architectures
• can be used on a wider range of problems since it exploits both
task parallelism and data parallelism
• can run on both shared memory and distributed memory
architectures
• highly portable with specific optimization for the implementation on
most hardware

OpenMP vs. MPI
Cons of MPI
• requires more programming changes to go from serial to parallel
version
• can be harder to debug

OpenMP vs. MPI
Different MPI and OpenMP applications for matrix multiplication

MPI vs OpenMP Programing
Message-Passing Parallelism Shared-Memory Parallelism

Introduction to symmetric multiprocessor

More Related Content

Similar to Introduction to symmetric multiprocessor

Recently uploaded

Introduction to symmetric multiprocessor