SlideShare a Scribd company logo
1 of 213
Download to read offline
21/12/05 r.innocente 1
Nodes and Networks
hardware
Roberto Innocente
inno@sissa.it
Kumasi, GHANA
Joint KNUST/ICTP school on
HPC on linux clusters
21/12/05 r.innocente 2
Overview
• Nodes:
– CPU
– Processor Bus
– I/O bus
• Networks
– boards
– switches
21/12/05 r.innocente 3
Computer families/1
• CISC (Complex Instruction Set Computer)
– large set of complex instructions (VAX, x86)
– many instructions can have operands in memory
– variable length instructions
– many instructions require many cycles to complete
• RISC (Reduced Instruction Set Computer)
– small set of simple instructions(Mips,alpha)
– also called Load/Store architecture because operations are done in registers and
only load/store instructions can access memory
– instructions are typically hardwired and require few cycles
– instructions have fixed length so that is easy to parse them
• VLIW (Very Large Instruction Word Computers)
– Fixed number of instructions (3-5) scheduled by compiler to be executed together
– Itanium (IA64), Transmeta Crusoe
21/12/05 r.innocente 4
Computer families/2
It is very difficult to optimize computers with CISC
instruction sets, because it is difficult to predict
what will be the effect of the instructions.
For this reason today high performance processors
have a RISC core even if the external instruction
set is CISC.
Starting with the Pentium Pro, Intel x86 processors
in fact translate x86 instruction into 1 or more
RISC microops (uops).
21/12/05 r.innocente 5
Micro architecture features
for HPC
• Superscalar
• Pipelining: super/hyper pipelining
• OOO (Out Of Order) execution
• Branch prediction/speculative execution
• Hyperthreading
• EPIC
21/12/05 r.innocente 6
Superscalar
It’s a CPU having multiple functional execution units and
able to dispatch multiple instruction per cycle (double
issue, quad issue ,...).
Pentium 4 has 7 distinct functional units:
Load, Store, 2 x double speed ALUs, normal speed ALU, FP,
FP Move. It can issue up to 6 uops per cycle (it has 4
dispatch ports but the 2 simple ALUs are double speed).
The Athlon has 9 functional units.
The AMD64/Opteron has a 3 issue core with 11 FU : 3 int,3
Address Generation, 3 FP, 2 ld/st
The IA64/Itanium2 can issue 6 instr, has 11 FU : 2 int, 4
mem, 3 branch, 2 FP
21/12/05 r.innocente 7
Pipelining/1
• The division of the work necessary to execute
instructions in stages to allow more instructions in
execution at the same time (at different stages)
• Previous generation architectures had 5/6 stages
• Now there are 10/20 stages, Intel called this
superpipelining or hyperpipelining
• Nocona(EM64T) Pentium 4 has 32 stages
• Opteron has around 30 stages
21/12/05 r.innocente 8
Pipelining/2
Memory
hierarchy
(Registers,
caches,
memory)
Instruction Fetch (IF)
Instruction decode (ID)
Execute(EX)
Data access(DA)
Write back results(WB)
21/12/05 r.innocente 9
Pipelining/3
WBDAEXEXIDIFInstr 3
DAEXIDIFIFInstr 5
DAEXIDIDIFInstr 4
WBDADADADAEXIDIFInstr 2
WBWBDAEXIDIFInstr 1
109876554321
Clock cycle
On the 5th cycle there are 5 instr. simultaneously executing,
2nd instr causes a stall of the pipe during DA
21/12/05 r.innocente 10
P3/P4 superpipelining
picture from Intel
21/12/05 r.innocente 11
Opteron pipeline
21/12/05 r.innocente 12
Itanium2 pipeline
21/12/05 r.innocente 13
Out Of Order (OOO) execution/1
• In-order execution (as stated by the program)
under-uses functional units
• Algorithms for out of order(OOO) execution of
instruction were devised such that :
– Data consistency is maintained
– High utilization of FU is achieved
• Instructions are executed when the inputs are
ready, without regard to the original program
order
21/12/05 r.innocente 14
OOO/2
• Hazards :
– RAW (Read After Write): result forwarding
– WAW/WAR(Write After Write,Write After Read)
: skip previous WB or ROB
• Tomasulo’s scheduler (IBM360/91-’76)
• Reorder Buffer (ROB) : to allow precise
interrupts
21/12/05 r.innocente 15
OOO/3
Tomasulo additions for OOO execution with ROB
21/12/05 r.innocente 16
OOO/4
In the Tomasulo approach, to each resource (functional units, load and
store units, register file) a set of buffers called reservation stations
is added. This is a distributed algorithm and each station is in charge
of watching out when the operands are available, eventually snooping
on the bus, and executing its operation asap.
• Issue :
• Execute
• WriteBack
21/12/05 r.innocente 17
Branch prediction
Speculative execution
To avoid pipeline starvation, the most probable branch of a
conditional jump (for which the condition register is not yet
ready) is guessed (branch prediction) and the following
instructions are executed ( speculative execution ) and
their result is stored in hidden registers.
If later the condition turns out as predicted we say the
instruction has to be retired and the hidden registers
renamed to the real registers, otherwise we had a
misprediction and the pipe following the branch is cleared.
21/12/05 r.innocente 18
Hyperthreading/1
• Introduced with this name by Intel on their Xeon
Pentium4 processors. Similar proposals were
previously known in literature as Simultaneous
multithreading
• With 5% more gates it could achieve a 20-30%
performance increase ?
• The processor has two copies of the architectural
state (registers, IP and so on) and so appears to
the software as 2 logical processors
21/12/05 r.innocente 19
Hyperthreading/2
• Multithreaded
applications gain the
most
• The large cache
footprint of typical
scientific applications
make them quite less
favoured by this
feature
Resources
State1 State2
0 1 23 4 5
Logical proc
numbering
21/12/05 r.innocente 20
Hyperthreading/3
• Different approaches could have been used
: partition(fixed assignement to logical
processors), full sharing, threshold
(resource is shared but there is an upper
limit on its use by each logical processor)
• Intel for the most part used the full sharing
policy
21/12/05 r.innocente 21
Hyperthreading/4
Pic from intel
21/12/05 r.innocente 22
Hyperthreading/5
Operating modes : ST0,ST1,MT
Pic from Intel
21/12/05 r.innocente 23
x86 uarchitectures
Intel:
• P6 uarchitecture
• NetBurst
• Nocona (EM64T)
AMD:
• Thunderbird
• Palomino
• AMD64/Hammer
• SledgeHammer
21/12/05 r.innocente 24
Intel x86 family
• Pentium III
– Katmai
– Coppermine
• 500 Mhz – 1.13 Ghz
– Tualatin
• 1.13 – 1.33 Ghz
• Pentium 4
– Willamette
• 1.4-2.0 Ghz
– Northwood
• 2.0–2.2 Ghz
– Gallatin – 3.4 Ghz
- Prescott 2.8 Ghz -
- Nocona (EM64T)
0.18 u
0.13 u
0.25 u
90 nm
21/12/05 r.innocente 25
AMD Athlon family
• K7 Athlon 1999 0.25 technology w/3DNow!
• K75 Athlon 0.18 techology
• Thunderbird (Athlon 3) 0.18 u technology
(on die L2 cache)
• Palomino (Athlon 4) 0.18 u Quantispeed
uarchitecture (SSE support) 15 stages
pipeline
• Opteron/AMD64 0.18 and then 0.13 u
21/12/05 r.innocente 26
Pentium 4 0.13u uarchitecture
Pic from Intel
21/12/05 r.innocente 27
Netburst 90nm uarch
Pic from Intel
21/12/05 r.innocente 28
Athlon uarchitecture
21/12/05 r.innocente 29
x86 Architecture extensions/1
x86 architecture was conceived in 1978. Since then
it underwent many architecture extensions.
• Intel MMX (Matrix Math eXtensions):
– introduced in 1997 supported by all current processors
• Intel SSE (Streaming SIMD Extensions):
– introduced on the Pentium III in 1999
• Intel SSE2 (Streaming SIMD Extensions 2):
– introduced on the Pentium 4 in Dec 2000
21/12/05 r.innocente 30
x86 Architecture extensions/2
• AMD 3DNow! :
– introduced in 1998 (extends MMX)
• AMD 3DNow!+ (or 3DNow! Professional, or
3DNow! Athlon):
– introduced with the Athlon (includes SSE)
21/12/05 r.innocente 31
x86 architecture extensions/3
The so called feature set can be obtained using the
assembly instruction
cpuid
that was introduced with the Pentium and returns information
about the processor in the processor’s registers:
processor family, model, revision, features supported, size
and structure of the internal caches.
On linux the kernel uses this instrucion at startup and many
of these info are available typing:
cat /proc/cpuinfo
21/12/05 r.innocente 32
x86 architecture extensions/4
Part of a typical output can be :
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III(Coppermine)
cache size : 256 KB
flags : fpu pae tsc mtrr pse36 mmx
sse
21/12/05 r.innocente 33
SIMD technology
• A way to increase processor performance
is to group together equal instructions on
different data (Data Level Parallelism: DLP)
• SIMD (Single Instruction Multiple Data)
comes from Flynn taxonomy of parallel
machines(1966 )
• Intel proposed its :
– SWAR ( SIMD Within A Register)
21/12/05 r.innocente 34
typical SIMD operation
op
X4 X3 X2 X1
Y4 Y3 Y2 Y1
op op op
X4 op Y4 X3 op Y3 X2 op Y2 X1 op Y1
21/12/05 r.innocente 35
MMX
• Adds 8 64-bits new registers :
– MM0 – MM7
• MMX allows computations on packed
bytes, word or doubleword integers
contained in the MM registers (the MM
registers overlap the FP registers !)
• Not useful for scientific computing
21/12/05 r.innocente 36
SSE
• Adds 8 128-bits registers :
– XMM0 – XMM7
• SSE allows computations on operands that
contain 4 Single Precision Floating Points
either in memory or in the XMM registers
• Very limited use for scientific computing,
because of lack of precision
21/12/05 r.innocente 37
SSE2
• Works with operands either in memory or in the
XMM registers
• Allows operations on packed Double Precision
Floating Points or 128-bit integers
• Using a linear algebra kernel with SSE2
instructions matrix multiplication can achieve 1.8
Gflops on a P4 at 1.4 Ghz : http://hpc.sissa.it/p4/
21/12/05 r.innocente 38
Intel SSE3/1
• SSE3 introduces 13 new instructions:
– x87 conversion to integer (fisttp)
– Complex arithmetic (addsubps, addsubpd,
movesldup,moveshdup,…)
– Video encoding
– Thread synchronization
21/12/05 r.innocente 39
SSE3/2
• Conversion to integer (fisttp) :
– this was a long awaited feature. It is a
conversion to integer that always perform
truncation (required by programming
languages). Previously the fistp instruction
converted according to the rounding mode
specified in the FCW. And so usually required
an expensive load/reload of the FCW.
21/12/05 r.innocente 40
Cache memory
Cache=a place where you can safely hide
something
It is a high-speed memory that holds a small
subset of main memory that is in frequent
use.
CPU cache Memory
21/12/05 r.innocente 41
Cache memory/1
• Processor /memory gap is the motivation :
– 1980 no cache at all
– 1995 2 level cache
– 2002 3 level cache
21/12/05 r.innocente 42
Cache memory/2
0
100
200
300
400
500
600
700
800
1995 1996 1997 1998 1999 2000
Year
Performance
Mem perf
CPU perf
CPU/DRAM
gap :
50 %/year
1975 cpu cycle 1 usec, SRAM access 1 usec
2001 cpu cycle 0.5 ns, DRAM access 50 ns
21/12/05 r.innocente 43
Cache memory/3
As the cache contains only part of the main
memory, we need to identify which portions
of the main memory are cached. This is
done by tagging the data that is stored in
the cache.
The data in the cache is organized in lines
that are the unit of transfer from/to the CPU
and the memory.
21/12/05 r.innocente 44
Cache memory/4
Tags Data
cache line
Typical line
sizes are 32
(Pentium III),
64 (Athlon),
128 bytes
(Pentium 4)
21/12/05 r.innocente 45
Cache memory/5
Block placement algorithm :
• Direct Mapped : a block can be placed in just one row of the cache
• Fully Associative : a block can be placed on any row
• n-Way set associative: a block can be placed on n places in a cache
row
With direct mapping or n-way the row is determined using a simple hash
function such as the least significant bits of the row address.
E.g. If the cache line is 64 bytes(bits [5:0] of address are used only to
address the byte inside the row) and there are 1k rows, then bits
[15:6] of the address are used to select the cache row. The tag is then
the remaining most significant bits [:16] .
21/12/05 r.innocente 46
Cache memory/6
Tags Data
cache line
Tags Data
cache line
2-way
cache :
a block
can be
placed
on any
of the two
positions in
a row
21/12/05 r.innocente 47
Cache memory/7
• With fully associative or n-way caches an
algorithm for block replacement needs to be
implemented.
• Usually some approximation of the LRU (least
recently used) algorithm is used to determine the
victim line.
• With large associativity (16 or more ways)
choosing at random the line is almost equally
efficient
21/12/05 r.innocente 48
Cache memory/8
• When during a memory access we find the
data in the cache we say we had a hit,
otherwise we say we had a miss.
• The hit ratio is a very important parameter
for caches in that it let us predict the AMAT
(Average Memory Access Time)
21/12/05 r.innocente 49
Cache memory/9
If
• tc is the time to access the cache
• tm the time to access the data from
memory
• h the unitary hit ratio
then:
t = h * tc+ (1-h)*tm
21/12/05 r.innocente 50
Cache memory/10
In a typical case we could have :
• tc=2 ns
• tm=50 ns
• h=0.90
then the AMAT would be :
AMAT = 0.9 * 2+0.1*50= 6.8 ns
21/12/05 r.innocente 51
Cache memory/11
Typical split L1/unified L2 cache organization:
L2
unified
L1
data
L1
code
CPU
21/12/05 r.innocente 52
Cache memory/12
• Classification of misses (3C)
– Compulsory : can’t be avoided, 1st
time data is
loaded(would happen to an infinite cache)
– Capacity : data was loaded, but was unloaded
because the cache filled up(would happen to a
fully associative cache)
– Conflict : in an N-way cache, the miss that
happens because data was unloaded due to
low associativity
21/12/05 r.innocente 53
Cache memory/13
• Recent processors can prefetch cache lines
under program control or under hardware
control, snooping the patterns of access for
multiple flows. Netburst architecture
introduced hw prefetch on pentium.
21/12/05 r.innocente 54
Cache memories/14
• Netburst hw prefetch is invoked after 2-3 cache misses
with constant stride
• It prefetches 256 bytes ahead
• Up to 8 streams
• Does’nt prefetch past the 4KB page limit
21/12/05 r.innocente 55
Real Caches/1
• Pentium III : line size 32 bytes
– split L1 – 16KB inst/16KBdata 4-way write through
– Unified L2 256 KB (coppermine) 8-way write back
• Pentium 4 :
– split L1 :
• Instruction L1 : 12 K uops 8-ways (trace execution cache)
• Data L1 : 8 KB – 4 way 64 bytes line size write through
– unified L2: 256 KB 8-way line size 128 bytes write back (512KB on
Northwood)
• Athlon :
– Split L1 64 KB/64 KB line 64 bytes
– Unified L2 256 KB 16-ways 64 bytes line size
21/12/05 r.innocente 56
Real Caches/2
• Opteron : exclusive organization (L1 not subset of L2)
– L1 : instruction 64 KB/ 64 bytes per line, data 64 KB /64 bytes per line
– L2 : 1 MB 16 ways 128 bytes per line
• Itanium 2 0.13 u 1.5 Ghz (Madison) :
– L1I : 16 KB 4 ways/64 bytes line
– L1D : 16 KB 4 ways/64 bytes line
– Unified L2 : 256 KB 8 ways/ 128 bytes line
– Unified L3 : 9 MB 24 ways/ 128 bytes line
21/12/05 r.innocente 57
Real Caches/3
An intelligent
RAM (IRAM)?
Itanium2
(Madison)
die photo
6MB L3 cache
455 M transistors
21/12/05 r.innocente 58
Memory performance
• stream (McCalpin) : http://
www.cs.virginia.edu/stream
• memperf (T.Stricker): http://
www.cs.inf.ethz.ch/CoPs/ECT
21/12/05 r.innocente 59
Memory Type and Range Registers were
introduced with the P6 uarchitecture, they should
be programmed to define how the processor
should behave regarding the use of the cache in
different memory areas. The following types of
behaviour can be defined:
UC(Uncacheable), WC(Write Combining), WT(Write
through), WB(Write back), WP(write Protect)
MTRR/1
21/12/05 r.innocente 60
MTRR/2
• UC=uncacheable:
– no cache lookups,
– reads are not performed as line reads, but as is
– writes are posted to the write buffer and performed in
order
• WC=write combining:
– no cache lookups
– read requests performed as is
– a Write Combining Buffer of 32 bytes is used to buffer
modifications until a different line is addressed or a
serializing instruction is executed
21/12/05 r.innocente 61
MTRR/3
• WT=write through:
– cache lookups are perfomed
– reads are performed as line reads
– writes are performed also in memory in any case (L1 cache is
updated and L2 is eventually invalidated)
• WB=write back
– cache lookups are performed
– reads are performed as line reads
– writes are performed on the cache line eventually reading a full
line.Only when the line needs to be evicted from cache if modified,
it will be written in memory
• WP=write protect
– writes are never performed in the cache line
21/12/05 r.innocente 62
MTRR/4
There are 8 MTRRs for variable size areas
on the P6 uarchitecture.
On Linux you can display them with:
cat /proc/mtrr
21/12/05 r.innocente 63
MTRR/5
A typical output of the command would be:
reg00: base=0x00000000 (0MB),
size=256MB: write-back
reg01: base=0xf0000000 (3840MB),
size=32MB:write-combining
the first entry is for the DRAM, the other for the
graphic board framebuffer
21/12/05 r.innocente 64
MTRR/6
It is possible to remove and add regions
using these simple commands:
echo ‘disable=1’ >| /proc/mtrr
echo ‘base=0xf0000000 size=0x4000000
type=write-combining’ >|/proc/mtrr
( Suggestion: Try to use xengine while disabling and
re-enabling the write-combining region of the
framebuffer.)
21/12/05 r.innocente 65
Explicit cache control/1
Caches were introduced as an h/w only optimization
and as such destined to be completely hidden to
the programmers.
This is no more true, and all recent architectures
have introduced some instructions for explicit
cache control by the application programmer. On
the x86 this was done togheter with the
introduction of the SSE instructions.
21/12/05 r.innocente 66
Explicit cache control/2
On the Intel x86 architecture the following instructions
provide explicit cache control :
• prefetch : these instructions load a cache line before
the data is actually needed, to hide latency
• non temporal stores: to move data from registers to
memory without writing it in the caches, when it is known
data will not be re-used
• fence: to be sure order between prior/following memory
operation is respected
• flush : to write back a cache line, when it is known it
21/12/05 r.innocente 67
Performance and timestamp
counters/1
These counters were initially introduced on
the Pentium but documented only on the
PPro. They are supported also on the
Athlon.
Their aim was fine tuning and profiling of the
applications. They were adopted after many
other processors had shown their
usefulness.
21/12/05 r.innocente 68
Performance and timestamp
counters/2
Timestamp counter (TSC):
• it is a 64 bit counter
• counts the clock cycles (processor cycles: now up to 2
Ghz=0.5ns) since reset or since programmer zeroed
• when it reaches a count of all 1s, it wraps around without
generating any interrupt
• it is an x86 extension that is reported by the cpuid
instruction, as such can be found on /proc/cpuinfo
• Intel assured that even on future processor it will never
wrap around in less than 10 years
21/12/05 r.innocente 69
Performance and timestamp
counters/3
• the TSC can be read with the RDTSC (Read TSC)
instruction or the RDMSR (Read ModelSpecificRegister
0x10 ) instruction
• it is possible to zero the lower 32 bits of the TSC using the
WRMSR (Write MSR 0x10) instruction (the upper 32 bits
will be zeroed automagically by the h/w
• the RDTSC is not a serializing instruction as such does’nt
avoid Out of Order execution. This can be obtained using
the cpuid instruction
21/12/05 r.innocente 70
Performance and timestamp
counters/4
The TSC is used by Linux during the startup phase
to determine the clock frequency:
• the PIT (Programmable Interval Timer) is
programmed to generate a 50 millisecond interval
• the elapsed clock ticks are computed reading the
TSC before and after the interval
The TSC can be read by any user or only by the
kernel according to a cpu flag in the CR4 register
that can be set e.g. by a module.
21/12/05 r.innocente 71
Performance and timestamp
counters/5
On the P6 uarchitecture there are 4 registers used for performance
monitoring:
• 2 event select registers: PerfEvtSel0, PerfEvtSel1
• 2 performance counters 40 bits wide : PerfCtr0, PerfCtr1
You have to enable and specify the events to monitor setting the event
select registers.
These registers can be accessed using the RDMSR and WRMSR
instructions at kernel level (MSR 0x186,0x187, 0x193 0x194).
The performance counters can also be accessed using the RDPMC
(Read Performance Monitoring Counters) at any privilege level if
allowed by the CR4 flag.
21/12/05 r.innocente 72
Performance and timestamp
counters/6
The Performance Monitoring mechanism has been
vastly changed and expanded on P4 and Xeons.
This new feature is called : PEBS (Precise Event-
based Sampling).
There are 45 event selection control (ESCR)
registers , 18 performance monitor counters and
18 counter configuration control (CCCR) MSR
registers.
21/12/05 r.innocente 73
Performance and timestamp
counters/7
A useful performance counters library to
instrument the code and a tool called rabbit
to monitor programs not instrumented with
the library can be found at:
http://www.scl.ameslab.gov/Projects/Rabbit/
21/12/05 r.innocente 74
Intel Itanium/1
• EPIC (Explicit Parallel Instruction Computing) : a joint
HP/Intel project started at the end of the '90 :
– its goal was the overcoming of the limitations of current
architectures. Recognizing that it was hard to continue
with the exploitation of the ILP (Instruction Level
Parallelism) exclusively by hardware, it was thought that
the compilers could quite better decide on
dependencies and suggest parallelism.
• In-Order core !!!!!!!
21/12/05 r.innocente 75
Intel Itanium/2
• It's a 64 bit architecture : adresses are 64 bits
• It supports the following data types:
– Integers 1,2,4 and 8 bytes
– Floating point single, double and extended(80 bits)
– Pointers 64 bits
• Except for a few exceptions the basic data type is 64 bits
and operations are done on 64 bits. 1,2,4 bytes operands
are zero extended before being loaded.
21/12/05 r.innocente 76
Intel Itanium/3
• A typical Itanium instruction is a 3 operand
instruction designated with this syntax :
– [(qp)]mnemonic[.comp1][.comp2] dsts = srcs
• Examples :
– Simple instruction : add r1 = r2,r3
– Predicated instruction:(p5)add r3 = r4,r5
– Completer: cmp.eq p4 = r2,r3
21/12/05 r.innocente 77
Intel Itanium/4
• (qp) : a qualifying predicate is a 1 bit predicate
register p0-p63. The instruction is executed if the
register is true (= 1). Otherwise a NOP is
executed. If omitted p0 is assumed which is
always true
• mnemonic :
• comp1,comp2 : some instr include 1 or more
completers
• dests = srcs : most instructions have 2 src opnds
and 1 dest opnd
21/12/05 r.innocente 78
Intel Itanium/4bis
If ( i == 3)
x = 5
Else
x = 6
• pT,pF=compare(i,3)
• (pT)x=5
• (pF)y=6
21/12/05 r.innocente 79
Intel Itanium/5
• Memory in Itanium :
– single, uniform, linear address space of 2^64 bytes
• single : data and code together, not like Harvard
arch.
• uniform: no predefined zones, like interrupt vectors ..
• linear: not segmented like x86
• Code stored in little endian order, usually also data, but
there is support for big endian OS too
• Load and store arch.
21/12/05 r.innocente 80
Intel Itanium/6
• Improves ILP(instruction level
parallelism) :
– Enabling the code writer to
explicitly indicate it
– Providing a 3 instruction-wide
word called a bundle
– Providing a large number of
registers to avoid contention
21/12/05 r.innocente 81
Intel Itanium/7
• Instructions are bound in
instruction groups. These are
sets of instructions that don’t have
WAW or RAW dependencies and
so can execute in parallel
• An instruction group must contain
at least an instruction, there Is no
limit on the number of instr in an
instruction group. They are
indicated in the code by cycle
breaks (;;)
• Instr groups are composed of
instruction bundles. Each bundle
contains 3 instr and a template field
21/12/05 r.innocente 82
Intel Itanium/8
• Code generation assures there is no RAW or WAW race
between isntr in the bundle, and sets the template field
that maps each instruction to an execution unit. In this way
the instr in the bundle can be dispatched in parallel
• Bundles are 16 bytes aligned
• The template field can end the instruction group at the end
of the bundle or in the middle :
21/12/05 r.innocente 83
Intel Itanium/9
• Registers:
• 128 GPR
• 128 FP reg.
• 64 pred.reg
• 8 branch reg.
• 128 Applic. Reg
• IP reg
21/12/05 r.innocente 84
Intel Itanium/10
• Reg.Validity :
– Each GR has an associated
NaT(Not a Thing) bit
– FP reg use a special
pseudozero value called
NaTVal
21/12/05 r.innocente 85
Intel Itanium/11
• Branches :
– Relative direct : the instruction contains a 21 bit displacement that
is added to the current IP of the bundle containing the branch.
This is a 16 byte instruction bundle address (=25 bits)
– Indirect branches : using one of the 8 branch regs
• It allows multiple branches to be evaluated in parallel,
taking the first predicated true
21/12/05 r.innocente 86
Intel Itanium/12
• Memory Latency hiding because of large register file used
for :
– Data speculation : execution of an op before its data dependency
is resolved
– Control speculation: execution of an op before its control
dependency si resolved
– Elimination of mem access through use of regs
21/12/05 r.innocente 87
Intel Itanium/13
• Proc calls and register stack :
– Usually proc calls were implemented using a procedure stack in
main memory
– Itanium uses the regs stack for that
• Rotating regs :
– Static : the 1st
32 regs are permanent regs visible to all procs, in
which global vars are placed
– Stacked : the other 96 regs behave like a stack, a proc can
allocate up to 96 regs for a proc frame
• Reg Stack Engine (RSE):
– A hardware mechanism operating transparently in the background
ensures that no overflow can happen, saving regs to memory
21/12/05 r.innocente 88
Intel Itanium/13
Pic from Intel
21/12/05 r.innocente 89
Intel Itanium/15
• FP registers are 82 bits :
– 80 bits like the extended format
used since the x87 coprocessor
– 2 guard bits to allow computing
of valid 80 bits results for divide
and square roots implemented
in s/w
• FP status register (FPSR) :
– Provides 4 separate status
fields sf0-3, enabling 4 different
computational environments
21/12/05 r.innocente 90
Intel Itanium/16
• Itanium 2 (madison):
– 1.6 Ghz, 1.5 Ghz, 1.4 Ghz
– L3 cache : 9MB, 6MB, 4MB
– L2 cache: 256 KB
– L1 cache: 32 KB instr , 32 KB data
– 200/266 Mhz bus,400 MT/533 MT @128 bit
system bus (6.4 GB/s,8.5 GB/s)
21/12/05 r.innocente 91
Intel Itanium/17
Pic from Intel
21/12/05 r.innocente 92
Intel Itanium/18
Pic from Intel
21/12/05 r.innocente 93
Intel Itanium/18
21/12/05 r.innocente 94
Opteron uarch
Pic from AMD
21/12/05 r.innocente 95
64 bit architectures
• AMD – Opteron
• Intel – Itanium
• Intel - EMT64T
21/12/05 r.innocente 96
AMD-64/1
– AMD extensions to IA-32 to allow 64 bits
addressing aka as Direct Connect Architecture
(DCA)
– Implementations :
• Opteron for servers
• Athlon64, Athlon64 FX for desktops
21/12/05 r.innocente 97
AMD Hammer/Opteron
• x86-64 architecture specified in :
– http://www.x86-64.org
–
21/12/05 r.innocente 98
Intel EM64T/1
• Extended Memory 64 Technology : an extension
to IA-32 compatible with the proposed AMD-64
• Introduced with Nocona Xeons. 3 operating
modes :
– 64 bit mode : OS, drivers, applications recompiled
– Compatibility mode: OS & drivers 64 bits,appl.
unchanged
– Legacy mode: runs 32 bits OS, drivers and application
21/12/05 r.innocente 99
x86-64 or AMD64
Pic from AMD
21/12/05 r.innocente 100
Processor bus/1
• Intel (AGTL+):
– bus based (max 5 loads)
– explicit in band arbitration
– short bursts (4 data txfers)
– 8 bytes wide(64 bits), up to 133 Mhz
• Compaq Alpha (EV6)/Athlon :
– point to point
– DDR double data rate (2 transfers x clock)
– licensed by AMD for the Athlon (133 Mhz x 2)
– source synchronous(up to 400 Mhz)
– 8 bytes wide(64 bits)
21/12/05 r.innocente 101
Processor bus/2
• Intel Netburst (Pentium 4):
– source synchronous (like EV6)
– 8 bytes wide
– 100 Mhz clock
– quad data rate for data transfers (4 x 100 Mhz x 8
bytes= 3.2 GB/s, up tp 4 x 200Mhzx8=6.4 GB/s)
– double data rate for address transfer
– max 128 bytes(a P4 cache line)in a transaction (4 data
transfer cycles)
21/12/05 r.innocente 102
Processor bus/3
• Itanium2 :
– 128 bits wide
– 200/266Mhz
– Dual pumped
– 400/ 533 MT/s x 16 = 6.4 / 8.4 GB/s
• Opteron (not a bus) :
– 2,4,8,16,32 bits wide
– 800 Mhz
– Double pumped
– 800 Mhz x 2 DDR x 2 bytes =3.2 GB/s per direction
21/12/05 r.innocente 103
Intel IA32 node
Pentium 4 Pentium 4
North Bridge
Memory
shared
FSB 64 bits @100/133/200Mhz
21/12/05 r.innocente 104
Intel PIII/P4 processor bus
• Bus phases :
– Arbitration: 2 or 3 clks
– Request phase: 2 clks packet A, packet B (size)
– Error phase: 2 clks, check parity on pkts,drive AERR
– Snoop phase: variable length 1 ...
– Response phase: 2 clk
– Data phase : up to 32 bytes (4 clks, 1 cache line), 128
bytes with quad data rate on Pentium 4
• 13 clks to txfer 32 bytes, or 128 bytes on P4
21/12/05 r.innocente 105
Alpha node
21264 21264
Tsunami
xbar switch
Memory
AlphaEV6 bus
64 bit 4*83Mhz
256 bit-83 Mhzstream
measured
memory
b/w> 1GB/s
21/12/05 r.innocente 106
Alpha/Athlon EV6 interconnect
• 3 high speed channels :
– Unidirectional processor request channel
– Unidirectional snoop channel
– 72-bit data channel (ECC)
• source synchronous
• up to 400 Mhz (4 x 100 Mhz: quad pumped,
Athlon :133Mhz x 2 ddr )
21/12/05 r.innocente 107
Opteron integrated memory
controller
• Opteron block diagram
• The integrated
memory controller
frees the interconnect
from local memory
access
• Remote memory
access and I/O
happens through the
Hypertransport
21/12/05 r.innocente 108
2P Hammer
21/12/05 r.innocente 109
4P Hammer
21/12/05 r.innocente 110
4P Opteron detailed view
21/12/05 r.innocente 111
4P Opteron
21/12/05 r.innocente 112
8P Hammer
21/12/05 r.innocente 113
Hyper Transport hops
21/12/05 r.innocente 114
HT local/xfire bandwidth
Local mem
b/w
Xfire mem
b/w
21/12/05 r.innocente 115
Double/Quad core
Pics From Broadcom : dualcore and quadcore
MIPS64 cpus
21/12/05 r.innocente 116
SMPs
Symmetrical Multi Processing denotes a
multiprocessing architecture in which there
is no master CPU, but all CPUs co-operate.
• processor bus arbitration
• cache coherency/consistency
• atomic RMW operations
• mp interrupts
21/12/05 r.innocente 117
Intel MP
Processor bus arbitration
• As we have seen there is a special phase for each bus
transaction devoted to arbitration (it takes 2 or 3 cycles)
• At startup each processor is assigned a cluster number
from 0 to 3 and a processor number from 0 to 3 (the Intel
MP specification covers up to 16 processors)
• In the arbitration phase each processor knows who had
the bus during last transaction (the rotating ID) and who is
requesting it. Then each processor knows who will own
the current transaction because they will gain the bus
according to the fixed sequence 0-1-2-3.
21/12/05 r.innocente 118
Cache coherency
Coherence: all processor should see the same data.
This is a problem because each processor can have its own
copy of the data in its cache.
There are essentially 3 ways to assure cache coherency:
• directory based : there is a central directory that keeps
track of the shared regions (ccNUMA: Sun Enterprise,
SGI)
• snooping : all the caches monitor the bus to determine if
they have a copy of the data requested (UMA: Intel MP)
• Broadcast based
21/12/05 r.innocente 119
Cache consistency
• Consistency: mantain the order in which
writes are executed
Writes are serialized by the processor bus.
21/12/05 r.innocente 120
Snooping
Two protocols can be used by caches to snoop the
bus :
• write-invalidate: when a cache hears on the bus a
write request for one of his lines from another bus
agent then it invalidates its copy (Intel MP)
• write-update: when a cache hears on the bus a
write request for one of his lines from another
agent it reloads the line
21/12/05 r.innocente 121
Intel MP snooping
• If a memory transaction (memory read and invalidate/read code/read
data/write cache line/write) was not cancelled by an error, then the
error phase of the bus is followed by the snoop phase(2 or more
cycles)
• In this phase the caches will check if they have a copy of the line
• if it is a read and a processor has a modified copy then it will supply
its copy that is also written to memory
• if it is a write and a processor has a modified copy then the memory
will store first the modified line and then will merge the write data
21/12/05 r.innocente 122
MESI protocol
Each cache line can be in 1 of the 4 states:
• Invalid (I): the line is not valid and should not be used
• Exclusive(E): the line is valid, is the same as main
memory and no other processor has a copy of it, it can be
read and written in cache w/o problems
• Shared(S): the line is valid, the same as memory, one or
more other processors have a copy of it, it can be read
from memory, it should be written-through (even if
declared write back!)
• Modified(M): the line is valid, has been updated by the
local processor, no other cache has a copy, it can be read
and written in cache
21/12/05 r.innocente 123
MESI states
I
E
M
S
W
W
R
S R
S
Legenda:
Read access
Write access
Snoop
R
W
R
S
WS
W
R
R
21/12/05 r.innocente 124
L2/L1 coherence
• An easy way to keep coherent the 2 levels
of caches is to require inclusion ( L1 subset
of L2)
• Otherwise each cache can perform its
snooping
21/12/05 r.innocente 125
Atomic Read Modify Write
The x86 instruction set has the possibility to prefix some
instructions with a LOCK prefix:
– bit test and modify
– exchange
– increment, decrement, not,add,sub,and,or
These will cause the processor to assert the LOCK# bus
signal for the duration of the read and write.
The processor automatically asserts the LOCK# signal
during execution of an XCHG , during a task switch, while
reading a segment descriptor
21/12/05 r.innocente 126
Intel MP interrupts
Intel has introduced an I/O APIC (Advanced Programmable Interrupt
controller) which replaces the old 8259A.
• each processor of an SMP has its own integrated local APIC
• an Interrupt Controller Communication (ICC) bus connects an
external I/O APIC (front-end) to these local APICs
• externel IRQ lines are connected to the I/O APIC that acts as a router
• the I/O APIC can dispatch interrupts to a fixed processor or to the one
executing lowest priority activites (the priority table has to be updated
by the kernel at each context switch)
21/12/05 r.innocente 127
Broadcast cache coherence
• In a 4 proc cluster p3 wants to access m0:
– The read cache line request is sent on the link p3-p1
– Request forwarded from p1 to p0
– P0 reads line from memory snooping itself, and sends out
snooping req to p1 and p2
– P0 send out snoop response to p3 and p2 forwards snoop request
from p0
– P0 get the local memory copy and P2 sends snoop response to
p4
– Memory line is forwarded to p3
21/12/05 r.innocente 128
PCI Bus
• PCI32/33 4 bytes@33Mhz=132MBytes/s (on i440BX,...)
• PCI64/33 8 bytes@33Mhz=264Mbytes/s
• PCI64/66 8 bytes@66Mhz=528Mbytes/s (on i840)
• PCI-X 8 bytes@133Mhz=1056Mbytes/s
21/12/05 r.innocente 129
PCI-X 1.0
• 100/133 Mhz, 32/64 bit
– Max 1064 MB/s
• Split transactions
• Backward compatible with PCI 2.1/2.2/2.3 :
– PCI-X adapters work in existing PCI slots
– Older PCI cards slow down PCI-X bus
21/12/05 r.innocente 130
PCI-X 2.0
• 100% backward compatible
• PCI-X 266 and PCI-X 533
– 2.13 GB/s and 4.26 GB/s
– DDR and QDR using 133 Mhz base freq.
21/12/05 r.innocente 131
PCI efficiency
• Multimaster bus but arbitration is performed out of
band
• Multiplexed but in burst mode (implicit
addressing) only start address is txmitted
• Fairness guaranteed by MLT (Maximum Latency
Timer)
• 3 / 4 cycles overhead on 64 data txfers < 5 %
• on Linux use : lspci –vvv to look at the setup of
the boards
21/12/05 r.innocente 132
PCI 2.2/X timing diagram
CLK
Address
Bus cmd Attr
DataAttr
BE
Address
phase
Attribute
phase
Target
response
AD
C/BE#
Data
phase
IRDY#
TRDY#
FRAME #
21/12/05 r.innocente 133
common chipsets
PCI performance
372303Serverworks HE
315227Intel 860
328315AMD760MPX
248205Tsunami (alpha)
372309Serverworks champion II HE
372315Serverset III HE
399372Intel 460GX (Itanium)
488407Titan (Alpha)
512455Serverworks Serverset III LE
Write MB/sRead MB/sChipset
21/12/05 r.innocente 134
chipsets
PCI-X performance
1044784HP rx2000 dual 900Mhz Itanium2, HPzx1 chipset
1044932Supermicro XDL8-GG dual 2.4 Ghz Xeon
10369462xOpteron 244,AMD 8131
Write MB/sRead MB/sChipset
21/12/05 r.innocente 135
Memory buses
• SDRAM 8 bytes wide (64 bits): these memories are pipelined DRAM.
Industry found for them much more appealing to indicate clock
frequency than access times. Number of clock cycles to wait for access
is written in a small rom on the module(SPD)
– PC-100 PC-133
– DDR PC-200, DDR 266 QDR on the horizon
• RDRAM 2 bytes wide(16 bits) these memories have a completely new
signaling technology. Their busses should be terminated at both ends
(coninuity module required if slot not used!)
– RDRAM 600/800/1066 double data rate at 300,400,533 Mhz
21/12/05 r.innocente 136
Interconnects /1
• The technology used to interconnect processors overlaps today on
one end with IC and PCB (printed circuit board) technologies and on
the other end with WAN (Wide Area Network) technologies.
• SAN (System Area Network) and LAN technologies are at a midpoint.
21/12/05 r.innocente 137
LogP metrics (Culler)
• L = Latency: time data is on flight between the 2 nodes
• o = overhead: time during which the processor is engaged
in sending or receiving
• g = gap : minimum time interval between consecutive
message txmissions(or receptions)
• P = # of Processors
This metric was introduced to characterize a distributed system 
with its most important parameters,a bit outdated, but still useful.
(e.g..does’nt take into account pipelining)
21/12/05 r.innocente 138
LogP diagram
Network
NI NI
os
or
Processor
Processor
g
g
time=os+L+or
L
21/12/05 r.innocente 139
Interconnects /2
• single mode fiber ( sonet 10 gb/s for some
km)
• multimode fiber (1 gb 300 m: GigE)
• coaxial cable (800 mb/s 1 km: CATV)
• twisted pair (1 gb/s 100 m: GigE)
21/12/05 r.innocente 140
Interconnects /3
• shared / switched media:
– bus (shared communication):
• coaxial
• pcb trace
• backplane
– switch (1-1 connection):
• point to point
The required communication speed is forcing
a switch towards the pt-to-pt paradigm
21/12/05 r.innocente 141
Interconnects
• Full interconnection
network :
– Ideal but unattainable
and overkilling
– D = 1
– Links : like n^2
 
21/12/05 r.innocente 142
Interconnects /4
Hypercube:
• A k-cube has 2**k nodes,
each node is labelled by a
k-dim binary coordinate
• 2 nodes differing in only 1
dim are connected
• there are at most k hops
between 2 nodes, and
2**k * k wires
0 1
1-cube
00 10
01 11
2-cube
3-cube
21/12/05 r.innocente 143
Interconnects /4
Mesh:
• an extension of hypercube
• nodes are given a k-dim
coordinate (in the 0:N-1 range)
• nodes that differ by 1 in only 1
coordinate are connected
• a k-dim mesh has N**k nodes
• at most there are kN hops
between nodes and wires are ~
kN**k
00
01
00 00
21
02 12
11
22
2-dim 3x3 mesh
21/12/05 r.innocente 144
Interconnects /4
Crossbar (xbar):
• typical telephone network
technology
• organized by rows and columns
(NxN)
• requires N**2 switches
• any permutation w/o blocking
• in the pictures the i-row is the
sender and the i-col is the
receiver of node i
0 1 2 43
0
1
2
3
4
5x5 xbar switch
21/12/05 r.innocente 145
Interconnect/4bis
• 1D Torus(Ring)
• 2D Torus
21/12/05 r.innocente 146
Interconnects/5
• Diameter : maximum distance in hops between two
processors
• Bisection width : number of links to cut to sever the
network in two halves
• Bisection bandwidth : minimum sum of b/w of links to be
cut in order to divide the net in 2 halves
• Node degree : number of links from a node
21/12/05 r.innocente 147
Bisection /1
Bisection width : given an N nodes net, divided the nodes in
two sets of N/2 nodes, the number of links that go from
one set to the other.
An important parameter is the minimum bisection width: the
minimum number of links whatever cut you use to bisect
the nodes.
An upper bound on the minimum bisection is N/2 because
whatever the topology would be you can always find a cut
across half of the node links. If a network has minimum
bisection of N/2 we say it has full bisection.
21/12/05 r.innocente 148
Bisection /2
Bisection cut:
1 link
Bisection cut:
4 links
2 different 8 nodes nets
21/12/05 r.innocente 149
Bisection /3
It can be difficult to find the min bisection.
For full bisection networks, it can be shown that if a
network is re-arrangeable (it can route any
permutation w/o blocking) then it has full
bisection.
1625403To
6543210From
21/12/05 r.innocente 150
Interconnects /4
• Topology:
– crossbar (xbar) : any to any
– omega
– fat tree
• Bisection:
732hypercube
641024fully connected
5162-d torus
58grid/mesh
6432star
32ring
1bus
ports/switchbisection
21/12/05 r.innocente 151
Interconnects /5
• Connection/connectionless:
– circuit switched
– packet switched
• routing
– source based routing (SBR)
– virtual circuit
– destination based
21/12/05 r.innocente 152
Interconnects /6
• switch mechanism (buffer management):
– store and forward
• each msg is received completely by a switch before being forwarded to the
next
• pros: easy to design because there is no handling of partial msgs
• cons: long msg latency, requires large buffer
– cut-through (virtual cut-through)
• msg forwarded to the next node as soon as its header arrives
• if the next node cannot receive the msg then the full msg needs to be stored
– wormhole routing
• the same as cut-trough except that the msg can be buffered togheter by
multiple successive switches
• the msg is like a worm crawling through a worm-hole
21/12/05 r.innocente 153
Interconnects /7
• congestion control
– packet discarding
– flow control
• window /credit based
• start/stop
21/12/05 r.innocente 154
Gilder’s law
• proposed by G.Gilder
(1997) :
The total bandwidth of
communication systems
triples every 12 months
• compare with Moore’s law
(1974):
The processing power of a
microchip doubles every
18 months
• compare with:
memory performance
increases 10% per year
(source Cisco)
21/12/05 r.innocente 155
Optical networks
• Large fiberbundles
multiply the fibers per
route : 144, 432 and now
864 fibers per bundle
• DWDM (Dense
Wavelength Division
Multiplexing) multiplies the
b/w per fiber :
– Nortel announced a 80 x
80G per fiber (6.4Tb/s)
– Alcatel announced a 128 x
40G per fiber(5.1Tb/s)
21/12/05 r.innocente 156
NIC Interconnection point
(from D.Culler)
SP2, Fore
ATM cards
Myrinet,
3ComGbe
I/O Bus
HP
Medusa
Graphics
Bus
Intel
Paragon
Meiko
CS-2
T3E annexMemory
TMC CM-5Register
General uprocSpecial uprocController
Many
ether cards
21/12/05 r.innocente 157
LVDS/1
• Low Voltage Differential Signaling
(ANSI/TIA/EIA 644-1995)
• Defines only the electrical characteristics of
drivers and receivers
• Transmission media can be copper cable or
PCB traces
21/12/05 r.innocente 158
LVDS/2
• Differential :
– Instead of measuring the voltage Vref+U between a signal line
and a common GROUND, a pair is used and Vref+U and Vref–U
are transmitted on the 2 wires
– In this way the transmission is immune to Common Mode Noise
(the electrical noise induced in the same way on the 2 wires:
EMI,..)
21/12/05 r.innocente 159
LVDS/3
• Low Voltage:
– voltage swing is just 300 mV, with a driver offset of +1.2V
– receivers are able to detect signals as low as 20 mV,in the 0 to 2.4
V (supporting +/- 1 V of noise)
0 V
1.2 V
+300mV
­300mV
1.35V
1.05V
21/12/05 r.innocente 160
LVDS/4
• It consumes only 1 mW(330mV swing constant
current): GTL would consume 40 mW(1V swing)
and TTL much more
• consumer chips for connecting displays (OpenLDI
DS90C031) are already txmitting 6 Gb/s
• The low slew rate (300mV in 333 ps is only
0.9V/ns) minimize xtalk and distortion
21/12/05 r.innocente 161
VCSEL/1
VCSELs: Vertical Cavity
Surface Emitting Lasers:
• the laser cavity is vertical
to the semiconductor
wafer
• the light comes out from
the surface of the wafer
Distributed Bragg Reflectors
(DBR):
• 20/30 pairs of
semiconductor layers
Active region
p-DBR
n-DBR
21/12/05 r.innocente 162
VCSEL/2-EEL (Edge Emitting)
picture from Honeywell
21/12/05 r.innocente 163
VCSEL/3- Surface.Emission
picture from Honeywell
21/12/05 r.innocente 164
VCSEL/4
pict from Honeywell
21/12/05 r.innocente 165
VCSEL/5
At costs similar to LEDs have the characteristics of lasers.
EELs:
• it’s difficult to produce a geometry w/o problems cutting
the chips
• it’s not possible to know in advance if the chip it’s good
or not
VCSELs:
• they can be produced and tested with standard I.C.
procedures
• arrays of 10 or 1000 can be produced easily
21/12/05 r.innocente 166
Ethernet history
• 1976 Metcalfe invented a 1 Mb/s ether
• 1980 Ethernet DIX (Digital, Intel, Xerox)
standard
• 1989 Synoptics invented twisted pair
Ethernet
• 1995 Fast Ethernet
• 1998 Gigabit Ethernet
• 2002 10 Gigabit Ethernet
21/12/05 r.innocente 167
Ethernet
• 10 Mb/s
• Fast Ethernet
• Gigabit Ethernet
• 10GB Ethernet
21/12/05 r.innocente 168
Ethernet Frames
• 6 bytes dst address
• 6 bytes src address :3 bytes vendor codes, 3 bytes serial
#
• 2 bytes ethernet type: ip, ...
• data from 46 up to 1500 bytes
• 4 bytes FCS (checksum)
dst
address
src
address
type FCSdata
21/12/05 r.innocente 169
Hubs
• Repeaters: layer 1
– like a bus they just repeat all the data across all the
connections (not very useful for clusters)
• Switches (bridges) layer 2
– they filter traffic and repeat only necessary traffic on
connections (very cheap switches can easily switch at
full speed many Fast Ethernet connections)
• Routers : layer 3
21/12/05 r.innocente 170
Ethernet flow/control
• With large switches it is necessary to be able to limit the flow of packets to
avoid packet discarding following an overflow
• Vendors had tried to implement non standard mechanism to overcome the
problem
• One of this mechanism that could be used for half/duplex links is sending
carrier bursts to simulate a busy channel
• The pseudo busy channel mechanism does’nt apply to full/duplex links
• MAC control protocol : to support flow control on f/d links
– a new ethernet type = 0x8808, for frames of 46 bytes (min length)
– first 2 bytes are the opcode other are the parameters
– opcode = 0x0001 PAUSE stop txmitting
– dst address = 01-80-c2-00-00-01 (an address not forwarded by bridges)
– following 2 bytes represent the number of 512 bit times to stop txmission
21/12/05 r.innocente 171
Auto Negotiation
• On twisted pairs each station transmits a series of pulses (quite
different from data) to signal link active. These pulses are called
Normal Link Pulses (NLP)
• A set of Fast Link Pulses (FLP) is used to code station
capabilities.These FLPs bits code the set of station capabilities
• There are some special repeaters that connect 10mb/s links together
and 100mb/s together and then the two sets to ports of a mixed
speed bridge
• Parallel detection: when autonegotiation is not implemented, this
mechanism tries to discover speed used from NLP pulses (will not
detect duplex links).
21/12/05 r.innocente 172
GigE
Peculiarities :
• Flow control necessary on full duplex
• 1000base-T uses all 4-pairs of cat 5 cable in both directions: 250
Mhz*4 , with echo cancellation(10 and 100base-T used only 2 pairs)
• 1000base-SX on multimode fiber uses a laser (usually a VCSEL).
Some multimode fibers have a discontinuity in the refraction index for
some microns just in the middle of the fiber. This is not important with
a LED launch where the light fills the core but is essential with a laser
launch. An offset patch cord (a small length of single mode fiber that
moves away from the center the spot) has been proposed as solution.
21/12/05 r.innocente 173
10 GbE /1
• IEEE 802.3ae 10GbE ratified on June 17, 2002
• Optical Media ONLY : MMF 50u/400Mhz 66m, new
50u/2000Mhz 300m, 62.5u/160Mhz 26m, WWDM 4x
62.5u/160Mhz 300m, SMF 10Km/40Km
• Full Duplex only (for 1 GbE a CSMA/CD mode of
operation was still specified despite no one implemented
it)
• LAN/WAN different :10.3 Gbaud – 9.95Gbaud (SONET)
• 802.3 frame format (including min/max size)
21/12/05 r.innocente 174
10GbE /2
from Bottor
(Nortel
Networks) :
marriage of
Ethernet and
DWDM
21/12/05 r.innocente 175
VLAN/1
Virtual LAN :
– IEEE 802.1Q extension to the ethernet std
– frames can be tagged with 4 more bytes (so
that now an ethernet frame can be up to 1522
bytes):
• TPID tagged protocol identifier, 2 bytes with value
0x8100
• TCI Tag control information: specifies priority and
Virtual LAN (12 bits) this frame belongs
• remaining bytes with standard content
21/12/05 r.innocente 176
VLAN/2
• VLAN tagged ethernet
frame :
21/12/05 r.innocente 177
Infiniband/1
• It represents the convergence of 2 separate
proposal:
– NGIO (NextGenerationIO:Intel,Microsoft,Sun)
– FutureIO (Compaq,IBM,HP)
• Infiniband: Compaq, Dell, HP, IBM, Intel,
Microsoft, Sun
21/12/05 r.innocente 178
Infiniband/2
• Std channel at 2.5 Gb/s (Copper LVDS or
FIber) :
– 1x width 2.5 Gb/s
– 4x width 10 Gb/s
– 12x width 30 Gb/s
21/12/05 r.innocente 179
Infiniband/3
• Highlights:
– point to point switched interconnect
– channel based message passing
– computer room interconnect, diameter < 100-
300 m
– one connection for all I/O : ipc, storage I/O,
network I/O
– up to thousands of nodes
21/12/05 r.innocente 180
Infiniband/4
21/12/05 r.innocente 181
Infiniband/5
21/12/05 r.innocente 182
Myrinet/1
• comes from a USC/ISI research project : ATOMIC
• flow control and error control on all links
• cut-through xbar switches, intelligent boards
• Myrinet packet format :
type CRCdata (no length limit)
Allows multiple protocols
Source route bytes
21/12/05 r.innocente 183
Myrinet/2
21/12/05 r.innocente 184
Myrinet/3
Every node can discover the topology with
probe packets (mapper).
In this way it gets a routing table that can be
used to compute headers for any
destination.
There is 1 byte for each hop traversed, the
first byte with routing information is
removed by each switch on the path.
21/12/05 r.innocente 185
Myrinet/4
• Currently Myrinet uses a 2 Gb/s link
• There are double link cards for the PCI-X so that
a theoretical b/w of 4 Gb/s per node can be
achieved : in this way you have to use 2 ports on
the switch too
• Their current TCP/IP implementation is very well
tuned and it reaches about 80% of the b/w with
proprietary protocol, despite having a much larger
latency
21/12/05 r.innocente 186
Clos networks
Named after Charles Clos who introduced them in
1953. It’s a kind of fat tree network.
These networks have full bisection.
Large Myrinet clusters are frequently arranged as a
Clos network.
Myricom base block is their 16 port xbar switch.
From this brick it’s possible to build 128 nodes/3
level (16+8switches) and 1024 nodes 5 level
networks.
21/12/05 r.innocente 187
Clos networks/2
from myri.com
21/12/05 r.innocente 188
SCI/1
• Acronym for Scalable Coherent Interface,
ANSI/ISO/IEEE std 1596-1992
• Packet-based protocol for SAN (System Area Network)
• It uses pt-to-pt links of 5.3 Gb/s
• Topology : Ring, Torus 2D or 3D
• It uses a shared memory approach. SCI addresses are 64
bits :
– Most significant 16 bits specify the node
– 48 address memory on a node
• Up to 64K nodes/256 TB addresssing space
21/12/05 r.innocente 189
SCI/2
• Dolphin SISCI API is the std API used to control the interconnect
using a shared memory approach. To enable communications :
– The receiving node sets aside a part of its memory to be used by
the SCI interconnect
– The sender imports the memory area in its address space
– The sender can now read or write directly in that memory area
• Because it is difficult to streamline reads through the
memory/PCI bus : writes are 10 times faster than reads
• In many implementations of Upper level protocols, It is
quite convenient to use only writes
21/12/05 r.innocente 190
SCI/2
• The link b/w between nodes is effectively shared
• This puts an upper limit on the number of nodes
you can put in a loop. Currently up to 8-10 nodes.
So 2d torus up to 64 nodes, 3d over 64 nodes ..
• Because there are no switches the cost scales
linearly
• Cable length is a big limitation with a preferred
length of 1 m or less
21/12/05 r.innocente 191
SCI/3
21/12/05 r.innocente 192
SCI/4 3D Torus
21/12/05 r.innocente 193
Interconnect/1GbE
• Very cheap almost
commodity connection
now
• Broadcom chipset
based < USD 100 per
card
• Switch for 64-128
configurations around
USD 200-300 per port
21/12/05 r.innocente 194
Interconnect/SCI
21/12/05 r.innocente 195
Interconnect/Myrinet
• Currently 2Gb/s links,
with the possibility of a
dual link card
• SL card 800 USD, 2
links card 1600 USD
• 1 port on a switch
21/12/05 r.innocente 196
Interconnect/Infiniband
21/12/05 r.innocente 197
Interconnect/10GbE
• cost of optics alone is
around USD 3000
• 1 port on a switch
around 10,000 USD
• Up to only 48 ports
switches
21/12/05 r.innocente 198
Software
Despite great advances in network technology(2-3
orders of magnitude), much communication s/w
remained almost unchanged for many years
(e.g.BSD networking).
There is a lot of ongoing research on this theme and
very different solutions are proposed(zero-copy,
page remapping,VIA,...)
21/12/05 r.innocente 199
Software overhead
txmission
time
software
 overhead
software
overhead software
overhead
total
time
10Mb/s Ether 100Mb/s Ether  1Gb/s 
Being a constant, is becoming 
more and more important !!
21/12/05 r.innocente 200
Nic memory
Processor
Standard data flow /1
Processor
North Bridge
Memory
PCI bus NIC Network
Processor bus
I/O bus Memory
bus
1
2
3
21/12/05 r.innocente 201
Standard data flow /2
• Copies :
1. Copy from user space to kernel buffer
2. DMA copy from kernel buffer to NIC memory
3. DMA copy from NIC memory to network
• Loads:
– 2x on the processor bus
– 3x on the memory bus
– 2x on the NIC memory
21/12/05 r.innocente 202
Nic memory
Processor
Zero-copy data flow
Processor
North Bridge
Memory
PCI bus NIC Network
Processor bus
I/O bus Memory
bus
1
2
21/12/05 r.innocente 203
Zero Copy Research
• Shared memory between user/kernel:
– Fbufs(Druschel,1993)
– I/O-Lite (Druschel,1999)
• Page remapping with copy on write (Chu,1996)
• Blast: hardware splits headers from data
(Carter,O’Malley,1990)
• Ulni (User-level Network Interface): implementation of
communication s/w inside libraries in user space
High speed networks, I/O systems and memory have comparable 
bandwidths ­> it is essential to avoid any unnecessary copy of  data 
!
21/12/05 r.innocente 204
OS bypass – User level
networking
• Active Messages (AM) – von Eicken, Culler
(1992)
• U-Net –von Eicken, Basu, Vogels (1995)
• PM – Tezuka, Hori, Ishikawa, Sato (1997)
• Illinois FastMessages (FM) – Pakin,
Karamcheti, Chien (1997)
• Virtual Interface Architecture (VIA) –
Compaq,Intel,Microsoft (1998)
21/12/05 r.innocente 205
Active Messages (AM)
• 1-sided communication paradigm(no
receive op)
• each message as soon as received triggers
a receive handler that acts as a separate
thread (in current implementations it is
sender based)
21/12/05 r.innocente 206
FastMessages (FM)
• FM_send(dest,handler,buf,size)
sends a long message
• FM_send_4(dest,handler,i0,i1,i2,i3)
sends a 4 words msg (reg to reg)
• FM_extract()
process a received msg
21/12/05 r.innocente 207
VIA/1
To achieve at the industrial level the advantages obtained
from the various User Level Networking (ULN) initiatives,
Compaq, Intel and Microsoft proposed an industry
standard called VIA (Virtual Interface Architecture). This
proposal specifies an API (Application Programming
Interface).
In this proposal network reads and writes are done
bypassing the OS, while open/close/map are done with
the kernel intervention. It requires memory registering.
21/12/05 r.innocente 208
VIA/2
VI Hardware
VI  Provider API
Open/Close/Map            Send/Receive/RDMA               
VI
VI kernel i/f
VI kernel agent
VI
D
o
o
r
b
e
l
l
s
SendQRecQ RecQ SendQ
21/12/05 r.innocente 209
VIA/3
• VIA is better suited to network cards
implementing advanced mechanism like doorbells
(a queue of transactions in the address space of
the memory card, that are remembered by the
card)
• Anyway it can also be implemented completely in
s/w, despite less efficiently (look at the M-VIA
UCB project, and MVICH)
21/12/05 r.innocente 210
Fbufs (Druschel 1993)
• Shared buffers between user
processes and kernel
• It is a general schemes that can
be used for either network or
I/O operations
• Specific API without copy
semantic
• A buffer pool specific for a
process is created as part of
process initialization
• This buffer pool can be pre-
mapped in both the user and
kernel address space
• uf_read(int 
fd,void**bufp,size
_t nbytes)
• uf_get(int 
fd,void** bufpp)
• uf_write(int 
fs,void*bufp,size 
nbytes)
• uf_allocate(size_t 
size)
• uf_deallocate
(void*ptr,size_t 
size)
21/12/05 r.innocente 211
Network technologies
• Extended frames :
– Alteon Jumbograms
(MTU up to 9000)
• Interrupt suppression
(coalescing)
• Checksum offloading
(from Gallatin 1999)
21/12/05 r.innocente 212
Trapeze driver for Myrinet2000
on freeBSD
• zero-copy TCP by page
remapping and copy-on -write
• split header/payload
• gather/scatter DMA
• checksum offload to NIC
• adaptive message pipelining
• 220 MB/s (1760 Mb/s) on
Myrinet2000 and Dell Pentium III
dual processor
The application does’nt touch the data
(from Chase, Gallatin, Yocum
2001)
21/12/05 r.innocente 213
Bibliography
• Patterson/Hennessy: Computer
Architecture – A Quantitative Approach
• Hwang – Advanced Computer Architecture
• Schimmel – Unix Systems for Modern
Architectures

More Related Content

What's hot

Linux Kernel Platform Development: Challenges and Insights
 Linux Kernel Platform Development: Challenges and Insights Linux Kernel Platform Development: Challenges and Insights
Linux Kernel Platform Development: Challenges and InsightsGlobalLogic Ukraine
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxSamsung Open Source Group
 
ARM AAE - Memory Systems
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory SystemsAnh Dung NGUYEN
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPeter Griffin
 
Windows Internals for Linux Kernel Developers
Windows Internals for Linux Kernel DevelopersWindows Internals for Linux Kernel Developers
Windows Internals for Linux Kernel DevelopersKernel TLV
 
ARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/SpectreARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/SpectreGlobalLogic Ukraine
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAnh Dung NGUYEN
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
AAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced FeaturesAAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced FeaturesAnh Dung NGUYEN
 
Difference between i3 and i5 and i7 and core 2 duo pdf
Difference between i3 and i5 and i7 and core 2 duo pdfDifference between i3 and i5 and i7 and core 2 duo pdf
Difference between i3 and i5 and i7 and core 2 duo pdfnavendu shekhar
 
F9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory managementF9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory managementBenux Wei
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: VirtualizationZubair Nabi
 
F9 Microkernel code reading part 2 scheduling
F9 Microkernel code reading part 2 schedulingF9 Microkernel code reading part 2 scheduling
F9 Microkernel code reading part 2 schedulingBenux Wei
 
Linux SMEP bypass techniques
Linux SMEP bypass techniquesLinux SMEP bypass techniques
Linux SMEP bypass techniquesVitaly Nikolenko
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksAnne Nicolas
 

What's hot (20)

ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
 
Linux Kernel Platform Development: Challenges and Insights
 Linux Kernel Platform Development: Challenges and Insights Linux Kernel Platform Development: Challenges and Insights
Linux Kernel Platform Development: Challenges and Insights
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on Linux
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
ARM AAE - Memory Systems
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory Systems
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
 
Windows Internals for Linux Kernel Developers
Windows Internals for Linux Kernel DevelopersWindows Internals for Linux Kernel Developers
Windows Internals for Linux Kernel Developers
 
ARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/SpectreARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/Spectre
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and Optimization
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
AAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced FeaturesAAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced Features
 
Difference between i3 and i5 and i7 and core 2 duo pdf
Difference between i3 and i5 and i7 and core 2 duo pdfDifference between i3 and i5 and i7 and core 2 duo pdf
Difference between i3 and i5 and i7 and core 2 duo pdf
 
F9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory managementF9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory management
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: Virtualization
 
Memory protection unit
Memory protection unit Memory protection unit
Memory protection unit
 
F9 Microkernel code reading part 2 scheduling
F9 Microkernel code reading part 2 schedulingF9 Microkernel code reading part 2 scheduling
F9 Microkernel code reading part 2 scheduling
 
ARM AAE - System Issues
ARM AAE - System IssuesARM AAE - System Issues
ARM AAE - System Issues
 
Linux SMEP bypass techniques
Linux SMEP bypass techniquesLinux SMEP bypass techniques
Linux SMEP bypass techniques
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
Linux Kernel Debugging
Linux Kernel DebuggingLinux Kernel Debugging
Linux Kernel Debugging
 

Similar to Nodes and Networks for HPC computing

Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
Processors and its Types
Processors and its TypesProcessors and its Types
Processors and its TypesNimrah Shahbaz
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programminginside-BigData.com
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxKandavelEee
 
Not bridge south bridge archexture
Not bridge  south bridge archextureNot bridge  south bridge archexture
Not bridge south bridge archexturesunil kumar
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
 
Sparc t4 2 system technical overview
Sparc t4 2 system technical overviewSparc t4 2 system technical overview
Sparc t4 2 system technical overviewsolarisyougood
 
Sparc t4 1 system technical overview
Sparc t4 1 system technical overviewSparc t4 1 system technical overview
Sparc t4 1 system technical overviewsolarisyougood
 
Microprocessor.ppt
Microprocessor.pptMicroprocessor.ppt
Microprocessor.pptsafia kalwar
 
SUN主机产品介绍.ppt
SUN主机产品介绍.pptSUN主机产品介绍.ppt
SUN主机产品介绍.pptPencilData
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...DefconRussia
 

Similar to Nodes and Networks for HPC computing (20)

Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Processors and its Types
Processors and its TypesProcessors and its Types
Processors and its Types
 
AMD K6
AMD K6AMD K6
AMD K6
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
WEEK6_COMPUTER_ORGANIZATION.pptx
WEEK6_COMPUTER_ORGANIZATION.pptxWEEK6_COMPUTER_ORGANIZATION.pptx
WEEK6_COMPUTER_ORGANIZATION.pptx
 
Lec05
Lec05Lec05
Lec05
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
 
Not bridge south bridge archexture
Not bridge  south bridge archextureNot bridge  south bridge archexture
Not bridge south bridge archexture
 
Lesson 5a
Lesson 5aLesson 5a
Lesson 5a
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
Sparc t4 2 system technical overview
Sparc t4 2 system technical overviewSparc t4 2 system technical overview
Sparc t4 2 system technical overview
 
Introduction to 8085svv
Introduction to 8085svvIntroduction to 8085svv
Introduction to 8085svv
 
Amd Athlon Processors
Amd Athlon ProcessorsAmd Athlon Processors
Amd Athlon Processors
 
Sparc t4 1 system technical overview
Sparc t4 1 system technical overviewSparc t4 1 system technical overview
Sparc t4 1 system technical overview
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Microprocessor.ppt
Microprocessor.pptMicroprocessor.ppt
Microprocessor.ppt
 
Intel Core i7 Processors
Intel Core i7 ProcessorsIntel Core i7 Processors
Intel Core i7 Processors
 
SUN主机产品介绍.ppt
SUN主机产品介绍.pptSUN主机产品介绍.ppt
SUN主机产品介绍.ppt
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
 

More from rinnocente

Random Number Generators 2018
Random Number Generators 2018Random Number Generators 2018
Random Number Generators 2018rinnocente
 
Docker containers : introduction
Docker containers : introductionDocker containers : introduction
Docker containers : introductionrinnocente
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networkingrinnocente
 
WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?rinnocente
 
TLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailTLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailrinnocente
 
Fpga computing
Fpga computingFpga computing
Fpga computingrinnocente
 
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsRefreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsrinnocente
 
features of tcp important for the web
features of tcp  important for the webfeatures of tcp  important for the web
features of tcp important for the webrinnocente
 
Public key cryptography
Public key cryptography Public key cryptography
Public key cryptography rinnocente
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit erarinnocente
 
Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration rinnocente
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch predictionrinnocente
 
Data mining : rule mining algorithms
Data mining : rule mining algorithmsData mining : rule mining algorithms
Data mining : rule mining algorithmsrinnocente
 
FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)rinnocente
 
radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)rinnocente
 

More from rinnocente (16)

Random Number Generators 2018
Random Number Generators 2018Random Number Generators 2018
Random Number Generators 2018
 
Docker containers : introduction
Docker containers : introductionDocker containers : introduction
Docker containers : introduction
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networking
 
WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?
 
TLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailTLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated email
 
Fpga computing
Fpga computingFpga computing
Fpga computing
 
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsRefreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
 
features of tcp important for the web
features of tcp  important for the webfeatures of tcp  important for the web
features of tcp important for the web
 
Public key cryptography
Public key cryptography Public key cryptography
Public key cryptography
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit era
 
Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch prediction
 
Data mining : rule mining algorithms
Data mining : rule mining algorithmsData mining : rule mining algorithms
Data mining : rule mining algorithms
 
Ipv6 course
Ipv6  courseIpv6  course
Ipv6 course
 
FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)
 
radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)
 

Recently uploaded

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Nodes and Networks for HPC computing

  • 1. 21/12/05 r.innocente 1 Nodes and Networks hardware Roberto Innocente inno@sissa.it Kumasi, GHANA Joint KNUST/ICTP school on HPC on linux clusters
  • 2. 21/12/05 r.innocente 2 Overview • Nodes: – CPU – Processor Bus – I/O bus • Networks – boards – switches
  • 3. 21/12/05 r.innocente 3 Computer families/1 • CISC (Complex Instruction Set Computer) – large set of complex instructions (VAX, x86) – many instructions can have operands in memory – variable length instructions – many instructions require many cycles to complete • RISC (Reduced Instruction Set Computer) – small set of simple instructions(Mips,alpha) – also called Load/Store architecture because operations are done in registers and only load/store instructions can access memory – instructions are typically hardwired and require few cycles – instructions have fixed length so that is easy to parse them • VLIW (Very Large Instruction Word Computers) – Fixed number of instructions (3-5) scheduled by compiler to be executed together – Itanium (IA64), Transmeta Crusoe
  • 4. 21/12/05 r.innocente 4 Computer families/2 It is very difficult to optimize computers with CISC instruction sets, because it is difficult to predict what will be the effect of the instructions. For this reason today high performance processors have a RISC core even if the external instruction set is CISC. Starting with the Pentium Pro, Intel x86 processors in fact translate x86 instruction into 1 or more RISC microops (uops).
  • 5. 21/12/05 r.innocente 5 Micro architecture features for HPC • Superscalar • Pipelining: super/hyper pipelining • OOO (Out Of Order) execution • Branch prediction/speculative execution • Hyperthreading • EPIC
  • 6. 21/12/05 r.innocente 6 Superscalar It’s a CPU having multiple functional execution units and able to dispatch multiple instruction per cycle (double issue, quad issue ,...). Pentium 4 has 7 distinct functional units: Load, Store, 2 x double speed ALUs, normal speed ALU, FP, FP Move. It can issue up to 6 uops per cycle (it has 4 dispatch ports but the 2 simple ALUs are double speed). The Athlon has 9 functional units. The AMD64/Opteron has a 3 issue core with 11 FU : 3 int,3 Address Generation, 3 FP, 2 ld/st The IA64/Itanium2 can issue 6 instr, has 11 FU : 2 int, 4 mem, 3 branch, 2 FP
  • 7. 21/12/05 r.innocente 7 Pipelining/1 • The division of the work necessary to execute instructions in stages to allow more instructions in execution at the same time (at different stages) • Previous generation architectures had 5/6 stages • Now there are 10/20 stages, Intel called this superpipelining or hyperpipelining • Nocona(EM64T) Pentium 4 has 32 stages • Opteron has around 30 stages
  • 8. 21/12/05 r.innocente 8 Pipelining/2 Memory hierarchy (Registers, caches, memory) Instruction Fetch (IF) Instruction decode (ID) Execute(EX) Data access(DA) Write back results(WB)
  • 9. 21/12/05 r.innocente 9 Pipelining/3 WBDAEXEXIDIFInstr 3 DAEXIDIFIFInstr 5 DAEXIDIDIFInstr 4 WBDADADADAEXIDIFInstr 2 WBWBDAEXIDIFInstr 1 109876554321 Clock cycle On the 5th cycle there are 5 instr. simultaneously executing, 2nd instr causes a stall of the pipe during DA
  • 10. 21/12/05 r.innocente 10 P3/P4 superpipelining picture from Intel
  • 13. 21/12/05 r.innocente 13 Out Of Order (OOO) execution/1 • In-order execution (as stated by the program) under-uses functional units • Algorithms for out of order(OOO) execution of instruction were devised such that : – Data consistency is maintained – High utilization of FU is achieved • Instructions are executed when the inputs are ready, without regard to the original program order
  • 14. 21/12/05 r.innocente 14 OOO/2 • Hazards : – RAW (Read After Write): result forwarding – WAW/WAR(Write After Write,Write After Read) : skip previous WB or ROB • Tomasulo’s scheduler (IBM360/91-’76) • Reorder Buffer (ROB) : to allow precise interrupts
  • 15. 21/12/05 r.innocente 15 OOO/3 Tomasulo additions for OOO execution with ROB
  • 16. 21/12/05 r.innocente 16 OOO/4 In the Tomasulo approach, to each resource (functional units, load and store units, register file) a set of buffers called reservation stations is added. This is a distributed algorithm and each station is in charge of watching out when the operands are available, eventually snooping on the bus, and executing its operation asap. • Issue : • Execute • WriteBack
  • 17. 21/12/05 r.innocente 17 Branch prediction Speculative execution To avoid pipeline starvation, the most probable branch of a conditional jump (for which the condition register is not yet ready) is guessed (branch prediction) and the following instructions are executed ( speculative execution ) and their result is stored in hidden registers. If later the condition turns out as predicted we say the instruction has to be retired and the hidden registers renamed to the real registers, otherwise we had a misprediction and the pipe following the branch is cleared.
  • 18. 21/12/05 r.innocente 18 Hyperthreading/1 • Introduced with this name by Intel on their Xeon Pentium4 processors. Similar proposals were previously known in literature as Simultaneous multithreading • With 5% more gates it could achieve a 20-30% performance increase ? • The processor has two copies of the architectural state (registers, IP and so on) and so appears to the software as 2 logical processors
  • 19. 21/12/05 r.innocente 19 Hyperthreading/2 • Multithreaded applications gain the most • The large cache footprint of typical scientific applications make them quite less favoured by this feature Resources State1 State2 0 1 23 4 5 Logical proc numbering
  • 20. 21/12/05 r.innocente 20 Hyperthreading/3 • Different approaches could have been used : partition(fixed assignement to logical processors), full sharing, threshold (resource is shared but there is an upper limit on its use by each logical processor) • Intel for the most part used the full sharing policy
  • 22. 21/12/05 r.innocente 22 Hyperthreading/5 Operating modes : ST0,ST1,MT Pic from Intel
  • 23. 21/12/05 r.innocente 23 x86 uarchitectures Intel: • P6 uarchitecture • NetBurst • Nocona (EM64T) AMD: • Thunderbird • Palomino • AMD64/Hammer • SledgeHammer
  • 24. 21/12/05 r.innocente 24 Intel x86 family • Pentium III – Katmai – Coppermine • 500 Mhz – 1.13 Ghz – Tualatin • 1.13 – 1.33 Ghz • Pentium 4 – Willamette • 1.4-2.0 Ghz – Northwood • 2.0–2.2 Ghz – Gallatin – 3.4 Ghz - Prescott 2.8 Ghz - - Nocona (EM64T) 0.18 u 0.13 u 0.25 u 90 nm
  • 25. 21/12/05 r.innocente 25 AMD Athlon family • K7 Athlon 1999 0.25 technology w/3DNow! • K75 Athlon 0.18 techology • Thunderbird (Athlon 3) 0.18 u technology (on die L2 cache) • Palomino (Athlon 4) 0.18 u Quantispeed uarchitecture (SSE support) 15 stages pipeline • Opteron/AMD64 0.18 and then 0.13 u
  • 26. 21/12/05 r.innocente 26 Pentium 4 0.13u uarchitecture Pic from Intel
  • 27. 21/12/05 r.innocente 27 Netburst 90nm uarch Pic from Intel
  • 29. 21/12/05 r.innocente 29 x86 Architecture extensions/1 x86 architecture was conceived in 1978. Since then it underwent many architecture extensions. • Intel MMX (Matrix Math eXtensions): – introduced in 1997 supported by all current processors • Intel SSE (Streaming SIMD Extensions): – introduced on the Pentium III in 1999 • Intel SSE2 (Streaming SIMD Extensions 2): – introduced on the Pentium 4 in Dec 2000
  • 30. 21/12/05 r.innocente 30 x86 Architecture extensions/2 • AMD 3DNow! : – introduced in 1998 (extends MMX) • AMD 3DNow!+ (or 3DNow! Professional, or 3DNow! Athlon): – introduced with the Athlon (includes SSE)
  • 31. 21/12/05 r.innocente 31 x86 architecture extensions/3 The so called feature set can be obtained using the assembly instruction cpuid that was introduced with the Pentium and returns information about the processor in the processor’s registers: processor family, model, revision, features supported, size and structure of the internal caches. On linux the kernel uses this instrucion at startup and many of these info are available typing: cat /proc/cpuinfo
  • 32. 21/12/05 r.innocente 32 x86 architecture extensions/4 Part of a typical output can be : vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III(Coppermine) cache size : 256 KB flags : fpu pae tsc mtrr pse36 mmx sse
  • 33. 21/12/05 r.innocente 33 SIMD technology • A way to increase processor performance is to group together equal instructions on different data (Data Level Parallelism: DLP) • SIMD (Single Instruction Multiple Data) comes from Flynn taxonomy of parallel machines(1966 ) • Intel proposed its : – SWAR ( SIMD Within A Register)
  • 34. 21/12/05 r.innocente 34 typical SIMD operation op X4 X3 X2 X1 Y4 Y3 Y2 Y1 op op op X4 op Y4 X3 op Y3 X2 op Y2 X1 op Y1
  • 35. 21/12/05 r.innocente 35 MMX • Adds 8 64-bits new registers : – MM0 – MM7 • MMX allows computations on packed bytes, word or doubleword integers contained in the MM registers (the MM registers overlap the FP registers !) • Not useful for scientific computing
  • 36. 21/12/05 r.innocente 36 SSE • Adds 8 128-bits registers : – XMM0 – XMM7 • SSE allows computations on operands that contain 4 Single Precision Floating Points either in memory or in the XMM registers • Very limited use for scientific computing, because of lack of precision
  • 37. 21/12/05 r.innocente 37 SSE2 • Works with operands either in memory or in the XMM registers • Allows operations on packed Double Precision Floating Points or 128-bit integers • Using a linear algebra kernel with SSE2 instructions matrix multiplication can achieve 1.8 Gflops on a P4 at 1.4 Ghz : http://hpc.sissa.it/p4/
  • 38. 21/12/05 r.innocente 38 Intel SSE3/1 • SSE3 introduces 13 new instructions: – x87 conversion to integer (fisttp) – Complex arithmetic (addsubps, addsubpd, movesldup,moveshdup,…) – Video encoding – Thread synchronization
  • 39. 21/12/05 r.innocente 39 SSE3/2 • Conversion to integer (fisttp) : – this was a long awaited feature. It is a conversion to integer that always perform truncation (required by programming languages). Previously the fistp instruction converted according to the rounding mode specified in the FCW. And so usually required an expensive load/reload of the FCW.
  • 40. 21/12/05 r.innocente 40 Cache memory Cache=a place where you can safely hide something It is a high-speed memory that holds a small subset of main memory that is in frequent use. CPU cache Memory
  • 41. 21/12/05 r.innocente 41 Cache memory/1 • Processor /memory gap is the motivation : – 1980 no cache at all – 1995 2 level cache – 2002 3 level cache
  • 42. 21/12/05 r.innocente 42 Cache memory/2 0 100 200 300 400 500 600 700 800 1995 1996 1997 1998 1999 2000 Year Performance Mem perf CPU perf CPU/DRAM gap : 50 %/year 1975 cpu cycle 1 usec, SRAM access 1 usec 2001 cpu cycle 0.5 ns, DRAM access 50 ns
  • 43. 21/12/05 r.innocente 43 Cache memory/3 As the cache contains only part of the main memory, we need to identify which portions of the main memory are cached. This is done by tagging the data that is stored in the cache. The data in the cache is organized in lines that are the unit of transfer from/to the CPU and the memory.
  • 44. 21/12/05 r.innocente 44 Cache memory/4 Tags Data cache line Typical line sizes are 32 (Pentium III), 64 (Athlon), 128 bytes (Pentium 4)
  • 45. 21/12/05 r.innocente 45 Cache memory/5 Block placement algorithm : • Direct Mapped : a block can be placed in just one row of the cache • Fully Associative : a block can be placed on any row • n-Way set associative: a block can be placed on n places in a cache row With direct mapping or n-way the row is determined using a simple hash function such as the least significant bits of the row address. E.g. If the cache line is 64 bytes(bits [5:0] of address are used only to address the byte inside the row) and there are 1k rows, then bits [15:6] of the address are used to select the cache row. The tag is then the remaining most significant bits [:16] .
  • 46. 21/12/05 r.innocente 46 Cache memory/6 Tags Data cache line Tags Data cache line 2-way cache : a block can be placed on any of the two positions in a row
  • 47. 21/12/05 r.innocente 47 Cache memory/7 • With fully associative or n-way caches an algorithm for block replacement needs to be implemented. • Usually some approximation of the LRU (least recently used) algorithm is used to determine the victim line. • With large associativity (16 or more ways) choosing at random the line is almost equally efficient
  • 48. 21/12/05 r.innocente 48 Cache memory/8 • When during a memory access we find the data in the cache we say we had a hit, otherwise we say we had a miss. • The hit ratio is a very important parameter for caches in that it let us predict the AMAT (Average Memory Access Time)
  • 49. 21/12/05 r.innocente 49 Cache memory/9 If • tc is the time to access the cache • tm the time to access the data from memory • h the unitary hit ratio then: t = h * tc+ (1-h)*tm
  • 50. 21/12/05 r.innocente 50 Cache memory/10 In a typical case we could have : • tc=2 ns • tm=50 ns • h=0.90 then the AMAT would be : AMAT = 0.9 * 2+0.1*50= 6.8 ns
  • 51. 21/12/05 r.innocente 51 Cache memory/11 Typical split L1/unified L2 cache organization: L2 unified L1 data L1 code CPU
  • 52. 21/12/05 r.innocente 52 Cache memory/12 • Classification of misses (3C) – Compulsory : can’t be avoided, 1st time data is loaded(would happen to an infinite cache) – Capacity : data was loaded, but was unloaded because the cache filled up(would happen to a fully associative cache) – Conflict : in an N-way cache, the miss that happens because data was unloaded due to low associativity
  • 53. 21/12/05 r.innocente 53 Cache memory/13 • Recent processors can prefetch cache lines under program control or under hardware control, snooping the patterns of access for multiple flows. Netburst architecture introduced hw prefetch on pentium.
  • 54. 21/12/05 r.innocente 54 Cache memories/14 • Netburst hw prefetch is invoked after 2-3 cache misses with constant stride • It prefetches 256 bytes ahead • Up to 8 streams • Does’nt prefetch past the 4KB page limit
  • 55. 21/12/05 r.innocente 55 Real Caches/1 • Pentium III : line size 32 bytes – split L1 – 16KB inst/16KBdata 4-way write through – Unified L2 256 KB (coppermine) 8-way write back • Pentium 4 : – split L1 : • Instruction L1 : 12 K uops 8-ways (trace execution cache) • Data L1 : 8 KB – 4 way 64 bytes line size write through – unified L2: 256 KB 8-way line size 128 bytes write back (512KB on Northwood) • Athlon : – Split L1 64 KB/64 KB line 64 bytes – Unified L2 256 KB 16-ways 64 bytes line size
  • 56. 21/12/05 r.innocente 56 Real Caches/2 • Opteron : exclusive organization (L1 not subset of L2) – L1 : instruction 64 KB/ 64 bytes per line, data 64 KB /64 bytes per line – L2 : 1 MB 16 ways 128 bytes per line • Itanium 2 0.13 u 1.5 Ghz (Madison) : – L1I : 16 KB 4 ways/64 bytes line – L1D : 16 KB 4 ways/64 bytes line – Unified L2 : 256 KB 8 ways/ 128 bytes line – Unified L3 : 9 MB 24 ways/ 128 bytes line
  • 57. 21/12/05 r.innocente 57 Real Caches/3 An intelligent RAM (IRAM)? Itanium2 (Madison) die photo 6MB L3 cache 455 M transistors
  • 58. 21/12/05 r.innocente 58 Memory performance • stream (McCalpin) : http:// www.cs.virginia.edu/stream • memperf (T.Stricker): http:// www.cs.inf.ethz.ch/CoPs/ECT
  • 59. 21/12/05 r.innocente 59 Memory Type and Range Registers were introduced with the P6 uarchitecture, they should be programmed to define how the processor should behave regarding the use of the cache in different memory areas. The following types of behaviour can be defined: UC(Uncacheable), WC(Write Combining), WT(Write through), WB(Write back), WP(write Protect) MTRR/1
  • 60. 21/12/05 r.innocente 60 MTRR/2 • UC=uncacheable: – no cache lookups, – reads are not performed as line reads, but as is – writes are posted to the write buffer and performed in order • WC=write combining: – no cache lookups – read requests performed as is – a Write Combining Buffer of 32 bytes is used to buffer modifications until a different line is addressed or a serializing instruction is executed
  • 61. 21/12/05 r.innocente 61 MTRR/3 • WT=write through: – cache lookups are perfomed – reads are performed as line reads – writes are performed also in memory in any case (L1 cache is updated and L2 is eventually invalidated) • WB=write back – cache lookups are performed – reads are performed as line reads – writes are performed on the cache line eventually reading a full line.Only when the line needs to be evicted from cache if modified, it will be written in memory • WP=write protect – writes are never performed in the cache line
  • 62. 21/12/05 r.innocente 62 MTRR/4 There are 8 MTRRs for variable size areas on the P6 uarchitecture. On Linux you can display them with: cat /proc/mtrr
  • 63. 21/12/05 r.innocente 63 MTRR/5 A typical output of the command would be: reg00: base=0x00000000 (0MB), size=256MB: write-back reg01: base=0xf0000000 (3840MB), size=32MB:write-combining the first entry is for the DRAM, the other for the graphic board framebuffer
  • 64. 21/12/05 r.innocente 64 MTRR/6 It is possible to remove and add regions using these simple commands: echo ‘disable=1’ >| /proc/mtrr echo ‘base=0xf0000000 size=0x4000000 type=write-combining’ >|/proc/mtrr ( Suggestion: Try to use xengine while disabling and re-enabling the write-combining region of the framebuffer.)
  • 65. 21/12/05 r.innocente 65 Explicit cache control/1 Caches were introduced as an h/w only optimization and as such destined to be completely hidden to the programmers. This is no more true, and all recent architectures have introduced some instructions for explicit cache control by the application programmer. On the x86 this was done togheter with the introduction of the SSE instructions.
  • 66. 21/12/05 r.innocente 66 Explicit cache control/2 On the Intel x86 architecture the following instructions provide explicit cache control : • prefetch : these instructions load a cache line before the data is actually needed, to hide latency • non temporal stores: to move data from registers to memory without writing it in the caches, when it is known data will not be re-used • fence: to be sure order between prior/following memory operation is respected • flush : to write back a cache line, when it is known it
  • 67. 21/12/05 r.innocente 67 Performance and timestamp counters/1 These counters were initially introduced on the Pentium but documented only on the PPro. They are supported also on the Athlon. Their aim was fine tuning and profiling of the applications. They were adopted after many other processors had shown their usefulness.
  • 68. 21/12/05 r.innocente 68 Performance and timestamp counters/2 Timestamp counter (TSC): • it is a 64 bit counter • counts the clock cycles (processor cycles: now up to 2 Ghz=0.5ns) since reset or since programmer zeroed • when it reaches a count of all 1s, it wraps around without generating any interrupt • it is an x86 extension that is reported by the cpuid instruction, as such can be found on /proc/cpuinfo • Intel assured that even on future processor it will never wrap around in less than 10 years
  • 69. 21/12/05 r.innocente 69 Performance and timestamp counters/3 • the TSC can be read with the RDTSC (Read TSC) instruction or the RDMSR (Read ModelSpecificRegister 0x10 ) instruction • it is possible to zero the lower 32 bits of the TSC using the WRMSR (Write MSR 0x10) instruction (the upper 32 bits will be zeroed automagically by the h/w • the RDTSC is not a serializing instruction as such does’nt avoid Out of Order execution. This can be obtained using the cpuid instruction
  • 70. 21/12/05 r.innocente 70 Performance and timestamp counters/4 The TSC is used by Linux during the startup phase to determine the clock frequency: • the PIT (Programmable Interval Timer) is programmed to generate a 50 millisecond interval • the elapsed clock ticks are computed reading the TSC before and after the interval The TSC can be read by any user or only by the kernel according to a cpu flag in the CR4 register that can be set e.g. by a module.
  • 71. 21/12/05 r.innocente 71 Performance and timestamp counters/5 On the P6 uarchitecture there are 4 registers used for performance monitoring: • 2 event select registers: PerfEvtSel0, PerfEvtSel1 • 2 performance counters 40 bits wide : PerfCtr0, PerfCtr1 You have to enable and specify the events to monitor setting the event select registers. These registers can be accessed using the RDMSR and WRMSR instructions at kernel level (MSR 0x186,0x187, 0x193 0x194). The performance counters can also be accessed using the RDPMC (Read Performance Monitoring Counters) at any privilege level if allowed by the CR4 flag.
  • 72. 21/12/05 r.innocente 72 Performance and timestamp counters/6 The Performance Monitoring mechanism has been vastly changed and expanded on P4 and Xeons. This new feature is called : PEBS (Precise Event- based Sampling). There are 45 event selection control (ESCR) registers , 18 performance monitor counters and 18 counter configuration control (CCCR) MSR registers.
  • 73. 21/12/05 r.innocente 73 Performance and timestamp counters/7 A useful performance counters library to instrument the code and a tool called rabbit to monitor programs not instrumented with the library can be found at: http://www.scl.ameslab.gov/Projects/Rabbit/
  • 74. 21/12/05 r.innocente 74 Intel Itanium/1 • EPIC (Explicit Parallel Instruction Computing) : a joint HP/Intel project started at the end of the '90 : – its goal was the overcoming of the limitations of current architectures. Recognizing that it was hard to continue with the exploitation of the ILP (Instruction Level Parallelism) exclusively by hardware, it was thought that the compilers could quite better decide on dependencies and suggest parallelism. • In-Order core !!!!!!!
  • 75. 21/12/05 r.innocente 75 Intel Itanium/2 • It's a 64 bit architecture : adresses are 64 bits • It supports the following data types: – Integers 1,2,4 and 8 bytes – Floating point single, double and extended(80 bits) – Pointers 64 bits • Except for a few exceptions the basic data type is 64 bits and operations are done on 64 bits. 1,2,4 bytes operands are zero extended before being loaded.
  • 76. 21/12/05 r.innocente 76 Intel Itanium/3 • A typical Itanium instruction is a 3 operand instruction designated with this syntax : – [(qp)]mnemonic[.comp1][.comp2] dsts = srcs • Examples : – Simple instruction : add r1 = r2,r3 – Predicated instruction:(p5)add r3 = r4,r5 – Completer: cmp.eq p4 = r2,r3
  • 77. 21/12/05 r.innocente 77 Intel Itanium/4 • (qp) : a qualifying predicate is a 1 bit predicate register p0-p63. The instruction is executed if the register is true (= 1). Otherwise a NOP is executed. If omitted p0 is assumed which is always true • mnemonic : • comp1,comp2 : some instr include 1 or more completers • dests = srcs : most instructions have 2 src opnds and 1 dest opnd
  • 78. 21/12/05 r.innocente 78 Intel Itanium/4bis If ( i == 3) x = 5 Else x = 6 • pT,pF=compare(i,3) • (pT)x=5 • (pF)y=6
  • 79. 21/12/05 r.innocente 79 Intel Itanium/5 • Memory in Itanium : – single, uniform, linear address space of 2^64 bytes • single : data and code together, not like Harvard arch. • uniform: no predefined zones, like interrupt vectors .. • linear: not segmented like x86 • Code stored in little endian order, usually also data, but there is support for big endian OS too • Load and store arch.
  • 80. 21/12/05 r.innocente 80 Intel Itanium/6 • Improves ILP(instruction level parallelism) : – Enabling the code writer to explicitly indicate it – Providing a 3 instruction-wide word called a bundle – Providing a large number of registers to avoid contention
  • 81. 21/12/05 r.innocente 81 Intel Itanium/7 • Instructions are bound in instruction groups. These are sets of instructions that don’t have WAW or RAW dependencies and so can execute in parallel • An instruction group must contain at least an instruction, there Is no limit on the number of instr in an instruction group. They are indicated in the code by cycle breaks (;;) • Instr groups are composed of instruction bundles. Each bundle contains 3 instr and a template field
  • 82. 21/12/05 r.innocente 82 Intel Itanium/8 • Code generation assures there is no RAW or WAW race between isntr in the bundle, and sets the template field that maps each instruction to an execution unit. In this way the instr in the bundle can be dispatched in parallel • Bundles are 16 bytes aligned • The template field can end the instruction group at the end of the bundle or in the middle :
  • 83. 21/12/05 r.innocente 83 Intel Itanium/9 • Registers: • 128 GPR • 128 FP reg. • 64 pred.reg • 8 branch reg. • 128 Applic. Reg • IP reg
  • 84. 21/12/05 r.innocente 84 Intel Itanium/10 • Reg.Validity : – Each GR has an associated NaT(Not a Thing) bit – FP reg use a special pseudozero value called NaTVal
  • 85. 21/12/05 r.innocente 85 Intel Itanium/11 • Branches : – Relative direct : the instruction contains a 21 bit displacement that is added to the current IP of the bundle containing the branch. This is a 16 byte instruction bundle address (=25 bits) – Indirect branches : using one of the 8 branch regs • It allows multiple branches to be evaluated in parallel, taking the first predicated true
  • 86. 21/12/05 r.innocente 86 Intel Itanium/12 • Memory Latency hiding because of large register file used for : – Data speculation : execution of an op before its data dependency is resolved – Control speculation: execution of an op before its control dependency si resolved – Elimination of mem access through use of regs
  • 87. 21/12/05 r.innocente 87 Intel Itanium/13 • Proc calls and register stack : – Usually proc calls were implemented using a procedure stack in main memory – Itanium uses the regs stack for that • Rotating regs : – Static : the 1st 32 regs are permanent regs visible to all procs, in which global vars are placed – Stacked : the other 96 regs behave like a stack, a proc can allocate up to 96 regs for a proc frame • Reg Stack Engine (RSE): – A hardware mechanism operating transparently in the background ensures that no overflow can happen, saving regs to memory
  • 88. 21/12/05 r.innocente 88 Intel Itanium/13 Pic from Intel
  • 89. 21/12/05 r.innocente 89 Intel Itanium/15 • FP registers are 82 bits : – 80 bits like the extended format used since the x87 coprocessor – 2 guard bits to allow computing of valid 80 bits results for divide and square roots implemented in s/w • FP status register (FPSR) : – Provides 4 separate status fields sf0-3, enabling 4 different computational environments
  • 90. 21/12/05 r.innocente 90 Intel Itanium/16 • Itanium 2 (madison): – 1.6 Ghz, 1.5 Ghz, 1.4 Ghz – L3 cache : 9MB, 6MB, 4MB – L2 cache: 256 KB – L1 cache: 32 KB instr , 32 KB data – 200/266 Mhz bus,400 MT/533 MT @128 bit system bus (6.4 GB/s,8.5 GB/s)
  • 91. 21/12/05 r.innocente 91 Intel Itanium/17 Pic from Intel
  • 92. 21/12/05 r.innocente 92 Intel Itanium/18 Pic from Intel
  • 94. 21/12/05 r.innocente 94 Opteron uarch Pic from AMD
  • 95. 21/12/05 r.innocente 95 64 bit architectures • AMD – Opteron • Intel – Itanium • Intel - EMT64T
  • 96. 21/12/05 r.innocente 96 AMD-64/1 – AMD extensions to IA-32 to allow 64 bits addressing aka as Direct Connect Architecture (DCA) – Implementations : • Opteron for servers • Athlon64, Athlon64 FX for desktops
  • 97. 21/12/05 r.innocente 97 AMD Hammer/Opteron • x86-64 architecture specified in : – http://www.x86-64.org –
  • 98. 21/12/05 r.innocente 98 Intel EM64T/1 • Extended Memory 64 Technology : an extension to IA-32 compatible with the proposed AMD-64 • Introduced with Nocona Xeons. 3 operating modes : – 64 bit mode : OS, drivers, applications recompiled – Compatibility mode: OS & drivers 64 bits,appl. unchanged – Legacy mode: runs 32 bits OS, drivers and application
  • 99. 21/12/05 r.innocente 99 x86-64 or AMD64 Pic from AMD
  • 100. 21/12/05 r.innocente 100 Processor bus/1 • Intel (AGTL+): – bus based (max 5 loads) – explicit in band arbitration – short bursts (4 data txfers) – 8 bytes wide(64 bits), up to 133 Mhz • Compaq Alpha (EV6)/Athlon : – point to point – DDR double data rate (2 transfers x clock) – licensed by AMD for the Athlon (133 Mhz x 2) – source synchronous(up to 400 Mhz) – 8 bytes wide(64 bits)
  • 101. 21/12/05 r.innocente 101 Processor bus/2 • Intel Netburst (Pentium 4): – source synchronous (like EV6) – 8 bytes wide – 100 Mhz clock – quad data rate for data transfers (4 x 100 Mhz x 8 bytes= 3.2 GB/s, up tp 4 x 200Mhzx8=6.4 GB/s) – double data rate for address transfer – max 128 bytes(a P4 cache line)in a transaction (4 data transfer cycles)
  • 102. 21/12/05 r.innocente 102 Processor bus/3 • Itanium2 : – 128 bits wide – 200/266Mhz – Dual pumped – 400/ 533 MT/s x 16 = 6.4 / 8.4 GB/s • Opteron (not a bus) : – 2,4,8,16,32 bits wide – 800 Mhz – Double pumped – 800 Mhz x 2 DDR x 2 bytes =3.2 GB/s per direction
  • 103. 21/12/05 r.innocente 103 Intel IA32 node Pentium 4 Pentium 4 North Bridge Memory shared FSB 64 bits @100/133/200Mhz
  • 104. 21/12/05 r.innocente 104 Intel PIII/P4 processor bus • Bus phases : – Arbitration: 2 or 3 clks – Request phase: 2 clks packet A, packet B (size) – Error phase: 2 clks, check parity on pkts,drive AERR – Snoop phase: variable length 1 ... – Response phase: 2 clk – Data phase : up to 32 bytes (4 clks, 1 cache line), 128 bytes with quad data rate on Pentium 4 • 13 clks to txfer 32 bytes, or 128 bytes on P4
  • 105. 21/12/05 r.innocente 105 Alpha node 21264 21264 Tsunami xbar switch Memory AlphaEV6 bus 64 bit 4*83Mhz 256 bit-83 Mhzstream measured memory b/w> 1GB/s
  • 106. 21/12/05 r.innocente 106 Alpha/Athlon EV6 interconnect • 3 high speed channels : – Unidirectional processor request channel – Unidirectional snoop channel – 72-bit data channel (ECC) • source synchronous • up to 400 Mhz (4 x 100 Mhz: quad pumped, Athlon :133Mhz x 2 ddr )
  • 107. 21/12/05 r.innocente 107 Opteron integrated memory controller • Opteron block diagram • The integrated memory controller frees the interconnect from local memory access • Remote memory access and I/O happens through the Hypertransport
  • 110. 21/12/05 r.innocente 110 4P Opteron detailed view
  • 114. 21/12/05 r.innocente 114 HT local/xfire bandwidth Local mem b/w Xfire mem b/w
  • 115. 21/12/05 r.innocente 115 Double/Quad core Pics From Broadcom : dualcore and quadcore MIPS64 cpus
  • 116. 21/12/05 r.innocente 116 SMPs Symmetrical Multi Processing denotes a multiprocessing architecture in which there is no master CPU, but all CPUs co-operate. • processor bus arbitration • cache coherency/consistency • atomic RMW operations • mp interrupts
  • 117. 21/12/05 r.innocente 117 Intel MP Processor bus arbitration • As we have seen there is a special phase for each bus transaction devoted to arbitration (it takes 2 or 3 cycles) • At startup each processor is assigned a cluster number from 0 to 3 and a processor number from 0 to 3 (the Intel MP specification covers up to 16 processors) • In the arbitration phase each processor knows who had the bus during last transaction (the rotating ID) and who is requesting it. Then each processor knows who will own the current transaction because they will gain the bus according to the fixed sequence 0-1-2-3.
  • 118. 21/12/05 r.innocente 118 Cache coherency Coherence: all processor should see the same data. This is a problem because each processor can have its own copy of the data in its cache. There are essentially 3 ways to assure cache coherency: • directory based : there is a central directory that keeps track of the shared regions (ccNUMA: Sun Enterprise, SGI) • snooping : all the caches monitor the bus to determine if they have a copy of the data requested (UMA: Intel MP) • Broadcast based
  • 119. 21/12/05 r.innocente 119 Cache consistency • Consistency: mantain the order in which writes are executed Writes are serialized by the processor bus.
  • 120. 21/12/05 r.innocente 120 Snooping Two protocols can be used by caches to snoop the bus : • write-invalidate: when a cache hears on the bus a write request for one of his lines from another bus agent then it invalidates its copy (Intel MP) • write-update: when a cache hears on the bus a write request for one of his lines from another agent it reloads the line
  • 121. 21/12/05 r.innocente 121 Intel MP snooping • If a memory transaction (memory read and invalidate/read code/read data/write cache line/write) was not cancelled by an error, then the error phase of the bus is followed by the snoop phase(2 or more cycles) • In this phase the caches will check if they have a copy of the line • if it is a read and a processor has a modified copy then it will supply its copy that is also written to memory • if it is a write and a processor has a modified copy then the memory will store first the modified line and then will merge the write data
  • 122. 21/12/05 r.innocente 122 MESI protocol Each cache line can be in 1 of the 4 states: • Invalid (I): the line is not valid and should not be used • Exclusive(E): the line is valid, is the same as main memory and no other processor has a copy of it, it can be read and written in cache w/o problems • Shared(S): the line is valid, the same as memory, one or more other processors have a copy of it, it can be read from memory, it should be written-through (even if declared write back!) • Modified(M): the line is valid, has been updated by the local processor, no other cache has a copy, it can be read and written in cache
  • 123. 21/12/05 r.innocente 123 MESI states I E M S W W R S R S Legenda: Read access Write access Snoop R W R S WS W R R
  • 124. 21/12/05 r.innocente 124 L2/L1 coherence • An easy way to keep coherent the 2 levels of caches is to require inclusion ( L1 subset of L2) • Otherwise each cache can perform its snooping
  • 125. 21/12/05 r.innocente 125 Atomic Read Modify Write The x86 instruction set has the possibility to prefix some instructions with a LOCK prefix: – bit test and modify – exchange – increment, decrement, not,add,sub,and,or These will cause the processor to assert the LOCK# bus signal for the duration of the read and write. The processor automatically asserts the LOCK# signal during execution of an XCHG , during a task switch, while reading a segment descriptor
  • 126. 21/12/05 r.innocente 126 Intel MP interrupts Intel has introduced an I/O APIC (Advanced Programmable Interrupt controller) which replaces the old 8259A. • each processor of an SMP has its own integrated local APIC • an Interrupt Controller Communication (ICC) bus connects an external I/O APIC (front-end) to these local APICs • externel IRQ lines are connected to the I/O APIC that acts as a router • the I/O APIC can dispatch interrupts to a fixed processor or to the one executing lowest priority activites (the priority table has to be updated by the kernel at each context switch)
  • 127. 21/12/05 r.innocente 127 Broadcast cache coherence • In a 4 proc cluster p3 wants to access m0: – The read cache line request is sent on the link p3-p1 – Request forwarded from p1 to p0 – P0 reads line from memory snooping itself, and sends out snooping req to p1 and p2 – P0 send out snoop response to p3 and p2 forwards snoop request from p0 – P0 get the local memory copy and P2 sends snoop response to p4 – Memory line is forwarded to p3
  • 128. 21/12/05 r.innocente 128 PCI Bus • PCI32/33 4 bytes@33Mhz=132MBytes/s (on i440BX,...) • PCI64/33 8 bytes@33Mhz=264Mbytes/s • PCI64/66 8 bytes@66Mhz=528Mbytes/s (on i840) • PCI-X 8 bytes@133Mhz=1056Mbytes/s
  • 129. 21/12/05 r.innocente 129 PCI-X 1.0 • 100/133 Mhz, 32/64 bit – Max 1064 MB/s • Split transactions • Backward compatible with PCI 2.1/2.2/2.3 : – PCI-X adapters work in existing PCI slots – Older PCI cards slow down PCI-X bus
  • 130. 21/12/05 r.innocente 130 PCI-X 2.0 • 100% backward compatible • PCI-X 266 and PCI-X 533 – 2.13 GB/s and 4.26 GB/s – DDR and QDR using 133 Mhz base freq.
  • 131. 21/12/05 r.innocente 131 PCI efficiency • Multimaster bus but arbitration is performed out of band • Multiplexed but in burst mode (implicit addressing) only start address is txmitted • Fairness guaranteed by MLT (Maximum Latency Timer) • 3 / 4 cycles overhead on 64 data txfers < 5 % • on Linux use : lspci –vvv to look at the setup of the boards
  • 132. 21/12/05 r.innocente 132 PCI 2.2/X timing diagram CLK Address Bus cmd Attr DataAttr BE Address phase Attribute phase Target response AD C/BE# Data phase IRDY# TRDY# FRAME #
  • 133. 21/12/05 r.innocente 133 common chipsets PCI performance 372303Serverworks HE 315227Intel 860 328315AMD760MPX 248205Tsunami (alpha) 372309Serverworks champion II HE 372315Serverset III HE 399372Intel 460GX (Itanium) 488407Titan (Alpha) 512455Serverworks Serverset III LE Write MB/sRead MB/sChipset
  • 134. 21/12/05 r.innocente 134 chipsets PCI-X performance 1044784HP rx2000 dual 900Mhz Itanium2, HPzx1 chipset 1044932Supermicro XDL8-GG dual 2.4 Ghz Xeon 10369462xOpteron 244,AMD 8131 Write MB/sRead MB/sChipset
  • 135. 21/12/05 r.innocente 135 Memory buses • SDRAM 8 bytes wide (64 bits): these memories are pipelined DRAM. Industry found for them much more appealing to indicate clock frequency than access times. Number of clock cycles to wait for access is written in a small rom on the module(SPD) – PC-100 PC-133 – DDR PC-200, DDR 266 QDR on the horizon • RDRAM 2 bytes wide(16 bits) these memories have a completely new signaling technology. Their busses should be terminated at both ends (coninuity module required if slot not used!) – RDRAM 600/800/1066 double data rate at 300,400,533 Mhz
  • 136. 21/12/05 r.innocente 136 Interconnects /1 • The technology used to interconnect processors overlaps today on one end with IC and PCB (printed circuit board) technologies and on the other end with WAN (Wide Area Network) technologies. • SAN (System Area Network) and LAN technologies are at a midpoint.
  • 137. 21/12/05 r.innocente 137 LogP metrics (Culler) • L = Latency: time data is on flight between the 2 nodes • o = overhead: time during which the processor is engaged in sending or receiving • g = gap : minimum time interval between consecutive message txmissions(or receptions) • P = # of Processors This metric was introduced to characterize a distributed system  with its most important parameters,a bit outdated, but still useful. (e.g..does’nt take into account pipelining)
  • 138. 21/12/05 r.innocente 138 LogP diagram Network NI NI os or Processor Processor g g time=os+L+or L
  • 139. 21/12/05 r.innocente 139 Interconnects /2 • single mode fiber ( sonet 10 gb/s for some km) • multimode fiber (1 gb 300 m: GigE) • coaxial cable (800 mb/s 1 km: CATV) • twisted pair (1 gb/s 100 m: GigE)
  • 140. 21/12/05 r.innocente 140 Interconnects /3 • shared / switched media: – bus (shared communication): • coaxial • pcb trace • backplane – switch (1-1 connection): • point to point The required communication speed is forcing a switch towards the pt-to-pt paradigm
  • 141. 21/12/05 r.innocente 141 Interconnects • Full interconnection network : – Ideal but unattainable and overkilling – D = 1 – Links : like n^2  
  • 142. 21/12/05 r.innocente 142 Interconnects /4 Hypercube: • A k-cube has 2**k nodes, each node is labelled by a k-dim binary coordinate • 2 nodes differing in only 1 dim are connected • there are at most k hops between 2 nodes, and 2**k * k wires 0 1 1-cube 00 10 01 11 2-cube 3-cube
  • 143. 21/12/05 r.innocente 143 Interconnects /4 Mesh: • an extension of hypercube • nodes are given a k-dim coordinate (in the 0:N-1 range) • nodes that differ by 1 in only 1 coordinate are connected • a k-dim mesh has N**k nodes • at most there are kN hops between nodes and wires are ~ kN**k 00 01 00 00 21 02 12 11 22 2-dim 3x3 mesh
  • 144. 21/12/05 r.innocente 144 Interconnects /4 Crossbar (xbar): • typical telephone network technology • organized by rows and columns (NxN) • requires N**2 switches • any permutation w/o blocking • in the pictures the i-row is the sender and the i-col is the receiver of node i 0 1 2 43 0 1 2 3 4 5x5 xbar switch
  • 145. 21/12/05 r.innocente 145 Interconnect/4bis • 1D Torus(Ring) • 2D Torus
  • 146. 21/12/05 r.innocente 146 Interconnects/5 • Diameter : maximum distance in hops between two processors • Bisection width : number of links to cut to sever the network in two halves • Bisection bandwidth : minimum sum of b/w of links to be cut in order to divide the net in 2 halves • Node degree : number of links from a node
  • 147. 21/12/05 r.innocente 147 Bisection /1 Bisection width : given an N nodes net, divided the nodes in two sets of N/2 nodes, the number of links that go from one set to the other. An important parameter is the minimum bisection width: the minimum number of links whatever cut you use to bisect the nodes. An upper bound on the minimum bisection is N/2 because whatever the topology would be you can always find a cut across half of the node links. If a network has minimum bisection of N/2 we say it has full bisection.
  • 148. 21/12/05 r.innocente 148 Bisection /2 Bisection cut: 1 link Bisection cut: 4 links 2 different 8 nodes nets
  • 149. 21/12/05 r.innocente 149 Bisection /3 It can be difficult to find the min bisection. For full bisection networks, it can be shown that if a network is re-arrangeable (it can route any permutation w/o blocking) then it has full bisection. 1625403To 6543210From
  • 150. 21/12/05 r.innocente 150 Interconnects /4 • Topology: – crossbar (xbar) : any to any – omega – fat tree • Bisection: 732hypercube 641024fully connected 5162-d torus 58grid/mesh 6432star 32ring 1bus ports/switchbisection
  • 151. 21/12/05 r.innocente 151 Interconnects /5 • Connection/connectionless: – circuit switched – packet switched • routing – source based routing (SBR) – virtual circuit – destination based
  • 152. 21/12/05 r.innocente 152 Interconnects /6 • switch mechanism (buffer management): – store and forward • each msg is received completely by a switch before being forwarded to the next • pros: easy to design because there is no handling of partial msgs • cons: long msg latency, requires large buffer – cut-through (virtual cut-through) • msg forwarded to the next node as soon as its header arrives • if the next node cannot receive the msg then the full msg needs to be stored – wormhole routing • the same as cut-trough except that the msg can be buffered togheter by multiple successive switches • the msg is like a worm crawling through a worm-hole
  • 153. 21/12/05 r.innocente 153 Interconnects /7 • congestion control – packet discarding – flow control • window /credit based • start/stop
  • 154. 21/12/05 r.innocente 154 Gilder’s law • proposed by G.Gilder (1997) : The total bandwidth of communication systems triples every 12 months • compare with Moore’s law (1974): The processing power of a microchip doubles every 18 months • compare with: memory performance increases 10% per year (source Cisco)
  • 155. 21/12/05 r.innocente 155 Optical networks • Large fiberbundles multiply the fibers per route : 144, 432 and now 864 fibers per bundle • DWDM (Dense Wavelength Division Multiplexing) multiplies the b/w per fiber : – Nortel announced a 80 x 80G per fiber (6.4Tb/s) – Alcatel announced a 128 x 40G per fiber(5.1Tb/s)
  • 156. 21/12/05 r.innocente 156 NIC Interconnection point (from D.Culler) SP2, Fore ATM cards Myrinet, 3ComGbe I/O Bus HP Medusa Graphics Bus Intel Paragon Meiko CS-2 T3E annexMemory TMC CM-5Register General uprocSpecial uprocController Many ether cards
  • 157. 21/12/05 r.innocente 157 LVDS/1 • Low Voltage Differential Signaling (ANSI/TIA/EIA 644-1995) • Defines only the electrical characteristics of drivers and receivers • Transmission media can be copper cable or PCB traces
  • 158. 21/12/05 r.innocente 158 LVDS/2 • Differential : – Instead of measuring the voltage Vref+U between a signal line and a common GROUND, a pair is used and Vref+U and Vref–U are transmitted on the 2 wires – In this way the transmission is immune to Common Mode Noise (the electrical noise induced in the same way on the 2 wires: EMI,..)
  • 159. 21/12/05 r.innocente 159 LVDS/3 • Low Voltage: – voltage swing is just 300 mV, with a driver offset of +1.2V – receivers are able to detect signals as low as 20 mV,in the 0 to 2.4 V (supporting +/- 1 V of noise) 0 V 1.2 V +300mV ­300mV 1.35V 1.05V
  • 160. 21/12/05 r.innocente 160 LVDS/4 • It consumes only 1 mW(330mV swing constant current): GTL would consume 40 mW(1V swing) and TTL much more • consumer chips for connecting displays (OpenLDI DS90C031) are already txmitting 6 Gb/s • The low slew rate (300mV in 333 ps is only 0.9V/ns) minimize xtalk and distortion
  • 161. 21/12/05 r.innocente 161 VCSEL/1 VCSELs: Vertical Cavity Surface Emitting Lasers: • the laser cavity is vertical to the semiconductor wafer • the light comes out from the surface of the wafer Distributed Bragg Reflectors (DBR): • 20/30 pairs of semiconductor layers Active region p-DBR n-DBR
  • 162. 21/12/05 r.innocente 162 VCSEL/2-EEL (Edge Emitting) picture from Honeywell
  • 163. 21/12/05 r.innocente 163 VCSEL/3- Surface.Emission picture from Honeywell
  • 165. 21/12/05 r.innocente 165 VCSEL/5 At costs similar to LEDs have the characteristics of lasers. EELs: • it’s difficult to produce a geometry w/o problems cutting the chips • it’s not possible to know in advance if the chip it’s good or not VCSELs: • they can be produced and tested with standard I.C. procedures • arrays of 10 or 1000 can be produced easily
  • 166. 21/12/05 r.innocente 166 Ethernet history • 1976 Metcalfe invented a 1 Mb/s ether • 1980 Ethernet DIX (Digital, Intel, Xerox) standard • 1989 Synoptics invented twisted pair Ethernet • 1995 Fast Ethernet • 1998 Gigabit Ethernet • 2002 10 Gigabit Ethernet
  • 167. 21/12/05 r.innocente 167 Ethernet • 10 Mb/s • Fast Ethernet • Gigabit Ethernet • 10GB Ethernet
  • 168. 21/12/05 r.innocente 168 Ethernet Frames • 6 bytes dst address • 6 bytes src address :3 bytes vendor codes, 3 bytes serial # • 2 bytes ethernet type: ip, ... • data from 46 up to 1500 bytes • 4 bytes FCS (checksum) dst address src address type FCSdata
  • 169. 21/12/05 r.innocente 169 Hubs • Repeaters: layer 1 – like a bus they just repeat all the data across all the connections (not very useful for clusters) • Switches (bridges) layer 2 – they filter traffic and repeat only necessary traffic on connections (very cheap switches can easily switch at full speed many Fast Ethernet connections) • Routers : layer 3
  • 170. 21/12/05 r.innocente 170 Ethernet flow/control • With large switches it is necessary to be able to limit the flow of packets to avoid packet discarding following an overflow • Vendors had tried to implement non standard mechanism to overcome the problem • One of this mechanism that could be used for half/duplex links is sending carrier bursts to simulate a busy channel • The pseudo busy channel mechanism does’nt apply to full/duplex links • MAC control protocol : to support flow control on f/d links – a new ethernet type = 0x8808, for frames of 46 bytes (min length) – first 2 bytes are the opcode other are the parameters – opcode = 0x0001 PAUSE stop txmitting – dst address = 01-80-c2-00-00-01 (an address not forwarded by bridges) – following 2 bytes represent the number of 512 bit times to stop txmission
  • 171. 21/12/05 r.innocente 171 Auto Negotiation • On twisted pairs each station transmits a series of pulses (quite different from data) to signal link active. These pulses are called Normal Link Pulses (NLP) • A set of Fast Link Pulses (FLP) is used to code station capabilities.These FLPs bits code the set of station capabilities • There are some special repeaters that connect 10mb/s links together and 100mb/s together and then the two sets to ports of a mixed speed bridge • Parallel detection: when autonegotiation is not implemented, this mechanism tries to discover speed used from NLP pulses (will not detect duplex links).
  • 172. 21/12/05 r.innocente 172 GigE Peculiarities : • Flow control necessary on full duplex • 1000base-T uses all 4-pairs of cat 5 cable in both directions: 250 Mhz*4 , with echo cancellation(10 and 100base-T used only 2 pairs) • 1000base-SX on multimode fiber uses a laser (usually a VCSEL). Some multimode fibers have a discontinuity in the refraction index for some microns just in the middle of the fiber. This is not important with a LED launch where the light fills the core but is essential with a laser launch. An offset patch cord (a small length of single mode fiber that moves away from the center the spot) has been proposed as solution.
  • 173. 21/12/05 r.innocente 173 10 GbE /1 • IEEE 802.3ae 10GbE ratified on June 17, 2002 • Optical Media ONLY : MMF 50u/400Mhz 66m, new 50u/2000Mhz 300m, 62.5u/160Mhz 26m, WWDM 4x 62.5u/160Mhz 300m, SMF 10Km/40Km • Full Duplex only (for 1 GbE a CSMA/CD mode of operation was still specified despite no one implemented it) • LAN/WAN different :10.3 Gbaud – 9.95Gbaud (SONET) • 802.3 frame format (including min/max size)
  • 174. 21/12/05 r.innocente 174 10GbE /2 from Bottor (Nortel Networks) : marriage of Ethernet and DWDM
  • 175. 21/12/05 r.innocente 175 VLAN/1 Virtual LAN : – IEEE 802.1Q extension to the ethernet std – frames can be tagged with 4 more bytes (so that now an ethernet frame can be up to 1522 bytes): • TPID tagged protocol identifier, 2 bytes with value 0x8100 • TCI Tag control information: specifies priority and Virtual LAN (12 bits) this frame belongs • remaining bytes with standard content
  • 176. 21/12/05 r.innocente 176 VLAN/2 • VLAN tagged ethernet frame :
  • 177. 21/12/05 r.innocente 177 Infiniband/1 • It represents the convergence of 2 separate proposal: – NGIO (NextGenerationIO:Intel,Microsoft,Sun) – FutureIO (Compaq,IBM,HP) • Infiniband: Compaq, Dell, HP, IBM, Intel, Microsoft, Sun
  • 178. 21/12/05 r.innocente 178 Infiniband/2 • Std channel at 2.5 Gb/s (Copper LVDS or FIber) : – 1x width 2.5 Gb/s – 4x width 10 Gb/s – 12x width 30 Gb/s
  • 179. 21/12/05 r.innocente 179 Infiniband/3 • Highlights: – point to point switched interconnect – channel based message passing – computer room interconnect, diameter < 100- 300 m – one connection for all I/O : ipc, storage I/O, network I/O – up to thousands of nodes
  • 182. 21/12/05 r.innocente 182 Myrinet/1 • comes from a USC/ISI research project : ATOMIC • flow control and error control on all links • cut-through xbar switches, intelligent boards • Myrinet packet format : type CRCdata (no length limit) Allows multiple protocols Source route bytes
  • 184. 21/12/05 r.innocente 184 Myrinet/3 Every node can discover the topology with probe packets (mapper). In this way it gets a routing table that can be used to compute headers for any destination. There is 1 byte for each hop traversed, the first byte with routing information is removed by each switch on the path.
  • 185. 21/12/05 r.innocente 185 Myrinet/4 • Currently Myrinet uses a 2 Gb/s link • There are double link cards for the PCI-X so that a theoretical b/w of 4 Gb/s per node can be achieved : in this way you have to use 2 ports on the switch too • Their current TCP/IP implementation is very well tuned and it reaches about 80% of the b/w with proprietary protocol, despite having a much larger latency
  • 186. 21/12/05 r.innocente 186 Clos networks Named after Charles Clos who introduced them in 1953. It’s a kind of fat tree network. These networks have full bisection. Large Myrinet clusters are frequently arranged as a Clos network. Myricom base block is their 16 port xbar switch. From this brick it’s possible to build 128 nodes/3 level (16+8switches) and 1024 nodes 5 level networks.
  • 187. 21/12/05 r.innocente 187 Clos networks/2 from myri.com
  • 188. 21/12/05 r.innocente 188 SCI/1 • Acronym for Scalable Coherent Interface, ANSI/ISO/IEEE std 1596-1992 • Packet-based protocol for SAN (System Area Network) • It uses pt-to-pt links of 5.3 Gb/s • Topology : Ring, Torus 2D or 3D • It uses a shared memory approach. SCI addresses are 64 bits : – Most significant 16 bits specify the node – 48 address memory on a node • Up to 64K nodes/256 TB addresssing space
  • 189. 21/12/05 r.innocente 189 SCI/2 • Dolphin SISCI API is the std API used to control the interconnect using a shared memory approach. To enable communications : – The receiving node sets aside a part of its memory to be used by the SCI interconnect – The sender imports the memory area in its address space – The sender can now read or write directly in that memory area • Because it is difficult to streamline reads through the memory/PCI bus : writes are 10 times faster than reads • In many implementations of Upper level protocols, It is quite convenient to use only writes
  • 190. 21/12/05 r.innocente 190 SCI/2 • The link b/w between nodes is effectively shared • This puts an upper limit on the number of nodes you can put in a loop. Currently up to 8-10 nodes. So 2d torus up to 64 nodes, 3d over 64 nodes .. • Because there are no switches the cost scales linearly • Cable length is a big limitation with a preferred length of 1 m or less
  • 193. 21/12/05 r.innocente 193 Interconnect/1GbE • Very cheap almost commodity connection now • Broadcom chipset based < USD 100 per card • Switch for 64-128 configurations around USD 200-300 per port
  • 195. 21/12/05 r.innocente 195 Interconnect/Myrinet • Currently 2Gb/s links, with the possibility of a dual link card • SL card 800 USD, 2 links card 1600 USD • 1 port on a switch
  • 197. 21/12/05 r.innocente 197 Interconnect/10GbE • cost of optics alone is around USD 3000 • 1 port on a switch around 10,000 USD • Up to only 48 ports switches
  • 198. 21/12/05 r.innocente 198 Software Despite great advances in network technology(2-3 orders of magnitude), much communication s/w remained almost unchanged for many years (e.g.BSD networking). There is a lot of ongoing research on this theme and very different solutions are proposed(zero-copy, page remapping,VIA,...)
  • 199. 21/12/05 r.innocente 199 Software overhead txmission time software  overhead software overhead software overhead total time 10Mb/s Ether 100Mb/s Ether  1Gb/s  Being a constant, is becoming  more and more important !!
  • 200. 21/12/05 r.innocente 200 Nic memory Processor Standard data flow /1 Processor North Bridge Memory PCI bus NIC Network Processor bus I/O bus Memory bus 1 2 3
  • 201. 21/12/05 r.innocente 201 Standard data flow /2 • Copies : 1. Copy from user space to kernel buffer 2. DMA copy from kernel buffer to NIC memory 3. DMA copy from NIC memory to network • Loads: – 2x on the processor bus – 3x on the memory bus – 2x on the NIC memory
  • 202. 21/12/05 r.innocente 202 Nic memory Processor Zero-copy data flow Processor North Bridge Memory PCI bus NIC Network Processor bus I/O bus Memory bus 1 2
  • 203. 21/12/05 r.innocente 203 Zero Copy Research • Shared memory between user/kernel: – Fbufs(Druschel,1993) – I/O-Lite (Druschel,1999) • Page remapping with copy on write (Chu,1996) • Blast: hardware splits headers from data (Carter,O’Malley,1990) • Ulni (User-level Network Interface): implementation of communication s/w inside libraries in user space High speed networks, I/O systems and memory have comparable  bandwidths ­> it is essential to avoid any unnecessary copy of  data  !
  • 204. 21/12/05 r.innocente 204 OS bypass – User level networking • Active Messages (AM) – von Eicken, Culler (1992) • U-Net –von Eicken, Basu, Vogels (1995) • PM – Tezuka, Hori, Ishikawa, Sato (1997) • Illinois FastMessages (FM) – Pakin, Karamcheti, Chien (1997) • Virtual Interface Architecture (VIA) – Compaq,Intel,Microsoft (1998)
  • 205. 21/12/05 r.innocente 205 Active Messages (AM) • 1-sided communication paradigm(no receive op) • each message as soon as received triggers a receive handler that acts as a separate thread (in current implementations it is sender based)
  • 206. 21/12/05 r.innocente 206 FastMessages (FM) • FM_send(dest,handler,buf,size) sends a long message • FM_send_4(dest,handler,i0,i1,i2,i3) sends a 4 words msg (reg to reg) • FM_extract() process a received msg
  • 207. 21/12/05 r.innocente 207 VIA/1 To achieve at the industrial level the advantages obtained from the various User Level Networking (ULN) initiatives, Compaq, Intel and Microsoft proposed an industry standard called VIA (Virtual Interface Architecture). This proposal specifies an API (Application Programming Interface). In this proposal network reads and writes are done bypassing the OS, while open/close/map are done with the kernel intervention. It requires memory registering.
  • 209. 21/12/05 r.innocente 209 VIA/3 • VIA is better suited to network cards implementing advanced mechanism like doorbells (a queue of transactions in the address space of the memory card, that are remembered by the card) • Anyway it can also be implemented completely in s/w, despite less efficiently (look at the M-VIA UCB project, and MVICH)
  • 210. 21/12/05 r.innocente 210 Fbufs (Druschel 1993) • Shared buffers between user processes and kernel • It is a general schemes that can be used for either network or I/O operations • Specific API without copy semantic • A buffer pool specific for a process is created as part of process initialization • This buffer pool can be pre- mapped in both the user and kernel address space • uf_read(int  fd,void**bufp,size _t nbytes) • uf_get(int  fd,void** bufpp) • uf_write(int  fs,void*bufp,size  nbytes) • uf_allocate(size_t  size) • uf_deallocate (void*ptr,size_t  size)
  • 211. 21/12/05 r.innocente 211 Network technologies • Extended frames : – Alteon Jumbograms (MTU up to 9000) • Interrupt suppression (coalescing) • Checksum offloading (from Gallatin 1999)
  • 212. 21/12/05 r.innocente 212 Trapeze driver for Myrinet2000 on freeBSD • zero-copy TCP by page remapping and copy-on -write • split header/payload • gather/scatter DMA • checksum offload to NIC • adaptive message pipelining • 220 MB/s (1760 Mb/s) on Myrinet2000 and Dell Pentium III dual processor The application does’nt touch the data (from Chase, Gallatin, Yocum 2001)
  • 213. 21/12/05 r.innocente 213 Bibliography • Patterson/Hennessy: Computer Architecture – A Quantitative Approach • Hwang – Advanced Computer Architecture • Schimmel – Unix Systems for Modern Architectures