Lec Jan12 2009

CSL718 : Architecture of
High Performance Systems
Introduction
12th January, 2009

Some basic questions
– Rate of computation
• What is high
performance? – Time to compute
– Weather prediction,
• Who needs high
complex design, scientific
performance systems?
computation etc.
– Every one needs it.
• How do you achieve – Technology
high performance? – Circuit / logic design
– Architecture
– Theoretical models
• How to analyse or
– Simulation
evaluate performance?
– Experimentation
slide 2
Anshul Kumar, CSE IITD

Execution Time and Clock Period
Instruction execution time = Tinst = CPI* Δt
Δt

IF D RF EX/AG M WB

Program exec time = Tprog = N * Tinst
= N * CPI * Δt
N: Number of instructions
CPI : Cycles per instruction(Av)
Δt : Clock cycle time
slide 3

What influences clock period?

Tprog = N * CPI * Δt
Technology - Δt ⇓
⇓
Software - N
Architecture - N * CPI * Δt ⇓
Instruction set architecture (ISA)
N vs CPI * Δt
trade-off
Micro architecture (μA)
CPI vs Δt
trade-off
slide 4

Relative performance per unit cost
Relative performance per unit cost

Year Technology Perf/cost
1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit 900
1995 VLSI 2,400,000

slide 5

Increase in workstation performance
1200
DEC Alpha 21264/600
1100

1000

900

800
Performance

700

600

500
DEC Alpha 5/500
400

300
DEC Alpha 5/300
200
DEC Alpha 4/266
SUN-4/ MIPS IBM
MIPS IBM POWER 100
100
260 M2000 RS6000 DEC AXP/500
M/120
HP 9000/750
0
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
Year

slide 6

Growth in DRAM Capacity

100,000

64M
16M

10,000
4M
Kbit capacity

1M
1000
256K

100
64K
16K
10
1996
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994
Year of introduction

slide 7

CPU-Memory Performance Gap
CPU-Memory
• Semiconductor
– Registers CPU speed
Random Access
– SRAM
– DRAM
– FLASH
• Magnetic Slow
– FDD
– HDD
• Optical Random + sequential
– CD Very slow
– DVD

slide 8

Memory Hierarchy Principle
hit
CPU
Speed Size Cost / bit
access
miss
Fastest Memory Smallest Highest

Temporal Locality
Memory
– References repeated in
time
Spatial Locality
– References repeated in
Slowest Memory Biggest Lowest
space
– Special case: Sequential
Locality
slide 9

Parallelism : Flynn’s Classification
Parallelism : Flynn’s Classification

Architecture Categories

SISD SIMD MISD MIMD

slide 10

SISD

IS IS DS
M
C P

slide 11

SIMD

DS
P

IS
M
C

DS
P

slide 12

MISD

IS IS DS
C P

M

IS IS DS
C P

slide 13

MIMD

IS IS DS
C P

M

IS IS DS
C P

slide 14

Feng’s Classification
Feng’s Classification

16K •MPP

•PEPE
256 •STARAN
bit slice
•IlliacIV
length 64

16 •C.mmP

•PDP11 •IBM370 •CRAY-1
1
1 16 32 64
word length

slide 15

Händler’s Classification
Händler’s Classification

< K x K’ , D x D’ , W x W’ >
control data word
dash → degree of pipelining
TI - ASC <1, 4, 64 x 8>
CDC 6600 <1, 1 x 10, 60> x <10, 1, 12> (I/O)
C.mmP <16,1,16> + <1x16,1,16> + <1,16,16>
PEPE <1 x 3, 288, 32>
Cray-1 <1, 12 x 8, 64 x (1 ~ 14)>

slide 16

Modern Classification
Modern Classification

Parallel
architectures

Function-parallel
Data-parallel
architectures
architectures

slide 17

Data Parallel Architectures
• SIMD Processors
– Multiple processing elements driven by a single
instruction stream
• Vector Processors
– Uni-processors with vector instructions
• Associative Processors
– SIMD like processors with associative memory
• Systolic Arrays
– Application specific VLSI structures

slide 18

Function Parallel Architectures
Function Parallel Architectures

Function-parallel
architectures

Instr level Thread level Process level
Parallel Arch Parallel Arch Parallel Arch
(MIMDs)
(ILPs)

Shared
Pipelined VLIWs Superscalar Distributed
Memory
processors processors Memory
MIMD
MIMD
slide 19

Pipelining

Simple multicycle design :
•resource sharing across cycles
• all instructions may not take same cycles

IF D RF EX/AG M WB

• faster throughput with pipelining
slide 20

Limits of Pipelining
• Structural hazards
– Resource conflicts - two instruction require
same resource in the same cycle
• Data hazards
– Data dependencies - one instruction needs data
which is yet to be produced by another
instruction
• Control Hazards
– Decision about next instruction needs more
cycles
slide 21

ILP in VLIW processors
Cache/ Fetch
memory Unit Single multi-operation instruction

FU FU
FU

Register file
multi-operation instruction

slide 22

ILP in Superscalar processors
Decode
Cache/ Fetch
and issue
memory Unit
unit Multiple instruction

FU FU
FU

Sequential stream of instructions

Instruction/control
Register file
Data

FU Funtional Unit

slide 23

Superscalar and VLIW processors
Superscalar and VLIW processors

slide 24

Issues in ILP Architectures
FU FU FU

Register file

•Scalability with increase in number of register ports
•ILP detection – special compilers / special hardware
•Code compatibility
•Code density, Instruction encoding
•Maintaining consistency
slide 25

ILP and Multithreading
ILP Coarse MT Fine MT SMT
Hennessy and Patterson

slide 26

Why Process level Parallel Architectures?
Why Process level Parallel Architectures?

Function-parallel
Data-parallel
architectures
architectures

Instruction Thread Process
level PAs level PAs level PAs
(MIMDs)

Built using
general purpose
Shared
Distributed
processors
Memory
Memory
MIMD
MIMD
slide 27

Issues from user’s perspective
user’s
• Specification / Program design
– explicit parallelism or
– implicit parallelism + parallelizing compiler
• Partitioning / mapping to processors
• Scheduling / mapping to time instants
– static or dynamic
• Communication and Synchronization

slide 28

Parallel programming models

Concurrent Functional or Vector/array
control flow logic program operations

Concurrent
tasks/processes/threads/objects

Relationship between
With shared variables
programming model
or message passing
and architecture ?
slide 29

Issues from architect’s perspective
Issues from architect’s perspective
• Coherence problem in shared memory with
caches
• Efficient interconnection networks

slide 30

Shared Memory Multiprocessor
Shared Memory Multiprocessor
M M M M M M
M M

P P P P P P P P

Interconnection Network Interconnection Network

M M M M M M

Global Interconnection Network

M M M
slide 31

Cache Coherence Problem
Multiple copies of data may exist
⇒ Problem of cache coherence
Options for coherence protocols
• What action is taken?
– Invalidate or Update
• Which processors/caches communicate?
– Snoopy (broadcast) or directory based
• Status of each block?
slide 32

Interconnection Networks
• Architectural Variations:
– Topology
– Direct or Indirect (through switches)
– Static (fixed connections) or Dynamic (connections
established as required)
– Routing type store and forward/worm hole)
• Efficiency:
– Delay
– Bandwidth
– Cost

slide 33

Quest for Performance
1946 ENIAC ($0.5 M, 18K VTs, 150 kW)
add/sub 5000 per sec
mult 385 per sec
div 40 per sec
sqrt 3 per sec

1962 Atlas (Pipelined, Int + FPU)
200K FLOPs
1962 Burroughs D825 (4 CPUs 16 Mem)
1964 CDC 6600 (first supercomputer)
multiple FUs, dynamic scheduling
1972 ILLIAC-IV (64 PEs, 4 MFLOPs each)
slide 34

Fastest Supercomputer
(ref www.top500.org)
(ref www.top500.org)

• IBM’s Blue Gene/L at Lawrence Livermore Lab
topped in June 2006 with 280.6 teraflops
• Japan’s Earth simulator introduced in 2002 was
fastest with 35.8 teraflops till Blue Gene took over in
2004.
• Japan’s proposal (2005) to build a supercomputer 73
times faster than the current best. Target: 10
petaflops, budget $800 - $900 million, date 2011.
• Tata sons’ EKA entered 4th spot in 2007 with 132.8
teraflops
• Energy efficiency (max 488 mflopr/watt) also listed
in June 2008
slide 35

June 2008 list
June 2008 list
Site Computer
Rank

Roadrunner - BladeCenter QS22/LS21 Cluster,
1 DOE/NNSA/LANL United States PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz ,
Voltaire Infiniband, IBM (1026 teraflops)
2 DOE/NNSA/LLNL United States BlueGene/L - eServer Blue Gene Solution, IBM
Argonne National Laboratory
3 Blue Gene/P Solution, IBM
United States
Texas Advanced Computing
Ranger - SunBlade x6420, Opteron Quad 2Ghz,
4 Center/Univ. of Texas United
Infiniband, Sun Microsystems
States
DOE/Oak Ridge National
5 Jaguar - Cray XT4 QuadCore 2.1 GHz, Cray Inc.
Laboratory United States
6 Forschungszentrum Juelich (FZJ) JUGENE - Blue Gene/P Solution, IBM
New Mexico Computing
Encanto - SGI Altix ICE 8200, Xeon quad core
7 Applications Center (NMCAC)
3.0 GHz, SGI
United States
Computational Research EKA - Cluster Platform 3000 BL460c, Xeon 53xx
8
Laboratories, TATA SONS India 3GHz, Infiniband, HP (133 teraflops)

Blue Gene Supercomputer
• 32 x 32 x 64 3D torus (65,536 nodes)
• Global reduction tree - max/sum in a
few μs
• Fast synch across entire machine within
a few μs
• 1,024 gbps links to a global parallel file
system

slide 37

Blue Gene Supercomputer contd.
Blue Gene Supercomputer contd.

slide 38

Embedded vs GP Computing
• Fixed functionality
• Part of a larger system
• Interact with environment
• Real-time requirements
• Power constraints
• Environmental contraints

• Performance can not be increased simply by
increasing clock frequency

slide 39

Cradle CT 3616 Architecture

slide 40

IBM Cell
IBM Cell
Architecture
Architecture
• Clock speed: > 4 GHz
• Peak performance (single
precision): > 256 GFlops
• Peak performance
(double precision): >26
GFlops
• SPU registers 128 x 128b
• Local storage size per
SPU: 256KB
• Area: 221 mm²
• Technology 90nm SOI
• Total number of
transistors: 234M

slide 41

Books
1. D.A. Patterson, J.L. Hennessy, quot;Computer Architecture : A
Quantitative Approachquot;, Morgan Kaufmann Publishers, 2006.
2. D. Sima, T. Fountain, P. Kacsuk, quot;Advanced Computer
Architectures : A Design Space Approachquot;, Addison Wesley,
1997.
3. M.J. Flynn, quot;Computer Architecture : Pipelined and Parallel
Processor Designquot;, Narosa Publishing House/ Jones and
Bartlett, 1996.
4. K. Hwang, quot;Advanced Computer Architecture : Parallelism,
Scalability, Programmabilityquot;, McGraw Hill, 1993.
5. H.G. Cragon, quot;Memory Systems and Pipelined Processorsquot;,
Narosa Publishing House/ Jones and Bartlett, 1998.
6. D.E. Culler, J.P Singh and Anoop Gupta, quot;Parallel Computer
Architecture, A Hardware/Software Approachquot;, Harcourt Asia
/ Morgan Kaufmann Publishers, 2000.
slide 42

Lec Jan12 2009

Recommended

Recommended

More Related Content

Similar to Lec Jan12 2009

Similar to Lec Jan12 2009 (20)

More from Ravi Soni

More from Ravi Soni (13)

Recently uploaded

Recently uploaded (20)

Lec Jan12 2009