2. Some basic questions
â Rate of computation
⢠What is high
performance? â Time to compute
â Weather prediction,
⢠Who needs high
complex design, scientific
performance systems?
computation etc.
â Every one needs it.
⢠How do you achieve â Technology
high performance? â Circuit / logic design
â Architecture
â Theoretical models
⢠How to analyse or
â Simulation
evaluate performance?
â Experimentation
slide 2
Anshul Kumar, CSE IITD
3. Execution Time and Clock Period
Instruction execution time = Tinst = CPI* Ît
Ît
IF D RF EX/AG M WB
Program exec time = Tprog = N * Tinst
= N * CPI * Ît
N: Number of instructions
CPI : Cycles per instruction(Av)
Ît : Clock cycle time
slide 3
Anshul Kumar, CSE IITD
4. What influences clock period?
Tprog = N * CPI * Ît
Technology - Ît â
â
Software - N
Architecture - N * CPI * Ît â
Instruction set architecture (ISA)
N vs CPI * Ît
trade-off
Micro architecture (ÎźA)
CPI vs Ît
trade-off
slide 4
Anshul Kumar, CSE IITD
5. Relative performance per unit cost
Relative performance per unit cost
Year Technology Perf/cost
1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit 900
1995 VLSI 2,400,000
slide 5
Anshul Kumar, CSE IITD
6. Increase in workstation performance
1200
DEC Alpha 21264/600
1100
1000
900
800
Performance
700
600
500
DEC Alpha 5/500
400
300
DEC Alpha 5/300
200
DEC Alpha 4/266
SUN-4/ MIPS IBM
MIPS IBM POWER 100
100
260 M2000 RS6000 DEC AXP/500
M/120
HP 9000/750
0
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
Year
slide 6
Anshul Kumar, CSE IITD
7. Growth in DRAM Capacity
100,000
64M
16M
10,000
4M
Kbit capacity
1M
1000
256K
100
64K
16K
10
1996
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994
Year of introduction
slide 7
Anshul Kumar, CSE IITD
8. CPU-Memory Performance Gap
CPU-Memory
⢠Semiconductor
â Registers CPU speed
Random Access
â SRAM
â DRAM
â FLASH
⢠Magnetic Slow
â FDD
â HDD
⢠Optical Random + sequential
â CD Very slow
â DVD
slide 8
Anshul Kumar, CSE IITD
9. Memory Hierarchy Principle
hit
CPU
Speed Size Cost / bit
access
miss
Fastest Memory Smallest Highest
Temporal Locality
Memory
â References repeated in
time
Spatial Locality
â References repeated in
Slowest Memory Biggest Lowest
space
â Special case: Sequential
Locality
slide 9
Anshul Kumar, CSE IITD
16. Händlerâs Classification
Händlerâs Classification
< K x Kâ , D x Dâ , W x Wâ >
control data word
dash â degree of pipelining
TI - ASC <1, 4, 64 x 8>
CDC 6600 <1, 1 x 10, 60> x <10, 1, 12> (I/O)
C.mmP <16,1,16> + <1x16,1,16> + <1,16,16>
PEPE <1 x 3, 288, 32>
Cray-1 <1, 12 x 8, 64 x (1 ~ 14)>
slide 16
Anshul Kumar, CSE IITD
17. Modern Classification
Modern Classification
Parallel
architectures
Function-parallel
Data-parallel
architectures
architectures
slide 17
Anshul Kumar, CSE IITD
18. Data Parallel Architectures
⢠SIMD Processors
â Multiple processing elements driven by a single
instruction stream
⢠Vector Processors
â Uni-processors with vector instructions
⢠Associative Processors
â SIMD like processors with associative memory
⢠Systolic Arrays
â Application specific VLSI structures
slide 18
Anshul Kumar, CSE IITD
20. Pipelining
Simple multicycle design :
â˘resource sharing across cycles
⢠all instructions may not take same cycles
IF D RF EX/AG M WB
⢠faster throughput with pipelining
slide 20
Anshul Kumar, CSE IITD
21. Limits of Pipelining
⢠Structural hazards
â Resource conflicts - two instruction require
same resource in the same cycle
⢠Data hazards
â Data dependencies - one instruction needs data
which is yet to be produced by another
instruction
⢠Control Hazards
â Decision about next instruction needs more
cycles
slide 21
Anshul Kumar, CSE IITD
22. ILP in VLIW processors
Cache/ Fetch
memory Unit Single multi-operation instruction
FU FU
FU
Register file
multi-operation instruction
slide 22
Anshul Kumar, CSE IITD
23. ILP in Superscalar processors
Decode
Cache/ Fetch
and issue
memory Unit
unit Multiple instruction
FU FU
FU
Sequential stream of instructions
Instruction/control
Register file
Data
FU Funtional Unit
slide 23
Anshul Kumar, CSE IITD
24. Superscalar and VLIW processors
Superscalar and VLIW processors
slide 24
Anshul Kumar, CSE IITD
25. Issues in ILP Architectures
FU FU FU
Register file
â˘Scalability with increase in number of register ports
â˘ILP detection â special compilers / special hardware
â˘Code compatibility
â˘Code density, Instruction encoding
â˘Maintaining consistency
slide 25
Anshul Kumar, CSE IITD
26. ILP and Multithreading
ILP Coarse MT Fine MT SMT
Hennessy and Patterson
slide 26
Anshul Kumar, CSE IITD
27. Why Process level Parallel Architectures?
Why Process level Parallel Architectures?
Function-parallel
Data-parallel
architectures
architectures
Instruction Thread Process
level PAs level PAs level PAs
(MIMDs)
Built using
general purpose
Shared
Distributed
processors
Memory
Memory
MIMD
MIMD
slide 27
Anshul Kumar, CSE IITD
28. Issues from userâs perspective
userâs
⢠Specification / Program design
â explicit parallelism or
â implicit parallelism + parallelizing compiler
⢠Partitioning / mapping to processors
⢠Scheduling / mapping to time instants
â static or dynamic
⢠Communication and Synchronization
slide 28
Anshul Kumar, CSE IITD
29. Parallel programming models
Concurrent Functional or Vector/array
control flow logic program operations
Concurrent
tasks/processes/threads/objects
Relationship between
With shared variables
programming model
or message passing
and architecture ?
slide 29
Anshul Kumar, CSE IITD
30. Issues from architectâs perspective
Issues from architectâs perspective
⢠Coherence problem in shared memory with
caches
⢠Efficient interconnection networks
slide 30
Anshul Kumar, CSE IITD
31. Shared Memory Multiprocessor
Shared Memory Multiprocessor
M M M M M M
M M
P P P P P P P P
Interconnection Network Interconnection Network
M M M M M M
Global Interconnection Network
M M M
slide 31
Anshul Kumar, CSE IITD
32. Cache Coherence Problem
Multiple copies of data may exist
â Problem of cache coherence
Options for coherence protocols
⢠What action is taken?
â Invalidate or Update
⢠Which processors/caches communicate?
â Snoopy (broadcast) or directory based
⢠Status of each block?
slide 32
Anshul Kumar, CSE IITD
33. Interconnection Networks
⢠Architectural Variations:
â Topology
â Direct or Indirect (through switches)
â Static (fixed connections) or Dynamic (connections
established as required)
â Routing type store and forward/worm hole)
⢠Efficiency:
â Delay
â Bandwidth
â Cost
slide 33
Anshul Kumar, CSE IITD
34. Quest for Performance
1946 ENIAC ($0.5 M, 18K VTs, 150 kW)
add/sub 5000 per sec
mult 385 per sec
div 40 per sec
sqrt 3 per sec
1962 Atlas (Pipelined, Int + FPU)
200K FLOPs
1962 Burroughs D825 (4 CPUs 16 Mem)
1964 CDC 6600 (first supercomputer)
multiple FUs, dynamic scheduling
1972 ILLIAC-IV (64 PEs, 4 MFLOPs each)
slide 34
Anshul Kumar, CSE IITD
35. Fastest Supercomputer
(ref www.top500.org)
(ref www.top500.org)
⢠IBMâs Blue Gene/L at Lawrence Livermore Lab
topped in June 2006 with 280.6 teraflops
⢠Japanâs Earth simulator introduced in 2002 was
fastest with 35.8 teraflops till Blue Gene took over in
2004.
⢠Japanâs proposal (2005) to build a supercomputer 73
times faster than the current best. Target: 10
petaflops, budget $800 - $900 million, date 2011.
⢠Tata sonsâ EKA entered 4th spot in 2007 with 132.8
teraflops
⢠Energy efficiency (max 488 mflopr/watt) also listed
in June 2008
slide 35
Anshul Kumar, CSE IITD
36. June 2008 list
June 2008 list
Site Computer
Rank
Roadrunner - BladeCenter QS22/LS21 Cluster,
1 DOE/NNSA/LANL United States PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz ,
Voltaire Infiniband, IBM (1026 teraflops)
2 DOE/NNSA/LLNL United States BlueGene/L - eServer Blue Gene Solution, IBM
Argonne National Laboratory
3 Blue Gene/P Solution, IBM
United States
Texas Advanced Computing
Ranger - SunBlade x6420, Opteron Quad 2Ghz,
4 Center/Univ. of Texas United
Infiniband, Sun Microsystems
States
DOE/Oak Ridge National
5 Jaguar - Cray XT4 QuadCore 2.1 GHz, Cray Inc.
Laboratory United States
6 Forschungszentrum Juelich (FZJ) JUGENE - Blue Gene/P Solution, IBM
New Mexico Computing
Encanto - SGI Altix ICE 8200, Xeon quad core
7 Applications Center (NMCAC)
3.0 GHz, SGI
United States
Computational Research EKA - Cluster Platform 3000 BL460c, Xeon 53xx
8
Laboratories, TATA SONS India 3GHz, Infiniband, HP (133 teraflops)
37. Blue Gene Supercomputer
⢠32 x 32 x 64 3D torus (65,536 nodes)
⢠Global reduction tree - max/sum in a
few Îźs
⢠Fast synch across entire machine within
a few Îźs
⢠1,024 gbps links to a global parallel file
system
slide 37
Anshul Kumar, CSE IITD
38. Blue Gene Supercomputer contd.
Blue Gene Supercomputer contd.
slide 38
Anshul Kumar, CSE IITD
39. Embedded vs GP Computing
⢠Fixed functionality
⢠Part of a larger system
⢠Interact with environment
⢠Real-time requirements
⢠Power constraints
⢠Environmental contraints
⢠Performance can not be increased simply by
increasing clock frequency
slide 39
Anshul Kumar, CSE IITD