Lec Jan12 2009
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Lec Jan12 2009

on

  • 3,179 views

 

Statistics

Views

Total Views
3,179
Views on SlideShare
3,166
Embed Views
13

Actions

Likes
0
Downloads
76
Comments
0

1 Embed 13

http://www.slideshare.net 13

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lec Jan12 2009 Presentation Transcript

  • 1. CSL718 : Architecture of High Performance Systems Introduction 12th January, 2009
  • 2. Some basic questions – Rate of computation • What is high performance? – Time to compute – Weather prediction, • Who needs high complex design, scientific performance systems? computation etc. – Every one needs it. • How do you achieve – Technology high performance? – Circuit / logic design – Architecture – Theoretical models • How to analyse or – Simulation evaluate performance? – Experimentation slide 2 Anshul Kumar, CSE IITD
  • 3. Execution Time and Clock Period Instruction execution time = Tinst = CPI* Δt Δt IF D RF EX/AG M WB Program exec time = Tprog = N * Tinst = N * CPI * Δt N: Number of instructions CPI : Cycles per instruction(Av) Δt : Clock cycle time slide 3 Anshul Kumar, CSE IITD
  • 4. What influences clock period? Tprog = N * CPI * Δt Technology - Δt ⇓ ⇓ Software - N Architecture - N * CPI * Δt ⇓ Instruction set architecture (ISA) N vs CPI * Δt trade-off Micro architecture (μA) CPI vs Δt trade-off slide 4 Anshul Kumar, CSE IITD
  • 5. Relative performance per unit cost Relative performance per unit cost Year Technology Perf/cost 1951 Vacuum tube 1 1965 Transistor 35 1975 Integrated circuit 900 1995 VLSI 2,400,000 slide 5 Anshul Kumar, CSE IITD
  • 6. Increase in workstation performance 1200 DEC Alpha 21264/600 1100 1000 900 800 Performance 700 600 500 DEC Alpha 5/500 400 300 DEC Alpha 5/300 200 DEC Alpha 4/266 SUN-4/ MIPS IBM MIPS IBM POWER 100 100 260 M2000 RS6000 DEC AXP/500 M/120 HP 9000/750 0 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 Year slide 6 Anshul Kumar, CSE IITD
  • 7. Growth in DRAM Capacity 100,000 64M 16M 10,000 4M Kbit capacity 1M 1000 256K 100 64K 16K 10 1996 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 Year of introduction slide 7 Anshul Kumar, CSE IITD
  • 8. CPU-Memory Performance Gap CPU-Memory • Semiconductor – Registers CPU speed Random Access – SRAM – DRAM – FLASH • Magnetic Slow – FDD – HDD • Optical Random + sequential – CD Very slow – DVD slide 8 Anshul Kumar, CSE IITD
  • 9. Memory Hierarchy Principle hit CPU Speed Size Cost / bit access miss Fastest Memory Smallest Highest Temporal Locality Memory – References repeated in time Spatial Locality – References repeated in Slowest Memory Biggest Lowest space – Special case: Sequential Locality slide 9 Anshul Kumar, CSE IITD
  • 10. Parallelism : Flynn’s Classification Parallelism : Flynn’s Classification Architecture Categories SISD SIMD MISD MIMD slide 10 Anshul Kumar, CSE IITD
  • 11. SISD IS IS DS M C P slide 11 Anshul Kumar, CSE IITD
  • 12. SIMD DS P IS M C DS P slide 12 Anshul Kumar, CSE IITD
  • 13. MISD IS IS DS C P M IS IS DS C P slide 13 Anshul Kumar, CSE IITD
  • 14. MIMD IS IS DS C P M IS IS DS C P slide 14 Anshul Kumar, CSE IITD
  • 15. Feng’s Classification Feng’s Classification 16K •MPP •PEPE 256 •STARAN bit slice •IlliacIV length 64 16 •C.mmP •PDP11 •IBM370 •CRAY-1 1 1 16 32 64 word length slide 15 Anshul Kumar, CSE IITD
  • 16. Händler’s Classification Händler’s Classification < K x K’ , D x D’ , W x W’ > control data word dash → degree of pipelining TI - ASC <1, 4, 64 x 8> CDC 6600 <1, 1 x 10, 60> x <10, 1, 12> (I/O) C.mmP <16,1,16> + <1x16,1,16> + <1,16,16> PEPE <1 x 3, 288, 32> Cray-1 <1, 12 x 8, 64 x (1 ~ 14)> slide 16 Anshul Kumar, CSE IITD
  • 17. Modern Classification Modern Classification Parallel architectures Function-parallel Data-parallel architectures architectures slide 17 Anshul Kumar, CSE IITD
  • 18. Data Parallel Architectures • SIMD Processors – Multiple processing elements driven by a single instruction stream • Vector Processors – Uni-processors with vector instructions • Associative Processors – SIMD like processors with associative memory • Systolic Arrays – Application specific VLSI structures slide 18 Anshul Kumar, CSE IITD
  • 19. Function Parallel Architectures Function Parallel Architectures Function-parallel architectures Instr level Thread level Process level Parallel Arch Parallel Arch Parallel Arch (MIMDs) (ILPs) Shared Pipelined VLIWs Superscalar Distributed Memory processors processors Memory MIMD MIMD slide 19 Anshul Kumar, CSE IITD
  • 20. Pipelining Simple multicycle design : •resource sharing across cycles • all instructions may not take same cycles IF D RF EX/AG M WB • faster throughput with pipelining slide 20 Anshul Kumar, CSE IITD
  • 21. Limits of Pipelining • Structural hazards – Resource conflicts - two instruction require same resource in the same cycle • Data hazards – Data dependencies - one instruction needs data which is yet to be produced by another instruction • Control Hazards – Decision about next instruction needs more cycles slide 21 Anshul Kumar, CSE IITD
  • 22. ILP in VLIW processors Cache/ Fetch memory Unit Single multi-operation instruction FU FU FU Register file multi-operation instruction slide 22 Anshul Kumar, CSE IITD
  • 23. ILP in Superscalar processors Decode Cache/ Fetch and issue memory Unit unit Multiple instruction FU FU FU Sequential stream of instructions Instruction/control Register file Data FU Funtional Unit slide 23 Anshul Kumar, CSE IITD
  • 24. Superscalar and VLIW processors Superscalar and VLIW processors slide 24 Anshul Kumar, CSE IITD
  • 25. Issues in ILP Architectures FU FU FU Register file •Scalability with increase in number of register ports •ILP detection – special compilers / special hardware •Code compatibility •Code density, Instruction encoding •Maintaining consistency slide 25 Anshul Kumar, CSE IITD
  • 26. ILP and Multithreading ILP Coarse MT Fine MT SMT Hennessy and Patterson slide 26 Anshul Kumar, CSE IITD
  • 27. Why Process level Parallel Architectures? Why Process level Parallel Architectures? Function-parallel Data-parallel architectures architectures Instruction Thread Process level PAs level PAs level PAs (MIMDs) Built using general purpose Shared Distributed processors Memory Memory MIMD MIMD slide 27 Anshul Kumar, CSE IITD
  • 28. Issues from user’s perspective user’s • Specification / Program design – explicit parallelism or – implicit parallelism + parallelizing compiler • Partitioning / mapping to processors • Scheduling / mapping to time instants – static or dynamic • Communication and Synchronization slide 28 Anshul Kumar, CSE IITD
  • 29. Parallel programming models Concurrent Functional or Vector/array control flow logic program operations Concurrent tasks/processes/threads/objects Relationship between With shared variables programming model or message passing and architecture ? slide 29 Anshul Kumar, CSE IITD
  • 30. Issues from architect’s perspective Issues from architect’s perspective • Coherence problem in shared memory with caches • Efficient interconnection networks slide 30 Anshul Kumar, CSE IITD
  • 31. Shared Memory Multiprocessor Shared Memory Multiprocessor M M M M M M M M P P P P P P P P Interconnection Network Interconnection Network M M M M M M Global Interconnection Network M M M slide 31 Anshul Kumar, CSE IITD
  • 32. Cache Coherence Problem Multiple copies of data may exist ⇒ Problem of cache coherence Options for coherence protocols • What action is taken? – Invalidate or Update • Which processors/caches communicate? – Snoopy (broadcast) or directory based • Status of each block? slide 32 Anshul Kumar, CSE IITD
  • 33. Interconnection Networks • Architectural Variations: – Topology – Direct or Indirect (through switches) – Static (fixed connections) or Dynamic (connections established as required) – Routing type store and forward/worm hole) • Efficiency: – Delay – Bandwidth – Cost slide 33 Anshul Kumar, CSE IITD
  • 34. Quest for Performance 1946 ENIAC ($0.5 M, 18K VTs, 150 kW) add/sub 5000 per sec mult 385 per sec div 40 per sec sqrt 3 per sec 1962 Atlas (Pipelined, Int + FPU) 200K FLOPs 1962 Burroughs D825 (4 CPUs 16 Mem) 1964 CDC 6600 (first supercomputer) multiple FUs, dynamic scheduling 1972 ILLIAC-IV (64 PEs, 4 MFLOPs each) slide 34 Anshul Kumar, CSE IITD
  • 35. Fastest Supercomputer (ref www.top500.org) (ref www.top500.org) • IBM’s Blue Gene/L at Lawrence Livermore Lab topped in June 2006 with 280.6 teraflops • Japan’s Earth simulator introduced in 2002 was fastest with 35.8 teraflops till Blue Gene took over in 2004. • Japan’s proposal (2005) to build a supercomputer 73 times faster than the current best. Target: 10 petaflops, budget $800 - $900 million, date 2011. • Tata sons’ EKA entered 4th spot in 2007 with 132.8 teraflops • Energy efficiency (max 488 mflopr/watt) also listed in June 2008 slide 35 Anshul Kumar, CSE IITD
  • 36. June 2008 list June 2008 list Site Computer Rank Roadrunner - BladeCenter QS22/LS21 Cluster, 1 DOE/NNSA/LANL United States PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz , Voltaire Infiniband, IBM (1026 teraflops) 2 DOE/NNSA/LLNL United States BlueGene/L - eServer Blue Gene Solution, IBM Argonne National Laboratory 3 Blue Gene/P Solution, IBM United States Texas Advanced Computing Ranger - SunBlade x6420, Opteron Quad 2Ghz, 4 Center/Univ. of Texas United Infiniband, Sun Microsystems States DOE/Oak Ridge National 5 Jaguar - Cray XT4 QuadCore 2.1 GHz, Cray Inc. Laboratory United States 6 Forschungszentrum Juelich (FZJ) JUGENE - Blue Gene/P Solution, IBM New Mexico Computing Encanto - SGI Altix ICE 8200, Xeon quad core 7 Applications Center (NMCAC) 3.0 GHz, SGI United States Computational Research EKA - Cluster Platform 3000 BL460c, Xeon 53xx 8 Laboratories, TATA SONS India 3GHz, Infiniband, HP (133 teraflops)
  • 37. Blue Gene Supercomputer • 32 x 32 x 64 3D torus (65,536 nodes) • Global reduction tree - max/sum in a few μs • Fast synch across entire machine within a few μs • 1,024 gbps links to a global parallel file system slide 37 Anshul Kumar, CSE IITD
  • 38. Blue Gene Supercomputer contd. Blue Gene Supercomputer contd. slide 38 Anshul Kumar, CSE IITD
  • 39. Embedded vs GP Computing • Fixed functionality • Part of a larger system • Interact with environment • Real-time requirements • Power constraints • Environmental contraints • Performance can not be increased simply by increasing clock frequency slide 39 Anshul Kumar, CSE IITD
  • 40. Cradle CT 3616 Architecture slide 40 Anshul Kumar, CSE IITD
  • 41. IBM Cell IBM Cell Architecture Architecture • Clock speed: > 4 GHz • Peak performance (single precision): > 256 GFlops • Peak performance (double precision): >26 GFlops • SPU registers 128 x 128b • Local storage size per SPU: 256KB • Area: 221 mm² • Technology 90nm SOI • Total number of transistors: 234M slide 41 Anshul Kumar, CSE IITD
  • 42. Books 1. D.A. Patterson, J.L. Hennessy, quot;Computer Architecture : A Quantitative Approachquot;, Morgan Kaufmann Publishers, 2006. 2. D. Sima, T. Fountain, P. Kacsuk, quot;Advanced Computer Architectures : A Design Space Approachquot;, Addison Wesley, 1997. 3. M.J. Flynn, quot;Computer Architecture : Pipelined and Parallel Processor Designquot;, Narosa Publishing House/ Jones and Bartlett, 1996. 4. K. Hwang, quot;Advanced Computer Architecture : Parallelism, Scalability, Programmabilityquot;, McGraw Hill, 1993. 5. H.G. Cragon, quot;Memory Systems and Pipelined Processorsquot;, Narosa Publishing House/ Jones and Bartlett, 1998. 6. D.E. Culler, J.P Singh and Anoop Gupta, quot;Parallel Computer Architecture, A Hardware/Software Approachquot;, Harcourt Asia / Morgan Kaufmann Publishers, 2000. slide 42 Anshul Kumar, CSE IITD