02 evolution zhu

818 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
818
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

02 evolution zhu

  1. 1. Computer Organizationand ArchitectureComputerEvolution andPerformanceYanmin Zhuyzhu@cs.sjtu.edu.cn Yanmin Zhu @ SJTU 1
  2. 2. The First Computer?• Abacus 算盘• Invented by ancient Chinese• The 2nd century BCYanmin Zhu @ SJTU 2
  3. 3. ENIAC - background• Electronic Numerical Computer• Eckert and Mauchly• University of Pennsylvania• Trajectory tables for weapons• Started 1943• Finished 1946 —Too late for war effort• Used until 1955 John Mauchly (1907-1980) & J. Presper Eckert (1919-Yanmin Zhu @ SJTU 1995) 3
  4. 4. ENIAC - details • Decimal (not binary) • 20 accumulators of 10 digits • Programmed manually by switches • 18,000 vacuum tubes • 30 tons • 15,000 square feet • 140 kW power consumption • 5,000 additions per secondYanmin Zhu @ SJTU 4
  5. 5. von Neumann/Turing• Stored Program concept• Main memory storing programs Father of computers and data Von Neumann, 1903-1957• ALU operating on binary data• Control unit interpreting instructions from memory and executing• Input and output equipment Alan Turing, 1912- operated by control unit 1954• Princeton Institute for Advanced Studies —IAS computer• Completed 1952Yanmin Zhu @ SJTU IAS computer 5
  6. 6. Structure of von Neumann machineYanmin Zhu @ SJTU 6
  7. 7. IAS - details• 1000 x 40 bit words —Binary number —2 x 20 bit instructionsYanmin Zhu @ SJTU 7
  8. 8. Structure of IAS – detail • Memory Buffer Register • Memory Address Register • Instruction Register • Instruction Buffer Register • Program Counter • Accumulator • Multiplier Quotient
  9. 9. Commercial Computers• 1947 - Eckert-Mauchly Computer Corporation• UNIVAC I (Universal Automatic Computer)• US Bureau of Census —1950 calculations• Became part of Sperry-Rand Corporation• Late 1950s - UNIVAC II —Faster —More memoryYanmin Zhu @ SJTU 9
  10. 10. IBM• Punched-card processing equipment• 1953 - the 701 —IBM’s first stored program computer —Scientific calculations• 1955 - the 702 —Business applications• Lead to 700/7000 seriesYanmin Zhu @ SJTU IBM 701 at Douglas Aircraft. 10
  11. 11. Transistors – the 2nd generation• Replaced vacuum tubes• Smaller• Cheaper• Less heat dissipation• Solid State device• Made from Silicon (Sand)• Invented 1947 at Bell Labs• William Shockley et al.Yanmin Zhu @ SJTU 11
  12. 12. Transistor Based Computers• Second generation machines• NCR & RCA produced small transistor machines IBM 7000• IBM —Produced IBM 7000• DEC - 1957 —Produced PDP-1Yanmin Zhu @ SJTU PDP-1 12
  13. 13. Microelectronics – 3rd Generation• Literally - ―small electronics‖• A computer is made up of gates, memory cells and interconnections• These can be manufactured on a semiconductor• e.g. silicon waferYanmin Zhu @ SJTU 13
  14. 14. Manufacturing ICs • Yield: proportion of working dies per waferYanmin Zhu @ SJTU 14
  15. 15. AMD Opteron X2 Wafer • X2: 300mm wafer, 117 chips, 90nm technologyYanmin Zhu @ SJTU 15
  16. 16. Generations of Computer• Vacuum tube - 1946-1957• Transistor - 1958-1964• Small scale integration - 1965 on —Up to 100 devices on a chip• Medium scale integration - to 1971 —100-3,000 devices on a chip• Large scale integration - 1971-1977 —3,000 - 100,000 devices on a chip• Very large scale integration - 1978 -1991 —100,000 - 100,000,000 devices on a chip• Ultra large scale integration – 1991 - —Over 100,000,000 devices on a chipYanmin Zhu @ SJTU 16
  17. 17. Generations of ComputerYanmin Zhu @ SJTU 17
  18. 18. Moore’s Law• Increased density of components on chip• Since 1970’s development has slowed a little — Number of transistors doubles every 18 months• Cost of a chip has Gordon Moore remained almost – co-founder of unchanged Intel — That means? Number of transistors on a chip will double every yearYanmin Zhu @ SJTU 18
  19. 19. Growth in CPU Transistor Count
  20. 20. Consequences of Moore’s Law• Higher packing density means shorter electrical paths, giving higher performance• Smaller size gives increased flexibility• Reduced power and cooling requirements• Fewer interconnections increases reliabilityYanmin Zhu @ SJTU 20
  21. 21. IBM 360 series• 1964• Replaced (& not compatible with) 7000 series• First planned ―family‖ of computers — Similar or identical instruction sets — Similar or identical O/S — Increasing speed — Increasing number of I/O ports (i.e. more terminals) This series offered a standard — Increased memory size interface and a wide choice of — Increased cost• Multiplexed switch standard peripheral devices. structure The standard interface made it easy to change the peripherals when the customers needs changed.Yanmin Zhu @ SJTU 21
  22. 22. Example members of System/360Yanmin Zhu @ SJTU 22
  23. 23. DEC PDP-8• 1964• First minicomputer (after miniskirt!) —Did not need air conditioned room —Small enough to sit on a lab bench• $16,000 —$100k+ for IBM 360• Embedded applications & OEM• BUS STRUCTUREYanmin Zhu @ SJTU 23
  24. 24. Spectrum of Computers Supercomputer Mainframe Minicomputer Desktop ServerLaptopYanmin Zhu @ SJTU 24
  25. 25. Central Control vs Bus Multiplexed switch structure BUS STRUCTUREYanmin Zhu @ SJTU 25
  26. 26. Intel Later generations• 1971 - 4004 —First microprocessor (All CPU components on a single chip) —4 bit• Followed in 1972 by 8008 —8 bit —Both designed for specific applications• 1974 - 8080 —Intel’s first general purpose microprocessorYanmin Zhu @ SJTU 26
  27. 27. Intel processors in 1970s and 1980sYanmin Zhu @ SJTU 27
  28. 28. Intel processors in 1990s and beyondYanmin Zhu @ SJTU 28
  29. 29. Performance Gap betweenProcessor & Memory• Processor speed increased• Memory capacity increased• Memory speed lags behind processor speed What is the frequency of today’s memory?Yanmin Zhu @ SJTU 29
  30. 30. Solutions• Change DRAM interface —Cache• Reduce frequency of memory access —More complex cache and cache on chip• Increase number of bits retrieved at one time —Make DRAM ―wider‖ rather than ―deeper‖• Increase interconnection bandwidth —High speed buses —Hierarchy of busesYanmin Zhu @ SJTU 30
  31. 31. I/O Devices• Peripherals with intensive I/O demands —Large data throughput demandsYanmin Zhu @ SJTU 31
  32. 32. Problem and Solutions• Problem moving data between processor and peripherals• Solutions: —Caching —Buffering —Higher-speed interconnection buses —More elaborate bus structures —Multiple-processor configurationsYanmin Zhu @ SJTU 32
  33. 33. Key is Balance• Processor components• Main memory• I/O devices• Interconnection structures Balance the throughput and processing demands of processor, memory, IO, etc.Yanmin Zhu @ SJTU 33
  34. 34. Speeding Microprocessor Up • Clock rate • Logic density • Pipelining • On board L1 & L2 cache • Branch prediction How to make • Data flow analysis faster • Speculative processors? executionYanmin Zhu @ SJTU 34
  35. 35. Increase hardware speed ofprocessor• Fundamentally due to shrinking logic gate size —More gates, packed more tightly, and increasing clock rate —Propagation time for signals reducedYanmin Zhu @ SJTU 35
  36. 36. Problem with Clock Speed• Power —Power density increases with density of logic and clock speed —Dissipating heat —Some fundamental physical limits are being reachedYanmin Zhu @ SJTU 36
  37. 37. Power Trends In CMOS IC technology Power  Capacitive load  Voltage2  Frequency ×30 5V → 1V ×1000Yanmin Zhu @ SJTU 37
  38. 38. Reducing Power • Suppose a new CPU has —85% of capacitive load of old CPU —15% voltage and 15% frequency reduction Pnew Cold  0.85  (Vold  0.85)2  Fold  0.85   0.85 4  0.52 Cold  Vold  Fold 2 Pold  The power wall  We can’t reduce voltage further  We can’t remove more heat  How else can we improve performance?Yanmin Zhu @ SJTU 38
  39. 39. Problems with Logic Density• RC delay —Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them —Delay increases as RC product increases —Wire interconnects thinner, increasing resistance —Wires closer together, increasing capacitanceYanmin Zhu @ SJTU 39
  40. 40. Solution?Increase clock rate Power Wall More emphasis on Organization al and Architectural ApproachesIncrease Density RC DelayYanmin Zhu @ SJTU 40
  41. 41. Techniques developedYanmin Zhu @ SJTU Intel Microprocessor History 41
  42. 42. Increased Cache Capacity• Typically two or three levels of cache between processor and main memory• Chip density increased —More cache memory on chip – Faster cache access• Pentium chip devoted about 10% of chip area to cache• Pentium 4 devotes about 50%Yanmin Zhu @ SJTU 42
  43. 43. More Complex Execution Logic• Enable parallel execution of instructions —Pipelining —Superscalar• Pipeline works like assembly line Example —Different stages of execution of in your different instructions at same time daily life? along pipeline• Superscalar allows multiple pipelines within single processor —Instructions that do not depend on one another can be executed in parallelYanmin Zhu @ SJTU 43
  44. 44. Diminishing Returns• Internal organization of processors complex —Can get a great deal of parallelism —Further significant increases likely to be relatively modest• Benefits from cache are reaching limitYanmin Zhu @ SJTU 44
  45. 45. Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency
  46. 46. New Approach – Multiple Cores• Multiple processors on single chip — Large shared cache• Within a processor, increase in performance proportional to square root of increase in complexity• If software can use multiple processors, doubling number of processors almost doubles performance• So, use two simpler processors on the chip rather than one more complex processorYanmin Zhu @ SJTU 46
  47. 47. Multicore Challenges• Requires explicitly parallel programming —Compare with instruction level parallelism – Hardware executes multiple instructions at once –Hidden from the programmer —Hard to do – Programming for performance – Load balancing – Optimizing communication and synchronizationYanmin Zhu @ SJTU 47
  48. 48. CISC vs. RISC • CISC: Complex Instruction Set Computer • RISC: Reduced Instruction Set ComputerIntel x86 ARM CISC RISC Kill two each time Kill one each timeYanmin Zhu @ SJTU 48
  49. 49. x86 Evolution (1)• 8080 — first general purpose microprocessor — 8 bit data path — Used in first personal computer – Altair• 8086 – 5MHz – 29,000 transistors — much more powerful — 16 bit — instruction cache, prefetch few instructions — 8088 (8 bit external bus) used in first IBM PC• 80286 — 16 Mbyte memory addressable — up from 1Mb• 80386 — 32 bit — Support for multitasking• 80486 — sophisticated powerful cache and instruction pipelining — built in maths co-processorYanmin Zhu @ SJTU 49
  50. 50. x86 Evolution (2)• Pentium — Superscalar — Multiple instructions executed in parallel• Pentium Pro — Increased superscalar organization — Aggressive register renaming — branch prediction — data flow analysis — speculative execution• Pentium II — MMX technology — graphics, video & audio processing• Pentium III — Additional floating point instructions for 3D graphicsYanmin Zhu @ SJTU 50
  51. 51. x86 Evolution (3)• Pentium 4 — Note Arabic rather than Roman numerals — Further floating point and multimedia enhancements• Core — First x86 with dual core• Core 2 — 64 bit architecture• Core 2 Quad – 3GHz – 820 million transistors — Four processors on chipYanmin Zhu @ SJTU 51
  52. 52. x86 Evolution (4)• x86 architecture dominant outside embedded systems• Organization and technology changed dramatically• Instruction set architecture evolved with backwards compatibility• ~1 instruction per month added• 500 instructions availableYanmin Zhu @ SJTU 52
  53. 53. Embedded Systems & ARM• ARM evolved from RISC design• Used mainly in embedded systems —Used within product —Not general purpose computer —Dedicated function —E.g. Anti-lock brakes in car (ABS)Yanmin Zhu @ SJTU 53
  54. 54. Embedded System - DefinitionA combination of computer hardware and software, andperhaps additional mechanical or other parts, designedto perform a dedicated function. In many cases,embedded systems are part of a larger system orproduct, as in the case of an antilock braking system ina carYanmin Zhu @ SJTU 54
  55. 55. Embedded Systems Requirements• Different sizes —Different constraints, optimization, reuse• Different requirements —Safety, reliability, real-time, flexibility, legislation —Lifespan —Environmental conditions —Static v dynamic loads —Slow to fast speeds —Computation v I/O intensive —Descrete event v continuous dynamicsYanmin Zhu @ SJTU 55
  56. 56. Possible Organization of an Embedded SystemYanmin Zhu @ SJTU 56
  57. 57. ARM Evolution• Designed by ARM Inc., Cambridge, England• Licensed to manufacturers• High speed, small die, low power consumption• PDAs, hand held games, phones —E.g. iPod, iPhone, HTC• Acorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989• Acorn, VLSI and Apple Computer founded ARM Ltd.Yanmin Zhu @ SJTU 57
  58. 58. ARM Systems Categories• Embedded real time• Application platform —Linux, Palm OS, Symbian OS, Windows mobile• Secure applicationsYanmin Zhu @ SJTU 58
  59. 59. Computer Assessment• Performance• Cost• Size• Security• Reliability• Power consumption• Cool lookYanmin Zhu @ SJTU 59
  60. 60. Defining Performance • Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 100 200 300 400 500 0 2000 4000 6000 8000 10000 Passenger Capacity Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 500 1000 1500 0 100000 200000 300000 400000 Cruising Speed (mph) Passengers x mphYanmin Zhu @ SJTU 60
  61. 61. Response Time and Throughput• Response time —How long it takes to do a task• Throughput —Total work done per unit time – e.g., tasks/transactions/… per hour• How are response time and throughput affected by —Replacing the processor with a faster version? —Adding more processors?Yanmin Zhu @ SJTU 61
  62. 62. Relative Performance– response time • Define Performance = 1/Execution Time • ―X is n times faster than Y‖ Performanc e X Performanc e Y  Execution time Y Execution time X  n  Example: time taken to run a program  10s on A, 15s on B  Execution TimeB / Execution TimeA = 15s / 10s = 1.5  So A is 1.5 times faster than BYanmin Zhu @ SJTU 62
  63. 63. Measuring Execution Time• Elapsed time —Total response time, including all aspects – Processing, I/O, OS overhead, idle time —Determines system performance• CPU time —Time spent processing a given job – Discounts I/O time, other jobs’ shares —Comprises user CPU time and system CPU time —Different programs are affected differently by CPU and system performanceYanmin Zhu @ SJTU 63
  64. 64. CPU Clocking • Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state  Clock period: duration of a clock cycle  e.g., 250ps = 0.25ns = 250× –12s 10  Clock frequency (rate): cycles per second  e.g., 4.0GHz = 4000MHz = 4.0× 9Hz 10Yanmin Zhu @ SJTU 64
  65. 65. System Clock System clock cannot be arbitrarily fast • Signals in CPU take time to settle down to 1 or 0 • Signals may change at different speeds • Operations need to be synchronisedYanmin Zhu @ SJTU 65
  66. 66. CPU Time CPU Time  CPU Clock Cycles  Clock Cycle Time CPU Clock Cycles  Clock Rate • Performance improved by —Reducing number of clock cycles —Increasing clock rate —Hardware designer must often trade off clock rate against cycle countYanmin Zhu @ SJTU 66
  67. 67. CPU Time Example • Computer A: 2GHz clock, 10s CPU time • Designing Computer B — Aim for 6s CPU time — Can do faster clock, but causes 1.2 × clock cycles • How fast must Computer B clock be? Clock CyclesB 1.2  Clock Cycles A Clock Rate B   CPU Time B 6s Clock Cycles A  CPU Time A  Clock Rate A  10s  2GHz  20  109 1.2  20  109 24  109 Clock Rate B    4GHz 6s 6sYanmin Zhu @ SJTU 67
  68. 68. Instruction Count and CPI CPI= consumer price index??Clock Cycles  Instructio n Count  Cycles per Instructio n CPU Time  Instructio n Count  CPI  Clock Cycle Time Instructio n Count  CPI  Clock Rate • Instruction Count for a program —Determined by program, ISA and compiler • Average cycles per instruction —Determined by CPU hardware —Different instructions have different CPI – Average CPI affected by instruction mixYanmin Zhu @ SJTU 68
  69. 69. CPI Example • Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much? CPU Time  Instructio n Count  CPI  Cycle Time A A A  I  2.0  250ps  I  500ps A is faster… CPU Time  Instructio n Count  CPI  Cycle Time B B B  I  1.2  500ps  I  600ps B  I  600ps  1.2 CPU Time CPU Time I  500ps …by this much AYanmin Zhu @ SJTU 69
  70. 70. CPI in More Detail • If different instruction classes take different numbers of cycles n Clock Cycles   (CPIi  Instructio n Count i ) i1  Weighted average CPI Clock Cycles n  Instructio n Count i  CPI     CPIi   Instructio n Count i1  Instructio n Count  Relative frequencyYanmin Zhu @ SJTU 70
  71. 71. CPI Example • Alternative compiled code sequences using instructions in classes A, B, C Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1  Sequence 1: IC = 5  Sequence 2: IC = 6  Clock Cycles  Clock Cycles = 2× + 1× + 2× 1 2 3 = 4× + 1× + 1× 1 2 3 = 10 =9  Avg. CPI = 10/5 = 2.0  Avg. CPI = 9/6 = 1.5Yanmin Zhu @ SJTU 71
  72. 72. CPU Time Summary The BIG Picture Instructio ns Clock cycles Seconds CPU Time    Program Instructio n Clock cycle • Performance depends on —Algorithm: affects IC, possibly CPI —Programming language: affects IC, CPI —Compiler: affects IC, CPI —Instruction set architecture: affects IC, CPI, TcYanmin Zhu @ SJTU 72
  73. 73. Instruction Execution Rate Instructions executed per second!• MIPS: Millions of instructions per second• MFLOPS: Millions of floating point instructions per second Heavily dependent on instruction set compiler design processor implementation cache & memory hierarchyYanmin Zhu @ SJTU 73
  74. 74. Benchmarks• Benchmark: Programs designed to test performance — Written in high level language: Portable — Represents style of task: Systems, numerical, commercial — Easily measured — Widely distributed• E.g. System Performance Evaluation Corporation (SPEC) — CPU2006 for computation bound – 17 floating point programs in C, C++, Fortran – 12 integer programs in C, C++ – 3 million lines of code — Speed and rate metrics – Single task and throughputYanmin Zhu @ SJTU 74
  75. 75. SPEC Speed Metric -Single task• Base runtime defined for each benchmark using reference machine• Results are reported as ratio of reference time to system run time — Trefi execution time for benchmark i on reference machine — Tsuti execution time of benchmark i on test system• Overall performance calculated by averaging ratios for all 12 integer benchmarks — Use geometric mean: Appropriate for normalized numbers such as ratiosYanmin Zhu @ SJTU 75
  76. 76. Amdahl’s Law• Gene Amdahl [AMDA67]• Potential speed up of program using multiple processors• Concluded that: —Code needs to be parallelizable —Speed up is bound, giving diminishing returns for more processorsYanmin Zhu @ SJTU 76
  77. 77. Amdahl’s Law Formula• For program running on single processor — Fraction f of code infinitely parallelizable with no scheduling overhead — Fraction (1-f) of code inherently serial — T is total execution time for program on single processor — N is number of processors that fully exploit parallel portions of codeYanmin Zhu @ SJTU 77
  78. 78. Amdahl’s Law Formula 100 data1 90 80 70 60 50 f/(1-f) 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 f • f small, parallel processors has little effect • N ->∞, speedup bound by 1/(1 – f) —Diminishing returns for using more processorsYanmin Zhu @ SJTU 78
  79. 79. Homework #1• Page 45-47• 2.4, 2.9, 2.10, 2.13, 2.14• Due: class time next Friday (18 March)• Writing on paper/exercise book (student no and name).• Policy: No late submission is allowedYanmin Zhu @ SJTU 79

×