0
“ Evolución de la Arquitectura de Computadores ” Valladolid, Septiembre 2010 Prof. Mateo Valero   Director
Technological Achievements <ul><li>Transistor (Bell Labs, 1947) </li></ul><ul><ul><li>DEC PDP-1 (1957) </li></ul></ul><ul>...
Pipeline  (H. Ford)
Technology Trends
 
 
Power Density 1 10 100 1000           i386 i486 Pentium®  Pentium® Pro Pentium®...
 
Technology Outlook Shekhar Borkar, Micro37, P Medium  High  Very High Variability Energy scaling will slow down >0.5 >0.5 ...
We have seen increasing number of gates on a chip and increasing clock speed. Heat becoming an unmanageable problem, Intel...
Increasing chip performance:  Intel´s Petaflop chip <ul><li>80 processors in a die of 300 square mm. </li></ul><ul><li>Ter...
NVIDIA Fermi Architecture Unified 768KB L2 cache serves all threads GigaThread hardware scheduler assigns Thread Blocks to...
Cell Broadband Engine  TM : A Heterogeneous Multi-core Architecture * Cell Broadband Engine is a trademark of Sony Compute...
Intel/UPC <ul><li>Since 2002 (Roger  </li></ul><ul><li>Espasa, Toni Juan) </li></ul><ul><li>40 People </li></ul><ul><li>Mi...
Top10
Looking at the Gordon Bell Prize <ul><li>1 GFlop/s; 1988; Cray Y-MP; 8 Processors </li></ul><ul><ul><li>Static finite elem...
BSC-CNS e iniciativas a nivel internacional: IESP Build an international plan for developing the next generation open sour...
1 EFlop/s “Clean Sheet of Paper” Strawman <ul><li>4 FPUs+RegFiles/Core (=6 GF @1.5GHz) </li></ul><ul><li>1 Chip =  742  Co...
Education for Parallel Programming  Multicore-based pacifier I  multi-core programming I  many-core programming We all  ma...
Navigating the Mare Nostrum
Initial developments <ul><li>Mechanical machines </li></ul><ul><li>1854: Boolean algebra by G. Boole </li></ul><ul><li>190...
In 50 Years ... Eniac ,  Eckert&Mauchly1946  ...  18000 vacuum tubes Pentium III playing DVD,  1998 ... 24 M transistors
Technology Trends:  Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “ Moore’s Law ” Moore’s Law Micropr...
 
Computer Architecture Achievements <ul><li>1951 : Microprogramming (M. Wilkes) </li></ul><ul><li>1962 : Virtual Memory (At...
 
Virtual Worlds have huge potential beyond Games <ul><li>Commerce & Advertising </li></ul><ul><li>Corporate </li></ul><ul><...
<ul><li>Cray XT5-HE system </li></ul><ul><li>Over 37,500 quad-core AMD Opteron processors running at 2.6 GHz, 224,162 core...
MareIncognito: Project structure 4 relevant apps: Materials: SIESTA Geophisics imaging: RTM Comp. Mechanics: ALYA Plasma: ...
<ul><li>Supercomputación y eCiencia </li></ul><ul><ul><li>22 grupos de élite </li></ul></ul><ul><ul><li>M ás de 120 invest...
High Performance Computing as key-enabler 1980 1990 2000 2010 2020 2030 Capacity:  #  of Overnight  Loads cases run Availa...
Diseño del ITER TOKAMAK (JET, Oxford)
Supercomputación, teoría y experimentación  Cortesia de IBM
Weather, Climate and Earth Sciences: Roadmap <ul><li>2009 </li></ul><ul><li>Resolution : 80 Km </li></ul><ul><li>Memory:  ...
Education for Parallel Programming  Multicore-based pacifier I  multi-core programming I  many-core programming We all  ma...
Navigating the Mare Nostrum
Upcoming SlideShare
Loading in...5
×

Valladolid final-septiembre-2010

558

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
558
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Access latency for main memory, even using a modern SDRAM with a CAS latency of 2, will typically be around 9 cycles of the **memory system clock** -- the sum of The latency between the FSB and the chipset (Northbridge) (+/- 1 clockcycle) The latency between the chipset and the DRAM (+/- 1 clockcycle) The RAS to CAS latency (2-3 clocks, charging the right row) The CAS latency (2-3 clocks, getting the right column) 1 cycle to transfer the data. The latency to get this data back from the DRAM output buffer to the CPU (via the chipset) (+/- 2 clockcycles) Assuming a typical 133 MHz SDRAM memory system (eg: either PC133 or DDR266/PC2100), and assuming a 1.3 GHz processor, this makes 9*10 = 90 cycles of the CPU clock to access main memory! Yikes, you say! And it gets worse – a 1.6 GHz processor would take it to 108 cycles, a 2.0 GHz processor to 135 cycles, and even if the memory system was increased to 166 MHz (and still stayed CL2), a 3.0 GHz processor would wait a staggering 162 cycles! Caches make the memory system seem almost as fast as the L1 cache, yet as large as main memory. A modern primary (L1) cache has a latency of just two or three **processor cycles**, which is dozens of times faster than accessing main memory, and modern primary caches achieve hit rates of around 90% for most applications. So 90% of the time, accessing memory only takes a couple of cycles. Good overview http://www.pattosoft.com.au/Articles/ModernMicroprocessors/
  • It is the conclusion of this TTA that, in the very near future (in fact some early examples are clearly in evidence right now), virtual worlds will extend their reach well beyond their current subject matter of on-line fantasy gaming to incorporate all manner of business and commerce. This evolution will quickly encompass many industries and business processes where IBM has traditionally had a significant business interests. In the education industry, it is not at all a stretch to imagine a university physics professor convening a kinematics lecture in a virtual world in which the professor could alter the force of gravity and move large, virtual objects to demonstrate environments on other planets. Closer to our industry, an IBM Industry Solution sales specialist could arrange to meet a client in a virtual world populated by highly realistic (virtual) world venues containing software solutions created by IBM and select business partners. In these virtual sales worlds, clients would interact with the solutions in the same manner as real world users, exploiting all the solution&apos;s functional capacities. For example, a virtual mobile work force solution could be demonstrated from multiple perspectives in the context of real business scenarios - the control center, the mobile vehicle etc. The solution demonstration would totally immerse the client in the solution experience there by creating an unparalleled selling tool. The possibilities are limitless. From top left, clockwise: (1) Worlds of Warcraft: A Tavern. This is just a symbolic representation of commerce &amp; advertising within games. Many people run their own businesses within virtual worlds, trading both virtual and real items for virtual and real currencies. Microsoft’s acquisition of Massive Inc. has also now secured them a huge advertising ecosystem of game development companies, advertising agencies and leading brands, using online video games as another advertising channel for directed and personalized ads and product placement deals. The tavern represents the real-world metaphors that build community within virtual worlds, much like the 18 th century coffee houses lead to the formation of stock exchanges. Incidentally, there is a game advertising summit in San Francisco, June 9 th 2006. (2) Hazmat Hot zone: project based at the Entertainment Technology Center at Carnegie Melon University, is one of the earliest serious game projects and now has several scenarios up-and-running using Unreal-Tournament based graphics and game play. Intended users: fire-department personnel who handle HazMat response. HazMat uses multiplayer gaming technology and augmented communication practices to assist with team-based training vital to HazMat and other disaster response practices. (3) Virtual Iraq: Not only are the army using virtual world simulations for the training of troops and engagement planning, but also for the treatment of Post Traumatic Stress Disorder (PTSD) through the ability to “relive” traumatic events through simulation. ( http://www.washingtonpost.com/ac2/wp-dyn/A58360-2005Mar22?language=printer) (4) Simulation of forest fire disasters and how to combat them. (5) Virtual Acropolis: This is an example of using virtual environments as an educational and research tool for the humanities, in this case ancient history. The use of highly detailed models, created collaboratively by historians and researchers, to model world heritage sites for a variety of uses, including tourism, education, simulation of “what-if” scenarios, etc. imagine teaching history of a famous era or battle by immersing the student in a highly realistic, immersive simulation complete with architecture, artifacts and even populace of the period. These may also help the study of social history and sociological development and evolution via large scale community participation. (6) Food Force: From the United Nations World Food Program (WFP), Food Force is an educational video game telling the story of a hunger crisis on the fictitious island of Sheylan. Comprised of 6 mini-games or “missions”, the game takes young players from an initial crisis assessment through to delivery and distribution of food aid, with each sequential mission addressing a particular aspect of this challenging process. (http://www.food-force.com/) (7) Yourself Fitness: Yourself!Fitness is a complete fitness program on a disc - exercise, diet, motivation, and fitness tracking are all included. Your host is Maya, a dynamically generated digital personality who guides you through all aspects of the application. You need nothing more than an Xbox and a television set to partake. ( http://www.yourselffitness.com/) (8) Pulse!! The virtual clinical learning lab and simulation, for training of first responders in treatments and medical and nursing students. ( http://www.businessweek.com/innovate/content/apr2006/id20060410_051875.htm?chan=innovation_game+room_features). (10) Another picture of Worlds of Warcraft: This is just to illustrate the breadth, diversity and scale of virtual environments. It is easy to take for granted that the fact that this huge architectural vista and the tavern above are all parts of a single virtual world that is WoW, is a challenge to the rendering engine, to deal with a broad spectrum of conditions. Why is this important? It means that the same middleware engine can be used to a broad variety of simulation environments and applications these days, rather than purpose built or specialized simulations for specific scenarios, and are configurable through XML &amp; scripting mechanisms. (centre) Google Earth: Now being offered as Enterprise Services for a variety of applications including real-estate, architecture &amp; engineering, insurance, media. Google’s provision of 3D modelling tools and open repository for free is a significant step in them making Google Earth a platform for application development using it as a visualization engine and MySpace of the future. NEED FOR STANDARDS: Multiple Virtual Worlds Interconnected &amp; Interdependent Independently operated Open standard interfaces, to allow: Avatar portability Property portability Security Metering, Billing, Separations, Settlements Distributed problem determination Distributed systems management
  • (Please note - this slide includes 2 animation steps) An exciting question to ask, is where is this research heading? In this slide you can see what is probably a familiar chart depicting the progress that has been made in supercomputing since the early 90s. (At each time point, the green line shows the 500th fast supercomputer, the dark blue line the fastest supercomputer, and the light blue line the summed power of the top 500 machines). These lines show a nice trend, which we’ve extrapolated out 10 years. [ANIMATE SLIDE] The IBM team’s latest simulation results fall here on the graph. These latest results represent a model about 4 and a half percent of the scale of the cerebral cortex, which was run at 1/83 of real time. The machine used provided 144 TB of memory and 0.5 PFLop/s. [ANIMATE SLIDE] Turning to the future, you can see that running human scale cortical simulations will require 4 PB of memory and to run these simulations in real time will require over 1 EFLop/s. If the current trends in supercomputing continue, however, the IBM team believes they will have the ability to perform such simulations in the not too distant future.
  • Transcript of "Valladolid final-septiembre-2010"

    1. 1. “ Evolución de la Arquitectura de Computadores ” Valladolid, Septiembre 2010 Prof. Mateo Valero Director
    2. 2. Technological Achievements <ul><li>Transistor (Bell Labs, 1947) </li></ul><ul><ul><li>DEC PDP-1 (1957) </li></ul></ul><ul><ul><li>IBM 7090 (1960) </li></ul></ul><ul><li>Integrated circuit (1958) </li></ul><ul><ul><li>IBM System 360 (1965) </li></ul></ul><ul><ul><li>DEC PDP-8 (1965) </li></ul></ul><ul><li>Microprocessor (1971) </li></ul><ul><ul><li>Intel 4004 </li></ul></ul>
    3. 3. Pipeline (H. Ford)
    4. 4. Technology Trends
    5. 7. Power Density 1 10 100 1000           i386 i486 Pentium® Pentium® Pro Pentium® II Pentium® III Hot plate Nuclear Reactor Sun's Surface Rocket Nozzle * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference keynote - 1999. Pentium® 4 Watts/cm 2
    6. 9. Technology Outlook Shekhar Borkar, Micro37, P Medium High Very High Variability Energy scaling will slow down >0.5 >0.5 >0.35 Energy/Logic Op scaling 0.5 to 1 layer per generation 8-9 7-8 6-7 Metal Layers 1 1 1 1 1 1 1 1 RC Delay Reduce slowly towards 2-2.5 <3 ~3 ILD (K) Low Probability High Probability Alternate, 3G etc 128 11 2016 High Probability Low Probability Bulk Planar CMOS Delay scaling will slow down >0.7 ~0.7 0.7 Delay = CV/I scaling 256 64 32 16 8 4 2 Integration Capacity (BT) 8 16 22 32 45 65 90 Technology Node (nm) 2018 2014 2012 2010 2008 2006 2004 High Volume Manufacturing
    7. 10. We have seen increasing number of gates on a chip and increasing clock speed. Heat becoming an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, the number of gates on a chip will continue to increase. Increasing the number of gates into a tight knot and decreasing the cycle time of the processor Lower Voltage Increase Clock Rate & Transistor Density Core Cache Core Cache Core C1 C2 C3 C4 Cache C1 C2 C3 C4 Cache C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4
    8. 11. Increasing chip performance: Intel´s Petaflop chip <ul><li>80 processors in a die of 300 square mm. </li></ul><ul><li>Terabytes per second of memory bandwidth </li></ul><ul><li>Note: The barrier of the Teraflops was obtained by Intel in 1991 using 10.000 Pentium Pro processors contained in more than 85 cabinets occupying 200 square meters  </li></ul><ul><li>This will be possible in 3 years from now </li></ul>ICPP-2009, September 23rd 2009 Thanks to Intel
    9. 12. NVIDIA Fermi Architecture Unified 768KB L2 cache serves all threads GigaThread hardware scheduler assigns Thread Blocks to SMs Wide DRAM interface provides 12 GB/s bandwidth 16 Streaming- Multiprocessors (512 cores) execute Thread Blocks 620 Gigaflops
    10. 13. Cell Broadband Engine TM : A Heterogeneous Multi-core Architecture * Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.
    11. 14. Intel/UPC <ul><li>Since 2002 (Roger </li></ul><ul><li>Espasa, Toni Juan) </li></ul><ul><li>40 People </li></ul><ul><li>Microprocessor </li></ul><ul><li>Development </li></ul><ul><li>(Larrabee x86 </li></ul><ul><li>many core) </li></ul>
    12. 15. Top10
    13. 16. Looking at the Gordon Bell Prize <ul><li>1 GFlop/s; 1988; Cray Y-MP; 8 Processors </li></ul><ul><ul><li>Static finite element analysis </li></ul></ul><ul><li>1 TFlop/s; 1998; Cray T3E; 1024 Processors </li></ul><ul><ul><li>Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. </li></ul></ul><ul><li>1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors </li></ul><ul><ul><li>Superconductive materials </li></ul></ul><ul><li>1 EFlop/s; ~2018; ?; 1x10 7 Processors (10 9 threads) </li></ul>Jack Dongarra
    14. 17. BSC-CNS e iniciativas a nivel internacional: IESP Build an international plan for developing the next generation open source software for scientific high-performance computing Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment
    15. 18. 1 EFlop/s “Clean Sheet of Paper” Strawman <ul><li>4 FPUs+RegFiles/Core (=6 GF @1.5GHz) </li></ul><ul><li>1 Chip = 742 Cores (=4.5TF/s) </li></ul><ul><ul><li>213MB of L1I&D; 93MB of L2 </li></ul></ul><ul><li>1 Node = 1 Proc Chip + 16 DRAMs (16GB) </li></ul><ul><li>1 Group = 12 Nodes + 12 Routers (=54TF/s) </li></ul><ul><li>1 Rack = 32 Groups (=1.7 PF/s) </li></ul><ul><ul><li>384 nodes / rack </li></ul></ul><ul><li>3.6EB of Disk Storage included </li></ul><ul><li>1 System = 583 Racks (=1 EF/s) </li></ul><ul><ul><li>166 MILLION cores </li></ul></ul><ul><ul><li>680 MILLION FPUs </li></ul></ul><ul><ul><li>3.6PB = 0.0036 bytes/flops </li></ul></ul><ul><ul><li>68 MW w’aggressive assumptions </li></ul></ul>Sizing done by “balancing” power budgets with achievable capabilities Largely due to Bill Dally Courtesy of Peter Kogge, UND
    16. 19. Education for Parallel Programming Multicore-based pacifier I multi-core programming I many-core programming We all massive parallel prog. I games
    17. 20. Navigating the Mare Nostrum
    18. 21. Initial developments <ul><li>Mechanical machines </li></ul><ul><li>1854: Boolean algebra by G. Boole </li></ul><ul><li>1904: Diode vacuum tube by J.A. Fleming </li></ul><ul><li>1938: Boolean Algebra & Electronics Switches, C. Shannon </li></ul><ul><li>1946: ENIAC by J.P. Eckert and J. Mauchly </li></ul><ul><li>1945: Stored program by J.V. Neuma nn ?????? </li></ul><ul><li>1947 : First transistor (Bell Labs) </li></ul><ul><li>1949: EDSAC by M. Wilkes </li></ul><ul><li>1952: UNIVAC I and IBM 701 </li></ul>
    19. 22. In 50 Years ... Eniac , Eckert&Mauchly1946 ... 18000 vacuum tubes Pentium III playing DVD, 1998 ... 24 M transistors
    20. 23. Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “ Moore’s Law ” Moore’s Law Microprocessors have become smaller, denser, and more powerful. Not just processors, bandwidth, storage, etc Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
    21. 25. Computer Architecture Achievements <ul><li>1951 : Microprogramming (M. Wilkes) </li></ul><ul><li>1962 : Virtual Memory (Atlas, Manchester) </li></ul><ul><li>1964 : Pipeline (CDC 6600, S. Cray, 10 Mflop/s) </li></ul><ul><li>1965 : Cache memory (M. Wilkes) </li></ul><ul><li>1975 : Vector processors (S. Cray) </li></ul><ul><li>1980 : RISC architecture (IBM, Berkeley, Stanford) </li></ul><ul><li>1982 : Multiprocessors with distributed memory </li></ul><ul><li>1990 : Superscalar processors : PA-Risc (HP) and RS-6000 (IBM) </li></ul><ul><li>1991 : Multiprocessors with distributed shared memory </li></ul><ul><li>1994 : SMT (M. Nemirowski, D. Tullsen, S. Eggers) </li></ul><ul><li>1994 : Speculative Multiprocessors ( G. Sohi, Winsconsin) </li></ul><ul><li>1996 : Value Prediction (J.P.Shen and M.Lipasti, CMU) </li></ul><ul><li>2000: Multicore/Manycore Architectures </li></ul>
    22. 27. Virtual Worlds have huge potential beyond Games <ul><li>Commerce & Advertising </li></ul><ul><li>Corporate </li></ul><ul><li>Education </li></ul><ul><li>First Responders </li></ul><ul><li>Government </li></ul><ul><li>Health </li></ul><ul><li>Military </li></ul><ul><li>Science </li></ul><ul><li>Community Facilitation </li></ul><ul><li>Social Change </li></ul>
    23. 28. <ul><li>Cray XT5-HE system </li></ul><ul><li>Over 37,500 quad-core AMD Opteron processors running at 2.6 GHz, 224,162 cores. </li></ul><ul><li>Power: 6.95 Mwatts </li></ul><ul><li>300 terabytes of memory </li></ul><ul><li>10 petabytes of disk space. </li></ul><ul><li>240 gigabytes per second </li></ul><ul><li>disk bandwidth </li></ul><ul><li>Cray's SeaStar2+ </li></ul><ul><li>interconnect network. </li></ul>Jaguar @ ORNL: 1.75 PF/s Jack Dongarra
    24. 29. MareIncognito: Project structure 4 relevant apps: Materials: SIESTA Geophisics imaging: RTM Comp. Mechanics: ALYA Plasma: EUTERPE General kernels Automatic analysis Coarse/fine grain prediction Sampling Clustering Integration with Peekperf Contention, Collectives Overlap computation/communication Slimmed Networks Direct versus indirect networks Contribution to new Cell design Support for programming model Support for load balancing Support for performance tools Issues for future processors Coordinated scheduling: Run time, Process, Job Power efficiency StarSs: CellSs, SMPSs [email_address] OpenMP++ MPI + OpenMP/StarSs Performance analysis tools Processor and node Load balancing Interconnect Applications Programming models Models and prototype
    25. 30. <ul><li>Supercomputación y eCiencia </li></ul><ul><ul><li>22 grupos de élite </li></ul></ul><ul><ul><li>M ás de 120 investigadores seniors </li></ul></ul><ul><ul><li>Más de 300 estudiantes de doctorado </li></ul></ul>BSC-CNS: vertebrador de la investigación en supercomputación en España Application scope “Earth Sciences” Application scope “Astrophysics” Application scope “Engineering” Application scope “Physics” Application scope “Life Sciences” Compilers and tuning of application kernels Programming models and performance tuning tools Architectures and hardware technologies
    26. 31. High Performance Computing as key-enabler 1980 1990 2000 2010 2020 2030 Capacity: # of Overnight Loads cases run Available Computational Capacity [Flop/s] CFD-based LOADS & HQ Aero Optimisation & CFD-CSM Full MDO Real time CFD based in flight simulation x10 6 1 Zeta (10 21 ) 1 Peta (10 15 ) 1 Tera (10 12 ) 1 Giga (10 9 ) 1 Exa (10 18 ) 10 2 10 3 10 4 10 5 10 6 LES CFD-based noise simulation RANS Low Speed RANS High Speed HS Design Data Set UnsteadyRANS <ul><li>“ Smart” use of HPC power: </li></ul><ul><li>Algorithms </li></ul><ul><li>Data mining </li></ul><ul><li>knowledge </li></ul>Capability achieved during one night batch Courtesy AIRBUS France
    27. 32. Diseño del ITER TOKAMAK (JET, Oxford)
    28. 33. Supercomputación, teoría y experimentación Cortesia de IBM
    29. 34. Weather, Climate and Earth Sciences: Roadmap <ul><li>2009 </li></ul><ul><li>Resolution : 80 Km </li></ul><ul><li>Memory: ≈110 GB </li></ul><ul><li>Storage: ≈ 8 TB </li></ul><ul><li>NEC-SX9 48 vector procs: ≈ 40 days run </li></ul><ul><li>2015 </li></ul><ul><li>Resolution : 20 Km </li></ul><ul><li>MemSory: ≈ 3,5 TB </li></ul><ul><li>Storage: ≈ 180 TB </li></ul><ul><li>High resolution model with complete carbon cycle model </li></ul><ul><li>Challenges: data viz and post-processing, data discovery, archiving </li></ul><ul><li>2020 </li></ul><ul><li>Resolution : 1 Km </li></ul><ul><li>Memory: ≈ 4 PB </li></ul><ul><li>Storage: ≈ 150 PB </li></ul><ul><li>Higher resolution with global cloud resolving model </li></ul><ul><li>Challenges: data sharing, transfer memory management, I/O management </li></ul>
    30. 35. Education for Parallel Programming Multicore-based pacifier I multi-core programming I many-core programming We all massive parallel prog. I games
    31. 36. Navigating the Mare Nostrum
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×