Exascale Computer in 2018?     A Hardware View       jkwu@cs.hku.hk
• Hardware Evolution• Exascale Challenges• My View• Industrial & international movements
Hardware Evolution• Processor/Node Architecture: Multi-core ->  Many Core• Acceleration/Coprocessors   - SIMD Units (GP GP...
Processor/Node Architecture• Intel Xeon E5-2600 processor: Sandy Bridge microarchitecture• Released: March 2012           ...
Processor/Node Architecture• Intel Knights Corner: Many Integrated Cores (MIC)• Early Version: Nov, 2011                  ...
Processor/Node Architecture• AMD Opteron 6200 processor: Bulldozer core• Released: Nov, 2011                              ...
Processor/Node Architecture• AMD Llano APU A8-3870K: Fusion• Released: Dec, 2011                                   4 x86 C...
Processor/Node Architecture• IBM Power 7: Power Architecture, multi-core• Released: Feb, 2010                             ...
Coprocessor/GPU Architecture• NVIDIA Fermi (GeForce 590)/Kepler/Maxwell• Released: March 2011                             ...
Coprocessor/FPGA Architecture• Xilinx/Altera/Lattice   Semiconductor FPGAs typically interface to                         ...
Petascale Parallel Architectures: Blue Waters
Petascale Parallel Architectures: XT6
Current Petascale Parallel Platforms
Heterogeneous Platforms: Tianhe-1A                   • 14,336 Intel XeonX5670 processors                   and 7,168 Nvidi...
Heterogeneous Platforms: RoadRunner
From 10 to 1000 PFLOPSSeveral critical issues must be addressed:• Power (GFLOPS/w)• Fault Tolerance (MTBF and high compone...
Exascale Hardware Challenges•   Power consumption•   Concurrency•   Scalability•   Resiliency
Architectures Considered• Evolutionary Strawmen – “Heavyweight” Strawman based on commodity-derived microprocessors – “Lig...
Evolutionary Scaling Assumptions• Applications will demand same DRAM/Flops ratio as today• Ignore any changes needed in di...
The Power Models• Simplistic: A highly optimistic model  – Max power per die grows as per ITRS  – Power for memory grows o...
The Prediction: Heavyweight
The Prediction: Lightweight
Architectures Considered• Evolutionary Strawmen: NOT FEASIBLE – “Heavyweight” Strawman based on commodity-derived micropro...
The Prediction: Aggressive
A Whole Picture
Why?
Supply voltages are unlikely to reduce significantly.Processor clocks are unlikely to increase significantly.
Die power consumption flattens.Clock rate decreased with power constrained.
Power consumption/Flop flattens.
Fault Tolerance
My View (based on DARPA report)• Power is a major consideration• Faults and fault tolerance are major issues• Constraints ...
Intel
NVIDIA Echelon Project: Extreme-scale Computer    Hierarchies with Efficient Locality-Optimized Nodes                     ...
DOE’s View
DOE’s points on Exascale System• Voltage scaling to reduce power and energy   - Explodes parallelism   - Cost of communica...
DOE’s Timeline
European: Dynamical Exascale Entry         Platform (DEEP)                            Start: 1st Dec 2011                 ...
DEEP System: A fusion of general purposeand high scalability supercomputers
China Exascale Plans• 12th 5-year Plan (2011-2015)  - Seven petascale HPCs  - At least one 50-100 PFLOPS  - Budget: CNY 4 ...
Thank you!
Upcoming SlideShare
Loading in...5
×

Exaflop In 2018 Hardware

1,057

Published on

1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total Views
1,057
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Exaflop In 2018 Hardware

  1. 1. Exascale Computer in 2018? A Hardware View jkwu@cs.hku.hk
  2. 2. • Hardware Evolution• Exascale Challenges• My View• Industrial & international movements
  3. 3. Hardware Evolution• Processor/Node Architecture: Multi-core -> Many Core• Acceleration/Coprocessors - SIMD Units (GP GPUs) - FPGAs (field-programmable gate array)• Memory/ I/O Considerations• Interconnection
  4. 4. Processor/Node Architecture• Intel Xeon E5-2600 processor: Sandy Bridge microarchitecture• Released: March 2012 Up to 8 cores (16 threads) up to 3.8 GHz (turbo- boost) DDR3 1600 Memory at 51 GB/s 64 KB L1 (3 cycles) 256 KB L2 (8 cycles) 20 MB L3 Core-Memory: ring-topology interconnect CPU-CPU: QPI interconnect
  5. 5. Processor/Node Architecture• Intel Knights Corner: Many Integrated Cores (MIC)• Early Version: Nov, 2011 • Over 50 cores • each core operating at 1.2GHz, 512 -bit vector processing units, 8MB of cache, and 4 threads per core. • 1 TFLOPS • It can be coupled with up to 2GB of GDDR5 memory. The chip uses the Sandy Bridge architecture, and manufactured using a 22nm process with 3D tri-gate transistors.
  6. 6. Processor/Node Architecture• AMD Opteron 6200 processor: Bulldozer core• Released: Nov, 2011 4 cores up to 3.3 GHz 128KB L1 (4) 1024 KB L2 (4) 16 MB L3 115W
  7. 7. Processor/Node Architecture• AMD Llano APU A8-3870K: Fusion• Released: Dec, 2011 4 x86 Cores (Stars architecture), 1MB L2 on each core, GPU on chip with 480 stream processors.
  8. 8. Processor/Node Architecture• IBM Power 7: Power Architecture, multi-core• Released: Feb, 2010 8 cores up to 4.25 GHz, 32 threads, 32 KB L1 (2 cycles) 256 KB L2 (8 cycles) 32 MB of L3 (embedded DRAM) 100 GB/s of memory bandwidth
  9. 9. Coprocessor/GPU Architecture• NVIDIA Fermi (GeForce 590)/Kepler/Maxwell• Released: March 2011 16 streaming multiprocessors (SMs), each with 32 stream processors (512 CUDA cores) 48 KB/SM memory (True cache hierarchy + on-chip shared RAM) 768KB L2 772 MHz core 3GB GDDR5 at 288G/s 1.6 TFLOP peak
  10. 10. Coprocessor/FPGA Architecture• Xilinx/Altera/Lattice Semiconductor FPGAs typically interface to PCI/PCIe buses and can significantly accelerate compute - intensive applications by orders of magnitude.
  11. 11. Petascale Parallel Architectures: Blue Waters
  12. 12. Petascale Parallel Architectures: XT6
  13. 13. Current Petascale Parallel Platforms
  14. 14. Heterogeneous Platforms: Tianhe-1A • 14,336 Intel XeonX5670 processors and 7,168 Nvidia Tesla M2050 general purpose GPUs • Theoretical peak performance of 4.701 PFLOPS • 2PB Disk and 262 TB RAM • Arch interconnect links the server nodes together using optical- electric cables in a hybrid fat tree configuration
  15. 15. Heterogeneous Platforms: RoadRunner
  16. 16. From 10 to 1000 PFLOPSSeveral critical issues must be addressed:• Power (GFLOPS/w)• Fault Tolerance (MTBF and high component count)• Node Performance (esp. in view of limited memory)• I/O (esp. in view of limited I/O bandwidth)• Heterogeneity (regarding application composition)• (and many incoming ones)
  17. 17. Exascale Hardware Challenges• Power consumption• Concurrency• Scalability• Resiliency
  18. 18. Architectures Considered• Evolutionary Strawmen – “Heavyweight” Strawman based on commodity-derived microprocessors – “Lightweight” Strawman based on custom microprocessors• Aggressive Strawmen – “Clean Sheet of Paper” CMOS Silicon
  19. 19. Evolutionary Scaling Assumptions• Applications will demand same DRAM/Flops ratio as today• Ignore any changes needed in disk capacity• Processor die size will remain constant• Continued reduction in device area => multi-core chips•Vdd, max power dissipation will flatten as forecast – Thus clock rates limited as before• On a per core basis, micro-architecture will improve from 2 flops/cycle to 4 in 2008, and 8 in 2015• Max # of sockets per board will double roughly every 5 years• Max # of boards per rank will increase once by 33%• Max power per rack will double every 3 years• Allow growth in system configuration by 50 racks each year
  20. 20. The Power Models• Simplistic: A highly optimistic model – Max power per die grows as per ITRS – Power for memory grows only linearly with # of chips • Power per memory chip remains constant – Power for routers and common logic remains constant • Regardless of obvious need to increase bandwidth – True if energy for bit moved/accessed decreases as fast as “flops per second” increase• Fully Scaled: A pessimistic model – Same as Simplistic, except memory & router power growwith peak flops per chip – True if energy for bit moved/accessed remains constant
  21. 21. The Prediction: Heavyweight
  22. 22. The Prediction: Lightweight
  23. 23. Architectures Considered• Evolutionary Strawmen: NOT FEASIBLE – “Heavyweight” Strawman based on commodity-derived microprocessors – “Lightweight” Strawman based on custom microprocessors• Aggressive Strawmen – “Clean Sheet of Paper” CMOS Silicon
  24. 24. The Prediction: Aggressive
  25. 25. A Whole Picture
  26. 26. Why?
  27. 27. Supply voltages are unlikely to reduce significantly.Processor clocks are unlikely to increase significantly.
  28. 28. Die power consumption flattens.Clock rate decreased with power constrained.
  29. 29. Power consumption/Flop flattens.
  30. 30. Fault Tolerance
  31. 31. My View (based on DARPA report)• Power is a major consideration• Faults and fault tolerance are major issues• Constraints on power density constrain processor speed – thus emphasizing concurrency• Levels of concurrency needed to reach exascale are projected to be over 109 cores• For these reasons, evolutionary path to exaflop is unlikely to succeed in/before 2018, to its best in 2020ish
  32. 32. Intel
  33. 33. NVIDIA Echelon Project: Extreme-scale Computer Hierarchies with Efficient Locality-Optimized Nodes • 64 NoC (Network on Chip), each with 4 SMs, each SM with 8 SM Lanes • 8 LOC (latency optimized core) • 2.5GHz • 10nm Chip Floorplan Node and SystemObjectives: 16 TFLOP (double precision ) perchip in 2018 at best• 100X better application energy efficiencyover today’s CPU systems.• Improved programmer productivity• Strong scaling for many applications• High AMTT• Machines resilient to attack
  34. 34. DOE’s View
  35. 35. DOE’s points on Exascale System• Voltage scaling to reduce power and energy - Explodes parallelism - Cost of communication vs. computation—critical balance• Its not about the FLOPS. Its about data movement. - Algorithms should be designed to perform more work per unit data movement. - Programming systems should further optimize this data movement. - Architectures should facilitate this by providing an exposed hierarchy and efficient communication.• System software to orchestrate all of the above Self aware operating system
  36. 36. DOE’s Timeline
  37. 37. European: Dynamical Exascale Entry Platform (DEEP) Start: 1st Dec 2011 Duration: 3 years Budget: 18.5 M€
  38. 38. DEEP System: A fusion of general purposeand high scalability supercomputers
  39. 39. China Exascale Plans• 12th 5-year Plan (2011-2015) - Seven petascale HPCs - At least one 50-100 PFLOPS - Budget: CNY 4 Billions• 13th 5-year Plan (2016-2020) - 1~10 ExaFLOPS HPC
  40. 40. Thank you!

×