Processor/Node Architecture• Intel Xeon E5-2600 processor: Sandy Bridge microarchitecture• Released: March 2012 Up to 8 cores (16 threads) up to 3.8 GHz (turbo- boost) DDR3 1600 Memory at 51 GB/s 64 KB L1 (3 cycles) 256 KB L2 (8 cycles) 20 MB L3 Core-Memory: ring-topology interconnect CPU-CPU: QPI interconnect
Processor/Node Architecture• Intel Knights Corner: Many Integrated Cores (MIC)• Early Version: Nov, 2011 • Over 50 cores • each core operating at 1.2GHz, 512 -bit vector processing units, 8MB of cache, and 4 threads per core. • 1 TFLOPS • It can be coupled with up to 2GB of GDDR5 memory. The chip uses the Sandy Bridge architecture, and manufactured using a 22nm process with 3D tri-gate transistors.
Processor/Node Architecture• AMD Llano APU A8-3870K: Fusion• Released: Dec, 2011 4 x86 Cores (Stars architecture), 1MB L2 on each core, GPU on chip with 480 stream processors.
Processor/Node Architecture• IBM Power 7: Power Architecture, multi-core• Released: Feb, 2010 8 cores up to 4.25 GHz, 32 threads, 32 KB L1 (2 cycles) 256 KB L2 (8 cycles) 32 MB of L3 (embedded DRAM) 100 GB/s of memory bandwidth
Coprocessor/GPU Architecture• NVIDIA Fermi (GeForce 590)/Kepler/Maxwell• Released: March 2011 16 streaming multiprocessors (SMs), each with 32 stream processors (512 CUDA cores) 48 KB/SM memory (True cache hierarchy + on-chip shared RAM) 768KB L2 772 MHz core 3GB GDDR5 at 288G/s 1.6 TFLOP peak
Coprocessor/FPGA Architecture• Xilinx/Altera/Lattice Semiconductor FPGAs typically interface to PCI/PCIe buses and can significantly accelerate compute - intensive applications by orders of magnitude.
Heterogeneous Platforms: Tianhe-1A • 14,336 Intel XeonX5670 processors and 7,168 Nvidia Tesla M2050 general purpose GPUs • Theoretical peak performance of 4.701 PFLOPS • 2PB Disk and 262 TB RAM • Arch interconnect links the server nodes together using optical- electric cables in a hybrid fat tree configuration
From 10 to 1000 PFLOPSSeveral critical issues must be addressed:• Power (GFLOPS/w)• Fault Tolerance (MTBF and high component count)• Node Performance (esp. in view of limited memory)• I/O (esp. in view of limited I/O bandwidth)• Heterogeneity (regarding application composition)• (and many incoming ones)
Exascale Hardware Challenges• Power consumption• Concurrency• Scalability• Resiliency
Architectures Considered• Evolutionary Strawmen – “Heavyweight” Strawman based on commodity-derived microprocessors – “Lightweight” Strawman based on custom microprocessors• Aggressive Strawmen – “Clean Sheet of Paper” CMOS Silicon
Evolutionary Scaling Assumptions• Applications will demand same DRAM/Flops ratio as today• Ignore any changes needed in disk capacity• Processor die size will remain constant• Continued reduction in device area => multi-core chips•Vdd, max power dissipation will flatten as forecast – Thus clock rates limited as before• On a per core basis, micro-architecture will improve from 2 flops/cycle to 4 in 2008, and 8 in 2015• Max # of sockets per board will double roughly every 5 years• Max # of boards per rank will increase once by 33%• Max power per rack will double every 3 years• Allow growth in system configuration by 50 racks each year
The Power Models• Simplistic: A highly optimistic model – Max power per die grows as per ITRS – Power for memory grows only linearly with # of chips • Power per memory chip remains constant – Power for routers and common logic remains constant • Regardless of obvious need to increase bandwidth – True if energy for bit moved/accessed decreases as fast as “flops per second” increase• Fully Scaled: A pessimistic model – Same as Simplistic, except memory & router power growwith peak flops per chip – True if energy for bit moved/accessed remains constant
Architectures Considered• Evolutionary Strawmen: NOT FEASIBLE – “Heavyweight” Strawman based on commodity-derived microprocessors – “Lightweight” Strawman based on custom microprocessors• Aggressive Strawmen – “Clean Sheet of Paper” CMOS Silicon
My View (based on DARPA report)• Power is a major consideration• Faults and fault tolerance are major issues• Constraints on power density constrain processor speed – thus emphasizing concurrency• Levels of concurrency needed to reach exascale are projected to be over 109 cores• For these reasons, evolutionary path to exaflop is unlikely to succeed in/before 2018, to its best in 2020ish
NVIDIA Echelon Project: Extreme-scale Computer Hierarchies with Efficient Locality-Optimized Nodes • 64 NoC (Network on Chip), each with 4 SMs, each SM with 8 SM Lanes • 8 LOC (latency optimized core) • 2.5GHz • 10nm Chip Floorplan Node and SystemObjectives: 16 TFLOP (double precision ) perchip in 2018 at best• 100X better application energy efficiencyover today’s CPU systems.• Improved programmer productivity• Strong scaling for many applications• High AMTT• Machines resilient to attack
DOE’s points on Exascale System• Voltage scaling to reduce power and energy - Explodes parallelism - Cost of communication vs. computation—critical balance• Its not about the FLOPS. Its about data movement. - Algorithms should be designed to perform more work per unit data movement. - Programming systems should further optimize this data movement. - Architectures should facilitate this by providing an exposed hierarchy and efficient communication.• System software to orchestrate all of the above Self aware operating system