0
Simulation Directed Co-Designfrom Smartphones to Supercomputers           Eric Van Hensbergen        ARM Research & Develo...
STATE§  SHIFT	        §  No	  longer	  solely	  rely	  on	  Process	  Reduc0on	  to	  improve	  performance	        §  ...
FLEXIBILITY§  Build	  what	  you	  want?	        §  Target	  your	  SoC	  to	  solve	  your	  problem	              §  ...
MARKETS    Mobile             3%4.6bn        in 2011                       Home    Home             40%0.4bn        in 201...
PROCESSORS    Architecture   Processor Micro-Architecture   Processor Hard-Macro       “ARMv8”                    “Cortex-...
“On-Chip” INTERCONNECT    Architecture   RTL Implementation       “AMBA”                CCN-5046
GPUs    Architecture   GPU Micro-Architecture      “Midgard”                Mali T-6787
1000+8
gem5§  Architectural simulator§  ARM has invested significantly in ARM support for gem5    under the internal name “Syst...
SystemExplorer10
OS Support in SystemExplorerUbuntu 12.04 (Linux kernel v3.3)   Android Jellybean (Kernel v2.6.38)§ Latest Ubuntu and Andr...
SystemExplorer Application Support         Understanding how real application workloads and operating systems stress our I...
ARM gem5 Usage Continues to Grow                            300                                                           ...
gem5 Visualization with Streamline14
SystemExplorer Dhrystone Correlation15
SystemExplorer SPECint2000 Correlation16
SystemExplorer EEMBC CORRELATION17
High Performance Computing§  High performance computing (HPC) is becoming much more    pervasive.§  Power efficiency and...
Why does ARM care about HPC?§  We expect the challenges HPC experiences today to be  similar to the enterprise challenges...
First steps in ARM HPC:§    Supercomputer investigation based      on embedded (ARM) technology§    Funded under FP7    ...
Mont-Blanc Roadmap       A big challenge, and a huge opportunity for Europe                                             Bu...
US DoE Exascale Timeline22
Goals§  Port co-design center proxy applications to ARM platform and  take baseline measurements§  Also execute HPC Chal...
Baseline Workload Characterization      Co-Design                    RTL Simulation       Centers                         ...
High Performance Computing Challenge§  DARPA benchmark established to help evaluate systems in    the HPCS program (which...
Mantevo Proxy Applications Suite§  Developed at Sandia National Labs as an outgrowth of  Trillinos project which is a col...
Co-Design Center Apps             CESARCenter for Exascale Simulationof Advanced Reactors•    Thermal Hydraulics: for the ...
Workloads, benchmarks, & miniappsLinpack        DGEMM       FFTminiMD         OpenMD      NekboneminiFE         Hpccg     ...
Workloads Instruction MixLinpack                       DGEMM                              FFTminiMD                       ...
gem5 Methodology§  Boot scripts are in m5-obj/config/boot/hpc§  Base.rcS creates a checkpoint after boot and 60-second  ...
gem5 Correlation – Simple Memory                                                  CoMD 8k                                 ...
miniFE – finite element simulation§  It assembles a sparse linear-system from the steady-state  conduction equation on a ...
Profile: miniFE§  Language & Runtime: reference code in C++, alternate  versions for openMP, cilk, chapel, qthreads, etc....
miniFE gem5 Cache Occupancy34
miniFE Streamline Visualization35
Workloads: Next Steps§  More benchmarks      §    Get big data analytic mini-apps and benchmarks working (graph 500, man...
Simulation Driven Challenges§  Performance      §    When running functional mode (atomic), performance is in MIPS, when...
Future Work: Integration with SST§  SST: The Structural Simulation Toolkit      §  Maintained by Sandia National Labs   ...
Links§  More info on ARM including Research Papers     §  http://infocenter.arm.com§  gem5 (http://www.m5sim.org)§  SS...
Thanks!     QUESTIONS?40
Upcoming SlideShare
Loading in...5
×

Simulation Directed Co-Design from Smartphones to Supercomputers

606

Published on

SystemExplorer is a system simulation framework based upon the open-source gem5 simulation infrastructure. It includes a rich collection of hardware components such as ARM cores, interconnect, memories and memory controllers, IO devices - ethernet, PCIe, and other peripherals. In addition it provides support for run fully featured operating systems such as Linux and Android combined with pre-packaged filesystem images that contain real workloads and benchmarks for Smartphone, Server and High Performance Computing. In this talk I'll give an overview of ARM R&D's use of the SystemExplorer tool for workload directed architectural co-design. I will focus on how we are using it in combination with the Department of Energy's co-design center proxy applications to help evaluate and enable the ARM architecture to address the power-efficiency, performance, and resilience requirements of Exascale computing.

(Presented during FastPass 2013 Workshop in Austin, TX)

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
606
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Simulation Directed Co-Design from Smartphones to Supercomputers"

  1. 1. Simulation Directed Co-Designfrom Smartphones to Supercomputers Eric Van Hensbergen ARM Research & Development Austin, TX FastPath 2013 April 21, 20131
  2. 2. STATE§  SHIFT   §  No  longer  solely  rely  on  Process  Reduc0on  to  improve  performance   §  Performance/Power/Cost  will  increasingly  become  reliant  on  Integra0on  §  ARM   §  Focuses  on  Design  &  Licensing  of  IP    Building  Blocks  for  SoC’s  (=LEGO’s)   §  Building  Blocks  effecJvely  act  as  COTS-­‐on-­‐Silicon   §  COTS-­‐on-­‐Silicon  encourages  mulJ-­‐suppliers  through  the  eco-­‐system   §  It  enables  circuit-­‐boards  to  be  integrated  onto  a  single  chip     §  Technology  DNA  is  Power-­‐Efficiency   2
  3. 3. FLEXIBILITY§  Build  what  you  want?   §  Target  your  SoC  to  solve  your  problem   §  One  size  does  not  fit  all   §  OpJmize  power/performance  for  the  domain   §  UJlize  common  infrastructure  and  components   §  Leverage  SW  ecosystem  and  portability   §  Leverage  validated  IP   §  Proven  design  flows   §  Focus  on  adding  value  to  solve  your  problems   §  Adding  you  applicaJon  specific  IP   §  Everything  else  off  the  shelf   §  Rich  IP  libraries   §  Diverse  and  compeJJve  IP  vendors   §  Leverage  the  ARM  ecosystem   3
  4. 4. MARKETS Mobile 3%4.6bn in 2011 Home Home 40%0.4bn in 2011Embedded 25%2.3bn in 2011Enterprise 10%1.4bn in 20114
  5. 5. PROCESSORS Architecture Processor Micro-Architecture Processor Hard-Macro “ARMv8” “Cortex-A57” Implementation5
  6. 6. “On-Chip” INTERCONNECT Architecture RTL Implementation “AMBA” CCN-5046
  7. 7. GPUs Architecture GPU Micro-Architecture “Midgard” Mali T-6787
  8. 8. 1000+8
  9. 9. gem5§  Architectural simulator§  ARM has invested significantly in ARM support for gem5 under the internal name “SystemExplorer” §  Plan to continue to invest over time §  ARMv7 support is extremely good today §  Plans to contribute ARMv8 support when complete§  BSD licensed§  Good platform for collaboration §  Base infrastructure is available and we can share bits beyond that 9
  10. 10. SystemExplorer10
  11. 11. OS Support in SystemExplorerUbuntu 12.04 (Linux kernel v3.3) Android Jellybean (Kernel v2.6.38)§ Latest Ubuntu and Android distributions11
  12. 12. SystemExplorer Application Support Understanding how real application workloads and operating systems stress our IP Taji Egypt … Angry BBench SSJ Graphics Birds ToF DaCapo SystemExplorer Platforms Ande- Bench Replica IOzone JS V8 AppLaunch Engine Single system simulation Caffeine Mark WPS Netperf Vellamo Velllamo HTML5 Metal Webserver RLBench UI Twiddle HPC Mantevo CESAR AR Vid Playback DB ExaCT Multi-system simulation Wireless Disp Video Conf with simulated Ethernet HPC Applications Server Applications Mobile Applications EEMBC SPEC2000 Includes kernel support InDone Planned Process Legacy12
  13. 13. ARM gem5 Usage Continues to Grow 300 Overtake alpha 250 ARM #1 200 Downloads per Month 150 100 50 Overtake x86 0 alpha arm x86§  ARM gem5 exceeding both X86 and Alpha13
  14. 14. gem5 Visualization with Streamline14
  15. 15. SystemExplorer Dhrystone Correlation15
  16. 16. SystemExplorer SPECint2000 Correlation16
  17. 17. SystemExplorer EEMBC CORRELATION17
  18. 18. High Performance Computing§  High performance computing (HPC) is becoming much more pervasive.§  Power efficiency and integration are becoming key factors in both large-scale and commercial HPC§  2018-2022 DARPA/DOD/DOE Visions for HPC: 50 GFLOPS/W (20 pJ/FLOP) 20W Chip 5KW Chassis 20KW Rack 20MW Data Center Teraflop Terascale Petascale Exaflop Medical/Pharma 18
  19. 19. Why does ARM care about HPC?§  We expect the challenges HPC experiences today to be similar to the enterprise challenges of tomorrow §  Data center networking is getting more advanced §  Energy will forever be a concern§  ARM’s long-term vision is for ARM technology to be in all levels of compute §  Five years ago we announced the Cortex-M (Microcontroller) series §  ARM powers many hard-real-time system (Radio, Automotive, etc) §  Mobile devices §  Servers §  HPC is the only place you don’t find ARM technology today and we aim to change that 19
  20. 20. First steps in ARM HPC:§  Supercomputer investigation based on embedded (ARM) technology§  Funded under FP7 §  3-year IP Project (Start October 2011) §  Budget: 14.5 M€ (8.1 M€ from EC)§  Project goals: physical prototype based on available embedded (ARM) technology and a design of a full next-gen system§  Consortium includes experienced HPC developers and users: 20
  21. 21. Mont-Blanc Roadmap A big challenge, and a huge opportunity for Europe Built with the best that is coming Built with the best of the marketGFLOPS / W 256 nodes What is the best 250 GFLOPS that we could do? 1.7 Kwatt 2011 2012 2013 2014 2015 2016 2017 • Prototypes are critical to accelerate software development • System software stack + applications 21 18 HPC Advisory Council, Malaga September 13, 2012
  22. 22. US DoE Exascale Timeline22
  23. 23. Goals§  Port co-design center proxy applications to ARM platform and take baseline measurements§  Also execute HPC Challenge and FFTW benchmarks to compliment proxy applications§  Execute same set of workloads on gem5 with a configuration similar to an ARM hardware platform to get an idea of how well the simulator correlates§  Use results as a baseline for understanding the current state of ARM for HPC, future optimizations and sensitivity studies§  Since national labs aren’t as interested in 32-bit, use the process to refine methodology till 64-bit hardware and/or simulator becomes available 23
  24. 24. Baseline Workload Characterization Co-Design RTL Simulation Centers Characterization Design Workloads Sensitivity Studies HPC Disk Image Performance ProjectionNational Labs 24
  25. 25. High Performance Computing Challenge§  DARPA benchmark established to help evaluate systems in the HPCS program (which ultimately produced Cray Cascade and IBM PERCs machine) §  LINPACK – stress peak floating point §  PTRANS – rate of transfer of large arrays §  GUPS – random updates of memory §  FFT – Fast Fourier Transform §  STREAM – measures sustainable memory bandwidth §  DGEMM – Double precision general matrix multiply§  Generally run across a cluster with MPI, but can run single node and single core§  Configure can scale to different working set sizes§  http://icl.cs.utk.edu/hpcc 25
  26. 26. Mantevo Proxy Applications Suite§  Developed at Sandia National Labs as an outgrowth of Trillinos project which is a collection of open-source scientific libraries, applications and benchmarks§  Goals: §  Predict performance of real applications in new situations. §  Aid computer systems design decisions. §  Foster communication between applications, libraries and computer systems developers. §  Guide application and library developers in algorithm and software design choices for new systems. §  Provide open source software to promote informed algorithm, application and architecture decisions in the HPC community.§  Released as open source: §  http://mantevo.org 26
  27. 27. Co-Design Center Apps CESARCenter for Exascale Simulationof Advanced Reactors•  Thermal Hydraulics: for the ExMatEx fluid codes (NEK 5000)* ExaCT Materials in Extreme•  Neutronics : for the Neutronics Center for Exascale Simulation of Environments codes (MOCFE and OpenMC) Combustion in Turbulence•  Coupling and Data Analytics for •  CoMD – Molecular Dynamics data intensive tasks: cian •  Exp_CNS_NoSpec: A simple •  LULESH - Lagrangian Explicit stencil-based test code Shock Hydrodynamics •  MultiGrid_C: A multigrid-based •  VPFFT - Crystal viscoplasticity solver for a model linear elliptic system based on a centered second-order discretization. •  vodeDriver: chemical combustion kinetics 27
  28. 28. Workloads, benchmarks, & miniappsLinpack DGEMM FFTminiMD OpenMD NekboneminiFE Hpccg PHDmeshPTRANS STREAM GUPS 28
  29. 29. Workloads Instruction MixLinpack DGEMM FFTminiMD OpenMD NekboneminiFE HPCCG PHDmeshPTRANS STREAM GUPS Memory – Integer – SIMD Integer – Float – SIMD Float 29
  30. 30. gem5 Methodology§  Boot scripts are in m5-obj/config/boot/hpc§  Base.rcS creates a checkpoint after boot and 60-second “rest” period. Setup to re-read workload script after checkpoint so that workload can be configured during restore skipping boot period.§  Configs are self-contained in workloads, output is sent to simulation host via m5 writefile.§  I’ve got some bundled run scripts which handle establishing the base checkpoint and for restoring checkpoint and executing workloads in atomic, A15, and A15 with period stats enabled§  Runs parameterized so that complete run can complete in a reasonable amount of time w/timing-approximate simulation§  Disk image available, optimized for A15 30
  31. 31. gem5 Correlation – Simple Memory CoMD 8k phdMesh MD TIME HPCCG P1 FLOP/s CG_MFLOP/s FFTW (SP) GUPS EP-STREAM GB/s G-PTRANS GB/s EP-DGEMM MFlop/s G-FFTR MFlop/s G-HPL MFlop/s-100.00% -80.00% -60.00% -40.00% -20.00% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 31
  32. 32. miniFE – finite element simulation§  It assembles a sparse linear-system from the steady-state conduction equation on a brick-shaped problem domain of linear 8-node hex elements. It then solves the linear-system using a simple un-preconditioned conjugate-gradient algorithm.Thus the kernels that it contains are:§  computation of element-operators §  diffusion matrix, source vector§  assembly §  scattering element-operators into sparse matrix and vector§  sparse matrix-vector product §  during CG solve§  vector operations (level-1 blas: axpy, dot, norm) 32
  33. 33. Profile: miniFE§  Language & Runtime: reference code in C++, alternate versions for openMP, cilk, chapel, qthreads, etc.§  Library Dependencies: None§  SLOCCOUNT: 2872 lines of code§  A15 Perf Characteristics: §  Run Time: 217,413,167 cycles Int §  Max Heap Size: 14.54MB Float SIMD Int §  CPI: 1.6958 SIMD Float §  L1D Miss Rate: 2.5% Memory Other §  L2 Miss Rate: 6.68% §  Branch Mispredicts: 6.36% 33
  34. 34. miniFE gem5 Cache Occupancy34
  35. 35. miniFE Streamline Visualization35
  36. 36. Workloads: Next Steps§  More benchmarks §  Get big data analytic mini-apps and benchmarks working (graph 500, mantevo analytics mini-app, others?) §  Get an ExaCT benchmark working, incorporate forthcoming ASC benchmarks§  More variations §  Multinode MPI, PGAS, and other runtimes §  OpenCL variants §  Handcode NEON optimized versions of key benchmarks§  More Accuracy §  Continue calibration gem5 memory system against hardware to increase accuracy of memory-bound benchmarks§  Systems Software Sensitivity Study §  OpenMPI versus MPICH versus LAMPI on ARM §  Operating System Version (3.7 has THP) §  armcc vs gcc vs gcc-dragon-egg versus clang (etc.)§  Transition to 64-bit gem5 (and hardware) when available.§  Integrate Montblanc benchmarks and runtimes§  Roll bare-metal version of co-design center workloads to make them more accessible to design teams. 36
  37. 37. Simulation Driven Challenges§  Performance §  When running functional mode (atomic), performance is in MIPS, when running in cycle approximate mode (with memory models, cache models, etc.) simulation runs in KIPS – but longer runtimes with timing models give more representative results. §  Current methodology works of atomic checkpoints followed by short timing measurements, but can be refined to get better representation of multi-phase workloads§  Scale §  gem5 is currently inherently serial, adding cores or nodes to simulation has a multiplicative effect §  Multi-threading the simulation model at core, node, and cluster levels could help address this problem, but may impact granularity of timing accuracy.§  Correlation §  Correlating a single core simulation is hard, correlating multi-core is extremely difficult, as is multi-node.§  Sensitivity Study State Space Explosion §  Many knobs to turn, determining which ones to turn in combination for the best effect is an on-going research problem.§  Visualization §  Need better ways of visualizing performance characteristics, particularly at scale. 37
  38. 38. Future Work: Integration with SST§  SST: The Structural Simulation Toolkit §  Maintained by Sandia National Labs §  Component-based Discrete Event Model §  Already uses gem5 as a component (but not well integrated with ARM variant) §  Potential to help us scale out simulation as well as integrate with other simulations (fabric, etc.) to allow for end-to-end simulation of large scale supercomputer. 38
  39. 39. Links§  More info on ARM including Research Papers §  http://infocenter.arm.com§  gem5 (http://www.m5sim.org)§  SST (http://sst.sandia.gov)§  Montblanc (http://montblanc-project.eu)§  Exacale Initiative §  http://sites.google.com/a/lbl.gov/exascale-initiative/§  Co-Design Center Proxy Apps §  Mantevo (http://mantevo.org) §  ExMatEx (http://exmatex.lanl.gov) §  ExaCT (http://exactcodesign.org) §  CESAR (http://cesar.mcs.anl.gov) 39
  40. 40. Thanks! QUESTIONS?40
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×