SystemExplorer is a system simulation framework based upon the open-source gem5 simulation infrastructure. It includes a rich collection of hardware components such as ARM cores, interconnect, memories and memory controllers, IO devices - ethernet, PCIe, and other peripherals. In addition it provides support for run fully featured operating systems such as Linux and Android combined with pre-packaged filesystem images that contain real workloads and benchmarks for Smartphone, Server and High Performance Computing. In this talk I'll give an overview of ARM R&D's use of the SystemExplorer tool for workload directed architectural co-design. I will focus on how we are using it in combination with the Department of Energy's co-design center proxy applications to help evaluate and enable the ARM architecture to address the power-efficiency, performance, and resilience requirements of Exascale computing.
(Presented during FastPass 2013 Workshop in Austin, TX)
Simulation Directed Co-Design from Smartphones to Supercomputers
1. Simulation Directed Co-Design
from Smartphones to Supercomputers
Eric Van Hensbergen
ARM Research & Development
Austin, TX
FastPath 2013 April 21, 2013
1
2. STATE
§ SHIFT
§ No
longer
solely
rely
on
Process
Reduc0on
to
improve
performance
§ Performance/Power/Cost
will
increasingly
become
reliant
on
Integra0on
§ ARM
§ Focuses
on
Design
&
Licensing
of
IP
Building
Blocks
for
SoC’s
(=LEGO’s)
§ Building
Blocks
effecJvely
act
as
COTS-‐on-‐Silicon
§ COTS-‐on-‐Silicon
encourages
mulJ-‐suppliers
through
the
eco-‐system
§ It
enables
circuit-‐boards
to
be
integrated
onto
a
single
chip
§ Technology
DNA
is
Power-‐Efficiency
2
3. FLEXIBILITY
§ Build
what
you
want?
§ Target
your
SoC
to
solve
your
problem
§ One
size
does
not
fit
all
§ OpJmize
power/performance
for
the
domain
§ UJlize
common
infrastructure
and
components
§ Leverage
SW
ecosystem
and
portability
§ Leverage
validated
IP
§ Proven
design
flows
§ Focus
on
adding
value
to
solve
your
problems
§ Adding
you
applicaJon
specific
IP
§ Everything
else
off
the
shelf
§ Rich
IP
libraries
§ Diverse
and
compeJJve
IP
vendors
§ Leverage
the
ARM
ecosystem
3
4. MARKETS
Mobile
3%
4.6bn in 2011
Home
Home
40%
0.4bn in 2011
Embedded
25%
2.3bn in 2011
Enterprise
10%
1.4bn in 2011
4
9. gem5
§ Architectural simulator
§ ARM has invested significantly in ARM support for gem5
under the internal name “SystemExplorer”
§ Plan to continue to invest over time
§ ARMv7 support is extremely good today
§ Plans to contribute ARMv8 support when complete
§ BSD licensed
§ Good platform for collaboration
§ Base infrastructure is available and we can share bits beyond that
9
11. OS Support in SystemExplorer
Ubuntu 12.04 (Linux kernel v3.3) Android Jellybean (Kernel v2.6.38)
§ Latest Ubuntu and Android distributions
11
12. SystemExplorer Application Support
Understanding how real application workloads and operating systems stress our IP
Taji Egypt … Angry BBench
SSJ Graphics Birds
ToF
DaCapo SystemExplorer
Platforms Ande-
Bench Replica
IOzone
JS V8
AppLaunch
Engine
Single system simulation
Caffeine
Mark WPS
Netperf Vellamo Velllamo
HTML5 Metal
Webserver
RLBench UI Twiddle
HPC
Mantevo CESAR
AR Vid Playback
DB ExaCT
Multi-system simulation Wireless Disp Video Conf
with simulated Ethernet
HPC Applications
Server Applications Mobile Applications
EEMBC SPEC2000 Includes kernel support
In
Done Planned
Process Legacy
12
13. ARM gem5 Usage Continues to Grow
300
Overtake alpha
250
ARM
#1
200
Downloads per Month
150
100
50
Overtake x86
0
alpha arm x86
§ ARM gem5 exceeding both X86 and Alpha
13
18. High Performance Computing
§ High performance computing (HPC) is becoming much more
pervasive.
§ Power efficiency and integration are becoming key factors in both
large-scale and commercial HPC
§ 2018-2022 DARPA/DOD/DOE Visions for HPC:
50 GFLOPS/W (20 pJ/FLOP)
20W Chip 5KW Chassis 20KW Rack 20MW Data Center
Teraflop Terascale Petascale Exaflop
Medical/Pharma
18
19. Why does ARM care about HPC?
§ We expect the challenges HPC experiences today to be
similar to the enterprise challenges of tomorrow
§ Data center networking is getting more advanced
§ Energy will forever be a concern
§ ARM’s long-term vision is for ARM technology to be in all
levels of compute
§ Five years ago we announced the Cortex-M (Microcontroller) series
§ ARM powers many hard-real-time system (Radio, Automotive, etc)
§ Mobile devices
§ Servers
§ HPC is the only place you don’t find ARM technology today and we
aim to change that
19
20. First steps in ARM HPC:
§ Supercomputer investigation based
on embedded (ARM) technology
§ Funded under FP7
§ 3-year IP Project (Start October 2011)
§ Budget: 14.5 M€ (8.1 M€ from EC)
§ Project goals: physical prototype based on
available embedded (ARM) technology and
a design of a full next-gen system
§ Consortium includes experienced HPC
developers and users:
20
21. Mont-Blanc Roadmap
A big challenge, and a huge opportunity for Europe
Built with the best
that is coming
Built with the best
of the market
GFLOPS / W
256 nodes
What is the best
250 GFLOPS that we could do?
1.7 Kwatt
2011 2012 2013 2014 2015 2016 2017
• Prototypes are critical to accelerate software development
• System software stack + applications
21
18 HPC Advisory Council, Malaga September 13, 2012
23. Goals
§ Port co-design center proxy applications to ARM platform and
take baseline measurements
§ Also execute HPC Challenge and FFTW benchmarks to
compliment proxy applications
§ Execute same set of workloads on gem5 with a configuration
similar to an ARM hardware platform to get an idea of how
well the simulator correlates
§ Use results as a baseline for understanding the current state
of ARM for HPC, future optimizations and sensitivity studies
§ Since national labs aren’t as interested in 32-bit, use the
process to refine methodology till 64-bit hardware and/or
simulator becomes available
23
25. High Performance Computing Challenge
§ DARPA benchmark established to help evaluate systems in
the HPCS program (which ultimately produced Cray Cascade
and IBM PERCs machine)
§ LINPACK – stress peak floating point
§ PTRANS – rate of transfer of large arrays
§ GUPS – random updates of memory
§ FFT – Fast Fourier Transform
§ STREAM – measures sustainable memory bandwidth
§ DGEMM – Double precision general matrix multiply
§ Generally run across a cluster with MPI, but can run single
node and single core
§ Configure can scale to different working set sizes
§ http://icl.cs.utk.edu/hpcc
25
26. Mantevo Proxy Applications Suite
§ Developed at Sandia National Labs as an outgrowth of
Trillinos project which is a collection of open-source scientific
libraries, applications and benchmarks
§ Goals:
§ Predict performance of real applications in new situations.
§ Aid computer systems design decisions.
§ Foster communication between applications, libraries and computer
systems developers.
§ Guide application and library developers in algorithm and software
design choices for new systems.
§ Provide open source software to promote informed algorithm,
application and architecture decisions in the HPC community.
§ Released as open source:
§ http://mantevo.org
26
27. Co-Design Center Apps
CESAR
Center for Exascale Simulation
of Advanced Reactors
• Thermal Hydraulics: for the ExMatEx
fluid codes (NEK 5000)* ExaCT Materials in Extreme
• Neutronics : for the Neutronics Center for Exascale Simulation of Environments
codes (MOCFE and OpenMC) Combustion in Turbulence
• Coupling and Data Analytics for • CoMD – Molecular Dynamics
data intensive tasks: cian • Exp_CNS_NoSpec: A simple • LULESH - Lagrangian Explicit
stencil-based test code Shock Hydrodynamics
• MultiGrid_C: A multigrid-based • VPFFT - Crystal viscoplasticity
solver for a model linear elliptic
system based on a centered
second-order discretization.
• vodeDriver: chemical
combustion kinetics
27
30. gem5 Methodology
§ Boot scripts are in m5-obj/config/boot/hpc
§ Base.rcS creates a checkpoint after boot and 60-second
“rest” period. Setup to re-read workload script after
checkpoint so that workload can be configured during restore
skipping boot period.
§ Configs are self-contained in workloads, output is sent to
simulation host via m5 writefile.
§ I’ve got some bundled run scripts which handle establishing
the base checkpoint and for restoring checkpoint and
executing workloads in atomic, A15, and A15 with period
stats enabled
§ Runs parameterized so that complete run can complete in a
reasonable amount of time w/timing-approximate simulation
§ Disk image available, optimized for A15
30
32. miniFE – finite element simulation
§ It assembles a sparse linear-system from the steady-state
conduction equation on a brick-shaped problem domain of
linear 8-node hex elements. It then solves the linear-system
using a simple un-preconditioned conjugate-gradient
algorithm.
Thus the kernels that it contains are:
§ computation of element-operators
§ diffusion matrix, source vector
§ assembly
§ scattering element-operators into sparse matrix and vector
§ sparse matrix-vector product
§ during CG solve
§ vector operations (level-1 blas: axpy, dot, norm)
32
33. Profile: miniFE
§ Language & Runtime: reference code in C++, alternate
versions for openMP, cilk, chapel, qthreads, etc.
§ Library Dependencies: None
§ SLOCCOUNT: 2872 lines of code
§ A15 Perf Characteristics:
§ Run Time: 217,413,167 cycles Int
§ Max Heap Size: 14.54MB Float
SIMD Int
§ CPI: 1.6958 SIMD Float
§ L1D Miss Rate: 2.5% Memory
Other
§ L2 Miss Rate: 6.68%
§ Branch Mispredicts: 6.36%
33
36. Workloads: Next Steps
§ More benchmarks
§ Get big data analytic mini-apps and benchmarks working (graph 500, mantevo
analytics mini-app, others?)
§ Get an ExaCT benchmark working, incorporate forthcoming ASC benchmarks
§ More variations
§ Multinode MPI, PGAS, and other runtimes
§ OpenCL variants
§ Handcode NEON optimized versions of key benchmarks
§ More Accuracy
§ Continue calibration gem5 memory system against hardware to increase accuracy
of memory-bound benchmarks
§ Systems Software Sensitivity Study
§ OpenMPI versus MPICH versus LAMPI on ARM
§ Operating System Version (3.7 has THP)
§ armcc vs gcc vs gcc-dragon-egg versus clang (etc.)
§ Transition to 64-bit gem5 (and hardware) when available.
§ Integrate Montblanc benchmarks and runtimes
§ Roll bare-metal version of co-design center workloads to make them more
accessible to design teams.
36
37. Simulation Driven Challenges
§ Performance
§ When running functional mode (atomic), performance is in MIPS, when running in
cycle approximate mode (with memory models, cache models, etc.) simulation
runs in KIPS – but longer runtimes with timing models give more representative
results.
§ Current methodology works of atomic checkpoints followed by short timing
measurements, but can be refined to get better representation of multi-phase
workloads
§ Scale
§ gem5 is currently inherently serial, adding cores or nodes to simulation has a
multiplicative effect
§ Multi-threading the simulation model at core, node, and cluster levels could help
address this problem, but may impact granularity of timing accuracy.
§ Correlation
§ Correlating a single core simulation is hard, correlating multi-core is extremely
difficult, as is multi-node.
§ Sensitivity Study State Space Explosion
§ Many knobs to turn, determining which ones to turn in combination for the best
effect is an on-going research problem.
§ Visualization
§ Need better ways of visualizing performance characteristics, particularly at scale.
37
38. Future Work: Integration with SST
§ SST: The Structural Simulation Toolkit
§ Maintained by Sandia National Labs
§ Component-based Discrete Event Model
§ Already uses gem5 as a component (but not well integrated with ARM
variant)
§ Potential to help us scale out simulation as well as integrate with
other simulations (fabric, etc.) to allow for end-to-end simulation of
large scale supercomputer.
38
39. Links
§ More info on ARM including Research Papers
§ http://infocenter.arm.com
§ gem5 (http://www.m5sim.org)
§ SST (http://sst.sandia.gov)
§ Montblanc (http://montblanc-project.eu)
§ Exacale Initiative
§ http://sites.google.com/a/lbl.gov/exascale-initiative/
§ Co-Design Center Proxy Apps
§ Mantevo (http://mantevo.org)
§ ExMatEx (http://exmatex.lanl.gov)
§ ExaCT (http://exactcodesign.org)
§ CESAR (http://cesar.mcs.anl.gov)
39