SlideShare a Scribd company logo
1 of 40
Download to read offline
Simulation Directed Co-Design
from Smartphones to Supercomputers
           Eric Van Hensbergen
        ARM Research & Development
                 Austin, TX


        FastPath 2013 April 21, 2013

1
STATE
§  SHIFT	
  
      §  No	
  longer	
  solely	
  rely	
  on	
  Process	
  Reduc0on	
  to	
  improve	
  performance	
  
      §  Performance/Power/Cost	
  will	
  increasingly	
  become	
  reliant	
  on	
  Integra0on	
  




§  ARM	
  
      §  Focuses	
  on	
  Design	
  &	
  Licensing	
  of	
  IP	
  	
  Building	
  Blocks	
  for	
  SoC’s	
  (=LEGO’s)	
  
      §  Building	
  Blocks	
  effecJvely	
  act	
  as	
  COTS-­‐on-­‐Silicon	
  
      §  COTS-­‐on-­‐Silicon	
  encourages	
  mulJ-­‐suppliers	
  through	
  the	
  eco-­‐system	
  
      §  It	
  enables	
  circuit-­‐boards	
  to	
  be	
  integrated	
  onto	
  a	
  single	
  chip	
  	
  
      §  Technology	
  DNA	
  is	
  Power-­‐Efficiency	
  
  2
FLEXIBILITY
§  Build	
  what	
  you	
  want?	
  
      §  Target	
  your	
  SoC	
  to	
  solve	
  your	
  problem	
  
            §  One	
  size	
  does	
  not	
  fit	
  all	
  
            §  OpJmize	
  power/performance	
  for	
  the	
  domain	
  
      §  UJlize	
  common	
  infrastructure	
  and	
  components	
  
            §  Leverage	
  SW	
  ecosystem	
  and	
  portability	
  
            §  Leverage	
  validated	
  IP	
  
            §  Proven	
  design	
  flows	
  
      §  Focus	
  on	
  adding	
  value	
  to	
  solve	
  your	
  problems	
  
            §  Adding	
  you	
  applicaJon	
  specific	
  IP	
  
            §  Everything	
  else	
  off	
  the	
  shelf	
  
                   §  Rich	
  IP	
  libraries	
  
      §  Diverse	
  and	
  compeJJve	
  IP	
  vendors	
  
            §  Leverage	
  the	
  ARM	
  ecosystem	
  
  3
MARKETS
    Mobile
             3%

4.6bn        in 2011
                       Home



    Home
             40%

0.4bn        in 2011




Embedded
             25%

2.3bn        in 2011




Enterprise
             10%
1.4bn        in 2011




4
PROCESSORS

    Architecture   Processor Micro-Architecture   Processor Hard-Macro
       “ARMv8”                    “Cortex-A57”          Implementation




5
“On-Chip” INTERCONNECT
    Architecture   RTL Implementation
       “AMBA”                CCN-504




6
GPUs

    Architecture   GPU Micro-Architecture
      “Midgard”                Mali T-678




7
1000+




8
gem5
§  Architectural simulator
§  ARM has invested significantly in ARM support for gem5
    under the internal name “SystemExplorer”
     §  Plan to continue to invest over time
     §  ARMv7 support is extremely good today
     §  Plans to contribute ARMv8 support when complete
§  BSD licensed
§  Good platform for collaboration
     §  Base infrastructure is available and we can share bits beyond that




  9
SystemExplorer




10
OS Support in SystemExplorer
Ubuntu 12.04 (Linux kernel v3.3)   Android Jellybean (Kernel v2.6.38)




§ Latest Ubuntu and Android distributions



11
SystemExplorer Application Support
         Understanding how real application workloads and operating systems stress our IP

                                     Taji          Egypt            …   Angry            BBench
         SSJ                                   Graphics                 Birds

                                                                                         ToF
                          DaCapo            SystemExplorer
                                              Platforms                 Ande-
                                                                        Bench            Replica
        IOzone
                                                                        JS V8
                                                                                          AppLaunch
                                                                        Engine
       Single system simulation
                                                                        Caffeine
                                                                        Mark             WPS

                           Netperf                                      Vellamo          Velllamo
                                                                        HTML5            Metal

       Webserver
                                                                        RLBench           UI Twiddle
                            HPC
                                        Mantevo            CESAR

                                                                             AR         Vid Playback
        DB                                                  ExaCT
       Multi-system simulation                                          Wireless Disp     Video Conf
       with simulated Ethernet
                                            HPC Applications
         Server Applications                                                Mobile Applications
                                        EEMBC              SPEC2000                     Includes kernel support
                In
Done                   Planned
             Process                              Legacy
12
ARM gem5 Usage Continues to Grow
                            300
                                                               Overtake alpha

                            250
                                                                                ARM
                                                                                 #1
                            200
      Downloads per Month




                            150



                            100



                             50
                                                Overtake x86

                              0




                                  alpha   arm      x86




§    ARM gem5 exceeding both X86 and Alpha

13
gem5 Visualization with Streamline




14
SystemExplorer Dhrystone Correlation




15
SystemExplorer SPECint2000 Correlation




16
SystemExplorer EEMBC CORRELATION




17
High Performance Computing
§  High performance computing (HPC) is becoming much more
    pervasive.
§  Power efficiency and integration are becoming key factors in both
    large-scale and commercial HPC
§  2018-2022 DARPA/DOD/DOE Visions for HPC:
                  50 GFLOPS/W (20 pJ/FLOP)
      20W Chip    5KW Chassis           20KW Rack      20MW Data Center
      Teraflop    Terascale             Petascale      Exaflop




                                             Medical/Pharma


 18
Why does ARM care about HPC?
§  We expect the challenges HPC experiences today to be
  similar to the enterprise challenges of tomorrow
    §  Data center networking is getting more advanced
    §  Energy will forever be a concern

§  ARM’s long-term vision is for ARM technology to be in all
  levels of compute
    §  Five years ago we announced the Cortex-M (Microcontroller) series
    §  ARM powers many hard-real-time system (Radio, Automotive, etc)
    §  Mobile devices
    §  Servers
    §  HPC is the only place you don’t find ARM technology today and we
      aim to change that


 19
First steps in ARM HPC:
§    Supercomputer investigation based
      on embedded (ARM) technology

§    Funded under FP7
       §  3-year IP Project (Start October 2011)
       §  Budget: 14.5 M€ (8.1 M€ from EC)
§    Project goals: physical prototype based on
      available embedded (ARM) technology and
      a design of a full next-gen system

§    Consortium includes experienced HPC
      developers and users:




  20
Mont-Blanc Roadmap
       A big challenge, and a huge opportunity for Europe

                                             Built with the best
                                             that is coming
                  Built with the best
                  of the market
GFLOPS / W




                   256 nodes
                                                                             What is the best
                  250 GFLOPS                                                 that we could do?
                    1.7 Kwatt




                    2011        2012        2013     2014          2015      2016    2017

       •     Prototypes are critical to accelerate software development
              •   System software stack + applications

  21
    18       HPC Advisory Council, Malaga               September 13, 2012
US DoE Exascale Timeline




22
Goals
§  Port co-design center proxy applications to ARM platform and
  take baseline measurements
§  Also execute HPC Challenge and FFTW benchmarks to
  compliment proxy applications
§  Execute same set of workloads on gem5 with a configuration
  similar to an ARM hardware platform to get an idea of how
  well the simulator correlates
§  Use results as a baseline for understanding the current state
    of ARM for HPC, future optimizations and sensitivity studies
§  Since national labs aren’t as interested in 32-bit, use the
    process to refine methodology till 64-bit hardware and/or
    simulator becomes available


 23
Baseline Workload Characterization


      Co-Design                    RTL Simulation
       Centers                                      Characterization




                                                         Design
      Workloads                                         Sensitivity
                                                         Studies

                  HPC Disk Image

                                                       Performance
                                                        Projection
National
 Labs




 24
High Performance Computing Challenge
§  DARPA benchmark established to help evaluate systems in
    the HPCS program (which ultimately produced Cray Cascade
    and IBM PERCs machine)
      §  LINPACK – stress peak floating point
      §  PTRANS – rate of transfer of large arrays
      §  GUPS – random updates of memory
      §  FFT – Fast Fourier Transform
      §  STREAM – measures sustainable memory bandwidth
      §  DGEMM – Double precision general matrix multiply
§  Generally run across a cluster with MPI, but can run single
    node and single core
§  Configure can scale to different working set sizes
§  http://icl.cs.utk.edu/hpcc

 25
Mantevo Proxy Applications Suite
§  Developed at Sandia National Labs as an outgrowth of
  Trillinos project which is a collection of open-source scientific
  libraries, applications and benchmarks
§  Goals:
      §  Predict performance of real applications in new situations.
      §  Aid computer systems design decisions.
      §  Foster communication between applications, libraries and computer
            systems developers.
      §    Guide application and library developers in algorithm and software
            design choices for new systems.
      §    Provide open source software to promote informed algorithm,
            application and architecture decisions in the HPC community.
§  Released as open source:
      §  http://mantevo.org

 26
Co-Design Center Apps



             CESAR
Center for Exascale Simulation
of Advanced Reactors

•    Thermal Hydraulics: for the                                                       ExMatEx
     fluid codes (NEK 5000)*                         ExaCT               Materials in Extreme
•    Neutronics : for the Neutronics   Center for Exascale Simulation of Environments
     codes (MOCFE and OpenMC)          Combustion in Turbulence
•    Coupling and Data Analytics for                                     •  CoMD – Molecular Dynamics
     data intensive tasks: cian        •  Exp_CNS_NoSpec: A simple       •  LULESH - Lagrangian Explicit
                                          stencil-based test code           Shock Hydrodynamics
                                       •  MultiGrid_C: A multigrid-based •  VPFFT - Crystal viscoplasticity
                                            solver for a model linear elliptic
                                            system based on a centered
                                            second-order discretization.
                                       •    vodeDriver: chemical
                                            combustion kinetics

     27
Workloads, benchmarks, & miniapps
Linpack        DGEMM       FFT




miniMD         OpenMD      Nekbone




miniFE         Hpccg       PHDmesh




PTRANS         STREAM      GUPS




 28
Workloads Instruction Mix
Linpack                       DGEMM                              FFT




miniMD                        OpenMD                             Nekbone




miniFE                        HPCCG                              PHDmesh




PTRANS                        STREAM                             GUPS



          Memory – Integer – SIMD Integer – Float – SIMD Float

 29
gem5 Methodology
§  Boot scripts are in m5-obj/config/boot/hpc
§  Base.rcS creates a checkpoint after boot and 60-second
    “rest” period. Setup to re-read workload script after
    checkpoint so that workload can be configured during restore
    skipping boot period.
§  Configs are self-contained in workloads, output is sent to
    simulation host via m5 writefile.
§  I’ve got some bundled run scripts which handle establishing
    the base checkpoint and for restoring checkpoint and
    executing workloads in atomic, A15, and A15 with period
    stats enabled
§  Runs parameterized so that complete run can complete in a
    reasonable amount of time w/timing-approximate simulation
§  Disk image available, optimized for A15

 30
gem5 Correlation – Simple Memory
                                                  CoMD 8k


                                                  phdMesh


                                                  MD TIME


                                         HPCCG P1 FLOP/s


                                              CG_MFLOP/s


                                                FFTW (SP)


                                                      GUPS


                                          EP-STREAM GB/s


                                           G-PTRANS GB/s


                                         EP-DGEMM MFlop/s


                                            G-FFTR MFlop/s


                                             G-HPL MFlop/s

-100.00%   -80.00%   -60.00%   -40.00%      -20.00%      0.00%   20.00%   40.00%   60.00%   80.00%   100.00%



 31
miniFE – finite element simulation
§  It assembles a sparse linear-system from the steady-state
  conduction equation on a brick-shaped problem domain of
  linear 8-node hex elements. It then solves the linear-system
  using a simple un-preconditioned conjugate-gradient
  algorithm.
Thus the kernels that it contains are:
§  computation of element-operators
     §  diffusion matrix, source vector
§  assembly
     §  scattering element-operators into sparse matrix and vector
§  sparse matrix-vector product
     §  during CG solve
§  vector operations (level-1 blas: axpy, dot, norm)
 32
Profile: miniFE
§  Language & Runtime: reference code in C++, alternate
  versions for openMP, cilk, chapel, qthreads, etc.
§  Library Dependencies: None
§  SLOCCOUNT: 2872 lines of code
§  A15 Perf Characteristics:
      §  Run Time: 217,413,167 cycles                     Int

      §  Max Heap Size: 14.54MB                           Float
                                                           SIMD Int
      §  CPI: 1.6958                                      SIMD Float

      §  L1D Miss Rate: 2.5%                              Memory
                                                           Other
      §  L2 Miss Rate: 6.68%
      §  Branch Mispredicts: 6.36%



 33
miniFE gem5 Cache Occupancy




34
miniFE Streamline Visualization




35
Workloads: Next Steps
§  More benchmarks
      §    Get big data analytic mini-apps and benchmarks working (graph 500, mantevo
            analytics mini-app, others?)
      §    Get an ExaCT benchmark working, incorporate forthcoming ASC benchmarks
§  More variations
      §    Multinode MPI, PGAS, and other runtimes
      §    OpenCL variants
      §    Handcode NEON optimized versions of key benchmarks
§  More Accuracy
      §    Continue calibration gem5 memory system against hardware to increase accuracy
            of memory-bound benchmarks
§  Systems Software Sensitivity Study
      §    OpenMPI versus MPICH versus LAMPI on ARM
      §    Operating System Version (3.7 has THP)
      §    armcc vs gcc vs gcc-dragon-egg versus clang (etc.)
§  Transition to 64-bit gem5 (and hardware) when available.
§  Integrate Montblanc benchmarks and runtimes
§  Roll bare-metal version of co-design center workloads to make them more
  accessible to design teams.

 36
Simulation Driven Challenges
§  Performance
      §    When running functional mode (atomic), performance is in MIPS, when running in
            cycle approximate mode (with memory models, cache models, etc.) simulation
            runs in KIPS – but longer runtimes with timing models give more representative
            results.
      §    Current methodology works of atomic checkpoints followed by short timing
            measurements, but can be refined to get better representation of multi-phase
            workloads
§  Scale
      §    gem5 is currently inherently serial, adding cores or nodes to simulation has a
            multiplicative effect
      §    Multi-threading the simulation model at core, node, and cluster levels could help
            address this problem, but may impact granularity of timing accuracy.
§  Correlation
      §    Correlating a single core simulation is hard, correlating multi-core is extremely
            difficult, as is multi-node.
§  Sensitivity Study State Space Explosion
      §    Many knobs to turn, determining which ones to turn in combination for the best
            effect is an on-going research problem.
§  Visualization
      §    Need better ways of visualizing performance characteristics, particularly at scale.

 37
Future Work: Integration with SST
§  SST: The Structural Simulation Toolkit
      §  Maintained by Sandia National Labs
      §  Component-based Discrete Event Model
      §  Already uses gem5 as a component (but not well integrated with ARM
            variant)
      §    Potential to help us scale out simulation as well as integrate with
            other simulations (fabric, etc.) to allow for end-to-end simulation of
            large scale supercomputer.




 38
Links
§  More info on ARM including Research Papers
     §  http://infocenter.arm.com
§  gem5 (http://www.m5sim.org)
§  SST (http://sst.sandia.gov)
§  Montblanc (http://montblanc-project.eu)
§  Exacale Initiative
     §  http://sites.google.com/a/lbl.gov/exascale-initiative/
§  Co-Design Center Proxy Apps
     §  Mantevo (http://mantevo.org)
     §  ExMatEx (http://exmatex.lanl.gov)
     §  ExaCT (http://exactcodesign.org)
     §  CESAR (http://cesar.mcs.anl.gov)

 39
Thanks!

     QUESTIONS?


40

More Related Content

What's hot

云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本Riquelme624
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
Multiple Shared Processor Pools In Power Systems
Multiple Shared Processor Pools In Power SystemsMultiple Shared Processor Pools In Power Systems
Multiple Shared Processor Pools In Power SystemsAndrey Klyachkin
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputinginside-BigData.com
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
Virtualization aware Java VM
Virtualization aware Java VMVirtualization aware Java VM
Virtualization aware Java VMTim Ellison
 
Presentation power vm virtualization without limits
Presentation   power vm virtualization without limitsPresentation   power vm virtualization without limits
Presentation power vm virtualization without limitssolarisyougood
 
Presentation power vm common 2012
Presentation   power vm common 2012Presentation   power vm common 2012
Presentation power vm common 2012solarisyougood
 
Matching Cisco and System p
Matching Cisco and System pMatching Cisco and System p
Matching Cisco and System pAndrey Klyachkin
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...Edge AI and Vision Alliance
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCinside-BigData.com
 
Aix The Future of UNIX
Aix The Future of UNIX Aix The Future of UNIX
Aix The Future of UNIX xKinAnx
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreZen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreAMD
 
Presentation power vm editions and power systems virtualization - basic
Presentation   power vm editions and power systems virtualization - basicPresentation   power vm editions and power systems virtualization - basic
Presentation power vm editions and power systems virtualization - basicsolarisyougood
 
Intel® Ethernet Update
Intel® Ethernet Update Intel® Ethernet Update
Intel® Ethernet Update Michelle Holley
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane Michelle Holley
 

What's hot (20)

云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
云计算核心技术架构分论坛 一石三鸟 性能 功耗及成本
 
HPC Network Stack on ARM
HPC Network Stack on ARMHPC Network Stack on ARM
HPC Network Stack on ARM
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Xrm xensummit
Xrm xensummitXrm xensummit
Xrm xensummit
 
Multiple Shared Processor Pools In Power Systems
Multiple Shared Processor Pools In Power SystemsMultiple Shared Processor Pools In Power Systems
Multiple Shared Processor Pools In Power Systems
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Virtualization aware Java VM
Virtualization aware Java VMVirtualization aware Java VM
Virtualization aware Java VM
 
Presentation power vm virtualization without limits
Presentation   power vm virtualization without limitsPresentation   power vm virtualization without limits
Presentation power vm virtualization without limits
 
Presentation power vm common 2012
Presentation   power vm common 2012Presentation   power vm common 2012
Presentation power vm common 2012
 
Matching Cisco and System p
Matching Cisco and System pMatching Cisco and System p
Matching Cisco and System p
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
 
9P Overview
9P Overview9P Overview
9P Overview
 
Aix The Future of UNIX
Aix The Future of UNIX Aix The Future of UNIX
Aix The Future of UNIX
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreZen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
 
Presentation power vm editions and power systems virtualization - basic
Presentation   power vm editions and power systems virtualization - basicPresentation   power vm editions and power systems virtualization - basic
Presentation power vm editions and power systems virtualization - basic
 
Intel® Ethernet Update
Intel® Ethernet Update Intel® Ethernet Update
Intel® Ethernet Update
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
zClouds - A better business Cloud
zClouds - A better business CloudzClouds - A better business Cloud
zClouds - A better business Cloud
 

Similar to Simulation Directed Co-Design from Smartphones to Supercomputers

Dc tco in_a_nutshell
Dc tco in_a_nutshellDc tco in_a_nutshell
Dc tco in_a_nutshellerjosito
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Community
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureBob Rhubart
 
Презентация команды "Обыватели"
Презентация команды "Обыватели"Презентация команды "Обыватели"
Презентация команды "Обыватели"Tatyana Savchyk
 
HP Server og Lagring SPOR 1
HP Server og Lagring SPOR 1HP Server og Lagring SPOR 1
HP Server og Lagring SPOR 1HP Norge
 
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」PC Cluster Consortium
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john maoNAVER D2
 
Track F- Designing the kiler soc - sonics
Track F- Designing the kiler soc - sonicsTrack F- Designing the kiler soc - sonics
Track F- Designing the kiler soc - sonicschiportal
 
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM SystemsXPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM SystemsThe Linux Foundation
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosBrent Salisbury
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011darach
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...PT Datacomm Diangraha
 
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be SlowELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be SlowBenjamin Zores
 
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) ArchitectureNFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) ArchitectureMichelle Holley
 
A comprehensive formal verification solution for ARM based SOC design
A comprehensive formal verification solution for ARM based SOC design A comprehensive formal verification solution for ARM based SOC design
A comprehensive formal verification solution for ARM based SOC design chiportal
 
CMP315_Optimizing Network Performance for Amazon EC2 Instances
CMP315_Optimizing Network Performance for Amazon EC2 InstancesCMP315_Optimizing Network Performance for Amazon EC2 Instances
CMP315_Optimizing Network Performance for Amazon EC2 InstancesAmazon Web Services
 
Enterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technologyEnterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technologysolarisyougood
 

Similar to Simulation Directed Co-Design from Smartphones to Supercomputers (20)

Dc tco in_a_nutshell
Dc tco in_a_nutshellDc tco in_a_nutshell
Dc tco in_a_nutshell
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the Future
 
Презентация команды "Обыватели"
Презентация команды "Обыватели"Презентация команды "Обыватели"
Презентация команды "Обыватели"
 
HP Server og Lagring SPOR 1
HP Server og Lagring SPOR 1HP Server og Lagring SPOR 1
HP Server og Lagring SPOR 1
 
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john mao
 
Track F- Designing the kiler soc - sonics
Track F- Designing the kiler soc - sonicsTrack F- Designing the kiler soc - sonics
Track F- Designing the kiler soc - sonics
 
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM SystemsXPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
6dec2011 - DELL
6dec2011 - DELL6dec2011 - DELL
6dec2011 - DELL
 
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be SlowELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
 
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) ArchitectureNFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
 
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
 
A comprehensive formal verification solution for ARM based SOC design
A comprehensive formal verification solution for ARM based SOC design A comprehensive formal verification solution for ARM based SOC design
A comprehensive formal verification solution for ARM based SOC design
 
CMP315_Optimizing Network Performance for Amazon EC2 Instances
CMP315_Optimizing Network Performance for Amazon EC2 InstancesCMP315_Optimizing Network Performance for Amazon EC2 Instances
CMP315_Optimizing Network Performance for Amazon EC2 Instances
 
Enterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technologyEnterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technology
 
Hosseini sv07
Hosseini sv07Hosseini sv07
Hosseini sv07
 

More from Eric Van Hensbergen (15)

Scaling Arm from One to One Trillion
Scaling Arm from One to One TrillionScaling Arm from One to One Trillion
Scaling Arm from One to One Trillion
 
Brasil Ross 2011
Brasil Ross 2011Brasil Ross 2011
Brasil Ross 2011
 
Multipipes
MultipipesMultipipes
Multipipes
 
Multi-pipes
Multi-pipesMulti-pipes
Multi-pipes
 
HARE 2010 Review
HARE 2010 ReviewHARE 2010 Review
HARE 2010 Review
 
PUSH-- a Dataflow Shell
PUSH-- a Dataflow ShellPUSH-- a Dataflow Shell
PUSH-- a Dataflow Shell
 
XCPU3: Workload Distribution and Aggregation
XCPU3: Workload Distribution and AggregationXCPU3: Workload Distribution and Aggregation
XCPU3: Workload Distribution and Aggregation
 
9P Code Walkthrough
9P Code Walkthrough9P Code Walkthrough
9P Code Walkthrough
 
Push Podc09
Push Podc09Push Podc09
Push Podc09
 
Libra: a Library OS for a JVM
Libra: a Library OS for a JVMLibra: a Library OS for a JVM
Libra: a Library OS for a JVM
 
Effect of Virtualization on OS Interference
Effect of Virtualization on OS InterferenceEffect of Virtualization on OS Interference
Effect of Virtualization on OS Interference
 
PROSE
PROSEPROSE
PROSE
 
Libra Library OS
Libra Library OSLibra Library OS
Libra Library OS
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Holistic Aggregate Resource Environment
Holistic Aggregate Resource EnvironmentHolistic Aggregate Resource Environment
Holistic Aggregate Resource Environment
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Simulation Directed Co-Design from Smartphones to Supercomputers

  • 1. Simulation Directed Co-Design from Smartphones to Supercomputers Eric Van Hensbergen ARM Research & Development Austin, TX FastPath 2013 April 21, 2013 1
  • 2. STATE §  SHIFT   §  No  longer  solely  rely  on  Process  Reduc0on  to  improve  performance   §  Performance/Power/Cost  will  increasingly  become  reliant  on  Integra0on   §  ARM   §  Focuses  on  Design  &  Licensing  of  IP    Building  Blocks  for  SoC’s  (=LEGO’s)   §  Building  Blocks  effecJvely  act  as  COTS-­‐on-­‐Silicon   §  COTS-­‐on-­‐Silicon  encourages  mulJ-­‐suppliers  through  the  eco-­‐system   §  It  enables  circuit-­‐boards  to  be  integrated  onto  a  single  chip     §  Technology  DNA  is  Power-­‐Efficiency   2
  • 3. FLEXIBILITY §  Build  what  you  want?   §  Target  your  SoC  to  solve  your  problem   §  One  size  does  not  fit  all   §  OpJmize  power/performance  for  the  domain   §  UJlize  common  infrastructure  and  components   §  Leverage  SW  ecosystem  and  portability   §  Leverage  validated  IP   §  Proven  design  flows   §  Focus  on  adding  value  to  solve  your  problems   §  Adding  you  applicaJon  specific  IP   §  Everything  else  off  the  shelf   §  Rich  IP  libraries   §  Diverse  and  compeJJve  IP  vendors   §  Leverage  the  ARM  ecosystem   3
  • 4. MARKETS Mobile 3% 4.6bn in 2011 Home Home 40% 0.4bn in 2011 Embedded 25% 2.3bn in 2011 Enterprise 10% 1.4bn in 2011 4
  • 5. PROCESSORS Architecture Processor Micro-Architecture Processor Hard-Macro “ARMv8” “Cortex-A57” Implementation 5
  • 6. “On-Chip” INTERCONNECT Architecture RTL Implementation “AMBA” CCN-504 6
  • 7. GPUs Architecture GPU Micro-Architecture “Midgard” Mali T-678 7
  • 9. gem5 §  Architectural simulator §  ARM has invested significantly in ARM support for gem5 under the internal name “SystemExplorer” §  Plan to continue to invest over time §  ARMv7 support is extremely good today §  Plans to contribute ARMv8 support when complete §  BSD licensed §  Good platform for collaboration §  Base infrastructure is available and we can share bits beyond that 9
  • 11. OS Support in SystemExplorer Ubuntu 12.04 (Linux kernel v3.3) Android Jellybean (Kernel v2.6.38) § Latest Ubuntu and Android distributions 11
  • 12. SystemExplorer Application Support Understanding how real application workloads and operating systems stress our IP Taji Egypt … Angry BBench SSJ Graphics Birds ToF DaCapo SystemExplorer Platforms Ande- Bench Replica IOzone JS V8 AppLaunch Engine Single system simulation Caffeine Mark WPS Netperf Vellamo Velllamo HTML5 Metal Webserver RLBench UI Twiddle HPC Mantevo CESAR AR Vid Playback DB ExaCT Multi-system simulation Wireless Disp Video Conf with simulated Ethernet HPC Applications Server Applications Mobile Applications EEMBC SPEC2000 Includes kernel support In Done Planned Process Legacy 12
  • 13. ARM gem5 Usage Continues to Grow 300 Overtake alpha 250 ARM #1 200 Downloads per Month 150 100 50 Overtake x86 0 alpha arm x86 §  ARM gem5 exceeding both X86 and Alpha 13
  • 14. gem5 Visualization with Streamline 14
  • 18. High Performance Computing §  High performance computing (HPC) is becoming much more pervasive. §  Power efficiency and integration are becoming key factors in both large-scale and commercial HPC §  2018-2022 DARPA/DOD/DOE Visions for HPC: 50 GFLOPS/W (20 pJ/FLOP) 20W Chip 5KW Chassis 20KW Rack 20MW Data Center Teraflop Terascale Petascale Exaflop Medical/Pharma 18
  • 19. Why does ARM care about HPC? §  We expect the challenges HPC experiences today to be similar to the enterprise challenges of tomorrow §  Data center networking is getting more advanced §  Energy will forever be a concern §  ARM’s long-term vision is for ARM technology to be in all levels of compute §  Five years ago we announced the Cortex-M (Microcontroller) series §  ARM powers many hard-real-time system (Radio, Automotive, etc) §  Mobile devices §  Servers §  HPC is the only place you don’t find ARM technology today and we aim to change that 19
  • 20. First steps in ARM HPC: §  Supercomputer investigation based on embedded (ARM) technology §  Funded under FP7 §  3-year IP Project (Start October 2011) §  Budget: 14.5 M€ (8.1 M€ from EC) §  Project goals: physical prototype based on available embedded (ARM) technology and a design of a full next-gen system §  Consortium includes experienced HPC developers and users: 20
  • 21. Mont-Blanc Roadmap A big challenge, and a huge opportunity for Europe Built with the best that is coming Built with the best of the market GFLOPS / W 256 nodes What is the best 250 GFLOPS that we could do? 1.7 Kwatt 2011 2012 2013 2014 2015 2016 2017 • Prototypes are critical to accelerate software development • System software stack + applications 21 18 HPC Advisory Council, Malaga September 13, 2012
  • 22. US DoE Exascale Timeline 22
  • 23. Goals §  Port co-design center proxy applications to ARM platform and take baseline measurements §  Also execute HPC Challenge and FFTW benchmarks to compliment proxy applications §  Execute same set of workloads on gem5 with a configuration similar to an ARM hardware platform to get an idea of how well the simulator correlates §  Use results as a baseline for understanding the current state of ARM for HPC, future optimizations and sensitivity studies §  Since national labs aren’t as interested in 32-bit, use the process to refine methodology till 64-bit hardware and/or simulator becomes available 23
  • 24. Baseline Workload Characterization Co-Design RTL Simulation Centers Characterization Design Workloads Sensitivity Studies HPC Disk Image Performance Projection National Labs 24
  • 25. High Performance Computing Challenge §  DARPA benchmark established to help evaluate systems in the HPCS program (which ultimately produced Cray Cascade and IBM PERCs machine) §  LINPACK – stress peak floating point §  PTRANS – rate of transfer of large arrays §  GUPS – random updates of memory §  FFT – Fast Fourier Transform §  STREAM – measures sustainable memory bandwidth §  DGEMM – Double precision general matrix multiply §  Generally run across a cluster with MPI, but can run single node and single core §  Configure can scale to different working set sizes §  http://icl.cs.utk.edu/hpcc 25
  • 26. Mantevo Proxy Applications Suite §  Developed at Sandia National Labs as an outgrowth of Trillinos project which is a collection of open-source scientific libraries, applications and benchmarks §  Goals: §  Predict performance of real applications in new situations. §  Aid computer systems design decisions. §  Foster communication between applications, libraries and computer systems developers. §  Guide application and library developers in algorithm and software design choices for new systems. §  Provide open source software to promote informed algorithm, application and architecture decisions in the HPC community. §  Released as open source: §  http://mantevo.org 26
  • 27. Co-Design Center Apps CESAR Center for Exascale Simulation of Advanced Reactors •  Thermal Hydraulics: for the ExMatEx fluid codes (NEK 5000)* ExaCT Materials in Extreme •  Neutronics : for the Neutronics Center for Exascale Simulation of Environments codes (MOCFE and OpenMC) Combustion in Turbulence •  Coupling and Data Analytics for •  CoMD – Molecular Dynamics data intensive tasks: cian •  Exp_CNS_NoSpec: A simple •  LULESH - Lagrangian Explicit stencil-based test code Shock Hydrodynamics •  MultiGrid_C: A multigrid-based •  VPFFT - Crystal viscoplasticity solver for a model linear elliptic system based on a centered second-order discretization. •  vodeDriver: chemical combustion kinetics 27
  • 28. Workloads, benchmarks, & miniapps Linpack DGEMM FFT miniMD OpenMD Nekbone miniFE Hpccg PHDmesh PTRANS STREAM GUPS 28
  • 29. Workloads Instruction Mix Linpack DGEMM FFT miniMD OpenMD Nekbone miniFE HPCCG PHDmesh PTRANS STREAM GUPS Memory – Integer – SIMD Integer – Float – SIMD Float 29
  • 30. gem5 Methodology §  Boot scripts are in m5-obj/config/boot/hpc §  Base.rcS creates a checkpoint after boot and 60-second “rest” period. Setup to re-read workload script after checkpoint so that workload can be configured during restore skipping boot period. §  Configs are self-contained in workloads, output is sent to simulation host via m5 writefile. §  I’ve got some bundled run scripts which handle establishing the base checkpoint and for restoring checkpoint and executing workloads in atomic, A15, and A15 with period stats enabled §  Runs parameterized so that complete run can complete in a reasonable amount of time w/timing-approximate simulation §  Disk image available, optimized for A15 30
  • 31. gem5 Correlation – Simple Memory CoMD 8k phdMesh MD TIME HPCCG P1 FLOP/s CG_MFLOP/s FFTW (SP) GUPS EP-STREAM GB/s G-PTRANS GB/s EP-DGEMM MFlop/s G-FFTR MFlop/s G-HPL MFlop/s -100.00% -80.00% -60.00% -40.00% -20.00% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 31
  • 32. miniFE – finite element simulation §  It assembles a sparse linear-system from the steady-state conduction equation on a brick-shaped problem domain of linear 8-node hex elements. It then solves the linear-system using a simple un-preconditioned conjugate-gradient algorithm. Thus the kernels that it contains are: §  computation of element-operators §  diffusion matrix, source vector §  assembly §  scattering element-operators into sparse matrix and vector §  sparse matrix-vector product §  during CG solve §  vector operations (level-1 blas: axpy, dot, norm) 32
  • 33. Profile: miniFE §  Language & Runtime: reference code in C++, alternate versions for openMP, cilk, chapel, qthreads, etc. §  Library Dependencies: None §  SLOCCOUNT: 2872 lines of code §  A15 Perf Characteristics: §  Run Time: 217,413,167 cycles Int §  Max Heap Size: 14.54MB Float SIMD Int §  CPI: 1.6958 SIMD Float §  L1D Miss Rate: 2.5% Memory Other §  L2 Miss Rate: 6.68% §  Branch Mispredicts: 6.36% 33
  • 34. miniFE gem5 Cache Occupancy 34
  • 36. Workloads: Next Steps §  More benchmarks §  Get big data analytic mini-apps and benchmarks working (graph 500, mantevo analytics mini-app, others?) §  Get an ExaCT benchmark working, incorporate forthcoming ASC benchmarks §  More variations §  Multinode MPI, PGAS, and other runtimes §  OpenCL variants §  Handcode NEON optimized versions of key benchmarks §  More Accuracy §  Continue calibration gem5 memory system against hardware to increase accuracy of memory-bound benchmarks §  Systems Software Sensitivity Study §  OpenMPI versus MPICH versus LAMPI on ARM §  Operating System Version (3.7 has THP) §  armcc vs gcc vs gcc-dragon-egg versus clang (etc.) §  Transition to 64-bit gem5 (and hardware) when available. §  Integrate Montblanc benchmarks and runtimes §  Roll bare-metal version of co-design center workloads to make them more accessible to design teams. 36
  • 37. Simulation Driven Challenges §  Performance §  When running functional mode (atomic), performance is in MIPS, when running in cycle approximate mode (with memory models, cache models, etc.) simulation runs in KIPS – but longer runtimes with timing models give more representative results. §  Current methodology works of atomic checkpoints followed by short timing measurements, but can be refined to get better representation of multi-phase workloads §  Scale §  gem5 is currently inherently serial, adding cores or nodes to simulation has a multiplicative effect §  Multi-threading the simulation model at core, node, and cluster levels could help address this problem, but may impact granularity of timing accuracy. §  Correlation §  Correlating a single core simulation is hard, correlating multi-core is extremely difficult, as is multi-node. §  Sensitivity Study State Space Explosion §  Many knobs to turn, determining which ones to turn in combination for the best effect is an on-going research problem. §  Visualization §  Need better ways of visualizing performance characteristics, particularly at scale. 37
  • 38. Future Work: Integration with SST §  SST: The Structural Simulation Toolkit §  Maintained by Sandia National Labs §  Component-based Discrete Event Model §  Already uses gem5 as a component (but not well integrated with ARM variant) §  Potential to help us scale out simulation as well as integrate with other simulations (fabric, etc.) to allow for end-to-end simulation of large scale supercomputer. 38
  • 39. Links §  More info on ARM including Research Papers §  http://infocenter.arm.com §  gem5 (http://www.m5sim.org) §  SST (http://sst.sandia.gov) §  Montblanc (http://montblanc-project.eu) §  Exacale Initiative §  http://sites.google.com/a/lbl.gov/exascale-initiative/ §  Co-Design Center Proxy Apps §  Mantevo (http://mantevo.org) §  ExMatEx (http://exmatex.lanl.gov) §  ExaCT (http://exactcodesign.org) §  CESAR (http://cesar.mcs.anl.gov) 39
  • 40. Thanks! QUESTIONS? 40