IBM Data Centric Systems & OpenPOWER
Yoonho Park
Research Staff Member/Senior Manager
Data Centric Systems Software, Cloud and Cognitive
IBM Research
1
LinkBandwidthperdirection,Gb/s
10
100
1000
4X
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
QDR
EDR
100G
FDR
HDR
200G
HDR
56G
2010
Sensors
& Devices
VoIP
Enterprise Data
Social
Media
2020
Data volume grows exponentially Microprocessor clock rates have stalled…
And network bandwidth cannot keep up.
Data Growth Outpaces Computing Technology Elements
1990 1995 2000 2005 2010
0.1
1
10
100
1000
SustainedHDDI/ORatesperGByte
Desktop and Server Drive Performance
<=7200 RPM
10K RPM
15K RPMDesktop
5400 & 7200 RPM
0.01 4TB SATA
I/O performance/capacity loosing ground…
2
Big Data and the New Era of Computing
Dimensions of data growth
Terabytes to
exabytes of
existing data
to process
Structured,
unstructured,
text, multimedia
Streaming data,
milliseconds to
seconds to
respond
Uncertainty
from inconsistency,
ambiguities, etc.
Variety
Volume Velocity
Veracity
9000
8000
7000
6000
5000
4000
3000
Data volume is on the rise
2010
VolumeinExabytes
Sensors
& Devices
VoIP
Enterprise
Data
Social
Media
2020
• Big Data analytics and Exascale High Performance Computing facing similar
challenges: scale, performance, bandwidth, computational complexity
• IBM approach: Move compute to data – Data Centric Systems (DCS)
3
Region defined by
LINPACK
Conceptual View of Data
Intensive with Floating Point
Workflow
Data Centric
Applications
Compute Centric
Applications
Floating Point
OPS
Integer
OPS
Low Spatial
Locality
High Spatial
Locality
Conceptual View of Data Intensive-
Integer Workflow
Conceptual View of Data
Fusion/Uncertainty Quantification
Workflow
Different Solutions for Different Types of Workflows
4
5
Data Centric Workflows: Mixed Compute Capabilities Required
All Source Analytics
Oil and Gas
Science
Financial Analytics
4-40 Racks
2-20 Racks
1-5+ Racks
1-100’s Racks
Analytics
Reservoir
Graph
Analytics
Throughput
Seismic
Value At Risk
Image
Analysis
Capability
Massively Parallel
Compute Capability:
• Simple kernels,
• Ops dominated (e.g.
DGEMM, Linpack)
• Simple data access
patterns.
• Can be preplanned for
high performance.
Analytics Capability:
• Complex code
• Data Dependent Code
Paths / Computation
• Lots of indirection /
pointer chasing
• Often Memory System
Latency Dependent
• C++ templated codes
• Limited opportunity for
vectorization
• Limited scalability
• Limited threading
opportunity
Comparing Compute Centric to Data Centric
 Systems and Solutions must become more data
centric and data aware
– Data movement minimized
• Within the system
• Within/across the end to end solution
– Compute enabled at all levels
– Workloads/workflow driven system and solution design
choices
– Modular, composable solution architectures
– Enhanced resource agility and sharing
 Focus must shift from algorithms to workflows
– New end to end efficiencies and optimizations
– Based on a data aware understanding of the full scope of
resource requirements
• Storage, Networking, Compute, Applications, Resource
management, Data Centers
 Uncertainty of the future puts a premium on
flexibility and innovation
TraditionalComputing
System
DataCentricComputing
System
6
IBM Data Centric Design Principles
Principle 3: Modularity
–Balanced, composable architecture for Big Data
analytics, modeling and simulation
Principle 2: Enable compute in all levels of the systems
hierarchy
–HW & SW innovations to support / enable compute in
data
Principle 4: Application-driven design
– Use real workloads/workflows to drive design points
Principle 1: Minimize data motion
– Data motion is expensive
– Allow workloads to run where they run best
Principle 5: Leverage OpenPOWER to accelerate innovation and broaden diversity for clients
Massive data requirements drive a composable architecture for big data, complex analytics, modeling and
simulation. The DCS architecture will appeal to segments experiencing an explosion of data and the
associated computational demands
7
8
IBM OpenPOWER-based HPC Roadmap
2015 2016 2017
POWER8
POWER8+
POWER9
OpenPower
CAPI Interface
NVLink
Enhanced
NVLink
Connect-IB
FDR Infiniband
PCIe Gen3
ConnectX-4
EDR Infiniband
CAPI over PCIe Gen3
ConnectX-5
Next-Gen Infiniband
Enhanced CAPI over PCIe Gen4
Mellanox
Interconnect
Technology
IBM CPUs
NVIDIA GPUs
Kepler
PCIe Gen3
Volta
Enhanced NVLink
Pascal
NVLink
IBM Nodes
8
Heterogeneity is Key
9 © 2017 OpenPOWER Foundation
• Moore’s law no longer
satisfies performance
gain
• Growing workload
demands
• Numerous IT
consumption models
• Mature Open software
ecosystem
OpenPOWER, a catalyst for Open Innovation
• Rich software ecosystem
• Spectrum of power
servers
• Multiple hardware
options
• Derivative POWER chips
OpenPOWER is an open development community,
using the POWER Architecture to serve the evolving needs of customers.
Performance of POWER
architecture
amplified capability
Open Development
open software, open hardware
Collaboration of thought
leaders
simultaneous innovation, multiple
disciplines
Feeds back … resulting in
client choice
140+ OpenPOWER Foundation Members
Implementation, HPC & Research
Software
System Integration
I/O, Storage & Acceleration
Boards & Systems
Chips & SoCs
US & UK Research Centers Select OpenPOWER-based Supercomputers
IBM, Mellanox, and
NVIDIA awarded $325M
U.S. Department of
Energy’s CORAL
Supercomputers
IBM & UK’s STFC in
£313M Partnership
for Big Data & Cognitive
Computing Research
CORAL: Leadership Class
Supercomputers
5X – 10X HIGHER APP PERF THAN CURRENT SYSTEMS
11
© 2014 IBM Corporation
Hybrid CPU/GPU architecture
 At least 5X Titan / Sequoia Application Performance
 Approximately 3,400 nodes, each with:
– Multiple IBM POWER9™ CPUs and multiple NVIDIA Tesla® GPUs using the NVIDIA
Volta architecture
– CPUs and GPUs completely connected with high speed NVLink
– Large coherent memory: over 512 GB (HBM + DDR4)
– All memory directly addressable from the CPUs and GPUs
 Over 40 TF peak performance per node
 Dual-rail Mellanox® EDR-IB full, non-blocking fat-tree interconnect
 IBM Elastic Storage (GPFS™) - 1TB/s I/O and 120 PB disk capacity.
12
© 2014 IBM Corporation
Programming Approaches
 Accelerator Approach:
– Required when not coherent
– Each processor computes in its own private address apace
– Data objects are homed in CPU Memory, are copied to GPU memory for GPU execution,
GPU engines act only on data in GPU memory.
 Compute in Shared Address Space:
– New option, now that CPU / GPU memories are coherent
– Data objects can be in any physical memory domain
– Processors (either CPU or GPU) can use data in place.
– No copies required
 Note:
– Will still have to manage NUMA
13
© 2014 IBM Corporation
Compute and Memory View – Emerging Approach
GPU 1 Compute
GPU 1 Memory
CPU 1 Compute
CPU 1 Memory
GPU 2 Compute
GPU 2 Memory
Multiple Compute Engines
• Consider all engines as equal peers
Multiple Memories
• Consider all memories as equal peers
14
© 2014 IBM Corporation
Compute and Memory View – Traditional Acceleration
MPI Task
Data
Data / Compute Flow
15
© 2014 IBM Corporation
Peer Processing
MPI Task
Data
No Data Flow
Compute Flow
Data
Data can be placed at Allocation, or
can migrate under run time control
(e.g. UVM)
Thread-like programming model
(OpenMP 4.x, OpenAcc, CUDA …
Still under development …
16
© 2016 IBM
Future DCS Programming Model
HPC workflowsBig Data workflows
Unstructured or semi-structured data
Collaborative Filtering, clustering, web search,
recommendation, …
Real-time analysis, fraud and anomaly detections
Sensor data filtering, classification, …
statistical averages/histogram, …
Deep Learning, Support Vector Machine, …
Graph Community finding to Shortest Path, …
Structured data
Exascale data sets
Primarily scientific calculations
Ensemble analysis
Sensitive analysis
Uncertainty quantification
Network bandwidth/
efficiency
CPU clock rate
stalling
IO performance
Layered storage:
New storage
technologies
CAPI, InifiniBand, Flash1GigE, TCP/IP, HDDs, …
HPC ecosystem
Fortran/C, MPI+OpenMP, CUDA, Legion, …
UPC, OpenSHMEM, Charm++, GASNet, …
UCX, PAMI, Verbs, …
Hadoop/Spark ecosystem
Spark API, Hadoop API, Cassandra, …
Java, Scala, Python, TCP sockets, … Exascale
Runtime &
Middleware
High productivity and High performanceCommodity clusters OpenPOWER/DCS
High productivity, fault-tolerance High performance, specialized hardware
common challenges
17
 Linux – NUMA support for hardware coherent GPU memory
 Provisioning – xCAT
 Burst Buffer – Support for fast shared-file checkpointing
 CSM – Cluster System Management
 Compilers – LLVM
 Tools – Ensure tools have appropriate APIs
DCS OpenPOWER Software Contribution Examples
18
 Experience/Observation: Extraction of meaningful insights from Big Data and
enabling real-time, predictive decision making requires similar computation
techniques that have been characteristic of Technical Computing
– Convergence in many future workflow requirements including Big Data-driven analytics,
modeling, visualization, and simulation
– Will require optimized full-system design and integration
 IBM Approach: Data Centric innovation in multiple areas, in open ecosystems with
workload-driven co-design
– System architecture and design with modular building blocks
– Hardware technologies
– Integration of heterogeneous compute elements
– Software enablement
Summary – IBM Data Centric Systems & OpenPOWER
19

IBM Data Centric Systems & OpenPOWER

  • 1.
    IBM Data CentricSystems & OpenPOWER Yoonho Park Research Staff Member/Senior Manager Data Centric Systems Software, Cloud and Cognitive IBM Research 1
  • 2.
    LinkBandwidthperdirection,Gb/s 10 100 1000 4X 2008 2009 20102011 2012 2013 2014 2015 2016 2017 2018 QDR EDR 100G FDR HDR 200G HDR 56G 2010 Sensors & Devices VoIP Enterprise Data Social Media 2020 Data volume grows exponentially Microprocessor clock rates have stalled… And network bandwidth cannot keep up. Data Growth Outpaces Computing Technology Elements 1990 1995 2000 2005 2010 0.1 1 10 100 1000 SustainedHDDI/ORatesperGByte Desktop and Server Drive Performance <=7200 RPM 10K RPM 15K RPMDesktop 5400 & 7200 RPM 0.01 4TB SATA I/O performance/capacity loosing ground… 2
  • 3.
    Big Data andthe New Era of Computing Dimensions of data growth Terabytes to exabytes of existing data to process Structured, unstructured, text, multimedia Streaming data, milliseconds to seconds to respond Uncertainty from inconsistency, ambiguities, etc. Variety Volume Velocity Veracity 9000 8000 7000 6000 5000 4000 3000 Data volume is on the rise 2010 VolumeinExabytes Sensors & Devices VoIP Enterprise Data Social Media 2020 • Big Data analytics and Exascale High Performance Computing facing similar challenges: scale, performance, bandwidth, computational complexity • IBM approach: Move compute to data – Data Centric Systems (DCS) 3
  • 4.
    Region defined by LINPACK ConceptualView of Data Intensive with Floating Point Workflow Data Centric Applications Compute Centric Applications Floating Point OPS Integer OPS Low Spatial Locality High Spatial Locality Conceptual View of Data Intensive- Integer Workflow Conceptual View of Data Fusion/Uncertainty Quantification Workflow Different Solutions for Different Types of Workflows 4
  • 5.
    5 Data Centric Workflows:Mixed Compute Capabilities Required All Source Analytics Oil and Gas Science Financial Analytics 4-40 Racks 2-20 Racks 1-5+ Racks 1-100’s Racks Analytics Reservoir Graph Analytics Throughput Seismic Value At Risk Image Analysis Capability Massively Parallel Compute Capability: • Simple kernels, • Ops dominated (e.g. DGEMM, Linpack) • Simple data access patterns. • Can be preplanned for high performance. Analytics Capability: • Complex code • Data Dependent Code Paths / Computation • Lots of indirection / pointer chasing • Often Memory System Latency Dependent • C++ templated codes • Limited opportunity for vectorization • Limited scalability • Limited threading opportunity
  • 6.
    Comparing Compute Centricto Data Centric  Systems and Solutions must become more data centric and data aware – Data movement minimized • Within the system • Within/across the end to end solution – Compute enabled at all levels – Workloads/workflow driven system and solution design choices – Modular, composable solution architectures – Enhanced resource agility and sharing  Focus must shift from algorithms to workflows – New end to end efficiencies and optimizations – Based on a data aware understanding of the full scope of resource requirements • Storage, Networking, Compute, Applications, Resource management, Data Centers  Uncertainty of the future puts a premium on flexibility and innovation TraditionalComputing System DataCentricComputing System 6
  • 7.
    IBM Data CentricDesign Principles Principle 3: Modularity –Balanced, composable architecture for Big Data analytics, modeling and simulation Principle 2: Enable compute in all levels of the systems hierarchy –HW & SW innovations to support / enable compute in data Principle 4: Application-driven design – Use real workloads/workflows to drive design points Principle 1: Minimize data motion – Data motion is expensive – Allow workloads to run where they run best Principle 5: Leverage OpenPOWER to accelerate innovation and broaden diversity for clients Massive data requirements drive a composable architecture for big data, complex analytics, modeling and simulation. The DCS architecture will appeal to segments experiencing an explosion of data and the associated computational demands 7
  • 8.
    8 IBM OpenPOWER-based HPCRoadmap 2015 2016 2017 POWER8 POWER8+ POWER9 OpenPower CAPI Interface NVLink Enhanced NVLink Connect-IB FDR Infiniband PCIe Gen3 ConnectX-4 EDR Infiniband CAPI over PCIe Gen3 ConnectX-5 Next-Gen Infiniband Enhanced CAPI over PCIe Gen4 Mellanox Interconnect Technology IBM CPUs NVIDIA GPUs Kepler PCIe Gen3 Volta Enhanced NVLink Pascal NVLink IBM Nodes 8 Heterogeneity is Key
  • 9.
    9 © 2017OpenPOWER Foundation • Moore’s law no longer satisfies performance gain • Growing workload demands • Numerous IT consumption models • Mature Open software ecosystem OpenPOWER, a catalyst for Open Innovation • Rich software ecosystem • Spectrum of power servers • Multiple hardware options • Derivative POWER chips OpenPOWER is an open development community, using the POWER Architecture to serve the evolving needs of customers. Performance of POWER architecture amplified capability Open Development open software, open hardware Collaboration of thought leaders simultaneous innovation, multiple disciplines Feeds back … resulting in client choice
  • 10.
    140+ OpenPOWER FoundationMembers Implementation, HPC & Research Software System Integration I/O, Storage & Acceleration Boards & Systems Chips & SoCs
  • 11.
    US & UKResearch Centers Select OpenPOWER-based Supercomputers IBM, Mellanox, and NVIDIA awarded $325M U.S. Department of Energy’s CORAL Supercomputers IBM & UK’s STFC in £313M Partnership for Big Data & Cognitive Computing Research CORAL: Leadership Class Supercomputers 5X – 10X HIGHER APP PERF THAN CURRENT SYSTEMS 11
  • 12.
    © 2014 IBMCorporation Hybrid CPU/GPU architecture  At least 5X Titan / Sequoia Application Performance  Approximately 3,400 nodes, each with: – Multiple IBM POWER9™ CPUs and multiple NVIDIA Tesla® GPUs using the NVIDIA Volta architecture – CPUs and GPUs completely connected with high speed NVLink – Large coherent memory: over 512 GB (HBM + DDR4) – All memory directly addressable from the CPUs and GPUs  Over 40 TF peak performance per node  Dual-rail Mellanox® EDR-IB full, non-blocking fat-tree interconnect  IBM Elastic Storage (GPFS™) - 1TB/s I/O and 120 PB disk capacity. 12
  • 13.
    © 2014 IBMCorporation Programming Approaches  Accelerator Approach: – Required when not coherent – Each processor computes in its own private address apace – Data objects are homed in CPU Memory, are copied to GPU memory for GPU execution, GPU engines act only on data in GPU memory.  Compute in Shared Address Space: – New option, now that CPU / GPU memories are coherent – Data objects can be in any physical memory domain – Processors (either CPU or GPU) can use data in place. – No copies required  Note: – Will still have to manage NUMA 13
  • 14.
    © 2014 IBMCorporation Compute and Memory View – Emerging Approach GPU 1 Compute GPU 1 Memory CPU 1 Compute CPU 1 Memory GPU 2 Compute GPU 2 Memory Multiple Compute Engines • Consider all engines as equal peers Multiple Memories • Consider all memories as equal peers 14
  • 15.
    © 2014 IBMCorporation Compute and Memory View – Traditional Acceleration MPI Task Data Data / Compute Flow 15
  • 16.
    © 2014 IBMCorporation Peer Processing MPI Task Data No Data Flow Compute Flow Data Data can be placed at Allocation, or can migrate under run time control (e.g. UVM) Thread-like programming model (OpenMP 4.x, OpenAcc, CUDA … Still under development … 16
  • 17.
    © 2016 IBM FutureDCS Programming Model HPC workflowsBig Data workflows Unstructured or semi-structured data Collaborative Filtering, clustering, web search, recommendation, … Real-time analysis, fraud and anomaly detections Sensor data filtering, classification, … statistical averages/histogram, … Deep Learning, Support Vector Machine, … Graph Community finding to Shortest Path, … Structured data Exascale data sets Primarily scientific calculations Ensemble analysis Sensitive analysis Uncertainty quantification Network bandwidth/ efficiency CPU clock rate stalling IO performance Layered storage: New storage technologies CAPI, InifiniBand, Flash1GigE, TCP/IP, HDDs, … HPC ecosystem Fortran/C, MPI+OpenMP, CUDA, Legion, … UPC, OpenSHMEM, Charm++, GASNet, … UCX, PAMI, Verbs, … Hadoop/Spark ecosystem Spark API, Hadoop API, Cassandra, … Java, Scala, Python, TCP sockets, … Exascale Runtime & Middleware High productivity and High performanceCommodity clusters OpenPOWER/DCS High productivity, fault-tolerance High performance, specialized hardware common challenges 17
  • 18.
     Linux –NUMA support for hardware coherent GPU memory  Provisioning – xCAT  Burst Buffer – Support for fast shared-file checkpointing  CSM – Cluster System Management  Compilers – LLVM  Tools – Ensure tools have appropriate APIs DCS OpenPOWER Software Contribution Examples 18
  • 19.
     Experience/Observation: Extractionof meaningful insights from Big Data and enabling real-time, predictive decision making requires similar computation techniques that have been characteristic of Technical Computing – Convergence in many future workflow requirements including Big Data-driven analytics, modeling, visualization, and simulation – Will require optimized full-system design and integration  IBM Approach: Data Centric innovation in multiple areas, in open ecosystems with workload-driven co-design – System architecture and design with modular building blocks – Hardware technologies – Integration of heterogeneous compute elements – Software enablement Summary – IBM Data Centric Systems & OpenPOWER 19