Florin Manaila
HPC/Deep Learning Architect and Inventor
IBM Cognitive Systems Europe
florin.manaila@de.ibm.com
August 31, 2018
IBM PowerAI Deep Learning Platform
(architecture, hardware roadmap, future innovation)
2Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
AI Infrastructure Stack
Vision
Enterprise
L1-L3 Support
Base
Transform & Prep
Data (ETL)
Micro-Services / Applications
AI APIs
(Eg: Watson)
In-House APIs
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare,
Automotive
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
Pytoch
SparkML, Snap.ML
Spark, MPI
Hadoop HDFS,
NoSQL DBs,
Parallel File
System
Accelerated
Infrastructure
3
AI Infrastructure Stack Challenges
Transform & Prep
Data (ETL)
Micro-Services / Applications
AI APIs
(Eg: Watson)
In-House APIs
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Data Prep, ETL, Curation,
Data Labeling
Performance to Reduce Training Time
Multi-tenant, GPU Virtualization,
DL Framework Scaling
Feature extraction, Selecting Right
Model, Hyper-parameter tuning
Finding Right “Tagged”
Data, Model Integrity
Use Case Identification,
Access to Enough Data
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
What’s in the training of deep neural networks?
Neural network model
Billions of parameters
Gigabytes
Computation
Iterative gradient based search
Millions of iterations
Mainly matrix operations
Data
Millions of images, sentences
Terabytes
Workload characteristics: Both compute and data intensive!
4Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Deep Learning at work
Available options
5
Longer Training Time Shorter Training Time
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Data processing stages for distributed deep learning
Training data
on storage
CPU:
Coordination
and data prep
GPU
computation
Parameter data
exchange
across systems
Network,
NVLink,
GPU Memory
POWER9
CPU
Storage
NVMe, SSD,
ESS
GPU
PCIe Gen. 4 2nd Gen
NVLink
Source: Hillery Hunter, IBM, GTC 2018
6Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
NVIDIA GPU implementation in AC922 Deep Learning System
NVLINK 2.0
Innovative Systems with NVLink 2.0:
• Faster GPU-GPU communication
• Breaks down barriers between CPU and GPU
• New system architectures
• Acceleration limited by PCIe Gen3
7Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM AC922 Deep Learning System Architecture
AC922-GTG
8Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM AC922 Deep Learning System Architecture
AC922-GTW
9Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
x86 GPU System vs IBM AC922 Deep Learning System
3D Image Segmentation Use Case
10
When factoring out this
inter-batch overhead the
NVLink 2.0 + Volta V100
combination is still 2.4x
faster than the PCIe Gen3
+ Volta V100 combination
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Unified Memory with ATS on IBM POWER9
IBM POWER9 CPUs With NVLink Interconnect
11
ALLOCATION
 Automatic access to all system memory: malloc,
stack, file system
ACCESS
 All data accessible concurrently from any processor,
anytime
 Atomic operations resolved directly over NVLink
ATS & POWER9 FEATURES
 ATS allows GPUDirect RDMA to unified memory
 Managed memory is cache-coherent between CPU
and GPU
 CPU has direct access to GPU memory without need
for migration
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM AC922 Deep Learning System
AC922-GTG
12Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM AC922 Deep Learning System
AC922-GTW
13Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM AC922 System
Options and Features
14
Processor Features
 16 Core Processor Module
190W – 250W (2.25GHZ -
3.12GHZ)
 20 Core Processor Module
190W – 250W (2.25GHZ -
2.80GHZ)
 18 Core Processor Module
190W – 250W (2.98GHZ -
3.26GHZ)
 22 Core Processor Module
190W – 250W (2.78GHZ -
3.07GHZ)
Memory Features
 8GB IS RDIMM DDR4
 16GB IS RDIMM DDR4
 32GB IS RDIMM DDR4
 64GB IS RDIMM DDR4
 128GB IS RDIMM DDR4
Storage Features
 HDD 1TB 2.5” 7k RPM SATA
 HDD 2TB 2.5” 7k RPM SATA
 SSD 960GB 2.5” SATA
 SSD 1.92TB 2.5” SATA
 SSD 3.84TB 2.5” SATA
 1.6TB NVMe Flash Adapter
 3.2TB NVMe Flash Adapter
 6.4TB NVMe Flash Adapter
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
OpenPower Recent Tests on PCIe Gen4
15Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
16
PCIe Adapter Features
 4-Port Ethernet (4x1 1Gb)
 2-Port 40/100 GbE RoCE SFP+
 2-Port Ethernet (10Gb)
 4-Port Ethernet (2x10 10Gb Optical + 2x 1Gb)
 4-Port Ethernet Cu (2x10 10Gb CU + 2x 1Gb)
 2 Port 10Gb/s NIC & ROCE SR/CU
 2 Port 25/10Gb/s NIC & ROCE SR/CU
 1 Port EDR 100Gb IB CX-5 CAPI
 2 Port EDR 100Gb IB CX-5 CAPI
 2-Port Fiber Channel (16Gb/s)
 2-Port Fiber Channel (32Gb/s)
Accelerators Features
 NVIDIA V100 SMX2 16GB HBM2
 NVIDIA V100 SMX2 32GB HBM2
 Xilinix ADM-PCIE-8V3 FPGA
IBM AC922 System
Options and Features
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
OpenCAPI 3.0
Data-Centric approach to server design
17Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
18
IBM AC922 Deep Learning System
Front and Rear View
RearViewFrontView
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Volta SMX2 GPU Accelerator
Power Regulation
2x 400 Pin Connectors2x Grounding Pads
BottomSide
Multi Chip Module
NVIDIA GPU Details
19
TopSide
NVIDIA Volta Specifications
NVIDIA Tensor Cores 640
NVIDIA CUDA Cores 5120
Peak Double-Precision Performance 7.8 TFLOPS
Single-Precision Performance 15.7 TFLOPS
Tensor Performance 125 TFLOPS
Memory Bandwidth 900 GB/sec
GPU Memory Size 16 GB or 32GB
HBM2
NVLink “Bricks” (8 lane interface) 6
NVLink Interconnect Bi-Directional 300 GB/sec
Maximum Power 300W
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
20
Server based FPGA: ie. ADM-PCIE-8V3
Features
• Board Format : Half-Length, low profile x16 PCIe form factor
• Host I/F : PCI Express Gen3 x8
• Target Device : Xilinx Virtex Ultrascale : XCVU095-2 - FFVC1517
• SDRAM : 2x banks of 1G x 72, DDR4-2400 (16GiB total),
upgradable to 16GiB, DDR4-1866 (dual bank devices), per bank (32
GiB total)
• FLASH : On-board re-programmable flash memory for embedded
configuration
• Optional integrated Board Support Package (BSP) including
extensive FPGA example designs, plug and play drivers, and a
mature Application Programming Interface (API)
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
21
CAPI Advantages on AC922 Deep Learning System
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Feature List:
 REST Management
 IPMI
 SSH based SOL
 Power and Cooling
Management
 Event Logs
 Zeroconf discoverable
 Sensors
Features In
Progress:
 Full IPMI 2.0
Compliance with DCMI
 Verified Boot
 HTML5 Java Script Web
User Interface
 BMC RAS
IBM is the
OpenBMC
Community Leader
 Facebook
 Google
 IBM
 Intel
 Microsoft
 OCP
22
OpenBMC is a free open
source management
software Linux distribution
 Inventory
 LED Management
 Host Watchdog
 Simulation
 Code Update Support for
multiple BMC/BIOS
images
 POWER On Chip
Controller (OCC) SupportCognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM Deep Learning Software Stack
23
Reference Architecture for AI Infrastructure: Software
24Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM PowerAI at the glance
June, 2018 update
25Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM PowerAI Base @hub.docker.com
26Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM PowerAI Base usage at the glance
27
PowerAI framework activation (Python2 or Python3)
 Activation scripts are used to manage system and python paths
 To activate PowerAI deep learning frameworks:
$ source /opt/DL/<framework-name>/bin/<framework-name>-activate
This script sets PATH and PYTHONPATH to the appropriate values for the desired deep learning framework as it resides in
/opt/DL directory.
 <framework>-activate will also call check_dependencies
 Activation will only happen if all dependencies are met
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
What data science methods are used at work?
28Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
34
libGLM (C++ / CUDA
Optimized Primitive Lib)
Distributed Training
Logistic Regression Linear Regression
Support Vector
Machines (SVM)
Distributed Hyper-
Parameter Optimization
More Coming Soon
APIs for Popular ML
Frameworks
IBM Snap ML part of PowerAI Base
Distributed GPU-Accelerated Machine Learning Library
(coming
soon)
Snap Machine Learning (ML) Library
34Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
46x faster than previous
record set by Google
Workload: Click-through rate
prediction for advertising
Logistic Regression Classifier in
Snap ML using GPUs vs
TensorFlow using CPU-only
35
Snap ML: Training Time Goes
From An Hour to Minutes
Logistic Regression in Snap ML
(with GPUs) vs TensorFlow (CPU-
only)
1.1 Hours
1.53
Minutes
0
20
40
60
80
Google
CPU-only
Snap ML
Power + GPU
Runtime(Minutes)
46x Faster
Dataset: Criteo Terabyte Click Logs
(http://labs.criteo.com/2013/12/download-terabyte-click-logs/)
4 billion training examples, 1 million features
Model: Logistic Regression: TensorFlow vs Snap ML
Test LogLoss: 0.1293 (Google using Tensorflow), 0.1292 (Snap ML)
Platform: 89 CPU-only machines in Google using Tensorflow versus
4 AC922 servers (each 2 Power9 CPUs + 4 V100 GPUs) for Snap ML
Google data from this Google blog
90 x86 Servers
(CPU-only)
4 Power9 Servers
With GPUs
38
Deep Learning Impact
(DLI) Module
Data & Model
Management, ETL,
Visualize, Advise
IBM Conductor with Spark
Cluster Virtualization,
Auto Hyper-Parameter Optimization
PowerAI: Open Source ML Frameworks
Large Model Support (LMS)
Distributed Deep
Learning (DDL)
Auto ML
Enterprise
Accelerated
Infrastructure
IBM PowerAI Enterprise V1.1
Announced on June, 2018
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
39
Enterprise
IBM PowerAI Enterprise V1.1
Announced on June, 2018
Deep Learning Impact
Data Management and ETL
Training visualization and monitoring
Hyper-parameter optimization
Spectrum Conductor
Multi-tenancy support & security
User reporting & charge back
Dynamic resource allocation
External data connectors
Distributed Deep Learning (DDL)
Support Line L1-L3
Accelerated
Infrastructure
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Real time monitoring of hyper parameters in PowerAI Enterprise
40Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Hyper-parameter Tuning/Search in PowerAI Enterprise
41
Hyper-parameters
– Learning rate
– Decay rate
– Batch size
– Optimizer:
 GradientDecedent,
 Adadelta,
 Momentum,
 RMSProp
 …..
– Momentum (for some
optimizers)
– LSTM hidden unit size (for
models which use LSTM)
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
42Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation
Who are the typical Personas for computer vision solutions ?
43Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Steps for Deep Learning Development
44
Define
training
task
Prepare
training Data
Data Pre-
processing
DNN Model
selection
Configure
the training
hyper-
parameter
DNN Model
Training
Start
Package the new
DNN model
together with
preprocessing into
inference proc.
Application
development with
inference API
DL training
framework
preparation
Danielle
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
How to Simplifying Deep Learning Adoption?
45
 Format transformation
 Support both training and evaluation sets
 Support different pre-processing plugins
 Provide base models for different scenarios
 Predict training time
 Training process visualization
 Training with GPU
 Scalability and HA deployment
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM PowerAI Vision
Simplify Deep Learning Adoption
46
User could use the
deployed API for
visual recognition
PowerAI Vision
Iris
Danny
Define
training
task
Prepare
training Data
Data Pre-
processing
DNN Model
selection
Configure
the training
hyper-
parameter
DNN Model
Training
Package the new
DNN model
together with
preprocessing into
inference proc.
DL training
framework
preparation
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
What are we solving ?
47
Data
Up &
Running
Data Pre-
Processin
g
Build,
Train,
Optimize
Deploy &
Infer
Maintain
Model
Accuracy
 Training
visualization &
accuracy
monitoring
 Customize
parameters for
training
 Datasets for
classification
 Datasets for object
detections
 Semi-auto labeling on
videos
 Pre-bundled models
dataset creation
 Data augmentation
 REST APIs for
creating
datasets.
 Export/Import
datasets
 Custom DNN
models
 Hyper-parameter
search and tuning
 REST APIs to
infer with
images/videos
 Inference
Engine for
compiling
accelerated
models on edge
Image Analyst
Data Scientist
 Simplified
installation and
deployment
Developer
 Deploy where
trained
 Optimized
models for few
categories
 Visualize
progress and
early warning
 Customize models
for pre-processing
 Use Interface is
deployed  Validate trained
models
 Built audit
systems based
low inference
scores
Most vendors
address this space
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM PowerAI Vision
Lowers the barriers for creating Computer Vision related AI solutions.
48Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM PowerAI Vision
Lowers the barriers for creating Computer Vision related AI solutions.
49Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM PowerAI Vision
Lowers the barriers for creating Computer Vision related AI solutions.
50Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Semi-Automatic Labeling from video content
51
Train DL Model
Define Labels
Manually Label Some
Images / Video Frames
Manually Label
Use Trained DL
Model
Run Trained DL Model
on Entire Input Data to
Generate Labels
Correct Labels on
Some Data
Manually Correct
Labels on Some Data
Repeat Till Labels Achieve Desired Accuracy
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Delivered Pre-Trained Models
Time and Data Matters
52
Convolutional Neural Network (CNN)
Pre-trained CNN
New
Task
Fine-tune W
Mergus
Larus
….
Corvus
Sourav
Mergus
Larus
….
Corvus
Recreate
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM PowerAI Vision: Deep Learning Development Platform for Computer Vision
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI Vision APIs
Inference APIs for Object Detection (example)
54
Developer could use these APIs for object detection with the deployed model in PowerAI Vision from any IP device
http://IP:PORT/ (of the deployed inference instance)
/test
GET: Only to test if the monitor service is running.
/detect_url
GET: Upload image with image url and detect objects
/detect_upload
POST: Post image file and do the object detection
Inference return:
{'confidence': 0.9038739204406738, 'ymax': 145, 'label': 'badge', 'xmax': 172, 'xmin': 157, 'ymin': 123}
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
IBM PowerAI Vision
Inference Server
55Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
56
CPU + GPU Neural network
processor
Embedded
GPU
Embedded FPGA
CPUs, GPUs
Trained
DNN model
DNN model parser
DNN model analyzer
NN structure
Backend specific
optimization
Estimate resources
& performance
Mapping to
back ends
PowerAI Inference Engine
Map to Different
Platforms
Data Center: Train model & Compile to Edge
Cloud or Edge
PowerAI Inference Engine (PIE)
Automatically Map Trained AI Models to Cloud or Edge
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Inference at the edge How can I
accelerate
models for the
edge ?
-- Developer
Compile accelerated models for FPGAs, NVIDIA TX1/TX2* & Raspberry Pi*
*coming soon
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
58
Edge FPGA: ie. TySOM-3 Embedded Prototyping Board
Features
 TySOM-3-ZU7 is a compact prototyping board containing Zynq
UltraScale+ MPSoC device which provides 64-bit processor scalability
while combining real-time control with soft and hard engines for graphics,
video, waveform, and packet processing.
 Xilinx Zynq UltraScale+ ZU7EV-FFVC1156 MPSoC contains a Video
Codec Unit which supports H.264/H.265, and also it has the biggest
FPGA in the UltraScale+™ MPSoC family.
 This chip includes a Quad-core ARM Cortex-A53 as an Application
Processing Unit, Dual-core ARM Cortex-R5 as a Real-Time Processing
Unit and ARM Mali-400 MP2 as a Graphics Processing Unit.
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Enterprise AI your way
Deep Learning Containers on AC922 with Kubernetes
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI on IBM Cloud Private
Deployed on AC922
60Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
PowerAI on IBM Cloud Private
Deployed on AC922
61Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
H2O Driverless AI on IBM Cloud Private
Deployed on AC922
62Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
63
IBM AC922 Deep Learning Cluster Architecture Overview
Containerized environment - 40x NVIDIA Volta V100 GPU’s
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
64
IBM AC922 Deep Learning System Cluster POD
40x NVIDIA Volta V100 GPU’s
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
65Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Thank you
66
Florin Manaila
Cognitive Systems Europe
HPC/Deep Learning Senior IT Architect
—
florin.manaila@de.ibm.com
+49-7034-274-5294
ibm.com
67

PowerAI Deep dive

  • 1.
    Florin Manaila HPC/Deep LearningArchitect and Inventor IBM Cognitive Systems Europe florin.manaila@de.ibm.com August 31, 2018 IBM PowerAI Deep Learning Platform (architecture, hardware roadmap, future innovation)
  • 2.
    2Cognitive Systems Europe/ August 31 / © 2018 IBM Corporation AI Infrastructure Stack Vision Enterprise L1-L3 Support Base Transform & Prep Data (ETL) Micro-Services / Applications AI APIs (Eg: Watson) In-House APIs Machine & Deep Learning Libraries & Frameworks Distributed Computing Data Lake & Data Stores Segment Specific: Finance, Retail, Healthcare, Automotive Speech, Vision, NLP, Sentiment TensorFlow, Caffe, Pytoch SparkML, Snap.ML Spark, MPI Hadoop HDFS, NoSQL DBs, Parallel File System Accelerated Infrastructure
  • 3.
    3 AI Infrastructure StackChallenges Transform & Prep Data (ETL) Micro-Services / Applications AI APIs (Eg: Watson) In-House APIs Machine & Deep Learning Libraries & Frameworks Distributed Computing Data Lake & Data Stores Data Prep, ETL, Curation, Data Labeling Performance to Reduce Training Time Multi-tenant, GPU Virtualization, DL Framework Scaling Feature extraction, Selecting Right Model, Hyper-parameter tuning Finding Right “Tagged” Data, Model Integrity Use Case Identification, Access to Enough Data Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 4.
    What’s in thetraining of deep neural networks? Neural network model Billions of parameters Gigabytes Computation Iterative gradient based search Millions of iterations Mainly matrix operations Data Millions of images, sentences Terabytes Workload characteristics: Both compute and data intensive! 4Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 5.
    Deep Learning atwork Available options 5 Longer Training Time Shorter Training Time Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 6.
    Data processing stagesfor distributed deep learning Training data on storage CPU: Coordination and data prep GPU computation Parameter data exchange across systems Network, NVLink, GPU Memory POWER9 CPU Storage NVMe, SSD, ESS GPU PCIe Gen. 4 2nd Gen NVLink Source: Hillery Hunter, IBM, GTC 2018 6Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 7.
    NVIDIA GPU implementationin AC922 Deep Learning System NVLINK 2.0 Innovative Systems with NVLink 2.0: • Faster GPU-GPU communication • Breaks down barriers between CPU and GPU • New system architectures • Acceleration limited by PCIe Gen3 7Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 8.
    IBM AC922 DeepLearning System Architecture AC922-GTG 8Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 9.
    IBM AC922 DeepLearning System Architecture AC922-GTW 9Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 10.
    x86 GPU Systemvs IBM AC922 Deep Learning System 3D Image Segmentation Use Case 10 When factoring out this inter-batch overhead the NVLink 2.0 + Volta V100 combination is still 2.4x faster than the PCIe Gen3 + Volta V100 combination Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 11.
    Unified Memory withATS on IBM POWER9 IBM POWER9 CPUs With NVLink Interconnect 11 ALLOCATION  Automatic access to all system memory: malloc, stack, file system ACCESS  All data accessible concurrently from any processor, anytime  Atomic operations resolved directly over NVLink ATS & POWER9 FEATURES  ATS allows GPUDirect RDMA to unified memory  Managed memory is cache-coherent between CPU and GPU  CPU has direct access to GPU memory without need for migration Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 12.
    IBM AC922 DeepLearning System AC922-GTG 12Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 13.
    IBM AC922 DeepLearning System AC922-GTW 13Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 14.
    IBM AC922 System Optionsand Features 14 Processor Features  16 Core Processor Module 190W – 250W (2.25GHZ - 3.12GHZ)  20 Core Processor Module 190W – 250W (2.25GHZ - 2.80GHZ)  18 Core Processor Module 190W – 250W (2.98GHZ - 3.26GHZ)  22 Core Processor Module 190W – 250W (2.78GHZ - 3.07GHZ) Memory Features  8GB IS RDIMM DDR4  16GB IS RDIMM DDR4  32GB IS RDIMM DDR4  64GB IS RDIMM DDR4  128GB IS RDIMM DDR4 Storage Features  HDD 1TB 2.5” 7k RPM SATA  HDD 2TB 2.5” 7k RPM SATA  SSD 960GB 2.5” SATA  SSD 1.92TB 2.5” SATA  SSD 3.84TB 2.5” SATA  1.6TB NVMe Flash Adapter  3.2TB NVMe Flash Adapter  6.4TB NVMe Flash Adapter Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 15.
    OpenPower Recent Testson PCIe Gen4 15Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 16.
    16 PCIe Adapter Features 4-Port Ethernet (4x1 1Gb)  2-Port 40/100 GbE RoCE SFP+  2-Port Ethernet (10Gb)  4-Port Ethernet (2x10 10Gb Optical + 2x 1Gb)  4-Port Ethernet Cu (2x10 10Gb CU + 2x 1Gb)  2 Port 10Gb/s NIC & ROCE SR/CU  2 Port 25/10Gb/s NIC & ROCE SR/CU  1 Port EDR 100Gb IB CX-5 CAPI  2 Port EDR 100Gb IB CX-5 CAPI  2-Port Fiber Channel (16Gb/s)  2-Port Fiber Channel (32Gb/s) Accelerators Features  NVIDIA V100 SMX2 16GB HBM2  NVIDIA V100 SMX2 32GB HBM2  Xilinix ADM-PCIE-8V3 FPGA IBM AC922 System Options and Features Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 17.
    OpenCAPI 3.0 Data-Centric approachto server design 17Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 18.
    18 IBM AC922 DeepLearning System Front and Rear View RearViewFrontView Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 19.
    Volta SMX2 GPUAccelerator Power Regulation 2x 400 Pin Connectors2x Grounding Pads BottomSide Multi Chip Module NVIDIA GPU Details 19 TopSide NVIDIA Volta Specifications NVIDIA Tensor Cores 640 NVIDIA CUDA Cores 5120 Peak Double-Precision Performance 7.8 TFLOPS Single-Precision Performance 15.7 TFLOPS Tensor Performance 125 TFLOPS Memory Bandwidth 900 GB/sec GPU Memory Size 16 GB or 32GB HBM2 NVLink “Bricks” (8 lane interface) 6 NVLink Interconnect Bi-Directional 300 GB/sec Maximum Power 300W Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 20.
    20 Server based FPGA:ie. ADM-PCIE-8V3 Features • Board Format : Half-Length, low profile x16 PCIe form factor • Host I/F : PCI Express Gen3 x8 • Target Device : Xilinx Virtex Ultrascale : XCVU095-2 - FFVC1517 • SDRAM : 2x banks of 1G x 72, DDR4-2400 (16GiB total), upgradable to 16GiB, DDR4-1866 (dual bank devices), per bank (32 GiB total) • FLASH : On-board re-programmable flash memory for embedded configuration • Optional integrated Board Support Package (BSP) including extensive FPGA example designs, plug and play drivers, and a mature Application Programming Interface (API) Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 21.
    21 CAPI Advantages onAC922 Deep Learning System Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 22.
    Feature List:  RESTManagement  IPMI  SSH based SOL  Power and Cooling Management  Event Logs  Zeroconf discoverable  Sensors Features In Progress:  Full IPMI 2.0 Compliance with DCMI  Verified Boot  HTML5 Java Script Web User Interface  BMC RAS IBM is the OpenBMC Community Leader  Facebook  Google  IBM  Intel  Microsoft  OCP 22 OpenBMC is a free open source management software Linux distribution  Inventory  LED Management  Host Watchdog  Simulation  Code Update Support for multiple BMC/BIOS images  POWER On Chip Controller (OCC) SupportCognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 23.
    IBM Deep LearningSoftware Stack 23
  • 24.
    Reference Architecture forAI Infrastructure: Software 24Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 25.
    IBM PowerAI atthe glance June, 2018 update 25Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 26.
    IBM PowerAI Base@hub.docker.com 26Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 27.
    IBM PowerAI Baseusage at the glance 27 PowerAI framework activation (Python2 or Python3)  Activation scripts are used to manage system and python paths  To activate PowerAI deep learning frameworks: $ source /opt/DL/<framework-name>/bin/<framework-name>-activate This script sets PATH and PYTHONPATH to the appropriate values for the desired deep learning framework as it resides in /opt/DL directory.  <framework>-activate will also call check_dependencies  Activation will only happen if all dependencies are met Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 28.
    What data sciencemethods are used at work? 28Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 34.
    34 libGLM (C++ /CUDA Optimized Primitive Lib) Distributed Training Logistic Regression Linear Regression Support Vector Machines (SVM) Distributed Hyper- Parameter Optimization More Coming Soon APIs for Popular ML Frameworks IBM Snap ML part of PowerAI Base Distributed GPU-Accelerated Machine Learning Library (coming soon) Snap Machine Learning (ML) Library 34Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 35.
    46x faster thanprevious record set by Google Workload: Click-through rate prediction for advertising Logistic Regression Classifier in Snap ML using GPUs vs TensorFlow using CPU-only 35 Snap ML: Training Time Goes From An Hour to Minutes Logistic Regression in Snap ML (with GPUs) vs TensorFlow (CPU- only) 1.1 Hours 1.53 Minutes 0 20 40 60 80 Google CPU-only Snap ML Power + GPU Runtime(Minutes) 46x Faster Dataset: Criteo Terabyte Click Logs (http://labs.criteo.com/2013/12/download-terabyte-click-logs/) 4 billion training examples, 1 million features Model: Logistic Regression: TensorFlow vs Snap ML Test LogLoss: 0.1293 (Google using Tensorflow), 0.1292 (Snap ML) Platform: 89 CPU-only machines in Google using Tensorflow versus 4 AC922 servers (each 2 Power9 CPUs + 4 V100 GPUs) for Snap ML Google data from this Google blog 90 x86 Servers (CPU-only) 4 Power9 Servers With GPUs
  • 38.
    38 Deep Learning Impact (DLI)Module Data & Model Management, ETL, Visualize, Advise IBM Conductor with Spark Cluster Virtualization, Auto Hyper-Parameter Optimization PowerAI: Open Source ML Frameworks Large Model Support (LMS) Distributed Deep Learning (DDL) Auto ML Enterprise Accelerated Infrastructure IBM PowerAI Enterprise V1.1 Announced on June, 2018 Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 39.
    39 Enterprise IBM PowerAI EnterpriseV1.1 Announced on June, 2018 Deep Learning Impact Data Management and ETL Training visualization and monitoring Hyper-parameter optimization Spectrum Conductor Multi-tenancy support & security User reporting & charge back Dynamic resource allocation External data connectors Distributed Deep Learning (DDL) Support Line L1-L3 Accelerated Infrastructure Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 40.
    Real time monitoringof hyper parameters in PowerAI Enterprise 40Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 41.
    Hyper-parameter Tuning/Search inPowerAI Enterprise 41 Hyper-parameters – Learning rate – Decay rate – Batch size – Optimizer:  GradientDecedent,  Adadelta,  Momentum,  RMSProp  ….. – Momentum (for some optimizers) – LSTM hidden unit size (for models which use LSTM) Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 42.
    42Group Name /DOC ID / Month XX, 2017 / © 2017 IBM Corporation
  • 43.
    Who are thetypical Personas for computer vision solutions ? 43Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 44.
    Steps for DeepLearning Development 44 Define training task Prepare training Data Data Pre- processing DNN Model selection Configure the training hyper- parameter DNN Model Training Start Package the new DNN model together with preprocessing into inference proc. Application development with inference API DL training framework preparation Danielle Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 45.
    How to SimplifyingDeep Learning Adoption? 45  Format transformation  Support both training and evaluation sets  Support different pre-processing plugins  Provide base models for different scenarios  Predict training time  Training process visualization  Training with GPU  Scalability and HA deployment Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 46.
    IBM PowerAI Vision SimplifyDeep Learning Adoption 46 User could use the deployed API for visual recognition PowerAI Vision Iris Danny Define training task Prepare training Data Data Pre- processing DNN Model selection Configure the training hyper- parameter DNN Model Training Package the new DNN model together with preprocessing into inference proc. DL training framework preparation Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 47.
    What are wesolving ? 47 Data Up & Running Data Pre- Processin g Build, Train, Optimize Deploy & Infer Maintain Model Accuracy  Training visualization & accuracy monitoring  Customize parameters for training  Datasets for classification  Datasets for object detections  Semi-auto labeling on videos  Pre-bundled models dataset creation  Data augmentation  REST APIs for creating datasets.  Export/Import datasets  Custom DNN models  Hyper-parameter search and tuning  REST APIs to infer with images/videos  Inference Engine for compiling accelerated models on edge Image Analyst Data Scientist  Simplified installation and deployment Developer  Deploy where trained  Optimized models for few categories  Visualize progress and early warning  Customize models for pre-processing  Use Interface is deployed  Validate trained models  Built audit systems based low inference scores Most vendors address this space Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 48.
    IBM PowerAI Vision Lowersthe barriers for creating Computer Vision related AI solutions. 48Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 49.
    IBM PowerAI Vision Lowersthe barriers for creating Computer Vision related AI solutions. 49Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 50.
    IBM PowerAI Vision Lowersthe barriers for creating Computer Vision related AI solutions. 50Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 51.
    Semi-Automatic Labeling fromvideo content 51 Train DL Model Define Labels Manually Label Some Images / Video Frames Manually Label Use Trained DL Model Run Trained DL Model on Entire Input Data to Generate Labels Correct Labels on Some Data Manually Correct Labels on Some Data Repeat Till Labels Achieve Desired Accuracy Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 52.
    Delivered Pre-Trained Models Timeand Data Matters 52 Convolutional Neural Network (CNN) Pre-trained CNN New Task Fine-tune W Mergus Larus …. Corvus Sourav Mergus Larus …. Corvus Recreate Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 53.
    IBM PowerAI Vision:Deep Learning Development Platform for Computer Vision Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 54.
    PowerAI Vision APIs InferenceAPIs for Object Detection (example) 54 Developer could use these APIs for object detection with the deployed model in PowerAI Vision from any IP device http://IP:PORT/ (of the deployed inference instance) /test GET: Only to test if the monitor service is running. /detect_url GET: Upload image with image url and detect objects /detect_upload POST: Post image file and do the object detection Inference return: {'confidence': 0.9038739204406738, 'ymax': 145, 'label': 'badge', 'xmax': 172, 'xmin': 157, 'ymin': 123} Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 55.
    IBM PowerAI Vision InferenceServer 55Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 56.
    56 CPU + GPUNeural network processor Embedded GPU Embedded FPGA CPUs, GPUs Trained DNN model DNN model parser DNN model analyzer NN structure Backend specific optimization Estimate resources & performance Mapping to back ends PowerAI Inference Engine Map to Different Platforms Data Center: Train model & Compile to Edge Cloud or Edge PowerAI Inference Engine (PIE) Automatically Map Trained AI Models to Cloud or Edge Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 57.
    Inference at theedge How can I accelerate models for the edge ? -- Developer Compile accelerated models for FPGAs, NVIDIA TX1/TX2* & Raspberry Pi* *coming soon Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 58.
    58 Edge FPGA: ie.TySOM-3 Embedded Prototyping Board Features  TySOM-3-ZU7 is a compact prototyping board containing Zynq UltraScale+ MPSoC device which provides 64-bit processor scalability while combining real-time control with soft and hard engines for graphics, video, waveform, and packet processing.  Xilinx Zynq UltraScale+ ZU7EV-FFVC1156 MPSoC contains a Video Codec Unit which supports H.264/H.265, and also it has the biggest FPGA in the UltraScale+™ MPSoC family.  This chip includes a Quad-core ARM Cortex-A53 as an Application Processing Unit, Dual-core ARM Cortex-R5 as a Real-Time Processing Unit and ARM Mali-400 MP2 as a Graphics Processing Unit. Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 59.
    Enterprise AI yourway Deep Learning Containers on AC922 with Kubernetes Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 60.
    PowerAI on IBMCloud Private Deployed on AC922 60Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 61.
    PowerAI on IBMCloud Private Deployed on AC922 61Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 62.
    H2O Driverless AIon IBM Cloud Private Deployed on AC922 62Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 63.
    63 IBM AC922 DeepLearning Cluster Architecture Overview Containerized environment - 40x NVIDIA Volta V100 GPU’s Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 64.
    64 IBM AC922 DeepLearning System Cluster POD 40x NVIDIA Volta V100 GPU’s Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
  • 65.
    65Cognitive Systems Europe/ August 31 / © 2018 IBM Corporation
  • 66.
    Thank you 66 Florin Manaila CognitiveSystems Europe HPC/Deep Learning Senior IT Architect — florin.manaila@de.ibm.com +49-7034-274-5294 ibm.com
  • 67.

Editor's Notes

  • #20 This slide provides a physical view of the GPU. The top view showing the chip and regulators, and the bottom view showing the 800 pins of interconnect to the backplane. The upper right picture is a completed assembly with the heat sink assembly added. The heat is sink is required to cool the 300 Watts of power in an air cooled machine.
  • #23 The IBM AC922 has a new Board Management Controller (BMC) interface called OpenBMC. Open BMC is a free open source management software Linux distribution of which IBM is a community leader … and gaining attention from users all over the marketplace. Quite simply, OpenBMC is the code stack used with the AC922 industry standard BMC service processor controller. Think of OpenBMC analogous to the way your car is likely inspected in the shop. It used to be the case where you would bring your car into the shop when you heard a sound, or on some maintenance window. Perhaps a mechanic would shine a light, diagnose, and investigate what was wrong with the car. Today, they simply plug a computer into the car’s port and it tells the mechanic what’s wrong (which begs the question why are they paid so much, but that’s a different conversation). IPMI SoCs are known as baseboard management controllers (BMCs). The BMC is connected to most of the standard buses on the motherboard, so it can monitor temperature and fan sensors, storage devices and expansion cards, and even access the network (through its own virtual network interface that includes a separate MAC address). But BMCs almost invariably ship with a proprietary IPMI implementation which is limited in functionality to what the vendor chooses. Furthermore, IPMI is riddled with poor security and, thus, leaves servers vulnerable to all sorts of attacks. Once the BMC has been compromised, the attacker has direct access to essentially every part of the server. One of the major reasons why the marketplace is enthused about OpenBMC is because of issues associated with the Intelligent Platform Management Interface (IPMI) – a set of system-management-and-monitoring APIs typically implemented on server motherboards via an embedded system-on-chip (SoC) that functions completely outside of the host system's BIOS and operating system. While IPMI is intended as a convenience for those who must manage dozens or hundreds of servers in a remote facility, IPMI has been called out for its potential as a serious hole in server security. IBM pulled the OpenBMC project into a Design Thinking workshop and facilitated a group of external clients and contributors who helped enable the interface’s look and feel. When this was sent out for a broader set of reviews and followed up with the Net Promoter Score (NPS) questionnaire, it received a preliminary score of 100! Learn more about OpenBMC at: https://lwn.net/Articles/683320/.
  • #25 Roadmap to Containers: NVIDIA frameworks are being delivered via container strategy. Data Science Apps and Value add tools = AI Vision, PIE, DSX, Anaconda = 28HC ML/DL UI and Flow.... this row seems to be a double count.  Parallel training is DDL.  DLI is part of Spectrum CwS integration DL Frameworks: 30HC DDL: 11HC Runtime Resources/WL = ~ Spark, CwC.Cfc = 6HC
  • #29 Source: https://www.kaggle.com/surveys/2017 What data science methods are used at work? Deep Learning is Growing Exponentially, but Machine Learning still has a strong foothold
  • #52 You can use PowerAI Vision for semi-automatic labeling
  • #53 You can use PowerAI Vision for semi-automatic labeling
  • #57 The PowerAI Inference Engine can map trained AI models to all kinds of embedded devices & accelerators
  • #58 In this demo, the UI on the left is called PowerAI Inference Engine (PIE). It’s a user interface designed for developers to compile compressed versions of trained neural networks. A large neural network needs to be compressed so that it can run with the same accuracy on a less compute intense hardware called FPGA. PIE is available as a prototype for our customers to use. The video on the right side shows inference of the model once the compressed model is imported. The card in red uses a Xilink ZYNQ series chip which is an FPGA.