SlideShare a Scribd company logo
1 of 26
Scalable Machine/Deep
Learning with Apache SystemML
on OpenPOWER
Berthold Reinwald
reinwald@us.ibm.com
IBM Research – Almaden, San Jose, CA
AI and OpenPOWER Meetup at h2o.AI
March 25th, 2018
1
Let’s start off
with a Tweet …
2
IBM Think 2018, Las Vegas
IBM Research achieves record deep learning
performance with new software technology
• Training time from weeks to hours.
• SW/HW co-optimized achieves near-linear scaling up
to hundreds of GPUs.
• Multi-ring communication pattern provides a good
tradeoff between latency and bandwidth.
• Resnet-101 on Imagenet 22K with 64 IBM Power8
S822LC servers (256 GPUs) in about 7 hours to an
accuracy of 33.8 % validation accuracy. Microsoft's
ADAM and Google's DistBelief results did not reach
30 % validation accuracy.
• Compared to Facebook AI Research on 256 GPU
training, the new communication algorithm, and
better combined SW/HW offers better
communication overhead for Resnet-50. A PowerAI
DDL enabled version of Torch completed 90 epochs
of training on Resnet 50 for 1K classes in 50 minutes
using 64 IBM Power8 S822LC servers (256 GPUs).
3
Tumor Proliferation Score
4
Medical Image Segmentation
5
U-Net: Deep Convolutional Neural Network
for Segmentation of biomedical Images
6
Problem:
• Learn segmentation
• Very few annotated images
(approx 30 per application)
• Touching objects of same class;
need to be separated by
segmentation algorithm
Challenges:
• 3D
• Not generalizable shapes
• Gradual edges
Raw Image Segmentation Map
Challenges in Machine/Deep Learning
• Simplify the Life of Data Scientists
• Custom algorithms & DNNs
• Fast turn around time
• Data Characteristics
(input/intermediates/output)
• Dense / sparse
• Small / large number of data points
• Small / large number of features
• Mixed Workloads
• Compute bound
• I/O or memory bandwidth bound
• Core Operations
• Data manipulation
• Linear algebra
• Convolution
• Iterative
• Multiple Stages
• Training
• Testing
• Inference
• Deployment Environments
• Range from embeddable scoring library (low
latency), to scale up on large nodes, to
distributed
• Libraries: MKL/MKL-DNN, OpenBlas,
CuDA/CuDNN and low precision
• Hardware
• x86/Power
• Many cores, GPU, TPU, FPGA
• High-speed interconnects (Topologies)
• … all combinations
7
Why Apache SystemML
• Today’s Roles of Data Scientists
• Algorithm researcher: Invent new optimization schemes
• Systems programmer: provide distributed
implementations
• Deployment engineer: Run for varying datasets
• Systems researcher: Optimize clusters
• SystemML simplifies the Life of Data Scientists
• in implementing custom machine learning
• running algorithms distributed if needed
• running algorithms varying from small data to large data
• Fast turn around
8
NIPS ICML
KDD
JMLR
Apache SystemML – Declarative Machine Learning
• Productivity of data scientists
• Machine learning language for data scientists
(“The SQL for analytics”)
• Strong foundation in linear algebra and statistical functions
• Comes with approx. 20+ algorithms pre-implemented
• Enable Solutions development and Tools
• Scalability & Performance
• Built on data parallel platforms, e.g. Spark
• Cost-based optimizer to compile execution plans
• Depending on data characteristics (tall/skinny, short/wide) and cluster characteristics
• Ranging from in-memory single node to clusters (MapReduce, Spark), and hybrid plans
• APIs & Tools
• Command line: standalone Java app, spark-submit, hadoop jar
• Use in Spark through Scala, Python, R, and Java APIs
• Embeddable scoring library
• Tools: REPL (Scala Spark and pyspark), SparkR, SparkML,
Jupyter, Zeppelin Notebooks
9
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language
SystemML integrated in Spark Ecosystem
10
Spark Core Engine
Spark
SQL
Spark
Streaming (MLlib)
GraphX
(SystemML)
Analytics
Library
Custom
Analytics
Machine Learning
DataFrame
Spark API to SystemML
SystemML to run against Spark
core for distributed
computations
Apache SystemML Open Source
• Apache Open source Project (http://systemml.apache.org/)
• Nov. 2015, Start SystemML Apache Incubator Project
• …
• Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API.
May 2017, Release 0.14.0 on Spark 2.0.2+.
• May 2017, Apache Top Level Project
• …
• Dec 2017, Release 1.0.0
• March 2018, Release 1.1.0
• Release downloads (http://systemml.apache.org/download)
• Binaries
• Coordinates to Maven repository
• Github source code (https://github.com/apache/systemml)
• Documentation (https://apache.github.io/systemml/)
• 3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial-kdd2017.html), Aug. 2017
11
Automatic Algebraic Simplification Rewrites lead
to Significant Performance Improvements
• Simplify operations over mmult  Eliminate unnecessary compute
• trace (X %*% Y)  sum(X * t(Y))
• Remove unnecessary operations  Merging operations
• rand (…, min=-1, max=1) * 7
 rand (…, min=-7, max=7)
• Binary to unary operations  Reduce amount of data touched
• X*X
 X^2
• Remove unnecessary Indexing  Eliminate operations (conditional)
• X[a:b,c:d] = Y
 X = Y iff dims(X)=dims(Y)
• … 10’s more rewrite rules 12
Compilation Chain
13
Training a Deep Neural Network
14
Training features:
Training label:
Goal: learn the weights
Define a loss function:
For numerical stability and mathematical simplicity, we use negative log-likelihood
(often referred to as cross-entropy):
“Forward propagation”
Compute a function via composition of linear transformations followed by
element-wise non-linearities
“Backward propagation”
Propagates errors backwards and update weights according
to how much they contributed to the output
Deep Learning Layers
• Fully connected layer
15
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
Convolution Layer
• Less number of parameters as compared
to fully connected layer
• Useful to capture local features (spatially)
• Output #channels = #filters
16
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
Deep Learning Support
• Reuse existing infrastructure to implement
custom DNNs like other training algorithms
• Small number of DL-specific built-in functions
• e.g. convolution
• NN library of layers and training optimizers to stack layers, e.g.
• Affine (fully-connected) layer is matrix multiplication
• Convolution layer invokes new convolution function
• Caffe/Keras2DML to import existing DNNs
• Transfer learning to continue training on different data
• GPU and native BLAS libraries
17
NN library:
Compressed Linear Algebra (CLA)
• Motivation: Iterative ML algorithms with I/O-bound MV
multiplications
• Key Ideas: Use lightweight DB compression techniques and perform
LA operations on compressed matrices (w/o decompression)
• Experiments
• LinregCG, 10 iterations, SystemML 0.14
• 1+6 node cluster, Spark 2.1
18
Dataset Gzip Snappy CLA
Higgs 1.93 1.38 2.17
Census 17.11 6.04 35.69
Covtype 10.40 6.13 18.19
ImageNet 5.54 3.35 7.34
Mnist8m 4.12 2.60 7.32
Airline78 7.07 4.28 7.44
Compression
Ratios
89
3409
5663
135
765
2730
93
463
998
0
1000
2000
3000
4000
5000
6000
Mnist40m Mnist240m Mnist480m
Uncompressed
Snappy (RDD Compression)
CLA
End-to-End Performance [sec]
90GB 540GB 1.1TB
Code Generation for Operator Fusion
• Motivation
• Ubiquitous Fusion Opportunities
• High Performance Impact
• Key Ideas
• Templates skeletons (Row, Cell, Outer, MultiAgg)
• Candidate exploration to identify fusion opportunities
• Candidate selection via cost-based optimizer or heuristics
• Codegen with janino / javac during compile and dynamic recompile
19
X Y
b(*)u(^2) u(^2)
sumsum sum
Multi-Aggregate
a=sum(X^2)
b=sum(X*Y)
c=sum(Y^2)
X Y
Z*
sum
*
1st
pass
X
v
X
2nd
pass
q
┬
U V
┬X * logsum
sparsity
exploitation
Codegen Micro Benchmarks (FP64)
sum(X ʘ Y ʘ Z), dense sum(X ʘ Y ʘ Z), sparse
Sparsity
0.1
X
┬
(X v), dense
Data size
20K x 20K
sum(X ʘ log(UV
┬
+ 1e-15))
#1 Gen close
to hand-coded
fused ops
#2 TF/Julia Gen
only single-
threaded
#3 TF w/ very
limited sparse
support
#4 Sparse Gen
challenging,
Gen better
than hand-
coded ops
#5 TF w/ poor
performance
for data-
intensive ops,
#6 Gen at
peak mem
bandwidth
#7 Autom.
Sparsity
exploitation
across chains
of ops
20
SystemML on Power Environment
• Contributed native ppc64le libraries for Jcuda to mavenized jcuda
project
• GPU backend on Power for SystemML
• Contributed native ppc64le libraries to protoc project
• Useful for compiling Caffe proto files
• Supported native BLAS operations in SystemML
• Matrix Multiplication, Convolution (forward/backward)
• OpenBLAS with OpenMP support
21
Linear Regression Conjugate Gradient
(preliminary 1/2)
22
0
2
4
6
8
10
12
14
64 128 256 512 1024 2048
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC CPU Time
PPC GPU Time
x86 CPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
M-V multiplication
chain is memory bound,
But more cores help
with parallelization.
Linear Regression Conjugate Gradient
(preliminary 2/2)
23
0
2
4
6
8
10
12
14
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC GPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
0
1
2
3
4
5
6
7
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
CPU-GPU Transfer Time
PPC toDev Time
x86 toDev Time
Most of the time is spent
in transferring data from
host to device
-> 2x performance benefit
due to CPU-GPU NVLink
Capabilities of DL frameworks
24
Single Precision Double
Precision
Code
generation
BLAS Spark
DataFrame
support
Sparse
operation
CPU GPU CPU GPU CPU GPU
SystemML Limited
(only for
BLAS)
Yes Yes Yes Yes No OpenBLAS,
MKL, Java
Yes Yes
TF 1.5 Yes Yes No No Yes Yes Eigen, MKL ? (via
elephas)
Limited
BigDL Yes No Yes No No No MKL Yes No
Execution time for 10 epochs with Lenet 5
and 60K MNIST dataset
1
10
100
1000
10000
CPU single precision GPU single precision CPU double precision GPU double precision
TF TF with XLA SystemML SystemML with codegen Intel BigDL
Due to limited single
precision support in
SystemML
SystemML/TF outperforms BigDL for
minibatch training
Both TF and SystemML perform
equally well on GPU
BigDL: No GPU
support
TF: No support for
double precision
SystemML: No GPU
codegen support yet
Code-generation improves
performance both on CPU and on GPU
Summary
• SystemML simplifies the Life of Data Scientist
• Custom Machine/Deep Learning Algorithms
• Scale up & out
• Mixed Workloads
• Memory access bound
• Compute bound
• Strike Balance between
• Data transfer
• Parallelism
26

More Related Content

What's hot

How Researchers Will Benefit from Canada’s National Data Cyberinfrastructure
How Researchers Will Benefit from Canada’s National Data CyberinfrastructureHow Researchers Will Benefit from Canada’s National Data Cyberinfrastructure
How Researchers Will Benefit from Canada’s National Data Cyberinfrastructure
inside-BigData.com
 

What's hot (20)

The Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and ResiliencyThe Pandemic Changes Everything, the Need for Speed and Resiliency
The Pandemic Changes Everything, the Need for Speed and Resiliency
 
Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...
Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...
Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...
 
Text Processing with KNIME
Text Processing with KNIMEText Processing with KNIME
Text Processing with KNIME
 
Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Webinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the LearningWebinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the Learning
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning Primer
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
InfluxDB + Kepware: Start Monitoring Industrial Data Quickly
InfluxDB + Kepware: Start Monitoring Industrial Data QuicklyInfluxDB + Kepware: Start Monitoring Industrial Data Quickly
InfluxDB + Kepware: Start Monitoring Industrial Data Quickly
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
How Researchers Will Benefit from Canada’s National Data Cyberinfrastructure
How Researchers Will Benefit from Canada’s National Data CyberinfrastructureHow Researchers Will Benefit from Canada’s National Data Cyberinfrastructure
How Researchers Will Benefit from Canada’s National Data Cyberinfrastructure
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
OpenACC Monthly Highlights - September
OpenACC Monthly Highlights - SeptemberOpenACC Monthly Highlights - September
OpenACC Monthly Highlights - September
 
Webinar: Déployez facilement Kubernetes & vos containers
Webinar: Déployez facilement Kubernetes & vos containersWebinar: Déployez facilement Kubernetes & vos containers
Webinar: Déployez facilement Kubernetes & vos containers
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
 

Similar to 2018 03 25 system ml ai and openpower meetup

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Sahil Kaw
 

Similar to 2018 03 25 system ml ai and openpower meetup (20)

Parallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC SystemsParallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC Systems
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetes
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 

More from Ganesan Narayanasamy

180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

2018 03 25 system ml ai and openpower meetup

  • 1. Scalable Machine/Deep Learning with Apache SystemML on OpenPOWER Berthold Reinwald reinwald@us.ibm.com IBM Research – Almaden, San Jose, CA AI and OpenPOWER Meetup at h2o.AI March 25th, 2018 1
  • 2. Let’s start off with a Tweet … 2 IBM Think 2018, Las Vegas
  • 3. IBM Research achieves record deep learning performance with new software technology • Training time from weeks to hours. • SW/HW co-optimized achieves near-linear scaling up to hundreds of GPUs. • Multi-ring communication pattern provides a good tradeoff between latency and bandwidth. • Resnet-101 on Imagenet 22K with 64 IBM Power8 S822LC servers (256 GPUs) in about 7 hours to an accuracy of 33.8 % validation accuracy. Microsoft's ADAM and Google's DistBelief results did not reach 30 % validation accuracy. • Compared to Facebook AI Research on 256 GPU training, the new communication algorithm, and better combined SW/HW offers better communication overhead for Resnet-50. A PowerAI DDL enabled version of Torch completed 90 epochs of training on Resnet 50 for 1K classes in 50 minutes using 64 IBM Power8 S822LC servers (256 GPUs). 3
  • 6. U-Net: Deep Convolutional Neural Network for Segmentation of biomedical Images 6 Problem: • Learn segmentation • Very few annotated images (approx 30 per application) • Touching objects of same class; need to be separated by segmentation algorithm Challenges: • 3D • Not generalizable shapes • Gradual edges Raw Image Segmentation Map
  • 7. Challenges in Machine/Deep Learning • Simplify the Life of Data Scientists • Custom algorithms & DNNs • Fast turn around time • Data Characteristics (input/intermediates/output) • Dense / sparse • Small / large number of data points • Small / large number of features • Mixed Workloads • Compute bound • I/O or memory bandwidth bound • Core Operations • Data manipulation • Linear algebra • Convolution • Iterative • Multiple Stages • Training • Testing • Inference • Deployment Environments • Range from embeddable scoring library (low latency), to scale up on large nodes, to distributed • Libraries: MKL/MKL-DNN, OpenBlas, CuDA/CuDNN and low precision • Hardware • x86/Power • Many cores, GPU, TPU, FPGA • High-speed interconnects (Topologies) • … all combinations 7
  • 8. Why Apache SystemML • Today’s Roles of Data Scientists • Algorithm researcher: Invent new optimization schemes • Systems programmer: provide distributed implementations • Deployment engineer: Run for varying datasets • Systems researcher: Optimize clusters • SystemML simplifies the Life of Data Scientists • in implementing custom machine learning • running algorithms distributed if needed • running algorithms varying from small data to large data • Fast turn around 8 NIPS ICML KDD JMLR
  • 9. Apache SystemML – Declarative Machine Learning • Productivity of data scientists • Machine learning language for data scientists (“The SQL for analytics”) • Strong foundation in linear algebra and statistical functions • Comes with approx. 20+ algorithms pre-implemented • Enable Solutions development and Tools • Scalability & Performance • Built on data parallel platforms, e.g. Spark • Cost-based optimizer to compile execution plans • Depending on data characteristics (tall/skinny, short/wide) and cluster characteristics • Ranging from in-memory single node to clusters (MapReduce, Spark), and hybrid plans • APIs & Tools • Command line: standalone Java app, spark-submit, hadoop jar • Use in Spark through Scala, Python, R, and Java APIs • Embeddable scoring library • Tools: REPL (Scala Spark and pyspark), SparkR, SparkML, Jupyter, Zeppelin Notebooks 9 Hadoop or Spark Cluster (scale-out) In-Memory Single Node (scale-up) Runtime Compiler Language
  • 10. SystemML integrated in Spark Ecosystem 10 Spark Core Engine Spark SQL Spark Streaming (MLlib) GraphX (SystemML) Analytics Library Custom Analytics Machine Learning DataFrame Spark API to SystemML SystemML to run against Spark core for distributed computations
  • 11. Apache SystemML Open Source • Apache Open source Project (http://systemml.apache.org/) • Nov. 2015, Start SystemML Apache Incubator Project • … • Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API. May 2017, Release 0.14.0 on Spark 2.0.2+. • May 2017, Apache Top Level Project • … • Dec 2017, Release 1.0.0 • March 2018, Release 1.1.0 • Release downloads (http://systemml.apache.org/download) • Binaries • Coordinates to Maven repository • Github source code (https://github.com/apache/systemml) • Documentation (https://apache.github.io/systemml/) • 3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial-kdd2017.html), Aug. 2017 11
  • 12. Automatic Algebraic Simplification Rewrites lead to Significant Performance Improvements • Simplify operations over mmult  Eliminate unnecessary compute • trace (X %*% Y)  sum(X * t(Y)) • Remove unnecessary operations  Merging operations • rand (…, min=-1, max=1) * 7  rand (…, min=-7, max=7) • Binary to unary operations  Reduce amount of data touched • X*X  X^2 • Remove unnecessary Indexing  Eliminate operations (conditional) • X[a:b,c:d] = Y  X = Y iff dims(X)=dims(Y) • … 10’s more rewrite rules 12
  • 14. Training a Deep Neural Network 14 Training features: Training label: Goal: learn the weights Define a loss function: For numerical stability and mathematical simplicity, we use negative log-likelihood (often referred to as cross-entropy): “Forward propagation” Compute a function via composition of linear transformations followed by element-wise non-linearities “Backward propagation” Propagates errors backwards and update weights according to how much they contributed to the output
  • 15. Deep Learning Layers • Fully connected layer 15 Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
  • 16. Convolution Layer • Less number of parameters as compared to fully connected layer • Useful to capture local features (spatially) • Output #channels = #filters 16 Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
  • 17. Deep Learning Support • Reuse existing infrastructure to implement custom DNNs like other training algorithms • Small number of DL-specific built-in functions • e.g. convolution • NN library of layers and training optimizers to stack layers, e.g. • Affine (fully-connected) layer is matrix multiplication • Convolution layer invokes new convolution function • Caffe/Keras2DML to import existing DNNs • Transfer learning to continue training on different data • GPU and native BLAS libraries 17 NN library:
  • 18. Compressed Linear Algebra (CLA) • Motivation: Iterative ML algorithms with I/O-bound MV multiplications • Key Ideas: Use lightweight DB compression techniques and perform LA operations on compressed matrices (w/o decompression) • Experiments • LinregCG, 10 iterations, SystemML 0.14 • 1+6 node cluster, Spark 2.1 18 Dataset Gzip Snappy CLA Higgs 1.93 1.38 2.17 Census 17.11 6.04 35.69 Covtype 10.40 6.13 18.19 ImageNet 5.54 3.35 7.34 Mnist8m 4.12 2.60 7.32 Airline78 7.07 4.28 7.44 Compression Ratios 89 3409 5663 135 765 2730 93 463 998 0 1000 2000 3000 4000 5000 6000 Mnist40m Mnist240m Mnist480m Uncompressed Snappy (RDD Compression) CLA End-to-End Performance [sec] 90GB 540GB 1.1TB
  • 19. Code Generation for Operator Fusion • Motivation • Ubiquitous Fusion Opportunities • High Performance Impact • Key Ideas • Templates skeletons (Row, Cell, Outer, MultiAgg) • Candidate exploration to identify fusion opportunities • Candidate selection via cost-based optimizer or heuristics • Codegen with janino / javac during compile and dynamic recompile 19 X Y b(*)u(^2) u(^2) sumsum sum Multi-Aggregate a=sum(X^2) b=sum(X*Y) c=sum(Y^2) X Y Z* sum * 1st pass X v X 2nd pass q ┬ U V ┬X * logsum sparsity exploitation
  • 20. Codegen Micro Benchmarks (FP64) sum(X ʘ Y ʘ Z), dense sum(X ʘ Y ʘ Z), sparse Sparsity 0.1 X ┬ (X v), dense Data size 20K x 20K sum(X ʘ log(UV ┬ + 1e-15)) #1 Gen close to hand-coded fused ops #2 TF/Julia Gen only single- threaded #3 TF w/ very limited sparse support #4 Sparse Gen challenging, Gen better than hand- coded ops #5 TF w/ poor performance for data- intensive ops, #6 Gen at peak mem bandwidth #7 Autom. Sparsity exploitation across chains of ops 20
  • 21. SystemML on Power Environment • Contributed native ppc64le libraries for Jcuda to mavenized jcuda project • GPU backend on Power for SystemML • Contributed native ppc64le libraries to protoc project • Useful for compiling Caffe proto files • Supported native BLAS operations in SystemML • Matrix Multiplication, Convolution (forward/backward) • OpenBLAS with OpenMP support 21
  • 22. Linear Regression Conjugate Gradient (preliminary 1/2) 22 0 2 4 6 8 10 12 14 64 128 256 512 1024 2048 TimeinSeconds No. of Rows of input matrix (in Thousands) PPC CPU Time PPC GPU Time x86 CPU Time x86 GPU Time Data: random with sparsity 0.95, 1000 features Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01 Driver-memory: 100G, local[*] master M-V multiplication chain is memory bound, But more cores help with parallelization.
  • 23. Linear Regression Conjugate Gradient (preliminary 2/2) 23 0 2 4 6 8 10 12 14 64 256 1024 TimeinSeconds No. of Rows of input matrix (in Thousands) PPC GPU Time x86 GPU Time Data: random with sparsity 0.95, 1000 features Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01 Driver-memory: 100G, local[*] master 0 1 2 3 4 5 6 7 64 256 1024 TimeinSeconds No. of Rows of input matrix (in Thousands) CPU-GPU Transfer Time PPC toDev Time x86 toDev Time Most of the time is spent in transferring data from host to device -> 2x performance benefit due to CPU-GPU NVLink
  • 24. Capabilities of DL frameworks 24 Single Precision Double Precision Code generation BLAS Spark DataFrame support Sparse operation CPU GPU CPU GPU CPU GPU SystemML Limited (only for BLAS) Yes Yes Yes Yes No OpenBLAS, MKL, Java Yes Yes TF 1.5 Yes Yes No No Yes Yes Eigen, MKL ? (via elephas) Limited BigDL Yes No Yes No No No MKL Yes No
  • 25. Execution time for 10 epochs with Lenet 5 and 60K MNIST dataset 1 10 100 1000 10000 CPU single precision GPU single precision CPU double precision GPU double precision TF TF with XLA SystemML SystemML with codegen Intel BigDL Due to limited single precision support in SystemML SystemML/TF outperforms BigDL for minibatch training Both TF and SystemML perform equally well on GPU BigDL: No GPU support TF: No support for double precision SystemML: No GPU codegen support yet Code-generation improves performance both on CPU and on GPU
  • 26. Summary • SystemML simplifies the Life of Data Scientist • Custom Machine/Deep Learning Algorithms • Scale up & out • Mixed Workloads • Memory access bound • Compute bound • Strike Balance between • Data transfer • Parallelism 26

Editor's Notes

  1. 2x faster, 4x mem, 1st PCIe gen4 on chip, support for NVLink
  2. Automated Grading of Gliomas using Deep Learning in Digital Pathology Images Cut a “whole-slide” image into square “tiles” at 20x magnification. Filter the “tiles” to remove any without tissue. Cut the remaining “tiles” into smaller “samples”. Assign a tumor score label to each sample based on the tumor score of the “whole-slide” image. Repeat 1-4 for all “whole-slide” images. Train a convolutional neural network with the resulting dataset of labeled “samples”. Good results!
  3. Contraction (increase the "what", reduce the "where"): convolutions followed by non-linear activation function. Expansion path (create a high resolution segmentation map): sequence of up-convolutions and concatenation with high-resolution features from contractiong path. Output has 2 channels: one for the foreground, and one for the background. Training time: 10h with GPU. Applicaton: 1s per image; Better accuracy than sliding-window CNN
  4. 9