Scalable Machine/Deep
Learning with Apache SystemML
on OpenPOWER
Berthold Reinwald
reinwald@us.ibm.com
IBM Research – Almaden, San Jose, CA
AI and OpenPOWER Meetup at h2o.AI
March 25th, 2018
1
IBM Research achieves record deep learning
performance with new software technology
• Training time from weeks to hours.
• SW/HW co-optimized achieves near-linear scaling up
to hundreds of GPUs.
• Multi-ring communication pattern provides a good
tradeoff between latency and bandwidth.
• Resnet-101 on Imagenet 22K with 64 IBM Power8
S822LC servers (256 GPUs) in about 7 hours to an
accuracy of 33.8 % validation accuracy. Microsoft's
ADAM and Google's DistBelief results did not reach
30 % validation accuracy.
• Compared to Facebook AI Research on 256 GPU
training, the new communication algorithm, and
better combined SW/HW offers better
communication overhead for Resnet-50. A PowerAI
DDL enabled version of Torch completed 90 epochs
of training on Resnet 50 for 1K classes in 50 minutes
using 64 IBM Power8 S822LC servers (256 GPUs).
3
U-Net: Deep Convolutional Neural Network
for Segmentation of biomedical Images
6
Problem:
• Learn segmentation
• Very few annotated images
(approx 30 per application)
• Touching objects of same class;
need to be separated by
segmentation algorithm
Challenges:
• 3D
• Not generalizable shapes
• Gradual edges
Raw Image Segmentation Map
Challenges in Machine/Deep Learning
• Simplify the Life of Data Scientists
• Custom algorithms & DNNs
• Fast turn around time
• Data Characteristics
(input/intermediates/output)
• Dense / sparse
• Small / large number of data points
• Small / large number of features
• Mixed Workloads
• Compute bound
• I/O or memory bandwidth bound
• Core Operations
• Data manipulation
• Linear algebra
• Convolution
• Iterative
• Multiple Stages
• Training
• Testing
• Inference
• Deployment Environments
• Range from embeddable scoring library (low
latency), to scale up on large nodes, to
distributed
• Libraries: MKL/MKL-DNN, OpenBlas,
CuDA/CuDNN and low precision
• Hardware
• x86/Power
• Many cores, GPU, TPU, FPGA
• High-speed interconnects (Topologies)
• … all combinations
7
Why Apache SystemML
• Today’s Roles of Data Scientists
• Algorithm researcher: Invent new optimization schemes
• Systems programmer: provide distributed
implementations
• Deployment engineer: Run for varying datasets
• Systems researcher: Optimize clusters
• SystemML simplifies the Life of Data Scientists
• in implementing custom machine learning
• running algorithms distributed if needed
• running algorithms varying from small data to large data
• Fast turn around
8
NIPS ICML
KDD
JMLR
Apache SystemML – Declarative Machine Learning
• Productivity of data scientists
• Machine learning language for data scientists
(“The SQL for analytics”)
• Strong foundation in linear algebra and statistical functions
• Comes with approx. 20+ algorithms pre-implemented
• Enable Solutions development and Tools
• Scalability & Performance
• Built on data parallel platforms, e.g. Spark
• Cost-based optimizer to compile execution plans
• Depending on data characteristics (tall/skinny, short/wide) and cluster characteristics
• Ranging from in-memory single node to clusters (MapReduce, Spark), and hybrid plans
• APIs & Tools
• Command line: standalone Java app, spark-submit, hadoop jar
• Use in Spark through Scala, Python, R, and Java APIs
• Embeddable scoring library
• Tools: REPL (Scala Spark and pyspark), SparkR, SparkML,
Jupyter, Zeppelin Notebooks
9
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language
SystemML integrated in Spark Ecosystem
10
Spark Core Engine
Spark
SQL
Spark
Streaming (MLlib)
GraphX
(SystemML)
Analytics
Library
Custom
Analytics
Machine Learning
DataFrame
Spark API to SystemML
SystemML to run against Spark
core for distributed
computations
Apache SystemML Open Source
• Apache Open source Project (http://systemml.apache.org/)
• Nov. 2015, Start SystemML Apache Incubator Project
• …
• Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API.
May 2017, Release 0.14.0 on Spark 2.0.2+.
• May 2017, Apache Top Level Project
• …
• Dec 2017, Release 1.0.0
• March 2018, Release 1.1.0
• Release downloads (http://systemml.apache.org/download)
• Binaries
• Coordinates to Maven repository
• Github source code (https://github.com/apache/systemml)
• Documentation (https://apache.github.io/systemml/)
• 3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial-kdd2017.html), Aug. 2017
11
Automatic Algebraic Simplification Rewrites lead
to Significant Performance Improvements
• Simplify operations over mmult Eliminate unnecessary compute
• trace (X %*% Y) sum(X * t(Y))
• Remove unnecessary operations Merging operations
• rand (…, min=-1, max=1) * 7
rand (…, min=-7, max=7)
• Binary to unary operations Reduce amount of data touched
• X*X
X^2
• Remove unnecessary Indexing Eliminate operations (conditional)
• X[a:b,c:d] = Y
X = Y iff dims(X)=dims(Y)
• … 10’s more rewrite rules 12
Training a Deep Neural Network
14
Training features:
Training label:
Goal: learn the weights
Define a loss function:
For numerical stability and mathematical simplicity, we use negative log-likelihood
(often referred to as cross-entropy):
“Forward propagation”
Compute a function via composition of linear transformations followed by
element-wise non-linearities
“Backward propagation”
Propagates errors backwards and update weights according
to how much they contributed to the output
Deep Learning Layers
• Fully connected layer
15
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
Convolution Layer
• Less number of parameters as compared
to fully connected layer
• Useful to capture local features (spatially)
• Output #channels = #filters
16
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
Deep Learning Support
• Reuse existing infrastructure to implement
custom DNNs like other training algorithms
• Small number of DL-specific built-in functions
• e.g. convolution
• NN library of layers and training optimizers to stack layers, e.g.
• Affine (fully-connected) layer is matrix multiplication
• Convolution layer invokes new convolution function
• Caffe/Keras2DML to import existing DNNs
• Transfer learning to continue training on different data
• GPU and native BLAS libraries
17
NN library:
Code Generation for Operator Fusion
• Motivation
• Ubiquitous Fusion Opportunities
• High Performance Impact
• Key Ideas
• Templates skeletons (Row, Cell, Outer, MultiAgg)
• Candidate exploration to identify fusion opportunities
• Candidate selection via cost-based optimizer or heuristics
• Codegen with janino / javac during compile and dynamic recompile
19
X Y
b(*)u(^2) u(^2)
sumsum sum
Multi-Aggregate
a=sum(X^2)
b=sum(X*Y)
c=sum(Y^2)
X Y
Z*
sum
*
1st
pass
X
v
X
2nd
pass
q
┬
U V
┬X * logsum
sparsity
exploitation
Codegen Micro Benchmarks (FP64)
sum(X ʘ Y ʘ Z), dense sum(X ʘ Y ʘ Z), sparse
Sparsity
0.1
X
┬
(X v), dense
Data size
20K x 20K
sum(X ʘ log(UV
┬
+ 1e-15))
#1 Gen close
to hand-coded
fused ops
#2 TF/Julia Gen
only single-
threaded
#3 TF w/ very
limited sparse
support
#4 Sparse Gen
challenging,
Gen better
than hand-
coded ops
#5 TF w/ poor
performance
for data-
intensive ops,
#6 Gen at
peak mem
bandwidth
#7 Autom.
Sparsity
exploitation
across chains
of ops
20
SystemML on Power Environment
• Contributed native ppc64le libraries for Jcuda to mavenized jcuda
project
• GPU backend on Power for SystemML
• Contributed native ppc64le libraries to protoc project
• Useful for compiling Caffe proto files
• Supported native BLAS operations in SystemML
• Matrix Multiplication, Convolution (forward/backward)
• OpenBLAS with OpenMP support
21
Linear Regression Conjugate Gradient
(preliminary 1/2)
22
0
2
4
6
8
10
12
14
64 128 256 512 1024 2048
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC CPU Time
PPC GPU Time
x86 CPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
M-V multiplication
chain is memory bound,
But more cores help
with parallelization.
Linear Regression Conjugate Gradient
(preliminary 2/2)
23
0
2
4
6
8
10
12
14
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC GPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
0
1
2
3
4
5
6
7
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
CPU-GPU Transfer Time
PPC toDev Time
x86 toDev Time
Most of the time is spent
in transferring data from
host to device
-> 2x performance benefit
due to CPU-GPU NVLink
Capabilities of DL frameworks
24
Single Precision Double
Precision
Code
generation
BLAS Spark
DataFrame
support
Sparse
operation
CPU GPU CPU GPU CPU GPU
SystemML Limited
(only for
BLAS)
Yes Yes Yes Yes No OpenBLAS,
MKL, Java
Yes Yes
TF 1.5 Yes Yes No No Yes Yes Eigen, MKL ? (via
elephas)
Limited
BigDL Yes No Yes No No No MKL Yes No
Execution time for 10 epochs with Lenet 5
and 60K MNIST dataset
1
10
100
1000
10000
CPU single precision GPU single precision CPU double precision GPU double precision
TF TF with XLA SystemML SystemML with codegen Intel BigDL
Due to limited single
precision support in
SystemML
SystemML/TF outperforms BigDL for
minibatch training
Both TF and SystemML perform
equally well on GPU
BigDL: No GPU
support
TF: No support for
double precision
SystemML: No GPU
codegen support yet
Code-generation improves
performance both on CPU and on GPU
Summary
• SystemML simplifies the Life of Data Scientist
• Custom Machine/Deep Learning Algorithms
• Scale up & out
• Mixed Workloads
• Memory access bound
• Compute bound
• Strike Balance between
• Data transfer
• Parallelism
26
Editor's Notes
2x faster,
4x mem,
1st PCIe gen4 on chip, support for NVLink
Automated Grading of Gliomas using Deep Learning in Digital Pathology Images
Cut a “whole-slide” image into square “tiles” at 20x magnification.
Filter the “tiles” to remove any without tissue.
Cut the remaining “tiles” into smaller “samples”.
Assign a tumor score label to each sample based on the tumor score of the “whole-slide” image.
Repeat 1-4 for all “whole-slide” images.
Train a convolutional neural network with the resulting dataset of labeled “samples”.
Good results!
Contraction (increase the "what", reduce the "where"): convolutions followed by non-linear activation function.
Expansion path (create a high resolution segmentation map): sequence of up-convolutions and concatenation with high-resolution features from contractiong path.
Output has 2 channels: one for the foreground, and one for the background.
Training time: 10h with GPU. Applicaton: 1s per image;
Better accuracy than sliding-window CNN