2018 03 25 system ml ai and openpower meetup

Scalable Machine/Deep
Learning with Apache SystemML
on OpenPOWER
Berthold Reinwald
reinwald@us.ibm.com
IBM Research – Almaden, San Jose, CA
AI and OpenPOWER Meetup at h2o.AI
March 25th, 2018
1

Let’s start off
with a Tweet …
2
IBM Think 2018, Las Vegas

IBM Research achieves record deep learning
performance with new software technology
• Training time from weeks to hours.
• SW/HW co-optimized achieves near-linear scaling up
to hundreds of GPUs.
• Multi-ring communication pattern provides a good
tradeoff between latency and bandwidth.
• Resnet-101 on Imagenet 22K with 64 IBM Power8
S822LC servers (256 GPUs) in about 7 hours to an
accuracy of 33.8 % validation accuracy. Microsoft's
ADAM and Google's DistBelief results did not reach
30 % validation accuracy.
• Compared to Facebook AI Research on 256 GPU
training, the new communication algorithm, and
better combined SW/HW offers better
communication overhead for Resnet-50. A PowerAI
DDL enabled version of Torch completed 90 epochs
of training on Resnet 50 for 1K classes in 50 minutes
using 64 IBM Power8 S822LC servers (256 GPUs).
3

U-Net: Deep Convolutional Neural Network
for Segmentation of biomedical Images
6
Problem:
• Learn segmentation
• Very few annotated images
(approx 30 per application)
• Touching objects of same class;
need to be separated by
segmentation algorithm
Challenges:
• 3D
• Not generalizable shapes
• Gradual edges
Raw Image Segmentation Map

Challenges in Machine/Deep Learning
• Simplify the Life of Data Scientists
• Custom algorithms & DNNs
• Fast turn around time
• Data Characteristics
(input/intermediates/output)
• Dense / sparse
• Small / large number of data points
• Small / large number of features
• Mixed Workloads
• Compute bound
• I/O or memory bandwidth bound
• Core Operations
• Data manipulation
• Linear algebra
• Convolution
• Iterative
• Multiple Stages
• Training
• Testing
• Inference
• Deployment Environments
• Range from embeddable scoring library (low
latency), to scale up on large nodes, to
distributed
• Libraries: MKL/MKL-DNN, OpenBlas,
CuDA/CuDNN and low precision
• Hardware
• x86/Power
• Many cores, GPU, TPU, FPGA
• High-speed interconnects (Topologies)
• … all combinations
7

Why Apache SystemML
• Today’s Roles of Data Scientists
• Algorithm researcher: Invent new optimization schemes
• Systems programmer: provide distributed
implementations
• Deployment engineer: Run for varying datasets
• Systems researcher: Optimize clusters
• SystemML simplifies the Life of Data Scientists
• in implementing custom machine learning
• running algorithms distributed if needed
• running algorithms varying from small data to large data
• Fast turn around
8
NIPS ICML
KDD
JMLR

Apache SystemML – Declarative Machine Learning
• Productivity of data scientists
• Machine learning language for data scientists
(“The SQL for analytics”)
• Strong foundation in linear algebra and statistical functions
• Comes with approx. 20+ algorithms pre-implemented
• Enable Solutions development and Tools
• Scalability & Performance
• Built on data parallel platforms, e.g. Spark
• Cost-based optimizer to compile execution plans
• Depending on data characteristics (tall/skinny, short/wide) and cluster characteristics
• Ranging from in-memory single node to clusters (MapReduce, Spark), and hybrid plans
• APIs & Tools
• Command line: standalone Java app, spark-submit, hadoop jar
• Use in Spark through Scala, Python, R, and Java APIs
• Embeddable scoring library
• Tools: REPL (Scala Spark and pyspark), SparkR, SparkML,
Jupyter, Zeppelin Notebooks
9
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language

SystemML integrated in Spark Ecosystem
10
Spark Core Engine
Spark
SQL
Spark
Streaming (MLlib)
GraphX
(SystemML)
Analytics
Library
Custom
Analytics
Machine Learning
DataFrame
Spark API to SystemML
SystemML to run against Spark
core for distributed
computations

Apache SystemML Open Source
• Apache Open source Project (http://systemml.apache.org/)
• Nov. 2015, Start SystemML Apache Incubator Project
• …
• Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API.
May 2017, Release 0.14.0 on Spark 2.0.2+.
• May 2017, Apache Top Level Project
• …
• Dec 2017, Release 1.0.0
• March 2018, Release 1.1.0
• Release downloads (http://systemml.apache.org/download)
• Binaries
• Coordinates to Maven repository
• Github source code (https://github.com/apache/systemml)
• Documentation (https://apache.github.io/systemml/)
• 3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial-kdd2017.html), Aug. 2017
11

Automatic Algebraic Simplification Rewrites lead
to Significant Performance Improvements
• Simplify operations over mmult  Eliminate unnecessary compute
• trace (X %*% Y)  sum(X * t(Y))
• Remove unnecessary operations  Merging operations
• rand (…, min=-1, max=1) * 7
 rand (…, min=-7, max=7)
• Binary to unary operations  Reduce amount of data touched
• X*X
 X^2
• Remove unnecessary Indexing  Eliminate operations (conditional)
• X[a:b,c:d] = Y
 X = Y iff dims(X)=dims(Y)
• … 10’s more rewrite rules 12

Training a Deep Neural Network
14
Training features:
Training label:
Goal: learn the weights
Define a loss function:
For numerical stability and mathematical simplicity, we use negative log-likelihood
(often referred to as cross-entropy):
“Forward propagation”
Compute a function via composition of linear transformations followed by
element-wise non-linearities
“Backward propagation”
Propagates errors backwards and update weights according
to how much they contributed to the output

Deep Learning Layers
• Fully connected layer
15
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/

Convolution Layer
• Less number of parameters as compared
to fully connected layer
• Useful to capture local features (spatially)
• Output #channels = #filters
16
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/

Deep Learning Support
• Reuse existing infrastructure to implement
custom DNNs like other training algorithms
• Small number of DL-specific built-in functions
• e.g. convolution
• NN library of layers and training optimizers to stack layers, e.g.
• Affine (fully-connected) layer is matrix multiplication
• Convolution layer invokes new convolution function
• Caffe/Keras2DML to import existing DNNs
• Transfer learning to continue training on different data
• GPU and native BLAS libraries
17
NN library:

Compressed Linear Algebra (CLA)
• Motivation: Iterative ML algorithms with I/O-bound MV
multiplications
• Key Ideas: Use lightweight DB compression techniques and perform
LA operations on compressed matrices (w/o decompression)
• Experiments
• LinregCG, 10 iterations, SystemML 0.14
• 1+6 node cluster, Spark 2.1
18
Dataset Gzip Snappy CLA
Higgs 1.93 1.38 2.17
Census 17.11 6.04 35.69
Covtype 10.40 6.13 18.19
ImageNet 5.54 3.35 7.34
Mnist8m 4.12 2.60 7.32
Airline78 7.07 4.28 7.44
Compression
Ratios
89
3409
5663
135
765
2730
93
463
998
0
1000
2000
3000
4000
5000
6000
Mnist40m Mnist240m Mnist480m
Uncompressed
Snappy (RDD Compression)
CLA
End-to-End Performance [sec]
90GB 540GB 1.1TB

Code Generation for Operator Fusion
• Motivation
• Ubiquitous Fusion Opportunities
• High Performance Impact
• Key Ideas
• Templates skeletons (Row, Cell, Outer, MultiAgg)
• Candidate exploration to identify fusion opportunities
• Candidate selection via cost-based optimizer or heuristics
• Codegen with janino / javac during compile and dynamic recompile
19
X Y
b(*)u(^2) u(^2)
sumsum sum
Multi-Aggregate
a=sum(X^2)
b=sum(X*Y)
c=sum(Y^2)
X Y
Z*
sum
*
1st
pass
X
v
X
2nd
pass
q
┬
U V
┬X * logsum
sparsity
exploitation

Codegen Micro Benchmarks (FP64)
sum(X ʘ Y ʘ Z), dense sum(X ʘ Y ʘ Z), sparse
Sparsity
0.1
X
┬
(X v), dense
Data size
20K x 20K
sum(X ʘ log(UV
┬
+ 1e-15))
#1 Gen close
to hand-coded
fused ops
#2 TF/Julia Gen
only single-
threaded
#3 TF w/ very
limited sparse
support
#4 Sparse Gen
challenging,
Gen better
than hand-
coded ops
#5 TF w/ poor
performance
for data-
intensive ops,
#6 Gen at
peak mem
bandwidth
#7 Autom.
Sparsity
exploitation
across chains
of ops
20

SystemML on Power Environment
• Contributed native ppc64le libraries for Jcuda to mavenized jcuda
project
• GPU backend on Power for SystemML
• Contributed native ppc64le libraries to protoc project
• Useful for compiling Caffe proto files
• Supported native BLAS operations in SystemML
• Matrix Multiplication, Convolution (forward/backward)
• OpenBLAS with OpenMP support
21

Linear Regression Conjugate Gradient
(preliminary 1/2)
22
0
2
4
6
8
10
12
14
64 128 256 512 1024 2048
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC CPU Time
PPC GPU Time
x86 CPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
M-V multiplication
chain is memory bound,
But more cores help
with parallelization.

Linear Regression Conjugate Gradient
(preliminary 2/2)
23
0
2
4
6
8
10
12
14
64 256 1024
TimeinSeconds
PPC GPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
0
1
2
3
4
5
6
7
64 256 1024
TimeinSeconds
CPU-GPU Transfer Time
PPC toDev Time
x86 toDev Time
Most of the time is spent
in transferring data from
host to device
-> 2x performance benefit
due to CPU-GPU NVLink

Capabilities of DL frameworks
24
Single Precision Double
Precision
Code
generation
BLAS Spark
DataFrame
support
Sparse
operation
CPU GPU CPU GPU CPU GPU
SystemML Limited
(only for
BLAS)
Yes Yes Yes Yes No OpenBLAS,
MKL, Java
Yes Yes
TF 1.5 Yes Yes No No Yes Yes Eigen, MKL ? (via
elephas)
Limited
BigDL Yes No Yes No No No MKL Yes No

Execution time for 10 epochs with Lenet 5
and 60K MNIST dataset
1
10
100
1000
10000
CPU single precision GPU single precision CPU double precision GPU double precision
TF TF with XLA SystemML SystemML with codegen Intel BigDL
Due to limited single
precision support in
SystemML
SystemML/TF outperforms BigDL for
minibatch training
Both TF and SystemML perform
equally well on GPU
BigDL: No GPU
support
TF: No support for
double precision
SystemML: No GPU
codegen support yet
Code-generation improves
performance both on CPU and on GPU

Summary
• SystemML simplifies the Life of Data Scientist
• Custom Machine/Deep Learning Algorithms
• Scale up & out
• Mixed Workloads
• Memory access bound
• Compute bound
• Strike Balance between
• Data transfer
• Parallelism
26

2018 03 25 system ml ai and openpower meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2018 03 25 system ml ai and openpower meetup

Similar to 2018 03 25 system ml ai and openpower meetup (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

2018 03 25 system ml ai and openpower meetup

Editor's Notes