System mldl meetup

Scalable Machine/Deep Learning with
Apache SystemML on Power
Berthold Reinwald
reinwald@us.ibm.com
IBM Research – Almaden
San Jose, CA
Nov. 17th, 2017
1

Agenda
 Use cases
 What is Apache SystemML
 Demos on Power
– Handwritten Digits Image Classification
– Medical Image Segmentation
 Inside SystemML
– Compiler, optimizer, and runtime
– Advanced Features
2

Enterprise Use cases for Scalable Machine Learning
5
 Insurance
 Problem Description
– optimal subset of features that leads to the best regr model
 Problem Size
– 1.1M observations, 95 features, Subsets of 15 variables
 Algorithm
– Parallelization of independent model building
 Automotive
– Customer Satisfaction
 Problem Size
– 2 mill cars with 8,000 reacquired cars, 10 mill repair cases, 25 mill
parts exchanges
 Algorithms
– Logistic regression using ~22k feature variables
– Increasing the #features from ~250 to ~21,800, improved
precision/recall by order of magnitude
– Sequence mining using very low support value
– Very large number of intermediate result sequences.
 Air Transportation
– Predict passenger volumes at locations in an airport
 Problem Size
– WiFi data with ~66 M rows for ~1.3 M MAC addr.
 Algorithms
– Multiple models per location, per passenger type
– Time-series analysis using seasonal and non-seasonal auto-
regressive, moving average components along with differencing
operations (Arima and Holt-Winters triple exponential smoothing)
Financial Services
Problem Description
– Compute correlations between Financial Analysts’
performance metrics and sentiments extracted from surveys
submitted by them
Algorithms
– Descriptive (Bivariate) Statistics: Chi-squared test, Spearman’s
Rho, Gamma, Kendall’s Tau-B, Odds-Ratio test, F-test (stratified
and unstratified)
Retail Banking
Problem Description
– Use statistical analysis on social media data linked to the bank’s
data to identify customer segments of interest, find predictors
of purchase intent, and gauge sentiment towards bank’s
products.
Algorithms
– Bivariate odds ratios and binomial proportions with confidence
intervals
Services Company
Problem
– Compute a benchmark index by mapping producers’ financial
reports into a normalized schema, using analytics to extrapolate
missing reports and/or impute missing values.
Algorithms
– Regularized least-squares loss minimization and Gibbs sampling
(MCMC) jointly over the parameter space and over the missing
(estimated) values
•
•

Why Apache SystemML
 Today’s Roles of Data Scientists
– Algorithm researcher: Invent new optimization schemes
– Systems programmer: provide distributed
implementations
– Deployment engineer: Run for varying datasets
– Systems researcher: Optimize clusters
 SystemML simplifies the Life of Data Scientists
– in implementing custom machine learning
– running algorithms distributed if needed
– running algorithms varying from small data to large data
NIPS ICML
KDD
JMLR
6

Apache SystemML – Declarative Machine Learning
 Productivity of data scientists
– Machine learning language for data scientists
(“The SQL for analytics”)
– Strong foundation in linear algebra and statistical functions
– Comes with approx. 20+ algorithms pre-implemented
– Enable Solutions development and Tools
 Scalability & Performance
– Built on data parallel platforms, e.g. Spark
 Cost-based optimizer to compile execution plans
– Depending on data characteristics (tall/skinny, short/wide) and cluster
characteristics
– Ranging from in-memory single node to clusters (MapReduce, Spark),
and hybrid plans
 APIs & Tools
– Command line: standalone Java app, spark-submit, hadoop jar
– Use in Spark through Scala, Python, R, and Java APIs
– Embeddable scoring library
– Tools: REPL (Scala Spark and pyspark), SparkR, SparkML,
Jupyter, Zeppelin Notebooks
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language
GPU backend
In progress
7

SystemML integrated in Spark Ecosystem
Spark Core Engine
Spark
SQL
Spark
Streaming (MLlib)
GraphX
(SystemML)
Analytics
Library
Custom
Analytics
Machine Learning
DataFrame
Spark API to SystemML
SystemML to run against Spark
core for distributed
computations
8

Apache SystemML Open Source
 Apache Open source Project (http://systemml.apache.org/)
– Nov. 2015, Start SystemML Apache Incubator Project
– …
– Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API.
May 2017, Release 0.14.0 on Spark 2.0.2+.
– May 2017, Apache Top Level Project
– Sep 2017, Release 0.15
 Release downloads (http://systemml.apache.org/download)
– Binaries
– Coordinates to Maven repository
 Github source code (https://github.com/apache/systemml)
 Documentation (https://apache.github.io/systemml/)
 3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial-
kdd2017.html), Aug. 2017
9

SystemML’s Scalable Algorithms
Category Description
Descriptive Statistics
Univariate
Bivariate
Stratified Bivariate
Classification
Logistic Regression (multinomial)
Multi-Class SVM, non-linear SVM
Naïve Bayes (multinomial)
Decision Trees
Random Forest
kNN
Clustering k-Means
Regression
Linear Regression system of equations
CG (conjugate gradient descent)
Generalized Linear
Models (GLM)
Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli
Links for all distributions: identity, log, sq. root, inverse, 1/μ2
Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit
Stepwise
Linear
GLM
Lasso
Dimension Reduction PCA, Probabilistic PCA
Matrix Factorization ALS
direct solve
CG (conjugate gradient descent)
Survival Models
Kaplan Meier Estimate
Cox Proportional Hazard Regression
Deep Learning Autoencoder, word2vec, CNN, LSTM, RBM … and Deep Learning Library (DML-bodied) functions
Predict Algorithm-specific scoring
Transformation (native) Recoding, dummy coding, binning, scaling, missing value imputation
PMML models lm, kmeans, svm, glm, mlogit 10

Effect of Deep Learning: ImageNet Large-Scale Visual
Recognition Challenge
11
AlexNet
GoogleNet
ResNet (34 layer)

Layers
 Fully connected layer
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
13

Layers
• Fully connected layer
• Convolution layer
• Less number of parameters as
compared to FC
• Useful to capture local
features (spatially)
• Output #channels = #filters
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
14

Deep Learning Support
NN library: Reuse existing infrastructure to implement
custom DNNs like other training algorithms
 Small number of DL-specific built-in functions
– e.g. convolution
 NN library of layers and training optimizers to stack layers, e.g.
– Affine (fully-connected) layer is matrix multiplication
– Convolution layer invokes new convolution function
 Caffe/Keras2DML to import existing DNNs
 Transfer learning to continue training on different data
 GPU and native BLAS libraries

21
https://github.com/apache/systemml/blob/master/samples/jupyter-
notebooks/Deep_Learning_Image_Classification.ipynb
Handwritten Digits Image Classification
Using LeNet CNN

22
Medical Image Segmentation
Using U-Net CNN

Automatic Algebraic Simplification Rewrites lead to
Significant Performance Improvements
 Simplify operations over mmult  Eliminate unnecessary compute
– trace (X %*% Y)  sum(X * t(Y))
 Remove unnecessary operations  Merging operations
– rand (…, min=-1, max=1) * 7
 rand (…, min=-7, max=7)
 Binary to unary operations  Reduce amount of data touched
– X*X
 X^2
 Remove unnecessary Indexing  Eliminate operations (conditional)
– X[a:b,c:d] = Y
 X = Y iff dims(X)=dims(Y)
 … 10’s more rewrite rules 23

Compressed Linear Algebra (CLA)
 Motivation: Iterative ML algorithms with I/O-bound MV multiplications
 Key Ideas: Use lightweight DB compression techniques and perform LA
operations on compressed matrices (w/o decompression)
 Experiments
– LinregCG, 10 iterations, SystemML 0.14
– 1+6 node cluster, Spark 2.1
Dataset Gzip Snappy CLA
Higgs 1.93 1.38 2.17
Census 17.11 6.04 35.69
Covtype 10.40 6.13 18.19
ImageNet 5.54 3.35 7.34
Mnist8m 4.12 2.60 7.32
Airline78 7.07 4.28 7.44
Compression Ratios
89
3409
5663
135
765
2730
93
463
998
0
1000
2000
3000
4000
5000
6000
Mnist40m Mnist240m Mnist480m
Uncompressed
Snappy (RDD Compression)
CLA
End-to-End Performance [sec]
90GB 540GB 1.1TB
26

Code Generation for Operator Fusion
 Motivation
– Ubiquitous Fusion Opportunities
– High Performance Impact
 Key Ideas
– Templates skeletons (Row, Cell, Outer, MultiAgg)
– Candidate exploration to identify fusion opportunities
– Candidate selection via cost-based optimizer or heuristics
– Codegen with janino / javac during compile and dynamic recompile
X Y
b(*)u(^2) u(^2)
sumsum sum
Multi-Aggregate
a=sum(X^2)
b=sum(X*Y)
c=sum(Y^2)
X Y
Z*
sum
*
1st
pass
X
v
X
2nd
pass
q
┬
U V
┬X * logsum
sparsity
exploitation
27

Codegen Micro Benchmarks (FP64)
sum(X ʘ Y ʘ Z), dense sum(X ʘ Y ʘ Z), sparse
Sparsity
0.1
X
┬
(X v), dense
Data size
20K x 20K
sum(X ʘ log(UV
┬
+ 1e-15))
#1 Gen close
to hand-coded
fused ops
#2 TF/Julia Gen
only single-
threaded
#3 TF w/ very
limited sparse
support
#4 Sparse Gen
challenging,
Gen better
than hand-
coded ops
#5 TF w/ poor
performance
for data-
intensive ops,
#6 Gen at
peak mem
bandwidth
#7 Autom.
Sparsity
exploitation
across chains
of ops

SystemML on Power Environment
 Contributed native ppc64le libraries for Jcuda to mavenized jcuda
project
– GPU backend on Power for SystemML
 Contributed native ppc64le libraries to protoc project
– Useful for compiling Caffe proto files
 Supported native BLAS operations in SystemML
– Matrix Multiplication, Convolution (forward/backward)
– OpenBLAS with OpenMP support
30

Linear Regression Conjugate Gradient
(preliminary 1/2)
31
0
2
4
6
8
10
12
14
64 128 256 512 1024 2048
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC CPU Time
PPC GPU Time
x86 CPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
M-V multiplication
chain is memory bound,
But more cores help
with parallelization.

Linear Regression Conjugate Gradient
(preliminary 2/2)
32
0
2
4
6
8
10
12
14
64 256 1024
TimeinSeconds
PPC GPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
0
1
2
3
4
5
6
7
64 256 1024
TimeinSeconds
CPU-GPU Transfer Time
PPC toDev Time
x86 toDev Time
Most of the time is spent
in transferring data from
host to device
-> 2x performance benefit
due to CPU-GPU NVLink

More Details
 Matthias Boehm, Alexandre Evfimievski, Niketan Pansare, Berthold Reinwald, Prithvi Sen: Declarative, Large-Scale Machine Learning with
Apache SystemML, 3 hours hands-on tutorial, KDD 2017
 Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, Prithviray Sen: SPOOF: Sum-
Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR 2017
 Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold Reinwald: Compressed Linear Algebra for Large Scale
Machine Learning. VLDB 2016 (Best Paper Award)
– Extended Version to appear in VLDB Journal, 2017
– Summary Version to appear in ACM SIGMOD Record Research Highlights, 2017
 Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold
Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, Shirish Tatikonda. SystemML: Declarative Machine Learning on Spark. VLDB
2016
 Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, Frederick R. Reiss: Resource Elasticity for Large-
Scale Machine Learning. SIGMOD 2015: 137-152
 Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, P. Sadayappan: On optimizing
machine learning workloads via kernel fusion. PPOPP 2015: 173-182
 Sebastian Schelter, Juan Soto, Volker Markl, Douglas Burdick, Berthold Reinwald, Alexandre V. Evfimievski: Efficient sample generation for
scalable meta learning. ICDE 2015: 1191-1202
 Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish
Tatikonda, Yuanyuan Tian: SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng.
Bull. 37(3): 52-62 (2014)
 Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas Burdick, Shivakumar Vaithyanathan: Hybrid
Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB 7(7): 553-564 (2014)
 Peter D. Kirchner, Matthias Boehm, Berthold Reinwald, Daby M. Sow, Michael Schmidt, Deepak S. Turaga, Alain Biem: Large Scale
Discriminative Metric Learning. IPDPS Workshop 2014: 1656-1663
 Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald: Scalable and Numerically Stable Descriptive Statistics in SystemML. ICDE 2012: 1351-
1359
 Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan
Tian, Shivakumar Vaithyanathan: SystemML: Declarative machine learning on MapReduce. ICDE 2011: 231-242
Custom
Algorithm
Optimizer
Resource
Elasticity
GPU
Sampling
Numeric
Stability
Task
Parallelism
1st paper
on Spark
Compression
Automatic
Rewr & Fusion
33
Hands on
Tutorial

Summary
 SystemML simplifies the Life of Data Scientist
 Custom Machine/Deep Learning Algorithms
 Scale up & out
 Mixed Workloads
– Memory access bound
– Compute bound
 Strike Balance between
– Data transfer
– Parallelism
34

System mldl meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to System mldl meetup

Similar to System mldl meetup (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

System mldl meetup

Editor's Notes