Scalable Machine/Deep Learning with
Apache SystemML on Power
Berthold Reinwald
reinwald@us.ibm.com
IBM Research – Almaden
San Jose, CA
Nov. 17th, 2017
1
Agenda
 Use cases
 What is Apache SystemML
 Demos on Power
– Handwritten Digits Image Classification
– Medical Image Segmentation
 Inside SystemML
– Compiler, optimizer, and runtime
– Advanced Features
2
Tumor Proliferation Score
Medical Image Segmentation
Enterprise Use cases for Scalable Machine Learning
5
 Insurance
 Problem Description
– optimal subset of features that leads to the best regr model
 Problem Size
– 1.1M observations, 95 features, Subsets of 15 variables
 Algorithm
– Parallelization of independent model building
 Automotive
 Problem Description
– Customer Satisfaction
 Problem Size
– 2 mill cars with 8,000 reacquired cars, 10 mill repair cases, 25 mill
parts exchanges
 Algorithms
– Logistic regression using ~22k feature variables
– Increasing the #features from ~250 to ~21,800, improved
precision/recall by order of magnitude
– Sequence mining using very low support value
– Very large number of intermediate result sequences.
 Air Transportation
 Problem Description
– Predict passenger volumes at locations in an airport
 Problem Size
– WiFi data with ~66 M rows for ~1.3 M MAC addr.
 Algorithms
– Multiple models per location, per passenger type
– Time-series analysis using seasonal and non-seasonal auto-
regressive, moving average components along with differencing
operations (Arima and Holt-Winters triple exponential smoothing)
Financial Services
Problem Description
– Compute correlations between Financial Analysts’
performance metrics and sentiments extracted from surveys
submitted by them
Algorithms
– Descriptive (Bivariate) Statistics: Chi-squared test, Spearman’s
Rho, Gamma, Kendall’s Tau-B, Odds-Ratio test, F-test (stratified
and unstratified)
Retail Banking
Problem Description
– Use statistical analysis on social media data linked to the bank’s
data to identify customer segments of interest, find predictors
of purchase intent, and gauge sentiment towards bank’s
products.
Algorithms
– Bivariate odds ratios and binomial proportions with confidence
intervals
Services Company
Problem
– Compute a benchmark index by mapping producers’ financial
reports into a normalized schema, using analytics to extrapolate
missing reports and/or impute missing values.
Algorithms
– Regularized least-squares loss minimization and Gibbs sampling
(MCMC) jointly over the parameter space and over the missing
(estimated) values
•
•
Why Apache SystemML
 Today’s Roles of Data Scientists
– Algorithm researcher: Invent new optimization schemes
– Systems programmer: provide distributed
implementations
– Deployment engineer: Run for varying datasets
– Systems researcher: Optimize clusters
 SystemML simplifies the Life of Data Scientists
– in implementing custom machine learning
– running algorithms distributed if needed
– running algorithms varying from small data to large data
NIPS ICML
KDD
JMLR
6
Apache SystemML – Declarative Machine Learning
 Productivity of data scientists
– Machine learning language for data scientists
(“The SQL for analytics”)
– Strong foundation in linear algebra and statistical functions
– Comes with approx. 20+ algorithms pre-implemented
– Enable Solutions development and Tools
 Scalability & Performance
– Built on data parallel platforms, e.g. Spark
 Cost-based optimizer to compile execution plans
– Depending on data characteristics (tall/skinny, short/wide) and cluster
characteristics
– Ranging from in-memory single node to clusters (MapReduce, Spark),
and hybrid plans
 APIs & Tools
– Command line: standalone Java app, spark-submit, hadoop jar
– Use in Spark through Scala, Python, R, and Java APIs
– Embeddable scoring library
– Tools: REPL (Scala Spark and pyspark), SparkR, SparkML,
Jupyter, Zeppelin Notebooks
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language
GPU backend
In progress
7
SystemML integrated in Spark Ecosystem
Spark Core Engine
Spark
SQL
Spark
Streaming (MLlib)
GraphX
(SystemML)
Analytics
Library
Custom
Analytics
Machine Learning
DataFrame
Spark API to SystemML
SystemML to run against Spark
core for distributed
computations
8
Apache SystemML Open Source
 Apache Open source Project (http://systemml.apache.org/)
– Nov. 2015, Start SystemML Apache Incubator Project
– …
– Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API.
May 2017, Release 0.14.0 on Spark 2.0.2+.
– May 2017, Apache Top Level Project
– Sep 2017, Release 0.15
 Release downloads (http://systemml.apache.org/download)
– Binaries
– Coordinates to Maven repository
 Github source code (https://github.com/apache/systemml)
 Documentation (https://apache.github.io/systemml/)
 3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial-
kdd2017.html), Aug. 2017
9
SystemML’s Scalable Algorithms
Category Description
Descriptive Statistics
Univariate
Bivariate
Stratified Bivariate
Classification
Logistic Regression (multinomial)
Multi-Class SVM, non-linear SVM
Naïve Bayes (multinomial)
Decision Trees
Random Forest
kNN
Clustering k-Means
Regression
Linear Regression system of equations
CG (conjugate gradient descent)
Generalized Linear
Models (GLM)
Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli
Links for all distributions: identity, log, sq. root, inverse, 1/μ2
Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit
Stepwise
Linear
GLM
Lasso
Dimension Reduction PCA, Probabilistic PCA
Matrix Factorization ALS
direct solve
CG (conjugate gradient descent)
Survival Models
Kaplan Meier Estimate
Cox Proportional Hazard Regression
Deep Learning Autoencoder, word2vec, CNN, LSTM, RBM … and Deep Learning Library (DML-bodied) functions
Predict Algorithm-specific scoring
Transformation (native) Recoding, dummy coding, binning, scaling, missing value imputation
PMML models lm, kmeans, svm, glm, mlogit 10
Effect of Deep Learning: ImageNet Large-Scale Visual
Recognition Challenge
11
AlexNet
GoogleNet
ResNet (34 layer)
Layers
 Fully connected layer
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
13
Layers
• Fully connected layer
• Convolution layer
• Less number of parameters as
compared to FC
• Useful to capture local
features (spatially)
• Output #channels = #filters
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
14
Deep Learning Support
NN library: Reuse existing infrastructure to implement
custom DNNs like other training algorithms
 Small number of DL-specific built-in functions
– e.g. convolution
 NN library of layers and training optimizers to stack layers, e.g.
– Affine (fully-connected) layer is matrix multiplication
– Convolution layer invokes new convolution function
 Caffe/Keras2DML to import existing DNNs
 Transfer learning to continue training on different data
 GPU and native BLAS libraries
21
https://github.com/apache/systemml/blob/master/samples/jupyter-
notebooks/Deep_Learning_Image_Classification.ipynb
Handwritten Digits Image Classification
Using LeNet CNN
22
Medical Image Segmentation
Using U-Net CNN
Automatic Algebraic Simplification Rewrites lead to
Significant Performance Improvements
 Simplify operations over mmult  Eliminate unnecessary compute
– trace (X %*% Y)  sum(X * t(Y))
 Remove unnecessary operations  Merging operations
– rand (…, min=-1, max=1) * 7
 rand (…, min=-7, max=7)
 Binary to unary operations  Reduce amount of data touched
– X*X
 X^2
 Remove unnecessary Indexing  Eliminate operations (conditional)
– X[a:b,c:d] = Y
 X = Y iff dims(X)=dims(Y)
 … 10’s more rewrite rules 23
Compilation Chain
24
Compressed Linear Algebra (CLA)
 Motivation: Iterative ML algorithms with I/O-bound MV multiplications
 Key Ideas: Use lightweight DB compression techniques and perform LA
operations on compressed matrices (w/o decompression)
 Experiments
– LinregCG, 10 iterations, SystemML 0.14
– 1+6 node cluster, Spark 2.1
Dataset Gzip Snappy CLA
Higgs 1.93 1.38 2.17
Census 17.11 6.04 35.69
Covtype 10.40 6.13 18.19
ImageNet 5.54 3.35 7.34
Mnist8m 4.12 2.60 7.32
Airline78 7.07 4.28 7.44
Compression Ratios
89
3409
5663
135
765
2730
93
463
998
0
1000
2000
3000
4000
5000
6000
Mnist40m Mnist240m Mnist480m
Uncompressed
Snappy (RDD Compression)
CLA
End-to-End Performance [sec]
90GB 540GB 1.1TB
26
Code Generation for Operator Fusion
 Motivation
– Ubiquitous Fusion Opportunities
– High Performance Impact
 Key Ideas
– Templates skeletons (Row, Cell, Outer, MultiAgg)
– Candidate exploration to identify fusion opportunities
– Candidate selection via cost-based optimizer or heuristics
– Codegen with janino / javac during compile and dynamic recompile
X Y
b(*)u(^2) u(^2)
sumsum sum
Multi-Aggregate
a=sum(X^2)
b=sum(X*Y)
c=sum(Y^2)
X Y
Z*
sum
*
1st
pass
X
v
X
2nd
pass
q
┬
U V
┬X * logsum
sparsity
exploitation
27
Codegen Micro Benchmarks (FP64)
sum(X ʘ Y ʘ Z), dense sum(X ʘ Y ʘ Z), sparse
Sparsity
0.1
X
┬
(X v), dense
Data size
20K x 20K
sum(X ʘ log(UV
┬
+ 1e-15))
#1 Gen close
to hand-coded
fused ops
#2 TF/Julia Gen
only single-
threaded
#3 TF w/ very
limited sparse
support
#4 Sparse Gen
challenging,
Gen better
than hand-
coded ops
#5 TF w/ poor
performance
for data-
intensive ops,
#6 Gen at
peak mem
bandwidth
#7 Autom.
Sparsity
exploitation
across chains
of ops
SystemML on Power Environment
 Contributed native ppc64le libraries for Jcuda to mavenized jcuda
project
– GPU backend on Power for SystemML
 Contributed native ppc64le libraries to protoc project
– Useful for compiling Caffe proto files
 Supported native BLAS operations in SystemML
– Matrix Multiplication, Convolution (forward/backward)
– OpenBLAS with OpenMP support
30
Linear Regression Conjugate Gradient
(preliminary 1/2)
31
0
2
4
6
8
10
12
14
64 128 256 512 1024 2048
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC CPU Time
PPC GPU Time
x86 CPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
M-V multiplication
chain is memory bound,
But more cores help
with parallelization.
Linear Regression Conjugate Gradient
(preliminary 2/2)
32
0
2
4
6
8
10
12
14
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC GPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
0
1
2
3
4
5
6
7
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
CPU-GPU Transfer Time
PPC toDev Time
x86 toDev Time
Most of the time is spent
in transferring data from
host to device
-> 2x performance benefit
due to CPU-GPU NVLink
More Details
 Matthias Boehm, Alexandre Evfimievski, Niketan Pansare, Berthold Reinwald, Prithvi Sen: Declarative, Large-Scale Machine Learning with
Apache SystemML, 3 hours hands-on tutorial, KDD 2017
 Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, Prithviray Sen: SPOOF: Sum-
Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR 2017
 Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold Reinwald: Compressed Linear Algebra for Large Scale
Machine Learning. VLDB 2016 (Best Paper Award)
– Extended Version to appear in VLDB Journal, 2017
– Summary Version to appear in ACM SIGMOD Record Research Highlights, 2017
 Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold
Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, Shirish Tatikonda. SystemML: Declarative Machine Learning on Spark. VLDB
2016
 Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, Frederick R. Reiss: Resource Elasticity for Large-
Scale Machine Learning. SIGMOD 2015: 137-152
 Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, P. Sadayappan: On optimizing
machine learning workloads via kernel fusion. PPOPP 2015: 173-182
 Sebastian Schelter, Juan Soto, Volker Markl, Douglas Burdick, Berthold Reinwald, Alexandre V. Evfimievski: Efficient sample generation for
scalable meta learning. ICDE 2015: 1191-1202
 Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish
Tatikonda, Yuanyuan Tian: SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng.
Bull. 37(3): 52-62 (2014)
 Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas Burdick, Shivakumar Vaithyanathan: Hybrid
Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB 7(7): 553-564 (2014)
 Peter D. Kirchner, Matthias Boehm, Berthold Reinwald, Daby M. Sow, Michael Schmidt, Deepak S. Turaga, Alain Biem: Large Scale
Discriminative Metric Learning. IPDPS Workshop 2014: 1656-1663
 Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald: Scalable and Numerically Stable Descriptive Statistics in SystemML. ICDE 2012: 1351-
1359
 Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan
Tian, Shivakumar Vaithyanathan: SystemML: Declarative machine learning on MapReduce. ICDE 2011: 231-242
Custom
Algorithm
Optimizer
Resource
Elasticity
GPU
Sampling
Numeric
Stability
Task
Parallelism
1st paper
on Spark
Compression
Automatic
Rewr & Fusion
33
Hands on
Tutorial
Summary
 SystemML simplifies the Life of Data Scientist
 Custom Machine/Deep Learning Algorithms
 Scale up & out
 Mixed Workloads
– Memory access bound
– Compute bound
 Strike Balance between
– Data transfer
– Parallelism
34

System mldl meetup

  • 1.
    Scalable Machine/Deep Learningwith Apache SystemML on Power Berthold Reinwald reinwald@us.ibm.com IBM Research – Almaden San Jose, CA Nov. 17th, 2017 1
  • 2.
    Agenda  Use cases What is Apache SystemML  Demos on Power – Handwritten Digits Image Classification – Medical Image Segmentation  Inside SystemML – Compiler, optimizer, and runtime – Advanced Features 2
  • 3.
  • 4.
  • 5.
    Enterprise Use casesfor Scalable Machine Learning 5  Insurance  Problem Description – optimal subset of features that leads to the best regr model  Problem Size – 1.1M observations, 95 features, Subsets of 15 variables  Algorithm – Parallelization of independent model building  Automotive  Problem Description – Customer Satisfaction  Problem Size – 2 mill cars with 8,000 reacquired cars, 10 mill repair cases, 25 mill parts exchanges  Algorithms – Logistic regression using ~22k feature variables – Increasing the #features from ~250 to ~21,800, improved precision/recall by order of magnitude – Sequence mining using very low support value – Very large number of intermediate result sequences.  Air Transportation  Problem Description – Predict passenger volumes at locations in an airport  Problem Size – WiFi data with ~66 M rows for ~1.3 M MAC addr.  Algorithms – Multiple models per location, per passenger type – Time-series analysis using seasonal and non-seasonal auto- regressive, moving average components along with differencing operations (Arima and Holt-Winters triple exponential smoothing) Financial Services Problem Description – Compute correlations between Financial Analysts’ performance metrics and sentiments extracted from surveys submitted by them Algorithms – Descriptive (Bivariate) Statistics: Chi-squared test, Spearman’s Rho, Gamma, Kendall’s Tau-B, Odds-Ratio test, F-test (stratified and unstratified) Retail Banking Problem Description – Use statistical analysis on social media data linked to the bank’s data to identify customer segments of interest, find predictors of purchase intent, and gauge sentiment towards bank’s products. Algorithms – Bivariate odds ratios and binomial proportions with confidence intervals Services Company Problem – Compute a benchmark index by mapping producers’ financial reports into a normalized schema, using analytics to extrapolate missing reports and/or impute missing values. Algorithms – Regularized least-squares loss minimization and Gibbs sampling (MCMC) jointly over the parameter space and over the missing (estimated) values • •
  • 6.
    Why Apache SystemML Today’s Roles of Data Scientists – Algorithm researcher: Invent new optimization schemes – Systems programmer: provide distributed implementations – Deployment engineer: Run for varying datasets – Systems researcher: Optimize clusters  SystemML simplifies the Life of Data Scientists – in implementing custom machine learning – running algorithms distributed if needed – running algorithms varying from small data to large data NIPS ICML KDD JMLR 6
  • 7.
    Apache SystemML –Declarative Machine Learning  Productivity of data scientists – Machine learning language for data scientists (“The SQL for analytics”) – Strong foundation in linear algebra and statistical functions – Comes with approx. 20+ algorithms pre-implemented – Enable Solutions development and Tools  Scalability & Performance – Built on data parallel platforms, e.g. Spark  Cost-based optimizer to compile execution plans – Depending on data characteristics (tall/skinny, short/wide) and cluster characteristics – Ranging from in-memory single node to clusters (MapReduce, Spark), and hybrid plans  APIs & Tools – Command line: standalone Java app, spark-submit, hadoop jar – Use in Spark through Scala, Python, R, and Java APIs – Embeddable scoring library – Tools: REPL (Scala Spark and pyspark), SparkR, SparkML, Jupyter, Zeppelin Notebooks Hadoop or Spark Cluster (scale-out) In-Memory Single Node (scale-up) Runtime Compiler Language GPU backend In progress 7
  • 8.
    SystemML integrated inSpark Ecosystem Spark Core Engine Spark SQL Spark Streaming (MLlib) GraphX (SystemML) Analytics Library Custom Analytics Machine Learning DataFrame Spark API to SystemML SystemML to run against Spark core for distributed computations 8
  • 9.
    Apache SystemML OpenSource  Apache Open source Project (http://systemml.apache.org/) – Nov. 2015, Start SystemML Apache Incubator Project – … – Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API. May 2017, Release 0.14.0 on Spark 2.0.2+. – May 2017, Apache Top Level Project – Sep 2017, Release 0.15  Release downloads (http://systemml.apache.org/download) – Binaries – Coordinates to Maven repository  Github source code (https://github.com/apache/systemml)  Documentation (https://apache.github.io/systemml/)  3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial- kdd2017.html), Aug. 2017 9
  • 10.
    SystemML’s Scalable Algorithms CategoryDescription Descriptive Statistics Univariate Bivariate Stratified Bivariate Classification Logistic Regression (multinomial) Multi-Class SVM, non-linear SVM Naïve Bayes (multinomial) Decision Trees Random Forest kNN Clustering k-Means Regression Linear Regression system of equations CG (conjugate gradient descent) Generalized Linear Models (GLM) Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli Links for all distributions: identity, log, sq. root, inverse, 1/μ2 Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit Stepwise Linear GLM Lasso Dimension Reduction PCA, Probabilistic PCA Matrix Factorization ALS direct solve CG (conjugate gradient descent) Survival Models Kaplan Meier Estimate Cox Proportional Hazard Regression Deep Learning Autoencoder, word2vec, CNN, LSTM, RBM … and Deep Learning Library (DML-bodied) functions Predict Algorithm-specific scoring Transformation (native) Recoding, dummy coding, binning, scaling, missing value imputation PMML models lm, kmeans, svm, glm, mlogit 10
  • 11.
    Effect of DeepLearning: ImageNet Large-Scale Visual Recognition Challenge 11 AlexNet GoogleNet ResNet (34 layer)
  • 12.
    Layers  Fully connectedlayer Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/ 13
  • 13.
    Layers • Fully connectedlayer • Convolution layer • Less number of parameters as compared to FC • Useful to capture local features (spatially) • Output #channels = #filters Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/ 14
  • 14.
    Deep Learning Support NNlibrary: Reuse existing infrastructure to implement custom DNNs like other training algorithms  Small number of DL-specific built-in functions – e.g. convolution  NN library of layers and training optimizers to stack layers, e.g. – Affine (fully-connected) layer is matrix multiplication – Convolution layer invokes new convolution function  Caffe/Keras2DML to import existing DNNs  Transfer learning to continue training on different data  GPU and native BLAS libraries
  • 15.
  • 16.
  • 17.
    Automatic Algebraic SimplificationRewrites lead to Significant Performance Improvements  Simplify operations over mmult  Eliminate unnecessary compute – trace (X %*% Y)  sum(X * t(Y))  Remove unnecessary operations  Merging operations – rand (…, min=-1, max=1) * 7  rand (…, min=-7, max=7)  Binary to unary operations  Reduce amount of data touched – X*X  X^2  Remove unnecessary Indexing  Eliminate operations (conditional) – X[a:b,c:d] = Y  X = Y iff dims(X)=dims(Y)  … 10’s more rewrite rules 23
  • 18.
  • 19.
    Compressed Linear Algebra(CLA)  Motivation: Iterative ML algorithms with I/O-bound MV multiplications  Key Ideas: Use lightweight DB compression techniques and perform LA operations on compressed matrices (w/o decompression)  Experiments – LinregCG, 10 iterations, SystemML 0.14 – 1+6 node cluster, Spark 2.1 Dataset Gzip Snappy CLA Higgs 1.93 1.38 2.17 Census 17.11 6.04 35.69 Covtype 10.40 6.13 18.19 ImageNet 5.54 3.35 7.34 Mnist8m 4.12 2.60 7.32 Airline78 7.07 4.28 7.44 Compression Ratios 89 3409 5663 135 765 2730 93 463 998 0 1000 2000 3000 4000 5000 6000 Mnist40m Mnist240m Mnist480m Uncompressed Snappy (RDD Compression) CLA End-to-End Performance [sec] 90GB 540GB 1.1TB 26
  • 20.
    Code Generation forOperator Fusion  Motivation – Ubiquitous Fusion Opportunities – High Performance Impact  Key Ideas – Templates skeletons (Row, Cell, Outer, MultiAgg) – Candidate exploration to identify fusion opportunities – Candidate selection via cost-based optimizer or heuristics – Codegen with janino / javac during compile and dynamic recompile X Y b(*)u(^2) u(^2) sumsum sum Multi-Aggregate a=sum(X^2) b=sum(X*Y) c=sum(Y^2) X Y Z* sum * 1st pass X v X 2nd pass q ┬ U V ┬X * logsum sparsity exploitation 27
  • 21.
    Codegen Micro Benchmarks(FP64) sum(X ʘ Y ʘ Z), dense sum(X ʘ Y ʘ Z), sparse Sparsity 0.1 X ┬ (X v), dense Data size 20K x 20K sum(X ʘ log(UV ┬ + 1e-15)) #1 Gen close to hand-coded fused ops #2 TF/Julia Gen only single- threaded #3 TF w/ very limited sparse support #4 Sparse Gen challenging, Gen better than hand- coded ops #5 TF w/ poor performance for data- intensive ops, #6 Gen at peak mem bandwidth #7 Autom. Sparsity exploitation across chains of ops
  • 22.
    SystemML on PowerEnvironment  Contributed native ppc64le libraries for Jcuda to mavenized jcuda project – GPU backend on Power for SystemML  Contributed native ppc64le libraries to protoc project – Useful for compiling Caffe proto files  Supported native BLAS operations in SystemML – Matrix Multiplication, Convolution (forward/backward) – OpenBLAS with OpenMP support 30
  • 23.
    Linear Regression ConjugateGradient (preliminary 1/2) 31 0 2 4 6 8 10 12 14 64 128 256 512 1024 2048 TimeinSeconds No. of Rows of input matrix (in Thousands) PPC CPU Time PPC GPU Time x86 CPU Time x86 GPU Time Data: random with sparsity 0.95, 1000 features Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01 Driver-memory: 100G, local[*] master M-V multiplication chain is memory bound, But more cores help with parallelization.
  • 24.
    Linear Regression ConjugateGradient (preliminary 2/2) 32 0 2 4 6 8 10 12 14 64 256 1024 TimeinSeconds No. of Rows of input matrix (in Thousands) PPC GPU Time x86 GPU Time Data: random with sparsity 0.95, 1000 features Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01 Driver-memory: 100G, local[*] master 0 1 2 3 4 5 6 7 64 256 1024 TimeinSeconds No. of Rows of input matrix (in Thousands) CPU-GPU Transfer Time PPC toDev Time x86 toDev Time Most of the time is spent in transferring data from host to device -> 2x performance benefit due to CPU-GPU NVLink
  • 25.
    More Details  MatthiasBoehm, Alexandre Evfimievski, Niketan Pansare, Berthold Reinwald, Prithvi Sen: Declarative, Large-Scale Machine Learning with Apache SystemML, 3 hours hands-on tutorial, KDD 2017  Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, Prithviray Sen: SPOOF: Sum- Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR 2017  Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold Reinwald: Compressed Linear Algebra for Large Scale Machine Learning. VLDB 2016 (Best Paper Award) – Extended Version to appear in VLDB Journal, 2017 – Summary Version to appear in ACM SIGMOD Record Research Highlights, 2017  Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, Shirish Tatikonda. SystemML: Declarative Machine Learning on Spark. VLDB 2016  Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, Frederick R. Reiss: Resource Elasticity for Large- Scale Machine Learning. SIGMOD 2015: 137-152  Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, P. Sadayappan: On optimizing machine learning workloads via kernel fusion. PPOPP 2015: 173-182  Sebastian Schelter, Juan Soto, Volker Markl, Douglas Burdick, Berthold Reinwald, Alexandre V. Evfimievski: Efficient sample generation for scalable meta learning. ICDE 2015: 1191-1202  Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish Tatikonda, Yuanyuan Tian: SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull. 37(3): 52-62 (2014)  Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas Burdick, Shivakumar Vaithyanathan: Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB 7(7): 553-564 (2014)  Peter D. Kirchner, Matthias Boehm, Berthold Reinwald, Daby M. Sow, Michael Schmidt, Deepak S. Turaga, Alain Biem: Large Scale Discriminative Metric Learning. IPDPS Workshop 2014: 1656-1663  Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald: Scalable and Numerically Stable Descriptive Statistics in SystemML. ICDE 2012: 1351- 1359  Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, Shivakumar Vaithyanathan: SystemML: Declarative machine learning on MapReduce. ICDE 2011: 231-242 Custom Algorithm Optimizer Resource Elasticity GPU Sampling Numeric Stability Task Parallelism 1st paper on Spark Compression Automatic Rewr & Fusion 33 Hands on Tutorial
  • 26.
    Summary  SystemML simplifiesthe Life of Data Scientist  Custom Machine/Deep Learning Algorithms  Scale up & out  Mixed Workloads – Memory access bound – Compute bound  Strike Balance between – Data transfer – Parallelism 34

Editor's Notes