Scalable Machine/Deep Learning with
Apache SystemML on Power
1
Why Apache SystemML
 Today’s Roles of Data Scientists
– Algorithm researcher: Invent new optimization schemes
– Systems programmer: provide distributed
implementations
– Deployment engineer: Run for varying datasets
– Systems researcher: Optimize clusters
 SystemML simplifies the Life of Data Scientists
– in implementing custom machine learning
– running algorithms distributed if needed
– running algorithms varying from small data to large data
NIPS ICML
KDD
JMLR
2
Apache SystemML – Declarative Machine Learning
 Productivity of data scientists
– Machine learning language for data scientists
(“The SQL for analytics”)
– Strong foundation in linear algebra and statistical functions
– Comes with approx. 20+ algorithms pre-implemented
– Enable Solutions development and Tools
 Scalability & Performance
– Built on data parallel platforms, e.g. Spark
 Cost-based optimizer to compile execution plans
– Depending on data characteristics (tall/skinny, short/wide) and cluster
characteristics
– Ranging from in-memory single node to clusters (MapReduce, Spark),
and hybrid plans
 APIs & Tools
– Command line: standalone Java app, spark-submit, hadoop jar
– Use in Spark through Scala, Python, R, and Java APIs
– Embeddable scoring library
– Tools: REPL (Scala Spark and pyspark), SparkR, SparkML,
Jupyter, Zeppelin Notebooks
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language
GPU backend
In progress
3
Apache SystemML Open Source
 Apache Open source Project (http://systemml.apache.org/)
– Nov. 2015, Start SystemML Apache Incubator Project
– …
– Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API.
May 2017, Release 0.14.0 on Spark 2.0.2+.
– May 2017, Apache Top Level Project
– Sep 2017, Release 0.15
 Release downloads (http://systemml.apache.org/download)
– Binaries
– Coordinates to Maven repository
 Github source code (https://github.com/apache/systemml)
 Documentation (https://apache.github.io/systemml/)
 3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial-
kdd2017.html), Aug. 2017
4
5
https://github.com/apache/systemml/blob/master/samples/jupyter-
notebooks/Deep_Learning_Image_Classification.ipynb
Handwritten Digits Image Classification
Using LeNet CNN
SystemML on Power Environment
 Contributed native ppc64le libraries for Jcuda to mavenized jcuda
project
– GPU backend on Power for SystemML
 Contributed native ppc64le libraries to protoc project
– Useful for compiling Caffe proto files
 Supported native BLAS operations in SystemML
– Matrix Multiplication, Convolution (forward/backward)
– OpenBLAS with OpenMP support
6
Linear Regression Conjugate Gradient
(preliminary 1/2)
7
0
2
4
6
8
10
12
14
64 128 256 512 1024 2048
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC CPU Time
PPC GPU Time
x86 CPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
M-V multiplication
chain is memory bound,
But more cores help
with parallelization.
Linear Regression Conjugate Gradient
(preliminary 2/2)
8
0
2
4
6
8
10
12
14
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC GPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
0
1
2
3
4
5
6
7
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
CPU-GPU Transfer Time
PPC toDev Time
x86 toDev Time
Most of the time is spent
in transferring data from
host to device
-> 2x performance benefit
due to CPU-GPU NVLink
More Details
 Matthias Boehm, Alexandre Evfimievski, Niketan Pansare, Berthold Reinwald, Prithvi Sen: Declarative, Large-Scale Machine Learning with
Apache SystemML, 3 hours hands-on tutorial, KDD 2017
 Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, Prithviray Sen: SPOOF: Sum-
Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR 2017
 Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold Reinwald: Compressed Linear Algebra for Large Scale
Machine Learning. VLDB 2016 (Best Paper Award)
– Extended Version to appear in VLDB Journal, 2017
– Summary Version to appear in ACM SIGMOD Record Research Highlights, 2017
 Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold
Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, Shirish Tatikonda. SystemML: Declarative Machine Learning on Spark. VLDB
2016
 Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, Frederick R. Reiss: Resource Elasticity for Large-
Scale Machine Learning. SIGMOD 2015: 137-152
 Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, P. Sadayappan: On optimizing
machine learning workloads via kernel fusion. PPOPP 2015: 173-182
 Sebastian Schelter, Juan Soto, Volker Markl, Douglas Burdick, Berthold Reinwald, Alexandre V. Evfimievski: Efficient sample generation for
scalable meta learning. ICDE 2015: 1191-1202
 Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish
Tatikonda, Yuanyuan Tian: SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng.
Bull. 37(3): 52-62 (2014)
 Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas Burdick, Shivakumar Vaithyanathan: Hybrid
Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB 7(7): 553-564 (2014)
 Peter D. Kirchner, Matthias Boehm, Berthold Reinwald, Daby M. Sow, Michael Schmidt, Deepak S. Turaga, Alain Biem: Large Scale
Discriminative Metric Learning. IPDPS Workshop 2014: 1656-1663
 Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald: Scalable and Numerically Stable Descriptive Statistics in SystemML. ICDE 2012: 1351-
1359
 Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan
Tian, Shivakumar Vaithyanathan: SystemML: Declarative machine learning on MapReduce. ICDE 2011: 231-242
Custom
Algorithm
Optimizer
Resource
Elasticity
GPU
Sampling
Numeric
Stability
Task
Parallelism
1st paper
on Spark
Compression
Automatic
Rewr & Fusion
9
Hands on
Tutorial
Summary
 SystemML simplifies the Life of Data Scientist
 Custom Machine/Deep Learning Algorithms
 Scale up & out
 Mixed Workloads
– Memory access bound
– Compute bound
 Strike Balance between
– Data transfer
– Parallelism
10

System mldl meetup

  • 1.
    Scalable Machine/Deep Learningwith Apache SystemML on Power 1
  • 2.
    Why Apache SystemML Today’s Roles of Data Scientists – Algorithm researcher: Invent new optimization schemes – Systems programmer: provide distributed implementations – Deployment engineer: Run for varying datasets – Systems researcher: Optimize clusters  SystemML simplifies the Life of Data Scientists – in implementing custom machine learning – running algorithms distributed if needed – running algorithms varying from small data to large data NIPS ICML KDD JMLR 2
  • 3.
    Apache SystemML –Declarative Machine Learning  Productivity of data scientists – Machine learning language for data scientists (“The SQL for analytics”) – Strong foundation in linear algebra and statistical functions – Comes with approx. 20+ algorithms pre-implemented – Enable Solutions development and Tools  Scalability & Performance – Built on data parallel platforms, e.g. Spark  Cost-based optimizer to compile execution plans – Depending on data characteristics (tall/skinny, short/wide) and cluster characteristics – Ranging from in-memory single node to clusters (MapReduce, Spark), and hybrid plans  APIs & Tools – Command line: standalone Java app, spark-submit, hadoop jar – Use in Spark through Scala, Python, R, and Java APIs – Embeddable scoring library – Tools: REPL (Scala Spark and pyspark), SparkR, SparkML, Jupyter, Zeppelin Notebooks Hadoop or Spark Cluster (scale-out) In-Memory Single Node (scale-up) Runtime Compiler Language GPU backend In progress 3
  • 4.
    Apache SystemML OpenSource  Apache Open source Project (http://systemml.apache.org/) – Nov. 2015, Start SystemML Apache Incubator Project – … – Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API. May 2017, Release 0.14.0 on Spark 2.0.2+. – May 2017, Apache Top Level Project – Sep 2017, Release 0.15  Release downloads (http://systemml.apache.org/download) – Binaries – Coordinates to Maven repository  Github source code (https://github.com/apache/systemml)  Documentation (https://apache.github.io/systemml/)  3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial- kdd2017.html), Aug. 2017 4
  • 5.
  • 6.
    SystemML on PowerEnvironment  Contributed native ppc64le libraries for Jcuda to mavenized jcuda project – GPU backend on Power for SystemML  Contributed native ppc64le libraries to protoc project – Useful for compiling Caffe proto files  Supported native BLAS operations in SystemML – Matrix Multiplication, Convolution (forward/backward) – OpenBLAS with OpenMP support 6
  • 7.
    Linear Regression ConjugateGradient (preliminary 1/2) 7 0 2 4 6 8 10 12 14 64 128 256 512 1024 2048 TimeinSeconds No. of Rows of input matrix (in Thousands) PPC CPU Time PPC GPU Time x86 CPU Time x86 GPU Time Data: random with sparsity 0.95, 1000 features Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01 Driver-memory: 100G, local[*] master M-V multiplication chain is memory bound, But more cores help with parallelization.
  • 8.
    Linear Regression ConjugateGradient (preliminary 2/2) 8 0 2 4 6 8 10 12 14 64 256 1024 TimeinSeconds No. of Rows of input matrix (in Thousands) PPC GPU Time x86 GPU Time Data: random with sparsity 0.95, 1000 features Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01 Driver-memory: 100G, local[*] master 0 1 2 3 4 5 6 7 64 256 1024 TimeinSeconds No. of Rows of input matrix (in Thousands) CPU-GPU Transfer Time PPC toDev Time x86 toDev Time Most of the time is spent in transferring data from host to device -> 2x performance benefit due to CPU-GPU NVLink
  • 9.
    More Details  MatthiasBoehm, Alexandre Evfimievski, Niketan Pansare, Berthold Reinwald, Prithvi Sen: Declarative, Large-Scale Machine Learning with Apache SystemML, 3 hours hands-on tutorial, KDD 2017  Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, Prithviray Sen: SPOOF: Sum- Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR 2017  Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold Reinwald: Compressed Linear Algebra for Large Scale Machine Learning. VLDB 2016 (Best Paper Award) – Extended Version to appear in VLDB Journal, 2017 – Summary Version to appear in ACM SIGMOD Record Research Highlights, 2017  Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, Shirish Tatikonda. SystemML: Declarative Machine Learning on Spark. VLDB 2016  Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, Frederick R. Reiss: Resource Elasticity for Large- Scale Machine Learning. SIGMOD 2015: 137-152  Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, P. Sadayappan: On optimizing machine learning workloads via kernel fusion. PPOPP 2015: 173-182  Sebastian Schelter, Juan Soto, Volker Markl, Douglas Burdick, Berthold Reinwald, Alexandre V. Evfimievski: Efficient sample generation for scalable meta learning. ICDE 2015: 1191-1202  Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish Tatikonda, Yuanyuan Tian: SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull. 37(3): 52-62 (2014)  Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas Burdick, Shivakumar Vaithyanathan: Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB 7(7): 553-564 (2014)  Peter D. Kirchner, Matthias Boehm, Berthold Reinwald, Daby M. Sow, Michael Schmidt, Deepak S. Turaga, Alain Biem: Large Scale Discriminative Metric Learning. IPDPS Workshop 2014: 1656-1663  Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald: Scalable and Numerically Stable Descriptive Statistics in SystemML. ICDE 2012: 1351- 1359  Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, Shivakumar Vaithyanathan: SystemML: Declarative machine learning on MapReduce. ICDE 2011: 231-242 Custom Algorithm Optimizer Resource Elasticity GPU Sampling Numeric Stability Task Parallelism 1st paper on Spark Compression Automatic Rewr & Fusion 9 Hands on Tutorial
  • 10.
    Summary  SystemML simplifiesthe Life of Data Scientist  Custom Machine/Deep Learning Algorithms  Scale up & out  Mixed Workloads – Memory access bound – Compute bound  Strike Balance between – Data transfer – Parallelism 10

Editor's Notes