Apache SystemML
Mike Dusenberry
Engineer, Machine Learning & SystemML
Spark Technology Center
@dusenberrymw
Datapalooza, Denver - 05.19.16
Apache
SystemML
1. Background
a. Machine Learning
b. Declarative ML
2. SystemML
a. Overview
b. Language
c. Compiler/Optimizer
d. Runtime
3. Demo
4. Current Work
a. Deep Learning: SystemML-NN
5. Questions
Agenda
Links
โ— Main Website:
systemml.apache.org
โ— Code:
github.com/apache/incubator-systemml
โ— Documentation:
apache.github.io/incubator-systemml
โ— JIRA:
issues.apache.org/jira/browse/SYSTEMML
Machine Learning
Machine Learning
โ— Data
โ—‹ Multiple โ€œexamplesโ€
โ—‹ Multiple โ€œfeaturesโ€ per โ€œexampleโ€
โ—‹ โ€œLabel(s)โ€ for each โ€œexampleโ€ (supervised)
โ— Model
โ—‹ Construct/select a model that fits the problem.
โ—‹ Examples:
โ–  Linear/Logistic Regression
โ–  SVM
โ–  Neural Networks
โ— Loss
โ—‹ An โ€œevaluationโ€ of how well the model fits the data.
โ— Optimizer
โ—‹ Minimize โ€œlossโ€ by adjusting model to better fit the data.
Declarative Machine Learning
Laptop
Exploratory Data Analysis Today
R
Python
Others
Data
Scientist
DataR
Python
Others
Data
Scientist
Laptop
Exploratory Data Analysis Today
R
Python
Others
Data
Scientist
R
Python
Others
Data
Scientist
Current Best Practice for Big Data Analysis
Data
Scientist
Data
Scientist
Data
Scientist
Hadoop
Engineer
Spark
Engineer
MPI
Engineer
R
Python
Others
Laptop
Data
Scientist
Scale-up
Cluster
R
Python Query
Optimization
Others
Vision: Declarative Machine Learning
Common patterns:
โ€ขChanges in feature set
โ€ขChanges in data size
โ€ขAlgorithm customization
โ€ขQuick iteration
Declarative Machine Learning
Classification by level of abstraction (different target user)
Landscape of Existing Work
Distributed Systems w/ DSLs
Large-Scale ML Libraries (fixed plan)
Declarative ML (fixed algorithm)
Declarative ML++ (fixed task)
Spark, Flink, REEF, GraphLab,
(R, Matlab, SAS)
MLlib, Mahout MR, MADlib, ORE,
Rev R, HP Dist R, Custom alg.
SystemML, (Mahout Samsara,
Tupleware, Cumulon, Dmac, SimSQL)
Mlbase*, Specific sys.
Requirements to Support Declarative ML
โ€ข Goal: Write ML algorithms independent of input data and cluster characteristics.
โ€ข R1: Full flexibility
โ–ช Specify new / customize existing ML algorithms.
โ–ช โž” ML DSL
โ€ข R2: Data independence
โ–ช Hide physical data representation (sparse/dense, row/column-major, blocking
configs, partitioning, caching, compression).
โ–ช โž” Abstract data types and coarse-grained logical operations.
โ€ข R3: Efficiency and scalability
โ–ช Very small to very large use-cases.
โ–ช โž” Automatic optimization and hybrid runtime plans.
โ€ข R4: Specified algorithm semantics
โ–ช Understand, debug, and control algorithm behavior.
โ–ช โž” Optimization for performance only, not accuracy.
Apache SystemML
Sidenote: Fun Stuff - Neural Art
-A Neural Algorithm of Artistic Style, L.A.
Gatys, A.S. Ecker, M. Bethge
-https://github.com/jcjohnson/neural-style
Apache SystemML
Apache SystemML
โ— High-level language
โ—‹ DML -> R-like
โ—‹ PyDML -> Python-like
โ—‹ Focus is on matrices and
linear algebra.
โ— Engine
โ—‹ Compiler/Optimizer
โ—‹ Lots of optimizations, such as
rewrites.
โ— Runtime
โ—‹ Laptop
โ—‹ Spark
โ—‹ (also Hadoop)
(DML) (PyDML)
Engine
Apache SystemML
โ— High-level language
โ—‹ DML -> R-like
โ—‹ PyDML -> Python-like
โ—‹ Focus is on matrices and
linear algebra.
โ— Engine
โ—‹ Compiler/Optimizer
โ—‹ Lots of optimizations, such as
rewrites.
โ— Runtime
โ—‹ Laptop
โ—‹ Spark
โ—‹ (also Hadoop)
(DML) (PyDML)
Engine
SystemML - Example: Logistic Regression (DML)
SystemML - Example: Sigmoid Function (DML)
Apache SystemML
โ— High-level language
โ—‹ DML -> R-like
โ—‹ PyDML -> Python-like
โ—‹ Focus is on matrices and
linear algebra.
โ— Engine
โ—‹ Compiler/Optimizer
โ—‹ Lots of optimizations, such as
rewrites.
โ— Runtime
โ—‹ Laptop
โ—‹ Spark
โ—‹ (also Hadoop)
(DML) (PyDML)
Engine
Apache SystemML
โ— High-level language
โ—‹ DML -> R-like
โ—‹ PyDML -> Python-like
โ—‹ Focus is on matrices and
linear algebra.
โ— Engine
โ—‹ Compiler/Optimizer
โ—‹ Lots of optimizations, such as
rewrites.
โ— Runtime
โ—‹ Laptop
โ—‹ Spark
โ—‹ (also Hadoop)
(DML) (PyDML)
Engine
SystemML - Compilation Chain
SystemML - Compilation Chain
24
SystemML - Compilation Chain
25
SystemML - Compilation Chain
26
SystemML - Compilation Chain
27
Apache SystemML
โ— High-level language
โ—‹ DML -> R-like
โ—‹ PyDML -> Python-like
โ—‹ Focus is on matrices and
linear algebra.
โ— Engine
โ—‹ Compiler/Optimizer
โ—‹ Lots of optimizations, such as
rewrites.
โ— Runtime
โ—‹ Laptop
โ—‹ Spark
โ—‹ (also Hadoop)
(DML) (PyDML)
Engine
Apache SystemML
โ— High-level language
โ—‹ DML -> R-like
โ—‹ PyDML -> Python-like
โ—‹ Focus is on matrices and
linear algebra.
โ— Engine
โ—‹ Compiler/Optimizer
โ—‹ Lots of optimizations, such as
rewrites.
โ— Runtime
โ—‹ Laptop
โ—‹ Spark
โ—‹ (also Hadoop)
(DML) (PyDML)
Engine
More Fun...
https://github.com/google/deepdream
Apache SystemML
โ— High-level language
โ—‹ DML -> R-like
โ—‹ PyDML -> Python-like
โ—‹ Focus is on matrices and
linear algebra.
โ— Engine
โ—‹ Compiler/Optimizer
โ—‹ Lots of optimizations, such as
rewrites.
โ— Runtime
โ—‹ Laptop
โ—‹ Spark
โ—‹ (also Hadoop)
(DML) (PyDML)
Engine
SystemML - Compilation Chain
32
SystemML - Compilation Chain
33
Spark
CP + b sb _mVar1
SPARK mapmm X.MATRIX.DOUBLE _mvar1.MATRIX.DOUBLE
_mVar2.MATRIX.DOUBLE RIGHT false NONE
CP * y _mVar2 _mVar3
Apache SystemML
โ— High-level language
โ—‹ DML -> R-like
โ—‹ PyDML -> Python-like
โ—‹ Focus is on matrices and
linear algebra.
โ— Engine
โ—‹ Compiler/Optimizer
โ—‹ Lots of optimizations, such as
rewrites.
โ— Runtime
โ—‹ Laptop
โ—‹ Spark
โ—‹ (also Hadoop)
(DML) (PyDML)
Engine
SystemML Architecture (APIs and runtime)
35
Command
Line
JMLC
Spark
MLContext
Spark
ML
APIs
High-Level Operators (HOPs)
Parser/Language
Low-Level Operators (LOPs)
Compiler
Runtime
Control Program
Runtime
Prog
Buffer Pool
ParFor Optimizer/Runtime
MR InstSpark
Inst
CP
Inst
Recompiler
Cost-based
optimizations
DFS IOMem/FS IO
Generic
MR
MatrixBlock Library
(single/multi-threaded)
SystemML Architecture (APIs and runtime)
36
Command
Line
JMLC
Spark
MLContext
Spark
ML
APIs
High-Level Operators (HOPs)
Parser/Language
Low-Level Operators (LOPs)
Compiler
Runtime
Control Program
Runtime
Prog
Buffer Pool
ParFor Optimizer/Runtime
MR InstSpark
Inst
CP
Inst
Recompiler
Cost-based
optimizations
DFS IOMem/FS IO
Generic
MR
MatrixBlock Library
(single/multi-threaded)
Demo
Current Work
Current Work
โ— Usability / Applications:
โ—‹ Deep Learning (SYSTEMML-540)
โ—‹ Embedded Scala/Python/R DSL with sufficient optimization scope (SYSTEMML-451)
โ— Optimizer:
โ—‹ Cost-model enhancement (SYSTEMML-416)
โ—‹ Global program optimization (SYSTEMML-421)
โ—‹ Source code generation for automatic operator fusion (SYSTEMML-448)
โ— Runtime:
โ—‹ Add GPU backend (SYSTEMML-445) => CUDA / OpenCL
โ—‹ Frame support / Sparse block representation
โ—‹ Integrate Apache Flink as additional backend for SystemML (SYSTEMML-636 / PR-119)
โ—‹ NUMA-aware single node backend (SYSTEMML-406)
Deep Learning - Plans
โ— Deep Learning library for SystemML written in DML (SYSTEMML-618).
โ—‹ SystemML-NN [https://github.com/dusenberrymw/systemml-nn]
โ— Built-in DML functions for computationally-intensive layers.
โ—‹ Convolution (2D), Max Pooling
โ— GPU acceleration for these built-in functions (SYSTEMML-445).
โ— Integration with existing deep learning libraries (Keras, TensorFlow, Torch,
etc.)?
Deep Learning - SystemML-NN Library
โ— Deep learning library written in DML (and
PyDML soonโ€ฆ).
โ— Multiple layers:
โ—‹ Core:
โ–  Affine, 2D Convolution, Max Pooling
โ—‹ Nonlinearity/Transfer:
โ–  Sigmoid, Tanh, Softmax, ReLU
โ—‹ Regularization:
โ–  Dropout, L1, L2
โ—‹ Loss:
โ–  Log-loss, Cross-entropy, L1, L2
โ— Multiple optimizers:
โ—‹ SGD, SGD w/ momentum, SGD w/
Nesterov momentum, Adagrad, RMSprop,
Adam
https://github.com/dusenberrymw/systemml-nn
Deep Learning - SystemML-NN Library (cont.)
https://github.com/dusenberrymw/systemml-nn
โ— Each layer type has a simple `forward(...)
` and `backward(...)` API.
โ—‹ `forward(...)` computes the output of the
function based on the inputs.
โ—‹ `backward(...)`computes the partial
derivatives (gradient) of the inputs to the
function w.r.t. some function deeper in the
network (usually the loss function at the
end).
โ— Each optimizer has a simple `update(...)`
API.
โ—‹ `update(...)` adjusts the given parameters
based on their partial derivatives.
โ— Includes test code in DML.
โ—‹ Gradient checks, unit tests
Deep Learning - SystemML-NN Library (cont.)
SystemML-NN
SystemML
Engine
Apache
SystemML
1. Background
a. Machine Learning
b. Declarative ML
2. SystemML
a. Overview
b. Language
c. Compiler/Optimizer
d. Runtime
3. Demo
4. Current Work
a. Deep Learning: SystemML-NN
5. Questions
Agenda Revisited
Links
โ— Main Website:
systemml.apache.org
โ— Code:
github.com/apache/incubator-systemml
โ— Documentation:
apache.github.io/incubator-systemml
โ— JIRA:
issues.apache.org/jira/browse/SYSTEMML
Questions?
Thanks!

SystemML - Datapalooza Denver - 05.17.16 MWD

  • 1.
    Apache SystemML Mike Dusenberry Engineer,Machine Learning & SystemML Spark Technology Center @dusenberrymw Datapalooza, Denver - 05.19.16
  • 2.
    Apache SystemML 1. Background a. MachineLearning b. Declarative ML 2. SystemML a. Overview b. Language c. Compiler/Optimizer d. Runtime 3. Demo 4. Current Work a. Deep Learning: SystemML-NN 5. Questions Agenda
  • 3.
    Links โ— Main Website: systemml.apache.org โ—Code: github.com/apache/incubator-systemml โ— Documentation: apache.github.io/incubator-systemml โ— JIRA: issues.apache.org/jira/browse/SYSTEMML
  • 4.
  • 5.
    Machine Learning โ— Data โ—‹Multiple โ€œexamplesโ€ โ—‹ Multiple โ€œfeaturesโ€ per โ€œexampleโ€ โ—‹ โ€œLabel(s)โ€ for each โ€œexampleโ€ (supervised) โ— Model โ—‹ Construct/select a model that fits the problem. โ—‹ Examples: โ–  Linear/Logistic Regression โ–  SVM โ–  Neural Networks โ— Loss โ—‹ An โ€œevaluationโ€ of how well the model fits the data. โ— Optimizer โ—‹ Minimize โ€œlossโ€ by adjusting model to better fit the data.
  • 6.
  • 7.
    Laptop Exploratory Data AnalysisToday R Python Others Data Scientist DataR Python Others Data Scientist
  • 8.
    Laptop Exploratory Data AnalysisToday R Python Others Data Scientist R Python Others Data Scientist
  • 9.
    Current Best Practicefor Big Data Analysis Data Scientist Data Scientist Data Scientist Hadoop Engineer Spark Engineer MPI Engineer R Python Others
  • 10.
  • 11.
    Common patterns: โ€ขChanges infeature set โ€ขChanges in data size โ€ขAlgorithm customization โ€ขQuick iteration Declarative Machine Learning
  • 12.
    Classification by levelof abstraction (different target user) Landscape of Existing Work Distributed Systems w/ DSLs Large-Scale ML Libraries (fixed plan) Declarative ML (fixed algorithm) Declarative ML++ (fixed task) Spark, Flink, REEF, GraphLab, (R, Matlab, SAS) MLlib, Mahout MR, MADlib, ORE, Rev R, HP Dist R, Custom alg. SystemML, (Mahout Samsara, Tupleware, Cumulon, Dmac, SimSQL) Mlbase*, Specific sys.
  • 13.
    Requirements to SupportDeclarative ML โ€ข Goal: Write ML algorithms independent of input data and cluster characteristics. โ€ข R1: Full flexibility โ–ช Specify new / customize existing ML algorithms. โ–ช โž” ML DSL โ€ข R2: Data independence โ–ช Hide physical data representation (sparse/dense, row/column-major, blocking configs, partitioning, caching, compression). โ–ช โž” Abstract data types and coarse-grained logical operations. โ€ข R3: Efficiency and scalability โ–ช Very small to very large use-cases. โ–ช โž” Automatic optimization and hybrid runtime plans. โ€ข R4: Specified algorithm semantics โ–ช Understand, debug, and control algorithm behavior. โ–ช โž” Optimization for performance only, not accuracy.
  • 14.
  • 15.
    Sidenote: Fun Stuff- Neural Art -A Neural Algorithm of Artistic Style, L.A. Gatys, A.S. Ecker, M. Bethge -https://github.com/jcjohnson/neural-style
  • 16.
  • 17.
    Apache SystemML โ— High-levellanguage โ—‹ DML -> R-like โ—‹ PyDML -> Python-like โ—‹ Focus is on matrices and linear algebra. โ— Engine โ—‹ Compiler/Optimizer โ—‹ Lots of optimizations, such as rewrites. โ— Runtime โ—‹ Laptop โ—‹ Spark โ—‹ (also Hadoop) (DML) (PyDML) Engine
  • 18.
    Apache SystemML โ— High-levellanguage โ—‹ DML -> R-like โ—‹ PyDML -> Python-like โ—‹ Focus is on matrices and linear algebra. โ— Engine โ—‹ Compiler/Optimizer โ—‹ Lots of optimizations, such as rewrites. โ— Runtime โ—‹ Laptop โ—‹ Spark โ—‹ (also Hadoop) (DML) (PyDML) Engine
  • 19.
    SystemML - Example:Logistic Regression (DML)
  • 20.
    SystemML - Example:Sigmoid Function (DML)
  • 21.
    Apache SystemML โ— High-levellanguage โ—‹ DML -> R-like โ—‹ PyDML -> Python-like โ—‹ Focus is on matrices and linear algebra. โ— Engine โ—‹ Compiler/Optimizer โ—‹ Lots of optimizations, such as rewrites. โ— Runtime โ—‹ Laptop โ—‹ Spark โ—‹ (also Hadoop) (DML) (PyDML) Engine
  • 22.
    Apache SystemML โ— High-levellanguage โ—‹ DML -> R-like โ—‹ PyDML -> Python-like โ—‹ Focus is on matrices and linear algebra. โ— Engine โ—‹ Compiler/Optimizer โ—‹ Lots of optimizations, such as rewrites. โ— Runtime โ—‹ Laptop โ—‹ Spark โ—‹ (also Hadoop) (DML) (PyDML) Engine
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    Apache SystemML โ— High-levellanguage โ—‹ DML -> R-like โ—‹ PyDML -> Python-like โ—‹ Focus is on matrices and linear algebra. โ— Engine โ—‹ Compiler/Optimizer โ—‹ Lots of optimizations, such as rewrites. โ— Runtime โ—‹ Laptop โ—‹ Spark โ—‹ (also Hadoop) (DML) (PyDML) Engine
  • 29.
    Apache SystemML โ— High-levellanguage โ—‹ DML -> R-like โ—‹ PyDML -> Python-like โ—‹ Focus is on matrices and linear algebra. โ— Engine โ—‹ Compiler/Optimizer โ—‹ Lots of optimizations, such as rewrites. โ— Runtime โ—‹ Laptop โ—‹ Spark โ—‹ (also Hadoop) (DML) (PyDML) Engine
  • 30.
  • 31.
    Apache SystemML โ— High-levellanguage โ—‹ DML -> R-like โ—‹ PyDML -> Python-like โ—‹ Focus is on matrices and linear algebra. โ— Engine โ—‹ Compiler/Optimizer โ—‹ Lots of optimizations, such as rewrites. โ— Runtime โ—‹ Laptop โ—‹ Spark โ—‹ (also Hadoop) (DML) (PyDML) Engine
  • 32.
  • 33.
    SystemML - CompilationChain 33 Spark CP + b sb _mVar1 SPARK mapmm X.MATRIX.DOUBLE _mvar1.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE RIGHT false NONE CP * y _mVar2 _mVar3
  • 34.
    Apache SystemML โ— High-levellanguage โ—‹ DML -> R-like โ—‹ PyDML -> Python-like โ—‹ Focus is on matrices and linear algebra. โ— Engine โ—‹ Compiler/Optimizer โ—‹ Lots of optimizations, such as rewrites. โ— Runtime โ—‹ Laptop โ—‹ Spark โ—‹ (also Hadoop) (DML) (PyDML) Engine
  • 35.
    SystemML Architecture (APIsand runtime) 35 Command Line JMLC Spark MLContext Spark ML APIs High-Level Operators (HOPs) Parser/Language Low-Level Operators (LOPs) Compiler Runtime Control Program Runtime Prog Buffer Pool ParFor Optimizer/Runtime MR InstSpark Inst CP Inst Recompiler Cost-based optimizations DFS IOMem/FS IO Generic MR MatrixBlock Library (single/multi-threaded)
  • 36.
    SystemML Architecture (APIsand runtime) 36 Command Line JMLC Spark MLContext Spark ML APIs High-Level Operators (HOPs) Parser/Language Low-Level Operators (LOPs) Compiler Runtime Control Program Runtime Prog Buffer Pool ParFor Optimizer/Runtime MR InstSpark Inst CP Inst Recompiler Cost-based optimizations DFS IOMem/FS IO Generic MR MatrixBlock Library (single/multi-threaded)
  • 37.
  • 38.
  • 39.
    Current Work โ— Usability/ Applications: โ—‹ Deep Learning (SYSTEMML-540) โ—‹ Embedded Scala/Python/R DSL with sufficient optimization scope (SYSTEMML-451) โ— Optimizer: โ—‹ Cost-model enhancement (SYSTEMML-416) โ—‹ Global program optimization (SYSTEMML-421) โ—‹ Source code generation for automatic operator fusion (SYSTEMML-448) โ— Runtime: โ—‹ Add GPU backend (SYSTEMML-445) => CUDA / OpenCL โ—‹ Frame support / Sparse block representation โ—‹ Integrate Apache Flink as additional backend for SystemML (SYSTEMML-636 / PR-119) โ—‹ NUMA-aware single node backend (SYSTEMML-406)
  • 40.
    Deep Learning -Plans โ— Deep Learning library for SystemML written in DML (SYSTEMML-618). โ—‹ SystemML-NN [https://github.com/dusenberrymw/systemml-nn] โ— Built-in DML functions for computationally-intensive layers. โ—‹ Convolution (2D), Max Pooling โ— GPU acceleration for these built-in functions (SYSTEMML-445). โ— Integration with existing deep learning libraries (Keras, TensorFlow, Torch, etc.)?
  • 41.
    Deep Learning -SystemML-NN Library โ— Deep learning library written in DML (and PyDML soonโ€ฆ). โ— Multiple layers: โ—‹ Core: โ–  Affine, 2D Convolution, Max Pooling โ—‹ Nonlinearity/Transfer: โ–  Sigmoid, Tanh, Softmax, ReLU โ—‹ Regularization: โ–  Dropout, L1, L2 โ—‹ Loss: โ–  Log-loss, Cross-entropy, L1, L2 โ— Multiple optimizers: โ—‹ SGD, SGD w/ momentum, SGD w/ Nesterov momentum, Adagrad, RMSprop, Adam https://github.com/dusenberrymw/systemml-nn
  • 42.
    Deep Learning -SystemML-NN Library (cont.) https://github.com/dusenberrymw/systemml-nn โ— Each layer type has a simple `forward(...) ` and `backward(...)` API. โ—‹ `forward(...)` computes the output of the function based on the inputs. โ—‹ `backward(...)`computes the partial derivatives (gradient) of the inputs to the function w.r.t. some function deeper in the network (usually the loss function at the end). โ— Each optimizer has a simple `update(...)` API. โ—‹ `update(...)` adjusts the given parameters based on their partial derivatives. โ— Includes test code in DML. โ—‹ Gradient checks, unit tests
  • 43.
    Deep Learning -SystemML-NN Library (cont.) SystemML-NN SystemML Engine
  • 44.
    Apache SystemML 1. Background a. MachineLearning b. Declarative ML 2. SystemML a. Overview b. Language c. Compiler/Optimizer d. Runtime 3. Demo 4. Current Work a. Deep Learning: SystemML-NN 5. Questions Agenda Revisited
  • 45.
    Links โ— Main Website: systemml.apache.org โ—Code: github.com/apache/incubator-systemml โ— Documentation: apache.github.io/incubator-systemml โ— JIRA: issues.apache.org/jira/browse/SYSTEMML
  • 46.
  • 47.