More Related Content

Slideshows for you(20)

Similar to 2018 03 25 system ml ai and openpower meetup(20)

2018 03 25 system ml ai and openpower meetup

  1. Scalable Machine/Deep Learning with Apache SystemML on OpenPOWER Berthold Reinwald IBM Research – Almaden, San Jose, CA AI and OpenPOWER Meetup at h2o.AI March 25th, 2018 1
  2. Let’s start off with a Tweet … 2 IBM Think 2018, Las Vegas
  3. IBM Research achieves record deep learning performance with new software technology • Training time from weeks to hours. • SW/HW co-optimized achieves near-linear scaling up to hundreds of GPUs. • Multi-ring communication pattern provides a good tradeoff between latency and bandwidth. • Resnet-101 on Imagenet 22K with 64 IBM Power8 S822LC servers (256 GPUs) in about 7 hours to an accuracy of 33.8 % validation accuracy. Microsoft's ADAM and Google's DistBelief results did not reach 30 % validation accuracy. • Compared to Facebook AI Research on 256 GPU training, the new communication algorithm, and better combined SW/HW offers better communication overhead for Resnet-50. A PowerAI DDL enabled version of Torch completed 90 epochs of training on Resnet 50 for 1K classes in 50 minutes using 64 IBM Power8 S822LC servers (256 GPUs). 3
  4. Tumor Proliferation Score 4
  5. Medical Image Segmentation 5
  6. U-Net: Deep Convolutional Neural Network for Segmentation of biomedical Images 6 Problem: • Learn segmentation • Very few annotated images (approx 30 per application) • Touching objects of same class; need to be separated by segmentation algorithm Challenges: • 3D • Not generalizable shapes • Gradual edges Raw Image Segmentation Map
  7. Challenges in Machine/Deep Learning • Simplify the Life of Data Scientists • Custom algorithms & DNNs • Fast turn around time • Data Characteristics (input/intermediates/output) • Dense / sparse • Small / large number of data points • Small / large number of features • Mixed Workloads • Compute bound • I/O or memory bandwidth bound • Core Operations • Data manipulation • Linear algebra • Convolution • Iterative • Multiple Stages • Training • Testing • Inference • Deployment Environments • Range from embeddable scoring library (low latency), to scale up on large nodes, to distributed • Libraries: MKL/MKL-DNN, OpenBlas, CuDA/CuDNN and low precision • Hardware • x86/Power • Many cores, GPU, TPU, FPGA • High-speed interconnects (Topologies) • … all combinations 7
  8. Why Apache SystemML • Today’s Roles of Data Scientists • Algorithm researcher: Invent new optimization schemes • Systems programmer: provide distributed implementations • Deployment engineer: Run for varying datasets • Systems researcher: Optimize clusters • SystemML simplifies the Life of Data Scientists • in implementing custom machine learning • running algorithms distributed if needed • running algorithms varying from small data to large data • Fast turn around 8 NIPS ICML KDD JMLR
  9. Apache SystemML – Declarative Machine Learning • Productivity of data scientists • Machine learning language for data scientists (“The SQL for analytics”) • Strong foundation in linear algebra and statistical functions • Comes with approx. 20+ algorithms pre-implemented • Enable Solutions development and Tools • Scalability & Performance • Built on data parallel platforms, e.g. Spark • Cost-based optimizer to compile execution plans • Depending on data characteristics (tall/skinny, short/wide) and cluster characteristics • Ranging from in-memory single node to clusters (MapReduce, Spark), and hybrid plans • APIs & Tools • Command line: standalone Java app, spark-submit, hadoop jar • Use in Spark through Scala, Python, R, and Java APIs • Embeddable scoring library • Tools: REPL (Scala Spark and pyspark), SparkR, SparkML, Jupyter, Zeppelin Notebooks 9 Hadoop or Spark Cluster (scale-out) In-Memory Single Node (scale-up) Runtime Compiler Language
  10. SystemML integrated in Spark Ecosystem 10 Spark Core Engine Spark SQL Spark Streaming (MLlib) GraphX (SystemML) Analytics Library Custom Analytics Machine Learning DataFrame Spark API to SystemML SystemML to run against Spark core for distributed computations
  11. Apache SystemML Open Source • Apache Open source Project ( • Nov. 2015, Start SystemML Apache Incubator Project • … • Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API. May 2017, Release 0.14.0 on Spark 2.0.2+. • May 2017, Apache Top Level Project • … • Dec 2017, Release 1.0.0 • March 2018, Release 1.1.0 • Release downloads ( • Binaries • Coordinates to Maven repository • Github source code ( • Documentation ( • 3 Hours KDD Hands-On Tutorial (, Aug. 2017 11
  12. Automatic Algebraic Simplification Rewrites lead to Significant Performance Improvements • Simplify operations over mmult  Eliminate unnecessary compute • trace (X %*% Y)  sum(X * t(Y)) • Remove unnecessary operations  Merging operations • rand (…, min=-1, max=1) * 7  rand (…, min=-7, max=7) • Binary to unary operations  Reduce amount of data touched • X*X  X^2 • Remove unnecessary Indexing  Eliminate operations (conditional) • X[a:b,c:d] = Y  X = Y iff dims(X)=dims(Y) • … 10’s more rewrite rules 12
  13. Compilation Chain 13
  14. Training a Deep Neural Network 14 Training features: Training label: Goal: learn the weights Define a loss function: For numerical stability and mathematical simplicity, we use negative log-likelihood (often referred to as cross-entropy): “Forward propagation” Compute a function via composition of linear transformations followed by element-wise non-linearities “Backward propagation” Propagates errors backwards and update weights according to how much they contributed to the output
  15. Deep Learning Layers • Fully connected layer 15 Reference: Convolutional Neural Networks for Visual Recognition.
  16. Convolution Layer • Less number of parameters as compared to fully connected layer • Useful to capture local features (spatially) • Output #channels = #filters 16 Reference: Convolutional Neural Networks for Visual Recognition.
  17. Deep Learning Support • Reuse existing infrastructure to implement custom DNNs like other training algorithms • Small number of DL-specific built-in functions • e.g. convolution • NN library of layers and training optimizers to stack layers, e.g. • Affine (fully-connected) layer is matrix multiplication • Convolution layer invokes new convolution function • Caffe/Keras2DML to import existing DNNs • Transfer learning to continue training on different data • GPU and native BLAS libraries 17 NN library:
  18. Compressed Linear Algebra (CLA) • Motivation: Iterative ML algorithms with I/O-bound MV multiplications • Key Ideas: Use lightweight DB compression techniques and perform LA operations on compressed matrices (w/o decompression) • Experiments • LinregCG, 10 iterations, SystemML 0.14 • 1+6 node cluster, Spark 2.1 18 Dataset Gzip Snappy CLA Higgs 1.93 1.38 2.17 Census 17.11 6.04 35.69 Covtype 10.40 6.13 18.19 ImageNet 5.54 3.35 7.34 Mnist8m 4.12 2.60 7.32 Airline78 7.07 4.28 7.44 Compression Ratios 89 3409 5663 135 765 2730 93 463 998 0 1000 2000 3000 4000 5000 6000 Mnist40m Mnist240m Mnist480m Uncompressed Snappy (RDD Compression) CLA End-to-End Performance [sec] 90GB 540GB 1.1TB
  19. Code Generation for Operator Fusion • Motivation • Ubiquitous Fusion Opportunities • High Performance Impact • Key Ideas • Templates skeletons (Row, Cell, Outer, MultiAgg) • Candidate exploration to identify fusion opportunities • Candidate selection via cost-based optimizer or heuristics • Codegen with janino / javac during compile and dynamic recompile 19 X Y b(*)u(^2) u(^2) sumsum sum Multi-Aggregate a=sum(X^2) b=sum(X*Y) c=sum(Y^2) X Y Z* sum * 1st pass X v X 2nd pass q ┬ U V ┬X * logsum sparsity exploitation
  20. Codegen Micro Benchmarks (FP64) sum(X ʘ Y ʘ Z), dense sum(X ʘ Y ʘ Z), sparse Sparsity 0.1 X ┬ (X v), dense Data size 20K x 20K sum(X ʘ log(UV ┬ + 1e-15)) #1 Gen close to hand-coded fused ops #2 TF/Julia Gen only single- threaded #3 TF w/ very limited sparse support #4 Sparse Gen challenging, Gen better than hand- coded ops #5 TF w/ poor performance for data- intensive ops, #6 Gen at peak mem bandwidth #7 Autom. Sparsity exploitation across chains of ops 20
  21. SystemML on Power Environment • Contributed native ppc64le libraries for Jcuda to mavenized jcuda project • GPU backend on Power for SystemML • Contributed native ppc64le libraries to protoc project • Useful for compiling Caffe proto files • Supported native BLAS operations in SystemML • Matrix Multiplication, Convolution (forward/backward) • OpenBLAS with OpenMP support 21
  22. Linear Regression Conjugate Gradient (preliminary 1/2) 22 0 2 4 6 8 10 12 14 64 128 256 512 1024 2048 TimeinSeconds No. of Rows of input matrix (in Thousands) PPC CPU Time PPC GPU Time x86 CPU Time x86 GPU Time Data: random with sparsity 0.95, 1000 features Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01 Driver-memory: 100G, local[*] master M-V multiplication chain is memory bound, But more cores help with parallelization.
  23. Linear Regression Conjugate Gradient (preliminary 2/2) 23 0 2 4 6 8 10 12 14 64 256 1024 TimeinSeconds No. of Rows of input matrix (in Thousands) PPC GPU Time x86 GPU Time Data: random with sparsity 0.95, 1000 features Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01 Driver-memory: 100G, local[*] master 0 1 2 3 4 5 6 7 64 256 1024 TimeinSeconds No. of Rows of input matrix (in Thousands) CPU-GPU Transfer Time PPC toDev Time x86 toDev Time Most of the time is spent in transferring data from host to device -> 2x performance benefit due to CPU-GPU NVLink
  24. Capabilities of DL frameworks 24 Single Precision Double Precision Code generation BLAS Spark DataFrame support Sparse operation CPU GPU CPU GPU CPU GPU SystemML Limited (only for BLAS) Yes Yes Yes Yes No OpenBLAS, MKL, Java Yes Yes TF 1.5 Yes Yes No No Yes Yes Eigen, MKL ? (via elephas) Limited BigDL Yes No Yes No No No MKL Yes No
  25. Execution time for 10 epochs with Lenet 5 and 60K MNIST dataset 1 10 100 1000 10000 CPU single precision GPU single precision CPU double precision GPU double precision TF TF with XLA SystemML SystemML with codegen Intel BigDL Due to limited single precision support in SystemML SystemML/TF outperforms BigDL for minibatch training Both TF and SystemML perform equally well on GPU BigDL: No GPU support TF: No support for double precision SystemML: No GPU codegen support yet Code-generation improves performance both on CPU and on GPU
  26. Summary • SystemML simplifies the Life of Data Scientist • Custom Machine/Deep Learning Algorithms • Scale up & out • Mixed Workloads • Memory access bound • Compute bound • Strike Balance between • Data transfer • Parallelism 26

Editor's Notes

  1. 2x faster, 4x mem, 1st PCIe gen4 on chip, support for NVLink
  2. Automated Grading of Gliomas using Deep Learning in Digital Pathology Images Cut a “whole-slide” image into square “tiles” at 20x magnification. Filter the “tiles” to remove any without tissue. Cut the remaining “tiles” into smaller “samples”. Assign a tumor score label to each sample based on the tumor score of the “whole-slide” image. Repeat 1-4 for all “whole-slide” images. Train a convolutional neural network with the resulting dataset of labeled “samples”. Good results!
  3. Contraction (increase the "what", reduce the "where"): convolutions followed by non-linear activation function. Expansion path (create a high resolution segmentation map): sequence of up-convolutions and concatenation with high-resolution features from contractiong path. Output has 2 channels: one for the foreground, and one for the background. Training time: 10h with GPU. Applicaton: 1s per image; Better accuracy than sliding-window CNN
  4. 9