Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks with Dhabaleswar K. Panda and Xiaoyi Lu

70 views

Published on

Deep Learning over Big Data (DLoBD) is becoming one of the most important research paradigms to mine value from the massive amount of gathered data. Many emerging deep learning frameworks start running over Big Data stacks, such as Hadoop and Spark. With the convergence of HPC, Big Data, and Deep Learning, these DLoBD stacks are taking advantage of RDMA and multi-/many-core based CPUs/GPUs.

Even though a lot of activities are happening in the field, there is a lack of systematic studies on analyzing the impact of RDMA-capable networks and CPU/GPU on DLoBD stacks. To fill this gap, this talk will present a systematical characterization methodology and extensive performance evaluations on four representative DLoBD stacks (i.e., CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL) to expose the interesting trends regarding performance, scalability, accuracy, and resource utilization.

Our observations show that RDMA-based design for DLoBD stacks can achieve up to 2.7x speedup compared to the IPoIB based scheme. The RDMA scheme can also scale better and utilize resources more efficiently than the IPoIB scheme over InfiniBand clusters. For most cases, GPU-based deep learning can outperform CPU-based designs, but not always. We see that for LeNet on MNIST, CPU + MKL can achieve better performance than GPU and GPU + cuDNN on 16 nodes. Our studies show that there are large rooms to improve the designs of current generation DLoBD stacks. Finally, we will present some in-depth case studies to further accelerate deep learning workloads on Spark framework.

Published in: Data & Analytics
  • Be the first to comment

DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks with Dhabaleswar K. Panda and Xiaoyi Lu

  1. 1. DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks Talk at Spark+AI Summit 2018 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi
  2. 2. Spark+AI Summit 2018 2 Network Based Computing Laboratory •  Deep Learning is a sub-set of Machine Learning –  But, it is perhaps the most radical and revolutionary subset •  Deep Learning is going through a resurgence –  Model: Excellent accuracy for deep/convolutional neural networks –  Data: Public availability of versatile datasets like MNIST, CIFAR, and ImageNet –  Capability: Unprecedented computing and communication capabilities: Multi-/Many-Core, GPGPUs, Xeon Phi, InfiniBand, RoCE, etc. •  Big Data has become one of the most important elements in business analytics –  Increasing demand for getting Big Value out of Big Data to drive the revenue continuously growing Why Deep Learning is so hot? Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility- scalability-and-advocacy/ MNIST handwritten digits Deep Neural Network
  3. 3. Spark+AI Summit 2018 3 Network Based Computing Laboratory Application Example of DL: Flickr’s Magic View Photo Filtering Courtesy: https://thenextweb.com/opinion/2015/05/22/flickrs-new-magic-view-photo-filtering-feature-works-so-well-it-convinced-me-to-ditch-iphoto/#.tnw_RaZEaD6g •  Image recognition to divide pictures into surprisingly accurate categories •  Magic of AI/DL: Generate accurate tags for billions of pictures
  4. 4. Spark+AI Summit 2018 4 Network Based Computing Laboratory (1) Prepare Datasets @Scale (2) Deep Learning @Scale (3) Non-deep learning analytics @Scale (4) Apply ML model @Scale •  Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms •  More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark •  Benefits of the DLoBD approach –  Easily build a powerful data analytics pipeline •  E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof –  Better data locality –  Efficient resource sharing and cost effective Deep Learning over Big Data (DLoBD)
  5. 5. Spark+AI Summit 2018 5 Network Based Computing Laboratory •  CaffeOnSpark •  SparkNet •  TensorFlowOnSpark •  TensorFrame •  DeepLearning4J •  BigDL •  mmlspark –  CNTKOnSpark •  Many others… Examples of DLoBD Stacks
  6. 6. Spark+AI Summit 2018 6 Network Based Computing Laboratory •  Layers of DLoBD Stacks –  Deep learning application layer –  Deep learning library layer –  Big data analytics framework layer –  Resource scheduler layer –  Distributed file system layer –  Hardware resource layer •  How much performance benefit we can achieve for end deep learning applications? Overview of DLoBD Stacks Sub-optimal Performance ?
  7. 7. Spark+AI Summit 2018 7 Network Based Computing Laboratory Big Data (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) HPC (MPI, RDMA, Lustre, etc.) Increasing Usage of HPC, Big Data and Deep Learning Convergence of HPC, Big Data, and Deep Learning!!!
  8. 8. Spark+AI Summit 2018 8 Network Based Computing Laboratory •  BLAS Libraries – the heart of math operations –  Atlas/OpenBLAS –  NVIDIA cuBlas –  Intel Math Kernel Library (MKL) •  DNN Libraries – the heart of Convolutions! –  NVIDIA cuDNN (already reached its 7th iteration – cudnn-v7) –  Intel MKL-DNN (MKL 2017) – recent but a very promising development •  Communication Libraries – the heart of model parameter updating –  RDMA –  GPUDirect RDMA Highly-Optimized Underlying Libraries with HPC Technologies
  9. 9. Spark+AI Summit 2018 9 Network Based Computing Laboratory Outline •  Accelerating Big Data Stacks •  Benchmarking and Characterizing DLoBD Stacks –  CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL •  Accelerating DLoBD Stacks –  BigDL on RDMA-Spark –  TensorFlow
  10. 10. Spark+AI Summit 2018 10 Network Based Computing Laboratory •  RDMA for Apache Spark •  RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) –  Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions •  RDMA for Apache HBase •  RDMA for Memcached (RDMA-Memcached) •  RDMA for Apache Hadoop 1.x (RDMA-Hadoop) •  OSU HiBD-Benchmarks (OHB) –  HDFS, Memcached, HBase, and Spark Micro-benchmarks •  http://hibd.cse.ohio-state.edu •  Users Base: 285 organizations from 34 countries •  More than 26,550 downloads from the project site The High-Performance Big Data (HiBD) Project Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support Singularity and Docker
  11. 11. Spark+AI Summit 2018 11 Network Based Computing Laboratory HiBD Release Timeline and Downloads 0 5000 10000 15000 20000 25000 30000 Number of Downloads Timeline RDMA-Hadoop 2.x 0.9.1 RDMA-Hadoop 2.x 0.9.6 RDMA-Memcached 0.9.4 RDMA-Hadoop 2.x 0.9.7 RDMA-Spark 0.9.1 RDMA-HBase 0.9.1 RDMA-Hadoop 2.x 1.0.0 RDMA-Spark 0.9.4 RDMA-Hadoop 2.x 1.3.0 RDMA-Memcached 0.9.6 & OHB-0.9.3 RDMA-Spark 0.9.5 RDMA-Hadoop 1.x 0.9.0 RDMA-Hadoop 1.x 0.9.8 RDMA-Memcached 0.9.1 & OHB-0.7.1 RDMA-Hadoop 1.x 0.9.9
  12. 12. Spark+AI Summit 2018 12 Network Based Computing Laboratory Version Number 0.9.1 Example: RDMA-based Apache Spark Default Design Choices: 1. Hash-based Shuffle 2. NIO-based data transfer 1.2.0 Default Design Choices: 1.  Sort-based Shuffle 2.  NIO-based data transfer Default Design Choices: 1.  Sort-based Shuffle 2.  Netty-based data transfer 1.3.1 Default Design Choices: 1.  Sort-based Shuffle 2.  Netty-based data transfer New choice: Tungsten-Sort! 1.4.0~2.0.0+ RDMA-based Apache Spark (HotI’14) RDMA-based Apache Spark (BigData’16) 1.6.0~2.2.0+ New Workloads -- DLoBD: CaffeOnSpark, TensorFlowOnSpark, BigDL, etc. RDMA-based Apache Spark (HotI’17)
  13. 13. Spark+AI Summit 2018 13 Network Based Computing Laboratory •  Design Features –  RDMA based shuffle plugin –  SEDA-based architecture –  Dynamic connection management and sharing –  Non-blocking data transfer –  Off-JVM-heap buffer management –  InfiniBand/RoCE support Design Overview of Spark with RDMA •  Enables high performance RDMA communication, while supporting traditional socket interface •  JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016. Spark Core RDMA Capable Networks (IB, iWARP, RoCE ..) Apache Spark Benchmarks/Applications/Libraries/Frameworks 1/10/40/100 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI) Native RDMA-based Comm. Engine Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin) Netty Server NIO Server RDMA Server Netty Client NIO Client RDMA Client
  14. 14. Spark+AI Summit 2018 14 Network Based Computing Laboratory •  High-Performance Design of Spark over RDMA-enabled Interconnects –  High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark –  RDMA-based data shuffle and SEDA-based shuffle architecture –  Non-blocking and chunk-based data transfer –  Off-JVM-heap buffer management –  Support for OpenPOWER –  Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB) •  Current release: 0.9.5 –  Based on Apache Spark 2.1.0 –  Tested with •  Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) •  RoCE support with Mellanox adapters •  Various multi-core platforms (x86, POWER) •  RAM disks, SSDs, and HDD –  http://hibd.cse.ohio-state.edu RDMA for Apache Spark Distribution
  15. 15. Spark+AI Summit 2018 15 Network Based Computing Laboratory •  RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and available on SDSC Comet. –  Examples for various modes of usage are available in: •  RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP •  RDMA for Apache Spark: /share/apps/examples/SPARK/ –  Please email help@xsede.org (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration. •  RDMA for Apache Hadoop is also available on Chameleon Cloud as an appliance –  https://www.chameleoncloud.org/appliances/17/ HiBD Packages on SDSC Comet and Chameleon Cloud M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016
  16. 16. Spark+AI Summit 2018 16 Network Based Computing Laboratory •  InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) •  RDMA-based design for Spark 1.5.1 •  RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. –  32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) –  64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Performance Evaluation on SDSC Comet – HiBench PageRank 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time 0 50 100 150 200 250 300 350 400 450 Huge BigData Gigantic Time (sec) Data Size (GB) IPoIB RDMA 0 100 200 300 400 500 600 700 800 Huge BigData Gigantic Time (sec) Data Size (GB) IPoIB RDMA 43% 37%
  17. 17. Spark+AI Summit 2018 17 Network Based Computing Laboratory Outline •  Accelerating Big Data Stacks •  Benchmarking and Characterizing DLoBD Stacks –  CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL •  Accelerating DLoBD Stacks –  BigDL on RDMA-Spark –  TensorFlow
  18. 18. Spark+AI Summit 2018 18 Network Based Computing Laboratory •  Choose proper DL workloads, models and datasets –  Varied sizes to cover big and small models. Small and large data sets –  Cover different kinds of combinations •  Choose representative DLoBD stacks –  CaffeOnSpark, TensorFlowOnSpark, and BigDL –  Running over Spark, Yarn, HDFS Benchmarking and Characterization Methodology •  Define characterization dimensions –  Processor Type –  Parameter updating approach (i.e., communication) –  Network Protocol (IPoIB, RDMA) •  Generate evaluation reports –  Performance (End-to-end training time; time to a certain accuracy; epoch execution time) –  Accuracy, Scalability, Resource Utilization –  Breakdown
  19. 19. Spark+AI Summit 2018 19 Network Based Computing Laboratory •  Spark Driver: Job Launching and Job Control •  Spark Executor: For data feeding and task control •  Model Synchronizer: Communicates across nodes with RDMA / TCP, and output model file on HDFS •  Scalable and Communication intensive –  Server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck –  Out-of-band communication Overview of Representative DLoBD Stacks - CaffeOnSpark
  20. 20. Spark+AI Summit 2018 20 Network Based Computing Laboratory •  Spark Executors acting as containers used to run TensorFlow code •  Two different modes to ingesting data –  Read data directly from HDFS using built- in TensorFlow modules –  Feeding data from Spark RDDs to Spark executors (TensorFlow core) •  Scalable and Communication intensive –  Parameter Server-based approach –  Embedded inside one Spark executor and talk to other workers over gRPC or gPRC with RDMA –  Out-of-band communication Overview of Representative DLoBD Stacks - TensorFlowOnSpark
  21. 21. Spark+AI Summit 2018 21 Network Based Computing Laboratory •  Microsoft Cognitive Toolkit (CNTK) and OpenCV into Spark Machine Learning pipelines without data transfer overhead •  Feeding data for CNTK Core (e.g. images or texts) can be directly read from HDFS by Spark Executors •  Scalable and Communication intensive –  Embedded inside one Spark executor and talk to other workers over MPI (RDMA, TCP) –  Out-of-band communication Overview of Representative DLoBD Stacks – CNTKOnSpark/ MMLSpark
  22. 22. Spark+AI Summit 2018 22 Network Based Computing Laboratory •  Users can write deep learning applications as Spark programs •  Users can load pre-trained Caffe or Torch models into Spark programs using BigDL •  Feed data to BigDL core by Spark Executor which can directly load data from HDFS •  High performance –  Support Intel MKL –  Support both Xeon and Xeon Phi (e.g., KNL) •  Scalable and Communication intensive –  Spark block manager as parameter server –  Organically designed and integrated with Spark architecture –  In-band Communication •  RDMA communication can be achieved through our RDMA-Spark package! Overview of Representative DLoBD Stacks - BigDL
  23. 23. Spark+AI Summit 2018 23 Network Based Computing Laboratory MNIST CIFAR-10 ImageNet Category Digit Classification Object Classification Object Classification Resolution 28 × 28 B&W 32 × 32 Color 256 × 256 Color Classes 10 10 1000 Training Images 60 K 50 K 1.2 M Testing Images 10 K 10 K 100 K Selected Various Datasets and Models Model Layers (Conv. / Full-connected) Dataset Framework LeNet 2 / 2 MNIST CaffeOnSpark, TensorFlowOnSpark SoftMax Regression NA / NA MNIST TensorFlowOnSpark CIFAR-10 Quick 3 / 1 CIFAR-10 CaffeOnSpark, TensorFlowOnSpark, MMLSpark VGG-16 13 / 3 CIFAR-10 BigDL AlexNet 5 / 3 ImageNet CaffeOnSpark GoogLeNet 22 / 0 ImageNet CaffeOnSpark Resnet-50 53/1 Synthetic TensorFlow
  24. 24. Spark+AI Summit 2018 24 Network Based Computing Laboratory Performance Characterization for CPU-/GPU-based Deep Learning with CaffeOnSpark 0 1000 2000 3000 4000 5000 6000 1 2 4 8 16 End-to-EndTime(secs) Number of Nodes (one device/node) CIFAR-10 Quick CPU+OpenBlas CPU+MKL GPU GPU+cuDNN 0 200 400 600 800 1000 1200 1 2 4 8 16 End-to-EndTime(secs) Number of Nodes (one device/node) LeNet on MNIST CPU+OpenBlas CPU+MKL GPU GPU+cuDNN 63% 91.26% 98.08% 78.3% 41.5% 15.9% 81.31% 88.91% 96.78% 80.08% 63.92% •  DL workloads can benefit from the high performance of the DLoBD stacks. •  Network will become a bottleneck at some point if the sub-optimal IPoIB network protocol is used. •  GPU/GPU+cuDNN can get the best performance. GPU + cuDNN is degraded at a large scale (e.g., 16 nodes). •  For some models, solutions with CPU + MKL may outperform GPU-based solutions.
  25. 25. Spark+AI Summit 2018 25 Network Based Computing Laboratory Performance Characterization for IPoIB and RDMA with CaffeOnSpark and TensorFlowOnSpark (IB EDR) •  CaffeOnSpark benefits from the high performance of RDMA compared to IPoIB once communication overhead becomes significant. •  Our experiments show that the default RDMA design in TensorFlowOnSpark is not fully optimized yet. For MNIST tests, RDMA is not showing obvious benefits. 0 50 100 150 200 250 300 350 400 450 IPoIB RDMA IPoIB RDMA IPoIB RDMA IPoIB RDMA GPU GPU-cuDNN GPU GPU-cuDNN CIFAR-10 Quick LeNet on MNIST End-to-EndTime(secs) CaffeOnSpark 2 4 8 16 0 50 100 150 200 250 300 350 400 450 IPoIB RDMA IPoIB RDMA CIFAR10 (2 GPUs/node) MNIST (1 GPU/node) End-to-EndTime(secs) TensorFlowOnSpark 2 GPUs 4 GPUs 8 GPUs 14.2% 13.3% 51.2% 45.6% 33.01% 4.9%
  26. 26. Spark+AI Summit 2018 26 Network Based Computing Laboratory Performance Characterization with MMLSpark 0 1000 2000 3000 4000 5000 1 2 4 8 16 End-to-End Time (secs) Number of Nodes (one device/node) CPU+OpenBlas CPU+MKL GPU+cuDNN 0 40 80 120 160 IPoIB RDMA End-to-End Time (secs) CIFAR-10 2 4 8 16 •  The solution of GPU + cuDNN performs best, up to 55x faster than CPU + OpenBLAS, and up to 15x than CPU + MKL. •  OpenMPI-based communication over IPoIB and RDMA; Similar performance; The latency and bandwidth of IPoIB in this cluster are sufficient for small models. •  Could not find other benchmarks with bigger models for MMLSpark
  27. 27. Spark+AI Summit 2018 27 Network Based Computing Laboratory Characterization on Performance and Accuracy 10 20 30 40 50 60 70 80 286 1001 5297 12939 18673 22974 27273 30138 37301 43030 50326 Accuracy(%) Time (secs) IPoIB RDMA 10 30 50 70 90 160 511 863 1214 1565 1917 2268 2619 2971 3322 3908 4259 4610 4962 Accuracy(%) Time (secs) IPoIB RDMA AlexNet on ImageNet with CaffeOnSpark GoogleNet on ImageNet with CaffeOnSpark •  Performance Evaluation of CaffeOnSpark (training time to achieve a 70% accuracy) –  RDMA reduces the overall time cost by 22% in training AlexNet on ImageNet –  RDMA reduces the overall time cost by 15% in training GoogleNet on ImageNet •  Performance Evaluation of BigDL (training time to achieve a 70% accuracy) –  RDMA reduces the overall time cost by 48% in training VGG on CIFAR-10 0 20 40 60 80 100 165 474 711 961 1280 1598 1920 2017 2132 2261 2908 3645 4337 Accuracy(%) Time (secs) IPoIB RDMA VGG on CIFAR-10 with BigDL 70% Accuracy Reduce by 48% Reduce by 22% 70% Accuracy Reduce by 15% 70% Accuracy
  28. 28. Spark+AI Summit 2018 28 Network Based Computing Laboratory Memory and Network Utilization of CaffeOnSpark •  CIFAR-10 Quick Model and CIFAR-10 Dataset •  GPU-based solutions use less memory than CPU-based ones as they mostly use GPU memory. •  CPU + MKL solution uses host memory more efficiently and has better performance than CPU + OpenBLAS. •  RDMA utilizes the network resources more efficiently than the IPoIB in CaffeOnSpark. •  CaffeOnSpark still does not fully utilize the high throughput characteristic of RDMA and memory resource. 0 10 20 30 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 HostMemoryUsage (GB) Time (min) CPU+OpenBLAS CPU+MKL GPU GPU+cuDNN 0 10 20 0 1 2 3 4 5 6 7 NetworkThroughput (MB/s) Time (min) IPoIB RDMA
  29. 29. Spark+AI Summit 2018 29 Network Based Computing Laboratory •  SoftMax Regression model, over MNIST dataset •  Up to 15.5% time in Apache Hadoop YARN scheduler layer •  Up to 18.1% execution time in Spark job execution layer •  Data size is small, so we do not count the time spent on accessing HDFS layer. •  Need more effort to reduce the overhead across different layers of DLoBD stacks •  Maybe amortized in long-running deep learning jobs Performance Overhead across Layers in DLoBD Stacks 0 10 20 30 40 50 60 70 IPoIB RDMA IPoIB RDMA TensorFlowOnSpark Native TensorFlow Time(second) Other Spark-Overhead YARN-Overhead Actual Training
  30. 30. Spark+AI Summit 2018 30 Network Based Computing Laboratory •  RDMA can benefit DL workloads –  Up to 2.7x speedup with RDMA compared to the IPoIB scheme for deep learning workloads. –  RDMA can scale better and utilize resources more efficiently than IPoIB over InfiniBand clusters •  GPU-based DL designs can outperform CPU-based designs, but not always –  LeNet on MNIST, CPU + MKL achieved better performance than GPU and GPU + cuDNN on 8/16 nodes •  Large rooms for further improvement in DLoBD stacks!!! •  We need more benchmarks, public datasets, and analysis tools!!! Insights and Guidance X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017. X. Lu, H. Shi, R. Biswas, M. H. Javed, and D. K. Panda, DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters, IEEE Transactions on Multi-Scale Computing Systems (TMSCS), 2018.
  31. 31. Spark+AI Summit 2018 31 Network Based Computing Laboratory Outline •  Accelerating Big Data Stacks •  Benchmarking and Characterizing DLoBD Stacks –  CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL •  Accelerating DLoBD Stacks –  BigDL on RDMA-Spark –  TensorFlow
  32. 32. Spark+AI Summit 2018 32 Network Based Computing Laboratory Epoch-Level Evaluation with BigDL on SDSC Comet 0 10 20 30 40 50 60 0 500 1000 1500 2000 2500 3000 3500 4000 4500 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Accuracy(%) AccumulativeEpochsTime(secs) Epoch Number IPoIB-Time RDMA-Time IPoIB-Accuracy RDMA-Accuracy VGG on CIFAR-10 with BigDL •  Epoch-level evaluation of training VGG model using BigDL on default Spark with IPoIB and our RDMA-based Spark. •  RDMA version takes constantly less time than the IPoIB version to finish every epoch. –  RDMA finishes epoch 18 in 2.6x time faster than IPoIB 2.6X
  33. 33. Spark+AI Summit 2018 33 Network Based Computing Laboratory Scalability Evaluation with BigDL on SDSC Comet •  Using BigDL with IPoIB & RDMA Spark •  For VGG model trained with BigDL, RDMA- based Spark scales better than default IPoIB Spark •  For 384 CPU cores, 18 epochs and same batch size, RDMA takes about 870 seconds while IPoIB takes 2,372 seconds •  A speedup of 2.7x using RDMA for the epoch-level training time 10 510 1010 1510 2010 2510 3010 3510 4010 4510 5010 96 192 384 AccumulativeEpochsTime(secs) Number of CPU Cores IPoIB RDMA
  34. 34. Spark+AI Summit 2018 34 Network Based Computing Laboratory Outline •  Accelerating Big Data Stacks •  Benchmarking and Characterizing DLoBD Stacks –  CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL •  Accelerating DLoBD Stacks –  BigDL on RDMA-Spark –  TensorFlow
  35. 35. Spark+AI Summit 2018 35 Network Based Computing Laboratory Overview of gRPC with TensorFlow Worker services communicate among each other using gRPC, or gRPC+X! Client Master Worker /job:PS/task:0 Worker /job:Worker/task:0 CPU GPU gRPC server/ client CPU GPU gRPC server/ client
  36. 36. Spark+AI Summit 2018 36 Network Based Computing Laboratory Performance Benefits for RDMA-gRPC with Micro-Benchmark RDMA-gRPC RPC Latency •  gRPC-RDMA Latency on SDSC-Comet-FDR –  Up to 2.7x performance speedup over IPoIB for Latency for small messages –  Up to 2.8x performance speedup over IPoIB for Latency for medium messages –  Up to 2.5x performance speedup over IPoIB for Latency for large messages 0 15 30 45 60 75 90 2 8 32 128 512 2K 8K Latency (us) payload (Bytes) Default gRPC OSU RDMA gRPC 0 200 400 600 800 1000 16K 32K 64K 128K 256K 512K Latency (us) Payload (Bytes) Default gRPC OSU RDMA gRPC 100 3800 7500 11200 14900 18600 1M 2M 4M 8M Latency (us) Payload (Bytes) Default gRPC OSU RDMA gRPC R. Biswas, X. Lu, and D. K. Panda, Accelerating gRPC and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, Under Review.
  37. 37. Spark+AI Summit 2018 37 Network Based Computing Laboratory 0 20 40 60 80 100 120 140 160 16 32 64 Images / Second Batch Size / GPU gRPPC Verbs MPI gRPC+RDMA 35% Performance Benefit for TensorFlow - Resnet50 0 50 100 150 200 250 300 16 32 64 Images / Second Batch Size / GPU gRPPC Verbs MPI gRPC+RDMA 41% •  TensorFlow Resnet50 performance evaluation on an IB EDR cluster –  Up to 35% performance speedup over IPoIB for 4 nodes. –  Up to 41% performance speedup over IPoIB for 8 nodes. 4 Nodes 8 Nodes
  38. 38. Spark+AI Summit 2018 38 Network Based Computing Laboratory 0 10 20 30 40 50 60 70 80 90 16 32 64 Images / Second Batch Size / GPU gRPPC Verbs MPI gRPC+RDMA Performance Benefit for TensorFlow - Inception3 0 20 40 60 80 100 120 140 160 180 200 16 32 64 Images / Second Batch Size / GPU gRPPC Verbs MPI gRPC+RDMA TensorFlow Inception3 performance evaluation on an IB EDR cluster •  Up to 27% performance speedup over IPoIB for 4 nodes •  Up to 36% performance speedup over IPoIB for 8 nodes. 4 Nodes 8 Nodes 27% 36%
  39. 39. Spark+AI Summit 2018 39 Network Based Computing Laboratory •  Discussed challenges in benchmarking, characterizing, and accelerating Deep Learning over Big Data (DLoBD) stacks •  RDMA can benefit DL workloads as showed by our RDMA-Spark, AR-gRPC, and other RDMA designs •  Many other open issues need to be solved •  Will enable Big Data and Deep Learning community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner Concluding Remarks
  40. 40. Spark+AI Summit 2018 40 Network Based Computing Laboratory The 4th International Workshop on High-Performance Big Data Computing (HPBDC) HPBDC 2018 was held with IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018 Workshop Date: May 21st, 2018 Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-Performance Big Data Programming Environment Six Regular Research Papers and Two Short Research Papers Panel Topic: Which Framework is the Best for High-Performance Deep Learning: Big Data Framework or HPC Framework? http://web.cse.ohio-state.edu/~luxi/hpbdc2018 HPBDC 2017 was held in conjunction with IPDPS’17 http://web.cse.ohio-state.edu/~luxi/hpbdc2017 HPBDC 2016 was held in conjunction with IPDPS’16 http://web.cse.ohio-state.edu/~luxi/hpbdc2016
  41. 41. Spark+AI Summit 2018 41 Network Based Computing Laboratory Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ {panda, luxi}@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~luxi

×