Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Expanding HPCC Systems Deep Neural Network Capabilities


Published on

The training process for modern deep neural networks requires big data and large computational power. Though HPCC Systems excels at both of these, HPCC Systems is limited to utilizing the CPU only. It has been shown that GPU acceleration vastly improves Deep Learning training time. In this talk, Robert will explain how HPCC Systems became the first GPU accelerated library while also greatly expanding its deep neural network capabilities.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Expanding HPCC Systems Deep Neural Network Capabilities

  1. 1. 2019 HPCC Systems® Community Day Challenge Yourself – Challenge the Status Quo Robert Kennedy, PhD Candidate at Florida Atlantic University Taghi M. Khoshgoftaar, PhD | Advisor Timothy Humphrey | LexisNexis Mentor Expanding HPCC Systems Deep Neural Network Capabilities
  2. 2. Overview • Both topics covered here are a result from my Summer Internship • Work is available on GitHub • Tool for creating “Standard” HPCC Systems Platform Virtual Machines • Hyper-V, AWS, Azure, VirtualBox, etc… • • In addition, used for creating NVIDIA GPU Enabled VMs (AWS AMI) • Started a GPU Enabled Deep Learning Bundle • Demonstrating GPU accelerated Deep Learning on HPCC Systems • GPU Accelerated HPCC Systems | Robert Kennedy 2
  3. 3. HPCC Systems on Hyper-V • Used to generate machine images • To create a Hyper-V Image: • • Hyper-V VMs can be used similarly to the VirtualBox VMs you might already be using • Hyper-V Images build locally, on a Hyper-V enabled machine • Installed programs list can be easily modified in a .JSON format • HPCC Systems Platform running on Hyper-V allows for Docker Desktop (windows) use • Docker Desktop uses Hyper-V and Hyper-V and VirtualBox can’t run concurrently GPU Accelerated HPCC Systems | Robert Kennedy 3
  4. 4. Config File • uses .json file as config • Defines network (ex. for VirtualBox) • Defines size of machine (for cloud providers) • Config defines which software to be installed via standard Linux commands GPU Accelerated HPCC Systems | Robert Kennedy 4
  5. 5. GPU Enabled Virtual Machines • Using the same tool, GPU enabled VMs can be created • Cloud images build in cloud, local images build locally • This work supports the use of Python 3.6, CUDA 10.0, TensorFlow 1.14, and PyTorch 1.1 • AWS GPU Instances: • K80s, V100s • Azure GPU Instances: • K80s [12 gigs vram] • V100s [16 gigs vram] (with and without NVLink) • P100s [16 gigs vram] GPU Accelerated HPCC Systems | Robert Kennedy 5
  6. 6. Bundle Implementation
  7. 7. HPCC Systems and GPU Accelerated Deep Learning • Current HPCC Systems are CPU only, and so is its DL runtimes • My previous work was with Distributed DL on HPCC Systems using only CPUs • Traditional HPCC Systems use commodity computers connected via standard network protocols • With respect to Deep Learning, this presents a large communication bottle neck, partly due to its iterative nature • Graphics Processing Units (GPU) are used to decrease the computation time for Neural Networks • Single or Multiple GPUs are connected to the CPU (central node) via much faster hardware connections • A new bundle was started to enable GPU accelerated Deep Learning on HPCC Systems Platform GPU Accelerated HPCC Systems | Robert Kennedy 7
  8. 8. GPU Accelerated Deep Learning • With this bundle, you can train NN models on the GPU • Sprayed data is used as training data • Bundle is in its infancy, but you can build, train, and use neural networks • Using only ECL • Using ECL and Python, allows for more customized NN architectures and training routines • A trained model (either in ECL or ECL+Python) can be used to predict on sprayed data • It returns its predictions via records in a one-hot-encoded format GPU Accelerated HPCC Systems | Robert Kennedy 8
  9. 9. Bundle Implementation Overview • Current work uses only one Thor node • Single Thor node still can use multiple GPUs • ECL/HPCC Systems handles the data storage and execution of the NN runtimes • The implementation is uses data parallelism across one ore more GPUs • Currently limited to only a single physical computer • The pyembed plugin allows for Python to run on HPCC Systems Platform • We use Python 3, as Python 2 is nearing EOL • Python code handles the NN training and interfaces with GPUs directly using NVIDIA’s CUDA language GPU Accelerated HPCC Systems | Robert Kennedy 9
  10. 10. TensorFlow | Keras • The Python code is in the form of TensorFlow • TensorFlow • Google’s Popular Deep Learning Library • Keras • Deep Learning Library API – uses TensorFlow or other ‘backend’ • Much less code to produce same model 10
  11. 11. Artificial Neural Networks
  12. 12. Biological Neuron • Basis for artificial neural networks • Such as the ones in deep learning • Dendrites • Input vector, from previous neurons • Weights • Soma • Summation Function • Axon • Activation Function • A neuron 'fires” when there is enough of an input stimulus GPU Accelerated HPCC Systems | Robert Kennedy 12 Dendrite Axon Soma
  13. 13. Artificial Neuron • First concept in 1943 • Inputs of the neuron are the outputs of the previous layer’s neurons • The input weights are summed with a bias • Then passed into an activation function • Activation Functions are like the biological neurons ‘deciding’ to fire • ReLu activation – gives output x if x>0, and outputs 0, if x<0, where x is the input GPU Accelerated HPCC Systems | Robert Kennedy 13
  14. 14. A Fully Connected Network • Fully Connected Network • Each neuron is connected to every neuron in the subsequent layer • Neural Network Visualization • 2 hidden layers, fully connected, 3 class classification output • Multi-Layer Perceptron is an example GPU Accelerated HPCC Systems | Robert Kennedy 14
  15. 15. Neural Network Training • Forwardpropagation • Backpropagation • Optimize Model with respect to Loss Function • Quantification of how “right or wrong” the model for any given datum • Gradient Descent • Stochastic Gradient Descent (SGD) • Mini-batch SGD • Right: visualization of gradient descent over an example loss function GPU Accelerated HPCC Systems | Robert Kennedy 15 Gradient Descent In Action
  16. 16. Where Exactly Do the GPUs Come Into Play? • Training a NN Model is the most time-consuming part, this is where the GPU is used to dramatically reduce computation time • Two main training steps • Forward pass – weights and errors • Backward pass – gradients and weight updates • Computationally expensive convolutions are offloaded onto GPUs • These steps are done for each data point, multiple times GPU Accelerated HPCC Systems | Robert Kennedy 16
  17. 17. Parallel Paradigms • Data Parallelism • Model Parallelism • Synchronous and Asynchronous • Parallel SGD GPU Accelerated HPCC Systems | Robert Kennedy 17 Data Parallelism Model Parallelism
  18. 18. Model Parallelism • Neural Network Model is split across nodes • For models larger than a GPU’s memory • Requires significantly higher communication bandwidths between nodes • Not well suited for a cluster system • However, this paradigm is feasible for a multi-GPU system due to faster hardware speeds GPU Accelerated HPCC Systems | Robert Kennedy 18
  19. 19. Data Parallelism • Data is partitioned and distributed to nodes • A singe NN model is replicated onto each node • Only weight updates are communicated and aggregated • As defined by the specific parallel training method • Suitable for parallelizing across multiple nodes in HPCC Systems cluster or across GPUs in a single system • This is the paradigm that is used GPU Accelerated HPCC Systems | Robert Kennedy 19
  20. 20. Not Your Average HPCC Systems • Slightly different than traditional HPCC Systems topologies • Whole figure represents a single physical computer and Thor Node • Parameter Server • This is the CPU on the system • Nodes (blue) • Each node represents a single physical GPU • Connections are high speed hardware • PCI Express is up to 985 MB/s per each 16 lanes • NVLINK is roughly 10x faster than PCIe Gen 3 GPU Accelerated HPCC Systems | Robert Kennedy 20
  21. 21. Workflow Example
  22. 22. • We will create a Convolutional Neural Network (CNN) and train on the MNIST Dataset • MNIST is a 10-class image classification dataset, handwritten digits 0-9 • The CNN takes 784 pixels as an input (each with range 0-255) • Two Convolutional Layers • One fully connected layer with 128 neurons • 10 Output neurons (one for each class) • Total of 1,199,882 trainable parameters • Processing through 720,000 MNIST images Bundle Usage Example Architecture GPU Accelerated HPCC Systems | Robert Kennedy 22
  23. 23. Spray MNIST Dataset • MNIST included in bundle • Test and Train, 785 fixed length • 60,000 28x28 grayscale images • 10,000 28x28 grayscale images • Both are labeled as one of 10 classes, 0-9 GPU Accelerated HPCC Systems | Robert Kennedy 23
  24. 24. Image Visualization • Imported RAW MNIST Data • Visualization of a single MNIST image in the “data” format • Each pixel has value between 0-255, represented as 2-digit hex numbers • Each pixel is a feature GPU Accelerated HPCC Systems | Robert Kennedy 24
  25. 25. Preparing the Data • Currently, the bundle demonstrates how to train on image data • Includes Example NN and the example dataset (MNSIT and Fashion MNIST) • Training data and labels is molded into a NumPy array with specified shape before training • Here, shape is the dimensions of the image • i.e. the dimensions of the input features • These get flattened to an array of 784 inputs for 784 input neurons GPU Accelerated HPCC Systems | Robert Kennedy 25
  26. 26. Creating a CNN – model.add() method • First, we define the optimizer and its parameters • Next, we define the training scheme • Batch size = 128 • We’ll train for 20 epochs GPU Accelerated HPCC Systems | Robert Kennedy 26
  27. 27. Creating a CNN – model.add() method • Next, we define the NN architecture • Input shape, 28x28x1 grayscale images • Initialize the model • The “nnOutputLayer” is the final layer and is, at this point, the entire NN model thus far GPU Accelerated HPCC Systems | Robert Kennedy 27
  28. 28. • “nnOutputLayer” is passed into model.train() along with hyperparameters and training data Train the CNN – model.add() method GPU Accelerated HPCC Systems | Robert Kennedy 28 GPU: CPU:
  29. 29. Create CNN – ECL and Python GPU Accelerated HPCC Systems | Robert Kennedy 29
  30. 30. Example Input and Output GPU Accelerated HPCC Systems | Robert Kennedy 30 Image Input One-Hot-Encoded Output
  31. 31. Performance
  32. 32. Performance Evaluation • A case study was performed to measure the performance improvements • 5 identical Convolutional Neural Networks are trained on the MNIST dataset • 10 times each to provide statistical significance • Measuring the required training time for the same model on same data using fixed training parameters • Faster training time is desired • CPU Alone, 1, 2, 3, and 4 GPUs • Older K80’s are used • Newer GPUs will only increase performance and efficiency • Compared against each other and compared against the “optimal” speed up • i.e. linear speedup GPU Accelerated HPCC Systems | Robert Kennedy 32
  33. 33. Performance Boost: GPU vs. CPU • Time, in seconds, to train a CNN on MNIST dataset • Training time speedup is 5.4x between a Xeon CPU vs a K80 GPU • Speedup is large, even for a simple model on small and simple data • The training time is measuring NN training time, not necessarily any HPCC-specific computations that would be the same during CPU or GPU GPU Accelerated HPCC Systems | Robert Kennedy 33
  34. 34. Performance Boost: CPU vs. GPU vs Optimal Speedup • Optimal Speed up is linear • i.e. twice the nodes is twice as fast • Speedup is not expected to be linear due to communication overheads • Results show that additional GPUs have minimal cost GPU Accelerated HPCC Systems | Robert Kennedy 34
  35. 35. Conclusion • Tool used to create HPCC Systems Virtual Images on various new platforms • Good use case is to create GPU enabled images • Brief overview of Neural Networks and their optimization • Demonstrated that GPU accelerated deep learning is possible on HPCC Systems Platform • Demonstrated that GPU provides significant performance increase, even on non- traditional cluster GPU Accelerated HPCC Systems | Robert Kennedy 35
  36. 36. • Implementing generalizable data loaders • To allow for a training on data with less knowledge of NumPy (Python) • Continue adding to the supported methods and ECL modeling functions • Research and Development on integrating model parallelism • Research on NN training on multi-node clusters where each node can have one or more GPUs Future Work GPU Accelerated HPCC Systems | Robert Kennedy 36
  37. 37. Links • GitHub • • • NVIDIA CUDA • • TensorFlow • • Keras • • NumPy • GPU Accelerated HPCC Systems | Robert Kennedy 37
  38. 38. GPU Accelerated HPCC Systems | Robert Kennedy 38 Robert Kennedy PhD Candidate, Florida Atlantic University Questions?
  39. 39. GPU Accelerated HPCC Systems | Robert Kennedy 39 View this presentation on YouTube: 8MJMUpp8IKH5-d56az56t52YccleX5h&index=8&t=0s (4:02)