- 1. Tutorial: Deep Learning Implementations and Frameworks Seiya Tokui*, Kenta Oono*, Atsunori Kanemura+, Toshihiro Kamishima+ *Preferred Networks, Inc. (PFN) {tokui,oono}@preferred.jp +National Institute of Advanced Industrial Science and Technology (AIST) atsu-kan@aist.go.jp, mail@kamishima.net 1
- 2. Overview of this tutorial •1st session (KO, 8:30 ‒ 10:00) •Introduction •Basics of neural networks •Common design of neural network implementations •2nd session (ST, 10:30 ‒ 12:30) •Differences of deep learning frameworks •Coding examples of frameworks •Conclusion
- 3. Common Design of Deep Learning Frameworks Kenta Oono <oono@preferred.jp> Preferred Networks Inc. 2016/4/19 3DLIF Tutorial @ PAKDD2016
- 4. Objective of this part • How deep learning frameworks represent various neural networks. • How deep learning frameworks realize the training procedure of neural networks. • Technology stack that is common to most of deep learning frameworks. 2016/4/19 4DLIF Tutorial @ PAKDD2016
- 5. Steps for training neural networks Prepare the training dataset Repeat until meeting some criterion Prepare for the next (mini) batch Compute the loss (forward prop) Initialize the Neural Network (NN) parameters Save the NN parameters Define how to compute the loss of this batch Compute the gradient (backprop) Update the NN parameters 2016/4/19 5DLIF Tutorial @ PAKDD2016
- 6. Technology stack of DL framework name functions example Graphical interface DIGITS, TensorBoard Machine learning workflow management Dataset Management Training Loop Keras, Lasagne Blocks, TF Learn Computational graph management Build computational graph Forward prop/Backprop Theano, TensorFlow Torch.nn Multi-dimensional array library Linear algebra NumPy, CuPy Eigen, torch (core) Numerical computation package Matrix operation Convolution BLAS, cuBLAS, cuDNN Hardware CPU, GPU 2016/4/19 6DLIF Tutorial @ PAKDD2016
- 7. Technology stack of DL framework 2016/4/19 7DLIF Tutorial @ PAKDD2016 name functions example Graphical interface DIGITS, TensorBoard Machine learning workflow management Dataset Management Training Loop Keras, Lasagne Blocks, TF Learn Computational graph management Build computational graph Forward prop/Backprop Theano, TensorFlow Torch.nn Multi-dimensional array library Linear algebra NumPy, CuPy Eigen, torch (core) Numerical computation package Matrix operation Convolution BLAS, cuBLAS, cuDNN Hardware CPU, GPU
- 8. Neural Network as a Computational Graph • In simplest form, NN is represented as a computational graph (CG) that is a stack of bipartite DAGs (Directed Acyclic Graph) consisting of data nodes and operator nodes. y = x1 * x2 z = y - x3 x1 mul suby x3 z x2 data node operator node 2016/4/19 8DLIF Tutorial @ PAKDD2016
- 9. Example: Multi-layer Perceptron (MLP) x Affine W1 b1 h1 ReLU a1 Affine W2 b2 h2 ReLU a2 Soft max y Cross Entropy Lo ss t It is choice of implementation if CG includes weights and biases. 2016/4/19 9DLIF Tutorial @ PAKDD2016
- 10. Example: Recurrent Neural Network (RNN) x1 RNN Unit h1 RNN Unit x2 h2 RNN Unit xT h0 ・・・ hT RNN unit can be : • Affine + activation function • LSTM (Long Short-Term Memory) • GRU (Gated Recurrent Unit) x h y xt ht-1 ht W b 2016/4/19 10DLIF Tutorial @ PAKDD2016
- 11. Example: Stacked RNN x1 RNN Unit h1 RNN Unit x2 h2 RNN Unit xT h0 ・・・ hT RNN Unit z1 RNN Unit z2 RNN Unit z0 ・・・ zT Soft max Affine y 2016/4/19 11DLIF Tutorial @ PAKDD2016
- 12. Example: RNN with control flow nodes loop enter s i predic ate pr ed s h0 x switch s RNN Unit s’update loop end y pred=True pred=False • TensorFlow has control flow nodes (e.g. cond, switch, while) • As CG has a loop, some mechanism is necessary that resolves he dependency of nodes to schedule the order of calculation. W b 2016/4/19 12DLIF Tutorial @ PAKDD2016
- 13. Automatic Differentiation • Computes gradient of some specified data nodes (e.g. loss) with respect to each data node. • Each operator node must have backward operation to calculate gradients w.r.t. its inputs from gradients w.r.t. its outputs (realization of chain rule). • e.g. Function class of Chainer has backward method. • e.g. Each layer classes of Caffe has Backward_cpu and Backward_gpu methods • e.g. Autograd has a thin wrapper that adds gradient methods as a closure to most of NumPy methods. 2016/4/19 13DLIF Tutorial @ PAKDD2016
- 14. Backprop through CG ∇y z∇x1 z ∇z z = 1 y = x1 * x2 z = y - x3 x1 mul suby x3 z x2 2016/4/19 14DLIF Tutorial @ PAKDD2016
- 15. Backprop as extended graphs x1 mul suby x3 z x2 dzid neg mul mul dy dx 3 dx 1 dx 2 forward propagation backward propagation y = x1 * x2 z = y - x3 2016/4/19 15DLIF Tutorial @ PAKDD2016
- 16. Example: Theano 2016/4/19 16DLIF Tutorial @ PAKDD2016
- 17. Technology stack of DL framework 2016/4/19 17DLIF Tutorial @ PAKDD2016 name functions example Graphical interface DIGITS, TensorBoard Machine learning workflow management Dataset Management Training Loop Keras, Lasagne Blocks, TF Learn Computational graph management Build computational graph Forward prop/Backprop Theano, TensorFlow Torch.nn Multi-dimensional array library Linear algebra NumPy, CuPy Eigen, torch (core) Numerical computation package Matrix operation Convolution BLAS, cuBLAS, cuDNN Hardware CPU, GPU
- 18. Numerical optimizer • Many gradient-based optimization algorithms are implemented. • Stochastic Gradient Descent (SGD) is implemented in most DL frameworks. • It depends on concrete tasks which optimizer works best. w: parameters of neural network θ: states of optimizer L: loss function Γ: optimizer-specific function initialize w, θ until meet the criteria: get data (x, y) calculate ∇w L(x, y; w) w, θ ← Γ(w, θ, ∇w L) 2016/4/19 18DLIF Tutorial @ PAKDD2016
- 19. Serialization • Save/Load the snapshot of training process in specified format (e.g. hdf5, npz, protobuf) • Models to be trained (= architectures and parameters of NNs) • States of training procedure (e.g. epoch, learning rate, momentum) • Serialization enhance the portability of models. • Publish pre-trained model (e.g. Model Zoo (Caffe), MXNet, TensorFlow) • Import pre-trained model of other DL frameworks • e.g. Chainer supports BVLC-official reference models of Caffe. 2016/4/19 19DLIF Tutorial @ PAKDD2016
- 20. Computational optimizer • Convert CGs to make them simplified and efficient. e.g. Theano y = x1 * x2 z = y - x3 2016/4/19 20DLIF Tutorial @ PAKDD2016
- 21. Abstraction of ML workflow • Offers typical training/validation/evaluation procedures as APIs. • Users should call a single API and do not have to write the procedure manually. • e.g. fit, evaluate methods of Model class in Keras. 2016/4/19 21DLIF Tutorial @ PAKDD2016 Prepare the training dataset Repeat until meeting some criterion Prepare for the next (mini) batch Compute the loss (forward prop) Initialize the Neural Network (NN) parameters Save the NN parameters Define how to compute the loss of this batch Compute the gradient (backprop) Update the NN parameters
- 22. Graphical interface • Computational graph management • Editor, Visualizer • Visualization of training procedure • Visualization of feature maps, output of NNs etc. • Transition of error and accuracy • Performance monitor • e.g. Throughput, latency, memory usage 2016/4/19 22DLIF Tutorial @ PAKDD2016
- 23. Technology stack of DL framework 2016/4/19 23DLIF Tutorial @ PAKDD2016 name functions example Graphical interface DIGITS, TensorBoard Machine learning workflow management Dataset Management Training Loop Keras, Lasagne Blocks, TF Learn Computational graph management Build computational graph Forward prop/Backprop Theano, TensorFlow Torch.nn Multi-dimensional array library Linear algebra NumPy, CuPy Eigen, torch (core) Numerical computation package Matrix operation Convolution BLAS, cuBLAS, cuDNN Hardware CPU, GPU
- 24. GPU support • CUDA: Computing platform for GPGPU on NVIDIA GPU • language extension, compiler, library etc. • DL frameworks prepare wrappers for CUDA. • GPU-array library that utilizes cuBLAS, cuRAND etc. • Layer implementation with cuDNN (e.g. Convolution, sigmoid, LSTM) • Designed to switch CPU and GPU easily. • e.g. Users can write CPU-GPU agnostic code. • e.g. Switch CPU/GPU with environment variables. • Some framework supports Open CL as a GPU environment, but CUDA is more popular for now. 2016/4/19 24DLIF Tutorial @ PAKDD2016
- 25. Multi-dimensional array library (CPU / GPU) • In charge of concrete calculation of data nodes. • Heavily depends on BLAS (CPU) or CUDA / CUDA Toolkits (GPU) • CPU • Third-party library: Eigen::Tensor, NumPy • Scratch: ND4J (DL4J), mshadow (MXNet) • GPU • Third-party library: Eigen::Tensor, PyCUDA, gpuarray • Scratch: ND4J (DL4J), mshadow (MXNet), CuPy (Chainer) 2016/4/19 25DLIF Tutorial @ PAKDD2016
- 26. Which device to use? • GPU is (by far) faster than CPU in most case. • Most of tensor calculation consists of element-wise calculation, matrix multiplications and convolutions. • Exceptional cases • Difficult to apply mini-batch technique. • e.g. variable-length training dataset • e.g. The architecture of NN depends on the training data. • GPU calculation cannot hide transfer of data to GPU. • e.g. Minibatch size is too small. 2016/4/19 26DLIF Tutorial @ PAKDD2016
- 27. Technology stack of Chainer cuDNN Chainer NumPy CuPy BLAS cuBLAS, cuRAND CPU GPU 2016/4/19 27DLIF Tutorial @ PAKDD2016 name Graphical interface Machine learning workflow management Computational graph management Multi-dimensional array library Numerical computation package Hardware
- 28. Technology stack of TensorFlow cuDNN TensorFlow Eigen::Tensor BLAS cuBLAS, cuRAND CPU GPU 2016/4/19 28DLIF Tutorial @ PAKDD2016 name Graphical interface Machine learning workflow management Computational graph management Multi-dimensional array library Numerical computation package Hardware TensorBoard TF Learn
- 29. Technology stack of Theano CUDA, OpenCL CUDAToolkit Theano BLAS CPU GPU 2016/4/19 29DLIF Tutorial @ PAKDD2016 name Graphical interface Machine learning workflow management Computational graph management Multi-dimensional array library Numerical computation package Hardware lib gpuarray NumPy
- 30. Technology stack of Keras 2016/4/19 30DLIF Tutorial @ PAKDD2016 name Graphical interface Machine learning workflow management Computational graph management Multi-dimensional array library Numerical computation package Hardware Keras TensorFlowTheano Technology Stack of Theano Technology Stack of TF
- 31. Summary • Most DL frameworks have many components in common and can be organized as a similar technology stack. • At upper layer of the stack, frameworks are designed to support users to follow typical ML workflows. • At middle layer, manipulations on computational graphs are automated. • At lower layer, optimized tensor calculations are implemented. • Realization of these components differ between frameworks, as we will see in the following part. 2016/4/19 31DLIF Tutorial @ PAKDD2016
- 32. memorandum 2016/4/19 32DLIF Tutorial @ PAKDD2016
- 33. Training of Neural Networks • L is designed so that its value gets small as the prediction more “accurate” • In deep learning context • L : represented by neural networks • w : parameters of neural networks argminw ∑(x, y) L(x, y; w) w: parameters x: feature vector y: training label L: loss function e.g.: Classification problem 332016/4/19 DLIF Tutorial @ PAKDD2016
- 34. Layer = function + data nodes • Layers (e.g. Fully connected layer, convolutional layer) can be considered as a function with parameters to be optimized. • In most of modern frameworks, parameters of layers can be considered as data nodes in a computational graph. • Framework need to be differentiate which data nodes are parameters to be optimized or data point. 342016/4/19 DLIF Tutorial @ PAKDD2016
- 35. Execution Engine • It calculates the dependency between data node and schedules the execution of parts of computational graph (especially in multi-node or multi-GPU setting) 352016/4/19 DLIF Tutorial @ PAKDD2016