Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Chainer

3,259 views

Published on

Introduction of Chainer, a framework for neural networks, v1.11. Slides used for the student seminar on July 20, 2016, at Sugiyama-Sato lab in the Univ. of Tokyo.

Published in: Technology

Introduction to Chainer

  1. 1. Introduction to Chainer Seiya Tokui, July 20, 2016
  2. 2. Outline • General concepts of deep learning frameworks • Basics of Chainer (based on v1.11) • Walk through MNIST example 2
  3. 3. General concepts Considerations and internals of deep learning frameworks
  4. 4. Why are frameworks needed for neural networks (NN)? NN research requirements: • Flexibility in defining NN architectures • Fast trial-and-error • It means we want to automate programming as much as possible We want to automate • Automatic differentiation • CUDA programming • Efficient execution (computational optimization) • Boilerplates in training loops 4
  5. 5. Core concept: computational graph (CG) • Directed acyclic graph (DAG) that represents how to compute output from input • Two types of representations • Data-operation graph (Theano, Chainer) • Operation graph (a.k.a. data-flow graph) (TensorFlow, Torch.nn) matmul + MSE Example: CG (data-operation grpah) for least-squares linear regression MSE: Mean Squared Error 5
  6. 6. Automatic differentiation over a computational grpah • Gradient can be factorized by chain rules • Reducing the products of Jacobians one by one: automatic differentiation matmul + MSE (all factors are Jacobian) 6
  7. 7. Backpropagation over a computational graph Usually inputs (W, b) are larger than the output (loss), so reducing in right-fold way is computationally efficient. → Backpropagation (reverse-mode AD) Left-fold computation is called forward-mode AD or Real Time Recurrent Learning (in RNN literature) matmul + MSE (all factors are Jacobian) 7
  8. 8. Deep learning framework stack Device-specific general routines (e.g. BLAS, CUDA, OpenCL, cuBLAS) Device-specific DL routines (e.g. cuDNN) Multi-dimensional array (e.g. Eigen::Tensor, NumPy, CuPy) Computational graph implementation Neural network abstraction Training/evaluation loop implementation “Low level API” “High level API” “Backend” 8
  9. 9. Deep learning framework stack Device-specific general routines Device- specific DL routines Multi-dimensional array Computational graph implementation Neural network abstraction Training/evaluation loop implementation Theano TensorFlow Keras TFLearn Chainer CuPyTorch cuTorch Torch.nn 9
  10. 10. Basics of Chainer Based on v1.11
  11. 11. Chainer • Open source framework for neural networks • First release: v1.0.0 (June, 2015) • Latest release: v1.11.0 (July, 2016) URLs: • GitHub: https://github.com/pfnet/chainer • Official site: http://chainer.org/ • Documentation: http://docs.chainer.org/ • Forum • English: https://groups.google.com/forum/#!forum/chainer • Japanese: https://groups.google.com/forum/#!forum/chainer-jp 11
  12. 12. Basic information Chainer is a Python framework of neural networks. Components: • Backend / Low-level API • CuPy: GPU array library with NumPy-subset interface • Dynamic computational graph • High-level API • Model composition (links and chains) • Dataset abstraction • Training loop abstraction 12
  13. 13. Installation On your Python environment (2.7 or 3.5), type pip install chainer • Enable GPU: If CUDA is installed and paths (CPATH, LD_LIBRARY_PATH) are properly set, the above command also installs GPU support (including CuPy) • NOTE: install Chainer AFTER configuring CUDA! • When your CUDA is updated, don’t forget to reinstall Chainer • Sometimes you might have to delete the CUDA kernel cache at $(HOME)/.cupy (just rm -r it) • Chainer also supports cuDNN v2 – v5.1 13
  14. 14. How to learn Chainer Official tutorial • Part of the official document (http://docs.chainer.org) Official examples • examples directory in the official repository (https://github.com/pfnet/chainer) 14
  15. 15. Computational graph (CG) in Chainer Chainer is based on dynamic CG construction • CG is not built beforehand for the forward computation • The forward computation is written like a regular program on Variable and Function • Variable remembers the history of computation = CG • Backpropagation is done by walking through this CG 15
  16. 16. import chainer, chainer.functions as F import numpy as np W = chainer.Variable(np.array(...)) b = chainer.Variable(np.array(...)) x = np.array(...) y = np.array(...) a = F.matmul(W, x) y_hat = a + b ell = F.mean_squared_error(y, y_hat) print(ell.data) # => print the computed error CG in Chainer: example matmul + MSE Input definition (use Variable if you want to extract grad; see the next slide) Forward computation (done on-the-fly) 16
  17. 17. Backpropagation in Chainer • backward() executes backpropagation along the history of forward computations • Gradient w.r.t. terminal nodes can be extracted (for non-terminal nodes, pass retain_grad=True to the backward method) 17 a = F.matmul(W, x) y_hat = a + b ell = F.mean_squared_error(y, y_hat) ell.backward() # Compute gradient of the error print(W.grad) # => print the gradient w.r.t. W print(b.grad) # => print the gradient w.r.t. b
  18. 18. Pre-built CG vs On-the-fly CG • Most other frameworks use “pre-built CG” • CGs are built by scripting or from static data (e.g. prototxt) • They are executed multiple times after the construction • We call this paradigm “Define-and-Run” •  easy to cache the graph optimization •  hard to define different graphs at every iteration, unintuitive, hard to debug • Chainer adopts “on-the-fly CG” • CGs are built simultaneously with the forward computation • The graph is only used to derive automatic differentiation • We call this paradigm “Define-by-Run” •  flexibility, intuitiveness, easy to debug •  hard to cache the graph optimization 18
  19. 19. Defininig neural network models • In Object-Oriented Programming (OOP), we often bind codes to data • In NN programming, we want to bind the forward computation with parameters • We can use Link and Chain for that purpose • Link: a Function with parameters • Chain: a forward computation routine that combines one or more child links • Chain itself is a link, so we can nest chains, resulting in a hierarchy of links/chains 19
  20. 20. import chainer, chainer.functions as F, chainer.links as L import numpy as np class MLP(chainer.Chain): def __init__(self): super().__init__( l1=L.Linear(100, 10), l2=L.Linear(10, 1), ) def __call__(self, x): h = F.tanh(self.l1(x)) return self.l2(h) Example: multi-layer perceptron Wx+b Linear Linear MLP tanh Linear 20
  21. 21. class Classifier(chainer.Chain): def __init__(self, predictor): super().__init__(predictor=predictor) def __call__(self, x, y): y_hat = self.predictor(x) loss = F.softmax_cross_entropy(y_hat, y) accuracy = F.accuracy(y_hat, y) chainer.report({'loss': loss, 'accuracy': accuracy}, self) return loss Example: Classifier We often implement a chain that defines the loss function like this child link Report the performance to Reporter 21
  22. 22. Built-in functions and links There are many Functions and Links provided by Chainer Popular examples (see the reference manual for the full list): • Layers with parameters: Linear, Convolution2D, Deconvolution2D, EmbedID • Activation functions and recurrent layers: sigmoid, tanh, relu, maxout, LSTM, GRU • Loss functions: softmax_cross_entropy, mean_squared_error, connectionist_temporary_classification • Other NN stuffs: dropout, BatchNormalization Many array/math functions are also supported (mostly from NumPy) 22
  23. 23. Numerical optimization • NNs are often trained with online gradient methods • Stochastic Gradient Descent (SGD), momentum SGD, AdaGrad, RMSprop, Adam, etc. • Most of these optimizers need state vectors besides the parameters (e.g. momentum, moving average of squared gradient, etc.) • Chainer provides Optimizer to abstract the optimization routines • Examples are provided in chainer.optimizers 23
  24. 24. Numerical optimization • Optimizer accepts target link to optimize • It enumerates the parameters in the link hierarchy, and prepares the corresponding state vectors • Pass a loss function and its arguments to update the parameters (the optimizer runs forward prop and backprop, and then updates the parameters) 24 model = ... # chain optimizer = chainer.optimizers.MomentumSGD() optimizer.setup(model) def loss_fun(x, y): ... optimizer.update(loss_fun, x, y)
  25. 25. Training loop Training loop: a loop that implements the iterative updates of parameters Each iteration consists of following procedures: • Load a mini batch • Forward/backward propagation • Update parameters with Optimizer • Track the current performance • Save a checkpoint in regular intervals We sometimes want to resume the training from a checkpoint 25
  26. 26. Training loop abstraction (new feature of v1.11) Trainer implements the training loop. • Load a mini batch • Forward/backward propagation • Update parameters with Optimizer • Track the current performance • Save a checkpoint in regular intervals Updater Extension 26
  27. 27. Updater: parameter update routine • It updates parameters using a mini batch • Dataset defines a set of training points • Iterator defines how to iterate over the dataset Updater Optimizer Target Link Iterator Dataset 27
  28. 28. Built-in datasets, iterators, and updaters Dataset • MNIST, CIFAR10/100, Penn Tree Bank (word sequence) • Also support random split and cross-validation splits Iterators • Random shuffling at each epoch (epoch = 1 sweep over the whole dataset) • MultiprocessIterator: parallel prefetch of next mini batch Updaters • StandardUpdater: standard implementation • ParallelUpdater: multi-GPU data-parallel updater 28
  29. 29. Extensions to extend the training loop Trainer’s training loop • Call updater.update() • For all extension and corresponding trigger: If trigger(trainer) == True: call extension(trainer) Users can register extensions to a trainer • It is called at every iteration unless the trigger returns False • The trigger manages when to invoke the extension 29
  30. 30. Built-in extensions • Evaluator: evaluate the performance of current models on a validation set • LogReport: Collect reports made by the models and output them to a JSON file • PrintReport: Print selected entries from the log to console • ProgressBar: Print a progress bar of the training process • snapshot: Save a snapshot of the trainer object (users can resume training by deserializing the snapshot) 30
  31. 31. Custom extensions Users can write their own extensions. Two ways: inherit Extension class, or using @make_extension decorator (the decorator is easier) Example: learning rate decay by the inverse of iteration count 31 @chainer.training.make_extension() def adjust_learning_rate(trainer): optimizer = trainer.updater.get_optimizer('main') optimizer.lr = 0.01 / optimizer.t
  32. 32. MNIST example
  33. 33. MNIST example • MNIST: hand-written digit image dataset • 60,000 training images and 10,000 test images • Each image has 28x28 = 784 pixels • Digit (0-9) is labeled to each image • Often used as a “Hello world” of deep learning • Task: label prediction (multiclass classification) • We use a multi-layer perceptron and use softmax cross entropy as a loss function 33
  34. 34. Definition of the multi-layer perceptron 34 class MLP(chainer.Chain): def __init__(self, n_in, n_units, n_out): super().__init__( l1=L.Linear(n_in, n_units), # first layer l2=L.Linear(n_units, n_units), # second layer l3=L.Linear(n_units, n_out), # output layer ) def __call__(self, x): h1 = F.relu(self.l1(x)) h2 = F.relu(self.l2(h1)) return self.l3(h2)
  35. 35. Instantiate a classifier with the MLP We wrap MLP by Classifier chain. to_gpu is optional: it is needed only if you want to use a GPU. Then, set up an optimizer for it. 35 model = L.Classifier(MLP(784, 1000, 10)) model.to_gpu(device=0) # Copy the model to GPU-0 optimizer = chainer.optimizers.Adam() optimizer.setup(model)
  36. 36. Prepare the dataset and iterators Chainer provides a utility function to download MNIST. The train and test are instances of TupleDataset. Each example is a tuple of an image and a label. Each image is a 784 dimensional float32 vector. repeat=False means only one sweep over the test dataset is needed for validation. 36 train, test = chainer.datasets.get_mnist() train_iter = chainer.iterators.SerialIterator(train, batchsize=100) test_iter = chainer.iterators.SerialIterator( test, batchsize=100, repeat=False, shuffle=False)
  37. 37. We have to create an updater to use Trainer. Option: we can easily use data-parallel computation with multiple GPUs by using ParallelUpdater. Set up an updater and a trainer 37 updater = training.StandardUpdater(train_iter, optimizer, device=0) trainer = training.Trainer(updater, (10, 'epoch'), out='result') updater = training.ParallelUpdater( train_iter, optimizer, devices={'main': 0, 'second': 1})
  38. 38. Add extensions 38 # Evaluate the model with the test set at the end of each epoch trainer.extend(extensions.Evaluator(tests_iter, model, device=0)) # Save a snapshot of the trainer at the end of each epoch trainer.extend(extensions.snapshot()) # Collect performance, save it to a log file, # and print some entries to the console trainer.extend(extensions.LogReport()) trainer.extend(extensions.PrintReport( ['epoch', 'main/loss', 'validation/main/loss', 'main/accuracy', 'validation/main/accuracy'])) # Print a progress bar trainer.extend(extensions.ProgressBar())
  39. 39. Executes the training loop 39 Just call trainer.run. DEMO: executing this trainer. trainer.run()
  40. 40. Summary • Computational graph is a crucial part of any deep learning frameworks • DL frameworks also contain high-level APIs • Chainer provides both low-level APIs and high-level APIs • Today, I introduced the computational graph in Chainer and high-level APIs • Almost any part of Chainer can be customized, so you can try unusual workflow sometimes seen in cutting-edge DL research 40

×