Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chainer GTC 2016


Published on

Chainer introduction presentation
NVIDIA GTC @ San Jose, April 6, 2016

Published in: Technology
  • Be the first to comment

Chainer GTC 2016

  1. 1. A Powerful, Flexible, and Intui5ve Deep Learning Framework @ NVIDIA GTC, April 6th, 2016 Shohei Hido Chief Research Officer Preferred Networks, Inc.
  2. 2. Overview l  Chainer is a Python-based deep learning framework l  Chainer v1.0 was released as an open source on June 2015 l  It DOESN’T rely on Theano, unlike other Python frameworks l  Chainer uses a unique scheme named Define-by-Run l  Why do users sOll need another framework? l  How different and effecOve Chainer is? 2
  3. 3. Preferred Networks (PFN) A startup that applies deep learning to industrial IoT l  Founded: March 2014 l  Headquarter: Tokyo, Japan l  U.S. Subsidiary: San Mateo, California l  Company size: 35 engineers & researchers l  Investors: Toyota, FANUC, NTT Deep learning Industrial IoT 3 Manufacturing Automotive Healthcare
  4. 4. Partnering with world-leading companies using Chainer l  R&D collaboraOon on industrial problems with real-world data ̶  Specific requirements, modified algorithms, many trials and errors, etc ̶  Different from making general-purpose recogniOon system 4 Toyota FANUC Panasonic NTT Cisco NVIDIA
  5. 5. Two types of background behind DL frameworks 1. Scalability-oriented l  Use-cases in mind ̶  Image/speech recogniOon system ̶  Fast DL as a service in cloud l  Problem type ̶  A few general applicaOons ̶  10+ million training samples ̶  10+ nodes cluster w/ fast network l  Possible boZleneck ̶  Tuning of well-known algorithms ̶  Distributed computaOon for model/data-parallel training 2. Flexibility-oriented l  Use-cases in mind ̶  Algorithm research ̶  R&D projects for new products l  Problem type ̶  Various specific applicaOons ̶  10+ k training samples ̶  1 node with mulOple GPUs l  Possible boZleneck ̶  Trial-and-error in prototyping ̶  Debugging, profiling & refactoring ̶  (wait Ome during compilaOon)
  6. 6. Designed for efficient research & development l  Flexible: new kinds of complex models for various applicaOons l  IntuiOve: rapid prototyping and efficient trial-and-error l  Powerful: comparable performance for 1 node & mulO-GPUs 6 Scalability-oriented Flexibility-oriented
  7. 7. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 7
  8. 8. Neural network and computation x1 xN ・・ h1 hH ・・・・ kM k1 yM y1 Forward computation Backward computation (backpropagation) ・・ ・・ Input Hidden units Output Text Image Sensor Object:
 Tulip Anomaly score:
 0.35 Category:
 Sports ・・ ・・・・ 8
  9. 9. Chainer focuses on network representation/training l  Design choices for deep learning frameworks ̶  How to build neural networks? ̶  How to train neural networks? ̶  Which text format/language for modeling? ̶  Which language for compuOng? ̶  Run with GPU? ̶  Run on mulOple GPUs? ̶  Run on mulOple compute nodes? 9
  10. 10. Building and training neural networks: Computational graph construction is the key 1.  Construct a computaOonal graph ̶  Based on network definiOon given by users ̶  Chains of funcOons and operaOons on input variables 2.  Compute loss and gradients ̶  Forward computaOon to calculate loss for a minibatch ̶  BackpropagaOon gives gradients to all of parameters 3.  OpOmize model ̶  Update each parameter with the gradient ̶  Repeat unOl convergence Step 1. is the most important and there are many approaches 10
  11. 11. Building blocks l  These funcOonaliOes are very similar between frameworks l  But the structure, abstracOon level, and interface are different l  It comes to the design of domain-specific language for NN Array data structure (vector/matrix/tensor) Operations & functions Network (computational graph) Optimizer (SGD/AdaGrad/Adam) 11
  12. 12. Types of domain-specific language for neural networks l  Text DSL ̶  Ex. Caffe (prototxt) ̶  Ex. CNTK (NDL) l  Symbolic program ̶  OperaOons on symbols ̶  Ex. Theano ̶  Ex. TensorFlow l  ImperaOve program ̶  Direct computaOons on raw data arrays ̶  Ex. Torch.nn ̶  Ex. Chainer # Symbolic definiOon A = Variable(‘A’) B = Variable(‘B’) C = B * A D = C + Constant(1) # Compile f = compile(D) d = f(A=np.ones(10), B=np.ones(10) * 2) # ImperaOve declaraOon a = np.ones(10) b = np.ones(10) * 2 c = b * a d = c + 1 %% DefiniOon in text f: { “A”: “Variable”, “B”: “Variable”, “C”: [“B”, “*”, “A”], “ret”: [“C”, “+”, 1] } # Compile f = compile(“f.txt”) d = f(A=np.ones(10), B=np.ones(10) * 2) 12 Ex. MXNet
  13. 13. Comparison of DSL type DSL type Pros. Cons. Text DSL •  Human-readable definiOon •  Non-programmer can easily edit the network •  Users must study the format •  Format might have to be extended for new algorithms Internal DSL Symbolic •  StaOc analysis at compile •  OpOmizaOon before training •  Easy to parallelize •  Users must study special syntax •  May need more efforts to implement new algorithms ImperaOve •  Less efforts to learn syntax •  Easy debugging and profiling •  Suitable for new algorithms with complex logic •  Hard to opOmize in advance •  Less efficient in memory allocaOon and parallelizaOon Chainer is at the extreme end of imperaOve program for high flexibility 13
  14. 14. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 14
  15. 15. Chainer as an open-source project l  hZps:// l  50 contributors l  1,277 stars & 255 fork l  3,708 commits l  AcOve development & release for last 10 months ̶  v1.0.0 (June 2015) to v1.7.2 (March 2016) 15 Original developer Seiya Tokui
  16. 16. CuPy Chainer software stack CPU NVIDIA GPU CUDA cuDNN BLAS NumPy Chainer l  Chainer is built on top of NumPy and CUDA l  CuPy is also introduced as an equivalent of NumPy on GPU 16
  17. 17. Run Define Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not) l  Define: build a computaOonal graph based on definiOon l  Run: update the model (parameters) using training dataset Network definiOon ComputaOonal graph Gradient funcOon Parameters ComputaOonal graph Gradient funcOon Parameters Training data Update Loss & gradient Auto differenOaOon 17
  18. 18. Define-by-Run Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly l  No graph is constructed before training l  Instead, the graph is built at each forward computaOon l  ComputaOonal graph can be modified dynamically for each iteraOon/sample or depending on some condiOons Model definiOon ComputaOonal graph Gradient funcOon Parameters Training data Update Dynamic change CondiOons 18
  19. 19. Define-by-Run example: MLP for MNIST l  Only transformaOons between units are set before training l  ConnecOon is given as forward computaOon l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) Linear l2Linear l1 x yh1 W bias 0 5 9 W bias ReLU def forward(x): h1 = ReLU(l1(x)) return l2(h1) 19
  20. 20. Define-by-Run: An interpreted language for neural network l  Idea ̶  Forward computaOon actually goes through computaOonal graph ̶  By remembering the history, the actual graph can be obtained l  Advantage ̶  Flexibility for new algorithms with complex components u  Ex. recurrent, recursive, aZenOon, memory, adversarial, etc ̶  IntuiOve coding with highly imperaOve network definiOon u  Ex. stochasOc network of which graph changes for each iteraOon l  Current drawbacks ̶  Graph is generated every Ome also for fixed networks ̶  No opOmizaOon even for staOc part of graphs u  JIT-like analysis and subgraph cache might be useful 20
  21. 21. Basic components (1/2): Variable and Function l  Variable ̶  Variable wraps arrays (.data) ̶  It remembers parent funcOon (.creator) ̶  It will be assigned gradient (.grad) ̶  It keeps track of not only data but also computaOons l  FuncOon ̶  TransformaOon between Variable ̶  Stateless ̶  e.g. sigmoid, tanh, ReLU, maxpooling, dropout Function x y Variable x yh1 0 5 9 21
  22. 22. Chain (MLP2) Basic components (2/2): Link and Chain l  Link = funcOon with state ̶  Parameters are also Variable and gradients will be assigned ̶  e.g. Linear (fully-connected), LSTM ConvoluOon2d, word-embedding l  Chain = network ̶  Chain has a set of child Link ̶  Forward computaOon is defined in . __call__() ̶  e.g. MLP2, AlexNet, GoogLeNet, RNNLM, seq2seq, Link (Linear) y=f(W*x+b) x y W b Linear l2Linear l1 yh1 W bias W bias ReLU 22
  23. 23. Backpropagation through computational graph l  Consider an objecOve (Link.Linear): L = f(x * w + b) l  This computes the value of L in forward computaOon, and simultaneously builds the following computaOonal graph l  The gradient of L can be computed with respect to any variables by backpropagaOon l  Then the opOmizer updates the value of parameters *x W + b f L is Variable is FuncOon 23
  24. 24. Code sample (1/4): Multi-layer perceptron class MLP2(Chain): def __init__(self): super(MLP2, self).__init__( l1=L.Linear(784, 100), l2=L.Linear(100, 10), ) def __call__(self, x): h1 = F.relu(self.l1(x)) y = self.l2(h1) return y class Classifier(Chain): def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) def __call__(self, x, t): y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) return self.loss, self.accuracy # Model and optimizer setup model = Classifier(MLP2()) optimizer = optimizers.SGD() optimizer.setup(model) # training loop with minibatch for i in range(0, datasize, batchsize): x = Variable(x_tr[i:i+batchsize]) t = Variable(y_tr[i:i+batchsize]) model.zerograds() loss, acc = model(x, t) loss.backward() optimizer.update() Chain (MLP2) Linear l2Linear l1 yh1 W bias W bias ReLU 24
  25. 25. Code sample (2/4): Convolutional neural network class AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) def __call__(self, x, t): h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y * ImageNet Classification with Deep Convolutional Neural Networks conv2d conv2d conv2d conv2d conv2d linear linear 25 linear
  26. 26. Code sample (3/4): Recurrent neural network class SimpleRNN(Chain): def __init__(self, n_vocab, n_units): super(SimpleRNN, self).__init__( embed=L.EmbedID(n_vocab, n_units) x2h=L.Linear(n_units, n_units), h2h=L.Linear(n_units, n_units), h2y=L.Linear(n_units, n_vocab),) self.h = None def __call__(self, x): y, h_new = self.fwd_one_step(x, self.h) self.h = h_new return y def fwd_one_step(self, x, h): x = F.tanh(self.embed(x)) if h is None: h = F.tanh(self.x2h(x)) else: h = F.tanh(self.x2h(x) + self.h2h(h)) y = F.softmax(self.h2y(h)) return y, h x_1 h y_1 x_2 h y_2 x_3 h y_3 x_4 h y_4 BPTT length = 3 Input word OutputRecurrent state # Truncated BPTT (length=3) for i in range(0, datasize, batchsize): ... accum_loss += model(x, t) if i % bptt_length == 0: model.zerograds() accum_loss.backward() accum_loss.unchain_backward() optimizer.update() 26
  27. 27. Code sample (4/4): Deep Networks with Stochastic Depth A paper published on arXiv, March 30, 2016 l  A variant of Residual Net that skips connecOons stochasOcally ̶  Outperformed the original Residual Net (ImageNet 2015 winner, MSR) ̶  StochasOc skip: Taken from G. Huang et al. # Mock code in Chainer class StochasticResNet(Chain): def __init__(self, prob, size, …): super(StochasticResNet, size, …).__init__( ## Define f[i] as same for Residual Net ) self.p = prob # Survival probabilities def __call__(self, h): for i in range(self.size): b = numpy.random.binomial(1, self.p[i]) c = self.f[i](h) + h if b == 1 else h h = F.relu(c) return h w/ survival probability: 27
  28. 28. Miscellaneous l  Other features ̶  Install with pip in one line: ̶  MulO-GPU support by explicitly selecOng the ID to use ̶  Pre-trained Caffe model import from Model Zoo ̶  Model serializaOon & save & load : HDF5 or NumPy npz l  Future direcOon (not only for Chainer) ̶  JIT-like opOmizaOon during Define-by-Run ̶  Memory consumpOon reducOon (GPU memory is sOll small) ̶  Handling variable-length inputs without minibatch ̶  Maximizing performance on mulO-node & mulO-GPU environment $ pip install chainer 28
  29. 29. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 29
  30. 30. CuPy: (partially-)NumPy-compatible GPU library l  MoOvaOon: NumPy + CUDA = CuPy ̶  NumPy is the standard library in Python for numerical computaOon ̶  CUDA is the standard APIs for using GPU for high-performance ̶  Unfortunately, NumPy does NOT work with CUDA l  CuPy supports: ̶  Fast computaOon using NVIDIA’s cuBLAS and cuDNN ̶  Array indexing, slicing, transpose, and reshape ̶  Most of operaOons/funcOons in NumPy u  Chainer v1.7.2 already supports more than 170 funcOons ̶  User-defined funcOons and kernels ̶  all dtypes, broadcasOng, memory pool, etc 30
  31. 31. How to use CuPy l  Usage of CuPy: just replace NumPy with CuPy l  Conversion between numpy.ndarray and cupy.ndarray l  Ex. CPU/GPU-agnosOc logsumexp funcOon def logsumexp(x, axis=None): xp = cuda.get_array_module(x) #Get CuPy or NumPy x_max = x.max(axis) exp_sum = xp.exp(x - x_max).sum(axis) return x_max + xp.log(exp_sum) import numpy, cupy enable_cupy = True xp = cupy if enable_cupy else numpy w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray 31
  32. 32. CuPy implementation: Optimized for performance & NumPy-compatibility l  Use Cython for cupy.core & cupy.cuda l  Dynamic code generaOon & compile ̶  CUDA code is generated for specific tensor dimension & data type ̶  On-the-fly compile by nvcc and binary cache (faster awer 1st use) CUDA libraries (cuBLAS, cuRAND, cuDNN) ndarray ufunc, elementwise, reduc5on CUDA Python wrapper cupy.cuda cupy.core Tensor opera5ons & func5ons cupy 32
  33. 33. CuPy performance on linear algebra: 5 to 25 times faster than NumPy def test(xp): a = xp.arange(1000000).reshape(1000, -1) return a.T * 2 test(numpy) t1 = for i in range(1000): test(numpy) t2 = print(t2 -t1) test(cupy) t1 = for i in range(1000): test(cupy) t2 = print(t2 -t1) msec speed up NumPy 2,929 1.0 CuPy 585 5.0 CuPy + Memory Pool 123 23.8 Intel Core i7-4790 @3.60GHz,32GB, GeForce GTX 970 33
  34. 34. Use CuPy for GPU-based computation l  Support three paZerns as wrappers ̶  ElementwiseKernel: for element-wise computaOon ̶  ReducOonKernel: for reduce operaOon along axis ̶  ufunc: universal funcOon as in Numpy l  Ex. definiOon of an element-wise funcOon l  Usage (automaOc broadcast and type check are supported) squared_diff = cupy.ElementwiseKernel( ‘float32 x, float32 y’, # Input ‘float32 z’, # Output ‘z = (x - y) * (x - y)’, # Operation ‘squared_diff’) # Name squared_diff(cupy.arange(10), 10) 34
  35. 35. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 35
  36. 36. Public benchmark results (CNN): Chainer shows comparable performance l  Forward computaOon is almost the same with TensorFlow l  Training with backward computaOon is slower, but it can be offset by no compilaOon Ome while debugging/tuning 0 200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve) 0 200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve) Forward computation (msec) Backward computation (msec) Taken from, using cuDNN except Caffe 36
  37. 37. Chainer can benefit from latest CUDA libraries: Ex. Winograd algorithm in cuDNN v5 l  Conv3x3 is common in CNNs & now computed with Winograd l  State-of-the-art CNN models (e.g., GoogLeNet, VGG-A) can be accelerated up to 2.0x at test Ome (forward only) 0 100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5 0 100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5 Forward computation (msec) Backward computation (msec) Independently measured by a modified version of soumith/convnet-benchmarks cuDNN v5 can be used in Chainer v1.8.0 37
  38. 38. Algorithm implementation in Chainer: A Neural Algorithm of Artistic Style (Gatys et al., 2015) l  hZps:// Content image (cat) Style image New artistic image + = Main code (45 lines) 38
  39. 39. l  Many collaboraOons are on-going w/ Chainer-based computer vision, deep reinforcement learning, etc… l  Ex. 1 Chainer-controlled toy cars in Toyota booth at CES 2016 l  Ex. 2 Highly accurate FANUC’s bin-picking robot at IREX 2015 ̶  8 hours training to reach expert-level, commercializaOon by 2016 end Chainer in industry: Used in demonstrations & being commercialized 39
  40. 40. Summary l  Chainer is a Python-based deep learning framework with dynamic network construcOon scheme and CuPy l  It is designed for efficient research and prototyping while keeping comparable performance thanks to NVIDIA GPU l  Official web: hZp:// l  Github: hZps:// Your contribuOons will be appreciated & we are hiring! 40