Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On the benchmark of Chainer

49,950 views

Published on

2nd July 2016 Chainer Meetup #3

Published in: Technology
  • Be the first to comment

On the benchmark of Chainer

  1. 1. On the benchmark of Chainer 2016年7⽉2⽇ Chainer Meetup #3@Dowango Seminar Room Preferred Networks Inc. Kenta Oono oono@preferred.jp
  2. 2. Self Introduction • Kenta Oono (twitter: @delta2323_) – Bio. : MSc@MathSci Univ. Tokyo → 2012.4 PFI → 2014.10 PFN – Role: BioHealthcare project, Chainer dev. team, etc. – blog: http://delta2323.github.io • Recent activity – Study meetup (NIPS2014, ICML2015, NIPS2015) – Several articles and talks on Deep Learning 7⽉21⽇ ICML2016読み会 @ドワンゴセミナールーム
  3. 3. What is Benchmark? • Metrics that evaluate the performance of frameworks – elapsed time, memory consumption, easiness of use etc. • Related to, but different from profiling – Profiling needs finer information of frameworks, possibly at the cost of performance – Benchmarking measures the overall behavior of frameworks • For framework developers: – provides suggestion for further enhancement of the framework – provides objective comparison with other frameworks • For framework users: – provides better choice of frameworks that satisfies their needs
  4. 4. Example: convnet-benchmarks • Author: Soumith Chintala(Facebook AI Research) • Measures latencies of convolutional neural networks • Provides objective comparison across various frameworks • Metric – Elapsed time of forward and backward propagation • Architecture – AlexNet-OWT / Overfeat / VGG-A / GoogleNet – Single 2D convolution layer of various sizes • Frameworks – Torch, neon, TenforFlow, fbfft (Torch), Chainer, cudaconvnet2, Caffe, CL-nn, Caffe-CL GreenTea etc.
  5. 5. convnet-benchmark
  6. 6. Basics of measurement of kernel execution • We cannot measure GPU execution time as CPU because launch of kernels is asynchronous ! clock_t start, end; start = clock(); // launch kernel end = clock(); elapsed_time = end - start; CPU GPU kernel exec. clock kick clock kernel exec.
  7. 7. Basics of measurement of kernel execution • We can measure the kernel execution time by inserting two events at the start and end of the launch. float elapsed=0; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); // launch the kernel cudaEventRecord(stop, 0); cudaEventSynchronize (stop); cudaEventElapsedTime( &elapsed, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop); CPU GPU record kick record kernel exec. kernel exec. Event Eventsync
  8. 8. Measurement of single Chainer Function Execution • Suppose GPU impl. of F.f consists of Python parts and single GPU kernel. • Elapsed time calculates kernel execution in this case. CPU GPU Python kick record kernel exec. kernel exec. Event Eventsync Python record start = cupy.cuda.Event() end = cupy.cuda.Event() start.record() y = F.f(x) # forward prop end.record() end.synchronize() cupy.cuda.get_elapsed_time( start, end) F.f
  9. 9. Measurement of single Chainer Function Execution • Suppose – no other kernels are waiting in the queue – Python overhead is large – the kernel is light • get_elapsed_time equals to whole execution time including Python code. CPU GPU Python kick record kernel exec. Event Event sync Python record start = cupy.cuda.Event() end = cupy.cuda.Event() start.record() y = F.f(x) # forward prop end.record() end.synchronize() cupy.cuda.get_elapsed_time( start, end)
  10. 10. Measurement of single Chainer Function Execution • In general, the elapsed time between two events are different from what we measured in the two previous situations. • What we really measure depends on – the status of the waiting queue – the amounts of Python code and kernel CPU GPU Python kick record kernel exec. Event Event sync Python record kernel exec. start = cupy.cuda.Event() end = cupy.cuda.Event() start.record() y = F.f(x) # forward prop end.record() end.synchronize() cupy.cuda.get_elapsed_time( start, end)
  11. 11. Synchronization before start Event • It ensures the start Event point is right before the execution of Python code. • But the timing of end Event is still undetermined. start = cupy.cuda.Event() end = cupy.cuda.Event() start.record() start.synchronize() y = F.f(x) # forward prop end.record() end.synchronize() cupy.cuda.get_elapsed_time( start, end) CPU GPU kick record kernel exec. Event kernel exec. sync Python ・・・ ・・・
  12. 12. Measurement of multi-layered NNs • Should we insert synchronization points before all function executions? • But it exposes Python code that should have been hidden if it were not for the synchronization. • I guess this is the reason why convnet- benchmarks offers the architectures that consist of single convolution layer. CPU GPU kick record kernel exe- cution Event kernel exe- cution sync Python record Python kick record sync Python record Python Event Event kernel exe- cution It would be hidden by the kernel execution if we did not measure elapsed times.
  13. 13. Tentative solution (Timer class: PR #1249) • Offers start and stop methods for measuring lap times. • Three patterns for synchronization before measurement by blocking_methodargument – block_every_time: synchronizes every start events – block_first_time: synchronizes only first start event – non_block: does not synchronize at the start of measurement • When we get the total time, Timer class implicitly call synchronize method. • synchronize method synchronizes all Events inserted by start and stop and calculates lap times lazily. • Once synchronize is invoked, the timer CANNOT accumulate lap times until it is reset.
  14. 14. DeepMark aurthored by Soumith Chintala • Comparison with convnet-benchmarks – Not only image recognition but also various use cases – Relatively newer architectures are employed – Multi-GPU evaluation will be supported (planned) • Many details of specifications are under discussion. • Architectures (planned) – Images : InceptionV3-batchnorm / Alexnet-OWT / VGG-D / ResNet-50 – Video: C3D – Audio: DeepSpeech2 / MSR's 5 layer FC – Text: Small RNN LSTM / Large RNN LSTM • Chainer support (delta2323/chainer-deepmark) – Not all features are supported (see issue for details)
  15. 15. Conclusion • Measurement of elapsed time of multi-layered NNs have many things to be considered. • We will participate in DeepMark, a general-purpose deep learning benchmarks. • Many criteria are to be measured: – Elapsed time <- Today’s topic – Memory consumption – etc… We are hiring!

×