On the benchmark of Chainer
2016年7⽉2⽇
Chainer Meetup #3@Dowango Seminar Room
Preferred Networks Inc.
Kenta Oono
oono@preferred.jp
Self Introduction
• Kenta Oono (twitter: @delta2323_)
– Bio. : MSc@MathSci Univ. Tokyo → 2012.4 PFI → 2014.10 PFN
– Role: BioHealthcare project, Chainer dev. team, etc.
– blog: http://delta2323.github.io
• Recent activity
– Study meetup (NIPS2014, ICML2015, NIPS2015)
– Several articles and talks on Deep Learning
7⽉21⽇ ICML2016読み会
@ドワンゴセミナールーム
What is Benchmark?
• Metrics that evaluate the performance of frameworks
– elapsed time, memory consumption, easiness of use etc.
• Related to, but different from profiling
– Profiling needs finer information of frameworks, possibly at the cost of
performance
– Benchmarking measures the overall behavior of frameworks
• For framework developers:
– provides suggestion for further enhancement of the framework
– provides objective comparison with other frameworks
• For framework users:
– provides better choice of frameworks that satisfies their needs
Example: convnet-benchmarks
• Author: Soumith Chintala(Facebook AI Research)
• Measures latencies of convolutional neural networks
• Provides objective comparison across various frameworks
• Metric
– Elapsed time of forward and backward propagation
• Architecture
– AlexNet-OWT / Overfeat / VGG-A / GoogleNet
– Single 2D convolution layer of various sizes
• Frameworks
– Torch, neon, TenforFlow, fbfft (Torch), Chainer, cudaconvnet2, Caffe,
CL-nn, Caffe-CL GreenTea etc.
Basics of measurement of kernel execution
• We cannot measure GPU execution
time as CPU because launch of
kernels is asynchronous !
clock_t start, end;
start = clock();
// launch kernel
end = clock();
elapsed_time = end - start;
CPU GPU
kernel
exec.
clock
kick
clock kernel
exec.
Basics of measurement of kernel execution
• We can measure the kernel execution time
by inserting two events at the start and end
of the launch.
float elapsed=0;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
// launch the kernel
cudaEventRecord(stop, 0);
cudaEventSynchronize (stop);
cudaEventElapsedTime(
&elapsed, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
CPU GPU
record
kick
record
kernel
exec.
kernel
exec.
Event
Eventsync
Measurement of single Chainer Function Execution
• Suppose GPU impl. of F.f consists of
Python parts and single GPU kernel.
• Elapsed time calculates kernel
execution in this case.
CPU GPU
Python
kick
record
kernel
exec.
kernel
exec.
Event
Eventsync
Python
record
start = cupy.cuda.Event()
end = cupy.cuda.Event()
start.record()
y = F.f(x) # forward prop
end.record()
end.synchronize()
cupy.cuda.get_elapsed_time(
start, end)
F.f
Measurement of single Chainer Function Execution
• Suppose
– no other kernels are waiting in the
queue
– Python overhead is large
– the kernel is light
• get_elapsed_time equals to whole
execution time including Python code.
CPU GPU
Python
kick
record
kernel
exec.
Event
Event
sync
Python
record
start = cupy.cuda.Event()
end = cupy.cuda.Event()
start.record()
y = F.f(x) # forward prop
end.record()
end.synchronize()
cupy.cuda.get_elapsed_time(
start, end)
Measurement of single Chainer Function Execution
• In general, the elapsed time between two
events are different from what we measured
in the two previous situations.
• What we really measure depends on
– the status of the waiting queue
– the amounts of Python code and kernel
CPU GPU
Python
kick
record
kernel
exec.
Event
Event
sync
Python
record
kernel
exec.
start = cupy.cuda.Event()
end = cupy.cuda.Event()
start.record()
y = F.f(x) # forward prop
end.record()
end.synchronize()
cupy.cuda.get_elapsed_time(
start, end)
Synchronization before start Event
• It ensures the start Event point is
right before the execution of
Python code.
• But the timing of end Event is still
undetermined.
start = cupy.cuda.Event()
end = cupy.cuda.Event()
start.record()
start.synchronize()
y = F.f(x) # forward prop
end.record()
end.synchronize()
cupy.cuda.get_elapsed_time(
start, end)
CPU GPU
kick
record
kernel
exec.
Event
kernel
exec.
sync
Python
・・・
・・・
Measurement of multi-layered NNs
• Should we insert synchronization points
before all function executions?
• But it exposes Python code that should
have been hidden if it were not for the
synchronization.
• I guess this is the reason why convnet-
benchmarks offers the architectures that
consist of single convolution layer.
CPU GPU
kick
record
kernel
exe-
cution
Event
kernel
exe-
cution
sync
Python
record
Python
kick
record
sync
Python
record
Python
Event
Event
kernel
exe-
cution
It would be hidden by the
kernel execution if we did
not measure elapsed times.
Tentative solution (Timer class: PR #1249)
• Offers start and stop methods for measuring lap times.
• Three patterns for synchronization before measurement by
blocking_methodargument
– block_every_time: synchronizes every start events
– block_first_time: synchronizes only first start event
– non_block: does not synchronize at the start of measurement
• When we get the total time, Timer class implicitly call synchronize
method.
• synchronize method synchronizes all Events inserted by start and
stop and calculates lap times lazily.
• Once synchronize is invoked, the timer CANNOT accumulate lap times
until it is reset.
DeepMark aurthored by Soumith Chintala
• Comparison with convnet-benchmarks
– Not only image recognition but also various use cases
– Relatively newer architectures are employed
– Multi-GPU evaluation will be supported (planned)
• Many details of specifications are under discussion.
• Architectures (planned)
– Images : InceptionV3-batchnorm / Alexnet-OWT / VGG-D / ResNet-50
– Video: C3D
– Audio: DeepSpeech2 / MSR's 5 layer FC
– Text: Small RNN LSTM / Large RNN LSTM
• Chainer support (delta2323/chainer-deepmark)
– Not all features are supported (see issue for details)
Conclusion
• Measurement of elapsed time of multi-layered NNs have many things to be
considered.
• We will participate in DeepMark, a general-purpose deep learning
benchmarks.
• Many criteria are to be measured:
– Elapsed time <- Today’s topic
– Memory consumption
– etc…
We are hiring!