Benchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius
AI on EDGE
GPU VS. VPU
byteLAKE’s basic benchmark results between two different setups
of example edge devices: with NVIDIA GPU and with Intel’s
pl. Solny 14/3
50-062 Wroclaw, Poland
+48 508 091 885
+48 505 322 282
+1 650 735 2063
AI on EDGE: GPU vs. VPU Jul-18 2
Tests were run on two Lenovo’s Tiny PCs.
Tiny#1: Lenovo ThinkCentre M910x Tiny
• CPU: Intel Core i7-7700T vPro
• AI accelerator: 2 x Intel Movidius Myriad 2 VPU
• Memory: 4 GB LPDDR3
• System: Ubuntu 16.04 LTS
Tiny#2: Lenovo ThinkCentre M920x Tiny
• CPU: Intel Core™ i5-8500T
• AI accelerator: NVIDIA Quadro P1000
• Memory: 4 GB GDDR5
• System: Ubuntu 18.04 LTS
• Frameworks: Caffe, Tensorflow, OpenCV 3.4
o Tiny #1: Intel Movidius Neural Compute SDK v1
o Tiny #2: Nvidia GPU Drivers ver. 390.48; CUDA Toolkit 8
AI on EDGE: GPU vs. VPU Jul-18 3
Test procedure description:
During the course of the studies, we analyzed the performance of two Tiny PCs using the state-of-the-
art YOLO (You Only Look Once) real-time detection model . In both cases we focused on a special
version of the YOLO model, called Tiny YOLO model.
The model consists of a single input layer, 8 convolution layers, 8 batch norm layers, 8 relu layers and
a single full-connected layer. Tiny YOLO is able to recognize objects out of 20 classes, including: aero-
plane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person,
potted plant, sheep, sofa, train and tv-monitor. The size of the pre-trained Tiny YOLO detection model
is 50 MB.
The deep neural net (DNN) used for this study has been implemented using Python Caffe AI framework.
Benchmarks were based on a real-time analysis of the sequence of frames captured from the camera.
Also, these were performed using three configurations of the AI devices, including:
• single Movidius Myriad accelerator enabled (Tiny#1);
• two Movidius Myriad cards enabled (Tiny#1);
• single NVIDIA GPU (Tiny#2).
AI on EDGE: GPU vs. VPU Jul-18 4
The procedure to assess the overall performance of the above Tiny PC configurations took into account
all steps required to generate resulting movie, including:
• grabbing of the frames from the camera;
• frames preparation;
• forwarding of the images through the deep neural net;
• filtering the results of the analysis;
• drawing the results on the frame;
• presenting the results of the analysis.
In order to ensure objectivity of measurements for all of the configurations, the analysis was performed
for a defined number of frames. At the same time, we assumed two criteria of performance: (i) average
value of Frame Per Second (FPS) factor, and (ii) execution time of AI computations using all above
Figure 1 below presents the method of taking the measurements in details (sample code from a single
Movidius configuration; for others: the method has been implemented in a similar fashion).
. YOLO: Real-Time Object Detection, URL: https://pjreddie.com/darknet/yolo/utm_source=
AI on EDGE: GPU vs. VPU Jul-18 5
Figure 1. Adopted method of performance measurements for a single Movidius Myriad accelerator
AI on EDGE: GPU vs. VPU Jul-18 6
The tests described above were based on RGB frames grabbed by a Creative Live! Cam Sync USB camera.
The original size of a single frame was 1080 x 720 (HD) pixels but due to the required structure of the
input layer of the YOLO detector, we resized the frames to 448 by 448 RGB pixels.
The benchmarks were carried out for a sequence of 500 frames.
The performance results for different configuration of AI accelerators are presented in Table 1 below.
The average FPS factor was calculated using the following formula:
FPSavg = 500 / Ta
where Ta refers to the time of the overall analysis of 500 frames (as described above).
Table 1. Performance results
1 x Movidius Myriad 2 2 x Movidius Myriad 2 1 x NVIDIA P100 GPU
Time [s] 123.1 69.8 23.3
Average FPS factor 4.05 7.16 21.3
As expected, the best performance results were achieved while using the GPU accelerator.
The execution time of this version for 500 frames took ca. 23 seconds, and it allowed for a processing
with the average frequency of ca. 21 frames. Consequently, a single GPU turned out to be 5.28 times
faster than a single Myriad chip and 2.99 times faster than the configuration with two Movidius
accelerators (at least for the given benchmark procedure).
In the scenario where we enabled both Movidius cards, we developed an approach which allowed for
parallel analysis of frames being grabbed from the camera. In consequence, this version was 1.76
times faster than the version with a single Myriad chip. In the given scenario, a single Intel Movidius
was able to perform only at the rate of ca. 4 FPS whereas a double-Movidius configuration reached
ca. 6 FPS.
AI on EDGE: GPU vs. VPU Jul-18 7
The results of this study show that using a GPU for objects detection based on YOLO model allows to
analyze data in real-time. At the same time, single Intel Movidius as well as two Intel Movidius chips
do not provide desired efficiency in the given scenario. However, it still can be successfully used in the
applications where real-time processing is not necessary and near-real-time is enough.
The comparison of both devices is presented in the Table 2 below. Based on the knowledge gained
during this study, we conclude that the advantage of NVIDIA GPU over Intel Movidius VPU is not only
in performance of computations. The GPU allows for both: training of the DNNs and interference
whereas Movidius is designed only for a cooperation with pre-trained models.
Another difference between both accelerators is about their support for various AI
libraries/frameworks. While Movidius provides support for two popular frameworks (Caffe and
Tensorflow), GPU supports more AI libraries, eg.: cuDNN or Theano.
The difference between these two accelerators can also be noticed on the side of the programming
process. In many cases the implementation of an application which uses GPU does not require any
special knowledge about the accelerator itself. Most of the AI frameworks provide a built-in support
for GPU computing (both training and interference) out of the box. In Movidius case, however, it is
required to gain knowledge about its SDK as well. It is not a painful process but still yet another tool in
When comparing both accelerators, another difference is also the area of usage. While the GPU is a
powerful accelerator for AI computations, electricity consumption and size of this kind of accelerators
can be an obstacle in many areas. GPU offers notable high performance of computations (order of few
TFlops or more), however it is usually dedicated for HPC solutions. At the same time, Intel Movidius is
a low-power AI solution dedicated for on-device computer vision. The size of device and power
consumption makes it attractive for many usages, eg: IoT solutions, drones or smart security.
Given the context above, here are some additional remarks one might consider when deciding which
accelerator is a better fit for a given design. However, it is important to emphasize that the comparison
of Movidius and NVIDIA as two competing accelerators for AI workloads leads to a conclusion that
these two are meant for different tasks. Therefore looking at these only thru the perspective of the
performance benchmarking results might be misleading. To properly choose between Movidius and
NVIDIA GPU one should foremost take into account the intended application rather than the
performance benchmark results only. Movidius is primarily designed to execute the AI workloads based
on trained models (inference). NVIDIA’s GPU on the other hand can do these plus training. Therefore it
really depends whether the planned device is to work in execute-only-mode or be capable of
updating/re-training its models (brains) as well. And of course these make sense as long as we are
talking of executing such tasks within a reasonable time frame.
AI on EDGE: GPU vs. VPU Jul-18 8
Table 2. The comparison of Nvidia GPU and Intel Movidius VPU
INTEL MOVIDIUS NVIDIA GPU
FOR INFERENCING YES YES
FOR TRAINING NO YES
AI FRAMEWORKS CAFFE / TENSORFLOW CAFE/TENSORFLOW/CUDNN
MAX MODEL SIZE 320 MB No limit
EASY TO CODE? Except knowledge about AI
programmers need to learn
Movidius programming SDK.
Programming AI applications
requires knowledge about
utilized library/framework, eg.:
Caffe or Tensorflow.
FORM FACTOR Small (i.e. mobile, IoT) medium+
POWER CONSUMPTION Low, ~1W medium+
HEATING + -
CAN WORK OFFLINE Yes Yes
MAIN PURPOSE Classification and recognition of
OS Ubuntu 16.04, Raspberry Pi 3
As long as the drivers are
available (Windows, Linux)
COMPUTATIONAL POWER 150 GFlops Very high, TFlops and higher
OTHER Imaging/vision accelerators
included (12 specialized vector
VLIW processors (SHAVEs) +
ARITHMETIC 8/16/32 integer, 16/32 floating
PRICE TAG <$80 $100+
AI on EDGE: GPU vs. VPU Jul-18 9
Contact us at: welcome@byteLAKE.com
AI on EDGE: GPU vs. VPU Jul-18 10
Learn how we work:
We start with a consultancy
session to better understand our
client’s requirements &
We thoroughly analyze the
gathered information and
prepare a draft offer.
We fine tune the offer further
and wrap up everything into a
Finally, the execution starts. We
deliver projects in a fully
transparent, Agile (SCRUM-
AI on EDGE: GPU vs. VPU Jul-18 11
We build Artificial Intelligence
software and integrate that into
We port and optimize algorithms
for parallel, CPU+GPU HPC
We deploy AI on data centers, the
cloud and constrained, embedded
devices (AI on Edge).
We are specialists in:
Helping companies transform
for the era of Artificial Intelligence.
We are a team of scientists, programmers, designers
and technology enthusiasts helping industries incorporate
AI techniques into products.
High Performance Computing