The document summarizes byteLAKE’s benchmark results for two alternative setups of AI edge devices. First one was equipped with NVIDIA Quadro P1000. The other: AMD industry customized working sample GfX.
Key takeaway: NVIDIA not only offers a better performance but thanks to a much wider software ecosystem, it is also supported by more frameworks. It will be very interesting to see how AMD sample develops further and enters the AI and Computer Vision ecosystem in particular.
Benchmark of AI edge devices: powered by NVIDIA Quadro P1000 vs. AMD industry customized working sample GfX
NVIDIA VS. AMD
byteLAKE’s Computer Vision benchmark results between two edge
devices: powered by NVIDIA vs. AMD industry customized working
Learning for IoT
Europe & USA
+48 508 091 885
+48 505 322 282
+1 650 735 2063
Benchmark: NVIDIA vs. AMD variants of edge devices Jul-19 2
Specification of Target Platforms
Tests were run on two edge devices. The parameters of these platforms are presented below.
with NVIDIA GPU
• CPU: Intel Core™ i5-8500T
• GPU accelerator: NVIDIA Quadro P1000
• Main memory: 4 GB GDDR5
• System: Ubuntu 18.04 LTS
• Drivers: CUDA 10.1, cuDNN 7.5, NVIDIA Graphic Drivers 410.104
with AMD GPU
• CPU: Intel Core™ i5-8600T
• GPU accelerator: AMD industry customized working sample GfX
• Main memory: 8 GB GDDR5
• System: Ubuntu 18.04 LTS
• Drivers: AMDGPU-PRO Driver 19.10
OpenCV library and Caffe AI frameworks have been installed on both platforms. We used the OpenCV
3.4.4 compiled and installed from the source code
(https://github.com/opencv/opencv/archive/3.4.4.zip). In case of the Caffe framework, for the NVIDIA-
based platform we used caffe-cuda implementation, while for the AMD-based platform we utilized
caffe-opencl version. In both cases, Caffe framework was downloaded from the official BVLC GitHub
Remark: software installation
During this work, we met several problems with software installation and configuration. These issues
were notable especially for the AMD GPU. While in the case of NVIDIA-based platform, both installation
and configuration of the necessary software components (CUDA, cuDNN, etc.) were straightforward,
for the device equipped with AMD GPU, however, the setup of the necessary software component was
quite a challenge.
The first problem for AMD-based platform was observed during installation of the graphics drivers.
Actually, it is hard to find the driver dedicated for AMD industry customized working sample GfX on the
AMD website. In our case, it was necessary to install several versions of the drives in order to find the
correct one. We considered both: the official and open source drivers. In fact, most of the official drivers
Benchmark: NVIDIA vs. AMD variants of edge devices Jul-19 3
(including the latest ones) caused numerous problems during the system booting process. The advices
of users found in the internet (e.g. stackoverflow) was not useful for these issues. At the same time,
the open-source drivers did not work good enough (i.e. there were problems with utilization of the GPU
device for AI).
Other problem is related to AI frameworks. The number of frameworks that work with AMD GPUs is
lower when compared to NVIDIA GPUs. Moreover, the installation process is more complicated. We
observed such problem during the trial of installation of the Caffe framework from the AMD GitHub
repository. In fact, there were the problems with the availability of additional software necessary to
installation. As a result, we decided to install the Caffe framework using BVLC GitHub repository.
Description of Test Procedures
During the course of the studies, we analyzed the performance using the state-of-the-art YOLO (You
Only Look Once) real-time detection model . In both cases we focused on a special smaller version
of the YOLO network, called Tiny YOLO.
The model consists of a single input layer, 8 convolution layers, 8 batch norm layers, 8 relu layers and
a single full-connected layer. Tiny YOLO is able to recognize objects out of 20 classes, including: aero-
plane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person,
potted plant, sheep, sofa, train and tv-monitor. The size of the pre-trained Tiny YOLO model is 50 MB.
The deep neural net (DNN) used for this study has been implemented using Python Caffe AI framework.
For the purpose of this work, we have developed two test scenarios. The first case assesses the overall
performance of the whole platform (procedure #1), while the second one evaluates only the
performance of the GPU accelerators (procedure #2). Detailed description of the procedures is
described in the next subsections.
This procedure assesses the overall performance of the above described configurations and takes into
account all steps required to generate resulting movie, including:
• reading frames of the movie;
• frames preparation;
• forwarding of the images through the Tiny YOLO net;
• filtering the results of the analysis;
• drawing the results on the frame;
• presenting the results of the analysis.
Figure 1 presents the Python implementation of the above measurement method. In order to ensure
objectivity of the measurements for all of the configurations, the analysis was performed for a defined
number of frames. At the same time, we assumed two criteria of performance: (i) execution time of AI
computations using all above mentioned configurations and (ii) average value of Frame Per Second
Benchmark: NVIDIA vs. AMD variants of edge devices Jul-19 4
Figure 1. Adopted method of measurements used to evaluate the performance of the platform as a whole
The second procedure, on the other hand, evaluates the performance of each GPU accelerator itself. It
bases on the measurement of the total time required to forwarding of the images through the Tiny
YOLO net. Like in the previous scenario, the test is executed for a defined number of frames. The source
code of this method is presented on Figure 2.
Figure 2. The method used to assess the performance of the GPU accelerator only
Benchmark: NVIDIA vs. AMD variants of edge devices Jul-19 5
The above presented tests were based on RGB frames read form the movie. The original size of a single
frame was 1920 × 1080 pixels (Full HD), but due to the required structure of the input layer of the Tiny
YOLO detector, the resize of frames to 448 by 448 RGB pixels was necessary. The benchmarks were
executed for a sequence of 500 frames. Figure 3 presents the resulting frame of analyzed movie.
Figure 3. Single frame of movie analyzed using Tiny-YOLO DNN
Table 1 presents the performance results obtained for the first testing scenario. The average FPS factor
was calculated using the following formula:
FPSavg = 500 / Ta
where Ta refers to the time of the overall analysis of 500 frames (as described above). To ensure the
reliability of the performance results, the measurements of the execution time were repeated r = 10
times, and the median values are used finally.
Table 1. Performance results obtained for the whole platform
NVIDIA GPU AMD GPU
Analysis time [s] 14.54 21.16
Average FPS factor
When analyzing the results shown in Table 1, we can observe that NVIDIA-based platform provides
higher performance of the computations. The total time necessary to analyze the sequence of 500
frames is equal to 14.54 seconds for device with NVIDIA GPU. At the same time, the analysis of the
movie using the device equipped with AMD GPU is performed 1.46 times longer. The execution time in
this case is equal to 21.16 seconds. Average FPS factor achieved for the platforms is equal to 34.39 and
Benchmark: NVIDIA vs. AMD variants of edge devices Jul-19 6
23.62 for NVIDIA and AMD, respectively. Figure 4 presents the screenshots of the system terminals
showing the achieved performance and information about devices.
Figure 4. The screenshots of the system terminals of PCs equipped with: a) NVIDIA GPU
and b) AMD GPU, respectively
Benchmark: NVIDIA vs. AMD variants of edge devices Jul-19 7
Table 2 illustrates the performance comparison of the GPU devices. The achieved results show that
NVIDIA GPU of provides higher performance of inferencing. The time necessary to analyze the
sequence of 500 frames is equal to 7.36 seconds. At the same time, the inferencing performance with
AMD GPU of takes 11.86 seconds.
Table 2. The comparison of the performance of GPU devices
NVIDIA GPU AMD GPU
Inferencing time [s] 7.36 11.86
When it comes to the noise level, we observed that its level is similar for both and it is not disruptive.
The results of this study show that utilization of NVIDIA GPU for objects detection based on Tiny-YOLO
model gives a possibility to analyze data in real-time. At the same time, the performance of AMD
industry customized working sample GfX is about 40% less in comparison with NVIDIA Quadro P1000.
However, considering the achieved average FPS factor for AMD, we can conclude that it offers near-
Based on the knowledge gained during this study, we can conclude also that the difference between both
devices is not only the performance of computations. This difference is associated with the software
installation and configuration. In the case of the NVIDIA GPUs, there is a wide range of software that is quite
easy to use, while for the AMD GPUs the software setup process can cause hassle.
Another difference between both devices is about their support for various AI libraries/frameworks. While
most of the AI frameworks are implemented in CUDA programing model in order to ensure support for
NVIDIA GPUs, usually it is hard to find official and stable OpenCL implementation of these frameworks that
could work with the AMD devices. An example of such framework is Darknet framework which implements
YOLO object detectors .
 YOLO: Real-Time Object Detection, URL: https://pjreddie.com/darknet/yolo/utm_source=
 Darknet: Open Source Neural Networks in C, URL: https://pjreddie.com/darknet/
Benchmark: NVIDIA vs. AMD variants of edge devices Jul-19 8
Contact us: welcome@byteLAKE.com
Benchmark: NVIDIA vs. AMD variants of edge devices Jul-19 9
Learn how we work:
We start with a consultancy
session to better understand our
client’s requirements &
We thoroughly analyze the
gathered information and
prepare a draft offer.
We fine tune the offer further
and wrap up everything into a
Finally, the execution starts. We
deliver projects in a fully
transparent, Agile (SCRUM-
Benchmark: NVIDIA vs. AMD variants of edge devices Jul-19 10
We build AI and HPC solutions.
Focusing on software.
We use machine/ deep learning
to bring automation and optimize
operations in businesses across
We create highly optimized software
Our researchers hold PhD and DSc
• AI (highly optimized AI engines to analyze text, image, video, time series data)
• HPC (highly optimized apps and kernels for HPC architectures)