Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Benchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius


Published on

The document summarizes byteLAKE’s basic benchmark results between two different setups of example edge devices: with NVIDIA GPU and with Intel’s Movidius cards.

Key takeaway: the comparison of Movidius and NVIDIA as two competing accelerators for AI workloads leads to a conclusion that these two are meant for different tasks.

Published in: Devices & Hardware
  • @ChrisRossall Not yet and yes, we saw that the successor is reportedly bringing interesting improvements.
    Are you sure you want to  Yes  No
    Your message goes here
  • Have you tried this on an NCS2? They claim an 8x performance improvement
    Are you sure you want to  Yes  No
    Your message goes here

Benchmark of common AI accelerators: NVIDIA GPU vs. Intel Movidius

  1. 1. AI on EDGE GPU VS. VPU byteLAKE’s basic benchmark results between two different setups of example edge devices: with NVIDIA GPU and with Intel’s Movidius cards. Artificial Intelligence HPC Machine Learning Deep Learning Computer Vision Edge Intelligence byteLAKE pl. Solny 14/3 50-062 Wroclaw, Poland +48 508 091 885 +48 505 322 282 +1 650 735 2063
  2. 2. AI on EDGE: GPU vs. VPU  Jul-18 2 Devices Configuration Tests were run on two Lenovo’s Tiny PCs. Tiny#1: Lenovo ThinkCentre M910x Tiny • CPU: Intel Core i7-7700T vPro • AI accelerator: 2 x Intel Movidius Myriad 2 VPU • Memory: 4 GB LPDDR3 • System: Ubuntu 16.04 LTS Tiny#2: Lenovo ThinkCentre M920x Tiny • CPU: Intel Core™ i5-8500T • AI accelerator: NVIDIA Quadro P1000 • Memory: 4 GB GDDR5 • System: Ubuntu 18.04 LTS Software Configuration: • Frameworks: Caffe, Tensorflow, OpenCV 3.4 • Drivers: o Tiny #1: Intel Movidius Neural Compute SDK v1 o Tiny #2: Nvidia GPU Drivers ver. 390.48; CUDA Toolkit 8
  3. 3. AI on EDGE: GPU vs. VPU  Jul-18 3 Test procedure description: During the course of the studies, we analyzed the performance of two Tiny PCs using the state-of-the- art YOLO (You Only Look Once) real-time detection model [1]. In both cases we focused on a special version of the YOLO model, called Tiny YOLO model. The model consists of a single input layer, 8 convolution layers, 8 batch norm layers, 8 relu layers and a single full-connected layer. Tiny YOLO is able to recognize objects out of 20 classes, including: aero- plane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, train and tv-monitor. The size of the pre-trained Tiny YOLO detection model is 50 MB. The deep neural net (DNN) used for this study has been implemented using Python Caffe AI framework. Benchmarks were based on a real-time analysis of the sequence of frames captured from the camera. Also, these were performed using three configurations of the AI devices, including: • single Movidius Myriad accelerator enabled (Tiny#1); • two Movidius Myriad cards enabled (Tiny#1); • single NVIDIA GPU (Tiny#2).
  4. 4. AI on EDGE: GPU vs. VPU  Jul-18 4 The procedure to assess the overall performance of the above Tiny PC configurations took into account all steps required to generate resulting movie, including: • grabbing of the frames from the camera; • frames preparation; • forwarding of the images through the deep neural net; • filtering the results of the analysis; • drawing the results on the frame; • presenting the results of the analysis. In order to ensure objectivity of measurements for all of the configurations, the analysis was performed for a defined number of frames. At the same time, we assumed two criteria of performance: (i) average value of Frame Per Second (FPS) factor, and (ii) execution time of AI computations using all above mentioned configurations. Figure 1 below presents the method of taking the measurements in details (sample code from a single Movidius configuration; for others: the method has been implemented in a similar fashion). [1]. YOLO: Real-Time Object Detection, URL:
  5. 5. AI on EDGE: GPU vs. VPU  Jul-18 5 Figure 1. Adopted method of performance measurements for a single Movidius Myriad accelerator
  6. 6. AI on EDGE: GPU vs. VPU  Jul-18 6 Results The tests described above were based on RGB frames grabbed by a Creative Live! Cam Sync USB camera. The original size of a single frame was 1080 x 720 (HD) pixels but due to the required structure of the input layer of the YOLO detector, we resized the frames to 448 by 448 RGB pixels. The benchmarks were carried out for a sequence of 500 frames. The performance results for different configuration of AI accelerators are presented in Table 1 below. The average FPS factor was calculated using the following formula: FPSavg = 500 / Ta where Ta refers to the time of the overall analysis of 500 frames (as described above). Table 1. Performance results 1 x Movidius Myriad 2 2 x Movidius Myriad 2 1 x NVIDIA P100 GPU Time [s] 123.1 69.8 23.3 Average FPS factor 4.05 7.16 21.3 As expected, the best performance results were achieved while using the GPU accelerator. The execution time of this version for 500 frames took ca. 23 seconds, and it allowed for a processing with the average frequency of ca. 21 frames. Consequently, a single GPU turned out to be 5.28 times faster than a single Myriad chip and 2.99 times faster than the configuration with two Movidius accelerators (at least for the given benchmark procedure). In the scenario where we enabled both Movidius cards, we developed an approach which allowed for parallel analysis of frames being grabbed from the camera. In consequence, this version was 1.76 times faster than the version with a single Myriad chip. In the given scenario, a single Intel Movidius was able to perform only at the rate of ca. 4 FPS whereas a double-Movidius configuration reached ca. 6 FPS.
  7. 7. AI on EDGE: GPU vs. VPU  Jul-18 7 Conclusions The results of this study show that using a GPU for objects detection based on YOLO model allows to analyze data in real-time. At the same time, single Intel Movidius as well as two Intel Movidius chips do not provide desired efficiency in the given scenario. However, it still can be successfully used in the applications where real-time processing is not necessary and near-real-time is enough. The comparison of both devices is presented in the Table 2 below. Based on the knowledge gained during this study, we conclude that the advantage of NVIDIA GPU over Intel Movidius VPU is not only in performance of computations. The GPU allows for both: training of the DNNs and interference whereas Movidius is designed only for a cooperation with pre-trained models. Another difference between both accelerators is about their support for various AI libraries/frameworks. While Movidius provides support for two popular frameworks (Caffe and Tensorflow), GPU supports more AI libraries, eg.: cuDNN or Theano. The difference between these two accelerators can also be noticed on the side of the programming process. In many cases the implementation of an application which uses GPU does not require any special knowledge about the accelerator itself. Most of the AI frameworks provide a built-in support for GPU computing (both training and interference) out of the box. In Movidius case, however, it is required to gain knowledge about its SDK as well. It is not a painful process but still yet another tool in the chain. When comparing both accelerators, another difference is also the area of usage. While the GPU is a powerful accelerator for AI computations, electricity consumption and size of this kind of accelerators can be an obstacle in many areas. GPU offers notable high performance of computations (order of few TFlops or more), however it is usually dedicated for HPC solutions. At the same time, Intel Movidius is a low-power AI solution dedicated for on-device computer vision. The size of device and power consumption makes it attractive for many usages, eg: IoT solutions, drones or smart security. Given the context above, here are some additional remarks one might consider when deciding which accelerator is a better fit for a given design. However, it is important to emphasize that the comparison of Movidius and NVIDIA as two competing accelerators for AI workloads leads to a conclusion that these two are meant for different tasks. Therefore looking at these only thru the perspective of the performance benchmarking results might be misleading. To properly choose between Movidius and NVIDIA GPU one should foremost take into account the intended application rather than the performance benchmark results only. Movidius is primarily designed to execute the AI workloads based on trained models (inference). NVIDIA’s GPU on the other hand can do these plus training. Therefore it really depends whether the planned device is to work in execute-only-mode or be capable of updating/re-training its models (brains) as well. And of course these make sense as long as we are talking of executing such tasks within a reasonable time frame.
  8. 8. AI on EDGE: GPU vs. VPU  Jul-18 8 Table 2. The comparison of Nvidia GPU and Intel Movidius VPU INTEL MOVIDIUS NVIDIA GPU FOR INFERENCING YES YES FOR TRAINING NO YES AI FRAMEWORKS CAFFE / TENSORFLOW CAFE/TENSORFLOW/CUDNN and more... MAX MODEL SIZE 320 MB No limit EASY TO CODE? Except knowledge about AI framework/library, programmers need to learn Movidius programming SDK. Programming AI applications requires knowledge about utilized library/framework, eg.: Caffe or Tensorflow. FORM FACTOR Small (i.e. mobile, IoT) medium+ POWER CONSUMPTION Low, ~1W medium+ HEATING + - CAN WORK OFFLINE Yes Yes MAIN PURPOSE Classification and recognition of objects General AI OS Ubuntu 16.04, Raspberry Pi 3 Raspbian Stretch As long as the drivers are available (Windows, Linux) COMPUTATIONAL POWER 150 GFlops Very high, TFlops and higher OTHER Imaging/vision accelerators included (12 specialized vector VLIW processors (SHAVEs) + 2*RISC processors). ARITHMETIC 8/16/32 integer, 16/32 floating point all PRICE TAG <$80 $100+
  9. 9. AI on EDGE: GPU vs. VPU  Jul-18 9 Thank you! Contact us at:
  10. 10. AI on EDGE: GPU vs. VPU  Jul-18 10 Learn how we work: Listen Actively We start with a consultancy session to better understand our client’s requirements & assumptions. 1 2 Suggest We thoroughly analyze the gathered information and prepare a draft offer. 3 Agree We fine tune the offer further and wrap up everything into a binding contract. 4 Deliver Finally, the execution starts. We deliver projects in a fully transparent, Agile (SCRUM- based) fashion.
  11. 11. AI on EDGE: GPU vs. VPU  Jul-18 11 We build Artificial Intelligence software and integrate that into products. We port and optimize algorithms for parallel, CPU+GPU HPC architectures. We deploy AI on data centers, the cloud and constrained, embedded devices (AI on Edge). byteLAKE We are specialists in: Helping companies transform for the era of Artificial Intelligence. We are a team of scientists, programmers, designers and technology enthusiasts helping industries incorporate AI techniques into products. Machine Learning Deep Learning Computer Vision High Performance Computing Heterogeneous Computing Edge Intelligence