"Making OpenCV Code Run Fast," a Presentation from Intel

Copyright © 2017 Intel Corporation 1
Vadim Pisarevsky, Software Engineering Manager, Intel Corp.
May 2017
Making OpenCV Code Run Fast

OpenCV at glance
What The most popular computer vision library:
http://opencv.org
License BSD
Supported Languages C/C++, Java, Python
Size >950 K lines of code
SourceForge statistics 13.6 M downloads (does not include github traffic)
Github statistics >7500 forks, >4000 patches merged during 6 years
(~2.5 patches per working day before Intel,
~5 patches per working day at Intel)
Accelerated with SSE, AVX, NEON, IPP, MKL, OpenCL, CUDA,
parallel_for_, OpenVX, Halide (planned)
The actual versions 2.4.13.2 (2016 Dec), 3.2 (2016 Dec)
Upcoming releases 2.4.14 (2017), 3.3 (2017 Jun)

OpenCV, CV & Hardware Evolution 2000 => 2017
2000 2017
OpenCV OpenCV 1.0 alpha; C API, 1
module, Windows
OpenCV 3.2; C++ API; 30+30 modules,
Windows/Linux/Android/iOS/QNX, etc.
CPU 32-bit single-core, ~1 GFlop 32/64-bit many-core, 300+ GFlops, ~100 GFlops in a
cellphone!
GPU as accelerator - OpenCL, CUDA; 0.5-1+ TFlops
Other accelerators FPGA (manually coded) OpenCL-capable FPGA, various DSPs, etc.
Vision algorithms Traditional vision, simple image
processing, detection & tracking,
contours; “empirical, low-profile
computer vision”
Sophisticated traditional vision, 3D vision,
computational photography, deep learning, hybrid
algorithms; “learning-based, extensive computer
vision”
Cameras, sensors Analog surveillance cameras
(recording only), Webcams
Computer vision in every cellphone, every street
crossing, every mall, coming to every car; 3d
sensors, lidars, etc.
Computing model Desktop Edge, Cloud, Fog; Desktop for R&D only

OpenCV Acceleration Options
CUDA modules
OpenVX
(immediate mode)
OpenCV optimized
for custom hardware
Universal
intrinsics
NEON/SSE/AVX2…
Carotene HAL
OpenCV optimized for
ARM CPU
IPP, MKL
OpenCV optimized
for x86/x64 CPU
OpenVX
(graphs)
OpenCV optimized
for custom hardware
OpenCV
T-API OpenCL GPU-optimized
OpenCV
OpenCV HAL
Halide scripts Any Halide-supported
hardware
User-programmable
tools
Collections of fixed
functions
Active development area

• OpenCV 3.x includes T-API by default:
• Asynchronous: can run GPU & CPU code in parallel
• 100s of open-source OpenCL kernels
T-API: heterogeneous compute
with OpenCV is easy!
#include "opencv2/opencv.hpp"
using namespace cv;
int main(int argc, char** argv)
{
Mat img, gray;
img = imread(argv[1], 1);
imshow("original", img);
cvtColor(img, gray, COLOR_BGR2GRAY);
GaussianBlur(gray, gray,
Size(7, 7), 1.5);
Canny(gray, gray, 0, 50);
imshow("edges", gray);
waitKey();
return 0;
}
#include "opencv2/opencv.hpp"
using namespace cv;
int main(int argc, char** argv)
{
Mat img; UMat gray;
img = imread(argv[1]);
imshow("original", img);
cvtColor(img, gray, COLOR_BGR2GRAY);
GaussianBlur(gray, gray,
Size(7, 7), 1.5);
Canny(gray, gray, 0, 50);
imshow("edges", gray); // automatic sync point
waitKey();
return 0;
}

T-API: under the hood
Very little of “boilerplate code”! (just ~30 lines of code)
void mykernel(cv::InputArray input, cv::OutputArray output, params …) {
}
Use OpenCL?
Get clmem (use zero-
copy if possible)
Retrieve/compile OpenCL
kernel & “enqueue” it
successfully?
yes
yes
Finish
Retrieve
cv::Mat
Run C++ code

T-API execution model
• Supports multiple devices
• Asynchronous execution with no explicit synchronization required

T-API showcase: Pedestrian Detector
Build pyramid RGB2Luv
HOG feature
maps
Integrals of
HOG maps
Feature Pyramid Builder
Capture Video
Frame
Optical flow-
based Tracker
Per-frame detector
Sliding window +
Cascade classifier
Non-maxima
suppression (filtering
out duplicates)
Do temporal filtering,
follow pedestrians,
detect new ones
Performance profile of
per-frame detector (CPU)
Feature Pyramid Builder (65%)
Classifier + Non-max (35%)
• Feature Pyramid Builder is the ideal “kernel” to optimize:
• Expensive
• Regular, easy to parallelize & vectorize
• Reusable (e.g., for cars)

• Duplicate CPU branch
• Make OpenCL-compatible copy (cv::UMat) for each internal buffer (cv::Mat)
• Use available OpenCL-optimized funcs (e.g. cv::resize, cv::integral)
• Create OpenCL kernels for other parts (RGB2Luv, HOG): ~700 LoC
• Debug-Profile-Optimize: repeat until happy
Feature Pyramid Builder optimization with T-API
Part CPU time,
ms (1080p)
OCL time,
ms (1080p)
CPU time,
ms (720p)
OCL time,
ms (720p)
Acceleration
(1080p)
Acceleration
(720p)
All 200 140 107 87 42% 23%
Feature Pyramid
Builder
130 70 60 40 85% 50%
Test machine: Core i5 (Skylake), 2-core 2.5 GHz, Intel HD530 GPU

• Many acceleration options are available (CPU,
GPU, DSPs, FPGA, etc.)
• Coding kernels using native tools is huge
investment and maintenance cost
• Big time to market
• Big commitment because of low portability
• OpenCV cannot be optimized for each single
accelerator
• OpenCL is not perf-portable neither easy to use
• Let’s generate OpenCL or LLVM code automatically
from high-level algorithm description!
• Let’s separate the platform-agnostic algorithm
description and platform-specific “pragma’s”
(vectorization, tiling …)!
Halide: write once, schedule everywhere!
Halide! (http://halide-lang.org)
Function 1 Function 2 …
CPU Scheduler:
Tiling,
Vectorization,
Pipelining
GPU Scheduler:
Tiling,
Vectorization,
Pipelining
CPU code
(SSE, AVX…,
NEON)
GPU code
(OpenCL,
CUDA)
Algorithm Description

• Same code for CPU & GPU
• Halide includes very efficient loop handling engine
• Almost any known DNN can be implemented
entirely in Halide
• The language is quite limited (insufficient to cover
OpenVX 1.0)
• In some cases the produced code is inefficient
• The whole infrastructure is immature
Plans
• Halide backend in OpenCV DNN module (in
progress)
• Extend the language (if operator, etc.)
• Improve performance of the generated code
• Fix/improve the infrastructure (nicer frontend, better
support for offline compilation)
kernel OpenCV, ms
(CPU)
Halide, ms
(CPU)
Halide, ms
(GPU)
RGB=>Gray 0.44 0.54 (-20%) 0.58 (-25%)
Canny 3.3 1.4+2 (-3%) 2.4+2 (-25%)
DNN: AlexNet 29 (w. MKL) 24 (+20%) 47 (-40%)
DNN: ENet
(512x256)
~250 (w. MKL) 60 (+320%) 44 (+470%)
HOG-based
pedestrian
detector (1080p)
200 75+70 (+38%) 140 – 700 ms
Halide: first impressions & results

• OpenVX-based HAL in OpenCV
✓ [Done] Immediate-mode OpenVX calls to accelerate simple functions:
• cv::boxFilter(const cv::Mat&, …) => vxuBox3x3(vx_image, …) etc.
• tested with Khronos’ sample implementation and Intel IAP
• [TBD] Graphs for DNN acceleration
✓ [Done] Mixing OpenVX + OpenCV at user app level
• vx_image  cv::Mat, OpenVX C++ wrappers, sample code:
• https://github.com/opencv/opencv/tree/master/samples/openvx
OpenCV + OpenVX

OpenCV Acceleration Options Comparison
+ ⎼
HAL functions Get used automatically (zero effort); vendors-specific
implementation is possible
Little coverage (mostly image processing); usually CPU-only
HAL intrinsics Super-flexible, widely applicable and widely available Low-level, CPU only
T-API Can potentially deliver top speed OpenCL is not performance-portable; lot’s of expertise needed
OpenVX Can be tailored for any hardware (CPU, GPU, DSP, FPGA) Inflexible, not easy to use, difficult to extend
Halide Decent performance; relatively easy to use Not as flexible as OpenCL or C++
Performance
Ease-of-use
HAL functions
HAL intrinsics
Halide
T-API (custom)
T-API (built-in)
OpenVX (graphs)
OpenVX (graphs for DNN)
Flexibility
Coverage
HAL functions
HAL intrinsics
Halide
T-API (custom)
T-API (built-in)
OpenVX (graphs)

• Modern OpenCV provides several acceleration paths
• Custom kernels are essential for user apps; existing OpenCV (and
OpenVX) functionality is not enough
• Universal intrinsics
(http://docs.opencv.org/master/df/d91/group__core__hal__intrin.html) is
best solution for CPU
• T-API (OpenCL; http://opencv.org/platforms/opencl.html) is the way to go
for GPU acceleration
• Halide looks very promising and can become a viable alternative to plain
C++ and OpenCL for “regular” algorithms; OpenCV 3.3 will include
Halide-accelerated deep learning module
Summary

• OpenCV: http://opencv.org
• Intel CV SDK: https://software.intel.com/en-us/computer-vision-sdk - the
home of Intel-optimized OpenCV & OpenVX
• Halide: http://halide-lang.org
• Insights on the OpenCV 3.x feature roadmap, EVS2016 talk by Gary
Bradski: https://www.embedded-vision.com/platinum-
members/embedded-vision-alliance/embedded-vision-
training/videos/pages/may-2016-embedded-vision-summit-opencv
Resources

"Making OpenCV Code Run Fast," a Presentation from Intel

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to "Making OpenCV Code Run Fast," a Presentation from Intel

Similar to "Making OpenCV Code Run Fast," a Presentation from Intel (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"Making OpenCV Code Run Fast," a Presentation from Intel