• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 

MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko

on

  • 2,102 views

Presentation MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko at the AMD Developer Summit (APU13) November 11-13, 2013.

Presentation MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko at the AMD Developer Summit (APU13) November 11-13, 2013.

Statistics

Views

Total Views
2,102
Views on SlideShare
2,101
Embed Views
1

Actions

Likes
1
Downloads
32
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko Presentation Transcript

    • OPENCV OPENCL™ ACCELERATED COMPUTER VISION
    •  OpenCV Introduction Andrey Pavlenko, Itseez  Heterogeneous Compute and OpenCV Dr. Harris Gasparakis, AMD  OpenCV 3.0 Vadim Pisarevsky, Itseez
    • OpenCV introduction Andrey Pavlenko 1. Features 2. History 3. Development Process 4. Performance
    • Open-source Computer Vision Library 1. 2,500+ algorithms and functions 2. Cross-platform 3. Liberal BSD license 4. High performance 5. Professionally developed 6. 7M+ downloads
    • Functionality overview Image Processing Filters Transformations Edges, contours Robust features Segmentation Video, Stereo, 3D Calibration Pose estimation Optical Flow Detection and recognition Depth
    • Industrial applications • Street View Panorama, etc. (Google) • Vision system of the PR2 robot (Willow Garage) • Robots for Mars exploration (NASA) • Quality control of the production of coins (China)
    • OpenCV History Popularity Contributors Core team 2000 First public release 2008 2009 v2.0 C++ API 2012 @github 2013 v2.4.3, opencl present
    • Contribution/patch workflow: see OpenCV wiki OpenCV infrastructure build.opencv.org: buildbot with 50+ builders 50+ builds nightly! github.com/itseez/opencv pullrequest.opencv.org Every patch to OpenCV must pass 7 builders!
    • OpenCV resources 1. Home: opencv.org 2. Docs and tutorials: docs.opencv.org 3. Q&A forum: answers.opencv.org 4. Wiki and issues: code.opencv.org 5. Develop: https://github.com/Itseez/opencv 6. Packages: sourceforge.net/projects/opencvlibrary/
    • OpenCL™ in OpenCV 2.4 • ‘ocl’ is a separate module (cv::ocl::resize()) • runs on various OpenCL-compliant devices and OSes • 2.4.7 release on November 6 – – – – – – – – official Windows bin pack with OpenCL enabled OpenCV pre-commit check includes OpenCL tests 200+ pull requests since 2.4.6 (most actively developed OpenCV part) dynamic OpenCL runtime loading set default OpenCL device via environment variable ~800 optimized kernels, ~30% of most commonly used functionality 8000+ accuracy and ~500 performance tests can be built without OpenCL SDK installed
    • OpenCL™ performance in OpenCV 2.4 AMD A10-6800k (with HD8670D) + Radeon HD7790
    • HETEROGENEOUS COMPUTE AND OPENCV  The OpenCL™ Module in OpenCV  Heterogeneous compute and Computer Vision  Compute paths and data representations  Future roadmap: transparent API 12 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • OPENCV’S OPENCL™ MODULE  Enables taking advantage of OpenCL™ acceleration, but currently it is an explicit path a developer can choose to call. All OpenCL memory buffer types are supported, but not automatically optimized. ‒ But stay tuned for OpenCV 3.0’s transparent API.  Initial release: OpenCV 2.4.3 [11/2012]  Currently ~800 kernels ‒ Image processing ‒ Pixel-wise operations ‒ Geometric transforms ‒ Pixel transforms: filtering, edges, corners etc ‒ Feature detection and matching ‒ SURF, HOG, Haar, brute matching, kNN. templateMatching ‒ Object recognition ‒ SVM: Support Vector Machine  Applications, including: ‒ Face Detect ‒ Optical flow ‒ Stereo Matching 13 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • COMPILING FROM SOURCE  OpenCL™ is enabled by default in CMAKE 14 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • COMPILING FROM SOURCE BROWSE/BUILD CODE IN AN IDE OpenCL™ module (2.4.x). Rebuild it even if you just change a kernel OpenCL kernels. Those are converted to kernels.cpp by a script (hence you need to rebuild if you change a kernel). OpenCL samples. After you build them, go to [ROOT]bin[CONFIG],, observe: ocl-example-*.exe 15 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • INCORPORATING OPENCV INTO YOUR OWN CODE  APP SDK provides 3 examples. Very easy integration!  With less than 15 lines of code you can have a minimal program that reads video frames, passes them to the OpenCL™ device, and runs your own simple kernel! OpenCV-CL: ‒ takes care of all OpenCL plumbing ‒ Compiles the kernels, and even caches them at runtime, and saves the OpenCL binaries on disk [user can also modify default behavior] ‒ Allows specifying an OpenCL device/platform via environment variable. ‒ Allows plugging your own kernels to OpenCV-CL, using the OpenCV-CL data-structures. 16 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • INCORPORATING OPENCV INTO YOUR OWN CODE SOME CODE, FROM APP SDK 2.9, GESTURE SAMPLE, SHOWCASING OPENNI® INTEGRATION cv::Mat depthImgClamp = cv::Mat( SIZEY, SIZEX, CV_8UC1, openniBuffer); cv::ocl::oclMat oclDepthImgClamp(depthImgClamp ); In one line, populate an image in GPU! vector<pair<size_t, const void *> > args; args.push_back(make_pair(sizeof(cl_mem), (void *)&src.data)); args.push_back(make_pair(sizeof(cl_mem), (void *)&oclDst.data)); openCLExecuteKernelInterop (oclDst.clCxt, &depthConvertSrcStr, "convertDepthToWorldCoordinates", globalThreads, localThreads, args, -1, -1, "", false, false, true); } In one command, add your own kernel launch, acting on OpenCV-CL data-structures 17 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • HETEROGENEOUS COMPUTE AND COMPUTER VISION Webcams everywhere Heterogeneous compute everywhere Real time computer vision everywhere  Heterogeneous compute mission: To take optimal advantage of the full capabilities of the underlying platform. ‒ APU / HSA APU ‒ Discrete GPU ‒ CPU ‒ FPGA, DSP, etc. 18 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL Many code paths? - Possibly interleaving execution between different devices Many data representations?
    • DATA REPRESENTATIONS DISCRETE APUS, OPENCL™ 1.2  Copy data to/from GPU  Use “device Memory” for data that is used between GPU kernels  Map/unmap using pinned memory ‒ True for all generations. Special memory that can be read and written fast by GPU. ‒ On APUs: physically part of main memory, possibly with special paths. ‒ But: device memory cannot be read/written very fast from CPU.  Zero copy (map/unmap): best path for data written(read) by CPU(GPU) or vice versa.  Cannot mix and match (bounce back and forth between) CPU and GPU well.  Small kernels are typically a bad idea 19 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • H1’14: APUS, OPENCL™ 1.2 + HSA extensions OR OPENCL 2.0  Can still use “device Memory” for data that is used between GPU kernels, and zero copy still available.  However: SVM (shared virtual memory) can be written to/read from both CPU and GPU fast “enough” ‒ Enables ping/pong (producer/consumer) between CPU and GPU ‒ Enables concurrent producer/consumer between CPU/GPU (platform atomics) ‒ Much easier to port a vision pipeline using HSA. You can incrementally pick and choose what part of the pipeline to accelerate, and what part to allow the CPU to execute. ‒ On HSA APUs, using SVM is reasonable (and better) than current defaults., significantly simplifying code.  User mode enqueueing: much faster kernel dispatching leads to less performance degradation of small kernels. Can feed the GPU smaller computational tasks fast, and (busy) wait for results on the CPU. 20 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • COMPUTE PATHS OpenCV 2.4.x: Face detect on CPU // initialization VideoCapture vcap(...); CascadeClassifier fd("haar_ff.xml"); Removed image Mat frame, frameGray; demonstrating face detect vector<Rect> faces; for(;;){ // processing loop vcap >> frame; cvtColor(frame, frameGray, BGR2GRAY); equalizeHist(frameGray, frameGray); fd.detectMultiScale(frameGray, faces, ...); // draw rectangles … // show image … } 21 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • COMPUTE PATHS OpenCV 2.4.x: Face detect with OpenCL™ // initialization VideoCapture vcap(...); ocl::OclCascadeClassifier fd("haar_ff.xml"); Removed image ocl::oclMat frame, frameGray; demonstrating face detect Mat frameCpu; vector<Rect> faces; for(;;){ // processing loop vcap >> frameCpu; frame = frameCpu; ocl:: cvtColor(frame, frameGray, BGR2GRAY); ocl:: equalizeHist(frameGray, frameGray); ocl:: fd.detectMultiScale(frameGray, faces, ...); // draw rectangles … // show image … 22 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • FUTURE ROADMAP ‒ Incorporate OpenCL™ 1.2 with HSA extensions, and OpenCL 2.0 ‒ Shared Virtual Memory (SVM) significantly simplifies programming model in general. Allows reusing existing memory as SVM. ‒ In SVM, a “pointer is a pointer” ‒ Pass your tree/linked list/graph data structure in the GPU, have threads explore sub-branches, or explore paths on a graph ‒ Transparent API: ‒ ‒ ‒ ‒ One code path, OpenCV will choose the best execution path at runtime, given the platform. Changes of data locality should be implemented by the framework. Includes applying heuristics appropriate for underlying hardware (dGPU, APU, HSA APU). Eventually it should be self-optimizing ‒ reasonably define optimal memory type “under the hood.” ‒ Detect data flow dependencies, in the pipeline, and automatically represent them as OpenCL events. Starting with OpenCV 3.0. 23 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 24 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
    • OpenCV 3.0 Vadim Pisarevsky 1. Transparent API 2. UMat 3. Under the hood
    • OpenCV 3.0 • OpenCV 3.0 is scheduled for 2014’Q1 • Based on 2.x, but: – transparent API and more efficient and platform-specific OpenCL™ codepaths (including better zero-copy and SVM support) – API cleanup – a lot of new algorithms
    • Transparent API • same code can run on CPU or GPU – no specialized cv::ocl::Canny vs cv::Canny – no recompilation is needed • includes the following key components: – new data structure UMat (Universal Mat) – – simple and robust mechanism for async processing convenient API for custom algorithm implementation • minimal or no changes in the existing code – CPU-only processing – no changes required
    • UMat • Mat=>UMat is the only change needed • Sometimes, somewhere (HSA) it’s not needed either! // initialization VideoCapture vcap(...); CascadeClassifier fd("haar_ff.xml"); UMat frame, frameGray; vector<Rect> faces; for(;;){ // processing loop vcap >> frame; cvtColor(frame, frameGray, BGR2GRAY); equalizeHist(frameGray, frameGray); fd.detectMultiScale(frameGray, faces, ...); // draw rectangles … // show image … }
    • Transparent API: under the hood bool _ocl_cvtColor(InputArray src, OutputArray dst, int code) { static ocl::ProgramSource oclsrc(“//cvtcolor.cl source coden …”); UMat src_ocl = src.getUMat(), dst_ocl = dst.getUMat(); if (code == COLOR_BGR2GRAY) { // get the kernel; kernel is compiled only once and cached ocl::Kernel kernel(“bgr2gray”, oclsrc, <compile_flags>); // pass 2 arrays to the kernel and run it return kernel.args(src, dst).run(0, 0, false); } else if(code == COLOR_BGR2YUV) { … } return false; } void _cpu_cvtColor(const Mat& src, Mat& dst, int code) { … } // transparent API dispatcher function void cvtColor(InputArray src, OutputArray dst, int code) { dst.create(src.size(), …); if (useOpenCL(src, dst) && _ocl_cvtColor(src, dst, code)) return; // getMat() uses zero-copy if available; and with SVM it’s no op Mat src_cpu = src.getMat(); Mat dst_cpu = dst.getMat(); _cpu_cvtColor(src_cpu, dst_cpu, code);
    • OpenCV+OpenCL™ execution model CPU threads … cv::ocl::Queue cv::ocl::Queue cv::ocl::Device … … cv::ocl::Queue cv::ocl::Device • One OpenCL queue and one OpenCL device per CPU thread • OpenCL kernels are executed asynchronously • cv::ocl::finish() puts the barrier in the current CPU thread; .getMat() automatically calls it.
    • Summary & Future directions • OpenCL™ is a great tool to boost performance of vision algorithms; OpenCV unleashes its potential to CV community • OpenCV 3.0 transparent API makes it even easier and … more transparent • possible directions: pipelines, memory allocation optimization, more algorithms ported to OpenCL
    • The first results