Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Similar to "The Vision Acceleration API Landscape: Options and Trade-offs," a Presentation from the Khronos Group(20)

Advertisement

More from Edge AI and Vision Alliance(20)

Recently uploaded(20)

Advertisement

"The Vision Acceleration API Landscape: Options and Trade-offs," a Presentation from the Khronos Group

  1. Copyright © 2017 Your Company Name 1 Your Company Logo goes here. See Slide 8 for instructions Neil Trevett, Khronos President VP Developer Ecosystem, NVIDIA May2017 Vision Acceleration API Landscape: Options and Trade-offs
  2. © Copyright Khronos Group 2017 - Page 2© Copyright Khronos Group 2017 - Page 2© Copyright Khronos Group 2017 - Page 2© Copyright Khronos Group 2017 - Page 2 The Search for Vision Performance and Portability Leads to a layered Ecosystem – and Graph Programming Explicit Kernels Diverse Hardware (GPUs, DSPs, FPGAs) GPU FPGA DSP Dedicated Hardware Single Source C++ Languages Graph Programming TensorRT Code Intermediate Representations PTX HSAIL
  3. © Copyright Khronos Group 2017 - Page 3 © Copyright Khronos Group 2017 - Page 3 OpenVX – Performance Portable Vision • OpenVX developers express a graph of image operations (‘Nodes’) - Using a C API • Nodes can be executed on any hardware or processor coded in any language - Implementers can optimize under the high-level graph abstraction • Graphs are the key to run-time optimization opportunities… Array of Keypoints YUV Frame Gray Frame Camera Input Rendering Output Pyrt Color Conversion Channel Extract Optical Flow Harris Track Image Pyramid RGB Frame Array of Features Ftrt-1OpenVX Graph OpenVX Nodes Feature Extraction Example Graph
  4. © Copyright Khronos Group 2017 - Page 4© Copyright Khronos Group 2017 - Page 4© Copyright Khronos Group 2017 - Page 4© Copyright Khronos Group 2017 - Page 4 OpenVX Efficiency through Graphs.. Reuse pre-allocated memory for multiple intermediate data Memory Management Less allocation overhead, more memory for other applications Replace a sub- graph with a single faster node Kernel Fusion Better memory locality, less kernel launch overhead Split the graph execution across the whole system: CPU / GPU / dedicated HW Graph Scheduling Faster execution or lower power consumption Execute a sub- graph at tile granularity instead of image granularity Data Tiling Better use of data cache and local memory
  5. © Copyright Khronos Group 2017 - Page 5© Copyright Khronos Group 2017 - Page 5© Copyright Khronos Group 2017 - Page 5© Copyright Khronos Group 2017 - Page 5 OpenVX Evolution OpenVX 1.0 Spec released October 2014 Conformant Implementations OpenVX 1.1 Spec released May 2016 Conformant Implementations AMD OpenVX Tools - Open source, highly optimized for x86 CPU and OpenCL for GPU - “Graph Optimizer” looks at entire processing pipeline and removes, replaces, merges functions to improve performance and bandwidth - Scripting for rapid prototyping, without re- compiling, at production performance levels http://gpuopen.com/compute-product/amd-openvx/ New Functionality Expanded Nodes Functionality Enhanced Graph Framework OpenVX 1.2 Spec released 1st May 2017 New Functionality Conditional node execution Feature detection Classification operators Expanded imaging operations Extensions Neural Network Acceleration Graph Save and Restore 16-bit image operation OpenVX Roadmap Discussions New Functionality Under Discussion NNEF Import Accelerated Programmable user nodes Come to the next presentation for more details!
  6. © Copyright Khronos Group 2017 - Page 6© Copyright Khronos Group 2017 - Page 6© Copyright Khronos Group 2017 - Page 6© Copyright Khronos Group 2017 - Page 6 Khronos NNEF (Neural Net Exchange Format • Range of Neural Network tools and inferencing architectures is rapidly increasing • NNEF encapsulates neural network formal semantics - Structure, Data formats - Commonly used operations (such as convolution, pooling, normalization, etc.) • Cross-vendor Neural Net file format removes industry friction - Simple exchange between tools and inferencing engines - Unified format for network optimizations NN Authoring Framework 1 NN Authoring Framework 2 NN Authoring Framework 3 Inference Engine 1 Inference Engine 2 Inference Engine 3 NN Authoring Framework 1 NN Authoring Framework 2 NN Authoring Framework 3 Inference Engine 1 Inference Engine 2 Inference Engine 3 Every Tool Needs an Exporter to Every Accelerator
  7. © Copyright Khronos Group 2017 - Page 7 © Copyright Khronos Group 2017 - Page 7 OpenVX 1.2 and Neural Net Extension • Convolution Neural Network topologies can be represented as OpenVX graphs - Layers are represented as OpenVX nodes - Layers connected by multi-dimensional tensors objects - Layer types include convolution, activation, pooling, fully-connected, soft-max - CNN nodes can be mixed with traditional vision nodes • Import/Export Extension - Efficient handling of network Weights/Biases or complete networks • Will be able to import NNEF files into OpenVX Neural Nets Vision Node Vision Node Vision Node Downstream Application Processing Native Camera Control CNN Nodes An OpenVX graph mixing CNN nodes with traditional vision nodes
  8. © Copyright Khronos Group 2017 - Page 8© Copyright Khronos Group 2017 - Page 8© Copyright Khronos Group 2017 - Page 8© Copyright Khronos Group 2017 - Page 8 Building Graphs in C++ Rather than an API Programmatic approaches to Graph Programming Explicit Kernels Diverse Hardware (GPUs, DSPs, FPGAs) GPU FPGA DSP Dedicated Hardware Single Source C++ Languages Graph Programming TensorRT Code Intermediate Representations PTX HSAIL
  9. © Copyright Khronos Group 2017 - Page 9 © Copyright Khronos Group 2017 - Page 9 Halide – C++ Based Image Processing • Halide is a programming language embedded in C++ - C++ code builds an in-memory representation of a Halide graph - Compiled to an object file, to be run in the same process • Separate descriptions of ‘algorithm’ per output pixel and execution ‘schedule’ - Can hand or auto-optimize schedule without affecting the algorithm correctness • Compiler based on LLVM/Clang - Compiler targets: X86, ARM, CUDA, OpenCL, and OpenGL Halide Program (Image processing Algorithm) Human Expert Autoscheduler Halide Schedule (mapping of algorithm to hardware) Halide Code Generator (LLVM/Clang) Platform- optimized binary
  10. © Copyright Khronos Group 2017 - Page 10© Copyright Khronos Group 2017 - Page 10© Copyright Khronos Group 2017 - Page 10© Copyright Khronos Group 2017 - Page 10 Khronos SYCL - Single Source C++ • Single-source heterogeneous programming using STANDARD C++ - Use C++ templates and lambda functions for host & device code • Kernel Fusion in C++ is a widely used compiler technique - proven to work - Halide, Eigen, Boost.Compute, … - Optimization at the C++, not assembly, level - Achieves better performance on complex software than hand-coding • Rapid optimization of multiple libraries - more information at http://sycl.tech - SYCLBLAS - SYCL Eigen - SYCL TensorFlow - SYCL GTX - triSYCL - ComputeCpp - VisionCpp - ComputeCpp SDK
  11. © Copyright Khronos Group 2017 - Page 11© Copyright Khronos Group 2017 - Page 11© Copyright Khronos Group 2017 - Page 11© Copyright Khronos Group 2017 - Page 11 Graph Programming - Fusion Results • C++ Kernel fusion provides optimization benefits - Tiled operations in local memory - Reduced bandwidth to off-chip memory Courtesy Codeplay: https://www.slideshare.net/AndrewRichards28/open-standards-for-adas-andrew-richards-codeplay-at-autosens-2016-66476890
  12. © Copyright Khronos Group 2017 - Page 12© Copyright Khronos Group 2017 - Page 12© Copyright Khronos Group 2017 - Page 12© Copyright Khronos Group 2017 - Page 12 Convergence with Standard ISO C++ • SYCL aligns the hardware acceleration of OpenCL with direction of the C++ standard - Use of C++14 with open source C++17 Parallel STL hosted by Khronos • Khronos working with others on bringing proposals to ISO C++ for: - Executors – for scheduling work - “Managed pointers” or “channels” – for sharing data • Hoping to target C++ 20 - But timescales are tight • The eventual aim is for standard C++ to support advanced acceleration offload
  13. © Copyright Khronos Group 2017 - Page 13© Copyright Khronos Group 2017 - Page 13© Copyright Khronos Group 2017 - Page 13© Copyright Khronos Group 2017 - Page 13 Explicit Programming and Run Times Run-time support for accelerator offload Explicit Kernels Diverse Hardware (GPUs, DSPs, FPGAs) GPU FPGA DSP Dedicated Hardware Single Source C++ Languages Graph Programming TensorRT Code Intermediate Representations PTX HSAIL
  14. © Copyright Khronos Group 2017 - Page 14© Copyright Khronos Group 2017 - Page 14© Copyright Khronos Group 2017 - Page 14© Copyright Khronos Group 2017 - Page 14 OpenCL as Parallel Language/Library Backend C++ based Neural network framework MulticoreWare open source project on Bitbucket Compiler directives for Fortran, C and C++ Java language extensions for parallelism Language for image processing and computational photography Single Source C++ Programming for OpenCL Approaching 200 languages, frameworks and projects using OpenCL as a compiler target to access vendor-optimized, heterogeneous compute runtimes Low Level Explicit APIs Vision processing open source project Open source software library for machine learning
  15. © Copyright Khronos Group 2017 - Page 15© Copyright Khronos Group 2017 - Page 15© Copyright Khronos Group 2017 - Page 15© Copyright Khronos Group 2017 - Page 15 OpenCL 2.2 - Top to Bottom C++ OpenCL 1.0 Specification Dec08 Jun10 OpenCL 1.1 Specification Nov11 OpenCL 1.2 Specification OpenCL 2.0 Specification Nov13 Device partitioning Separate compilation and linking Enhanced image support Built-in kernels / custom devices Enhanced DX and OpenGL Interop Shared Virtual Memory On-device dispatch Generic Address Space Enhanced Image Support C11 Atomics Pipes Android ICD 3-component vectors Additional image formats Multiple hosts and devices Buffer region operations Enhanced event-driven execution Additional OpenCL C built-ins Improved OpenGL data/event interop 18 months 18 months 24 months OpenCL 2.1 Specification Nov1524 months SPIR-V in Core Subgroups into core Subgroup query operations clCloneKernel Low-latency device timer queries OpenCL 2.2 PROVISIONAL May167months Single Source C++ Programming Full support for features in C++14-based Kernel Language API and Language Specs Brings C++14-based Kernel Language into core specification Portable Kernel Intermediate Language Support for C++14-based kernel language e.g. constructors/destructors OpenCL C++ Kernel Language SPIR-V 1.1 with C++ support SYCL 2.2 for single source C++
  16. © Copyright Khronos Group 2017 - Page 16© Copyright Khronos Group 2017 - Page 16© Copyright Khronos Group 2017 - Page 16© Copyright Khronos Group 2017 - Page 16 OpenCL Conformant Implementations OpenCL 1.0 Specification Dec08 Jun10 OpenCL 1.1 Specification Nov11 OpenCL 1.2 Specification OpenCL 2.0 Specification Nov13 1.0|Jul13 1.0|Aug09 1.0|May09 1.0|May10 1.0 | Feb11 1.0|May09 1.0|Jan10 1.1|Aug10 1.1|Jul11 1.2|May12 1.2|Jun12 1.1|Feb11 1.1|Mar11 1.1|Jun10 1.1|Aug12 1.1|Nov12 1.1|May13 1.1|Apr12 1.2|Apr14 1.2|Sep13 1.2|Dec12 Desktop Mobile FPGA 2.0|Jul14 OpenCL 2.1 Specification Nov15 1.2|May15 2.0|Dec14 1.0|Dec14 1.2|Dec14 1.2|Sep14 Vendor timelines are first implementation of each spec generation 1.2|May15 Embedded 1.2|Aug15 1.2|Mar16 2.0|Nov15 2.0|Apr17
  17. © Copyright Khronos Group 2017 - Page 17© Copyright Khronos Group 2017 - Page 17© Copyright Khronos Group 2017 - Page 17© Copyright Khronos Group 2017 - Page 17 Strong Vulkan Momentum Games Studios publicly confirming that work is ongoing on Vulkan Titles All Major GPU Companies shipping Vulkan Drivers – for Desktop and Mobile Platforms Mobile, Embedded and Console Platforms Supporting Vulkan Embedded Linux In first 12months : #Vulkan Games on PC = 11 In first 18 months #DX12 Games on PC = 19 https://en.wikipedia.org/wiki/List_of_games_with_Vulkan_support Android TVAndroid 7.0 Nintendo Switch
  18. © Copyright Khronos Group 2017 - Page 18© Copyright Khronos Group 2017 - Page 18© Copyright Khronos Group 2017 - Page 18© Copyright Khronos Group 2017 - Page 18 Vulkan Explicit GPU Control GPU High-level Driver Abstraction Context management Memory allocation Full GLSL compiler Error detection Layered GPU Control Application Single thread per context GPU Thin Driver Explicit GPU Control Application Memory allocation Thread management Synchronization Multi-threaded generation of command buffers Language Front-end Compilers Initially GLSL Loadable debug and validation layers Vulkan 1.0 provides access to OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality but with increased performance and flexibility Loadable Layers No error handling overhead in production code SPIR-V Pre-compiled Shaders: No front-end compiler in driver Future shading language flexibility Simpler drivers: Improved efficiency/performance Reduced CPU bottlenecks Lower latency Increased portability Graphics, compute and DMA queues: Work dispatch flexibility Command Buffers: Command creation can be multi-threaded Multiple CPU cores increase performance Resource management in app code: Less hitches and surprises Vulkan Benefits SPIR-V pre-compiled shaders
  19. © Copyright Khronos Group 2017 - Page 19© Copyright Khronos Group 2017 - Page 19© Copyright Khronos Group 2017 - Page 19© Copyright Khronos Group 2017 - Page 19 Vulkan Influence on OpenCL Roadmap • OpenCL converging with Vulkan – expanding beyond graphics + more processor types - Thin, powerful, explicit run-time for control and predictability - Feature sets and dial-able precision for target market agility - Installable tools and three layer ecosystem for flexibility - Vulkan renderpasses are already a way to enabled tiled processing • OpenCL also considering optional reduced precision for vision and inferencing - Will enable significant number of DSP implementations to become conformant Thin, explicit run-time with rigorous memory/execution model. Low-latency, fine-grain pre-emption and synchronization Dial-able types and precision Features that can be enabled for particular target markets Real-time Pre- emption and QoS scheduling Explicit Asynch DMA Self-synchronized, self-scheduled graphs Stream Processing … Math Libraries Vendor-supplied and open source middleware Language Front-ends Tool Layers Installable tool & validation layers Applications API Definitions
  20. © Copyright Khronos Group 2017 - Page 20© Copyright Khronos Group 2017 - Page 20© Copyright Khronos Group 2017 - Page 20© Copyright Khronos Group 2017 - Page 20 SPIR-V Ecosystem LLVM Third party kernel and shader Languages SPIR-V • Khronos defined and controlled cross-API intermediate language • Native support for graphics and parallel constructs • 32-bit Word Stream • Extensible and easily parsed • Retains data object and control flow information for effective code generation and translation OpenCL C++OpenCL C GLSL Khronos has open sourced these tools and translators IHV Driver Runtimes Other Intermediate Forms SPIR-V Validator SPIR-V (Dis)Assembler LLVM to SPIR-V Bi-directional Translator Khronos plans to open source these tools soon HLSL https://github.com/KhronosGroup/SPIRV-Tools ‘glslang’ GLSL to SPIR-V compiler Front-end compilers leverage clang
  21. © Copyright Khronos Group 2017 - Page 21© Copyright Khronos Group 2017 - Page 21© Copyright Khronos Group 2017 - Page 21© Copyright Khronos Group 2017 - Page 21 Possible Convergence of Graph Technologies • API-created Graphs such as OpenVX benefit from flexibility of user-programmed nodes - OpenVX Tiling extension lets them participate in tiled/fused optimizations - But currently user nodes can run only on the CPU • Perhaps use a C++ language, such such as a Halide algorithm, to program user nodes - That can be offloaded and scheduled within the OpenVX graph - Perhaps use SPIR-V to define node capabilities and store portable Node programs Vision Node Vision Node Vision Node Downstream Application Processing Native Camera Control CNN Nodes Discussion – should an OpenVX user programmed node in a C++ domain specific language may have its executable stored as SPIR-V? Programmed User Node
  22. © Copyright Khronos Group 2017 - Page 22© Copyright Khronos Group 2017 - Page 22© Copyright Khronos Group 2017 - Page 22© Copyright Khronos Group 2017 - Page 22 Safety Critical APIs New Generation APIs for safety certifiable vision, graphics and compute e.g. ISO 26262 and DO-178B/C OpenGL ES 1.0 - 2003 Fixed function graphics OpenGL ES 2.0 - 2007 Shader programmable pipeline OpenGL SC 1.0 - 2005 Fixed function graphics subset OpenGL SC 2.0 - April 2016 Shader programmable pipeline subset Experience and Guidelines Vulkan SC being discussed Small driver size Advanced functionality Graphics and compute OpenVX SC 1.1 Released Today Restricted “deployment” implementation executes on the target hardware by reading the binary format and executing the pre- compiled graphs Khronos SCAP ‘Safety Critical Advisory Panel’ Guidelines for designing APIs that ease system certification. Open to Khronos member AND industry experts. If interested to join contact ntrevett@nvidia.com
  23. © Copyright Khronos Group 2017 - Page 23 © Copyright Khronos Group 2017 - Page 23 Key Takeaways and What’s Next? • Vision Tools and APIs are becoming increasingly sophisticated - Ecosystem is layering libraries, language and run-times • Compiler technologies becoming increasingly central – especially C++ - To enable graph-based language-based solutions with kernel fusion - LLVM and Clang open source projects being used by many • Safety-critical APIs becoming increasingly essential - Many vision applications need system certification • Still no cross-vendor camera APIs? - Is the time yet right for this to be a target for standardization? • Please join if your company interested in Khronos open standards - Neil Trevett | ntrevett@nvidia.com | @neilt3d
Advertisement