Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Imaging on embedded GPUs

3,788 views

Published on

Discuss challenges of implementing imaging pipelines on mobile chipsets with ARM Mali T604 GPU and Qualcomm Adreno 3xx GPUs.

Presented at Bay Area multimedia meetup (http://www.meetup.com/Bay-Area-Multimedia-Meetup-Group) on Dec. 19, 2013

Published in: Technology
  • Do This Simple 2-Minute Ritual To Loss 1 Pound Of Belly Fat Every 72 Hours ●●● https://tinyurl.com/y6qaaou7
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Doctor's 2-Minute Ritual For Shocking Daily Belly Fat Loss! Watch This Video ★★★ http://ishbv.com/bkfitness3/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Imaging on embedded GPUs

  1. 1. | © 2013 Aptina Imaging Corporation | Aptina Confidential1 © 2013 Aptina Imaging Corporation. All rights reserved. Products are warranted only to meet Aptina’s production data sheet specifications. Information, products, and/or specifications are subject to change without notice. All information is provided on an “AS IS” basis without warranties of any kind. Dates are estimates only. Drawings not to scale. Aptina and the Aptina logo are trademarks of Aptina Imaging Corporation. All other trademarks are the property of their respective owners. Imaging on Embedded GPUs Investigating flexible imaging pipelines using embedded GPUs Mikaël Bourges-Sévenier (msevenier at aptina dot com) Director, High-Performance Imaging December 19, 2013 Bay Area Multimedia
  2. 2. | © 2013 Aptina Imaging Corporation | Aptina Confidential2 •  Overview: the need for computational imaging •  What is imaging? •  Architecture of some embedded GPUs •  8MP MobileHDR pipeline on ARM Mali T604 •  Khronos Camera: a standard API for computational imaging •  Q&A Agenda
  3. 3. | © 2013 Aptina Imaging Corporation | Aptina Confidential3 Computational Imaging evolution Spatial (Volumetric) Gesture AR Face Detect Face Track Presence Colorimetry Brightness Web Cam Smart Camera True Color, Brightness Compensation, Exposure control User Identity Access Control Augmented Information 3D Imaging Interactive Services
  4. 4. | © 2013 Aptina Imaging Corporation | Aptina Confidential4 •  Requires significant computing over large data sets Mobile Compute driving Imaging use cases Augmented Reality Face, Body and Gesture Tracking Computational Photography 3D Scene/Object Reconstruction Time
  5. 5. | © 2013 Aptina Imaging Corporation | Aptina Confidential5 Increasing Use of Imaging SensorsDifferentiationOpportunity Time Photography Input = 2D Camera Processors = ISP + CPU Product = Static Images Computational Photography Input = MEMS + 2D Camera Processors = ISP + CPU + GPU Product = Real Time Images We are here Perceptual Imaging Input = MEMS + Depth Camera Processors = ISP + CPU + GPU + DSP Product = Real Time Extracted Information Perceptual Imaging1. Uses the full array of mobile sensors 2. to extract information in real-time 3. about the user and environment 4. to generate enhanced user interactions
  6. 6. | © 2013 Aptina Imaging Corporation | Aptina Confidential6 Hardware Save Power e.g. Camera Sensor ISP •  CPU ‣  Single processor or Neon SIMD - running fast ‣  Makes heavy use of general memory ‣  Non-optimal performance and power •  GPU ‣  Programmable and flexible ‣  Many way parallelism - run at lower frequency ‣  Efficient image caching close to processors ‣  BUT cycles frames in and out of memory •  Camera ISP (Image Signal Processor) ‣  Little or no programmability ‣  Data flows thru compact hardware pipe ‣  Scan-line-based - no global memory ‣  Best perf/watt
  7. 7. | © 2013 Aptina Imaging Corporation | Aptina Confidential7 0 50 100 150 200 250 300 350 400 450 Sep-2011 Dec-2011 Apr-2012 Jul-2012 Oct-2012 Jan-2013 May-2013 Aug-2013 Nov-2013 Mar-2014 Jun-2014 Evolution of Embedded GPUs GFLOPS Trend Adreno 320 Adreno 330 Mali T628 PowerVR 6 Tegra 5 PowerVR 5XT Mali T604 40% more GFLOPS/quarter Estimated at sustained peak performance. Likely to be much less in practice.
  8. 8. | © 2013 Aptina Imaging Corporation | Aptina Confidential8 •  Pre-processing: for non-standard Bayer pixels (e.g. iHDR) •  ISP: for fast demosaic, lens shading, denoising, 3A, statistics … •  Post-processing: for special reconstruction of colors (e.g. Clarity+) •  Processing requires control of metadata aligned with data Computational Imaging pipeline Pre-processing Image Signal Processor (ISP) Post-processing CMOS sensor Color Filter Array Lens Bayer RGB YUV App Lens, sensor, aperture control Metadata 3A stats
  9. 9. | © 2013 Aptina Imaging Corporation | Aptina Confidential9 •  DSP are similar to CPU ‣  Typically integer optimized (some have rudimentary floating point support) ‣  With signal processing intrinsics •  FPGA ‣  Can be tailored to a cross between CPU/DSP and GPU Different Computing Devices Latency-Optimized CPU Fast serial Processing lots of big on-chip caches sophisticated control Throughput-Optimized GPU Scalable parallel Processing multithreading can hide latency simpler control, cost amortized over ALUs via SIMD a b c + + SISD (scalar ALU) SIMD (vector ALU) b1 b2 b3 b4a2a1 a4a3 c1 c2 c3 c4 OpenCL works on all devices but performance isn’t guaranteed
  10. 10. | © 2013 Aptina Imaging Corporation | Aptina Confidential10 •  Stream-based (ISP) ‣  For low-memory devices ‣  Set of lines processed by kernels ‣  Delay: #lines a kernel needs •  Frame-based (GPU) ‣  For fast data-parallel devices ‣  Full image frame processed ‣  Delay: whole frame(s) Stream-based vs. Frame-based Kernel continuous stream of pixels Q Kernel final image accumulates lines Kernel Kernel KernelFrame Frame Frame Frame Completely different kernels
  11. 11. | © 2013 Aptina Imaging Corporation | Aptina Confidential11 What is Imaging? Capture image from a camera sensor and process it to get a render-able image.
  12. 12. | © 2013 Aptina Imaging Corporation | Aptina Confidential12 How Imaging Sensors work http://www.photoaxe.com Bayer GRBG pattern •  50% green •  25% red and blue Bayer CFA is one type of pattern
  13. 13. | © 2013 Aptina Imaging Corporation | Aptina Confidential13 Bayer Demosaicing •  50% More G than R, B since eye is more sensitive to luminance than chrominance •  Convert pixel colors from Bayer space to Full RGB color •  Complex interpolation to avoid artifacts (e.g. on edges) RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB 0 1 2 3 0 GRBG 1 RGGB 2 GBRG 3 BGGR
  14. 14. | © 2013 Aptina Imaging Corporation | Aptina Confidential14 OpenCL (memory system) Desktop Embedded Non-uniform memory •  Data is physically copied between GPU and CPU memory Uniform memory •  __local memory may be in __global •  Cheap data exchange between CPU and GPU
  15. 15. | © 2013 Aptina Imaging Corporation | Aptina Confidential15 A tour of some embedded GPUs ARM Mali T604, Qualcomm Adreno 330
  16. 16. | © 2013 Aptina Imaging Corporation | Aptina Confidential16 ARM Mali T604, T628 •  Found in Samsung Exynos 5 Dual (T604)/Octa (T628) Application Processors ‣  Chromebook, Nexus 10, Samsung S4… •  32nm process for T604, 28nm for T628 •  T604 has 4 shader cores, T628 has 8 cores •  Tri-pipe architecture: each GPU core has 3 types of instruction pipelines ‣  1x load/store ‣  1x texture ‣  2x ALU (T604) / 4x ALU (T628) •  64-bit integers and IEEE 754 floating-point ALUs
  17. 17. | © 2013 Aptina Imaging Corporation | Aptina Confidential17 29868v00 CONFIDENTIAL OpenCL and OpenGL ES The Vithar Architecture: OpenGL ESOpenCL Load/Store Pipeline Arithmetic Pipeline Arithmetic Pipeline Texturing Pipeline Thread Issue Thread Completion •  3 kinds of pipelines ‣  Arithmetic ‣  Load/Store ‣  Texture •  Barrel-threaded (like AMD/NVIDIA) •  No SIMT execution (unlike AMD/NVIDIA) •  SIMD (like AMD) ‣  Use vectors for best performance! •  256 threads max (64 in practice) OpenCL and OpenGL ES
  18. 18. | © 2013 Aptina Imaging Corporation | Aptina Confidential18 •  Automatic hardware load balancing •  Seamless concurrent execution •  Integrated seamless power manager Midgard Job execution and Load-balancingJob Execution and Load-balancing
  19. 19. | © 2013 Aptina Imaging Corporation | Aptina Confidential19 Qualcomm MSM8974 •  Process: 28nm •  CPU: 4x Krait 2.3 GHz, ‣  ARMv7A Neon instruction set ‣  Power and performance efficiencies over ARM ‣  4KB+4KB L0, 16KB+16KB L1, 2MB L2 cache ‣  No 64b support •  GPU: Adreno 330 450 MHz ‣  32x 32b scalar ALUs/pipeline, 8 pipelines, 129.6 GFLOPS •  16b kernels provide 2x performance ‣  128b registers ‣  8 KB local memory per shader core ‣  8 KB constant memory ‣  12 reads, 4 writes simultaneous per clock ‣  512 work-items max ‣  1.5 MB on-chip SRAM ‣  Tiled renderer max 3.6 GPix/s •  Hexagon DSP ‣  3x core, 600 MHz, 16 KB L1, 256 KB L2, integrated MMU ‣  Limited floating-point support (no division, no log/ exp…) •  RAM: 2GB 2x LP-DDR3 800 MHz (12.8 GB/s) MSM8974 Adreno 330 vs Adreno 320 Adreno 330 has better performance 450 MHz GPU clock (up from 400 MHz in Adreno 320) 2x better shader performance than A320 – 2x more ALU blocks Dedicated GPU power rail Will allow GPU to be at a lower frequency and voltage than the FABR Adreno 330 Shader Processor “SP” Block Total of 32 (32-bit) scalar ALUs m sevenier-aptina.com 98.248.48.48 2013.10.19 at21:47:19 PD T 16-bit ALUs used if all kernel is 16-bit, otherwise 32b ALU is used
  20. 20. | © 2013 Aptina Imaging Corporation | Aptina Confidential20 MobileHDR pipeline
  21. 21. | © 2013 Aptina Imaging Corporation | Aptina Confidential21 Arndale Samsung Exynos 5 Dual board •  Arndale Samsung Exynos 5 board ‣  CPU: ARM Corte-A15 (2-core) 1.7 GHz 32nm •  32KB L1 cache, 1MB L2 cache ‣  GPU: ARM MALI T604 •  64 concurrent threads •  Vector ALUs •  128b registers •  OpenCL 1.1 Full Profile ‣  RAM: 2GB LP-DDR3 800 MHz (12.8 GB/s) ‣  Truly unified cached memory •  CPU and GPU memory is shared – NO COPY! •  128b wide L1 and L2 access
  22. 22. | © 2013 Aptina Imaging Corporation | Aptina Confidential22 ARM Mali T604 GPUs In Samsung Exynos 5 Dual Type Vector GPU Process 32nm OpenCL 1.1 Full Profile Unified memory Yes Rendering Tile Work-items 256 Clock 533MHz L2 cache 1MB Register width 128b Global memory 2GB LP-DDR3 800Mhz (12.8 GB/s) ALUs 8 (2 ALUs/core) Throughput 100 GFLOPS Local memory 32KB/core (global) Constant memory 64KB Texture cache yes Compute devices (shader cores) 4 Cacheline 64 bytes 16/32/64b floats No/yes/yes
  23. 23. | © 2013 Aptina Imaging Corporation | Aptina Confidential23 Avoid buffer copy •  Mali/Adreno have unified memory ‣  Use CL_MEM_ALLOC_PTR to avoid copy between CPU and GPU •  Mali has no local memory •  Adreno has local memory (1.5MB SRAM 115GB/s) Host data pointers Global Memory Buffer created by malloc() CPU (Host) GPU (Compute Device) Buffers created by user (malloc) are not mapped into the GPU memory space Global Memory Buffer created by malloc() CPU (Host) Buffer created by clCreateBuffer() GPU (Compute Device) COPY clCreateBuffer(CL_MEM_USE_HOST_PTR) creates a new buffer and copies the data over (but the copy operations are expensive) Global Memory Buffer created by malloc() Buffers created by user (malloc) are not mapped into the GPU memory space Global Memory Buffer created by malloc() CPU (Host) Buffer created by clCreateBuffer() GPU (Compute Device) COPY clCreateBuffer(CL_MEM_USE_HOST_PTR) creates a new buffer and copies the data over (but the copy operations are expensive) Host data pointers Global Memory CPU (Host) Buffer created by clCreateBuffer() GPU (Compute Device) clCre create Where  possible  don’t  use  CL_ – Create buffers at the start of your app – Use CL_MEM_ALLOC_HOST_PTR instead of m – Then you can use the buffer on both clCreateBuffer(CL_MEM_USE_HOST_PTR) clCreateBuffer(CL_MEM_ALLOC_HOST_PTR)malloc()
  24. 24. | © 2013 Aptina Imaging Corporation | Aptina Confidential24 Aptina Sensor with MobileHDR™ Turned off
  25. 25. | © 2013 Aptina Imaging Corporation | Aptina Confidential25 Aptina Sensor with MobileHDR™ Turned on
  26. 26. | © 2013 Aptina Imaging Corporation | Aptina Confidential26 AR0833 8MP Camera sensor •  Frame is inscribed in a 1/3.2” circle ‣  4:3 for images e.g. 8MP 3264 x 2448 ‣  16:9 for video e.g. 6MP 3264 x 1836 •  10-bit per pixel (framed in 16 bits) •  At 30fps, we need 343 MB/s for 180 MPix/s •  Interlaced HDR feature •  Interface with ISP ‣  Data over MIPI CSI-2 (serial) ‣  Control over I2C 4:3 2448 3264 16:9 1836 3264 1/3.2" image circle
  27. 27. | © 2013 Aptina Imaging Corporation | Aptina Confidential27 Feature: Interlaced HDR •  1 frame contains 2 exposures interlaced •  Ratio between odd and even pairs ‣  User controlled: 1x, 2x, 4x, 8x single frame are captured at different integration times. This output is then mat with an algorithm designed to reconstruct this output into an HDR still image or The sensor HDR is controlled by two shutter pointers (Shutter pointer1, Shutter pointer2) that control the integration of the odd (Shutter pointer1) and even (Sh pointer 2) row pairs. Figure 16: HDR Integration Time Tint 1 Tint 2 Sample pointer Shutter pointer 1 Shutter pointer 2 I-FRAME 1 I-FRAME 2 Output Frame from S EXPOSURE I-FRAME 1 EXPOSURE I-FRAME 2 Output I-FRAME 1 and 2 Features Interlaced HDR Readout The sensor enables HDR by outputting frames where even and odd row pairs within a single frame are captured at different integration times. This output is then matched with an algorithm designed to reconstruct this output into an HDR still image or video. The sensor HDR is controlled by two shutter pointers (Shutter pointer1, Shutter pointer2) that control the integration of the odd (Shutter pointer1) and even (Shutter pointer 2) row pairs. Figure 16: HDR Integration Time Tint 1 Tint 2 Sample pointer Shutter pointer 1 Shutter pointer 2 I-FRAME 1 I-FRAME 2 Output Frame from Sensor EXPOSURE I-FRAME 1 EXPOSURE I-FRAME 2 Output I-FRAME 1 and 2 Aptina reserves the right to change products or specifications witho AR0833_DS - Rev. F Pub. 4/13 EN 30 ©2011 Aptina Imaging Corporation. All right Figure 16: HDR Integration Time Tint 1 Tint 2 Sample pointer Shutter pointer 1 Shutter pointer 2 I-FRAME 1 I-FRAME 2 Output Frame from Senso EXPOSURE I-FRAME 1 EXPOSURE I-FRAME 2 Output I-FRAME 1 and 2 Exposure 1 Exposure 2
  28. 28. | © 2013 Aptina Imaging Corporation | Aptina Confidential28 mobileHDR demo •  Zero-copy between sensor/OpenCL and OpenCL/OpenGL •  On Arndale board (Samsung Exynos 5 Dual with Mali T604 GPU) Noise Reduction iHDR Reconstruction Bayer scaler Tone Mapping Color Correction 10b iHDR 3264x1836 14b RGB888 EGLImage CL Image 1080p OpenCL GL Texture OpenGL ES
  29. 29. | © 2013 Aptina Imaging Corporation | Aptina Confidential29 Summary •  Embedded GPUs are ideal candidates for computational imaging ‣  Performance at reasonable image size is now available ‣  Power efficiency is being addressed •  OpenCL 1.1 is available on all recent application processors ‣  But may be reserved to OEM ‣  Performance portability isn’t guaranteed (but so it is true for any high- performance applications) •  Opening camera imaging processing “black box” is now feasible for incredible new applications
  30. 30. | © 2013 Aptina Imaging Corporation | Aptina Confidential30 Khronos Camera A standard to control image acquisition and processing.
  31. 31. | © 2013 Aptina Imaging Corporation | Aptina Confidential31 Typical Imaging Pipeline •  Pre- and Post-processing can be done on CPU, GPU, DSP… •  ISP controls camera via 3A algorithms Auto Exposure (AE), Auto White Balance (AWB), Auto Focus (AF) •  ISP may be a separate chip or within Application Processor Pre-processing Image Signal Processor (ISP) Post-processing CMOS sensor Color Filter Array Lens Bayer RGB/YUV App Lens, sensor, aperture control 3A Need for advanced camera control API: - to drive more flexible app camera control - over more types of camera sensors - with tighter integration with the rest of the system
  32. 32. | © 2013 Aptina Imaging Corporation | Aptina Confidential32 Advanced Camera Control Use Cases •  High-dynamic range (HDR) and computational flash photography ‣  High-speed burst with individual frame control over exposure and flash •  Rolling shutter elimination ‣  High-precision intra-frame synchronization between camera and motion sensor •  HDR Panorama, photo-spheres ‣  Continuous frame capture with constant exposure and white balance •  Subject isolation and depth detection •  High-speed burst with individual frame control over focus •  Time-of-flight or structured light depth camera processing ‣  Aligned stacking of data from multiple sensors •  Augmented Reality ‣  60Hz, low-latency capture with motion sensor synchronization ‣  Multiple Region of Interest (ROI) capture ‣  Multiple sensors for scene scaling ‣  Detailed feedback on camera operation per frame
  33. 33. | © 2013 Aptina Imaging Corporation | Aptina Confidential33 Camera API Architecture (FCAM based) •  No global state ‣  State travels with image requests ‣  Every stage in the pipeline may have different state •  -> allows fast, deterministic state changes •  Synchronize devices ‣  Lens, flash, sound capture, gyro… ‣  Devices can schedule Actions •  E.g. to be triggered on exposure change •  Enables device synchronization
  34. 34. | © 2013 Aptina Imaging Corporation | Aptina Confidential34 Visual Sensor Revolution •  Single sensor RGB cameras are just the start of the mobile visual revolution ‣  IR sensors – LEAP Motion, eye-trackers •  Multi-sensors: Stereo pairs -> Plenoptic array -> Depth cameras ‣  Stereo pair can enable object scaling and enhanced depth extraction ‣  Plenoptic Field processing needs FFTs and ray-casting •  Hybrid visual sensing solutions ‣  Different sensors mixed for different distances and lighting conditions •  GPUs today – more dedicated ISPs tomorrow? Dual Camera LG Electronics Plenoptic Array Pelican imaging Capri Structured Light 3D Camera PrimeSense
  35. 35. | © 2013 Aptina Imaging Corporation | Aptina Confidential35 Khronos APIs for Augmented Reality Advanced Camera Control and stream generation 3D Rendering and Video Composition On GPU Audio Rendering Application on CPUs, GPUs and DSPs Sensor Fusion Vision Processing MEMS Sensors Camera Control API EGLStream - stream data between APIs Precision timestamps on all sensor samples AR needs not just advanced sensor processing, vision acceleration, computation and rendering - but also for all these subsystems to work efficiently together
  36. 36. | © 2013 Aptina Imaging Corporation | Aptina Confidential36 Khronos Camera API •  Catalyze camera functionality not available on any current platform ‣  Open API that aligns with future platform directions for easy adoption ‣  E.g. could be used to implement future versions of Android Camera HAL •  Control multiple sensors with synch and alignment ‣  E.g. Stereo pairs, Plenoptic arrays, TOF or structured light depth cameras •  More detailed control per frame ‣  Format flexibility, Region of Interest (ROI) selection •  Global Timing & Synchronization ‣  E.g. Between cameras and MEMS sensors •  Application control over ISP processing (including 3A) ‣  Including multiple, re-entrant ISPs •  Flexible processing/streaming ‣  Multiple output streams and streaming rows (not just frames) ‣  RAW, Bayer and YUV Processing
  37. 37. | © 2013 Aptina Imaging Corporation | Aptina Confidential37 Camera API Design Milestones and Philosophy •  C-language API starting from proven designs ‣  e.g. FCAM, Android camera HAL V3 •  Design alignment with widely used hardware standards ‣  e.g. MIPI CSI •  Focus on mobile, power-limited devices ‣  But do not preclude other use cases such as automotive, surveillance, DSLR… •  Minimize overlap and maximize interoperability with other Khronos APIs ‣  But other Khronos APIs are not required •  Provide support for vendor-specific extensions Apr13 Jul13 Group charter approved 4Q13 Provisional specification 1Q14 First draft specification 2Q14 Sample implementation and tests 3Q14 Specification ratification
  38. 38. | © 2013 Aptina Imaging Corporation | Aptina Confidential38 Questions & Answers Thank you!

×