High Performance Pedestrian Detection On TEGRA X1

3,772 views

Published on

It's my talk on GTC'16, presenting an innovate approach to efficiently mapping a popular pedestrian detection algorithm (HoG) on an NVIDIA Tegra GPU

Published in: Software
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,772
On SlideShare
0
From Embeds
0
Number of Embeds
3,446
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • When decreasing DLP, ILP may not grows as expected because of per-thread resource limitation or operations’ dependency.

    When decreasing ILP, DLP may be limited by redundant operations, additional resource occupation, or inter-thread communication.
  • 1 pixel per thread, 4 pixel per thread, 1 cell per thread, 1 block per thread, or even 1 window per thread

    Hundreds of warps, which is unable to saturate a large GPU like Titan X
  • Magnitude and orientation are stored as 16-bit integer in memory
  • Over the last two years, convolutional neural networks
    (CNNs) have brought new levels of accuracy to object detection.
    For common ADAS problems like vehicle detection
    and pedestrian detection, the CNN accuracy gains have been
    moderate. However, CNNs offer huge accuracy improvements
    in recognizing textured objects like plants and specific types
    of dogs and cats.

    Speed is the major downside of CNN-based object detection.
    For example, the R-CNN [25] object detector operates
    at roughly 1/10 fps (2000 J/frame) on a GPU, with most of
    the time spent extracting CNN features.3 With a few tricks to
    amortize CNN feature computation, it is possible to accelerate
    CNN-based object detection to 1 fps (200 J/frame) on GPUs,
    as discovered independently by [27], [28], [29], and [30]. Even
    with these improvements, CNN-based object
  • Over the last two years, convolutional neural networks
    (CNNs) have brought new levels of accuracy to object detection.
    For common ADAS problems like vehicle detection
    and pedestrian detection, the CNN accuracy gains have been
    moderate. However, CNNs offer huge accuracy improvements
    in recognizing textured objects like plants and specific types
    of dogs and cats.

    Speed is the major downside of CNN-based object detection.
    For example, the R-CNN [25] object detector operates
    at roughly 1/10 fps (2000 J/frame) on a GPU, with most of
    the time spent extracting CNN features.3 With a few tricks to
    amortize CNN feature computation, it is possible to accelerate
    CNN-based object detection to 1 fps (200 J/frame) on GPUs,
    as discovered independently by [27], [28], [29], and [30]. Even
    with these improvements, CNN-based object
  • High Performance Pedestrian Detection On TEGRA X1

    1. 1. April 4-7, 2016 | Silicon Valley Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1
    2. 2. 2 AGENDA Histogram of Oriented Gradients on GPU Optimization Opportunities on a Tegra GPU Optimization #1: Improve ILP (Instruction Level Parallelism) Optimization #2: Approximation Optimization #3: Specialization Final Results
    3. 3. 3 PEDESTRIAN DETECTION: HOG DESCRIPTOR Gradient-based feature descriptor developed for pedestrian detection Introduced by Navneet Dalal and Bill Triggs (CVPR’05) Global descriptor for the complete body Very high-dimensional: typically ~4000 dimensions Histogram of Oriented Gradients Source: Dalal, N.; Triggs, B., "Histograms of oriented gradients for human detection,"CVPR 2005.
    4. 4. 4 HOG PIPELINE ON GPU Oriented Gradients: 3x3 Sobel filter with gamma correction Block Histogram: Pixels vote in proportion to gradient magnitude, with a tri-linear interpolation, in each block (16x16 pixels) Histograms Normalization: Normalize each block of histogram (36-bin) Linear SVM: A linear SVM classifier, dot product of each window (7x15 36-bin normalized histograms) and trained coefficients Four GPU Kernels Block Histograms Oriented Gradients Histograms Normalization Linear SVM
    5. 5. 5 OPTIMIZATION OPPORTUNITIES Our goal is to improve the performance further based on a well-optimized implementation in VisionWorks Trade-offs between ILP (Instruction-level-parallelism) and DLP (Data-level-parallelism) Trade-offs between precision and computation Trade-offs between generalization and specialization On a 2-SM Maxwell GPU in Tegra X1 NVIDIA Tegra X1 Maxwell GPU Specification CUDA Cores 256 Texture Units 16 ROPs 16 GPU Clock ~1000MHz Memory Clock 1600MHz (LPDDR4) Memory Bus Width 64-bit FP16 Peak 1024 GFLOPS FP32 Peak 512 GFLOPS Architecture Maxwell
    6. 6. 6 OPTIMIZATION #1 Existed GPU kernels optimized for large GPU, improving DLP to saturate SMs For small GPUs on Tegra, it’s possible to gain perf with larger ILP but smaller DLP Increase workload in each thread while # of total threads decreases Try different configs until the best perf is achieved Improve ILP (Instruction Level Parallelism) z z A B ILP (In-flight ops per thread) DLP (Thread #)
    7. 7. 7 T1 T2 T3 T4 OPTIMIZATION #1 Various patterns to compute a block of histograms. Best trade-off: Each thread calculates 3x12 pixels Not work well on large GPUs like Titan X, but suitable for Tegra X1 Example: Best ILP & DLP trade-off for Block Histograms 16 16 12 12
    8. 8. 8 OPTIMIZATION #2 32-bit float point of GPU is unnecessary for most of computer vision applications `--use_fast_math` is enabled by default for our CV projects Compute in float point, but load and store pixels in integer using texture instructions Sometimes it’s safe to relax the precision even further Approximation 0, 0.5, 1.0, … 0, 128, 255, … Conversion / (De)Normalization / Sampling In Texture Compute as FP16/FP32 in SM Store as 8-bit/16-bit Integer in Memory
    9. 9. 9 A fast version of atan2f() with 3rd order Lagrange polynomial interpolation, and without handling corner cases OPTIMIZATION #2 Example: Fast atan2f() for Oriented Gradients float atan2f_lagrange_3rd(const float dy, const float dx) { float A = 0.0f, B = 0.0f; float Offset = copysignf(float(M_PI), dy); if (fabsf(dy) < fabsf(dx)) { A = dx; B = dy; if (dx >= 0.0f) Offset = 0.0f; } else { A = -dy; B = dx; Offset *= 0.5f; } const float r = B / A; const float p = 1.0f - fabsf(r); return ((-0.0663f*p + 0.311f) * p + float(M_PI/4.0)) * r + Offset; } Comparison between different atan2f implementations Native This work FMA/FADD (op) 12 4 MUFU.RCP (op) 2 1 Handle Corner Case (op) ~30 ~5 Avg. Error (degree) 0.01 0.05
    10. 10. 10 OPTIMIZATION #3 Specialize parameters of CV applications to enable further optimization Unroll the loop fully to eliminate index computation and conditional branches Allow automatic register blocking by compiler, better instruction scheduling Allow more tricks to reuse on-chip data Specialization __global__ void kernel (int N) { ... #pragma unroll for (int i = 0; i < N; i++) { if (i % 3) { ... } ... tmp[i] += ... } ... }
    11. 11. 11 OPTIMIZATION #3 Dot products of (7x15x36)-dimension vectors = Sum of 36-layer 7x15 2D convolutions Load the whole patch to shared memory Uniform loads of coefficients in constant memory, without any bank conflict Reuse our well-optimized 2D convolution kernel (aggressive register blocking, GTC’15, Zhao et.al) Example: Transform Linear SVM to 36-layer 7x15 2D Convolutions
    12. 12. 12 OPTIMIZATION #3 Example: Transform Linear SVM to 36-layer 7x15 2D Convolutions 15... … * 7 winPerImgX winPerImgY = ... … Atomic Add = 2D convolution on 36 layers Add up results of all layers Each element is dot product of each window …
    13. 13. 13 FINAL RESULTS Runtime (ms) of VGA input on Tegra X1, compared to the previous implementation of VisionWorks (https://developer.nvidia.com/embedded/visionworks) 214 FPS on Tegra X1 1.22 3.90 0.85 2.48 8.73 0.86 2.23 0.29 1.01 4.67 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 Oriented Gradients Block Histograms Histogram Normalization Linear SVM Overall Base Optimized 1.87x Speedup
    14. 14. April 4-7, 2016 | Silicon Valley THANK YOU mlv@nvidia.com https://github.com/madeye
    15. 15. April 4-7, 2016 | Silicon Valley BACKUPS
    16. 16. 16 Employ LOP3 (3-operand logic operations, new instruction of Maxwell arch) OPTIMIZATION #2 Example: Fast atan2f() for Oriented Gradients float atan2f_lagrange_3rd(const float dy, const float dx) { float flag, z = 0.0f; __SET_LT(flag, fabsf(dy), fabsf(dx)); uint32_t m, t1 = 0x80000000; float t2 = float(M_PI) / 2.0f; __LOP3_0x2e(m, __float_as_int(dx), t1, __float_as_int(t2)); float w = flag * __int_as_float(m) + float(M_PI)/2.0f; float Offset = copysignf(w, dy); float t = fminf(fabsf(dx), fabsf(dy)) / fmaxf(fabsf(dx), fabsf(dy)); uint32_t r, b = __float_as_int(flag) << 2; uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b); __LOP3_0xe2(r, mask, t1, __floast_as_int(t)); const float p = fabsf(__int_as_float(r)) - 1.0f; return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset; } LOP3 eliminates conditional branches

    ×