SlideShare a Scribd company logo
April 4-7, 2016 | Silicon Valley
Max Lv, NVIDIA
Brant Zhao, NVIDIA
April 7
mlv@nvidia.com
https://github.com/madeye
HIGH PERFORMANCE PEDESTRIAN
DETECTION ON TEGRA X1
2
AGENDA
Histogram of Oriented Gradients on GPU
Optimization Opportunities on a Tegra GPU
Optimization #1: Improve ILP (Instruction Level
Parallelism)
Optimization #2: Approximation
Optimization #3: Specialization
Final Results
3
PEDESTRIAN DETECTION: HOG DESCRIPTOR
Gradient-based feature descriptor
developed for pedestrian detection
Introduced by Navneet Dalal and Bill Triggs
(CVPR’05)
Global descriptor for the complete body
Very high-dimensional: typically ~4000
dimensions
Histogram of Oriented Gradients
Source: Dalal, N.; Triggs, B., "Histograms of oriented
gradients for human detection,"CVPR 2005.
4
HOG PIPELINE ON GPU
Oriented Gradients: 3x3 Sobel filter with gamma correction
Block Histogram: Pixels vote in proportion to gradient magnitude, with a tri-linear
interpolation, in each block (16x16 pixels)
Histograms Normalization: Normalize each block of histogram (36-bin)
Linear SVM: A linear SVM classifier, dot product of each window (7x15 36-bin
normalized histograms) and trained coefficients
Four GPU Kernels
Block
Histograms
Oriented
Gradients
Histograms
Normalization
Linear SVM
5
OPTIMIZATION OPPORTUNITIES
Our goal is to improve the performance further based on a
well-optimized implementation in VisionWorks
Trade-offs between ILP (Instruction-level-parallelism) and
DLP (Data-level-parallelism)
Trade-offs between precision and computation
Trade-offs between generalization and specialization
On a 2-SM Maxwell GPU in Tegra X1
NVIDIA Tegra X1 Maxwell GPU
Specification
CUDA Cores 256
Texture Units 16
ROPs 16
GPU Clock ~1000MHz
Memory Clock 1600MHz
(LPDDR4)
Memory Bus
Width
64-bit
FP16 Peak 1024 GFLOPS
FP32 Peak 512 GFLOPS
Architecture Maxwell
6
OPTIMIZATION #1
Existed GPU kernels optimized for large
GPU, improving DLP to saturate SMs
For small GPUs on Tegra, it’s possible to
gain perf with larger ILP but smaller DLP
Increase workload in each thread while #
of total threads decreases
Try different configs until the best perf is
achieved
Improve ILP (Instruction Level Parallelism)
z
z
A
B
ILP (In-flight ops per thread)
DLP
(Thread #)
7
T1 T2 T3 T4
OPTIMIZATION #1
Various patterns to compute a block of
histograms.
Best trade-off: Each thread calculates
3x12 pixels
Not work well on large GPUs like Titan X,
but suitable for Tegra X1
Example: Best ILP & DLP trade-off for Block Histograms
16
16
12
12
8
OPTIMIZATION #2
32-bit float point of GPU is unnecessary for
most of computer vision applications
`--use_fast_math` is enabled by default for
our CV projects
Compute in float point, but load and store
pixels in integer using texture instructions
Sometimes it’s safe to relax the precision
even further
Approximation
0, 0.5, 1.0, …
0, 128, 255, …
Conversion /
(De)Normalization /
Sampling
In Texture
Compute as FP16/FP32 in SM
Store as 8-bit/16-bit Integer in Memory
9
A fast version of atan2f() with 3rd order
Lagrange polynomial interpolation, and
without handling corner cases
OPTIMIZATION #2
Example: Fast atan2f() for Oriented Gradients
float atan2f_lagrange_3rd(const float dy,
const float dx) {
float A = 0.0f, B = 0.0f;
float Offset = copysignf(float(M_PI), dy);
if (fabsf(dy) < fabsf(dx)) {
A = dx; B = dy;
if (dx >= 0.0f) Offset = 0.0f;
} else {
A = -dy; B = dx;
Offset *= 0.5f;
}
const float r = B / A;
const float p = 1.0f - fabsf(r);
return ((-0.0663f*p + 0.311f) * p
+ float(M_PI/4.0)) * r + Offset;
}
Comparison between different atan2f
implementations
Native This work
FMA/FADD (op) 12 4
MUFU.RCP (op) 2 1
Handle Corner Case (op) ~30 ~5
Avg. Error (degree) 0.01 0.05
10
OPTIMIZATION #3
Specialize parameters of CV applications to
enable further optimization
Unroll the loop fully to eliminate index
computation and conditional branches
Allow automatic register blocking by
compiler, better instruction scheduling
Allow more tricks to reuse on-chip data
Specialization
__global__ void kernel (int N) {
...
#pragma unroll
for (int i = 0; i < N; i++) {
if (i % 3) {
...
}
...
tmp[i] += ...
}
...
}
11
OPTIMIZATION #3
Dot products of (7x15x36)-dimension vectors = Sum of 36-layer 7x15 2D convolutions
Load the whole patch to shared memory
Uniform loads of coefficients in constant memory, without any bank conflict
Reuse our well-optimized 2D convolution kernel (aggressive register blocking,
GTC’15, Zhao et.al)
Example: Transform Linear SVM to 36-layer 7x15 2D Convolutions
12
OPTIMIZATION #3
Example: Transform Linear SVM to 36-layer 7x15 2D Convolutions
15...
…
*
7
winPerImgX
winPerImgY
=
...
…
Atomic Add
=
2D convolution on 36 layers Add up results of all layers
Each element is
dot product of
each window
…
13
FINAL RESULTS
Runtime (ms) of VGA input on Tegra X1, compared to the previous implementation of
VisionWorks (https://developer.nvidia.com/embedded/visionworks)
214 FPS on Tegra X1
1.22
3.90
0.85
2.48
8.73
0.86
2.23
0.29
1.01
4.67
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
Oriented
Gradients
Block Histograms Histogram
Normalization
Linear SVM Overall
Base Optimized
1.87x Speedup
April 4-7, 2016 | Silicon Valley
THANK YOU
mlv@nvidia.com
https://github.com/madeye
April 4-7, 2016 | Silicon Valley
BACKUPS
16
Employ LOP3 (3-operand logic operations, new instruction of Maxwell arch)
OPTIMIZATION #2
Example: Fast atan2f() for Oriented Gradients
float atan2f_lagrange_3rd(const float dy, const float dx) {
float flag, z = 0.0f;
__SET_LT(flag, fabsf(dy), fabsf(dx));
uint32_t m, t1 = 0x80000000; float t2 = float(M_PI) / 2.0f;
__LOP3_0x2e(m, __float_as_int(dx), t1, __float_as_int(t2));
float w = flag * __int_as_float(m) + float(M_PI)/2.0f; float Offset = copysignf(w, dy);
float t = fminf(fabsf(dx), fabsf(dy)) / fmaxf(fabsf(dx), fabsf(dy));
uint32_t r, b = __float_as_int(flag) << 2;
uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b);
__LOP3_0xe2(r, mask, t1, __floast_as_int(t));
const float p = fabsf(__int_as_float(r)) - 1.0f;
return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset;
}
LOP3 eliminates
conditional branches

More Related Content

What's hot

Deep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceDeep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for Finance
Ben Ball
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
Brandon Liu
 
Matrix multiplication
Matrix multiplicationMatrix multiplication
Matrix multiplication
International Islamic University
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetaggingTakashi Abe
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive Subsampling
Martino Ferrari
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Matthias Trapp
 
2.5D Clip-Surfaces for Technical Visualization
2.5D Clip-Surfaces for Technical Visualization2.5D Clip-Surfaces for Technical Visualization
2.5D Clip-Surfaces for Technical Visualization
Matthias Trapp
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
Young-Geun Choi
 
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Universitat Politècnica de Catalunya
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
Ivan Goloskokovic
 
Hubba Deep Learning
Hubba Deep LearningHubba Deep Learning
Hubba Deep Learning
Ivan Goloskokovic
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
MLconf
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application Developers
Etsuji Nakai
 
Gpgpu
GpgpuGpgpu
Gpgpu
Su Yan-Jen
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
Artem Lutov
 
Fpga 11-sequence-detector-fir-iir-filter
Fpga 11-sequence-detector-fir-iir-filterFpga 11-sequence-detector-fir-iir-filter
Fpga 11-sequence-detector-fir-iir-filterMalik Tauqir Hasan
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitShiladitya Sen
 
Pclsp ntnu
Pclsp ntnuPclsp ntnu
Pclsp ntnu
Kjetil Haugen
 

What's hot (20)

Deep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceDeep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for Finance
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
 
Matrix multiplication
Matrix multiplicationMatrix multiplication
Matrix multiplication
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetagging
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive Subsampling
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)
 
2.5D Clip-Surfaces for Technical Visualization
2.5D Clip-Surfaces for Technical Visualization2.5D Clip-Surfaces for Technical Visualization
2.5D Clip-Surfaces for Technical Visualization
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
 
Hubba Deep Learning
Hubba Deep LearningHubba Deep Learning
Hubba Deep Learning
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Machine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application DevelopersMachine Learning Basics for Web Application Developers
Machine Learning Basics for Web Application Developers
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 
Fpga 11-sequence-detector-fir-iir-filter
Fpga 11-sequence-detector-fir-iir-filterFpga 11-sequence-detector-fir-iir-filter
Fpga 11-sequence-detector-fir-iir-filter
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
 
Pclsp ntnu
Pclsp ntnuPclsp ntnu
Pclsp ntnu
 

Viewers also liked

Joint Human Detection from On-Board and Off-Board Cameras
Joint Human Detection from On-Board and Off-Board CamerasJoint Human Detection from On-Board and Off-Board Cameras
Joint Human Detection from On-Board and Off-Board CamerasJustas Miseikis
 
Object Detection with Discrmininatively Trained Part based Models
Object Detection with Discrmininatively Trained Part based ModelsObject Detection with Discrmininatively Trained Part based Models
Object Detection with Discrmininatively Trained Part based Modelszukun
 
Real time pedestrian detection with deformable part models [h. cho, p. rybski...
Real time pedestrian detection with deformable part models [h. cho, p. rybski...Real time pedestrian detection with deformable part models [h. cho, p. rybski...
Real time pedestrian detection with deformable part models [h. cho, p. rybski...tino
 
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
AIST
 
Learning Object Detectors From Weakly Supervised Image Data
Learning Object Detectors From Weakly Supervised Image DataLearning Object Detectors From Weakly Supervised Image Data
Learning Object Detectors From Weakly Supervised Image Data
Yandex
 
Deformable Part Models are Convolutional Neural Networks
Deformable Part Models are Convolutional Neural NetworksDeformable Part Models are Convolutional Neural Networks
Deformable Part Models are Convolutional Neural Networks
Wei Yang
 
Pedestrian Detection Technology - Brochure
Pedestrian Detection Technology - BrochurePedestrian Detection Technology - Brochure
Pedestrian Detection Technology - Brochure
Mobileye
 
Real time pedestrian detection, tracking, and distance estimation
Real time pedestrian detection, tracking, and distance estimationReal time pedestrian detection, tracking, and distance estimation
Real time pedestrian detection, tracking, and distance estimation
omid Asudeh
 
Venture Scanner 3D Printing Tech Report Q1 2017
Venture Scanner 3D Printing Tech Report Q1 2017Venture Scanner 3D Printing Tech Report Q1 2017
Venture Scanner 3D Printing Tech Report Q1 2017
Nathan Pacer
 

Viewers also liked (9)

Joint Human Detection from On-Board and Off-Board Cameras
Joint Human Detection from On-Board and Off-Board CamerasJoint Human Detection from On-Board and Off-Board Cameras
Joint Human Detection from On-Board and Off-Board Cameras
 
Object Detection with Discrmininatively Trained Part based Models
Object Detection with Discrmininatively Trained Part based ModelsObject Detection with Discrmininatively Trained Part based Models
Object Detection with Discrmininatively Trained Part based Models
 
Real time pedestrian detection with deformable part models [h. cho, p. rybski...
Real time pedestrian detection with deformable part models [h. cho, p. rybski...Real time pedestrian detection with deformable part models [h. cho, p. rybski...
Real time pedestrian detection with deformable part models [h. cho, p. rybski...
 
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
 
Learning Object Detectors From Weakly Supervised Image Data
Learning Object Detectors From Weakly Supervised Image DataLearning Object Detectors From Weakly Supervised Image Data
Learning Object Detectors From Weakly Supervised Image Data
 
Deformable Part Models are Convolutional Neural Networks
Deformable Part Models are Convolutional Neural NetworksDeformable Part Models are Convolutional Neural Networks
Deformable Part Models are Convolutional Neural Networks
 
Pedestrian Detection Technology - Brochure
Pedestrian Detection Technology - BrochurePedestrian Detection Technology - Brochure
Pedestrian Detection Technology - Brochure
 
Real time pedestrian detection, tracking, and distance estimation
Real time pedestrian detection, tracking, and distance estimationReal time pedestrian detection, tracking, and distance estimation
Real time pedestrian detection, tracking, and distance estimation
 
Venture Scanner 3D Printing Tech Report Q1 2017
Venture Scanner 3D Printing Tech Report Q1 2017Venture Scanner 3D Printing Tech Report Q1 2017
Venture Scanner 3D Printing Tech Report Q1 2017
 

Similar to High Performance Pedestrian Detection On TEGRA X1

An35225228
An35225228An35225228
An35225228
IJERA Editor
 
09 accelerators
09 accelerators09 accelerators
09 accelerators
Murali M
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
fcassier
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
Mail.ru Group
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
Platonov Sergey
 
Cv mini project (1)
Cv mini project (1)Cv mini project (1)
Cv mini project (1)
Kadambini Indurkar
 
26_Fan.pdf
26_Fan.pdf26_Fan.pdf
26_Fan.pdf
RioCarthiis
 
Eye deep
Eye deepEye deep
Eye deep
sveitser
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I
💻 Anton Gerdelan
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
Ferdinand Jamitzky
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Intel® Software
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
Ganesan Narayanasamy
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
Kohei KaiGai
 
All projects
All projectsAll projects
All projects
Karishma Jain
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Akihiro Hayashi
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
Kohei KaiGai
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
Amer Ather
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
Intel® Software
 

Similar to High Performance Pedestrian Detection On TEGRA X1 (20)

An35225228
An35225228An35225228
An35225228
 
09 accelerators
09 accelerators09 accelerators
09 accelerators
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Cv mini project (1)
Cv mini project (1)Cv mini project (1)
Cv mini project (1)
 
26_Fan.pdf
26_Fan.pdf26_Fan.pdf
26_Fan.pdf
 
Eye deep
Eye deepEye deep
Eye deep
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
All projects
All projectsAll projects
All projects
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 

Recently uploaded

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
abdulrafaychaudhry
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
vrstrong314
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 

Recently uploaded (20)

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 

High Performance Pedestrian Detection On TEGRA X1

  • 1. April 4-7, 2016 | Silicon Valley Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1
  • 2. 2 AGENDA Histogram of Oriented Gradients on GPU Optimization Opportunities on a Tegra GPU Optimization #1: Improve ILP (Instruction Level Parallelism) Optimization #2: Approximation Optimization #3: Specialization Final Results
  • 3. 3 PEDESTRIAN DETECTION: HOG DESCRIPTOR Gradient-based feature descriptor developed for pedestrian detection Introduced by Navneet Dalal and Bill Triggs (CVPR’05) Global descriptor for the complete body Very high-dimensional: typically ~4000 dimensions Histogram of Oriented Gradients Source: Dalal, N.; Triggs, B., "Histograms of oriented gradients for human detection,"CVPR 2005.
  • 4. 4 HOG PIPELINE ON GPU Oriented Gradients: 3x3 Sobel filter with gamma correction Block Histogram: Pixels vote in proportion to gradient magnitude, with a tri-linear interpolation, in each block (16x16 pixels) Histograms Normalization: Normalize each block of histogram (36-bin) Linear SVM: A linear SVM classifier, dot product of each window (7x15 36-bin normalized histograms) and trained coefficients Four GPU Kernels Block Histograms Oriented Gradients Histograms Normalization Linear SVM
  • 5. 5 OPTIMIZATION OPPORTUNITIES Our goal is to improve the performance further based on a well-optimized implementation in VisionWorks Trade-offs between ILP (Instruction-level-parallelism) and DLP (Data-level-parallelism) Trade-offs between precision and computation Trade-offs between generalization and specialization On a 2-SM Maxwell GPU in Tegra X1 NVIDIA Tegra X1 Maxwell GPU Specification CUDA Cores 256 Texture Units 16 ROPs 16 GPU Clock ~1000MHz Memory Clock 1600MHz (LPDDR4) Memory Bus Width 64-bit FP16 Peak 1024 GFLOPS FP32 Peak 512 GFLOPS Architecture Maxwell
  • 6. 6 OPTIMIZATION #1 Existed GPU kernels optimized for large GPU, improving DLP to saturate SMs For small GPUs on Tegra, it’s possible to gain perf with larger ILP but smaller DLP Increase workload in each thread while # of total threads decreases Try different configs until the best perf is achieved Improve ILP (Instruction Level Parallelism) z z A B ILP (In-flight ops per thread) DLP (Thread #)
  • 7. 7 T1 T2 T3 T4 OPTIMIZATION #1 Various patterns to compute a block of histograms. Best trade-off: Each thread calculates 3x12 pixels Not work well on large GPUs like Titan X, but suitable for Tegra X1 Example: Best ILP & DLP trade-off for Block Histograms 16 16 12 12
  • 8. 8 OPTIMIZATION #2 32-bit float point of GPU is unnecessary for most of computer vision applications `--use_fast_math` is enabled by default for our CV projects Compute in float point, but load and store pixels in integer using texture instructions Sometimes it’s safe to relax the precision even further Approximation 0, 0.5, 1.0, … 0, 128, 255, … Conversion / (De)Normalization / Sampling In Texture Compute as FP16/FP32 in SM Store as 8-bit/16-bit Integer in Memory
  • 9. 9 A fast version of atan2f() with 3rd order Lagrange polynomial interpolation, and without handling corner cases OPTIMIZATION #2 Example: Fast atan2f() for Oriented Gradients float atan2f_lagrange_3rd(const float dy, const float dx) { float A = 0.0f, B = 0.0f; float Offset = copysignf(float(M_PI), dy); if (fabsf(dy) < fabsf(dx)) { A = dx; B = dy; if (dx >= 0.0f) Offset = 0.0f; } else { A = -dy; B = dx; Offset *= 0.5f; } const float r = B / A; const float p = 1.0f - fabsf(r); return ((-0.0663f*p + 0.311f) * p + float(M_PI/4.0)) * r + Offset; } Comparison between different atan2f implementations Native This work FMA/FADD (op) 12 4 MUFU.RCP (op) 2 1 Handle Corner Case (op) ~30 ~5 Avg. Error (degree) 0.01 0.05
  • 10. 10 OPTIMIZATION #3 Specialize parameters of CV applications to enable further optimization Unroll the loop fully to eliminate index computation and conditional branches Allow automatic register blocking by compiler, better instruction scheduling Allow more tricks to reuse on-chip data Specialization __global__ void kernel (int N) { ... #pragma unroll for (int i = 0; i < N; i++) { if (i % 3) { ... } ... tmp[i] += ... } ... }
  • 11. 11 OPTIMIZATION #3 Dot products of (7x15x36)-dimension vectors = Sum of 36-layer 7x15 2D convolutions Load the whole patch to shared memory Uniform loads of coefficients in constant memory, without any bank conflict Reuse our well-optimized 2D convolution kernel (aggressive register blocking, GTC’15, Zhao et.al) Example: Transform Linear SVM to 36-layer 7x15 2D Convolutions
  • 12. 12 OPTIMIZATION #3 Example: Transform Linear SVM to 36-layer 7x15 2D Convolutions 15... … * 7 winPerImgX winPerImgY = ... … Atomic Add = 2D convolution on 36 layers Add up results of all layers Each element is dot product of each window …
  • 13. 13 FINAL RESULTS Runtime (ms) of VGA input on Tegra X1, compared to the previous implementation of VisionWorks (https://developer.nvidia.com/embedded/visionworks) 214 FPS on Tegra X1 1.22 3.90 0.85 2.48 8.73 0.86 2.23 0.29 1.01 4.67 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 Oriented Gradients Block Histograms Histogram Normalization Linear SVM Overall Base Optimized 1.87x Speedup
  • 14. April 4-7, 2016 | Silicon Valley THANK YOU mlv@nvidia.com https://github.com/madeye
  • 15. April 4-7, 2016 | Silicon Valley BACKUPS
  • 16. 16 Employ LOP3 (3-operand logic operations, new instruction of Maxwell arch) OPTIMIZATION #2 Example: Fast atan2f() for Oriented Gradients float atan2f_lagrange_3rd(const float dy, const float dx) { float flag, z = 0.0f; __SET_LT(flag, fabsf(dy), fabsf(dx)); uint32_t m, t1 = 0x80000000; float t2 = float(M_PI) / 2.0f; __LOP3_0x2e(m, __float_as_int(dx), t1, __float_as_int(t2)); float w = flag * __int_as_float(m) + float(M_PI)/2.0f; float Offset = copysignf(w, dy); float t = fminf(fabsf(dx), fabsf(dy)) / fmaxf(fabsf(dx), fabsf(dy)); uint32_t r, b = __float_as_int(flag) << 2; uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b); __LOP3_0xe2(r, mask, t1, __floast_as_int(t)); const float p = fabsf(__int_as_float(r)) - 1.0f; return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset; } LOP3 eliminates conditional branches

Editor's Notes

  1. When decreasing DLP, ILP may not grows as expected because of per-thread resource limitation or operations’ dependency. When decreasing ILP, DLP may be limited by redundant operations, additional resource occupation, or inter-thread communication.
  2. 1 pixel per thread, 4 pixel per thread, 1 cell per thread, 1 block per thread, or even 1 window per thread Hundreds of warps, which is unable to saturate a large GPU like Titan X
  3. Magnitude and orientation are stored as 16-bit integer in memory
  4. Over the last two years, convolutional neural networks (CNNs) have brought new levels of accuracy to object detection. For common ADAS problems like vehicle detection and pedestrian detection, the CNN accuracy gains have been moderate. However, CNNs offer huge accuracy improvements in recognizing textured objects like plants and specific types of dogs and cats. Speed is the major downside of CNN-based object detection. For example, the R-CNN [25] object detector operates at roughly 1/10 fps (2000 J/frame) on a GPU, with most of the time spent extracting CNN features.3 With a few tricks to amortize CNN feature computation, it is possible to accelerate CNN-based object detection to 1 fps (200 J/frame) on GPUs, as discovered independently by [27], [28], [29], and [30]. Even with these improvements, CNN-based object
  5. Over the last two years, convolutional neural networks (CNNs) have brought new levels of accuracy to object detection. For common ADAS problems like vehicle detection and pedestrian detection, the CNN accuracy gains have been moderate. However, CNNs offer huge accuracy improvements in recognizing textured objects like plants and specific types of dogs and cats. Speed is the major downside of CNN-based object detection. For example, the R-CNN [25] object detector operates at roughly 1/10 fps (2000 J/frame) on a GPU, with most of the time spent extracting CNN features.3 With a few tricks to amortize CNN feature computation, it is possible to accelerate CNN-based object detection to 1 fps (200 J/frame) on GPUs, as discovered independently by [27], [28], [29], and [30]. Even with these improvements, CNN-based object