Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apr 2017 – Chris Gottbrath
REDUCED PRECISION (FP16, INT8) INFERENCE ON
CONVOLUTIONAL NEURAL NETWORKS WITH
TENSORRT AND NVI...
2
AGENDA
Deep Learning
TensorRT
Reduced Precision
GPU REST Engine
Conclusion
3
NEW AI SERVICES POSSIBLE WITH GPU CLOUD
SPOTIFY
SONG RECOMMENDATIONS
NETFLIX
VIDEO RECOMMENDATIONS
YELP
SELECTING COVER ...
4
TESLA REVOLUTIONIZES
DEEP LEARNING
NEURAL NETWORK APPLICATION
BEFORE TESLA AFTER TESLA
Cost $5,000K $200K
Servers 1,000 ...
5
NVIDIA DEEP LEARNING SDK
Powerful tools and libraries for designing and
deploying GPU-accelerated deep learning applicat...
6
POWERING THE DEEP LEARNING ECOSYSTEM
NVIDIA SDK accelerates every major framework
COMPUTER VISION
OBJECT DETECTION IMAGE...
7
TensorRT
8
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
NVIDIA DEEP LEARNING SDK
TensorRT
Embedded
Automotive
Data center
TRAINING FRAMEW...
9
NVIDIA TensorRT
High-performance deep learning inference for production
deployment
developer.nvidia.com/tensorrt
High pe...
10
WORKFLOW – GETTING A TRAINED MODEL
INTO TensorRT
11
TensorRT
Development Workflow
Training Framework
OPTIMIZATION
USING TensorRT
Validation
USING TensorRT
PLANNEURAL
NETWO...
12
TensorRT
Production Workflow
RUNTIME
USING TensorRT
Serialized PLAN
developer.nvidia.com/tensorrt
13
TO IMPORT A TRAINED MODEL TO TensorRT
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = bu...
14
IMPORTING USING THE GRAPH DEFINITION API
If using other frameworks such as TensorFlow you can call our network builder ...
15
EXECUTE THE NEURAL NETWORK
IExecutionContext *context = engine->createExecutionContext();
<handle> = engine->getBinding...
16
THROUGHPUT
0
500
1000
1500
2000
2500
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
...
17
LATENCY
1
10
100
1000
10000
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
TensorRT ...
18
REDUCED PRECISION
19
SMALLER AND FASTER
0
0.5
1
1.5
2
2.5
3
3.5
FP32 FP16 on P100 INT8 on P40
Performance
%scaledtoFP32
ResNet50 Model, Batc...
20
INT8 INFERENCE
• Main challenge
• INT8 has significantly lower precision and dynamic range compared to FP32
• Requires ...
21
QUANTIZATION OF WEIGHTS
-127 -126 -125 125 126 127
I8_weight = Round_to_nearest_int( scaling_factor * F32_weight )
scal...
22NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
QUANTIZATION OF ACTIVATIONS
I8_value = (value > threshold) ?
threshold :
scale *...
24
TENSORRT
INT8 Workflow
FP32
Training Framework
INT8 OPTIMIZATION
USING TensorRT
INT8 RUNTIME
USING TensorRT
INT8
PLAN
F...
25
TURNING ON INT8 AND CALLING THE
CALIBRATOR
builder->setInt8Mode(true);
IInt8Calibrator* calibrator,
builder->setInt8Cal...
26
8-BIT INFERENCE
Top-1 Accuracy
Network FP32 Top1 INT8 Top1 Difference Perf Gain
developer.nvidia.com/tensorrt
27
DEPLOYING ACCELERATED FUNCTIONS
SUCH AS TensorRT
AS A MICROSERVICE WITH
GPU REST ENGINE (GRE)
28
GPU REST ENGINE (GRE) SDK
Accelerated microservices for web and mobile
Supercomputer performance for hyperscale
datacen...
29
WEB ARCHITECTURE WITH GRE
Create accelerated
microservices
REST interfaces
Provide your own GPU
kernel
GRE plugs in eas...
30
REST API
HTTP layer
App layer
CPU-side layer
Device-layer
func EmptyKernel_Handler
kernel_wrapper()
benchmark_execute()...
31
Context
Pool
Request ScopedContextRequest ScopedContext
GPU1
GPU2
Context
Context
Request ScopedContext
Request ScopedC...
32
ScopedContext<>
REST API
HTTP layer
App layer
Device-layer
func classify
classifier_classify()
Microservice
Client
Go
C...
33
CLASSIFICATION.CPP (1/2)
func classify
classifier_classify()
classify()
constexpr static int kContextsPerDevice = 2;
cl...
34
CLASSIFICATION.CPP (2/2)
func classify
classifier_classify()
classify()
const char* classifier_classify(classifier_ctx*...
35
CONCLUSION
Inference is going to power an increasing number of features and capabilities.
Latency is important for resp...
36
WANT TO LEARN MORE?
GPU Technology Conference
May 8-11 in San Jose
S7310 - 8-Bit Inference with TensorRT
Szymon Migacz
...
cgottbrath@nvidia.com
THANKS
38
RESOURCE SLIDES
39
main.go
func EmptyKernel_Handler
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler(w htt...
40
benchmark.cpp (1/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
constexpr static int kContextsPerDevice = 4...
41
benchmark.cpp (2/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler
void benchmark_exe...
42
kernel.cu
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler
__global__ void empty_kernel...
43
TensorRT
• Convolution: Currently only 2D convolutions
• Activation: ReLU, tanh and sigmoid
• Pooling: max and average
...
44
TENSORRT
Optimizations
• Fuse network layers
• Eliminate concatenation layers
• Kernel specialization
• Auto-tuning for...
45
GRAPH OPTIMIZATION
Unoptimized network
concat
max pool
input
next input
3x3 conv.
relu
bias
1x1 conv.
relu
bias
1x1 con...
46
GRAPH OPTIMIZATION
Vertical fusion
concat
max pool
input
next input
concat
1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR 1x1 ...
47
GRAPH OPTIMIZATION
Horizontal fusion
concat
max pool
input
next input
concat
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
48
GRAPH OPTIMIZATION
Concat elision
max pool
input
next input
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
49
Int8 precision
New in TensorRT
ACCURACYEFFICIENCYPERFORMANCE
0
1000
2000
3000
4000
5000
6000
7000
2 4 128
FP32 INT8
Up ...
50
IDP.4A – 8 BIT INSTRUCTION
i8 i8 i8 i8
× × × ×
i8 i8 i8 i8
i32 + i32
Upcoming SlideShare
Loading in …5
×

Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

2,883 views

Published on

https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/223666658/

NVIDIA’s Pascal GPUs provide developers a platform for both training and deploying neural networks. In deployment GPUs allow lower latencies or servicing large inference workloads with a smaller set of accelerated nodes. One advanced technique to optimize throughput is to leverage the Pascal GPU family’s reduced precision instructions. I’ll show how you can start with a network trained in FP32 and deploy that same network with 16 bit or even 8 bit weights and activations using TensorRT. I’ll talk in some detail about the mechanics of converting a neural network and what kinds of performance and accuracy we are seeing on image net style networks.

I’ll end with a quick overview of how developers can deploy these DL networks as micro services using the GPU REST Engine.

References

• https://devblogs.nvidia.com/parallelforall/deploying-deep-learning-nvidia-tensorrt/

Thanks to Chris Gottbrath from the Nvidia TensorRT Team!!

https://twitter.com/chris_hpc
https://www.linkedin.com/in/chrisgottbrath/

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

  1. 1. Apr 2017 – Chris Gottbrath REDUCED PRECISION (FP16, INT8) INFERENCE ON CONVOLUTIONAL NEURAL NETWORKS WITH TENSORRT AND NVIDIA PASCAL
  2. 2. 2 AGENDA Deep Learning TensorRT Reduced Precision GPU REST Engine Conclusion
  3. 3. 3 NEW AI SERVICES POSSIBLE WITH GPU CLOUD SPOTIFY SONG RECOMMENDATIONS NETFLIX VIDEO RECOMMENDATIONS YELP SELECTING COVER PHOTOS
  4. 4. 4 TESLA REVOLUTIONIZES DEEP LEARNING NEURAL NETWORK APPLICATION BEFORE TESLA AFTER TESLA Cost $5,000K $200K Servers 1,000 Servers 16 Tesla Servers Energy 600 KW 4 KW Performance 1x 6x
  5. 5. 5 NVIDIA DEEP LEARNING SDK Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks Multi-GPU scaling that accelerates training on up to eight GPU High performance GPU-acceleration for deep learning “ We are amazed by the steady stream of improvements made to the NVIDIA Deep Learning SDK and the speedups that they deliver.” — Frédéric Bastien, Team Lead (Theano) MILA developer.nvidia.com/deep-learning-software
  6. 6. 6 POWERING THE DEEP LEARNING ECOSYSTEM NVIDIA SDK accelerates every major framework COMPUTER VISION OBJECT DETECTION IMAGE CLASSIFICATION SPEECH & AUDIO VOICE RECOGNITION LANGUAGE TRANSLATION NATURAL LANGUAGE PROCESSING RECOMMENDATION ENGINES SENTIMENT ANALYSIS DEEP LEARNING FRAMEWORKS Mocha.jl NVIDIA DEEP LEARNING SDK developer.nvidia.com/deep-learning-software
  7. 7. 7 TensorRT
  8. 8. 8 NVIDIA DEEP LEARNING SOFTWARE PLATFORM NVIDIA DEEP LEARNING SDK TensorRT Embedded Automotive Data center TRAINING FRAMEWORK Training Data Training Data Management Model Assessment Trained Neural Network developer.nvidia.com/deep-learning-software
  9. 9. 9 NVIDIA TensorRT High-performance deep learning inference for production deployment developer.nvidia.com/tensorrt High performance neural network inference engine for production deployment Generate optimized and deployment-ready models for datacenter, embedded and automotive platforms Deliver high-performance, low-latency inference demanded by real-time services Deploy faster, more responsive and memory efficient deep learning applications with INT8 and FP16 optimized precision support 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 2 8 128 CPU-Only Tesla P40 + TensorRT (FP32) Tesla P40 + TensorRT (INT8) Up to 36x More Image/sec Batch Size GoogLenet, CPU-only vs Tesla P40 + TensorRT CPU: 1 socket E4 2690 v4 @2.6 GHz, HT-on GPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box Images/Second
  10. 10. 10 WORKFLOW – GETTING A TRAINED MODEL INTO TensorRT
  11. 11. 11 TensorRT Development Workflow Training Framework OPTIMIZATION USING TensorRT Validation USING TensorRT PLANNEURAL NETWORK developer.nvidia.com/tensorrt Serialize to disk Batch Size Precision
  12. 12. 12 TensorRT Production Workflow RUNTIME USING TensorRT Serialized PLAN developer.nvidia.com/tensorrt
  13. 13. 13 TO IMPORT A TRAINED MODEL TO TensorRT IBuilder* builder = createInferBuilder(gLogger); INetworkDefinition* network = builder->createNetwork(); CaffeParser parser; auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>); network->markOutput(*blob_name_to_tensor->find(<output layer name>)); builder->setMaxBatchSize(<size>); builder->setMaxWorkspaceSize(<size>); ICudaEngine* engine = builder->buildCudaEngine(*network); Key function calls This assumes you have a Caffe model file developer.nvidia.com/tensorrt
  14. 14. 14 IMPORTING USING THE GRAPH DEFINITION API If using other frameworks such as TensorFlow you can call our network builder API ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…}); IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …); Etc… We are looking at a streamlined graph input for TensorFlow like our Caffe parser. From any framework developer.nvidia.com/tensorrt
  15. 15. 15 EXECUTE THE NEURAL NETWORK IExecutionContext *context = engine->createExecutionContext(); <handle> = engine->getBindingIndex(<binding layer name>), <malloc and cudaMalloc calls > //allocate buffers for data moving in and out cudaStream_t stream; cudaStreamCreate(&stream); cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU context.enqueue(<args>); cudaMemcpyAsync( <args> )); // Copy Output Data to the Host cudaStreamSynchronize(stream); Running inference using the API
  16. 16. 16 THROUGHPUT 0 500 1000 1500 2000 2500 1 2 4 8 16 32 64 128 Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100 TensorRT FP32 on P100 TensorRT FP16 on P100 Images/s Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
  17. 17. 17 LATENCY 1 10 100 1000 10000 1 2 4 8 16 32 64 128 Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100 TensorRT FP32 on P100 TensorRT FP16 on P100 Latence(mstoexecutebatch) Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
  18. 18. 18 REDUCED PRECISION
  19. 19. 19 SMALLER AND FASTER 0 0.5 1 1.5 2 2.5 3 3.5 FP32 FP16 on P100 INT8 on P40 Performance %scaledtoFP32 ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease 0 20 40 60 80 100 120 FP32 FP16 on P100 INT8 on P40 Memory Usage Images/s-ScaledtoFP32developer.nvidia.com/tensorrt
  20. 20. 20 INT8 INFERENCE • Main challenge • INT8 has significantly lower precision and dynamic range compared to FP32 • Requires “smart” quantization and calibration from FP32 to INT8 Challenge Dynamic Range Min Pos Value FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45 FP16 -65504 ~ +65504 5.96 x 10-8 INT8 -128 ~ +127 1 developer.nvidia.com/tensorrt
  21. 21. 21 QUANTIZATION OF WEIGHTS -127 -126 -125 125 126 127 I8_weight = Round_to_nearest_int( scaling_factor * F32_weight ) scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) ) Symmetric, Linear Quantization [-127, 127]
  22. 22. 22NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. QUANTIZATION OF ACTIVATIONS I8_value = (value > threshold) ? threshold : scale * F32_value How do you decide optimal ‘threshold’?  Activation range is unknown offline, input dependent  Calibration using ‘representative’ dataset ? ? ? Input
  23. 23. 24 TENSORRT INT8 Workflow FP32 Training Framework INT8 OPTIMIZATION USING TensorRT INT8 RUNTIME USING TensorRT INT8 PLAN FP32 NEURAL NETWORK developer.nvidia.com/tensorrt Calibration Dataset Batch Size Precision
  24. 24. 25 TURNING ON INT8 AND CALLING THE CALIBRATOR builder->setInt8Mode(true); IInt8Calibrator* calibrator, builder->setInt8Calibrator(calibrator); bool getBatch(<args>) override API calls developer.nvidia.com/tensorrt
  25. 25. 26 8-BIT INFERENCE Top-1 Accuracy Network FP32 Top1 INT8 Top1 Difference Perf Gain developer.nvidia.com/tensorrt
  26. 26. 27 DEPLOYING ACCELERATED FUNCTIONS SUCH AS TensorRT AS A MICROSERVICE WITH GPU REST ENGINE (GRE)
  27. 27. 28 GPU REST ENGINE (GRE) SDK Accelerated microservices for web and mobile Supercomputer performance for hyperscale datacenters Up to 50 teraflops per node, min ~250μs response time Easy to develop new microservices Open source, integrates with existing infrastructure Easy to deploy & scale Ready-to-run Dockerfile HTTP (~250μs) GPU REST Engine Image Classification Speech Recognition … Image Scaling developer.nvidia.com/gre
  28. 28. 29 WEB ARCHITECTURE WITH GRE Create accelerated microservices REST interfaces Provide your own GPU kernel GRE plugs in easily Web Presentation Layer Content Ident Svc GRE Ads ICE Img Data Analytics GRE Image Classification developer.nvidia.com/gre
  29. 29. 30 REST API HTTP layer App layer CPU-side layer Device-layer func EmptyKernel_Handler kernel_wrapper() benchmark_execute() Microservice Client empty_kernel<<<>>> Go C++ CUDA host CUDA device GPU Host CPU Host CPU Host CPU ScopedContext<> Hello World Microservice developer.nvidia.com/gre
  30. 30. 31 Context Pool Request ScopedContextRequest ScopedContext GPU1 GPU2 Context Context Request ScopedContext Request ScopedContext Request ScopedContext Context Request ScopedContext Context Request ScopedContext Request ScopedContext Request ScopedContext Resource Pool developer.nvidia.com/gre
  31. 31. 32 ScopedContext<> REST API HTTP layer App layer Device-layer func classify classifier_classify() Microservice Client Go C++ CUDA device GPU Host CPU Host CPU classify() Classification Microservice developer.nvidia.com/gre
  32. 32. 33 CLASSIFICATION.CPP (1/2) func classify classifier_classify() classify() constexpr static int kContextsPerDevice = 2; classifier_ctx* classifier_initialize(char* model_file, char* trained_file, char* mean_file, char* label_file) {try{ cudaError_t st = cudaGetDeviceCount(&device_count); ContextPool<CaffeContext> pool; for (int dev = 0; dev < device_count; ++dev) { for (int i = 0; i < kContextsPerDevice; ++i) { std::unique_ptr<CaffeContext> context(new CaffeContext(model_file, trained_file, Mean_file, label_file, dev)); pool.Push(std::move(context)); }}} catch { ... } } To allow latency hiding CaffeContexts developer.nvidia.com/gre
  33. 33. 34 CLASSIFICATION.CPP (2/2) func classify classifier_classify() classify() const char* classifier_classify(classifier_ctx* ctx, char* buffer, size_t length) { try{ ScopedContext<CaffeContext> context(ctx->pool); auto classifier = context->CaffeClassifier(); predictions = classifier->Classify(img); /* Write the top N predictions in JSON format. */ } Uses a scoped context Lower level classify routine developer.nvidia.com/gre
  34. 34. 35 CONCLUSION Inference is going to power an increasing number of features and capabilities. Latency is important for responsive services Throughput is important for controlling costs and scaling out GPUs can deliver throughput and low latency Reduced precision can be used for an extra boost There is a template to follow for creating accelerated microservices developer.nvidia.com/gre
  35. 35. 36 WANT TO LEARN MORE? GPU Technology Conference May 8-11 in San Jose S7310 - 8-Bit Inference with TensorRT Szymon Migacz S7458 - Deploying unique DL Networks as Micro- Services with TensorRT, user extensible layers, and GPU REST Engine Chris Gottbrath 9 Spark and 17 TensorFlow sessions 20% off discount code: NVCGOTT developer.nvidia.com/tensorrt developer.nvidia.com/gre devblogs.nvidia.com/parallelforall/ NVIDIA Jetson TX2 Delivers Twice … Production Deep Learning … www.nvidia.com/en-us/deep-learning- ai/education/ github.com/dusty-nv/jetson-inference Resources to check out developer.nvidia.com/gre
  36. 36. cgottbrath@nvidia.com THANKS
  37. 37. 38 RESOURCE SLIDES
  38. 38. 39 main.go func EmptyKernel_Handler kernel_wrapper() benchmark_execute() empty_kernel<<<>>> func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) { C.benchmark_execute(benchmark_ctx, (*C.char)(unsafe.Pointer(&message[0]))) io.WriteString(w, string(message[:])) } func main() { http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler) http.ListenAndServe(":8000", nil) } Calls the C func Execute server Set API URL
  39. 39. 40 benchmark.cpp (1/2) kernel_wrapper() benchmark_execute() empty_kernel<<<>>> constexpr static int kContextsPerDevice = 4; benchmark_ctx* benchmark_initialize() { cudaGetDeviceCount(&device_count); ContextPool<BenchmarkContext> pool; for (int dev = 0; dev < device_count; ++dev) for (int i = 0; i < kContextsPerDevice; ++i) std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev)); pool.Push(std::move(context)); } 4 per GPU Get # GPUs Create pool func EmptyKernel_Handler
  40. 40. 41 benchmark.cpp (2/2) kernel_wrapper() benchmark_execute() empty_kernel<<<>>> func EmptyKernel_Handler void benchmark_execute(benchmark_ctx* ctx, char* message) { ScopedContext<BenchmarkContext> context(ctx->pool); cudaStream_t stream = context->CUDAStream(); kernel_wrapper(stream, message); } Scoped Context Run the wrapper
  41. 41. 42 kernel.cu kernel_wrapper() benchmark_execute() empty_kernel<<<>>> func EmptyKernel_Handler __global__ void empty_kernel(char* device_message) { const char message[50] = "Hello world from an (almost) empty CUDA kernel :)"; for(int i=0;i<50;i++){ device_message[i] = message[i]; if(message[i]=='0') break; }} void kernel_wrapper(cudaStream_t stream, char* message) { cudaHostAlloc((void**)&device_message, message_size, cudaHostAllocDefault); host_message = (char*)malloc(message_size); empty_kernel<<<1, 1, 0, stream>>>(device_message); cudaMemcpy(host_message, device_message, message_size, cudaMemcpyDeviceToHost); strncpy(message, host_message, message_size); } GPU code Device call Host side wrapper
  42. 42. 43 TensorRT • Convolution: Currently only 2D convolutions • Activation: ReLU, tanh and sigmoid • Pooling: max and average • Scale: similar to Caffe Power layer (shift+scale*x)^p • ElementWise: sum, product or max of two tensors • LRN: cross-channel only • Fully-connected: with or without bias • SoftMax: cross-channel only • Deconvolution Layers Types Supported
  43. 43. 44 TENSORRT Optimizations • Fuse network layers • Eliminate concatenation layers • Kernel specialization • Auto-tuning for target platform • Tuned for given batch size TRAINED NEURAL NETWORK OPTIMIZED INFERENCE RUNTIME developer.nvidia.com/tensorrt
  44. 44. 45 GRAPH OPTIMIZATION Unoptimized network concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias concat 1x1 conv. relu bias 5x5 conv. relu bias
  45. 45. 46 GRAPH OPTIMIZATION Vertical fusion concat max pool input next input concat 1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR 1x1 CBR
  46. 46. 47 GRAPH OPTIMIZATION Horizontal fusion concat max pool input next input concat 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR
  47. 47. 48 GRAPH OPTIMIZATION Concat elision max pool input next input 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR
  48. 48. 49 Int8 precision New in TensorRT ACCURACYEFFICIENCYPERFORMANCE 0 1000 2000 3000 4000 5000 6000 7000 2 4 128 FP32 INT8 Up To 3x More Images/sec with INT8 Precision Batch Size GoogLenet, FP32 vs INT8 precision + TensorRT on Tesla P40 GPU, 2 Socket Haswell E5-2698 v3@2.3GHz with HT off Images/Second 0 200 400 600 800 1000 1200 1400 2 4 128 FP32 INT8 Deploy 2x Larger Models with INT8 Precision Batch Size Memory(MB) 0% 20% 40% 60% 80% 100% Top 1 Accuracy Top 5 Accuracy FP32 INT8 Deliver full accuracy with INT8 precision %Accuracy
  49. 49. 50 IDP.4A – 8 BIT INSTRUCTION i8 i8 i8 i8 × × × × i8 i8 i8 i8 i32 + i32

×