3. 5 critical factors that are used to measure APIs for Inference:
1.Throughput:
The volume of output within a given period. Often measured in inferences/second or samples/second.
2.Efficiency: Amount of throughput delivered per unit-power, often expressed as performance/watt.
3.Latency: Time to execute an inference, usually measured in milliseconds. Low latency is critical to
delivering rapidly growing, real-time inference-based services.
4.Accuracy: A trained neural network’s ability to deliver the correct answer.
5.Memory usage: The host and device memory that need to be reserved to do inference on a network
depends on the algorithms used. This constrains what networks and what combinations of networks can
run on a given inference platform. This is particularly important for systems where multiple networks are
needed and memory resources are limited.
12. 1. Vertical Fusion.
2. Horizontal Fusion.
3. Layer Elimination.
Network Layers Before Layers After
VGG-19 43 27
Incepton V3 309 113
ResNet-152 670 159
Note: All results the regarding Performance is taken Official Tensorrt official Blog
13.
14. Precision calibration
Automatic weights calibration is
performed, if INT8 is used.
Performance boost with
reduced precision
https://developer.nvidia.com/tensorrt
15. Dynamic Tensor Memory
Kernel Auto-Tuning
• Reduces memory footprint
and improves memory re-use.
• Manages memory allocation
for each tensor only for the
duration of its usage.
Tesla V100 Jetson TX2 Drive PX2
KERNEL AUTO-TUNING and DYNAMIC TENSOR MEMORY
TensorRT will pick the implementation from a library
of kernels that delivers the best performance for the
target GPU, input data size, filter size, tensor layout,
batch size and other parameters.
16.
17. EXAMPLE: DEPLOYING TENSORFLOW MODELS WITH
TENSORRT
Trained Neural
Network
TensorRT Optimizer
Inference Results
Optimized Runtime
Engine
Import, optimize and deploy
TensorFlow models using
TensorRT python API
1.Start with a frozen TensorFlow
model
2.Create a model parser
3.Optimize model and create a
runtime engine
4.Perform inference using the
optimized runtime engine
21. Features
1. Light Weight.
2. Low-Latency.
3. Privacy.
4. Improves Power Consumption.
5. Efficient Model Format.
6. Pre- trained models.
22. Converter Interpreter
Transforms TF models into a form
efficient for reading by Interpreter.
Diverse Platform Support(Android, ios,
Embedded Linux and Microcontrollers).
Introduces optimizations to improve
binary model Performance and/or
reduce model Size.
Platform APIs for accelerated Inference.
Components of Tflite
23.
24. Acceleration Available
Software NNAPI
(also as a delegate)
Hardware EDGE TPU
GPU
CPU Optimizations
Performance
->GPU Delegate and NNAPI.
-> Not all operations are supported by GPU backend.
25. Optimization Techniques
1. Quantization.
2. Weight Pruning.
3. Model Topology Transforms.
Why Quantization:
1. All Available CPU platforms are supported.
2. Reducing Latency and inference Cost.
3.Low Memory Footprint.
4. Allow execution on fixed point operations.
5. Optimised models for Special Purpose HW
Accelerators(TPU).
28. Parameters for Conversion
Saved Model -> Tf.lite.TFLiteConverter.from_saved_model(…)
Keras Model-> Tf.lite.TFLiteConverter.from_keras_model(…)
Concrete Functions-> Tf.lite.TFLiteConverter.from_from_concrete_functions(…)
Output of the Conversion is Tflite Model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
29. import tensorflow as tf
# Create a simple Keras model.
x = [-1, 0, 1, 2, 3, 4]
y = [-3, -1, 1, 3, 5, 7]
model = tf.keras.models.Sequential([tf.keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd',loss='mean_squared_error')
model.fit(x, y, epochs=50)
# Convert the model.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
30. import numpy as np
import tensorflow as tf
# Load the MobileNet tf.keras model.
model = tf.keras.applications.MobileNetV2(
weights="imagenet", input_shape=(224, 224, 3))
# Convert the model.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
End-to-end MobileNet conversion
31. # Test the TensorFlow Lite model on random input data.
input_shape = input_details[0]['shape']
input_data = np.array(np.random.random_sample(input_shape), dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
tflite_results = interpreter.get_tensor(output_details[0]['index'])
# Test the TensorFlow model on random input data.
tf_results = model(tf.constant(input_data))
# Compare the result.
for tf_result, tflite_result in zip(tf_results, tflite_results):
np.testing.assert_almost_equal(tf_result, tflite_result, decimal=5)
32. #Saving from a SavedModel
tflite_convert --output_file=model.tflite --saved_model_dir=/tmp/saved_model
Command Line
#Saving from a Keras Model
tflite_convert --output_file=model.tflite --Keras_model_file=saved_model.h5
34. Inference at Edge
Factors driving this trend:
1. Smaller, faster models.
2. on Device Accelerators.
3. Demands move ML Capability from loud to On-device.
4. Smart device Growth demands bring ML to Edge
Benefits of On-deice ML
1. High Performance.
2. Local Data Acceseibility.
3. Better Privacy.
4.Works Offline.