Inference accelerators

Inference
Accelerators
CELLSTRAT AI LAB
DARSHAN C G

5 critical factors that are used to measure APIs for Inference:
1.Throughput:
The volume of output within a given period. Often measured in inferences/second or samples/second.
2.Efficiency: Amount of throughput delivered per unit-power, often expressed as performance/watt.
3.Latency: Time to execute an inference, usually measured in milliseconds. Low latency is critical to
delivering rapidly growing, real-time inference-based services.
4.Accuracy: A trained neural network’s ability to deliver the correct answer.
5.Memory usage: The host and device memory that need to be reserved to do inference on a network
depends on the algorithms used. This constrains what networks and what combinations of networks can
run on a given inference platform. This is particularly important for systems where multiple networks are
needed and memory resources are limited.

NVIDIA TENSORRT
Programmable Inference Accelerator
FRAMWORKS GPU Platforms
https://developer.nvidia.com/tensorrt

TENSORRT
Performance

Faster TensorFlow Inference and Volta Support

TENSORRT OPTIMIZATION
➢ Optimizations are completely
automatic.
➢ Performed with a single function call.

LAYER & TENSOR FUSION
Unoptimized Network Optimized Network

1. Vertical Fusion.
2. Horizontal Fusion.
3. Layer Elimination.
Network Layers Before Layers After
VGG-19 43 27
Incepton V3 309 113
ResNet-152 670 159
Note: All results the regarding Performance is taken Official Tensorrt official Blog

Precision calibration
Automatic weights calibration is
performed, if INT8 is used.
Performance boost with
reduced precision

Dynamic Tensor Memory
Kernel Auto-Tuning
• Reduces memory footprint
and improves memory re-use.
• Manages memory allocation
for each tensor only for the
duration of its usage.
Tesla V100 Jetson TX2 Drive PX2
KERNEL AUTO-TUNING and DYNAMIC TENSOR MEMORY
TensorRT will pick the implementation from a library
of kernels that delivers the best performance for the
target GPU, input data size, filter size, tensor layout,
batch size and other parameters.

EXAMPLE: DEPLOYING TENSORFLOW MODELS WITH
TENSORRT
Trained Neural
Network
TensorRT Optimizer
Inference Results
Optimized Runtime
Engine
Import, optimize and deploy
TensorFlow models using
TensorRT python API
1.Start with a frozen TensorFlow
model
2.Create a model parser
3.Optimize model and create a
runtime engine
4.Perform inference using the
optimized runtime engine

Img Source: https://www.tensorflow.org/lite/convert/index?hl=RU

Features
1. Light Weight.
2. Low-Latency.
3. Privacy.
4. Improves Power Consumption.
5. Efficient Model Format.
6. Pre- trained models.

Converter Interpreter
Transforms TF models into a form
efficient for reading by Interpreter.
Diverse Platform Support(Android, ios,
Embedded Linux and Microcontrollers).
Introduces optimizations to improve
binary model Performance and/or
reduce model Size.
Platform APIs for accelerated Inference.
Components of Tflite

Acceleration Available
Software NNAPI
(also as a delegate)
Hardware EDGE TPU
GPU
CPU Optimizations
Performance
->GPU Delegate and NNAPI.
-> Not all operations are supported by GPU backend.

Optimization Techniques
1. Quantization.
2. Weight Pruning.
3. Model Topology Transforms.
Why Quantization:
1. All Available CPU platforms are supported.
2. Reducing Latency and inference Cost.
3.Low Memory Footprint.
4. Allow execution on fixed point operations.
5. Optimised models for Special Purpose HW
Accelerators(TPU).

Tensorflow->Saved model-> TFLite Converter-> TFLite Model

Workflow
Saved Models, Frozen Graphs, HDF5 models, tf.session(Python API Only)
https://www.tensorflow.org/lite/convert/index?hl=RU

Parameters for Conversion
Saved Model -> Tf.lite.TFLiteConverter.from_saved_model(…)
Keras Model-> Tf.lite.TFLiteConverter.from_keras_model(…)
Concrete Functions-> Tf.lite.TFLiteConverter.from_from_concrete_functions(…)
Output of the Conversion is Tflite Model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

import tensorflow as tf
# Create a simple Keras model.
x = [-1, 0, 1, 2, 3, 4]
y = [-3, -1, 1, 3, 5, 7]
model = tf.keras.models.Sequential([tf.keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd',loss='mean_squared_error')
model.fit(x, y, epochs=50)
# Convert the model.

import numpy as np
# Load the MobileNet tf.keras model.
model = tf.keras.applications.MobileNetV2(
weights="imagenet", input_shape=(224, 224, 3))
# Convert the model.
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
End-to-end MobileNet conversion

# Test the TensorFlow Lite model on random input data.
input_shape = input_details[0]['shape']
input_data = np.array(np.random.random_sample(input_shape), dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
tflite_results = interpreter.get_tensor(output_details[0]['index'])
# Test the TensorFlow model on random input data.
tf_results = model(tf.constant(input_data))
# Compare the result.
for tf_result, tflite_result in zip(tf_results, tflite_results):
np.testing.assert_almost_equal(tf_result, tflite_result, decimal=5)

#Saving from a SavedModel
tflite_convert --output_file=model.tflite --saved_model_dir=/tmp/saved_model
Command Line
#Saving from a Keras Model
tflite_convert --output_file=model.tflite --Keras_model_file=saved_model.h5

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
Quantization
DEFAULT: OPTIMIZE_FOR_SPEED/SIZE

Inference at Edge
Factors driving this trend:
1. Smaller, faster models.
2. on Device Accelerators.
3. Demands move ML Capability from loud to On-device.
4. Smart device Growth demands bring ML to Edge
Benefits of On-deice ML
1. High Performance.
2. Local Data Acceseibility.
3. Better Privacy.
4.Works Offline.

Coral
Mendel OS.
Edge TPU Compile.
Mendel Development Tool
https://www.seeedstudio.com/Coral-Dev-Board-p-2900.html

Raspberry Pi
Small Sized, Low cost,
Just Like a Computer,
Raspbian OS.
https://robu.in/product/raspberry-pi-4-model-b-with-4-gb-ram/

Options
1. Compile Tensorflow from Source.
2. Install Tensorflow from Pip.
3. Use Interpreter Directly.

Inference accelerators

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Inference accelerators

Similar to Inference accelerators (20)

Recently uploaded

Recently uploaded (20)

Inference accelerators