Introduction to some new features of TFLite, including
1. delegates: including new NNAPI delegate, GPU delegate, and flex delegate,
2. optimized kernels for ARM CPUs,
3. various APIs: including Python, C, Objective-C, and Swift ones, and
4. misc, e.g., graph writer and Edge TPU.
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Status quo of tensor flow lite on edge devices coscup 2019
1. Status Quo of
TensorFlow Lite on Edge
Devices
Koan-Sin Tan
freedom@computer.org
Aug 17th, 2019
COSCUP, Taipei, Taiwan
1
2. • disclaimer: Opinions Are My Own
• feel free to interrupt me if you have any questions
2
3. who i am
• Used open source before the term “open
source” is used
• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
• Used to be a programming language junkie
• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components
• Recently, on NN performance on edge devices
related stuff
• Contributed from time to time to TensorFlow
Lite
• started a command line label_image for
TFLite
https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0
http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3
4. Outline
• overview: or, say, why TFLite
• new features
• delegates: including new NNAPI delegate, GPU delegate,
and flex delegate,
• optimized kernels for ARM CPUs,
• various APIs: including Python, C, Objective-C, and Swift
ones, and
• misc, e.g., graph writer and Edge TPU.
4
5. Why TFLite?
• TensorFlow Lite
• TensorFlow is the most popular machine learning frameworks
• TFLite: a lightweight runtime for edge devices
• could be accelerated by GPU, DSP, or ASIC accelerators
• PyTorch is catching up, but acceleration part is still lagging far
behind TFLite
• Yes, there are other open source NN frameworks. No one is as
comprehensive as TF Lite, as far as I can tell
5
8. Offline usage
• we heard words such as “always-on” and “always-
connected” back to 3G days 🤔, but wireless
communications is so unreliable
8
9. latency
• “There is an old network saying: Bandwidth problems
can be cured with money. Latency problems are harder
because the speed of light is fixed — you can't bribe
God.” -- David D. Clark, MIT
9
https://en.wikipedia.org/wiki/David_D._Clark
10. Bandwidth
• Well, bandwidth of wireless network is not easy problem
either
• consider you have NN-based “portrait model” (or say
Bokeh effect) on iPhone Xs Max (12 + 12 MP)
• if we send raw image (12+12)*10^6*(3*8) = 576 M bits
• 576 * 30 ~= 17.3 G bits
• you know this is not feasible for now
10
11. Privacy
• you know you need privacy for
both your physical body and
your mobile device(s)
11
12. NN-based ML is already in
cell phones
• Google I/O 2017: Mobile First —> AI First
• TensorFlow Lite, Android Neural Network API
• Lots of stuff from Google blogs and papers, e.g., Google Lens, federated learning in Gboard
• Pixel Visual Core in Pixel 2/3, 2/3 XL: although it seems there is no way for developers to
use it as a general NN accelerator
• Apple announced CoreML, a machine framework, at WWDC 2017 (June 2017)
• Apple’s machine learning journal (https://machinelearning.apple.com/): how Apple uses CNN
and other machine techniques in iPhone
• Neural Engine in A11/A11X/A12/A12X, available to developers via Core ML on A12
devices
• Computer Architecture: A Quantitative Approach, 6th Ed. (Nov, 2017) has a whole new chapter
on Domain Specific Architecture, actually NN accelerators.
12
13. actually there are many NNAPI-
enabled phones already
http://ai-benchmark.com/ranking_processors.html
mid June, 2019
13
15. https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the-
silicon-secrets/5
• AnandTech is one the my favorite tech sites. Usually, it provides
good analysis
• E.g., Apple’s CPUs
• cache sizes
• execution units
• various instruction latency
• Not good enough for NN accelerators on mobile phones
• floating-point VGG16, Inception V3, and ResNet34?
• come on, are you still in Neolithic era?
Evolving fast: the slide I prepared Nov, 2018
15
16. TF Lite in Android Pie
• There are ‘libtflite.so’s in /system/lib and /system/lib64
• https://source.android.com/devices/tech/display/textclassifier
16
19. ML Kit
• https://
developers.google.com/ml-
kit/, part of FireBase
• Originally, only custom models
are TFLite
• Now, as far as I can tell, vision
parts are using TFLite also
https://developers.google.com/ml-kit/ 19
20. • see appendix for Google Translate, Google Lens, Gboard,
and others
20
21. Some Progresses Make NN
on Edge Devices Really Viable
• “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” [1]. A keynote at
ESWEEK 2017, “Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep-
Neural-Network Architectures” [2]
• MobileNet V1 [3] and V2 [4]: Depthwise separable convolution [5] and inverted residuals and linear
bottlenecks [4]
• AutoML, e.g.,
• NASNet Mobile [6] and Mnasnet [7]
• MobileNet V3 [10] and EfficientNet [11]
• Quantization [8][9]
• How about pruning / compression stuff? As far as I know, not widely used yet
[1] https://arxiv.org/abs/1602.07360
[2] https://arxiv.org/abs/1710.02759
[3] https://arxiv.org/abs/1704.04861
[4] https://arxiv.org/abs/1801.04381
[5] https://www.di.ens.fr/data/publications/papers/phd_sifre.pdf
[6] https://arxiv.org/abs/1707.07012
[7] https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html, https://arxiv.org/abs/1807.11626
[8] https://arxiv.org/abs/1712.05877
[9] https://arxiv.org/abs/1806.08342
[10] https://arxiv.org/abs/1905.02244
[11] https://arxiv.org/abs/1905.11946
21
22. • Michael Jordan published an
article on Medium named
“Artificial Intelligence — The
Revolution Hasn’t
Happened Yet” [1]
• Yes, but current deep learning
driven stuff should be enough
for next few years
[1] https://medium.com/
@mijordan3/artificial-intelligence-
the-revolution-hasnt-happened-
yet-5e1d5812e1e7
22
23. Why I Started Learning TF
Lite
• We heard Android NN and TensorFlow Lite back in Google I/
O 2017
• My COSCUP 2017 slide deck “TensorFlow on Android”
• https://www.slideshare.net/kstan2/tensorflow-on-
android
• People knew a bit about Android NN API before it was
announced and released
• No information about TensorFlow Lite, at least to me,
before it was released in Nov, 2017
23
24. Quantization and
Accelerators
• Quantization
• Quantization is not new, people know that there are lots
redundancy in NN models back from pre DNN days. Many
quantization and compressing/pruning techniques were
presented all the years. TFLite and its underlying gemmlowp
(and NNAPI) made the first production quality system that
supports quantized unsigned int8.
• accelerators (thru NNAPI in the beginning, and directly later)
• CPU is not always the best one to use NN models
• GPU, DSP, and other accelerators
24
25. TFLite and Android NN in
Google I/O 2017
• New TensorFlow runtime
• Optimized for mobile and
embedded apps
• Runs TensorFlow models on
device
• Leverage Android NN API
• Soon to be open sourced
from Google I/O 2017 video
25
26. Actual Android NN API
• Announced/published with Android 8.1
Preview 1
• Available to developer in NDK
• yes, NDK
• The Android Neural Networks API (NNAPI)
is an Android C API designed for running
computationally intensive operations for
machine learning on mobile devices
• NNAPI is designed to provide a base layer
of functionality for higher-level machine
learning frameworks (such as TensorFlow
Lite, Caffe2, or others) that build and train
neural networks
• The API is available on all devices running
Android 8.1 (API level 27) or higher.
https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png
26
27. Android NN on Pixel 2
• Only the CPU fallback was available on Oreo MR1
• Actually, you can see Android NN API related in AOSP after Oreo MR1 (8.1) release already
• user level code, see https://android.googlesource.com/platform/frameworks/ml/+/oreo-mr1-release
• HAL, see https://android.googlesource.com/platform/hardware/interfaces/+/oreo-mr1-release/
neuralnetworks/
• There is NN API 1.1 on Android Pie
• https://developer.android.com/about/versions/pie/android-9.0#nnapi
• adding support for nine new ops — Pad, BatchToSpaceND, SpaceToBatchND, Transpose, Strided
Slice, Mean, Div, Sub, and Squeeze
• In the Android P DP1/2 (https://developer.android.com/preview/download.html), there was a HVX
NN API 1.0 (yes, 1.0) driver. Gone after DP2. Not in recent Pie release. (See https://
android.googlesource.com/platform/hardware/qcom/neuralnetworks/hvxservice/ for source code)
• NN API 1.2, which supports 90+ ops, is in AOSP and will be in forthcoming Android Q (version 10)
27
28. So NNAPI accelerators
don’t work?
• Yes, I don’t know what happened to earlier Pixel phones
• I don’t have Pixel 3 to try
• Q beta 4 for Pixel 3a comes with working a HVX
accelerator driver that works. It’s an NNAPI 1.1 one
though.
• And remember what I showed in pp. 13 and 14, there are
many NNAPI-enabled phones already
28
29. Original TFLite APIs
• Java API: A convenience
wrapper around the C++ API
on Android
• C++ API: loads the
TensorFlow Lite model file and
invokes the Interpreter. The
same library is available on
both Android and iOS
https://www.tensorflow.org/mobile/tflite/
29
30. Other bindings
• Python and C APIs
• Python: introduced in TF 1.8.0, built into pip package in 1.9.0
• my label_image.py for tflite merged on Aug 9, 2018
• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
lite/examples/python/label_image.py
• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
lite/examples/python
• C API: introduced for Unity
• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
contrib/lite/experimental/c
30
31. How to Use it
31
• TFLite guys work hard
• documentation getting better and better
over < 2 yrs
• yes, sometimes you still have to “use the
source”
https://www.tensorflow.org/lite
33. Basic Usage
• model: .tflite model
• resolver: if no custom ops, builtin op
resolver is enough
• interpreter: we need it to compute
the graph
• interpreter->AllocateTensor():
Allocate stuff for you, e.g., input
tensor(s)
• fill the input
• interpreter->Invoke(): run the graph
• process the output
tflite::FlatBufferModel model(path_to_model);
tflite::ops::builtin::BuiltinOpResolver resolver;
std::unique_ptr<tflite::Interpreter> interpreter;
tflite::InterpreterBuilder(*model, resolver)(&interpreter);
// Resize input tensors, if desired.
interpreter->AllocateTensors();
float* input = interpreter->typed_input_tensor<float>(0);
// Fill `input`.
interpreter->Invoke();
float* output = interpreter->type_output_tensor<float>(0);
33
34. more source code
• Check my COSCUP 2018 slide deck, which was for a talk
in a source code reading track, for more details
• https://www.slideshare.net/kstan2/open-source-nn-
frameworks-on-cellphones
• And I’ll have a more code-oriented talk on TFLite
delegates tomorrow
34
37. More Delegates
• Flex Delegate
• Ops supported by TFLite is relatively limited, TensorFlow Lite models can now use a
subset of TensorFlow ops when TFLite builtin ops are not sufficient
• GPU backend: no, not NNAPI
• OpenGL ES 3.1 Compute Shaders on Android devices
• Metal Compute Shaders on iOS device
• “in general the new GPU backend performs 2–7x faster than the floating point CPU
implementation for a wide range of diverse deep neural network models.”
https://www.tensorflow.org/lite/using_select_tf_ops
https://medium.com/tensorflow/tensorflow-lite-now-faster-with-mobile-gpus-developer-preview-e15797e6dee7
https://www.tensorflow.org/lite/performance/gpu
https://www.tensorflow.org/lite/performance/gpu_advanced
37
38. Why a non-NNAPI delegate?
https://developer.android.com/about/dashboards
NNAPI-enabled devices ~7.5% around the end of Oct, 2018
38
39. NNAPI-enabled devices ~ 25.8% around May 7, 2019
https://developer.android.com/about/dashboards39
41. GPU Delegate Performance
• my quick and dirty benchmarks
• Android: https://github.com/freedomtan/
glDelegateBench
• iOS: https://github.com/freedomtan/
glDelegateBenchmark/
• at first, GPU Delegate is binary release only (aar for Android; pod for iOS)
• after the release of GPU delegate source code, benchmark_model and
label_image are able to use GPU delegate
41
42. GPU delegate kernels
• Recently, TFLite GPU delegate guys
published a paper talking about how they
design it. Covered some interesting details
• GPU backends require initialization
involving shader compilation and
optimization by the driver before inference
• PHWC4: P stands for plane
• Reshape is expensive on GPU
• RGBA is better than RGB on GPU
• a tensor of shape [B,H,W,5], for instance,
is twice as expensive as [B, H, W, 4], but
about the same as [B, H, W, 8], then the
architect can tune around those 4-channel
boundaries rather than trying to optimize
on other boundaries.
•
https://arxiv.org/pdf/1907.01989.pdf
43. Faster ARM CPU kernels
• It’s available now. Enabled by default for Android ARM64
early June
• https://github.com/tensorflow/tensorflow/commit/
8924e67e034909bea0343631b9f9024c5a6da5c4
• ruy:
• four tune fixed point kernels: big/LITTLE (out-of-order/
in-order), w/ or w/o dot-product instructions
• two tuned floating point kernels
43
44. More on ruy
• matrix multiplication in AArch64 NEON
• sdot based kernels for either out-of-order CPUs, e.g., CA76, or in-order CPUs, e.g., CA55r1
• non sdot based kernels for either out-of-order CPUs, e.g., CA73, or in-order CPUs, e.g., CA53
• how the kernel is chosen: detection at run time instead of hard-coded list (e.g., PyTorch cpuinfo)
• sdot or not: see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/
experimental/ruy/detect_dotprod.cc#L129-L157
• in-order or out-of-order: see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/
experimental/ruy/tune.cc, esp., https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
lite/experimental/ruy/tune.cc#L102-L124
• doesn't need to list all possibilities, probably can handle future cores. Still cannot deal with
big.LITTLE cores
• thread pool: it seems to scale better than the one currently in use, so that multi-threaded floating-
point numbers are much better
• before ruy, floating point: eigen thread pool; fixed-point: TFLite’s thread pool
44
45. Python API
• TensorFlow Lite Optimizing Converter (TOCO) —> tflite_convert, mainly python
wrapped C++ code
• Python Interpeter: https://www.tensorflow.org/lite/convert/
python_api#tensorflow_lite_python_interpreter_
• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/
convert/python_api.md
• https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/lite
• I sent label_image.py (merged, https://github.com/tensorflow/tensorflow/tree/master/
tensorflow/lite/examples/python) and mobilenet_ssd. Tried others such as DeepLab V3
on RPI 3 B+.
• Quick test and you can use OpenCV to do preprocessing and post-processing
45
46. C API
• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/
experimental/c/c_api.h
• Started as a base for Unity, https://github.com/tensorflow/tensorflow/tree/
master/tensorflow/lite/experimental/examples/unity/TensorFlowLitePlugin
• FFI via C is much easier than C++
• Who uses it? Objective-C and Swift APIs
• my quick-and-dirty hacks for Pharo Smalltalk, https://github.com/
freedomtan/libtensorflow-pharo-bindings/blob/libtensorflowlite_c_hacks/
LibTensorFlow-Core/TensorFlowLiteCAPI.class.st
46
47. Yes, Smalltalk Is Alive
• Smalltalk is an object-
oriented, dynamically typed
reflective programming
language started in 1970s
• Alan Kay, the creator or
Smalltalk, coined the term
Object Oriented Programming
(OOP).
• MVC, IDE, live programming http://pharo.org/web/files/teaser50.png
47
49. There are more new things
• For example, uP
• See https://github.com/tensorflow/tensorflow/tree/
master/tensorflow/lite/experimental
• TFLite Micro and uTensor
• https://os.mbed.com/blog/entry/uTensor-and-Tensor-
Flow-Announcement/
• Yes, RNN-based models, including LSTM, are not doing
well (yet)
49
50. Google I/O 2019 updates
• new MLIR-based TF —> TFLite converter
• improved CPU backend: ruy
• on-device training: not ready yet?
• control flow support
• see more at https://www.youtube.com/watch?
v=Jjm7MT6W0Dc
50
54. TF graphdef .pb -> TFLite flatbuffer .tflite
• Build TensorFlow MLIR related binaries
bazel build --config opt tensorflow/compiler/mlir/...
• Get your model, e.g.,
wget http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_224.tgz
• Convert it
./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate -tf-input-shapes=1,224,224,3 -tf-input-data-
types=DT_FLOAT -tf-output-arrays=MobilenetV1/Predictions/Reshape_1 /tmp/mobilenet_v1_1.0_224_frozen.pb --tf-
input-arrays=input -o /tmp/foo.tflite
• Yes, it works like a charm. But, not for quantized model, neither
./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate -tf-input-shapes=1,224,224,3 -tf-input-data-
types=DT_QUINT8 -tf-output-arrays=MobilenetV1/Predictions/Reshape_1 /tmp/mobilenet_v1_1.0_224_quant_frozen.pb --
tf-input-arrays=input -o /tmp/bar.tflite
nor
./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate -tf-input-shapes=1,224,224,3 -tf-input-data-
types=DT_FLOAT -tf-output-arrays=MobilenetV1/Predictions/Reshape_1 /tmp/mobilenet_v1_1.0_224_quant_frozen.pb --
tf-input-arrays=input -o /tmp/bar.tflite —tf-inference-type=TF_QUINT8
works
54
55. Google Edge TPU
• Announced back in Google
Next 2018 (July, 2018)
• Available to general developers
right before TensorFlow Dev
Summit 2019 (Mar, 2019)
• USB: Coral Accelerator
• Dev Board: Coral Dev Board
• More are coming, e.g., PCI-E
Accelerator and SOM
• Supported framework: TFLite
https://coral.withgoogle.com/products/
55
56. Edge TPU Software
•Updates released on April 11th, 2019
•Compiler: removed the restriction for specific architectures
•New TensorFlow Lite C++ API
•Updated Python API, mainly for multiple Edge TPUs
•Updated Mendel OS and Mendel Management Tool (MDT) tool
•Environmental Sensor Board, https://coral.withgoogle.com/products/environmental/
•May updates, May 29th, 2019
•Offline compiler
•MDT update
https://developers.googleblog.com/2019/04/updates-from-coral-new-compiler-and.html
https://coral.withgoogle.com/news/updates-04-2019/
https://coral.withgoogle.com/news/updates-05-2019/
56
57. Edge TPU Software
• July updates, July 24th, 2019
• Updated Edge TPU Compiler and runtime: support for
models built using post-training quantization
• Updated Edge TPU Python library
• New on-device backpropagation API
• Updated weight imprinting API
• New TensorFlow Lite delegate for Edge TPU
https://coral.withgoogle.com/news/updates-07-2019/
57
58. Edge TPU’s canned model
• all ops that could be offloaded
are packed into on op
The compiler creates a single custom op for all Edge TPU
compatible ops; anything else stays the same
https://coral.withgoogle.com/docs/edgetpu/models-intro/
58
MobileNet V1 1×224×224×3
1×1001
edgetpu-custom-op
input
Softmax
1×300×300×3
1×1917×91
1×10×4 1×10 1×10 1
edgetpu-custom-op
TFLite_Detection_PostProcess
3 1917×4
normalized_input_image_tensor
TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3
SSD MobileNet V1
59. EdgeTPU Delegate
• There is dynamic delegate plugin interface recently.
Currently it’s only used by EdgeTPU’s
https://coral.withgoogle.com/news/updates-07-2019/
60. There still are many trivial bugs in
TensorFlow
• There are many typos in comments of TensorFlow code
• Many things are not well-documented
• There are many many warnings when building TensorFlow from source
code
• a trivial fix in May, 2019 by me
60
https://github.com/tensorflow/tensorflow/pull/28618
61. Concluding Remarks
• Deep learning on devices are here to stay. You can see
some applications nowadays. More to come.
• TensorFlow, including Lite, is under active development.
Documentation is improving. Opportunities to contribute
are still there
61
70. Your phone personalizes the model locally, based on your usage (A).
Many users' updates are aggregated (B) to form a consensus change
(C) to the shared model, after which the procedure is repeated.
https://research.googleblog.com/2017/04/federated-learning-collaborative.html
70
72. Gboard: Chinese input methods
seem to be HMM-based
• As the name suggested, it could be HMM (Hidden
Markovian Model) and n-gram based
• Does HMM and n-gram work with federated learning?
72
73. • All-neural on-device
Recognizer [1]
• Live caption [2], announced in
Google I/O 2019
• [1] https://ai.googleblog.com/
2019/03/an-all-neural-on-
device-speech.html
• [2] https://www.youtube.com/
watch?v=hPv1PkjJ-J0
73
74. label_image for TFLite
• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/
• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/label_image.md
• Run a TF Lite single input, single output classifier model, e.g., MobileNet V1, so that we can verify the classifier
works or not
• What does it do
• read an image: unlike TF, there is no image decoder in TF Lite, so I wrote a simple .bmp decoder
• resize the input image to specific size, e.g., 224x244 or 299x299
• convert the image tensor to floating point if necessary
• load the classifier
• prepare tensors
• run the model
• process the input
• top-k labels
74
75. Speed of Quantized Models
• It seems it's much better than naive quantization as we saw before (in TensorFlow before TFLite)
• On Nexus 9 (MobileNet 1.0/224)
• Quantized
• ./label_image -t 2: ~ 160 ms
• ./label_image -t 2 -c 100: ~ 60 ms
• Floating point
• ./label_image -t 2 -m ./mobilenet_v1_1.0_224.tflite: ~ 300 ms
• ./label_image -t 2 -c 100 -m ./mobilenet_v1_1.0_224.tflite: ~ 82 ms
• Pixel 2 Quantized
• CPU
• single thread: as is: ~ 90 ms, controlled env: ~ 70 ms
• 4 threads: ~ 30 ms
• HVX: ~ 12 ms
75
76. Fake Quantization in Early
Dec, 2017
• How hard can it be? How much time is needed?
• Several pre-tested models are available
• https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/contrib/lite/g3doc/models.md
• but only one of them (https://storage.googleapis.com/
download.tensorflow.org/models/tflite/
mobilenet_v1_224_android_quant_2017_11_08.zip) is quantized
one
• as we can guess from related docs, retrain is kinda required to
get accuracy back
76
77. Fake Quantization in early
Nov, 2018
• Documents
• a paper at Arxiv: https://arxiv.org/abs/1712.05877
• white paper: https://arxiv.org/abs/1806.08342
• Code, e.g.,
• TF fake quant
• SLIM (https://github.com/tensorflow/models/blob/master/research/slim/train_image_classifier.py#L519-
L521), object-detection (e.g., https://github.com/tensorflow/models/blob/master/research/
object_detection/samples/configs/ssd_mobilenet_v2_quantized_300x300_coco.config#L196-L201), etc.
• models many quantized models
• classifiers: all MobileNet V1, some MobileNet V2 and others (https://www.tensorflow.org/lite/models)
• others, e.g.,
• Object-detection: e.g., MobileNet-SSD
• Semantic segmentation: DeepLab V3
77
79. Note that the biases are not quantized because they are
represented as 32-bit integers in the inference process, with
a much higher range and precision compared to the 8 bit
weights and activations. Furthermore, quantization param-
eters used for biases are inferred from the quantization pa-
rameters of the weights and activations. See section 2.4.
Typical TensorFlow code illustrating use of [19] follows:
from tf.contrib.quantize
import quantize_graph as qg
g = tf.Graph()
with g.as_default():
output = ...
total_loss = ...
optimizer = ...
train_tensor = ...
if is_training:
quantized_graph =
qg.create_training_graph(g)
else:
quantized_graph =
qg.create_eval_graph(g)
# Train or evaluate quantized_graph.
3.2. Batch normalization folding
For models that use batch normalization (see [17]), there
is additional complexity: the training graph contains batch
normalization as a separate block of operations, whereas
the inference graph has batch normalization parameters
“folded” into the convolutional or fully connected layer’s
Float
Integer
Table 4.1
tized net
Sche
Weigh
Activati
Accu
Table 4.
ious qua
works (B
[21, 22])
fine-grai
4. Expe
We c
ing the e
and the o
tradeoff
tion. 4.2
ence wo
is matrix
floating-
library [1
how to use fake quant
conv
weights
uint8
input
+
biases
uint32
ReLU6 output
uint8
uint32
uint8
uint8
(a) Integer-arithmetic-only inference
conv
wt quant weightsinput
+
biases
ReLU6 act quant output
(b) Training with simulated quantization
10 20 40 80 160 320
40
50
60
70
Latency (ms)
Top1Accuracy
Float
8-bit
(c) ImageNet latency-vs-accuracy tradeoff
Figure 1.1: Integer-arithmetic-only quantization. a) Integer-arithmetic-only inference of a convolution layer. The input and output
are represented as 8-bit integers according to equation 1. The convolution involves 8-bit integer operands and a 32-bit integer accumulator.
The bias addition involves only 32-bit integers (section 2.4). The ReLU6 nonlinearity only involves 8-bit integer arithmetic. b) Training
with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic.
Weight quantization (“wt quant”) and activation quantization (“act quant”) nodes are injected into the computation graph to simulate the
effects of quantization of the variables (section 3). The resultant graph approximates the integer-arithmetic-only computation graph in panel
a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from
the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section 4). The figure compares
integer quantized MobileNets [10] against floating point baselines on ImageNet [3] using Qualcomm Snapdragon 835 LITTLE cores.
tions [14, 27, 34]. With these approaches, both multiplica-
tions and additions can be implemented by efficient bit-shift
and bit-count operations, which are showcased in custom
GPU kernels (BNN [14]). However, 1 bit quantization of-
Our work draws inspiration from [7], which leverages
low-precision fixed-point arithmetic to accelerate the train-
ing speed of CNNs, and from [31], which uses 8-bit fixed-
point arithmetic to speed up inference on x86 CPUs. Our
[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/qu
README.md
[2] https://arxiv.org/abs/1712.05877
[3] https://arxiv.org/abs/1806.08342
79
81. Real computation
• BLAS part: Eigen (http://eigen.tuxfamily.org/) and gemmlowp
(https://github.com/google/gemmlowp)
• Some Caveats
• convolutions are multithreaded
• uint8/gemm: 1
• float32/Eigen: 4
• depthwise convolutions are single threaded
• problems: big.LITTLE, number of cores, scheduling
81
82. knowing more to squeeze
performance
• Memory management: to get reasonable good performance when running highly parallel
workloads on mobile devices, you need good enough mechanism
• Profiling: there is a simple profiling mechanism in TF Lite since Apr, 2018
• time profiling only now. how about memory stuff?
• static buffer size: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/
contrib/lite/profiling/profiler.h#L80
• https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/profiling
• Computation of quantized uint8
• when you want to do some operations on tensors, scale and zero point could be
changed. How to do it efficiently
• Post-training quantization: https://www.tensorflow.org/lite/performance/
post_training_quantization
82
83. Quick Intro to Caffe 2
• Caffe 2
• 2nd generation of Caffe, which was the most popular deep learning framework
(before TensorFlow) from Berkeley
• merged into PyTorch
• What's the difference? Caffe2 improves Caffe 1.0 in a series of directions:
• first-class support for large-scale distributed training
• mobile deployment
• new hardware support (in addition to CPU and CUDA)
• flexibility for future directions such as quantized computation
• stress tested by the vast scale of Facebook applications
83
https://caffe2.ai/docs/caffe-migration.html
84. Caffe2 backends for
Android I know
• ARM CPU:
• NNPACK, Eigen: quite mature
• QNNPACK: looks good, (https://code.fb.com/ml-applications/qnnpack/)
• OpenGL ES:
• OpenGL: not actively maintained (?)
• ARM Compute Library (GL ES part): stalled? 18.01
• NEON, and OpenCL
• NNAPI: stalled? NNAPI 1.0 (Oreo 8.1 API 27), not fully integrated yet
• ios: iOS MPS backend
84
85. More open source frameworks
• Yes, there are other framworks, e.g.,
• MACE from XiaoMi: https://github.com/XiaoMi/mace,
• ncnn from Tencent: https://github.com/Tencent/ncnn,
• ONNX runtime from Microsoft, https://github.com/
microsoft/onnxruntime,
• TVM stack, https://tvm.ai
• So far, the TF/TFLite ecosystem is the largest one
85
86. Beyond Open Source
• Apple CoreML
• https://developer.apple.com/
documentation/coreml
• Google ML Kit
• https://developers.google.com/ml-kit/
• image labeling, OCR, face detection, bar
code scanning, landmark detection, etc.
• Custom models in TF Lite
• Qualcomm Snapdragon Neural Processing
Engine (SNPE)
• https://developer.qualcomm.com/software/
snapdragon-neural-processing-engine-ai
• Huawei HiAi DDK
86
88. Figure 7.13 Example of systolic array in action, from top to bottom on the page. In this example, the six weights
are already inside the multiply-accumulate units, as is the norm for the TPU. The three inputs are staggered in time to
get the desired effect, and in this example are shown coming in from the top. (In the TPU, the data actually comes in
from the left.) The array passes the data down to the next element and the result of the computation to the right to the
next element. At the end of the process, the sum of products is found to the right. Drawings courtesy of Yaz Sato.
It seems Edge TPU is not TPU-like?
Figure 7.14 Systolic data flow of the Matrix Multiply Unit.
https://www.elsevier.com/books-and-journals/book-companion/9780128119051
88
91. 1×224×224×3
1×1×1×1024
1×1×1×1024
1×1×1×5
1×5
1×5
edgetpu-custom-op
L2Normalization
Conv2D
weights 5×1×1×1024
bias 5
Reshape
Softmax
input
Output
Imprinting Engine
• Yes, let’s check what it is
• The Imprinting Engine implements a low-shot learning technique
called ‘Imprinted Weights’ [1][2]
• Can be used to retrain classifiers on-device (either on USB
Accelerator or Dev Board), no back-propagation gradient involved.
• Why?
• Transfer-learning happens on-device, at near-realtime speed.
• You don't need to recompile the model.
• Limitations
• Training data size is limited to a max of 200 images per class.
• It is most suitable only for datasets that have a small inner
class variation.
• The last fully-connected layer runs on the CPU, not the Edge
TPU. So it will be slightly less efficient than running a pre-
compiled on Edge TPU.
• if you are interested in it, check the paper and
aiy::learn::imprinting::ImprintingEngine::Train(un
signed char const*, int, int)
91
[1] https://coral.withgoogle.com/docs/edgetpu/retrain-classification-ondevice/
[2] https://arxiv.org/abs/1712.07136
1×224×224×3
1×1×1×1024
edgetpu-custom-op
input
AvgPool
92. EfficientNet
• EfficientNet-B0:
• much smaller FLOPS than
MobileNet V1; much higher
accuracy
• MobileNet V2: a bit larger FLOPS;
much higher accuracy
http://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html
92
95. Depthwise Separable Convolution
• CNNs with depthwise separable convolution such as Mobilenet [1]
changed almost everything
• Depthwise separable convolution “factorize” a standard convolution
into a depthwise convolution and a 1 × 1 convolution called a
pointwise convolution. Thus it greatly reduces computation
complexity.
• Depthwise separable convolution is not that that new [2], but pure
depthwise separable convolution-based networks such as Xception
and MobileNet demonstrated its power
[1] https://arxiv.org/abs/1704.04861
[2] L. Sifre. “Rigid-motion scattering for image classification”, PhD thesis, 2014
95