Status quo of tensor flow lite on edge devices coscup 2019

Status Quo of
TensorFlow Lite on Edge
Devices
Koan-Sin Tan

freedom@computer.org

Aug 17th, 2019

COSCUP, Taipei, Taiwan
1

• disclaimer: Opinions Are My Own

• feel free to interrupt me if you have any questions
2

who i am
• Used open source before the term “open
source” is used

• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD

• Used to be a programming language junkie

• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components

• Recently, on NN performance on edge devices
related stuff

• Contributed from time to time to TensorFlow
Lite

• started a command line label_image for
TFLite
https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0
http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3

Outline
• overview: or, say, why TFLite

• new features

• delegates: including new NNAPI delegate, GPU delegate,
and ﬂex delegate,

• optimized kernels for ARM CPUs,

• various APIs: including Python, C, Objective-C, and Swift
ones, and

• misc, e.g., graph writer and Edge TPU.
4

Why TFLite?
• TensorFlow Lite

• TensorFlow is the most popular machine learning frameworks

• TFLite: a lightweight runtime for edge devices

• could be accelerated by GPU, DSP, or ASIC accelerators

• PyTorch is catching up, but acceleration part is still lagging far
behind TFLite

• Yes, there are other open source NN frameworks. No one is as
comprehensive as TF Lite, as far as I can tell
5

https://www.youtube.com/watch?v=Jjm7MT6W0Dc
Comprehensive?
6

Why NN on edge device,
esp. cell phones?
• Oﬄine usages

• Latency

• Bandwidth

• Privacy

• Sensors
7

Ofﬂine usage
• we heard words such as “always-on” and “always-
connected” back to 3G days 🤔, but wireless
communications is so unreliable
8

latency
• “There is an old network saying: Bandwidth problems
can be cured with money. Latency problems are harder
because the speed of light is ﬁxed — you can't bribe
God.” -- David D. Clark, MIT
9
https://en.wikipedia.org/wiki/David_D._Clark

Bandwidth
• Well, bandwidth of wireless network is not easy problem
either

• consider you have NN-based “portrait model” (or say
Bokeh eﬀect) on iPhone Xs Max (12 + 12 MP)

• if we send raw image (12+12)*10^6*(3*8) = 576 M bits

• 576 * 30 ~= 17.3 G bits

• you know this is not feasible for now
10

Privacy
• you know you need privacy for
both your physical body and
your mobile device(s)
11

NN-based ML is already in
cell phones
• Google I/O 2017: Mobile First —> AI First

• TensorFlow Lite, Android Neural Network API

• Lots of stuﬀ from Google blogs and papers, e.g., Google Lens, federated learning in Gboard

• Pixel Visual Core in Pixel 2/3, 2/3 XL: although it seems there is no way for developers to
use it as a general NN accelerator

• Apple announced CoreML, a machine framework, at WWDC 2017 (June 2017)

• Apple’s machine learning journal (https://machinelearning.apple.com/): how Apple uses CNN
and other machine techniques in iPhone

• Neural Engine in A11/A11X/A12/A12X, available to developers via Core ML on A12
devices

• Computer Architecture: A Quantitative Approach, 6th Ed. (Nov, 2017) has a whole new chapter
on Domain Speciﬁc Architecture, actually NN accelerators.
12

actually there are many NNAPI-
enabled phones already
http://ai-benchmark.com/ranking_processors.html
mid June, 2019
13

ﬁercely competitive market
14
http://ai-benchmark.com/ranking_processors.html
Aug 16th, 2019

https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the-
silicon-secrets/5
• AnandTech is one the my favorite tech sites. Usually, it provides
good analysis

• E.g., Apple’s CPUs

• cache sizes

• execution units

• various instruction latency

• Not good enough for NN accelerators on mobile phones

• ﬂoating-point VGG16, Inception V3, and ResNet34?

• come on, are you still in Neolithic era?
Evolving fast: the slide I prepared Nov, 2018
15

TF Lite in Android Pie
• There are ‘libtflite.so’s in /system/lib and /system/lib64

• https://source.android.com/devices/tech/display/textclassiﬁer
16

Some TFLite clients
presented by TFLite guys
18

ML Kit
• https://
developers.google.com/ml-
kit/, part of FireBase

• Originally, only custom models
are TFLite

• Now, as far as I can tell, vision
parts are using TFLite also
https://developers.google.com/ml-kit/ 19

• see appendix for Google Translate, Google Lens, Gboard,
and others
20

Some Progresses Make NN
on Edge Devices Really Viable
• “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” [1]. A keynote at
ESWEEK 2017, “Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep-
Neural-Network Architectures” [2]

• MobileNet V1 [3] and V2 [4]: Depthwise separable convolution [5] and inverted residuals and linear
bottlenecks [4]

• AutoML, e.g.,

• NASNet Mobile [6] and Mnasnet [7]

• MobileNet V3 [10] and EﬃcientNet [11]

• Quantization [8][9]

• How about pruning / compression stuﬀ? As far as I know, not widely used yet

[1] https://arxiv.org/abs/1602.07360




[5] https://www.di.ens.fr/data/publications/papers/phd_sifre.pdf


[7] https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html, https://arxiv.org/abs/1807.11626




21

• Michael Jordan published an
article on Medium named
“Artificial Intelligence — The
Revolution Hasn’t
Happened Yet” [1]

• Yes, but current deep learning
driven stuff should be enough
for next few years

[1] https://medium.com/
@mijordan3/artificial-intelligence-
the-revolution-hasnt-happened-
yet-5e1d5812e1e7
22

Why I Started Learning TF
Lite
• We heard Android NN and TensorFlow Lite back in Google I/
O 2017

• My COSCUP 2017 slide deck “TensorFlow on Android”

• https://www.slideshare.net/kstan2/tensorﬂow-on-
android

• People knew a bit about Android NN API before it was
announced and released

• No information about TensorFlow Lite, at least to me,
before it was released in Nov, 2017
23

Quantization and
Accelerators
• Quantization

• Quantization is not new, people know that there are lots
redundancy in NN models back from pre DNN days. Many
quantization and compressing/pruning techniques were
presented all the years. TFLite and its underlying gemmlowp
(and NNAPI) made the ﬁrst production quality system that
supports quantized unsigned int8.

• accelerators (thru NNAPI in the beginning, and directly later)

• CPU is not always the best one to use NN models

• GPU, DSP, and other accelerators
24

TFLite and Android NN in
Google I/O 2017
• New TensorFlow runtime
• Optimized for mobile and
embedded apps

• Runs TensorFlow models on
device

• Leverage Android NN API

• Soon to be open sourced
from Google I/O 2017 video
25

Actual Android NN API
• Announced/published with Android 8.1
Preview 1

• Available to developer in NDK

• yes, NDK

• The Android Neural Networks API (NNAPI)
is an Android C API designed for running
computationally intensive operations for
machine learning on mobile devices

• NNAPI is designed to provide a base layer
of functionality for higher-level machine
learning frameworks (such as TensorFlow
Lite, Caﬀe2, or others) that build and train
neural networks

• The API is available on all devices running
Android 8.1 (API level 27) or higher.
https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png
26

Android NN on Pixel 2
• Only the CPU fallback was available on Oreo MR1

• Actually, you can see Android NN API related in AOSP after Oreo MR1 (8.1) release already

• user level code, see https://android.googlesource.com/platform/frameworks/ml/+/oreo-mr1-release

• HAL, see https://android.googlesource.com/platform/hardware/interfaces/+/oreo-mr1-release/
neuralnetworks/

• There is NN API 1.1 on Android Pie

• https://developer.android.com/about/versions/pie/android-9.0#nnapi

• adding support for nine new ops — Pad, BatchToSpaceND, SpaceToBatchND, Transpose, Strided
Slice, Mean, Div, Sub, and Squeeze

• In the Android P DP1/2 (https://developer.android.com/preview/download.html), there was a HVX
NN API 1.0 (yes, 1.0) driver. Gone after DP2. Not in recent Pie release. (See https://
android.googlesource.com/platform/hardware/qcom/neuralnetworks/hvxservice/ for source code)

• NN API 1.2, which supports 90+ ops, is in AOSP and will be in forthcoming Android Q (version 10)
27

So NNAPI accelerators
don’t work?
• Yes, I don’t know what happened to earlier Pixel phones

• I don’t have Pixel 3 to try

• Q beta 4 for Pixel 3a comes with working a HVX
accelerator driver that works. It’s an NNAPI 1.1 one
though.

• And remember what I showed in pp. 13 and 14, there are
many NNAPI-enabled phones already
28

Original TFLite APIs
• Java API: A convenience
wrapper around the C++ API
on Android

• C++ API: loads the
TensorFlow Lite model file and
invokes the Interpreter. The
same library is available on
both Android and iOS
https://www.tensorflow.org/mobile/tflite/
29

Other bindings
• Python and C APIs

• Python: introduced in TF 1.8.0, built into pip package in 1.9.0

• my label_image.py for tflite merged on Aug 9, 2018

• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
lite/examples/python/label_image.py

• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
lite/examples/python

• C API: introduced for Unity

• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
contrib/lite/experimental/c
30

How to Use it
31
• TFLite guys work hard
• documentation getting better and better
over < 2 yrs
• yes, sometimes you still have to “use the
source”
https://www.tensorﬂow.org/lite

TFLite Converter
https://www.tensorﬂow.org/lite/images/convert/workﬂow.svg
32

Basic Usage
• model: .tflite model

• resolver: if no custom ops, builtin op
resolver is enough

• interpreter: we need it to compute
the graph

• interpreter->AllocateTensor():
Allocate stuff for you, e.g., input
tensor(s)

• fill the input

• interpreter->Invoke(): run the graph

• process the output
tflite::FlatBufferModel model(path_to_model);
tflite::ops::builtin::BuiltinOpResolver resolver;
std::unique_ptr<tflite::Interpreter> interpreter;
tflite::InterpreterBuilder(*model, resolver)(&interpreter);
// Resize input tensors, if desired.
interpreter->AllocateTensors();
float* input = interpreter->typed_input_tensor<float>(0);
// Fill ìnput`.
interpreter->Invoke();
float* output = interpreter->type_output_tensor<float>(0);
33

more source code
• Check my COSCUP 2018 slide deck, which was for a talk
in a source code reading track, for more details

• https://www.slideshare.net/kstan2/open-source-nn-
frameworks-on-cellphones

• And I’ll have a more code-oriented talk on TFLite
delegates tomorrow
34

Interpreter
35
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/core/subgraph.cc#L734-L797
• TFLite compute graph is
a directed acyclic graph
(DAG), so traverse the
sorted graph node by
node

1×224×224×3
1×1001
TfLiteNnapiDelegate
1 32×3×3×3
2 1×3×3×512
3 512×1×1×512
4 1×3×3×512
5 512×1×1×512
6 1×3×3×512
7 1024×1×1×512
8 1×3×3×1024
9 1024×1×1×1024
10 1×3×3×32
11 64×1×1×32
12 1×3×3×64
13 128×1×1×64
14 1×3×3×128
15 128×1×1×128
16 1×3×3×128
17 256×1×1×128
18 1×3×3×256
19 256×1×1×256
20 1×3×3×256
21 512×1×1×256
22 1×3×3×512
23 512×1×1×512
24 1×3×3×512
25 512×1×1×512
26 1×3×3×512
27 512×1×1×512
28 1001
29 1001×1×1×1024
30 2
31 32
32 512
33 512
34 512
35 512
36 512
37 1024
38 1024
39 1024
40 32
41 64
42 64
43 128
44 128
45 128
46 128
47 256
48 256
49 256
50 256
51 512
52 512
53 512
54 512
55 512
56 512
57 512
input
Reshape_1
NNAPI Delegate
• Previously, when a graph is
delegated to NNAPI, it’s kinda
invisible to TFLite

• With recently NNAPI delegate
rewrite, it’s an op in TFLite
now

• subgraph

• all-or-nothing —> per op
1×224×224×3
1×112×112×32
1×112×112×32
1×112×112×64
1×56×56×64
1×56×56×128
1×56×56×128
1×56×56×128
1×28×28×128
1×28×28×256
1×28×28×256
1×28×28×256
1×14×14×256
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×7×7×512
1×7×7×1024
1×7×7×1024
1×7×7×1024
1×1×1×1024
1×1×1×1001
1×1001
1×1001
Conv2D
weights 32×3×3×3
bias 32
DepthwiseConv2D
weights 1×3×3×32
bias 32
Conv2D
weights 64×1×1×32
bias 64
DepthwiseConv2D
weights 1×3×3×64
bias 64
Conv2D
weights 128×1×1×64
bias 128
DepthwiseConv2D
bias 128
Conv2D
weights 128×1×1×128
bias 128
DepthwiseConv2D
bias 128
Conv2D
weights 256×1×1×128
bias 256
DepthwiseConv2D
bias 256
Conv2D
weights 256×1×1×256
bias 256
DepthwiseConv2D
bias 256
Conv2D
weights 512×1×1×256
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 1024×1×1×512
bias 1024
DepthwiseConv2D
weights 1×3×3×1024
bias 1024
Conv2D
weights 1024×1×1×1024
bias 1024
AveragePool2D
Conv2D
weights 1001×1×1×1024
bias 1001
Squeeze
Softmax
input
Reshape_1
36
http://localhost:8080/, http://localhost:8090/

More Delegates
• Flex Delegate

• Ops supported by TFLite is relatively limited, TensorFlow Lite models can now use a
subset of TensorFlow ops when TFLite builtin ops are not sufficient

• GPU backend: no, not NNAPI

• OpenGL ES 3.1 Compute Shaders on Android devices

• Metal Compute Shaders on iOS device

• “in general the new GPU backend performs 2–7x faster than the floating point CPU
implementation for a wide range of diverse deep neural network models.”

https://www.tensorflow.org/lite/using_select_tf_ops

https://medium.com/tensorflow/tensorflow-lite-now-faster-with-mobile-gpus-developer-preview-e15797e6dee7

https://www.tensorflow.org/lite/performance/gpu

https://www.tensorflow.org/lite/performance/gpu_advanced
37

Why a non-NNAPI delegate?
https://developer.android.com/about/dashboards
NNAPI-enabled devices ~7.5% around the end of Oct, 2018
38

NNAPI-enabled devices ~ 25.8% around May 7, 2019
https://developer.android.com/about/dashboards39

40
GL ES compute shader capable devices ~ 50%
https://developer.android.com/about/dashboards

GPU Delegate Performance
• my quick and dirty benchmarks

• Android: https://github.com/freedomtan/
glDelegateBench

• iOS: https://github.com/freedomtan/
glDelegateBenchmark/
• at ﬁrst, GPU Delegate is binary release only (aar for Android; pod for iOS)
• after the release of GPU delegate source code, benchmark_model and
label_image are able to use GPU delegate
41

GPU delegate kernels
• Recently, TFLite GPU delegate guys
published a paper talking about how they
design it. Covered some interesting details

• GPU backends require initialization
involving shader compilation and
optimization by the driver before inference

• PHWC4: P stands for plane

• Reshape is expensive on GPU

• RGBA is better than RGB on GPU

• a tensor of shape [B,H,W,5], for instance,
is twice as expensive as [B, H, W, 4], but
about the same as [B, H, W, 8], then the
architect can tune around those 4-channel
boundaries rather than trying to optimize
on other boundaries.

•
https://arxiv.org/pdf/1907.01989.pdf

Faster ARM CPU kernels
• It’s available now. Enabled by default for Android ARM64
early June

• https://github.com/tensorflow/tensorflow/commit/
8924e67e034909bea0343631b9f9024c5a6da5c4

• ruy:

• four tune fixed point kernels: big/LITTLE (out-of-order/
in-order), w/ or w/o dot-product instructions

• two tuned floating point kernels
43

More on ruy
• matrix multiplication in AArch64 NEON
• sdot based kernels for either out-of-order CPUs, e.g., CA76, or in-order CPUs, e.g., CA55r1
• non sdot based kernels for either out-of-order CPUs, e.g., CA73, or in-order CPUs, e.g., CA53
• how the kernel is chosen: detection at run time instead of hard-coded list (e.g., PyTorch cpuinfo)
• sdot or not: see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/
experimental/ruy/detect_dotprod.cc#L129-L157
• in-order or out-of-order: see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/
experimental/ruy/tune.cc, esp., https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
lite/experimental/ruy/tune.cc#L102-L124
• doesn't need to list all possibilities, probably can handle future cores. Still cannot deal with
big.LITTLE cores
• thread pool: it seems to scale better than the one currently in use, so that multi-threaded floating-
point numbers are much better

• before ruy, floating point: eigen thread pool; fixed-point: TFLite’s thread pool
44

Python API
• TensorFlow Lite Optimizing Converter (TOCO) —> tflite_convert, mainly python
wrapped C++ code

• Python Interpeter: https://www.tensorflow.org/lite/convert/
python_api#tensorflow_lite_python_interpreter_

• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/
convert/python_api.md

• https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/lite

• I sent label_image.py (merged, https://github.com/tensorflow/tensorflow/tree/master/
tensorflow/lite/examples/python) and mobilenet_ssd. Tried others such as DeepLab V3
on RPI 3 B+.

• Quick test and you can use OpenCV to do preprocessing and post-processing
45

C API
• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/
experimental/c/c_api.h

• Started as a base for Unity, https://github.com/tensorflow/tensorflow/tree/
master/tensorflow/lite/experimental/examples/unity/TensorFlowLitePlugin

• FFI via C is much easier than C++

• Who uses it? Objective-C and Swift APIs

• my quick-and-dirty hacks for Pharo Smalltalk, https://github.com/
freedomtan/libtensorflow-pharo-bindings/blob/libtensorflowlite_c_hacks/
LibTensorFlow-Core/TensorFlowLiteCAPI.class.st
46

Yes, Smalltalk Is Alive
• Smalltalk is an object-
oriented, dynamically typed
reﬂective programming
language started in 1970s

• Alan Kay, the creator or
Smalltalk, coined the term
Object Oriented Programming
(OOP).

• MVC, IDE, live programming http://pharo.org/web/ﬁles/teaser50.png
47

Smalltalk using TFLite C
API
48

There are more new things
• For example, uP

• See https://github.com/tensorflow/tensorflow/tree/
master/tensorflow/lite/experimental

• TFLite Micro and uTensor

• https://os.mbed.com/blog/entry/uTensor-and-Tensor-
Flow-Announcement/

• Yes, RNN-based models, including LSTM, are not doing
well (yet)
49

Google I/O 2019 updates
• new MLIR-based TF —> TFLite converter

• improved CPU backend: ruy

• on-device training: not ready yet?

• control ﬂow support

• see more at https://www.youtube.com/watch?
v=Jjm7MT6W0Dc
50

why MLIR
51
https://medium.com/tensorflow/mlir-a-new-intermediate-
representation-and-compiler-framework-beba999ed18d

MLIR: Multi-Level Intermediate Representation for Compiler Infrastructure
52
MLIR for TFLite Converter

MLIR: Multi-Level Intermediate Representation for Compiler Infrastructure
53

TF graphdef .pb -> TFLite flatbuffer .tflite
• Build TensorFlow MLIR related binaries
bazel build --config opt tensorflow/compiler/mlir/...
• Get your model, e.g.,
wget http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_224.tgz
• Convert it
./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate -tf-input-shapes=1,224,224,3 -tf-input-data-
types=DT_FLOAT -tf-output-arrays=MobilenetV1/Predictions/Reshape_1 /tmp/mobilenet_v1_1.0_224_frozen.pb --tf-
input-arrays=input -o /tmp/foo.tflite
• Yes, it works like a charm. But, not for quantized model, neither
types=DT_QUINT8 -tf-output-arrays=MobilenetV1/Predictions/Reshape_1 /tmp/mobilenet_v1_1.0_224_quant_frozen.pb --
tf-input-arrays=input -o /tmp/bar.tflite
nor
 
types=DT_FLOAT -tf-output-arrays=MobilenetV1/Predictions/Reshape_1 /tmp/mobilenet_v1_1.0_224_quant_frozen.pb --
tf-input-arrays=input -o /tmp/bar.tflite —tf-inference-type=TF_QUINT8
works
54

Google Edge TPU
• Announced back in Google
Next 2018 (July, 2018)

• Available to general developers
right before TensorFlow Dev
Summit 2019 (Mar, 2019)

• USB: Coral Accelerator

• Dev Board: Coral Dev Board

• More are coming, e.g., PCI-E
Accelerator and SOM

• Supported framework: TFLite
https://coral.withgoogle.com/products/
55

Edge TPU Software
•Updates released on April 11th, 2019

•Compiler: removed the restriction for speciﬁc architectures

•New TensorFlow Lite C++ API

•Updated Python API, mainly for multiple Edge TPUs

•Updated Mendel OS and Mendel Management Tool (MDT) tool

•Environmental Sensor Board, https://coral.withgoogle.com/products/environmental/

•May updates, May 29th, 2019

•Oﬄine compiler

•MDT update

https://developers.googleblog.com/2019/04/updates-from-coral-new-compiler-and.html

https://coral.withgoogle.com/news/updates-04-2019/

56

Edge TPU Software
• July updates, July 24th, 2019

• Updated Edge TPU Compiler and runtime: support for
models built using post-training quantization

• Updated Edge TPU Python library

• New on-device backpropagation API

• Updated weight imprinting API

• New TensorFlow Lite delegate for Edge TPU

57

Edge TPU’s canned model
• all ops that could be oﬄoaded
are packed into on op
The compiler creates a single custom op for all Edge TPU
compatible ops; anything else stays the same
https://coral.withgoogle.com/docs/edgetpu/models-intro/
58
MobileNet V1 1×224×224×3
1×1001
edgetpu-custom-op
input
Softmax
1×300×300×3
1×1917×91
1×10×4 1×10 1×10 1
edgetpu-custom-op
TFLite_Detection_PostProcess
3 1917×4
normalized_input_image_tensor
TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3
SSD MobileNet V1

EdgeTPU Delegate
• There is dynamic delegate plugin interface recently.
Currently it’s only used by EdgeTPU’s

There still are many trivial bugs in
TensorFlow
• There are many typos in comments of TensorFlow code
• Many things are not well-documented
• There are many many warnings when building TensorFlow from source
code
• a trivial fix in May, 2019 by me
60
https://github.com/tensorflow/tensorflow/pull/28618

Concluding Remarks
• Deep learning on devices are here to stay. You can see
some applications nowadays. More to come.

• TensorFlow, including Lite, is under active development.
Documentation is improving. Opportunities to contribute
are still there
61

Transistor–Transistor Logic (TTL)
https://en.wikipedia.org/wiki/Transistor–transistor_logic
https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/68k_ttl.jpg/600px-68k_ttl.jpg
64

[1] https://www.slideshare.net/kstan2/tensorflow-on-android

[2] https://www.slideshare.net/kstan2/introduction-to-tensorflow-lite

[3] https://www.slideshare.net/kstan2/caffe2-on-android

[4] https://www.slideshare.net/kstan2/open-source-nn-frameworks-
on-cellphones

[5] https://www.slideshare.net/kstan2/why-you-cannot-use-neural-
engine-to-run-your-nn-models-on-a11-devices

[6] https://www.slideshare.net/kstan2/a-peek-into-googles-edge-tpu
65

https://www.amazon.com/
Computational-Aspects-Principles-
Computer-Science/dp/0914894951
Google Translate
66

Google Lens
⼦曰：「⼩⼦何莫學夫詩︖詩，可以興，可以觀，可以群，可以怨。邇之事⽗，遠之事君︔多識於⿃獸草⽊之名。」
69

Your phone personalizes the model locally, based on your usage (A).
Many users' updates are aggregated (B) to form a consensus change
(C) to the shared model, after which the procedure is repeated.
https://research.googleblog.com/2017/04/federated-learning-collaborative.html
70

tflite in gboard
Data:
•nwp
next-word-predictor/
next-word-predictor/tflite-nwp-20180920
next-word-predictor/tflite-nwp-20180920/nwp.uint8.tflite
next-word-predictor/tflite-nwp-20180920/nwp.syms
next-word-predictor/pie-nwp-20180807
next-word-predictor/pie-nwp-20180807/nwp.syms
next-word-predictor/pie-nwp-20180807/nwp.uint8.data
•Emoji
./emoji-predictor
./emoji-predictor/tflite-emoji-pred-
a69f4f3dd1a865206f8a5f8cdcd9f6d6
a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.scale.csv
a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.emoji.syms
a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.uint8.tflite
a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.token.syms
• Next-word-predictor
and emoji predictor
seem to be TFLite
based and using uint8
model
• However, .tflite here is
not real flatbuffer .tflite
• Seems to be from this
paper [1]
71

Gboard: Chinese input methods
seem to be HMM-based
• As the name suggested, it could be HMM (Hidden
Markovian Model) and n-gram based
• Does HMM and n-gram work with federated learning?
72

• All-neural on-device
Recognizer [1]

• Live caption [2], announced in
Google I/O 2019

• [1] https://ai.googleblog.com/
2019/03/an-all-neural-on-
device-speech.html

• [2] https://www.youtube.com/
watch?v=hPv1PkjJ-J0
73

label_image for TFLite
• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/

• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/label_image.md

• Run a TF Lite single input, single output classifier model, e.g., MobileNet V1, so that we can verify the classifier
works or not

• What does it do

• read an image: unlike TF, there is no image decoder in TF Lite, so I wrote a simple .bmp decoder

• resize the input image to specific size, e.g., 224x244 or 299x299

• convert the image tensor to floating point if necessary

• load the classifier

• prepare tensors

• run the model

• process the input

• top-k labels
74

Speed of Quantized Models
• It seems it's much better than naive quantization as we saw before (in TensorFlow before TFLite)

• On Nexus 9 (MobileNet 1.0/224)

• Quantized

• ./label_image -t 2: ~ 160 ms

• ./label_image -t 2 -c 100: ~ 60 ms

• Floating point

• ./label_image -t 2 -m ./mobilenet_v1_1.0_224.tflite: ~ 300 ms

• ./label_image -t 2 -c 100 -m ./mobilenet_v1_1.0_224.tflite: ~ 82 ms

• Pixel 2 Quantized

• CPU

• single thread: as is: ~ 90 ms, controlled env: ~ 70 ms

• 4 threads: ~ 30 ms

• HVX: ~ 12 ms
75

Fake Quantization in Early
Dec, 2017
• How hard can it be? How much time is needed?

• Several pre-tested models are available

• https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/contrib/lite/g3doc/models.md

• but only one of them (https://storage.googleapis.com/
download.tensorflow.org/models/tflite/
mobilenet_v1_224_android_quant_2017_11_08.zip) is quantized
one

• as we can guess from related docs, retrain is kinda required to
get accuracy back
76

Fake Quantization in early
Nov, 2018
• Documents

• a paper at Arxiv: https://arxiv.org/abs/1712.05877

• white paper: https://arxiv.org/abs/1806.08342

• Code, e.g.,

• TF fake quant

• SLIM (https://github.com/tensorflow/models/blob/master/research/slim/train_image_classifier.py#L519-
L521), object-detection (e.g., https://github.com/tensorflow/models/blob/master/research/
object_detection/samples/configs/ssd_mobilenet_v2_quantized_300x300_coco.config#L196-L201), etc.

• models many quantized models

• classifiers: all MobileNet V1, some MobileNet V2 and others (https://www.tensorflow.org/lite/models)

• others, e.g.,

• Object-detection: e.g., MobileNet-SSD

• Semantic segmentation: DeepLab V3
77

TfLiteQuantizationParams
typedef struct {
float scale;
int32_t zero_point;
} TfLiteQuantizationParams;
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L165-L171
r = S(q − Z)
78

Note that the biases are not quantized because they are
represented as 32-bit integers in the inference process, with
a much higher range and precision compared to the 8 bit
weights and activations. Furthermore, quantization param-
eters used for biases are inferred from the quantization pa-
rameters of the weights and activations. See section 2.4.
Typical TensorFlow code illustrating use of [19] follows:
from tf.contrib.quantize
import quantize_graph as qg
g = tf.Graph()
with g.as_default():
output = ...
total_loss = ...
optimizer = ...
train_tensor = ...
if is_training:
quantized_graph =
qg.create_training_graph(g)
else:
quantized_graph =
qg.create_eval_graph(g)
# Train or evaluate quantized_graph.
3.2. Batch normalization folding
For models that use batch normalization (see [17]), there
is additional complexity: the training graph contains batch
normalization as a separate block of operations, whereas
the inference graph has batch normalization parameters
“folded” into the convolutional or fully connected layer’s
Float
Integer
Table 4.1
tized net
Sche
Weigh
Activati
Accu
Table 4.
ious qua
works (B
[21, 22])
fine-grai
4. Expe
We c
ing the e
and the o
tradeoff
tion. 4.2
ence wo
is matrix
floating-
library [1
how to use fake quant
conv
weights
uint8
input
+
biases
uint32
ReLU6 output
uint8
uint32
uint8
uint8
(a) Integer-arithmetic-only inference
conv
wt quant weightsinput
+
biases
ReLU6 act quant output
(b) Training with simulated quantization
10 20 40 80 160 320
40
50
60
70
Latency (ms)
Top1Accuracy
Float
8-bit
(c) ImageNet latency-vs-accuracy tradeoff
Figure 1.1: Integer-arithmetic-only quantization. a) Integer-arithmetic-only inference of a convolution layer. The input and output
are represented as 8-bit integers according to equation 1. The convolution involves 8-bit integer operands and a 32-bit integer accumulator.
The bias addition involves only 32-bit integers (section 2.4). The ReLU6 nonlinearity only involves 8-bit integer arithmetic. b) Training
with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic.
Weight quantization (“wt quant”) and activation quantization (“act quant”) nodes are injected into the computation graph to simulate the
effects of quantization of the variables (section 3). The resultant graph approximates the integer-arithmetic-only computation graph in panel
a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from
the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section 4). The figure compares
integer quantized MobileNets [10] against floating point baselines on ImageNet [3] using Qualcomm Snapdragon 835 LITTLE cores.
tions [14, 27, 34]. With these approaches, both multiplica-
tions and additions can be implemented by efficient bit-shift
and bit-count operations, which are showcased in custom
GPU kernels (BNN [14]). However, 1 bit quantization of-
Our work draws inspiration from [7], which leverages
low-precision fixed-point arithmetic to accelerate the train-
ing speed of CNNs, and from [31], which uses 8-bit fixed-
point arithmetic to speed up inference on x86 CPUs. Our
[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/qu
README.md
79

example of depthwise
convolution with fake quant
80

Real computation
• BLAS part: Eigen (http://eigen.tuxfamily.org/) and gemmlowp
(https://github.com/google/gemmlowp)

• Some Caveats

• convolutions are multithreaded

• uint8/gemm: 1

• ﬂoat32/Eigen: 4

• depthwise convolutions are single threaded

• problems: big.LITTLE, number of cores, scheduling
81

knowing more to squeeze
performance
• Memory management: to get reasonable good performance when running highly parallel
workloads on mobile devices, you need good enough mechanism

• Profiling: there is a simple profiling mechanism in TF Lite since Apr, 2018

• time profiling only now. how about memory stuff?

• static buffer size: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/
contrib/lite/profiling/profiler.h#L80

• https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/profiling

• Computation of quantized uint8

• when you want to do some operations on tensors, scale and zero point could be
changed. How to do it efficiently

• Post-training quantization: https://www.tensorflow.org/lite/performance/
post_training_quantization
82

Quick Intro to Caffe 2
• Caffe 2

• 2nd generation of Caffe, which was the most popular deep learning framework
(before TensorFlow) from Berkeley

• merged into PyTorch
• What's the difference? Caffe2 improves Caffe 1.0 in a series of directions:

• first-class support for large-scale distributed training

• mobile deployment
• new hardware support (in addition to CPU and CUDA)

• flexibility for future directions such as quantized computation

• stress tested by the vast scale of Facebook applications
83
https://caffe2.ai/docs/caffe-migration.html

Caffe2 backends for
Android I know
• ARM CPU:

• NNPACK, Eigen: quite mature

• QNNPACK: looks good, (https://code.fb.com/ml-applications/qnnpack/)

• OpenGL ES:

• OpenGL: not actively maintained (?)

• ARM Compute Library (GL ES part): stalled? 18.01

• NEON, and OpenCL

• NNAPI: stalled? NNAPI 1.0 (Oreo 8.1 API 27), not fully integrated yet

• ios: iOS MPS backend
84

More open source frameworks
• Yes, there are other framworks, e.g.,

• MACE from XiaoMi: https://github.com/XiaoMi/mace,

• ncnn from Tencent: https://github.com/Tencent/ncnn,

• ONNX runtime from Microsoft, https://github.com/
microsoft/onnxruntime,

• TVM stack, https://tvm.ai

• So far, the TF/TFLite ecosystem is the largest one
85

Beyond Open Source
• Apple CoreML

• https://developer.apple.com/
documentation/coreml

• Google ML Kit

• https://developers.google.com/ml-kit/

• image labeling, OCR, face detection, bar
code scanning, landmark detection, etc.

• Custom models in TF Lite

• Qualcomm Snapdragon Neural Processing
Engine (SNPE)

• https://developer.qualcomm.com/software/
snapdragon-neural-processing-engine-ai

• Huawei HiAi DDK
86

https://aiyprojects.withgoogle.com/edge-tpu
https://www.anandtech.com/show/
13393/techinsights-publishes-
apple-a12-die-shot-our-take
Figure 7.38 Floor plan of the 8-core Pixel Visual Core chip. A53 is an ARMv7 core. LPDDR4 is a DRAM controller.
PCIE and MIPI are I/O buses.
87

Figure 7.13 Example of systolic array in action, from top to bottom on the page. In this example, the six weights
are already inside the multiply-accumulate units, as is the norm for the TPU. The three inputs are staggered in time to
get the desired effect, and in this example are shown coming in from the top. (In the TPU, the data actually comes in
from the left.) The array passes the data down to the next element and the result of the computation to the right to the
next element. At the end of the process, the sum of products is found to the right. Drawings courtesy of Yaz Sato.
It seems Edge TPU is not TPU-like?
Figure 7.14 Systolic data flow of the Matrix Multiply Unit.
https://www.elsevier.com/books-and-journals/book-companion/9780128119051
88

Edge TPU and NCS 2
89
device
MobileNet V1
1.0/224
MobileNet V2
1.0/224
Inception V3 ResNet 50 SqueezeNet 1.1
MobileNet V1
0.25/128
SSD MobileNet
V1 COCO
SSD MobileNet
V2 COCO
Coral: Edge
TPU
2.74 2.87 43.27 42.41 1.90 1.11 10.05 12.48
NCS 2 (fp16) 12.11 14.87 52.25 33.1 3.99 4.08 23.53 39.11
iPhone Xs Max
(Neural Engine
accelerated,
fp16)
1.74 2.15 8.65 6.91 1.75 1.16
Mobilenet V1/V2 and SSD Mobilenet V1/V2 are quite good
• Edge TPU: my scripts, https://github.com/freedomtan/edge_tpu_python_scripts
• NCS 2: ./benchmark_app-d MYRIAD -niter 50 -nireq 10 ..
• iPhone Xs Max: my CoreML benchmark, https://github.com/freedomtan/coremlbenchmark

0
2
4
6
8
10
12
14
time(ms)
Mobilenet V1: Edge TPU and NCS2
ncs2 mobilenet_v1_0.25 ncs2 mobilenet_v1_0.5 ncs2 mobilenet_v1_0.75 ncs2 mobilenet_v1_1.0
coral mobilenet_v1_0.25 coral mobilenet_v1_0.5 coral mobilenet_v1_0.75 coral mobilenet_v1_1.0
Mobilenet V1 on EdgeTPU
and NCS2
90
inference time size=128x128 size=160x160 size=192x192 size=224x224
ncs2
mobilenet_v1_0
.25
3.83 3.95 4.06 4.4
ncs2
mobilenet_v1_0
.5
4.98 4.86 5.51 6.51
ncs2
mobilenet_v1_0
.75
6.04 6.67 7.93 9.4
ncs2
mobilenet_v1_1
.0
7.43 8.68 10.13 12.2
coral
mobilenet_v1_0
.25
1.07 1.24 1.30 1.47
coral
mobilenet_v1_0
.5
1.16 1.40 1.53 1.95
coral
mobilenet_v1_0
.75
1.29 1.70 1.80 2.16
coral
mobilenet_v1_1
.0
1.50 1.95 2.15 2.85

1×224×224×3
1×1×1×1024
1×1×1×1024
1×1×1×5
1×5
1×5
edgetpu-custom-op
L2Normalization
Conv2D
weights 5×1×1×1024
bias 5
Reshape
Softmax
input
Output
Imprinting Engine
• Yes, let’s check what it is

• The Imprinting Engine implements a low-shot learning technique
called ‘Imprinted Weights’ [1][2]

• Can be used to retrain classifiers on-device (either on USB
Accelerator or Dev Board), no back-propagation gradient involved.

• Why?

• Transfer-learning happens on-device, at near-realtime speed.

• You don't need to recompile the model.

• Limitations

• Training data size is limited to a max of 200 images per class.

• It is most suitable only for datasets that have a small inner
class variation.

• The last fully-connected layer runs on the CPU, not the Edge
TPU. So it will be slightly less efficient than running a pre-
compiled on Edge TPU.

• if you are interested in it, check the paper and
aiy::learn::imprinting::ImprintingEngine::Train(un
signed char const*, int, int)
91
[1] https://coral.withgoogle.com/docs/edgetpu/retrain-classification-ondevice/

1×224×224×3
1×1×1×1024
edgetpu-custom-op
input
AvgPool

EfficientNet
• EfficientNet-B0:

• much smaller FLOPS than
MobileNet V1; much higher
accuracy

• MobileNet V2: a bit larger FLOPS;
much higher accuracy
http://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html
92

EfﬁcientNet-B0 ﬂoating point
93

EfﬁcientNet-B0 ﬁxed point
94

Depthwise Separable Convolution
• CNNs with depthwise separable convolution such as Mobilenet [1]
changed almost everything

• Depthwise separable convolution “factorize” a standard convolution
into a depthwise convolution and a 1 × 1 convolution called a
pointwise convolution. Thus it greatly reduces computation
complexity.

• Depthwise separable convolution is not that that new [2], but pure
depthwise separable convolution-based networks such as Xception
and MobileNet demonstrated its power


[2] L. Sifre. “Rigid-motion scattering for image classiﬁcation”, PhD thesis, 2014
95

...M
N
1
1
...
MDK
DK
1
...
M
DK
DK N
depthwise convolution ﬁlters
standard convolution ﬁlters
1×1 Convolutional Filters (Pointwise Convolution)https://arxiv.org/abs/1704.04861
Depthwise Separable Convolution
96

Status quo of tensor flow lite on edge devices coscup 2019

Recommended

Recommended

More Related Content

More from Koan-Sin Tan

More from Koan-Sin Tan (11)

Recently uploaded

Recently uploaded (20)

Status quo of tensor flow lite on edge devices coscup 2019