2. • Learnt to use open source software before the
term “open source” was coined
• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
• Recently, working on NN performance on edge
devices related stu
f
• Contributed from time to time to TensorFlow,
esp. TFLite
• Contributed some code to MLPerf Mobile
App
• Disclaimer: Opinions expressed are solely my
own and do not express the views or opinions
of my employer.
who i am
https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
2
https://github.com/tensor
fl
ow/tensor
fl
ow/releases/tag/v2.13.0
3. Arthur C. Clarke.
Any sufficiently advanced technology is
indistinguishable from magic.
3
4. Outline
Overview
Run stable diffusion on your Android device
Converting fp32 models to fp32 tflite models
Converting fp32 models to PTQ tflite models
Make converted models work on NN accelerator(s)
Recap
4
5. • I put something about converting Stable
Di
ff
usion to t
fl
ite on GitHub late 2022,
https://github.com/freedomtan/
keras_cv_stable_di
ff
usion_to_t
fl
ite
• I told @thiteanish on twitter how to how to
got reasonable performance op his Pixel 6
this March
• However, I found that there is a session
called “How to run llama.cpp on local
graphics cards”, this afternoon, https://
coscup.org/2023/en/session/LXQGDU
• Since NNAPI doesn’t support lower-bit
quantization, I’ll focus on Stable Di
ff
usion.
I’ll focus on Stable Diffusion 1.x
5 https://twitter.com/thiteanish/status/1635678053853536256
7. Keras CV implementation
• We'll use Keras CV's Stable Di
ff
usion implementation because
• it's easier for converting to TFLite, and
• its code seems to easier to understand
Note that the Keras CV implementation uses weights converted from original PyTorch implementation.
• Other code you may want to check
• original one: https://huggingface.co/CompVis/stable-di
ff
usion
• Apple's Core ML related code (Python and Swift code included): https://github.com/apple/ml-
stable-di
ff
usion
7
8. The 3 models in Keras CV Stable Di
ff
usion. After
construct the pipeline with
model = keras_cv.models.StableDiffusion()
1.text encoder: model.text_encoder
2.di
ff
usion/denoise model: model.diffusion_model
3.decoder: model.decoder
With something like
model.text_encoder.summary() we can dump a
Keras model's layers, including inputs and outputs
layers. We’ll do it later
Models in Stable Diffusion
8
9. getting NNAPI-friendly tflites
• fp32: Converting fp32 keras models to fp32 t
fl
ite models is relatively easy. We
don't really need to understand the 3 models.
• quantized model: To convert fp32 keras models to quantized int8 t
fl
ite one is
more di
ffi
cult. Either
• quatization-aware training (QAT, https://www.tensor
fl
ow.org/
model_optimization/guide/quantization/training), or
• post-training quantization (PTQ, https://www.tensor
fl
ow.org/lite/performance/
post_training_quantization) has to be done. For PTQ, "representative" data
have to be prepared for all the model inputs. I don't know whether it's
feasible to perform QAT retrain. So I started from PTQ.
9
10. converting fp32 models to fp32 tflite
• how to convert a Keras/Saved Model or some concrete functions to a t
fl
ite
model
• tf.lite.TFLiteConverter.from_keras_model()
• tf.lite.TFLiteConverter.from_saved_model()
• tf.lite.TFLiteConverter.from_concrete_function()
10
11. benchmark_model with op validation
• There are some #ifdef NNAPI_VERBOSE_VALIDATION in NNAPI delegate
source
• Add --copt=-DNNAPI_VERBOSE_VALIDATION when building
benchmark_model with bezel
11
12. cheetah:/data/local/tmp $ ./benchmark_model_validation --graph=foo/text_encoder_fixed_batch_size.tflite --use_nnapi=1 --nnapi_allow_fp16=1 --
enable_op_profiling=1
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Graph: [foo/text_encoder_fixed_batch_size.tflite]
INFO: Enable op profiling: [1]
INFO: Use NNAPI: [1]
INFO: NNAPI accelerators available: [google-edgetpu,google-armnn,nnapi-reference]
INFO: Allow fp16 in NNAPI: [1]
INFO: Loaded model foo/text_encoder_fixed_batch_size.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: NNAPI delegate created.
WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
VERBOSE: Replacing 1511 out of 1513 node(s) with delegate (TfLiteNnapiDelegate) node, yielding 2 partitions for the whole graph.
WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
…
Number of nodes executed: 3
============================== Summary by node type ==============================
[Node type] [count] [avg ms] [avg %] [cdf %] [mem KB][times called]
TfLiteNnapiDelegate 1 67.030 99.966% 99.966% 0.000 1
GATHER 2 0.023 0.034% 100.000% 0.000 2
Timings (microseconds): count=50 first=64299 curr=63834 min=63370 max=194959 avg=67054 std=18319
Memory (bytes): count=0
3 nodes observed
12
13. • from_keras_model()
• No place to set batch size
without changing keras_cv
code
• from_saved_model()
• I don’t know how to do it either
• from_concrete_function():
• Yes, I know how to do it
Fix batch size
13
14. • To support dynamic batch size, there are some ops not supported by NNAPI
• SHAPE, REDUCE_PROD, RESHAPE, etc.
What’s the problem of tflite with dynamic batch size
14
15. Decoder with fixed batch size failed to be delegated
130|cheetah:/data/local/tmp $ ./benchmark_model_validation --graph=foo/decoder_fixed_batch_size.tflite --use_nnapi=1 --nnapi_allow_fp16=1 --enable_op_profiling=1
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Graph: [foo/decoder_fixed_batch_size.tflite]
INFO: Enable op profiling: [1]
INFO: Use NNAPI: [1]
INFO: NNAPI accelerators available: [google-edgetpu,google-armnn,nnapi-reference]
INFO: Allow fp16 in NNAPI: [1]
INFO: Loaded model foo/decoder_fixed_batch_size.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: NNAPI delegate created.
WARNING: Operator RESHAPE (v1) refused by NNAPI delegate: Output rank should be <= 4
WARNING: Operator MEAN (v1) refused by NNAPI delegate: NNAPI does not support mean of a tensor with rank > 4
WARNING: Operator SUB (v3) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator SQUARE (v1) refused by NNAPI delegate: Unsupported operation type.
WARNING: Operator MEAN (v1) refused by NNAPI delegate: NNAPI does not support mean of a tensor with rank > 4
WARNING: Operator ADD (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator BROADCAST_TO (v2) refused by NNAPI delegate: Unsupported operation type.
WARNING: Operator MUL (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator BROADCAST_TO (v2) refused by NNAPI delegate: Unsupported operation type.
WARNING: Operator MUL (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator BROADCAST_TO (v2) refused by NNAPI delegate: Unsupported operation type.
WARNING: Operator MUL (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator SUB (v3) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator ADD (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator RESHAPE (v1) refused by NNAPI delegate: Input rank should be <= 4
WARNING: Operator RESHAPE (v1) refused by NNAPI delegate: Output rank should be <= 4
WARNING: Operator MEAN (v1) refused by NNAPI delegate: NNAPI does not support mean of a tensor with rank > 4
• BTW, I
fi
xed a NNAPI delegate issue when delegating some
invalid ops last year, https://github.com/tensor
fl
ow/tensor
fl
ow/
pull/58978)
15
16. • Group Normalization divides the channels into
groups and computes within each group the
mean and vari ance for normalization. Empirically,
its accuracy is more stable than batch norm in a
wide range of small batch sizes, if learning rate is
adjusted linearly with batch sizes.
• Relation to Layer Normalization: If the number of
groups is set to 1, then this operation becomes
nearly identical to Layer Normalization (see Layer
Normalization docs for details).
• Relation to Instance Normalization: If the number
of groups is set to the input dimension (number of
groups is equal to number of channels), then this
operation becomes identical to Instance
Normalization.
Group Normalization
https://arxiv.org/pdf/1803.08494.pdf
https://keras.io/api/layers/normalization_layers/group_normalization/
16
17. • Here the last axis (channel) is
splitted into 32 groups, each
group have 512 / 32 = 16
elements
• (1, 64, 64, 512) -> (1, 64, 64, 32,
16)
• NNAPI doesn’t allow rank > 4. A
naive method is to split, run layer
norm, and concat
Group normalization
17
18. • Group norm is quite easy to
implement
• I tested if split + layer norm +
concatenate works as expected
by modifying the code from the
Group Norm paper
• It just works
Group norm implementation
https://arxiv.org/pdf/1803.08494.pdf
18
19. • Production Group Norm has to consider
more
• I hacked Keras code at [1] to split, layer
norm, and concatenate
• Voila, we can fully delegate the
fi
xed batch
size decode model
• However, it failed because transaction
size is larger than NNAPI HIDL could
handle
• Fortunately, something we have other
delegates such MediaTek’s Neuron
Delegate
Group norm in Keras
[1] https://github.com/keras-team/keras-core/blob/v0.1.3/keras_core/layers/normalization/group_normalization.py#L148-L195
[2] https://github.com/MediaTek-NeuroPilot/t
fl
ite-neuron-delegate 19
20. The diffusion model
• When converting saved_model to We ran into this
• E tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2181]
Model size is bigger than 2gb
• How about keras_model and concrete_function?
• Well, there are converted to saved_model and there is single protobuf (pb) 2gb
limit too
• The only way I can think of is to split the di
ff
usion model (into 2 or more models)
• BTW, there is group norm issues in the di
ff
usion model
20
21. • The di
ff
usion/reverse di
ff
usion model is a
U-Net with skip/residual connections
• Because of residual connections, it’s not a
sequential model. When we split the
model, we must check the connections/
edges between nodes located in di
ff
erent
subgroups [1]
• [1] https://github.com/freedomtan/
keras_cv_stable_di
ff
usion_to_t
fl
ite/blob/
main/
convert_keras_di
ff
usion_model_into_two_t
fl
ite_models.ipynb
Splitting the diffusion model into to
models
https://arxiv.org/pdf/2112.10752.pdf
https://github.com/keras-team/keras-cv/blob/v0.6.1/keras_cv/models/stable_di
ff
usion/di
ff
usion_model.py#L23-L114
21
https://github.com/CompVis/latent-di
ff
usion/blob/main/assets/model
fi
gure.png
22. Convert fp32 models to qint8 tflite
• There are some issues:
• the group normalization
• 2 GiB limitation in
fl
atbu
ff
er
• unlike fp16 conversion, for PTQ, fp32 model is serialized and write to a
fi
le before running quantization as an
optimization, so we cannot get around this without modifying TFLite converter.
• representative data: we can borrow tokenizer, random number generator, and tilmestep scheduler from the main loop
• to generate data for PTQ, understanding the input tensors for the models are needed
• scripts for dummy PTQ:
• text encoder and decoder, https://github.com/freedomtan/keras_cv_stable_di
ff
usion_to_t
fl
ite/blob/main/
convert_text_encoder_and_decoder_to_t
fl
ite_models_qint8.ipynb
• di
ff
usion model, https://github.com/freedomtan/keras_cv_stable_di
ff
usion_to_t
fl
ite/blob/main/
convert_keras_di
ff
usion_model_into_two_t
fl
ite_models_qint8.ipynb
22
23. input and output tensors of the text encode
The inputs of the text encoder are:
• tokens (InputLayer) [(None, 77)] 0 []
• positions (InputLayer) [(None, 77)] 0 []
Tokens are from the output of a tokenizer and padded tokens. The positions are simply 0, 1, ..., 76
The output is
• layer_normalization_24 (LayerNormalization (None, 77, 768) 1536 ['clip_encoder_layer_11[0][0]'] )
To understand the tokenizer, we have to know roughly what the text encoder is.
23
24. Text Encoder
• The text encoder in Stable Di
ff
usion 1.x is from Open AI's CLIP. The text encoder in
Stable Di
ff
usion 2.x is from Open CLIP, which is an open source implementation of CLIP.
• CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety
of (image, text) pairs. For the image encoder part, Vision Transformer (ViT) is used.
In the original ViT paper, there are 3 variants: Base (B), Large (L), and Huge (H). Then in a
consequential paper, there were Tiny (Ti), Small (S), and G (Gigantic?).
from https://arxiv.org/abs/2010.11929
24
25. Tokenizer
• Before the text encoder, we need a tokenizer to parse prompt line into tokens. Then we can feed those tokens to the text
encoder.
• A variant of Byte-Pair Encoding (BPE, https://en.wikipedia.org/wiki/Byte_pair_encoding) is used to parse sentences into
tokens.
• In HugginFace's Stable Di
ff
usion implementation, the tokenizer is a part of the CLIP model, see https://huggingface.co/
CompVis/stable-di
ff
usion-v1-4/blob/main/tokenizer/tokenizer_con
fi
g.json and https://huggingface.co/transformers/v4.9.2/
_modules/transformers/models/clip/tokenization_clip.html.
• In Keras CV implementation, it's in https://github.com/keras-team/keras-cv/blob/master/keras_cv/models/
stable_di
ff
usion/clip_tokenizer.py
• In Apple's Core ML implementation,
• python: CLIP tokenizer is used, https://github.com/apple/ml-stable-di
ff
usion/tree/main/python_coreml_stable_di
ff
usion
• Swift: there is Swift implementation, https://github.com/apple/ml-stable-di
ff
usion/tree/main/swift/StableDi
ff
usion/
tokenizer
25
26. The diffusion model
We won't cover what di
ff
usion is and how does it works. The most "non-math" programmer-
friendly tutorials I know are Keras CV articles. For example,
• "High-performance image generation using Stable Di
ff
usion in KerasCV", https://keras.io/
guides/keras_cv/generate_images_with_stable_di
ff
usion/
• "A walk through latent space with Stable Di
ff
usion" (https://keras.io/examples/generative/
random_walks_with_stable_di
ff
usion/)
• "Denoising Di
ff
usion Implicit Models", https://keras.io/examples/generative/ddim/
Huggingface has many good tutorials, too. For example
• Annotated Di
ff
usion (https://huggingface.co/blog/annotated-di
ff
usion)
If you prefer mode math/theoretical stu
ff
, you can start with Lilian Weng's "What are Di
ff
usion
Models?", https://lilianweng.github.io/posts/2021-07-11-di
ff
usion-models/
26
27. diffusion as Markovian chain
27
from https://lilianweng.github.io/posts/2021-07-11-di
ff
usion-models/
28. • Last week, I saw an interesting
article named “Perspectives on
di
ff
usion”
• Some of them are kinda tongue-
in-cheek ones, some of them are
really useful
If you have hard time understand
diffusion model
28 https://sander.ai/2023/07/20/perspectives.html
29. input and output tensors of the diffusion model
inputs
• input_1 (InputLayer) [(None, 77, 768)] 0 []: this is directly from text_encoder's output
• input_2 (InputLayer) [(None, 320)] 0 []: timestep embedding
• input_3 (InputLayer) [(None, 64, 64, 4)] 0 []: noise to be denoised
output
• padded_conv2d_83 (PaddedConv2D (None, 64, 64, 4) 11524 ['activation_67[0][0]']): reverse
di
ff
usion data, to be feed into the di
ff
usion model or decoder.
29
30. • In di
ff
erent phases of denoising,
di
ff
erent levels of noise is needed.
• In stable di
ff
usion, the timestep
embedding is from Transforemer's
sinusoidal positional embedding.
• There are many articles
discussing why using sinusoidal
positional embedding and other
related ones could be more
sensitive to noise.
Timestep embedding
30
https://github.com/CompVis/stable-di
ff
usion/blob/main/ldm/modules/di
ff
usionmodules/util.py#L151-L171
31. • When looping through di
ff
usion
model, the timestep is not enough. As
shown in the code snippet below, the
"latent", which is changed by
• how much do we want to follow the
prompt, and
• some kind of moving average
• There are many schedulers, see
https://huggingface.co/docs/di
ff
users/
api/schedulers/overview for general
information of scheduler/sampler.
Scheduling
31
32. There are more
• For example, there is guidance scale:
• unconditional_guidance_scale in Keras CV implementation code, is to
control how closely the image should adhere to the prompt. Larger values
result in more closely adhering to the prompt, but will make the image
noisier.
• It's not a part of any models, but it's a part of the pipeline, when looping the
di
ff
usion model.
32
33. •On the left-handed side of the
fi
gure,
there are (pixel space to latent
space encoder) and (latent space
to pixel space decoder). The ( , )
pair are actually part of a Variational
autoencoder (VAE) trained for Stable
Di
ff
usion.
•The Decoder is the
ℰ
𝒟
ℰ
𝒟𝒟
The decoder
33
https://github.com/CompVis/latent-di
ff
usion/blob/main/assets/model
fi
gure.png
34. Decoder I/O tensors
Input
• rescaling (Rescaling) (None, 64, 64, 4) 0
output
• padded_conv2d_37 (PaddedConv2D) (None, 512, 512, 3) 3459
Note that the output tensor if a
fl
oat32 tensor with valued within [−1.0,1.0]
34
35. Representative data for PTQ
• representative data: we can borrow tokenizer, random number generator, and
tilmestep scheduler from the main loop
• Text encoder: tokenizer
• Di
ff
usion model: tokenizer + text encoder + RNG + tilmestep embedding +
scheduling
• Decoder: all the parts before decoder :-)
35
36. Latency issue
• If you did whatever I discussed, mostly, you can delegate all the ops to NNAPI.
However, the end-to-end latency might be larger than expected.
• One thing we can do is to “set prop debug.nn.vlog 1” to check if all the ops are on
the accelerator(s) as expected
• An example is the tf batch norm implementation. In Keras, both Layer Norm
and Group use tf.nn.batchnorm (https://www.tensor
fl
ow.org/api_docs/python/
tf/nn/batch_normalization, https://github.com/tensor
fl
ow/tensor
fl
ow/blob/r2.13/
tensor
fl
ow/python/ops/nn_impl.py#L1531-L1599)
• Going back to the original de
fi
nition will help
γ(x − μ)
σ
+ β
36
37. Running models on Android
• As we know from previous discussion for PTQ, we need
• tokenizer,
• timestep embedding and scheduling code, and
• noise generator.
And using Python on Android is tricky.
• So I had some quick and dirty implementation in C++
• Why not in Java/Kotlin: well, I usually work in command environment
• Why not in C? C++ has more convenient vector mechanisms.
• Bonus: I also implemented an imprinting demo :-)
• https://github.com/freedomtan/keras_cv_stable_di
ff
usion_to_t
fl
ite/tree/main/cpp_glue_code
37
38. Is that all?
• Surely, no. Optimization of attention and other layers for generative models is
quite hot.
• In case you don’t know where to start, I put links to Google’s and Apple’s
work in the end of the slide deck.
• Mostly you’ll see more and more generative models on your mobile devices in
the end of 2023 or early 2024.
38
42. Stable Diffusion related optimizations from Google MediaPipe
• Two blog articles
• https://ai.googleblog.com/2023/06/speed-is-all-you-need-on-device.html
• https://ai.googleblog.com/2023/06/on-device-di
ff
usion-plugins-for.html
• So it seems to GPU delegate only
• Optimizations they discussed
• Fused softmax
• Winograd convolution
• GELU and group norm
• For CotrolNet like feature, they proposed a lightweight MobileNetv2 based network
42