SlideShare a Scribd company logo
1 of 43
Download to read offline
“freedom” Koan-Sin Tan, COSCUP, Taiwan, July 30th, 2023
Stable Diffusion on Android
1
• Learnt to use open source software before the
term “open source” was coined
• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
• Recently, working on NN performance on edge
devices related stu
f
• Contributed from time to time to TensorFlow,
esp. TFLite
• Contributed some code to MLPerf Mobile
App
• Disclaimer: Opinions expressed are solely my
own and do not express the views or opinions
of my employer.
who i am
https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
2
https://github.com/tensor
fl
ow/tensor
fl
ow/releases/tag/v2.13.0
Arthur C. Clarke.
Any sufficiently advanced technology is
indistinguishable from magic.
3
Outline
Overview
Run stable diffusion on your Android device
Converting fp32 models to fp32 tflite models
Converting fp32 models to PTQ tflite models
Make converted models work on NN accelerator(s)
Recap
4
• I put something about converting Stable
Di
ff
usion to t
fl
ite on GitHub late 2022,
https://github.com/freedomtan/
keras_cv_stable_di
ff
usion_to_t
fl
ite
• I told @thiteanish on twitter how to how to
got reasonable performance op his Pixel 6
this March
• However, I found that there is a session
called “How to run llama.cpp on local
graphics cards”, this afternoon, https://
coscup.org/2023/en/session/LXQGDU
• Since NNAPI doesn’t support lower-bit
quantization, I’ll focus on Stable Di
ff
usion.
I’ll focus on Stable Diffusion 1.x
5 https://twitter.com/thiteanish/status/1635678053853536256
• Pre-normalization
• SwiGLU
• Rotary Embedding
• Grouped-Query Attention
LLaMA and Llama 2
6
https://arxiv.org/abs/2307.09288
https://arxiv.org/abs/2302.13971
Keras CV implementation
• We'll use Keras CV's Stable Di
ff
usion implementation because
• it's easier for converting to TFLite, and
• its code seems to easier to understand
Note that the Keras CV implementation uses weights converted from original PyTorch implementation.
• Other code you may want to check
• original one: https://huggingface.co/CompVis/stable-di
ff
usion
• Apple's Core ML related code (Python and Swift code included): https://github.com/apple/ml-
stable-di
ff
usion
7
The 3 models in Keras CV Stable Di
ff
usion. After
construct the pipeline with
model = keras_cv.models.StableDiffusion()
1.text encoder: model.text_encoder
2.di
ff
usion/denoise model: model.diffusion_model
3.decoder: model.decoder
With something like
model.text_encoder.summary() we can dump a
Keras model's layers, including inputs and outputs
layers. We’ll do it later
Models in Stable Diffusion
8
getting NNAPI-friendly tflites
• fp32: Converting fp32 keras models to fp32 t
fl
ite models is relatively easy. We
don't really need to understand the 3 models.
• quantized model: To convert fp32 keras models to quantized int8 t
fl
ite one is
more di
ffi
cult. Either
• quatization-aware training (QAT, https://www.tensor
fl
ow.org/
model_optimization/guide/quantization/training), or
• post-training quantization (PTQ, https://www.tensor
fl
ow.org/lite/performance/
post_training_quantization) has to be done. For PTQ, "representative" data
have to be prepared for all the model inputs. I don't know whether it's
feasible to perform QAT retrain. So I started from PTQ.
9
converting fp32 models to fp32 tflite
• how to convert a Keras/Saved Model or some concrete functions to a t
fl
ite
model
• tf.lite.TFLiteConverter.from_keras_model()
• tf.lite.TFLiteConverter.from_saved_model()
• tf.lite.TFLiteConverter.from_concrete_function()
10
benchmark_model with op validation
• There are some #ifdef NNAPI_VERBOSE_VALIDATION in NNAPI delegate
source
• Add --copt=-DNNAPI_VERBOSE_VALIDATION when building
benchmark_model with bezel
11
cheetah:/data/local/tmp $ ./benchmark_model_validation --graph=foo/text_encoder_fixed_batch_size.tflite --use_nnapi=1 --nnapi_allow_fp16=1 --
enable_op_profiling=1
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Graph: [foo/text_encoder_fixed_batch_size.tflite]
INFO: Enable op profiling: [1]
INFO: Use NNAPI: [1]
INFO: NNAPI accelerators available: [google-edgetpu,google-armnn,nnapi-reference]
INFO: Allow fp16 in NNAPI: [1]
INFO: Loaded model foo/text_encoder_fixed_batch_size.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: NNAPI delegate created.
WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
VERBOSE: Replacing 1511 out of 1513 node(s) with delegate (TfLiteNnapiDelegate) node, yielding 2 partitions for the whole graph.
WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
…
Number of nodes executed: 3
============================== Summary by node type ==============================
[Node type] [count] [avg ms] [avg %] [cdf %] [mem KB][times called]
TfLiteNnapiDelegate 1 67.030 99.966% 99.966% 0.000 1
GATHER 2 0.023 0.034% 100.000% 0.000 2
Timings (microseconds): count=50 first=64299 curr=63834 min=63370 max=194959 avg=67054 std=18319
Memory (bytes): count=0
3 nodes observed
12
• from_keras_model()
• No place to set batch size
without changing keras_cv
code
• from_saved_model()
• I don’t know how to do it either
• from_concrete_function():
• Yes, I know how to do it
Fix batch size
13
• To support dynamic batch size, there are some ops not supported by NNAPI
• SHAPE, REDUCE_PROD, RESHAPE, etc.
What’s the problem of tflite with dynamic batch size
14
Decoder with fixed batch size failed to be delegated
130|cheetah:/data/local/tmp $ ./benchmark_model_validation --graph=foo/decoder_fixed_batch_size.tflite --use_nnapi=1 --nnapi_allow_fp16=1 --enable_op_profiling=1
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Graph: [foo/decoder_fixed_batch_size.tflite]
INFO: Enable op profiling: [1]
INFO: Use NNAPI: [1]
INFO: NNAPI accelerators available: [google-edgetpu,google-armnn,nnapi-reference]
INFO: Allow fp16 in NNAPI: [1]
INFO: Loaded model foo/decoder_fixed_batch_size.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: NNAPI delegate created.
WARNING: Operator RESHAPE (v1) refused by NNAPI delegate: Output rank should be <= 4
WARNING: Operator MEAN (v1) refused by NNAPI delegate: NNAPI does not support mean of a tensor with rank > 4
WARNING: Operator SUB (v3) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator SQUARE (v1) refused by NNAPI delegate: Unsupported operation type.
WARNING: Operator MEAN (v1) refused by NNAPI delegate: NNAPI does not support mean of a tensor with rank > 4
WARNING: Operator ADD (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator BROADCAST_TO (v2) refused by NNAPI delegate: Unsupported operation type.
WARNING: Operator MUL (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator BROADCAST_TO (v2) refused by NNAPI delegate: Unsupported operation type.
WARNING: Operator MUL (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator BROADCAST_TO (v2) refused by NNAPI delegate: Unsupported operation type.
WARNING: Operator MUL (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator SUB (v3) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator ADD (v1) refused by NNAPI delegate: Input rank must be <= 4
WARNING: Operator RESHAPE (v1) refused by NNAPI delegate: Input rank should be <= 4
WARNING: Operator RESHAPE (v1) refused by NNAPI delegate: Output rank should be <= 4
WARNING: Operator MEAN (v1) refused by NNAPI delegate: NNAPI does not support mean of a tensor with rank > 4
• BTW, I
fi
xed a NNAPI delegate issue when delegating some
invalid ops last year, https://github.com/tensor
fl
ow/tensor
fl
ow/
pull/58978)
15
• Group Normalization divides the channels into
groups and computes within each group the
mean and vari ance for normalization. Empirically,
its accuracy is more stable than batch norm in a
wide range of small batch sizes, if learning rate is
adjusted linearly with batch sizes.
• Relation to Layer Normalization: If the number of
groups is set to 1, then this operation becomes
nearly identical to Layer Normalization (see Layer
Normalization docs for details).
• Relation to Instance Normalization: If the number
of groups is set to the input dimension (number of
groups is equal to number of channels), then this
operation becomes identical to Instance
Normalization.
Group Normalization
https://arxiv.org/pdf/1803.08494.pdf
https://keras.io/api/layers/normalization_layers/group_normalization/
16
• Here the last axis (channel) is
splitted into 32 groups, each
group have 512 / 32 = 16
elements
• (1, 64, 64, 512) -> (1, 64, 64, 32,
16)
• NNAPI doesn’t allow rank > 4. A
naive method is to split, run layer
norm, and concat
Group normalization
17
• Group norm is quite easy to
implement
• I tested if split + layer norm +
concatenate works as expected
by modifying the code from the
Group Norm paper
• It just works
Group norm implementation
https://arxiv.org/pdf/1803.08494.pdf
18
• Production Group Norm has to consider
more
• I hacked Keras code at [1] to split, layer
norm, and concatenate
• Voila, we can fully delegate the
fi
xed batch
size decode model
• However, it failed because transaction
size is larger than NNAPI HIDL could
handle
• Fortunately, something we have other
delegates such MediaTek’s Neuron
Delegate
Group norm in Keras
[1] https://github.com/keras-team/keras-core/blob/v0.1.3/keras_core/layers/normalization/group_normalization.py#L148-L195
[2] https://github.com/MediaTek-NeuroPilot/t
fl
ite-neuron-delegate 19
The diffusion model
• When converting saved_model to We ran into this
• E tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2181]
Model size is bigger than 2gb
• How about keras_model and concrete_function?
• Well, there are converted to saved_model and there is single protobuf (pb) 2gb
limit too
• The only way I can think of is to split the di
ff
usion model (into 2 or more models)
• BTW, there is group norm issues in the di
ff
usion model
20
• The di
ff
usion/reverse di
ff
usion model is a
U-Net with skip/residual connections
• Because of residual connections, it’s not a
sequential model. When we split the
model, we must check the connections/
edges between nodes located in di
ff
erent
subgroups [1]
• [1] https://github.com/freedomtan/
keras_cv_stable_di
ff
usion_to_t
fl
ite/blob/
main/
convert_keras_di
ff
usion_model_into_two_t
fl
ite_models.ipynb
Splitting the diffusion model into to
models
https://arxiv.org/pdf/2112.10752.pdf
https://github.com/keras-team/keras-cv/blob/v0.6.1/keras_cv/models/stable_di
ff
usion/di
ff
usion_model.py#L23-L114
21
https://github.com/CompVis/latent-di
ff
usion/blob/main/assets/model
fi
gure.png
Convert fp32 models to qint8 tflite
• There are some issues:
• the group normalization
• 2 GiB limitation in
fl
atbu
ff
er
• unlike fp16 conversion, for PTQ, fp32 model is serialized and write to a
fi
le before running quantization as an
optimization, so we cannot get around this without modifying TFLite converter.
• representative data: we can borrow tokenizer, random number generator, and tilmestep scheduler from the main loop
• to generate data for PTQ, understanding the input tensors for the models are needed
• scripts for dummy PTQ:
• text encoder and decoder, https://github.com/freedomtan/keras_cv_stable_di
ff
usion_to_t
fl
ite/blob/main/
convert_text_encoder_and_decoder_to_t
fl
ite_models_qint8.ipynb
• di
ff
usion model, https://github.com/freedomtan/keras_cv_stable_di
ff
usion_to_t
fl
ite/blob/main/
convert_keras_di
ff
usion_model_into_two_t
fl
ite_models_qint8.ipynb
22
input and output tensors of the text encode
The inputs of the text encoder are:
• tokens (InputLayer) [(None, 77)] 0 []
• positions (InputLayer) [(None, 77)] 0 []
Tokens are from the output of a tokenizer and padded tokens. The positions are simply 0, 1, ..., 76
The output is
• layer_normalization_24 (LayerNormalization (None, 77, 768) 1536 ['clip_encoder_layer_11[0][0]'] )
To understand the tokenizer, we have to know roughly what the text encoder is.
23
Text Encoder
• The text encoder in Stable Di
ff
usion 1.x is from Open AI's CLIP. The text encoder in
Stable Di
ff
usion 2.x is from Open CLIP, which is an open source implementation of CLIP.
• CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety
of (image, text) pairs. For the image encoder part, Vision Transformer (ViT) is used.
In the original ViT paper, there are 3 variants: Base (B), Large (L), and Huge (H). Then in a
consequential paper, there were Tiny (Ti), Small (S), and G (Gigantic?).
from https://arxiv.org/abs/2010.11929
24
Tokenizer
• Before the text encoder, we need a tokenizer to parse prompt line into tokens. Then we can feed those tokens to the text
encoder.
• A variant of Byte-Pair Encoding (BPE, https://en.wikipedia.org/wiki/Byte_pair_encoding) is used to parse sentences into
tokens.
• In HugginFace's Stable Di
ff
usion implementation, the tokenizer is a part of the CLIP model, see https://huggingface.co/
CompVis/stable-di
ff
usion-v1-4/blob/main/tokenizer/tokenizer_con
fi
g.json and https://huggingface.co/transformers/v4.9.2/
_modules/transformers/models/clip/tokenization_clip.html.
• In Keras CV implementation, it's in https://github.com/keras-team/keras-cv/blob/master/keras_cv/models/
stable_di
ff
usion/clip_tokenizer.py
• In Apple's Core ML implementation,
• python: CLIP tokenizer is used, https://github.com/apple/ml-stable-di
ff
usion/tree/main/python_coreml_stable_di
ff
usion
• Swift: there is Swift implementation, https://github.com/apple/ml-stable-di
ff
usion/tree/main/swift/StableDi
ff
usion/
tokenizer
25
The diffusion model
We won't cover what di
ff
usion is and how does it works. The most "non-math" programmer-
friendly tutorials I know are Keras CV articles. For example,
• "High-performance image generation using Stable Di
ff
usion in KerasCV", https://keras.io/
guides/keras_cv/generate_images_with_stable_di
ff
usion/
• "A walk through latent space with Stable Di
ff
usion" (https://keras.io/examples/generative/
random_walks_with_stable_di
ff
usion/)
• "Denoising Di
ff
usion Implicit Models", https://keras.io/examples/generative/ddim/
Huggingface has many good tutorials, too. For example
• Annotated Di
ff
usion (https://huggingface.co/blog/annotated-di
ff
usion)
If you prefer mode math/theoretical stu
ff
, you can start with Lilian Weng's "What are Di
ff
usion
Models?", https://lilianweng.github.io/posts/2021-07-11-di
ff
usion-models/
26
diffusion as Markovian chain
27
from https://lilianweng.github.io/posts/2021-07-11-di
ff
usion-models/
• Last week, I saw an interesting
article named “Perspectives on
di
ff
usion”
• Some of them are kinda tongue-
in-cheek ones, some of them are
really useful
If you have hard time understand
diffusion model
28 https://sander.ai/2023/07/20/perspectives.html
input and output tensors of the diffusion model
inputs
• input_1 (InputLayer) [(None, 77, 768)] 0 []: this is directly from text_encoder's output
• input_2 (InputLayer) [(None, 320)] 0 []: timestep embedding
• input_3 (InputLayer) [(None, 64, 64, 4)] 0 []: noise to be denoised
output
• padded_conv2d_83 (PaddedConv2D (None, 64, 64, 4) 11524 ['activation_67[0][0]']): reverse
di
ff
usion data, to be feed into the di
ff
usion model or decoder.
29
• In di
ff
erent phases of denoising,
di
ff
erent levels of noise is needed.
• In stable di
ff
usion, the timestep
embedding is from Transforemer's
sinusoidal positional embedding.
• There are many articles
discussing why using sinusoidal
positional embedding and other
related ones could be more
sensitive to noise.
Timestep embedding
30
https://github.com/CompVis/stable-di
ff
usion/blob/main/ldm/modules/di
ff
usionmodules/util.py#L151-L171
• When looping through di
ff
usion
model, the timestep is not enough. As
shown in the code snippet below, the
"latent", which is changed by
• how much do we want to follow the
prompt, and
• some kind of moving average
• There are many schedulers, see
https://huggingface.co/docs/di
ff
users/
api/schedulers/overview for general
information of scheduler/sampler.
Scheduling
31
There are more
• For example, there is guidance scale:
• unconditional_guidance_scale in Keras CV implementation code, is to
control how closely the image should adhere to the prompt. Larger values
result in more closely adhering to the prompt, but will make the image
noisier.
• It's not a part of any models, but it's a part of the pipeline, when looping the
di
ff
usion model.
32
•On the left-handed side of the
fi
gure,
there are (pixel space to latent
space encoder) and (latent space
to pixel space decoder). The ( , )
pair are actually part of a Variational
autoencoder (VAE) trained for Stable
Di
ff
usion.
•The Decoder is the
ℰ
𝒟
ℰ
𝒟𝒟
The decoder
33
https://github.com/CompVis/latent-di
ff
usion/blob/main/assets/model
fi
gure.png
Decoder I/O tensors
Input
• rescaling (Rescaling) (None, 64, 64, 4) 0
output
• padded_conv2d_37 (PaddedConv2D) (None, 512, 512, 3) 3459
Note that the output tensor if a
fl
oat32 tensor with valued within [−1.0,1.0]
34
Representative data for PTQ
• representative data: we can borrow tokenizer, random number generator, and
tilmestep scheduler from the main loop
• Text encoder: tokenizer
• Di
ff
usion model: tokenizer + text encoder + RNG + tilmestep embedding +
scheduling
• Decoder: all the parts before decoder :-)
35
Latency issue
• If you did whatever I discussed, mostly, you can delegate all the ops to NNAPI.
However, the end-to-end latency might be larger than expected.
• One thing we can do is to “set prop debug.nn.vlog 1” to check if all the ops are on
the accelerator(s) as expected
• An example is the tf batch norm implementation. In Keras, both Layer Norm
and Group use tf.nn.batchnorm (https://www.tensor
fl
ow.org/api_docs/python/
tf/nn/batch_normalization, https://github.com/tensor
fl
ow/tensor
fl
ow/blob/r2.13/
tensor
fl
ow/python/ops/nn_impl.py#L1531-L1599)
• Going back to the original de
fi
nition will help
γ(x − μ)
σ
+ β
36
Running models on Android
• As we know from previous discussion for PTQ, we need
• tokenizer,
• timestep embedding and scheduling code, and
• noise generator.
And using Python on Android is tricky.
• So I had some quick and dirty implementation in C++
• Why not in Java/Kotlin: well, I usually work in command environment
• Why not in C? C++ has more convenient vector mechanisms.
• Bonus: I also implemented an imprinting demo :-)
• https://github.com/freedomtan/keras_cv_stable_di
ff
usion_to_t
fl
ite/tree/main/cpp_glue_code
37
Is that all?
• Surely, no. Optimization of attention and other layers for generative models is
quite hot.
• In case you don’t know where to start, I put links to Google’s and Apple’s
work in the end of the slide deck.
• Mostly you’ll see more and more generative models on your mobile devices in
the end of 2023 or early 2024.
38
Hopefully, running stable di
ff
usion on Android is no longer magic to you :-)
39
Thank you. Q&A
40
Appendix
41
Stable Diffusion related optimizations from Google MediaPipe
• Two blog articles
• https://ai.googleblog.com/2023/06/speed-is-all-you-need-on-device.html
• https://ai.googleblog.com/2023/06/on-device-di
ff
usion-plugins-for.html
• So it seems to GPU delegate only
• Optimizations they discussed
• Fused softmax
• Winograd convolution
• GELU and group norm
• For CotrolNet like feature, they proposed a lightweight MobileNetv2 based network
42
• https://github.com/apple/ml-
stable-di
ff
usion
• Some Einsum op based
optimizations. Does that mean
Apple’s Neural Engine would
perform well if Einsum ops are
used
Apple’s work for M1/M2 and
iPhones
43

More Related Content

What's hot

2日間Fabricを触った俺が
 色々解説してみる
2日間Fabricを触った俺が
 色々解説してみる2日間Fabricを触った俺が
 色々解説してみる
2日間Fabricを触った俺が
 色々解説してみる
airtoxin Ishii
 
Linux packet-forwarding
Linux packet-forwardingLinux packet-forwarding
Linux packet-forwarding
Masakazu Asama
 

What's hot (20)

SpectreBustersあるいはLinuxにおけるSpectre対策
SpectreBustersあるいはLinuxにおけるSpectre対策SpectreBustersあるいはLinuxにおけるSpectre対策
SpectreBustersあるいはLinuxにおけるSpectre対策
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
 
Docker, LinuX Container
Docker, LinuX ContainerDocker, LinuX Container
Docker, LinuX Container
 
2日間Fabricを触った俺が
 色々解説してみる
2日間Fabricを触った俺が
 色々解説してみる2日間Fabricを触った俺が
 色々解説してみる
2日間Fabricを触った俺が
 色々解説してみる
 
20191115-PGconf.Japan
20191115-PGconf.Japan20191115-PGconf.Japan
20191115-PGconf.Japan
 
Hadoop/Spark を使うなら Bigtop を使い熟そう! ~並列分散処理基盤のいま、から Bigtop の最近の取り組みまで一挙ご紹介~(Ope...
Hadoop/Spark を使うなら Bigtop を使い熟そう! ~並列分散処理基盤のいま、から Bigtop の最近の取り組みまで一挙ご紹介~(Ope...Hadoop/Spark を使うなら Bigtop を使い熟そう! ~並列分散処理基盤のいま、から Bigtop の最近の取り組みまで一挙ご紹介~(Ope...
Hadoop/Spark を使うなら Bigtop を使い熟そう! ~並列分散処理基盤のいま、から Bigtop の最近の取り組みまで一挙ご紹介~(Ope...
 
Alpine linuxを触ってみよう
Alpine linuxを触ってみようAlpine linuxを触ってみよう
Alpine linuxを触ってみよう
 
What is gRPC introduction gRPC Explained
What is gRPC introduction gRPC ExplainedWhat is gRPC introduction gRPC Explained
What is gRPC introduction gRPC Explained
 
SpectreとMeltdown:最近のCPUの深い話
SpectreとMeltdown:最近のCPUの深い話SpectreとMeltdown:最近のCPUの深い話
SpectreとMeltdown:最近のCPUの深い話
 
AlmaLinux と Rocky Linux の誕生経緯&比較
AlmaLinux と Rocky Linux の誕生経緯&比較AlmaLinux と Rocky Linux の誕生経緯&比較
AlmaLinux と Rocky Linux の誕生経緯&比較
 
NAT超えとはなんぞや
NAT超えとはなんぞやNAT超えとはなんぞや
NAT超えとはなんぞや
 
How shit works: the CPU
How shit works: the CPUHow shit works: the CPU
How shit works: the CPU
 
AWS DevOps - Terraform, Docker, HashiCorp Vault
AWS DevOps - Terraform, Docker, HashiCorp VaultAWS DevOps - Terraform, Docker, HashiCorp Vault
AWS DevOps - Terraform, Docker, HashiCorp Vault
 
Dockerfileを改善するためのBest Practice 2019年版
Dockerfileを改善するためのBest Practice 2019年版Dockerfileを改善するためのBest Practice 2019年版
Dockerfileを改善するためのBest Practice 2019年版
 
Linux packet-forwarding
Linux packet-forwardingLinux packet-forwarding
Linux packet-forwarding
 
Routed networks sydney
Routed networks sydneyRouted networks sydney
Routed networks sydney
 
Understanding performance aspects of etcd and Raft
Understanding performance aspects of etcd and RaftUnderstanding performance aspects of etcd and Raft
Understanding performance aspects of etcd and Raft
 
Cumulus Linuxを導入したワケ
Cumulus Linuxを導入したワケCumulus Linuxを導入したワケ
Cumulus Linuxを導入したワケ
 
ストリームデータ分散処理基盤Storm
ストリームデータ分散処理基盤Stormストリームデータ分散処理基盤Storm
ストリームデータ分散処理基盤Storm
 
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
 

Similar to running stable diffusion on android

Similar to running stable diffusion on android (20)

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
Computer Architecture Assignment Help
Computer Architecture Assignment HelpComputer Architecture Assignment Help
Computer Architecture Assignment Help
 
Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1
 
Training course lect1
Training course lect1Training course lect1
Training course lect1
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Новый InterSystems: open-source, митапы, хакатоны
Новый InterSystems: open-source, митапы, хакатоныНовый InterSystems: open-source, митапы, хакатоны
Новый InterSystems: open-source, митапы, хакатоны
 
From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R
 
Mykhailo Zarai "Be careful when dealing with C++" at Rivne IT Talks
Mykhailo Zarai "Be careful when dealing with C++" at Rivne IT TalksMykhailo Zarai "Be careful when dealing with C++" at Rivne IT Talks
Mykhailo Zarai "Be careful when dealing with C++" at Rivne IT Talks
 
Final training course
Final training courseFinal training course
Final training course
 
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
 
Matopt
MatoptMatopt
Matopt
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
 
Neural Networks from Scratch - TensorFlow 101
Neural Networks from Scratch - TensorFlow 101Neural Networks from Scratch - TensorFlow 101
Neural Networks from Scratch - TensorFlow 101
 
Data race
Data raceData race
Data race
 
embedded C.pptx
embedded C.pptxembedded C.pptx
embedded C.pptx
 
Advanced malwareanalysis training session2 botnet analysis part1
Advanced malwareanalysis training session2 botnet analysis part1Advanced malwareanalysis training session2 botnet analysis part1
Advanced malwareanalysis training session2 botnet analysis part1
 

More from Koan-Sin Tan

Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
Koan-Sin Tan
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Koan-Sin Tan
 

More from Koan-Sin Tan (18)

Exploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsExploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source Tools
 
A Peek into TFRT
A Peek into TFRTA Peek into TFRT
A Peek into TFRT
 
Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020
 
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolExploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
A Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowA Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlow
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
 
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphones
 
Caffe2 on Android
Caffe2 on AndroidCaffe2 on Android
Caffe2 on Android
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of Smartwatch
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

running stable diffusion on android

  • 1. “freedom” Koan-Sin Tan, COSCUP, Taiwan, July 30th, 2023 Stable Diffusion on Android 1
  • 2. • Learnt to use open source software before the term “open source” was coined • A software guy, learned to use Unix and open source software on VAX-11/780 running 4.3BSD • Recently, working on NN performance on edge devices related stu f • Contributed from time to time to TensorFlow, esp. TFLite • Contributed some code to MLPerf Mobile App • Disclaimer: Opinions expressed are solely my own and do not express the views or opinions of my employer. who i am https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg 2 https://github.com/tensor fl ow/tensor fl ow/releases/tag/v2.13.0
  • 3. Arthur C. Clarke. Any sufficiently advanced technology is indistinguishable from magic. 3
  • 4. Outline Overview Run stable diffusion on your Android device Converting fp32 models to fp32 tflite models Converting fp32 models to PTQ tflite models Make converted models work on NN accelerator(s) Recap 4
  • 5. • I put something about converting Stable Di ff usion to t fl ite on GitHub late 2022, https://github.com/freedomtan/ keras_cv_stable_di ff usion_to_t fl ite • I told @thiteanish on twitter how to how to got reasonable performance op his Pixel 6 this March • However, I found that there is a session called “How to run llama.cpp on local graphics cards”, this afternoon, https:// coscup.org/2023/en/session/LXQGDU • Since NNAPI doesn’t support lower-bit quantization, I’ll focus on Stable Di ff usion. I’ll focus on Stable Diffusion 1.x 5 https://twitter.com/thiteanish/status/1635678053853536256
  • 6. • Pre-normalization • SwiGLU • Rotary Embedding • Grouped-Query Attention LLaMA and Llama 2 6 https://arxiv.org/abs/2307.09288 https://arxiv.org/abs/2302.13971
  • 7. Keras CV implementation • We'll use Keras CV's Stable Di ff usion implementation because • it's easier for converting to TFLite, and • its code seems to easier to understand Note that the Keras CV implementation uses weights converted from original PyTorch implementation. • Other code you may want to check • original one: https://huggingface.co/CompVis/stable-di ff usion • Apple's Core ML related code (Python and Swift code included): https://github.com/apple/ml- stable-di ff usion 7
  • 8. The 3 models in Keras CV Stable Di ff usion. After construct the pipeline with model = keras_cv.models.StableDiffusion() 1.text encoder: model.text_encoder 2.di ff usion/denoise model: model.diffusion_model 3.decoder: model.decoder With something like model.text_encoder.summary() we can dump a Keras model's layers, including inputs and outputs layers. We’ll do it later Models in Stable Diffusion 8
  • 9. getting NNAPI-friendly tflites • fp32: Converting fp32 keras models to fp32 t fl ite models is relatively easy. We don't really need to understand the 3 models. • quantized model: To convert fp32 keras models to quantized int8 t fl ite one is more di ffi cult. Either • quatization-aware training (QAT, https://www.tensor fl ow.org/ model_optimization/guide/quantization/training), or • post-training quantization (PTQ, https://www.tensor fl ow.org/lite/performance/ post_training_quantization) has to be done. For PTQ, "representative" data have to be prepared for all the model inputs. I don't know whether it's feasible to perform QAT retrain. So I started from PTQ. 9
  • 10. converting fp32 models to fp32 tflite • how to convert a Keras/Saved Model or some concrete functions to a t fl ite model • tf.lite.TFLiteConverter.from_keras_model() • tf.lite.TFLiteConverter.from_saved_model() • tf.lite.TFLiteConverter.from_concrete_function() 10
  • 11. benchmark_model with op validation • There are some #ifdef NNAPI_VERBOSE_VALIDATION in NNAPI delegate source • Add --copt=-DNNAPI_VERBOSE_VALIDATION when building benchmark_model with bezel 11
  • 12. cheetah:/data/local/tmp $ ./benchmark_model_validation --graph=foo/text_encoder_fixed_batch_size.tflite --use_nnapi=1 --nnapi_allow_fp16=1 -- enable_op_profiling=1 INFO: STARTING! INFO: Log parameter values verbosely: [0] INFO: Graph: [foo/text_encoder_fixed_batch_size.tflite] INFO: Enable op profiling: [1] INFO: Use NNAPI: [1] INFO: NNAPI accelerators available: [google-edgetpu,google-armnn,nnapi-reference] INFO: Allow fp16 in NNAPI: [1] INFO: Loaded model foo/text_encoder_fixed_batch_size.tflite INFO: Initialized TensorFlow Lite runtime. INFO: Created TensorFlow Lite delegate for NNAPI. INFO: NNAPI delegate created. WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks! VERBOSE: Replacing 1511 out of 1513 node(s) with delegate (TfLiteNnapiDelegate) node, yielding 2 partitions for the whole graph. WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks! … Number of nodes executed: 3 ============================== Summary by node type ============================== [Node type] [count] [avg ms] [avg %] [cdf %] [mem KB][times called] TfLiteNnapiDelegate 1 67.030 99.966% 99.966% 0.000 1 GATHER 2 0.023 0.034% 100.000% 0.000 2 Timings (microseconds): count=50 first=64299 curr=63834 min=63370 max=194959 avg=67054 std=18319 Memory (bytes): count=0 3 nodes observed 12
  • 13. • from_keras_model() • No place to set batch size without changing keras_cv code • from_saved_model() • I don’t know how to do it either • from_concrete_function(): • Yes, I know how to do it Fix batch size 13
  • 14. • To support dynamic batch size, there are some ops not supported by NNAPI • SHAPE, REDUCE_PROD, RESHAPE, etc. What’s the problem of tflite with dynamic batch size 14
  • 15. Decoder with fixed batch size failed to be delegated 130|cheetah:/data/local/tmp $ ./benchmark_model_validation --graph=foo/decoder_fixed_batch_size.tflite --use_nnapi=1 --nnapi_allow_fp16=1 --enable_op_profiling=1 INFO: STARTING! INFO: Log parameter values verbosely: [0] INFO: Graph: [foo/decoder_fixed_batch_size.tflite] INFO: Enable op profiling: [1] INFO: Use NNAPI: [1] INFO: NNAPI accelerators available: [google-edgetpu,google-armnn,nnapi-reference] INFO: Allow fp16 in NNAPI: [1] INFO: Loaded model foo/decoder_fixed_batch_size.tflite INFO: Initialized TensorFlow Lite runtime. INFO: Created TensorFlow Lite delegate for NNAPI. INFO: NNAPI delegate created. WARNING: Operator RESHAPE (v1) refused by NNAPI delegate: Output rank should be <= 4 WARNING: Operator MEAN (v1) refused by NNAPI delegate: NNAPI does not support mean of a tensor with rank > 4 WARNING: Operator SUB (v3) refused by NNAPI delegate: Input rank must be <= 4 WARNING: Operator SQUARE (v1) refused by NNAPI delegate: Unsupported operation type. WARNING: Operator MEAN (v1) refused by NNAPI delegate: NNAPI does not support mean of a tensor with rank > 4 WARNING: Operator ADD (v1) refused by NNAPI delegate: Input rank must be <= 4 WARNING: Operator BROADCAST_TO (v2) refused by NNAPI delegate: Unsupported operation type. WARNING: Operator MUL (v1) refused by NNAPI delegate: Input rank must be <= 4 WARNING: Operator BROADCAST_TO (v2) refused by NNAPI delegate: Unsupported operation type. WARNING: Operator MUL (v1) refused by NNAPI delegate: Input rank must be <= 4 WARNING: Operator BROADCAST_TO (v2) refused by NNAPI delegate: Unsupported operation type. WARNING: Operator MUL (v1) refused by NNAPI delegate: Input rank must be <= 4 WARNING: Operator SUB (v3) refused by NNAPI delegate: Input rank must be <= 4 WARNING: Operator ADD (v1) refused by NNAPI delegate: Input rank must be <= 4 WARNING: Operator RESHAPE (v1) refused by NNAPI delegate: Input rank should be <= 4 WARNING: Operator RESHAPE (v1) refused by NNAPI delegate: Output rank should be <= 4 WARNING: Operator MEAN (v1) refused by NNAPI delegate: NNAPI does not support mean of a tensor with rank > 4 • BTW, I fi xed a NNAPI delegate issue when delegating some invalid ops last year, https://github.com/tensor fl ow/tensor fl ow/ pull/58978) 15
  • 16. • Group Normalization divides the channels into groups and computes within each group the mean and vari ance for normalization. Empirically, its accuracy is more stable than batch norm in a wide range of small batch sizes, if learning rate is adjusted linearly with batch sizes. • Relation to Layer Normalization: If the number of groups is set to 1, then this operation becomes nearly identical to Layer Normalization (see Layer Normalization docs for details). • Relation to Instance Normalization: If the number of groups is set to the input dimension (number of groups is equal to number of channels), then this operation becomes identical to Instance Normalization. Group Normalization https://arxiv.org/pdf/1803.08494.pdf https://keras.io/api/layers/normalization_layers/group_normalization/ 16
  • 17. • Here the last axis (channel) is splitted into 32 groups, each group have 512 / 32 = 16 elements • (1, 64, 64, 512) -> (1, 64, 64, 32, 16) • NNAPI doesn’t allow rank > 4. A naive method is to split, run layer norm, and concat Group normalization 17
  • 18. • Group norm is quite easy to implement • I tested if split + layer norm + concatenate works as expected by modifying the code from the Group Norm paper • It just works Group norm implementation https://arxiv.org/pdf/1803.08494.pdf 18
  • 19. • Production Group Norm has to consider more • I hacked Keras code at [1] to split, layer norm, and concatenate • Voila, we can fully delegate the fi xed batch size decode model • However, it failed because transaction size is larger than NNAPI HIDL could handle • Fortunately, something we have other delegates such MediaTek’s Neuron Delegate Group norm in Keras [1] https://github.com/keras-team/keras-core/blob/v0.1.3/keras_core/layers/normalization/group_normalization.py#L148-L195 [2] https://github.com/MediaTek-NeuroPilot/t fl ite-neuron-delegate 19
  • 20. The diffusion model • When converting saved_model to We ran into this • E tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2181] Model size is bigger than 2gb • How about keras_model and concrete_function? • Well, there are converted to saved_model and there is single protobuf (pb) 2gb limit too • The only way I can think of is to split the di ff usion model (into 2 or more models) • BTW, there is group norm issues in the di ff usion model 20
  • 21. • The di ff usion/reverse di ff usion model is a U-Net with skip/residual connections • Because of residual connections, it’s not a sequential model. When we split the model, we must check the connections/ edges between nodes located in di ff erent subgroups [1] • [1] https://github.com/freedomtan/ keras_cv_stable_di ff usion_to_t fl ite/blob/ main/ convert_keras_di ff usion_model_into_two_t fl ite_models.ipynb Splitting the diffusion model into to models https://arxiv.org/pdf/2112.10752.pdf https://github.com/keras-team/keras-cv/blob/v0.6.1/keras_cv/models/stable_di ff usion/di ff usion_model.py#L23-L114 21 https://github.com/CompVis/latent-di ff usion/blob/main/assets/model fi gure.png
  • 22. Convert fp32 models to qint8 tflite • There are some issues: • the group normalization • 2 GiB limitation in fl atbu ff er • unlike fp16 conversion, for PTQ, fp32 model is serialized and write to a fi le before running quantization as an optimization, so we cannot get around this without modifying TFLite converter. • representative data: we can borrow tokenizer, random number generator, and tilmestep scheduler from the main loop • to generate data for PTQ, understanding the input tensors for the models are needed • scripts for dummy PTQ: • text encoder and decoder, https://github.com/freedomtan/keras_cv_stable_di ff usion_to_t fl ite/blob/main/ convert_text_encoder_and_decoder_to_t fl ite_models_qint8.ipynb • di ff usion model, https://github.com/freedomtan/keras_cv_stable_di ff usion_to_t fl ite/blob/main/ convert_keras_di ff usion_model_into_two_t fl ite_models_qint8.ipynb 22
  • 23. input and output tensors of the text encode The inputs of the text encoder are: • tokens (InputLayer) [(None, 77)] 0 [] • positions (InputLayer) [(None, 77)] 0 [] Tokens are from the output of a tokenizer and padded tokens. The positions are simply 0, 1, ..., 76 The output is • layer_normalization_24 (LayerNormalization (None, 77, 768) 1536 ['clip_encoder_layer_11[0][0]'] ) To understand the tokenizer, we have to know roughly what the text encoder is. 23
  • 24. Text Encoder • The text encoder in Stable Di ff usion 1.x is from Open AI's CLIP. The text encoder in Stable Di ff usion 2.x is from Open CLIP, which is an open source implementation of CLIP. • CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. For the image encoder part, Vision Transformer (ViT) is used. In the original ViT paper, there are 3 variants: Base (B), Large (L), and Huge (H). Then in a consequential paper, there were Tiny (Ti), Small (S), and G (Gigantic?). from https://arxiv.org/abs/2010.11929 24
  • 25. Tokenizer • Before the text encoder, we need a tokenizer to parse prompt line into tokens. Then we can feed those tokens to the text encoder. • A variant of Byte-Pair Encoding (BPE, https://en.wikipedia.org/wiki/Byte_pair_encoding) is used to parse sentences into tokens. • In HugginFace's Stable Di ff usion implementation, the tokenizer is a part of the CLIP model, see https://huggingface.co/ CompVis/stable-di ff usion-v1-4/blob/main/tokenizer/tokenizer_con fi g.json and https://huggingface.co/transformers/v4.9.2/ _modules/transformers/models/clip/tokenization_clip.html. • In Keras CV implementation, it's in https://github.com/keras-team/keras-cv/blob/master/keras_cv/models/ stable_di ff usion/clip_tokenizer.py • In Apple's Core ML implementation, • python: CLIP tokenizer is used, https://github.com/apple/ml-stable-di ff usion/tree/main/python_coreml_stable_di ff usion • Swift: there is Swift implementation, https://github.com/apple/ml-stable-di ff usion/tree/main/swift/StableDi ff usion/ tokenizer 25
  • 26. The diffusion model We won't cover what di ff usion is and how does it works. The most "non-math" programmer- friendly tutorials I know are Keras CV articles. For example, • "High-performance image generation using Stable Di ff usion in KerasCV", https://keras.io/ guides/keras_cv/generate_images_with_stable_di ff usion/ • "A walk through latent space with Stable Di ff usion" (https://keras.io/examples/generative/ random_walks_with_stable_di ff usion/) • "Denoising Di ff usion Implicit Models", https://keras.io/examples/generative/ddim/ Huggingface has many good tutorials, too. For example • Annotated Di ff usion (https://huggingface.co/blog/annotated-di ff usion) If you prefer mode math/theoretical stu ff , you can start with Lilian Weng's "What are Di ff usion Models?", https://lilianweng.github.io/posts/2021-07-11-di ff usion-models/ 26
  • 27. diffusion as Markovian chain 27 from https://lilianweng.github.io/posts/2021-07-11-di ff usion-models/
  • 28. • Last week, I saw an interesting article named “Perspectives on di ff usion” • Some of them are kinda tongue- in-cheek ones, some of them are really useful If you have hard time understand diffusion model 28 https://sander.ai/2023/07/20/perspectives.html
  • 29. input and output tensors of the diffusion model inputs • input_1 (InputLayer) [(None, 77, 768)] 0 []: this is directly from text_encoder's output • input_2 (InputLayer) [(None, 320)] 0 []: timestep embedding • input_3 (InputLayer) [(None, 64, 64, 4)] 0 []: noise to be denoised output • padded_conv2d_83 (PaddedConv2D (None, 64, 64, 4) 11524 ['activation_67[0][0]']): reverse di ff usion data, to be feed into the di ff usion model or decoder. 29
  • 30. • In di ff erent phases of denoising, di ff erent levels of noise is needed. • In stable di ff usion, the timestep embedding is from Transforemer's sinusoidal positional embedding. • There are many articles discussing why using sinusoidal positional embedding and other related ones could be more sensitive to noise. Timestep embedding 30 https://github.com/CompVis/stable-di ff usion/blob/main/ldm/modules/di ff usionmodules/util.py#L151-L171
  • 31. • When looping through di ff usion model, the timestep is not enough. As shown in the code snippet below, the "latent", which is changed by • how much do we want to follow the prompt, and • some kind of moving average • There are many schedulers, see https://huggingface.co/docs/di ff users/ api/schedulers/overview for general information of scheduler/sampler. Scheduling 31
  • 32. There are more • For example, there is guidance scale: • unconditional_guidance_scale in Keras CV implementation code, is to control how closely the image should adhere to the prompt. Larger values result in more closely adhering to the prompt, but will make the image noisier. • It's not a part of any models, but it's a part of the pipeline, when looping the di ff usion model. 32
  • 33. •On the left-handed side of the fi gure, there are (pixel space to latent space encoder) and (latent space to pixel space decoder). The ( , ) pair are actually part of a Variational autoencoder (VAE) trained for Stable Di ff usion. •The Decoder is the ℰ 𝒟 ℰ 𝒟𝒟 The decoder 33 https://github.com/CompVis/latent-di ff usion/blob/main/assets/model fi gure.png
  • 34. Decoder I/O tensors Input • rescaling (Rescaling) (None, 64, 64, 4) 0 output • padded_conv2d_37 (PaddedConv2D) (None, 512, 512, 3) 3459 Note that the output tensor if a fl oat32 tensor with valued within [−1.0,1.0] 34
  • 35. Representative data for PTQ • representative data: we can borrow tokenizer, random number generator, and tilmestep scheduler from the main loop • Text encoder: tokenizer • Di ff usion model: tokenizer + text encoder + RNG + tilmestep embedding + scheduling • Decoder: all the parts before decoder :-) 35
  • 36. Latency issue • If you did whatever I discussed, mostly, you can delegate all the ops to NNAPI. However, the end-to-end latency might be larger than expected. • One thing we can do is to “set prop debug.nn.vlog 1” to check if all the ops are on the accelerator(s) as expected • An example is the tf batch norm implementation. In Keras, both Layer Norm and Group use tf.nn.batchnorm (https://www.tensor fl ow.org/api_docs/python/ tf/nn/batch_normalization, https://github.com/tensor fl ow/tensor fl ow/blob/r2.13/ tensor fl ow/python/ops/nn_impl.py#L1531-L1599) • Going back to the original de fi nition will help γ(x − μ) σ + β 36
  • 37. Running models on Android • As we know from previous discussion for PTQ, we need • tokenizer, • timestep embedding and scheduling code, and • noise generator. And using Python on Android is tricky. • So I had some quick and dirty implementation in C++ • Why not in Java/Kotlin: well, I usually work in command environment • Why not in C? C++ has more convenient vector mechanisms. • Bonus: I also implemented an imprinting demo :-) • https://github.com/freedomtan/keras_cv_stable_di ff usion_to_t fl ite/tree/main/cpp_glue_code 37
  • 38. Is that all? • Surely, no. Optimization of attention and other layers for generative models is quite hot. • In case you don’t know where to start, I put links to Google’s and Apple’s work in the end of the slide deck. • Mostly you’ll see more and more generative models on your mobile devices in the end of 2023 or early 2024. 38
  • 39. Hopefully, running stable di ff usion on Android is no longer magic to you :-) 39
  • 42. Stable Diffusion related optimizations from Google MediaPipe • Two blog articles • https://ai.googleblog.com/2023/06/speed-is-all-you-need-on-device.html • https://ai.googleblog.com/2023/06/on-device-di ff usion-plugins-for.html • So it seems to GPU delegate only • Optimizations they discussed • Fused softmax • Winograd convolution • GELU and group norm • For CotrolNet like feature, they proposed a lightweight MobileNetv2 based network 42
  • 43. • https://github.com/apple/ml- stable-di ff usion • Some Einsum op based optimizations. Does that mean Apple’s Neural Engine would perform well if Einsum ops are used Apple’s work for M1/M2 and iPhones 43