Standardising the
compressed representation
of neural networks
Werner Bailer
TEWI-Kolloquium, 2021-06-25
Outline
Context
State of the art in NN compression
Standard: need and design considerations
Coding methods
High-level syntax
Performance
Ongoing work and conclusion
Context
Artificial neural networks are widely used, e.g. in multimedia
visual and audio content recognition and classification
speech and natural language processing
Deep learning makes use of very large networks
many layers with nodes and connections
parameters/weights attached to each of them (e.g. convolution operations)
4
Context
Training
learn parameters from data
typically once, on powerful infrastructure
updates or adaptations in target environment may be necessary
Inference
use the trained network for prediction
needs network with all its parameters, i.e., large amount of data to be transmitted and processed
→ focus: small size to be transmitted
often used on resource constrained devices (mobile phones, smart cameras, edge nodes, …)
→ focus: low memory and computational complexity during inference
5
SotA in NN Compression
typically three steps
reduction of parameters, e.g.
eliminating neurons (pruning)
reducing the entropy of a tensor
decomposing/transforming a tensor
reducing the precision of parameters (i.e. quantization)
performing entropy coding
6
Han et al., ICLR 2016
SotA in NN Compression
Pruning and sparsification
large number of methods
simple to implement
strongly depends on initial model
at low compression rates, performance
improvements are possible
performance loss can often be regained by
training again
computational effort is a concern
7
Blalock et al., MLSys 2020
SotA in NN Compression
Lottery ticket hypothesis [Frankle and Carbin, ICLR 2018]
for a dense randomly initialized feed forward NN there are subnetworks that
achieve the same performance at similar training cost
such a subnetwork can be obtained by pruning, without further training
retrain from early training iteration
Inference complexity gain
straight forward for pruning layers, channels, etc.
sparse tensors need specific support (e.g. recently added to CUDA)
usually certain sparseness ratio needs to be met to get gain
8
SotA in NN Compression - Quantisation
possible to go <10bit parameters for many common NNs without performance loss
many works show losses can be regained with fine-tuning (or already training with quantized
network)
mixed precision (e.g., per layer) is beneficial for many networks
going to very low precision (1bit) is feasible (needs approaches to ensure backprop still works, e.g.,
asymptotic-quantized estimator [Chen et al., IEEE JSTSP 2020])
Inference complexity gain
needs support on target platform, and in DL framework (+ any custom operators)
increasingly supported on GPUs and specific NN processors (at least 4, 8, 16)
9
Relation to Network Architecture Search (NAS)
finding an alternative architecture
and train
architecture search is
computationally expensive
training is then done using e.g. knowledge distillation,
teacher-student learning
Faster NAS methods have been proposed, e.g. Single-Path Mobile AutoML
[Stamoulis et al., IEEE JSTSP 2020]
needs also access to full training data, while fine-tuning could be done on partial
data or application specific data
10
[Strubell, et al. ACL 2019]
Relation to target hardware
Optimising for target hardware
supported operations (e.g., sparse matrix multiplication, weight precisions)
relative costs of memory access and computing operations
training a network, derive network for particular platform
first approaches [Cai, ICLR 2020]
no reliable prediction of inference
costs on particular target architecture
in particular, prediction of speed and
energy consumption
cf. autotune in inference of
DL frameworks
11
Need for a standardised interface
12
Accelerator Libraries
supporting multiple
backends
Accelerator Libraries
supporting multiple
backends
Standardization in MPEG (ISO/IEC JTC1 SC29)
Develop interoperable compressed representation of neural networks
Leverage the know-how in the MPEG on compression of various types of
(multimedia) data
Enable multimedia applications to benefit from the progress in machine
learning using deep neural networks
Cover a broad set of relevant use cases
selected image classification, visual content matching, content coding and audio
classification as applications in which technology is validated
13
Design Considerations
Interoperability with exchange formats (NNEF, ONNX) and formats of
common DL frameworks
Reuse existing approaches for representing topology
Agnostic to inference platform and its specificities
Different types of networks, applications, … may be best served by
different compression tools
14
Evaluating compression technologies
Compression ratio
Reconstruction of original parameters (cf. PSNR for multimedia data)
not a useful metric
performance in target application (e.g., image classification) needs to be measured
(cf. perceptual quality metrics for multimedia)
requires models and data sets for each target application
runtime/memory consumption
of the encoding/decoding process
inference using the resulting model
15
Evaluating compression technologies
16
compression ratio of
stored/transmitted model
(format must be considered)
compression ratio of
model in memory
(depends on platform)
performance for task
(with/without finetuning)
Standard as a Toolbox
17
Reconstructed
Neural Net
R
Original
Neural Net
O
NNR Encoder
NNR Decoder
Pi Q
RP RQ
Bitstream S
0 1 0 1 1 1
Sparsification
Pruning
LR-Decomp.
Optional Preprocessing /
Parameter Reduction
Quantization Entropy Coding
Uniform Nearest
Neighbor Q.
DeepCABAC
Codebook Q.
Dependent Q.
Unification
Batchnorm
Folding
Local Scaling
Optional Postprocessing /
FineTuning
Value Reconstruction Entropy Decoding
DeepCABAC
Value
Reconstruction /
Rescaling
Network Reconstruction /
Re-Composition
can be represented in
any NN format
(maybe not efficiently)
representation for
inference
representation for
storage/transmission
Parameter Reduction - Sparsification
General sparsification
perform fine-tuning and determine sparsity loss
overall loss is weighted sum of task-specific
and sparsity loss
set weights below threshold to zero
Micro-structured sparsification
based on General Matrix Multiplication (GEMM)
reshape tensor to 1-D, 2-D or 3-D block structure
set blocks of weights to zero
keep mask of blocks that have been set to zero
18
Parameter Reduction - Pruning
Pruning is used in combination with one of the sparsification methods
Perform weight importance estimation
graph diffusion on output channels -> get an estimate weight redundancy
Iterative algorithm
determine weight importance
prune neurons according to target pruning ratio
apply sparsification
repeat until target ratios are reached
19
Parameter Reduction – LR Decomposition, Unification
Low-rank decomposition
decompose weight matrix (using SVD)
choose rank r to trade off reconstruction error vs. number of parameters
Unification
generalisation of micro-structured sparsification
instead of setting block to zero, set to approximated value (keep sign)
keep mask of unified blocks
optimised multiplication can be applied if mask is used
20
Parameter Reduction – Batchnorm folding, Local scaling
Batch normalisation is commonly used in NNs
store batchnorm parameters, and apply to weights
better compressability of transformed weights
Local scaling
scaling factor per row
factors can be adjusted using fine-tuning
if used together with batchnorm folding,
no addition parameters are added
21
Quantisation
Uniform Nearest Neighbor Quantization
Codebook Quantization
Dependent Scalar Quantization
Trellis-coded quantisation
two scalar quantisers, and procedure for
switching between them (state-machine
with 8 states)
22
Entropy Coding
DeepCABAC
Adaptation of Context-adaptive Binary
Arithmetic Coding (CABAC)
Binarization
Context-modelling
separate models for each of the flags
select from a fixed set of models
Arithmetic coding to regular and
bypass bins
23
Decoding
Output of encoded tensor
Integer or floating point
Block: set of combined tensors (e.g., components of decomposed tensor)
Parallel decoding of parts of a tensor is supported
option to specify entry points for the parts during encoding
24
High-level Syntax
Sequence of units
size, header, payload
Unit types
Start unit
Model parameter set
Layer parameter set
Topology
Quantisation data
Compressed data
Aggregated unit: grouping units for joint decoding
25
Interoperability with exchange formats
Include network topology in encoded bitstream
ONNX, NNEF, Tensorflow, PyTorch
Supports encoding just some of the tensors
Compatibility with quantisation formats supported in those formats
Include encoded tensors in exchange format
Recommended approach for NNEF and ONNX
26
Performance
27
Performance
28
Performance
5
10
15
20
25
30
35
40
25
35
45
55
65
75
0 0.05 0.1 0.15 0.2 0.25
Compression Ratio
Pretrained
VGG16 NCTM
BZIP2
ResNet50 NCTM
BZIP2
MobileNetV2 NCTM
BZIP2
DCase NCTM
BZIP2
UC12B NCTM
BZIP2
Top-1 Accuracy in % PSNR in dB
} image classification
audio classification
image encoding
Performance
} image classification
audio classification
image encoding
5
10
15
20
25
30
35
40
25
30
35
40
45
50
55
60
65
70
75
0 0.05 0.1 0.15
Compression Ratio
Low Rank Decomposition
VGG16 NCTM
BZIP2
ResNet50 NCTM
BZIP2
MobileNetV2 NCTM
BZIP2
DCase NCTM
BZIP2
UC12B NCTM
BZIP2
Top-1 Accuracy in % PSNR in dB
Ongoing work: incremental compression
Use cases that need to send updated models
e.g. deploy to mobile devices, federated learning
Encode model w.r.t. base model
support tensor updates and structural changes,
e.g. transfer learning with different number
of output classes
Initial results
updates in distributed training can be represented at <1% of the base model size
31
VGG-16
Central Server
Central Server Institution A
Institution B
Train data C
Train data C
Train data B
Train data B
Aggregation
(Central Server)
Aggregation
(Central Server)
VGG-16
NCTM available
NCTM available
Conclusion
standard for compressing NN parameters
compresses to less than 10% without performance loss
interoperability with exchange formats
status
compression standard is going to final ballot
reference software under development (first public release in autumn)
work on incremental compression ongoing
32
www.joanneum.at/digital

Standardising the compressed representation of neural networks

  • 1.
    Standardising the compressed representation ofneural networks Werner Bailer TEWI-Kolloquium, 2021-06-25
  • 2.
    Outline Context State of theart in NN compression Standard: need and design considerations Coding methods High-level syntax Performance Ongoing work and conclusion
  • 3.
    Context Artificial neural networksare widely used, e.g. in multimedia visual and audio content recognition and classification speech and natural language processing Deep learning makes use of very large networks many layers with nodes and connections parameters/weights attached to each of them (e.g. convolution operations) 4
  • 4.
    Context Training learn parameters fromdata typically once, on powerful infrastructure updates or adaptations in target environment may be necessary Inference use the trained network for prediction needs network with all its parameters, i.e., large amount of data to be transmitted and processed → focus: small size to be transmitted often used on resource constrained devices (mobile phones, smart cameras, edge nodes, …) → focus: low memory and computational complexity during inference 5
  • 5.
    SotA in NNCompression typically three steps reduction of parameters, e.g. eliminating neurons (pruning) reducing the entropy of a tensor decomposing/transforming a tensor reducing the precision of parameters (i.e. quantization) performing entropy coding 6 Han et al., ICLR 2016
  • 6.
    SotA in NNCompression Pruning and sparsification large number of methods simple to implement strongly depends on initial model at low compression rates, performance improvements are possible performance loss can often be regained by training again computational effort is a concern 7 Blalock et al., MLSys 2020
  • 7.
    SotA in NNCompression Lottery ticket hypothesis [Frankle and Carbin, ICLR 2018] for a dense randomly initialized feed forward NN there are subnetworks that achieve the same performance at similar training cost such a subnetwork can be obtained by pruning, without further training retrain from early training iteration Inference complexity gain straight forward for pruning layers, channels, etc. sparse tensors need specific support (e.g. recently added to CUDA) usually certain sparseness ratio needs to be met to get gain 8
  • 8.
    SotA in NNCompression - Quantisation possible to go <10bit parameters for many common NNs without performance loss many works show losses can be regained with fine-tuning (or already training with quantized network) mixed precision (e.g., per layer) is beneficial for many networks going to very low precision (1bit) is feasible (needs approaches to ensure backprop still works, e.g., asymptotic-quantized estimator [Chen et al., IEEE JSTSP 2020]) Inference complexity gain needs support on target platform, and in DL framework (+ any custom operators) increasingly supported on GPUs and specific NN processors (at least 4, 8, 16) 9
  • 9.
    Relation to NetworkArchitecture Search (NAS) finding an alternative architecture and train architecture search is computationally expensive training is then done using e.g. knowledge distillation, teacher-student learning Faster NAS methods have been proposed, e.g. Single-Path Mobile AutoML [Stamoulis et al., IEEE JSTSP 2020] needs also access to full training data, while fine-tuning could be done on partial data or application specific data 10 [Strubell, et al. ACL 2019]
  • 10.
    Relation to targethardware Optimising for target hardware supported operations (e.g., sparse matrix multiplication, weight precisions) relative costs of memory access and computing operations training a network, derive network for particular platform first approaches [Cai, ICLR 2020] no reliable prediction of inference costs on particular target architecture in particular, prediction of speed and energy consumption cf. autotune in inference of DL frameworks 11
  • 11.
    Need for astandardised interface 12 Accelerator Libraries supporting multiple backends Accelerator Libraries supporting multiple backends
  • 12.
    Standardization in MPEG(ISO/IEC JTC1 SC29) Develop interoperable compressed representation of neural networks Leverage the know-how in the MPEG on compression of various types of (multimedia) data Enable multimedia applications to benefit from the progress in machine learning using deep neural networks Cover a broad set of relevant use cases selected image classification, visual content matching, content coding and audio classification as applications in which technology is validated 13
  • 13.
    Design Considerations Interoperability withexchange formats (NNEF, ONNX) and formats of common DL frameworks Reuse existing approaches for representing topology Agnostic to inference platform and its specificities Different types of networks, applications, … may be best served by different compression tools 14
  • 14.
    Evaluating compression technologies Compressionratio Reconstruction of original parameters (cf. PSNR for multimedia data) not a useful metric performance in target application (e.g., image classification) needs to be measured (cf. perceptual quality metrics for multimedia) requires models and data sets for each target application runtime/memory consumption of the encoding/decoding process inference using the resulting model 15
  • 15.
    Evaluating compression technologies 16 compressionratio of stored/transmitted model (format must be considered) compression ratio of model in memory (depends on platform) performance for task (with/without finetuning)
  • 16.
    Standard as aToolbox 17 Reconstructed Neural Net R Original Neural Net O NNR Encoder NNR Decoder Pi Q RP RQ Bitstream S 0 1 0 1 1 1 Sparsification Pruning LR-Decomp. Optional Preprocessing / Parameter Reduction Quantization Entropy Coding Uniform Nearest Neighbor Q. DeepCABAC Codebook Q. Dependent Q. Unification Batchnorm Folding Local Scaling Optional Postprocessing / FineTuning Value Reconstruction Entropy Decoding DeepCABAC Value Reconstruction / Rescaling Network Reconstruction / Re-Composition can be represented in any NN format (maybe not efficiently) representation for inference representation for storage/transmission
  • 17.
    Parameter Reduction -Sparsification General sparsification perform fine-tuning and determine sparsity loss overall loss is weighted sum of task-specific and sparsity loss set weights below threshold to zero Micro-structured sparsification based on General Matrix Multiplication (GEMM) reshape tensor to 1-D, 2-D or 3-D block structure set blocks of weights to zero keep mask of blocks that have been set to zero 18
  • 18.
    Parameter Reduction -Pruning Pruning is used in combination with one of the sparsification methods Perform weight importance estimation graph diffusion on output channels -> get an estimate weight redundancy Iterative algorithm determine weight importance prune neurons according to target pruning ratio apply sparsification repeat until target ratios are reached 19
  • 19.
    Parameter Reduction –LR Decomposition, Unification Low-rank decomposition decompose weight matrix (using SVD) choose rank r to trade off reconstruction error vs. number of parameters Unification generalisation of micro-structured sparsification instead of setting block to zero, set to approximated value (keep sign) keep mask of unified blocks optimised multiplication can be applied if mask is used 20
  • 20.
    Parameter Reduction –Batchnorm folding, Local scaling Batch normalisation is commonly used in NNs store batchnorm parameters, and apply to weights better compressability of transformed weights Local scaling scaling factor per row factors can be adjusted using fine-tuning if used together with batchnorm folding, no addition parameters are added 21
  • 21.
    Quantisation Uniform Nearest NeighborQuantization Codebook Quantization Dependent Scalar Quantization Trellis-coded quantisation two scalar quantisers, and procedure for switching between them (state-machine with 8 states) 22
  • 22.
    Entropy Coding DeepCABAC Adaptation ofContext-adaptive Binary Arithmetic Coding (CABAC) Binarization Context-modelling separate models for each of the flags select from a fixed set of models Arithmetic coding to regular and bypass bins 23
  • 23.
    Decoding Output of encodedtensor Integer or floating point Block: set of combined tensors (e.g., components of decomposed tensor) Parallel decoding of parts of a tensor is supported option to specify entry points for the parts during encoding 24
  • 24.
    High-level Syntax Sequence ofunits size, header, payload Unit types Start unit Model parameter set Layer parameter set Topology Quantisation data Compressed data Aggregated unit: grouping units for joint decoding 25
  • 25.
    Interoperability with exchangeformats Include network topology in encoded bitstream ONNX, NNEF, Tensorflow, PyTorch Supports encoding just some of the tensors Compatibility with quantisation formats supported in those formats Include encoded tensors in exchange format Recommended approach for NNEF and ONNX 26
  • 26.
  • 27.
  • 28.
    Performance 5 10 15 20 25 30 35 40 25 35 45 55 65 75 0 0.05 0.10.15 0.2 0.25 Compression Ratio Pretrained VGG16 NCTM BZIP2 ResNet50 NCTM BZIP2 MobileNetV2 NCTM BZIP2 DCase NCTM BZIP2 UC12B NCTM BZIP2 Top-1 Accuracy in % PSNR in dB } image classification audio classification image encoding
  • 29.
    Performance } image classification audioclassification image encoding 5 10 15 20 25 30 35 40 25 30 35 40 45 50 55 60 65 70 75 0 0.05 0.1 0.15 Compression Ratio Low Rank Decomposition VGG16 NCTM BZIP2 ResNet50 NCTM BZIP2 MobileNetV2 NCTM BZIP2 DCase NCTM BZIP2 UC12B NCTM BZIP2 Top-1 Accuracy in % PSNR in dB
  • 30.
    Ongoing work: incrementalcompression Use cases that need to send updated models e.g. deploy to mobile devices, federated learning Encode model w.r.t. base model support tensor updates and structural changes, e.g. transfer learning with different number of output classes Initial results updates in distributed training can be represented at <1% of the base model size 31 VGG-16 Central Server Central Server Institution A Institution B Train data C Train data C Train data B Train data B Aggregation (Central Server) Aggregation (Central Server) VGG-16 NCTM available NCTM available
  • 31.
    Conclusion standard for compressingNN parameters compresses to less than 10% without performance loss interoperability with exchange formats status compression standard is going to final ballot reference software under development (first public release in autumn) work on incremental compression ongoing 32
  • 32.