Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Standardising the compressed representation of neural networks

Artificial neural networks have been adopted for a broad range of tasks in multimedia analysis and processing, such as visual and acoustic classification, extraction of multimedia descriptors or image and video coding. The trained neural networks for these applications contain a large number of parameters (weights), resulting in a considerable size. Thus, transferring them to a number of clients using them in applications (e.g., mobile phones, smart cameras) benefits from a compressed representation of neural networks.

MPEG Neural Network Coding and Representation is the first international standard for efficient compression of neural networks (NNs). The standard is designed as a toolbox of compression methods, which can be used to create coding pipelines. It can be either used as an independent coding framework (with its own bitstream format) or together with external neural network formats and frameworks. For providing the highest degree of flexibility, the network compression methods operate per parameter tensor in order to always ensure proper decoding, even if no structure information is provided. The standard contains compression-efficient quantization and an arithmetic coding scheme (DeepCABAC) as core encoding and decoding technologies, as well as neural network parameter pre-processing methods like sparsification, pruning, low-rank decomposition, unification, local scaling, and batch norm folding. NNR achieves a compression efficiency of more than 97% for transparent coding cases, i.e. without degrading classification quality, such as top-1 or top-5 accuracies.

This talk presents an overview of the context, technical features, and characteristics of the NN coding standard, and discusses ongoing topics such as incremental neural network representation.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

Standardising the compressed representation of neural networks

  1. 1. Standardising the compressed representation of neural networks Werner Bailer TEWI-Kolloquium, 2021-06-25
  2. 2. Outline Context State of the art in NN compression Standard: need and design considerations Coding methods High-level syntax Performance Ongoing work and conclusion
  3. 3. Context Artificial neural networks are widely used, e.g. in multimedia visual and audio content recognition and classification speech and natural language processing Deep learning makes use of very large networks many layers with nodes and connections parameters/weights attached to each of them (e.g. convolution operations) 4
  4. 4. Context Training learn parameters from data typically once, on powerful infrastructure updates or adaptations in target environment may be necessary Inference use the trained network for prediction needs network with all its parameters, i.e., large amount of data to be transmitted and processed → focus: small size to be transmitted often used on resource constrained devices (mobile phones, smart cameras, edge nodes, …) → focus: low memory and computational complexity during inference 5
  5. 5. SotA in NN Compression typically three steps reduction of parameters, e.g. eliminating neurons (pruning) reducing the entropy of a tensor decomposing/transforming a tensor reducing the precision of parameters (i.e. quantization) performing entropy coding 6 Han et al., ICLR 2016
  6. 6. SotA in NN Compression Pruning and sparsification large number of methods simple to implement strongly depends on initial model at low compression rates, performance improvements are possible performance loss can often be regained by training again computational effort is a concern 7 Blalock et al., MLSys 2020
  7. 7. SotA in NN Compression Lottery ticket hypothesis [Frankle and Carbin, ICLR 2018] for a dense randomly initialized feed forward NN there are subnetworks that achieve the same performance at similar training cost such a subnetwork can be obtained by pruning, without further training retrain from early training iteration Inference complexity gain straight forward for pruning layers, channels, etc. sparse tensors need specific support (e.g. recently added to CUDA) usually certain sparseness ratio needs to be met to get gain 8
  8. 8. SotA in NN Compression - Quantisation possible to go <10bit parameters for many common NNs without performance loss many works show losses can be regained with fine-tuning (or already training with quantized network) mixed precision (e.g., per layer) is beneficial for many networks going to very low precision (1bit) is feasible (needs approaches to ensure backprop still works, e.g., asymptotic-quantized estimator [Chen et al., IEEE JSTSP 2020]) Inference complexity gain needs support on target platform, and in DL framework (+ any custom operators) increasingly supported on GPUs and specific NN processors (at least 4, 8, 16) 9
  9. 9. Relation to Network Architecture Search (NAS) finding an alternative architecture and train architecture search is computationally expensive training is then done using e.g. knowledge distillation, teacher-student learning Faster NAS methods have been proposed, e.g. Single-Path Mobile AutoML [Stamoulis et al., IEEE JSTSP 2020] needs also access to full training data, while fine-tuning could be done on partial data or application specific data 10 [Strubell, et al. ACL 2019]
  10. 10. Relation to target hardware Optimising for target hardware supported operations (e.g., sparse matrix multiplication, weight precisions) relative costs of memory access and computing operations training a network, derive network for particular platform first approaches [Cai, ICLR 2020] no reliable prediction of inference costs on particular target architecture in particular, prediction of speed and energy consumption cf. autotune in inference of DL frameworks 11
  11. 11. Need for a standardised interface 12 Accelerator Libraries supporting multiple backends Accelerator Libraries supporting multiple backends
  12. 12. Standardization in MPEG (ISO/IEC JTC1 SC29) Develop interoperable compressed representation of neural networks Leverage the know-how in the MPEG on compression of various types of (multimedia) data Enable multimedia applications to benefit from the progress in machine learning using deep neural networks Cover a broad set of relevant use cases selected image classification, visual content matching, content coding and audio classification as applications in which technology is validated 13
  13. 13. Design Considerations Interoperability with exchange formats (NNEF, ONNX) and formats of common DL frameworks Reuse existing approaches for representing topology Agnostic to inference platform and its specificities Different types of networks, applications, … may be best served by different compression tools 14
  14. 14. Evaluating compression technologies Compression ratio Reconstruction of original parameters (cf. PSNR for multimedia data) not a useful metric performance in target application (e.g., image classification) needs to be measured (cf. perceptual quality metrics for multimedia) requires models and data sets for each target application runtime/memory consumption of the encoding/decoding process inference using the resulting model 15
  15. 15. Evaluating compression technologies 16 compression ratio of stored/transmitted model (format must be considered) compression ratio of model in memory (depends on platform) performance for task (with/without finetuning)
  16. 16. Standard as a Toolbox 17 Reconstructed Neural Net R Original Neural Net O NNR Encoder NNR Decoder Pi Q RP RQ Bitstream S 0 1 0 1 1 1 Sparsification Pruning LR-Decomp. Optional Preprocessing / Parameter Reduction Quantization Entropy Coding Uniform Nearest Neighbor Q. DeepCABAC Codebook Q. Dependent Q. Unification Batchnorm Folding Local Scaling Optional Postprocessing / FineTuning Value Reconstruction Entropy Decoding DeepCABAC Value Reconstruction / Rescaling Network Reconstruction / Re-Composition can be represented in any NN format (maybe not efficiently) representation for inference representation for storage/transmission
  17. 17. Parameter Reduction - Sparsification General sparsification perform fine-tuning and determine sparsity loss overall loss is weighted sum of task-specific and sparsity loss set weights below threshold to zero Micro-structured sparsification based on General Matrix Multiplication (GEMM) reshape tensor to 1-D, 2-D or 3-D block structure set blocks of weights to zero keep mask of blocks that have been set to zero 18
  18. 18. Parameter Reduction - Pruning Pruning is used in combination with one of the sparsification methods Perform weight importance estimation graph diffusion on output channels -> get an estimate weight redundancy Iterative algorithm determine weight importance prune neurons according to target pruning ratio apply sparsification repeat until target ratios are reached 19
  19. 19. Parameter Reduction – LR Decomposition, Unification Low-rank decomposition decompose weight matrix (using SVD) choose rank r to trade off reconstruction error vs. number of parameters Unification generalisation of micro-structured sparsification instead of setting block to zero, set to approximated value (keep sign) keep mask of unified blocks optimised multiplication can be applied if mask is used 20
  20. 20. Parameter Reduction – Batchnorm folding, Local scaling Batch normalisation is commonly used in NNs store batchnorm parameters, and apply to weights better compressability of transformed weights Local scaling scaling factor per row factors can be adjusted using fine-tuning if used together with batchnorm folding, no addition parameters are added 21
  21. 21. Quantisation Uniform Nearest Neighbor Quantization Codebook Quantization Dependent Scalar Quantization Trellis-coded quantisation two scalar quantisers, and procedure for switching between them (state-machine with 8 states) 22
  22. 22. Entropy Coding DeepCABAC Adaptation of Context-adaptive Binary Arithmetic Coding (CABAC) Binarization Context-modelling separate models for each of the flags select from a fixed set of models Arithmetic coding to regular and bypass bins 23
  23. 23. Decoding Output of encoded tensor Integer or floating point Block: set of combined tensors (e.g., components of decomposed tensor) Parallel decoding of parts of a tensor is supported option to specify entry points for the parts during encoding 24
  24. 24. High-level Syntax Sequence of units size, header, payload Unit types Start unit Model parameter set Layer parameter set Topology Quantisation data Compressed data Aggregated unit: grouping units for joint decoding 25
  25. 25. Interoperability with exchange formats Include network topology in encoded bitstream ONNX, NNEF, Tensorflow, PyTorch Supports encoding just some of the tensors Compatibility with quantisation formats supported in those formats Include encoded tensors in exchange format Recommended approach for NNEF and ONNX 26
  26. 26. Performance 27
  27. 27. Performance 28
  28. 28. Performance 5 10 15 20 25 30 35 40 25 35 45 55 65 75 0 0.05 0.1 0.15 0.2 0.25 Compression Ratio Pretrained VGG16 NCTM BZIP2 ResNet50 NCTM BZIP2 MobileNetV2 NCTM BZIP2 DCase NCTM BZIP2 UC12B NCTM BZIP2 Top-1 Accuracy in % PSNR in dB } image classification audio classification image encoding
  29. 29. Performance } image classification audio classification image encoding 5 10 15 20 25 30 35 40 25 30 35 40 45 50 55 60 65 70 75 0 0.05 0.1 0.15 Compression Ratio Low Rank Decomposition VGG16 NCTM BZIP2 ResNet50 NCTM BZIP2 MobileNetV2 NCTM BZIP2 DCase NCTM BZIP2 UC12B NCTM BZIP2 Top-1 Accuracy in % PSNR in dB
  30. 30. Ongoing work: incremental compression Use cases that need to send updated models e.g. deploy to mobile devices, federated learning Encode model w.r.t. base model support tensor updates and structural changes, e.g. transfer learning with different number of output classes Initial results updates in distributed training can be represented at <1% of the base model size 31 VGG-16 Central Server Central Server Institution A Institution B Train data C Train data C Train data B Train data B Aggregation (Central Server) Aggregation (Central Server) VGG-16 NCTM available NCTM available
  31. 31. Conclusion standard for compressing NN parameters compresses to less than 10% without performance loss interoperability with exchange formats status compression standard is going to final ballot reference software under development (first public release in autumn) work on incremental compression ongoing 32
  32. 32. www.joanneum.at/digital

    Be the first to comment

Artificial neural networks have been adopted for a broad range of tasks in multimedia analysis and processing, such as visual and acoustic classification, extraction of multimedia descriptors or image and video coding. The trained neural networks for these applications contain a large number of parameters (weights), resulting in a considerable size. Thus, transferring them to a number of clients using them in applications (e.g., mobile phones, smart cameras) benefits from a compressed representation of neural networks. MPEG Neural Network Coding and Representation is the first international standard for efficient compression of neural networks (NNs). The standard is designed as a toolbox of compression methods, which can be used to create coding pipelines. It can be either used as an independent coding framework (with its own bitstream format) or together with external neural network formats and frameworks. For providing the highest degree of flexibility, the network compression methods operate per parameter tensor in order to always ensure proper decoding, even if no structure information is provided. The standard contains compression-efficient quantization and an arithmetic coding scheme (DeepCABAC) as core encoding and decoding technologies, as well as neural network parameter pre-processing methods like sparsification, pruning, low-rank decomposition, unification, local scaling, and batch norm folding. NNR achieves a compression efficiency of more than 97% for transparent coding cases, i.e. without degrading classification quality, such as top-1 or top-5 accuracies. This talk presents an overview of the context, technical features, and characteristics of the NN coding standard, and discusses ongoing topics such as incremental neural network representation.

Views

Total views

585

On Slideshare

0

From embeds

0

Number of embeds

522

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×