This document discusses pipelined compression in remote GPU virtualization systems using rCUDA. It introduces remote GPU virtualization and the challenges of slow networks. It then describes a pipelined compression architecture that can compress data on the fly during transfer. Experimental results show that compression libraries reduce execution time by 1-6 minutes for various machine learning models. Analysis finds that over 90% of transfers are small, between 1 byte and 1 KB, and could benefit from further compression. The initial implementation shows potential for reducing execution time but leaves room for improvement.
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early Experiences
1. Pipelined Compression in Remote
GPU Virtualization Systems using
rCUDA: Early Experiences
Cristian Peñaranda, Carlos Reaño and Federico Silla
ICPP 2022 DUAC Workshop
August 29, 2022
6. Transfer to
| GPU Memory
Compression architecture without pipeline
Data
GPU Data
Client Side Server Side
|
Compression | Decompression
Compression
Decompression
| Receive
Send
|
Send Receive
Host Memory
7. Transfer to
| GPU Memory
Pipeline compression architecture
Data
Data
Chunk
Data
Chunk
… Data
Chunk
Data
Chunk
…
GPU Data
Data
Chunk
Data
Chunk
… Data
Chunk
Data
Chunk
…
… …
… …
… …
… …
… …
… …
… …
… …
… …
Client
Pipeline
Server
Pipeline
Client Side Server Side
|
Compression | Decompression
Compression
Decompression
| Receive
Send
|
Send Receive
Host Memory
8. Machine learning applications
Alexnet:
Cifar10:
Inception:
Mnist:
Evalues the inference time using Alexnet CNN model.
Uses Cifar10 dataset to evaluate the image classification of a simple CNN.
Evaluates the image classification using LeNet-5-like CNN and Mnist
dataset.
Uses flowers dataset to evaluate the image classification using inception-V3
CNN model.
9. Compression libraries
Smash: Benchmark of compression libraries
● 41 different lossless compression libraries.
● Different options to configure compression libraries.
● Available at https://github.com/cpenaranda/smash
10. Compression libraries
Lz4:
Zlib:
Snappy:
Zstandard (Zstd):
Based on LZ77 focused on fast compression and decompression.
Uses a combination of LZ77 and Huffman coding.
Created by Meta and is based on LZ77 with a combination of a fast Finite
State Entropy and Huffman coding.
Based on LZ77 and created by Google. It is focused on getting a shorter
computation time.
Gipfeli:
FastLZ:
Based on LZ77 and developed by Google. It is focused on getting higher
compression ratios.
An implementation of the LZ77 algorithm for lossless data compression.
11. Experimental setup
Edge Device
Raspberry Pi 4
Model B
Server Node
Quad core ARM Cortex-A72 64-bit 1.5GHz Intel(R) Xeon(R) CPU E5-2637 v2 3.50GHz
NVIDIA
V100
GPU
Network
10Mbps
12. Results
- CPU results are better than others except for
Mnist.
- Compression libraries reduce the execution
time between 1 and 6 minutes.
13. Results
- The [8B-16B[ data size range represents more
than 35% of all data transfers.
- rCUDA is implemented with chunks of 1,024
bytes.
- More than 90% data transfers have a size
between 1 byte and 1,023 bytes.
Compression is done without pipeline
14. Analysis of data transfers in the range of [8B-16B[
TensorFlow application Number of data transfers Number of data transfers
with different data values
Alexnet 15,218 2,820
Cifar10 33,067 10,479
Mnist 83,665 15,855
Inception 97,346 25,530
- All data transfers have a size of 8 bytes (2^64 possible values).
- TF applications use less than 65,535 different data values (less than 2^16). Data could
therefore be represented by 2 bytes instead of 8 bytes.
15. Analysis of data transfers in the range of [8B-16B[
Inception
Alexnet Cifar10 Mnist
The data shown is the most repetitive. They have a frequency greater than 0.2%.
- Values could be represented using 1 byte.
- These data represent between 42.69% and 67.98% of all 8-byte data transfers.
16. Analysis of data transfers in the range of [8B-16B[
TensorFlow
application
Number of data
transfers
Number of data
transfers with
different data
values
Size without
compression
Size with
compression
proposed
Alexnet 15,218 2,820 118.89KB 19.62-23.38KB
Cifar10 33,067 10,479 258.34KB 42.63-50.80KB
Mnist 83,665 15,855 653.63KB 107.87-128.53KB
Inception 97,346 25,530 760.52KB 125.50-149.55KB
17. Conclusions
- Initial pipelined implementation of on-the-fly data compression using rCUDA.
- We have leveraged four popular machine learning applications.
- This initial implementation is able to reduce the execution time.
- We have pointed out several ways to improve the performance of our pipelined on-the-fly data
compression mechanism.
18. Contact: cripeace@gap.upv.es
Get a free copy of rCUDA at:
http://www.rcuda.net
Get a free copy of smash at:
https://github.com/cpenaranda/smash
THANK YOU!