Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Compression 2020

Loading in …3

Check these out next

1 of 36 Ad

More Related Content

Similar to Data Compression 2020 (20)


Recently uploaded (20)

Data Compression 2020

  1. 1. Smaller is better Data Compression Dietmar Hauser | roborodent e.U. | 2020
  2. 2. Why data size matters Save money Initial download / updates Continuous connections Expand reach Decreased loading times Smaller app size
  3. 3. Isn’t this handled by the platform? Little incentive „Good enough“ attitude CPU / Memory gap Bandwidth / Fidelity gap Stand out from competition!
  4. 4. Compression in theory Wikipedia: „[...] encoding information using fewer bits than the original representation“
  5. 5. Two flavours of compression Lossless All information is retained Lossy An approximation is retained
  6. 6. History & Concepts Information Theory, ~1948 Claude Shannon Entropy Shannon limit
  7. 7. History & Concepts Prefix code, ~1952 Variable length code Translated with a dictionary Constructed with Huffman tree Fast and efficient Still used today
  8. 8. History & Concepts Lempel-Ziv, 1977 Base for the LZ-family Refers back to already processed data „Sliding Window“ Implicit dictionary creation
  9. 9. History & Concepts Deflate, 1991 LZ77 + Huffman Used everywhere! 29 years old!
  10. 10. Compression In Practice
  11. 11. Smaller Apps
  12. 12. Smaller Apps Platform owners enforce package format .apk, .ipa, .appx, … Actually just .zip files Built in compression far from optimal Compress before packaging Bonus: Less storage space used!
  13. 13. Smaller Apps Textures Best compression: JPEG (or H.26X) Most pitfalls: PNG Don’t use Photoshop output for final images! Use compressed texture formats if possible Don’t forget to apply regular compression Consider custom image format
  14. 14. Reducing Network Traffic
  15. 15. Smaller Apps Textures – Teh Future RDO – Rate-distortion optimization Crunch: Transcoding between compressed formats Basis: New compressed GPU formats glTF: ASTC - Adaptive Scalable Texture Compression
  16. 16. Smaller Apps Geometry & Animation Highly format dependent Strip unneeded data Tangents, Binormals, Extra Uvs,… Lossy animation compression Compress using a generic algorithm
  17. 17. Smaller Apps Sound and Music Use lossy compression MP3, Ogg/Vorbis, BINKA, … Depends on audio platform Check back with provider Consider mono for music
  18. 18. Smaller Apps Config, Settings, Loca,… HTML, JSON, XML,… Human readable  low entropy Strip whitespace and comments Brotli is optimized for these Consider binary formats i.e. MsgPack, ProtoBuffers, Binary XML, BSON,… Consider creating your own format
  19. 19. Smaller Apps Further complications Certain files have fixed formats App icons, splash screens, … Exe is encrypted / signed Consider interpreted code Only workarounds are possible… Lobby platform owners?
  20. 20. Compression In Practice
  21. 21. Smaller Downloads HTTP is usually a must (CDN) HTTP 1.1 has compression built in! Likely already available to you Only GZIP widely supported Google is pushing Brotli! Make sure it‘s turned on! Content-Encoding: br Accept-Encoding: br, gzip
  22. 22. Smaller Downloads HTTP Compression is not optimal! Data is rarely changed Compression time is not relevant Use strongest compression available Don’t forget to turn off HTTP compression
  23. 23. Smaller Downloads Compression Options Free: LZMA, XZ, LZHAM Commercial: Oodle (Kraken, Leviathan, …) Slow to very slow compression Very high compression ratios Slow to fast decompression
  24. 24. Reducing Network Traffic
  25. 25. Smaller Downloads General Hints Consider keeping files compressed locally HTTP request delays and limits Few big files > many small files Use parallel downloads, if possible Don‘t forget about decompression time
  26. 26. Compression In Practice
  27. 27. Less Network Traffic Data treatment options Separate static from dynamic data Transfer static data once (or never) i.e. replace Strings with Ids Use binary data formats Ditch HTTP, Base64 re-adds ~25% Use TCP/UDP, WebSocket instead Per packet vs. stream compression
  28. 28. Less Network Traffic Fast compression options Free: LZ4, Density Commercial: LZO, Oodle (Selkie, LZB16) Much (!) faster than GZIP Lower to equal compression ratio
  29. 29. Less Network Traffic Strong compression options Free: ZStd, BROTLI Commercial: Oodle (Mermaid) Faster decompression speed Slower to equal compression speed Equal to higher compression ratio
  30. 30. Less Network Traffic Teh Future HTTP/2 & 3 will be binary protocols Shared dictionaries SDCH or home made (i.e. using ZStd) Brotli has a generic dictionary built in
  31. 31. Source:
  32. 32. Conclusions Take care of your data from day 1 There is more than Deflate / Zlib Smaller data makes people happy!
  33. 33. Resources Yann Collet Blog: LZ4: ZStd: Oodle Official: Charles Bloom: Fabian Giesen:
  34. 34. Resources BROTLI Standard: Source: Misc Rich Geldreich (LZHAM): Crunch: Basis: LZO: 7z / LZMA / XZ: Density:
  35. 35. roborodent Dietmar Hauser P r o g r a m m e r Dietmar Hauser | roborodent e.U. | 2020 Software Solutions | Creative Consulting @rattenhirn

Editor's Notes

  • Why should you care?

    Reduced bandwidth benefits you and the customers

  • Platform providers pay highly discounted bulk rates

    You have likely already heard of…
    CPUs got over 10.000 times faster, memory only 10 times
    In addition, more CPU core are being added, that compete for memory
    The chance of idling CPUs is high

    What might we use those idle CPUs for?
    I have „invented“ a second gap

    VR, 4K, high framerates, MMO, ….

    720p -> 0.9 MP
    1080p -> 2 MP
    2160p -> 8.3 MP
  • Now that I hopefully have made my case,
    let‘s review the basics

    So it is literally „shrinking data“
  • Two things you probably already know, but I recap them anyways

    First reduce entropy, then apply lossless compression

    I‘ll mostly talk about lossless compression
    Rule of thumb: Very human senses are involved, lossy compression can be used
  • Now that we know what it is, let‘s look at how it roughly works
    by reviewing its history briefly

    Would‘ve been 100 years old in 2016

    Entropy: Don‘t mix up with physical and chemical entropy

    H(X) = Entropy in „Shannons“
    Pr(X=1) = Probability that the coin will land on head

    In this case Entropy == number of bits needed to store
    If I want to store the outcome of 100 coin tosses I need at least 100 bits at H(X) == 1
    But only at least 50 bits at H(X) == 0,5

    So, the more predictable data is, the better it can be compressed
  • David A. Huffman
    Not the first prefix code, but the best at the time

    „Universal codes“, prefix code to use when data is not known
  • We jump forward 25 years, skip over arithmetic and range coding

    Abraham Lempel, Jacob Ziv, IEEE Milestone 2004

    Have contributed more to the efficient storage of cat images than anyone else

    Notable LZ-family members:
    Originals: LZ77, LZ78
    Well known: LZW (1984) -> GIF
    Modern: LZ4, LZO, LZMA, Oodle,…
  • This is where for many people the history of compression ends

    As we‘ll see it‘s the default in a lot of places
    Zlib is considered „good enough“

    Research hasn‘t stopped since then
    Some theoretical progress, a LOT of implementation improvements
  • Our first use case (of three, for the impatient)
  • A jarring example
    FlappyBird.apk -> 895 KiB
    TappyChicken.apk -> 26,408 KiB
  • Deliciously named IPA format

    .apk: Data is kept in archive, code is extracted
    .ipa: Everything is extracted
    .appx: Nothing is extracted

    DEFLATE again, remember, 27 years old

    LZMA / 7z would be 20-30% smaller at

    FlappyBird.apk -> 895 KiB -> 604 KiB ~= 300 KiB saved
    TappyChicken.apk -> 26,408 KiB -> 17,793 KiB ~= 10 MiB saved!!

    It‘s public domain and free!
  • Different kinds of data should be treated differently

    JPEG is lossy and has superior compression rates
    Alpha channel needs to be stored separately if needed

    PNGs output from graphics packages are not optimal
    Photoshop adds random data to PNGs
    Uses deflate to compress lines
    Run them through an optimizer like PNG crush, Tiny PNG, PNGquant
    Consider using the palette feature

    Compressed textures:
    Saves disk space, memory and GPU bandwidth
    ETC2 is available on all OpenGL ES 3.0 devices
    Texture compression is lossy and has a fixed compression ratio
    It's worth it to compress them again using a generic algorithm

    Custom image format:
    Save raw pixel data in the desired pixel format (including compressed ones)
    Add required meta data (format, height, width)

  • uncompressed 280 MiB
    original: 85 MiB
    png: 41 MiB (less then half!)
    jpeg 80: 8 MiB

    etc2: 106 MiB
    etc2 comp: 24 – 19 MiB

    More about the compressors used later
  • Crunch is DXT5 only

    Basis is only available when being licensed

    gITF, helped by the ppl behind Binomial
  • BINKA comes with Miles Sound System and Bink

  • Omitting whitespace and comments: Lossy compression

     JavaScript libraries use this technique extensively
    Conversion wastes CPU / memory
    Binary format moves compression to creation time

    Use generic compressor
    BROTLI is aimed at text
  • Certain files need to be in certain formats
    i.e. App startup image, app icons, ... need to be 32 Bit PNG even though they have not transparency
    Make them as simple as possible (large monochrome patches)
    The executable is encrypted before compression
    Not much one can do, except keeping the executable small
    Consider using interpreted code
    Can be compressed
    Can also be updated without a certification pass
    Lobby platform owners to change this and/or provide options
  • Our second use case (of three, for the impatient)
  • The good news:

    What is GZIP? Deflate!

    HTTP is used a lot, for some good and many bad reasons
  • Initial download, patches, DLCs, updates,…
  • Silesia corpus, all compressors on max compression setting

    XZ == LZMA
    LZHAM is missing

    Kraken blows everything out of the water
    Leviathan (also Oodle) is even better

    Free: Zstd or LZMA, depends if speed or ratio is more important

    Compression times not included because we don‘t care.
    Up to 10 minutes for 200 MiB
  • If possible, store files compressed locally as well
    - May help with loading times if local transfer rate is low
    - Makes users happy, especially on mobile devices

    Data flow considerations
    - HTTP requests will have some delay before starting, especially on CDNs (due to redirections and back end stuff)
    - How to cope:
    - Run as many parallel request as possible, but
    - Per RFC, servers are not obliged to service more than 2 at a time
    - Proxies or even the platform may also be a limiting factor
    Decompression takes some time as well
    - Especially on weak CPU (mobile) platforms
    - Try to parallelize with download
    - Keep an eye on memory consumption
  • Data transfers in online games
  • Data separation also improves cache friendlyness and memory pressure
  • Two ways to improve compression, number 1:

    Links at the end
  • Number 2

    BROTLI is designed to handle text well, the others are general purpose
  • Here is how this plays out

    Shorter bar is better

    „Sai-Lee-Sha“ =~ 200 MiB of representative data

    Brotli, LZO and Density are missing
    Brotli comparable to Zstd, better on text (20% improvment over deflate)

    Silesia corpus, all compressors on „default“ setting
    Should be the best trade off between speed and ratio
    Max compression time is 2 minutes for 200 MiB

    We‘ll look at even stronger compressors later
  • SDCH -> Shared Dictionary Compression for HTTP
    Pre shared dictionaries are a generic approach to separate static from dynamic data

    BROTLI dictionary in Appendix A.
  • Only useful for small packages and non-streaming connections