Factor Snappy LZ4 LZO
Algorithm
LZ77
Simplified variant of
LZ77 LZ77 variant
File extension .snappy .lz4 .lzo
codec used
org.apache.hadoop.io.co
mpress.SnappyCodec
org.apache.hadoop.io.c
ompress.Lz4Codec
com.hadoop.compression.l
zo.LzoCodec
Strategy
Block-oriented, API Fast scan, API
Dictionary-based, block-
oriented, API
Java Native N/Y N/Y N/Y
Emphasis
Very high compression
speeds
Very high compression
speeds
High compression speeds
Splittable N N N
Comments
Came out of Google,
previously known as
Zippy
Available in newer
Hadoop distributions
Common for intermediate
compression, HBase tables
Compressed file size
(GB or MB)
Least compressed=1.5 GB 1.4 GB 1.3 GB
Time to compress (sec) 6.4 6.4 7.5
Time to uncompress
(sec)
20 2.5 11.5
Compression ratio 50% 51% 51%
Compression +
decompression time
(sec)
26.4 8.9 19
Strengths
Quickest compression
method
Very quick compression
method
Best results in
decompression speed
Rapid compression
Balanced comp/decomp
times
Weakness
Relatively slow in
decompression
Non splittable
Non splittable Non splittable
Some uses
Used to reduce both disk
write, and network
transfer load
Used to reduce both disk
write, and network transfer
load
Hadoop Compression and suitability for use
Format
bzip2 gzip zlib (not much in use currently)
Burrows-Wheeler transform Wrapper around zlib
Uses Deflate (LZ77 and Huffman
coding)
.bz2 .gz .deflate
org.apache.hadoop.io.compres
s.BZip2Codec
org.apache.hadoop.io.compr
ess.GzipCodec
org.apache.hadoop.io.compress.D
efaultCodec
Transform-based, block-
oriented
Dictionary-based, standard
compression utility
Dictionary-based, API
Y/ Y Y/ Y Y/ Y
Higher compression ratios
than zlib
Focus on compression ratio.
Ccodec operates on and
produces standard gzip files
Compression ratio
Y N N
Common for Pig
For data interchange on and
off Hadoop
Default codec
780 MB 890 MB
1.1 GB
142 85
118
62.5 21.9
45.6
72% 68% 70%
204.5 106.9
163.6
Best compression ratio
splittable
Relatively high compression
ratio
Reasonable speed
High compression ratio
Maximum time taken
Relatively slower than lzo,
snappy and lz4
Non splittable Slower total time
Used when input data is large.
This will reduce read cost.
ression and suitability for use
Format

Hadoop compression analysis strata conference

  • 1.
    Factor Snappy LZ4LZO Algorithm LZ77 Simplified variant of LZ77 LZ77 variant File extension .snappy .lz4 .lzo codec used org.apache.hadoop.io.co mpress.SnappyCodec org.apache.hadoop.io.c ompress.Lz4Codec com.hadoop.compression.l zo.LzoCodec Strategy Block-oriented, API Fast scan, API Dictionary-based, block- oriented, API Java Native N/Y N/Y N/Y Emphasis Very high compression speeds Very high compression speeds High compression speeds Splittable N N N Comments Came out of Google, previously known as Zippy Available in newer Hadoop distributions Common for intermediate compression, HBase tables Compressed file size (GB or MB) Least compressed=1.5 GB 1.4 GB 1.3 GB Time to compress (sec) 6.4 6.4 7.5 Time to uncompress (sec) 20 2.5 11.5 Compression ratio 50% 51% 51% Compression + decompression time (sec) 26.4 8.9 19 Strengths Quickest compression method Very quick compression method Best results in decompression speed Rapid compression Balanced comp/decomp times Weakness Relatively slow in decompression Non splittable Non splittable Non splittable Some uses Used to reduce both disk write, and network transfer load Used to reduce both disk write, and network transfer load Hadoop Compression and suitability for use Format
  • 2.
    bzip2 gzip zlib(not much in use currently) Burrows-Wheeler transform Wrapper around zlib Uses Deflate (LZ77 and Huffman coding) .bz2 .gz .deflate org.apache.hadoop.io.compres s.BZip2Codec org.apache.hadoop.io.compr ess.GzipCodec org.apache.hadoop.io.compress.D efaultCodec Transform-based, block- oriented Dictionary-based, standard compression utility Dictionary-based, API Y/ Y Y/ Y Y/ Y Higher compression ratios than zlib Focus on compression ratio. Ccodec operates on and produces standard gzip files Compression ratio Y N N Common for Pig For data interchange on and off Hadoop Default codec 780 MB 890 MB 1.1 GB 142 85 118 62.5 21.9 45.6 72% 68% 70% 204.5 106.9 163.6 Best compression ratio splittable Relatively high compression ratio Reasonable speed High compression ratio Maximum time taken Relatively slower than lzo, snappy and lz4 Non splittable Slower total time Used when input data is large. This will reduce read cost. ression and suitability for use Format