Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options In Hadoop –
A Tale of Tradeoffs
Govind Kamat, Sumeet Singh
Hadoop Summit (San Jose), June 27, 2013

Introduction
2
Sumeet Singh
Director of Products, Hadoop
Cloud Engineering Group
701 First Avenue
Sunnyvale, CA 94089 USA
Govind Kamat
Technical Yahoo!, Hadoop
Cloud Engineering Group
§  Member of Technical Staff in the Hadoop Services
team at Yahoo!
§  Focuses on HBase and Hadoop performance
§  Worked with the Performance Engineering Group on
improving the performance and scalability of several
Yahoo! applications
§  Experience includes development of large-scale
software systems, microprocessor architecture,
instruction-set simulators, compiler technology and
electronic design
701 First Avenue
Sunnyvale, CA 94089 USA
§  Leads Hadoop products team at Yahoo!
§  Responsible for Product Management, Customer
Engagements, Evangelism, and Program
Management
§  Prior to this role, led Strategy functions for the Cloud
Platform Group at Yahoo!

Agenda
3
Data Compression in Hadoop1
Available Compression Options2
Understanding and Working with Compression Options3
Problems Faced at Yahoo! with Large Data Sets4
Performance Evaluations, Native Bzip2, and IPP Libraries5
Wrap-up and Future Work6

Compression Needs and Tradeoffs in Hadoop
4
§  Storage
§  Disk I/O
§  Network bandwidth
§  CPU Time
§  Hadoop jobs are data-intensive, compressing data can speed up the I/O operations
§  MapReduce jobs are almost always I/O bound
§  Compressed data can save storage space and speed up data transfers across the
network
§  Capital allocation for hardware can go further
§  Reduced I/O and network load can bring significant performance improvements
§  MapReduce jobs can finish faster overall
§  On the other hand, CPU utilization and processing time increases during
compression and decompression
§  Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
The Compression Tradeoff

Data Compression in Hadoop’s MR Pipeline
5
Input
splits
Map
Source: Hadoop: The Definitive Guide, Tom White
Output
ReduceBuffer in
memory
Partition and Sort
fetch
Merge
on disk
Merge and sort
Other
maps
Other
reducers
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Map Reduce
Reduce I/P
Map O/P
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Sort & Shuffle
Compress Decompress

Compression Options in Hadoop (1/2)
6
Format Algorithm Strategy Emphasis Comments
zlib
Uses DEFLATE
(LZ77 and Huffman
coding)
Dictionary-based, API Compression ratio Default codec
gzip Wrapper around zlib
Dictionary-based,
standard compression
utility
Same as zlib, codec
operates on and
produces standard gzip
files
For data interchange on
and off Hadoop
bzip2
Burrows-Wheeler
transform
Transform-based,
block-oriented
Higher compression
ratios than zlib
Common for Pig
LZO Variant of LZ77
Dictionary-based,
block-oriented, API
High compression
speeds
Common for
intermediate
compression, HBase
tables
LZ4
Simplified variant of
LZ77
Fast scan, API
Very high compression
speeds
Available in newer
Hadoop distributions
Snappy LZ77 Block-oriented, API
Very high compression
speeds
Came out of Google,
previously known as
Zippy

Compression Options in Hadoop (2/2)
7
Format Codec (Defined in io.compression.codecs) File Extn. Splittable
Java/
Native
zlib/ DEFLATE
(default)
org.apache.hadoop.io.compress.DefaultCodec !.deflate! N Y/ Y
gzip org.apache.hadoop.io.compress.GzipCodec ! .gz! N Y/ Y
bzip2 org.apache.hadoop.io.compress.BZip2Codec ! .bz2! Y Y/ Y
LZO
(download
separately)
com.hadoop.compression.lzo.LzoCodec ! .lzo! N N/ Y
LZ4 org.apache.hadoop.io.compress.Lz4Codec ! .lz4! N N/ Y
Snappy org.apache.hadoop.io.compress.SnappyCodec ! .snappy! N N/ Y
NOTES:
§  Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other
algorithms require all blocks together for decompression with a single MapReduce task.
§  LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still
supported and the codec can be downloaded separately and enabled manually.
§  Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23

Space-Time Tradeoff of Compression Options
8
64%, 32.3
71%, 60.0
47%, 4.842%, 4.0
44%, 2.4
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
40% 45% 50% 55% 60% 65% 70% 75%
CPUTimeinSec.
(Compress+Decompress)
Space Savings
Bzip2
Zlib
(Deflate, Gzip)
LZOSnappy
LZ4
Note:
A 265 MB corpus from Wikipedia was used for the performance comparisons.
Space savings is defined as [1 – (Compressed/ Uncompressed)]
Codec Performance on the Wikipedia Text Corpus
High Compression Ratio
High Compression Speed

Using Data Compression in Hadoop
9
Phase in MR
Pipeline
Config Values
Input data to
Map
File extension recognized automatically for
decompression
File extensions for supported formats
Note: For SequenceFile, headers have the
information [compression (boolean), block
compression (boolean), and compression
codec]
One of the supported codecs one defined in io.compression.codecs!
Intermediate
(Map) Output
mapreduce.map.output.compress!
false (default), true
mapreduce.map.output.compress.codec!
!
one defined in io.compression.codecs!
Final
(Reduce)
Output
mapreduce.output.fileoutputformat.
compress!
false (default), true
compress.codec!
one defined in io.compression.codecs!
compress.type!
Type of compression to use for SequenceFile
outputs: NONE, RECORD (default), BLOCK
1
2
3

§  Compress the input data,
if large
§  Always use compression,
particularly if spillage or
slow network transfers
§  Compress for storage/
archival, better write
speeds, or between MR jobs
§  Use splittable algo such as
bzip2, or use zlib with
SequenceFile format
§  Use faster codecs such as
LZO, LZ4, or Snappy
§  Use standard utility such as
gzip or bzip2 for data
interchange, and faster
codecs for chained jobs
When to Use Compression and Which Codec
10
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Compress Decompress
Final Reduce Output

Compression in the Hadoop Ecosystem
11
Component When to Use What to Use
Pig
§  Compressing data between MR
job
§  Typical in Pig scripts that include
joins or other operators that
expand your data size
Enable compression and select the codec:
pig.tmpfilecompression = true!
pig.tmpfilecompression.codec = gzip, lzo!
!
Hive
§  Intermediate files produced by
Hive between multiple map-
reduce jobs
§  Hive writes output to a table
Enable intermediate or output compression:
hive.exec.compress.intermediate = true!
hive.exec.compress.output = true!
HBase
§  Compress data at the CF level
(support for LZO, gzip, Snappy,
and LZ4)
List required JNI libraries:
hbase.regionserver.codecs!
!
Enabling compression:
create ’table', { NAME => 'colfam',
COMPRESSION => ’LZO' }!
alter ’table', { NAME => 'colfam',
COMPRESSION => ’LZO' } !

4.2M Jobs, Jun 10-16, 2013
Compression in Hadoop at Yahoo!
12
99.8%
0.2%
LZO 98.3%
gzip 1.1%
zlib / default 0.5%
bzip2 0.1%
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
1 2 3
Final Reduce Output
39.0%
61.0%
LZO 55%
gzip 35%
bzip2 5%
zlib / default 5%
4.2M Jobs, Jun 10-16, 2013
98%
2%
zlib / default 73%
gzip 22%
bzip2 4%
LZO 1%
380M Files on Jun 16, 2013
(/data, /projects)
Includes
intermediate
Pig/ Hive
compression
Pig
Intermediate
Compressed

Compression for Data Storage Efficiency
§  DSE considerations at Yahoo!
§  RCFile instead of SequenceFile
§  Faster implementation of bzip2
§  Native-code bzip2 codec
§  HADOOP-84621, available in 0.23.7
§  Substituting the IPP library
13
1 Native-code bzip2 implementation done in collaboration with Jason Lowe,
Hadoop Core PMC member

IPP Libraries
§  Integrated Performance Primitives from Intel
§  Algorithmic and architectural optimizations
§  Processor-specific variants of each function
§  Applications remain processor-neutral
§  Compression: LZ, RLE, BWT, LZO
§  High level formats include: zlib, gzip, bzip2 and LZO
14

Measuring Standalone Performance
§  Standard programs (gzip, bzip2) used
§  Driver program written for other cases
§  32-bit mode
§  Single-threaded
§  JVM load overhead discounted
§  Default compression level
§  Quad-core Xeon machine
15

Data Corpuses Used
§  Binary files
§  Generated text from randomtextwriter
§  Wikipedia corpus
§  Silesia corpus
16

Compression Ratio
0
50
100
150
200
250
300
uncomp zlib bzip2 LZO Snappy LZ4
FileSize(MB)
exe rtext wiki silesia
17

Compression Performance
29
23
63
44
26
0
10
20
30
40
50
60
70
80
90
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
18

Compression Performance (Fast Algorithms)
3.2
2.9
1.7
0
0.5
1
1.5
2
2.5
3
3.5
LZO Snappy LZ4
CPUTime(sec)
19

Decompression Performance
3
2
21
17
12
0
5
10
15
20
25
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
20

Decompression Performance (Fast Algorithms)
1.6
1.1
0.7
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
LZO Snappy LZ4
CPUTime(sec)
21

Compression Performance within Hadoop
§  Daytona performance framework
§  GridMix v1
§  Loadgen and sort jobs
§  Input data compressed with zlib / bzip2
§  LZO used for intermediate compression
§  35 datanodes, dual-quad-core machines
22

Map Performance
47 46 46
33
0
5
10
15
20
25
30
35
40
45
50
Java-bzip2 bzip2 IPP-bzip2 zlib
MapTime(sec)
23

Reduce Performance
31
28
18
14
0
5
10
15
20
25
30
35
ReduceTime(min)
24

Job Performance
38
34
23
19
38
34
25
18
0
5
10
15
20
25
30
35
40
JobTime(min)
sort loadgen
25

Future Work
§  Splittability support for native-code bzip2 codec
§  Enhancing Pig to use common bzip2 codec
§  Optimizing the JNI interface and buffer copies
§  Varying the compression effort parameter
§  Performance evaluation for 64-bit mode
§  Updating the zlib codec to specify alternative libraries
§  Other codec combinations, such as zlib for transient data
§  Other compression algorithms
26

Considerations in Selecting Compression Type
§  Nature of the data set
§  Chained jobs
§  Data-storage efficiency requirements
§  Frequency of compression vs. decompression
§  Requirement for compatibility with a standard data format
§  Splittability requirements
§  Size of the intermediate and final data
§  Alternative implementations of compression libraries
27

Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs

Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs

More Related Content

What's hot

Similar to Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs

More from Sumeet Singh

Recently uploaded

Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs