Compression Options In Hadoop –
A Tale of Tradeoffs
Govind Kamat, Sumeet Singh
Hadoop Summit (San Jose), June 27, 2013
Introduction
2
Sumeet Singh
Director of Products, Hadoop
Cloud Engineering Group
701 First Avenue
Sunnyvale, CA 94089 USA
Govind Kamat
Technical Yahoo!, Hadoop
Cloud Engineering Group
§  Member of Technical Staff in the Hadoop Services
team at Yahoo!
§  Focuses on HBase and Hadoop performance
§  Worked with the Performance Engineering Group on
improving the performance and scalability of several
Yahoo! applications
§  Experience includes development of large-scale
software systems, microprocessor architecture,
instruction-set simulators, compiler technology and
electronic design
701 First Avenue
Sunnyvale, CA 94089 USA
§  Leads Hadoop products team at Yahoo!
§  Responsible for Product Management, Customer
Engagements, Evangelism, and Program
Management
§  Prior to this role, led Strategy functions for the Cloud
Platform Group at Yahoo!
Agenda
3
Data Compression in Hadoop1
Available Compression Options2
Understanding and Working with Compression Options3
Problems Faced at Yahoo! with Large Data Sets4
Performance Evaluations, Native Bzip2, and IPP Libraries5
Wrap-up and Future Work6
Compression Needs and Tradeoffs in Hadoop
4
§  Storage
§  Disk I/O
§  Network bandwidth
§  CPU Time
§  Hadoop jobs are data-intensive, compressing data can speed up the I/O operations
§  MapReduce jobs are almost always I/O bound
§  Compressed data can save storage space and speed up data transfers across the
network
§  Capital allocation for hardware can go further
§  Reduced I/O and network load can bring significant performance improvements
§  MapReduce jobs can finish faster overall
§  On the other hand, CPU utilization and processing time increases during
compression and decompression
§  Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
The Compression Tradeoff
Data Compression in Hadoop’s MR Pipeline
5
Input
splits
Map
Source: Hadoop: The Definitive Guide, Tom White
Output
ReduceBuffer in
memory
Partition and Sort
fetch
Merge
on disk
Merge and sort
Other
maps
Other
reducers
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Map Reduce
Reduce I/P
Map O/P
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Sort & Shuffle
Compress Decompress
Compression Options in Hadoop (1/2)
6
Format Algorithm Strategy Emphasis Comments
zlib
Uses DEFLATE
(LZ77 and Huffman
coding)
Dictionary-based, API Compression ratio Default codec
gzip Wrapper around zlib
Dictionary-based,
standard compression
utility
Same as zlib, codec
operates on and
produces standard gzip
files
For data interchange on
and off Hadoop
bzip2
Burrows-Wheeler
transform
Transform-based,
block-oriented
Higher compression
ratios than zlib
Common for Pig
LZO Variant of LZ77
Dictionary-based,
block-oriented, API
High compression
speeds
Common for
intermediate
compression, HBase
tables
LZ4
Simplified variant of
LZ77
Fast scan, API
Very high compression
speeds
Available in newer
Hadoop distributions
Snappy LZ77 Block-oriented, API
Very high compression
speeds
Came out of Google,
previously known as
Zippy
Compression Options in Hadoop (2/2)
7
Format Codec (Defined in io.compression.codecs) File Extn. Splittable
Java/
Native
zlib/ DEFLATE
(default)
org.apache.hadoop.io.compress.DefaultCodec !.deflate! N Y/ Y
gzip org.apache.hadoop.io.compress.GzipCodec ! .gz! N Y/ Y
bzip2 org.apache.hadoop.io.compress.BZip2Codec ! .bz2! Y Y/ Y
LZO
(download
separately)
com.hadoop.compression.lzo.LzoCodec ! .lzo! N N/ Y
LZ4 org.apache.hadoop.io.compress.Lz4Codec ! .lz4! N N/ Y
Snappy org.apache.hadoop.io.compress.SnappyCodec ! .snappy! N N/ Y
NOTES:
§  Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other
algorithms require all blocks together for decompression with a single MapReduce task.
§  LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still
supported and the codec can be downloaded separately and enabled manually.
§  Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
Space-Time Tradeoff of Compression Options
8
64%, 32.3
71%, 60.0
47%, 4.842%, 4.0
44%, 2.4
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
40% 45% 50% 55% 60% 65% 70% 75%
CPUTimeinSec.
(Compress+Decompress)
Space Savings
Bzip2
Zlib
(Deflate, Gzip)
LZOSnappy
LZ4
Note:
A 265 MB corpus from Wikipedia was used for the performance comparisons.
Space savings is defined as [1 – (Compressed/ Uncompressed)]
Codec Performance on the Wikipedia Text Corpus
High Compression Ratio
High Compression Speed
Using Data Compression in Hadoop
9
Phase in MR
Pipeline
Config Values
Input data to
Map
File extension recognized automatically for
decompression
File extensions for supported formats
Note: For SequenceFile, headers have the
information [compression (boolean), block
compression (boolean), and compression
codec]
One of the supported codecs one defined in io.compression.codecs!
Intermediate
(Map) Output
mapreduce.map.output.compress!
false (default), true
mapreduce.map.output.compress.codec!
!
one defined in io.compression.codecs!
Final
(Reduce)
Output
mapreduce.output.fileoutputformat.
compress!
false (default), true
mapreduce.output.fileoutputformat.
compress.codec!
one defined in io.compression.codecs!
mapreduce.output.fileoutputformat.
compress.type!
Type of compression to use for SequenceFile
outputs: NONE, RECORD (default), BLOCK
1
2
3
§  Compress the input data,
if large
§  Always use compression,
particularly if spillage or
slow network transfers
§  Compress for storage/
archival, better write
speeds, or between MR jobs
§  Use splittable algo such as
bzip2, or use zlib with
SequenceFile format
§  Use faster codecs such as
LZO, LZ4, or Snappy
§  Use standard utility such as
gzip or bzip2 for data
interchange, and faster
codecs for chained jobs
When to Use Compression and Which Codec
10
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Compress Decompress
Final Reduce Output
Compression in the Hadoop Ecosystem
11
Component When to Use What to Use
Pig
§  Compressing data between MR
job
§  Typical in Pig scripts that include
joins or other operators that
expand your data size
Enable compression and select the codec:
pig.tmpfilecompression = true!
pig.tmpfilecompression.codec = gzip, lzo!
!
Hive
§  Intermediate files produced by
Hive between multiple map-
reduce jobs
§  Hive writes output to a table
Enable intermediate or output compression:
hive.exec.compress.intermediate = true!
hive.exec.compress.output = true!
HBase
§  Compress data at the CF level
(support for LZO, gzip, Snappy,
and LZ4)
List required JNI libraries:
hbase.regionserver.codecs!
!
Enabling compression:
create ’table', { NAME => 'colfam',
COMPRESSION => ’LZO' }!
alter ’table', { NAME => 'colfam',
COMPRESSION => ’LZO' } !
4.2M Jobs, Jun 10-16, 2013
Compression in Hadoop at Yahoo!
12
99.8%
0.2%
LZO 98.3%
gzip 1.1%
zlib / default 0.5%
bzip2 0.1%
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
1 2 3
Final Reduce Output
39.0%
61.0%
LZO 55%
gzip 35%
bzip2 5%
zlib / default 5%
4.2M Jobs, Jun 10-16, 2013
98%
2%
zlib / default 73%
gzip 22%
bzip2 4%
LZO 1%
380M Files on Jun 16, 2013
(/data, /projects)
Includes
intermediate
Pig/ Hive
compression
Pig
Intermediate
Compressed
Compression for Data Storage Efficiency
§  DSE considerations at Yahoo!
§  RCFile instead of SequenceFile
§  Faster implementation of bzip2
§  Native-code bzip2 codec
§  HADOOP-84621, available in 0.23.7
§  Substituting the IPP library
13
1 Native-code bzip2 implementation done in collaboration with Jason Lowe,
Hadoop Core PMC member
IPP Libraries
§  Integrated Performance Primitives from Intel
§  Algorithmic and architectural optimizations
§  Processor-specific variants of each function
§  Applications remain processor-neutral
§  Compression: LZ, RLE, BWT, LZO
§  High level formats include: zlib, gzip, bzip2 and LZO
14
Measuring Standalone Performance
§  Standard programs (gzip, bzip2) used
§  Driver program written for other cases
§  32-bit mode
§  Single-threaded
§  JVM load overhead discounted
§  Default compression level
§  Quad-core Xeon machine
15
Data Corpuses Used
§  Binary files
§  Generated text from randomtextwriter
§  Wikipedia corpus
§  Silesia corpus
16
Compression Ratio
0
50
100
150
200
250
300
uncomp zlib bzip2 LZO Snappy LZ4
FileSize(MB)
exe rtext wiki silesia
17
Compression Performance
29
23
63
44
26
0
10
20
30
40
50
60
70
80
90
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
exe rtext wiki silesia
18
Compression Performance (Fast Algorithms)
3.2
2.9
1.7
0
0.5
1
1.5
2
2.5
3
3.5
LZO Snappy LZ4
CPUTime(sec)
exe rtext wiki silesia
19
Decompression Performance
3
2
21
17
12
0
5
10
15
20
25
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
exe rtext wiki silesia
20
Decompression Performance (Fast Algorithms)
1.6
1.1
0.7
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
LZO Snappy LZ4
CPUTime(sec)
exe rtext wiki silesia
21
Compression Performance within Hadoop
§  Daytona performance framework
§  GridMix v1
§  Loadgen and sort jobs
§  Input data compressed with zlib / bzip2
§  LZO used for intermediate compression
§  35 datanodes, dual-quad-core machines
22
Map Performance
47 46 46
33
0
5
10
15
20
25
30
35
40
45
50
Java-bzip2 bzip2 IPP-bzip2 zlib
MapTime(sec)
23
Reduce Performance
31
28
18
14
0
5
10
15
20
25
30
35
Java-bzip2 bzip2 IPP-bzip2 zlib
ReduceTime(min)
24
Job Performance
38
34
23
19
38
34
25
18
0
5
10
15
20
25
30
35
40
Java-bzip2 bzip2 IPP-bzip2 zlib
JobTime(min)
sort loadgen
25
Future Work
§  Splittability support for native-code bzip2 codec
§  Enhancing Pig to use common bzip2 codec
§  Optimizing the JNI interface and buffer copies
§  Varying the compression effort parameter
§  Performance evaluation for 64-bit mode
§  Updating the zlib codec to specify alternative libraries
§  Other codec combinations, such as zlib for transient data
§  Other compression algorithms
26
Considerations in Selecting Compression Type
§  Nature of the data set
§  Chained jobs
§  Data-storage efficiency requirements
§  Frequency of compression vs. decompression
§  Requirement for compatibility with a standard data format
§  Splittability requirements
§  Size of the intermediate and final data
§  Alternative implementations of compression libraries
27
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs

Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs

  • 1.
    Compression Options InHadoop – A Tale of Tradeoffs Govind Kamat, Sumeet Singh Hadoop Summit (San Jose), June 27, 2013
  • 2.
    Introduction 2 Sumeet Singh Director ofProducts, Hadoop Cloud Engineering Group 701 First Avenue Sunnyvale, CA 94089 USA Govind Kamat Technical Yahoo!, Hadoop Cloud Engineering Group §  Member of Technical Staff in the Hadoop Services team at Yahoo! §  Focuses on HBase and Hadoop performance §  Worked with the Performance Engineering Group on improving the performance and scalability of several Yahoo! applications §  Experience includes development of large-scale software systems, microprocessor architecture, instruction-set simulators, compiler technology and electronic design 701 First Avenue Sunnyvale, CA 94089 USA §  Leads Hadoop products team at Yahoo! §  Responsible for Product Management, Customer Engagements, Evangelism, and Program Management §  Prior to this role, led Strategy functions for the Cloud Platform Group at Yahoo!
  • 3.
    Agenda 3 Data Compression inHadoop1 Available Compression Options2 Understanding and Working with Compression Options3 Problems Faced at Yahoo! with Large Data Sets4 Performance Evaluations, Native Bzip2, and IPP Libraries5 Wrap-up and Future Work6
  • 4.
    Compression Needs andTradeoffs in Hadoop 4 §  Storage §  Disk I/O §  Network bandwidth §  CPU Time §  Hadoop jobs are data-intensive, compressing data can speed up the I/O operations §  MapReduce jobs are almost always I/O bound §  Compressed data can save storage space and speed up data transfers across the network §  Capital allocation for hardware can go further §  Reduced I/O and network load can bring significant performance improvements §  MapReduce jobs can finish faster overall §  On the other hand, CPU utilization and processing time increases during compression and decompression §  Understanding the tradeoffs is important for MapReduce pipeline’s overall performance The Compression Tradeoff
  • 5.
    Data Compression inHadoop’s MR Pipeline 5 Input splits Map Source: Hadoop: The Definitive Guide, Tom White Output ReduceBuffer in memory Partition and Sort fetch Merge on disk Merge and sort Other maps Other reducers I/P compressed Mapper decompresses Mapper O/P compressed 1 Map Reduce Reduce I/P Map O/P Reducer I/P decompresses Reducer O/P compressed 2 3 Sort & Shuffle Compress Decompress
  • 6.
    Compression Options inHadoop (1/2) 6 Format Algorithm Strategy Emphasis Comments zlib Uses DEFLATE (LZ77 and Huffman coding) Dictionary-based, API Compression ratio Default codec gzip Wrapper around zlib Dictionary-based, standard compression utility Same as zlib, codec operates on and produces standard gzip files For data interchange on and off Hadoop bzip2 Burrows-Wheeler transform Transform-based, block-oriented Higher compression ratios than zlib Common for Pig LZO Variant of LZ77 Dictionary-based, block-oriented, API High compression speeds Common for intermediate compression, HBase tables LZ4 Simplified variant of LZ77 Fast scan, API Very high compression speeds Available in newer Hadoop distributions Snappy LZ77 Block-oriented, API Very high compression speeds Came out of Google, previously known as Zippy
  • 7.
    Compression Options inHadoop (2/2) 7 Format Codec (Defined in io.compression.codecs) File Extn. Splittable Java/ Native zlib/ DEFLATE (default) org.apache.hadoop.io.compress.DefaultCodec !.deflate! N Y/ Y gzip org.apache.hadoop.io.compress.GzipCodec ! .gz! N Y/ Y bzip2 org.apache.hadoop.io.compress.BZip2Codec ! .bz2! Y Y/ Y LZO (download separately) com.hadoop.compression.lzo.LzoCodec ! .lzo! N N/ Y LZ4 org.apache.hadoop.io.compress.Lz4Codec ! .lz4! N N/ Y Snappy org.apache.hadoop.io.compress.SnappyCodec ! .snappy! N N/ Y NOTES: §  Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other algorithms require all blocks together for decompression with a single MapReduce task. §  LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still supported and the codec can be downloaded separately and enabled manually. §  Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
  • 8.
    Space-Time Tradeoff ofCompression Options 8 64%, 32.3 71%, 60.0 47%, 4.842%, 4.0 44%, 2.4 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 40% 45% 50% 55% 60% 65% 70% 75% CPUTimeinSec. (Compress+Decompress) Space Savings Bzip2 Zlib (Deflate, Gzip) LZOSnappy LZ4 Note: A 265 MB corpus from Wikipedia was used for the performance comparisons. Space savings is defined as [1 – (Compressed/ Uncompressed)] Codec Performance on the Wikipedia Text Corpus High Compression Ratio High Compression Speed
  • 9.
    Using Data Compressionin Hadoop 9 Phase in MR Pipeline Config Values Input data to Map File extension recognized automatically for decompression File extensions for supported formats Note: For SequenceFile, headers have the information [compression (boolean), block compression (boolean), and compression codec] One of the supported codecs one defined in io.compression.codecs! Intermediate (Map) Output mapreduce.map.output.compress! false (default), true mapreduce.map.output.compress.codec! ! one defined in io.compression.codecs! Final (Reduce) Output mapreduce.output.fileoutputformat. compress! false (default), true mapreduce.output.fileoutputformat. compress.codec! one defined in io.compression.codecs! mapreduce.output.fileoutputformat. compress.type! Type of compression to use for SequenceFile outputs: NONE, RECORD (default), BLOCK 1 2 3
  • 10.
    §  Compress theinput data, if large §  Always use compression, particularly if spillage or slow network transfers §  Compress for storage/ archival, better write speeds, or between MR jobs §  Use splittable algo such as bzip2, or use zlib with SequenceFile format §  Use faster codecs such as LZO, LZ4, or Snappy §  Use standard utility such as gzip or bzip2 for data interchange, and faster codecs for chained jobs When to Use Compression and Which Codec 10 Map ReduceShuffle & Sort Input data to Map Intermediate (Map) Output I/P compressed Mapper decompresses Mapper O/P compressed 1 Reducer I/P decompresses Reducer O/P compressed 2 3 Compress Decompress Final Reduce Output
  • 11.
    Compression in theHadoop Ecosystem 11 Component When to Use What to Use Pig §  Compressing data between MR job §  Typical in Pig scripts that include joins or other operators that expand your data size Enable compression and select the codec: pig.tmpfilecompression = true! pig.tmpfilecompression.codec = gzip, lzo! ! Hive §  Intermediate files produced by Hive between multiple map- reduce jobs §  Hive writes output to a table Enable intermediate or output compression: hive.exec.compress.intermediate = true! hive.exec.compress.output = true! HBase §  Compress data at the CF level (support for LZO, gzip, Snappy, and LZ4) List required JNI libraries: hbase.regionserver.codecs! ! Enabling compression: create ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }! alter ’table', { NAME => 'colfam', COMPRESSION => ’LZO' } !
  • 12.
    4.2M Jobs, Jun10-16, 2013 Compression in Hadoop at Yahoo! 12 99.8% 0.2% LZO 98.3% gzip 1.1% zlib / default 0.5% bzip2 0.1% Map ReduceShuffle & Sort Input data to Map Intermediate (Map) Output 1 2 3 Final Reduce Output 39.0% 61.0% LZO 55% gzip 35% bzip2 5% zlib / default 5% 4.2M Jobs, Jun 10-16, 2013 98% 2% zlib / default 73% gzip 22% bzip2 4% LZO 1% 380M Files on Jun 16, 2013 (/data, /projects) Includes intermediate Pig/ Hive compression Pig Intermediate Compressed
  • 13.
    Compression for DataStorage Efficiency §  DSE considerations at Yahoo! §  RCFile instead of SequenceFile §  Faster implementation of bzip2 §  Native-code bzip2 codec §  HADOOP-84621, available in 0.23.7 §  Substituting the IPP library 13 1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member
  • 14.
    IPP Libraries §  IntegratedPerformance Primitives from Intel §  Algorithmic and architectural optimizations §  Processor-specific variants of each function §  Applications remain processor-neutral §  Compression: LZ, RLE, BWT, LZO §  High level formats include: zlib, gzip, bzip2 and LZO 14
  • 15.
    Measuring Standalone Performance § Standard programs (gzip, bzip2) used §  Driver program written for other cases §  32-bit mode §  Single-threaded §  JVM load overhead discounted §  Default compression level §  Quad-core Xeon machine 15
  • 16.
    Data Corpuses Used § Binary files §  Generated text from randomtextwriter §  Wikipedia corpus §  Silesia corpus 16
  • 17.
    Compression Ratio 0 50 100 150 200 250 300 uncomp zlibbzip2 LZO Snappy LZ4 FileSize(MB) exe rtext wiki silesia 17
  • 18.
    Compression Performance 29 23 63 44 26 0 10 20 30 40 50 60 70 80 90 zlib IPP-zlibJava-bzip2 bzip2 IPP-bzip2 CPUTime(sec) exe rtext wiki silesia 18
  • 19.
    Compression Performance (FastAlgorithms) 3.2 2.9 1.7 0 0.5 1 1.5 2 2.5 3 3.5 LZO Snappy LZ4 CPUTime(sec) exe rtext wiki silesia 19
  • 20.
    Decompression Performance 3 2 21 17 12 0 5 10 15 20 25 zlib IPP-zlibJava-bzip2 bzip2 IPP-bzip2 CPUTime(sec) exe rtext wiki silesia 20
  • 21.
    Decompression Performance (FastAlgorithms) 1.6 1.1 0.7 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 LZO Snappy LZ4 CPUTime(sec) exe rtext wiki silesia 21
  • 22.
    Compression Performance withinHadoop §  Daytona performance framework §  GridMix v1 §  Loadgen and sort jobs §  Input data compressed with zlib / bzip2 §  LZO used for intermediate compression §  35 datanodes, dual-quad-core machines 22
  • 23.
    Map Performance 47 4646 33 0 5 10 15 20 25 30 35 40 45 50 Java-bzip2 bzip2 IPP-bzip2 zlib MapTime(sec) 23
  • 24.
  • 25.
  • 26.
    Future Work §  Splittabilitysupport for native-code bzip2 codec §  Enhancing Pig to use common bzip2 codec §  Optimizing the JNI interface and buffer copies §  Varying the compression effort parameter §  Performance evaluation for 64-bit mode §  Updating the zlib codec to specify alternative libraries §  Other codec combinations, such as zlib for transient data §  Other compression algorithms 26
  • 27.
    Considerations in SelectingCompression Type §  Nature of the data set §  Chained jobs §  Data-storage efficiency requirements §  Frequency of compression vs. decompression §  Requirement for compatibility with a standard data format §  Splittability requirements §  Size of the intermediate and final data §  Alternative implementations of compression libraries 27