DATA
COMPRESSION
IN HADOOP
Selahattin Güngörmüş
Introduction
Selahattin Güngörmüş
Sr. Data Warehouse Consultant, i2i-Systems
 Computer Engineer (Istanbul Technical Unive...
Agenda
 Data Compression Overview
 Tradeoffs and Common Compression Algorithms
 Test Results
 Data Compression in Hado...
Data Compression
 Storing data in a format that
requires less space than the
original
 Useful for storing and transmitti...
Data Compression Tradeoffs
- Reduces
Storage Need
- Less Disk I/O
- Speeds up
Network Transfer
- Consumes CPU
Compression
...
Compression Algorithms
Format Algorithm File Extension Splittable Java / Native
GZIP Deflate .gz N Both
BZIP2 Bzip2 .bz2 Y...
Test Environment
 8 core i7 CPU
 8 GB memory
 64 bit CentOS operating system
 1.4 GB Wikipedia Corpus 2-gram text input
Compression Ratio
1,403
701 693 684
447
390
0
200
400
600
800
1,000
1,200
1,400
1,600
No Compress Snappy LZ4 LZO GZIP BZIP...
Compression Speed
Time(s)
142.32
85.67
7.61 6.45 6.41
62.51
21.82
11.17
2.36
19.84
0
20
40
60
80
100
120
140
160
BZIP2 GZI...
Comp. Ratio / Speed Tradeoff
• Compression Ratio: 1- (Compressed Size / UnCompressed Size) * 100
• 1.4 GB Sized Wikipedia ...
Test Results
Format Strengths Weaknesses
GZIP
 Relatively high compression
ratio
 Reasonable speed
 Relatively slower t...
Data Compression in Hadoop
 Hadoop jobs are usually I/O bound
 Compression reduces the size of data
transferred accross ...
What is Splittable Compression?
 If a compression method is splittable, every compressed input split
can be extracted and...
Compression in MapReduce Pipeline
Input
Split
Maps
Spill To
Disk
OutputCompressed
Input
Input
Split
Input
Split
Reducers
D...
When to Compress?
1
Use Compressed
Map Input
2
Compress Intermediate
Data
3
Compress Reducer
Output
• Mapreduce jobs read
...
Supported Codecs in Hadoop
 Zlib  org.apache.hadoop.io.compress.DefaultCodec
 Gzip  org.apache.hadoop.io.compress.Gzip...
Compression in MapReduce
Compressed
Input Usage
File format is auto recognized with extension.
Codec must be defined in co...
Compression in Pig
Compressed
Input Usage
File format is auto recognized with extension.
Codec must be defined in core-sit...
Compression in Hive
Compressed
Input Usage
Can be defined in table definition
STORED AS INPUTFORMAT
"com.hadoop.mapred.Dep...
Performance Test For Hive
We are going to test the performance effect of compression in Hive
Input File: Wikipedia Corpus ...
Performance Test For Hive
Case1:
• Input data is uncompressed text file
• No intermediate compression
• No output compress...
Performance Test For Hive
1,619
299
193
945
128 139
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
HDFS READ HDFS WRITE T...
Questions
Upcoming SlideShare
Loading in...5
×

Data Compression in Hadoop - TROUG 2014

3,302
-1

Published on

Published in: Technology, Art & Photos
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,302
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
161
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Data Compression in Hadoop - TROUG 2014

  1. 1. DATA COMPRESSION IN HADOOP Selahattin Güngörmüş
  2. 2. Introduction Selahattin Güngörmüş Sr. Data Warehouse Consultant, i2i-Systems  Computer Engineer (Istanbul Technical University / 2010)  Consultant at Turkcell for 2 years  Primary focus on Data Integration  Hadoop, Big Data Technologies  Oracle PL/SQL, ODI, OWB
  3. 3. Agenda  Data Compression Overview  Tradeoffs and Common Compression Algorithms  Test Results  Data Compression in Hadoop  What is Splittable Compression?  Compression in MapReduce Pipeline  When to Compress?  Compression in Map Reduce & Pig & Hive  Performance Tests
  4. 4. Data Compression  Storing data in a format that requires less space than the original  Useful for storing and transmitting the data  Two general types:  Lossless compression  Lossy compression Data Compression Types Lossless Compression Run Length Encoding Lempel Zviv Huffman Encoding Lossy Compression JPEG MP3 MPEG
  5. 5. Data Compression Tradeoffs - Reduces Storage Need - Less Disk I/O - Speeds up Network Transfer - Consumes CPU Compression Speed Compression Ratio
  6. 6. Compression Algorithms Format Algorithm File Extension Splittable Java / Native GZIP Deflate .gz N Both BZIP2 Bzip2 .bz2 Y Both LZO LZO .lzo Y (Indexed) Native Snappy Snappy .snappy N Native LZ4 Kind of LZ77 .lz4 N Native  Splittability: Every compressed split of the file can be uncompressed and processed independently. Parallel processing is possible.  Native implementations are preferable due to higher performance rates.
  7. 7. Test Environment  8 core i7 CPU  8 GB memory  64 bit CentOS operating system  1.4 GB Wikipedia Corpus 2-gram text input
  8. 8. Compression Ratio 1,403 701 693 684 447 390 0 200 400 600 800 1,000 1,200 1,400 1,600 No Compress Snappy LZ4 LZO GZIP BZIP2 Series1 Size(MB)
  9. 9. Compression Speed Time(s) 142.32 85.67 7.61 6.45 6.41 62.51 21.82 11.17 2.36 19.84 0 20 40 60 80 100 120 140 160 BZIP2 GZIP LZO LZ4 Snappy Compress Time UnCompress Time
  10. 10. Comp. Ratio / Speed Tradeoff • Compression Ratio: 1- (Compressed Size / UnCompressed Size) * 100 • 1.4 GB Sized Wikipedia Corpus data is used for performance comparisons LZO; 19; 51 BZIP2; 205; 72 GZIP; 107; 68 Snappy; 26; 50 LZ4; 9; 51 40 45 50 55 60 65 70 75 80 0 50 100 150 200 250 LZO BZIP2 GZIP Snappy LZ4 Compr.+Decompr.Time(s) Compression Ratio
  11. 11. Test Results Format Strengths Weaknesses GZIP  Relatively high compression ratio  Reasonable speed  Relatively slower than lzo, snappy and lz4  Non splittable BZIP2  Best compression ratio  Splittable  2x slower than gzip LZO  Rapid compression  Balanced comp/decomp times  Non splittable Snappy  Quickest compression method  Relatively slow in decompression  Non splittable LZ4  Very quick compression method  Best results in decompression speed  Non splittable
  12. 12. Data Compression in Hadoop  Hadoop jobs are usually I/O bound  Compression reduces the size of data transferred accross network  Overall job performance may be increased by simply enabling compression  Splittability must be taken into account!
  13. 13. What is Splittable Compression?  If a compression method is splittable, every compressed input split can be extracted and processed independently.  Otherwise, in order to decompress the input file every compressed split should be transferred to a single mapper node. Split1.rar Split2.rar Split3.rar Split4.rar Decompress Decompress Decompress Decompress Mapper Mapper Mapper Mapper
  14. 14. Compression in MapReduce Pipeline Input Split Maps Spill To Disk OutputCompressed Input Input Split Input Split Reducers Decompress Input Compress Output Decompress Input Compress Output Compressed Output Map Shuffle&Sort Reduce 1 Use Compressed Map Input 2 Compress Intermediate Data 3 Compress Reducer Output
  15. 15. When to Compress? 1 Use Compressed Map Input 2 Compress Intermediate Data 3 Compress Reducer Output • Mapreduce jobs read input from HDFS • Compress if input data is large. This will reduce disk read cost. • Compress with splittable algorithms like Bzip2 • Or use compression with splittable file structures such as Sequence Files, RC Files etc. • Map output is written to disk (spill) and transferred accross the network • Always use compression to reduce both disk write, and network transfer load • Beneficial in performace point of view even if input and output is uncompressed • Use faster codecs such as Snappy, LZO • Mapreduce output used for both archiving or chaining mapreduce jobs • Use compression to reduce disk space for archiving • Compression is also beneficial for chaining jobs especially with limited disk throughput resource. • Use compression methods with higher compress ratio to save more disk space
  16. 16. Supported Codecs in Hadoop  Zlib  org.apache.hadoop.io.compress.DefaultCodec  Gzip  org.apache.hadoop.io.compress.GzipCodec  Bzip2  org.apache.hadoop.io.compress.BZip2Codec  Lzo  com.hadoop.compression.lzo.LzoCodec  Lz4  org.apache.hadoop.io.compress.Lz4Codec  Snappy  org.apache.hadoop.io.compress.SnappyCodec
  17. 17. Compression in MapReduce Compressed Input Usage File format is auto recognized with extension. Codec must be defined in core-site.xml. Compress Intermediate Data (Map Output) mapreduce.map.output.compress = True; mapreduce.map.output.compress.codec = CodecName; Compress Job Output (Reducer Output) mapreduce.output.fileoutputformat.compress = True; mapreduce.output.fileoutputformat.compress.codec = CodecName;
  18. 18. Compression in Pig Compressed Input Usage File format is auto recognized with extension. Codec must be defined in core-site.xml. Compress Intermediate Data (Map Output) pig.tmpfilecompression = True; pig.tmpfilecompression.codec = CodecName; Use faster codecs such as Snappy, Lzo, LZ4 Useful for chained mapreduce jobs with lots of intermediate data such as joins. Compress Job Output (Reducer Output) (Same as MapReduce) mapreduce.output.fileoutputformat.compress=True; mapreduce.output.fileoutputformat. compress.codec = CodecName;
  19. 19. Compression in Hive Compressed Input Usage Can be defined in table definition STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" Compress Intermediate Data (Map Output) SET hive.exec.compress.intermediate = True; SET mapred.map.output.compression.codec = CodecName; SET mapred.map.output.compression.type = BLOCK / RECORD; Use faster codecs such as Snappy, Lzo, LZ4 Useful for chained mapreduce jobs with lots of intermediate data such as joins. Compress Job Output (Reducer Output) SET hive.exec.compress.output = True; SET mapred.output.compression.codec = CodecName; SET mapred.output.compression.type = BLOCK / RECORD;
  20. 20. Performance Test For Hive We are going to test the performance effect of compression in Hive Input File: Wikipedia Corpus 2-gram text data
  21. 21. Performance Test For Hive Case1: • Input data is uncompressed text file • No intermediate compression • No output compression Case2: • Input data is sequence file compressed with Snappy format • Intermediate data is compressed with Snappy • Output data is compressed with Snappy
  22. 22. Performance Test For Hive 1,619 299 193 945 128 139 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 HDFS READ HDFS WRITE TIME SPENT NOT COMPRESSED COMPRESSED
  23. 23. Questions
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×