Data Compression in Hadoop - TROUG 2014

  • 1,092 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
1,092
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
36
Comments
1
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. DATA COMPRESSION IN HADOOP Selahattin Güngörmüş
  • 2. Introduction Selahattin Güngörmüş Sr. Data Warehouse Consultant, i2i-Systems  Computer Engineer (Istanbul Technical University / 2010)  Consultant at Turkcell for 2 years  Primary focus on Data Integration  Hadoop, Big Data Technologies  Oracle PL/SQL, ODI, OWB
  • 3. Agenda  Data Compression Overview  Tradeoffs and Common Compression Algorithms  Test Results  Data Compression in Hadoop  What is Splittable Compression?  Compression in MapReduce Pipeline  When to Compress?  Compression in Map Reduce & Pig & Hive  Performance Tests
  • 4. Data Compression  Storing data in a format that requires less space than the original  Useful for storing and transmitting the data  Two general types:  Lossless compression  Lossy compression Data Compression Types Lossless Compression Run Length Encoding Lempel Zviv Huffman Encoding Lossy Compression JPEG MP3 MPEG
  • 5. Data Compression Tradeoffs - Reduces Storage Need - Less Disk I/O - Speeds up Network Transfer - Consumes CPU Compression Speed Compression Ratio
  • 6. Compression Algorithms Format Algorithm File Extension Splittable Java / Native GZIP Deflate .gz N Both BZIP2 Bzip2 .bz2 Y Both LZO LZO .lzo Y (Indexed) Native Snappy Snappy .snappy N Native LZ4 Kind of LZ77 .lz4 N Native  Splittability: Every compressed split of the file can be uncompressed and processed independently. Parallel processing is possible.  Native implementations are preferable due to higher performance rates.
  • 7. Test Environment  8 core i7 CPU  8 GB memory  64 bit CentOS operating system  1.4 GB Wikipedia Corpus 2-gram text input
  • 8. Compression Ratio 1,403 701 693 684 447 390 0 200 400 600 800 1,000 1,200 1,400 1,600 No Compress Snappy LZ4 LZO GZIP BZIP2 Series1 Size(MB)
  • 9. Compression Speed Time(s) 142.32 85.67 7.61 6.45 6.41 62.51 21.82 11.17 2.36 19.84 0 20 40 60 80 100 120 140 160 BZIP2 GZIP LZO LZ4 Snappy Compress Time UnCompress Time
  • 10. Comp. Ratio / Speed Tradeoff • Compression Ratio: 1- (Compressed Size / UnCompressed Size) * 100 • 1.4 GB Sized Wikipedia Corpus data is used for performance comparisons LZO; 19; 51 BZIP2; 205; 72 GZIP; 107; 68 Snappy; 26; 50 LZ4; 9; 51 40 45 50 55 60 65 70 75 80 0 50 100 150 200 250 LZO BZIP2 GZIP Snappy LZ4 Compr.+Decompr.Time(s) Compression Ratio
  • 11. Test Results Format Strengths Weaknesses GZIP  Relatively high compression ratio  Reasonable speed  Relatively slower than lzo, snappy and lz4  Non splittable BZIP2  Best compression ratio  Splittable  2x slower than gzip LZO  Rapid compression  Balanced comp/decomp times  Non splittable Snappy  Quickest compression method  Relatively slow in decompression  Non splittable LZ4  Very quick compression method  Best results in decompression speed  Non splittable
  • 12. Data Compression in Hadoop  Hadoop jobs are usually I/O bound  Compression reduces the size of data transferred accross network  Overall job performance may be increased by simply enabling compression  Splittability must be taken into account!
  • 13. What is Splittable Compression?  If a compression method is splittable, every compressed input split can be extracted and processed independently.  Otherwise, in order to decompress the input file every compressed split should be transferred to a single mapper node. Split1.rar Split2.rar Split3.rar Split4.rar Decompress Decompress Decompress Decompress Mapper Mapper Mapper Mapper
  • 14. Compression in MapReduce Pipeline Input Split Maps Spill To Disk OutputCompressed Input Input Split Input Split Reducers Decompress Input Compress Output Decompress Input Compress Output Compressed Output Map Shuffle&Sort Reduce 1 Use Compressed Map Input 2 Compress Intermediate Data 3 Compress Reducer Output
  • 15. When to Compress? 1 Use Compressed Map Input 2 Compress Intermediate Data 3 Compress Reducer Output • Mapreduce jobs read input from HDFS • Compress if input data is large. This will reduce disk read cost. • Compress with splittable algorithms like Bzip2 • Or use compression with splittable file structures such as Sequence Files, RC Files etc. • Map output is written to disk (spill) and transferred accross the network • Always use compression to reduce both disk write, and network transfer load • Beneficial in performace point of view even if input and output is uncompressed • Use faster codecs such as Snappy, LZO • Mapreduce output used for both archiving or chaining mapreduce jobs • Use compression to reduce disk space for archiving • Compression is also beneficial for chaining jobs especially with limited disk throughput resource. • Use compression methods with higher compress ratio to save more disk space
  • 16. Supported Codecs in Hadoop  Zlib  org.apache.hadoop.io.compress.DefaultCodec  Gzip  org.apache.hadoop.io.compress.GzipCodec  Bzip2  org.apache.hadoop.io.compress.BZip2Codec  Lzo  com.hadoop.compression.lzo.LzoCodec  Lz4  org.apache.hadoop.io.compress.Lz4Codec  Snappy  org.apache.hadoop.io.compress.SnappyCodec
  • 17. Compression in MapReduce Compressed Input Usage File format is auto recognized with extension. Codec must be defined in core-site.xml. Compress Intermediate Data (Map Output) mapreduce.map.output.compress = True; mapreduce.map.output.compress.codec = CodecName; Compress Job Output (Reducer Output) mapreduce.output.fileoutputformat.compress = True; mapreduce.output.fileoutputformat.compress.codec = CodecName;
  • 18. Compression in Pig Compressed Input Usage File format is auto recognized with extension. Codec must be defined in core-site.xml. Compress Intermediate Data (Map Output) pig.tmpfilecompression = True; pig.tmpfilecompression.codec = CodecName; Use faster codecs such as Snappy, Lzo, LZ4 Useful for chained mapreduce jobs with lots of intermediate data such as joins. Compress Job Output (Reducer Output) (Same as MapReduce) mapreduce.output.fileoutputformat.compress=True; mapreduce.output.fileoutputformat. compress.codec = CodecName;
  • 19. Compression in Hive Compressed Input Usage Can be defined in table definition STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" Compress Intermediate Data (Map Output) SET hive.exec.compress.intermediate = True; SET mapred.map.output.compression.codec = CodecName; SET mapred.map.output.compression.type = BLOCK / RECORD; Use faster codecs such as Snappy, Lzo, LZ4 Useful for chained mapreduce jobs with lots of intermediate data such as joins. Compress Job Output (Reducer Output) SET hive.exec.compress.output = True; SET mapred.output.compression.codec = CodecName; SET mapred.output.compression.type = BLOCK / RECORD;
  • 20. Performance Test For Hive We are going to test the performance effect of compression in Hive Input File: Wikipedia Corpus 2-gram text data
  • 21. Performance Test For Hive Case1: • Input data is uncompressed text file • No intermediate compression • No output compression Case2: • Input data is sequence file compressed with Snappy format • Intermediate data is compressed with Snappy • Output data is compressed with Snappy
  • 22. Performance Test For Hive 1,619 299 193 945 128 139 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 HDFS READ HDFS WRITE TIME SPENT NOT COMPRESSED COMPRESSED
  • 23. Questions