• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data Compression in Hadoop - TROUG 2014
 

Data Compression in Hadoop - TROUG 2014

on

  • 333 views

 

Statistics

Views

Total Views
333
Views on SlideShare
316
Embed Views
17

Actions

Likes
0
Downloads
15
Comments
0

5 Embeds 17

http://www.slideee.com 8
http://www.linkedin.com 4
https://twitter.com 3
http://localhost 1
https://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data Compression in Hadoop - TROUG 2014 Data Compression in Hadoop - TROUG 2014 Presentation Transcript

    • DATA COMPRESSION IN HADOOP Selahattin Güngörmüş
    • Introduction Selahattin Güngörmüş Sr. Data Warehouse Consultant, i2i-Systems  Computer Engineer (Istanbul Technical University / 2010)  Consultant at Turkcell for 2 years  Primary focus on Data Integration  Hadoop, Big Data Technologies  Oracle PL/SQL, ODI, OWB
    • Agenda  Data Compression Overview  Tradeoffs and Common Compression Algorithms  Test Results  Data Compression in Hadoop  What is Splittable Compression?  Compression in MapReduce Pipeline  When to Compress?  Compression in Map Reduce & Pig & Hive  Performance Tests
    • Data Compression  Storing data in a format that requires less space than the original  Useful for storing and transmitting the data  Two general types:  Lossless compression  Lossy compression Data Compression Types Lossless Compression Run Length Encoding Lempel Zviv Huffman Encoding Lossy Compression JPEG MP3 MPEG
    • Data Compression Tradeoffs - Reduces Storage Need - Less Disk I/O - Speeds up Network Transfer - Consumes CPU Compression Speed Compression Ratio
    • Compression Algorithms Format Algorithm File Extension Splittable Java / Native GZIP Deflate .gz N Both BZIP2 Bzip2 .bz2 Y Both LZO LZO .lzo Y (Indexed) Native Snappy Snappy .snappy N Native LZ4 Kind of LZ77 .lz4 N Native  Splittability: Every compressed split of the file can be uncompressed and processed independently. Parallel processing is possible.  Native implementations are preferable due to higher performance rates.
    • Test Environment  8 core i7 CPU  8 GB memory  64 bit CentOS operating system  1.4 GB Wikipedia Corpus 2-gram text input
    • Compression Ratio 1,403 701 693 684 447 390 0 200 400 600 800 1,000 1,200 1,400 1,600 No Compress Snappy LZ4 LZO GZIP BZIP2 Series1 Size(MB)
    • Compression Speed Time(s) 142.32 85.67 7.61 6.45 6.41 62.51 21.82 11.17 2.36 19.84 0 20 40 60 80 100 120 140 160 BZIP2 GZIP LZO LZ4 Snappy Compress Time UnCompress Time
    • Comp. Ratio / Speed Tradeoff • Compression Ratio: 1- (Compressed Size / UnCompressed Size) * 100 • 1.4 GB Sized Wikipedia Corpus data is used for performance comparisons LZO; 19; 51 BZIP2; 205; 72 GZIP; 107; 68 Snappy; 26; 50 LZ4; 9; 51 40 45 50 55 60 65 70 75 80 0 50 100 150 200 250 LZO BZIP2 GZIP Snappy LZ4 Compr.+Decompr.Time(s) Compression Ratio
    • Test Results Format Strengths Weaknesses GZIP  Relatively high compression ratio  Reasonable speed  Relatively slower than lzo, snappy and lz4  Non splittable BZIP2  Best compression ratio  Splittable  2x slower than gzip LZO  Rapid compression  Balanced comp/decomp times  Non splittable Snappy  Quickest compression method  Relatively slow in decompression  Non splittable LZ4  Very quick compression method  Best results in decompression speed  Non splittable
    • Data Compression in Hadoop  Hadoop jobs are usually I/O bound  Compression reduces the size of data transferred accross network  Overall job performance may be increased by simply enabling compression  Splittability must be taken into account!
    • What is Splittable Compression?  If a compression method is splittable, every compressed input split can be extracted and processed independently.  Otherwise, in order to decompress the input file every compressed split should be transferred to a single mapper node. Split1.rar Split2.rar Split3.rar Split4.rar Decompress Decompress Decompress Decompress Mapper Mapper Mapper Mapper
    • Compression in MapReduce Pipeline Input Split Maps Spill To Disk OutputCompressed Input Input Split Input Split Reducers Decompress Input Compress Output Decompress Input Compress Output Compressed Output Map Shuffle&Sort Reduce 1 Use Compressed Map Input 2 Compress Intermediate Data 3 Compress Reducer Output
    • When to Compress? 1 Use Compressed Map Input 2 Compress Intermediate Data 3 Compress Reducer Output • Mapreduce jobs read input from HDFS • Compress if input data is large. This will reduce disk read cost. • Compress with splittable algorithms like Bzip2 • Or use compression with splittable file structures such as Sequence Files, RC Files etc. • Map output is written to disk (spill) and transferred accross the network • Always use compression to reduce both disk write, and network transfer load • Beneficial in performace point of view even if input and output is uncompressed • Use faster codecs such as Snappy, LZO • Mapreduce output used for both archiving or chaining mapreduce jobs • Use compression to reduce disk space for archiving • Compression is also beneficial for chaining jobs especially with limited disk throughput resource. • Use compression methods with higher compress ratio to save more disk space
    • Supported Codecs in Hadoop  Zlib  org.apache.hadoop.io.compress.DefaultCodec  Gzip  org.apache.hadoop.io.compress.GzipCodec  Bzip2  org.apache.hadoop.io.compress.BZip2Codec  Lzo  com.hadoop.compression.lzo.LzoCodec  Lz4  org.apache.hadoop.io.compress.Lz4Codec  Snappy  org.apache.hadoop.io.compress.SnappyCodec
    • Compression in MapReduce Compressed Input Usage File format is auto recognized with extension. Codec must be defined in core-site.xml. Compress Intermediate Data (Map Output) mapreduce.map.output.compress = True; mapreduce.map.output.compress.codec = CodecName; Compress Job Output (Reducer Output) mapreduce.output.fileoutputformat.compress = True; mapreduce.output.fileoutputformat.compress.codec = CodecName;
    • Compression in Pig Compressed Input Usage File format is auto recognized with extension. Codec must be defined in core-site.xml. Compress Intermediate Data (Map Output) pig.tmpfilecompression = True; pig.tmpfilecompression.codec = CodecName; Use faster codecs such as Snappy, Lzo, LZ4 Useful for chained mapreduce jobs with lots of intermediate data such as joins. Compress Job Output (Reducer Output) (Same as MapReduce) mapreduce.output.fileoutputformat.compress=True; mapreduce.output.fileoutputformat. compress.codec = CodecName;
    • Compression in Hive Compressed Input Usage Can be defined in table definition STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" Compress Intermediate Data (Map Output) SET hive.exec.compress.intermediate = True; SET mapred.map.output.compression.codec = CodecName; SET mapred.map.output.compression.type = BLOCK / RECORD; Use faster codecs such as Snappy, Lzo, LZ4 Useful for chained mapreduce jobs with lots of intermediate data such as joins. Compress Job Output (Reducer Output) SET hive.exec.compress.output = True; SET mapred.output.compression.codec = CodecName; SET mapred.output.compression.type = BLOCK / RECORD;
    • Performance Test For Hive We are going to test the performance effect of compression in Hive Input File: Wikipedia Corpus 2-gram text data
    • Performance Test For Hive Case1: • Input data is uncompressed text file • No intermediate compression • No output compression Case2: • Input data is sequence file compressed with Snappy format • Intermediate data is compressed with Snappy • Output data is compressed with Snappy
    • Performance Test For Hive 1,619 299 193 945 128 139 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 HDFS READ HDFS WRITE TIME SPENT NOT COMPRESSED COMPRESSED
    • Questions