Compression Options in Hadoop - A Tale of Tradeoffs


Published on

Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.

Published in: Technology
1 Comment
  • LZO is not natively spittable, although this brings splittability
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Time: 30 sec (Total: 30 sec)Good afternoon and welcome to the Compression Options in Hadoop – A Tale of Tradeoffs. Before we get started, can I have a show of hands for how many of you use compression in hadoop or are familiar with the compression options in hadoop? Well, quite a few of you, I hope that you will be able to augment your current understanding and gather further insights into compression in hadoop from our talk.
  • Time: 30 sec (Total: 1 min)Before we start, let us introduce ourselves. Govind, who is going to talk about his compression work later in this talk, is a software developer in the Hadoop team at Yahoo! where he focuses on HBase and Hadoop performance. My name is Sumeet Singh and I lead the Hadoop products team at Yahoo!.
  • Time: 1 min (Total: 2 min)We have organized the talk in two broad groups. I will cover the first three topics- how data compression works in Hadoop, the available compression options in hadoop, and understanding these options in terms of when and how you use them. Govindwill then cover the compression problem he had at hand, results from his performance works on bzip2 and Intel’s IPP libraries, and some of the future considerations in this area.
  • Time: 2 min (Total: 4 min)Given that Hadoop was designed to operate on large volumes of data, lossless data compression (the type of compression we deal with in Hadoop) is quite integral to the MapReduce jobs. Compression provides several benefits:MR jobs are almost always I/O bound, compressing data can speed up the IO operations that are more often than not the performance bottleneckYou can do more with less with compression turned on i.e. improve thecluster utilization through space savings and faster data transfers across the network as you will send less data. This is particularly true as Hadoop uses a 3x data replication by default for fault tolerance.As a user, you can improve your overall job performance and your jobs may take less time to complete.Compression does not come for free though. These benefits come at the cost of increased CPU utilization in compressing and decompressing data. So, using compression itself presents a tradeoff between storage savings, faster I/O, and better use of network bandwidth with increase in CPU load. But given the nature of Hadoop, using compression generally turns out to be a good tradeoff to make.
  • Time: 2 min (Total: 6 min)Let’s look at the compression decompression process in a MapReduce pipeline. More often than note, you are dealing with large datasets for processing that comes in compressed for processing. Map phase detects supported compression format (that we will talk about in detail here) and decompresses data for processing.Map outputs to the reducers are sorted on its keys. The process of sorting and transferring data to the reducer phase is called shuffle. Map writes are buffered in memory, as spilled to the disk when they get full. ((Data is partitioned before the spills according to the reducer it needs to go to). Several spills files may get generated as part of the above process that are then merged on the disk to create a bigger partition. The data on the disk write can again be compressed. The reducer may need data from several map tasks on other nodes. Transferring this data compressed helps. The reduce phase merges output from map tasks either in memory or in a combination of in-memory and disk that is fed to the reducer for further processing upon which the data is written to HDFS (and that can be written compressed).As you can clearly see, compression is integral to a MapReduce pipeline and can have a significant impact on the job’s overall performance.
  • Time: 2 min (Total: 8 min)Now given the importance of compression in Hadoop, it provides support for multiple codecs as a result. The class of these algorithms are called lossless as data needs to precisely regenerated from the compressed form. There are 6 available options to you in Hadoop 0.23 version that we run at Yahoo! right now.Compression algorithms operate by finding and eliminating redundancy and duplication in data. Thus, truly random data can never be compressed. Compression strategies generally have three phases: a preprocessing or transform phase, followed by duplicate elimination and finally a phase that focuses on bit reduction. The algorithms used in each compression format vary and have a strong impact on the efficacy and speed of compression.
  • Time: 2 min (Total: 10 min)Splittability was not there initially. Seq. file format was designed to tackle with the splittability issue (aware of keys and values). Compression cares about byte streams only. Pig has the same problem, added split capability to their compression. Made a copy of bzip2 code and added split capability. Later on, hadoop added split support and made it work for bzip2. Split capability could be added to block oriented compression algos such as LZO, Snappy and LZ4.
  • Time: 1 min (Total: 11 min)
  • Time: 1 min (Total: 12 min)Try to add Yahoo! numbers here.
  • Time: 2 min (Total: 14 min)Input data is large (Govind to provide a rule of thumb)Intermediate (Spillage is one, network transfers are slow)Final (space, speed of writes, chained MR)I/P – O/P  once that gives you better space savings | compression ratios such as Zlib or Bzip2.Intermediate (LZO type compression, faster codecs)
  • Time: 2 min (Total: 16 min)
  • Time: 1 min (Total: 17 min)
  • Circle around zlib and bzip2Circle around LZO, Snappy and LZ4
  • Circle around zlib and IPP-zlibCircle around bzip2 and IPP-bzip2
  • Compression Options in Hadoop - A Tale of Tradeoffs

    1. 1. Compression Options In Hadoop – A Tale of Tradeoffs Govind Kamat, Sumeet Singh Hadoop Summit (San Jose), June 27, 2013
    2. 2. Introduction 2 Sumeet Singh Director of Products, Hadoop Cloud Engineering Group 701 First Avenue Sunnyvale, CA 94089 USA Govind Kamat Technical Yahoo!, Hadoop Cloud Engineering Group  Member of Technical Staff in the Hadoop Services team at Yahoo!  Focuses on HBase and Hadoop performance  Worked with the Performance Engineering Group on improving the performance and scalability of several Yahoo! applications  Experience includes development of large-scale software systems, microprocessor architecture, instruction-set simulators, compiler technology and electronic design 701 First Avenue Sunnyvale, CA 94089 USA  Leads Hadoop products team at Yahoo!  Responsible for Product Management, Customer Engagements, Evangelism, and Program Management  Prior to this role, led Strategy functions for the Cloud Platform Group at Yahoo!
    3. 3. Agenda 3 Data Compression in Hadoop1 Available Compression Options2 Understanding and Working with Compression Options3 Problems Faced at Yahoo! with Large Data Sets4 Performance Evaluations, Native Bzip2, and IPP Libraries5 Wrap-up and Future Work6
    4. 4. Compression Needs and Tradeoffs in Hadoop 4  Storage  Disk I/O  Network bandwidth  CPU Time  Hadoop jobs are data-intensive, compressing data can speed up the I/O operations  MapReduce jobs are almost always I/O bound  Compressed data can save storage space and speed up data transfers across the network  Capital allocation for hardware can go further  Reduced I/O and network load can bring significant performance improvements  MapReduce jobs can finish faster overall  On the other hand, CPU utilization and processing time increases during compression and decompression  Understanding the tradeoffs is important for MapReduce pipeline’s overall performance The Compression Tradeoff
    5. 5. Data Compression in Hadoop’s MR Pipeline 5 Input splits Map Source: Hadoop: The Definitive Guide, Tom White Output ReduceBuffer in memory Partition and Sort fetch Merge on disk Merge and sort Other maps Other reducers I/P compressed Mapper decompresses Mapper O/P compressed 1 Map Reduce Reduce I/P Map O/P Reducer I/P decompresses Reducer O/P compressed 2 3 Sort & Shuffle Compress Decompress
    6. 6. Compression Options in Hadoop (1/2) 6 Format Algorithm Strategy Emphasis Comments zlib Uses DEFLATE (LZ77 and Huffman coding) Dictionary-based, API Compression ratio Default codec gzip Wrapper around zlib Dictionary-based, standard compression utility Same as zlib, codec operates on and produces standard gzip files For data interchange on and off Hadoop bzip2 Burrows-Wheeler transform Transform-based, block-oriented Higher compression ratios than zlib Common for Pig LZO Variant of LZ77 Dictionary-based, block-oriented, API High compression speeds Common for intermediate compression, HBase tables LZ4 Simplified variant of LZ77 Fast scan, API Very high compression speeds Available in newer Hadoop distributions Snappy LZ77 Block-oriented, API Very high compression speeds Came out of Google, previously known as Zippy
    7. 7. Compression Options in Hadoop (2/2) 7 Format Codec (Defined in io.compression.codecs) File Extn. Splittable Java/ Native zlib/ DEFLATE (default) .deflate N Y/ Y gzip .gz N Y/ Y bzip2 .bz2 Y Y/ Y LZO (download separately) com.hadoop.compression.lzo.LzoCodec .lzo N N/ Y LZ4 .lz4 N N/ Y Snappy .snappy N N/ Y NOTES:  Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other algorithms require all blocks together for decompression with a single MapReduce task.  LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still supported and the codec can be downloaded separately and enabled manually.  Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
    8. 8. Space-Time Tradeoff of Compression Options 8 64%, 32.3 71%, 60.0 47%, 4.842%, 4.0 44%, 2.4 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 40% 45% 50% 55% 60% 65% 70% 75% CPUTimeinSec. (Compress+Decompress) Space Savings Bzip2 Zlib (Deflate, Gzip) LZOSnappy LZ4 Note: A 265 MB corpus from Wikipedia was used for the performance comparisons. Space savings is defined as [1 – (Compressed/ Uncompressed)] Codec Performance on the Wikipedia Text Corpus High Compression Ratio High Compression Speed
    9. 9. Using Data Compression in Hadoop 9 Phase in MR Pipeline Config Values Input data to Map File extension recognized automatically for decompression File extensions for supported formats Note: For SequenceFile, headers have the information [compression (boolean), block compression (boolean), and compression codec] One of the supported codecs one defined in io.compression.codecs Intermediate (Map) Output false (default), true one defined in io.compression.codecs Final (Reduce) Output mapreduce.output.fileoutputformat. compress false (default), true mapreduce.output.fileoutputformat. compress.codec one defined in io.compression.codecs mapreduce.output.fileoutputformat. compress.type Type of compression to use for SequenceFile outputs: NONE, RECORD (default), BLOCK 1 2 3
    10. 10.  Compress the input data, if large  Always use compression, particularly if spillage or slow network transfers  Compress for storage/ archival, better write speeds, or between MR jobs  Use splittable algo such as bzip2, or use zlib with SequenceFile format  Use faster codecs such as LZO, LZ4, or Snappy  Use standard utility such as gzip or bzip2 for data interchange, and faster codecs for chained jobs When to Use Compression and Which Codec 10 Map ReduceShuffle & Sort Input data to Map Intermediate (Map) Output I/P compressed Mapper decompresses Mapper O/P compressed 1 Reducer I/P decompresses Reducer O/P compressed 2 3 Compress Decompress Final Reduce Output
    11. 11. Compression in the Hadoop Ecosystem 11 Component When to Use What to Use Pig  Compressing data between MR job  Typical in Pig scripts that include joins or other operators that expand your data size Enable compression and select the codec: pig.tmpfilecompression = true pig.tmpfilecompression.codec = gzip, lzo Hive  Intermediate files produced by Hive between multiple map- reduce jobs  Hive writes output to a table Enable intermediate or output compression: hive.exec.compress.intermediate = true hive.exec.compress.output = true HBase  Compress data at the CF level (support for LZO, gzip, Snappy, and LZ4) List required JNI libraries: hbase.regionserver.codecs Enabling compression: create ’table', { NAME => 'colfam', COMPRESSION => ’LZO' } alter ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }
    12. 12. 4.2M Jobs, Jun 10-16, 2013 Compression in Hadoop at Yahoo! 12 99.8% 0.2% LZO 98.3% gzip 1.1% zlib / default 0.5% bzip2 0.1% Map ReduceShuffle & Sort Input data to Map Intermediate (Map) Output 1 2 3 Final Reduce Output 39.0% 61.0% LZO 55% gzip 35% bzip2 5% zlib / default 5% 4.2M Jobs, Jun 10-16, 2013 98% 2% zlib / default 73% gzip 22% bzip2 4% LZO 1% 380M Files on Jun 16, 2013 (/data, /projects) Includes intermediate Pig/ Hive compression Pig Intermediate Compressed
    13. 13. Compression for Data Storage Efficiency  DSE considerations at Yahoo!  RCFile instead of SequenceFile  Faster implementation of bzip2  Native-code bzip2 codec  HADOOP-84621, available in 0.23.7  Substituting the IPP library 13 1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member
    14. 14. IPP Libraries  Integrated Performance Primitives from Intel  Algorithmic and architectural optimizations  Processor-specific variants of each function  Applications remain processor-neutral  Compression: LZ, RLE, BWT, LZO  High level formats include: zlib, gzip, bzip2 and LZO 14
    15. 15. Measuring Standalone Performance  Standard programs (gzip, bzip2) used  Driver program written for other cases  32-bit mode  Single-threaded  JVM load overhead discounted  Default compression level  Quad-core Xeon machine 15
    16. 16. Data Corpuses Used  Binary files  Generated text from randomtextwriter  Wikipedia corpus  Silesia corpus 16
    17. 17. Compression Ratio 0 50 100 150 200 250 300 uncomp zlib bzip2 LZO Snappy LZ4 FileSize(MB) exe rtext wiki silesia 17
    18. 18. Compression Performance 29 23 63 44 26 0 10 20 30 40 50 60 70 80 90 zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2 CPUTime(sec) exe rtext wiki silesia 18
    19. 19. Compression Performance (Fast Algorithms) 3.2 2.9 1.7 0 0.5 1 1.5 2 2.5 3 3.5 LZO Snappy LZ4 CPUTime(sec) exe rtext wiki silesia 19
    20. 20. Decompression Performance 3 2 21 17 12 0 5 10 15 20 25 zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2 CPUTime(sec) exe rtext wiki silesia 20
    21. 21. Decompression Performance (Fast Algorithms) 1.6 1.1 0.7 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 LZO Snappy LZ4 CPUTime(sec) exe rtext wiki silesia 21
    22. 22. Compression Performance within Hadoop  Daytona performance framework  GridMix v1  Loadgen and sort jobs  Input data compressed with zlib / bzip2  LZO used for intermediate compression  35 datanodes, dual-quad-core machines 22
    23. 23. Map Performance 47 46 46 33 0 5 10 15 20 25 30 35 40 45 50 Java-bzip2 bzip2 IPP-bzip2 zlib MapTime(sec) 23
    24. 24. Reduce Performance 31 28 18 14 0 5 10 15 20 25 30 35 Java-bzip2 bzip2 IPP-bzip2 zlib ReduceTime(min) 24
    25. 25. Job Performance 38 34 23 19 38 34 25 18 0 5 10 15 20 25 30 35 40 Java-bzip2 bzip2 IPP-bzip2 zlib JobTime(min) sort loadgen 25
    26. 26. Future Work  Splittability support for native-code bzip2 codec  Enhancing Pig to use common bzip2 codec  Optimizing the JNI interface and buffer copies  Varying the compression effort parameter  Performance evaluation for 64-bit mode  Updating the zlib codec to specify alternative libraries  Other codec combinations, such as zlib for transient data  Other compression algorithms 26
    27. 27. Considerations in Selecting Compression Type  Nature of the data set  Chained jobs  Data-storage efficiency requirements  Frequency of compression vs. decompression  Requirement for compatibility with a standard data format  Splittability requirements  Size of the intermediate and final data  Alternative implementations of compression libraries 27