SlideShare a Scribd company logo
1 of 28
Compression Options In Hadoop –
A Tale of Tradeoffs
Govind Kamat, Sumeet Singh
Hadoop Summit (San Jose), June 27, 2013
Introduction
2
Sumeet Singh
Director of Products, Hadoop
Cloud Engineering Group
701 First Avenue
Sunnyvale, CA 94089 USA
Govind Kamat
Technical Yahoo!, Hadoop
Cloud Engineering Group
§  Member of Technical Staff in the Hadoop Services
team at Yahoo!
§  Focuses on HBase and Hadoop performance
§  Worked with the Performance Engineering Group on
improving the performance and scalability of several
Yahoo! applications
§  Experience includes development of large-scale
software systems, microprocessor architecture,
instruction-set simulators, compiler technology and
electronic design
701 First Avenue
Sunnyvale, CA 94089 USA
§  Leads Hadoop products team at Yahoo!
§  Responsible for Product Management, Customer
Engagements, Evangelism, and Program
Management
§  Prior to this role, led Strategy functions for the Cloud
Platform Group at Yahoo!
Agenda
3
Data Compression in Hadoop1
Available Compression Options2
Understanding and Working with Compression Options3
Problems Faced at Yahoo! with Large Data Sets4
Performance Evaluations, Native Bzip2, and IPP Libraries5
Wrap-up and Future Work6
Compression Needs and Tradeoffs in Hadoop
4
§  Storage
§  Disk I/O
§  Network bandwidth
§  CPU Time
§  Hadoop jobs are data-intensive, compressing data can speed up the I/O operations
§  MapReduce jobs are almost always I/O bound
§  Compressed data can save storage space and speed up data transfers across the
network
§  Capital allocation for hardware can go further
§  Reduced I/O and network load can bring significant performance improvements
§  MapReduce jobs can finish faster overall
§  On the other hand, CPU utilization and processing time increases during
compression and decompression
§  Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
The Compression Tradeoff
Data Compression in Hadoop’s MR Pipeline
5
Input
splits
Map
Source: Hadoop: The Definitive Guide, Tom White
Output
ReduceBuffer in
memory
Partition and Sort
fetch
Merge
on disk
Merge and sort
Other
maps
Other
reducers
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Map Reduce
Reduce I/P
Map O/P
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Sort & Shuffle
Compress Decompress
Compression Options in Hadoop (1/2)
6
Format Algorithm Strategy Emphasis Comments
zlib
Uses DEFLATE
(LZ77 and Huffman
coding)
Dictionary-based, API Compression ratio Default codec
gzip Wrapper around zlib
Dictionary-based,
standard compression
utility
Same as zlib, codec
operates on and
produces standard gzip
files
For data interchange on
and off Hadoop
bzip2
Burrows-Wheeler
transform
Transform-based,
block-oriented
Higher compression
ratios than zlib
Common for Pig
LZO Variant of LZ77
Dictionary-based,
block-oriented, API
High compression
speeds
Common for
intermediate
compression, HBase
tables
LZ4
Simplified variant of
LZ77
Fast scan, API
Very high compression
speeds
Available in newer
Hadoop distributions
Snappy LZ77 Block-oriented, API
Very high compression
speeds
Came out of Google,
previously known as
Zippy
Compression Options in Hadoop (2/2)
7
Format Codec (Defined in io.compression.codecs) File Extn. Splittable
Java/
Native
zlib/ DEFLATE
(default)
org.apache.hadoop.io.compress.DefaultCodec !.deflate! N Y/ Y
gzip org.apache.hadoop.io.compress.GzipCodec ! .gz! N Y/ Y
bzip2 org.apache.hadoop.io.compress.BZip2Codec ! .bz2! Y Y/ Y
LZO
(download
separately)
com.hadoop.compression.lzo.LzoCodec ! .lzo! N N/ Y
LZ4 org.apache.hadoop.io.compress.Lz4Codec ! .lz4! N N/ Y
Snappy org.apache.hadoop.io.compress.SnappyCodec ! .snappy! N N/ Y
NOTES:
§  Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other
algorithms require all blocks together for decompression with a single MapReduce task.
§  LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still
supported and the codec can be downloaded separately and enabled manually.
§  Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
Space-Time Tradeoff of Compression Options
8
64%, 32.3
71%, 60.0
47%, 4.842%, 4.0
44%, 2.4
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
40% 45% 50% 55% 60% 65% 70% 75%
CPUTimeinSec.
(Compress+Decompress)
Space Savings
Bzip2
Zlib
(Deflate, Gzip)
LZOSnappy
LZ4
Note:
A 265 MB corpus from Wikipedia was used for the performance comparisons.
Space savings is defined as [1 – (Compressed/ Uncompressed)]
Codec Performance on the Wikipedia Text Corpus
High Compression Ratio
High Compression Speed
Using Data Compression in Hadoop
9
Phase in MR
Pipeline
Config Values
Input data to
Map
File extension recognized automatically for
decompression
File extensions for supported formats
Note: For SequenceFile, headers have the
information [compression (boolean), block
compression (boolean), and compression
codec]
One of the supported codecs one defined in io.compression.codecs!
Intermediate
(Map) Output
mapreduce.map.output.compress!
false (default), true
mapreduce.map.output.compress.codec!
!
one defined in io.compression.codecs!
Final
(Reduce)
Output
mapreduce.output.fileoutputformat.
compress!
false (default), true
mapreduce.output.fileoutputformat.
compress.codec!
one defined in io.compression.codecs!
mapreduce.output.fileoutputformat.
compress.type!
Type of compression to use for SequenceFile
outputs: NONE, RECORD (default), BLOCK
1
2
3
§  Compress the input data,
if large
§  Always use compression,
particularly if spillage or
slow network transfers
§  Compress for storage/
archival, better write
speeds, or between MR jobs
§  Use splittable algo such as
bzip2, or use zlib with
SequenceFile format
§  Use faster codecs such as
LZO, LZ4, or Snappy
§  Use standard utility such as
gzip or bzip2 for data
interchange, and faster
codecs for chained jobs
When to Use Compression and Which Codec
10
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Compress Decompress
Final Reduce Output
Compression in the Hadoop Ecosystem
11
Component When to Use What to Use
Pig
§  Compressing data between MR
job
§  Typical in Pig scripts that include
joins or other operators that
expand your data size
Enable compression and select the codec:
pig.tmpfilecompression = true!
pig.tmpfilecompression.codec = gzip, lzo!
!
Hive
§  Intermediate files produced by
Hive between multiple map-
reduce jobs
§  Hive writes output to a table
Enable intermediate or output compression:
hive.exec.compress.intermediate = true!
hive.exec.compress.output = true!
HBase
§  Compress data at the CF level
(support for LZO, gzip, Snappy,
and LZ4)
List required JNI libraries:
hbase.regionserver.codecs!
!
Enabling compression:
create ’table', { NAME => 'colfam',
COMPRESSION => ’LZO' }!
alter ’table', { NAME => 'colfam',
COMPRESSION => ’LZO' } !
4.2M Jobs, Jun 10-16, 2013
Compression in Hadoop at Yahoo!
12
99.8%
0.2%
LZO 98.3%
gzip 1.1%
zlib / default 0.5%
bzip2 0.1%
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
1 2 3
Final Reduce Output
39.0%
61.0%
LZO 55%
gzip 35%
bzip2 5%
zlib / default 5%
4.2M Jobs, Jun 10-16, 2013
98%
2%
zlib / default 73%
gzip 22%
bzip2 4%
LZO 1%
380M Files on Jun 16, 2013
(/data, /projects)
Includes
intermediate
Pig/ Hive
compression
Pig
Intermediate
Compressed
Compression for Data Storage Efficiency
§  DSE considerations at Yahoo!
§  RCFile instead of SequenceFile
§  Faster implementation of bzip2
§  Native-code bzip2 codec
§  HADOOP-84621, available in 0.23.7
§  Substituting the IPP library
13
1 Native-code bzip2 implementation done in collaboration with Jason Lowe,
Hadoop Core PMC member
IPP Libraries
§  Integrated Performance Primitives from Intel
§  Algorithmic and architectural optimizations
§  Processor-specific variants of each function
§  Applications remain processor-neutral
§  Compression: LZ, RLE, BWT, LZO
§  High level formats include: zlib, gzip, bzip2 and LZO
14
Measuring Standalone Performance
§  Standard programs (gzip, bzip2) used
§  Driver program written for other cases
§  32-bit mode
§  Single-threaded
§  JVM load overhead discounted
§  Default compression level
§  Quad-core Xeon machine
15
Data Corpuses Used
§  Binary files
§  Generated text from randomtextwriter
§  Wikipedia corpus
§  Silesia corpus
16
Compression Ratio
0
50
100
150
200
250
300
uncomp zlib bzip2 LZO Snappy LZ4
FileSize(MB)
exe rtext wiki silesia
17
Compression Performance
29
23
63
44
26
0
10
20
30
40
50
60
70
80
90
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
exe rtext wiki silesia
18
Compression Performance (Fast Algorithms)
3.2
2.9
1.7
0
0.5
1
1.5
2
2.5
3
3.5
LZO Snappy LZ4
CPUTime(sec)
exe rtext wiki silesia
19
Decompression Performance
3
2
21
17
12
0
5
10
15
20
25
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
exe rtext wiki silesia
20
Decompression Performance (Fast Algorithms)
1.6
1.1
0.7
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
LZO Snappy LZ4
CPUTime(sec)
exe rtext wiki silesia
21
Compression Performance within Hadoop
§  Daytona performance framework
§  GridMix v1
§  Loadgen and sort jobs
§  Input data compressed with zlib / bzip2
§  LZO used for intermediate compression
§  35 datanodes, dual-quad-core machines
22
Map Performance
47 46 46
33
0
5
10
15
20
25
30
35
40
45
50
Java-bzip2 bzip2 IPP-bzip2 zlib
MapTime(sec)
23
Reduce Performance
31
28
18
14
0
5
10
15
20
25
30
35
Java-bzip2 bzip2 IPP-bzip2 zlib
ReduceTime(min)
24
Job Performance
38
34
23
19
38
34
25
18
0
5
10
15
20
25
30
35
40
Java-bzip2 bzip2 IPP-bzip2 zlib
JobTime(min)
sort loadgen
25
Future Work
§  Splittability support for native-code bzip2 codec
§  Enhancing Pig to use common bzip2 codec
§  Optimizing the JNI interface and buffer copies
§  Varying the compression effort parameter
§  Performance evaluation for 64-bit mode
§  Updating the zlib codec to specify alternative libraries
§  Other codec combinations, such as zlib for transient data
§  Other compression algorithms
26
Considerations in Selecting Compression Type
§  Nature of the data set
§  Chained jobs
§  Data-storage efficiency requirements
§  Frequency of compression vs. decompression
§  Requirement for compatibility with a standard data format
§  Splittability requirements
§  Size of the intermediate and final data
§  Alternative implementations of compression libraries
27
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs

More Related Content

What's hot

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...MongoDB
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012StampedeCon
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsTrendProgContest13
 
Operating and supporting HBase Clusters
Operating and supporting HBase ClustersOperating and supporting HBase Clusters
Operating and supporting HBase Clustersenissoz
 
Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureCloudera, Inc.
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentYahoo Developer Network
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
 
A DBA’s guide to using TSA
A DBA’s guide to using TSAA DBA’s guide to using TSA
A DBA’s guide to using TSAFrederik Engelen
 
DB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple StandbyDB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple StandbyDale McInnis
 
Data Hacking with RHadoop
Data Hacking with RHadoopData Hacking with RHadoop
Data Hacking with RHadoopEd Kohlwey
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 

What's hot (18)

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
D02 Evolution of the HADR tool
D02 Evolution of the HADR toolD02 Evolution of the HADR tool
D02 Evolution of the HADR tool
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Operating and supporting HBase Clusters
Operating and supporting HBase ClustersOperating and supporting HBase Clusters
Operating and supporting HBase Clusters
 
Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and Future
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
A DBA’s guide to using TSA
A DBA’s guide to using TSAA DBA’s guide to using TSA
A DBA’s guide to using TSA
 
DB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple StandbyDB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple Standby
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Data Hacking with RHadoop
Data Hacking with RHadoopData Hacking with RHadoop
Data Hacking with RHadoop
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 

Similar to Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conferencenkabra
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterShivraj Raj
 
Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Robert Treat
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAYthevijayps
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceChandra Sekhar
 
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 

Similar to Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs (20)

Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAY
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective Audience
 
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Pig
PigPig
Pig
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Training
TrainingTraining
Training
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 

More from Sumeet Singh

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckSumeet Singh
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

More from Sumeet Singh (16)

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Recently uploaded

US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 

Recently uploaded (20)

US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 

Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeoffs

  • 1. Compression Options In Hadoop – A Tale of Tradeoffs Govind Kamat, Sumeet Singh Hadoop Summit (San Jose), June 27, 2013
  • 2. Introduction 2 Sumeet Singh Director of Products, Hadoop Cloud Engineering Group 701 First Avenue Sunnyvale, CA 94089 USA Govind Kamat Technical Yahoo!, Hadoop Cloud Engineering Group §  Member of Technical Staff in the Hadoop Services team at Yahoo! §  Focuses on HBase and Hadoop performance §  Worked with the Performance Engineering Group on improving the performance and scalability of several Yahoo! applications §  Experience includes development of large-scale software systems, microprocessor architecture, instruction-set simulators, compiler technology and electronic design 701 First Avenue Sunnyvale, CA 94089 USA §  Leads Hadoop products team at Yahoo! §  Responsible for Product Management, Customer Engagements, Evangelism, and Program Management §  Prior to this role, led Strategy functions for the Cloud Platform Group at Yahoo!
  • 3. Agenda 3 Data Compression in Hadoop1 Available Compression Options2 Understanding and Working with Compression Options3 Problems Faced at Yahoo! with Large Data Sets4 Performance Evaluations, Native Bzip2, and IPP Libraries5 Wrap-up and Future Work6
  • 4. Compression Needs and Tradeoffs in Hadoop 4 §  Storage §  Disk I/O §  Network bandwidth §  CPU Time §  Hadoop jobs are data-intensive, compressing data can speed up the I/O operations §  MapReduce jobs are almost always I/O bound §  Compressed data can save storage space and speed up data transfers across the network §  Capital allocation for hardware can go further §  Reduced I/O and network load can bring significant performance improvements §  MapReduce jobs can finish faster overall §  On the other hand, CPU utilization and processing time increases during compression and decompression §  Understanding the tradeoffs is important for MapReduce pipeline’s overall performance The Compression Tradeoff
  • 5. Data Compression in Hadoop’s MR Pipeline 5 Input splits Map Source: Hadoop: The Definitive Guide, Tom White Output ReduceBuffer in memory Partition and Sort fetch Merge on disk Merge and sort Other maps Other reducers I/P compressed Mapper decompresses Mapper O/P compressed 1 Map Reduce Reduce I/P Map O/P Reducer I/P decompresses Reducer O/P compressed 2 3 Sort & Shuffle Compress Decompress
  • 6. Compression Options in Hadoop (1/2) 6 Format Algorithm Strategy Emphasis Comments zlib Uses DEFLATE (LZ77 and Huffman coding) Dictionary-based, API Compression ratio Default codec gzip Wrapper around zlib Dictionary-based, standard compression utility Same as zlib, codec operates on and produces standard gzip files For data interchange on and off Hadoop bzip2 Burrows-Wheeler transform Transform-based, block-oriented Higher compression ratios than zlib Common for Pig LZO Variant of LZ77 Dictionary-based, block-oriented, API High compression speeds Common for intermediate compression, HBase tables LZ4 Simplified variant of LZ77 Fast scan, API Very high compression speeds Available in newer Hadoop distributions Snappy LZ77 Block-oriented, API Very high compression speeds Came out of Google, previously known as Zippy
  • 7. Compression Options in Hadoop (2/2) 7 Format Codec (Defined in io.compression.codecs) File Extn. Splittable Java/ Native zlib/ DEFLATE (default) org.apache.hadoop.io.compress.DefaultCodec !.deflate! N Y/ Y gzip org.apache.hadoop.io.compress.GzipCodec ! .gz! N Y/ Y bzip2 org.apache.hadoop.io.compress.BZip2Codec ! .bz2! Y Y/ Y LZO (download separately) com.hadoop.compression.lzo.LzoCodec ! .lzo! N N/ Y LZ4 org.apache.hadoop.io.compress.Lz4Codec ! .lz4! N N/ Y Snappy org.apache.hadoop.io.compress.SnappyCodec ! .snappy! N N/ Y NOTES: §  Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other algorithms require all blocks together for decompression with a single MapReduce task. §  LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still supported and the codec can be downloaded separately and enabled manually. §  Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
  • 8. Space-Time Tradeoff of Compression Options 8 64%, 32.3 71%, 60.0 47%, 4.842%, 4.0 44%, 2.4 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 40% 45% 50% 55% 60% 65% 70% 75% CPUTimeinSec. (Compress+Decompress) Space Savings Bzip2 Zlib (Deflate, Gzip) LZOSnappy LZ4 Note: A 265 MB corpus from Wikipedia was used for the performance comparisons. Space savings is defined as [1 – (Compressed/ Uncompressed)] Codec Performance on the Wikipedia Text Corpus High Compression Ratio High Compression Speed
  • 9. Using Data Compression in Hadoop 9 Phase in MR Pipeline Config Values Input data to Map File extension recognized automatically for decompression File extensions for supported formats Note: For SequenceFile, headers have the information [compression (boolean), block compression (boolean), and compression codec] One of the supported codecs one defined in io.compression.codecs! Intermediate (Map) Output mapreduce.map.output.compress! false (default), true mapreduce.map.output.compress.codec! ! one defined in io.compression.codecs! Final (Reduce) Output mapreduce.output.fileoutputformat. compress! false (default), true mapreduce.output.fileoutputformat. compress.codec! one defined in io.compression.codecs! mapreduce.output.fileoutputformat. compress.type! Type of compression to use for SequenceFile outputs: NONE, RECORD (default), BLOCK 1 2 3
  • 10. §  Compress the input data, if large §  Always use compression, particularly if spillage or slow network transfers §  Compress for storage/ archival, better write speeds, or between MR jobs §  Use splittable algo such as bzip2, or use zlib with SequenceFile format §  Use faster codecs such as LZO, LZ4, or Snappy §  Use standard utility such as gzip or bzip2 for data interchange, and faster codecs for chained jobs When to Use Compression and Which Codec 10 Map ReduceShuffle & Sort Input data to Map Intermediate (Map) Output I/P compressed Mapper decompresses Mapper O/P compressed 1 Reducer I/P decompresses Reducer O/P compressed 2 3 Compress Decompress Final Reduce Output
  • 11. Compression in the Hadoop Ecosystem 11 Component When to Use What to Use Pig §  Compressing data between MR job §  Typical in Pig scripts that include joins or other operators that expand your data size Enable compression and select the codec: pig.tmpfilecompression = true! pig.tmpfilecompression.codec = gzip, lzo! ! Hive §  Intermediate files produced by Hive between multiple map- reduce jobs §  Hive writes output to a table Enable intermediate or output compression: hive.exec.compress.intermediate = true! hive.exec.compress.output = true! HBase §  Compress data at the CF level (support for LZO, gzip, Snappy, and LZ4) List required JNI libraries: hbase.regionserver.codecs! ! Enabling compression: create ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }! alter ’table', { NAME => 'colfam', COMPRESSION => ’LZO' } !
  • 12. 4.2M Jobs, Jun 10-16, 2013 Compression in Hadoop at Yahoo! 12 99.8% 0.2% LZO 98.3% gzip 1.1% zlib / default 0.5% bzip2 0.1% Map ReduceShuffle & Sort Input data to Map Intermediate (Map) Output 1 2 3 Final Reduce Output 39.0% 61.0% LZO 55% gzip 35% bzip2 5% zlib / default 5% 4.2M Jobs, Jun 10-16, 2013 98% 2% zlib / default 73% gzip 22% bzip2 4% LZO 1% 380M Files on Jun 16, 2013 (/data, /projects) Includes intermediate Pig/ Hive compression Pig Intermediate Compressed
  • 13. Compression for Data Storage Efficiency §  DSE considerations at Yahoo! §  RCFile instead of SequenceFile §  Faster implementation of bzip2 §  Native-code bzip2 codec §  HADOOP-84621, available in 0.23.7 §  Substituting the IPP library 13 1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member
  • 14. IPP Libraries §  Integrated Performance Primitives from Intel §  Algorithmic and architectural optimizations §  Processor-specific variants of each function §  Applications remain processor-neutral §  Compression: LZ, RLE, BWT, LZO §  High level formats include: zlib, gzip, bzip2 and LZO 14
  • 15. Measuring Standalone Performance §  Standard programs (gzip, bzip2) used §  Driver program written for other cases §  32-bit mode §  Single-threaded §  JVM load overhead discounted §  Default compression level §  Quad-core Xeon machine 15
  • 16. Data Corpuses Used §  Binary files §  Generated text from randomtextwriter §  Wikipedia corpus §  Silesia corpus 16
  • 17. Compression Ratio 0 50 100 150 200 250 300 uncomp zlib bzip2 LZO Snappy LZ4 FileSize(MB) exe rtext wiki silesia 17
  • 18. Compression Performance 29 23 63 44 26 0 10 20 30 40 50 60 70 80 90 zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2 CPUTime(sec) exe rtext wiki silesia 18
  • 19. Compression Performance (Fast Algorithms) 3.2 2.9 1.7 0 0.5 1 1.5 2 2.5 3 3.5 LZO Snappy LZ4 CPUTime(sec) exe rtext wiki silesia 19
  • 20. Decompression Performance 3 2 21 17 12 0 5 10 15 20 25 zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2 CPUTime(sec) exe rtext wiki silesia 20
  • 21. Decompression Performance (Fast Algorithms) 1.6 1.1 0.7 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 LZO Snappy LZ4 CPUTime(sec) exe rtext wiki silesia 21
  • 22. Compression Performance within Hadoop §  Daytona performance framework §  GridMix v1 §  Loadgen and sort jobs §  Input data compressed with zlib / bzip2 §  LZO used for intermediate compression §  35 datanodes, dual-quad-core machines 22
  • 23. Map Performance 47 46 46 33 0 5 10 15 20 25 30 35 40 45 50 Java-bzip2 bzip2 IPP-bzip2 zlib MapTime(sec) 23
  • 26. Future Work §  Splittability support for native-code bzip2 codec §  Enhancing Pig to use common bzip2 codec §  Optimizing the JNI interface and buffer copies §  Varying the compression effort parameter §  Performance evaluation for 64-bit mode §  Updating the zlib codec to specify alternative libraries §  Other codec combinations, such as zlib for transient data §  Other compression algorithms 26
  • 27. Considerations in Selecting Compression Type §  Nature of the data set §  Chained jobs §  Data-storage efficiency requirements §  Frequency of compression vs. decompression §  Requirement for compatibility with a standard data format §  Splittability requirements §  Size of the intermediate and final data §  Alternative implementations of compression libraries 27