SlideShare a Scribd company logo
1 of 2
Due to the volume of datastoredinHadoop clustersthere mightbe aneedforcompressionbasedon
companyneeds. The followingcomparisonaimsatsummarizingsome optionsandtheirtrade-offs.
Format File
Extension
Speed Output
File Size
Can be
Splitted
codec
bzip2 .bz2 slowest smallest Yes org.apache.hadoop.io.compress.BZip2Codec
gzip .gz slower smaller No org.apache.hadoop.io.compress.GzipCodec
zlib/
DEFLATE
(default)
.deflate slower smaller No org.apache.hadoop.io.compress.DefaultCodec
LZO .lzo fast medium No org.apache.hadoop.io.compress.LzoCodec
LZ4 .lz4 faster larger No org.apache.hadoop.io.compress.Lz4Codec
Snappy .snappy fastest largest No org.apache.hadoop.io.compress.SnappyCodec
In the above table the firstcolumnliststhe format option,the second column the fileextensionof the
compressed outputfileusingthe particularformat.The column Speed referstothe compressionand
decompressiontime requirementand Output File Size refersto the resultingsize of the compressed
file basedonthe testcaseson WikipediaTextCorpusasreportedbyGovindKamatand Sumeet Singhin
HadoopSummitJune 2013 (1)
.Note that the above compressiononlyconsidersdefaultsettingof zlib.If
zlibwasbeingusedinits SequenceFile settingitwouldresult inadifferentperformace placementfor
zlib. Amongall six optionsconsidered onlybzip2canbe decompressedinparallel.Thisisshownin the
fifthcolumnlabeled Can be Splitted.For optionsthatcan be splitted,inparticularbzip2,multiple
MapReduce taskscan be executedduringdecompressionphase.Forall otherformatsthe complete file
isrequired duringthe decompressionphasesoitisnot possible tomultithreadthe decompression.
Alsonote that a traditional LZOformatisnow licensedunderthe GNUGPL and therefore removedfrom
Hadoop. If youare goingto downloadanew versionof Hadoop,youwill notbe able toget the codecas
part of yourdownload.Therefore the codecshouldbe downloadedseparatelyandenabledmanually
and thenyouwill be able touse it as before.LZOisstill supportedwithinthe code base if you wantto
continue usingit.
It ispossible touse differentcompressiontechniquesatdifferentphasesof MapReduce since the
requirementsof eachphase showsvariations.Itispossible touse compression/decompressiononlyata
particularphase.Forinstance,duringShuffleandSortphase compressioncanbe quite useful toreduce
the networktransferlatency. FastercodecsasLZO,LZ4 or Snappycan be preferred inthisphase toavoid
additional CPUoverheadduringcompression/decompression.Incontrastduringthe initial Mapphase it
ispossible toinvestinslowercodecssuchas bzip2or zlibwithSequenceFile settingtomake use of
parallelismatthisstage.Duringthe Reduce phase gzipcan be used fordata interchange forchained
jobs.It ispossible touse bzip2as well duringthe reduce phase asafastercodec. Compressionduring
Reduce phase helpsreduce storage requirementsforarchival dataand improve write speeds.
The decisiononwhetherto applycompressionforanyphase of the MapReduce jobisa decisionthat
can be made basedon the trade-offsbetweenreducedstorage,networktransferloadandthe
additional CPUoverheadrunningthe compressioncodec.Forbasictaskswhere data transferoverheads
are significantlylarger,the clientswouldbenefitlargelyoncompressionaslongas the CPU overheadon
the commoditymachinesdonotresultinan additional overhead.
Please note thatHadoopjobsare data intensive andbasedonthe clients’hardware,e.g.,racksversus
commoditymachine clusters,optical networkconnectionversusall the waytooldfashionedRS232
connections, decisionsoncompressioncanbe made asa general solutionora particularjobbased
solution.
(1)
GovindKamatand SumeetSingh,CompressionOptionsInHadoop – A Tale of Tradeoffs,
HadoopSummit,SanJose,June 2013, http://www.slideshare.net/Hadoop_Summit/kamat-
singh-june27425pmroom210cv2

More Related Content

What's hot

What's hot (20)

4.1 create partitions and filesystems
4.1 create partitions and filesystems4.1 create partitions and filesystems
4.1 create partitions and filesystems
 
101 4.1 create partitions and filesystems
101 4.1 create partitions and filesystems101 4.1 create partitions and filesystems
101 4.1 create partitions and filesystems
 
101 4.1 create partitions and filesystems
101 4.1 create partitions and filesystems101 4.1 create partitions and filesystems
101 4.1 create partitions and filesystems
 
Linux
LinuxLinux
Linux
 
Linux Commands
Linux CommandsLinux Commands
Linux Commands
 
Basic linux day 5
Basic linux day 5Basic linux day 5
Basic linux day 5
 
Part 2
Part 2Part 2
Part 2
 
Basic Linux day 1
Basic Linux day 1Basic Linux day 1
Basic Linux day 1
 
Basic linux day 3
Basic linux day 3Basic linux day 3
Basic linux day 3
 
File management
File managementFile management
File management
 
Basic Linux day 6
Basic Linux day 6Basic Linux day 6
Basic Linux day 6
 
101 3.4 use streams, pipes and redirects
101 3.4 use streams, pipes and redirects101 3.4 use streams, pipes and redirects
101 3.4 use streams, pipes and redirects
 
Basic linux day 4
Basic linux day 4Basic linux day 4
Basic linux day 4
 
Basic Linux day 2
Basic Linux day 2Basic Linux day 2
Basic Linux day 2
 
101 2.1 design hard disk layout
101 2.1 design hard disk layout101 2.1 design hard disk layout
101 2.1 design hard disk layout
 
101 3.3 perform basic file management
101 3.3 perform basic file management101 3.3 perform basic file management
101 3.3 perform basic file management
 
Operating Systems: File Management
Operating Systems: File ManagementOperating Systems: File Management
Operating Systems: File Management
 
Basic unix commands_1
Basic unix commands_1Basic unix commands_1
Basic unix commands_1
 
Basic C L I
Basic  C L IBasic  C L I
Basic C L I
 
Bozorgmeh os lab
Bozorgmeh os labBozorgmeh os lab
Bozorgmeh os lab
 

Viewers also liked

Proyecto gerencia industrial iupsmpzo.
Proyecto gerencia industrial   iupsmpzo.Proyecto gerencia industrial   iupsmpzo.
Proyecto gerencia industrial iupsmpzo.Yumar Rondon
 
Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Adri9C
 
Apresentação Nativi
Apresentação Nativi Apresentação Nativi
Apresentação Nativi Renan Ranzani
 
Artículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoArtículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoYumar Rondon
 
Phrases (1) (1)
Phrases (1) (1)Phrases (1) (1)
Phrases (1) (1)ishlive
 
Proceso de manufactura unidad iii
Proceso de manufactura unidad iiiProceso de manufactura unidad iii
Proceso de manufactura unidad iiiYumar Rondon
 
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductSourabh Bajaj
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxHortonworks
 
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Impression techiques  / implant dentistry course/ implant dentistry coursevvvImpression techiques  / implant dentistry course/ implant dentistry coursevvv
Impression techiques / implant dentistry course/ implant dentistry coursevvvIndian dental academy
 
Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantAkshay Rai
 
Choice Based Credit System
Choice Based Credit SystemChoice Based Credit System
Choice Based Credit SystemMadan Mankotia
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 

Viewers also liked (20)

Proyecto gerencia industrial iupsmpzo.
Proyecto gerencia industrial   iupsmpzo.Proyecto gerencia industrial   iupsmpzo.
Proyecto gerencia industrial iupsmpzo.
 
Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia
 
Apresentação Nativi
Apresentação Nativi Apresentação Nativi
Apresentação Nativi
 
Artículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoArtículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreo
 
Nature
NatureNature
Nature
 
Git in real product
Git in real productGit in real product
Git in real product
 
Phrases (1) (1)
Phrases (1) (1)Phrases (1) (1)
Phrases (1) (1)
 
Proceso de manufactura unidad iii
Proceso de manufactura unidad iiiProceso de manufactura unidad iii
Proceso de manufactura unidad iii
 
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
 
Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
 
Agile retrospective
Agile retrospectiveAgile retrospective
Agile retrospective
 
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Impression techiques  / implant dentistry course/ implant dentistry coursevvvImpression techiques  / implant dentistry course/ implant dentistry coursevvv
Impression techiques / implant dentistry course/ implant dentistry coursevvv
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. Elephant
 
Choice Based Credit System
Choice Based Credit SystemChoice Based Credit System
Choice Based Credit System
 
Hadoop in Healthcare Systems
Hadoop in Healthcare SystemsHadoop in Healthcare Systems
Hadoop in Healthcare Systems
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
Dental implants
Dental implants Dental implants
Dental implants
 

Similar to HadoopCompression

Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of TradeoffsAugust 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of TradeoffsYahoo Developer Network
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Mydbops
 
Compression Commands in Linux
Compression Commands in LinuxCompression Commands in Linux
Compression Commands in LinuxPegah Taheri
 
Hadoop compression analysis strata conference
Hadoop compression analysis strata conferenceHadoop compression analysis strata conference
Hadoop compression analysis strata conferencenkabra
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conferencenkabra
 
G zip compresser ppt
G zip compresser pptG zip compresser ppt
G zip compresser pptgaurav kumar
 
Assignment 1 MapReduce With Hadoop
Assignment 1  MapReduce With HadoopAssignment 1  MapReduce With Hadoop
Assignment 1 MapReduce With HadoopAllison Thompson
 
Linux Common Command
Linux Common CommandLinux Common Command
Linux Common CommandJeff Yang
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionacogoluegnes
 
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data Joachim Jacob
 
CDS Filtering Program - User Manual
CDS Filtering Program - User ManualCDS Filtering Program - User Manual
CDS Filtering Program - User ManualYoann Pageaud
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
7-zip compression settings guide
7-zip compression settings guide7-zip compression settings guide
7-zip compression settings guideLevan Chelidze
 
Golang execution modes
Golang execution modesGolang execution modes
Golang execution modesTing-Li Chou
 
List the most common arguments and describe the effect of that argumen.docx
List the most common arguments and describe the effect of that argumen.docxList the most common arguments and describe the effect of that argumen.docx
List the most common arguments and describe the effect of that argumen.docxdarlened3
 
Aggregate standard for Netapp storage 7 mode
Aggregate standard for Netapp storage 7 mode Aggregate standard for Netapp storage 7 mode
Aggregate standard for Netapp storage 7 mode Saroj Sahu
 

Similar to HadoopCompression (20)

Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of TradeoffsAugust 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
 
Compression Commands in Linux
Compression Commands in LinuxCompression Commands in Linux
Compression Commands in Linux
 
Zipnotes
ZipnotesZipnotes
Zipnotes
 
Hadoop compression analysis strata conference
Hadoop compression analysis strata conferenceHadoop compression analysis strata conference
Hadoop compression analysis strata conference
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
 
G zip compresser ppt
G zip compresser pptG zip compresser ppt
G zip compresser ppt
 
Assignment 1 MapReduce With Hadoop
Assignment 1  MapReduce With HadoopAssignment 1  MapReduce With Hadoop
Assignment 1 MapReduce With Hadoop
 
Linux Common Command
Linux Common CommandLinux Common Command
Linux Common Command
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
 
CDS Filtering Program - User Manual
CDS Filtering Program - User ManualCDS Filtering Program - User Manual
CDS Filtering Program - User Manual
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
7-zip compression settings guide
7-zip compression settings guide7-zip compression settings guide
7-zip compression settings guide
 
Golang execution modes
Golang execution modesGolang execution modes
Golang execution modes
 
List the most common arguments and describe the effect of that argumen.docx
List the most common arguments and describe the effect of that argumen.docxList the most common arguments and describe the effect of that argumen.docx
List the most common arguments and describe the effect of that argumen.docx
 
Aggregate standard for Netapp storage 7 mode
Aggregate standard for Netapp storage 7 mode Aggregate standard for Netapp storage 7 mode
Aggregate standard for Netapp storage 7 mode
 

HadoopCompression

  • 1. Due to the volume of datastoredinHadoop clustersthere mightbe aneedforcompressionbasedon companyneeds. The followingcomparisonaimsatsummarizingsome optionsandtheirtrade-offs. Format File Extension Speed Output File Size Can be Splitted codec bzip2 .bz2 slowest smallest Yes org.apache.hadoop.io.compress.BZip2Codec gzip .gz slower smaller No org.apache.hadoop.io.compress.GzipCodec zlib/ DEFLATE (default) .deflate slower smaller No org.apache.hadoop.io.compress.DefaultCodec LZO .lzo fast medium No org.apache.hadoop.io.compress.LzoCodec LZ4 .lz4 faster larger No org.apache.hadoop.io.compress.Lz4Codec Snappy .snappy fastest largest No org.apache.hadoop.io.compress.SnappyCodec In the above table the firstcolumnliststhe format option,the second column the fileextensionof the compressed outputfileusingthe particularformat.The column Speed referstothe compressionand decompressiontime requirementand Output File Size refersto the resultingsize of the compressed file basedonthe testcaseson WikipediaTextCorpusasreportedbyGovindKamatand Sumeet Singhin HadoopSummitJune 2013 (1) .Note that the above compressiononlyconsidersdefaultsettingof zlib.If zlibwasbeingusedinits SequenceFile settingitwouldresult inadifferentperformace placementfor zlib. Amongall six optionsconsidered onlybzip2canbe decompressedinparallel.Thisisshownin the fifthcolumnlabeled Can be Splitted.For optionsthatcan be splitted,inparticularbzip2,multiple MapReduce taskscan be executedduringdecompressionphase.Forall otherformatsthe complete file isrequired duringthe decompressionphasesoitisnot possible tomultithreadthe decompression. Alsonote that a traditional LZOformatisnow licensedunderthe GNUGPL and therefore removedfrom Hadoop. If youare goingto downloadanew versionof Hadoop,youwill notbe able toget the codecas
  • 2. part of yourdownload.Therefore the codecshouldbe downloadedseparatelyandenabledmanually and thenyouwill be able touse it as before.LZOisstill supportedwithinthe code base if you wantto continue usingit. It ispossible touse differentcompressiontechniquesatdifferentphasesof MapReduce since the requirementsof eachphase showsvariations.Itispossible touse compression/decompressiononlyata particularphase.Forinstance,duringShuffleandSortphase compressioncanbe quite useful toreduce the networktransferlatency. FastercodecsasLZO,LZ4 or Snappycan be preferred inthisphase toavoid additional CPUoverheadduringcompression/decompression.Incontrastduringthe initial Mapphase it ispossible toinvestinslowercodecssuchas bzip2or zlibwithSequenceFile settingtomake use of parallelismatthisstage.Duringthe Reduce phase gzipcan be used fordata interchange forchained jobs.It ispossible touse bzip2as well duringthe reduce phase asafastercodec. Compressionduring Reduce phase helpsreduce storage requirementsforarchival dataand improve write speeds. The decisiononwhetherto applycompressionforanyphase of the MapReduce jobisa decisionthat can be made basedon the trade-offsbetweenreducedstorage,networktransferloadandthe additional CPUoverheadrunningthe compressioncodec.Forbasictaskswhere data transferoverheads are significantlylarger,the clientswouldbenefitlargelyoncompressionaslongas the CPU overheadon the commoditymachinesdonotresultinan additional overhead. Please note thatHadoopjobsare data intensive andbasedonthe clients’hardware,e.g.,racksversus commoditymachine clusters,optical networkconnectionversusall the waytooldfashionedRS232 connections, decisionsoncompressioncanbe made asa general solutionora particularjobbased solution. (1) GovindKamatand SumeetSingh,CompressionOptionsInHadoop – A Tale of Tradeoffs, HadoopSummit,SanJose,June 2013, http://www.slideshare.net/Hadoop_Summit/kamat- singh-june27425pmroom210cv2