Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Programming Hive   Reading #4    @just_do_neet
Chapter 11. and 15.            •Chapter 11. ‘Other File Formats and             Compression’                •Choosing / En...
#11 Determining Installed Codecs        $ hive -e "set io.compression.codecs"        io.compression.codecs=         org.ap...
#11 Choosing a Compression Codec            •Advantage :                •network I/O , disk space.            •Disadvantag...
#11 Choosing a Compression Codec            •“why do we need different compression             schemes?”                •s...
#11 Choosing a Compression Codec            •“why do we need different compression             schemes?”                  ...
take a break : algorithm            •lossless compression                •LZ77(LZSS), LZ78, etc...                     •DE...
take a break : algorithm                        http://www.slideshare.net/moaikids/ss-2638826Programming Hive Reading #4  ...
take a break : algorithm                        http://www.slideshare.net/moaikids/ss-2638826Programming Hive Reading #4  ...
take a break : algorithm            •Burrows–Wheeler Transform(BWT)                •block sorting            •“abracadabra...
take a break : algorithm            •BWT with Suffix Array                •ref. http://d.hatena.ne.jp/naoya/20081016/122417...
take a break : algorithm            •LZO                •“Compression is comparable in speed to                 DEFLATE co...
take a break : algorithm            •Google Snappy                •“very high speeds and reasonable                 compre...
take a break : algorithm            •LZ4                •“very fast lossless compression algorithm”                • https...
take a break : algorithm            •“Add support for LZ4 compression”                •fix version : 0.23.1, 0.24.0,(CDH4) ...
take a break : Implementation Codec  public HogeCodec implements CompressionCodec{   @Override   public CompressionOutputS...
#11 Enabling Compression            •Intermediate Compression(hive, mapred)            •Final Output Compression(hive, map...
#11 Enabling Compression            •Intermediate Compression(hive, mapred)                •setting enable flagProgramming ...
#11 Enabling Compression            •Intermediate Compression(hive, mapred)                •setting codecProgramming Hive ...
#11 Enabling Compression            •Final Output Compression(hive, mapred)                •setting enable flagProgramming ...
#11 Enabling Compression            •Final Output Compression(hive, mapred)                •setting codecProgramming Hive ...
#11 Sequence File            •Sequence File Format                • Header                • Record                     • R...
#11 Sequence File            •Compression Type                •NONE : nothing to do                •RECORD : compress on e...
#11 Compression in Action            •(DEMO)Programming Hive Reading #4          25
#11 Archive Partition            •Using ‘HAR’                •ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.ht...
Break :)
#15 Record Format            •TEXTFILE            •SEQUENCEFILE            •RCFILE              CREATE TABLE hoge (.      ...
#15 Record Format            •RCFile(Record Columnar File)                •fast data loading                •fast query pr...
#15 Record Format            •RCFile Format                •1 record = some Row Group                •1 HDFS Block = some ...
#15 Record Format            •Implementation of RCFile                •Input Format                     •o.a.h.h.ql.io.RCF...
#15 Record Format            •Tuning of RCFile                •“hive.io.rcfile.record.buffer.size”                     •defi...
#15 Record Format            •ref. “HDFS and Hive storage - comparing file             formats and compression methods”    ...
#Appendix - trevni            •ref. https://github.com/cutting/trevni/            •ref. http://avro.apache.org/docs/curren...
#Appendix - trevni       file header   column    column      column   column         column       column              colum...
#Appendix - ORCFile            •ref. http://hortonworks.com/blog/100x-              faster-hive/            •ref. https://...
#Appendix - ORCFile            •ref. data sizeProgramming Hive Reading #4    37
#Appendix - ORCFile            •ref. comparisonProgramming Hive Reading #4    38
#Appendix - Column-Oriented Storage            •ref. http://arxiv.org/pdf/1105.4252.pdfProgramming Hive Reading #4        ...
#Appendix - more informations          http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=Programming Hive...
Thanks for your listening :)
Programming Hive Reading #4
Upcoming SlideShare
Loading in …5
×

Programming Hive Reading #4

14,566 views

Published on

  • Be the first to comment

Programming Hive Reading #4

  1. 1. Programming Hive Reading #4 @just_do_neet
  2. 2. Chapter 11. and 15. •Chapter 11. ‘Other File Formats and Compression’ •Choosing / Enabling / Action / HAR / etc... •Chapter 15. ‘Customizing Hive File and Record Formats’ •Demystifying DML / File Formats / etc... •exclude "SerDe" related topics at this presentation...Programming Hive Reading #4 3
  3. 3. #11 Determining Installed Codecs $ hive -e "set io.compression.codecs" io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, org.apache.hadoop.io.compress.SnappyCodecProgramming Hive Reading #4 4
  4. 4. #11 Choosing a Compression Codec •Advantage : •network I/O , disk space. •Disadvantage : •CPU overhead. •to be short... : Trade-offProgramming Hive Reading #4 5
  5. 5. #11 Choosing a Compression Codec •“why do we need different compression schemes?” •speed •minimizing size •‘splittable’ or not.Programming Hive Reading #4 6
  6. 6. #11 Choosing a Compression Codec •“why do we need different compression schemes?” http://comphadoop.weebly.com/Programming Hive Reading #4 7
  7. 7. take a break : algorithm •lossless compression •LZ77(LZSS), LZ78, etc... •DEFLATE (LZ77 with Huffman coding) •LZH (LZ77 with Static Huffman coding) •BZIP2(Burrows–Wheeler transform, Move-to- Front, Huffman Coding) •lossy •for JPEG, MPEG,etc...(snip.)Programming Hive Reading #4 8
  8. 8. take a break : algorithm http://www.slideshare.net/moaikids/ss-2638826Programming Hive Reading #4 9
  9. 9. take a break : algorithm http://www.slideshare.net/moaikids/ss-2638826Programming Hive Reading #4 10
  10. 10. take a break : algorithm •Burrows–Wheeler Transform(BWT) •block sorting •“abracadabra” = bwt“ard$rcaaabb” abracadabra$ $abracadabra a $ a bracadabra$a a$abracadabr r a b racadabra$ab abra$abracad d a r acadabra$abr abracadabra$ $ a a cadabra$abra acadabra$abr r a c adabra$abrac adabra$abrac c a a dabra$abraca bra$abracada a b d abra$abracad bracadabra$a a b a bra$abracada cadabra$abra a c b ra$abracadab dabra$abraca a d r a$abracadabr ra$abracadab b r a $abracadabra racadabra$ab b r $Programming Hive Reading #4 11
  11. 11. take a break : algorithm •BWT with Suffix Array •ref. http://d.hatena.ne.jp/naoya/20081016/1224173077 •ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.pptProgramming Hive Reading #4 12
  12. 12. take a break : algorithm •LZO •“Compression is comparable in speed to DEFLATE compression.” •“Very fast decompression” • http://www.oberhumer.com/opensource/lzo/Programming Hive Reading #4 13
  13. 13. take a break : algorithm •Google Snappy •“very high speeds and reasonable compression” • https://code.google.com/p/snappy/ •ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889Programming Hive Reading #4 14
  14. 14. take a break : algorithm •LZ4 •“very fast lossless compression algorithm” • https://code.google.com/p/lz4/ •ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4Programming Hive Reading #4 15
  15. 15. take a break : algorithm •“Add support for LZ4 compression” •fix version : 0.23.1, 0.24.0,(CDH4) •ref. https://issues.apache.org/jira/browse/HADOOP-7657Programming Hive Reading #4 16
  16. 16. take a break : Implementation Codec public HogeCodec implements CompressionCodec{ @Override public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) throws IOException { return new BlockCompressorStream(out, compressor, bufferSize, compressionOverhead); } @Override ref. public Class<? extends Compressor> getCompressorType() { return HogeCompressor.class; http://hadoop.apache.org/ } docs/current/api/org/apache/ @Override hadoop/io/compress/ public CompressionOutputStream createOutputStream(OutputStream out) CompressionCodec.html throws IOException { return createOutputStream(out, createCompressor()); } @Override public Compressor createCompressor() { return new HogeCompressor(); } @Override public CompressionInputStream createInputStream(InputStream in) throws IOException { return createInputStream(in, createDecompressor()); } ............Programming Hive Reading #4 17
  17. 17. #11 Enabling Compression •Intermediate Compression(hive, mapred) •Final Output Compression(hive, mapred)Programming Hive Reading #4 18
  18. 18. #11 Enabling Compression •Intermediate Compression(hive, mapred) •setting enable flagProgramming Hive Reading #4 19
  19. 19. #11 Enabling Compression •Intermediate Compression(hive, mapred) •setting codecProgramming Hive Reading #4 20
  20. 20. #11 Enabling Compression •Final Output Compression(hive, mapred) •setting enable flagProgramming Hive Reading #4 21
  21. 21. #11 Enabling Compression •Final Output Compression(hive, mapred) •setting codecProgramming Hive Reading #4 22
  22. 22. #11 Sequence File •Sequence File Format • Header • Record • Record length • Key length • Key • Value • A sync-marker every few 100 bytes or so. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/ SequenceFile.htmlProgramming Hive Reading #4 23
  23. 23. #11 Sequence File •Compression Type •NONE : nothing to do •RECORD : compress on each records •BLOCK : compress on each blocksProgramming Hive Reading #4 24
  24. 24. #11 Compression in Action •(DEMO)Programming Hive Reading #4 25
  25. 25. #11 Archive Partition •Using ‘HAR’ •ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html •Archiving $ SET hive.archive.enabled=true; $ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’) •Unarchiving $ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’)Programming Hive Reading #4 26
  26. 26. Break :)
  27. 27. #15 Record Format •TEXTFILE •SEQUENCEFILE •RCFILE CREATE TABLE hoge (. ........ ) STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]Programming Hive Reading #4 28
  28. 28. #15 Record Format •RCFile(Record Columnar File) •fast data loading •fast query processing •highly efficient storage space utilization •a strong adaptivity to dynamic data access patterns. •ref. "A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems (ICDE’11)" http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/ TR-11-4.pdfProgramming Hive Reading #4 29
  29. 29. #15 Record Format •RCFile Format •1 record = some Row Group •1 HDFS Block = some Row Group •Row Group •a sync marker •metadata header •table data •uses the RLE algorithm to compress ‘metadata header’ section.Programming Hive Reading #4 30
  30. 30. #15 Record Format •Implementation of RCFile •Input Format •o.a.h.h.ql.io.RCFileInputFormat •Output Format •o.a.h.h.ql.io.RCFileOutputFormat •SerDe •o.a.h.h.serde2.columnar.ColumnarSerDeProgramming Hive Reading #4 31
  31. 31. #15 Record Format •Tuning of RCFile •“hive.io.rcfile.record.buffer.size” •define “RowGroup” size(default: 4MB)Programming Hive Reading #4 32
  32. 32. #15 Record Format •ref. “HDFS and Hive storage - comparing file formats and compression methods” • http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format- compression/ •"In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results." •"In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate."Programming Hive Reading #4 33
  33. 33. #Appendix - trevni •ref. https://github.com/cutting/trevni/ •ref. http://avro.apache.org/docs/current/trevni/spec.htmlProgramming Hive Reading #4 34
  34. 34. #Appendix - trevni file header column column column column column column column ...... file number of number of file column column start number of magic block block rows columns metadata metadata position blocks ...... file header column ・name ・type column ・codec block row row row row metadata ・etc... descriptor ...... block number of uncompres compress rows sed bytes ed bytes block descriptorProgramming Hive Reading #4 35
  35. 35. #Appendix - ORCFile •ref. http://hortonworks.com/blog/100x- faster-hive/ •ref. https://issues.apache.org/jira/browse/ HIVE-3874 •ref. https://issues.apache.org/jira/secure/ attachment/12564124/OrcFileIntro.pptxProgramming Hive Reading #4 36
  36. 36. #Appendix - ORCFile •ref. data sizeProgramming Hive Reading #4 37
  37. 37. #Appendix - ORCFile •ref. comparisonProgramming Hive Reading #4 38
  38. 38. #Appendix - Column-Oriented Storage •ref. http://arxiv.org/pdf/1105.4252.pdfProgramming Hive Reading #4 39
  39. 39. #Appendix - more informations http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=Programming Hive Reading #4 40
  40. 40. Thanks for your listening :)

×