Programming Hive Reading #4
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
12,044
On Slideshare
4,525
From Embeds
7,519
Number of Embeds
22

Actions

Shares
Downloads
15
Comments
0
Likes
2

Embeds 7,519

http://dayafterneet.blogspot.jp 7,234
http://dayafterneet.blogspot.com 197
https://www.google.co.jp 28
http://dayafterneet.blogspot.kr 18
http://exchangite25.tawaba.com 6
http://www.google.co.jp 5
http://dayafterneet.blogspot.co.uk 4
http://dayafterneet.blogspot.sg 4
http://dayafterneet.blogspot.com.es 3
http://dayafterneet.blogspot.in 3
http://after969.rssing.com 2
http://translate.googleusercontent.com 2
http://dayafterneet.blogspot.de 2
http://dayafterneet.blogspot.hk 2
http://dayafterneet.blogspot.tw 2
http://dayafterneet.blogspot.com.au 1
https://www.google.com 1
http://webcache.googleusercontent.com 1
http://www.feedspot.com 1
http://dayafterneet.blogspot.fr 1
http://dayafterneet.blogspot.no 1
http://dayafterneet.blogspot.nl 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Programming Hive Reading #4 @just_do_neet
  • 2. Chapter 11. and 15. •Chapter 11. ‘Other File Formats and Compression’ •Choosing / Enabling / Action / HAR / etc... •Chapter 15. ‘Customizing Hive File and Record Formats’ •Demystifying DML / File Formats / etc... •exclude "SerDe" related topics at this presentation...Programming Hive Reading #4 3
  • 3. #11 Determining Installed Codecs $ hive -e "set io.compression.codecs" io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, org.apache.hadoop.io.compress.SnappyCodecProgramming Hive Reading #4 4
  • 4. #11 Choosing a Compression Codec •Advantage : •network I/O , disk space. •Disadvantage : •CPU overhead. •to be short... : Trade-offProgramming Hive Reading #4 5
  • 5. #11 Choosing a Compression Codec •“why do we need different compression schemes?” •speed •minimizing size •‘splittable’ or not.Programming Hive Reading #4 6
  • 6. #11 Choosing a Compression Codec •“why do we need different compression schemes?” http://comphadoop.weebly.com/Programming Hive Reading #4 7
  • 7. take a break : algorithm •lossless compression •LZ77(LZSS), LZ78, etc... •DEFLATE (LZ77 with Huffman coding) •LZH (LZ77 with Static Huffman coding) •BZIP2(Burrows–Wheeler transform, Move-to- Front, Huffman Coding) •lossy •for JPEG, MPEG,etc...(snip.)Programming Hive Reading #4 8
  • 8. take a break : algorithm http://www.slideshare.net/moaikids/ss-2638826Programming Hive Reading #4 9
  • 9. take a break : algorithm http://www.slideshare.net/moaikids/ss-2638826Programming Hive Reading #4 10
  • 10. take a break : algorithm •Burrows–Wheeler Transform(BWT) •block sorting •“abracadabra” = bwt“ard$rcaaabb” abracadabra$ $abracadabra a $ a bracadabra$a a$abracadabr r a b racadabra$ab abra$abracad d a r acadabra$abr abracadabra$ $ a a cadabra$abra acadabra$abr r a c adabra$abrac adabra$abrac c a a dabra$abraca bra$abracada a b d abra$abracad bracadabra$a a b a bra$abracada cadabra$abra a c b ra$abracadab dabra$abraca a d r a$abracadabr ra$abracadab b r a $abracadabra racadabra$ab b r $Programming Hive Reading #4 11
  • 11. take a break : algorithm •BWT with Suffix Array •ref. http://d.hatena.ne.jp/naoya/20081016/1224173077 •ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.pptProgramming Hive Reading #4 12
  • 12. take a break : algorithm •LZO •“Compression is comparable in speed to DEFLATE compression.” •“Very fast decompression” • http://www.oberhumer.com/opensource/lzo/Programming Hive Reading #4 13
  • 13. take a break : algorithm •Google Snappy •“very high speeds and reasonable compression” • https://code.google.com/p/snappy/ •ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889Programming Hive Reading #4 14
  • 14. take a break : algorithm •LZ4 •“very fast lossless compression algorithm” • https://code.google.com/p/lz4/ •ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4Programming Hive Reading #4 15
  • 15. take a break : algorithm •“Add support for LZ4 compression” •fix version : 0.23.1, 0.24.0,(CDH4) •ref. https://issues.apache.org/jira/browse/HADOOP-7657Programming Hive Reading #4 16
  • 16. take a break : Implementation Codec public HogeCodec implements CompressionCodec{ @Override public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) throws IOException { return new BlockCompressorStream(out, compressor, bufferSize, compressionOverhead); } @Override ref. public Class<? extends Compressor> getCompressorType() { return HogeCompressor.class; http://hadoop.apache.org/ } docs/current/api/org/apache/ @Override hadoop/io/compress/ public CompressionOutputStream createOutputStream(OutputStream out) CompressionCodec.html throws IOException { return createOutputStream(out, createCompressor()); } @Override public Compressor createCompressor() { return new HogeCompressor(); } @Override public CompressionInputStream createInputStream(InputStream in) throws IOException { return createInputStream(in, createDecompressor()); } ............Programming Hive Reading #4 17
  • 17. #11 Enabling Compression •Intermediate Compression(hive, mapred) •Final Output Compression(hive, mapred)Programming Hive Reading #4 18
  • 18. #11 Enabling Compression •Intermediate Compression(hive, mapred) •setting enable flagProgramming Hive Reading #4 19
  • 19. #11 Enabling Compression •Intermediate Compression(hive, mapred) •setting codecProgramming Hive Reading #4 20
  • 20. #11 Enabling Compression •Final Output Compression(hive, mapred) •setting enable flagProgramming Hive Reading #4 21
  • 21. #11 Enabling Compression •Final Output Compression(hive, mapred) •setting codecProgramming Hive Reading #4 22
  • 22. #11 Sequence File •Sequence File Format • Header • Record • Record length • Key length • Key • Value • A sync-marker every few 100 bytes or so. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/ SequenceFile.htmlProgramming Hive Reading #4 23
  • 23. #11 Sequence File •Compression Type •NONE : nothing to do •RECORD : compress on each records •BLOCK : compress on each blocksProgramming Hive Reading #4 24
  • 24. #11 Compression in Action •(DEMO)Programming Hive Reading #4 25
  • 25. #11 Archive Partition •Using ‘HAR’ •ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html •Archiving $ SET hive.archive.enabled=true; $ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’) •Unarchiving $ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’)Programming Hive Reading #4 26
  • 26. Break :)
  • 27. #15 Record Format •TEXTFILE •SEQUENCEFILE •RCFILE CREATE TABLE hoge (. ........ ) STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]Programming Hive Reading #4 28
  • 28. #15 Record Format •RCFile(Record Columnar File) •fast data loading •fast query processing •highly efficient storage space utilization •a strong adaptivity to dynamic data access patterns. •ref. "A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems (ICDE’11)" http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/ TR-11-4.pdfProgramming Hive Reading #4 29
  • 29. #15 Record Format •RCFile Format •1 record = some Row Group •1 HDFS Block = some Row Group •Row Group •a sync marker •metadata header •table data •uses the RLE algorithm to compress ‘metadata header’ section.Programming Hive Reading #4 30
  • 30. #15 Record Format •Implementation of RCFile •Input Format •o.a.h.h.ql.io.RCFileInputFormat •Output Format •o.a.h.h.ql.io.RCFileOutputFormat •SerDe •o.a.h.h.serde2.columnar.ColumnarSerDeProgramming Hive Reading #4 31
  • 31. #15 Record Format •Tuning of RCFile •“hive.io.rcfile.record.buffer.size” •define “RowGroup” size(default: 4MB)Programming Hive Reading #4 32
  • 32. #15 Record Format •ref. “HDFS and Hive storage - comparing file formats and compression methods” • http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format- compression/ •"In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results." •"In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate."Programming Hive Reading #4 33
  • 33. #Appendix - trevni •ref. https://github.com/cutting/trevni/ •ref. http://avro.apache.org/docs/current/trevni/spec.htmlProgramming Hive Reading #4 34
  • 34. #Appendix - trevni file header column column column column column column column ...... file number of number of file column column start number of magic block block rows columns metadata metadata position blocks ...... file header column ・name ・type column ・codec block row row row row metadata ・etc... descriptor ...... block number of uncompres compress rows sed bytes ed bytes block descriptorProgramming Hive Reading #4 35
  • 35. #Appendix - ORCFile •ref. http://hortonworks.com/blog/100x- faster-hive/ •ref. https://issues.apache.org/jira/browse/ HIVE-3874 •ref. https://issues.apache.org/jira/secure/ attachment/12564124/OrcFileIntro.pptxProgramming Hive Reading #4 36
  • 36. #Appendix - ORCFile •ref. data sizeProgramming Hive Reading #4 37
  • 37. #Appendix - ORCFile •ref. comparisonProgramming Hive Reading #4 38
  • 38. #Appendix - Column-Oriented Storage •ref. http://arxiv.org/pdf/1105.4252.pdfProgramming Hive Reading #4 39
  • 39. #Appendix - more informations http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=Programming Hive Reading #4 40
  • 40. Thanks for your listening :)