Programming Hive Reading #4

Programming Hive
Reading #4

@just_do_neet

Chapter 11. and 15.

•Chapter 11. ‘Other File Formats and
Compression’

•Choosing / Enabling / Action / HAR / etc...

•Chapter 15. ‘Customizing Hive File and Record
Formats’

•Demystifying DML / File Formats / etc...

•exclude "SerDe" related topics at this
presentation...

Programming Hive Reading #4 3

#11 Determining Installed Codecs

$ hive -e "set io.compression.codecs"
io.compression.codecs=
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
org.apache.hadoop.io.compress.SnappyCodec


#11 Choosing a Compression Codec

•Advantage :

•network I/O , disk space.

•Disadvantage :

•CPU overhead.

•to be short... : Trade-off



•“why do we need different compression
schemes?”

•speed

•minimizing size

•‘splittable’ or not.



•“why do we need different compression
schemes?”

http://comphadoop.weebly.com/


take a break : algorithm

•lossless compression

•LZ77(LZSS), LZ78, etc...

•DEFLATE (LZ77 with Huffman coding)

•LZH (LZ77 with Static Huffman coding)

•BZIP2(Burrows–Wheeler transform, Move-to-
Front, Huffman Coding)

•lossy

•for JPEG, MPEG,etc...(snip.)


http://www.slideshare.net/moaikids/ss-2638826



•Burrows–Wheeler Transform(BWT)

•block sorting

•“abracadabra” = bwt“ard$rcaaabb”
abracadabra$ $abracadabra a $ a
bracadabra$a a$abracadabr r a b
racadabra$ab abra$abracad d a r
acadabra$abr abracadabra$ $ a a
cadabra$abra acadabra$abr r a c
adabra$abrac adabra$abrac c a a
dabra$abraca bra$abracada a b d
abra$abracad bracadabra$a a b a
bra$abracada cadabra$abra a c b
ra$abracadab dabra$abraca a d r
a$abracadabr ra$abracadab b r a
$abracadabra racadabra$ab b r $



•BWT with Sufﬁx Array

•ref. http://d.hatena.ne.jp/naoya/20081016/1224173077

•ref. http://hillbig.cocolog-nifty.com/do/ﬁles/2005-12-compInd.ppt



•LZO

•“Compression is comparable in speed to
DEFLATE compression.”

•“Very fast decompression”
• http://www.oberhumer.com/opensource/lzo/



•Google Snappy

•“very high speeds and reasonable
compression”
• https://code.google.com/p/snappy/

•ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889



•LZ4

•“very fast lossless compression algorithm”
• https://code.google.com/p/lz4/

•ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4



•“Add support for LZ4 compression”

•ﬁx version : 0.23.1, 0.24.0,(CDH4)

•ref. https://issues.apache.org/jira/browse/HADOOP-7657


take a break : Implementation Codec

public HogeCodec implements CompressionCodec{
@Override
public CompressionOutputStream createOutputStream(OutputStream out,
Compressor compressor)
throws IOException {
return new BlockCompressorStream(out, compressor, bufferSize,
compressionOverhead);
}

@Override ref.
public Class<? extends Compressor> getCompressorType() {
return HogeCompressor.class;
http://hadoop.apache.org/
} docs/current/api/org/apache/
@Override hadoop/io/compress/
public CompressionOutputStream createOutputStream(OutputStream out) CompressionCodec.html
return createOutputStream(out, createCompressor());
}

@Override
public Compressor createCompressor() {
return new HogeCompressor();
}

@Override
public CompressionInputStream createInputStream(InputStream in)
return createInputStream(in, createDecompressor());
}
............


#11 Enabling Compression

•Intermediate Compression(hive, mapred)

•Final Output Compression(hive, mapred)




•setting enable ﬂag




•setting codec




•setting enable ﬂag




•setting codec


#11 Sequence File

•Sequence File Format

• Header
• Record
• Record length
• Key length
• Key
• Value
• A sync-marker every few 100 bytes or so.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/
SequenceFile.html


#11 Sequence File

•Compression Type

•NONE : nothing to do

•RECORD : compress on each records

•BLOCK : compress on each blocks


#11 Compression in Action

•(DEMO)


#11 Archive Partition

•Using ‘HAR’

•ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html

•Archiving
$ SET hive.archive.enabled=true;
$ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’)

•Unarchiving
$ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’)


#15 Record Format

•TEXTFILE

•SEQUENCEFILE

•RCFILE

CREATE TABLE hoge (.
........
)
STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]


#15 Record Format

•RCFile(Record Columnar File)

•fast data loading

•fast query processing

•highly efﬁcient storage space utilization

•a strong adaptivity to dynamic data access
patterns.

•ref. "A Fast and Space-efﬁcient Data Placement Structure in
MapReduce-based Warehouse Systems (ICDE’11)"
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/
TR-11-4.pdf

#15 Record Format

•RCFile Format
•1 record = some Row Group

•1 HDFS Block = some Row Group

•Row Group
•a sync marker
•metadata header
•table data

•uses the RLE algorithm to compress ‘metadata
header’ section.

#15 Record Format

•Implementation of RCFile

•Input Format

•o.a.h.h.ql.io.RCFileInputFormat

•Output Format

•o.a.h.h.ql.io.RCFileOutputFormat

•SerDe

•o.a.h.h.serde2.columnar.ColumnarSerDe


#15 Record Format

•Tuning of RCFile

•“hive.io.rcﬁle.record.buffer.size”

•deﬁne “RowGroup” size(default: 4MB)


#15 Record Format

•ref. “HDFS and Hive storage - comparing ﬁle
formats and compression methods”
• http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-
compression/

•"In term of ﬁle size, the “RCFILE” format with
the “default” and “gz” compression achieve the
best results."

•"In term of speed, the “RCFILE” formats with the
“lzo” and “snappy” are very fast while preserving
a high compression rate."


#Appendix - trevni

•ref. https://github.com/cutting/trevni/

•ref. http://avro.apache.org/docs/current/trevni/spec.html


#Appendix - trevni

file header column column column column column column column
......
file

number of number of file column column start number of
magic block block
rows columns metadata metadata position blocks ......

file header column
・name
・type
column ・codec block row row row row
metadata ・etc... descriptor ......
block

number of uncompres compress
rows sed bytes ed bytes

block descriptor


#Appendix - ORCFile

•ref. http://hortonworks.com/blog/100x-
faster-hive/

•ref. https://issues.apache.org/jira/browse/
HIVE-3874

•ref. https://issues.apache.org/jira/secure/
attachment/12564124/OrcFileIntro.pptx


#Appendix - ORCFile

•ref. data size


#Appendix - ORCFile

•ref. comparison


#Appendix - Column-Oriented Storage

•ref. http://arxiv.org/pdf/1105.4252.pdf


#Appendix - more informations

http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=


Programming Hive Reading #4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Programming Hive Reading #4

Similar to Programming Hive Reading #4 (20)

More from moai kids

More from moai kids (20)

Programming Hive Reading #4