Storage in hadoop

Storage in Hadoop
PUNEET TRIPATHI
1

What are we covering
 Delimited –
> CSV and others
> Sequence Files
> Avro
> Column formats
>> RC/ORC/Parquet
 Compression/Decompression –
Gzip, Bzip2, Snappy, LZO
 Focus – Apache Parquet
2

File Formats
AND WHY WE NEED
THEM

Common consideration for choosing a
file format
• Processing and query tools to use
• Does data structure change over time
• Compression and Splittability
• Processing or Query Performance
• Well space still can’t be ignored!
4

Moment of truth -
 There is no file format that will do all the things for you
 Consider following for picking a format –
> Hadoop Distribution
> Read/Query Requirements
> Interchange & Extraction of Data
 Different phase may need different formats storage –
> Parquets are best suited if your mart is query heavy
> CSV for porting data to other data stores
 Always avoid XMLs and JSONs, they are not splittable and
Hadoop cares for it intensely.
7

Common consideration for choosing codec
• Balance the processing capacity to compress and uncompress
• Compression Degree –Speed tradeoff
• How soon you query data?
• Splittablility – matters a lot in context of Hadoop
9

Available CoDecs
• Gzip
Wrapper around Zlib | Not Splittable | But awesome compression |
supported out-of-the-box | CPU intensive
• Bzip2
Similar to Gzip except Splittable | Provides even better compression ratio |
Slowest possible
• LZO – Lempel-Ziv-Oberhumer
Modest Compression Ratio | Creates index while compression | Splittable
(with Index) | Fast Compression speed | Not CPU intensive | Works good with
Text files too
• Snappy
Belongs to LZO family | Shipped with Hadoop | Fastest Decompression &
compression comparable with LZO | Compression Ratio is poorer than other
codecs | Not CPU intensive
 Snappy often performs better than LZO. However It is worth running tests to
see if you detect a significant difference.
 Hadoop doesn’t support ZIP out-of-the-box.
10

CoDecs – performance comparison
 Space Savings and CPU Time comparison [Yahoo]
11

Focus -
Parquet
IF TIME PERMITS

Columnar Storage – Overview
 Lets say we got table with these observations:-
 Reduction of space consumption & Better column level compression
 Efficient encoding and decoding by storing together values of the same
primitive type
 Reading the same number of column field values for the same number of
records requires a third of the I/O operations compared to row-wise storage
13

Parquet – Columnar file format
• Inspired from Google Dremel, developed by Twitter and
Cloudera
• Storage on Disk –
• Supports Nested Data Structure –
14
Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html

Parquet – Specifications
• Supports primitive datatypes – Boolean, INT(32,64,96), Float, Double, Byte_Array
• Schema is defined as Protocol Buffer
> has root called message
> fields are required, optional & repeated
• Field types are either Group or Primitive type
• Each cell is encoded as triplet – repetition level, definition level & value
• Structure of Record is captured by 2 ints – repetition level & definition level
• Definition level explain columns & nullity of columns
• Repetition Level explains where a new list
[repeated fields are stored as lists] starts
15
Definition Level
message ExampleDefinitionLevel {
optional group a {
optional group b {
optional string c;
}
}
}
one column: a.b.c
Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html

Storage in hadoop

More Related Content

What's hot

Similar to Storage in hadoop

Recently uploaded

Storage in hadoop