Storage in Hadoop
PUNEET TRIPATHI
1
What are we covering
 Delimited –
> CSV and others
> Sequence Files
> Avro
> Column formats
>> RC/ORC/Parquet
 Compression/Decompression –
Gzip, Bzip2, Snappy, LZO
 Focus – Apache Parquet
2
File Formats
AND WHY WE NEED
THEM
Common consideration for choosing a
file format
• Processing and query tools to use
• Does data structure change over time
• Compression and Splittability
• Processing or Query Performance
• Well space still can’t be ignored!
4
Available storage formats
 Text/CSV
Ubiquitously parsable | Splittable | No Metadata | No block compression
 JSON Records – Always try to avoid JSONs
Each line is JSON datum | Metadata | Schema evolution | embarrassing
native Serdes | No(read optional) block compression
 Sequence Files
Binary format | Similar structure to CSV | Block compression | No Metadata
 Avro
Splittable | Metadata | Superb Schema Evolution | Block compression |
Supported by almost all Hadoop tools | Looks like JSON
 RC
Record Columnar Files | Columnar Formats | Compression | Query
Performance | No schema evolution(rewrite previous files) | Writes
unoptimized
5
Available storage formats
 ORC
Optimized RC | Same benefits as RC but faster | No Schema
evolution | Support could be a problem | ORC files compresses
to be smallest format(some benchmark claim including mine) |
As performant as parquet
 Parquet
Columnar | Superb Compression | Query Performance | Writes
unoptimized | Supports Schema evolution | Highly supported in
Hadoop ecosystem or the support is being added | Spark
supports it out-of-the-box and We use Spark |
6
Moment of truth -
 There is no file format that will do all the things for you
 Consider following for picking a format –
> Hadoop Distribution
> Read/Query Requirements
> Interchange & Extraction of Data
 Different phase may need different formats storage –
> Parquets are best suited if your mart is query heavy
> CSV for porting data to other data stores
 Always avoid XMLs and JSONs, they are not splittable and
Hadoop cares for it intensely.
7
Codecs
AND WHY WE NEED
THEM
Common consideration for choosing codec
• Balance the processing capacity to compress and uncompress
• Compression Degree –Speed tradeoff
• How soon you query data?
• Splittablility – matters a lot in context of Hadoop
9
Available CoDecs
• Gzip
Wrapper around Zlib | Not Splittable | But awesome compression |
supported out-of-the-box | CPU intensive
• Bzip2
Similar to Gzip except Splittable | Provides even better compression ratio |
Slowest possible
• LZO – Lempel-Ziv-Oberhumer
Modest Compression Ratio | Creates index while compression | Splittable
(with Index) | Fast Compression speed | Not CPU intensive | Works good with
Text files too
• Snappy
Belongs to LZO family | Shipped with Hadoop | Fastest Decompression &
compression comparable with LZO | Compression Ratio is poorer than other
codecs | Not CPU intensive
 Snappy often performs better than LZO. However It is worth running tests to
see if you detect a significant difference.
 Hadoop doesn’t support ZIP out-of-the-box.
10
CoDecs – performance comparison
 Space Savings and CPU Time comparison [Yahoo]
11
Focus -
Parquet
IF TIME PERMITS
Columnar Storage – Overview
 Lets say we got table with these observations:-
 Reduction of space consumption & Better column level compression
 Efficient encoding and decoding by storing together values of the same
primitive type
 Reading the same number of column field values for the same number of
records requires a third of the I/O operations compared to row-wise storage
13
Parquet – Columnar file format
• Inspired from Google Dremel, developed by Twitter and
Cloudera
• Storage on Disk –
• Supports Nested Data Structure –
14
Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html
Parquet – Specifications
• Supports primitive datatypes – Boolean, INT(32,64,96), Float, Double, Byte_Array
• Schema is defined as Protocol Buffer
> has root called message
> fields are required, optional & repeated
• Field types are either Group or Primitive type
• Each cell is encoded as triplet – repetition level, definition level & value
• Structure of Record is captured by 2 ints – repetition level & definition level
• Definition level explain columns & nullity of columns
• Repetition Level explains where a new list
[repeated fields are stored as lists] starts
15
Definition Level
message ExampleDefinitionLevel {
optional group a {
optional group b {
optional string c;
}
}
}
one column: a.b.c
Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html
“
”
Thank You!
16

Storage in hadoop

  • 1.
  • 2.
    What are wecovering  Delimited – > CSV and others > Sequence Files > Avro > Column formats >> RC/ORC/Parquet  Compression/Decompression – Gzip, Bzip2, Snappy, LZO  Focus – Apache Parquet 2
  • 3.
  • 4.
    Common consideration forchoosing a file format • Processing and query tools to use • Does data structure change over time • Compression and Splittability • Processing or Query Performance • Well space still can’t be ignored! 4
  • 5.
    Available storage formats Text/CSV Ubiquitously parsable | Splittable | No Metadata | No block compression  JSON Records – Always try to avoid JSONs Each line is JSON datum | Metadata | Schema evolution | embarrassing native Serdes | No(read optional) block compression  Sequence Files Binary format | Similar structure to CSV | Block compression | No Metadata  Avro Splittable | Metadata | Superb Schema Evolution | Block compression | Supported by almost all Hadoop tools | Looks like JSON  RC Record Columnar Files | Columnar Formats | Compression | Query Performance | No schema evolution(rewrite previous files) | Writes unoptimized 5
  • 6.
    Available storage formats ORC Optimized RC | Same benefits as RC but faster | No Schema evolution | Support could be a problem | ORC files compresses to be smallest format(some benchmark claim including mine) | As performant as parquet  Parquet Columnar | Superb Compression | Query Performance | Writes unoptimized | Supports Schema evolution | Highly supported in Hadoop ecosystem or the support is being added | Spark supports it out-of-the-box and We use Spark | 6
  • 7.
    Moment of truth-  There is no file format that will do all the things for you  Consider following for picking a format – > Hadoop Distribution > Read/Query Requirements > Interchange & Extraction of Data  Different phase may need different formats storage – > Parquets are best suited if your mart is query heavy > CSV for porting data to other data stores  Always avoid XMLs and JSONs, they are not splittable and Hadoop cares for it intensely. 7
  • 8.
  • 9.
    Common consideration forchoosing codec • Balance the processing capacity to compress and uncompress • Compression Degree –Speed tradeoff • How soon you query data? • Splittablility – matters a lot in context of Hadoop 9
  • 10.
    Available CoDecs • Gzip Wrapperaround Zlib | Not Splittable | But awesome compression | supported out-of-the-box | CPU intensive • Bzip2 Similar to Gzip except Splittable | Provides even better compression ratio | Slowest possible • LZO – Lempel-Ziv-Oberhumer Modest Compression Ratio | Creates index while compression | Splittable (with Index) | Fast Compression speed | Not CPU intensive | Works good with Text files too • Snappy Belongs to LZO family | Shipped with Hadoop | Fastest Decompression & compression comparable with LZO | Compression Ratio is poorer than other codecs | Not CPU intensive  Snappy often performs better than LZO. However It is worth running tests to see if you detect a significant difference.  Hadoop doesn’t support ZIP out-of-the-box. 10
  • 11.
    CoDecs – performancecomparison  Space Savings and CPU Time comparison [Yahoo] 11
  • 12.
  • 13.
    Columnar Storage –Overview  Lets say we got table with these observations:-  Reduction of space consumption & Better column level compression  Efficient encoding and decoding by storing together values of the same primitive type  Reading the same number of column field values for the same number of records requires a third of the I/O operations compared to row-wise storage 13
  • 14.
    Parquet – Columnarfile format • Inspired from Google Dremel, developed by Twitter and Cloudera • Storage on Disk – • Supports Nested Data Structure – 14 Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html
  • 15.
    Parquet – Specifications •Supports primitive datatypes – Boolean, INT(32,64,96), Float, Double, Byte_Array • Schema is defined as Protocol Buffer > has root called message > fields are required, optional & repeated • Field types are either Group or Primitive type • Each cell is encoded as triplet – repetition level, definition level & value • Structure of Record is captured by 2 ints – repetition level & definition level • Definition level explain columns & nullity of columns • Repetition Level explains where a new list [repeated fields are stored as lists] starts 15 Definition Level message ExampleDefinitionLevel { optional group a { optional group b { optional string c; } } } one column: a.b.c Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html
  • 16.