Data storage format in hdfs

Evaluation Criteria
- The processing tools
- i.e Cloudera do not support ORC
- Whether data has a changing nature or not
- Splitability
- XML is not splittable
- Compression
- Speed up I/O operation
- Save Storage
- Increase processing time : DECOMPRESSION!
- The data size
- Processing and query performance

Common File Formats
All File Formats
ColumnarStandard
Sequence Data Structure Data Parquet ORC
Serialization
Avro

Summary of some file formats’ features
Data Format Type of Format Splittable Changing Compression Meta Data
Json, XML Standards - + - +
CSV File Standards + - - -
JSON Records Standards + + - +
Sequence Files Standards + - + -
Avro Files Serialization + + + +
ORC Files Columnar + + + +
Parquet Files Columnar + + + +

Sequence File
- An optimal solution for small files
- Save as <key, value>
- Support compression
- Record
- Block

Parquet
- Optimized for Impala
- Used by Twitter
- Data Structure
- Data partitioned into rows
- Pages can be compressed

ORC
- Optimized for Hive, Presto
- Data Structure
- Index contain basic statistics
- File footer contain a list of stripes information
- Postscript holds compression parameters

Avro
- Row base storage
- Found in Apache Kafka
- Robust Support for changing schema
- Data Structure

Avro vs Parquet
- Avro is ideal for ETL
- Parquet is ideal for query analysis
- Read operation is better in Parquet
- Write operation is better in Avro
- Avro support full changing schema
- Parquet just support append

Parquet vs ORC
- Parquet is better for nested data
- ORC is more compression efficient

Data storage format in hdfs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data storage format in hdfs

Similar to Data storage format in hdfs (20)

Recently uploaded

Recently uploaded (20)

Data storage format in hdfs