2. Evaluation Criteria
- The processing tools
- i.e Cloudera do not support ORC
- Whether data has a changing nature or not
- Splitability
- XML is not splittable
- Compression
- Speed up I/O operation
- Save Storage
- Increase processing time : DECOMPRESSION!
- The data size
- Processing and query performance
3. Common File Formats
All File Formats
ColumnarStandard
Sequence Data Structure Data Parquet ORC
Serialization
Avro
4. Summary of some file formats’ features
Data Format Type of Format Splittable Changing Compression Meta Data
Json, XML Standards - + - +
CSV File Standards + - - -
JSON Records Standards + + - +
Sequence Files Standards + - + -
Avro Files Serialization + + + +
ORC Files Columnar + + + +
Parquet Files Columnar + + + +
5. Sequence File
- An optimal solution for small files
- Save as <key, value>
- Support compression
- Record
- Block
6. Parquet
- Optimized for Impala
- Used by Twitter
- Data Structure
- Data partitioned into rows
- Pages can be compressed
8. ORC
- Optimized for Hive, Presto
- Data Structure
- Index contain basic statistics
- File footer contain a list of stripes information
- Postscript holds compression parameters
9. Avro
- Row base storage
- Found in Apache Kafka
- Robust Support for changing schema
- Data Structure
10. Avro vs Parquet
- Avro is ideal for ETL
- Parquet is ideal for query analysis
- Read operation is better in Parquet
- Write operation is better in Avro
- Avro support full changing schema
- Parquet just support append
11. Parquet vs ORC
- Parquet is better for nested data
- ORC is more compression efficient