Apache Parquet
https://parquet.apache.org/
Why Parquet
Columnar storage format
Can store nested Data
Efficiency in file size and query performance
Nested field can be read independently of other fields
Many data processing understand avro format (Hive, Spark, Pig and
MapReduce,etc)
Data Model
boolean int32 int64 int96 float double binary fixed_len
_byte_ar
ray
UTF8 ENUM DECIMAL(pr
ecision,scale
DATE LIST MAP
Record parquet
message Nom {
(required,optional,repeated) type nom ;
required int32 age ;
required binary nom (UTF8)
}
{"name": "right", "type":
"string", "order":
"descending"}
Avro
Parquet
Parquet File Format
Header Block .. Block Footer
Magic number
Using parquet
Magic number
,Schema,Encoding
method,Block position,..
Columns
Pages
128MB
Encoding & Compression
Run-length : True , True ,True ,False (3 true , 1 false)
Dictionary encoding ( indexation )
Plain encoding
Compression algorithms Snappy,gzip,...
Configuration
parquet.block.size int
parquet.page.size int
parquet.dictionary.page.size int
parquet.enable.dictionary boolean true
parquet.compression String (SNAPPY,UNCOMPRESSED,LZO...)
Writing and reading Parquet files
Parquet MapReduce
Parquet MapReduce
Parquet MapReduce

Apache Parquet