Apache Parquet

Apache Parquet
https://parquet.apache.org/

Why Parquet
Columnar storage format
Can store nested Data
Efficiency in file size and query performance
Nested field can be read independently of other fields
Many data processing understand avro format (Hive, Spark, Pig and
MapReduce,etc)

Data Model
boolean int32 int64 int96 float double binary fixed_len
_byte_ar
ray
UTF8 ENUM DECIMAL(pr
ecision,scale
DATE LIST MAP

Record parquet
message Nom {
(required,optional,repeated) type nom ;
required int32 age ;
required binary nom (UTF8)
}
{"name": "right", "type":
"string", "order":
"descending"}
Avro
Parquet

Parquet File Format
Header Block .. Block Footer
Magic number
Using parquet
Magic number
,Schema,Encoding
method,Block position,..
Columns
Pages
128MB

Encoding & Compression
Run-length : True , True ,True ,False (3 true , 1 false)
Dictionary encoding ( indexation )
Plain encoding
Compression algorithms Snappy,gzip,...

Configuration
parquet.block.size int
parquet.page.size int
parquet.dictionary.page.size int
parquet.enable.dictionary boolean true
parquet.compression String (SNAPPY,UNCOMPRESSED,LZO...)

Writing and reading Parquet files

Apache Parquet

More Related Content

What's hot

Similar to Apache Parquet

Recently uploaded

Apache Parquet