ByteDance's Native
Parquet Reader
Shengxuan Liu @ByteDance
About Parquet Native Parquet
Reader Architecture
Key Features
What is Parquet?
● Definition
○ Columnar storage file format for big data.
○ Developed for efficient storage and processing
of large datasets.
● Purpose
○ Organizes data by columns instead of rows.
Why Parquet?
● Columnar Storage
○ Facilitates efficient compression.
○ Improves performance for analytical queries on specific columns.
○ Improves the CPU cache hit rates while processing the data.
● Compression
○ Supports various algorithms (Snappy, Gzip, LZO).
○ Achieves high compression ratios.
○ Reduces remote I/O and lowers storage costs.
Why Parquet?
● Primitive Types
○ BOOLEAN: Represents a boolean value (true/false).
○ INT32/INT64: Represents signed 32-bit or 64-bit integers.
○ FLOAT: Represents a 32-bit floating-point number.
○ DOUBLE: Represents a 64-bit floating-point number.
○ BINARY: Represents variable-length binary data (e.g., strings, byte
arrays).
○ FIXED_LEN_BYTE_ARRAY: Represents fixed-length binary data with a
specified length.
Why Parquet?
● Nested Types
○ LIST: Represents an ordered collection of elements.
○ MAP: Represents a collection of key-value pairs.
○ STRUCT: Represents a complex structure with named fields.
Parquet with Cache
● Alluxio Local Cache for PrestoDB
○ Break the Parquet files into smaller pieces and store locally
○ Improve reading efficiency and reduce remote I/O cost
Rust Native Parquet Reader
● What is a Native Parquet Reader?
○ A tool to read and process data stored in the Parquet file format.
○ The Reader is written in native languages and can be compiled into
machine codes directly.
● Why Rust?
○ Good performance and memory safety
○ Able to integrate with other language using FFI and etc.
Native Parquet Reader
● High-Level Components
○ Metadata Parser
○ Row Group Level Reader
○ Column Level Reader
○ Page Level Reader
○ Decompression
○ Data Decoder
○ Data Materialization System
○ Filter system
Native Parquet Reader
● Metadata Parser -- Thrift
○ File Metadata
■ Extracts schema information, data types,
and column statistics.
■ Determines the structure of the Parquet
file.
○ Page Metadata
■ Parses Page level metadata
■ Extracts the page statistics, page type,
compression and decoding types...
Native Parquet Reader
● Row Group Level Reader
○ Row Group contains all the columns
○ Smallest parallelism unit for most Parquet Readers
○ Arrange and schedule Column level reading
Native Parquet Reader
● Column Level Reader
○ Each column contains all the pages for itself within the row group
○ Arrange and schedule page level reading
Native Parquet Reader
● Page Level Reader
○ Page Level Reader reads different types of Data Page
■ Dictionary Page: stores the unique values of the column
■ Data Pages: contain the actual values of the column
● Store indexes according to the Dictionary Page; or
● Store the Plain values
■ ......
○ Page Level Reader reads actual values out:
■ Decompression -> Decoding -> Materialization
Native Parquet Reader
● Decompression
○ GZIP
○ ZSTD
○ SNAPPY
○ ......
Native Parquet Reader
● Data Decoder
○ Dictionary Decoder
■ Maintain the dictionary from dictionary page
■ Data Page reading needs to find the actual values from indexes
○ RLE/BP Decoder
■ A combination of bit-packing and run length encoding
○ Plain Decoder
■ Actual values are stored directly
Native Parquet Reader
● Data Materialization System
○ Each computing engine has its own memory layout
○ Materialize the data from data pages into the designated memory
format
Native Parquet Reader
● Filter System
○ Most of the queries are associated with filters
○ Some filters can be processed at the Parquet reading stage
■ select a, b from table where a > 1 -> :-)
■ select a, b from table where a+b > 1 -> :-(
○ The filter system enables the Parquet Reader read less data if possible
Key Features
● Batch Reading with Limit
● Filter Push-down
● Filter Ordering and Re-ordering
● Flexible Data Materialization
Batch Reading with Limit
● What is Batch Reading with Limit?
○ Materialize a specific number of rows of data
○ For example, read(1000) returns up to 1000 rows of data
● Why Batch Reading with Limit?
○ The materialized data might be 10x larger than Parquet formatted data
○ Reduce memory usages by producing the data in smaller batches for
consumption.
○ Better fit for different systems. Most engines take in much smaller
pieces of data than Row Group
Filter Push-down
● What is Filter Push-down?
○ Try to filter out unnecessary data before final materialization
● Why Filter Push-down?
○ Save unnecessary cpu cycles, like materializations.
○ Filters are very common from user input, which can be largely used in
the real-world.
○ The deeper we push the filters, the more efficient reader would be
Filter Ordering & Re-ordering
● What is Filter Ordering and Re-ordering?
○ Filter ordering: the execution sequence of the columns with filters
○ Filter re-ordering: Rearrange the sequence of filters on different
columns according to the real time filter performance.
● Why Filter Ordering and Re-ordering?
○ In real world scenarios, multiple filters will be applied at the same time.
○ We would like to start from the filter that eliminate the most data to
reduce the future computation burden.
○ Adaptive approach to find the best filter execution sequence.
Flexible Data Materialization
● What is Data Materialization?
○ The Parquet Reader is to parse the Parquet file and convert it into
datasets in different memory layout.
● Why Data Materialization need to be Flexible?
○ Parquet files are widely used in different scenarios: OLAP, Machine
Learning ...
● How can Data Materialization be Flexible?
○ The native Parquet Reader provides bridges to customize the output
data format.
Thanks!!

Data Infra Meetup | ByteDance's Native Parquet Reader

  • 1.
  • 2.
    About Parquet NativeParquet Reader Architecture Key Features
  • 3.
    What is Parquet? ●Definition ○ Columnar storage file format for big data. ○ Developed for efficient storage and processing of large datasets. ● Purpose ○ Organizes data by columns instead of rows.
  • 4.
    Why Parquet? ● ColumnarStorage ○ Facilitates efficient compression. ○ Improves performance for analytical queries on specific columns. ○ Improves the CPU cache hit rates while processing the data. ● Compression ○ Supports various algorithms (Snappy, Gzip, LZO). ○ Achieves high compression ratios. ○ Reduces remote I/O and lowers storage costs.
  • 5.
    Why Parquet? ● PrimitiveTypes ○ BOOLEAN: Represents a boolean value (true/false). ○ INT32/INT64: Represents signed 32-bit or 64-bit integers. ○ FLOAT: Represents a 32-bit floating-point number. ○ DOUBLE: Represents a 64-bit floating-point number. ○ BINARY: Represents variable-length binary data (e.g., strings, byte arrays). ○ FIXED_LEN_BYTE_ARRAY: Represents fixed-length binary data with a specified length.
  • 6.
    Why Parquet? ● NestedTypes ○ LIST: Represents an ordered collection of elements. ○ MAP: Represents a collection of key-value pairs. ○ STRUCT: Represents a complex structure with named fields.
  • 7.
    Parquet with Cache ●Alluxio Local Cache for PrestoDB ○ Break the Parquet files into smaller pieces and store locally ○ Improve reading efficiency and reduce remote I/O cost
  • 8.
    Rust Native ParquetReader ● What is a Native Parquet Reader? ○ A tool to read and process data stored in the Parquet file format. ○ The Reader is written in native languages and can be compiled into machine codes directly. ● Why Rust? ○ Good performance and memory safety ○ Able to integrate with other language using FFI and etc.
  • 9.
    Native Parquet Reader ●High-Level Components ○ Metadata Parser ○ Row Group Level Reader ○ Column Level Reader ○ Page Level Reader ○ Decompression ○ Data Decoder ○ Data Materialization System ○ Filter system
  • 10.
    Native Parquet Reader ●Metadata Parser -- Thrift ○ File Metadata ■ Extracts schema information, data types, and column statistics. ■ Determines the structure of the Parquet file. ○ Page Metadata ■ Parses Page level metadata ■ Extracts the page statistics, page type, compression and decoding types...
  • 11.
    Native Parquet Reader ●Row Group Level Reader ○ Row Group contains all the columns ○ Smallest parallelism unit for most Parquet Readers ○ Arrange and schedule Column level reading
  • 12.
    Native Parquet Reader ●Column Level Reader ○ Each column contains all the pages for itself within the row group ○ Arrange and schedule page level reading
  • 13.
    Native Parquet Reader ●Page Level Reader ○ Page Level Reader reads different types of Data Page ■ Dictionary Page: stores the unique values of the column ■ Data Pages: contain the actual values of the column ● Store indexes according to the Dictionary Page; or ● Store the Plain values ■ ...... ○ Page Level Reader reads actual values out: ■ Decompression -> Decoding -> Materialization
  • 14.
    Native Parquet Reader ●Decompression ○ GZIP ○ ZSTD ○ SNAPPY ○ ......
  • 15.
    Native Parquet Reader ●Data Decoder ○ Dictionary Decoder ■ Maintain the dictionary from dictionary page ■ Data Page reading needs to find the actual values from indexes ○ RLE/BP Decoder ■ A combination of bit-packing and run length encoding ○ Plain Decoder ■ Actual values are stored directly
  • 16.
    Native Parquet Reader ●Data Materialization System ○ Each computing engine has its own memory layout ○ Materialize the data from data pages into the designated memory format
  • 17.
    Native Parquet Reader ●Filter System ○ Most of the queries are associated with filters ○ Some filters can be processed at the Parquet reading stage ■ select a, b from table where a > 1 -> :-) ■ select a, b from table where a+b > 1 -> :-( ○ The filter system enables the Parquet Reader read less data if possible
  • 18.
    Key Features ● BatchReading with Limit ● Filter Push-down ● Filter Ordering and Re-ordering ● Flexible Data Materialization
  • 19.
    Batch Reading withLimit ● What is Batch Reading with Limit? ○ Materialize a specific number of rows of data ○ For example, read(1000) returns up to 1000 rows of data ● Why Batch Reading with Limit? ○ The materialized data might be 10x larger than Parquet formatted data ○ Reduce memory usages by producing the data in smaller batches for consumption. ○ Better fit for different systems. Most engines take in much smaller pieces of data than Row Group
  • 20.
    Filter Push-down ● Whatis Filter Push-down? ○ Try to filter out unnecessary data before final materialization ● Why Filter Push-down? ○ Save unnecessary cpu cycles, like materializations. ○ Filters are very common from user input, which can be largely used in the real-world. ○ The deeper we push the filters, the more efficient reader would be
  • 21.
    Filter Ordering &Re-ordering ● What is Filter Ordering and Re-ordering? ○ Filter ordering: the execution sequence of the columns with filters ○ Filter re-ordering: Rearrange the sequence of filters on different columns according to the real time filter performance. ● Why Filter Ordering and Re-ordering? ○ In real world scenarios, multiple filters will be applied at the same time. ○ We would like to start from the filter that eliminate the most data to reduce the future computation burden. ○ Adaptive approach to find the best filter execution sequence.
  • 22.
    Flexible Data Materialization ●What is Data Materialization? ○ The Parquet Reader is to parse the Parquet file and convert it into datasets in different memory layout. ● Why Data Materialization need to be Flexible? ○ Parquet files are widely used in different scenarios: OLAP, Machine Learning ... ● How can Data Materialization be Flexible? ○ The native Parquet Reader provides bridges to customize the output data format.
  • 23.