Data Infra Meetup | ByteDance's Native Parquet Reader

ByteDance's Native
Parquet Reader
Shengxuan Liu @ByteDance

About Parquet Native Parquet
Reader Architecture
Key Features

What is Parquet?
● Definition
○ Columnar storage file format for big data.
○ Developed for efficient storage and processing
of large datasets.
● Purpose
○ Organizes data by columns instead of rows.

Why Parquet?
● Columnar Storage
○ Facilitates efficient compression.
○ Improves performance for analytical queries on specific columns.
○ Improves the CPU cache hit rates while processing the data.
● Compression
○ Supports various algorithms (Snappy, Gzip, LZO).
○ Achieves high compression ratios.
○ Reduces remote I/O and lowers storage costs.

Why Parquet?
● Primitive Types
○ BOOLEAN: Represents a boolean value (true/false).
○ INT32/INT64: Represents signed 32-bit or 64-bit integers.
○ FLOAT: Represents a 32-bit floating-point number.
○ DOUBLE: Represents a 64-bit floating-point number.
○ BINARY: Represents variable-length binary data (e.g., strings, byte
arrays).
○ FIXED_LEN_BYTE_ARRAY: Represents fixed-length binary data with a
specified length.

Why Parquet?
● Nested Types
○ LIST: Represents an ordered collection of elements.
○ MAP: Represents a collection of key-value pairs.
○ STRUCT: Represents a complex structure with named fields.

Parquet with Cache
● Alluxio Local Cache for PrestoDB
○ Break the Parquet files into smaller pieces and store locally
○ Improve reading efficiency and reduce remote I/O cost

Rust Native Parquet Reader
● What is a Native Parquet Reader?
○ A tool to read and process data stored in the Parquet file format.
○ The Reader is written in native languages and can be compiled into
machine codes directly.
● Why Rust?
○ Good performance and memory safety
○ Able to integrate with other language using FFI and etc.

Native Parquet Reader
● High-Level Components
○ Metadata Parser
○ Row Group Level Reader
○ Column Level Reader
○ Page Level Reader
○ Decompression
○ Data Decoder
○ Data Materialization System
○ Filter system

● Metadata Parser -- Thrift
○ File Metadata
■ Extracts schema information, data types,
and column statistics.
■ Determines the structure of the Parquet
file.
○ Page Metadata
■ Parses Page level metadata
■ Extracts the page statistics, page type,
compression and decoding types...

● Row Group Level Reader
○ Row Group contains all the columns
○ Smallest parallelism unit for most Parquet Readers
○ Arrange and schedule Column level reading

● Column Level Reader
○ Each column contains all the pages for itself within the row group
○ Arrange and schedule page level reading

● Page Level Reader
○ Page Level Reader reads different types of Data Page
■ Dictionary Page: stores the unique values of the column
■ Data Pages: contain the actual values of the column
● Store indexes according to the Dictionary Page; or
● Store the Plain values
■ ......
○ Page Level Reader reads actual values out:
■ Decompression -> Decoding -> Materialization

● Decompression
○ GZIP
○ ZSTD
○ SNAPPY
○ ......

● Data Decoder
○ Dictionary Decoder
■ Maintain the dictionary from dictionary page
■ Data Page reading needs to find the actual values from indexes
○ RLE/BP Decoder
■ A combination of bit-packing and run length encoding
○ Plain Decoder
■ Actual values are stored directly

● Data Materialization System
○ Each computing engine has its own memory layout
○ Materialize the data from data pages into the designated memory
format

● Filter System
○ Most of the queries are associated with filters
○ Some filters can be processed at the Parquet reading stage
■ select a, b from table where a > 1 -> :-)
■ select a, b from table where a+b > 1 -> :-(
○ The filter system enables the Parquet Reader read less data if possible

Key Features
● Batch Reading with Limit
● Filter Push-down
● Filter Ordering and Re-ordering
● Flexible Data Materialization

Batch Reading with Limit
● What is Batch Reading with Limit?
○ Materialize a specific number of rows of data
○ For example, read(1000) returns up to 1000 rows of data
● Why Batch Reading with Limit?
○ The materialized data might be 10x larger than Parquet formatted data
○ Reduce memory usages by producing the data in smaller batches for
consumption.
○ Better fit for different systems. Most engines take in much smaller
pieces of data than Row Group

Filter Push-down
● What is Filter Push-down?
○ Try to filter out unnecessary data before final materialization
● Why Filter Push-down?
○ Save unnecessary cpu cycles, like materializations.
○ Filters are very common from user input, which can be largely used in
the real-world.
○ The deeper we push the filters, the more efficient reader would be

Filter Ordering & Re-ordering
● What is Filter Ordering and Re-ordering?
○ Filter ordering: the execution sequence of the columns with filters
○ Filter re-ordering: Rearrange the sequence of filters on different
columns according to the real time filter performance.
● Why Filter Ordering and Re-ordering?
○ In real world scenarios, multiple filters will be applied at the same time.
○ We would like to start from the filter that eliminate the most data to
reduce the future computation burden.
○ Adaptive approach to find the best filter execution sequence.

Flexible Data Materialization
● What is Data Materialization?
○ The Parquet Reader is to parse the Parquet file and convert it into
datasets in different memory layout.
● Why Data Materialization need to be Flexible?
○ Parquet files are widely used in different scenarios: OLAP, Machine
Learning ...
● How can Data Materialization be Flexible?
○ The native Parquet Reader provides bridges to customize the output
data format.

Data Infra Meetup | ByteDance's Native Parquet Reader

Recommended

Recommended

More Related Content

Similar to Data Infra Meetup | ByteDance's Native Parquet Reader

Similar to Data Infra Meetup | ByteDance's Native Parquet Reader (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Data Infra Meetup | ByteDance's Native Parquet Reader