Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Shengxuan Liu (Software Engineer, @ByteDance)
Shengxuan Liu from ByteDance presents the new ByteDance’s native Parquet Reader. The talk covers the architecture and key features of the Reader, and how the new Reader is able to facilitate data processing efficiency.
3. What is Parquet?
● Definition
○ Columnar storage file format for big data.
○ Developed for efficient storage and processing
of large datasets.
● Purpose
○ Organizes data by columns instead of rows.
4. Why Parquet?
● Columnar Storage
○ Facilitates efficient compression.
○ Improves performance for analytical queries on specific columns.
○ Improves the CPU cache hit rates while processing the data.
● Compression
○ Supports various algorithms (Snappy, Gzip, LZO).
○ Achieves high compression ratios.
○ Reduces remote I/O and lowers storage costs.
5. Why Parquet?
● Primitive Types
○ BOOLEAN: Represents a boolean value (true/false).
○ INT32/INT64: Represents signed 32-bit or 64-bit integers.
○ FLOAT: Represents a 32-bit floating-point number.
○ DOUBLE: Represents a 64-bit floating-point number.
○ BINARY: Represents variable-length binary data (e.g., strings, byte
arrays).
○ FIXED_LEN_BYTE_ARRAY: Represents fixed-length binary data with a
specified length.
6. Why Parquet?
● Nested Types
○ LIST: Represents an ordered collection of elements.
○ MAP: Represents a collection of key-value pairs.
○ STRUCT: Represents a complex structure with named fields.
7. Parquet with Cache
● Alluxio Local Cache for PrestoDB
○ Break the Parquet files into smaller pieces and store locally
○ Improve reading efficiency and reduce remote I/O cost
8. Rust Native Parquet Reader
● What is a Native Parquet Reader?
○ A tool to read and process data stored in the Parquet file format.
○ The Reader is written in native languages and can be compiled into
machine codes directly.
● Why Rust?
○ Good performance and memory safety
○ Able to integrate with other language using FFI and etc.
9. Native Parquet Reader
● High-Level Components
○ Metadata Parser
○ Row Group Level Reader
○ Column Level Reader
○ Page Level Reader
○ Decompression
○ Data Decoder
○ Data Materialization System
○ Filter system
10. Native Parquet Reader
● Metadata Parser -- Thrift
○ File Metadata
■ Extracts schema information, data types,
and column statistics.
■ Determines the structure of the Parquet
file.
○ Page Metadata
■ Parses Page level metadata
■ Extracts the page statistics, page type,
compression and decoding types...
11. Native Parquet Reader
● Row Group Level Reader
○ Row Group contains all the columns
○ Smallest parallelism unit for most Parquet Readers
○ Arrange and schedule Column level reading
12. Native Parquet Reader
● Column Level Reader
○ Each column contains all the pages for itself within the row group
○ Arrange and schedule page level reading
13. Native Parquet Reader
● Page Level Reader
○ Page Level Reader reads different types of Data Page
■ Dictionary Page: stores the unique values of the column
■ Data Pages: contain the actual values of the column
● Store indexes according to the Dictionary Page; or
● Store the Plain values
■ ......
○ Page Level Reader reads actual values out:
■ Decompression -> Decoding -> Materialization
15. Native Parquet Reader
● Data Decoder
○ Dictionary Decoder
■ Maintain the dictionary from dictionary page
■ Data Page reading needs to find the actual values from indexes
○ RLE/BP Decoder
■ A combination of bit-packing and run length encoding
○ Plain Decoder
■ Actual values are stored directly
16. Native Parquet Reader
● Data Materialization System
○ Each computing engine has its own memory layout
○ Materialize the data from data pages into the designated memory
format
17. Native Parquet Reader
● Filter System
○ Most of the queries are associated with filters
○ Some filters can be processed at the Parquet reading stage
■ select a, b from table where a > 1 -> :-)
■ select a, b from table where a+b > 1 -> :-(
○ The filter system enables the Parquet Reader read less data if possible
18. Key Features
● Batch Reading with Limit
● Filter Push-down
● Filter Ordering and Re-ordering
● Flexible Data Materialization
19. Batch Reading with Limit
● What is Batch Reading with Limit?
○ Materialize a specific number of rows of data
○ For example, read(1000) returns up to 1000 rows of data
● Why Batch Reading with Limit?
○ The materialized data might be 10x larger than Parquet formatted data
○ Reduce memory usages by producing the data in smaller batches for
consumption.
○ Better fit for different systems. Most engines take in much smaller
pieces of data than Row Group
20. Filter Push-down
● What is Filter Push-down?
○ Try to filter out unnecessary data before final materialization
● Why Filter Push-down?
○ Save unnecessary cpu cycles, like materializations.
○ Filters are very common from user input, which can be largely used in
the real-world.
○ The deeper we push the filters, the more efficient reader would be
21. Filter Ordering & Re-ordering
● What is Filter Ordering and Re-ordering?
○ Filter ordering: the execution sequence of the columns with filters
○ Filter re-ordering: Rearrange the sequence of filters on different
columns according to the real time filter performance.
● Why Filter Ordering and Re-ordering?
○ In real world scenarios, multiple filters will be applied at the same time.
○ We would like to start from the filter that eliminate the most data to
reduce the future computation burden.
○ Adaptive approach to find the best filter execution sequence.
22. Flexible Data Materialization
● What is Data Materialization?
○ The Parquet Reader is to parse the Parquet file and convert it into
datasets in different memory layout.
● Why Data Materialization need to be Flexible?
○ Parquet files are widely used in different scenarios: OLAP, Machine
Learning ...
● How can Data Materialization be Flexible?
○ The native Parquet Reader provides bridges to customize the output
data format.