SlideShare a Scribd company logo
Parquet performance tuning:
The missing guide
Ryan Blue
Strata + Hadoop World NY 2016
● Big data at Netflix
● Parquet format background
● Optimization basics
● Stats and dictionary filtering
● Format 2 and compression
● Future work
Contents.
Big data at Netflix.
Big data at Netflix.
40+ PB DW Read 3PB Write 300TB600B Events
Strata San Jose results.
Metrics dataset.
Based on Atlas, Netflix’s telemetry platform.
● Performance monitoring backend and UI
● http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html
Example metrics data.
● Partitioned by day, and cluster
● Columns include metric time, name, value, and host
● Measurements for each minute are stored in a Parquet table
Parquet format background.
Parquet data layout.
ROW GROUPS.
● Data needed for a group of rows to be reassembled
● Smallest task or input split size
● Made of COLUMN CHUNKS
COLUMN CHUNKS.
● Contiguous data for a single column
● Made of DATA PAGES and an optional DICTIONARY PAGE
DATA PAGES.
● Encoded and compressed runs of values
Row groups.
... F
A B C D
a1 b1 c1 d1
... ... ... ...
aN bN cN dN
... ... ... ...
HDFS block
Column chunks and pages.
... F
dict
Read less data.
Columnar organization.
● Encoding: make the data smaller
● Column projection: read only the columns you need
Row group filtering.
● Use footer stats to eliminate row groups
● Use dictionary pages to eliminate row groups
Page filtering.
● Use page stats to eliminate pages
Basics.
Setup.
Parquet writes:
● Version 1.8.1 or later – includes fix for incorrect statistics, PARQUET-251
● 1.9.0 due in October
Reads:
● Presto: Used 0.139
● Spark: Used version 1.6.1 reading from Hive
● Pig: Used parquet-pig 1.9.0 for predicate push-down
Pig configuration.
-- enable pushdown/filtering
set parquet.pig.predicate.pushdown.enable true;
-- enables stats and dictionary filtering
set parquet.filter.statistics.enabled true;
set parquet.filter.dictionary.enabled true;
Spark configuration.
// turn on Parquet push-down, stats filtering, and dictionary filtering
sqlContext.setConf("parquet.filter.statistics.enabled", "true")
sqlContext.setConf("parquet.filter.dictionary.enabled", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
// use the non-Hive read path
sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "true")
// turn off schema merging, which turns off push-down
sqlContext.setConf("spark.sql.parquet.mergeSchema", "false")
sqlContext.setConf("spark.sql.hive.convertMetastoreParquet.mergeSchema",
"false")
Writing the data.
Spark:
sqlContext
.table("raw_metrics")
.write.insertInto("metrics")
Pig:
metricsData = LOAD 'raw_metrics'
USING SomeLoader;
STORE metricsData INTO 'metrics'
USING ParquetStorer;
Writing the data.
Spark:
sqlContext
.table("raw_metrics")
.write.insertInto("metrics")
Pig:
metricsData = LOAD 'raw_metrics'
USING SomeLoader;
STORE metricsData INTO 'metrics'
USING ParquetStorer;
OutOfMemoryError
or
ParquetRuntimeException
Writing too many files.
Data doesn’t match partitioning.
● Tasks write a file per partition
Symptoms:
● OutOfMemoryError
● ParquetRuntimeException: New Memory allocation 1047284 bytes is smaller than the
minimum allocation size of 1048576 bytes.
● Successfully write lots of small files, slow split planning
Task 1 part=1/
part=2/
Task 2 part=3/
part=4/
Task 3 part=.../
Account for partitioning.
Spark.
sqlContext
.table("raw_metrics")
.sort("day", "cluster")
.write.insertInto("metrics")
Pig.
metrics = LOAD 'raw_metrics'
USING SomeLoader;
metricsSorted = ORDER metrics
BY day, cluster;
STORE metricsSorted INTO 'metrics'
USING ParquetStorer;
Filter to select partitions.
Spark.
val partition = sqlContext
.table("metrics")
.filter("day = 20160929")
.filter("cluster = 'emr_adhoc'")
Pig.
metricsData = LOAD 'metrics'
USING ParquetLoader;
partition = FILTER metricsData BY
date == 20160929 AND
cluster == 'emr_adhoc'
Stats filters.
Sample query.
Spark.
val low_cpu_count = partition
.filter("name =
'system.cpu.utilization'")
.filter("value < 0.8")
.count()
Pig.
low_cpu = FILTER partition BY
name == 'system.cpu.utilization' AND
value < 0.8;
low_cpu_count = FOREACH
(GROUP low_cpu ALL) GENERATE
COUNT(name);
My job was 5 minutes faster!
Did it work?
● Success metrics: S3 bytes read, CPU time spent
S3N: Number of bytes read: 1,366,228,942,336
CPU time spent (ms): 280,218,780
● Filter didn’t work. Bytes read shows the entire partition was read.
● What happened?
Inspect the file.
● Stats show what happened:
Row group 0: count: 84756 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 84756 61.52 B 0 "A..." / "z..."
...
Row group 1: count: 84756 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 85579 61.52 B 0 "A..." / "z..."
● Every row group matched the query
Add query columns to the sort.
Spark.
sqlContext
.table("raw_metrics")
.sort("day", "cluster", "name")
.write.insertInto("metrics")
Pig.
metrics = LOAD 'raw_metrics'
USING SomeLoader;
metricsSorted = ORDER metrics
BY day, cluster, name;
STORE metricsSorted INTO 'metrics'
USING ParquetStorer;
Inspect the file, again.
● Stats are fixed:
Row group 0: count: 84756 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 84756 61.52 B 0 "A..." / "F..."
...
Row group 1: count: 85579 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 85579 61.52 B 0 "F..." / "N..."
...
Row group 2: count: 86712 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 86712 61.52 B 0 "N..." / "b..."
Dictionary filters.
Dictionary filtering.
Dictionary is a compact list of all the values.
● Search term missing? Skip the row group
● Like a bloom filter without false positives
When dictionary filtering helps:
● When a column is sorted in each file, not globally sorted – one row group matches
● When filtering an unsorted column
dict dict dict
Dictionary filtering overhead.
Read overhead.
● Extra seeks
● Extra page reads
Not a problem in practice.
● Reading both dictionary and row group resulted in < 1% penalty
● Stats filtering prevents unnecessary dictionary reads
dict dict dict
Works out of the box, right?
Nope.
● Only works when columns are completely dictionary-encoded
● Plain-encoded pages can contain any value, dictionary is no help
● All pages in a chunk must use the dictionary
Dictionary fallback rules:
● If dictionary + references > plain encoding, fall back
● If dictionary size is too large, fall back (default threshold: 1 MB)
Fallback to plain encoding.
parquet-tools dump -d
utc_timestamp_ms TV=142990 RL=0 DL=1 DS: 833491 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED V:RLE SZ:72912
page 1: DLE:RLE RLE:BIT_PACKED V:RLE SZ:135022
page 2: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607
page 3: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607
page 4: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:714941
What’s happening:
● Values repeat, but change over time
● Dictionary gets too large, falls back to plain encoding
● Dictionary encoding is a size win!
Avoid encoding fallback.
Increase max dictionary size.
● 2-3 MB usually worked
● parquet.dictionary.page.size
Decrease row group size.
● 24, 32, or 64 MB
● parquet.block.size
● New dictionary for each row group
● Also lowers memory consumption!
Run several tests to find the right configuration (per table).
Row group size.
Other reasons to decrease row group size:
● Reduce memory consumption – but not to avoid write-side OOM
● Increase number of tasks / parallelism
Results!
Results (from Pig).
CPU and wall time dropped.
● Initial: CPU Time: 280,218,780 ms Wall Time: 15m 27s
● Filtered: CPU Time: 120,275,590 ms Wall Time: 9m 51s
● Final: CPU Time: 9,593,700 ms Wall Time: 6m 47s
Bytes read is much better.
● Initial: S3 bytes read: 1,366,228,942,336 (1.24 TB)
● Filtered: S3 bytes read: 49,195,996,736 (45.82 GB)
Filtered vs. final time.
Row group filtering is parallel.
● Split planning is independent of stats (or else is a bottleneck)
● Lots of very small tasks: read footer, read dictionary, stop processing
Combine splits in Pig/MR for better time.
● 1 GB splits tend to work well
Other work.
Format version 2.
What’s included:
● New encodings: delta-integer, prefix-binary
● New page format to enable page-level filtering
New encodings didn’t help with Netflix data.
● Delta-integer didn’t help significantly, even with timestamps (high overhead?)
● Not large enough prefixes in URL and JSON data
Page filtering isn’t implemented (yet).
Brotli compression.
● New compression library, from Google
● Based on LZ77, with compatible license
Faster compression, smaller files, or both.
● brotli-5: 19.7% smaller, 2.7% slower – 1 day of data from Kafka
● brotli-4: 14.8% smaller, 12.5% faster – 1 hour, 4 largest Parquet tables
● brotli-1: 8.1% smaller, 28.3% faster – JSON-heavy dataset
Brotli compression. (continued)
Future work.
Future work.
Short term:
● Release Parquet 1.9.0
● Test Zstd compression
● Convert embedded JSON to Avro – good preliminary results
Long-term:
● New encodings: Zig-zag RLE, patching, and floating point decomposition
● Page-level filtering
Thank you!
Questions?
https://jobs.netflix.com/
rblue@netflix.com

More Related Content

What's hot

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
Kashif Khan
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 

What's hot (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 

Viewers also liked

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Amazon Web Services
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-Takeshi Yamamuro
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
Amazon Web Services
 
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 TokyoPrestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Treasure Data, Inc.
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon Athena
Amazon Web Services Japan
 

Viewers also liked (8)

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 TokyoPrestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon Athena
 

Similar to Parquet performance tuning: the missing guide

10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
Antonios Giannopoulos
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
HBaseCon
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
Sahan Bulathwela
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Ontico
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
Amazon Web Services
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
Guido Oswald
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDB
Kai Sasaki
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
Amazon Web Services
 

Similar to Parquet performance tuning: the missing guide (20)

10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDB
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 

Recently uploaded

PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
Sunil Jagani
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
HarpalGohil4
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Ukraine
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 

Recently uploaded (20)

PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 

Parquet performance tuning: the missing guide

  • 1. Parquet performance tuning: The missing guide Ryan Blue Strata + Hadoop World NY 2016
  • 2. ● Big data at Netflix ● Parquet format background ● Optimization basics ● Stats and dictionary filtering ● Format 2 and compression ● Future work Contents.
  • 3. Big data at Netflix.
  • 4. Big data at Netflix. 40+ PB DW Read 3PB Write 300TB600B Events
  • 5. Strata San Jose results.
  • 6. Metrics dataset. Based on Atlas, Netflix’s telemetry platform. ● Performance monitoring backend and UI ● http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html Example metrics data. ● Partitioned by day, and cluster ● Columns include metric time, name, value, and host ● Measurements for each minute are stored in a Parquet table
  • 8. Parquet data layout. ROW GROUPS. ● Data needed for a group of rows to be reassembled ● Smallest task or input split size ● Made of COLUMN CHUNKS COLUMN CHUNKS. ● Contiguous data for a single column ● Made of DATA PAGES and an optional DICTIONARY PAGE DATA PAGES. ● Encoded and compressed runs of values
  • 9. Row groups. ... F A B C D a1 b1 c1 d1 ... ... ... ... aN bN cN dN ... ... ... ... HDFS block
  • 10. Column chunks and pages. ... F dict
  • 11. Read less data. Columnar organization. ● Encoding: make the data smaller ● Column projection: read only the columns you need Row group filtering. ● Use footer stats to eliminate row groups ● Use dictionary pages to eliminate row groups Page filtering. ● Use page stats to eliminate pages
  • 13. Setup. Parquet writes: ● Version 1.8.1 or later – includes fix for incorrect statistics, PARQUET-251 ● 1.9.0 due in October Reads: ● Presto: Used 0.139 ● Spark: Used version 1.6.1 reading from Hive ● Pig: Used parquet-pig 1.9.0 for predicate push-down
  • 14. Pig configuration. -- enable pushdown/filtering set parquet.pig.predicate.pushdown.enable true; -- enables stats and dictionary filtering set parquet.filter.statistics.enabled true; set parquet.filter.dictionary.enabled true;
  • 15. Spark configuration. // turn on Parquet push-down, stats filtering, and dictionary filtering sqlContext.setConf("parquet.filter.statistics.enabled", "true") sqlContext.setConf("parquet.filter.dictionary.enabled", "true") sqlContext.setConf("spark.sql.parquet.filterPushdown", "true") // use the non-Hive read path sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "true") // turn off schema merging, which turns off push-down sqlContext.setConf("spark.sql.parquet.mergeSchema", "false") sqlContext.setConf("spark.sql.hive.convertMetastoreParquet.mergeSchema", "false")
  • 16. Writing the data. Spark: sqlContext .table("raw_metrics") .write.insertInto("metrics") Pig: metricsData = LOAD 'raw_metrics' USING SomeLoader; STORE metricsData INTO 'metrics' USING ParquetStorer;
  • 17. Writing the data. Spark: sqlContext .table("raw_metrics") .write.insertInto("metrics") Pig: metricsData = LOAD 'raw_metrics' USING SomeLoader; STORE metricsData INTO 'metrics' USING ParquetStorer; OutOfMemoryError or ParquetRuntimeException
  • 18. Writing too many files. Data doesn’t match partitioning. ● Tasks write a file per partition Symptoms: ● OutOfMemoryError ● ParquetRuntimeException: New Memory allocation 1047284 bytes is smaller than the minimum allocation size of 1048576 bytes. ● Successfully write lots of small files, slow split planning Task 1 part=1/ part=2/ Task 2 part=3/ part=4/ Task 3 part=.../
  • 19. Account for partitioning. Spark. sqlContext .table("raw_metrics") .sort("day", "cluster") .write.insertInto("metrics") Pig. metrics = LOAD 'raw_metrics' USING SomeLoader; metricsSorted = ORDER metrics BY day, cluster; STORE metricsSorted INTO 'metrics' USING ParquetStorer;
  • 20. Filter to select partitions. Spark. val partition = sqlContext .table("metrics") .filter("day = 20160929") .filter("cluster = 'emr_adhoc'") Pig. metricsData = LOAD 'metrics' USING ParquetLoader; partition = FILTER metricsData BY date == 20160929 AND cluster == 'emr_adhoc'
  • 22. Sample query. Spark. val low_cpu_count = partition .filter("name = 'system.cpu.utilization'") .filter("value < 0.8") .count() Pig. low_cpu = FILTER partition BY name == 'system.cpu.utilization' AND value < 0.8; low_cpu_count = FOREACH (GROUP low_cpu ALL) GENERATE COUNT(name);
  • 23. My job was 5 minutes faster!
  • 24. Did it work? ● Success metrics: S3 bytes read, CPU time spent S3N: Number of bytes read: 1,366,228,942,336 CPU time spent (ms): 280,218,780 ● Filter didn’t work. Bytes read shows the entire partition was read. ● What happened?
  • 25. Inspect the file. ● Stats show what happened: Row group 0: count: 84756 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 84756 61.52 B 0 "A..." / "z..." ... Row group 1: count: 84756 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 85579 61.52 B 0 "A..." / "z..." ● Every row group matched the query
  • 26. Add query columns to the sort. Spark. sqlContext .table("raw_metrics") .sort("day", "cluster", "name") .write.insertInto("metrics") Pig. metrics = LOAD 'raw_metrics' USING SomeLoader; metricsSorted = ORDER metrics BY day, cluster, name; STORE metricsSorted INTO 'metrics' USING ParquetStorer;
  • 27. Inspect the file, again. ● Stats are fixed: Row group 0: count: 84756 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 84756 61.52 B 0 "A..." / "F..." ... Row group 1: count: 85579 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 85579 61.52 B 0 "F..." / "N..." ... Row group 2: count: 86712 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 86712 61.52 B 0 "N..." / "b..."
  • 29. Dictionary filtering. Dictionary is a compact list of all the values. ● Search term missing? Skip the row group ● Like a bloom filter without false positives When dictionary filtering helps: ● When a column is sorted in each file, not globally sorted – one row group matches ● When filtering an unsorted column dict dict dict
  • 30. Dictionary filtering overhead. Read overhead. ● Extra seeks ● Extra page reads Not a problem in practice. ● Reading both dictionary and row group resulted in < 1% penalty ● Stats filtering prevents unnecessary dictionary reads dict dict dict
  • 31. Works out of the box, right? Nope. ● Only works when columns are completely dictionary-encoded ● Plain-encoded pages can contain any value, dictionary is no help ● All pages in a chunk must use the dictionary Dictionary fallback rules: ● If dictionary + references > plain encoding, fall back ● If dictionary size is too large, fall back (default threshold: 1 MB)
  • 32. Fallback to plain encoding. parquet-tools dump -d utc_timestamp_ms TV=142990 RL=0 DL=1 DS: 833491 DE:PLAIN_DICTIONARY ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED V:RLE SZ:72912 page 1: DLE:RLE RLE:BIT_PACKED V:RLE SZ:135022 page 2: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607 page 3: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607 page 4: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:714941 What’s happening: ● Values repeat, but change over time ● Dictionary gets too large, falls back to plain encoding ● Dictionary encoding is a size win!
  • 33. Avoid encoding fallback. Increase max dictionary size. ● 2-3 MB usually worked ● parquet.dictionary.page.size Decrease row group size. ● 24, 32, or 64 MB ● parquet.block.size ● New dictionary for each row group ● Also lowers memory consumption! Run several tests to find the right configuration (per table).
  • 34. Row group size. Other reasons to decrease row group size: ● Reduce memory consumption – but not to avoid write-side OOM ● Increase number of tasks / parallelism
  • 36. Results (from Pig). CPU and wall time dropped. ● Initial: CPU Time: 280,218,780 ms Wall Time: 15m 27s ● Filtered: CPU Time: 120,275,590 ms Wall Time: 9m 51s ● Final: CPU Time: 9,593,700 ms Wall Time: 6m 47s Bytes read is much better. ● Initial: S3 bytes read: 1,366,228,942,336 (1.24 TB) ● Filtered: S3 bytes read: 49,195,996,736 (45.82 GB)
  • 37. Filtered vs. final time. Row group filtering is parallel. ● Split planning is independent of stats (or else is a bottleneck) ● Lots of very small tasks: read footer, read dictionary, stop processing Combine splits in Pig/MR for better time. ● 1 GB splits tend to work well
  • 39. Format version 2. What’s included: ● New encodings: delta-integer, prefix-binary ● New page format to enable page-level filtering New encodings didn’t help with Netflix data. ● Delta-integer didn’t help significantly, even with timestamps (high overhead?) ● Not large enough prefixes in URL and JSON data Page filtering isn’t implemented (yet).
  • 40. Brotli compression. ● New compression library, from Google ● Based on LZ77, with compatible license Faster compression, smaller files, or both. ● brotli-5: 19.7% smaller, 2.7% slower – 1 day of data from Kafka ● brotli-4: 14.8% smaller, 12.5% faster – 1 hour, 4 largest Parquet tables ● brotli-1: 8.1% smaller, 28.3% faster – JSON-heavy dataset
  • 43. Future work. Short term: ● Release Parquet 1.9.0 ● Test Zstd compression ● Convert embedded JSON to Avro – good preliminary results Long-term: ● New encodings: Zig-zag RLE, patching, and floating point decomposition ● Page-level filtering