Columnar Table Performance Enhancements Of Greenplum Database with Block Metadata and Sort Keys / Xiaojian Fan (Alibaba Cloud)

Columnar Table
Performance
Enhancements Of
Greenplum Database
with Block Metadata and
Sort Keys
Xiaojian Fan

About me
• From Hangzhou,China, working in Alibaba Cloud as a developer for
Greenplum and PostgreSQL feature development, bug fixing, and
customer support.
• Email: funnyxj.fxj@alibaba-inc.com

About Alibaba Cloud
• As the cloud computing arm and a business unit of Alibaba Group
(NYSE: BABA), Alibaba Cloud provides a comprehensive suite of global
cloud computing services to help power and grow your business.
Alibaba Cloud ranks as the top third largest public cloud services
provider globally and is the leading cloud provider in the China
market.
• https://www.alibabacloud.com/

About Greenplum
• Greenplum Database is an advanced, fully featured, open source data
platform. It provides powerful and rapid analytics on petabyte scale
data volumes. Uniquely geared toward big data analytics, Greenplum
Database is powered by the world’s most advanced cost-based query
optimizer delivering high analytical query performance on large data
volumes.
• https://github.com/greenplum-db/gpdb

What Alibaba Cloud does on Greenplum
• Alibaba Cloud built up a data warehouse service named HybridDB in
its public cloud service, based on the open sourced Greenplum
Database
• Extension/OSS/HyperLogLog/Json/PlJava/Madlib/Postgis/Vectorizatio
n,etc
• Block Metadata and Sort Keys
• https://www.alibabacloud.com/product/hybriddb-postgresql

Why Block Metadata and Sort Keys?
• Lots of columnar tables are used
• Poor query performance
• Poor loading data performance

Structure of Greenplum Columnar Table
• Look at heap table first

Structure of Greenplum Columnar Table
• Columnar table
Columnar table

Structure of columnar table file

Transaction of columnar table
• Auxiliary heap table
• Records files EOF

How to DML
• Insert
• Multiple files avoid file locks
• Delete
• Visimap heap table which records
all tuples visibility
• Update->Delete+insert

Disadvantages of columnar table
• Without index
• Need to scan and decompress all blocks, cost lots of CPU/IO
• With index
• Loading data is very slow
• Data expansion
• Lots of random IO

Block Metadata
• Add a metadata heap table
• Like auxiliary heap table
• Collecting metadata
• Loading data performance decreased by 10%
• Char/varchar/text optimization
• Comparing string is slow

• If data are evenly
or randomly distributed?

• Filtering works better
after sort

Sort Keys
• Cluster table data on the disk according to sort keys
• Two main advantages:
• Distinct metadata, less [min,max] interval intersection
• Avoid reordering data again when SQL involves order by/group by

Kind of Sort Keys
• Global Sort Keys
• Like order by a,b,c
• Avoid reordering
• Group Sort Keys
• Divide data to multiple groups, every group is ordered
• Less [min,max] interval intersection for any columns

Global Sort Keys
• Without sort key

Global Sort Keys
• With sort key
• More faster

Group Sort Keys
• If we sort by column1,
but filter by column2?
select * from test where column2=99

Group Sort Keys
• Two columns
• Horizontal axis is col1, and vertical axis is col2
• Every cube has many tuples (col1, col2), we call Group

Group Sort Keys
• More than three columns
• Multidimensional spatial coordinate system

TPCH Performance Testing
• The amount of data
• 1 TB
• Queries
• Q6, Q14, Q15
• Column l_shipdate in table lineitem is ordered

Metascan Vs. Index
• Loading data performance
• Index results in a 3x decrease in performance

Metascan Vs. Index
• TPCH performance

Metascan Vs. Index
• Different filtering rates
• We change the interval of Q6 l_shipdate (1day, 3days, 7days, 10days,......)

Metascan Vs. Index
• > 99.7% filtering rates
• Index better than metascan
• < 99.7% filtering rates
• Metascan better than index
• > 40% filtering rates
• Metascan more than
double performance
of seqscan

Conclusion
• Block Metadata is a very effective way to improve the performance of
columnar table
• How to choose and design sort keys is also important

Columnar Table Performance Enhancements Of Greenplum Database with Block Metadata and Sort Keys / Xiaojian Fan (Alibaba Cloud)

More Related Content

What's hot

Similar to Columnar Table Performance Enhancements Of Greenplum Database with Block Metadata and Sort Keys / Xiaojian Fan (Alibaba Cloud)

More from Ontico

Recently uploaded

Columnar Table Performance Enhancements Of Greenplum Database with Block Metadata and Sort Keys / Xiaojian Fan (Alibaba Cloud)