Columnar Table
Performance
Enhancements Of
Greenplum Database
with Block Metadata and
Sort Keys
Xiaojian Fan
About me
• From Hangzhou,China, working in Alibaba Cloud as a developer for
Greenplum and PostgreSQL feature development, bug fixing, and
customer support.
• Email: funnyxj.fxj@alibaba-inc.com
About Alibaba Cloud
• As the cloud computing arm and a business unit of Alibaba Group
(NYSE: BABA), Alibaba Cloud provides a comprehensive suite of global
cloud computing services to help power and grow your business.
Alibaba Cloud ranks as the top third largest public cloud services
provider globally and is the leading cloud provider in the China
market.
• https://www.alibabacloud.com/
About Greenplum
• Greenplum Database is an advanced, fully featured, open source data
platform. It provides powerful and rapid analytics on petabyte scale
data volumes. Uniquely geared toward big data analytics, Greenplum
Database is powered by the world’s most advanced cost-based query
optimizer delivering high analytical query performance on large data
volumes.
• https://github.com/greenplum-db/gpdb
Architecture of Greenplum
What Alibaba Cloud does on Greenplum
• Alibaba Cloud built up a data warehouse service named HybridDB in
its public cloud service, based on the open sourced Greenplum
Database
• Extension/OSS/HyperLogLog/Json/PlJava/Madlib/Postgis/Vectorizatio
n,etc
• Block Metadata and Sort Keys
• https://www.alibabacloud.com/product/hybriddb-postgresql
Why Block Metadata and Sort Keys?
• Lots of columnar tables are used
• Poor query performance
• Poor loading data performance
Structure of Greenplum Columnar Table
• Look at heap table first
Structure of Greenplum Columnar Table
• Columnar table
Columnar table
Structure of columnar table file
Transaction of columnar table
• Auxiliary heap table
• Records files EOF
How to DML
• Insert
• Multiple files avoid file locks
• Delete
• Visimap heap table which records
all tuples visibility
• Update->Delete+insert
Disadvantages of columnar table
• Without index
• Need to scan and decompress all blocks, cost lots of CPU/IO
• With index
• Loading data is very slow
• Data expansion
• Lots of random IO
Block Metadata
• Add a metadata heap table
• Like auxiliary heap table
• Collecting metadata
• Loading data performance decreased by 10%
• Char/varchar/text optimization
• Comparing string is slow
Metadata Scan Flow Process
• If data are evenly
or randomly distributed?
• Filtering works better
after sort
Sort Keys
• Cluster table data on the disk according to sort keys
• Two main advantages:
• Distinct metadata, less [min,max] interval intersection
• Avoid reordering data again when SQL involves order by/group by
Kind of Sort Keys
• Global Sort Keys
• Like order by a,b,c
• Avoid reordering
• Group Sort Keys
• Divide data to multiple groups, every group is ordered
• Less [min,max] interval intersection for any columns
Global Sort Keys
• Without sort key
Global Sort Keys
• With sort key
• More faster
Group Sort Keys
• If we sort by column1,
but filter by column2?
select * from test where column2=99
Group Sort Keys
• Two columns
• Horizontal axis is col1, and vertical axis is col2
• Every cube has many tuples (col1, col2), we call Group
Group Sort Keys
• More than three columns
• Multidimensional spatial coordinate system
TPCH Performance Testing
• The amount of data
• 1 TB
• Queries
• Q6, Q14, Q15
• Column l_shipdate in table lineitem is ordered
Results of TPCH
Metascan Vs. Index
• Loading data performance
• Index results in a 3x decrease in performance
Metascan Vs. Index
• TPCH performance
Metascan Vs. Index
• Different filtering rates
• We change the interval of Q6 l_shipdate (1day, 3days, 7days, 10days,......)
Metascan Vs. Index
• > 99.7% filtering rates
• Index better than metascan
• < 99.7% filtering rates
• Metascan better than index
• > 40% filtering rates
• Metascan more than
double performance
of seqscan
Conclusion
• Block Metadata is a very effective way to improve the performance of
columnar table
• How to choose and design sort keys is also important

Columnar Table Performance Enhancements Of Greenplum Database with Block Metadata and Sort Keys / Xiaojian Fan (Alibaba Cloud)

  • 1.
    Columnar Table Performance Enhancements Of GreenplumDatabase with Block Metadata and Sort Keys Xiaojian Fan
  • 2.
    About me • FromHangzhou,China, working in Alibaba Cloud as a developer for Greenplum and PostgreSQL feature development, bug fixing, and customer support. • Email: funnyxj.fxj@alibaba-inc.com
  • 3.
    About Alibaba Cloud •As the cloud computing arm and a business unit of Alibaba Group (NYSE: BABA), Alibaba Cloud provides a comprehensive suite of global cloud computing services to help power and grow your business. Alibaba Cloud ranks as the top third largest public cloud services provider globally and is the leading cloud provider in the China market. • https://www.alibabacloud.com/
  • 4.
    About Greenplum • GreenplumDatabase is an advanced, fully featured, open source data platform. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely geared toward big data analytics, Greenplum Database is powered by the world’s most advanced cost-based query optimizer delivering high analytical query performance on large data volumes. • https://github.com/greenplum-db/gpdb
  • 5.
  • 6.
    What Alibaba Clouddoes on Greenplum • Alibaba Cloud built up a data warehouse service named HybridDB in its public cloud service, based on the open sourced Greenplum Database • Extension/OSS/HyperLogLog/Json/PlJava/Madlib/Postgis/Vectorizatio n,etc • Block Metadata and Sort Keys • https://www.alibabacloud.com/product/hybriddb-postgresql
  • 7.
    Why Block Metadataand Sort Keys? • Lots of columnar tables are used • Poor query performance • Poor loading data performance
  • 8.
    Structure of GreenplumColumnar Table • Look at heap table first
  • 9.
    Structure of GreenplumColumnar Table • Columnar table Columnar table
  • 10.
  • 11.
    Transaction of columnartable • Auxiliary heap table • Records files EOF
  • 12.
    How to DML •Insert • Multiple files avoid file locks • Delete • Visimap heap table which records all tuples visibility • Update->Delete+insert
  • 13.
    Disadvantages of columnartable • Without index • Need to scan and decompress all blocks, cost lots of CPU/IO • With index • Loading data is very slow • Data expansion • Lots of random IO
  • 14.
    Block Metadata • Adda metadata heap table • Like auxiliary heap table • Collecting metadata • Loading data performance decreased by 10% • Char/varchar/text optimization • Comparing string is slow
  • 15.
  • 16.
    • If dataare evenly or randomly distributed?
  • 17.
    • Filtering worksbetter after sort
  • 18.
    Sort Keys • Clustertable data on the disk according to sort keys • Two main advantages: • Distinct metadata, less [min,max] interval intersection • Avoid reordering data again when SQL involves order by/group by
  • 19.
    Kind of SortKeys • Global Sort Keys • Like order by a,b,c • Avoid reordering • Group Sort Keys • Divide data to multiple groups, every group is ordered • Less [min,max] interval intersection for any columns
  • 20.
    Global Sort Keys •Without sort key
  • 21.
    Global Sort Keys •With sort key • More faster
  • 22.
    Group Sort Keys •If we sort by column1, but filter by column2? select * from test where column2=99
  • 23.
    Group Sort Keys •Two columns • Horizontal axis is col1, and vertical axis is col2 • Every cube has many tuples (col1, col2), we call Group
  • 24.
    Group Sort Keys •More than three columns • Multidimensional spatial coordinate system
  • 25.
    TPCH Performance Testing •The amount of data • 1 TB • Queries • Q6, Q14, Q15 • Column l_shipdate in table lineitem is ordered
  • 26.
  • 27.
    Metascan Vs. Index •Loading data performance • Index results in a 3x decrease in performance
  • 28.
    Metascan Vs. Index •TPCH performance
  • 29.
    Metascan Vs. Index •Different filtering rates • We change the interval of Q6 l_shipdate (1day, 3days, 7days, 10days,......)
  • 30.
    Metascan Vs. Index •> 99.7% filtering rates • Index better than metascan • < 99.7% filtering rates • Metascan better than index • > 40% filtering rates • Metascan more than double performance of seqscan
  • 31.
    Conclusion • Block Metadatais a very effective way to improve the performance of columnar table • How to choose and design sort keys is also important