Vertica Performance Optimization

Vertica
Zvika Gutkin
DB Expert
Zvika.gutkin@gmail.com

Agenda

• What is Vertica.

• How does it work.

• How To Use Vertica … (The Right Way ).

• Where It Falls Short.

• Examples …

MPP-Columnar DBMS

10x –100x performance of classic RDBMS.
Linear Scale
SQL
Commodity Hardware
Built-in fault tolerance

10x –100x performance of classic
RDBMS
• Column store architecture
• High Compression rates
• Sorted columns
• Objects Segmentation/Replication.

Delete
• Deleted rows are only marked as deleted.
• Stored in delete vector on disk.
• Query merge the ROS and Deleted vector to
remove deleted records.
• Data is removed asynchronously during
mergeout.

Projections
• Physical structure of the table (logical)
• Stored sorted and compressed
• Internal maintenance
• At least one (super) projection.
• Projection Types:
– Super projection
– Query specific projection
– Pre join projection
– Buddy projection

What‘s Important ….
• Choose the right columns (General Vs Specific).
• Choose the right sort order .
• Choose the right encoding .
• Choose the right column to partition by .
• Choose the right column to segment by .

Where It Falls Short …
• Lack of Features .
• Good for specific types of queries .
– Keep Queries Simple .
– Use the right columns
– Use Order By to help optimizer pick the right
projection
– Check the join column – Best if both tables order
by it .
– Check the join column – best if segmented by it.

Choose the Right sort order
Example
select
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS
WJXBFS1
from lp_15744040.FACT_VISIT_ROOM a11
group by
a11.LP_ACCOUNT_ID;

First projection ….
table_name projection_name projection_column_name column_position sort_position
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_SESSION_ID 0 0
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad LP_ACCOUNT_ID 1 1
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_VISITOR_ID 2 2
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_TRUNC 3 3
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ACCOUNT_ID 4 4
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ROOM_ID 5 5
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_ACTUAL 6 6
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_TO_DT_ACTUAL 7 7
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad HOT_LEAD_IND 8 8

Access Path:
+-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a11.LP_ACCOUNT_ID
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 7M, Rows: 10K] (PATH ID: 2)
| | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3)
| | | Projection: lp_15744040.FACT_VISIT_ROOM_bad
| | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

Second projection …
table_name projection_name projection_column_name column_position sort_position
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 LP_ACCOUNT_ID 0 0
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_SESSION_ID 1 1
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_VISITOR_ID 2 2
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_TRUNC 3 3
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ACCOUNT_ID 4 4
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ROOM_ID 5 5
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_ACTUAL 6 6
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_TO_DT_ACTUAL 7 7
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 HOT_LEAD_IND 8 8

Access Path:
+-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1)
| Group By: a11.LP_ACCOUNT_ID
| +---> GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 2)
| | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3)
| | | Projection: lp_15744040.FACT_VISIT_ROOM_fix1
| | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

Results …
Elapsed Time First projection
GROUPBY HASH (SORT OUTPUT)

Time: First fetch (7 rows): 264527.916 ms. All rows formatted: 264527.978 ms

Elapsed Time Second projection
GROUPBY PIPELINED


Join Example
select a12.DT_WEEK AS DT_WEEK,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
from zzz.FACT_VISIT a11
join zzz.DIM_DATE_TIME a12
on (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
where (a11.LP_ACCOUNT_ID in ('57386690')
and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
group by a12.DT_WEEK,
a11.LP_ACCOUNT_ID

 Filter : LP_ACCOUNT_ID, VISIT_FROM_DT_TRUNC
 Group By : DT_WEEK , LP_ACCOUNT_ID
 Join: VISIT_FROM_DT_TRUNC , DATE_TIME_ID
 Select : DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID

Full Explain Plan…
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)
| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISTICS)] (PATH ID: 3)
| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | | Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for a11 [Cost: 421K, Rows: 372M (NO STATISTICS)] (PATH ID: 4)
| | | | Projection: zzz.FACT_VISIT_b0
| | | | Materialize: a11.VISIT_FROM_DT_TRUNC
| | | | Filter: (a11.LP_ACCOUNT_ID = '57386690')
| | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <=
'2011-12-31 12:52:50'::timestamp))
| | | | Execute on: All Nodes
| | | +-- Inner -> STORAGE ACCESS for a12 [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
| | | | Projection: zzz.DIM_DATE_TIME_node0004
| | | | Materialize: a12.DATE_TIME_ID, a12.DT_WEEK
| | | | Filter: ((a12.DATE_TIME_ID >= '2011-09-01 15:28:00'::timestamp) AND (a12.DATE_TIME_ID <= '2011-12-31
12:52:50'::timestamp))
| | | | Execute on: All Nodes

Explain Plan (substract)…
Access Path:l
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID:
1)
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISlTICS)] (PATH ID: 3)
| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID


Solution one - Functions
select week(a11.VISIT_FROM_DT_TRUNC) AS DT_WEEK,
from zzz.FACT_VISIT a11
group by week(a11.VISIT_FROM_DT_TRUNC),
a11.LP_ACCOUNT_ID;
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 127, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| Group By: <SVAR>, a11.LP_ACCOUNT_ID
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 126, Rows: 1 (STALE STATISTICS)] (PATH ID: 2)
| | Group By: (date_part('week', a11.VISIT_FROM_DT_TRUNC))::int, a11.LP_ACCOUNT_ID,
a11.VS_LP_SESSION_ID
| | +---> STORAGE ACCESS for a11 [Cost: 125, Rows: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | Projection: zzz.FACT_VISIT_b0
Saved the Join Time

Solution Two- PreJoin Projection
Pros Cons
• Eliminate Join overhead • Not Flexible
• Maintain By Vertica • Cause Overhead on Load
• Need Primary/Foreign Key
• Maintenance Restrictions

order by
LP_ACCOUNT_ID,VISIT_FROM_DT_TRUNC,DT_WEEK,HOT_LEAD_IND,DATE_TIME_ID,VS_LP_SESSION_ID

Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 12K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT visit_date_time_prejoin8_b0.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin8_b0.DT_WEEK,
visit_date_time_prejoin8_b0.LP_ACCOUNT_ID
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 11K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin8_b0.DT_WEEK,
visit_date_time_prejoin8_b0.LP_ACCOUNT_ID, visit_date_time_prejoin8_b0.VS_LP_SESSION_ID
| | +---> STORAGE ACCESS for <No Alias> [Cost: 8K, Rows: 1M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin8_b0

Saved the Join Time

Sorted By DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID

Access Path:
| Aggregates: count(DISTINCT visit_date_time_prejoin_z6.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| +---> GROUPBY PIPELINED [Cost: 542K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin_z6.DT_WEEK,
visit_date_time_prejoin_z6.VS_LP_SESSION_ID, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| | +---> STORAGE ACCESS for <No Alias> [Cost: 501K, Rows: 15M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin_z6
||

Saved the Join Time and Group by hash Time

Solution Three - Denormalize
select DT_WEEK,
from zzz.FACT_VISIT_Z1 a11
group by DT_WEEK,
a11.LP_ACCOUNT_ID;
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| | +---> STORAGE ACCESS for a11 [Cost: 2M, Rows: 372M (NO STATISTICS)] (PATH ID: 3)
| | | Projection: zzz.FACT_VISIT_Z1_super
Time: First etch (6 rows): 33885.178 ms. All rows formatted: 33885.253 ms
Saved the Join Time

Solution Three - Denormalize
• Changing the projection sort order
Access Path:
| +---> GROUPBY PIPELINED [Cost: 587K, Rows: 10K] (PATH ID: 2)
| | Group By: a11.DT_WEEK, a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | +---> STORAGE ACCESS for a11 [Cost: 531K, Rows: 20M] (PATH ID: 3)
| | | Projection: zzz.fact_visit_z1_pipe
| | | Materialize: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | | Filter: (a11.LP_ACCOUNT_ID = '57386690')
| | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp)
AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp))
Saved the Join Time and Group by hash Time

Let’s sum it up…

• Keep it simple
• Keep it sorted.
• Keep it joinless

Vertica Performance Optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Vertica Performance Optimization

Similar to Vertica Performance Optimization (20)

More from Zvika Gutkin

More from Zvika Gutkin (6)

Vertica Performance Optimization