The document discusses Vertica, a column-oriented database management system. It explains that Vertica provides 10x to 100x better performance than traditional RDBMS through its columnar storage format, linear scalability, and built-in fault tolerance. The document then provides details on how Vertica works, how to properly use it through configuration of projections and sort orders, and examples of queries and optimizations on a sample dataset.
7. Delete
• Deleted rows are only marked as deleted.
• Stored in delete vector on disk.
• Query merge the ROS and Deleted vector to
remove deleted records.
• Data is removed asynchronously during
mergeout.
8. Projections
• Physical structure of the table (logical)
• Stored sorted and compressed
• Internal maintenance
• At least one (super) projection.
• Projection Types:
– Super projection
– Query specific projection
– Pre join projection
– Buddy projection
10. What‘s Important ….
• Choose the right columns (General Vs Specific).
• Choose the right sort order .
• Choose the right encoding .
• Choose the right column to partition by .
• Choose the right column to segment by .
11. Where It Falls Short …
• Lack of Features .
• Good for specific types of queries .
– Keep Queries Simple .
– Use the right columns
– Use Order By to help optimizer pick the right
projection
– Check the join column – Best if both tables order
by it .
– Check the join column – best if segmented by it.
12. Choose the Right sort order
Example
select
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS
WJXBFS1
from lp_15744040.FACT_VISIT_ROOM a11
group by
a11.LP_ACCOUNT_ID;
15. Results …
Elapsed Time First projection
GROUPBY HASH (SORT OUTPUT)
Time: First fetch (7 rows): 264527.916 ms. All rows formatted: 264527.978 ms
Elapsed Time Second projection
GROUPBY PIPELINED
Time: First fetch (7 rows): 38913.909 ms. All rows formatted: 38913.965 ms
16. Join Example
select a12.DT_WEEK AS DT_WEEK,
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
from zzz.FACT_VISIT a11
join zzz.DIM_DATE_TIME a12
on (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
where (a11.LP_ACCOUNT_ID in ('57386690')
and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
group by a12.DT_WEEK,
a11.LP_ACCOUNT_ID
Filter : LP_ACCOUNT_ID, VISIT_FROM_DT_TRUNC
Group By : DT_WEEK , LP_ACCOUNT_ID
Join: VISIT_FROM_DT_TRUNC , DATE_TIME_ID
Select : DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID
17. Full Explain Plan…
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISTICS)] (PATH ID: 3)
| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | | Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for a11 [Cost: 421K, Rows: 372M (NO STATISTICS)] (PATH ID: 4)
| | | | Projection: zzz.FACT_VISIT_b0
| | | | Materialize: a11.VISIT_FROM_DT_TRUNC
| | | | Filter: (a11.LP_ACCOUNT_ID = '57386690')
| | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <=
'2011-12-31 12:52:50'::timestamp))
| | | | Execute on: All Nodes
| | | +-- Inner -> STORAGE ACCESS for a12 [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
| | | | Projection: zzz.DIM_DATE_TIME_node0004
| | | | Materialize: a12.DATE_TIME_ID, a12.DT_WEEK
| | | | Filter: ((a12.DATE_TIME_ID >= '2011-09-01 15:28:00'::timestamp) AND (a12.DATE_TIME_ID <= '2011-12-31
12:52:50'::timestamp))
| | | | Execute on: All Nodes
18. Explain Plan (substract)…
Access Path:l
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID:
1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISlTICS)] (PATH ID: 3)
| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | | Execute on: All Nodes
Time: First fetch (6 rows): 56654.894 ms. All rows formatted: 56654.988 ms
19. Solution one - Functions
select week(a11.VISIT_FROM_DT_TRUNC) AS DT_WEEK,
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
from zzz.FACT_VISIT a11
where (a11.LP_ACCOUNT_ID in ('57386690')
and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
group by week(a11.VISIT_FROM_DT_TRUNC),
a11.LP_ACCOUNT_ID;
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 127, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: <SVAR>, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 126, Rows: 1 (STALE STATISTICS)] (PATH ID: 2)
| | Group By: (date_part('week', a11.VISIT_FROM_DT_TRUNC))::int, a11.LP_ACCOUNT_ID,
a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for a11 [Cost: 125, Rows: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | Projection: zzz.FACT_VISIT_b0
Time: First fetch (6 rows): 33453.997 ms. All rows formatted: 33454.154 ms
Saved the Join Time
20. Solution Two- PreJoin Projection
Pros Cons
• Eliminate Join overhead • Not Flexible
• Maintain By Vertica • Cause Overhead on Load
• Need Primary/Foreign Key
• Maintenance Restrictions
21. Solution Two- PreJoin Projection
order by
LP_ACCOUNT_ID,VISIT_FROM_DT_TRUNC,DT_WEEK,HOT_LEAD_IND,DATE_TIME_ID,VS_LP_SESSION_ID
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 12K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT visit_date_time_prejoin8_b0.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin8_b0.DT_WEEK,
visit_date_time_prejoin8_b0.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 11K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin8_b0.DT_WEEK,
visit_date_time_prejoin8_b0.LP_ACCOUNT_ID, visit_date_time_prejoin8_b0.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for <No Alias> [Cost: 8K, Rows: 1M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin8_b0
Time: First fetch (6 rows): 35312.331 ms. All rows formatted: 35312.421 ms
Saved the Join Time
22. Solution Two- PreJoin Projection
Sorted By DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 542K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT visit_date_time_prejoin_z6.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY PIPELINED [Cost: 542K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin_z6.DT_WEEK,
visit_date_time_prejoin_z6.VS_LP_SESSION_ID, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for <No Alias> [Cost: 501K, Rows: 15M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin_z6
||
Time: First fetch (6 rows): 3680.853 ms. All rows formatted: 3680.969 ms
Saved the Join Time and Group by hash Time
23. Solution Three - Denormalize
select DT_WEEK,
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
from zzz.FACT_VISIT_Z1 a11
where (a11.LP_ACCOUNT_ID in ('57386690')
and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
group by DT_WEEK,
a11.LP_ACCOUNT_ID;
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for a11 [Cost: 2M, Rows: 372M (NO STATISTICS)] (PATH ID: 3)
| | | Projection: zzz.FACT_VISIT_Z1_super
Time: First etch (6 rows): 33885.178 ms. All rows formatted: 33885.253 ms
Saved the Join Time
24. Solution Three - Denormalize
• Changing the projection sort order
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 588K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY PIPELINED [Cost: 587K, Rows: 10K] (PATH ID: 2)
| | Group By: a11.DT_WEEK, a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for a11 [Cost: 531K, Rows: 20M] (PATH ID: 3)
| | | Projection: zzz.fact_visit_z1_pipe
| | | Materialize: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | | Filter: (a11.LP_ACCOUNT_ID = '57386690')
| | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp)
AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp))
| | | Execute on: All Nodes
Time: First fetch (6 rows): 4313.497 ms. All rows formatted: 4313.600 ms
Saved the Join Time and Group by hash Time
25. Let’s sum it up…
• Keep it simple
• Keep it sorted.
• Keep it joinless