Vertica mpp columnar dbms

Vertica
Zvika Gutkin
DB Expert
Zvika.gutkin@gmail.com

Agenda

• What is Vertica.

• How does it work.

• How To Use Vertica … (The Right Way ).

• Where It Falls Short.

• Examples …

MPP-Columnar DBMS

10x –100x performance of classic RDBMS.
Linear Scale
SQL
Commodity Hardware
Built-in fault tolerance

10x –100x performance of classic
RDBMS
• Column store architecture
• High Compression rates
• Sorted columns
• Objects Segmentation/Replication.

Delete
• Deleted rows are only marked as deleted.
• Stored in delete vector on disk.
• Query merge the ROS and Deleted vector to
remove deleted records.
• Data is removed asynchronously during
mergeout.

Projections
• Physical structure of the table (logical)
• Stored sorted and compressed
• Internal maintenance
• At least one (super) projection.
• Projection Types:
– Super projection
– Query specific projection
– Pre join projection
– Buddy projection

What‘s Important ….
• Choose the right columns (General Vs Specific).
• Choose the right sort order .
• Choose the right encoding .
• Choose the right column to partition by .
• Choose the right column to segment by .

Where It Falls Short …
• Lack of Features .
• Good for specific types of queries .
– Keep Queries Simple .
– Use the right columns
– Use Order By to help optimizer pick the right
projection
– Check the join column – Best if both tables order
by it .
– Check the join column – best if segmented by it.

Choose the Right sort order
Example
select
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS
WJXBFS1
from lp_15744040.FACT_VISIT_ROOM a11
group by
a11.LP_ACCOUNT_ID;

First projection ….
table_name projection_name projection_column_name column_position sort_position
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_SESSION_ID 0 0
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad LP_ACCOUNT_ID 1 1
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_VISITOR_ID 2 2
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_TRUNC 3 3
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ACCOUNT_ID 4 4
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ROOM_ID 5 5
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_ACTUAL 6 6
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_TO_DT_ACTUAL 7 7
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad HOT_LEAD_IND 8 8

Access Path:
+-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a11.LP_ACCOUNT_ID
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 7M, Rows: 10K] (PATH ID: 2)
| | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3)
| | | Projection: lp_15744040.FACT_VISIT_ROOM_bad
| | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

Second projection …
table_name projection_name projection_column_name column_position sort_position
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 LP_ACCOUNT_ID 0 0
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_SESSION_ID 1 1
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_VISITOR_ID 2 2
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_TRUNC 3 3
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ACCOUNT_ID 4 4
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ROOM_ID 5 5
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_ACTUAL 6 6
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_TO_DT_ACTUAL 7 7
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 HOT_LEAD_IND 8 8

Access Path:
+-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1)
| Group By: a11.LP_ACCOUNT_ID
| +---> GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 2)
| | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3)
| | | Projection: lp_15744040.FACT_VISIT_ROOM_fix1
| | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

Results …
Elapsed Time First projection
GROUPBY HASH (SORT OUTPUT)

Time: First fetch (7 rows): 264527.916 ms. All rows formatted: 264527.978 ms

Elapsed Time Second projection
GROUPBY PIPELINED


Join Example
select a12.DT_WEEK AS DT_WEEK,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
from zzz.FACT_VISIT a11
join zzz.DIM_DATE_TIME a12
on (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
where (a11.LP_ACCOUNT_ID in ('57386690')
and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
group by a12.DT_WEEK,
a11.LP_ACCOUNT_ID

 Filter : LP_ACCOUNT_ID, VISIT_FROM_DT_TRUNC
 Group By : DT_WEEK , LP_ACCOUNT_ID
 Join: VISIT_FROM_DT_TRUNC , DATE_TIME_ID
 Select : DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID

Full Explain Plan…
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)
| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISTICS)] (PATH ID: 3)
| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | | Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for a11 [Cost: 421K, Rows: 372M (NO STATISTICS)] (PATH ID: 4)
| | | | Projection: zzz.FACT_VISIT_b0
| | | | Materialize: a11.VISIT_FROM_DT_TRUNC
| | | | Filter: (a11.LP_ACCOUNT_ID = '57386690')
| | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <=
'2011-12-31 12:52:50'::timestamp))
| | | | Execute on: All Nodes
| | | +-- Inner -> STORAGE ACCESS for a12 [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
| | | | Projection: zzz.DIM_DATE_TIME_node0004
| | | | Materialize: a12.DATE_TIME_ID, a12.DT_WEEK
| | | | Filter: ((a12.DATE_TIME_ID >= '2011-09-01 15:28:00'::timestamp) AND (a12.DATE_TIME_ID <= '2011-12-31
12:52:50'::timestamp))
| | | | Execute on: All Nodes

Explain Plan (substract)…
Access Path:l
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID:
1)
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISlTICS)] (PATH ID: 3)
| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID


Solution one - Functions
select week(a11.VISIT_FROM_DT_TRUNC) AS DT_WEEK,
from zzz.FACT_VISIT a11
group by week(a11.VISIT_FROM_DT_TRUNC),
a11.LP_ACCOUNT_ID;
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 127, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| Group By: <SVAR>, a11.LP_ACCOUNT_ID
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 126, Rows: 1 (STALE STATISTICS)] (PATH ID: 2)
| | Group By: (date_part('week', a11.VISIT_FROM_DT_TRUNC))::int, a11.LP_ACCOUNT_ID,
a11.VS_LP_SESSION_ID
| | +---> STORAGE ACCESS for a11 [Cost: 125, Rows: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | Projection: zzz.FACT_VISIT_b0
Saved the Join Time

Solution Two- PreJoin Projection
Pros Cons
• Eliminate Join overhead • Not Flexible
• Maintain By Vertica • Cause Overhead on Load
• Need Primary/Foreign Key
• Maintenance Restrictions

order by
LP_ACCOUNT_ID,VISIT_FROM_DT_TRUNC,DT_WEEK,HOT_LEAD_IND,DATE_TIME_ID,VS_LP_SESSION_ID

Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 12K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT visit_date_time_prejoin8_b0.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin8_b0.DT_WEEK,
visit_date_time_prejoin8_b0.LP_ACCOUNT_ID
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 11K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin8_b0.DT_WEEK,
visit_date_time_prejoin8_b0.LP_ACCOUNT_ID, visit_date_time_prejoin8_b0.VS_LP_SESSION_ID
| | +---> STORAGE ACCESS for <No Alias> [Cost: 8K, Rows: 1M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin8_b0

Saved the Join Time

Sorted By DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID

Access Path:
| Aggregates: count(DISTINCT visit_date_time_prejoin_z6.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| +---> GROUPBY PIPELINED [Cost: 542K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin_z6.DT_WEEK,
visit_date_time_prejoin_z6.VS_LP_SESSION_ID, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| | +---> STORAGE ACCESS for <No Alias> [Cost: 501K, Rows: 15M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin_z6
||

Saved the Join Time and Group by hash Time

Solution Three - Denormalize
select DT_WEEK,
from zzz.FACT_VISIT_Z1 a11
group by DT_WEEK,
a11.LP_ACCOUNT_ID;
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| | +---> STORAGE ACCESS for a11 [Cost: 2M, Rows: 372M (NO STATISTICS)] (PATH ID: 3)
| | | Projection: zzz.FACT_VISIT_Z1_super
Time: First etch (6 rows): 33885.178 ms. All rows formatted: 33885.253 ms
Saved the Join Time

Solution Three - Denormalize
• Changing the projection sort order
Access Path:
| +---> GROUPBY PIPELINED [Cost: 587K, Rows: 10K] (PATH ID: 2)
| | Group By: a11.DT_WEEK, a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | +---> STORAGE ACCESS for a11 [Cost: 531K, Rows: 20M] (PATH ID: 3)
| | | Projection: zzz.fact_visit_z1_pipe
| | | Materialize: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | | Filter: (a11.LP_ACCOUNT_ID = '57386690')
| | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp)
AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp))
Saved the Join Time and Group by hash Time

Let’s sum it up…

• Keep it simple
• Keep it sorted.
• Keep it joinless

Vertica mpp columnar dbms

More Related Content

What's hot

Similar to Vertica mpp columnar dbms

More from Zvika Gutkin

Vertica mpp columnar dbms