SlideShare a Scribd company logo
1 of 27
Download to read offline
Vertica
Zvika Gutkin
DB Expert
Zvika.gutkin@gmail.com
Agenda

• What is Vertica.

• How does it work.

• How To Use Vertica … (The Right Way ).

• Where It Falls Short.

• Examples …
MPP-Columnar DBMS

10x –100x performance of classic RDBMS.
Linear Scale
SQL
Commodity Hardware
Built-in fault tolerance
10x –100x performance of classic
                 RDBMS
•   Column store architecture
•   High Compression rates
•   Sorted columns
•   Objects Segmentation/Replication.
How Does It Work ?
Tuple Mover
Delete
• Deleted rows are only marked as deleted.
• Stored in delete vector on disk.
• Query merge the ROS and Deleted vector to
  remove deleted records.
• Data is removed asynchronously during
  mergeout.
Projections
•   Physical structure of the table (logical)
•   Stored sorted and compressed
•   Internal maintenance
•   At least one (super) projection.
•   Projection Types:
    –   Super projection
    –   Query specific projection
    –   Pre join projection
    –   Buddy projection
Projections
What‘s Important ….
•   Choose the right columns (General Vs Specific).
•   Choose the right sort order .
•   Choose the right encoding .
•   Choose the right column to partition by .
•   Choose the right column to segment by .
Where It Falls Short …
• Lack of Features .
• Good for specific types of queries .
  – Keep Queries Simple .
  – Use the right columns
  – Use Order By to help optimizer pick the right
    projection
  – Check the join column – Best if both tables order
    by it .
  – Check the join column – best if segmented by it.
Choose the Right sort order
             Example
select
    a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
    count(distinct a11.VS_LP_SESSION_ID) AS Visits,
    (count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS
WJXBFS1
 from lp_15744040.FACT_VISIT_ROOM a11
 group by
    a11.LP_ACCOUNT_ID;
First projection ….
table_name         projection_name      projection_column_name   column_position   sort_position
FACT_VISIT_ROOM    FACT_VISIT_ROOM_bad VS_LP_SESSION_ID          0                 0
FACT_VISIT_ROOM    FACT_VISIT_ROOM_bad LP_ACCOUNT_ID             1                 1
FACT_VISIT_ROOM    FACT_VISIT_ROOM_bad VS_LP_VISITOR_ID          2                 2
FACT_VISIT_ROOM    FACT_VISIT_ROOM_bad VISIT_FROM_DT_TRUNC       3                 3
FACT_VISIT_ROOM    FACT_VISIT_ROOM_bad ACCOUNT_ID                4                 4
FACT_VISIT_ROOM    FACT_VISIT_ROOM_bad ROOM_ID                   5                 5
FACT_VISIT_ROOM    FACT_VISIT_ROOM_bad VISIT_FROM_DT_ACTUAL      6                 6
FACT_VISIT_ROOM    FACT_VISIT_ROOM_bad VISIT_TO_DT_ACTUAL        7                 7
FACT_VISIT_ROOM    FACT_VISIT_ROOM_bad HOT_LEAD_IND              8                 8


Access Path:
+-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a11.LP_ACCOUNT_ID
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 7M, Rows: 10K] (PATH ID: 2)
| | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3)
| | | Projection: lp_15744040.FACT_VISIT_ROOM_bad
| | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
Second projection …
table_name          projection_name        projection_column_name   column_position   sort_position
FACT_VISIT_ROOM     FACT_VISIT_ROOM_fix1   LP_ACCOUNT_ID            0                 0
FACT_VISIT_ROOM     FACT_VISIT_ROOM_fix1   VS_LP_SESSION_ID         1                 1
FACT_VISIT_ROOM     FACT_VISIT_ROOM_fix1   VS_LP_VISITOR_ID         2                 2
FACT_VISIT_ROOM     FACT_VISIT_ROOM_fix1   VISIT_FROM_DT_TRUNC      3                 3
FACT_VISIT_ROOM     FACT_VISIT_ROOM_fix1   ACCOUNT_ID               4                 4
FACT_VISIT_ROOM     FACT_VISIT_ROOM_fix1   ROOM_ID                  5                 5
FACT_VISIT_ROOM     FACT_VISIT_ROOM_fix1   VISIT_FROM_DT_ACTUAL     6                 6
FACT_VISIT_ROOM     FACT_VISIT_ROOM_fix1   VISIT_TO_DT_ACTUAL       7                 7
FACT_VISIT_ROOM     FACT_VISIT_ROOM_fix1   HOT_LEAD_IND             8                 8



  Access Path:
  +-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1)
  | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
  | Group By: a11.LP_ACCOUNT_ID
  | +---> GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 2)
  | | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
  | | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3)
  | | | Projection: lp_15744040.FACT_VISIT_ROOM_fix1
  | | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
Results …
Elapsed Time First projection
GROUPBY HASH (SORT OUTPUT)

Time: First fetch (7 rows): 264527.916 ms. All rows formatted: 264527.978 ms




Elapsed Time Second projection
GROUPBY PIPELINED



Time: First fetch (7 rows): 38913.909 ms. All rows formatted: 38913.965 ms
Join Example
select a12.DT_WEEK AS DT_WEEK,
    a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
    count(distinct a11.VS_LP_SESSION_ID) AS Visits,
    (count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
 from zzz.FACT_VISIT a11
    join zzz.DIM_DATE_TIME a12
     on (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
 where (a11.LP_ACCOUNT_ID in ('57386690')
  and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
 group by a12.DT_WEEK,
    a11.LP_ACCOUNT_ID

   Filter : LP_ACCOUNT_ID, VISIT_FROM_DT_TRUNC
   Group By : DT_WEEK , LP_ACCOUNT_ID
   Join: VISIT_FROM_DT_TRUNC , DATE_TIME_ID
   Select : DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID
Full Explain Plan…
 Access Path:
  +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)
  | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
  | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID
  | Execute on: All Nodes
  | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
  | | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
  | | Execute on: All Nodes
  | | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISTICS)] (PATH ID: 3)
  | | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
  | | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
  | | | Execute on: All Nodes
  | | | +-- Outer -> STORAGE ACCESS for a11 [Cost: 421K, Rows: 372M (NO STATISTICS)] (PATH ID: 4)
  | | | | Projection: zzz.FACT_VISIT_b0
  | | | | Materialize: a11.VISIT_FROM_DT_TRUNC
  | | | | Filter: (a11.LP_ACCOUNT_ID = '57386690')
  | | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <=
 '2011-12-31 12:52:50'::timestamp))
  | | | | Execute on: All Nodes
  | | | +-- Inner -> STORAGE ACCESS for a12 [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
  | | | | Projection: zzz.DIM_DATE_TIME_node0004
  | | | | Materialize: a12.DATE_TIME_ID, a12.DT_WEEK
  | | | | Filter: ((a12.DATE_TIME_ID >= '2011-09-01 15:28:00'::timestamp) AND (a12.DATE_TIME_ID <= '2011-12-31
 12:52:50'::timestamp))
  | | | | Execute on: All Nodes
Explain Plan (substract)…
 Access Path:l
 +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID:
 1)
 | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
 | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID
 | Execute on: All Nodes
 | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
 | | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
 | | Execute on: All Nodes
 | | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISlTICS)] (PATH ID: 3)
 | | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
 | | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
 | | | Execute on: All Nodes




 Time: First fetch (6 rows): 56654.894 ms. All rows formatted: 56654.988 ms
Solution one - Functions
   select week(a11.VISIT_FROM_DT_TRUNC) AS DT_WEEK,
       a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
       count(distinct a11.VS_LP_SESSION_ID) AS Visits,
       (count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
    from zzz.FACT_VISIT a11
    where (a11.LP_ACCOUNT_ID in ('57386690')
     and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
    group by week(a11.VISIT_FROM_DT_TRUNC),
       a11.LP_ACCOUNT_ID;
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 127, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: <SVAR>, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 126, Rows: 1 (STALE STATISTICS)] (PATH ID: 2)
| | Group By: (date_part('week', a11.VISIT_FROM_DT_TRUNC))::int, a11.LP_ACCOUNT_ID,
a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for a11 [Cost: 125, Rows: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | Projection: zzz.FACT_VISIT_b0
              Time: First fetch (6 rows): 33453.997 ms. All rows formatted: 33454.154 ms
                                        Saved the Join Time
Solution Two- PreJoin Projection
Pros                        Cons
• Eliminate Join overhead   • Not Flexible
• Maintain By Vertica       • Cause Overhead on Load
                            • Need Primary/Foreign Key
                            • Maintenance Restrictions
Solution Two- PreJoin Projection
order by
LP_ACCOUNT_ID,VISIT_FROM_DT_TRUNC,DT_WEEK,HOT_LEAD_IND,DATE_TIME_ID,VS_LP_SESSION_ID


Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 12K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT visit_date_time_prejoin8_b0.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin8_b0.DT_WEEK,
visit_date_time_prejoin8_b0.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 11K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin8_b0.DT_WEEK,
visit_date_time_prejoin8_b0.LP_ACCOUNT_ID, visit_date_time_prejoin8_b0.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for <No Alias> [Cost: 8K, Rows: 1M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin8_b0

Time: First fetch (6 rows): 35312.331 ms. All rows formatted: 35312.421 ms
                               Saved the Join Time
Solution Two- PreJoin Projection
    Sorted By DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID

Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 542K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT visit_date_time_prejoin_z6.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY PIPELINED [Cost: 542K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin_z6.DT_WEEK,
visit_date_time_prejoin_z6.VS_LP_SESSION_ID, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for <No Alias> [Cost: 501K, Rows: 15M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin_z6
||



Time: First fetch (6 rows): 3680.853 ms. All rows formatted: 3680.969 ms
                   Saved the Join Time and Group by hash Time
Solution Three - Denormalize
select DT_WEEK,
     a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
     count(distinct a11.VS_LP_SESSION_ID) AS Visits,
     (count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
  from zzz.FACT_VISIT_Z1 a11
  where (a11.LP_ACCOUNT_ID in ('57386690')
   and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
  group by DT_WEEK,
     a11.LP_ACCOUNT_ID;
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for a11 [Cost: 2M, Rows: 372M (NO STATISTICS)] (PATH ID: 3)
| | | Projection: zzz.FACT_VISIT_Z1_super
Time: First etch (6 rows): 33885.178 ms. All rows formatted: 33885.253 ms
                                 Saved the Join Time
Solution Three - Denormalize
• Changing the projection sort order
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 588K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY PIPELINED [Cost: 587K, Rows: 10K] (PATH ID: 2)
| | Group By: a11.DT_WEEK, a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for a11 [Cost: 531K, Rows: 20M] (PATH ID: 3)
| | | Projection: zzz.fact_visit_z1_pipe
| | | Materialize: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | | Filter: (a11.LP_ACCOUNT_ID = '57386690')
| | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp)
AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp))
| | | Execute on: All Nodes
 Time: First fetch (6 rows): 4313.497 ms. All rows formatted: 4313.600 ms
                Saved the Join Time and Group by hash Time
Let’s sum it up…

  • Keep it simple
 • Keep it sorted.
 • Keep it joinless
Questions ?
Thank You

More Related Content

What's hot

MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool FeaturesMongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Featuresajhannan
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Brian O'Neill
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 
Let if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and moreLet if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and moreBhakti Mehta
 
Advanced Query Parsing Techniques
Advanced Query Parsing TechniquesAdvanced Query Parsing Techniques
Advanced Query Parsing TechniquesSearch Technologies
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To CascadingNate Murray
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011mubarakss
 
managing big data
managing big datamanaging big data
managing big dataSuveeksha
 
N1QL New Features in couchbase 7.0
N1QL New Features in couchbase 7.0N1QL New Features in couchbase 7.0
N1QL New Features in couchbase 7.0Keshav Murthy
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Keshav Murthy
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016Duyhai Doan
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetAnkit Beohar
 

What's hot (20)

MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool FeaturesMongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
 
Python redis talk
Python redis talkPython redis talk
Python redis talk
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Let if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and moreLet if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and more
 
Advanced Query Parsing Techniques
Advanced Query Parsing TechniquesAdvanced Query Parsing Techniques
Advanced Query Parsing Techniques
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011
 
managing big data
managing big datamanaging big data
managing big data
 
N1QL New Features in couchbase 7.0
N1QL New Features in couchbase 7.0N1QL New Features in couchbase 7.0
N1QL New Features in couchbase 7.0
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Om nom nom nom
Om nom nom nomOm nom nom nom
Om nom nom nom
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 

Similar to Vertica Performance Optimization

Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)LivePerson
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreMariaDB plc
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
MapfilterreducepresentationManjuKumara GH
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreMariaDB plc
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Databricks
 
Enabling Applications with Informix' new OLAP functionality
 Enabling Applications with Informix' new OLAP functionality Enabling Applications with Informix' new OLAP functionality
Enabling Applications with Informix' new OLAP functionalityAjay Gupte
 
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...QAFest
 
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...EDB
 
Olap Functions Suport in Informix
Olap Functions Suport in InformixOlap Functions Suport in Informix
Olap Functions Suport in InformixBingjie Miao
 
Advanced SQL For Data Scientists
Advanced SQL For Data ScientistsAdvanced SQL For Data Scientists
Advanced SQL For Data ScientistsDatabricks
 
Very basic functional design patterns
Very basic functional design patternsVery basic functional design patterns
Very basic functional design patternsTomasz Kowal
 
Profiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production EnvironmentProfiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production EnvironmentRaimonds Simanovskis
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail MarketingJonathan Sedar
 
Ugif 10 2012 ppt0000002
Ugif 10 2012 ppt0000002Ugif 10 2012 ppt0000002
Ugif 10 2012 ppt0000002UGIF
 
Performance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondPerformance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondTomas Vondra
 
Oracle GoldenGate 12c CDR Presentation for ECO
Oracle GoldenGate 12c CDR Presentation for ECOOracle GoldenGate 12c CDR Presentation for ECO
Oracle GoldenGate 12c CDR Presentation for ECOBobby Curtis
 
Optimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex PlansOptimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex PlansDatabricks
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentationMichael Keane
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1MariaDB plc
 

Similar to Vertica Performance Optimization (20)

Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
Mapfilterreducepresentation
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
 
Enabling Applications with Informix' new OLAP functionality
 Enabling Applications with Informix' new OLAP functionality Enabling Applications with Informix' new OLAP functionality
Enabling Applications with Informix' new OLAP functionality
 
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
 
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
 
Olap Functions Suport in Informix
Olap Functions Suport in InformixOlap Functions Suport in Informix
Olap Functions Suport in Informix
 
Advanced SQL For Data Scientists
Advanced SQL For Data ScientistsAdvanced SQL For Data Scientists
Advanced SQL For Data Scientists
 
Very basic functional design patterns
Very basic functional design patternsVery basic functional design patterns
Very basic functional design patterns
 
Profiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production EnvironmentProfiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production Environment
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
 
Ugif 10 2012 ppt0000002
Ugif 10 2012 ppt0000002Ugif 10 2012 ppt0000002
Ugif 10 2012 ppt0000002
 
Performance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondPerformance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyond
 
Oracle GoldenGate 12c CDR Presentation for ECO
Oracle GoldenGate 12c CDR Presentation for ECOOracle GoldenGate 12c CDR Presentation for ECO
Oracle GoldenGate 12c CDR Presentation for ECO
 
Optimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex PlansOptimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex Plans
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 

More from Zvika Gutkin

Vertica the convertro way
Vertica   the convertro wayVertica   the convertro way
Vertica the convertro wayZvika Gutkin
 
15 shades of fvertica
15 shades of fvertica15 shades of fvertica
15 shades of fverticaZvika Gutkin
 
Vertica loading best practices
Vertica loading best practicesVertica loading best practices
Vertica loading best practicesZvika Gutkin
 

More from Zvika Gutkin (6)

Vertica on aws
Vertica on awsVertica on aws
Vertica on aws
 
Vertica the convertro way
Vertica   the convertro wayVertica   the convertro way
Vertica the convertro way
 
15 shades of fvertica
15 shades of fvertica15 shades of fvertica
15 shades of fvertica
 
Recognition
RecognitionRecognition
Recognition
 
Vertica loading best practices
Vertica loading best practicesVertica loading best practices
Vertica loading best practices
 
Vertica trace
Vertica traceVertica trace
Vertica trace
 

Vertica Performance Optimization

  • 2. Agenda • What is Vertica. • How does it work. • How To Use Vertica … (The Right Way ). • Where It Falls Short. • Examples …
  • 3. MPP-Columnar DBMS 10x –100x performance of classic RDBMS. Linear Scale SQL Commodity Hardware Built-in fault tolerance
  • 4. 10x –100x performance of classic RDBMS • Column store architecture • High Compression rates • Sorted columns • Objects Segmentation/Replication.
  • 5. How Does It Work ?
  • 7. Delete • Deleted rows are only marked as deleted. • Stored in delete vector on disk. • Query merge the ROS and Deleted vector to remove deleted records. • Data is removed asynchronously during mergeout.
  • 8. Projections • Physical structure of the table (logical) • Stored sorted and compressed • Internal maintenance • At least one (super) projection. • Projection Types: – Super projection – Query specific projection – Pre join projection – Buddy projection
  • 10. What‘s Important …. • Choose the right columns (General Vs Specific). • Choose the right sort order . • Choose the right encoding . • Choose the right column to partition by . • Choose the right column to segment by .
  • 11. Where It Falls Short … • Lack of Features . • Good for specific types of queries . – Keep Queries Simple . – Use the right columns – Use Order By to help optimizer pick the right projection – Check the join column – Best if both tables order by it . – Check the join column – best if segmented by it.
  • 12. Choose the Right sort order Example select a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID, count(distinct a11.VS_LP_SESSION_ID) AS Visits, (count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1 from lp_15744040.FACT_VISIT_ROOM a11 group by a11.LP_ACCOUNT_ID;
  • 13. First projection …. table_name projection_name projection_column_name column_position sort_position FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_SESSION_ID 0 0 FACT_VISIT_ROOM FACT_VISIT_ROOM_bad LP_ACCOUNT_ID 1 1 FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_VISITOR_ID 2 2 FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_TRUNC 3 3 FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ACCOUNT_ID 4 4 FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ROOM_ID 5 5 FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_ACTUAL 6 6 FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_TO_DT_ACTUAL 7 7 FACT_VISIT_ROOM FACT_VISIT_ROOM_bad HOT_LEAD_IND 8 8 Access Path: +-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.LP_ACCOUNT_ID | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 7M, Rows: 10K] (PATH ID: 2) | | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3) | | | Projection: lp_15744040.FACT_VISIT_ROOM_bad | | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
  • 14. Second projection … table_name projection_name projection_column_name column_position sort_position FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 LP_ACCOUNT_ID 0 0 FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_SESSION_ID 1 1 FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_VISITOR_ID 2 2 FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_TRUNC 3 3 FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ACCOUNT_ID 4 4 FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ROOM_ID 5 5 FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_ACTUAL 6 6 FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_TO_DT_ACTUAL 7 7 FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 HOT_LEAD_IND 8 8 Access Path: +-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.LP_ACCOUNT_ID | +---> GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 2) | | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3) | | | Projection: lp_15744040.FACT_VISIT_ROOM_fix1 | | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
  • 15. Results … Elapsed Time First projection GROUPBY HASH (SORT OUTPUT) Time: First fetch (7 rows): 264527.916 ms. All rows formatted: 264527.978 ms Elapsed Time Second projection GROUPBY PIPELINED Time: First fetch (7 rows): 38913.909 ms. All rows formatted: 38913.965 ms
  • 16. Join Example select a12.DT_WEEK AS DT_WEEK, a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID, count(distinct a11.VS_LP_SESSION_ID) AS Visits, (count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1 from zzz.FACT_VISIT a11 join zzz.DIM_DATE_TIME a12 on (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID) where (a11.LP_ACCOUNT_ID in ('57386690') and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50') group by a12.DT_WEEK, a11.LP_ACCOUNT_ID  Filter : LP_ACCOUNT_ID, VISIT_FROM_DT_TRUNC  Group By : DT_WEEK , LP_ACCOUNT_ID  Join: VISIT_FROM_DT_TRUNC , DATE_TIME_ID  Select : DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID
  • 17. Full Explain Plan… Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2) | | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISTICS)] (PATH ID: 3) | | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID) | | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID | | | Execute on: All Nodes | | | +-- Outer -> STORAGE ACCESS for a11 [Cost: 421K, Rows: 372M (NO STATISTICS)] (PATH ID: 4) | | | | Projection: zzz.FACT_VISIT_b0 | | | | Materialize: a11.VISIT_FROM_DT_TRUNC | | | | Filter: (a11.LP_ACCOUNT_ID = '57386690') | | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp)) | | | | Execute on: All Nodes | | | +-- Inner -> STORAGE ACCESS for a12 [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 5) | | | | Projection: zzz.DIM_DATE_TIME_node0004 | | | | Materialize: a12.DATE_TIME_ID, a12.DT_WEEK | | | | Filter: ((a12.DATE_TIME_ID >= '2011-09-01 15:28:00'::timestamp) AND (a12.DATE_TIME_ID <= '2011-12-31 12:52:50'::timestamp)) | | | | Execute on: All Nodes
  • 18. Explain Plan (substract)… Access Path:l +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2) | | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISlTICS)] (PATH ID: 3) | | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID) | | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID | | | Execute on: All Nodes Time: First fetch (6 rows): 56654.894 ms. All rows formatted: 56654.988 ms
  • 19. Solution one - Functions select week(a11.VISIT_FROM_DT_TRUNC) AS DT_WEEK, a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID, count(distinct a11.VS_LP_SESSION_ID) AS Visits, (count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1 from zzz.FACT_VISIT a11 where (a11.LP_ACCOUNT_ID in ('57386690') and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50') group by week(a11.VISIT_FROM_DT_TRUNC), a11.LP_ACCOUNT_ID; Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 127, Rows: 1 (STALE STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: <SVAR>, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 126, Rows: 1 (STALE STATISTICS)] (PATH ID: 2) | | Group By: (date_part('week', a11.VISIT_FROM_DT_TRUNC))::int, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 125, Rows: 1 (STALE STATISTICS)] (PATH ID: 3) | | | Projection: zzz.FACT_VISIT_b0 Time: First fetch (6 rows): 33453.997 ms. All rows formatted: 33454.154 ms Saved the Join Time
  • 20. Solution Two- PreJoin Projection Pros Cons • Eliminate Join overhead • Not Flexible • Maintain By Vertica • Cause Overhead on Load • Need Primary/Foreign Key • Maintenance Restrictions
  • 21. Solution Two- PreJoin Projection order by LP_ACCOUNT_ID,VISIT_FROM_DT_TRUNC,DT_WEEK,HOT_LEAD_IND,DATE_TIME_ID,VS_LP_SESSION_ID Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 12K, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT visit_date_time_prejoin8_b0.VS_LP_SESSION_ID) | Group By: visit_date_time_prejoin8_b0.DT_WEEK, visit_date_time_prejoin8_b0.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 11K, Rows: 10K] (PATH ID: 2) | | Group By: visit_date_time_prejoin8_b0.DT_WEEK, visit_date_time_prejoin8_b0.LP_ACCOUNT_ID, visit_date_time_prejoin8_b0.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for <No Alias> [Cost: 8K, Rows: 1M] (PATH ID: 3) | | | Projection: lp_15744040.visit_date_time_prejoin8_b0 Time: First fetch (6 rows): 35312.331 ms. All rows formatted: 35312.421 ms Saved the Join Time
  • 22. Solution Two- PreJoin Projection Sorted By DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 542K, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT visit_date_time_prejoin_z6.VS_LP_SESSION_ID) | Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY PIPELINED [Cost: 542K, Rows: 10K] (PATH ID: 2) | | Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.VS_LP_SESSION_ID, visit_date_time_prejoin_z6.LP_ACCOUNT_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for <No Alias> [Cost: 501K, Rows: 15M] (PATH ID: 3) | | | Projection: lp_15744040.visit_date_time_prejoin_z6 || Time: First fetch (6 rows): 3680.853 ms. All rows formatted: 3680.969 ms Saved the Join Time and Group by hash Time
  • 23. Solution Three - Denormalize select DT_WEEK, a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID, count(distinct a11.VS_LP_SESSION_ID) AS Visits, (count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1 from zzz.FACT_VISIT_Z1 a11 where (a11.LP_ACCOUNT_ID in ('57386690') and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50') group by DT_WEEK, a11.LP_ACCOUNT_ID; Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 2) | | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 2M, Rows: 372M (NO STATISTICS)] (PATH ID: 3) | | | Projection: zzz.FACT_VISIT_Z1_super Time: First etch (6 rows): 33885.178 ms. All rows formatted: 33885.253 ms Saved the Join Time
  • 24. Solution Three - Denormalize • Changing the projection sort order Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 588K, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY PIPELINED [Cost: 587K, Rows: 10K] (PATH ID: 2) | | Group By: a11.DT_WEEK, a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 531K, Rows: 20M] (PATH ID: 3) | | | Projection: zzz.fact_visit_z1_pipe | | | Materialize: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | | Filter: (a11.LP_ACCOUNT_ID = '57386690') | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp)) | | | Execute on: All Nodes Time: First fetch (6 rows): 4313.497 ms. All rows formatted: 4313.600 ms Saved the Join Time and Group by hash Time
  • 25. Let’s sum it up… • Keep it simple • Keep it sorted. • Keep it joinless