Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Present & Future of Greenplum Database A massively parallel Postgres Database - Greenplum Summit 2019

129 views

Published on

Greenplum Summit 2019
Ivan Novick
@NovickGreenplum

Published in: Software
  • Be the first to comment

Present & Future of Greenplum Database A massively parallel Postgres Database - Greenplum Summit 2019

  1. 1. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Ivan Novick @NovickGreenplum March 2019 Present & Future of Greenplum Database A massively parallel Postgres Database
  2. 2. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Greenplum Database v5 Mission Critical Analytical Database Platform
  3. 3. GPDB v5: Mission Critical Analytical Database Platform Well rounded and proven feature set: ● Proven in Mission Critical Use Cases ● ORCA Optimizer ● Resource Groups & PGBouncer for Concurrency ● In-Database Analytics ● External Data Federation Ecosystem ● Pivotal Greenplum Command Center 4.x ● Updated Backup and Migration Tooling “Pivotal Greenplum is often used in mission- critical use cases, where downtime is not well-tolerated.” -- Gartner MQ 2019
  4. 4. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Greenplum Database V6 Massive Postgres Power
  5. 5. GPDB v6: Massive Postgres Power What if Greenplum was a Superset and not a subset of Postgres ● Postgres 9.4 merged ● WAL Replication ● Row Level Locking for Updates/Deletes ● Foreign Data Wrapper API ● PG Extensions: e.g. pgaudit ● Recursive CTE ● JSON, JSONB, FTS, GIN Index “Customers frequently called out the open-source alignment with PostgreSQL as a strong and cost- effective positive” -- Gartner MQ 2019
  6. 6. GPDB v6: OLTP Performance with Greenplum Up to 50x Performance gain on pgbench in early testing ● Greenplum has always been ACID with Transaction semantics ● Many Analytical Systems Require a Mix of Analytical and OLTP Queries ● Remove Table Lock on Updates & Deletes ● Distributed Deadlock Detector introduced ● Concurrent OLTP Operations allowed “Customers frequently called out the open-source alignment with PostgreSQL as a strong and cost- effective positive” -- Gartner MQ 2019
  7. 7. V6: Big Data Features #ScaleMatters ● Online Expansion w/ Jump Consistent Hash ● Star-Schema DW with Replicated Tables ● Join Aggregrate Query Perf with Eager Aggregation Optimizations ● zStandard compression “Reference customers for Pivotal praised the overall performance and scalability of Pivotal Greenplum” -- Gartner MQ 2019
  8. 8. GP v5 Expand Example Distributed by Call ID Detailed Call Records Example Call id 1 Call id 4 Call id 7 Call id 10 Call id 2 Call id 5 Call id 8 Call id 11 Call id 3 Call id 6 Call id 9 Call id 12 Call id 1 Call id 5 Call id 9 Call id 2 Call id 6 Call id 10 Call id 3 Call id 7 Call id 11 Call id 4 Call id 8 Call id 12 RESHUFFLE ALL GPEXPAND
  9. 9. GP v6 Online Expand w/ Jump Consistent Hash Distributed by Call ID Detailed Call Records Example Call id 1 Call id 4 Call id 7 Call id 10 Call id 2 Call id 5 Call id 8 Call id 11 Call id 3 Call id 6 Call id 9 Call id 12 Call id 1 Call id 4 Call id 7 Call id 2 Call id 5 Call id 8 Call id 3 Call id 6 Call id 9 Call id 10 Call id 11 Call id 12 MINIMAL DATA MOVEMENT GPEXPAND
  10. 10. GP v6 Replicated Tables Call 1, Caller 1 Call 5, Caller 2 Call 9, Caller 1 Call 13, Caller 3 Call 2, Caller 1 Call 6, Caller 3 Call 10, Caller 3 Call 14, Caller 3 Call 3, Caller 3 Call 7, Caller 3 Call 11, Caller 1 Call 15, Caller 1 CallerID 1 CallerID 2 CallerID 3 JOIN Call 4, Caller 2 Call 8, Caller 3 Call 12, Caller 1 Call 16, Caller 1 CallerID 1 CallerID 2 CallerID 3 CallerID 1 CallerID 2 CallerID 3 CallerID 1 CallerID 2 CallerID 3 Distributed Fact Table Replicated Dimension Table SEGMENT 1 SEGMENT 2 SEGMENT 3 SEGMENT 4 CREATE TABLE CallerUser (x CallerId, y Attribute) DISTRIBUTED REPLICATED;
  11. 11. Eager-Agg Optimization in GPDB v6 create table foo (j1 int, g1 int, s1 int); insert into foo select i%10000, i %1000, i from generate_series(1,100000000) i; ● 10,000 unique grouping columns ● 1000 unique join columns create table bar (j2 int, g2 int, s2 int); insert into bar select i%100, i %1000, i from generate_series(1,100000) i; ● 1000 unique grouping columns ● 100 unique join columns Query: select sum(s1) from foo, bar where j1 = j2 and s1%2 = 0 group by g1, g2; Greenplum v5 63.8 seconds Greenplum v6 7.4 seconds ~ 9X Im provem ent
  12. 12. Aggregate Queries over Join GPDB v5 Find the loss per line item for all returned items Join the line items to the orders Group them by store and compute the aggregate loss Straightforward translation of the query into the query plan If each order has a large number of line items, the join results can be quite large and expensiveLINEITEM ORDERS ! L_LOSS: L_EXTENDEDPRICE * (I- L_DISCOUNT) ⨝ (L_ORDERKEY = O_ORDERKEY) #O_STORE (SUM(L_LOSS)) σ (L_RETURNFLAG = “R”)
  13. 13. Eager Agg Optimization GPDB v6 ● Find the loss of revenue for each order ● Join the aggregated view with table ORDERS ● Compute the total loss for each store ● Benefit: Inner group-by reduces the number of row to the join [Yan95] W. P. Yan and P. Larson, "Eager Aggregation and Lazy Aggregation", VLDB 1995 LINEITEM ORDERS ! L_LOSS: L_EXTENDEDPRICE * (I- L_DISCOUNT) ⨝ (L_ORDERKEY = O_ORDERKEY) # O_STORE SUM(L_ORDERLOSS) σ (L_RETURNFLAG = “R”) # L_ORDERKEY L_ORDERLOSS: SUM(L_LOSS)
  14. 14. GPDB v6 zStd Compression Same or more for less ● Open Source ● Lower CPU Cycles with same or better compression ● Originated at Facebook CREATE TABLE call_data_records(callid int4, calldetails json) WITH (appendonly=true, compresstype=zstd, orientation=column) DISTRIBUTED BY (callid);
  15. 15. Pivotal Greenplum 6 Roadmap
  16. 16. Containerized Greenplum w/ GPDB v6● GP embedded in containers for portability and dependency management ● Containers managed by Kubernetes for higher availability and elasticity ● Kubernetes operator used for automation Container Operator AUTOMATION AUTOMATION AUTOMATION pod pod
  17. 17. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Greenplum Database V7 BEYOND THE CLUSTER
  18. 18. GPDB v7: Beyond the Cluster We have all this Postgres infrastructure in GPDB v6 now lets use it ● Postgres 9.6 target ● DB Snapshots / Backup ● Streaming Replication ● Log Shipping and Reconciliation ● Greenplum as a source for Kafka ● Greenplum as a source for CDC Tools ● Greenplum to Greenplum Inter Cluster Queries “You do this and you can beat Oracle” -- US Federal Customer, 2018
  19. 19. GPDB v7: Thought Leadership in Database AI Define Artificial Intelligence. Does it make sense to integrate intelligence into an analytical platform? ● 2019 Apache Madlib is focused on Deep Learning and GPU processing ● 2019 Pivotal’s GPText Solution will add more cognitive intelligence of human language ● Combine with existing functions: PostGIS Geospatial; Apache Madlib Machine Learning & Graph; Python, R libraries, SQL at scale ● This is a platform for modern AI! “With the Apache MADlib analytics libraries, Pivotal Greenplum has capable in-database analytics that allow for predictive modeling and ML to be applied to relational data.” -- Gartner MQ 2019
  20. 20. “Greenplum Database, soar with us new to new heights”
  21. 21. #ScaleMatters © Copyright 2019 Pivotal Software, Inc. All rights Reserved.

×