Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Proxy - Greenplum Summit 2019


Published on

Greenplum Summit 2019
Zack Odom - Field Engineer, Pivotal

Erik Brandsberg - CTO, Heimdall Data Inc

Published in: Software
  • Be the first to comment

  • Be the first to like this

Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Proxy - Greenplum Summit 2019

  1. 1. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Zack Odom - Field Engineer, Pivotal Erik Brandsberg - CTO, Heimdall Data Inc Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Proxy
  2. 2. The Fundamental Theorem of Software Engineering "We can solve any problem by introducing an extra level of indirection." originated by Andrew Koenig to describe a remark by Butler Lampson attributed to the late David J. Wheeler Question: What provides indirection for Databases?
  3. 3. Backgrounds Greenplum History ● Exists since PostgreSQL 7.4 - went live in 2005 ● Merged with PostgreSQL until 8.2, then forked ● Product evolved, company acquired ● Open Source in 2015 ● Greenplum getting closer to latest PostgreSQL with every new merged version ○ PG 8.3 is merged (GPDB Version V5) ○ PG 9.4 is WIP (GBDB Version 6) Heimdall History ● Founded in 2014 ● Advanced Partner with Pivotal ● AWS Competency Partner ● Database Vendor Neutral: ○ Postgres, SQL Server, MySQL, JDBC data sources
  4. 4. Pivotal Greenplum Powerful, Postgres-based MPP and multi-cloud analytics on petabyte-scale data Challenges • Legacy scale-up DBs are expensive to operate • Hadoop doesn’t fit low-latency, iterative analytics with high user concurrency • Multiple environments with messy, disjointed structured and unstructured data Greenplum Delivers • Multi-cloud, Open-source, analytics data platform • Massively parallel processing with machine learning and ANSI SQL compliance • Unify and query structured and unstructured data from native, HDFS, and cloud storage - including text, spatial, and graph data Benefits • Scales linearly with hardware for optimal cost and performance • Faster workflow; train models in parallel, publish to DB for rapid parallel scoring • Analyze more types of data more quickly for faster, deeper insights
  5. 5. Hadoop Data Lakes Massively Parallel Data Warehouse Public Cloud Data Lakes Predefined Libraries Programmatic GPText Massively Parallel Analytical Processing High Speed of Ingestion Pivotal Greenplum Massively Parallel Data Load from External Sources In-DB Predictive Analytics High Speed of Processing Massively Parallel Postgres Architecture <Postgres in Parallel>
  6. 6. Application Server Application Heimdall Data Driver/Proxy Application Servers SQL Auto-Caching Auto-invalidation Auto-Cache Refresh Automated Failover Load Balancing Read/Write Split Batch Processing OLTP/OLAP Routing Query Triggers Query Analytics, Transformation, & Firewall Connection Pooling Heimdall Architecture Application Server Application Heimdall Data Driver/Proxy Application Server Application Heimdall Database Proxy
  7. 7. OLAP VS OLTP Analytics-based (OLAP) ● High latency, many reads, less writes ● Bulk ETL operations, complex queries ● Calculate Results across very large datasets ● Most are purposely built to scale (and expand) to many nodes with replication and HA built in. ● Optimizer should evaluate best plan using statistics for more complex analytical queries ● SLA’s are not sub-second/minute ● Caching or materialized views typically is not leveraged due to inherent nature of deep/wide analytical queries Transactional-based (OLTP) ● Low Latency, memory intensive operations ● Singleton ETL operations including DML ● Typically targeted data retrieval ● Scale has limitations and expensive – single node for OTLP purposes (Postgres) ● Optimizer does not need to be intelligent as most queries are single threaded ● SLA’s are typically sub-second ● Caching utilized heavily for SLA
  8. 8. OLTP happens - on Analytical shared-nothing systems Many applications ported from Oracle, etc ● Greenplum will open and spawn many threads based on query type ● Singleton ops take up unnecessary threads exhausting finite RAM/CPU resources. ● Pooling agents can alleviate pressure on the Master – but throughput will be affected by number of resources used and operation type. ● Small, quick queries are not cached resulting in re-reads (lookup / dim tables, etc). ● Historically, applications need to be re-written to utilize batch loading operations – expensive! ● When combined, referred to as HTAP “Hybrid Transaction Analytical Processing”
  9. 9. HTAP Use Case: Dell, Inc. Problem: Legacy Apps with Singleton DML (Insert/Update/Delete) ● Existing infrastructure supported applications performing single inserts/update/deletes in volume ● Greenplum’s MPP Design has slow commit times for Singleton Inserts ● Customer desired to support DML without a redesign Solution: Heimdall Auto-Batching into Greenplum ● DML operations are isolated and batched by Heimdall ● Commits are performed over many operations, reducing overhead ● Exceptions are tracked by Heimdall for later analysis Result: DML Performance Increased by 20x, Meeting Requirements
  10. 10. 4 3 2 1 Application DML Request 6 5 4 3 2 1 Queue Batch Size 4 78 START TRANSACTION; DML 1; DML 2; DML 3; DML 4; COMMIT; Exceptions are logged, removed from batch, and transaction restarted Benefits: •Lower CPU overhead due to fewer commits •Improved application response time •Improved DML scale #1 #2 #3 #4 Asynchronous Batch Processing
  11. 11. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. DEMO
  12. 12. Customer Example: CATL Problem: Slow Report Generation in Tableau ● Each report contained up to 30 queries, taking 30 seconds each ● Data was updated every two hours ● Reports viewed at random intervals by management Solution: Heimdall Auto-Refresh Caching into Gemfire ● Redundant queries were learned by Heimdall ● Via Stored Procedure after data load, Heimdall invalidates modified tables ● Query cache was refreshed from Greenplum into Gemfire by Heimdall Result: Average Report Generation Went From 17s to 3s
  13. 13. Auto-Refresh Caching Bulk Data Upload Invalidation SP (Or Trigger) Invalidation Event Initial Request & Response Initial Request & Response Cache Populated Query Tracker Application Caches (L1+L2) Data Source Cached Result Later Request & Cached Result Cache Invalidated Queries Reissued Cache Re- Populated Refresh Request & Response Auto-Refresh targets finite query pattern environments, i.e reporting and dashboard interfaces
  14. 14. Customer Example: Questis w/ Aurora for Postgres Problem: Productizing MVP (Minimally Viable Product) ● Development had focus on features, not performance ● No cache layer had been implemented during MVP development ● In use, many redundant queries were being performed Solution: Heimdall Caching Logic for Amazon Elasticache ● Reduced Database load by 90% ● Improved page generation time ● Auto-Invalidation gave peak cache efficiency without stale data Result: MVP Code was put into production without rewrites for caching and met customer SLA’s
  15. 15. Pivotal Greenplum: Learn More ● Find out more about Pivotal Greenplum and Heimdall at ○ ○ ● OR learn more about the open source Greenplum at ○ ● OR give a try: ○ Amazon AWS, Azure, Google GCP or Heimdall website ● Check for the Heimdall Q/A Deep Dive (Date TBD)
  16. 16. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Q & A