Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenplum Summit 2019

64 views

Published on

Greenplum Summit 2019
Shivram Mani @shivram
Francisco Guerrero @frankgh

Published in: Software
  • Be the first to comment

Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenplum Summit 2019

  1. 1. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Shivram Mani Francisco Guerrero @shivram @frankgh Maximize Greenplum For Any Use Cases Decoupling Compute and Storage
  2. 2. Cover w/ Image Agenda ■ Enterprise Data Landscape ■ Accessing External Data from Greenplum ■ Platform Extension Framework (PXF) ■ Use Cases ■ Q+A
  3. 3. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Enterprise Data Landscape
  4. 4. The Wild Wild West of Data ? 5
  5. 5. Greenplum uses PXF as a federated query engine to access external heterogeneous data.
  6. 6. Platform Extension Framework (PXF) Tabular view for heterogeneous data Built-in connectors for various data sources/formats Pluggable framework Parallel high throughput data access Open source Read and write external data 7
  7. 7. Architecture of PXF Master Host External Data Segment Host 1 seg1 seg2 seg3 PXF Segment Host 2 seg4 seg5 seg6 PXF 8
  8. 8. Q: How can I access sales data residing in an S3 bucket stored in parquet format? Greenplum External Table CREATE EXTERNAL TABLE sales (cust int, sku text, amount decimal, date date) LOCATION ('pxf://s3-bucket/2018/sales/?PROFILE=s3:parquet&SERVER=s3_sales') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import') profilepath to data server 9
  9. 9. How can we scale performance when querying remote data ?
  10. 10. Performance - Predicate Pushdown state=NY state=NJ state=CA state=CA {state='CA'} SELECT item, amount FROM orders WHERE state = 'CA'MASTER SEGMENT predicates : state=CA PXF with JDBC Row oriented storage format ● Predicate information pushed to external system ● External engines can support predicates for its own queries (e.g. JDBC) ● No filtering within PXF itself ● Partition pruning (e.g. Hive)
  11. 11. Performance - Column Projection date: {item:, amount:, state='CA'} SELECT item, amount FROM orders WHERE state = 'CA'MASTER SEGMENT columns : item, amount predicates : state=CA aggregates : count PXF with Hive/ORC Columnar storage format ● Propagate columns projection metadata to external systems ● JDBC, Parquet & ORC ● Reduces Network I/O ● Reduces Remote Disk I/O ● Improved performance for aggregate queries state: amount: item:
  12. 12. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Use Cases
  13. 13. Use Case: Multi-temperature data querying ● Storage based on operational requirements ● Can I work with data created few second ago ? ● Can I run a report on data from few days ago ? ● Can I inspect the data archived months or years ago ? In-Memory Database RDBMS dataData Lake HOT DATA WARM DATA COLD DATA 14
  14. 14. Use Case: Elastic scaling with Greenplum ● Greenplum on K8s for elastic compute ● Elastic storage with S3/Azure/Google ● Ability to separate compute from storage ● On-demand data warehouses 15
  15. 15. Use Case: Access Heterogenous data on multiple clouds ● Different cloud providers based on business requirements ● Low cost storage ● No storage admin ● Data doesn’t need to be copied 16
  16. 16. Use Case: Access Heterogenous data on multiple clouds Historical_Orders xx xx xx xx Historical_Invoices xx xx xx xx Product_Catalog xx xx xx xx Historical_Orders xx xx xx xx Admin migrates data from s3- bucket-orders to Azure Blob Storage SELECT * FROM historical_orders o, product_catalog p WHERE o.product_id = p.product_id s3-bucket-orders s3-bucket-price Historical_Invoices xx xx xx xx 17 SELECT * FROM historical_orders o, product_catalog p WHERE o.product_id = p.product_id
  17. 17. Use Case: Access Heterogenous data on multiple cloud CREATE EXTERNAL TABLE historical_orders (item int, amount money) LOCATION ('pxf://s3-bucket-orders/path?PROFILE=s3:parquet&SERVER=s3_orders') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); CREATE EXTERNAL TABLE historical_orders (item int, amount money) LOCATION ('pxf://my.azuredatalakestore.net/path?PROFILE=adl:parquet&SERVER=azure') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); Historical_orders table data on S3 Historical_orders table data now on Azure Data Lake 18
  18. 18. Summary Greenplum embraces the modern data landscape ● Scale and manage compute independently from storage ● Federate queries across heterogeneous data sources ● Cloud Agnostic Data is available for analytics with Greenplum no matter its form and where it resides! 19
  19. 19. #ScaleMatters © Copyright 2019 Pivotal Software, Inc. All rights Reserved.
  20. 20. Cover w/ Image Greenplum External Table Define an external table with the following: ● the schema of the external data ● the protocol pxf ● the location of the data in an external system ● the profile to identify the specific connector ● The compressions_codec of the data ● the format of the external data CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name ( col_name data_type [,...] | LIKE other_table ) LOCATION ('pxf://<path to data>? PROFILE=[<profile_name>|<data_store:data_type>]& COMPRESSIONG_CODEC=[snappy|gzip|lzo|bzip2]& [&<CUSTOM_OPTIONS>=<value>[...]]’) FORMAT '[TEXT|CSV|CUSTOM]' cust, sku, amount, date 1234, ABC, $9.90, 4/01 1235, CDE, $8.80, 3/30 CREATE EXTERNAL TABLE sales (cust int, sku text, amount decimal, date date) LOCATION ('pxf:///2018/sales.csv?PROFILE=hdfs:text') FORMAT 'TEXT'
  21. 21. Cover w/ Image PXF supports accessing multiple external datastores simultaneously ● server identifies an external datastore ● Staging directory server/ under ${PXF_CONF} ● Contains relevant configuration files under servers/{server_name}/ ○ HDFS: core-site.xml, hdfs-site.xml, ... ○ S3: s3-site.xml containing access properties PXF Multi Server CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name ( col_name data_type [,...] | LIKE other_table ) LOCATION ('pxf://<path to data>? PROFILE=<data_store:data_type>& SERVER=<server_name>’) CREATE EXTERNAL TABLE sales (cust int, sku text, amount decimal, date date) LOCATION ('pxf://s3-bucket- sales/2018/sales.csv?PROFILE=s3:text&server=s3_s ales’) FORMAT 'TEXT' cust, sku, amount, date 1234, ABC, $9.90, 4/01 1235, CDE, $8.80, 3/30
  22. 22. Performance in PXF ● Parallel access to data ● Predicate pushdown ● Column projection 23 SELECT item, amount WHERE state = 'CA' column projection predicate pushdown
  23. 23. Performance in PXF ● Parallel access to data ● Column Projection ● Predicate Pushdown 24 SELECT item, amount WHERE state = 'CA' column projection predicate pushdown

×