Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ESA UNCLASSIFIED - For Official Use
Distributing big astronomical catalogues
with Greenplum
Pilar de Teodoro
European Spac...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 2
Outlines
• What is ESA an...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 3
European Space Agency
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 5
Data Flow
Mission Operati...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 6
http://archives.esac.esa....
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 7
Euclid Scientific Archive...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 8
The ESA Fleet for Astroph...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 9
Euclid will add
45PB by 2...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 10
Databases Size at ESDC (...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 11
Database Systems at ESDC...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 12
Databases HW
Chosen to s...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 13
Plan for Scale-out Tests...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 14
PostgreSQL Scale-out Sol...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 15
Postgres-XL Architecture
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 16
Postgres-XL Architecture...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 17
Citus Architecture
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 18
Citus Architecture for T...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 19
Greenplum Architecture
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 20
Greenplum architecture f...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 21
Simple Query Tests: Post...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 22
Postgres Single Instance...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 24
Hardware Testing for
Env...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 25
Greenplum Testing
• Inge...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 26
Greenplum Testing
Commun...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 27
GP Testing: Ingestion
gp...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 28
GP Testing: Expansion
GP...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 29
GP Testing: Performance
...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 30
GP testing: Xmatch Perfo...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 32
GP testing: Parameter Qu...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 33
GP Testing: Performance
...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 34
GP Testing: Performance
...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 35
GP Testing: High Availab...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 36
GP Testing: Managing Res...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 37
GP Testing: Resource man...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 38
Lessons Learnt and what ...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 39
v Greenplum is the most ...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 40
Near Future Tests
Euclid...
ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 42
Thank you
Upcoming SlideShare
Loading in …5
×

Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019

327 views

Published on

Greenplum Summit 2019
Pilar de Teodoro


Published in: Software
  • Be the first to comment

Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019

  1. 1. ESA UNCLASSIFIED - For Official Use Distributing big astronomical catalogues with Greenplum Pilar de Teodoro European Space Astronomy Centre, Madrid, Spain Greenplum Summit, NYC, 19/03/2019 @pteodoro
  2. 2. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 2 Outlines • What is ESA and ESDC • Why scaling out • Tests performed • Lessons learnt and what is missing • Next steps
  3. 3. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 3 European Space Agency
  4. 4. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 5 Data Flow Mission Operations Center ESOC: Germany Science Operations Center ESAC: Spain Esac Science Data Center ESAC: Spain
  5. 5. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 6 http://archives.esac.esa.int
  6. 6. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 7 Euclid Scientific Archive System Access to scientific data online
  7. 7. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 8 The ESA Fleet for Astrophysics Data at ESDC
  8. 8. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 9 Euclid will add 45PB by 2025
  9. 9. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 10 Databases Size at ESDC (January 2019) 0 5000 10000 15000 20000 25000 XSA CSA HST "HSA" PSA ESASKY Euclid LPFSA GACS Size (GB) Size (GB)
  10. 10. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 11 Database Systems at ESDC Most archives: PostgreSQL: 9.4 to 11.1 (15 operational databases) • Open Source • Big community • Spherical queries (pg_sphere, q3c, healpix, postgis) • Professional companies support (that we do not use at our own risk) ISO: Sybase Euclid DPS: Oracle ESASky, PSA (some functionalities): ElasticSearch Euclid Prototype: Spark Import from Files, MySQL…
  11. 11. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 12 Databases HW Chosen to scale up: Largest (GACS): Dell PowerEdge R920 Memory: 1.5TB 48 cores 7 TB SSD 13 TB SAS 21TB NetApp
  12. 12. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 13 Plan for Scale-out Tests Objective: Big Datasets queried in efficient times (Requirements). Ingestion performance. Pre-Requisites: Resources, data to ingest (Catalogues, Relational tables) Comparison: ingestion time, retrieving time, administration, problems arisen.
  13. 13. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 14 PostgreSQL Scale-out Solutions Tested Citus by CitusData • Citus 7, extension installed with Postgres-9 Postgres-XL by 2nd Quadrant • Postgres-10 alpha2 (postgres 10.3)- Euclid • Postgres-XL 9.5 R1.6 - Gaia • Pivotal GreenPlum 5.9
  14. 14. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 15 Postgres-XL Architecture
  15. 15. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 16 Postgres-XL Architecture for Testing Scicloud-27 VMs
  16. 16. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 17 Citus Architecture
  17. 17. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 18 Citus Architecture for Testing • PostgreSQL-9/PostgreSQL-10 • Citus extension:7.1 Master Node Worker Node1 Worker Node2 Worker Node3 32GB RAM 32GB RAM 32GB RAM 32GB RAM Scicloud-4VMs
  18. 18. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 19 Greenplum Architecture
  19. 19. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 20 Greenplum architecture for Testing Scicloud VMs: Scicloud Bare Metal Dell Cluster Master dn0 6 dn0 1 32GB/ 8 cores 4 segments Master 256GB/ 48 cores Master Master Master Master dn01 4/8 segments Netapp SAS Drive/ Netapp Master 256GB/ 48 cores Master Master Master dn01 32/64 segments SSD
  20. 20. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 21 Simple Query Tests: Postgres-XL vs Citus Select * from kids_mb_catalogue Where mag_auto_i_trunc=XX On Kids catalogue: v 16M rows v 12GB total v Structured data v Comparison for Single instance(SI) on postgres v10 vs postgres-xl, citus running on 3 nodes and postgres-xl running on 25 nodes. Time grows with the number of rows. Filter Rows 10 2389562 11 13 12 46 13 165 14 1615 15 21932 16 66365 17 116562 18 199186 19 341422 20 600520 21 977292 22 1496713 23 2144309 24 2722471 25 2654069 26 1761394 Filter Rows 27 730583 28 250883 29 90818 30 34346 31 13285 32 5320 33 2149 34 820 35 361 36 139 37 62 38 27 39 4 40 5 41 3 42 1 43 0
  21. 21. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 22 Postgres Single Instance vs XL vs Citus (hash distribution) 0 50000 100000 150000 200000 250000 300000 2389562 13 46 165 1615 21932 66365 116562 199186 341422 600520 977292 1496713 2144309 2722471 2654069 1761394 730583 250883 90818 34346 13285 5320 2149 820 361 139 62 27 4 5 3 1 0 Timeinms Rows number postgres-xl-3nodes postgres-xl-25_nodes SI-10 citus7.1-3nodes
  22. 22. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 24 Hardware Testing for Environment (Postgres-XL or SI) Time Netapp (10 partitions) 3h 46m SSD (10 partitions) 1h 17m Postgres-XL 10 nodes (VM on physical nodes-Netapp) 2h 35min Postgres-XL 25 nodes (Netapp scicloud) 35m Netapp (FDW postgres-xl 25 nodes) 1h 40m Flagship_mock table: 1,1TB Rows#: 2,5B Select count(*) from flagship_mock;
  23. 23. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 25 Greenplum Testing • Ingestion • Expansion • Performance • High Avalaibility • Resource Management • Community edition vs Commercial edition
  24. 24. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 26 Greenplum Testing Community edition does not include: • Support from Pivotal Greenplum team • Product packaging and installation script. • Support for QuickLZ compression. QuickLZ compression is not provided in the open source version of Greenplum Database due to licensing restrictions. • Support for managing Greenplum Database using Pivotal Greenplum Command Center. • Support for full text search and text analysis using Pivotal GPText. • Support for data connectors: •Greenplum-Spark Connector •Greenplum-Informatica Connector •Greenplum-Kafka Integration •Gemfire-Greenplum Connector • Data Direct ODBC/JDBC Drivers • gpcopy utility for copying or migrating objects between Greenplum systems.
  25. 25. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 27 GP Testing: Ingestion gpfdist: nohup gpfdist -d /spv02/ -p 4000 > gpfdist_4000.log 2>&1 & Create external table flag_ext() location ('gpfdist://172.25.4.244:4000/*’) FORMAT 'CSV’; 273*10GB csv files in 3 hours (NetApp) – 6 gpfdist servers The benefit of using gpfdist is that you are guaranteed maximum parallelism while reading from or writing to external tables, thereby offering the best performance as well as easier administration of external tables. From: https://gpdb.docs.pivotal.io/520/utility_guide/admin_utilities/gpfdist.html
  26. 26. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 28 GP Testing: Expansion GPExpand: Expanding the cluster to new datanodes or same ones increasing number of segments. 10 Gb network
  27. 27. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 29 GP Testing: Performance Performance: • Gaia query profiling tests: Catalogues XMatch • Euclid query profiling tests: 1,3TB catalogue What to test: • Table formats • Scalability
  28. 28. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 30 GP testing: Xmatch Performance Most relevant queries: Gaia query SELECT gdr1.source_id AS iddr1, gdr2.source_id AS iddr2 FROM gaiadr1.gaia_source AS gdr1, gaiadr2.gaia_source AS gdr2 WHERE q3c_join(gdr2.ra, gdr2.dec, gdr1.ra, gdr1.dec, 0.5/3600) 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 GACSDB02 GP-LIM-4-64-HEAP_C GP-BM-5-40-AO
  29. 29. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 32 GP testing: Parameter Query performance SELECT q3c_dist(public.flagship_mock."ra_gal",public.flagship_mock."dec_gal",220.,0.0) AS dist , * FROM public.flagship_mock euclid WHERE public.flagship_mock."galaxy_id" = 2 AND public.flagship_mock."gr_cosmos" > 1;
  30. 30. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 33 GP Testing: Performance select count(*) form euclid.flagship_mock; Table format Time(s) GP-BM-5-40-QLZ 14 GP-6-24-QLZ 22 GP-6-24-COL 33 GP-LIM-4-32-HEAP 457 GP-LIM-4-64-HEAP 533 GP-BM-5-40-AO 581 GP-6-24-AO 1693 GP-BM-5-40-HEAP-C 1695 GP-6-24 1808 GP-2-8 8055 Postgres-XL-10nodes 9324 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 G P-B M -5-40-Q LZ G P-6-24-Q LZ G P-6-24-CO L G P-LIM -4-32-H EAP G P-LIM -4-64-H EAP G P-B M -5-40-A O G P-6-24-AO G P-B M -5-40-…G P-6-24 G P-2-8 postgres-X L-… count(*) (s)
  31. 31. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 34 GP Testing: Performance Select count(*) from gaiadr2.gaia_source: 1692919135 rows 0 200 400 600 800 1000 1200 G P-B M -5-40-Q LZ G P-LIM -4-32-H EAP G P-B M -5-40-A OG P-6-24-CO L G P-B M -5-20-A O gacsdb02 G P-6-24 count(*) (s) Table format Time(s) GP-BM-5-40-QLZ 6 GP-LIM-4-32-HEAP 7 GP-BM-5-40-AO 10 GP-6-24-COL 14 GP-BM-5-20-AO 59 gacsdb02 323 GP-6-24 980
  32. 32. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 35 GP Testing: High Availability
  33. 33. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 36 GP Testing: Managing Resources • Resource groups: set and enforce CPU, memory, and concurrent transaction limits in Greenplum Database. Use Linux Control Groups (cgroups) to manage CPU resources. • Resource queues: manage the degree of concurrency in a Greenplum Database system. •manage the number of active queries that may execute concurrently, •the amount of memory each type of query is allocated, •priority of queries.
  34. 34. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 37 GP Testing: Resource management
  35. 35. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 38 Lessons Learnt and what is missing(I) v Yes, Postgres-XL is scaling but: vstill some issues about stability: GTM memory leak, workaround GTM proxy vdoes not work with foreign data wrappers vGTM does not rotate logs vnot direct monitoring tool vReshuffling not working vJDBC bug identified by ESDC not resolved yet v Citus: vas it is an extension can make some workers to behave as single instance. vWill need to work on performance to retrieve large amount of rows. vAdministration is easier than in Postgres-XL.
  36. 36. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 39 v Greenplum is the most mature postgres-flavour solution for us. • Active developers community • GP 5 based on postgres 8.3 use “reserved words”, TAP problem, resolved in future GP6 • Going to GP 6 based on postgres 9.4 • Community edition is enough for ESDC currently • Use of existing HW • Resource management if needed • Best performance: load big catalogues in different format depending on the query and when parsing it decide which one to use using TAP. • Relational datamodels must be adapted. No FKs are enforced. Lessons Learnt and what is missing(II)
  37. 37. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 40 Near Future Tests Euclid: Postgres-XL will be replaced by GP. We will continue ingesting catalogues in this system and testing.
  38. 38. ESA UNCLASSIFIED - For Official Use Pilar de Teodoro| Greenplum Summit’19 | 03/19/2019 | Slide 42 Thank you

×