Up-to date measurements of surface meteorological variables are essential to monitor weather conditions, their spatio-temporal variability and the potential effects on a wide range of sectors and applications. Moreover, when included in continuous records of long historical observations spanning several decades, they become essential for assessing long-term climate variability and change locally and on a regional level.
Automated pipelines capable of retrieving and processing near-real time meteorological data satisfy the primary prerequisites towards the development and advancement of effective and operational climate services.
With a public and operational near real-time monitoring web platform in mind, we present automated pipelines to collect and process up-to-date daily temperature and precipitation records for Trentino South Tyrol (Italy) and surrounding areas, and to derive their spatially interpolated fields at sub-km scale. Our pipelines are composed by multiple steps including data download, sanity checks, reconstruction of missing daily records, integration into the historical archive, spatial interpolation and publication onto online FAIR catalogues as (openEO) “datacubes”. The different APIs, data formats and structure across the various data sources, and the need to merge the data onto harmonized meteorological layers, make this a typical case of the so-called Extract, Transform and Load (ETL) pipelines, and, in order to follow the principles of data reproducibility and Open Science, we embraced open-source automated workflow management through GitLab’s Continuous Integration / Continuous Development (CI/CD) capabilities.
CI/CD workflows greatly help the management of the relatively complex graphs of tasks required for our climate application, ensuring seamless orchestration with thorough flow monitoring, application logs, transactions rollbacks, and exception handling in general. Native pipeline-oriented software development also fosters a clean separation of roles among the tasks, and a more modular architecture. This effectively reduces barriers to collaborative development and paves the way for robust operational climate services for researchers and decision makers in the face of the changing climate.
SFSCON23 - Elena Maines - Embracing CI/CD workflows for building ETL pipelines
1. Elena Maines
11.11.23
Embracing CI/CD workflows
for building ETL pipelines
how we will gather and monitor multi-source spatially-
interpolated meteorological parameters in near-real time
3. Pipeline: motivation
Monitoring current
meteorological conditions
at regional level
Monitoring climate
variability at local level
Having reliable and up-to-date long-term meteorological data is crucial for
Providing spatially distributed
meteorological inputs for
impact models (e.g., for
hydrological simulations)
4. Meteorological data and data preparation
Time series of key weather variables:
e.g., precipitation, min/max
temperatures.
These data are recorded at daily
and sub-daily resolution by
weather stations
5. Meteorological data and data preparation
Time series of key weather variables:
e.g., precipitation, min/max
temperatures.
These data are recorded at daily
and sub-daily resolution by
weather stations
6. Meteorological data and data preparation
Comprehensive and
updated time-series
dataset: a robust
compilation of at-site
meteorological conditions
Interpolation
for obtaining a
gridded
dataset
Irregular spatial coverage
of weather stations
PostgreSQL
11. Pipeline: implementation
➢ GitLab CI/CD pipeline
➢ Dockerized jobs runners
➢ R modules + geo-spatial libraries
➢ testthat + shell-based unit test suites
about.gitlab.com
12. Pipeline: implementation
➢ GitLab CI/CD pipeline
➢ Dockerized jobs runners
➢ R modules + geo-spatial libraries
➢ testthat + shell-based unit test suites
➢ renv for reproducibile environment
about.gitlab.com
13. Pipeline: implementation
➢ GitLab CI/CD pipeline
➢ Dockerized jobs runners
➢ R modules + geo-spatial libraries
➢ testthat + shell-based unit test suites
➢ renv for reproducibile environment
➢ bash scripts and wrappers
about.gitlab.com
14. ETL pipeline for
updating
meteorological datasets
in near real-time
Data retrieval
Integration
with historical
data
Gap filling
Gridding
Publishing
Sanity checks
Harmonization
22. 04 : Gap filling
Only isolated gaps
Interpolation based on
surrounding
stations
23. 04 : Gap filling
Only isolated gaps
Interpolation based on
surrounding
stations
Considering common
historical data
24. 04 : Gap filling
Only isolated gaps
Interpolation based on
surrounding correlated
stations
Considering common
historical data
25. 04 : Gap filling
Only isolated gaps
Interpolation based on
surrounding correlated
stations
Considering common
historical data
Missing datum
estimated as the
weighted average of
rescaled values
26. 04 : Gap filling
Only isolated gaps
Interpolation based on
surrounding correlated
stations
Considering common
historical data
Missing datum
estimated as the
weighted average of
rescaled values
Simulated values are checked for consistency and
mean temperature missing values are determined
from reconstructed values of min and max
temperatures
46. GitLab’s out of the box:
✓ UI-based pipeline status monitoring
Monitoring & Troubleshooting
47. GitLab’s out of the box:
✓ UI-based pipeline status monitoring
✓ Jobs logs visualization
Monitoring & Troubleshooting
48. GitLab’s out of the box:
✓ UI-based pipeline status monitoring
✓ Jobs logs visualization
✓ Timing information
Monitoring & Troubleshooting
49. GitLab’s out of the box:
✓ UI-based pipeline status monitoring
✓ Jobs logs visualization
✓ Timing information
✓ Job’s manual re-runs
Monitoring & Troubleshooting
50. GitLab’s out of the box:
✓ UI-based pipeline status monitoring
✓ Jobs logs visualization
✓ Timing information
✓ Job’s manual re-runs
✓ API also available
Monitoring & Troubleshooting
57. • Finalized implementation of the
GitLab pipeline
• Automating the connectors with
online catalogues
• Set up of webhooks for reporting to
MS Teams channel(s)
• Implementation of a monitoring
dashboard with GIS-based
visualization of the data store and of
near-real time updated fields
What’s next