A Python Framework for Staging of
Geo-referenced Data on the Collaborative
Climate Community Grid (C3-Grid)
Henning Bergme...
DLR
German Aerospace Center




   Research Institution
   Space Agency
   Project Management Agency




                 ...
Locations and employees


6000 employees across                                                            Hamburg
29 res...
Start
    Talk Outline
      Users View

        Addressing the Heterogeneity Problem of
   initialize               read ...
The Heterogeneity
Problem with Data in
Climate Research
  Distributed Data Archives
  throughout Germany
        Large Dat...
Scientific Workflow and Use of Data
                                                     Workflows consist of dependent ta...
The Collaborative Climate Community Grid
C3-Grid (2005-2009)

          Transparent Grid Infrastructure for the German
   ...
Working with Data on the Portal
 Advanced Search               Browse by Data Set




                                    ...
C3-Grid Layered Architecture (simplified)
     Users View




                                           Portal
          ...
Stage Request Constraints: Data Selection

  Data service receives selection constraints according to metadata
       Data...
Stage Request Constraints: Data Selection

  Data service receives selection constraints according to metadata
       Data...
Stage Request Constraints: Data Selection

  Data service receives selection constraints according to metadata
       Data...
Problem solved… 
  Solved in Middleware: Abstraction of heterogeneous storage access
       Standardized Metadata Format ...
PyModESt helps the data provider
  Hide unnecessary complexity
       e.g. XML processing
  Give Recipes
       Guided Tem...
Staging Process Skeleton
      Necessary Implementation effort for DP when using PyModESt
  initialize             read   ...
Implement a Data Processor

  Initialize data processor __init__(c3env, stage_request)
      perform common steps for stag...
Hook 1: retrieveAndFilterDataFiles()
–   determine required base data files
               Data provider knowledge about d...
Hook 2: updateMetadata(md)

Metadata object is automatically initialized from metadata server
Just pass updated attributes...
C3Thesaurus - CF Names Translation
                                       (based on Guido’s recipe for dict inversion)

  ...
Implement the Estimation Request
(Scheduling Questions)
   •   Verify request constraints
             Is the request fulf...
Hook 3: estimateFileSize()
  Return a long value (Size in Bytes)

 For regular Raster / Table Data simply
 multiply and su...
Hook 4: estimateStageTime(stage_instant)


  Given an instant for a Stage Request,
  when will the requested data be avail...
Associate Data Processors with Data Set

   Modularity of Stagers
                                                        ...
ERS2.GOME.L3.VCD.MONTHLYMEAN.O3 (95–05)
 (Example Data Set 1)

WDC-RSAT: World Data Center for
Remote Sensing of the Atmos...
PyHDF: Copying Parts of a Table to Another Table
  HDF Scientific Data Set
      Each variable is in a 2D table (longitude...
Chemical Weather Forecast Demo Data Set (2005)

Institute for Atmosphere Physics



File Structure
      1 file / day with...
Reading Structured Output of a Command Line Tool
  Climate Data Operators (CDO)
       supports a huge set of specific ope...
Benefits of Using Python for Impementation of the Stager Lib
 Easiness
      Comparably easy to understand and modify by n...
Conclusion

  C3-Grid is a grid infrastructure for collaboration of climate researchers
      Uniform ISO-annotated Data A...
Summary

 The C3-Grid is a collaboration infrastructure for the climate science
 community. Main aim is transparency of th...
References
  C3-Grid: http://www.c3grid.de
  DLR Simulation and Software Technology: http://dlr.de/sc
  DLR Intitute of At...
Upcoming SlideShare
Loading in …5
×

20090701 Climate Data Staging

1,490 views

Published on

This is the presentation version containing animations. Therefore it is not printable. I am going to post a printable version during the next days.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,490
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20090701 Climate Data Staging

  1. 1. A Python Framework for Staging of Geo-referenced Data on the Collaborative Climate Community Grid (C3-Grid) Henning Bergmeyer (http://tr.im/bergmeyer) German Aerospace Center Simulation- and Software Technology (V.09-06-30) Most up-to-date version of these slides available at: http://tr.im/ep09_pymodest Folie 1 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  2. 2. DLR German Aerospace Center Research Institution Space Agency Project Management Agency Folie 2 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  3. 3. Locations and employees 6000 employees across  Hamburg 29 research institutes and Bremen-   Neustrelitz Trauen  facilities at Berlin-  Braunschweig   13 sites.  Goettingen Offices in Brussels, Paris and Washington.  Koeln  Bonn  Lampoldshausen  Stuttgart  Oberpfaffenhofen Weilheim  Folie 3 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  4. 4. Start Talk Outline Users View Addressing the Heterogeneity Problem of initialize read Portal select choose Environment Data Stage Request Climate Pre-Processing DataProcessor Request Mode Middleware PyModESt: Modular Extendable Stager in Python handle handle No handle Workflow Stage Request Data Estimation Request Data Grid Cancel Request Stop Information Management Simulation MonitoringManagement Examples snippets from Data provider scripts for System Yes Simulation DLR data sets • find abandoned • (authorize) • authorize temporary files • prepare work space Problems ? Meta Data • retrieve • retrieve meta data • adjust Constraints • adjust constraints • estimate Post-Processing data • retrieve • Stage Time Resources • update meta data • File Size • transfer files to target Visualization Archive Optimum handle No tidy reached ? write Exceptions Work Space Yes Responses End Folie 4 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  5. 5. The Heterogeneity Problem with Data in Climate Research Distributed Data Archives throughout Germany Large Data Quantities (Peta Bytes) Storage at Data Sources (Sensors, Institutes) Many Data Formats Due to nature and purpose of data Historic reasons Folie 5 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  6. 6. Scientific Workflow and Use of Data Workflows consist of dependent tasks Start Tasks are carried out manually as local Applications Pre-Processing as Jobs on compute resources No Stop Simulation Monitoring Yes Simulation Problems ? Scheduler / Batch Systems Post-Processing plan execution time of jobs on required resources (time, space) Visualization Tasks consume and produce data No Optimum Staging: job data retrieval and reached ? storage of result Yes access should be standardized End Folie 6 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  7. 7. The Collaborative Climate Community Grid C3-Grid (2005-2009) Transparent Grid Infrastructure for the German Climate Research Community Uniform access to heterogeneous climate data stores Publishing Discovery Citation Download Data Processing Tools Data Visualization Standard Tools / Workflows User Interface: Web Portal Folie 7 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  8. 8. Working with Data on the Portal Advanced Search Browse by Data Set Folie 8 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  9. 9. C3-Grid Layered Architecture (simplified) Users View Portal Scheduled Tasks: • Computation • Data Download Middleware Workflow Data Data Grid Information Management Management System Abstraction: Data Request Service + Metadata WDC Climate WDC Mare WDC => custom staging scripts RSAT DWD Resources DKRZ, PIK, Storage GKSS, Solutions AWI, MPI-M Archive IFM-Geomar dCache OGSA-DAI FU Berlin Uni Cologne Flat Files … heterogeneous! Folie 9 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  10. 10. Stage Request Constraints: Data Selection Data service receives selection constraints according to metadata Data Set (Object ID) Folie 10 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  11. 11. Stage Request Constraints: Data Selection Data service receives selection constraints according to metadata Data Set (Object ID) log_surface_pressure Variables as CF Names mole_fraction_of_ozone_in_air … (Climate and Forecast MD Convention) Folie 11 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  12. 12. Stage Request Constraints: Data Selection Data service receives selection constraints according to metadata Data Set (Object ID) log_surface_pressure Variables as CF Names mole_fraction_of_ozone_in_air … (Climate and Forecast MD Convention) Regional Bounds (Longitude, Latitude) Vertical Bounds and Vertical Coordinate Reference System Time Period Data Set Specific Constraints Folie 12 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  13. 13. Problem solved…  Solved in Middleware: Abstraction of heterogeneous storage access Standardized Metadata Format (ISO 19115/19139) Common Set of Request Constraints over Grid Service …another one brought to the surface  Computer Science Exercises for Meteorologists Adaption of storage access to the standard service interface extend Java service or write external stager script Implementing Web/Grid Service Features Concurrency Issues Error handling CF Name Translation Metadata update requires XML processing & Schema knowledge … Folie 13 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  14. 14. PyModESt helps the data provider Hide unnecessary complexity e.g. XML processing Give Recipes Guided Template for Data Processors in Python Make implementation straight forward (e.g. reduce interfaces) 1 Input Channel (Constraints as Python primitives) 1 Output Channel Settings and Environment Access Avoid common mistakes Concurrency of Requests Let Users stay in their domain Let them use their own tools Provide additional tools for common issues e.g. data grid calculations Folie 14 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  15. 15. Staging Process Skeleton Necessary Implementation effort for DP when using PyModESt initialize read select choose Environment Stage Request DataProcessor Request Mode handle handle handle Cancel Request Stage Request Estimation Request • find abandoned •• (authorize) (authorize) • authorize temporary files •• prepare work space prepare work space • retrieve Meta Data •• retrieve xml metadata retrieve xml metadata • adjust Constraints •• adjust constraints adjust constraints • estimate •• retrieve data retrieve data • Stage Time •• update metadata update metadata • File Size •• transfer files to target transfer files to target handle tidy write Exceptions Work Space Responses Folie 15 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  16. 16. Implement a Data Processor Initialize data processor __init__(c3env, stage_request) perform common steps for stage request and estimation e.g. determine data mesh properties c3env and stage_request reference in data processor context do not touch base data Hook 1: retrieveAndFilterDataFiles() Hook 2: updateMetaData(c3_metadata) Folie 16 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  17. 17. Hook 1: retrieveAndFilterDataFiles() – determine required base data files Data provider knowledge about data store download a base data file to a temporary file dld_datafile_name = self.c3env.reserveTempFilePath() fln, headers = urllib.urlretrieve( url=src_url, filename= dld_datafile_name) extract constraint fulfilling data and append result file Use specific tools or libraries – update result properties remember meta data relevant attributes of the result remove downloaded file to save temporary disk space os.remove(fln) self.c3env.releaseTempFilePath(dld_datafile_name ) continue with step 2 until finished Folie 17 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  18. 18. Hook 2: updateMetadata(md) Metadata object is automatically initialized from metadata server Just pass updated attributes to methods of the metadata object md.removeQuicklook() md.filterContentInfo() md.setHorizontalBounds() md.setVerticalExtent() md.updateTimePeriod() md.setObjectId() md.addLineageProcessStep() Folie 18 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  19. 19. C3Thesaurus - CF Names Translation (based on Guido’s recipe for dict inversion) Climate and Forecast Convention used in C3-Grid to name variables Data providers internally often follow other conventions historic reasons data file formats (e.g. GRB only supports numeric indexes) Translates variable names between representations in different scopes c3t.translateFromC3(self, attributes, tgt_scope) c3t.translateToC3(self, attributes, src_scope) { SCOPE1 : { CF1 : DP1, CF2 : DP2 }, SCOPE2 : … } SCOPE_MAP = { "g2.de.dlr.wdc.ERS2.GOME.L3.VCD.MONTHLYMEAN.O3" : { "mean_ozone_VCD" : "mean", "standard_deviation_ozone_VCD" : "strd_dev” } } Folie 19 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  20. 20. Implement the Estimation Request (Scheduling Questions) • Verify request constraints Is the request fulfillable? • Estimate result file size How much storage space does the extracted data need? • Estimate staging time When can I start a process that depends on the data? Offer a contract to the Scheduling System Do NOT process any base data files for this Hook 3: estimateFileSize() returns long Hook 4: estimateStageTime(stage_instant) returns timedelta Folie 20 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  21. 21. Hook 3: estimateFileSize() Return a long value (Size in Bytes) For regular Raster / Table Data simply multiply and sum-up Use “GaussianGridHelper” to calculate table index ranges on Gaussian Grids Handle regions overlapping the longitude periodic edge gauss_grid_hlp = RegularGaussianGridHelper( src_lat_min, src_lat_max, lat_delta, lat_len, src_lon_min, src_lon_max, lon_delta, lon_len ) lat_idx_min, lat_idx_max, lat_idx_len, lon_idx_min, lon_idx_max, lon_idx_len = gauss_grid_hlp.calculateRegionIndices( lat_min, lat_max, lon_min, lon_max ) Folie 21 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  22. 22. Hook 4: estimateStageTime(stage_instant) Given an instant for a Stage Request, when will the requested data be available? Return a datetime.timedelta value It is practically impossible to be precise in a complex dynamic system theoretically statistics over past requests could be used heuristically compare with sensor data of current server and network loads Currently DLR implementations generously over-estimate with a constant: timedelta(seconds=60) Folie 22 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  23. 23. Associate Data Processors with Data Set Modularity of Stagers DA Data Processor: Hooks + CF Translation for new data sets Starter DA The starter script Config DA Contains all configuration settings Associates data sets with data processors DA { ObjectID : DataProcessor } Specific DataProcessor sub-class is plugged into the skeleton implementation PROCESSOR_TYPES = { "g2.de.dlr.wdc.CWF" : netcdf_extraction.IPA_NCDF_Processor, “g2.de.dlr.wdc.ERS… “: wdc_hdf_processing.WDCHDFProcessor } PROCESSOR_TYPES[stageRequest.object_ids[0]].retrieveAndFilter… Folie 23 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  24. 24. ERS2.GOME.L3.VCD.MONTHLYMEAN.O3 (95–05) (Example Data Set 1) WDC-RSAT: World Data Center for Remote Sensing of the Atmosphere File Structure Base Data: 1 file / month File format HDF4 HTTP download Retrieval and Processing using PyHDF library create new HDF file iterate over months covering requested time period adjust data describing attributes Folie 24 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  25. 25. PyHDF: Copying Parts of a Table to Another Table HDF Scientific Data Set Each variable is in a 2D table (longitude, latitude) Result file contains 3D tables (date, longitude, latitude) NumPy (Numerical Python) integration enables easy table operations Select a 2D part of a table Insert it into a specific range in a 3D table tgt_ds[time_idx, :lat_idx_len, :(lon_len-lon_idx_min)] = src_ds[lat_idx_min:(lat_idx_max+1), lon_idx_min:] tgt_ds[time_idx, :lat_idx_len, (lon_idx_len-lon_idx_max-1):] = src_ds[lat_idx_min:(lat_idx_max+1), :(lon_idx_max+1)] Folie 25 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  26. 26. Chemical Weather Forecast Demo Data Set (2005) Institute for Atmosphere Physics File Structure 1 file / day with 8 time steps / day, file format NetCDF local file system Retrieval and Processing using external command line tool CDO iterate over files corresponding to time period adjust data describing attributes by analyzing the result file Folie 26 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  27. 27. Reading Structured Output of a Command Line Tool Climate Data Operators (CDO) supports a huge set of specific operations for climate data command line tool getRegionalBoundsFromNetCDF(filename) returns (xsize, ysize, …) sproc = subprocess.Popen(“cdo griddes %s” % filename, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True) output, err = sproc.communicate() retcode = sproc.wait() ncdf_props = str(output).split() xsize = int(ncdf_props[ncdf_props.index(“xsize”) + 2]) This part provided as re-usable function callTool(cmd_ln) returns str Seems trivial (GOOD!) but is incredibly useful Saved a lot of implementation time Folie 27 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  28. 28. Benefits of Using Python for Impementation of the Stager Lib Easiness Comparably easy to understand and modify by non computer scientists Intuitive configuration using dictionaries Add new data sets by Copy and Customize DataProcessors (documented DataProcessor template is provided) Python: Powerful Language Access to all constraints as Python types (float, str, dict, ...) Python standard libraries (datetime, timedelta, math, …) Extensibility No compilation and re-deployment necessary Easy integration of scriptable tools and libraries Rich set of useful libraries available (HDF, Numeric, iso8601, PyParsing…) Integration of Java and C/C++ Libraries with JPype, ctypes, BOOST Folie 28 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  29. 29. Conclusion C3-Grid is a grid infrastructure for collaboration of climate researchers Uniform ISO-annotated Data Access Tools and complex workflows in Portal Python can ease the life of both middleware developers and service implementers PyModESt makes becoming a data provider easy Modular development DP can exploit the strengths of Python Just implement 4 documented hooks for each data set kind Provide CF Name translations (dict in dict) Folie 29 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  30. 30. Summary The C3-Grid is a collaboration infrastructure for the climate science community. Main aim is transparency of the infrastructure to users by abstraction of heterogeneous data resources for easy data discovery and access and to allow execution of basic manipulation and analysis tasks as well as complex distributed workflows. PyModESt is a Python framework for the comfortable modularized implementation of staging scripts for the C3-Grid that frees meteorological data providers from doing the work of system admins and vice versa. Folie 30 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01
  31. 31. References C3-Grid: http://www.c3grid.de DLR Simulation and Software Technology: http://dlr.de/sc DLR Intitute of Atmospheric Physics: http://www.dlr.de/pa/en/ World Data Center for Remote Sensing of the Atmosphere: http://wdc.dlr.de/ Check-out for slideshow of beautiful current satellite images and animations: http://wdc.dlr.de/show/show.php [Sorry, it’s PHP ;-)] PyHDF: http://pysclint.sourceforge.net/pyhdf/ Numpy: http://numpy.scipy.org/ Climate Data Operators: http://www.mpimet.mpg.de/fileadmin/software/cdo/ JPype (Dynamically use Java in Python): http://jpype.sourceforge.net/ ctypes (Integrate C/C++ libraries in Python): http://docs.python.org/library/ctypes.html These slides on slideshare (always up-to-date): http://tr.im/ep09_pymodest Folie 31 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

×