An On-line Collaborative Data Management System


Published on

A presentation I prepared that was presented by Rob Simmonds at the Gateway Computing Environments 2010 Workshop in New Orleans on November 14, 2010. It provides an overview of a data management system that was developed for GeoChronos - an on-line collaborative platform for Earth observation scientists.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

An On-line Collaborative Data Management System

  1. 1. An On-line Collaborative Data Management System Roger Curry 1 , Cameron Kiddle 1 , Rob Simmonds 1 and Gilberto Z. Pastorello Jr. 2 1 Grid Research Centre, University of Calgary 2 Centre for Earth Observation Science, University of Alberta
  2. 2. <ul><li>Data Challenges </li></ul><ul><li>Related Work </li></ul><ul><li>Data Management System </li></ul><ul><li>Use Case: GeoChronos </li></ul><ul><li>Summary and Future Work </li></ul>Outline GCE 2010 Nov. 14, 2010
  3. 3. <ul><li>Data Acquisition </li></ul><ul><ul><li>Much scientific data stored on off-line media </li></ul></ul><ul><ul><li>Cumbersome and time consuming to access </li></ul></ul><ul><ul><li>Making data available on-line difficult </li></ul></ul><ul><ul><li>Insufficient storage and bandwidth </li></ul></ul><ul><li>Sharing of Data </li></ul><ul><ul><li>Lack of willingness to share data </li></ul></ul><ul><ul><li>Proprietary data - need for controlled access </li></ul></ul>Data Challenges - I GCE 2010 Nov. 14, 2010
  4. 4. <ul><li>Usability of Data </li></ul><ul><ul><li>Insufficient metadata to describe data </li></ul></ul><ul><ul><li>Various metadata standards in some domains, but many lacking metadata standards – many scientists use their own metadata format </li></ul></ul><ul><li>Finding Data </li></ul><ul><ul><li>Difficult to find data that you need </li></ul></ul><ul><ul><li>Different data organized / stored differently </li></ul></ul><ul><ul><li>Tools to browse, search, visualize data often lacking </li></ul></ul>Data Challenges - II GCE 2010 Nov. 14, 2010
  5. 5. <ul><li>Content Management Systems </li></ul><ul><ul><li>i.e., Drupal, Joomla!, Microsoft SharePoint, Plone, ... </li></ul></ul><ul><ul><li>Offer rich set of features but do not handle: </li></ul></ul><ul><ul><ul><li>Meaningful support to specific data formats </li></ul></ul></ul><ul><ul><ul><li>Efficient association of metadata and ancillary files to data sets </li></ul></ul></ul><ul><ul><ul><li>Access to a variety of data processing tools </li></ul></ul></ul><ul><ul><ul><li>Uniform handling of outputs from processing tools </li></ul></ul></ul><ul><li>Spectral Libraries </li></ul><ul><ul><li>i.e., USGS, ASTER, Vegetation Spectral Library (VSL) </li></ul></ul><ul><ul><li>Are available on-line but lack: </li></ul></ul><ul><ul><ul><li>ability to dynamically restructure metadata for browsing </li></ul></ul></ul><ul><ul><ul><li>collaboration features enabled by social networking </li></ul></ul></ul>Related Work - I GCE 2010 Nov. 14, 2010
  6. 6. <ul><li>Spectral Library Tools </li></ul><ul><ul><li>i.e., DLR-DFD Spectral Archive, SPECCHIO </li></ul></ul><ul><ul><li>Flexibile in creating / handling metadata but: </li></ul></ul><ul><ul><ul><li>Have a fixed metadata schema – do not support new metadata needs </li></ul></ul></ul><ul><li>Data repositories for other domains </li></ul><ul><ul><li>i.e., Astrophysics Data System, FLUXNET, European Bioinformatics (EBI) Databases </li></ul></ul><ul><ul><li>Offer wide range of functionality but: </li></ul></ul><ul><ul><ul><li>Primarily focus on data that is already validated and structured </li></ul></ul></ul><ul><ul><ul><li>Do not handle preliminary, intermediate, untested data (i.e. research in progress) </li></ul></ul></ul><ul><li>Digital Libraries </li></ul><ul><ul><li>i.e., Planetary Data Systems, NCore, SciPort </li></ul></ul><ul><ul><li>Have flexible functionality but: </li></ul></ul><ul><ul><ul><li>Most focus on well-defined digital artefacts </li></ul></ul></ul><ul><ul><ul><li>Limited in handling collaboration on evolving data, metadata and schemas </li></ul></ul></ul>Related Work - II GCE 2010 Nov. 14, 2010
  7. 7. <ul><li>Supports the following functionality: </li></ul><ul><ul><li>On-line access to data </li></ul></ul><ul><ul><li>Enables scientists to share data while maintaining control of who sees it </li></ul></ul><ul><ul><li>Ability to add and edit metadata while working with multiple schemas </li></ul></ul><ul><ul><li>Collaboratively create new schemas to facilitate consistent/accurate recording of metadata </li></ul></ul><ul><ul><li>Dynamically restructure the way data is browsed </li></ul></ul>Data Management System - Overview GCE 2010 Nov. 14, 2010
  8. 8. Data Management System - Framework <ul><li>User & Data: </li></ul><ul><ul><li>User acquires data from sensor and uploads to portal </li></ul></ul><ul><ul><li>Direct acquisition of data also possible </li></ul></ul><ul><li>Elgg Portal: </li></ul><ul><ul><li>Built on top of Elgg – Open source social networking platform </li></ul></ul><ul><ul><li>Fine grained access control </li></ul></ul><ul><ul><li>Flexible data model </li></ul></ul><ul><li>Data Storage: </li></ul><ul><ul><li>Currently local NFS storage </li></ul></ul><ul><ul><li>Working on distributed iRODS based system </li></ul></ul><ul><li>Data Ingestion Service: </li></ul><ul><ul><li>Creates records, parses metadata, establishes ancillary relationships </li></ul></ul><ul><ul><li>Deployed on cloud-based Condor pool </li></ul></ul>GCE 2010 Nov. 14, 2010
  9. 9. Data Management System – Data Model GCE 2010 Nov. 14, 2010 Source: Data Management System – Data Model <ul><li>Arbitrary metadata can be assigned to any entity </li></ul><ul><li>Annotations allow users to comment on entities not owned by them </li></ul><ul><li>Data management system adds three new types of ElggObjects </li></ul><ul><ul><li>Schema </li></ul></ul><ul><ul><li>Collection </li></ul></ul><ul><ul><li>Record </li></ul></ul>
  10. 10. Data Management System - Schemas <ul><li>Create schemas </li></ul><ul><ul><li>Custom or standards-based (i.e. Dublin Core) </li></ul></ul><ul><ul><li>Individually or as a collaborative team </li></ul></ul><ul><li>Schemas consist of </li></ul><ul><ul><li>Namespace </li></ul></ul><ul><ul><li>Description </li></ul></ul><ul><ul><li>Read/write access permissions </li></ul></ul><ul><ul><li>Series of metadata keys </li></ul></ul><ul><li>Metadata keys consist of </li></ul><ul><ul><li>Name </li></ul></ul><ul><ul><li>Description </li></ul></ul><ul><ul><li>Type (text, latlong, ancillary) </li></ul></ul><ul><ul><li>Optionality: required, recommended, optional </li></ul></ul>GCE 2010 Nov. 14, 2010
  11. 11. Data Management System - Collections <ul><li>Group of related data </li></ul><ul><ul><li>i.e., spectral library, set of satellite data </li></ul></ul><ul><li>Collection consists of </li></ul><ul><ul><li>Name, description, read/write access permissions, metadata, records </li></ul></ul>GCE 2010 Nov. 14, 2010
  12. 12. Data Management System - Records GCE 2010 Nov. 14, 2010 <ul><li>Atomic unit of data management system </li></ul><ul><ul><li>Usually represents a single file, but does not need to be associated with a file </li></ul></ul><ul><li>Tabbed interface for viewing: </li></ul><ul><ul><li>Spectral plot, metadata, ancillary data, map, comments </li></ul></ul><ul><ul><li>Custom tabs based on data type </li></ul></ul>
  13. 13. Data Management System – Virtual Directory Structure GCE 2010 Nov. 14, 2010 <ul><li>Dynamic restructuring of data for browsing purposes </li></ul><ul><li>Folders based on metadata keys/values </li></ul><ul><li>User can customize the metadata keys used to establish the directory hierarchy </li></ul>
  14. 14. Use Case - GeoChronos GCE 2010 Nov. 14, 2010 (
  15. 15. <ul><li>An on-line platform </li></ul><ul><ul><li>For: </li></ul></ul><ul><ul><ul><li>Earth Observation Scientists </li></ul></ul></ul><ul><ul><li>Facilitating: </li></ul></ul><ul><ul><ul><li>Collaboration between scientists </li></ul></ul></ul><ul><ul><ul><li>Data access, management and sharing </li></ul></ul></ul><ul><ul><ul><li>Application access, management and sharing </li></ul></ul></ul><ul><ul><li>Leveraging: </li></ul></ul><ul><ul><ul><li>Web 2.0 and social networking technologies </li></ul></ul></ul><ul><ul><ul><li>Cloud computing technologies </li></ul></ul></ul><ul><ul><li>Funded by: </li></ul></ul><ul><ul><ul><li>CANARIE - Network Enabled Platform (NEP-1) program </li></ul></ul></ul><ul><ul><ul><li>Cybera </li></ul></ul></ul>GeoChronos - Overview GCE 2010 Nov. 14, 2010
  16. 16. GeoChronos - Project Team GCE 2010 Nov. 14, 2010 Dr. Arturo Sanchez-Azofeifa University of Alberta Dr. John Gamon University of Alberta Dr. Benoit Rivard University of Alberta Dr. Rob Simmonds University of Calgary Prinicipal Investigators Project Coordination Platform Development Domain Scientists
  17. 17. GeoChronos - Virtual Organization GCE 2010 Nov. 14, 2010
  18. 18. <ul><li>Libraries created </li></ul><ul><ul><li>Ingested some existing on-line libraries </li></ul></ul><ul><ul><ul><li>USGS, ASTER, Vegetation Spectral Library (VSL) </li></ul></ul></ul><ul><ul><ul><li>Many enhanced features as part of GeoChronos Spectral Library module - improved browsing, dynamic plotting, mapping, annotations, ... </li></ul></ul></ul><ul><ul><li>Domain scientists have contributed libraries </li></ul></ul><ul><ul><ul><li>Rock samples, tar sand samples, lichen samples, vegetation samples, alfalfa/barley field samples </li></ul></ul></ul><ul><li>Data formats / parsers supported </li></ul><ul><ul><li>ENVI, UNISPEC, ASD, several ASCII formats </li></ul></ul><ul><li>Schemas incorporated </li></ul><ul><ul><li>Library specific – USGS, ASTER, VSL, ... </li></ul></ul><ul><ul><li>Sensor/Format specific – UNISPEC, ENVI, .. </li></ul></ul><ul><ul><li>Other Standards – Dublin Core </li></ul></ul><ul><li>Currently hosting (including MODIS data) </li></ul><ul><ul><li>10+ schemas, </li></ul></ul><ul><ul><li>20+ collections (libraries), </li></ul></ul><ul><ul><li>20,000+ records </li></ul></ul>GeoChronos – Spectral Libraries GCE 2010 Nov. 14, 2010
  19. 19. GeoChronos – MODIS Satellite Data <ul><li>Developed automated workflow service for mosaicing, subsetting, reprojecting and masking MODIS satellite data </li></ul><ul><li>Significantly reduces time that scientists have spent manually doing such workflows </li></ul><ul><li>Data management system used to store raw MODIS satellite data and data products derived from the workflow </li></ul><ul><li>Parsers/schemas specific to MODIS data have been added to system </li></ul><ul><li>User provided with same powerful interface as Spectral Libraries for browsing, accessing and viewing data </li></ul>GCE 2010 Nov. 14, 2010
  20. 20. <ul><li>Have developed data management system in an interactive, iterative fashion </li></ul><ul><li>Domain scientists on project have provided much guidance, testing and feedback </li></ul><ul><li>Have customized, enhanced the data management system based on feedback received </li></ul>GeoChronos – User Feedback GCE 2010 Nov. 14, 2010
  21. 21. <ul><li>Identified data related challenges facing scientists </li></ul><ul><li>Discussed some related efforts and shortcomings of these approaches </li></ul><ul><li>Presented an on-line collaborative data management system addressing many data challenges </li></ul><ul><li>Showed example usage of the data management system by GeoChronos </li></ul>Summary GCE 2010 Nov. 14, 2010
  22. 22. <ul><li>Currently have a single local data repository </li></ul><ul><ul><li>Working on extending data management system to work with distributed data repositories using iRODS </li></ul></ul><ul><li>Currently have powerful browsing functionality </li></ul><ul><ul><li>Need to add search functionality across collections and based on metadata values </li></ul></ul><ul><li>Currently support custom metadata schemas </li></ul><ul><ul><li>Plan to make use of Semantic Web technologies to better relate data and provide ontological mapping between different metadata schemas / standards </li></ul></ul><ul><li>Currently work with spectral and MODIS satellite data </li></ul><ul><ul><li>Plan to incorporate other data such as carbon flux data, other satellite data, meteorological data, phenology tower data </li></ul></ul>Next Steps GCE 2010 Nov. 14, 2010
  23. 23. Contact Information GCE 2010 Nov. 14, 2010 [email_address]