Cynthia Parr @cydparr
US Department of Agriculture
National Agricultural Library
30 September 2015
Ag Data Commons
Adding value to
open agricultural research data
Federal directives: Public access to
open, machine-readable data
The problems in agricultural data
• Broad subject areas
• Journals not integrated with repositories like
Dryad
• Too many existing databases & web distribution
points
• Lack of infrastructure for long-tail data
• Lack of a neutral, sustainable solution for long-
term multi-institutional projects
3
• Supports Public Access mandates
• Holds agricultural research data
• Primary audience: researchers
• Holds metadata for data held elsewhere
• Starting with USDA data but will broaden
• Both human and machine access
• Can include unpublished data that is ready
for release
Ag Data Commons Prototyping FY 2015
A proposed solution
Search &
Knowledge
Discovery
Thesaurus &
Indexing
Ag Data
Commons
Repository
Organization &
Curation
Grant
management
systems
INGESTION DISSEMINATION
PubAg
Dataset
Submissio
n
Analytics &
Tools
Data.gov
Ag Data
Commons
Catalog
Legend
Building
Adapting
Existing
Distributed
repositories
Forest Service
Geospatial
Adding value
6
Metadata +
data package
DOI
Links
Thesaurus tags
Idiosyncratic
data
dictionary
Search, services,
compliance checking
DKAN http://nucivic.com/dkan/
PRO
• Open source community
• Drupal modules for basic
CMS functions
• Integrated CKAN catalog
• Feeds Data.gov
• Basic metadata already
supported
CON
• Not designed for scientific
data or scientists
• No links to literature
• No Digital Object
Identifiers
• Doesn’t handle dataset
relationships
• Metadata inadequate for
compliance checking &
re-use
7
Metadata Standards
Core Metadata Schema
POD 1.1 (Project Open Data)
https://project-open-data.cio.gov/
Related Scientific Metadata & Data Standards (e.g.)
ISO 19115 (GIS Data, FGDC)
https://www.iso.org
Darwin Core (Biodiversity standards)
http://rs.tdwg.org/dwc/
EML (Ecological Metadata Language)
https://knb.ecoinformatics.org/#tools/eml
MiXS GSC (Genomic Standards Consortium)
http://gensc.org/projects/mixs-gsc-project/
Controlled Vocabularies
• NALT – National Agricultural Library Thesaurus
http://agclass.nal.usda.gov
 GACS Global Agricultural Concept Scheme
• Taxonomy
• Gene Ontology (GO)
http://geneontology.org/
• ENVO, ecological, economic, etc.
Relevant for Agriculture
• Help create a semantic web
• SKOS (Simple Knowledge Organization System): W3C
recommendation, or RDF
Credit: AIMS--FAO
https://data.nal.usda.gov/
Launching
next week
Adding even more value
12
Structured
methods
metadata
Shared
data
dictionary
Semantic
data
dictionary
Adding even more value
13
Assist
application
launch
Find related
data
Integrate/link
related data
= help build the knowledge graph
Acknowledgements
Cynthia.Parr@ars.usda.gov
Susan McCarthy, NAL – KSD
Ursula Pieper, NAL – ISD
Qing Qu, NAL – KSD contractor
Jeff Campbell – NAL – KSD
Jaylen Nathwani, NAL – student intern
NüCivic, Angry Cactus Team
Jocelyn McNamara -- NAL – KSD contractor
Kerry Huller – UMD graduate fellow
Erin Antognoli – UMD graduate fellow

Ag Data Commons: Adding Value to open agricultural research data

  • 1.
    Cynthia Parr @cydparr USDepartment of Agriculture National Agricultural Library 30 September 2015 Ag Data Commons Adding value to open agricultural research data
  • 2.
    Federal directives: Publicaccess to open, machine-readable data
  • 3.
    The problems inagricultural data • Broad subject areas • Journals not integrated with repositories like Dryad • Too many existing databases & web distribution points • Lack of infrastructure for long-tail data • Lack of a neutral, sustainable solution for long- term multi-institutional projects 3
  • 4.
    • Supports PublicAccess mandates • Holds agricultural research data • Primary audience: researchers • Holds metadata for data held elsewhere • Starting with USDA data but will broaden • Both human and machine access • Can include unpublished data that is ready for release Ag Data Commons Prototyping FY 2015 A proposed solution
  • 5.
    Search & Knowledge Discovery Thesaurus & Indexing AgData Commons Repository Organization & Curation Grant management systems INGESTION DISSEMINATION PubAg Dataset Submissio n Analytics & Tools Data.gov Ag Data Commons Catalog Legend Building Adapting Existing Distributed repositories Forest Service Geospatial
  • 6.
    Adding value 6 Metadata + datapackage DOI Links Thesaurus tags Idiosyncratic data dictionary Search, services, compliance checking
  • 7.
    DKAN http://nucivic.com/dkan/ PRO • Opensource community • Drupal modules for basic CMS functions • Integrated CKAN catalog • Feeds Data.gov • Basic metadata already supported CON • Not designed for scientific data or scientists • No links to literature • No Digital Object Identifiers • Doesn’t handle dataset relationships • Metadata inadequate for compliance checking & re-use 7
  • 8.
    Metadata Standards Core MetadataSchema POD 1.1 (Project Open Data) https://project-open-data.cio.gov/ Related Scientific Metadata & Data Standards (e.g.) ISO 19115 (GIS Data, FGDC) https://www.iso.org Darwin Core (Biodiversity standards) http://rs.tdwg.org/dwc/ EML (Ecological Metadata Language) https://knb.ecoinformatics.org/#tools/eml MiXS GSC (Genomic Standards Consortium) http://gensc.org/projects/mixs-gsc-project/
  • 9.
    Controlled Vocabularies • NALT– National Agricultural Library Thesaurus http://agclass.nal.usda.gov  GACS Global Agricultural Concept Scheme • Taxonomy • Gene Ontology (GO) http://geneontology.org/ • ENVO, ecological, economic, etc. Relevant for Agriculture • Help create a semantic web • SKOS (Simple Knowledge Organization System): W3C recommendation, or RDF Credit: AIMS--FAO
  • 10.
  • 12.
    Adding even morevalue 12 Structured methods metadata Shared data dictionary Semantic data dictionary
  • 13.
    Adding even morevalue 13 Assist application launch Find related data Integrate/link related data = help build the knowledge graph
  • 14.
    Acknowledgements Cynthia.Parr@ars.usda.gov Susan McCarthy, NAL– KSD Ursula Pieper, NAL – ISD Qing Qu, NAL – KSD contractor Jeff Campbell – NAL – KSD Jaylen Nathwani, NAL – student intern NüCivic, Angry Cactus Team Jocelyn McNamara -- NAL – KSD contractor Kerry Huller – UMD graduate fellow Erin Antognoli – UMD graduate fellow

Editor's Notes

  • #2 Title Ag Data Commons: adding value to open agricultural research data     Public access to results of federally-funded research is a new mandate for large departments of the United States government. Public access to scholarly literature from U.S. investments is straightforward, with policies and systems like PubMed Central and PubAg (http://pubag.nal.usda.gov) already implemented. However, research data release is a more complex undertaking. Agricultural researchers make their data available in a patchwork of locations, if they share it at all, and metadata and data formats are far from standardized. Many data types overlap with basic science domains that have standards (e.g. biodiversity, genomics, hydrology) but have little in common with each other and are not tailored for agriculture. U.S. Department of Agriculture's prototoype system, the Ag Data Commons (http://data.nal.usda.gov), will meet the requirements of public access but should also go further to facilitate novel, data-intensive science. Aimed at researchers, Ag Data Commons uses DKAN, a Drupal-based catalog and repository (http://nucivic.com/dkan/), to enhance discoverability and access to well-curated resources (data files, databases, software) deposited in the system or held elsewhere. Core metadata fields are from Project Open Data v.1.1 (a requirement of the U.S. open data catalog athttp://data.gov) but we added fields and features to support scholarly research. We issue DataCite Digital Object Identifiers (DOIs), accept author ORCIDs (http://orcid.org/), apply National Agricultural Library thesaurus terms, and encourage citation of literature and linkage with related datasets and other online resources. While extremely detailed metadata are impractical given the breadth of agricultural domains, we can extract fields from sophisticated ISO 19115 geographic information metadata and extended metadata files can be posted and will be indexed. We are piloting the harvest of distributed metadata records. Towards data integration and standardization, we are developing guidelines for machine-readable data dictionaries, manifests of data elements in datasets not unlike Darwin Core Archives. We are exploring ways to enable basic interactive visualizations. Metadata are available in JSON (http://json.org/) and RDF (http://www.w3.org/RDF/), with dedicated feeds for publication links and (eventually) compliance checking. Many challenges remain before we can move from prototype to production. Among the challenges are how to provide easy API (application program interface) access to elements in data files, how to interface with related systems (e.g. Dryad, DataONE, EcoInforma, iPlant), how to leverage methods metadata and semantics, how to better support provenance and impact tracking, and how to ease the pain of both working with and preserving big data for high performance computing.
  • #3 This plan is in a learning and pilot phase now. Policies are being developed to be available in the next fiscal year. New projects in 2016-1017 will be expected to be in full compliance with policies, that means data management plans up front that result in publicly released scientific data according to policy. .So we have a little time to work out the details and influence the policies. We can have conversations now on best practices that may guide the policy makers.
  • #6 Dark Blue: develop as part of AgDatacCommons Light blue:Enhance existing systems. Gray: Already exist
  • #8 Drupal Knowledge Archive Network
  • #11 Phase II prototype Launching next week! Data submission for outside personnel Automate DOI submission Support for compliance checking Embargo support Support for methods & software metadata
  • #12 Scheduled harvesting from external repositories (including geospatial ISO metadata)