RDA Wheat Data Interoperability
Cookbook and last developments
9th
March 2015, San Diego
2
The WDI working group in brief
 Endorsement: March 2014
 Members: ~=30 members and 15 active members, Wheat
scientists, data and metadata technologists
 The goal: contribute to the improvement of Wheat related
data interoperability by
 Building a common interoperability framework (metadata, data formats and
vocabularies)
 Providing guidelines for describing, representing and linking Wheat related
data
3
 Deliverables
 A report of the survey of existing standards
 A cookbook intended for the Wheat data managers community, which
provides them with guidelines on what data formats, metadata, vocabularies
and ontologies they should use to describe, represent and link different
types of Wheat data.
 A library of linked vocabularies and ontologies in machine readable formats
with respect to the Linked Data standards.
 A prototype which showcases the gain of interoperability
Initial plans
4
Where we are
5
Data type Data formats currently used Recommendations
Standardized Tool specific Non
standardized
SNPs VCF BAM/SAM,
BED,
VARSCAN,
VEP
VCF files generated by using the
survey sequences of IWGSC +
metadata about VCF files to
enrich the information about the
SNPs.
genome
annotations
Genbank Flat File,
General Feature
Format (GFF), EMBL
GFF 3 + specifications with
regard the description of specific
columns
Germplasms MPCD, ABCD, Darwin
Core, Darwin Core
Germplasm
Grin Global tabulated MPCD
Gene
expression
Many format standards
laid out by repositories
such as NCBI (GEO)
and EBI Array Express
Existing format standards laid out
by the repositories such as NCBI
(GEO) and EBI Array Express +
ENA
Physical maps GFF Cmap, fpc GFF3
Genetic maps Cmap, gnpmap GFF3 (to be confirmed)
Phenotypes Drops, ped, isa-
tab, ephesis
tabulated Isa-tab
6
Examples of use cases
Title Searching for germplasm with specific traits
Description Example of searching for germplasm with specific traits - tagged with ontology terms?
Data types Germplasm
Phenotype
Challenges ● Metadata very important ~ standardized format
● Association of genes to traits, linked to germplasm, marker information
● Need for quality controls- how confident are you of the data source?
● Provenance of the germplasm- pedigree, ownership,
● Standard system for tracking germplasm, names
Title Identification of wheat genes that control root growth
Description Requires: Annotated genes (Gene Ontology, PFam, and other functional annotation)
Data types Genomic annotations? - Gene location ? (IWGS-SS ID or MIPS HCS link)
Challenges Mapping between wheat genes and orthologs from other species (deduce function by seq. similarity);
Access to RNASeq data (genes that are not expressed in roots may be irrelevant) ; mapping of wheat
genes and information on their function based on literature
Title Query on trial data associated with varieties
Data types Phenotypic data, GIS data, (wheat economy/production data)
Description To search wheat varieties with distribution maps, production figures, performances in wheat mega
environments, associated projects worldwide plus layers of climatic data on specific wheat production
areas and disease prevention information.
Challenges Phenotypic data should be linked to GIS data. Using keywords or ontology terms a system or a tool
should be able to pull out such information from different websites/systems developed by wheat
community.
7
8
 Assess the level of visibility and interoperability of Wheat
related vocabularies and ontologies
 Is the vocabulary/ontology updated regularly?
 What license and/or copyright is used?
 Is the vocabulary/ontology part of any ontology communities or listing
services?
 Is the vocabulary/ontology used or implemented in any database/repository?
 Does the vocabulary/ontology interlink and/or map to other vocabularies and
ontologies?
 Does the vocabulary/ontology
 Identify the domain covered by the ontologies and
vocabularies
 Refine the cookbook
 Collect more interoperability use cases
 Collect some technical details
Wheat related ontologies & vocabularies survey
9
Wheat related ontologies & vocabularies survey
The Wheat related BioPortal allows one to search for terms across multiple ontologies, browse
mappings between terms in different ontologies, receive recommendations on which ontologies are
most relevant for a corpus, annotate text with terms from ontologies
11
Next steps
 Metadata (harmonization, minimal metadata sets)
 Mappings
 Next workshop (summer 2015)
 Review and complete the recommendations
 Refine and complete the guidelines and the best practices
 Finalize the repository of Wheat related vocabularies
 Prototyping: a semantic knowledge base
 Integrate data from different data sources
 Provide smart search capabilities that leverage the vocabularies used against
the metadata.
12
Thank you!

RDA Wheat Data Interoperability Cookbook and last developments

  • 1.
    RDA Wheat DataInteroperability Cookbook and last developments 9th March 2015, San Diego
  • 2.
    2 The WDI workinggroup in brief  Endorsement: March 2014  Members: ~=30 members and 15 active members, Wheat scientists, data and metadata technologists  The goal: contribute to the improvement of Wheat related data interoperability by  Building a common interoperability framework (metadata, data formats and vocabularies)  Providing guidelines for describing, representing and linking Wheat related data
  • 3.
    3  Deliverables  Areport of the survey of existing standards  A cookbook intended for the Wheat data managers community, which provides them with guidelines on what data formats, metadata, vocabularies and ontologies they should use to describe, represent and link different types of Wheat data.  A library of linked vocabularies and ontologies in machine readable formats with respect to the Linked Data standards.  A prototype which showcases the gain of interoperability Initial plans
  • 4.
  • 5.
    5 Data type Dataformats currently used Recommendations Standardized Tool specific Non standardized SNPs VCF BAM/SAM, BED, VARSCAN, VEP VCF files generated by using the survey sequences of IWGSC + metadata about VCF files to enrich the information about the SNPs. genome annotations Genbank Flat File, General Feature Format (GFF), EMBL GFF 3 + specifications with regard the description of specific columns Germplasms MPCD, ABCD, Darwin Core, Darwin Core Germplasm Grin Global tabulated MPCD Gene expression Many format standards laid out by repositories such as NCBI (GEO) and EBI Array Express Existing format standards laid out by the repositories such as NCBI (GEO) and EBI Array Express + ENA Physical maps GFF Cmap, fpc GFF3 Genetic maps Cmap, gnpmap GFF3 (to be confirmed) Phenotypes Drops, ped, isa- tab, ephesis tabulated Isa-tab
  • 6.
    6 Examples of usecases Title Searching for germplasm with specific traits Description Example of searching for germplasm with specific traits - tagged with ontology terms? Data types Germplasm Phenotype Challenges ● Metadata very important ~ standardized format ● Association of genes to traits, linked to germplasm, marker information ● Need for quality controls- how confident are you of the data source? ● Provenance of the germplasm- pedigree, ownership, ● Standard system for tracking germplasm, names Title Identification of wheat genes that control root growth Description Requires: Annotated genes (Gene Ontology, PFam, and other functional annotation) Data types Genomic annotations? - Gene location ? (IWGS-SS ID or MIPS HCS link) Challenges Mapping between wheat genes and orthologs from other species (deduce function by seq. similarity); Access to RNASeq data (genes that are not expressed in roots may be irrelevant) ; mapping of wheat genes and information on their function based on literature Title Query on trial data associated with varieties Data types Phenotypic data, GIS data, (wheat economy/production data) Description To search wheat varieties with distribution maps, production figures, performances in wheat mega environments, associated projects worldwide plus layers of climatic data on specific wheat production areas and disease prevention information. Challenges Phenotypic data should be linked to GIS data. Using keywords or ontology terms a system or a tool should be able to pull out such information from different websites/systems developed by wheat community.
  • 7.
  • 8.
    8  Assess thelevel of visibility and interoperability of Wheat related vocabularies and ontologies  Is the vocabulary/ontology updated regularly?  What license and/or copyright is used?  Is the vocabulary/ontology part of any ontology communities or listing services?  Is the vocabulary/ontology used or implemented in any database/repository?  Does the vocabulary/ontology interlink and/or map to other vocabularies and ontologies?  Does the vocabulary/ontology  Identify the domain covered by the ontologies and vocabularies  Refine the cookbook  Collect more interoperability use cases  Collect some technical details Wheat related ontologies & vocabularies survey
  • 9.
    9 Wheat related ontologies& vocabularies survey
  • 10.
    The Wheat relatedBioPortal allows one to search for terms across multiple ontologies, browse mappings between terms in different ontologies, receive recommendations on which ontologies are most relevant for a corpus, annotate text with terms from ontologies
  • 11.
    11 Next steps  Metadata(harmonization, minimal metadata sets)  Mappings  Next workshop (summer 2015)  Review and complete the recommendations  Refine and complete the guidelines and the best practices  Finalize the repository of Wheat related vocabularies  Prototyping: a semantic knowledge base  Integrate data from different data sources  Provide smart search capabilities that leverage the vocabularies used against the metadata.
  • 12.