GA4GH Metadata Task Team Advances Interoperability of Genomic Data Standards
1. GA4GH – Metadata task team
Mélanie Courtot
On behalf of the Metadata task team
mcourtot@ebi.ac.uk
@mcourtot
2. The Metadata task Team (MTT)
• Main challenges:
• MTT cross cutting; hard in GA4GH first iteration
• Issue finding good datasets: “free-floating” development of
metadata standards
• Lack of use cases across task teams
• Mid-2016: move towards application-driven changes to
bring focus to model development
3. Initial metadata projects
• ArrayMap: cancer genome array data, for
visualization and somatic copy number
aberrations
• Beacon+: on top of ArrayMap, incorporates
structural genomic variants
• BioSamples: 5 millions samples data, linking to
EMBL-EBI archives (ArrayExpress, ENA, EGA…)
diverse
focused
http://arraymap.org
http://beacon.arraymap.org/beacon/beaconplus-ui/
https://www.ebi.ac.uk/biosamples/
6. DIPG
• Diffuse Intrinsic Pontine Glioma
• Rare, incurable brain tumor in children 6-8 years old
• Median survival < 1 year
• Lack of good model hampers progress in treatment: no
significant advances in 30 years
Misuraca et al., Front Oncol. 2015; 5: 172.
7. • 910 cases taken from 20 published series + 157
unpublished cases
• Added into ArrayMap and curated, accessible through
Beacon+
Michael
Baudis
Bo
Gao
(MacKay et al., Cancer Cell 2017)
The DIPG dataset
8. Beacon+
Concept
• Implementation of cancer beacon
prototype, backed by arrayMap
and DIPG data set
• structural variations (DUP, DEL) in
addition to SNV
• diagnosis queries using ontology
codes (NCIT, ICD-O)
• quantitative responses
• GA4GH schema compatible
variant & metadata API
9. Querying over integrated datasets through the
GA4GH API
• 1 variant is found in 21 biosamples, of which 12 are from
the brain stem (i.e. DIPG)
http://dipg.progenetix.org/beacon/beaconplus-server/beaconresponse.cgi?
dataset_id=dipg&variants.reference_name=chr17&assembly_id=GRCh36&variants.variant_type=SNV&variants.start=7577121&v
ariants.reference_bases=G&variants.alternate_bases=A&biosamples.bio_characteristics.ontology_terms.term_id=pgx:icdot:c71.7
10. A few issues along the way…
but we know how to make problems
far more tractable.
13. We generate GA4GH datasets for integration
over BioSD
Trish Whetzel Matt Green
We use data and
annotations to provide semi-
automated curation
diseaseState
sampleCharacteristics
hostDisease
diabetes
Diagnosis
…
disease
14. Semantic as a services
Iden%ty Resolu%on
Id Version &
Provenance
Tools Registry Ontology Services
Standards and APIs
Linked Data
Pipelines Applica%ons Publishers
Template Services
Iden%ty
Mapping
Guidelines and
Standards Registry
Cita%on
Implementa%on
Search (BioSolr) Prefix commons
Dataset Descrip%on
Metadata
Valida%on Services
16. Different datasets use different standards: we
provide mappings
• International Classification of Diseases for Oncology codes
the site (topography) and the histology (morphology) of
neoplasms
• Combination of ICDO morphology and topography can be
mapped to NCI Thesaurus
Paula Carrio Cordo
17. We usually work with open data
• Our data is open and
publicly released
• Not the case for all our
users, e.g. EGA requires
controlled access
18. Development of DUO
for standard consent
codes and data
restrictions
• Collaboration EGA/Broad
• Integration with ADA-M
and Beacon
https://github.com/EBISPOT/DUO
Moran
Cabili
Dylan
Spalding
Giselle
Kerry
20. MTT – short term: a new home
• Move to a distinct metadata repository
• Updated documentation
• Split into modules
• Link to examples
⇒ increase visibility/uptake
• Community adoption/alignment
21. MTT – medium term: coordination with work
streams
• Streaming: sample identification and representation
supporting streaming use cases
• Implementation Biosamples and EGA
• Discovery: representation for discovery use cases
• Implementation Beacon+ and ArrayMap
• Genomic Knowledge Standards: dataset level description,
study representation. Analysis result?
• Implementation Biosamples and ENA
23. DIPG data in Biosamples
• GA4GH API has been
implemented over
BioSamples
• Allows querying via
GA4GH metadata model,
linking to other EBI
archives and integrating
available data