The document provides an overview of variation and assembly resources available at EMBL-EBI, including the European Variation Archive (EVA), which assigns identifiers to non-human genetic variants and releases biannual merged variant datasets for different species. It also discusses the Genome-Wide Association Study (GWAS) Catalog, which curates over 3,000 publications and 44,000 genetic associations. Additionally, it mentions resources like the Pathogen Data Platform (PDX Finder) and tools for validating variant call format (VCF) files.
4. Looking back
• 1982 EMBL and Genbank established
• 1982 Data sharing and standardization
collaboration put in place
• 1983 first full phage genome published
• First public in November 1982
• Enterobacteria phage T7
https://www.ebi.ac.uk/ena/data/view/V01146
Credit: Ana Toribio
5. Assembly archiving today
• UI and API submission interfaces
• Reads and Assemblies accepted
• Chlamydia trachomatis A2497 serovar A
Comprehensive global genome dynamics
of Chlamydia trachomatis show ancient
diversification followed by contemporary
mixing and recent lineage expansion.
563 full genomes (455 novel)
Genome Res. 2017 Jul;27(7):1220-1229.
doi: 10.1101/gr.212647.116.
J Hadfield et al
https://www.ebi.ac.uk/ena/data/view/FM872306
Credit: Ana Toribio
6. Accessing managed human data
EGA By the numbers
● 1,698 studies
● 3,591 datasets
● 777 data providers
● >10,000 requestors
● EMBL-EBI and CRG
By volume
● 4.7 Petabytes
https://ega-archive.org/
Credit: Thomas Keane
9. HTS Get
What is it?
• An efficient non-file based API interface for accessing read data
• Separate backend storage implementation from interface
• A bridge from existing file formats to API client/server model
Progress
• Launch of v1.0 at GA4GH plenary October 2017!
• Demonstrations of integration with AAI+secure transfer
http://samtools.github.io/hts-specs/
Credit: Thomas Keane
10. Beacon project
• Allele based genotype queries
• Each beacon determines it’s own access poliy
• Data returned can be determined depending on tier
• Allele frequency
• Data set
• Population
• Sample
• Phenotype
• Anything else?
• EGA Beacon has 3 tiers of access
• Public
• Registered
• Controlled
https://beacon-network.org/#/
Credit Thomas Keane
11. European Variation Archive
• European Variation Archive
• Established in 2014
• Accepts VCF submissions (no archive specific format)
• Can link to ENA read submissions
• Taking over non-human RS assignment from dbSNP
Credit: Cristina Yenyxe Gonzalez
https://www.ebi.ac.uk/eva/
12. Non Human RS number assignment and releases
• EVA to assign rs (locus) and ss (submission) numbers for non human variants
• Existing accessions will remain in use
• Continues rolling release of variants as submitted
• Bi-annual merging of submitted variants into loci
• Always connected to existing rs numbers on search
• Per species VCFs released
• API and streaming access available
• EVA continues to broker Human variants to dbSNP
Credit Cristina Yenyxe Gonzalez
https://www.ebi.ac.uk/eva/
13. VCF specification and validation
• Maintained by GA4GH file formats group
• EVA validates against official specification
• www.github.com/ebivariation/vcf-validator
• Proposal in place to improve SV structure in VCFs
• Maintainers of variation archives
• Structural variation caller methods developers
• Pull request on https://github.com/samtools/hts-specs/pull/231
• Please give feedback
Credit: Cristina Yenyxe Gonzalez
https://www.github.com/samtools/hts-specs
16. Phenotype and disease data can be
searched by ontology term to retrieve
aggregated results.
Improved allele frequency
views with more data
available
Credit: Sarah Hunt
17. The GWAS Catalog
• Public catalog of Genome Wide Association Studies
• Curated from the literature
• Now with summary statistics
• > 3000 publications
• > 44,000 variant-trait associations
https://www.ebi.ac.uk/gwas/downloads/summary-statistics
Credit: Fiona
Cunningham
18. Turning pathogen data collection into actionable information
• Risk-assessment models and risk-based sampling
• From samples and metadata to comparable data
• From comparable data to actionable information
• Pathogen identification and characterization
• Outbreak detection
• Outbreak investigation
• Outbreak prediction
• Building a common data platform and analysis framework
• Risk communication
20. PDX Finder
Build a comprehensive global catalogue of PDX models and their data available
for researchers
www.pdxfinder.org
JAX and EMBL-EBI co-developed resource
Carol Bult – Helen Parkinson/Terry Meehan
NCI funding
EC EuroPDX
Credit Terry Meehan
Hello, I am Laura Clarke and I would like to thank the organisers for inviting me and Jared to lead this session and for giving me the opportunity to speak.
Before we get to the other talks about methods and approaches for assembly, variant discovery and annotation I want to give you a whistlestop tour of some of the EBI’s Variation and Assembly resources, we have being running genomic archives for about 35 years and over that time have added resources to turn this present this collected data together and support large scale data generation projects ensure their efforts are useful to the whole community.
To start with, EBI has 3 genomic archives, the European Nucleotide Archive, the European Genome Phenome Archive and the European Variation Archive
EMBL-bank was founded in 1982, a month before Genbank. Data sharing and standardization was rapidly established to ensure data submitted to either archive was available to users of both and in those days this was all distributed in printed books, before moving to CD and finally using the internet. One of the first genomes sequenced and submited to EMBL-Bank was the T7 phage, with a pre-publication submission in late 1982 and being published in the Journal of Molecular Biology in 1983. You can still access this genome today from the ENA it was last updated in 2004.
These days many more genomes from across all clades of life are being sequenced, here is a chlamydia genome part of a publication assessing the global diversity of the more prevalent sexually transmitted bacteria but it remains poorly understood due to it being difficult to culture, this study was able to demonstrate that the diversity in the genome arrived respectively recently in over a few thousand years rather than the previously thought millions
In a world where we sequence more and more humans to support biomedical resource we can no longer openly distribute all this data. The European Genome Phenome archive was established in 2008 along side dbGAP to provide a managed access solution enabling scientists to be able to get this data when previously it was impossible. EBI was joined by CRG in 2012 to maintain and extend the resource and today there are more than 3000 data sets and more than 10,000 users.
While the EGA has made it possible to access this data in the age of 10s of thousands of genomes of high depth the process is increasingly cumbersome both to get permission to access and to get the data once you have permission
The EGA over the course of the last year and moving forward is deploying new solutions to make it much easier to both put data in and get data out of the EGA (once you have permission), improving the tools available to give users permission and making it possible to for local deployments of the EGA to be setup ensuring nations which don’t allow genetic material to leave their borders to still take advantage of this data sharing technology
Yes, the driving reason for this is that human genetic data is moving towards federation, especially as research interfaces with national healthcare (where data is often subject to jurisdictional restrictions or has higher local data security requirements).
One new tool released at ASHG a couple of weeks ago is HTS Get, this allow streaming of data from secure storage using authentication enabling users to access this data both in a suitable cloud environment and get specific genomic regions of these files rather than needed to download and decrypt the whole thing when you are only interested in a piece of chr22. Currently HTSGet supports BAM and CRAM and VCF will be added in the near future.
For allele level variant querying the EGA already hosts beacons where you can query by allele to find out if the beacon contains it. Depending on the beacon different types of data can be returned and different authentication is required before access. EGA has three tiers of access, Public (anything which is openly consented), registered where the user needs to provide an email address and controlled which needs DAC permission. An this leads us nicely onto the European Variation archive.
The most recently of the EBI genomic archives to be established is the European Variation Archive, setup in 2014, originally this archived dataset and publication VCF files (and get associated reads into ENA if poss) and then brokered variants to dbSNP for RS numbers but as variantion discovery continues to increase they are taking over locus level accessioning from dbSNP from all non Human variants
The existing accessions will remain in use, imported into the EVA systems over the course of the next year. No new name space will be invented and continuous release of study level VCFs will continue to be merged into locus level rs accessions on a bianual basis.
VCF isn’t many peoples favourite format. It is being maintained by the GA4GH file formats group and the EVA will validate against the official specification. There are already proposals in place for SV representation, they are very keen to get feedback.
Who are the maintainers?
Cristina Yenyxe Gonzalez
David Roazen
Petr Danecek
Now we move onto our other resources supporting genome and variation data at EBI, these other resources pull together data from the archives, other projects and the publication record
I can’t talk about EBI’s resources for variation without mentioning Ensembl variation which holds data for many different species and puts this data into the context of all the other annotation Ensembl holds, allowing you to see genotypes, LD and allele frequencies and discover variants by phenotype associations
Recently the phenotype and disease searches have been improved and there are better views with more allele frequencies on the variant pages. Do go see Will Mclaren’s poster about their improved REST query service
We also have the GWAS catalog which curates genome wide association studies from the literature. Over the last year they have started adding summary statistics for these studies, about 1.5% of studies currently have them
We run data coordination efforts for many different projects. We support the COMPARE project, a EC funded effort to improve biosurveilance. The ENA archives the assemblies and provided an analysis platform to support rapid analysis of the collected data.
Finally I want to introduce a very new project which only started in the last year. Patient Derived Xenograph mice have been increasing in usage over the past five years but it is currently challenging to find all the different models and the data associated with them. These PDX mice can help assess cancer treatments in a trial setting and in future may be useful in a patient context, planning treatment
Both the NCI and the European Commission are funding the EBI together with the JAX labs to build a global catalog of these models and their associated data to ensure the whole community can benefit from the models which are being created.
Thank you very much for listening to me. I have spoken about the work of many people at the institute today. Do remember the EBI is a great place to work and we are always hiring.