After obtaining genetic variants from next generation sequencing data, a precursory step in tertiary analysis is to annotate each variant with available relevant information. There is no standardized compendium for this purpose; researchers instead are required to compile data from a motley of annotation tools and public datasets. These sources for annotation are independently maintained, and accordingly there is limited concordance between their reported contents. The choice of annotation datasets thus has a direct and significant impact on the results of the analysis.
CI, CD -Tools to integrate without manual intervention
ASHG 2015 - Redundant Annotations in Tertiary Analysis
1. www.bina.com
Fig. 3) Percentage of SNPs predicted as damaging by 7 different algorithms.
Fig. 1) 1000 Genomes overlap with transcription, coding and exonic regions.
Transcription and coding regions
Ensembl and RefSeq are standard references for transcription, coding
region and exon locations. Figure 1 displays how many of the 85M
unique variants from the 1000 Genomes projects overlap with the
genomic regions as defined by these two sources.
An empirical evaluation of redundant annotations in
common reference sources for tertiary analysis
James Warren1, Jian Li1, Aparna Chhibber1, Emre Colak1, Narges Bani Asadi1, Sharon
Barr1, Hugo Y. K. Lam1
1 Bina Technologies, Roche Sequencing, Redwood City, CA 94065
MOTIVATION
After obtaining genetic variants from next generation sequencing data, a precursory step in tertiary analysis is to annotate each variant with
available relevant information [1]. There is no standardized compendium for this purpose; researchers instead are required to compile data from a
motley of annotation tools and public datasets [2, 3]. These sources for annotation are independently maintained, and accordingly there is limited
concordance between their reported contents. The choice of annotation datasets thus has a direct and significant impact on the results of the
analysis [4].
References:
1. Warren, et. al. A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation. ASHG (2014)
2. Johnston and Biesecker. Databases of genomic variation and phenotypes: existing resources and future needs. Human Molecular
Genetics (2013)
3. Peterson, et. al. Towards precision medicine: advances in computational approaches for the analysis of human variants. Journal of
Molecular Biology (2013)
4. Taylor, et. al. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nature Genetics (2015)
METHODOLOGY
To empirically evaluate the differences between annotation data sources, we examined the overlap of the variants from the 1000 Genomes
project with commonly used and publicly available sources containing information regarding gene transcription and coding regions, predicted
functional impacts and population allele frequencies. For each of these dimensions, we compared the number of variants that met the specified
criteria by at least one annotation from the data source.
CONCLUSIONS
The choice of data sources for variant annotation have a substantive
impact on tertiary analysis results. When identifying variants that satisfy
given criteria, the differences between sources can result in significantly
different findings. As a best practice, multiple sources should be
included in the analysis. This provides the ability to tune the specificity
and sensitivity of the results by choosing to use intersections or unions
at each filtering step.
Please direct questions to:
bina.rd@bina.roche.com
DATA SOURCES
All data sources are publicly available and based on the GRCH37
reference. They were downloaded from the following organizations:
• NCBI: 1000 Genomes, Phase 3, V.5b
• UCSC: dbSNP142, RefSeq 72, Ensembl 75
• U. Washington: ESP 6500 SI, V2
• U. Texas Health Science Center: dbNSFP 2.9
Damaging SNV predictions
dbNSFP compiles predicted effects for non-synonymous SNVs in the
human genome. Of the 667K SNVs from 1000 Genomes that coincide
with dbNSFP, 76.3% are predicted damaging by at least one prediction
algorithm, whereas only 7.8% are predicted damaging by all algorithms.
Funnel analysis
We defined two separate annotation pipelines to identify NA12878
variants (NIST v.2.17) that are rare, predicted damaging, and within the
exons of protein-coding genes. Both pipelines identified approximately
60-80 results but with significant differences in the actual variant sets.
Fig. 4) Comparative analysis two annotation pipelines.
Population allele frequencies
Allele frequency can be used to identify or remove both rare and
common variants. ESP, dbSNP and 1000 Genomes all are common
sources for this purpose. Of the 506K SNVs shared by the three
sources, we identified the overlap for rare, uncommon and common
variants. ESP is considerably different, but the minor disagreement
between dbSNP and 1000 Genomes is also noteworthy since the 1000
Genomes records are contributed to dbSNP.
Fig. 2) Overlap of population allele frequency sources.
225,843
30,751
673 356
363
32,717
321
150,197
36,979
1206 574
759
33,629
1757
55,189
3059
1449 499
92
4228
1035
Rare
Allele Frequency < 0.1%
Uncommon
0.1% ≤ Allele Frequency < 5%
Common
5% ≤ Allele Frequency
dbSNP 1000 GenomesESP
RefSeq
1,370,552 variants
Ensembl
1,659,069 variants
Within Gene
RefSeq
50,050 variants
Within Exonic Region
Ensembl
91,707 variants
RefSeq
19,945 variants
Ensembl
44,748 variants
Within Coding Region
PolyPhen HDIV
765 variants
SIFT
1561 variants
Predicted Damaging
dbSNP
61 variants
1000 Genomes
79 variants
Frequency < 0.1%
8 2653
NA12878 - 3,178,239 total variants