SlideShare a Scribd company logo
1 of 1
Download to read offline
www.bina.com
Fig. 3) Percentage of SNPs predicted as damaging by 7 different algorithms.
Fig. 1) 1000 Genomes overlap with transcription, coding and exonic regions.
Transcription and coding regions
Ensembl and RefSeq are standard references for transcription, coding
region and exon locations. Figure 1 displays how many of the 85M
unique variants from the 1000 Genomes projects overlap with the
genomic regions as defined by these two sources.
An empirical evaluation of redundant annotations in
common reference sources for tertiary analysis
James Warren1, Jian Li1, Aparna Chhibber1, Emre Colak1, Narges Bani Asadi1, Sharon
Barr1, Hugo Y. K. Lam1
1 Bina Technologies, Roche Sequencing, Redwood City, CA 94065
MOTIVATION
After obtaining genetic variants from next generation sequencing data, a precursory step in tertiary analysis is to annotate each variant with
available relevant information [1]. There is no standardized compendium for this purpose; researchers instead are required to compile data from a
motley of annotation tools and public datasets [2, 3]. These sources for annotation are independently maintained, and accordingly there is limited
concordance between their reported contents. The choice of annotation datasets thus has a direct and significant impact on the results of the
analysis [4].
References:
1. Warren, et. al. A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation. ASHG (2014)
2. Johnston and Biesecker. Databases of genomic variation and phenotypes: existing resources and future needs. Human Molecular
Genetics (2013)
3. Peterson, et. al. Towards precision medicine: advances in computational approaches for the analysis of human variants. Journal of
Molecular Biology (2013)
4. Taylor, et. al. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nature Genetics (2015)
METHODOLOGY
To empirically evaluate the differences between annotation data sources, we examined the overlap of the variants from the 1000 Genomes
project with commonly used and publicly available sources containing information regarding gene transcription and coding regions, predicted
functional impacts and population allele frequencies. For each of these dimensions, we compared the number of variants that met the specified
criteria by at least one annotation from the data source.
CONCLUSIONS
The choice of data sources for variant annotation have a substantive
impact on tertiary analysis results. When identifying variants that satisfy
given criteria, the differences between sources can result in significantly
different findings. As a best practice, multiple sources should be
included in the analysis. This provides the ability to tune the specificity
and sensitivity of the results by choosing to use intersections or unions
at each filtering step.
Please direct questions to:
bina.rd@bina.roche.com
DATA SOURCES
All data sources are publicly available and based on the GRCH37
reference. They were downloaded from the following organizations:
• NCBI: 1000 Genomes, Phase 3, V.5b
• UCSC: dbSNP142, RefSeq 72, Ensembl 75
• U. Washington: ESP 6500 SI, V2
• U. Texas Health Science Center: dbNSFP 2.9
Damaging SNV predictions
dbNSFP compiles predicted effects for non-synonymous SNVs in the
human genome. Of the 667K SNVs from 1000 Genomes that coincide
with dbNSFP, 76.3% are predicted damaging by at least one prediction
algorithm, whereas only 7.8% are predicted damaging by all algorithms.
Funnel analysis
We defined two separate annotation pipelines to identify NA12878
variants (NIST v.2.17) that are rare, predicted damaging, and within the
exons of protein-coding genes. Both pipelines identified approximately
60-80 results but with significant differences in the actual variant sets.
Fig. 4) Comparative analysis two annotation pipelines.
Population allele frequencies
Allele frequency can be used to identify or remove both rare and
common variants. ESP, dbSNP and 1000 Genomes all are common
sources for this purpose. Of the 506K SNVs shared by the three
sources, we identified the overlap for rare, uncommon and common
variants. ESP is considerably different, but the minor disagreement
between dbSNP and 1000 Genomes is also noteworthy since the 1000
Genomes records are contributed to dbSNP.
Fig. 2) Overlap of population allele frequency sources.
225,843
30,751
673 356
363
32,717
321
150,197
36,979
1206 574
759
33,629
1757
55,189
3059
1449 499
92
4228
1035
Rare
Allele Frequency < 0.1%
Uncommon
0.1% ≤ Allele Frequency < 5%
Common
5% ≤ Allele Frequency
dbSNP 1000 GenomesESP
RefSeq
1,370,552 variants
Ensembl
1,659,069 variants
Within Gene
RefSeq
50,050 variants
Within Exonic Region
Ensembl
91,707 variants
RefSeq
19,945 variants
Ensembl
44,748 variants
Within Coding Region
PolyPhen HDIV
765 variants
SIFT
1561 variants
Predicted Damaging
dbSNP
61 variants
1000 Genomes
79 variants
Frequency < 0.1%
8 2653
NA12878 - 3,178,239 total variants

More Related Content

What's hot

The trivial case of the missing heritability
The trivial case of the missing heritabilityThe trivial case of the missing heritability
The trivial case of the missing heritabilityMax Moldovan
 
IJSRED-V2I1P5
IJSRED-V2I1P5IJSRED-V2I1P5
IJSRED-V2I1P5IJSRED
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Denis C. Bauer
 
From Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsFrom Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsAli Kishk
 
Monarch Initiative Poster - Rare Disease Symposium 2015
Monarch Initiative Poster - Rare Disease Symposium 2015Monarch Initiative Poster - Rare Disease Symposium 2015
Monarch Initiative Poster - Rare Disease Symposium 2015Nicole Vasilevsky
 
FunGen JC Presentation - Mostafavi et al. (2019)
FunGen JC Presentation - Mostafavi et al. (2019)FunGen JC Presentation - Mostafavi et al. (2019)
FunGen JC Presentation - Mostafavi et al. (2019)BrianSchilder
 
Soergel oa week-2014-lightning
Soergel oa week-2014-lightningSoergel oa week-2014-lightning
Soergel oa week-2014-lightningDavid Soergel
 
Raj Lab Meeting May/01/2019
Raj Lab Meeting May/01/2019Raj Lab Meeting May/01/2019
Raj Lab Meeting May/01/2019Ricardo Vialle
 
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Databasebigdatabm
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esJoaquin Dopazo
 
Common languages in genomic epidemiology: from ontologies to algorithms
Common languages in genomic epidemiology: from ontologies to algorithmsCommon languages in genomic epidemiology: from ontologies to algorithms
Common languages in genomic epidemiology: from ontologies to algorithmsJoão André Carriço
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!adcobb
 
Big Data and Genomic Medicine by Corey Nislow
Big Data and Genomic Medicine by Corey NislowBig Data and Genomic Medicine by Corey Nislow
Big Data and Genomic Medicine by Corey NislowKnome_Inc
 
Status and prospects of association mapping in crop plants
Status and prospects of association mapping in crop plantsStatus and prospects of association mapping in crop plants
Status and prospects of association mapping in crop plantsJyoti Prakash Sahoo
 
Sundaram et al. 2018 Presentation
Sundaram et al. 2018 PresentationSundaram et al. 2018 Presentation
Sundaram et al. 2018 PresentationBrianSchilder
 
Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)Sijo A
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseJennifer Smith
 
CV of Rong Chen
CV of Rong ChenCV of Rong Chen
CV of Rong ChenRong Chen
 

What's hot (20)

The trivial case of the missing heritability
The trivial case of the missing heritabilityThe trivial case of the missing heritability
The trivial case of the missing heritability
 
IJSRED-V2I1P5
IJSRED-V2I1P5IJSRED-V2I1P5
IJSRED-V2I1P5
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2
 
From Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsFrom Expression to Pathways Using Online Tools
From Expression to Pathways Using Online Tools
 
Monarch Initiative Poster - Rare Disease Symposium 2015
Monarch Initiative Poster - Rare Disease Symposium 2015Monarch Initiative Poster - Rare Disease Symposium 2015
Monarch Initiative Poster - Rare Disease Symposium 2015
 
FunGen JC Presentation - Mostafavi et al. (2019)
FunGen JC Presentation - Mostafavi et al. (2019)FunGen JC Presentation - Mostafavi et al. (2019)
FunGen JC Presentation - Mostafavi et al. (2019)
 
Soergel oa week-2014-lightning
Soergel oa week-2014-lightningSoergel oa week-2014-lightning
Soergel oa week-2014-lightning
 
Raj Lab Meeting May/01/2019
Raj Lab Meeting May/01/2019Raj Lab Meeting May/01/2019
Raj Lab Meeting May/01/2019
 
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-es
 
Common languages in genomic epidemiology: from ontologies to algorithms
Common languages in genomic epidemiology: from ontologies to algorithmsCommon languages in genomic epidemiology: from ontologies to algorithms
Common languages in genomic epidemiology: from ontologies to algorithms
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!
 
Big Data and Genomic Medicine by Corey Nislow
Big Data and Genomic Medicine by Corey NislowBig Data and Genomic Medicine by Corey Nislow
Big Data and Genomic Medicine by Corey Nislow
 
Status and prospects of association mapping in crop plants
Status and prospects of association mapping in crop plantsStatus and prospects of association mapping in crop plants
Status and prospects of association mapping in crop plants
 
Sundaram et al. 2018 Presentation
Sundaram et al. 2018 PresentationSundaram et al. 2018 Presentation
Sundaram et al. 2018 Presentation
 
Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)
 
Ml in genomics
Ml in genomicsMl in genomics
Ml in genomics
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome database
 
CV of Rong Chen
CV of Rong ChenCV of Rong Chen
CV of Rong Chen
 

Similar to ASHG 2015 - Redundant Annotations in Tertiary Analysis

jin-HMG2014-post
jin-HMG2014-postjin-HMG2014-post
jin-HMG2014-postJin Yu
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009Sean Davis
 
Microhaplotype, A Powerful New Type of Genetic Marker
Microhaplotype, A Powerful New Type of Genetic MarkerMicrohaplotype, A Powerful New Type of Genetic Marker
Microhaplotype, A Powerful New Type of Genetic MarkerMojgan Talebian
 
Developing a framework for for detection of low frequency somatic genetic alt...
Developing a framework for for detection of low frequency somatic genetic alt...Developing a framework for for detection of low frequency somatic genetic alt...
Developing a framework for for detection of low frequency somatic genetic alt...Ronak Shah
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics TechnologiesSean Davis
 
Functionally annotate genomic variants
Functionally annotate genomic variantsFunctionally annotate genomic variants
Functionally annotate genomic variantsDenis C. Bauer
 
EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13Jonathan Eisen
 
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Reid Robison
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationIJAEMSJORNAL
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07Paolo Missier
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceIJSTA
 
Global Gene Expression Profiles from Breast Tumor Samples using the Ion Ampli...
Global Gene Expression Profiles from Breast Tumor Samples using the Ion Ampli...Global Gene Expression Profiles from Breast Tumor Samples using the Ion Ampli...
Global Gene Expression Profiles from Breast Tumor Samples using the Ion Ampli...Thermo Fisher Scientific
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein functionLars Juhl Jensen
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPatricia Francis-Lyon
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationJoaquin Dopazo
 
A computational framework for large-scale analysis of TCRβ immune repertoire ...
A computational framework for large-scale analysis of TCRβ immune repertoire ...A computational framework for large-scale analysis of TCRβ immune repertoire ...
A computational framework for large-scale analysis of TCRβ immune repertoire ...Thermo Fisher Scientific
 

Similar to ASHG 2015 - Redundant Annotations in Tertiary Analysis (20)

jin-HMG2014-post
jin-HMG2014-postjin-HMG2014-post
jin-HMG2014-post
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009
 
Microhaplotype, A Powerful New Type of Genetic Marker
Microhaplotype, A Powerful New Type of Genetic MarkerMicrohaplotype, A Powerful New Type of Genetic Marker
Microhaplotype, A Powerful New Type of Genetic Marker
 
Developing a framework for for detection of low frequency somatic genetic alt...
Developing a framework for for detection of low frequency somatic genetic alt...Developing a framework for for detection of low frequency somatic genetic alt...
Developing a framework for for detection of low frequency somatic genetic alt...
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
Functionally annotate genomic variants
Functionally annotate genomic variantsFunctionally annotate genomic variants
Functionally annotate genomic variants
 
EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13
 
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferation
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
 
Global Gene Expression Profiles from Breast Tumor Samples using the Ion Ampli...
Global Gene Expression Profiles from Breast Tumor Samples using the Ion Ampli...Global Gene Expression Profiles from Breast Tumor Samples using the Ion Ampli...
Global Gene Expression Profiles from Breast Tumor Samples using the Ion Ampli...
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical information
 
A computational framework for large-scale analysis of TCRβ immune repertoire ...
A computational framework for large-scale analysis of TCRβ immune repertoire ...A computational framework for large-scale analysis of TCRβ immune repertoire ...
A computational framework for large-scale analysis of TCRβ immune repertoire ...
 
Gene Array Analyzer
Gene Array AnalyzerGene Array Analyzer
Gene Array Analyzer
 
Analysis of gene expression
Analysis of gene expressionAnalysis of gene expression
Analysis of gene expression
 

Recently uploaded

Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 

Recently uploaded (17)

Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 

ASHG 2015 - Redundant Annotations in Tertiary Analysis

  • 1. www.bina.com Fig. 3) Percentage of SNPs predicted as damaging by 7 different algorithms. Fig. 1) 1000 Genomes overlap with transcription, coding and exonic regions. Transcription and coding regions Ensembl and RefSeq are standard references for transcription, coding region and exon locations. Figure 1 displays how many of the 85M unique variants from the 1000 Genomes projects overlap with the genomic regions as defined by these two sources. An empirical evaluation of redundant annotations in common reference sources for tertiary analysis James Warren1, Jian Li1, Aparna Chhibber1, Emre Colak1, Narges Bani Asadi1, Sharon Barr1, Hugo Y. K. Lam1 1 Bina Technologies, Roche Sequencing, Redwood City, CA 94065 MOTIVATION After obtaining genetic variants from next generation sequencing data, a precursory step in tertiary analysis is to annotate each variant with available relevant information [1]. There is no standardized compendium for this purpose; researchers instead are required to compile data from a motley of annotation tools and public datasets [2, 3]. These sources for annotation are independently maintained, and accordingly there is limited concordance between their reported contents. The choice of annotation datasets thus has a direct and significant impact on the results of the analysis [4]. References: 1. Warren, et. al. A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation. ASHG (2014) 2. Johnston and Biesecker. Databases of genomic variation and phenotypes: existing resources and future needs. Human Molecular Genetics (2013) 3. Peterson, et. al. Towards precision medicine: advances in computational approaches for the analysis of human variants. Journal of Molecular Biology (2013) 4. Taylor, et. al. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nature Genetics (2015) METHODOLOGY To empirically evaluate the differences between annotation data sources, we examined the overlap of the variants from the 1000 Genomes project with commonly used and publicly available sources containing information regarding gene transcription and coding regions, predicted functional impacts and population allele frequencies. For each of these dimensions, we compared the number of variants that met the specified criteria by at least one annotation from the data source. CONCLUSIONS The choice of data sources for variant annotation have a substantive impact on tertiary analysis results. When identifying variants that satisfy given criteria, the differences between sources can result in significantly different findings. As a best practice, multiple sources should be included in the analysis. This provides the ability to tune the specificity and sensitivity of the results by choosing to use intersections or unions at each filtering step. Please direct questions to: bina.rd@bina.roche.com DATA SOURCES All data sources are publicly available and based on the GRCH37 reference. They were downloaded from the following organizations: • NCBI: 1000 Genomes, Phase 3, V.5b • UCSC: dbSNP142, RefSeq 72, Ensembl 75 • U. Washington: ESP 6500 SI, V2 • U. Texas Health Science Center: dbNSFP 2.9 Damaging SNV predictions dbNSFP compiles predicted effects for non-synonymous SNVs in the human genome. Of the 667K SNVs from 1000 Genomes that coincide with dbNSFP, 76.3% are predicted damaging by at least one prediction algorithm, whereas only 7.8% are predicted damaging by all algorithms. Funnel analysis We defined two separate annotation pipelines to identify NA12878 variants (NIST v.2.17) that are rare, predicted damaging, and within the exons of protein-coding genes. Both pipelines identified approximately 60-80 results but with significant differences in the actual variant sets. Fig. 4) Comparative analysis two annotation pipelines. Population allele frequencies Allele frequency can be used to identify or remove both rare and common variants. ESP, dbSNP and 1000 Genomes all are common sources for this purpose. Of the 506K SNVs shared by the three sources, we identified the overlap for rare, uncommon and common variants. ESP is considerably different, but the minor disagreement between dbSNP and 1000 Genomes is also noteworthy since the 1000 Genomes records are contributed to dbSNP. Fig. 2) Overlap of population allele frequency sources. 225,843 30,751 673 356 363 32,717 321 150,197 36,979 1206 574 759 33,629 1757 55,189 3059 1449 499 92 4228 1035 Rare Allele Frequency < 0.1% Uncommon 0.1% ≤ Allele Frequency < 5% Common 5% ≤ Allele Frequency dbSNP 1000 GenomesESP RefSeq 1,370,552 variants Ensembl 1,659,069 variants Within Gene RefSeq 50,050 variants Within Exonic Region Ensembl 91,707 variants RefSeq 19,945 variants Ensembl 44,748 variants Within Coding Region PolyPhen HDIV 765 variants SIFT 1561 variants Predicted Damaging dbSNP 61 variants 1000 Genomes 79 variants Frequency < 0.1% 8 2653 NA12878 - 3,178,239 total variants