SlideShare a Scribd company logo
1 of 1
Download to read offline
www.bina.com 
A highly efficient and scalable compute platform for massive variant annotation 
and rapid genome interpretation 
James Warren1, Emre Colak1, Amirhossein Kiani1, Jian Li2, Aparna Chhibber2, Sanchita Bhattacharya3, Narges Bani 
Asadi1,2, Sharon Barr1, Atul Butte3, Garry Nolan4, Rong Chen5, Wing H. Wong6,7, and Hugo Y.K. Lam 2,† 
MOTIVATION 
After obtaining variants from next generation sequencing data, researchers and clinicians still face the undertaking of interpreting the results. Despite the availability of a 
multitude of public databases, using this collective information is an arduous task due to inconsistent and heterogeneous data, multiple versions, and nonstandard formats. 
Moreover, after aggregating the data and annotating the variants, it remains a laborious exercise to identify the causative variants associated with the disease in question. 
APPROACH 
We have developed a highly efficient data processing pipeline that leverages big data technologies to integrate annotations from a wide range of biological databases. The 
pipeline takes variant call sets, annotates all samples, and indexes the variants for analysis. Users can perform real-time searches and analytical queries against the 
annotated results to rapidly identify variants for further study. 
CHALLENGES 
• Heterogeneous data. An annotation platform must integrate diverse datasets of variants, genes, diseases, transcripts and functional predictions. 
• No standardizations. The platform must account for differences in datasets, such as different reference genomes or changing schemas between versions. 
• Real-time Interaction. A user must be able to interact with the annotation results in real time. Such interaction allows rapid identification of relevant variants while also 
supporting undirected investigation. 
• Contextual interface. The system cannot assume the user is familiar with the underlying data sources. It instead must support contextual queries, such as "find all 
predicted damaging variants of high quality associated with a given disease.” 
1. Department of Engineering, Bina Technologies, Redwood City, CA 94065. 
2. Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065. 
3. Division of Systems Medicine, Department of Pediatrics, Stanford University School of 
Medicine, Stanford CA 94305. 
4. Department of Microbiology and Immunology, Stanford University, School of Medicine, 
Stanford California 94305. 
5. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 
New York, NY 10029 
6. Department of Statistics, Stanford University, Stanford, CA 94305. 
7. Department of Health Research and Policy, Stanford University School of Medicine, 
Stanford, CA 94305. 
† To whom correspondence should be addressed. 
Affiliations 
Hadoop / 
Cascalog 
Contact Us 
rd@bina.com 
! 
METHOD 
During the annotation process, the pipeline: 
• constructs indices that can be efficiently composed to support an effectively infinite 
number of queries 
• uses Hadoop MapReduce to associate variants with relevant annotations 
• stores the annotated output and indices in HBase, a NoSQL database 
• a 5-node Hadoop cluster can annotate and index a whole exome sequencing sample in 
30 minutes and a whole genome sequencing sample in under an hour 
As a variant set passes through the data pipeline: 
• linked with over 140 annotation classes 
• from more than 20 databases/datasets 
• annotating a sample and indexing its variants are computationally demanding steps, but 
these are one-time costs 
After the process is complete: 
• users can interact with the results via an intuitive web interface. 
External 
Data Sources* 
Genomic 
Variants 
Variants with 
Predicted 
Effects 
SnpEff 
Fully 
Annotated 
Variants 
Indices / 
Functional 
Filters 
NoSQL 
Datastore 
REST / API 
HBase 
Pre-Computation 
Real-Time Interaction 
* Data sources include: 
1000 Genomes 
Cancer Gene Census 
ClinVar 
dbNSFP 
dbSNP 
DGV 
ENSEMBL 
ESP 
GWAS 
HGMD 
PGMD 
PROTEOME 
RefSeq 
RepeatMasker 
Segmental Duplications 
TRANSFAC 
Ohtahara syndrome (early infantile epileptic encephalopathy with suppression 
bursts) is a rare form of epilepsy that presents in early infancy and occurs slightly 
more often in males. 
We analyzed a whole genome sequenced family trio 
with two unaffected parents and an affected son. 
Using the Bina Annotation Platform we were able to 
filter from over 6.5 million variants in this family 
down to one X-linked non-synonymous variant in 
the gene AGTR2 potentially associated with the 
syndrome in the proband. 
For another application of the Bina Annotation Platform, we analyzed the WGS 
data from the DNA of Ata [1], the skeletal remains of a 6-inch human found in the 
Atacama Desert, Chile. 
The annotation platform discovered 4,000+ exonic non-synonymous SNVs, 400+ 
frame-shift or codon-change indels in genes previously associated with disease, 
and 1,000+ structural variations. Fourteen of these variants were located in genes 
known to be associated with dwarfism and skeletal dysplasia, of which one was 
not in dbSNP. The results were scientifically interesting and taken for further 
investigation [2]. 
[1] http://news.sciencemag.org/health/2013/05/bizarre-6-inch-skeleton-shown-be-human 
[2] S. Bhattacharya, J. Li, H. Lam, R. Lachman, N. Asadi, A. Butte, G. Nolan, Whole genome 
sequencing of mummy DNA shows significant association with human disease phenotype. 
Poster 2914S at ASHG 2014. 
EXAMPLE APPLICATIONS 
. 
CONCLUSION AND FUTURE WORK CITATIONS 
.T he Bina Annotation Platform has proven to be a powerful tool for variant 
interpretation for both single and multi-sample analyses. In future releases the 
platform will support additional workflows such as case-control and cohort 
studies, and will allow users to upload custom databases.

More Related Content

What's hot

The trivial case of the missing heritability
The trivial case of the missing heritabilityThe trivial case of the missing heritability
The trivial case of the missing heritability
Max Moldovan
 
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
bigdatabm
 

What's hot (20)

Bioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyBioinformatics in dermato-oncology
Bioinformatics in dermato-oncology
 
The trivial case of the missing heritability
The trivial case of the missing heritabilityThe trivial case of the missing heritability
The trivial case of the missing heritability
 
Case Study: Unsupervised method for pathway analysis in Alzheimer patients
Case Study: Unsupervised method for pathway analysis in Alzheimer patientsCase Study: Unsupervised method for pathway analysis in Alzheimer patients
Case Study: Unsupervised method for pathway analysis in Alzheimer patients
 
Multigenic (mechanistic) biomarkers
Multigenic (mechanistic) biomarkersMultigenic (mechanistic) biomarkers
Multigenic (mechanistic) biomarkers
 
IJSRED-V2I1P5
IJSRED-V2I1P5IJSRED-V2I1P5
IJSRED-V2I1P5
 
Monitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomesMonitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomes
 
From Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsFrom Expression to Pathways Using Online Tools
From Expression to Pathways Using Online Tools
 
FunGen JC Presentation - Mostafavi et al. (2019)
FunGen JC Presentation - Mostafavi et al. (2019)FunGen JC Presentation - Mostafavi et al. (2019)
FunGen JC Presentation - Mostafavi et al. (2019)
 
Soergel oa week-2014-lightning
Soergel oa week-2014-lightningSoergel oa week-2014-lightning
Soergel oa week-2014-lightning
 
Raj Lab Meeting May/01/2019
Raj Lab Meeting May/01/2019Raj Lab Meeting May/01/2019
Raj Lab Meeting May/01/2019
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-es
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!
 
Common languages in genomic epidemiology: from ontologies to algorithms
Common languages in genomic epidemiology: from ontologies to algorithmsCommon languages in genomic epidemiology: from ontologies to algorithms
Common languages in genomic epidemiology: from ontologies to algorithms
 
Genomics connectathon
Genomics connectathonGenomics connectathon
Genomics connectathon
 
Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)Bioinformatics for beginners (exam point of view)
Bioinformatics for beginners (exam point of view)
 
CV of Rong Chen
CV of Rong ChenCV of Rong Chen
CV of Rong Chen
 
Sundaram et al. 2018 Presentation
Sundaram et al. 2018 PresentationSundaram et al. 2018 Presentation
Sundaram et al. 2018 Presentation
 

Similar to ASHG_2014_AP

The Human Variome Database in Australia in 2014 - Graham Taylor
The Human Variome Database in Australia in 2014 - Graham TaylorThe Human Variome Database in Australia in 2014 - Graham Taylor
The Human Variome Database in Australia in 2014 - Graham Taylor
Human Variome Project
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
Long Pei
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
Yatpang Cheung
 
Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...
Rick Silva
 

Similar to ASHG_2014_AP (20)

Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
The Human Variome Database in Australia in 2014 - Graham Taylor
The Human Variome Database in Australia in 2014 - Graham TaylorThe Human Variome Database in Australia in 2014 - Graham Taylor
The Human Variome Database in Australia in 2014 - Graham Taylor
 
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
 
Using Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsUsing Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS Variants
 
INBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria López
 
Forum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeForum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decade
 
JALANov2000
JALANov2000JALANov2000
JALANov2000
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
 
Brief introduction to Bioinformatics
Brief introduction to BioinformaticsBrief introduction to Bioinformatics
Brief introduction to Bioinformatics
 
NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016
 
Open Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of CancerOpen Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of Cancer
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
A systematic approach to Genotype-Phenotype correlations
A systematic approach to Genotype-Phenotype correlationsA systematic approach to Genotype-Phenotype correlations
A systematic approach to Genotype-Phenotype correlations
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 
Gaining Time -- Real-time Analysis of Big Medical Data
Gaining Time -- Real-time Analysis of Big Medical DataGaining Time -- Real-time Analysis of Big Medical Data
Gaining Time -- Real-time Analysis of Big Medical Data
 
Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...
 

ASHG_2014_AP

  • 1. www.bina.com A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation James Warren1, Emre Colak1, Amirhossein Kiani1, Jian Li2, Aparna Chhibber2, Sanchita Bhattacharya3, Narges Bani Asadi1,2, Sharon Barr1, Atul Butte3, Garry Nolan4, Rong Chen5, Wing H. Wong6,7, and Hugo Y.K. Lam 2,† MOTIVATION After obtaining variants from next generation sequencing data, researchers and clinicians still face the undertaking of interpreting the results. Despite the availability of a multitude of public databases, using this collective information is an arduous task due to inconsistent and heterogeneous data, multiple versions, and nonstandard formats. Moreover, after aggregating the data and annotating the variants, it remains a laborious exercise to identify the causative variants associated with the disease in question. APPROACH We have developed a highly efficient data processing pipeline that leverages big data technologies to integrate annotations from a wide range of biological databases. The pipeline takes variant call sets, annotates all samples, and indexes the variants for analysis. Users can perform real-time searches and analytical queries against the annotated results to rapidly identify variants for further study. CHALLENGES • Heterogeneous data. An annotation platform must integrate diverse datasets of variants, genes, diseases, transcripts and functional predictions. • No standardizations. The platform must account for differences in datasets, such as different reference genomes or changing schemas between versions. • Real-time Interaction. A user must be able to interact with the annotation results in real time. Such interaction allows rapid identification of relevant variants while also supporting undirected investigation. • Contextual interface. The system cannot assume the user is familiar with the underlying data sources. It instead must support contextual queries, such as "find all predicted damaging variants of high quality associated with a given disease.” 1. Department of Engineering, Bina Technologies, Redwood City, CA 94065. 2. Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065. 3. Division of Systems Medicine, Department of Pediatrics, Stanford University School of Medicine, Stanford CA 94305. 4. Department of Microbiology and Immunology, Stanford University, School of Medicine, Stanford California 94305. 5. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029 6. Department of Statistics, Stanford University, Stanford, CA 94305. 7. Department of Health Research and Policy, Stanford University School of Medicine, Stanford, CA 94305. † To whom correspondence should be addressed. Affiliations Hadoop / Cascalog Contact Us rd@bina.com ! METHOD During the annotation process, the pipeline: • constructs indices that can be efficiently composed to support an effectively infinite number of queries • uses Hadoop MapReduce to associate variants with relevant annotations • stores the annotated output and indices in HBase, a NoSQL database • a 5-node Hadoop cluster can annotate and index a whole exome sequencing sample in 30 minutes and a whole genome sequencing sample in under an hour As a variant set passes through the data pipeline: • linked with over 140 annotation classes • from more than 20 databases/datasets • annotating a sample and indexing its variants are computationally demanding steps, but these are one-time costs After the process is complete: • users can interact with the results via an intuitive web interface. External Data Sources* Genomic Variants Variants with Predicted Effects SnpEff Fully Annotated Variants Indices / Functional Filters NoSQL Datastore REST / API HBase Pre-Computation Real-Time Interaction * Data sources include: 1000 Genomes Cancer Gene Census ClinVar dbNSFP dbSNP DGV ENSEMBL ESP GWAS HGMD PGMD PROTEOME RefSeq RepeatMasker Segmental Duplications TRANSFAC Ohtahara syndrome (early infantile epileptic encephalopathy with suppression bursts) is a rare form of epilepsy that presents in early infancy and occurs slightly more often in males. We analyzed a whole genome sequenced family trio with two unaffected parents and an affected son. Using the Bina Annotation Platform we were able to filter from over 6.5 million variants in this family down to one X-linked non-synonymous variant in the gene AGTR2 potentially associated with the syndrome in the proband. For another application of the Bina Annotation Platform, we analyzed the WGS data from the DNA of Ata [1], the skeletal remains of a 6-inch human found in the Atacama Desert, Chile. The annotation platform discovered 4,000+ exonic non-synonymous SNVs, 400+ frame-shift or codon-change indels in genes previously associated with disease, and 1,000+ structural variations. Fourteen of these variants were located in genes known to be associated with dwarfism and skeletal dysplasia, of which one was not in dbSNP. The results were scientifically interesting and taken for further investigation [2]. [1] http://news.sciencemag.org/health/2013/05/bizarre-6-inch-skeleton-shown-be-human [2] S. Bhattacharya, J. Li, H. Lam, R. Lachman, N. Asadi, A. Butte, G. Nolan, Whole genome sequencing of mummy DNA shows significant association with human disease phenotype. Poster 2914S at ASHG 2014. EXAMPLE APPLICATIONS . CONCLUSION AND FUTURE WORK CITATIONS .T he Bina Annotation Platform has proven to be a powerful tool for variant interpretation for both single and multi-sample analyses. In future releases the platform will support additional workflows such as case-control and cohort studies, and will allow users to upload custom databases.