ASHG_2014_AP

www.bina.com
A highly efficient and scalable compute platform for massive variant annotation
and rapid genome interpretation
James Warren1, Emre Colak1, Amirhossein Kiani1, Jian Li2, Aparna Chhibber2, Sanchita Bhattacharya3, Narges Bani
Asadi1,2, Sharon Barr1, Atul Butte3, Garry Nolan4, Rong Chen5, Wing H. Wong6,7, and Hugo Y.K. Lam 2,†
MOTIVATION
After obtaining variants from next generation sequencing data, researchers and clinicians still face the undertaking of interpreting the results. Despite the availability of a
multitude of public databases, using this collective information is an arduous task due to inconsistent and heterogeneous data, multiple versions, and nonstandard formats.
Moreover, after aggregating the data and annotating the variants, it remains a laborious exercise to identify the causative variants associated with the disease in question.
APPROACH
We have developed a highly efficient data processing pipeline that leverages big data technologies to integrate annotations from a wide range of biological databases. The
pipeline takes variant call sets, annotates all samples, and indexes the variants for analysis. Users can perform real-time searches and analytical queries against the
annotated results to rapidly identify variants for further study.
CHALLENGES
• Heterogeneous data. An annotation platform must integrate diverse datasets of variants, genes, diseases, transcripts and functional predictions.
• No standardizations. The platform must account for differences in datasets, such as different reference genomes or changing schemas between versions.
• Real-time Interaction. A user must be able to interact with the annotation results in real time. Such interaction allows rapid identification of relevant variants while also
supporting undirected investigation.
• Contextual interface. The system cannot assume the user is familiar with the underlying data sources. It instead must support contextual queries, such as "find all
predicted damaging variants of high quality associated with a given disease.”
1. Department of Engineering, Bina Technologies, Redwood City, CA 94065.
2. Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065.
3. Division of Systems Medicine, Department of Pediatrics, Stanford University School of
Medicine, Stanford CA 94305.
4. Department of Microbiology and Immunology, Stanford University, School of Medicine,
Stanford California 94305.
5. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai,
New York, NY 10029
6. Department of Statistics, Stanford University, Stanford, CA 94305.
7. Department of Health Research and Policy, Stanford University School of Medicine,
Stanford, CA 94305.
† To whom correspondence should be addressed.
Affiliations
Hadoop /
Cascalog
Contact Us
rd@bina.com
!
METHOD
During the annotation process, the pipeline:
• constructs indices that can be efficiently composed to support an effectively infinite
number of queries
• uses Hadoop MapReduce to associate variants with relevant annotations
• stores the annotated output and indices in HBase, a NoSQL database
• a 5-node Hadoop cluster can annotate and index a whole exome sequencing sample in
30 minutes and a whole genome sequencing sample in under an hour
As a variant set passes through the data pipeline:
• linked with over 140 annotation classes
• from more than 20 databases/datasets
• annotating a sample and indexing its variants are computationally demanding steps, but
these are one-time costs
After the process is complete:
• users can interact with the results via an intuitive web interface.
External
Data Sources*
Genomic
Variants
Variants with
Predicted
Effects
SnpEff
Fully
Annotated
Variants
Indices /
Functional
Filters
NoSQL
Datastore
REST / API
HBase
Pre-Computation
Real-Time Interaction
* Data sources include:
1000 Genomes
Cancer Gene Census
ClinVar
dbNSFP
dbSNP
DGV
ENSEMBL
ESP
GWAS
HGMD
PGMD
PROTEOME
RefSeq
RepeatMasker
Segmental Duplications
TRANSFAC
Ohtahara syndrome (early infantile epileptic encephalopathy with suppression
bursts) is a rare form of epilepsy that presents in early infancy and occurs slightly
more often in males.
We analyzed a whole genome sequenced family trio
with two unaffected parents and an affected son.
Using the Bina Annotation Platform we were able to
filter from over 6.5 million variants in this family
down to one X-linked non-synonymous variant in
the gene AGTR2 potentially associated with the
syndrome in the proband.
For another application of the Bina Annotation Platform, we analyzed the WGS
data from the DNA of Ata [1], the skeletal remains of a 6-inch human found in the
Atacama Desert, Chile.
The annotation platform discovered 4,000+ exonic non-synonymous SNVs, 400+
frame-shift or codon-change indels in genes previously associated with disease,
and 1,000+ structural variations. Fourteen of these variants were located in genes
known to be associated with dwarfism and skeletal dysplasia, of which one was
not in dbSNP. The results were scientifically interesting and taken for further
investigation [2].
[1] http://news.sciencemag.org/health/2013/05/bizarre-6-inch-skeleton-shown-be-human
[2] S. Bhattacharya, J. Li, H. Lam, R. Lachman, N. Asadi, A. Butte, G. Nolan, Whole genome
sequencing of mummy DNA shows significant association with human disease phenotype.
Poster 2914S at ASHG 2014.
EXAMPLE APPLICATIONS
.
CONCLUSION AND FUTURE WORK CITATIONS
.T he Bina Annotation Platform has proven to be a powerful tool for variant
interpretation for both single and multi-sample analyses. In future releases the
platform will support additional workflows such as case-control and cohort
studies, and will allow users to upload custom databases.

ASHG_2014_AP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ASHG_2014_AP

Similar to ASHG_2014_AP (20)

ASHG_2014_AP