High-performance web services for gene and variant annotations
1. Chunlei Wu, Ph.D.
cwu@scripps.edu
@chunleiwu
Associate Professor of Molecular Medicine
Dept. of Molecular Experimental Medicine
The Scripps Research Institute
La Jolla, CA, USA
07/2016
High-performance web services for
gene and variant annotations
MyVariant.infoMyGene.info
2. Biological knowledge is a complex network
No one-fit-all database can capture
the entire knowledge space
4. BioThings APIs are built on document databases
Why we picked document databases:
⢠Object representation
⢠Rich data structures, handles heterogeneous data
very well
⢠Atomic operations, built for big-data scale
6. Keep data always up-to-date
Each data source is updated individually. Colors
indicate their different updating schedules.
Schematic view of MyVariant.info architecture
10. Make it really easy to use
Just two endpoints
No registration/sign-in
No API key
11. Developer-friendly
Python/R clients
(also js client for myvariant)
search âmygeneâ and âmyvariantâ
in PyPI and Bioconductor
JSONP
CORS
https
msgpack
http compression
http caching
JSON-LD
Supported!
12. Aggregate Everything about gene and variant
MyVariant.infoMyGene.info
Support >15M genes
for ~17K species
~ 200 annotation fields
Support > 334 M variants
~ 500 annotation fields
from 14 sources:
ClinVar
dbNSFP
dbSNP
âŚ
27. Species API used in MyGene.info
You can now query for genes beyond species:
Q: Give me all lytic enzymes for any firmicutes
http://mygene.info/v3/query?q=lytic enzyme&species=1239&include_tax_tree=true
http://mygene.info/v3/query?q=lytic enzyme&species=1239
0 hits
5 hits
31. BioThings APIs
A collection of data APIs A framework for building new APIs
Data as a service Software as a service
Got a new type of âBioThingsâ?
We can help you to build or even host your biothings API
32. BioThings TEAM
Funding and Support
U01HG008473
U54GM114833
TSRI:
Chunlei Wu
Andrew Su
Jiwen Xin
Cyrus Afrasiabi
Sebastien Lelong
Ginger Tsueng
Julee Adesara
Mike Mayers
U. Washington:
Sean Mooney
Moritz Juchler
Nikhil Gopal
35. 2441
2308
1917
18
9
5
Initial number of genes mutated in all four patients:
filter2 <- lapply(filter1, function(i) subset(i, cadd.consequence %in%
c("NON_SYNONYMOUS", "STOP_GAINED", "STOP_LOST", "CANONICAL_SPLICE", "SPLICE_SITE")))
nVars <- countGenes(vars)
filter1 <- lapply(vars, function(i) subset(i, DP > 8 & FS < 30 & QD > 2))
Filtering for sequencing coverage and strand bias:
Filtering for nonsynonymous and splice site variants:
filter3 <- lapply(filter2, function(i) subset(i, exac.af < 0.01))
Filtering for rare variants based on allele frequencies from ExAC:
filter4 <- lapply(filter3, function(i) subset(i, sapply(dbnsfp.1000gp1.af,
function(j) j < 0.01 )))
Filtering for rare variants based on allele frequencies from 1000 Genomes Project:
goBP <- data.frame(queryMany(top.genes$Var1, scopes="symbol", species="human",
fields=c("go.BP", "name", "MIM", "uniprot")))
# The Bioconductor package go.DB is used to find all genes with a GO biological process annotation
that # is a descendant of GO:0008152 - the GO id for metabolic process.
miller.bp <- lapply(goBP$go.BP, function(i) unlist(i$id))
bp.ancestor <- lapply(miller.bp, function(i) sapply(i, function(j) "GO:0008152" %in%
unlist(GOBPANCESTOR[[j]])))
candidate.genes <- top.genes$Var1[sapply(bp.ancestor, function(i) TRUE %in% i)]
Filtering by GO biological process annotation using MyGene.info:
Number of genes Filtering steps to prioritize candidate genes:
36. Demos in Jupyter notebooks
⢠Using myvariant and mygene in R for variant
prioritization
http://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/d
ocs/ipynb/myvariant_R_miller.ipynb
⢠Access ClinVar data from myvariant in Python
http://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/d
ocs/ipynb/myvariant_clinvar_demo.ipynb
⢠ID mapping using mygene module in Python
http://nbviewer.jupyter.org/gist/newgene/6771106
Editor's Notes
A high-performance query engine for aggregated variant annotations.