High-performance web services for gene and variant annotations

Chunlei Wu, Ph.D.
cwu@scripps.edu
@chunleiwu
Associate Professor of Molecular Medicine
Dept. of Molecular Experimental Medicine
The Scripps Research Institute
La Jolla, CA, USA
07/2016
High-performance web services for
gene and variant annotations
MyVariant.infoMyGene.info

Biological knowledge is a complex network
No one-fit-all database can capture
the entire knowledge space

Typical database representations
{
_id: 1017,
name: CDK2,
taxid: 9606
}
Relational
database
Document
database
RDF
triplestore
Tables JSON objects Triples
Key-value
store
Key-value pairs

BioThings APIs are built on document databases
Why we picked document databases:
• Object representation
• Rich data structures, handles heterogeneous data
very well
• Atomic operations, built for big-data scale

Gene and Variant annotations represented in JSON documents
{
"_id": "chr1:g.196659237C>T",
"cosmic": {
"chrom": "1",
"hg19": {
"start": 196659237,
"end": 196659237
},
"ref": "C",
"alt": "T",
"tumor_site": "breast",
"mut_freq": 0.49,
"mut_nt": "C>T",
"cosmic_id": "COSM424915"
}
{
“_id”: “1017”,
“Symbol”: “CDK2”,
“Ensembl”: “ENSG00000123374”,
“RefSeq”: [
“NM_001798”,
“NM_052827”
],
“Reporter”: {
“U95A”: [
“1792_g_at”,
“1833_at”
],
“U133A”:[
“211804_s_at”,
“2045252_at”,
“211803_at”
]
}
}

Keep data always up-to-date
Each data source is updated individually. Colors
indicate their different updating schedules.
Schematic view of MyVariant.info architecture

High-performance web service APIs
Schematic view of MyVariant.info architecture

MyGene.info + MyVariant.info
Gene
G
Variant
V
/v2/gene/<geneid>
/v2/query?q=<query>
/v1/variant/<hgvsid>
/v1/query?q=<query>
/v3/gene/<geneid>
/v3/query?q=<query>
single query on GET, batch query on POST

We focus on building APIs. Try to …

Make it really easy to use
Just two endpoints
No registration/sign-in
No API key

Developer-friendly
Python/R clients
(also js client for myvariant)
search “mygene” and “myvariant”
in PyPI and Bioconductor
JSONP
CORS
https
msgpack
http compression
http caching
JSON-LD
Supported!

Aggregate Everything about gene and variant
Support >15M genes
for ~17K species
~ 200 annotation fields
Support > 334 M variants
from 14 sources:
ClinVar
dbNSFP
dbSNP
…

Keep up-to-date
Weekly ~Monthly
Support >15M genes
for ~17K species
Support > 334 M variants
from 14 sources:
ClinVar
dbNSFP
dbSNP
…

High-performance and scalable
>95% queries response < 30ms

High-performance and scalable
“Stress test” suggests support for
>5,000 concurrent users for
~10,000
requests per minute

High availability
99.999%
over last year
99.87%
over last 6 months
Availability tracked by

Who is using
MinePath.org
Gene Wiki
JBrowse
Live applications:

Who is using
Many users use them in their
daily analysis pipelines
or
simply caching annotations locally

MyGene.info recent usage stats
requests unique IPs
Jan-16 3,885,192 2,498
Feb-16 5,313,950 2,786
Mar-16 3,362,354 3,121
Apr-16 10,918,104 3,065
May-16 10,776,858 3,803
Jun-16 6,396,148 3,940
39%
direct calls 38%
mygene.py
14%
mygene.R
9%
BioGPS
Over 40M requests
In six months

MyVariant.info recent usage stats
requests unique IPs
Jan-16 83,519 1,330
Feb-16 3,054,191 1,192
Mar-16 272,424 1,771
Apr-16 701,526 1,500
May-16 89,642 1,891
Jun-16 213,767 1,924
21%
direct calls
23%
myvariant.py
50%
myvariant.R
6%
myvariant.js
~4.5M requests
In six months

Generalized BioThings SDK
BioThings SDK
MyVariant.info
MyGene.info
JSON data
aggregation
mechanism
High-
performance
query engine
Well-designed
REST API
pattern
JSON-LD
enabled
Linked Data
Data-updating scheduler
Python/R clients
…

BioThings SDK
A tutorial here (more docs are coming):
http://biothingsapi.readthedocs.io/en/latest/

v.biothings.io
g.biothings.io
BioThings SDK
gene
variant
s.biothings.io species/
taxonomy
alias to MyGene.info
alias to MyVariant.info

BioThings API for species/taxonomy
{
"_id": "9606",
"_version": 1,
"authority": [
"homo sapiens linnaeus, 1758"
],
"children": [ 63221, 741158],
"common_name": "man",
"genbank_common_name": "human",
"has_gene": true,
"lineage": [ 9606, 9605, 207598, …,131567, 1],
"parent_taxid": 9605,
"rank": "species",
"scientific_name": "homo sapiens",
"taxid": 9606,
"uniprot_name": "homo sapiens"
}
http://s.biothings.io/v1/species/9606?include_children=true

BioThings API for species/taxonomy
{
"hits": [
{
"_id": "1239",
"_score": 10.971453,
"common_name": […],
"genbank_common_name": "gram-positive bacteria",
"has_gene": false,
"lineage": [1239, 1783272, 2, 131567, 1],
"parent_taxid": 1783272,
"rank": "phylum",
"scientific_name": "firmicutes",
"taxid": 1239,
"uniprot_name": "firmicutes"
}
],
"max_score": 10.971453,
"took": 12,
"total": 1
}
http://s.biothings.io/v1/query?q=rank:phylum AND
common_name:gram-positive

Species API used in MyGene.info
You can now query for genes beyond species:
Q: Give me all lytic enzymes for any firmicutes
http://mygene.info/v3/query?q=lytic enzyme&species=1239&include_tax_tree=true
http://mygene.info/v3/query?q=lytic enzyme&species=1239
0 hits
5 hits

Very minimal code for building a species API

Have the flexibility to customize your query

v.biothings.io
g.biothings.io
BioThings SDK
s.biothings.io
c.biothings.io
gene
variant
species/
taxonomy
drugs/
compounds
∙ ∙ ∙ ∙ ∙ ∙
alias to MyGene.info
alias to MyVariant.info
diseased.biothings.io

BioThings APIs
A collection of data APIs A framework for building new APIs
Data as a service Software as a service
Got a new type of “BioThings”?
We can help you to build or even host your biothings API

BioThings TEAM
Funding and Support
U01HG008473
U54GM114833
TSRI:
Chunlei Wu
Andrew Su
Jiwen Xin
Cyrus Afrasiabi
Sebastien Lelong
Ginger Tsueng
Julee Adesara
Mike Mayers
U. Washington:
Sean Mooney
Moritz Juchler
Nikhil Gopal

Source code
• MyGene.info
https://github.com/sulab/mygene.info
• MyVariant.info
https://github.com/sulab/myvariant.info
• BioThings API for species/taxonomy
https://github.com/sulab/biothings.species
• BioThings SDK
https://github.com/sulab/biothings.api

DEMO time!
by Jiwen (Kevin) Xin

2441
2308
1917
18
9
5
Initial number of genes mutated in all four patients:
filter2 <- lapply(filter1, function(i) subset(i, cadd.consequence %in%
c("NON_SYNONYMOUS", "STOP_GAINED", "STOP_LOST", "CANONICAL_SPLICE", "SPLICE_SITE")))
nVars <- countGenes(vars)
filter1 <- lapply(vars, function(i) subset(i, DP > 8 & FS < 30 & QD > 2))
Filtering for sequencing coverage and strand bias:
Filtering for nonsynonymous and splice site variants:
filter3 <- lapply(filter2, function(i) subset(i, exac.af < 0.01))
Filtering for rare variants based on allele frequencies from ExAC:
filter4 <- lapply(filter3, function(i) subset(i, sapply(dbnsfp.1000gp1.af,
function(j) j < 0.01 )))
Filtering for rare variants based on allele frequencies from 1000 Genomes Project:
goBP <- data.frame(queryMany(top.genes$Var1, scopes="symbol", species="human",
fields=c("go.BP", "name", "MIM", "uniprot")))
# The Bioconductor package go.DB is used to find all genes with a GO biological process annotation
that # is a descendant of GO:0008152 - the GO id for metabolic process.
miller.bp <- lapply(goBP$go.BP, function(i) unlist(i$id))
bp.ancestor <- lapply(miller.bp, function(i) sapply(i, function(j) "GO:0008152" %in%
unlist(GOBPANCESTOR[[j]])))
candidate.genes <- top.genes$Var1[sapply(bp.ancestor, function(i) TRUE %in% i)]
Filtering by GO biological process annotation using MyGene.info:
Number of genes Filtering steps to prioritize candidate genes:

Demos in Jupyter notebooks
• Using myvariant and mygene in R for variant
prioritization
http://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/d
ocs/ipynb/myvariant_R_miller.ipynb
• Access ClinVar data from myvariant in Python
http://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/d
ocs/ipynb/myvariant_clinvar_demo.ipynb
• ID mapping using mygene module in Python
http://nbviewer.jupyter.org/gist/newgene/6771106

High-performance web services for gene and variant annotations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to High-performance web services for gene and variant annotations

Similar to High-performance web services for gene and variant annotations (20)

Recently uploaded

Recently uploaded (20)

High-performance web services for gene and variant annotations

Editor's Notes