Chunlei Wu BD2K 201601 MyGene.info and MyVariant.info

771 views

Published on

My slides at HeartBD2K weekly technical conference call.

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
771
On SlideShare
0
From Embeds
0
Number of Embeds
260
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

  • A high-performance query engine for aggregated variant annotations.
  • A high-performance query engine for aggregated variant annotations.
  • Annotation data are fundamental
    Gene anno: no need a slide to explain, everyone need them
    Var anno: relatively new, more and more trending due to the booming of NGS
  • Chunlei Wu BD2K 201601 MyGene.info and MyVariant.info

    1. 1. Chunlei Wu, Ph.D. cwu@scripps.edu @chunleiwu Associate Professor of Molecular Medicine Dept. of Molecular Experimental Medicine The Scripps Research Institute La Jolla, CA, USA 01/22/2016 From MyGene.info and MyVariant.info towards BioThings API
    2. 2. As a MyGene.info and MyVariant.info recap Annotations Gene Variant (Aggregated) (high-performance) (real-time) Web Service
    3. 3. So many variant annotation resources dbNSFP The Exome Aggregation Consortium (ExAC)
    4. 4. Annotations centered around bio-entities Gene G Variant V Pathway P D Metabolite M Disease
    5. 5. Simple JSON-based Aggregation mechanism { "_id": "chr1:g.196659237C>T", "cadd": { … }, "clinvar": { … }, "cosmic": { … }, "dbsnp": { … }, "dbnsfp": { … }, "evs": { … }, "emv": { … }, "mutdb": { … }, "gwassnp": { … }, "snpedia": { … }, "wellderly": { … } } { "_id": "chr1:g.196659237C>T", “dbsnp": { "snpclass": "single", "rsid": "rs1061170", "func": "missense" } } { "_id": "chr1:g.196659237C>T", “cosmic": { "tumor_site": "breast", "mut_freq": 0.49, } } { "_id": "chr1:g.196659237C>T", “dbnsfp": { “sift": { "breast“: “tolerated”, “val”: 1 } } } “cadd” “clinvar” “evs” “mutdb” …
    6. 6. Keep data always up-to-date Each data source is updated individually. Colors indicate their different updating schedules. Schematic view of MyVariant.info architecture
    7. 7. High-performance web service APIs Schematic view of MyVariant.info architecture
    8. 8. MyVariant.info for the end users: http://MyVariant.info (currently v1 API, two endpoints) http://MyVariant.info/v1/query?q=<query> any query term(s) matching variant hits http://MyVariant.info/v1/variant/<variantid> hgvs id(s) matching variant object(s) Both supports batch-mode via POST Simple API. No sign-up. No API key. Try our live API , and documentations
    9. 9. MyGene.info for the end users: http://MyGene.info (currently v2 API, two endpoints) http://MyGene.info/v2/query?q=<query> any query term(s) matching gene hits http://MyGene.info/v2/gene/<geneid> gene id(s) matching gene object(s) Both supports batch-mode via POST Simple API. No sign-up. No API key. Try our live API , and documentations
    10. 10. MyGene.info usage updates last year this year 2M 3MMonthly hits in Millions
    11. 11. Usage spikes (5M hits/day) during X-Mas 2014
    12. 12. 30%9% 35% 26% Increased clients adoption Requests by MyGene.info clients Highlights: • mygene Python client usage now surpasses BioGPS usage • mygene R client usage now increased to 9% from <1% 10/07/2015-01/05/2016
    13. 13. 30%9% 35% 26% Increased clients adoption mygene Python client hosted in PyPI mygene R client hosted in Bioconductor
    14. 14. MyVariant.info updates Total over 334 Millions of annotated variants The Exome Aggregation Consortium (ExAC) New additions: dbNSFP Updated:
    15. 15. MyVariant.info updates 30% 68% 2% 10/07/2015-01/05/2016 1 Million requests in 3 months
    16. 16. MyVariant.info official Python/R Clients myvariant Python client hosted in PyPI (initial release in Aug 2015) myvariant R client hosted in Bioconductor (initial release in Oct 2015)
    17. 17. A Node.js client made by a user with passion
    18. 18. Next? MyVariant.info MyGene.info
    19. 19. Make our APIs serve Linked Data via
    20. 20. Why Linked Data? Gene G Variant V Pathway P D Metabolite M Disease
    21. 21. Linked Data for data aggregation MyVariant.info V Another Variant API V V
    22. 22. Linked Data for data aggregation MyVariant.info Another Variant API { "_id": "chr1:g.196659237C>T", “cosmic": { "tumor_site": "breast", "mut_freq": 0.49, }, "clinvar": {…}, "dbsnp": {…}, … } { "pop": "GWD", "nobs": 226, "freq": 0.371681415929, … } { "_id": "chr1:g.196659237C>T", “cosmic": { "tumor_site": "breast", "mut_freq": 0.49, }, "clinvar": {…}, "dbsnp": {…}, "new_src": { "pop": "GWD", "nobs": 226, "freq": 0.371681415929 }, … }
    23. 23. JSON + context = JSON-LD { "@context": { "clinvar": "http://schema.myvariant.info/datasource/clinvar", "rcv": "http://schema.myvariant.info/datanode/rcv", "gene": "http://schema.myvariant.info/datanode/gene", "_id": "@id" }, "_id": "chr6:g.26093141G>A", "clinvar": { "@context": { "uniprot": "http://identifiers.org/uniprot/", "omim": "http://identifiers.org/omim/" }, "chrom": "6", "alt": "A", "ref": "G", "allele_id": 15048, "rsid": "rs1800562", "rcv": { "@context": { "accession": "http://identifer.org/clinvar" }, "accession": "RCV000000020", "origin": "germline", "clinical_significance": "risk factor" }, "gene": { "@context": { "symbol": "http://identifiers.org/hgnc.symbol/" }, "id": "3077", "symbol": "HFE" }, "omim": "613609.0001", "variant_id": 9 } }
    24. 24. Processed JSON-LD <chr6:g.26093141G>A> <http://schema.myvariant.info/datasource/clinvar> _:b0 . _:b0 <http://identifiers.org/omim/> "613609.0001" . _:b0 <http://schema.myvariant.info/datanode/gene> _:b1 . _:b0 <http://schema.myvariant.info/datanode/rcv> _:b2 . _:b1 <http://identifiers.org/hgnc.symbol/> "HFE" . _:b2 <http://identifer.org/clinvar> "RCV000000020" . JSON-LD N-Quads output: { "@id": "chr6:g.26093141G>A", "http://schema.myvariant.info/datasource/clinvar": { "http://identifiers.org/omim/": "613609.0001", "http://schema.myvariant.info/datanode/gene": { "http://identifiers.org/hgnc.symbol/": "HFE" }, "http://schema.myvariant.info/datanode/rcv": { "http://identifer.org/clinvar": "RCV000000020" } } } JSON-LD compacted output:
    25. 25. In a nut-shell, what JSON-LD context does? Marks values in a JSON object to defined URIs "http://identifer.org/clinvar" →clinvar.rcv.accession
    26. 26. JSON-LD context makes your data "Linkable" "Linked" Downstream processing libraries
    27. 27. A Python library for processing JSON-LD data In [1]: fetch_value_source_for_variant("chr6:g.26093141G>A","http://identifiers.org/dbsnp/") Out[1]: ['rs1800562 http://schema.myvarint.info/datasource/dbnsfp', 'rs1800562 http://schema.myvarint.info/datasource/clinvar', 'rs1800562 http://schema.myvarint.info/datasource/dbsnp', 'rs1800562 http://schema.myvarint.info/datasource/evs', 'rs1800562 http://schema.myvarint.info/datasource/gwassnps', 'rs1800562 http://schema.myvarint.info/datasource/mutdb'] By Kevin Xin
    28. 28. Need to define an API specs • Output as a JSON object with a defined _id. • "jsonld=true/false" toggle for the inclusion of JSON-LD context. • Support the retrieval of a single entity via GET (use case: individual data aggregation on the fly) • Support the retrieval of a list of entities via POST (use case: routine data aggregation in batches) • Output should indicate the entity existence: GET /variant/<unknown_id>  404 POST /variant/ id1, <unknown_id>, id3  [id1: {…}, <unknown_id>: "notfound", id3: {…}] to enable data exchange via JSON-LD
    29. 29. BioThings API MyVariant.info MyGene.info By Cyrus Afrasiabi
    30. 30. BioThings API MyVariant.info MyGene.info JSON data aggregation mechanism High- performance query engine Well-designed REST API pattern JSON-LD enabled Linked Data Data-updating scheduler Python/R clients …
    31. 31. Data-sharing via Web API is trending Making a single web service is trivial, but making a sustainable/scalable web API is non-trivial. We would like to help other groups to create their own hosted web API for sharing their data.
    32. 32. Action item 1: BioThings API whitepaper Also the action item from last BD2K CA consortium meeting and the API working group from last year's NIH BD2K AHM
    33. 33. Action item 2: BioThings API framework NIH commons Infrastructure as a Service: Software as a Service: BioThings API
    34. 34. Action item 3: expansion to other "BioThings" D Disease D Drugs MyDrug.info MyDisease.info need an alt. name here
    35. 35. Acknowledgement Funding and Support U54GM114833 U01HG008473 Washtington U: Ben Ainscough Obi Griffith TSRI: Andrew Su Jiwen Xin Cyrus Afrasiabi Ginger Tsueng Adam Mark Greg Stupp Tim Putman STSI: Eric Topol Ali Torkamani Galina Erikson U. Washington: Sean Mooney Moritz Juchler Nikhil Gopal OICR: Robin Haw UC Berkeley: Chris Mungall UCSD: Trish Whetzel MyVariant.info MyGene.info

    ×