Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2016 bd2k bgood_wikidata

1,555 views

Published on

  • Be the first to comment

2016 bd2k bgood_wikidata

  1. 1. Wikidata for biomedical knowledge integration and curation Benjamin Good The Scripps Research Institute @bgood bgood@scripps.edu
  2. 2. “knowledge” • A lot • Important • Text
  3. 3. What are the functions of Fibronectin? 37186 articles What are the functions of the 238 ‘significant’ genes that came up in my high throughput screen??
  4. 4. What are the functions of Fibronectin? 37186 articles … Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy “knowledge integration” “curation” “knowledge base” Answers
  5. 5. Knowledge Bases 5 1,500+ listed at http://www.oxfordjournals.org/nar/database/a/
  6. 6. Applications of knowledge bases • Find information • Plan research • ”Known unknowns?” • Interpret data • Gene Ontology Enrichment Analysis
  7. 7. Interesting Gene List Gene Ontology, Pathway, Network interpretation
  8. 8. Knowledge bases are important tools and will only grow more important over time
  9. 9. 9 Great!
  10. 10. 10 BUT
  11. 11. 11 1. Knowledge bases are not complete 2. Will get to later..
  12. 12. Annotation missing from human GO annotation. Should be here! (‘5 HT Receptor’ means ‘Serotonin Receptor’) Circa 2010
  13. 13. Added to GO Jan. 2016 First characterized 1996 (Kohen et al J Neurochem)
  14. 14. Interesting Gene List Gene Ontology, Pathway, Network interpretation
  15. 15. We don’t know what we are missing 15 inflammatory response defense response Serotonin receptor activity? ? response to wounding immune response Interesting Gene List
  16. 16. “Gene Ontology, its great right ?” • “It sucks” • “I only use it out of desperation”
  17. 17. WHY?!
  18. 18. Process of building knowledge bases 1. do science 2. publish it 3. Manually extract the knowledge Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy
  19. 19. why does he look so down?
  20. 20. Many scientists, powerful tools, comparatively little reward for curating knowledge 100’s of thousands 100’s
  21. 21. More than 2 articles published/minute
  22. 22. Professional biocuration does not scale up to the rate of production 1. do science 2. publish it 3. Manually extract the knowledge Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy
  23. 23. 23 1. Knowledge bases are not complete 2. Knowledge needs integration
  24. 24. Knowledge is scattered, integration brings it together
  25. 25. Merging knowledge bases: the language barrier “Methadone” Interacts with: “Moxifloxacin”May treat: Opioid-Related Disorders ID: N0000000174 ID: 4095 Molecular Weight: 309.44518 g/mol … = ? = ? = ? = ? = ? = ? ID: DB00333 Manufactured by: Roxane laboratories inc
  26. 26. Good for business, bad for science Google Scholar search shows 469 papers about “identifier mapping” in bioinformatics
  27. 27. What can we do?
  28. 28. Global Knowledge Platform What would happen if everyone was literally working on the same database? 1. Split up work more effectively 2. Make integration the default behavior
  29. 29. Is to data as Wikipedia is to text “Giving more people more access to more knowledge” A free and open repository of knowledge Managed by the MediaWiki foundation that operates Wikipedia
  30. 30. It’s a knowledge base! • Anyone can edit • Anyone can use
  31. 31. Item: Q84
  32. 32. Item: Q414043 RELN Genomic start: 103471784 GenLoc assembly: GRCh38 Stated in: Ensembl Release 83 Retrieved: 19 January 2016 Value (numeric) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement
  33. 33. Item: Q414043 RELN Encodes: Reelin (protein) Stated in: NCBI homo sapiens annotation release 107 Retrieved: 19 January 2016 Value (item) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement
  34. 34. A Giant Global Graph These statements link together into a queryable graph https://query.wikidata.org
  35. 35. We are seeding it with biomedical data • All human, mouse genes and proteins • All Gene Ontology terms • All FDA approved drugs • 9,000+ human diseases Burgstaller et al (2016) Database (preprint in BioRxiv) Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
  36. 36. Our seeds are largely concepts linked to many identifier systems N identifiers per item • Genes: 8 • Drugs: 18 • Diseases: 11 Burgstaller et al (2016) Database (preprint in BioRxiv) Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv) Facilitate integration with key external knowledge bases
  37. 37. Nurturing a multi-community garden of biomedical knowledge Gene DrugDisease
  38. 38. A Platform for knowledge integration and curation 38 Open data Wikipedia(s) Your Apps Here! Your Apps Here! Your Apps Here! Your Apps Here!
  39. 39. Application #1 (of many) Burgstaller et al (2016) Database (preprint in BioRxiv)
  40. 40. Impact of wikidata on Wikipedia Gene Wiki Version 1. {{GNF_Protein_box | Name = Reelin| image = | image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 | MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 | IUPHAR = | ChEMBL = | OMIM = None | ECnumber = | Homologene = 9349 | GeneAtlas_image1 = | GeneAtlas_image2 = | GeneAtlas_image3 = | Protein_domain_image = | Function = {{GNF_GO|id=GO:0005515 |text = protein binding}} {{GNF_GO|id=GO:0016787 |text = hydrolase activity}} {{GNF_GO|id=GO:0046872 |text = metal ion binding}} | Component = {{GNF_GO|id=GO:0005739 |text = mitochondrion}} | Process = {{GNF_GO|id=GO:0008152 |text = metabolic process}} | Hs_EntrezGene = 51110 | Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA = NM_016027 | Hs_RefseqProtein = NP_057111 | Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 | Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174 | Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 | Mm_Ensembl = ENSMUSG00000025937 | Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein = NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr = 1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end = 13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}} = Gene Wiki Version 2. {{Infobox gene}} • All data in Wikidata • 1 Lua script works for all genes = (1 of these for every gene)
  41. 41. Application #2 Web Apollo Genome Browser 41 • Genome annotation data retrieved from wikidata via SPARQL queries to https://query.wikidata.org • Prototype achieved at recent San Diego hackathon 1 Putman et al (2016) (under review) (preprint in BioRxiv)
  42. 42. Microbial Genetic Data •Widely Distributed •Difficult to query •Not structured in meaningful way •A lot of interest from this community !
  43. 43. Microbial Genetic Data
  44. 44. Microbial genomes in Wikidata • Loading genes, proteins, annotations for 120 reference genomes. • Completed 21 genomes so far Putman et al (2016) (under review) (preprint in BioRxiv)
  45. 45. Microbiome modeling in Wikidata Putman et al (2016) (under review) (preprint in BioRxiv)
  46. 46. 46 1. Knowledge bases are not complete 2. Knowledge needs integration Can help
  47. 47. Centralizing content while distributing labor 47 Open data Your Apps Here! Wikipedia(s) Your Apps Here! Your Apps Here! Your Apps Here!
  48. 48. Thanks! Gene Wikidata Team Andra Waagmeester (Micelio) * Sebastian Burgstaller (Scripps) * Tim Putman (Scripps) * Elvira Mitraka (U Maryland) Julia Turner (Scripps) Justin Leong (UBC) Lynn Schriml (U Maryland) Paul Pavlidis (UBC) Andrew Su (Scripps) Ginger Tsueng (Scripps) Contact bgood@scripps.edu* First author on manuscript cited in this presentation Ben Tim Andra Elvira Sebastian Some Gene Wiki team members enjoying their best paper award at SWAT4LS, Dec. 2015 Adapted logo

×