Wikidata for biomedical
knowledge integration and
curation
Benjamin Good
The Scripps Research Institute
@bgood
bgood@scripps.edu
“knowledge”
• A lot
• Important
• Text
What are the
functions of
Fibronectin?
37186 articles
What are the functions of
the 238 ‘significant’ genes
that came up in my high
throughput screen??
What are the
functions of
Fibronectin?
37186 articles
…
Gene Property Value
Fibronectin Biological
Process
Angiogenesis
Fibronectin Cellular
Localization
Extracellular
matrix
Fibronectin Related
Disease
Glomerulopathy
“knowledge integration”
“curation”
“knowledge base”
Answers
Knowledge Bases
5
1,500+ listed at http://www.oxfordjournals.org/nar/database/a/
Applications of knowledge bases
• Find information
• Plan research
• ”Known unknowns?”
• Interpret data
• Gene Ontology
Enrichment Analysis
Interesting Gene List
Gene Ontology, Pathway,
Network interpretation
Knowledge bases are important tools
and will only grow more important
over time
9
Great!
10
BUT
11
1. Knowledge bases are not complete
2. Will get to later..
Annotation
missing from
human GO
annotation.
Should be here!
(‘5 HT Receptor’ means ‘Serotonin Receptor’)
Circa 2010
Added to GO
Jan. 2016
First characterized 1996
(Kohen et al J Neurochem)
Interesting Gene List
Gene Ontology, Pathway,
Network interpretation
We don’t know what we are missing
15
inflammatory
response
defense
response
Serotonin
receptor
activity?
?
response to
wounding
immune
response
Interesting Gene List
“Gene Ontology, its great right ?”
• “It sucks”
• “I only use it out of desperation”
WHY?!
Process of building knowledge bases
1. do science 2. publish it 3. Manually extract
the knowledge
Gene Property Value
Fibronectin Biological
Process
Angiogenesis
Fibronectin Cellular
Localization
Extracellular
matrix
Fibronectin Related
Disease
Glomerulopathy
why does he look so down?
Many scientists, powerful tools,
comparatively little reward for
curating knowledge
100’s of thousands 100’s
More than 2 articles
published/minute
Professional biocuration does not scale
up to the rate of production
1. do science 2. publish it 3. Manually extract
the knowledge
Gene Property Value
Fibronectin Biological
Process
Angiogenesis
Fibronectin Cellular
Localization
Extracellular
matrix
Fibronectin Related
Disease
Glomerulopathy
23
1. Knowledge bases are not complete
2. Knowledge needs integration
Knowledge is scattered,
integration brings it together
Merging knowledge bases:
the language barrier
“Methadone”
Interacts with:
“Moxifloxacin”May treat:
Opioid-Related Disorders
ID:
N0000000174
ID:
4095
Molecular Weight:
309.44518 g/mol
…
= ?
= ?
= ?
= ?
= ?
= ?
ID:
DB00333
Manufactured by:
Roxane laboratories inc
Good for business, bad for science
Google Scholar search shows 469 papers about
“identifier mapping” in bioinformatics
What can we do?
Global Knowledge Platform
What would happen if everyone
was literally working on the same
database?
1. Split up work more effectively
2. Make integration the default
behavior
Is to data
as Wikipedia is to text
“Giving more people more access to more knowledge”
A free and open repository of knowledge
Managed by the MediaWiki foundation
that operates Wikipedia
It’s a
knowledge
base!
• Anyone
can edit
• Anyone
can use
Item: Q84
Item: Q414043
RELN
Genomic start: 103471784
GenLoc assembly:
GRCh38
Stated in:
Ensembl Release 83
Retrieved:
19 January 2016
Value (numeric)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
Item: Q414043
RELN
Encodes: Reelin (protein) Stated in:
NCBI homo sapiens
annotation release 107
Retrieved:
19 January 2016
Value (item)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
A Giant Global Graph
These statements link together into a queryable graph
https://query.wikidata.org
We are seeding it with
biomedical data
• All human, mouse genes
and proteins
• All Gene Ontology terms
• All FDA approved drugs
• 9,000+ human diseases
Burgstaller et al (2016) Database (preprint in BioRxiv)
Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
Our seeds are largely
concepts linked to many
identifier systems
N identifiers per item
• Genes: 8
• Drugs: 18
• Diseases: 11
Burgstaller et al (2016) Database (preprint in BioRxiv)
Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
Facilitate
integration
with key
external
knowledge
bases
Nurturing a multi-community
garden of biomedical knowledge
Gene DrugDisease
A Platform for knowledge integration and curation
38
Open data
Wikipedia(s)
Your Apps
Here!
Your Apps
Here!
Your Apps
Here!
Your Apps
Here!
Application #1 (of many)
Burgstaller et al (2016) Database (preprint in BioRxiv)
Impact of wikidata on Wikipedia
Gene Wiki
Version 1.
{{GNF_Protein_box | Name = Reelin| image = |
image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 |
MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 |
IUPHAR = | ChEMBL = | OMIM = None | ECnumber = |
Homologene = 9349 | GeneAtlas_image1 = |
GeneAtlas_image2 = | GeneAtlas_image3 = |
Protein_domain_image = | Function =
{{GNF_GO|id=GO:0005515 |text = protein binding}}
{{GNF_GO|id=GO:0016787 |text = hydrolase activity}}
{{GNF_GO|id=GO:0046872 |text = metal ion binding}} |
Component = {{GNF_GO|id=GO:0005739 |text =
mitochondrion}} | Process = {{GNF_GO|id=GO:0008152
|text = metabolic process}} | Hs_EntrezGene = 51110 |
Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA =
NM_016027 | Hs_RefseqProtein = NP_057111 |
Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 |
Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174
| Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 |
Mm_Ensembl = ENSMUSG00000025937 |
Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein =
NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr =
1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end =
13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}}
=
Gene Wiki
Version 2.
{{Infobox gene}}
• All data in
Wikidata
• 1 Lua script works
for all genes
=
(1 of these for every gene)
Application #2 Web Apollo Genome Browser
41
• Genome annotation data retrieved
from wikidata via SPARQL queries
to https://query.wikidata.org
• Prototype achieved at recent San
Diego hackathon
1 Putman et al (2016) (under review) (preprint in BioRxiv)
Microbial Genetic Data
•Widely Distributed
•Difficult to query
•Not structured in meaningful way
•A lot of interest from this
community !
Microbial Genetic Data
Microbial genomes in Wikidata
• Loading genes,
proteins,
annotations for
120 reference
genomes.
• Completed 21
genomes so far
Putman et al (2016) (under review) (preprint in BioRxiv)
Microbiome modeling in Wikidata
Putman et al (2016) (under review) (preprint in BioRxiv)
46
1. Knowledge bases are not complete
2. Knowledge needs integration
Can help
Centralizing content while distributing labor
47
Open data
Your Apps
Here!
Wikipedia(s)
Your Apps
Here!
Your Apps
Here!
Your Apps
Here!
Thanks!
Gene Wikidata Team
Andra Waagmeester (Micelio)
* Sebastian Burgstaller (Scripps)
* Tim Putman (Scripps)
* Elvira Mitraka (U Maryland)
Julia Turner (Scripps)
Justin Leong (UBC)
Lynn Schriml (U Maryland)
Paul Pavlidis (UBC)
Andrew Su (Scripps)
Ginger Tsueng (Scripps)
Contact
bgood@scripps.edu* First author on manuscript cited in this presentation
Ben Tim
Andra
Elvira
Sebastian
Some Gene Wiki team members
enjoying their best paper award
at SWAT4LS, Dec. 2015
Adapted logo

2016 bd2k bgood_wikidata

  • 1.
    Wikidata for biomedical knowledgeintegration and curation Benjamin Good The Scripps Research Institute @bgood bgood@scripps.edu
  • 2.
  • 3.
    What are the functionsof Fibronectin? 37186 articles What are the functions of the 238 ‘significant’ genes that came up in my high throughput screen??
  • 4.
    What are the functionsof Fibronectin? 37186 articles … Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy “knowledge integration” “curation” “knowledge base” Answers
  • 5.
    Knowledge Bases 5 1,500+ listedat http://www.oxfordjournals.org/nar/database/a/
  • 6.
    Applications of knowledgebases • Find information • Plan research • ”Known unknowns?” • Interpret data • Gene Ontology Enrichment Analysis
  • 7.
    Interesting Gene List GeneOntology, Pathway, Network interpretation
  • 8.
    Knowledge bases areimportant tools and will only grow more important over time
  • 9.
  • 10.
  • 11.
    11 1. Knowledge basesare not complete 2. Will get to later..
  • 12.
    Annotation missing from human GO annotation. Shouldbe here! (‘5 HT Receptor’ means ‘Serotonin Receptor’) Circa 2010
  • 13.
    Added to GO Jan.2016 First characterized 1996 (Kohen et al J Neurochem)
  • 14.
    Interesting Gene List GeneOntology, Pathway, Network interpretation
  • 15.
    We don’t knowwhat we are missing 15 inflammatory response defense response Serotonin receptor activity? ? response to wounding immune response Interesting Gene List
  • 16.
    “Gene Ontology, itsgreat right ?” • “It sucks” • “I only use it out of desperation”
  • 17.
  • 18.
    Process of buildingknowledge bases 1. do science 2. publish it 3. Manually extract the knowledge Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy
  • 19.
    why does helook so down?
  • 20.
    Many scientists, powerfultools, comparatively little reward for curating knowledge 100’s of thousands 100’s
  • 21.
    More than 2articles published/minute
  • 22.
    Professional biocuration doesnot scale up to the rate of production 1. do science 2. publish it 3. Manually extract the knowledge Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy
  • 23.
    23 1. Knowledge basesare not complete 2. Knowledge needs integration
  • 24.
  • 25.
    Merging knowledge bases: thelanguage barrier “Methadone” Interacts with: “Moxifloxacin”May treat: Opioid-Related Disorders ID: N0000000174 ID: 4095 Molecular Weight: 309.44518 g/mol … = ? = ? = ? = ? = ? = ? ID: DB00333 Manufactured by: Roxane laboratories inc
  • 26.
    Good for business,bad for science Google Scholar search shows 469 papers about “identifier mapping” in bioinformatics
  • 27.
  • 28.
    Global Knowledge Platform Whatwould happen if everyone was literally working on the same database? 1. Split up work more effectively 2. Make integration the default behavior
  • 29.
    Is to data asWikipedia is to text “Giving more people more access to more knowledge” A free and open repository of knowledge Managed by the MediaWiki foundation that operates Wikipedia
  • 30.
  • 31.
  • 32.
    Item: Q414043 RELN Genomic start:103471784 GenLoc assembly: GRCh38 Stated in: Ensembl Release 83 Retrieved: 19 January 2016 Value (numeric) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement
  • 33.
    Item: Q414043 RELN Encodes: Reelin(protein) Stated in: NCBI homo sapiens annotation release 107 Retrieved: 19 January 2016 Value (item) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement
  • 34.
    A Giant GlobalGraph These statements link together into a queryable graph https://query.wikidata.org
  • 35.
    We are seedingit with biomedical data • All human, mouse genes and proteins • All Gene Ontology terms • All FDA approved drugs • 9,000+ human diseases Burgstaller et al (2016) Database (preprint in BioRxiv) Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
  • 36.
    Our seeds arelargely concepts linked to many identifier systems N identifiers per item • Genes: 8 • Drugs: 18 • Diseases: 11 Burgstaller et al (2016) Database (preprint in BioRxiv) Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv) Facilitate integration with key external knowledge bases
  • 37.
    Nurturing a multi-community gardenof biomedical knowledge Gene DrugDisease
  • 38.
    A Platform forknowledge integration and curation 38 Open data Wikipedia(s) Your Apps Here! Your Apps Here! Your Apps Here! Your Apps Here!
  • 39.
    Application #1 (ofmany) Burgstaller et al (2016) Database (preprint in BioRxiv)
  • 40.
    Impact of wikidataon Wikipedia Gene Wiki Version 1. {{GNF_Protein_box | Name = Reelin| image = | image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 | MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 | IUPHAR = | ChEMBL = | OMIM = None | ECnumber = | Homologene = 9349 | GeneAtlas_image1 = | GeneAtlas_image2 = | GeneAtlas_image3 = | Protein_domain_image = | Function = {{GNF_GO|id=GO:0005515 |text = protein binding}} {{GNF_GO|id=GO:0016787 |text = hydrolase activity}} {{GNF_GO|id=GO:0046872 |text = metal ion binding}} | Component = {{GNF_GO|id=GO:0005739 |text = mitochondrion}} | Process = {{GNF_GO|id=GO:0008152 |text = metabolic process}} | Hs_EntrezGene = 51110 | Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA = NM_016027 | Hs_RefseqProtein = NP_057111 | Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 | Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174 | Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 | Mm_Ensembl = ENSMUSG00000025937 | Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein = NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr = 1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end = 13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}} = Gene Wiki Version 2. {{Infobox gene}} • All data in Wikidata • 1 Lua script works for all genes = (1 of these for every gene)
  • 41.
    Application #2 WebApollo Genome Browser 41 • Genome annotation data retrieved from wikidata via SPARQL queries to https://query.wikidata.org • Prototype achieved at recent San Diego hackathon 1 Putman et al (2016) (under review) (preprint in BioRxiv)
  • 42.
    Microbial Genetic Data •WidelyDistributed •Difficult to query •Not structured in meaningful way •A lot of interest from this community !
  • 43.
  • 44.
    Microbial genomes inWikidata • Loading genes, proteins, annotations for 120 reference genomes. • Completed 21 genomes so far Putman et al (2016) (under review) (preprint in BioRxiv)
  • 45.
    Microbiome modeling inWikidata Putman et al (2016) (under review) (preprint in BioRxiv)
  • 46.
    46 1. Knowledge basesare not complete 2. Knowledge needs integration Can help
  • 47.
    Centralizing content whiledistributing labor 47 Open data Your Apps Here! Wikipedia(s) Your Apps Here! Your Apps Here! Your Apps Here!
  • 48.
    Thanks! Gene Wikidata Team AndraWaagmeester (Micelio) * Sebastian Burgstaller (Scripps) * Tim Putman (Scripps) * Elvira Mitraka (U Maryland) Julia Turner (Scripps) Justin Leong (UBC) Lynn Schriml (U Maryland) Paul Pavlidis (UBC) Andrew Su (Scripps) Ginger Tsueng (Scripps) Contact bgood@scripps.edu* First author on manuscript cited in this presentation Ben Tim Andra Elvira Sebastian Some Gene Wiki team members enjoying their best paper award at SWAT4LS, Dec. 2015 Adapted logo

Editor's Notes

  • #6 Databases. Obviously much more flexible. You can ask them questions.. (and make pretty pictures that are dynamic)
  • #7 “known unknowns” ?? If I want X, what Y should I test?
  • #13 Though it is a child of the more generic GO annotation to ‘G protein coupled receptor activity’ Kohen 1996, J Neurochem.
  • #19 Given a list of active genes produced from an experiment what key biological processes are happening in the cells? what diseases are these genes associated with? Given a list of genetic variations what diseases is a patient more susceptible to? what drugs should they take/avoid? etc.
  • #21 Given a list of active genes produced from an experiment what key biological processes are happening in the cells? what diseases are these genes associated with? Given a list of genetic variations what diseases is a patient more susceptible to? what drugs should they take/avoid? etc.
  • #22 Knowledge is either not shared (stuck in your head or your notebook) or it is shared as text and images in journal articles. There are more than 1 million articles added to PubMed each year
  • #23 Given a list of active genes produced from an experiment what key biological processes are happening in the cells? what diseases are these genes associated with? Given a list of genetic variations what diseases is a patient more susceptible to? what drugs should they take/avoid? etc.
  • #25 Divide and conquer algorithm for creating the knowledge base of everything. Splitting is hard because its very hard to know what other groups are doing, there is no centralized coordination, and decisions about what should be curated are made based on what gets funded rather than what is mist useful for the collective.
  • #26 The principle problem of knowledge integration is establishing which entities are shared between different systems Methadone N0000002109 (Opioid-Related Disorders) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3422823/
  • #29 It would be much easier to see what other people were doing By operating in the same database, it is much more likely that you will end up re-using entities that already exist rather than creating new ones and merging them later. Just like in your own local database.
  • #40 This is the first application of the work that we have done
  • #45 https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Update