Opportunities and challenges presented by Wikidata in the context of biocuration

Opportunities and challenges
presented by Wikidata in the
context of biocuration
Benjamin Good
BioCreative, Corvalis Oregon 2016
@bgood
bgood@scripps.edu
http://www.slideshare.net/goodb

Road Map
• Introduction to Wikidata
• Wikidata and biocuration
• Wikidata and BioCreative

Is to data
as Wikipedia is to text
“Giving more people more access to more knowledge”
A free and open repository of knowledge
• Initiated by WikiMedia Germany
• In transition to the WikiMedia Foundation
• Not a grant funded ‘project’… as stable as Wikipedia

It’s a knowledge
base!
• Anyone can edit
(human or robot)
• Anyone can use
(CC0)

Elements of the kb are called ‘items’
https://www.wikidata.org/wiki/Q146

Items are unique concepts,
used to link different language
Wikipedias together
Q146
Af:Kat
En:cat
Als:Hauskatze
Ang:Catte
Av:Keto

Items are described by “statements” that link
together to form the language-independent
wikidata knowledge graph
Cat
Domesticated
Animal
Animal
Subclass Of
Subclass Of
Animalia
Taxon name
Kingdom
Taxon rank

Item: Q414043
RELN
Genomic start: 103471784
GenLoc assembly:
GRCh38
Stated in:
Ensembl Release 83
Retrieved:
19 January 2016
Value (numeric)
Property
Claim Qualifiers
References
Statement
Genomic position for Reelin gene

Item: Q414043
RELN
Encodes: Reelin (protein) Stated in:
NCBI homo sapiens
annotation release 107
Retrieved:
19 January 2016
Value (item)
Property
Claim Qualifiers
References
Statement
Linking the Reelin gene to a protein it encodes

Item: Q13561329
Reelin
Cell component: dendrite
Determination method:
• ISS (Sequence or structural
Similarity)
• IEA (Electronic annotation)
Stated in:
Uniprot
Retrieved:
21 March 2016
Value (item)
Property
Claim Qualifiers
References
Statement
Gene ontology annotation for Reelin protein
with evidence codes modeled as qualifiers

graphical view
RELN
Reelin
encodes
dendrite
cellular component
claim
ISS IEA
qualifiers
UniProt
stated in
retrieved
21 March
2016
References
Statement

Inter-item links form a giant knowledge graph
Everything is connected
Reelin, Heart disease,
Barack Obama,
everything..
https://query.wikidata.org
SPARQL endpoint for Wikidata

Sample of current biomedical content
• All human, mouse genes and proteins (swissprot)
• All Gene Ontology terms
• All Human Disease Ontology terms
• All FDA approved drugs
• 109 reference microbial genomes
Burgstaller-Muelbacher et al (2016) Database
Mitraka et al (2015) Semantic Web Applications for the Life Sciences
Putman et al (2016) Database

http://tinyurl.com/biowiki-sparql
Sample queries that are currently possible:
• “GO cellular localization annotations for Reelin with
evidence code ISS”
• “Diseases treated by Metformin”
• “Diseases that might be treated by Metformin”
http://query.wikidata.org

Example question: repurposing Metformin
http://tinyurl.com/zem3oxz
Metformin
?disease
interacts
with
protein
SLC22A3encoded by genetic
association
Might
treat ?
Solute carrier
family 22
member 3
SLC22A3
prostate
cancer

API
Flatfiles
The dominant paradigm for open biocuration
API
Flatfiles
Your
Database
Your
Database
Your
Databasexrefs
Your
Database
Pain points
• API or flatfile parsing
• Ambiguous or non-existent xrefs
• Persistence of funding
• Too much information to curate
My Web
Application
My Database
My Database Curators
My Research Grants
$
Biomedical
knowledge

A new paradigm for open biocuration?
My
Application
Our Database?
Our Database Curators
And our community
Biomedical
knowledge
My
Application
My
Application
My Research Grants
$
Reducing the pain
• Reduces API/parser proliferation
• Forces up-front integration
• Facilitates coordination
• Ensures that if funding is lost,
data is not
• Invites community input

A new platform for open biocuration?
My
Application
And our community
Biomedical
knowledge
My
Application
My
Application
My Research Grants
$
• SPARQL = a common
API for accessing
content
• 1 endpoint to
maintain…
• Its working

The first application built on wikidata is Wikipedia
And our community
Biomedical
knowledge
Our
Applications
Our
Applications
Su, Schriml, Pavlidis R01 Grant…
$

Deeply integrated,
(incredible SEO)

Application #1
Burgstaller et al (2016)

Impact of wikidata on Wikipedia
Gene Wiki
Version 1.
{{GNF_Protein_box | Name = Reelin| image = |
image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 |
MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 |
IUPHAR = | ChEMBL = | OMIM = None | ECnumber = |
Homologene = 9349 | GeneAtlas_image1 = |
GeneAtlas_image2 = | GeneAtlas_image3 = |
Protein_domain_image = | Function =
{{GNF_GO|id=GO:0005515 |text = protein binding}}
{{GNF_GO|id=GO:0016787 |text = hydrolase activity}}
{{GNF_GO|id=GO:0046872 |text = metal ion binding}} |
Component = {{GNF_GO|id=GO:0005739 |text =
mitochondrion}} | Process = {{GNF_GO|id=GO:0008152
|text = metabolic process}} | Hs_EntrezGene = 51110 |
Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA =
NM_016027 | Hs_RefseqProtein = NP_057111 |
Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 |
Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174
| Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 |
Mm_Ensembl = ENSMUSG00000025937 |
Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein =
NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr =
1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end =
13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}}
=
Gene Wiki
Version 2.
{{Infobox gene}}
• All data in
Wikidata
• 1 Lua script works
for all genes
=
(1 of these for every gene)

Wikidata use increasing on Wikipedia
• https://en.wikipedia.org/wiki/Category:
Templates_using_data_from_Wikidata
• 81 templates indicate that they use it

Application #2. Centralized Model Organism Database (CMOD)
http://sulab.scripps.edu/CMOD/

The next application built on wikidata, yours?
Our community
Biomedical
knowledge
CMOD
????

Challenges
• Community ontology building
• Establishing computable trust
• Expanding the knowledge base
“Dogs and cats living together!
Mass hysteria!”
(leave that for ICBO)
BioCreative
Challenges?

‘Statements’ on Wikidata
2013 2016
100M
Statements
Bad
Good
Ugly
60M
20M
https://tools.wmflabs.org/wikidata-todo/stats.php

Computable trust
RELN
Genomic start: 103471784
GenLoc assembly:
GRCh38
Claim
Add References
1. Add references
2. Check that references concur
with the claim or not
3. Estimate ‘truthiness’ of claim
4. Provide humans with
sources to follow up.
• References can come from databases,
articles in PubMed, etc.
BadUgly
Good

Expanding the knowledge base
RELN
? ?
?
New Claims
• Given external knowledge
source (text or database)
• Create claims and
references automatically
with very high precision
• Allow for human verification
PMID: 77901
PMID: 523070

Unique characteristics Wikidata w/ regard to
IE tasks.
• 16,000+ ‘active’ editors and growing
• Could be a powerful crowdsourcing resource
• Must be kept involved or will block progress
• Constrained data model and some limits on content type
• CC0 requirement

One known attempt: “StrepHit”
• Individual Engagement Grant (IEG) from the Wikimedia Foundation
(30k, Start Jan. 2016)
• Goal to:
• “Generate trust and reliability over Wikidata content”
• “Alleviate the burden of manual curation” (sounds familiar, right?)
• Ended up working on Biographical and Soccer data…

StrepHit NLP pipeline:
https://github.com/Wikidata/StrepHit
Text corpus Claim Extractor “Primary sources”
tool on wikidata

‘Primary Sources’ optional userscript that
wikidata users can install.
Approving a suggested reference for the claim
that came from German Wikipedia
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool

The Wikidata Game(s) (= microtasks…)
https://tools.wmflabs.org/wikidata-game/
https://tools.wmflabs.org/wikidata-game/distributed/
Code for making your own!

StrepHit, Primary Sources, Wikidata games…
All works in progress
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal

BioCreative Challenges with Wikidata ?
And our community
Biomedical
knowledge
????
?

Acknowledgements
Gene Wikidata Team
Andra Waagmeester (Micelio)
Sebastian Burgstaller (Scripps)
Tim Putman (Scripps)
Elvira Mitraka (U Maryland)
Julia Turner (Scripps)
Justin Leong (UBC)
Lynn Schriml (U Maryland)
Paul Pavlidis (UBC)
Andrew Su (Scripps)
Ginger Tsueng (Scripps)
Contact
bgood@scripps.edu
@bgood on twitter
Adapted logo
Su Laboratory at TSRI The 16,950 other active editors of
Wikidata and especially the 693 that
joined last month and the 809 that
joined the month before that and
the 721 that joined the month
before that..
This work was supported by the US National Institute of Health
(grants GM089820 and U54GM114833) and by the Scripps
Translational Science Institute with an NIH-NCATS Clinical and
Translational Science Award (CTSA; 5 UL1 TR001114).

Social controls
• Anyone can
• Add or edit labels, descriptions, statements, references etc. on existing items
• Create new items
• Link items to Wikipedia articles
• Query using https://query.wikidata.org
• Read and write small numbers of edits with
https://www.wikidata.org/w/api.php
• Propose a new property
• Request a bot account for high-volume automated editing
Here be dragons..

Properties (as of April 10, 2016)
• 2196 active properties
• 114 new properties that have been proposed but not yet approved
Proposal
https://www.wikidata.org/wiki/Wikidata:Property_proposal

After proposal, community discussion
• Each property is left open
for discussion by anyone
until
• An administrator or other
person blessed with the
power either creates it or
decides not to create it
based on the discussion
• People that enjoy ontology
arguments needed here!
Lengthy (cut-off) discussion of proposal for ‘extinct’ property

https://www.wikidata.org/wiki/Wikidata:Property_proposal/
Property proposal on wikidata
Proposal
Community discussion

Bot accounts
• https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
• Same basic process.

Proposal discussions
• Can not be avoided
• The discussions are long and tiring but important
• Many of the people involved are quite experienced
• All are trying to make something great
• Persistence and patience required

RELN
Reelin
encodes
dendrite
cellular component
claim
ISS IEA
qualifiers
UniProt
stated in
retrieved
21 March
2016
References
Statement
SPARQL graph..
http://tinyurl.com/biowiki-sparql

Opportunities and challenges presented by Wikidata in the context of biocuration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Opportunities and challenges presented by Wikidata in the context of biocuration

Similar to Opportunities and challenges presented by Wikidata in the context of biocuration (20)

More from Benjamin Good

More from Benjamin Good (18)

Recently uploaded

Recently uploaded (20)

Opportunities and challenges presented by Wikidata in the context of biocuration

Editor's Notes