Abstract—Wikidata is a world readable and writable knowledge base maintained by the Wikimedia Foundation. It offers the opportunity to collaboratively construct a fully open access knowledge graph spanning biology, medicine, and all other domains of knowledge. To meet this potential, social and technical challenges must be overcome - many of which are familiar to the biocuration community. These include community ontology building, high precision information extraction, provenance, and license management. By working together with Wikidata now, we can help shape it into a trustworthy, unencumbered central node in the Semantic Web of biomedical data.
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Opportunities and challenges presented by Wikidata in the context of biocuration
1. Opportunities and challenges
presented by Wikidata in the
context of biocuration
Benjamin Good
BioCreative, Corvalis Oregon 2016
@bgood
bgood@scripps.edu
http://www.slideshare.net/goodb
3. Is to data
as Wikipedia is to text
“Giving more people more access to more knowledge”
A free and open repository of knowledge
• Initiated by WikiMedia Germany
• In transition to the WikiMedia Foundation
• Not a grant funded ‘project’… as stable as Wikipedia
5. Elements of the kb are called ‘items’
https://www.wikidata.org/wiki/Q146
6. Items are unique concepts,
used to link different language
Wikipedias together
Q146
Af:Kat
En:cat
Als:Hauskatze
Ang:Catte
Av:Keto
7. Items are described by “statements” that link
together to form the language-independent
wikidata knowledge graph
Cat
Domesticated
Animal
Animal
Subclass Of
Subclass Of
Animalia
Taxon name
Kingdom
Taxon rank
9. Item: Q414043
RELN
Genomic start: 103471784
GenLoc assembly:
GRCh38
Stated in:
Ensembl Release 83
Retrieved:
19 January 2016
Value (numeric)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
Genomic position for Reelin gene
10. Item: Q414043
RELN
Encodes: Reelin (protein) Stated in:
NCBI homo sapiens
annotation release 107
Retrieved:
19 January 2016
Value (item)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
Linking the Reelin gene to a protein it encodes
11. Item: Q13561329
Reelin
Cell component: dendrite
Determination method:
• ISS (Sequence or structural
Similarity)
• IEA (Electronic annotation)
Stated in:
Uniprot
Retrieved:
21 March 2016
Value (item)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q13561329
Statement
Gene ontology annotation for Reelin protein
with evidence codes modeled as qualifiers
13. Inter-item links form a giant knowledge graph
Everything is connected
Reelin, Heart disease,
Barack Obama,
everything..
https://query.wikidata.org
SPARQL endpoint for Wikidata
14. Sample of current biomedical content
• All human, mouse genes and proteins (swissprot)
• All Gene Ontology terms
• All Human Disease Ontology terms
• All FDA approved drugs
• 109 reference microbial genomes
Burgstaller-Muelbacher et al (2016) Database
Mitraka et al (2015) Semantic Web Applications for the Life Sciences
Putman et al (2016) Database
15. http://tinyurl.com/biowiki-sparql
Sample queries that are currently possible:
• “GO cellular localization annotations for Reelin with
evidence code ISS”
• “Diseases treated by Metformin”
• “Diseases that might be treated by Metformin”
http://query.wikidata.org
16. Example question: repurposing Metformin
http://tinyurl.com/zem3oxz
Metformin
?disease
interacts
with
protein
SLC22A3encoded by genetic
association
Might
treat ?
Solute carrier
family 22
member 3
SLC22A3
prostate
cancer
18. API
Flatfiles
The dominant paradigm for open biocuration
API
Flatfiles
Your
Database
Your
Database
Your
Databasexrefs
Your
Database
Pain points
• API or flatfile parsing
• Ambiguous or non-existent xrefs
• Persistence of funding
• Too much information to curate
My Web
Application
My Database
My Database Curators
My Research Grants
$
Biomedical
knowledge
19. A new paradigm for open biocuration?
My
Application
Our Database?
Our Database Curators
And our community
Biomedical
knowledge
My
Application
My
Application
My Research Grants
$
Reducing the pain
• Reduces API/parser proliferation
• Forces up-front integration
• Facilitates coordination
• Ensures that if funding is lost,
data is not
• Invites community input
20. A new platform for open biocuration?
My
Application
Our Database Curators
And our community
Biomedical
knowledge
My
Application
My
Application
My Research Grants
$
• SPARQL = a common
API for accessing
content
• 1 endpoint to
maintain…
• Its working
21. The first application built on wikidata is Wikipedia
Our Database Curators
And our community
Biomedical
knowledge
Our
Applications
Our
Applications
Su, Schriml, Pavlidis R01 Grant…
$
24. Impact of wikidata on Wikipedia
Gene Wiki
Version 1.
{{GNF_Protein_box | Name = Reelin| image = |
image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 |
MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 |
IUPHAR = | ChEMBL = | OMIM = None | ECnumber = |
Homologene = 9349 | GeneAtlas_image1 = |
GeneAtlas_image2 = | GeneAtlas_image3 = |
Protein_domain_image = | Function =
{{GNF_GO|id=GO:0005515 |text = protein binding}}
{{GNF_GO|id=GO:0016787 |text = hydrolase activity}}
{{GNF_GO|id=GO:0046872 |text = metal ion binding}} |
Component = {{GNF_GO|id=GO:0005739 |text =
mitochondrion}} | Process = {{GNF_GO|id=GO:0008152
|text = metabolic process}} | Hs_EntrezGene = 51110 |
Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA =
NM_016027 | Hs_RefseqProtein = NP_057111 |
Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 |
Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174
| Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 |
Mm_Ensembl = ENSMUSG00000025937 |
Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein =
NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr =
1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end =
13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}}
=
Gene Wiki
Version 2.
{{Infobox gene}}
• All data in
Wikidata
• 1 Lua script works
for all genes
=
(1 of these for every gene)
25. Wikidata use increasing on Wikipedia
• https://en.wikipedia.org/wiki/Category:
Templates_using_data_from_Wikidata
• 81 templates indicate that they use it
29. Challenges
• Community ontology building
• Establishing computable trust
• Expanding the knowledge base
“Dogs and cats living together!
Mass hysteria!”
(leave that for ICBO)
BioCreative
Challenges?
30. ‘Statements’ on Wikidata
2013 2016
100M
Statements
Bad
Good
Ugly
60M
20M
https://tools.wmflabs.org/wikidata-todo/stats.php
31. Computable trust
RELN
Genomic start: 103471784
GenLoc assembly:
GRCh38
Claim
Add References
1. Add references
2. Check that references concur
with the claim or not
3. Estimate ‘truthiness’ of claim
4. Provide humans with
sources to follow up.
• References can come from databases,
articles in PubMed, etc.
BadUgly
Good
32. Expanding the knowledge base
RELN
? ?
?
New Claims
• Given external knowledge
source (text or database)
• Create claims and
references automatically
with very high precision
• Allow for human verification
PMID: 77901
PMID: 523070
33. Unique characteristics Wikidata w/ regard to
IE tasks.
• 16,000+ ‘active’ editors and growing
• Could be a powerful crowdsourcing resource
• Must be kept involved or will block progress
• Constrained data model and some limits on content type
• CC0 requirement
34. One known attempt: “StrepHit”
• Individual Engagement Grant (IEG) from the Wikimedia Foundation
(30k, Start Jan. 2016)
• Goal to:
• “Generate trust and reliability over Wikidata content”
• “Alleviate the burden of manual curation” (sounds familiar, right?)
• Ended up working on Biographical and Soccer data…
36. ‘Primary Sources’ optional userscript that
wikidata users can install.
Approving a suggested reference for the claim
that came from German Wikipedia
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
37. The Wikidata Game(s) (= microtasks…)
https://tools.wmflabs.org/wikidata-game/
https://tools.wmflabs.org/wikidata-game/distributed/
Code for making your own!
38. StrepHit, Primary Sources, Wikidata games…
All works in progress
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal
40. Acknowledgements
Gene Wikidata Team
Andra Waagmeester (Micelio)
Sebastian Burgstaller (Scripps)
Tim Putman (Scripps)
Elvira Mitraka (U Maryland)
Julia Turner (Scripps)
Justin Leong (UBC)
Lynn Schriml (U Maryland)
Paul Pavlidis (UBC)
Andrew Su (Scripps)
Ginger Tsueng (Scripps)
Contact
bgood@scripps.edu
@bgood on twitter
Adapted logo
Su Laboratory at TSRI The 16,950 other active editors of
Wikidata and especially the 693 that
joined last month and the 809 that
joined the month before that and
the 721 that joined the month
before that..
This work was supported by the US National Institute of Health
(grants GM089820 and U54GM114833) and by the Scripps
Translational Science Institute with an NIH-NCATS Clinical and
Translational Science Award (CTSA; 5 UL1 TR001114).
41.
42. Social controls
• Anyone can
• Add or edit labels, descriptions, statements, references etc. on existing items
• Create new items
• Link items to Wikipedia articles
• Query using https://query.wikidata.org
• Read and write small numbers of edits with
https://www.wikidata.org/w/api.php
• Propose a new property
• Request a bot account for high-volume automated editing
Here be dragons..
43. Properties (as of April 10, 2016)
• 2196 active properties
• 114 new properties that have been proposed but not yet approved
Proposal
https://www.wikidata.org/wiki/Wikidata:Property_proposal
44. After proposal, community discussion
• Each property is left open
for discussion by anyone
until
• An administrator or other
person blessed with the
power either creates it or
decides not to create it
based on the discussion
• People that enjoy ontology
arguments needed here!
Lengthy (cut-off) discussion of proposal for ‘extinct’ property
47. Proposal discussions
• Can not be avoided
• The discussions are long and tiring but important
• Many of the people involved are quite experienced
• All are trying to make something great
• Persistence and patience required
I will spend a good portion of the talk explaining what wikidata is. With that, all of you smart people can start thinking on your own about how it might influence your work. Of course I will provide some possible ideas.
Labels and descriptions in many languages
about 25 million items, 100 million statements, resulting in about 1 billions triples in the sparql endoint
What it is, to what can we do with it ?
This is the central point I want to make. Wikidata can be used to to build knowledge-based applications, lowering the barrier to entry for building apps and reducing challenges of downstream data integration.
Before coming back to this, I will explain why.
This is the central point I want to make. Wikidata can be used to to build knowledge-based applications, lowering the barrier to entry for building apps and reducing challenges of downstream data integration.
May
By mixing the data into wikidata, we reduce API proliferation, easing application formation.
Over 1 billion triples
Fast
20-30 queries per second,
Avg about 6 seconds to answer queries
Stable since around September 2015
By mixing the data into wikidata, we reduce API proliferation, easing application formation.
Over 1 billion triples
Fast
Stable since around September 2015
This is the first application of the work that we have done
By mixing the data into wikidata, we reduce API proliferation, easing application formation.
Over 1 billion triples
Fast
Stable since around September 2015
Now that we have some ideas about how it can be used, consider the problems with it ways that the NLP community might help solve them.
ontology – define properties and patterns for their use
Trust – given a claim recorded on wikidata, verify that it matches (or conflicts with) statements made in other sources and provide references to those sources.
Data – add more
Given a claim, validate or invalidate and provide a reference
Could easily come up with many thousands of claims in the biomedical domain of the ugly or bad nature.
Lexicographical analysis
Relation extraction
Frame semantics
Machine learning
Very ambitious goal of producing both the “A box and the T box” (ie both identifying new properties and extracting relations using them)
Tool provided by Google project for loading data from Freebase that requires a human ‘thumbs up’.
currently StrepHit team is proposing to shift their work to improving this tool
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal
https://meta.wikimedia.org/wiki/Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Support_from_ContentMine
By mixing the data into wikidata, we reduce API proliferation, easing application formation.
Over 1 billion triples
Fast
Stable since around September 2015