Drug and chemical compound items in Wikidata
as a data source for Wikipedia infoboxes
https://commons.wikimedia.org/wiki/File:Wikidata-logo-en.svg
Sebastian Burgstaller-Muehlbacher, PhD
User:Sebotic
Twitter: @sebotic
Contents
● The Problem
● Introduction to Wikidata
– Data model
– References
– Values/data types
● Gene Wiki Info Boxes - An Example solution
● Chemistry Data in Wikidata
– Issues with the data
– Community cleanup
– Migration of Info Boxes to Wikidata
The Problem (with chemistry data)
● Wikipedia has ~300 different languages projects
● Currently, chemistry data resides as info box parameter
– Data are not reusable between language projects
– Data are not machine readable
– Data are hard to update automatically
– Data cannot be reused for other purposes, e.g. science.
The solution
Wikidata items
● Two types of entities
– Properties (Pxxxx):
● Describe the nature of a data value
● Different data types
● 2,900 different properties in Wikidata
– Data items (Qxxxx):
● A set of claims or statements
● Consist of property value pairs
● 20 million items in Wikidata
A Wikidata Statement
Wikidata Data types
● The current Wikidata data types:
– String
– WDItemID
– External ID
– MonolingualText
– Property
– Quantity
– Time
– Url
– GlobeCoordinate
– CommonsMedia
– Mathematical formula
Unique Features of Wikidata
● Completely free, even for commercial usage (CC0).
● Granular: Single values with references.
● Anybody can contribute.
● Extensive item history.
● A repository for data on all domains of knowledge.
● Full integration with the semantic web.
● Essentially: A giant graph of knowledge.
Burgstaller-Muehlbacher,
et al, Database, 2016
Data use case: Gene Wiki infoboxes
Issues with chemical data in the Wiki space
● Incorrect identifiers in info boxes or on Wikidata items
● Incorrect chemical properties
● Incorrect labels, aliases
● Incorrect isomeric forms of the compound
● Mixture of different isomeric forms
https://commons.wikimedia.org/wiki/File:Isomerism.svg
How to solve Isomerism issues?
● Make sure that the structure in Wikidata and Wikipedia are correct
and consistent:
– Use the InChI (International Chemical Identifier) or InChI key to determine
what isomer a certain article or WD item is actually talking about.
What are InChIs
● IUPAC InChI (International Chemical Identifier).
● Describes the structure of a chemical compound or substance.
● Freely usable.
● Can be computed from e.g SMILES, or MOL format.
● Do not need to be assigned by an organization.
What are InChI keys
● The SHA-256 hashed version of an InChI
● Makes chemicals searchable on the Web
● Makes chemicals easily comparable
● Short, unique
UEJJHQNACJXSKW-UHFFFAOYSA-N
First block (14 letter) encodes
skeleton (connectivtiy)
Second block (8 letter) encodes
stereochemistry and radioisotopes
Last letter, number of protons
(charge)
How to solve Isomerism issues?
● Make sure that the structure in Wikidata and Wikipedia are correct
and consistent:
– Use the InChI (International Chemical Identifier) or InChI key to determine
what isomer a certain article or WD item is actually talking about.
– Minimum requirement: Correct, unique InChI key on item.
– Best case: Make sure all structural identifiers are correct (isomeric
SMILES, canonical SMILES, InChI or InCh key).
– A minimum of a correct InChI key allows for the rest of the chemical
compound item to be populated by (our) bots.
What has been accomplished so far?
● Discussion on Wikiproject chemistry:
https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry#Wiki
– General consensus that info boxes should use Wikidata
– Wikidata needs to improve on data quality
● Of the 17,000 original chemical compound Wikidata items, 16,000
have been validated around an InChI key.
● More chemical data has been imported, so they are readily
available for new Wikipedia articles or correction of existing ones.
Things that need your attention
● I generated a list of items at Wikidata project chemistry which
need human intervention.
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry#Annota
Please have a look at those and unify the sterechemistry and
identifiers around one unique InChI key!
Data maintenance in Wikidata
● Our bots are written in Python (2.7 and 3.x compatible).
● Python bots keep Wikidata in sync with authoritative data
source. (PubChem, ChemSpider, ChEBI, ChEMBL)
● Bots are run according to data release cycles of authoritative
data sources.
● Mechanisms in place for detection of inconsistencies.
● Contributions of other Wikidata users are being accounted for,
based on references.
Wikidata API and query endpoints
● Three ways to access data:
– Wikidata API allows read, write and full text search.
(www.wikidata.org/w/api.php)
– REST endpoint for fast, direct data access.
(queryr.wmflabs.org/)
– Wikidata query service (WDQS) as a SPARQL endpoint for complex
queries.
(query.wikidata.org/)
Acknowledgments
Andrew Su
Benjamin Good
Tim Putman
Julia Turner
Gregg Stupp
(TSRI)
Gang Fu
Evan Bolton
(NIH, PubChem)
Andra Waagmeester
(Micelio.be)
Elvira Mitraka
Lynn Schriml
(Disease Ontology, U Baltimore)

Wikiconference 2016 talk Burgstaller

  • 1.
    Drug and chemicalcompound items in Wikidata as a data source for Wikipedia infoboxes https://commons.wikimedia.org/wiki/File:Wikidata-logo-en.svg Sebastian Burgstaller-Muehlbacher, PhD User:Sebotic Twitter: @sebotic
  • 2.
    Contents ● The Problem ●Introduction to Wikidata – Data model – References – Values/data types ● Gene Wiki Info Boxes - An Example solution ● Chemistry Data in Wikidata – Issues with the data – Community cleanup – Migration of Info Boxes to Wikidata
  • 3.
    The Problem (withchemistry data) ● Wikipedia has ~300 different languages projects ● Currently, chemistry data resides as info box parameter – Data are not reusable between language projects – Data are not machine readable – Data are hard to update automatically – Data cannot be reused for other purposes, e.g. science.
  • 4.
  • 6.
    Wikidata items ● Twotypes of entities – Properties (Pxxxx): ● Describe the nature of a data value ● Different data types ● 2,900 different properties in Wikidata – Data items (Qxxxx): ● A set of claims or statements ● Consist of property value pairs ● 20 million items in Wikidata
  • 7.
  • 10.
    Wikidata Data types ●The current Wikidata data types: – String – WDItemID – External ID – MonolingualText – Property – Quantity – Time – Url – GlobeCoordinate – CommonsMedia – Mathematical formula
  • 11.
    Unique Features ofWikidata ● Completely free, even for commercial usage (CC0). ● Granular: Single values with references. ● Anybody can contribute. ● Extensive item history. ● A repository for data on all domains of knowledge. ● Full integration with the semantic web. ● Essentially: A giant graph of knowledge.
  • 13.
  • 14.
    Data use case:Gene Wiki infoboxes
  • 15.
    Issues with chemicaldata in the Wiki space ● Incorrect identifiers in info boxes or on Wikidata items ● Incorrect chemical properties ● Incorrect labels, aliases ● Incorrect isomeric forms of the compound ● Mixture of different isomeric forms
  • 16.
  • 17.
    How to solveIsomerism issues? ● Make sure that the structure in Wikidata and Wikipedia are correct and consistent: – Use the InChI (International Chemical Identifier) or InChI key to determine what isomer a certain article or WD item is actually talking about.
  • 18.
    What are InChIs ●IUPAC InChI (International Chemical Identifier). ● Describes the structure of a chemical compound or substance. ● Freely usable. ● Can be computed from e.g SMILES, or MOL format. ● Do not need to be assigned by an organization.
  • 19.
    What are InChIkeys ● The SHA-256 hashed version of an InChI ● Makes chemicals searchable on the Web ● Makes chemicals easily comparable ● Short, unique UEJJHQNACJXSKW-UHFFFAOYSA-N First block (14 letter) encodes skeleton (connectivtiy) Second block (8 letter) encodes stereochemistry and radioisotopes Last letter, number of protons (charge)
  • 20.
    How to solveIsomerism issues? ● Make sure that the structure in Wikidata and Wikipedia are correct and consistent: – Use the InChI (International Chemical Identifier) or InChI key to determine what isomer a certain article or WD item is actually talking about. – Minimum requirement: Correct, unique InChI key on item. – Best case: Make sure all structural identifiers are correct (isomeric SMILES, canonical SMILES, InChI or InCh key). – A minimum of a correct InChI key allows for the rest of the chemical compound item to be populated by (our) bots.
  • 21.
    What has beenaccomplished so far? ● Discussion on Wikiproject chemistry: https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry#Wiki – General consensus that info boxes should use Wikidata – Wikidata needs to improve on data quality ● Of the 17,000 original chemical compound Wikidata items, 16,000 have been validated around an InChI key. ● More chemical data has been imported, so they are readily available for new Wikipedia articles or correction of existing ones.
  • 22.
    Things that needyour attention ● I generated a list of items at Wikidata project chemistry which need human intervention. https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry#Annota Please have a look at those and unify the sterechemistry and identifiers around one unique InChI key!
  • 23.
    Data maintenance inWikidata ● Our bots are written in Python (2.7 and 3.x compatible). ● Python bots keep Wikidata in sync with authoritative data source. (PubChem, ChemSpider, ChEBI, ChEMBL) ● Bots are run according to data release cycles of authoritative data sources. ● Mechanisms in place for detection of inconsistencies. ● Contributions of other Wikidata users are being accounted for, based on references.
  • 24.
    Wikidata API andquery endpoints ● Three ways to access data: – Wikidata API allows read, write and full text search. (www.wikidata.org/w/api.php) – REST endpoint for fast, direct data access. (queryr.wmflabs.org/) – Wikidata query service (WDQS) as a SPARQL endpoint for complex queries. (query.wikidata.org/)
  • 25.
    Acknowledgments Andrew Su Benjamin Good TimPutman Julia Turner Gregg Stupp (TSRI) Gang Fu Evan Bolton (NIH, PubChem) Andra Waagmeester (Micelio.be) Elvira Mitraka Lynn Schriml (Disease Ontology, U Baltimore)

Editor's Notes

  • #6 -Labels, descriptions, aliases in different languages -Diverse Properties -Sitelinks
  • #7 -Properties must be proposed and approved by the community -Data items can be edited by any Wikidata user and are the true data stores.
  • #8 Claim: Property with value + optional qualifiers Statement: A claim with its references
  • #11 -Many querys to the Wikidata API make the bot slow and might make Wikimedia people/adminstrators unhappy. -Calling wbeditentity ensures that all data is either written or not, so if the connection or bot breaks, no harm is done. -No new items will be created and then left unpopulated.
  • #12 Single value refs/nano publications Revisions/data releases
  • #25 The Sparql endpoint allows complex and also federated queries on the full WD content. REST and SPARQL are still in beta mode.