SlideShare a Scribd company logo
Drug and chemical compound items in Wikidata
as a data source for Wikipedia infoboxes
https://commons.wikimedia.org/wiki/File:Wikidata-logo-en.svg
Sebastian Burgstaller-Muehlbacher, PhD
User:Sebotic
Twitter: @sebotic
Contents
● The Problem
● Introduction to Wikidata
– Data model
– References
– Values/data types
● Gene Wiki Info Boxes - An Example solution
● Chemistry Data in Wikidata
– Issues with the data
– Community cleanup
– Migration of Info Boxes to Wikidata
The Problem (with chemistry data)
● Wikipedia has ~300 different languages projects
● Currently, chemistry data resides as info box parameter
– Data are not reusable between language projects
– Data are not machine readable
– Data are hard to update automatically
– Data cannot be reused for other purposes, e.g. science.
The solution
Wikidata items
● Two types of entities
– Properties (Pxxxx):
● Describe the nature of a data value
● Different data types
● 2,900 different properties in Wikidata
– Data items (Qxxxx):
● A set of claims or statements
● Consist of property value pairs
● 20 million items in Wikidata
A Wikidata Statement
Wikidata Data types
● The current Wikidata data types:
– String
– WDItemID
– External ID
– MonolingualText
– Property
– Quantity
– Time
– Url
– GlobeCoordinate
– CommonsMedia
– Mathematical formula
Unique Features of Wikidata
● Completely free, even for commercial usage (CC0).
● Granular: Single values with references.
● Anybody can contribute.
● Extensive item history.
● A repository for data on all domains of knowledge.
● Full integration with the semantic web.
● Essentially: A giant graph of knowledge.
Burgstaller-Muehlbacher,
et al, Database, 2016
Data use case: Gene Wiki infoboxes
Issues with chemical data in the Wiki space
● Incorrect identifiers in info boxes or on Wikidata items
● Incorrect chemical properties
● Incorrect labels, aliases
● Incorrect isomeric forms of the compound
● Mixture of different isomeric forms
https://commons.wikimedia.org/wiki/File:Isomerism.svg
How to solve Isomerism issues?
● Make sure that the structure in Wikidata and Wikipedia are correct
and consistent:
– Use the InChI (International Chemical Identifier) or InChI key to determine
what isomer a certain article or WD item is actually talking about.
What are InChIs
● IUPAC InChI (International Chemical Identifier).
● Describes the structure of a chemical compound or substance.
● Freely usable.
● Can be computed from e.g SMILES, or MOL format.
● Do not need to be assigned by an organization.
What are InChI keys
● The SHA-256 hashed version of an InChI
● Makes chemicals searchable on the Web
● Makes chemicals easily comparable
● Short, unique
UEJJHQNACJXSKW-UHFFFAOYSA-N
First block (14 letter) encodes
skeleton (connectivtiy)
Second block (8 letter) encodes
stereochemistry and radioisotopes
Last letter, number of protons
(charge)
How to solve Isomerism issues?
● Make sure that the structure in Wikidata and Wikipedia are correct
and consistent:
– Use the InChI (International Chemical Identifier) or InChI key to determine
what isomer a certain article or WD item is actually talking about.
– Minimum requirement: Correct, unique InChI key on item.
– Best case: Make sure all structural identifiers are correct (isomeric
SMILES, canonical SMILES, InChI or InCh key).
– A minimum of a correct InChI key allows for the rest of the chemical
compound item to be populated by (our) bots.
What has been accomplished so far?
● Discussion on Wikiproject chemistry:
https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry#Wiki
– General consensus that info boxes should use Wikidata
– Wikidata needs to improve on data quality
● Of the 17,000 original chemical compound Wikidata items, 16,000
have been validated around an InChI key.
● More chemical data has been imported, so they are readily
available for new Wikipedia articles or correction of existing ones.
Things that need your attention
● I generated a list of items at Wikidata project chemistry which
need human intervention.
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry#Annota
Please have a look at those and unify the sterechemistry and
identifiers around one unique InChI key!
Data maintenance in Wikidata
● Our bots are written in Python (2.7 and 3.x compatible).
● Python bots keep Wikidata in sync with authoritative data
source. (PubChem, ChemSpider, ChEBI, ChEMBL)
● Bots are run according to data release cycles of authoritative
data sources.
● Mechanisms in place for detection of inconsistencies.
● Contributions of other Wikidata users are being accounted for,
based on references.
Wikidata API and query endpoints
● Three ways to access data:
– Wikidata API allows read, write and full text search.
(www.wikidata.org/w/api.php)
– REST endpoint for fast, direct data access.
(queryr.wmflabs.org/)
– Wikidata query service (WDQS) as a SPARQL endpoint for complex
queries.
(query.wikidata.org/)
Acknowledgments
Andrew Su
Benjamin Good
Tim Putman
Julia Turner
Gregg Stupp
(TSRI)
Gang Fu
Evan Bolton
(NIH, PubChem)
Andra Waagmeester
(Micelio.be)
Elvira Mitraka
Lynn Schriml
(Disease Ontology, U Baltimore)

More Related Content

What's hot

The Ubiquity Partner Network: Enabling Library-Based Publishing
The Ubiquity Partner Network: Enabling Library-Based PublishingThe Ubiquity Partner Network: Enabling Library-Based Publishing
The Ubiquity Partner Network: Enabling Library-Based Publishing
Brian Hole
 
Publishing (Open) Data
Publishing (Open) DataPublishing (Open) Data
Publishing (Open) Data
Brian Hole
 
The Ubiquity Partner Network: Global Support for Publishing
The Ubiquity Partner Network: Global Support for PublishingThe Ubiquity Partner Network: Global Support for Publishing
The Ubiquity Partner Network: Global Support for Publishing
Brian Hole
 
Data Journals & Data Papers
Data Journals & Data PapersData Journals & Data Papers
Data Journals & Data Papers
Brian Hole
 
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
Brian Hole
 
The Journal of Open Economics Data
The Journal of Open Economics DataThe Journal of Open Economics Data
The Journal of Open Economics Data
Brian Hole
 
Publishing Open Data: Incentivising Rigour
Publishing Open Data: Incentivising RigourPublishing Open Data: Incentivising Rigour
Publishing Open Data: Incentivising Rigour
Brian Hole
 
Modeling Data with Karma – Data Integration Tool
Modeling Data with Karma – Data Integration ToolModeling Data with Karma – Data Integration Tool
Modeling Data with Karma – Data Integration Tool
Violeta Ilik
 
dkNET ESP Meeting - February 2016
dkNET ESP Meeting - February 2016dkNET ESP Meeting - February 2016
dkNET ESP Meeting - February 2016
dkNET
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data Publishing
Brian Hole
 
The Journal of Open Research Software
The Journal of Open Research SoftwareThe Journal of Open Research Software
The Journal of Open Research Software
Brian Hole
 
Obtaining Credit for Research Software
Obtaining Credit for Research SoftwareObtaining Credit for Research Software
Obtaining Credit for Research Software
Brian Hole
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshare
Hafabe
 
Starting from scratch – building the perfect digital repository
Starting from scratch – building the perfect digital repositoryStarting from scratch – building the perfect digital repository
Starting from scratch – building the perfect digital repository
Violeta Ilik
 
Data Science for the Win
Data Science for the WinData Science for the Win
Data Science for the Win
Michel Dumontier
 
What ami searching_hollis+articlestab
What ami searching_hollis+articlestabWhat ami searching_hollis+articlestab
What ami searching_hollis+articlestab
Emily Singley
 
The Case for Stable VIVO URIs
The Case for Stable VIVO URIsThe Case for Stable VIVO URIs
The Case for Stable VIVO URIs
Violeta Ilik
 
Supporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesSupporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life Sciences
Alasdair Gray
 
Karma Data Modeling
Karma Data ModelingKarma Data Modeling
Karma Data Modeling
Violeta Ilik
 
Integrating with others: Stable VIVO URIs for local authority records; linkin...
Integrating with others: Stable VIVO URIs for local authority records; linkin...Integrating with others: Stable VIVO URIs for local authority records; linkin...
Integrating with others: Stable VIVO URIs for local authority records; linkin...
Violeta Ilik
 

What's hot (20)

The Ubiquity Partner Network: Enabling Library-Based Publishing
The Ubiquity Partner Network: Enabling Library-Based PublishingThe Ubiquity Partner Network: Enabling Library-Based Publishing
The Ubiquity Partner Network: Enabling Library-Based Publishing
 
Publishing (Open) Data
Publishing (Open) DataPublishing (Open) Data
Publishing (Open) Data
 
The Ubiquity Partner Network: Global Support for Publishing
The Ubiquity Partner Network: Global Support for PublishingThe Ubiquity Partner Network: Global Support for Publishing
The Ubiquity Partner Network: Global Support for Publishing
 
Data Journals & Data Papers
Data Journals & Data PapersData Journals & Data Papers
Data Journals & Data Papers
 
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
 
The Journal of Open Economics Data
The Journal of Open Economics DataThe Journal of Open Economics Data
The Journal of Open Economics Data
 
Publishing Open Data: Incentivising Rigour
Publishing Open Data: Incentivising RigourPublishing Open Data: Incentivising Rigour
Publishing Open Data: Incentivising Rigour
 
Modeling Data with Karma – Data Integration Tool
Modeling Data with Karma – Data Integration ToolModeling Data with Karma – Data Integration Tool
Modeling Data with Karma – Data Integration Tool
 
dkNET ESP Meeting - February 2016
dkNET ESP Meeting - February 2016dkNET ESP Meeting - February 2016
dkNET ESP Meeting - February 2016
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data Publishing
 
The Journal of Open Research Software
The Journal of Open Research SoftwareThe Journal of Open Research Software
The Journal of Open Research Software
 
Obtaining Credit for Research Software
Obtaining Credit for Research SoftwareObtaining Credit for Research Software
Obtaining Credit for Research Software
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshare
 
Starting from scratch – building the perfect digital repository
Starting from scratch – building the perfect digital repositoryStarting from scratch – building the perfect digital repository
Starting from scratch – building the perfect digital repository
 
Data Science for the Win
Data Science for the WinData Science for the Win
Data Science for the Win
 
What ami searching_hollis+articlestab
What ami searching_hollis+articlestabWhat ami searching_hollis+articlestab
What ami searching_hollis+articlestab
 
The Case for Stable VIVO URIs
The Case for Stable VIVO URIsThe Case for Stable VIVO URIs
The Case for Stable VIVO URIs
 
Supporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesSupporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life Sciences
 
Karma Data Modeling
Karma Data ModelingKarma Data Modeling
Karma Data Modeling
 
Integrating with others: Stable VIVO URIs for local authority records; linkin...
Integrating with others: Stable VIVO URIs for local authority records; linkin...Integrating with others: Stable VIVO URIs for local authority records; linkin...
Integrating with others: Stable VIVO URIs for local authority records; linkin...
 

Viewers also liked

London Assembly Tackling FGM Conference (Education)
London Assembly Tackling FGM Conference (Education)London Assembly Tackling FGM Conference (Education)
London Assembly Tackling FGM Conference (Education)
London Assembly
 
Cónicas Tanya y Bianca 2°B MATEMÁTICAS.
Cónicas Tanya y Bianca 2°B MATEMÁTICAS.Cónicas Tanya y Bianca 2°B MATEMÁTICAS.
Cónicas Tanya y Bianca 2°B MATEMÁTICAS.
Tanyaybianca
 
Open Source Geospatial Technologies for the Portuguese Bluetongue Entomologic...
Open Source Geospatial Technologies for the Portuguese Bluetongue Entomologic...Open Source Geospatial Technologies for the Portuguese Bluetongue Entomologic...
Open Source Geospatial Technologies for the Portuguese Bluetongue Entomologic...Hugo Martins
 
Inkt op het scherm
Inkt op het schermInkt op het scherm
Inkt op het schermFARO
 
Liderar para Transformar.
Liderar para Transformar.Liderar para Transformar.
Liderar para Transformar.
Ricardo Jordão Magalhaes
 
Emerging Trends in LIS
Emerging Trends in LISEmerging Trends in LIS
Emerging Trends in LIS
Kishor Satpathy
 
Circuitos
CircuitosCircuitos
Somatprod
SomatprodSomatprod
SomatprodCarlos
 
Job Hunting
Job HuntingJob Hunting
Job Hunting
Tamer Elshamy
 
Bubbl us
Bubbl usBubbl us
Action movie
Action movieAction movie
Action movie
AlexBros360
 
A Mayoral Manifesto for the Digital Economy
A Mayoral Manifesto for the Digital EconomyA Mayoral Manifesto for the Digital Economy
A Mayoral Manifesto for the Digital Economy
London Assembly
 
2015/12/28付 オリジナルiTunes週間トップソングトピックス
2015/12/28付 オリジナルiTunes週間トップソングトピックス2015/12/28付 オリジナルiTunes週間トップソングトピックス
2015/12/28付 オリジナルiTunes週間トップソングトピックス
The Natsu Style
 
Concorrencia geral UFPE 2014
Concorrencia geral UFPE 2014Concorrencia geral UFPE 2014
Concorrencia geral UFPE 2014Portal NE10
 

Viewers also liked (17)

London Assembly Tackling FGM Conference (Education)
London Assembly Tackling FGM Conference (Education)London Assembly Tackling FGM Conference (Education)
London Assembly Tackling FGM Conference (Education)
 
Cónicas Tanya y Bianca 2°B MATEMÁTICAS.
Cónicas Tanya y Bianca 2°B MATEMÁTICAS.Cónicas Tanya y Bianca 2°B MATEMÁTICAS.
Cónicas Tanya y Bianca 2°B MATEMÁTICAS.
 
Open Source Geospatial Technologies for the Portuguese Bluetongue Entomologic...
Open Source Geospatial Technologies for the Portuguese Bluetongue Entomologic...Open Source Geospatial Technologies for the Portuguese Bluetongue Entomologic...
Open Source Geospatial Technologies for the Portuguese Bluetongue Entomologic...
 
Inkt op het scherm
Inkt op het schermInkt op het scherm
Inkt op het scherm
 
mark resume 2015
mark resume 2015mark resume 2015
mark resume 2015
 
Liderar para Transformar.
Liderar para Transformar.Liderar para Transformar.
Liderar para Transformar.
 
Emerging Trends in LIS
Emerging Trends in LISEmerging Trends in LIS
Emerging Trends in LIS
 
Circuitos
CircuitosCircuitos
Circuitos
 
Somatprod
SomatprodSomatprod
Somatprod
 
Test
TestTest
Test
 
Job Hunting
Job HuntingJob Hunting
Job Hunting
 
Bubbl us
Bubbl usBubbl us
Bubbl us
 
Action movie
Action movieAction movie
Action movie
 
A Mayoral Manifesto for the Digital Economy
A Mayoral Manifesto for the Digital EconomyA Mayoral Manifesto for the Digital Economy
A Mayoral Manifesto for the Digital Economy
 
Sectur
SecturSectur
Sectur
 
2015/12/28付 オリジナルiTunes週間トップソングトピックス
2015/12/28付 オリジナルiTunes週間トップソングトピックス2015/12/28付 オリジナルiTunes週間トップソングトピックス
2015/12/28付 オリジナルiTunes週間トップソングトピックス
 
Concorrencia geral UFPE 2014
Concorrencia geral UFPE 2014Concorrencia geral UFPE 2014
Concorrencia geral UFPE 2014
 

Similar to Wikiconference 2016 talk Burgstaller

Using wikipedia as a source of chemical information
Using wikipedia as a source of chemical informationUsing wikipedia as a source of chemical information
Using wikipedia as a source of chemical information
Martin Walker
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Improving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow toolsImproving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow tools
Mitch Miller
 
Video game controlled vocabulary in wikidata
Video game controlled vocabulary in wikidataVideo game controlled vocabulary in wikidata
Video game controlled vocabulary in wikidata
peterchanws
 
Loops of humans and bots in Wikidata
Loops of humans and bots in WikidataLoops of humans and bots in Wikidata
Loops of humans and bots in Wikidata
Elena Simperl
 
Towards an Interlinked Semantic Wiki Farm
Towards an Interlinked Semantic Wiki FarmTowards an Interlinked Semantic Wiki Farm
Towards an Interlinked Semantic Wiki Farm
Alexandre Passant
 
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis Martin Walker
 
Bot programming in Wikimedia Commons with Pywikibot
Bot programming in Wikimedia Commons with PywikibotBot programming in Wikimedia Commons with Pywikibot
Bot programming in Wikimedia Commons with Pywikibot
Miguel-Angel Monjas
 
Building Real Time, Open-Source Tools for Wikipedia
Building Real Time, Open-Source Tools for WikipediaBuilding Real Time, Open-Source Tools for Wikipedia
Building Real Time, Open-Source Tools for Wikipedia
FITC
 
NCBO Technology Overview
NCBO Technology OverviewNCBO Technology Overview
NCBO Technology Overview
Trish Whetzel
 
Wikipedia - The most successful encyclopedia in the world
Wikipedia - The most successful encyclopedia in the worldWikipedia - The most successful encyclopedia in the world
Wikipedia - The most successful encyclopedia in the world
Mubashar Iqbal
 
PSI-MI standards and PSICQUIC
PSI-MI standards and PSICQUICPSI-MI standards and PSICQUIC
PSI-MI standards and PSICQUIC
Rafael C. Jimenez
 
Intranet 2.0: Using Wikis
Intranet 2.0: Using WikisIntranet 2.0: Using Wikis
Intranet 2.0: Using Wikis
Nicole C. Engard
 
Navigating scientific resources using wiki based resources
Navigating scientific resources using wiki based resourcesNavigating scientific resources using wiki based resources
Navigating scientific resources using wiki based resources
Royal Society of Chemistry
 
Navigating scientific resources using wiki based resources
Navigating scientific resources using wiki based resourcesNavigating scientific resources using wiki based resources
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
Ken Karapetyan
 
Enabling cross-wikis integration by extending the SIOC ontology
Enabling cross-wikis integration by extending the SIOC ontologyEnabling cross-wikis integration by extending the SIOC ontology
Enabling cross-wikis integration by extending the SIOC ontology
Fabrizio Orlandi
 
Progress in delivering transparency in research data
Progress in delivering transparency in research dataProgress in delivering transparency in research data
2014 05-21 poster on ORCID identifiers in Wikipedia, Wikidata & sister projects
2014 05-21 poster on ORCID identifiers in Wikipedia, Wikidata & sister projects2014 05-21 poster on ORCID identifiers in Wikipedia, Wikidata & sister projects
2014 05-21 poster on ORCID identifiers in Wikipedia, Wikidata & sister projects
Andy Mabbett
 

Similar to Wikiconference 2016 talk Burgstaller (20)

Using wikipedia as a source of chemical information
Using wikipedia as a source of chemical informationUsing wikipedia as a source of chemical information
Using wikipedia as a source of chemical information
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Improving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow toolsImproving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow tools
 
Video game controlled vocabulary in wikidata
Video game controlled vocabulary in wikidataVideo game controlled vocabulary in wikidata
Video game controlled vocabulary in wikidata
 
Loops of humans and bots in Wikidata
Loops of humans and bots in WikidataLoops of humans and bots in Wikidata
Loops of humans and bots in Wikidata
 
Towards an Interlinked Semantic Wiki Farm
Towards an Interlinked Semantic Wiki FarmTowards an Interlinked Semantic Wiki Farm
Towards an Interlinked Semantic Wiki Farm
 
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
 
Bot programming in Wikimedia Commons with Pywikibot
Bot programming in Wikimedia Commons with PywikibotBot programming in Wikimedia Commons with Pywikibot
Bot programming in Wikimedia Commons with Pywikibot
 
Building Real Time, Open-Source Tools for Wikipedia
Building Real Time, Open-Source Tools for WikipediaBuilding Real Time, Open-Source Tools for Wikipedia
Building Real Time, Open-Source Tools for Wikipedia
 
NCBO Technology Overview
NCBO Technology OverviewNCBO Technology Overview
NCBO Technology Overview
 
Wikipedia - The most successful encyclopedia in the world
Wikipedia - The most successful encyclopedia in the worldWikipedia - The most successful encyclopedia in the world
Wikipedia - The most successful encyclopedia in the world
 
PSI-MI standards and PSICQUIC
PSI-MI standards and PSICQUICPSI-MI standards and PSICQUIC
PSI-MI standards and PSICQUIC
 
Intranet 2.0: Using Wikis
Intranet 2.0: Using WikisIntranet 2.0: Using Wikis
Intranet 2.0: Using Wikis
 
Navigating scientific resources using wiki based resources
Navigating scientific resources using wiki based resourcesNavigating scientific resources using wiki based resources
Navigating scientific resources using wiki based resources
 
Navigating scientific resources using wiki based resources
Navigating scientific resources using wiki based resourcesNavigating scientific resources using wiki based resources
Navigating scientific resources using wiki based resources
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
Enabling cross-wikis integration by extending the SIOC ontology
Enabling cross-wikis integration by extending the SIOC ontologyEnabling cross-wikis integration by extending the SIOC ontology
Enabling cross-wikis integration by extending the SIOC ontology
 
Progress in delivering transparency in research data
Progress in delivering transparency in research dataProgress in delivering transparency in research data
Progress in delivering transparency in research data
 
2014 05-21 poster on ORCID identifiers in Wikipedia, Wikidata & sister projects
2014 05-21 poster on ORCID identifiers in Wikipedia, Wikidata & sister projects2014 05-21 poster on ORCID identifiers in Wikipedia, Wikidata & sister projects
2014 05-21 poster on ORCID identifiers in Wikipedia, Wikidata & sister projects
 

Recently uploaded

NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
Renu Jangid
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 

Recently uploaded (20)

NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 

Wikiconference 2016 talk Burgstaller

  • 1. Drug and chemical compound items in Wikidata as a data source for Wikipedia infoboxes https://commons.wikimedia.org/wiki/File:Wikidata-logo-en.svg Sebastian Burgstaller-Muehlbacher, PhD User:Sebotic Twitter: @sebotic
  • 2. Contents ● The Problem ● Introduction to Wikidata – Data model – References – Values/data types ● Gene Wiki Info Boxes - An Example solution ● Chemistry Data in Wikidata – Issues with the data – Community cleanup – Migration of Info Boxes to Wikidata
  • 3. The Problem (with chemistry data) ● Wikipedia has ~300 different languages projects ● Currently, chemistry data resides as info box parameter – Data are not reusable between language projects – Data are not machine readable – Data are hard to update automatically – Data cannot be reused for other purposes, e.g. science.
  • 5.
  • 6. Wikidata items ● Two types of entities – Properties (Pxxxx): ● Describe the nature of a data value ● Different data types ● 2,900 different properties in Wikidata – Data items (Qxxxx): ● A set of claims or statements ● Consist of property value pairs ● 20 million items in Wikidata
  • 8.
  • 9.
  • 10. Wikidata Data types ● The current Wikidata data types: – String – WDItemID – External ID – MonolingualText – Property – Quantity – Time – Url – GlobeCoordinate – CommonsMedia – Mathematical formula
  • 11. Unique Features of Wikidata ● Completely free, even for commercial usage (CC0). ● Granular: Single values with references. ● Anybody can contribute. ● Extensive item history. ● A repository for data on all domains of knowledge. ● Full integration with the semantic web. ● Essentially: A giant graph of knowledge.
  • 12.
  • 14. Data use case: Gene Wiki infoboxes
  • 15. Issues with chemical data in the Wiki space ● Incorrect identifiers in info boxes or on Wikidata items ● Incorrect chemical properties ● Incorrect labels, aliases ● Incorrect isomeric forms of the compound ● Mixture of different isomeric forms
  • 17. How to solve Isomerism issues? ● Make sure that the structure in Wikidata and Wikipedia are correct and consistent: – Use the InChI (International Chemical Identifier) or InChI key to determine what isomer a certain article or WD item is actually talking about.
  • 18. What are InChIs ● IUPAC InChI (International Chemical Identifier). ● Describes the structure of a chemical compound or substance. ● Freely usable. ● Can be computed from e.g SMILES, or MOL format. ● Do not need to be assigned by an organization.
  • 19. What are InChI keys ● The SHA-256 hashed version of an InChI ● Makes chemicals searchable on the Web ● Makes chemicals easily comparable ● Short, unique UEJJHQNACJXSKW-UHFFFAOYSA-N First block (14 letter) encodes skeleton (connectivtiy) Second block (8 letter) encodes stereochemistry and radioisotopes Last letter, number of protons (charge)
  • 20. How to solve Isomerism issues? ● Make sure that the structure in Wikidata and Wikipedia are correct and consistent: – Use the InChI (International Chemical Identifier) or InChI key to determine what isomer a certain article or WD item is actually talking about. – Minimum requirement: Correct, unique InChI key on item. – Best case: Make sure all structural identifiers are correct (isomeric SMILES, canonical SMILES, InChI or InCh key). – A minimum of a correct InChI key allows for the rest of the chemical compound item to be populated by (our) bots.
  • 21. What has been accomplished so far? ● Discussion on Wikiproject chemistry: https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry#Wiki – General consensus that info boxes should use Wikidata – Wikidata needs to improve on data quality ● Of the 17,000 original chemical compound Wikidata items, 16,000 have been validated around an InChI key. ● More chemical data has been imported, so they are readily available for new Wikipedia articles or correction of existing ones.
  • 22. Things that need your attention ● I generated a list of items at Wikidata project chemistry which need human intervention. https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry#Annota Please have a look at those and unify the sterechemistry and identifiers around one unique InChI key!
  • 23. Data maintenance in Wikidata ● Our bots are written in Python (2.7 and 3.x compatible). ● Python bots keep Wikidata in sync with authoritative data source. (PubChem, ChemSpider, ChEBI, ChEMBL) ● Bots are run according to data release cycles of authoritative data sources. ● Mechanisms in place for detection of inconsistencies. ● Contributions of other Wikidata users are being accounted for, based on references.
  • 24. Wikidata API and query endpoints ● Three ways to access data: – Wikidata API allows read, write and full text search. (www.wikidata.org/w/api.php) – REST endpoint for fast, direct data access. (queryr.wmflabs.org/) – Wikidata query service (WDQS) as a SPARQL endpoint for complex queries. (query.wikidata.org/)
  • 25. Acknowledgments Andrew Su Benjamin Good Tim Putman Julia Turner Gregg Stupp (TSRI) Gang Fu Evan Bolton (NIH, PubChem) Andra Waagmeester (Micelio.be) Elvira Mitraka Lynn Schriml (Disease Ontology, U Baltimore)

Editor's Notes

  1. -Labels, descriptions, aliases in different languages -Diverse Properties -Sitelinks
  2. -Properties must be proposed and approved by the community -Data items can be edited by any Wikidata user and are the true data stores.
  3. Claim: Property with value + optional qualifiers Statement: A claim with its references
  4. -Many querys to the Wikidata API make the bot slow and might make Wikimedia people/adminstrators unhappy. -Calling wbeditentity ensures that all data is either written or not, so if the connection or bot breaks, no harm is done. -No new items will be created and then left unpopulated.
  5. Single value refs/nano publications Revisions/data releases
  6. The Sparql endpoint allows complex and also federated queries on the full WD content. REST and SPARQL are still in beta mode.