Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Using Automated
Workflow Tools to
Improve Wikipedia
MITCH MILLER
SCIENTIFIC THINKING
VERMONT CODE CAMP 2016
SEPTEMBER 17, ...
Disclaimer
 This talk represents my opinion and personal experience using software
systems developed by third parties
 T...
Overview
 Introduction: how are we improving Wikipedia? Why are we doing this?
 The list of information we need to compi...
What chemistry does Wikipedia
contain?
 9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total)
[source:...
Chemical identifiers
 Different specific databases
 Individual IDs have strengths and weakness
 The UNII is a non- prop...
SRS group goal
 Manages Substance Registration System (SRS)
 Assure uniformity of UNII assignments across internet resou...
The assignment
 Generate a report of all chemicals and drugs in Wikipedia
 Name, UNII (when present), CAS (when present)...
Development tool: KNIME
 Graphic, component based programming environment
 Drag functional components from palette onto ...
First method of report generation
 Read list of pages with each infobox
 E.g.,
https://en.wikipedia.org/w/index.php?titl...
First method: pluses/minuses
 Plus: it works
 Minus: had to run in batches to get all records
 Minus: XPath parsing was...
The Semantic Web
 A connected set of data resources that can be understood by machines
 Data encoded in a standard way t...
Understand Semantic Web in
comparison to WWW
 Compare pages on same subject:
 Wikipedia article on ethanol: https://en.w...
Technological foundations of Semantic
Web
 RDF – Resource Definition Framework – organizing facts as
 Subject – Predicat...
SPARQL
 Query language for RDF data
 SPARQL Protocol and RDF Query Language
 Similar to SQL
 Syntax based on the RDF t...
Wikidata
 Conceptually: semantic web version of Wikipedia
 Add grain of salt
 “Free and open knowledge base that can be...
Second method
 Search Wikidata programmatically for chemical information
 Wikidata SPARQL interface
 Format list
 Writ...
SPARQL for chemical and
pharmaceutical compounds
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www...
Second method: pluses/minus
 Fast and easy!
 Data arrives in a format we can use – no parsing!
 Minus:
 *some* Wikidat...
Third method
 Hybrid approach
 Use Wikidata SPARQL query to get list of chemicals
 Query Wikipedia for individual items...
Conclusion
 Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with
the required data
 Subject matter e...
References
 Scholarly article on KNIME and Pipeline Pilot
 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/
 KNIME...
Who is your speaker?
 Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience
 Independent consultant: Scientifi...
Upcoming SlideShare
Loading in …5
×

Improving the chemistry content of Wikipedia using workflow tools

246 views

Published on

Talk given at Vermont Code Camp 2016 on improving the chemical content of Wikipedia using

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Improving the chemistry content of Wikipedia using workflow tools

  1. 1. Using Automated Workflow Tools to Improve Wikipedia MITCH MILLER SCIENTIFIC THINKING VERMONT CODE CAMP 2016 SEPTEMBER 17, 2016
  2. 2. Disclaimer  This talk represents my opinion and personal experience using software systems developed by third parties  The software systems shown are very complex and have hundreds of components. I have only worked with a small number.  Every task shown today can be accomplished in multiple ways. I’m only showing some of those ways.
  3. 3. Overview  Introduction: how are we improving Wikipedia? Why are we doing this?  The list of information we need to compile  First method of generating the list  The second method of generating the list  The third method of generating the list
  4. 4. What chemistry does Wikipedia contain?  9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total) [source: https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes]  Chembox? Drug box?  Templates of selected content within Wikipedia articles  Contents of Chembox:  Molecular structure image  Name (systematically assigned name + synonyms)  Identifiers: CASNo, ChEBI, ChEMBL, ChemSpiderID, DrugBank, InChI, KEGG, PubChem, SMILES, UNII…  Key properties
  5. 5. Chemical identifiers  Different specific databases  Individual IDs have strengths and weakness  The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substance’s composition and/or descriptive information.  http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSystem- UniqueIngredientIdentifierUNII/  UNIIs contain 9 randomly generated alphanumeric characters with a tenth check alphanumeric character  When two samples have the same UNII, “they represent the same molecular entity or elements upon which the definition is based.”
  6. 6. SRS group goal  Manages Substance Registration System (SRS)  Assure uniformity of UNII assignments across internet resources that reference UNIIs
  7. 7. The assignment  Generate a report of all chemicals and drugs in Wikipedia  Name, UNII (when present), CAS (when present),Wikipedia URL  Idea: subject matter experts will review list and correct assignments, add new UNIIs to Wikipedia as needed  Result: more accurate Wikipedia that links to the FDA’s Substance Registration System unambiguously  https://fdasis.nlm.nih.gov/srs/srs.jsp
  8. 8. Development tool: KNIME  Graphic, component based programming environment  Drag functional components from palette onto canvas to create program  Configure most components by setting parameters  Connect components to route data from one to another  Run and observe data traveling down the lines  KNIME stands for KoNstanz Information MinEr  Pronounced “Nighm”  Originally a production of the University of Konstanz, Germany 2004  Currently produced by KNIME.com AG, a company in Zurich, Switzerland  Free version available for download  Windows, Linux, Mac
  9. 9. First method of report generation  Read list of pages with each infobox  E.g., https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Template:Ch embox&limit=50000&from=16225610&back=0  Retrieve each individual page mentioned in the list  Parse HTML  Use Xpath to get Name, CAS, UNII  The Infobox templates lead to pages with defined structure – straightforward to parse  Format data for output  Write to a file
  10. 10. First method: pluses/minuses  Plus: it works  Minus: had to run in batches to get all records  Minus: XPath parsing was more cumbersome than expected  Minus: misses some data
  11. 11. The Semantic Web  A connected set of data resources that can be understood by machines  Data encoded in a standard way that allows unattended processors to traverse links from one entity to another across organizational and geographic boundaries  [Standard WWW is a web of documents meant to be understood by humans]  Tim Berners-Lee has a great Ted talk on the semantic web  https://www.youtube.com/watch?v=OM6XIICm_qo
  12. 12. Understand Semantic Web in comparison to WWW  Compare pages on same subject:  Wikipedia article on ethanol: https://en.wikipedia.org/wiki/Ethanol  Wikidata page on ethanol: https://www.wikidata.org/wiki/Q153
  13. 13. Technological foundations of Semantic Web  RDF – Resource Definition Framework – organizing facts as  Subject – Predicate – Object  Conceptual example:  [Ethanol] [has a boiling point] [173 degrees Fahrenheit]  Coded example:  Wd:Q153 wdt:P2102 “173±1 degree Fahrenheit” .  Represented in Turtle - Terse RDF Triple Language
  14. 14. SPARQL  Query language for RDF data  SPARQL Protocol and RDF Query Language  Similar to SQL  Syntax based on the RDF triple
  15. 15. Wikidata  Conceptually: semantic web version of Wikipedia  Add grain of salt  “Free and open knowledge base that can be read and edited by both humans and machines. “  Designed as ‘central storage’ for Wikipedia and other Wikimedia projects  Approximately: programmatic interface to Wikipedia  See https://query.wikidata.org/  Run the example queries
  16. 16. Second method  Search Wikidata programmatically for chemical information  Wikidata SPARQL interface  Format list  Write file
  17. 17. SPARQL for chemical and pharmaceutical compounds PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wikibase: <http://wikiba.se/ontology#> PREFIX bd: <http://www.bigdata.com/rdf#> #All Chemicals with, optionally, CAS registry numbers and UNIIs in Wikidata SELECT DISTINCT ?compound ?compoundLabel ?formula ?unii ?pubchem ?cas WHERE { ?compound wdt:P31 wd:Q11173 . OPTIONAL { ?compound wdt:P231 ?cas . } OPTIONAL { ?compound wdt:P274 ?formula . } OPTIONAL { ?compound wdt:P652 ?unii . } OPTIONAL { ?compound wdt:P662 ?pubchem . } SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }
  18. 18. Second method: pluses/minus  Fast and easy!  Data arrives in a format we can use – no parsing!  Minus:  *some* Wikidata data does not match up with Wikipedia!
  19. 19. Third method  Hybrid approach  Use Wikidata SPARQL query to get list of chemicals  Query Wikipedia for individual items to compare values
  20. 20. Conclusion  Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with the required data  Subject matter experts are in the process of updating Wikipedia  Semantic web technology made the job easier!  Thank you!
  21. 21. References  Scholarly article on KNIME and Pipeline Pilot  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/  KNIME  www.knime.org  Wikipedia  https://en.wikipedia.org/wiki/Template:Chembox  https://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox  Wikidata: https://query.wikidata.org
  22. 22. Who is your speaker?  Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience  Independent consultant: Scientific Thinking, LLC  mitch.miller@thinkscience.us  Some recent projects  Ongoing custodian of one chemical database implementation for ChemIDplus project within the National Library of Medicine  Reporting systems  Web service to link collaborative object management system to reporting system  Import wizard for chemical array designer  Merged a set of chemical databases and harmonized data

×