Data rich chemistry inside wikipedia and other wikis Presentation Transcript
Martin A Walker, SUNY Potsdam
Chemical data in Wikipedia Validation of Wikipedia chemical data RSC Learn Chemistry Conclusion
Wikipedia is designed as an encyclopedia, NOT a database, BUT many cheminformatics groups want to use data from Wikipedia Since most data are entered by a human being, rather than by machine, Wikipedia can often provide a data source that is independent of the main online databases Could the Wikipedia chemists make the data more accessible without compromising the project’s mission? What about DBpedia?
TheChembox on a substance page contains standard representations such as Skeletal formula IUPAC name InChI and InChIKey CAS no. (represents substance, not structure) SMILES (proprietary but de facto standard before InChI) Thesewere traditionally supplied for use by readers to copy/paste, but we were asked to make a machine-friendly version
Chemboxes wereoriginally set up astables – OK for people,but not for data mining. A typical chembox From 2007
Now designed as a set of data fields with values entered by the editor – better for data extraction and for validation Drugboxes also redesigned Machine-friendly formats (SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes Hide/show used to avoid table “explosions” Collections of Wikipedia data are now available for cheminformatics groups to use
SIMPLE FULL FORM
Some data (e.g., InChIs for complex molecules)can be very long – and this was a hindrance totheir use in Wikipedia
InChI can be used to define what structure is being represented when compiling a virtual database. InChI can provide an unambiguous reference when validating structures on Wikipedia InChIKey is useful to help those using search engines
PROBLEM: Table creep – users ask for the table toinclude the Standard Free Energy of Hydroformylationin a Black BoxANSWER: Put it on a sub-page – the supplementarydata page (something unique to chemistry!).Click on a link from the bottom of the Chembox:
How I use the key terms:Validation =>“How I can be sure the data are correct?”Curation => an ongoing process of fixingerrors
In 2008 a data validation drive was initiated for basic chemical identifiers Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN Other fields now being validated Validated content indicated with a check mark
Every old version (called a RevID) of an article ispreserved (for all) for posterity, and canpotentially serve as a permanent record of avalidated version.
PROBLEM: This is “the encyclopedia anyonecan edit” – so anyone can change the BP ofwater to 200 oC.SOLUTION: A bot patrols the pages, andwatches for edits to key fields. Any dubiousedits are flagged with a red X (next to thedata), and logged.System developed by Dirk Beetstra(Eindhoven University of Technology). It isthe only such tool on Wikipedia.
If anyone tries tovandalize a validatedfield, this will beflagged by a bot soonafterwards. This example received a red X 11 minutes after it was vandalized.
IN 2008-2010, around 3000 chemical structures were informally checked against CAS Common Chemistry PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed
The bot has been modified to watch changesto the RevID of the Wikimedia Commonsstructure imageA few hundred images validated so far
Drugboxes are patrolled bythe bot, but at presentWP:PHARM not active informal validation. Most workdone by Dirk Beetstra, usingofficial lists from datasources (e.g., ChEBI).
Aims to enrich RSC educational content with datafrom ChemSpider, then make it open for educatorsto contribute their own content (licensed underCreative Commons)
Wikipedia can provide a useful “virtual database” of highly curated information on common chemicals and drugs. Don’t forget the data page information! The validation effort needs to go further – YOUR help is very welcome! RSC Learn Chemistry shows that chemical data can also be used to enrich an educational site.
Congratulations to Henry and Peter, and thanks for the invitation to speak in their symposium. Thanks to Antony Williams for his many contributions to both Wikipedia and Learn Chemistry. Thanks to Aileen Day, Lorna Thomson, Duncan McMillan and RSC Education staff, and to RSC for the funding of Learn Chemistry. Thanks to undergraduate student Tyson Terpstra for uploading many quiz InChIs. Thank you for your attention!
Thank you for your attention
All of my own content in this presentation is released under a Creative Commons BY-SA- 3.0 license Copyright information for images is usually attributed on the slide itself Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the original Wikipedia page and select the “history” tab. Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3.