Martin A Walker, SUNY Potsdam
 Chemical data in Wikipedia Validation of Wikipedia chemical data RSC Learn Chemistry Conclusion
   Wikipedia is designed as an encyclopedia, NOT a    database, BUT many cheminformatics groups want to    use data from ...
 TheChembox on a substance page contains standard representations such as    Skeletal formula    IUPAC name    InChI a...
Chemboxes wereoriginally set up astables – OK for people,but not for data mining.                  A typical              ...
 Now designed as a set of data  fields with values entered by  the editor – better for data  extraction and for validatio...
SIMPLE   FULL FORM
Some data (e.g., InChIs for complex molecules)can be very long – and this was a hindrance totheir use in Wikipedia
 InChI can be used to define what structure is  being represented when compiling a virtual  database. InChI can provide ...
PROBLEM: Table creep – users ask for the table toinclude the Standard Free Energy of Hydroformylationin a Black BoxANSWER:...
How I use the key terms:Validation =>“How I can be sure the data are correct?”Curation => an ongoing process of fixingerrors
 In 2008 a data validation drive was  initiated for basic chemical  identifiers Led to a collaboration with CAS, to  ens...
Every old version (called a RevID) of an article ispreserved (for all) for posterity, and canpotentially serve as a perman...
PROBLEM: This is “the encyclopedia anyonecan edit” – so anyone can change the BP ofwater to 200 oC.SOLUTION: A bot patrols...
If anyone tries tovandalize a validatedfield, this will beflagged by a bot soonafterwards.    This example     received a...
 IN 2008-2010, around 3000 chemical  structures were informally checked against  CAS Common Chemistry PROBLEM: Structure...
The bot has been modified to watch changesto the RevID of the Wikimedia Commonsstructure imageA few hundred images validat...
Drugboxes are patrolled bythe bot, but at presentWP:PHARM not active informal validation. Most workdone by Dirk Beetstra, ...
Aims to enrich RSC educational content with datafrom ChemSpider, then make it open for educatorsto contribute their own co...
 Wikipedia can provide a useful “virtual  database” of highly curated information on  common chemicals and drugs. Don’t ...
 Congratulations to Henry and Peter, and thanks  for the invitation to speak in their symposium. Thanks to Antony Willia...
Thank you for your attention
 All of my own content in this presentation is  released under a Creative Commons BY-SA-  3.0 license Copyright informat...
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
Upcoming SlideShare
Loading in …5
×

Data rich chemistry inside wikipedia and other wikis

2,475 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,475
On SlideShare
0
From Embeds
0
Number of Embeds
160
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data rich chemistry inside wikipedia and other wikis

  1. Martin A Walker, SUNY Potsdam
  2.  Chemical data in Wikipedia Validation of Wikipedia chemical data RSC Learn Chemistry Conclusion
  3.  Wikipedia is designed as an encyclopedia, NOT a database, BUT many cheminformatics groups want to use data from Wikipedia Since most data are entered by a human being, rather than by machine, Wikipedia can often provide a data source that is independent of the main online databases Could the Wikipedia chemists make the data more accessible without compromising the project’s mission? What about DBpedia?
  4.  TheChembox on a substance page contains standard representations such as  Skeletal formula  IUPAC name  InChI and InChIKey  CAS no. (represents substance, not structure)  SMILES (proprietary but de facto standard before InChI) Thesewere traditionally supplied for use by readers to copy/paste, but we were asked to make a machine-friendly version
  5. Chemboxes wereoriginally set up astables – OK for people,but not for data mining. A typical chembox From 2007
  6.  Now designed as a set of data fields with values entered by the editor – better for data extraction and for validation Drugboxes also redesigned Machine-friendly formats (SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes Hide/show used to avoid table “explosions” Collections of Wikipedia data are now available for cheminformatics groups to use
  7. SIMPLE FULL FORM
  8. Some data (e.g., InChIs for complex molecules)can be very long – and this was a hindrance totheir use in Wikipedia
  9.  InChI can be used to define what structure is being represented when compiling a virtual database. InChI can provide an unambiguous reference when validating structures on Wikipedia InChIKey is useful to help those using search engines
  10. PROBLEM: Table creep – users ask for the table toinclude the Standard Free Energy of Hydroformylationin a Black BoxANSWER: Put it on a sub-page – the supplementarydata page (something unique to chemistry!).Click on a link from the bottom of the Chembox:
  11. How I use the key terms:Validation =>“How I can be sure the data are correct?”Curation => an ongoing process of fixingerrors
  12.  In 2008 a data validation drive was initiated for basic chemical identifiers Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN Other fields now being validated Validated content indicated with a check mark
  13. Every old version (called a RevID) of an article ispreserved (for all) for posterity, and canpotentially serve as a permanent record of avalidated version.
  14. PROBLEM: This is “the encyclopedia anyonecan edit” – so anyone can change the BP ofwater to 200 oC.SOLUTION: A bot patrols the pages, andwatches for edits to key fields. Any dubiousedits are flagged with a red X (next to thedata), and logged.System developed by Dirk Beetstra(Eindhoven University of Technology). It isthe only such tool on Wikipedia.
  15. If anyone tries tovandalize a validatedfield, this will beflagged by a bot soonafterwards.  This example received a red X 11 minutes after it was vandalized.
  16.  IN 2008-2010, around 3000 chemical structures were informally checked against CAS Common Chemistry PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed
  17. The bot has been modified to watch changesto the RevID of the Wikimedia Commonsstructure imageA few hundred images validated so far
  18. Drugboxes are patrolled bythe bot, but at presentWP:PHARM not active informal validation. Most workdone by Dirk Beetstra, usingofficial lists from datasources (e.g., ChEBI).
  19. Aims to enrich RSC educational content with datafrom ChemSpider, then make it open for educatorsto contribute their own content (licensed underCreative Commons)
  20.  Wikipedia can provide a useful “virtual database” of highly curated information on common chemicals and drugs. Don’t forget the data page information! The validation effort needs to go further – YOUR help is very welcome! RSC Learn Chemistry shows that chemical data can also be used to enrich an educational site.
  21.  Congratulations to Henry and Peter, and thanks for the invitation to speak in their symposium. Thanks to Antony Williams for his many contributions to both Wikipedia and Learn Chemistry. Thanks to Aileen Day, Lorna Thomson, Duncan McMillan and RSC Education staff, and to RSC for the funding of Learn Chemistry. Thanks to undergraduate student Tyson Terpstra for uploading many quiz InChIs. Thank you for your attention!
  22. Thank you for your attention
  23.  All of my own content in this presentation is released under a Creative Commons BY-SA- 3.0 license Copyright information for images is usually attributed on the slide itself Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the original Wikipedia page and select the “history” tab. Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3.

×