Martin A Walker, SUNY Potsdam
 Chemical data in Wikipedia
 Validation of Wikipedia chemical data
 RSC Learn Chemistry
 Conclusion
   Wikipedia is designed as an encyclopedia, NOT a
    database, BUT many cheminformatics groups want to
    use data from Wikipedia
   Since most data are entered by a human being, rather
    than by machine, Wikipedia can often provide a data
    source that is independent of the main online databases
   Could the Wikipedia chemists make the data more
    accessible without compromising the project’s mission?
    What about DBpedia?
 TheChembox on a substance page contains
 standard representations such as
    Skeletal formula
    IUPAC name
    InChI and InChIKey
    CAS no. (represents substance, not structure)
    SMILES (proprietary but de facto standard before InChI)
 Thesewere traditionally supplied for use by
 readers to copy/paste, but we were asked to
 make a machine-friendly version
Chemboxes were
originally set up as
tables – OK for people,
but not for data mining.




                  A typical
                  chembox
                  From 2007
 Now designed as a set of data
  fields with values entered by
  the editor – better for data
  extraction and for validation
 Drugboxes also redesigned
 Machine-friendly formats
  (SMILES, InChI, InChIKey, CAS
  Reg. No.) included in nearly all
  chemboxes
 Hide/show used to avoid table
  “explosions”
 Collections of Wikipedia data
  are now available for
  cheminformatics groups to use
SIMPLE   FULL FORM
Some data (e.g., InChIs for complex molecules)
can be very long – and this was a hindrance to
their use in Wikipedia
 InChI can be used to define what structure is
  being represented when compiling a virtual
  database.
 InChI can provide an unambiguous reference
  when validating structures on Wikipedia
 InChIKey is useful to help those using search
  engines
PROBLEM: Table creep – users ask for the table to
include the Standard Free Energy of Hydroformylation
in a Black Box

ANSWER: Put it on a sub-page – the supplementary
data page (something unique to chemistry!).
Click on a link from the bottom of the Chembox:
How I use the key terms:

Validation =>
“How I can be sure the data are correct?”

Curation => an ongoing process of fixing
errors
 In 2008 a data validation drive was
  initiated for basic chemical
  identifiers
 Led to a collaboration with CAS, to
  ensure Wikipedia CAS registry nos.
  are correct
 Now around 3500+ substances have
  been validated against CAS Common
  Chemistry, as having correct name,
  structure & CAS RN
 Other fields now being validated
 Validated content indicated with a
  check mark
Every old version (called a RevID) of an article is
preserved (for all) for posterity, and can
potentially serve as a permanent record of a
validated version.
PROBLEM: This is “the encyclopedia anyone
can edit” – so anyone can change the BP of
water to 200 oC.

SOLUTION: A bot patrols the pages, and
watches for edits to key fields. Any dubious
edits are flagged with a red X (next to the
data), and logged.
System developed by Dirk Beetstra
(Eindhoven University of Technology). It is
the only such tool on Wikipedia.
If anyone tries to
vandalize a validated
field, this will be
flagged by a bot soon
afterwards.
    This example
     received a red X 11
     minutes after it was
     vandalized.
 IN 2008-2010, around 3000 chemical
  structures were informally checked against
  CAS Common Chemistry
 PROBLEM: Structures are loaded from an
  external file on Wikimedia Commons, which
  can be “invisibly” changed
The bot has been modified to watch changes
to the RevID of the Wikimedia Commons
structure image
A few hundred images validated so far
Drugboxes are patrolled by
the bot, but at present
WP:PHARM not active in
formal validation. Most work
done by Dirk Beetstra, using
official lists from data
sources (e.g., ChEBI).
Aims to enrich RSC educational content with data
from ChemSpider, then make it open for educators
to contribute their own content (licensed under
Creative Commons)
 Wikipedia can provide a useful “virtual
  database” of highly curated information on
  common chemicals and drugs.
 Don’t forget the data page information!
 The validation effort needs to go further –
  YOUR help is very welcome!
 RSC Learn Chemistry shows that chemical data
  can also be used to enrich an educational site.
 Congratulations to Henry and Peter, and thanks
  for the invitation to speak in their symposium.
 Thanks to Antony Williams for his many
  contributions to both Wikipedia and Learn
  Chemistry.
 Thanks to Aileen Day, Lorna Thomson, Duncan
  McMillan and RSC Education staff, and to RSC for
  the funding of Learn Chemistry.
 Thanks to undergraduate student Tyson Terpstra
  for uploading many quiz InChIs.
 Thank you for your attention!
Thank you for your attention
 All of my own content in this presentation is
  released under a Creative Commons BY-SA-
  3.0 license
 Copyright information for images is usually
  attributed on the slide itself
 Content from Wikipedia and Learn Chemistry
  is reused via a Creative Commons BY-SA-3.0
  license. For authors, please visit the original
  Wikipedia page and select the “history” tab.
 Other pictures not attributed should only be
  my own personal pictures, also CC-BY-SA3.

Data rich chemistry inside wikipedia and other wikis

  • 1.
    Martin A Walker,SUNY Potsdam
  • 2.
     Chemical datain Wikipedia  Validation of Wikipedia chemical data  RSC Learn Chemistry  Conclusion
  • 3.
    Wikipedia is designed as an encyclopedia, NOT a database, BUT many cheminformatics groups want to use data from Wikipedia  Since most data are entered by a human being, rather than by machine, Wikipedia can often provide a data source that is independent of the main online databases  Could the Wikipedia chemists make the data more accessible without compromising the project’s mission? What about DBpedia?
  • 4.
     TheChembox ona substance page contains standard representations such as  Skeletal formula  IUPAC name  InChI and InChIKey  CAS no. (represents substance, not structure)  SMILES (proprietary but de facto standard before InChI)  Thesewere traditionally supplied for use by readers to copy/paste, but we were asked to make a machine-friendly version
  • 6.
    Chemboxes were originally setup as tables – OK for people, but not for data mining. A typical chembox From 2007
  • 7.
     Now designedas a set of data fields with values entered by the editor – better for data extraction and for validation  Drugboxes also redesigned  Machine-friendly formats (SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes  Hide/show used to avoid table “explosions”  Collections of Wikipedia data are now available for cheminformatics groups to use
  • 8.
    SIMPLE FULL FORM
  • 9.
    Some data (e.g.,InChIs for complex molecules) can be very long – and this was a hindrance to their use in Wikipedia
  • 10.
     InChI canbe used to define what structure is being represented when compiling a virtual database.  InChI can provide an unambiguous reference when validating structures on Wikipedia  InChIKey is useful to help those using search engines
  • 11.
    PROBLEM: Table creep– users ask for the table to include the Standard Free Energy of Hydroformylation in a Black Box ANSWER: Put it on a sub-page – the supplementary data page (something unique to chemistry!). Click on a link from the bottom of the Chembox:
  • 14.
    How I usethe key terms: Validation => “How I can be sure the data are correct?” Curation => an ongoing process of fixing errors
  • 15.
     In 2008a data validation drive was initiated for basic chemical identifiers  Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct  Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN  Other fields now being validated  Validated content indicated with a check mark
  • 16.
    Every old version(called a RevID) of an article is preserved (for all) for posterity, and can potentially serve as a permanent record of a validated version.
  • 17.
    PROBLEM: This is“the encyclopedia anyone can edit” – so anyone can change the BP of water to 200 oC. SOLUTION: A bot patrols the pages, and watches for edits to key fields. Any dubious edits are flagged with a red X (next to the data), and logged. System developed by Dirk Beetstra (Eindhoven University of Technology). It is the only such tool on Wikipedia.
  • 18.
    If anyone triesto vandalize a validated field, this will be flagged by a bot soon afterwards.  This example received a red X 11 minutes after it was vandalized.
  • 20.
     IN 2008-2010,around 3000 chemical structures were informally checked against CAS Common Chemistry  PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed
  • 21.
    The bot hasbeen modified to watch changes to the RevID of the Wikimedia Commons structure image A few hundred images validated so far
  • 22.
    Drugboxes are patrolledby the bot, but at present WP:PHARM not active in formal validation. Most work done by Dirk Beetstra, using official lists from data sources (e.g., ChEBI).
  • 24.
    Aims to enrichRSC educational content with data from ChemSpider, then make it open for educators to contribute their own content (licensed under Creative Commons)
  • 32.
     Wikipedia canprovide a useful “virtual database” of highly curated information on common chemicals and drugs.  Don’t forget the data page information!  The validation effort needs to go further – YOUR help is very welcome!  RSC Learn Chemistry shows that chemical data can also be used to enrich an educational site.
  • 33.
     Congratulations toHenry and Peter, and thanks for the invitation to speak in their symposium.  Thanks to Antony Williams for his many contributions to both Wikipedia and Learn Chemistry.  Thanks to Aileen Day, Lorna Thomson, Duncan McMillan and RSC Education staff, and to RSC for the funding of Learn Chemistry.  Thanks to undergraduate student Tyson Terpstra for uploading many quiz InChIs.  Thank you for your attention!
  • 34.
    Thank you foryour attention
  • 35.
     All ofmy own content in this presentation is released under a Creative Commons BY-SA- 3.0 license  Copyright information for images is usually attributed on the slide itself  Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the original Wikipedia page and select the “history” tab.  Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3.