Webinar for the Chemical Information Division of the American Chemical Society. Describes descriptions of the types of chemical data in Wikipedia, and also how these are uploaded and maintained by the Wikipedia community.

Using wikipedia as a source of chemical information

  1. 1. Prof. Martin A. Walker, SUNY Potsdam June 27, 2013 Webinar for ACS Chemical Information Division
  2. 2. Introduction Chemical substance data in Wikipedia Other chemistry-related content Behind the scenes: •How articles are written •WikiProjects Conclusion Overview
  3. 3. What is a wiki? “A collaborative website which can be directly edited by anyone with access to it.” (Wiktionary, March 20, 2007) From the Hawaiian word “wiki wiki” meaning “quick.” Picture by Jshapiro WM Commons CC license
  4. 4. What isWikipedia? Wikipedia defines itself as: "a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profitWikimedia Foundation." Wikipedia logo is © Wikimedia Foundation, San Francisco, CA
  5. 5. Wikipedia is… An encyclopedia in over 200 languages An incredibly useful resource for academia Written by volunteers Editable by anyone Free to be copied, re-used Free to use (no cost) Operating for no profit Wikipedia is not… A “soapbox” or a place to publish your own work An authoritative resource for academia Written mainly by kids, or by paid professionals Free to re-use without attribution Run by a corporation
  6. 6. Traditional encyclopedia: “Experts know best” 1 • Editors choose an expert 2 • Expert writes, based on authoritative resources 3 • Editors review and check facts
  7. 7. Wikipedia – a new paradigm? “Many eyes are better” 1 • Volunteer writes, supposedly using authoritative resources 2 • Other volunteers review and check facts 3 • Ongoing process of adding content then review
  8. 8. Much chemical information on the Web is generated by machine. Wikipedia is large, even though most information is entered word-by-word by a human. This means that: • It exhibits nuances of human analysis • Much of it first enters the Web throughWikipedia • It is curated by humans • It has silly human errors! The value to cheminformatics – original human input Editing Wikipedia articles Pic by Girona7, Wikimedia Commons CC license
  9. 9. Article pages describe a specific topic  To comment on something in the article, click on the “Discussion” tab  To look at earlier versions, click the “History” tab  To change the article, click the “Edit” tab – but be careful! TheWikipedia article page
  10. 10. Substance articles
  11. 11. After a general lead section (“lede”), most decent substance articles cover these main areas: • Physical & chemical properties • Preparation • Uses • Identifiers, physical & chemical data (in a Chembox) Detailed information on safety or chemical suppliers is considered inappropriate. Substance articles
  12. 12. Wikipedia - an encyclopedia, NOT a database • But can it be used like a database anyway? • What about DBpedia? Substance data inWikipedia
  13. 13. The Chembox on a substance page contains standard representations such as •Skeletal formula •IUPAC name •InChI and InChIKey •CAS no. (represents substance, not structure) •SMILES (de facto standard before InChI) These were traditionally supplied for use by readers to copy/paste, but we were asked to make a machine- friendly version Chemboxes & Drugboxes
  14. 14. Chemboxes were originally set up as tables – OK for people, but not for data mining. EARLY CHEMBOXES A typical chembox From 2007
  15. 15. Some data (e.g., InChIs for complex molecules) can be very long – and this was a hindrance to their use in Wikipedia. TABLE EXPLOSIONS!
  16. 16. Now designed as a set of data fields with values entered by the editor – better for data extraction and for validation Drugboxes also redesigned Machine-friendly formats (SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes Hide/show used to avoid table “explosions” Collections of Wikipedia data are now available for cheminformatics groups to use NEW CHEMBOXES
  17. 17. FULL FORMSIMPLE Current form of CHEMBOX
  18. 18. • InChI can be used to define what structure is being represented when compiling a virtual database. • InChI can provide an unambiguous reference when validating structures on Wikipedia • InChIKey is useful to help those using search engines Value of the InChI and InChIKey
  19. 19. PROBLEM:Table creep – a user asks for the table to include the Standard Free Energy of Hydroformylation in a Black Box ANSWER: Put it on a sub-page – the supplementary data page (chemistry is unique in Wikipedia in having these!). Click on a link from the bottom of the Chembox: Data pages These do have value, with some data pages having over 50,000 hits/year
  20. 20. Data pages
  21. 21. Wikipedia Drug pages
  22. 22. Maintained by the Pharmacology WikiProject, which has a medicinal focus. This means that: • Some items of interest to chemists may be missing (though main ones are in the drugbox) • There are no supplementary data pages with spectral data, etc. • At the “border” between drugs and chemicals, there may be two similar substances that have different boxes. For example: • caffeine has a drugbox, but paraxanthine has a chembox Drugboxes
  23. 23. Chemical reactions
  24. 24. Some have ReactionBoxes
  25. 25. Good coverage of named organic reactions, but otherwise coverage is patchy – Wikipedia is very weak on reactions compared to March  probably because of the classic cheminformatics problem – substances are easy to define, reactions are hard Only a handful have ReactionBoxes. No database available based on Wikipedia reaction articles Typical content: • Mechanism • Reagents, catalysts, conditions • Scope & limitations • Stereochemistry • Variations Reaction articles
  26. 26. Biographical articles
  27. 27. • Large proportion of Wikipedia overall, but low in chemistry – chemists tend to be more interested in chemistry than in people! Many more could be written. • Mainly covers Nobel Laureates and important historical figures, plus a few chemists where someone has taken the time to write an article. • “Vanity articles” are strongly discouraged! Biographical articles
  28. 28. Variable coverage. None of these usually have data boxes, but many include templates to show related topics. • Methods and equipment • Constants, equations • Theories and hypotheses • Chemical families (e.g., “Aldehyde”) • Terms used (e.g., “Coordination complex”) • Many others – history, chemical companies, etc. Concepts & other chemistry content
  29. 29. The Wikipedia community User:Polimerek – a Polish Wikipedian and polymer chemist Picture from Wikimedia Polska, CC license
  30. 30. The lonely editor… Most articles started by a topic- enthusiast, and then expanded & maintained by the community if it is considered useful. Picture: WM Commons, Public domain These “Wikipedians” are motivated by altruism and a love of learning, and they want to share their knowledge with the world, for free. They can also enjoy seeing their work read by thousands, or even millions. Picture by Ziko van Dijk, CC license
  31. 31. WikiProjects provide a place for like-minded editors to discuss articles and organize collaborations. They also agree on standards & templates, and assess quality. WikiProjects
  32. 32. If you plan major changes to an article or articles, post a comment on the article talk page and also on the relevant WikiProject talk page. WikiProject talk pages – for informing
  33. 33. These discussions matter; the article discussed here had half a million hits the the last year. Wikipedia’s influence may be unofficial, but it is powerful and in many cases its definitions become the de facto standard. …and for discussions
  34. 34. Types of chemistry article WIKIPROJECT CHEMISTRY Chemical concepts Chemical reactions & processes Chemists WIKIPROJECT ELEMENTS Chemical elements WIKIPROJECT CHEMICALS Chemical substances WIKIPROJECT PHARMACOLOGY Pharmaceuticals WIKIPROJECT CELL & MOLECULAR BIOLOGY Molecular biology
  35. 35. WikiProject Chemicals ~60 members (10-20 active) Collaborates on writing quality articles and standards for: •developing data boxes for articles •chemical naming, structure drawing •article assessment Data validation Beta-Cyclodextrin Public domain picture by Edgar181
  36. 36. ChemBoxes, article validation, chemical names, structure drawing, style guide: all are organized by the WikiProjects. Type WP:MOSCHEM into Wikipedia to find the Manual of Style for Chemistry. WikiProjects collaborate to set standards
  37. 37. Articles are assessed, then tagged on the talk page. A bot compiles these assessments into lists & tables, allowing the project to review and track their articles. WikiProjects assess articles for quality & importance
  38. 38. Type WP:ASSESS into Wikipedia to see this Article assessment – by editors
  39. 39. Assessment guides article improvement priorities
  40. 40. WikiTrust – to check trustworthiness of contributions Downloadable as an extension to Firefox, this adds a tab above the article – click to see :
  41. 41. General ways to remove vandalism Watchlists: Users watch all changes to specific pages they care about Huggle: Software to help Wikipedians track and remove vandalism quickly Bots: “Obvious” vandalism (such as deleting all content from a page) is spotted and reverted almost immediately by “bots” that patrol the recent edits. (Bots are scripts that automate the editing process.) Part of my Watchlist from early this morning
  42. 42. Collaborations for validating data 2007-present: ChemSpider and Antony Williams have a longstanding collaboration with the Chemicals WikiProject, aimed at curating data in both ChemSpider and Wikipedia. 2008-2010: CAS provided a database of around 8000 substances to the Chemical WikiProject free of charge; this collection was also used as the basis for a new CAS open access site for the general public, CAS Common Chemistry
  43. 43. CAS CommonChemistry • Launched in April 2009 • Offered as a free service to provide CAS RNs to the public.
  44. 44. Since 2007 Wikipedia has collaborated with IUPAC to help propagate IUPAC definitions. This ensures that Wikipedia has accurate, current definitions, and IUPAC can reach a much wider audience. Currently, a collaboration is actively inserting IUPAC definitions for polymer terms into articles, and editing/expanding content as needed. IUPAC collaboration
  45. 45. How I use the key terms: Validation => “How I can be sure the data are correct?” Curation => an ongoing process of fixing errors Data validation
  46. 46. In 2008 a data validation drive was initiated for basic chemical identifiers, in collaboration with Antony Williams (ChemSpider) Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN Other identifier fields (e.g., KEGG) have since been validated. Validated content indicated with a check mark Content validation
  47. 47. Every old version (called a RevID) of an article is preserved (for all) for posterity, and can potentially serve as a permanent record of a validated version. The approach to validation
  48. 48. PROBLEM:This is “the encyclopedia anyone can edit” – so anyone can change the BP of water to 200 oC. SOLUTION: A bot patrols the pages, and watches for edits to key fields. Any dubious edits are flagged with a red X (next to the data), and logged. System developed by Dirk Beetstra (Eindhoven University of Technology). It is the only such tool on Wikipedia. Protecting validated fields
  49. 49. If anyone tries to vandalize a validated field, this will be flagged by a bot soon afterwards. • This example received a red X 11 minutes after it was vandalized. Validation protected by bot
  50. 50. Validated revisionIDs
  51. 51. IN 2008-2010, around 3000 chemical structures were informally checked against CAS Common Chemistry PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed Checking structures
  52. 52. The bot has been modified to watch changes to the RevID of the Wikimedia Commons structure image A few hundred images validated so far Since fall 2010
  53. 53. Drugboxes are patrolled by the bot, but at present WP:PHARM not active in formal validation. Most work done by Dirk Beetstra, using official lists from data sources (e.g., ChEBI). Drugboxes
  54. 54. Type the shortcuts shown in yellow into the Wikipedia search window • P:CHEM takes you to the Chemistry Portal • WP:CHEM and WP:CHEMISTRY – WikiProject pages are often a useful place to look for guidelines and to ask for help • WP:MOSCHEM takes you to the Chemistry Manual of Style – be sure to check this before making major edits • WP:CHEAT gives a “cheat sheet” for common edits • For general chemical information resources, Gary Wiggins has a WikiBook available at Useful sources
  55. 55. • Wikipedia can be a useful source of highly curated information on chemistry, common chemicals and drugs. • WikiProjects and the Wikipedia community play an important role in setting standards and maintaining articles. Validation will improve quality further. • Don’t forget the data page information! • The writing and the validation need to go further –YOUR help is very welcome! Conclusion
  56. 56. Thanks to Antony Williams for the invitation to present this Webinar, and also for his many contributions to Wikipedia. Thanks to Dave Martinsen for moderating this session, even while traveling! Thanks to the Wikipedia chemists who built this resource. Thank you for your attention. Acknowledgements Picture by Vistamommy Flickr, CC license
  57. 57. Thank you for your attention
  58. 58. All of my own content in this presentation is released under a Creative Commons BY-SA-3.0 license Copyright information for images is usually attributed on the slide itself Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the originalWikipedia page and select the “history” tab. Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3. Copyright information