Chemicals, Chemical Identifiers and
Navigating Through Databases
Antony Williams
UNC Chapel Hill, October 2010
Chemistry on the Internet
 Where do you source chemistry information?
 What can you trust online?
 How can you recogniz...
What is the Structure of Vitamin K?
MeSH
 A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAM...
What is the Structure of Vitamin K1?
Wikipedia
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
PubChem
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-
enyl)naphthalene-1,4-dione”
 Variants of systematic names on PubChem
 2-met...
Bioassay Data are Associated…
Lack of Stereochemistry
ChEBI – Manual Curation
Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)
Molfiles
 10 9 0 0 1 0 0 0 0 0 1 V2000
 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 26.6526 -9.0366 0.0000 C 0 0 ...
Molfiles
 Molfiles are the primary exchange format between
structure drawing packages
 Can be different between differen...
SMILES (http://en.wikipedia.org/wiki/SMILES)
 SMILES is a common format
 Can support polymers,
organometallics, etc.
 D...
Stereo
Tautomers
SMILES
 ACD/Labs
 CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2=O
 OpenEye
 CC1=C(C(=O)c2ccccc2C1=O)C...
The InChI Identifier
InChI
 SINGLE code base managed by IUPAC –
integrated into drawing packages. No variability
as with SMILES
 InChI String...
Multiple Layers
Tautomers – “Mobile H Perception”
Double Bond Orientation
Stereo
Checking for Stereochemistry
Checking for Stereochemistry
Use your drawing package!
Checking for Stereochemistry
Checking for Stereochemistry
Checking for Stereochemistry
InChIStrings Hash to InChIKeys
PubChem InChIKeys
 MBWXNTAXLNYFJB-NKFFZRIASA-N
 MBWXNTAXLNYFJB-LKUDQCMESA-N
 MBWXNTAXLNYFJB-UHFFFAOYSA-N
 MBWXNTAXLNYF...
PubChem InChIKeys
 MBWXNTAXLNYFJB-NKFFZRIASA-N
 MBWXNTAXLNYFJB-LKUDQCMESA-N
 MBWXNTAXLNYFJB-UHFFFAOYSA-N
 MBWXNTAXLNYF...
Databases and Standardization
Databases and Standardization
InChI
 No support for polymers, organometallics
 Many option settings can lead to variability and
make integration acros...
Vancomycin
Vancomycin
Search Molecular
SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Where is chemistry online?
 Encyclopedic articles (Wikipedia)
 Chemical vendor databases
 Metabolic pathway databases
...
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
www.chemspider.com
Search for a Chemical…by name
Available Information…
 Linked to vendors, safety data, toxicity, metabolism
How do we build it?
 25 million chemicals from 400 data sources
 We deal in Molfiles or SDF files – including
coordinate...
Inherited Errors
 We have inherited errors from every database…
all public compound databases, including ours,
have error...
Compounds and Identifiers
Be careful searching by Name!
 Determining the correct structure by name
searching is difficult online! Good, not perfect...
Validating structures
 Check for “full stereo” and use stereo descriptors
especially for checking!
 Check for quality of...
Online Curation
 Online databases generally do NOT allow
curation or annotation
 If you find errors they stay there!
 C...
Thank you
Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.c...
Chemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through Databases
Upcoming SlideShare
Loading in …5
×

Chemicals, Chemical Identifiers and Navigating Through Databases

1,173 views

Published on

This is a presentation given to a group of students at the UNC Eshelman School of Pharmacy.

As chemists many of us want to resource information that is high quality, accurate and addresses our query. With the increasing proliferation of online chemistry resources it is very common for us to turn to these resources to source data. However, are resources such as Wikipedia, PubChem and the plethora of databases delivering information for metabolism, medicinal chemistry and synthetic chemistry trustworthy? Which of these resources, if any, should be treated as authorities? What is the most integrated approach to resource chemistry related data online? What approaches can be taken to validate the data that is available and how can individual scientists participate in helping to improve the content and quality of chemistry related data on the web.

Antony Williams is ChemSpiderman. He started the ChemSpider database (www.chemspider.com) as a hobby to deliver a free platform for the community to source chemistry related data. Within three years the system was acquired by the Royal Society of Chemistry and now serves up close to 25 million chemical structures linked to over 400 data sources across the internet and offers individual scientists the opportunity to host and share their data with the community and to participate in data curation and annotation. Tony will share his experiences of building this chemistry database with a focus on data validation and curation and sourcing high quality data. During the presentation he will discuss ways to check chemical structure representations before submission to public systems for searching and provide an overview of chemical identifiers such as SMILES strings and the International Chemical Identifier (InChI) allows for the interlinking of resources. Attendees can expect to leave the session with a deeper understanding of utilizing the internet to resource chemistry related data.

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,173
On SlideShare
0
From Embeds
0
Number of Embeds
72
Actions
Shares
0
Downloads
20
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Chemicals, Chemical Identifiers and Navigating Through Databases

  1. 1. Chemicals, Chemical Identifiers and Navigating Through Databases Antony Williams UNC Chapel Hill, October 2010
  2. 2. Chemistry on the Internet  Where do you source chemistry information?  What can you trust online?  How can you recognize potential issues?  Cross-referencing and curating data
  3. 3. What is the Structure of Vitamin K?
  4. 4. MeSH  A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  5. 5. What is the Structure of Vitamin K1?
  6. 6. Wikipedia
  7. 7. What is the Structure of Vitamin K1?
  8. 8. CAS’s Common Chemistry
  9. 9. PubChem
  10. 10. “2-methyl-3-(3,7,11,15-tetramethylhexadec-2- enyl)naphthalene-1,4-dione”  Variants of systematic names on PubChem  2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E)-3,7,11,15-tetramethyl  2-methyl-3-(3,7,11,15-tetramethyl  2-methyl-3-[(E)-3,7,11,15-tetramethyl
  11. 11. Bioassay Data are Associated…
  12. 12. Lack of Stereochemistry
  13. 13. ChEBI – Manual Curation
  14. 14. Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)
  15. 15. Molfiles  10 9 0 0 1 0 0 0 0 0 1 V2000  31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0  30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0  28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0  32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0  30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0  3 1 2 0 0 0 0  4 1 1 0 0 0 0  9 1 1 0 0 0 0  7 2 1 0 0 0 0  5 2 2 0 0 0 0  8 2 1 0 0 0 0  6 4 1 0 0 0 0  4 10 1 6 0 0 0  7 6 1 0 0 0 0  M END
  16. 16. Molfiles  Molfiles are the primary exchange format between structure drawing packages  Can be different between different drawing packages  Most commonly carry X,Y coordinates for layout  Can support polymers, organometallics, etc.  Can carry 3D coordinates
  17. 17. SMILES (http://en.wikipedia.org/wiki/SMILES)  SMILES is a common format  Can support polymers, organometallics, etc.  Does NOT carry X,Y or Z coordinates for layout so requires layout algorithms – can be problematic!  Generally different between drawing packages
  18. 18. Stereo
  19. 19. Tautomers
  20. 20. SMILES  ACD/Labs  CC(C)CCC[C@@H](C)CCC[C@@H] (C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2=O  OpenEye  CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CCC[C @H](C)CCC[C@H](C)CCCC(C)C  ChEMBL  CC(C)CCC[C@@H](C)CCC[C@@H] (C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=O)C
  21. 21. The InChI Identifier
  22. 22. InChI  SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES  InChI Strings can be reversed to structures – same problem as with SMILES – no layout  Well adopted by the community (databases, publishers, blogs, Wikipedia) – good for searching the internet
  23. 23. Multiple Layers
  24. 24. Tautomers – “Mobile H Perception”
  25. 25. Double Bond Orientation
  26. 26. Stereo
  27. 27. Checking for Stereochemistry
  28. 28. Checking for Stereochemistry Use your drawing package!
  29. 29. Checking for Stereochemistry
  30. 30. Checking for Stereochemistry
  31. 31. Checking for Stereochemistry
  32. 32. InChIStrings Hash to InChIKeys
  33. 33. PubChem InChIKeys  MBWXNTAXLNYFJB-NKFFZRIASA-N  MBWXNTAXLNYFJB-LKUDQCMESA-N  MBWXNTAXLNYFJB-UHFFFAOYSA-N  MBWXNTAXLNYFJB-FAKCLFGASA-N  MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)  MBWXNTAXLNYFJB-ODDKJFTJSA-N  MBWXNTAXLNYFJB-KSVLJPARSA-N  MBWXNTAXLNYFJB-UDCSOKOMSA-N  MBWXNTAXLNYFJB-JHBCSKSVSA-N  MBWXNTAXLNYFJB-JXAKDHTRSA-N
  34. 34. PubChem InChIKeys  MBWXNTAXLNYFJB-NKFFZRIASA-N  MBWXNTAXLNYFJB-LKUDQCMESA-N  MBWXNTAXLNYFJB-UHFFFAOYSA-N  MBWXNTAXLNYFJB-FAKCLFGASA-N  MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)  MBWXNTAXLNYFJB-ODDKJFTJSA-N  MBWXNTAXLNYFJB-KSVLJPARSA-N  MBWXNTAXLNYFJB-UDCSOKOMSA-N  MBWXNTAXLNYFJB-JHBCSKSVSA-N  MBWXNTAXLNYFJB-JXAKDHTRSA-N
  35. 35. Databases and Standardization
  36. 36. Databases and Standardization
  37. 37. InChI  No support for polymers, organometallics  Many option settings can lead to variability and make integration across databases difficult – FixedH option especially problematic  “Slight” chance of collisions of InChIKeys  VERY USEFUL FOR INTEGRATING THE WEB
  38. 38. Vancomycin
  39. 39. Vancomycin Search Molecular SKELETON Search Full Molecule
  40. 40. Full Skeleton Search: 104 Hits
  41. 41. Full Molecule Search: 4 Hits
  42. 42. Where is chemistry online?  Encyclopedic articles (Wikipedia)  Chemical vendor databases  Metabolic pathway databases  Property databases  Patents with chemical structures  Drug Discovery data  Scientific publications  Compound aggregators  Blogs/Wikis and Open Notebook Science
  43. 43. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  44. 44. www.chemspider.com
  45. 45. Search for a Chemical…by name
  46. 46. Available Information…  Linked to vendors, safety data, toxicity, metabolism
  47. 47. How do we build it?  25 million chemicals from 400 data sources  We deal in Molfiles or SDF files – including coordinates  We do rudimentary filtering – valence checking, charge imbalance – prior to deposition  We have our own “business logic” to standardize  We use InChI to “aggregate tautomers” to one record  We link out to external sites where possible using their IDs
  48. 48. Inherited Errors  We have inherited errors from every database… all public compound databases, including ours, have errors  “Incorrect” structures – assertions, timelines etc  “Incorrect” names associated with structures  Properties  Links  Publications  ENORMOUS CHALLENGE
  49. 49. Compounds and Identifiers
  50. 50. Be careful searching by Name!  Determining the correct structure by name searching is difficult online! Good, not perfect  Wikipedia  ChEBI/ChEMBL  ChemIDPlus  ChemSpider  Be VERY careful with MOST databases
  51. 51. Validating structures  Check for “full stereo” and use stereo descriptors especially for checking!  Check for quality of associated data sources  Check against reference literature when available – but it can be wrong  Question EVERYTHING!
  52. 52. Online Curation  Online databases generally do NOT allow curation or annotation  If you find errors they stay there!  ChemSpider is unique…immediate curation  ChemSpider live demo following this lecture  Searching  Deposition and Curation  ChemSpider SyntheticPages
  53. 53. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×