Chemicals, Chemical Identifiers and Navigating Through Databases

Chemicals, Chemical Identifiers and
Navigating Through Databases
Antony Williams
UNC Chapel Hill, October 2010

Chemistry on the Internet
 Where do you source chemistry information?
 What can you trust online?
 How can you recognize potential issues?
 Cross-referencing and curating data

What is the Structure of Vitamin K?

MeSH
 A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione) derived
from plants, VITAMIN K 2 (menaquinone) from
bacteria, and synthetic naphthoquinone provitamins,
VITAMIN K 3 (menadione). Vitamin K 3 provitamins,
after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K

What is the Structure of Vitamin K1?

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-
enyl)naphthalene-1,4-dione”
 Variants of systematic names on PubChem
 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
 2-methyl-3-(3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Bioassay Data are Associated…

Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)

Molfiles
 10 9 0 0 1 0 0 0 0 0 1 V2000
 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
 3 1 2 0 0 0 0
 4 1 1 0 0 0 0
 9 1 1 0 0 0 0
 7 2 1 0 0 0 0
 5 2 2 0 0 0 0
 8 2 1 0 0 0 0
 6 4 1 0 0 0 0
 4 10 1 6 0 0 0
 7 6 1 0 0 0 0
 M END

Molfiles
 Molfiles are the primary exchange format between
structure drawing packages
 Can be different between different drawing packages
 Most commonly carry X,Y coordinates for layout
 Can support polymers, organometallics, etc.
 Can carry 3D coordinates

SMILES (http://en.wikipedia.org/wiki/SMILES)
 SMILES is a common format
 Can support polymers,
organometallics, etc.
 Does NOT carry X,Y or Z
coordinates for layout so
requires layout algorithms –
can be problematic!
 Generally different between
drawing packages

SMILES
 ACD/Labs
 CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2=O
 OpenEye
 CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CCC[C
@H](C)CCC[C@H](C)CCCC(C)C
 ChEMBL
 CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=O)C

InChI
 SINGLE code base managed by IUPAC –
integrated into drawing packages. No variability
as with SMILES
 InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
 Well adopted by the community (databases,
publishers, blogs, Wikipedia) – good for searching
the internet

Tautomers – “Mobile H Perception”

Checking for Stereochemistry
Use your drawing package!

InChIStrings Hash to InChIKeys

PubChem InChIKeys
 MBWXNTAXLNYFJB-NKFFZRIASA-N
 MBWXNTAXLNYFJB-LKUDQCMESA-N
 MBWXNTAXLNYFJB-UHFFFAOYSA-N
 MBWXNTAXLNYFJB-FAKCLFGASA-N
 MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)
 MBWXNTAXLNYFJB-ODDKJFTJSA-N
 MBWXNTAXLNYFJB-KSVLJPARSA-N
 MBWXNTAXLNYFJB-UDCSOKOMSA-N
 MBWXNTAXLNYFJB-JHBCSKSVSA-N
 MBWXNTAXLNYFJB-JXAKDHTRSA-N

InChI
 No support for polymers, organometallics
 Many option settings can lead to variability and
make integration across databases difficult –
FixedH option especially problematic
 “Slight” chance of collisions of InChIKeys
 VERY USEFUL FOR INTEGRATING THE WEB

Vancomycin
Search Molecular
SKELETON
Search Full Molecule

Full Skeleton Search: 104 Hits

Where is chemistry online?
 Encyclopedic articles (Wikipedia)
 Chemical vendor databases
 Metabolic pathway databases
 Property databases
 Patents with chemical structures
 Drug Discovery data
 Scientific publications
 Compound aggregators
 Blogs/Wikis and Open Notebook Science

Linked Data on the Web
Taken from: Rafael Sidis’ Blog

Search for a Chemical…by name

Available Information…
 Linked to vendors, safety data, toxicity, metabolism

How do we build it?
 25 million chemicals from 400 data sources
 We deal in Molfiles or SDF files – including
coordinates
 We do rudimentary filtering – valence checking,
charge imbalance – prior to deposition
 We have our own “business logic” to standardize
 We use InChI to “aggregate tautomers” to one
record
 We link out to external sites where possible using
their IDs

Inherited Errors
 We have inherited errors from every database…
all public compound databases, including ours,
have errors
 “Incorrect” structures – assertions, timelines etc
 “Incorrect” names associated with structures
 Properties
 Links
 Publications
 ENORMOUS CHALLENGE

Be careful searching by Name!
 Determining the correct structure by name
searching is difficult online! Good, not perfect
 Wikipedia
 ChEBI/ChEMBL
 ChemIDPlus
 ChemSpider
 Be VERY careful with MOST databases

Validating structures
 Check for “full stereo” and use stereo descriptors
especially for checking!
 Check for quality of associated data sources
 Check against reference literature when available
– but it can be wrong
 Question EVERYTHING!

Online Curation
 Online databases generally do NOT allow
curation or annotation
 If you find errors they stay there!
 ChemSpider is unique…immediate curation
 ChemSpider live demo following this lecture
 Searching
 Deposition and Curation
 ChemSpider SyntheticPages

Thank you
Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Chemicals, Chemical Identifiers and Navigating Through Databases

Recommended

Recommended

More Related Content

Similar to Chemicals, Chemical Identifiers and Navigating Through Databases

Similar to Chemicals, Chemical Identifiers and Navigating Through Databases (20)

Recently uploaded

Recently uploaded (20)

Chemicals, Chemical Identifiers and Navigating Through Databases