1. EBI is an Outstation of the European Molecular Biology Laboratory.
Small Molecules in Bioinformatics
EBI Bioinformatics Roadshow
Dr. Louisa Bellis, ChEMBL
Copenhagen, June 2011
2. Small molecule resources at the EBI
18.03.2024
2
Agenda
• Introduction
• Small molecule resources
• ChEMBL
• ChEBI
• Searching and browsing
• Hands-on Exercises
4. What are Small Molecules?
• A small molecule is defined as a low molecular weight
organic compound.
• Most drugs are small molecules to allow passage over
cell membranes and oral bioavailability.
• They are also able to bind to proteins and enzymes,
thereby altering function, which can lead to a therapeutic
effect.
6. Metabolism
Adenosine 5’-triphosphate (ATP): the
"molecular unit of currency" of intracellular
energy transfer.
• generated in the cell by energy-consuming processes, broken down by
energy-releasing processes
• proteins that bind ATP do so in a characteristic protein fold known as the
Rossmann fold, which is a general nucleotide-binding structural domain that
can also bind the cofactor NAD
Adenosine 5'-triphosphate
Small molecule resources at the EBI
18.03.2024
8
7. Enzymes
• Enzyme inhibitors are molecules that bind to enzymes and
decrease their activity.
• Many drugs are enzyme inhibitors.
They are also used as herbicides
and pesticides.
• Enzyme activators bind to enzymes and increase their enzymatic
activity.
• Enzyme activators are often involved in the allosteric regulation of
enzymes in the control of metabolism.
clavulanic acid
acts as a suicide
inhibitor of
bacterial β-lactamase
enzymes
Small molecule resources at the EBI
18.03.2024
9
9. Drug types 2003 - 2009
'Small molecules' in various shades of blue (http://chembl.blogspot.com/)
Small molecule resources at the EBI
18.03.2024
12
10. Small Molecule Databases
• Small Molecule Databases can be used to:
• Investigate historical compounds and associated bioactivity data.
• To give fresh insight into previously rejected drugs.
• Create Structure-Activity Relationships (SARs)
• Look at how changing a functional group can change the
biological activity of a compound – before you start your own
synthesis.
18.03.2024
13 Small molecule resources at the EBI
11. • Direct synthesis
• Could reduce number of compounds made – if any similar
compounds have significant toxicity or unfavourable binding data,
you can save time by not making analogues.
• Direct end product testing
• Suggest what testing could be carried out – the database can
give you an idea of what testing has given ‘good’ (i.e. clear)
results.
• Reduce number of compounds put through High Throughput
Screening (HTS).
18.03.2024
14 Small molecule resources at the EBI
13. What is ChEBI?
• Chemical Entities of Biological Interest
• Freely available
• Focused on ‘small’ chemical entities (no proteins or
nucleic acids)
• Illustrated dictionary of chemical nomenclature
• High quality, manually annotated
• Provides chemical ontology
Access ChEBI at http://www.ebi.ac.uk/chebi/
Small molecule resources at the EBI
18.03.2024
16
16. ChEBI – Chemical Entities of Biological Interest
18.03.2024
19
ChEBI entry view
17. Chemical Structures
• Chemical structure may be
interactively explored
using MarvinView applet
• Available in formats
• Image
• Molfile
• InChI and InChIKey
• SMILES
Small molecule resources at the EBI
18.03.2024
20
18. ChEBI – Chemical Entities of Biological Interest
18.03.2024
21
Automatic Cross-references
19. What is ChEMBL?
• Database of bioactive, drug-like small molecules.
• Contains 2D structures, calculated properties (logP, mol
weight, Lipinski etc)
• Contains abstracted bioactivity data, e.g. binding data
and IC50, from multiple primary scientific journals
• Covers about 30 years of compound synthesis and
testing
• Annotated FDA-approved drugs
Access ChEMBL at https://www.ebi.ac.uk/chembldb/
Small molecule resources at the EBI
18.03.2024
22
27. ChemSpider Links:
18.03.2024
30
The link works
both ways. They
link TO
ChemSpider and
FROM
ChemSpider.
They link on
Standard_Inchi
Small molecule resources at the EBI
28. Wikipedia Links:
18.03.2024
31
We also have links with
Wikipedia. These also use
the Standard_Inchi as the
common identifier. These
links will link to the
Compound Report Card in
ChEMBL.
The links are added by a
ChemoBot and can be
updated with each
release, if required.
Small molecule resources at the EBI
30. Stereoisomers
• Compounds that have same molecular formula and
configuration, but differ in the 3-dimensional orientations.
• The central tetrahedral carbon has 4 different molecular
groups/atoms attached. This is known as the chiral
centre.
18.03.2024
33 Small molecule resources at the EBI
31. Stereoisomerism Example - Thalidomide
• Caused thousands of deformities in babies across 46
countries between 1957 and 1961.
• The R isomer is to control morning sickness but the S
isomer was teratogenic.
• Sparked more tightly controlled laboratory practices
across the world.
18.03.2024
34 Small molecule resources at the EBI
32. Stereoisomers
• Where known, the stereochemistry of the compound is
noted in the structure and in the name.
• If a stereoisomer of an existing compound is submitted, it
is given a separate id number.
• If a mixture of two stereoisomers had data submitted, we
will also give this a separate id number if the activity of
the compounds can not be isolated.
• If you draw a planar compound into the structure search,
you will receive data on all stereoisomers.
18.03.2024
35 Small molecule resources at the EBI
33. Ofloxacin, Levofloxacin and Dextrofloxacin
• Fluoroquinolone antibiotics
• Ofloxacin is a racemic (equal) mixture of Levo and Dextro
isomers.
• Levofloxacin is the more active stereoisomer
• Dextrofloxacin is the less active stereoisomer
• ChEMBL has data on each with separate bioactivities.
18.03.2024
36 Small molecule resources at the EBI
34. Tautomers (keto-enol form)
• Two forms readily interconvert via the migration of a
hydrogen to the adjacent oxygen and the swapping of a
single to a double bond, and vice versa.
• ChEMBL does not differentiate between different
tautomers.
• The preferred tautomeric structure is retained.
• ChEBI does differentiate and will store the separate
tautomers.
18.03.2024
37 Small molecule resources at the EBI
35. Salts
• About 50% of marketed drugs are combined with salts to
aid in their activity.
• Some salts prevent the drug from being absorbed in the mouth.
• Some salts help the drug be activated in the intestines, rather
than the stomach.
• There are approx 53,450 ChEMBL compounds with salts.
• Bioactivity data is recorded against the parent drug and
against the salt.
• Therefore, it’s important to give these compounds
different ChEMBL ids.
18.03.2024
38 Small molecule resources at the EBI
36. Salt Example: Morphine
• Morphine can be administered with many different salts:
• Hydrochloride (HCl)
• Sulphate (SO4)
• Tartrate
• Acetate
• Citrate
• Methobromide (MeBr)
• Hydrobromide (HBr)
• Hydroiodide (HI)
• Lactate
• Chloride (Cl)
• Bitartrate
18.03.2024
39 Small molecule resources at the EBI
37. Dealing with Salts in ChEMBL
• Each compound, if in a salt form, is analysed and
matched to a ‘parent’ – i.e. the base form of the
compound. (Not inorganic compounds)
• For example, morphine hydrochloride (CHEMBL556578),
morphine sulfate (CHEMBL422878) and morphine sulfate
hydrate (CHEMBL1200603) are matched to their parent
morphine (CHEMBL70)
• This relationship is shown on the interface of the
compound page.
• Additionally, when you run a search for a compound, you
will only be brought back the parent form in the results
grid.
18.03.2024
40 Small molecule resources at the EBI
38. Parents and Salts on the Compound Page
18.03.2024
41
PARENT
(compound report
page)
SALTS
(with hyperlinks beneath)
Small molecule resources at the EBI
39. • Clicking on the Bioactivity Summary pie chart will give
you the bioactivity data for ALL forms of the compound
• To get salt specific bioactivity data, click on the hyperlink
beneath the salt form of interest to be taken to its
compound page.
18.03.2024
42
Morphine - All Data Morphine HCl specific data
Small molecule resources at the EBI
41. Chemical names
Common or trivial names are those that are highly used.
Advantages of common names include
simplicity,
pronounceability and
universally recognised
The main disadvantage is ambiguity – the same common
name may refer to more than one type of chemical.
Small molecule resources at the EBI
42. Systematic names
A systematic name is one which corresponds to the chemical
structure such that the structure can be determined from the
name, e.g. 1,2-dimethyl-naphthalene
Software packages exist which can generate structures from
the systematic names (e.g. ACD/Name, ChemOffice,
MarvinSketch).
More than one correct systematic name can be assigned to the
same molecular structure, depending on the manner in which
naming rules are applied.
Small molecule resources at the EBI
43. Examples of common and systematic names
Common names Systematic names
caffeine
guaranine
theine
1,3,7-trimethyl-3,7-
dihydro-1H-purine-2,6-
dione
7-methyltheophylline
1,3,7-trimethyl-2,6-
dioxopurine
Small molecule resources at the EBI
45. Why?
• Ontological data
• Structure classification
• Chemical entity, e.g. hydrocarbon
• Role, e.g. ligand
• Subatomic particle, e.g. electron
• Links to other databases
• Kegg
• DrugBank
• PDBEChem
• Citations
47. The ChEBI ontology
Organised into three sub-ontologies, namely
• Molecular structure ontology
• Subatomic particle ontology
• Role ontology
(R)-adrenaline
Small molecule resources at the EBI
18.03.2024
50
50. ChEBI – Chemical Entities of Biological Interest
18.03.2024
53
ChEBI ontology relationships
• Generic ontology relationships
• Chemistry-specific relationships
51. ChEBI – Chemical Entities of Biological Interest
18.03.2024
54
Viewing ChEBI ontology
52. Simple and advanced text search
Narrow by
category
AND, OR
and BUT
NOT
Small molecule resources at the EBI
18.03.2024
55
53. Structure search Search options
Structure
drawing tools
Small molecule resources at the EBI
18.03.2024
56
54. Search Results
Click to go to
compound page
Hover-over for
search menu
Small molecule resources at the EBI
18.03.2024
57
55. Types of structure search
• Identity – based on InChI
• Substructure – uses fingerprints to narrow search range, then
performs full substructure search algorithm
• Similarity – based on Tanimoto coefficient calculated between the
fingerprints
InChI=1/H2O/h1H2
1010110111
0010110010
1010110111
0010110010
Tanimoto(a,b)
= c / (a+b-c)
= 4 / (4+7-4)
= 0.57
a
b
Small molecule resources at the EBI
18.03.2024
58
59. ChEBI example
• Search for ‘Glycine’
• What is the ChEBI ID for this?
• Is it available as a Kegg compound?
• What are the IUPAC names?
• What is ‘glycine zwitterion’?
•
• 15428
• Yes
• Glycine, aminoacetic acid
• It is a tautomer of glycine
61. How to search in ChEMBL:
• Keywords
• Compound name – dopamine, haloperidol
• Assay name – cytotoxicity, liver hepatotoxicity
• Target – RAF-1, IRAK-4
• Structure
• BLAST search – FASTA sequence from UniProt
• Protein or taxonomy hierarchy
18.03.2024
64 Small molecule resources at the EBI
63. Using the search field (found on main page):
• Best for single words
• E.g. ‘dopamine’, ‘Muscarinic’
• Looks for matching text in compound name, key or
synonym
• 3-o-methyl-alpha-methyldopamine
• Muscarinic receptor 4
• Needs an exact match
• Can’t use wildcards, e.g. ‘%’, ‘?’…
18.03.2024
66 Small molecule resources at the EBI
64. Using the Protein Sequence Search
18.03.2024
67
• Useful for searching for a specific protein or a protein
from the same family
• The results brought back will show a percentage similarity
to the inputted sequence.
• An exact match will give 100%.
• Same targets but different organisms will give ~90%
Small molecule resources at the EBI
65. Compound Drawing
• Can draw the full structure of
interest or a partial structure
• Using the Substructure
Search you can find
compounds containing your
partial structure
• Using the Similarity Search,
you can find similar
compounds – based on a
percentage score (70-100%)
18.03.2024
68 Small molecule resources at the EBI
73. Drug design
• Ligand-based: relies on knowledge of other molecules that bind to the
biological target of interest.
• Structure-based: relies on knowledge of the 3D structure of the biological
target.
• A lead has
• evidence that modulation of the target will have therapeutic value: e.g. disease
linkage studies showing associations between mutations in the biological target
and certain disease states.
• evidence that the target is druggable, i.e. capable of binding to a small molecule
and that its activity can be modulated by the small molecule.
• Target is cloned and expressed, then libraries of potential drug compounds
are screened using screening assays
Small molecule resources at the EBI
18.03.2024
76
74. Drug Discovery Process
> 2,900,000 bioactivities
> 600,000 compounds
~30,000 distinct lead series
~12,000 candidates ~2000
drugs
Target
Discovery
Lead
Discovery
Lead
Optimisatio
n
Preclinical
Development
Phase
1
Phase
2
Phase
3
Launch
•Target
identification
•Microarray
profiling
•Target
validation
•Assay
development
•Biochemistry
•Clinical/Animal
disease models
•High-throughput
Screening (HTS)
•Fragment-based
screening
•Focused
libraries
•Screening
collection
•Medicinal
Chemistry
•Structure-based
drug design
•Selectivity
screens
•ADMET screens
•Cellular/Animal
disease models
•Pharmacokineti
cs
•Toxicology
•In vivo safety
pharmacology
•Formulation
•Dose prediction
PK
tolerabilit
y
Efficacy
Safety
&
Efficacy
Indication
Discovery &
expansion
Med. Chem. SAR Clinical
Candidates
Dru
gs
Discovery Development Use
Clinical Trials
ChEMBL database
Small molecule resources at the EBI
76. Current Data Content (ChEMBL_10)
• Abstracted from 40,623 papers from 27 journals
• Ongoing curation and clean-up of all data
• 785,746 compound records
• 639,570 distinct compound structures
• 8,371 targets
• 5,190 protein molecular targets
• Over 3,200,000 experimental bioactivities
• binding measurements, functional assays and ADMET
Small molecule resources at the EBI
77. ChEMBL Assay Data
• ChEMBL contains >3 million data points relating compounds to
targets or effects.
• These activities come from ~500K assays reported in
medicinal chemistry literature.
• Assays can be classified as:
• functional assay endpoints
e.g., Vasodilation
• binding measurements
e.g., IC50
• ADME/toxicity data
e.g., LD50
Small molecule resources at the EBI
55
29
16
Functional Binding ADMET
78. Compound Properties and Selectivity
• Stores a wide range of calculated compound properties
(e.g., mol wt, logP, RO5 violations)
• Can be used to identify compounds most likely to have good in
vivo properties (Absorption, Distribution, Metabolism, Excretion)
• Contains activity information against liability targets (e.g.,
cytochrome P450s, HERG K+ channel)
• If compounds have been tested in these assays, can avoid those
with potential toxicity issues
• Contains data on a wide range of targets
• If compounds have been tested against multiple targets, can get
an idea of their selectivity (important for validation studies)
Small molecule resources at the EBI
18.03.2024
81
79. Why Use SARs?
• A chemical structure determines its physical and
biological characteristics.
• Small changes to the structure can have a large impact
on activities.
• Understanding what changes have the greatest/least
effect can aid in drug design.
• Using the many available databases that contain this
information reduce time and money spent on synthesis of
potential drug compounds.
80. Example:
1. You are interested in creating a compound to target
IRAK4 and the compound must have an aniline core
structure.
2. Run a search for IRAK4 and download all of the
compounds as an SDFile and all of the IC50 data as a
text file.
3. Combine the compounds and data into one SDFile.
4. Analyse the SAR data with an external program.
81. • There are over 3,000,000 data points in ChEMBL
• Difficult to manually look through them all
• Pipeline Pilot ™ is used in the ChEMBL team to visualise
mass amounts of data.
• SAR grids can be created using downloaded structures
and associated bioactivity data.
86. Simple SAR
• A compound with an IC50 < 100nM for a target, is
considered to be ‘good’.
• Search for IRAK4 and filter for IC50 < 100nM.
• Download the filtered bioactivities as an XLS spreadsheet
(26 bioactivities).
• Extract the list of ChEMBL_IDs from the spreadsheet and
paste them into the search box (24 ids).
• Run same search and filter on the bioactivities of IC50 >
100nM (96 bioactivities).
• Download the bioactivity data and extract the list of
ChEMBL_IDs (7 ids).
• These 7 compounds are ‘potentially’ selective for IRAK4
and unselective for any other targets.
90. Downloading ChEBI flavours
18.03.2024
93
• All downloads come in two flavours
• 3 star only entries (manually annotated ChEBI
entries)
• 2 and 3 star entries (manually annotated ChEBI,
ChEMBL and user submissions)
Small molecule resources at the EBI
91. 18.03.2024
94
Downloading ChEBI
• OBO file
• Use on OBO-edit
• SDF File
• Chemistry software compliant such as Bioclipse
• Flat file, tab delimited
• Import all the data into Excel
• Parse it into your own database structure
• Oracle binary dumps
• Import into an oracle database
• Generic SQL insert statements
• Import into MySQL or postgresql database
Small molecule resources at the EBI
92. 18.03.2024
95
The ChEBI web service
• Programmatic access to a ChEBI entry
• SOAP based Java implementation
• Clients currently available in Java and perl
• Methods
• getLiteEntity
• getCompleteEntity and getCompleteEntityByList
• getOntologyParents
• getOntologyChildren and getAllOntologyChildrenInPath
• getStructureSearch
• Documented at
http://www.ebi.ac.uk/chebi/webServices.do.
Small molecule resources at the EBI
93. Downloading ChEMBL
• Frequent releases (approx monthly)
• SDFile
• Text
• MySQL
• Oracle
Small molecule resources at the EBI
95. Help and Feedback
• Email addresses for support queries and feedback
• General questions and feedback on ChEMBL interface:
chembl-help@ebi.ac.uk
• Reporting of data errors:
chembl-data@ebi.ac.uk
• General questions, support and feedback on ChEBI
chebi-help@ebi.ac.uk
Small molecule resources at the EBI
18.03.2024
98