Open-source tools for querying and organizing large reaction databases

Gregory Landrum
NIBR Informatics
Novartis Institutes for BioMedical Research
UK QSAR 2014
Open-source tools for querying and
organizing large reaction databases

Outline
2
§ Public data sources and reactions
§ Handling reactions with the RDKit
§ Fingerprints for reactions
§ Validation:
•  Machine learning
•  Clustering
§ Application: Identifying interesting clusters of reactions

Public data sources in cheminformatics
an aside at the beginning

Protein data bank
4
the exception
•  Crystal structures of proteins
•  Deposition is mandatory for publishing protein crystal structures

Pubchem
5
Evolution
Compounds
Assays
(non-ChEMBL)
Collection of molecules from vendors and patents together with
some assay data, primarily from NIH-funded screening centers.

ChEMBL
6
Evolution
Compounds
Activities
2009
Collection of molecules and assay data curated (primarily) from the
literature

What about how we made those molecules?
7
Public reaction data?
§ The literature:
§ Plenty of data locked up in large commercial databases,
very very little in the open
Yan, L. et al. SAR studies of 3-arylpropionic acids as potent and selective agonists of sphingosine-1-phosphate
receptor-1 (S1P1) with enhanced pharmacokinetic properties. Bioorganic & Medicinal Chemistry Letters 17, 828–
831 (2007).

An emerging area: chemical reactions
8
Not just what we made, but how we made it
§  Text-mining applied to open patent data to extract chemical reactions :
1.12 million reactions[1]
§  Reactions classified using namerxn, when possible, into 318 standard
types : >599000 classified reactions[2]
[1] Lowe DM: “Extraction of chemical structures and reactions from the literature.” PhD
thesis. University of Cambridge: Cambridge, UK; 2012.
[2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software)
http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-the-
wild/
Lots of reactions,
lots of repeats

More about the classes
9
Frequency of classes, revisited:
44675 2.1.2 Carboxylic acid + amine reaction
39297 1.7.9 Williamson ether synthesis
28194 2.1.1 Amide Schotten-Baumann
26739 1.3.7 Chloro N-arylation
22400 1.6.2 Bromo N-alkylation
20465 7.1.1 Nitro to amino
20405 1.6.4 Chloro N-alkylation
17226 6.2.2 CO2H-Me deprotection
16602 6.1.1 N-Boc deprotection
16021 6.2.1 CO2H-Et deprotection
12952 1.2.1 Aldehyde reductive amination
12250 2.2.3 Sulfonamide Schotten-Baumann
10659 11.9 Separation
8538 3.1.5 Bromo Suzuki-type coupling
7261 1.7.7 Mitsunobu aryl ether synthesis
7102 6.3.7 Methoxy to hydroxy
7071 3.3.1 Sonogashira coupling
6472 3.1.1 Bromo Suzuki coupling
6383 1.8.5 Thioether synthesis
5791 9.1.6 Hydroxy to chloro
20 most common classes:

RDKit: What is it?
§  Open-source C++ toolkit for cheminformatics
§  Wrappers for Python (2.x), Java, C#
§  Functionality:
•  2D and 3D molecular operations
•  Descriptor generation for machine learning
•  PostgreSQL database cartridge for substructure and similarity searching
•  Knime nodes
•  IPython integration
•  Lucene integration (experimental)
•  Supports Mac/Windows/Linux
§  Releases every 6 months
§  business-friendly BSD license
§  Code: https://github.com/rdkit
§  http://www.rdkit.org

RDKit: Some features
§  Input/Output: SMILES/SMARTS, SDF, TDT, PDB,
SLN [1], Corina mol2 [1]
§  “Cheminformatics”:
•  Substructure searching
•  Canonical SMILES
•  Chirality support (i.e. R/S or E/Z labeling)
•  Chemical transformations (e.g. remove matching
substructures)
•  Chemical reactions
§  2D depiction, including constrained depiction
§  2D->3D conversion/conformational analysis via
distance geometry
§  UFF and MMFF94 implementation for cleaning up
structures
§  Fingerprinting: Daylight-like, atom pairs, topological
torsions, Morgan algorithm, “MACCS keys”, etc.
§  Similarity/diversity picking
§  2D pharmacophores [1]
§  Gasteiger-Marsili charges
§  Hierarchical subgraph/fragment analysis
§  Bemis and Murcko scaffold determination
§  RECAP and BRICS implementations
§  Multi-molecule maximum common substructure
§  Feature maps
§  Shape-based similarity
§  Fraggle similarity (from GSK)
§  Molecule-molecule alignment
§  Open3DAlign implementation
§  Integration with PyMOL for 3D visualization
§  Functional group filtering
§  Salt stripping
§  Molecular descriptor library:
Topological (κ3, Balaban J, etc.), Compositional (Number
of Rings, Number of Aromatic Heterocycles, etc.),
EState, SlogP/SMR (Wildman and Crippen approach),
“MOE like” VSA descriptors, Feature-map vectors
§  Machine Learning:
•  Clustering (hierarchical)
•  Information theory (Shannon entropy, information
gain, etc.)
§  Tight integration with the IPython notebook and
pandas
§  Integration with the InChI library
[1] These implementations are functional but are not necessarily
the best, fastest, or most complete.

RDKit reaction handling
Basics
From an rxn file:

RDKit reaction handling
Virtual Protecting groups
The problem:
Introducing the protecting group on amide Ns:
The result:

Another approach for tuning specificity
start with the problem again

Another approach for tuning specificity
and now the solution
Thanks to Holger Claussen (BioSolveIT) for the idea to use atom values for this
Query definitions added as atom values

Got the reactions, what about reaction fingerprints?
16
Criteria for them to be useful
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?

Similarity applied to reactions
17
What are we talking about?
§  These two reactions are both type: “1.2.5 Ketone reductive amination”
It’s obvious that these are the same, right?

Got the reactions, what about reaction fingerprints?
18
Start simple: use difference fingerprints:
Similar idea here:
1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of
Metabolites. ChemMedChem 3, 821–832 (2008).
2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction
Vectors. J. Chem. Inf. Model. 49, 1163–1184 (2009).
FPReacts = FPi
i∈Reactants
∑
FPProducts = FPi
i∈Products
∑
FPRxn = FPProds − FPReacts

Refine the fingerprints a bit
19
Text-mined reactions often include reagents or
solvents in the reactants
Explore two options for handling this:
1.  Decrease the weight of reactant molecules where too many
of the bits are not present in the product fingerprint
2.  Decrease the weight of reactant molecules where too many
atoms are unmapped

Another reaction analysis scheme
20
Looking at functional group changes
§ Similar idea to the fingerprint analysis: count the numbers
of common functional groups in the reactants and
products and subtract the one from the other:

rfp=None

for
ri
in
range(rxn.GetNumReactantTemplates()):

m
=
rxn.GetReactantTemplate(ri)

fp
=
np.array(FunctionalGroups.CreateMolFingerprint(m,fgh))

if
rfp
is
None:

rfp
=
fp

else:

rfp
+=
fp

pfp=None

for
ri
in
range(rxn.GetNumProductTemplates()):

m
=
rxn.GetProductTemplate(ri)

fp
=
np.array(FunctionalGroups.CreateMolFingerprint(m,fgh))

if
pfp
is
None:

pfp
=
fp

else:

pfp
+=
fp

fp
=
pfp-‐rfp

Functional groups considered
21
acidchloride
acidchloride_aromatic
acidchloride_aliphatic
carboxylicacid
carboxylicacid_aromatic
carboxylicacid_aliphatic
carboxylicacid_alphaamino
sulfonylchloride
sulfonylchloride_aromatic
sulfonylchloride_aliphatic
amine
amine_primary
amine_primary_aromatic
amine_primary_aliphatic
amine_secondary
amine_secondary_aromatic
amine_secondary_aliphatic
amine_tertiary
amine_tertiary_aromatic
amine_tertiary_aliphatic
amine_aromatic
amine_aliphatic
amine_cyclic
boronicacid
boronicacid_aromatic
boronicacid_aliphatic
isocyanate
isocyanate_aromatic
isocyanate_aliphatic
alcohol
alcohol_aromatic
alcohol_aliphatic
aldehyde
aldehyde_aromatic
aldehyde_aliphatic
halogen
halogen_aromatic
halogen_aliphatic
halogen_notfluorine
halogen_notfluorine_aliphatic
halogen_notfluorine_aromatic
halogen_bromine
halogen_bromine_aliphatic
halogen_bromine_aromatic
halogen_bromine_bromoketone
azide
azide_aromatic
azide_aliphatic
nitro
nitro_aromatic
nitro_aliphatic
terminalalkyne

Functional group changes analyzed
22
Do the results make sense at all?
Func%onal
Group

Avg
in

Reac%on

Overall

Average

halogen
-‐0.98
-‐0.3

alcohol
-‐0.95
-‐0.12

halogen_no4luorine
-‐0.89
-‐0.27

alcohol_aroma:c
-‐0.67
-‐0.04

halogen_alipha:c
-‐0.62
-‐0.15

halogen_no4luorine_alipha:c
-‐0.62
-‐0.14

carboxylicacid
-‐0.5
-‐0.23

halogen_bromine
-‐0.42
-‐0.11

halogen_bromine_alipha:c
-‐0.39
-‐0.06

halogen_aroma:c
-‐0.36
-‐0.16

alcohol_alipha:c
-‐0.28
-‐0.08

halogen_no4luorine_aroma:c
-‐0.27
-‐0.13

amine
-‐0.04
-‐0.3

amine_alipha:c
-‐0.04
-‐0.27

carboxylicacid_alipha:c
-‐0.04
-‐0.08

halogen_bromine_aroma:c
-‐0.03
-‐0.05

amine_ter:ary
-‐0.02
-‐0.06

amine_ter:ary_alipha:c
-‐0.02
-‐0.08

carboxylicacid_aroma:c
-‐0.02
-‐0.03

amine_cyclic
-‐0.01
-‐0.02

halogen_bromine_bromoketone
-‐0.01
0

Func%onal
Group

Avg
in

Reac%on

Overall

Average

acidchloride
0
-‐0.07

acidchloride_alipha:c
0
-‐0.05

acidchloride_aroma:c
0
-‐0.02

aldehyde
0
-‐0.04

aldehyde_alipha:c
0
-‐0.01

aldehyde_aroma:c
0
-‐0.03

amine_aroma:c
0
-‐0.03

amine_primary
0
-‐0.15

amine_primary_alipha:c
0
-‐0.07

amine_primary_aroma:c
0
-‐0.07

amine_secondary
0
-‐0.04

amine_secondary_alipha:c
0
-‐0.07

amine_secondary_aroma:c
0
0.03

amine_ter:ary_aroma:c
0
0

azide
0
0

azide_alipha:c
0
0

azide_aroma:c
0
0

boronicacid
0
-‐0.03

boronicacid_alipha:c
0
0

boronicacid_aroma:c
0
-‐0.03

carboxylicacid_alphaamino
0
0

isocyanate
0
-‐0.01

isocyanate_alipha:c
0
0

isocyanate_aroma:c
0
0

nitro
0
-‐0.03

nitro_alipha:c
0
0

nitro_aroma:c
0
-‐0.03

sulfonylchloride
0
-‐0.02

sulfonylchloride_alipha:c
0
-‐0.01

sulfonylchloride_aroma:c
0
-‐0.01

terminalalkyne
0
-‐0.01

Compare the average deltas for the >39K instances of
Williamson ether synthesis
These look sensible

Are the fingerprints useful?
23
fingerprints

Machine learning and chemical reactions
24
§ Validation set:
•  The 68 reaction types with at least 2000 instances from the patent
data set
-  “Resolution” reaction types removed (e.g. 11.9 Separation and 11.1 Chiral
separation)
-  Final: 66 reaction types
§ Process:
•  Training set is 200 random instances of each reaction type
•  Test set is 800 random instances of each reaction type
•  Learning: random forest (scikit-learn)

Learning reaction classes
25
Results for test data
Overall:
•  Recall: 0.94
•  Precision: 0.94
•  Accuracy: 0.94
For a 66-class classifier, this looks pretty good!

Learning reaction classes
26
~94% accuracy
much of the
confusion is
between related
types
Confusion matrix for test data
Bromo Suzuki coupling
Bromo Suzuki-type coupling
Bromo N-arylation

Are the fingerprints useful?
27
fingerprints

Clustering reactions
28
§ Reaction similarity validation set:
•  The 66 most common reaction types from the patent data set
•  Look at the homogeneity of clusters with at least 10 members
1.2.5 Ketone reductive
amination
amination
amination
Integration
Interpretation: <30% of clusters are <90% homogeneous
Interpretation: <40% of clusters are <80% homogeneous

Similarity applied to reactions
29
Can we help classify the remaining 600K reactions?
§  Starting point: we have a similarity measure that clusters related
reactions together
§  We can apply the machine-learning model to the unclassified
reactions and see if the original assignment missed any instances
§  We can then look for big clusters of unclassified molecules and
(manually) assign classes to them.

Finding related unclassified reactions
30
§  Process:
1.  Pick 10K random unclassified reactions
2.  Cluster using the same fingerprint described above
3.  Characterize clusters by average functional-group profile
4.  Pick clusters where there is a clear signal
§  An example:
Cluster
12

amine
-‐0.68

amine_secondary
-‐0.35

amine_secondary_aliphatic
-‐0.35

amine_aliphatic
-‐0.61

aldehyde
-‐0.58

aldehyde_aromatic
-‐0.58

Example reactions from cluster 12
31
•  Clearly related reactions
•  Using this approach we’ve identified a number of reaction classes

Wrapping up
32
§ Dataset: 1+ million reactions text mined from patents
(publically available) with reaction classes assigned
§ Fingerprint: weighted atom-pair delta fingerprints
implemented using the RDKit
§ Fingerprint Validation:
•  Multiclass random-forest classifier ~94% accurate
•  Similarity measure works: similar reactions cluster together
§ Combination of clustering + functional group analysis
clustering allows identification of new reaction classes

§ NIBR:
• Anna Pelliccioli
• Sereina Riniker
• Mike Tarselli
§ NextMove Software:
• Roger Sayle
• Daniel Lowe
33
Acknowledgements

Advertising
34
3rd RDKit User Group Meeting
22-24 October 2014
Merck KGaA, Darmstadt, Germany
Talks, “talktorials”, lightning talks, social activities, and a hackathon on
the 24th.
Registration: http://goo.gl/z6QzwD
Full announcement: http://goo.gl/ZUm2wm
We’re looking for speakers. Please contact greg.landrum@gmail.com

Open-source tools for querying and organizing large reaction databases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Open-source tools for querying and organizing large reaction databases

Similar to Open-source tools for querying and organizing large reaction databases (20)

More from Greg Landrum

More from Greg Landrum (17)

Recently uploaded

Recently uploaded (20)

Open-source tools for querying and organizing large reaction databases