Fingerprinting Chemical Structures

Fingerprin(ng
Chemical
Structures
Rajarshi
Guha
h7ps://github.com/rajarshi/ctpa-‐fingerprints
September
9
2014

High
Throughput
Screening
• Test
thousands
to
hundreds
of
thousands
of
compounds
in
one
or
more
assays
– Biochemical,
gene(c,
pharmacological
assays
• Employs
a
robo(c
plaLorm
• Rapidly
iden(fy
novel
modulators
of
biological
systems
– Infec(ous
agents
– Cellular
basis
of
diseases

Goal
of
HTS
• Rapidly
screen
large
compound
collec(ons
• Efficiently
iden(fy
real
ac(ves
– Test
them
in
slower,
accurate,
expensive
screens
• Use
the
data
to
learn
what
types
of
compounds
tend
to
be
ac(ve
• Use
the
model
to
suggest
more
compounds
to
screen
300K
HTS
1000
300
Number of Molecules
Cherry
Picks

HTS
Data
Types
• Categorical
–
ac(ve/inac(ve
or
toxic/nontoxic
• Con(nuous
– Single
point
– Dose
response
• Mul(ple
readouts
– Might
120
90
60
100
75
50
25
read
at
different
wavelengths
or
(mepoints
– More
complex
when
dealing
with
imaging
• These
(usually)
represent
the
dependent
variable
30
0.01 1.00
log10 Concentration
Response
0
9.50 9.75 10.00 10.25 10.50
Concentration
Response

Independent
Variable(s)
• HTS
tests
the
ac(vity
of
a
molecule
–
the
molecule
is
our
“independent
variable”
• Need
Activity = f (Structure)
to
describe
the
molecular
structure
– Various
discrete
or
real-‐valued
descriptors
– Surfaces
(3D)
– Binary
fingerprints

Fingerprint
Representa(on
• Lots
1 0 1 1 0 0 0 1 0
of
types
of
fingerprints
• “Keyed”
fingerprints
indicate
the
presence
or
absence
of
a
structural
feature
• Length
can
vary
from
166
to
4096
bits
or
more
• Fingerprints
usually
compared
using
the
Tanimoto
metric

What
Can
I
Use
Them
For?
• Search
– Given
a
potent
ac(ve
molecule,
find
similar
ones
(or
dissimilar,
but
also
potent)
• Predic(on
– Given
a
set
of
ac(ve
&
inac(ve
molecules
build
a
model
to
predict
which
members
from
a
large
collec(on
will
be
ac(ve
• Clustering
– Given
a
set
of
molecules,
do
they
cluster
into
structurally
different
groups?

Fingerprints
in
R
• The
fingerprint
package
supports
I/O,
manipula(on,
similarity
methods,
and
various
u(lity
methods
• A
fingerprint
is
a
S4
object
– Create
them
manually
new("fingerprint", nbit = 1024, bits = c(1,4,5,100,200))
– Read
them
in
from
files
fp.read('data/cdk.fp', size=1024, lf=cdk.lf)

Gehng
Fingerprints
• You
can
also
generate
fingerprints
from
chemical
structures
using
the
rcdk
package
• If
you’re
not
doing
cheminforma(cs
you
can
read
in
your
own
FP
data
by
implemen(ng
a
line
reader!
– See
cdk.lf, moe.lf, bci.lf!
!

Random
Fingerprints
• Useful
for
benchmarking,
genera(ng
null
distribu(ons,
exploring
effects
of
bit
density
## How long does a similarity matrix calculation take as a function of fp length?
nfp <- 300
sizes <- c(64, 128, 512, 1024, 4096, 8192)
times <- sapply(sizes, function(size) {
fps <- lapply(1:nfp, function(i) random.fingerprint(size, size * 0.35))
system.time(junk <- fp.sim.matrix(fps))[3]
})
## For a given length, how does bit density affect calculation time?
densities <- c(0.1, 0.25, 0.5, 0.75, 0.95)
times <- sapply(densities, function(density) {
fps <- lapply(1:nfp, function(i) random.fingerprint(1024, 1024 * density))
system.time(junk <- fp.sim.matrix(fps))[3]
})

Random
Fingerprints
0.6
0.4
0.2
0.0
0 2000 4000 6000 8000
Fingerprint Length
Time (s)
0.072
0.070
0.068
0.066
0.25 0.50 0.75
Bit Density
Time (s)

fps <- fp.read('data/cdk.fp', size=881,
lf=cdk.lf, header=TRUE)[1:500]
s.tanimoto <- fp.sim.matrix(fps,
3
2
1
0
0.00 0.25 0.50 0.75 1.00
Similarity
density
Metric
Dice
Tanimoto
Compare
Similarity
Metrics
• More
than
20
similarity
metrics
– Some
are
in
wri7en
in
C,
so
very
fast,
applicable
to
larger
fingerprint
collec(ons
– Others
are
in
pure
R,
slow
method='tanimoto')
s.dice <- fp.sim.matrix(fps, method='dice')
d <- rbind(data.frame(method='Tanimoto',
s=as.numeric(s.tanimoto)),
data.frame(method='Dice',
s=as.numeric(s.dice)))

Predic(ng
with
Fingerprints
• Read
in
fingerprints
&
convert
to
matrix
form
• See
– data/solubility.csv
– data/solubility.maccs!
• 33,182
observa(ons
of
solubility
• 57,857
fingerprints
• Requires
some
data
wrangling
before
modeling
20000
15000
Frequency
10000
5000
0
high low medium
Solubility Class
OOB estimate of error rate: 22.37%
Confusion matrix:
high low medium class.error
high 181 52 621 0.78805621
low 35 5611 4598 0.45226474
medium 89 2029 19965 0.09591088

Predic(ng
with
Fingerprints
• The
model
will
use
MACCS
keys
– 166
bits
– Each
bit
is
associated
with
a
structural
feature
• Low
resolu(on,
somewhat
simplis(c
• Data
comes
in
a
non-‐standard
format,
so
we
must
implement
our
own
line
reader
• Classifica(on
problem
–
predict
low/medium/
high
solubility

Predic(ng
with
Fingerprints
sol <- read.csv('data/solubility.csv', header=TRUE)
fps <- fp.read('data/solubility.maccs', header=FALSE, size=166,
lf=function(line) {
toks <- strsplit(line, " ")[[1]]
title <- toks[1]
bits <- as.numeric(toks[2:length(toks)])
list(title, bits, list())
})
## Extract fingerprint for which we have a label
common <- which( sapply(fps, function(x) x@name) %in% sol$sid )
fps <- fps[common]
## Order the fingerprints & data
sol <- sol[order(sol$sid),]
fps <- fps[order(sapply(fps, function(x) as.integer(x@name)))]
## Make X matrix
fpm <- fp.to.matrix(fps)
## Model!
library(randomForest)
m1 <- randomForest(x=fpm, y=as.factor(sol$label))

Predic(ng
with
Fingerprints
• We
can
then
use
the
RF
variable
importance
measure
• Features
important
for
predic(ve
performance
– Presence
of
aroma(c
rings
– Presence
of
charged
atoms
– Presence
of
6-‐membered
rings
– N
&
O
atoms
connected
in
a
chain
• Chemically
sensible
125
49
145
105
62
149
97
144
135
150
79
98
95
80
132
160
93
131
133
111
152
96
99
65
77
138
100
90
85
120
0 50 150 250
h7ps://github.com/cdk/cdk/blob/master/descriptor/fingerprint/src/main/resources/org/openscience/cdk/fingerprint/data/maccs.txt
MeanDecreaseGini

Clustering
with
Fingerprints
• Generate
a
distance
matrix
directly
from
a
list
of
fingerprints
fps <- fp.read('data/cdk.fp',
size=881,
lf=cdk.lf)[1:500]
sims <- fp.sim.matrix(fps)
dmat <- as.dist(1-sims)
clus <- hclust(dmat)
par(mar=c(1,4,1,1))
plot(clus, label=FALSE, xlab='',
main='’)
0.0 0.2 0.4 0.6 0.8
Height
• Exercise:
How
do
clusters
vary
with
similarity
metric
and/or
fingerprint
type?

Comparing
Data
Sets
• How
do
we
compare
two
sets
of
chemical
structures?
– Sizes
may
be
different,
and
very
large
• Pairwise?
–
O(N2)
running
(me
– Need
to
aggregate
the
resultant
pairwise
values

Comparing
Data
Sets
• How
do
we
compare
two
sets
of
chemical
structures?
– Sizes
may
be
different,
and
very
large
• Distribu(ons?
– Of
what?
– Can
lead
to
mul(ple
ways
to
generate
a
comparison
– Data
fusion?

1.00
0.75
0.50
0.25
0.00
0 250 500 750
Bit Position
Normalized Frequency
Bit
Spectrum
• Vector
summary
of
the
fingerprints
for
a
dataset
• Defined
as
the
frac(on
of
(mes
a
bit
posi(on
is
set
to
1,
for
each
bit
posi(on
0 0 1
0 1 0
1 1 1
1 0 1
0.5 0.5 0.75
...
...
...
...
...
~
10K
molecules

Bit
Spectrum
• Now
comparison
of
two
datasets
is
a
O(1)
opera(on
–
independent
dataset
size
– Simply
take
the
difference
of
the
two
bit
spectra
• e.g.:
Compare
~
800
solubles
with
>
30k
insolubles
## make two subsets and generate bit spectra
sol.idx <- which(sol$label == 'high')
insol.idx <- which(sol$label != 'high')
sol.bs <- bit.spectrum(fps[sol.idx])
insol.bs <- bit.spectrum(fps[insol.idx])
## display a difference plot
bsdiff <- sol.bs - insol.bs
d <- data.frame(x=1:length(sol.bs), y=bsdiff)
ggplot(d, aes(x=x,y=y))+geom_line()+
xlab('Bit Position')+
ylab('Normalized Frequency')+
ylim(c(-1,1))
1.0
Frequency
0.5
Normalized 0.0
-0.5
Δ -1.0
Bit Position 0 50 100 150

Explaining
Poor
Model
Performance
• Training
set
for
model
• Poor
predic(ons
on
test
set
• Both
test
set
classes
look
like
the
toxic
class
in
the
training
set
Guha
&
Schurer,
J.
Comp.
Aided.
Molec.
Des.,
2008,
22,
367

Summary
• Fingerprints
are
a
useful
representa(on
for
molecules
–
fast,
objec(ve,
compact
• But
are
applicable
to
other
domains
and
objects
– Can
be
generated
from
arbitrary
datasets
(e.g.
text)
or
objects
(e.g.
networks)
• Useful
for
various
tasks
–
search
&
comparison,
predic(on,
clustering
• The
fingerprint
package
provides
a
domain
agnos(c
way
to
handle
binary
fingerprints

Comparing
Clusterings
• Generate
mul(ple
representa(ons
of
a
set
of
molecules
• How
differently
do
these
representa(ons
cluster?
– Measure
correla(on
of
clusters
using
cophene(c
coefficient
• A
variety
of
R
packages
to
support
this
– dendextend,
clValid

Comparing
Clusterings
Pubchem 881
111112211888891456789 111222200014590236778 111122200088880012349 1111122233699902369 111334411234688 11111181111667001579 1188899111223467 34444452456789 13344550033568 11133340234779 22222231122226123334 222222211112330011256 222222202355661245689 222222225567772334467 222222277788991357889 222222257899992456668 222222277888990011677 222222256668892345789 122277925889059 125577714569 127777801679 246778834678 12223880578991 111122244446771446788 222222224444560023569 236669900668891 155699903445598 155666600134670 125566802557881 11189990366799567 112222224445670127889 111222203448995567889 111222212345890045579 122222200133392344789 111111102233772356899 111111133456772456679 111111111223772346899 111111114555660134789 111111133355550112445 111111303456670123355 11222220122235127 111111100001223344569 111111012244022378
0.8 0.6 0.4 0.2 0.0
CDK Ext 1024
111111201122250334789 111111100002442234567 1122378022330801236 227884946799 112777855679 277771523456 25782360178 11112274444891177788 112222200123990445569 122222212334580455899 112222202334590233778 222222225577792445799 222222267778881223688 222222288888990134569 222222268999990234567 222222256677890157889 222222244445793456667 222222224445660011269 111222201355661256678 222222211222330112334 222222201122660123456 111122236677791457889 13444450255678 134445510334690 111188911671240168 111188911661235779 166899903457998 1156668000235601 25569991456789 35556660456788 111169945770895669 111111112255561245789 111111111355670334899 111111133455670123444 111111133456671122355 112222200001330267778 111122200188991457899 1111214144892636689 11113341123348 11333340147789 1122223590122300 111111102233772356899 111111888889012234
0.0 0.2 0.4 0.6 0.8

Comparing
Clusterings
Pairwise
cophene(c
correla(ons
for
clusterings
generated
using
different
fingerprints
Pubchem CDK Extended CDK Graph MACCS!
Pubchem 1.0000000 0.7075479 0.6879805 0.5752923!
CDK Extended 0.7075479 1.0000000 0.8050349 0.7386863!
CDK Graph 0.6879805 0.8050349 1.0000000 0.7288428!
MACCS 0.5752923 0.7386863 0.7288428 1.0000000!

Fingerprinting Chemical Structures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fingerprinting Chemical Structures

Similar to Fingerprinting Chemical Structures (20)

More from Rajarshi Guha

More from Rajarshi Guha (20)

Recently uploaded

Recently uploaded (20)

Fingerprinting Chemical Structures