SlideShare a Scribd company logo
1 of 27
Download to read offline
Fingerprin(ng 
Chemical 
Structures 
Rajarshi 
Guha 
h7ps://github.com/rajarshi/ctpa-­‐fingerprints 
September 
9 
2014
High 
Throughput 
Screening 
• Test 
thousands 
to 
hundreds 
of 
thousands 
of 
compounds 
in 
one 
or 
more 
assays 
– Biochemical, 
gene(c, 
pharmacological 
assays 
• Employs 
a 
robo(c 
plaLorm 
• Rapidly 
iden(fy 
novel 
modulators 
of 
biological 
systems 
– Infec(ous 
agents 
– Cellular 
basis 
of 
diseases
Goal 
of 
HTS 
• Rapidly 
screen 
large 
compound 
collec(ons 
• Efficiently 
iden(fy 
real 
ac(ves 
– Test 
them 
in 
slower, 
accurate, 
expensive 
screens 
• Use 
the 
data 
to 
learn 
what 
types 
of 
compounds 
tend 
to 
be 
ac(ve 
• Use 
the 
model 
to 
suggest 
more 
compounds 
to 
screen 
300K 
HTS 
1000 
300 
Number of Molecules 
Cherry 
Picks
HTS 
Data 
Types 
• Categorical 
– 
ac(ve/inac(ve 
or 
toxic/nontoxic 
• Con(nuous 
– Single 
point 
– Dose 
response 
• Mul(ple 
readouts 
– Might 
120 
90 
60 
100 
75 
50 
25 
read 
at 
different 
wavelengths 
or 
(mepoints 
– More 
complex 
when 
dealing 
with 
imaging 
• These 
(usually) 
represent 
the 
dependent 
variable 
30 
0.01 1.00 
log10 Concentration 
Response 
0 
9.50 9.75 10.00 10.25 10.50 
Concentration 
Response
Independent 
Variable(s) 
• HTS 
tests 
the 
ac(vity 
of 
a 
molecule 
– 
the 
molecule 
is 
our 
“independent 
variable” 
• Need 
Activity = f (Structure) 
to 
describe 
the 
molecular 
structure 
– Various 
discrete 
or 
real-­‐valued 
descriptors 
– Surfaces 
(3D) 
– Binary 
fingerprints
Fingerprint 
Representa(on 
• Lots 
1 0 1 1 0 0 0 1 0 
of 
types 
of 
fingerprints 
• “Keyed” 
fingerprints 
indicate 
the 
presence 
or 
absence 
of 
a 
structural 
feature 
• Length 
can 
vary 
from 
166 
to 
4096 
bits 
or 
more 
• Fingerprints 
usually 
compared 
using 
the 
Tanimoto 
metric
What 
Can 
I 
Use 
Them 
For? 
• Search 
– Given 
a 
potent 
ac(ve 
molecule, 
find 
similar 
ones 
(or 
dissimilar, 
but 
also 
potent) 
• Predic(on 
– Given 
a 
set 
of 
ac(ve 
& 
inac(ve 
molecules 
build 
a 
model 
to 
predict 
which 
members 
from 
a 
large 
collec(on 
will 
be 
ac(ve 
• Clustering 
– Given 
a 
set 
of 
molecules, 
do 
they 
cluster 
into 
structurally 
different 
groups?
Fingerprints 
in 
R 
• The 
fingerprint 
package 
supports 
I/O, 
manipula(on, 
similarity 
methods, 
and 
various 
u(lity 
methods 
• A 
fingerprint 
is 
a 
S4 
object 
– Create 
them 
manually 
new("fingerprint", nbit = 1024, bits = c(1,4,5,100,200)) 
– Read 
them 
in 
from 
files 
fp.read('data/cdk.fp', size=1024, lf=cdk.lf)
Gehng 
Fingerprints 
• You 
can 
also 
generate 
fingerprints 
from 
chemical 
structures 
using 
the 
rcdk 
package 
• If 
you’re 
not 
doing 
cheminforma(cs 
you 
can 
read 
in 
your 
own 
FP 
data 
by 
implemen(ng 
a 
line 
reader! 
– See 
cdk.lf, moe.lf, bci.lf! 
!
Random 
Fingerprints 
• Useful 
for 
benchmarking, 
genera(ng 
null 
distribu(ons, 
exploring 
effects 
of 
bit 
density 
## How long does a similarity matrix calculation take as a function of fp length? 
nfp <- 300 
sizes <- c(64, 128, 512, 1024, 4096, 8192) 
times <- sapply(sizes, function(size) { 
fps <- lapply(1:nfp, function(i) random.fingerprint(size, size * 0.35)) 
system.time(junk <- fp.sim.matrix(fps))[3] 
}) 
## For a given length, how does bit density affect calculation time? 
densities <- c(0.1, 0.25, 0.5, 0.75, 0.95) 
times <- sapply(densities, function(density) { 
fps <- lapply(1:nfp, function(i) random.fingerprint(1024, 1024 * density)) 
system.time(junk <- fp.sim.matrix(fps))[3] 
})
Random 
Fingerprints 
0.6 
0.4 
0.2 
0.0 
0 2000 4000 6000 8000 
Fingerprint Length 
Time (s) 
0.072 
0.070 
0.068 
0.066 
0.25 0.50 0.75 
Bit Density 
Time (s)
fps <- fp.read('data/cdk.fp', size=881, 
lf=cdk.lf, header=TRUE)[1:500] 
s.tanimoto <- fp.sim.matrix(fps, 
3 
2 
1 
0 
0.00 0.25 0.50 0.75 1.00 
Similarity 
density 
Metric 
Dice 
Tanimoto 
Compare 
Similarity 
Metrics 
• More 
than 
20 
similarity 
metrics 
– Some 
are 
in 
wri7en 
in 
C, 
so 
very 
fast, 
applicable 
to 
larger 
fingerprint 
collec(ons 
– Others 
are 
in 
pure 
R, 
slow 
method='tanimoto') 
s.dice <- fp.sim.matrix(fps, method='dice') 
d <- rbind(data.frame(method='Tanimoto', 
s=as.numeric(s.tanimoto)), 
data.frame(method='Dice', 
s=as.numeric(s.dice)))
Predic(ng 
with 
Fingerprints 
• Read 
in 
fingerprints 
& 
convert 
to 
matrix 
form 
• See 
– data/solubility.csv 
– data/solubility.maccs! 
• 33,182 
observa(ons 
of 
solubility 
• 57,857 
fingerprints 
• Requires 
some 
data 
wrangling 
before 
modeling 
20000 
15000 
Frequency 
10000 
5000 
0 
high low medium 
Solubility Class 
OOB estimate of error rate: 22.37% 
Confusion matrix: 
high low medium class.error 
high 181 52 621 0.78805621 
low 35 5611 4598 0.45226474 
medium 89 2029 19965 0.09591088
Predic(ng 
with 
Fingerprints 
• The 
model 
will 
use 
MACCS 
keys 
– 166 
bits 
– Each 
bit 
is 
associated 
with 
a 
structural 
feature 
• Low 
resolu(on, 
somewhat 
simplis(c 
• Data 
comes 
in 
a 
non-­‐standard 
format, 
so 
we 
must 
implement 
our 
own 
line 
reader 
• Classifica(on 
problem 
– 
predict 
low/medium/ 
high 
solubility
Predic(ng 
with 
Fingerprints 
sol <- read.csv('data/solubility.csv', header=TRUE) 
fps <- fp.read('data/solubility.maccs', header=FALSE, size=166, 
lf=function(line) { 
toks <- strsplit(line, " ")[[1]] 
title <- toks[1] 
bits <- as.numeric(toks[2:length(toks)]) 
list(title, bits, list()) 
}) 
## Extract fingerprint for which we have a label 
common <- which( sapply(fps, function(x) x@name) %in% sol$sid ) 
fps <- fps[common] 
## Order the fingerprints & data 
sol <- sol[order(sol$sid),] 
fps <- fps[order(sapply(fps, function(x) as.integer(x@name)))] 
## Make X matrix 
fpm <- fp.to.matrix(fps) 
## Model! 
library(randomForest) 
m1 <- randomForest(x=fpm, y=as.factor(sol$label))
Predic(ng 
with 
Fingerprints 
• We 
can 
then 
use 
the 
RF 
variable 
importance 
measure 
• Features 
important 
for 
predic(ve 
performance 
– Presence 
of 
aroma(c 
rings 
– Presence 
of 
charged 
atoms 
– Presence 
of 
6-­‐membered 
rings 
– N 
& 
O 
atoms 
connected 
in 
a 
chain 
• Chemically 
sensible 
125 
49 
145 
105 
62 
149 
97 
144 
135 
150 
79 
98 
95 
80 
132 
160 
93 
131 
133 
111 
152 
96 
99 
65 
77 
138 
100 
90 
85 
120 
0 50 150 250 
h7ps://github.com/cdk/cdk/blob/master/descriptor/fingerprint/src/main/resources/org/openscience/cdk/fingerprint/data/maccs.txt 
MeanDecreaseGini
Clustering 
with 
Fingerprints 
• Generate 
a 
distance 
matrix 
directly 
from 
a 
list 
of 
fingerprints 
fps <- fp.read('data/cdk.fp', 
size=881, 
lf=cdk.lf)[1:500] 
sims <- fp.sim.matrix(fps) 
dmat <- as.dist(1-sims) 
clus <- hclust(dmat) 
par(mar=c(1,4,1,1)) 
plot(clus, label=FALSE, xlab='', 
main='’) 
0.0 0.2 0.4 0.6 0.8 
Height 
• Exercise: 
How 
do 
clusters 
vary 
with 
similarity 
metric 
and/or 
fingerprint 
type?
Comparing 
Data 
Sets 
• How 
do 
we 
compare 
two 
sets 
of 
chemical 
structures? 
– Sizes 
may 
be 
different, 
and 
very 
large 
• Pairwise? 
– 
O(N2) 
running 
(me 
– Need 
to 
aggregate 
the 
resultant 
pairwise 
values
Comparing 
Data 
Sets 
• How 
do 
we 
compare 
two 
sets 
of 
chemical 
structures? 
– Sizes 
may 
be 
different, 
and 
very 
large 
• Distribu(ons? 
– Of 
what? 
– Can 
lead 
to 
mul(ple 
ways 
to 
generate 
a 
comparison 
– Data 
fusion?
1.00 
0.75 
0.50 
0.25 
0.00 
0 250 500 750 
Bit Position 
Normalized Frequency 
Bit 
Spectrum 
• Vector 
summary 
of 
the 
fingerprints 
for 
a 
dataset 
• Defined 
as 
the 
frac(on 
of 
(mes 
a 
bit 
posi(on 
is 
set 
to 
1, 
for 
each 
bit 
posi(on 
0 0 1 
0 1 0 
1 1 1 
1 0 1 
0.5 0.5 0.75 
... 
... 
... 
... 
... 
~ 
10K 
molecules
Bit 
Spectrum 
• Now 
comparison 
of 
two 
datasets 
is 
a 
O(1) 
opera(on 
– 
independent 
dataset 
size 
– Simply 
take 
the 
difference 
of 
the 
two 
bit 
spectra 
• e.g.: 
Compare 
~ 
800 
solubles 
with 
> 
30k 
insolubles 
## make two subsets and generate bit spectra 
sol.idx <- which(sol$label == 'high') 
insol.idx <- which(sol$label != 'high') 
sol.bs <- bit.spectrum(fps[sol.idx]) 
insol.bs <- bit.spectrum(fps[insol.idx]) 
## display a difference plot 
bsdiff <- sol.bs - insol.bs 
d <- data.frame(x=1:length(sol.bs), y=bsdiff) 
ggplot(d, aes(x=x,y=y))+geom_line()+ 
xlab('Bit Position')+ 
ylab('Normalized Frequency')+ 
ylim(c(-1,1)) 
1.0 
Frequency 
0.5 
Normalized 0.0 
-0.5 
Δ -1.0 
Bit Position 0 50 100 150
Explaining 
Poor 
Model 
Performance 
• Training 
set 
for 
model 
• Poor 
predic(ons 
on 
test 
set 
• Both 
test 
set 
classes 
look 
like 
the 
toxic 
class 
in 
the 
training 
set 
Guha 
& 
Schurer, 
J. 
Comp. 
Aided. 
Molec. 
Des., 
2008, 
22, 
367
Summary 
• Fingerprints 
are 
a 
useful 
representa(on 
for 
molecules 
– 
fast, 
objec(ve, 
compact 
• But 
are 
applicable 
to 
other 
domains 
and 
objects 
– Can 
be 
generated 
from 
arbitrary 
datasets 
(e.g. 
text) 
or 
objects 
(e.g. 
networks) 
• Useful 
for 
various 
tasks 
– 
search 
& 
comparison, 
predic(on, 
clustering 
• The 
fingerprint 
package 
provides 
a 
domain 
agnos(c 
way 
to 
handle 
binary 
fingerprints
Comparing 
Clusterings 
• Generate 
mul(ple 
representa(ons 
of 
a 
set 
of 
molecules 
• How 
differently 
do 
these 
representa(ons 
cluster? 
– Measure 
correla(on 
of 
clusters 
using 
cophene(c 
coefficient 
• A 
variety 
of 
R 
packages 
to 
support 
this 
– dendextend, 
clValid
Comparing 
Clusterings 
Pubchem 881 
111112211888891456789 111222200014590236778 111122200088880012349 1111122233699902369 111334411234688 11111181111667001579 1188899111223467 34444452456789 13344550033568 11133340234779 22222231122226123334 222222211112330011256 222222202355661245689 222222225567772334467 222222277788991357889 222222257899992456668 222222277888990011677 222222256668892345789 122277925889059 125577714569 127777801679 246778834678 12223880578991 111122244446771446788 222222224444560023569 236669900668891 155699903445598 155666600134670 125566802557881 11189990366799567 112222224445670127889 111222203448995567889 111222212345890045579 122222200133392344789 111111102233772356899 111111133456772456679 111111111223772346899 111111114555660134789 111111133355550112445 111111303456670123355 11222220122235127 111111100001223344569 111111012244022378 
0.8 0.6 0.4 0.2 0.0 
CDK Ext 1024 
111111201122250334789 111111100002442234567 1122378022330801236 227884946799 112777855679 277771523456 25782360178 11112274444891177788 112222200123990445569 122222212334580455899 112222202334590233778 222222225577792445799 222222267778881223688 222222288888990134569 222222268999990234567 222222256677890157889 222222244445793456667 222222224445660011269 111222201355661256678 222222211222330112334 222222201122660123456 111122236677791457889 13444450255678 134445510334690 111188911671240168 111188911661235779 166899903457998 1156668000235601 25569991456789 35556660456788 111169945770895669 111111112255561245789 111111111355670334899 111111133455670123444 111111133456671122355 112222200001330267778 111122200188991457899 1111214144892636689 11113341123348 11333340147789 1122223590122300 111111102233772356899 111111888889012234 
0.0 0.2 0.4 0.6 0.8
Comparing 
Clusterings 
Pairwise 
cophene(c 
correla(ons 
for 
clusterings 
generated 
using 
different 
fingerprints 
Pubchem CDK Extended CDK Graph MACCS! 
Pubchem 1.0000000 0.7075479 0.6879805 0.5752923! 
CDK Extended 0.7075479 1.0000000 0.8050349 0.7386863! 
CDK Graph 0.6879805 0.8050349 1.0000000 0.7288428! 
MACCS 0.5752923 0.7386863 0.7288428 1.0000000!

More Related Content

What's hot

CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)
Pinky Vincent
 

What's hot (20)

Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
 
Virtual screening techniques
Virtual screening techniquesVirtual screening techniques
Virtual screening techniques
 
Pharmacophore mapping joon
Pharmacophore mapping joonPharmacophore mapping joon
Pharmacophore mapping joon
 
De novo drug design
De novo drug designDe novo drug design
De novo drug design
 
Molecular modelling
Molecular modelling Molecular modelling
Molecular modelling
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Molecular Dynamics
Molecular DynamicsMolecular Dynamics
Molecular Dynamics
 
22.pharmacophore
22.pharmacophore22.pharmacophore
22.pharmacophore
 
STATISTICAL METHOD OF QSAR
STATISTICAL METHOD OF QSARSTATISTICAL METHOD OF QSAR
STATISTICAL METHOD OF QSAR
 
2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORS2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORS
 
Computational Drug Design
Computational Drug DesignComputational Drug Design
Computational Drug Design
 
Presentation on concept of pharmacophore mapping and pharmacophore based scre...
Presentation on concept of pharmacophore mapping and pharmacophore based scre...Presentation on concept of pharmacophore mapping and pharmacophore based scre...
Presentation on concept of pharmacophore mapping and pharmacophore based scre...
 
MD Simulation
MD SimulationMD Simulation
MD Simulation
 
3D QSAR
3D QSAR3D QSAR
3D QSAR
 
3 d qsar approaches structure
3 d qsar approaches structure3 d qsar approaches structure
3 d qsar approaches structure
 
Pharmacophore mapping and virtual screening
Pharmacophore mapping and virtual screeningPharmacophore mapping and virtual screening
Pharmacophore mapping and virtual screening
 
Molecular Mechanics in Molecular Modeling
Molecular Mechanics in Molecular ModelingMolecular Mechanics in Molecular Modeling
Molecular Mechanics in Molecular Modeling
 
Chemoinformatics
ChemoinformaticsChemoinformatics
Chemoinformatics
 
CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)
 
Molecular similarity searching methods, seminar
Molecular similarity searching methods, seminarMolecular similarity searching methods, seminar
Molecular similarity searching methods, seminar
 

Similar to Fingerprinting Chemical Structures

SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
Reza Rahimi
 
Real Time Human Posture Detection with Multiple Depth Sensors
Real Time Human Posture Detection with Multiple Depth SensorsReal Time Human Posture Detection with Multiple Depth Sensors
Real Time Human Posture Detection with Multiple Depth Sensors
Wassim Filali
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
Rajarshi Guha
 
Lecture1_computer vision-2023.pdf
Lecture1_computer vision-2023.pdfLecture1_computer vision-2023.pdf
Lecture1_computer vision-2023.pdf
ssuserff72e4
 
TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with R
Fayan TAO
 
Globecom - MENS 2011 - Characterizing Signature Sets for Testing DPI Systems
Globecom - MENS 2011 - Characterizing Signature Sets for Testing DPI SystemsGlobecom - MENS 2011 - Characterizing Signature Sets for Testing DPI Systems
Globecom - MENS 2011 - Characterizing Signature Sets for Testing DPI Systems
Stenio Fernandes
 

Similar to Fingerprinting Chemical Structures (20)

SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
R Basics
R BasicsR Basics
R Basics
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
User biglm
User biglmUser biglm
User biglm
 
Real Time Human Posture Detection with Multiple Depth Sensors
Real Time Human Posture Detection with Multiple Depth SensorsReal Time Human Posture Detection with Multiple Depth Sensors
Real Time Human Posture Detection with Multiple Depth Sensors
 
Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with R
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Lecture1_computer vision-2023.pdf
Lecture1_computer vision-2023.pdfLecture1_computer vision-2023.pdf
Lecture1_computer vision-2023.pdf
 
TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with R
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Time series representations for better data mining
Time series representations for better data miningTime series representations for better data mining
Time series representations for better data mining
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
Globecom - MENS 2011 - Characterizing Signature Sets for Testing DPI Systems
Globecom - MENS 2011 - Characterizing Signature Sets for Testing DPI SystemsGlobecom - MENS 2011 - Characterizing Signature Sets for Testing DPI Systems
Globecom - MENS 2011 - Characterizing Signature Sets for Testing DPI Systems
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 

More from Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
Rajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
Rajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
Rajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
Rajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Rajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
Rajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
Rajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
Rajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
Rajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
Rajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
Rajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
Rajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
Rajarshi Guha
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
Rajarshi Guha
 
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange
Rajarshi Guha
 

More from Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Fingerprinting Chemical Structures

  • 1. Fingerprin(ng Chemical Structures Rajarshi Guha h7ps://github.com/rajarshi/ctpa-­‐fingerprints September 9 2014
  • 2. High Throughput Screening • Test thousands to hundreds of thousands of compounds in one or more assays – Biochemical, gene(c, pharmacological assays • Employs a robo(c plaLorm • Rapidly iden(fy novel modulators of biological systems – Infec(ous agents – Cellular basis of diseases
  • 3. Goal of HTS • Rapidly screen large compound collec(ons • Efficiently iden(fy real ac(ves – Test them in slower, accurate, expensive screens • Use the data to learn what types of compounds tend to be ac(ve • Use the model to suggest more compounds to screen 300K HTS 1000 300 Number of Molecules Cherry Picks
  • 4. HTS Data Types • Categorical – ac(ve/inac(ve or toxic/nontoxic • Con(nuous – Single point – Dose response • Mul(ple readouts – Might 120 90 60 100 75 50 25 read at different wavelengths or (mepoints – More complex when dealing with imaging • These (usually) represent the dependent variable 30 0.01 1.00 log10 Concentration Response 0 9.50 9.75 10.00 10.25 10.50 Concentration Response
  • 5. Independent Variable(s) • HTS tests the ac(vity of a molecule – the molecule is our “independent variable” • Need Activity = f (Structure) to describe the molecular structure – Various discrete or real-­‐valued descriptors – Surfaces (3D) – Binary fingerprints
  • 6. Fingerprint Representa(on • Lots 1 0 1 1 0 0 0 1 0 of types of fingerprints • “Keyed” fingerprints indicate the presence or absence of a structural feature • Length can vary from 166 to 4096 bits or more • Fingerprints usually compared using the Tanimoto metric
  • 7. What Can I Use Them For? • Search – Given a potent ac(ve molecule, find similar ones (or dissimilar, but also potent) • Predic(on – Given a set of ac(ve & inac(ve molecules build a model to predict which members from a large collec(on will be ac(ve • Clustering – Given a set of molecules, do they cluster into structurally different groups?
  • 8. Fingerprints in R • The fingerprint package supports I/O, manipula(on, similarity methods, and various u(lity methods • A fingerprint is a S4 object – Create them manually new("fingerprint", nbit = 1024, bits = c(1,4,5,100,200)) – Read them in from files fp.read('data/cdk.fp', size=1024, lf=cdk.lf)
  • 9. Gehng Fingerprints • You can also generate fingerprints from chemical structures using the rcdk package • If you’re not doing cheminforma(cs you can read in your own FP data by implemen(ng a line reader! – See cdk.lf, moe.lf, bci.lf! !
  • 10. Random Fingerprints • Useful for benchmarking, genera(ng null distribu(ons, exploring effects of bit density ## How long does a similarity matrix calculation take as a function of fp length? nfp <- 300 sizes <- c(64, 128, 512, 1024, 4096, 8192) times <- sapply(sizes, function(size) { fps <- lapply(1:nfp, function(i) random.fingerprint(size, size * 0.35)) system.time(junk <- fp.sim.matrix(fps))[3] }) ## For a given length, how does bit density affect calculation time? densities <- c(0.1, 0.25, 0.5, 0.75, 0.95) times <- sapply(densities, function(density) { fps <- lapply(1:nfp, function(i) random.fingerprint(1024, 1024 * density)) system.time(junk <- fp.sim.matrix(fps))[3] })
  • 11. Random Fingerprints 0.6 0.4 0.2 0.0 0 2000 4000 6000 8000 Fingerprint Length Time (s) 0.072 0.070 0.068 0.066 0.25 0.50 0.75 Bit Density Time (s)
  • 12. fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf, header=TRUE)[1:500] s.tanimoto <- fp.sim.matrix(fps, 3 2 1 0 0.00 0.25 0.50 0.75 1.00 Similarity density Metric Dice Tanimoto Compare Similarity Metrics • More than 20 similarity metrics – Some are in wri7en in C, so very fast, applicable to larger fingerprint collec(ons – Others are in pure R, slow method='tanimoto') s.dice <- fp.sim.matrix(fps, method='dice') d <- rbind(data.frame(method='Tanimoto', s=as.numeric(s.tanimoto)), data.frame(method='Dice', s=as.numeric(s.dice)))
  • 13. Predic(ng with Fingerprints • Read in fingerprints & convert to matrix form • See – data/solubility.csv – data/solubility.maccs! • 33,182 observa(ons of solubility • 57,857 fingerprints • Requires some data wrangling before modeling 20000 15000 Frequency 10000 5000 0 high low medium Solubility Class OOB estimate of error rate: 22.37% Confusion matrix: high low medium class.error high 181 52 621 0.78805621 low 35 5611 4598 0.45226474 medium 89 2029 19965 0.09591088
  • 14. Predic(ng with Fingerprints • The model will use MACCS keys – 166 bits – Each bit is associated with a structural feature • Low resolu(on, somewhat simplis(c • Data comes in a non-­‐standard format, so we must implement our own line reader • Classifica(on problem – predict low/medium/ high solubility
  • 15. Predic(ng with Fingerprints sol <- read.csv('data/solubility.csv', header=TRUE) fps <- fp.read('data/solubility.maccs', header=FALSE, size=166, lf=function(line) { toks <- strsplit(line, " ")[[1]] title <- toks[1] bits <- as.numeric(toks[2:length(toks)]) list(title, bits, list()) }) ## Extract fingerprint for which we have a label common <- which( sapply(fps, function(x) x@name) %in% sol$sid ) fps <- fps[common] ## Order the fingerprints & data sol <- sol[order(sol$sid),] fps <- fps[order(sapply(fps, function(x) as.integer(x@name)))] ## Make X matrix fpm <- fp.to.matrix(fps) ## Model! library(randomForest) m1 <- randomForest(x=fpm, y=as.factor(sol$label))
  • 16. Predic(ng with Fingerprints • We can then use the RF variable importance measure • Features important for predic(ve performance – Presence of aroma(c rings – Presence of charged atoms – Presence of 6-­‐membered rings – N & O atoms connected in a chain • Chemically sensible 125 49 145 105 62 149 97 144 135 150 79 98 95 80 132 160 93 131 133 111 152 96 99 65 77 138 100 90 85 120 0 50 150 250 h7ps://github.com/cdk/cdk/blob/master/descriptor/fingerprint/src/main/resources/org/openscience/cdk/fingerprint/data/maccs.txt MeanDecreaseGini
  • 17. Clustering with Fingerprints • Generate a distance matrix directly from a list of fingerprints fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf)[1:500] sims <- fp.sim.matrix(fps) dmat <- as.dist(1-sims) clus <- hclust(dmat) par(mar=c(1,4,1,1)) plot(clus, label=FALSE, xlab='', main='’) 0.0 0.2 0.4 0.6 0.8 Height • Exercise: How do clusters vary with similarity metric and/or fingerprint type?
  • 18. Comparing Data Sets • How do we compare two sets of chemical structures? – Sizes may be different, and very large • Pairwise? – O(N2) running (me – Need to aggregate the resultant pairwise values
  • 19. Comparing Data Sets • How do we compare two sets of chemical structures? – Sizes may be different, and very large • Distribu(ons? – Of what? – Can lead to mul(ple ways to generate a comparison – Data fusion?
  • 20. 1.00 0.75 0.50 0.25 0.00 0 250 500 750 Bit Position Normalized Frequency Bit Spectrum • Vector summary of the fingerprints for a dataset • Defined as the frac(on of (mes a bit posi(on is set to 1, for each bit posi(on 0 0 1 0 1 0 1 1 1 1 0 1 0.5 0.5 0.75 ... ... ... ... ... ~ 10K molecules
  • 21. Bit Spectrum • Now comparison of two datasets is a O(1) opera(on – independent dataset size – Simply take the difference of the two bit spectra • e.g.: Compare ~ 800 solubles with > 30k insolubles ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1)) 1.0 Frequency 0.5 Normalized 0.0 -0.5 Δ -1.0 Bit Position 0 50 100 150
  • 22. Explaining Poor Model Performance • Training set for model • Poor predic(ons on test set • Both test set classes look like the toxic class in the training set Guha & Schurer, J. Comp. Aided. Molec. Des., 2008, 22, 367
  • 23. Summary • Fingerprints are a useful representa(on for molecules – fast, objec(ve, compact • But are applicable to other domains and objects – Can be generated from arbitrary datasets (e.g. text) or objects (e.g. networks) • Useful for various tasks – search & comparison, predic(on, clustering • The fingerprint package provides a domain agnos(c way to handle binary fingerprints
  • 24.
  • 25. Comparing Clusterings • Generate mul(ple representa(ons of a set of molecules • How differently do these representa(ons cluster? – Measure correla(on of clusters using cophene(c coefficient • A variety of R packages to support this – dendextend, clValid
  • 26. Comparing Clusterings Pubchem 881 111112211888891456789 111222200014590236778 111122200088880012349 1111122233699902369 111334411234688 11111181111667001579 1188899111223467 34444452456789 13344550033568 11133340234779 22222231122226123334 222222211112330011256 222222202355661245689 222222225567772334467 222222277788991357889 222222257899992456668 222222277888990011677 222222256668892345789 122277925889059 125577714569 127777801679 246778834678 12223880578991 111122244446771446788 222222224444560023569 236669900668891 155699903445598 155666600134670 125566802557881 11189990366799567 112222224445670127889 111222203448995567889 111222212345890045579 122222200133392344789 111111102233772356899 111111133456772456679 111111111223772346899 111111114555660134789 111111133355550112445 111111303456670123355 11222220122235127 111111100001223344569 111111012244022378 0.8 0.6 0.4 0.2 0.0 CDK Ext 1024 111111201122250334789 111111100002442234567 1122378022330801236 227884946799 112777855679 277771523456 25782360178 11112274444891177788 112222200123990445569 122222212334580455899 112222202334590233778 222222225577792445799 222222267778881223688 222222288888990134569 222222268999990234567 222222256677890157889 222222244445793456667 222222224445660011269 111222201355661256678 222222211222330112334 222222201122660123456 111122236677791457889 13444450255678 134445510334690 111188911671240168 111188911661235779 166899903457998 1156668000235601 25569991456789 35556660456788 111169945770895669 111111112255561245789 111111111355670334899 111111133455670123444 111111133456671122355 112222200001330267778 111122200188991457899 1111214144892636689 11113341123348 11333340147789 1122223590122300 111111102233772356899 111111888889012234 0.0 0.2 0.4 0.6 0.8
  • 27. Comparing Clusterings Pairwise cophene(c correla(ons for clusterings generated using different fingerprints Pubchem CDK Extended CDK Graph MACCS! Pubchem 1.0000000 0.7075479 0.6879805 0.5752923! CDK Extended 0.7075479 1.0000000 0.8050349 0.7386863! CDK Graph 0.6879805 0.8050349 1.0000000 0.7288428! MACCS 0.5752923 0.7386863 0.7288428 1.0000000!