Robots, 
Small 
Molecules 
& 
R 
Ingredients 
for 
Exploring 
and 
Predic<ng 
Biological 
Effects 
Rajarshi 
Guha 
September 
13, 
2014 
hEp://blog.rguha.net/
Target 
Iden<fica<on 
Lead 
Discovery 
Lead 
Op<miza<on 
Clinical 
Development 
• Sensi<vity 
• Scaling 
Assay 
Op<miza<on 
Primary 
Screening 
• Fluorescence 
• High 
Content 
• Select 
subset 
to 
follow 
up 
• Diversity 
Cherry 
Picking 
Confirma<on 
• Counter 
screen 
• Explore 
SAR 
HTS 
Hun<ng 
for 
Leads
High 
Throughput 
Screening 
• Test 
thousands 
to 
hundreds 
of 
thousands 
of 
compounds 
in 
one 
or 
more 
assays 
• Employs 
a 
robo<c 
plaXorm 
• Rapidly 
iden<fy 
novel 
modulators 
of 
biological 
systems 
– Infec<ous 
agents 
– Cellular 
basis 
of 
diseases
Robots 
for 
Screening
Robots 
for 
Screening
HTS 
Workflow 
• Rapidly 
screen 
large 
compound 
collec<ons 
• Efficiently 
iden<fy 
real 
ac<ves 
– Test 
them 
in 
slower, 
accurate, 
expensive 
screens 
• Use 
the 
data 
to 
learn 
what 
types 
of 
compounds 
tend 
to 
be 
ac<ve 
• Use 
the 
model 
to 
suggest 
more 
compounds 
to 
screen 
300K 
HTS 
1000 
300 
Number of Molecules 
Cherry 
Picks
Data 
Science 
Problems 
• Predic<ve 
models 
for 
highlight 
imbalanced 
datasets 
• Global 
versus 
local 
models? 
• Feature 
selec<on 
– 
data 
driven? 
Domain 
driven? 
• Clustering 
& 
enrichment 
• Similarity 
– 
defini<on, 
computa<on, 
performance 
• Integra<on 
– 
chemical 
structures, 
numerical 
data, 
text 
(papers, 
patents), 
images
The 
Roles 
of 
R 
Data Access 
ROracle 
RMyQSL 
RPostgreSQL 
rpubchem 
chemblr 
Chemistry 
rcdk 
ChemmineR 
fingerprint 
HTS QC 
displayHTS 
spdep 
Imaging 
EBImage 
rflowcyt 
ripa 
raster 
Visualization 
grid 
ggplot 
Shiny 
ggvis 
igraph 
Data Analysis 
drc 
igraph 
randomForest 
svm 
... 
Also 
see 
ChemPhys 
CRAN 
Task 
View
HTS 
Data 
Types 
– 
Single 
Point 
100 
75 
50 
25 
0 
9.50 9.75 10.00 10.25 10.50 
Concentration 
Response
HTS 
Data 
Types 
– 
Dose 
Response 
120 
90 
60 
30 
0.01 1.00 
log10 Concentration 
Response 
y = S0 + 
Sinf − S0 
1+10(log AC50−x)H
HTS 
Data 
Types 
– 
Mul<ple 
Readouts 
(and 
have 
this 
at 
mul<ple 
doses!)
HTS 
Data 
Types 
-­‐ 
Combina<ons 
+
Independent 
Variable(s) 
Activity = f ( )
Features, 
Features, 
Features 
• How 
do 
we 
“quan<fy” 
a 
chemical 
structure?
Features, 
Features, 
Features 
Charges 
Dipole 
moments 
Topological 
invariants 
Surface 
proper<es 
1 0 1 1 0 0 0 1 0
Working 
with 
Molecules 
in 
R 
• A 
number 
of 
OSS 
libraries 
are 
available 
• ChemmineR 
and 
rcdk 
are 
the 
main 
packages 
that 
allow 
you 
to 
manipulate 
molecules 
in 
R 
• Uses 
rJava 
to 
interface 
with 
JOELib 
and 
CDK 
respec<vely
rcdk 
• Idioma<c 
R 
interface 
to 
the 
CDK 
library 
– I/O 
support 
for 
chemical 
file 
formats 
– Manipula<on 
of 
atoms, 
bonds, 
molecules 
– Generate 
molecular 
descriptors, 
fingerprints 
library(rcdk) 
mol <- parse.smiles(‘CCCC’)[[1]] 
mols <- load.molecules(‘http://www.rguha.net/mipe100.smi’)
rcdk 
• rcdk 
works 
with 
references 
to 
Java 
objects 
– Can’t 
save 
them 
in 
a 
workspace 
(trivially) 
> mol 
[1] "Java-Object{AtomContainer(2040919865, #A:4, Atom(2131361171, S:C, H:3, 
AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), 
Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, 
Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, 
Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, 
AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), #B:3, 
Bond(549041464, #O:SINGLE, #S:NONE, #A:2, Atom(2131361171, S:C, H:3, 
AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), 
Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, 
Element(1759969037, S:C, AN:6)))), ElectronContainer(549041464EC:2)), Bond(2654289, 
#O:SINGLE, #S:NONE, #A:2, Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, 
Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, 
AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), 
ElectronContainer(2654289EC:2)), Bond(1660962283, #O:SINGLE, #S:NONE, #A:2, 
Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, 
Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, 
Isotope(703168415, Element(703168415, S:C, AN:6)))), ElectronContainer(1660962283EC: 
2)))}" 
>
Calcula<ng 
Molecular 
Features 
• Evaluate 
a 
matrix 
of 
numerical 
features 
mols <- load.molecules("mipe100.smi") 
dnames <- get.desc.names('topological') 
descs <- eval.desc(mols, dnames) 
• End 
up 
with 
a 
rectangular 
data.frame 
> str(descs) 
'data.frame': 99 obs. of 195 variables: 
$ nRings7 : num 1 0 1 0 0 0 0 0 0 0 ... 
$ nRings8 : num 0 0 0 0 0 0 0 0 0 0 ... 
$ nRings9 : num 0 0 0 0 0 0 0 0 0 0 ... 
$ tpsaEfficiency : num 0.1856 0.2035 0.0118 0.0602 ...
Calcula<ng 
Fingerprints 
• Binary 
string 
representa<on 
of 
molecular 
structure 
– Objec<vely 
defined, 
fast 
to 
calculate 
– Good 
for 
searching, 
clustering, 
predic<on 
library(fingerprint) 
fps <- lapply(mols, get.fingerprint) 
• The 
fingerprint 
package 
is 
used 
to 
represent 
them 
as 
S4 
objects
Calcula<ng 
Fingerprints 
• Methods 
to 
compute 
similari<es, 
generate 
summaries 
& 
manipulate 
fingerprints 
> fps[[1]] 
Fingerprint object 
name = 
length = 1024 
folded = FALSE 
source = CDK 
bits on = 15 18 45 73 77 78 79 85 87 96 107 109 129 139 149 159 
162 166 172 179 194 209 214 223 225 227 239 254 266 272 301 312 327 
335 350 354 359 392 393 395 397 415 435 455 486 491 492 499 534 535 
541 543 544 545 546 559 575 600 605 618 621 622 626 635 638 644 645 
647 690 723 728 742 743 753 754 800 819 831 832 889 893 913 922 930 
936 954 985 988 1005 1008 1016 
>
Use 
Case 
-­‐ 
SAR 
• Cluster 
molecules 
by 
structure 
and 
examine 
whether 
clusters 
are 
enriched 
in 
ac<vity 
library(chemblr); library(rcdk) 
d <- get.activity(chembl.id='CHEMBL857155', type='assay') 
cmpds <- lapply(d$ingredient_cmpd_chemblid, get.compound, 
type='chemblid') 
cmpds <- do.call(rbind, 
lapply(cmpds, function(x) 
data.frame(x$chemblId, x$smiles, 
stringsAsFactors=FALSE))) 
mols <- parse.smiles(cmpds$x.smiles) 
fps <- lapply(mols, get.fingerprint) 
sm <- fp.sim.matrix(fps) 
rownames(sm) <- cmpds$x.chemblId 
dm <- as.dist(1-sm) 
clus <- hclust(dm)
Use 
Case 
-­‐ 
SAR 
CHEMBL331502 
CHEMBL328164 
CHEMBL52551 
CHEMBL331120 
CHEMBL120497 
CHEMBL331759 
CHEMBL120547 
CHEMBL324064 
CHEMBL318208 
CHEMBL328627 
CHEMBL99803 
CHEMBL317562 
CHEMBL332678 
CHEMBL100312 
CHEMBL119963 
CHEMBL334031 
CHEMBL323657 
CHEMBL118406 
CHEMBL118162 
CHEMBL120137 
CHEMBL331722 
CHEMBL120078 
CHEMBL121953 
CHEMBL331783 
CHEMBL333066 
CHEMBL116832 
CHEMBL316512 
CHEMBL318471 
CHEMBL98153 
CHEMBL95827 
CHEMBL119932 
CHEMBL99037 
CHEMBL120355 
CHEMBL430574 
CHEMBL120941 
CHEMBL299756 
CHEMBL317964 
CHEMBL98501 
CHEMBL317150 
CHEMBL120030 
CHEMBL99779 
CHEMBL98554 
CHEMBL318911 
CHEMBL97844 
CHEMBL316485 
CHEMBL296586 
CHEMBL100309 
CHEMBL98360 
CHEMBL316940 
CHEMBL120664 
CHEMBL419054 
CHEMBL119989 
CHEMBL121958 
CHEMBL121957 
CHEMBL329505 
CHEMBL121543 
CHEMBL121492 
CHEMBL333894 
CHEMBL333006 
CHEMBL50894 
CHEMBL116545 
CHEMBL331190 
CHEMBL325403 
CHEMBL99423 
CHEMBL330398 
CHEMBL95477 
CHEMBL545053 
CHEMBL329063 
CHEMBL331000 
CHEMBL319373 
CHEMBL431634 
CHEMBL325654 
CHEMBL332359 
CHEMBL334084 
CHEMBL328194
1.00 
0.75 
0.50 
0.25 
0.00 
0 250 500 750 
Bit Position 
Normalized Frequency 
Use 
Case 
-­‐ 
Bit 
Spectrum 
• Vector 
summary 
of 
the 
fingerprints 
for 
a 
dataset 
• Defined 
as 
the 
frac<on 
of 
<mes 
a 
bit 
posi<on 
is 
set 
to 
1, 
for 
each 
bit 
posi<on 
0 0 1 
0 1 0 
1 1 1 
1 0 1 
0.5 0.5 0.75 
... 
... 
... 
... 
... 
~ 
10K 
molecules
• Comparison 
• Simply 
e.g.: 
Compare 
~ 
800 
solubles 
with 
> 
30k 
insolubles 
1.0 
Use 
Case 
-­‐ 
Bit 
Spectrum 
of 
two 
datasets 
is 
now 
O(n) 
take 
the 
difference 
of 
the 
two 
bit 
spectra 
Frequency 
0.5 
Normalized 0.0 
-0.5 
Δ -1.0 
Bit Position 0 50 100 150 
## make two subsets and generate bit spectra 
sol.idx <- which(sol$label == 'high') 
insol.idx <- which(sol$label != 'high') 
sol.bs <- bit.spectrum(fps[sol.idx]) 
insol.bs <- bit.spectrum(fps[insol.idx]) 
## display a difference plot 
bsdiff <- sol.bs - insol.bs 
d <- data.frame(x=1:length(sol.bs), y=bsdiff) 
ggplot(d, aes(x=x,y=y))+geom_line()+ 
xlab('Bit Position')+ 
ylab('Normalized Frequency')+ 
ylim(c(-1,1))
PREDICTIVE 
MODELS 
-­‐ 
CAVEATS
Building 
Models 
is 
the 
Easy 
Part 
• Given 
a 
descriptor 
data.frame 
or 
fingerprint 
list 
we’re 
ready 
to 
build 
models 
– caret, 
caretEnsemble 
• Ques<on 
is 
whether 
the 
model(s) 
can 
generalize 
• Applicability 
is 
a 
key 
considera<on 
when 
predic<ng 
bioac<vity 
– Has 
economic 
& 
safety 
ramifica<ons 
in 
regulatory 
enviroments
Domain 
Applicability 
• How 
Training 
Set 
Test 
Set 
dissimilar 
to 
the 
training 
set 
do 
you 
have 
to 
be 
before 
the 
predic<on 
is 
meaningless? 
– Distance 
to 
training 
set? 
Inside/outside 
convex 
hull 
– Comparison 
of 
bit 
spectra
Global 
vs 
Local 
Models 
• Bioassay 
data 
is 
not 
really 
big 
data 
• Can 
big 
data 
be 
too 
big? 
• AID 
1996 
– 57K 
measurements 
of 
aqueous 
solubility 
• Do 
we 
build 
one 
model? 
• Or 
mul<ple 
local 
models? 
PCA 
of 
166 
Binary 
Features
RESPONSE 
SURFACES
Screening 
Drug 
Combina<ons 
• Increased 
efficacy 
• Delay 
resistance 
• AEenuate 
toxicity 
• Inform 
signaling 
pathway 
connec<vity 
• Iden<fy 
synthe<c 
lethality 
• Polypharmacology 
Transla'onal 
Interest 
Basic 
Interest
How 
to 
Test 
Combina<ons 
• Many 
procedures 
described 
in 
the 
literature 
– Fixed 
dose 
ra<o 
(aka 
ray) 
– Ray 
contour 
– Checkerboard 
– Gene<c 
algorithm 
C5,D5 C5 
C4,D4 C4 
C3,D3 C3 
C2,D2 C2 
C1,D5 C1,D4 C1,D3 C1,D2 C1,D1 C1 
D5 D4 D3 D2 D1 0
How 
to 
Test 
Combina<ons 
• Many 
procedures 
described 
in 
the 
literature 
– Fixed 
dose 
ra<o 
(aka 
ray) 
– Ray 
contour 
– Checkerboard 
– Gene<c 
algorithm 
Vargatef DCC-2036 PD-166285 GDC-0941 
PI-103 GDC-0980 Bardoxolone methyl AATT-77551199 
SNS-032 NCGC00188382-01 Lestaurtinib CNF-2024 
ISOX Belinostat PF-477736 AZD-7762
• Vargatef 
Why 
Similarity? 
exhibited 
anomalous 
matrix 
response 
compared 
to 
other 
VEGFR 
inhibitors 
Vargatef 
Linifanib Axitinib Sorafenib Vatalanib 
Motesanib Tivozanib Brivanib Telatinib 
Cabozantinib Cediranib BMS-794833 Lenvatinib 
OSI-632 Foretinib Regorafenib
When 
are 
Combina<ons 
Similar? 
• Differences 
and 
their 
aggregates 
such 
as 
RMSD 
can 
lead 
to 
degeneracy 
• Instead 
we’re 
interested 
in 
the 
shape 
of 
the 
surface 
• How 
to 
characterize 
shape? 
– Parametrized 
fits 
– Distribu<on 
of 
responses 
0.010 
0.005 
0.000 
0 25 50 75 100 
0.06 
0.04 
0.02 
0.00 
0 25 50 75 100 
0.15 
0.10 
0.05 
0.00 
0 50 100 
D, p value
Similarity 
via 
the 
Syrjala 
Test 
10.0 
7.5 
5.0 
2.5 
0.0 
0.00 0.25 0.50 0.75 
D 
density 
• Syrjala 
test 
used 
to 
compare 
popula<on 
distribu<ons 
over 
a 
spa<al 
grid 
– Invariant 
to 
grid 
orienta<on 
– Provides 
an 
empirical 
p-­‐value 
• Less 
degenerate 
than 
just 
considering 
1D 
distribu<ons 
Syrjala, 
S.E., 
“A 
Sta<s<cal 
Test 
for 
a 
Difference 
between 
the 
Spa<al 
Distribu<ons 
of 
Two 
Popula<ons”, 
Ecology, 
1996, 
77(1), 
75-­‐80
Clustering 
Response 
Surfaces 
0.0 0.2 0.4 0.6 0.8 
C1 
(24) 
C3(35) 
C2(47) 
C4(24)
Working 
in 
“Combina<on 
Space” 
• Each 
cell 
line 
is 
represented 
as 
a 
vector 
of 
response 
matrices 
• “Distance” 
between 
two 
cell 
lines 
is 
a 
func<on 
of 
the 
distance 
between 
component 
response 
matrices 
• F 
can 
be 
min, 
max, 
mean, 
… 
L1 
L2 
= 
d1 
= 
d2 
= 
d3 
= 
d4 
= 
d5 
D L1, L2 ( ) = F({d1, d2,…, dn}) 
, 
, 
, 
, 
,
Many 
Choices 
to 
Make 
0 1 2 3 4 
KMS-34 
INA-6 
L363 
OPM-1 
XG-2 
FR4 
AMO-1 
XG-6 
MOLP-8 
ANBL-6 
KMS-20 
XG-7 
OCI-MY1 
XG-1 
8226 
EJM 
U266 
KMS-11LB 
SKMM-1 
MM-MM1 
sum 
0.0 0.1 0.2 0.3 0.4 0.5 0.6 
L363 
OPM-1 
XG-2 
KMS-20 
XG-1 
XG-7 
ANBL-6 
OCI-MY1 
U266 
XG-6 
INA-6 
MOLP-8 
AMO-1 
KMS-34 
KMS-11LB 
SKMM-1 
MM-MM1 
EJM 
FR4 
8226 
max 
0.00 0.05 0.10 0.15 0.20 0.25 
INA-6 
MM-MM1 
8226 
XG-1 
U266 
ANBL-6 
SKMM-1 
EJM 
OPM-1 
XG-2 
OCI-MY1 
KMS-20 
L363 
KMS-11LB 
AMO-1 
XG-6 
FR4 
KMS-34 
MOLP-8 
XG-7 
min 
0.0 0.2 0.4 0.6 0.8 1.0 1.2 
L363 
OPM-1 
XG-2 
KMS-34 
INA-6 
KMS-11LB 
SKMM-1 
EJM 
U266 
MM-MM1 
FR4 
AMO-1 
XG-6 
8226 
MOLP-8 
ANBL-6 
OCI-MY1 
XG-1 
KMS-20 
XG-7 
euc
NETWORKS
Networks 
& 
Integra<on 
• Network 
models 
of 
molecules, 
and 
targets 
are 
common 
– Allows 
for 
the 
incorpora<on 
of 
lots 
of 
associated 
informa<on 
– Diseases, 
pathways, 
OTE’s, 
• When 
linked 
with 
clinical 
data 
& 
outcomes, 
we 
can 
generate 
massive 
networks 
– Adverse 
events 
(FDA 
AERS) 
– Analysis 
by 
Cloudera 
considered 
> 
10E6 
drug-­‐drug-­‐ 
reac<on 
triples 
Yildirim, 
M.A. 
et 
al
Networks 
& 
integra<on 
• SAR 
data 
can 
be 
viewed 
in 
a 
network 
form 
– SALI, 
SARI 
based 
networks 
– Usually 
requires 
pairwise 
calcula<ons 
of 
the 
metric 
• Current 
studies 
have 
focused 
on 
small 
datasets 
(< 
1000 
molecules) 
• Hadoop 
+ 
Giraph 
could 
let 
us 
apply 
this 
to 
HTS-­‐ 
scale 
datasets 
Peltason, 
L 
et 
al 
hEp://sali.rguha.net/
Networks 
& 
integra<on 
• When 
we 
apply 
a 
network 
view 
we 
can 
consider 
many 
interes<ng 
applica<ons 
& 
make 
use 
of 
cloud 
scale 
infrastructure 
– Network 
based 
similarity 
– Community 
detec<on 
(aka 
clustering) 
– PageRank 
style 
ranking 
(of 
targets, 
compounds, 
…) 
– Generate 
network 
metrics, 
which 
can 
be 
used 
as 
input 
to 
predic<ve 
models 
(for 
interac<ons, 
effects, 
…) 
Bauer-­‐Mehren 
et 
al
Combina<ons 
as 
Networks 
Combina<on 
screens 
lend 
themselves 
naturally 
to 
network 
representa<ons 
● 
● 
● 
● 
● ● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● ● 
● 
● 
● 
Δ Bliss+ 
0.0 
−0.5 
−1.0 
−1.4 
−1.9 
−2.4 
−2.9 
−3.3 
−3.8 
−4.3 
● 
● ● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
Δ Bliss+ 
0.0 
−0.4 
−0.8 
−1.2 
−1.5 
−1.9 
−2.3 
−2.7 
−3.1 
−3.4 
immune system process 
apoptotic process 
transcription from RNA 
polymerase II promoter 
protein phosphorylation 
cell communication 
immune response
Combina<ons 
as 
Networks 
• Things 
get 
more 
interes<ng 
when 
we 
have 
n 
m 
screens 
• Can 
be 
simplified 
using 
a 
variety 
of 
methods 
– Neighborhoods 
– Minimum 
● 
● ● 
Spanning 
Tree 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
×
Comparing 
Neighborhoods 
Combina<ons 
that 
have 
DBSumNeg 
< 
1st 
quar<le 
value 
for 
that 
strain 
3D7 DD2 HB3
Iden<fying 
the 
Most 
Synergis<c 
Pairs 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●
Summary 
• The 
HTS 
workflow 
presents 
mul<ple 
data 
science 
problems 
involving 
(unique) 
data 
types 
• R 
can 
play 
a 
role 
at 
several 
stages, 
but 
model 
building 
is 
straighXorward 
• Representa<on 
is 
key 
and 
guides 
the 
types 
and 
nature 
of 
analyses

Robots, Small Molecules & R

  • 1.
    Robots, Small Molecules & R Ingredients for Exploring and Predic<ng Biological Effects Rajarshi Guha September 13, 2014 hEp://blog.rguha.net/
  • 2.
    Target Iden<fica<on Lead Discovery Lead Op<miza<on Clinical Development • Sensi<vity • Scaling Assay Op<miza<on Primary Screening • Fluorescence • High Content • Select subset to follow up • Diversity Cherry Picking Confirma<on • Counter screen • Explore SAR HTS Hun<ng for Leads
  • 3.
    High Throughput Screening • Test thousands to hundreds of thousands of compounds in one or more assays • Employs a robo<c plaXorm • Rapidly iden<fy novel modulators of biological systems – Infec<ous agents – Cellular basis of diseases
  • 4.
  • 5.
  • 6.
    HTS Workflow •Rapidly screen large compound collec<ons • Efficiently iden<fy real ac<ves – Test them in slower, accurate, expensive screens • Use the data to learn what types of compounds tend to be ac<ve • Use the model to suggest more compounds to screen 300K HTS 1000 300 Number of Molecules Cherry Picks
  • 7.
    Data Science Problems • Predic<ve models for highlight imbalanced datasets • Global versus local models? • Feature selec<on – data driven? Domain driven? • Clustering & enrichment • Similarity – defini<on, computa<on, performance • Integra<on – chemical structures, numerical data, text (papers, patents), images
  • 8.
    The Roles of R Data Access ROracle RMyQSL RPostgreSQL rpubchem chemblr Chemistry rcdk ChemmineR fingerprint HTS QC displayHTS spdep Imaging EBImage rflowcyt ripa raster Visualization grid ggplot Shiny ggvis igraph Data Analysis drc igraph randomForest svm ... Also see ChemPhys CRAN Task View
  • 9.
    HTS Data Types – Single Point 100 75 50 25 0 9.50 9.75 10.00 10.25 10.50 Concentration Response
  • 10.
    HTS Data Types – Dose Response 120 90 60 30 0.01 1.00 log10 Concentration Response y = S0 + Sinf − S0 1+10(log AC50−x)H
  • 11.
    HTS Data Types – Mul<ple Readouts (and have this at mul<ple doses!)
  • 12.
    HTS Data Types -­‐ Combina<ons +
  • 13.
  • 14.
    Features, Features, Features • How do we “quan<fy” a chemical structure?
  • 15.
    Features, Features, Features Charges Dipole moments Topological invariants Surface proper<es 1 0 1 1 0 0 0 1 0
  • 16.
    Working with Molecules in R • A number of OSS libraries are available • ChemmineR and rcdk are the main packages that allow you to manipulate molecules in R • Uses rJava to interface with JOELib and CDK respec<vely
  • 17.
    rcdk • Idioma<c R interface to the CDK library – I/O support for chemical file formats – Manipula<on of atoms, bonds, molecules – Generate molecular descriptors, fingerprints library(rcdk) mol <- parse.smiles(‘CCCC’)[[1]] mols <- load.molecules(‘http://www.rguha.net/mipe100.smi’)
  • 18.
    rcdk • rcdk works with references to Java objects – Can’t save them in a workspace (trivially) > mol [1] "Java-Object{AtomContainer(2040919865, #A:4, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), #B:3, Bond(549041464, #O:SINGLE, #S:NONE, #A:2, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), ElectronContainer(549041464EC:2)), Bond(2654289, #O:SINGLE, #S:NONE, #A:2, Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), ElectronContainer(2654289EC:2)), Bond(1660962283, #O:SINGLE, #S:NONE, #A:2, Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), ElectronContainer(1660962283EC: 2)))}" >
  • 19.
    Calcula<ng Molecular Features • Evaluate a matrix of numerical features mols <- load.molecules("mipe100.smi") dnames <- get.desc.names('topological') descs <- eval.desc(mols, dnames) • End up with a rectangular data.frame > str(descs) 'data.frame': 99 obs. of 195 variables: $ nRings7 : num 1 0 1 0 0 0 0 0 0 0 ... $ nRings8 : num 0 0 0 0 0 0 0 0 0 0 ... $ nRings9 : num 0 0 0 0 0 0 0 0 0 0 ... $ tpsaEfficiency : num 0.1856 0.2035 0.0118 0.0602 ...
  • 20.
    Calcula<ng Fingerprints •Binary string representa<on of molecular structure – Objec<vely defined, fast to calculate – Good for searching, clustering, predic<on library(fingerprint) fps <- lapply(mols, get.fingerprint) • The fingerprint package is used to represent them as S4 objects
  • 21.
    Calcula<ng Fingerprints •Methods to compute similari<es, generate summaries & manipulate fingerprints > fps[[1]] Fingerprint object name = length = 1024 folded = FALSE source = CDK bits on = 15 18 45 73 77 78 79 85 87 96 107 109 129 139 149 159 162 166 172 179 194 209 214 223 225 227 239 254 266 272 301 312 327 335 350 354 359 392 393 395 397 415 435 455 486 491 492 499 534 535 541 543 544 545 546 559 575 600 605 618 621 622 626 635 638 644 645 647 690 723 728 742 743 753 754 800 819 831 832 889 893 913 922 930 936 954 985 988 1005 1008 1016 >
  • 22.
    Use Case -­‐ SAR • Cluster molecules by structure and examine whether clusters are enriched in ac<vity library(chemblr); library(rcdk) d <- get.activity(chembl.id='CHEMBL857155', type='assay') cmpds <- lapply(d$ingredient_cmpd_chemblid, get.compound, type='chemblid') cmpds <- do.call(rbind, lapply(cmpds, function(x) data.frame(x$chemblId, x$smiles, stringsAsFactors=FALSE))) mols <- parse.smiles(cmpds$x.smiles) fps <- lapply(mols, get.fingerprint) sm <- fp.sim.matrix(fps) rownames(sm) <- cmpds$x.chemblId dm <- as.dist(1-sm) clus <- hclust(dm)
  • 23.
    Use Case -­‐ SAR CHEMBL331502 CHEMBL328164 CHEMBL52551 CHEMBL331120 CHEMBL120497 CHEMBL331759 CHEMBL120547 CHEMBL324064 CHEMBL318208 CHEMBL328627 CHEMBL99803 CHEMBL317562 CHEMBL332678 CHEMBL100312 CHEMBL119963 CHEMBL334031 CHEMBL323657 CHEMBL118406 CHEMBL118162 CHEMBL120137 CHEMBL331722 CHEMBL120078 CHEMBL121953 CHEMBL331783 CHEMBL333066 CHEMBL116832 CHEMBL316512 CHEMBL318471 CHEMBL98153 CHEMBL95827 CHEMBL119932 CHEMBL99037 CHEMBL120355 CHEMBL430574 CHEMBL120941 CHEMBL299756 CHEMBL317964 CHEMBL98501 CHEMBL317150 CHEMBL120030 CHEMBL99779 CHEMBL98554 CHEMBL318911 CHEMBL97844 CHEMBL316485 CHEMBL296586 CHEMBL100309 CHEMBL98360 CHEMBL316940 CHEMBL120664 CHEMBL419054 CHEMBL119989 CHEMBL121958 CHEMBL121957 CHEMBL329505 CHEMBL121543 CHEMBL121492 CHEMBL333894 CHEMBL333006 CHEMBL50894 CHEMBL116545 CHEMBL331190 CHEMBL325403 CHEMBL99423 CHEMBL330398 CHEMBL95477 CHEMBL545053 CHEMBL329063 CHEMBL331000 CHEMBL319373 CHEMBL431634 CHEMBL325654 CHEMBL332359 CHEMBL334084 CHEMBL328194
  • 24.
    1.00 0.75 0.50 0.25 0.00 0 250 500 750 Bit Position Normalized Frequency Use Case -­‐ Bit Spectrum • Vector summary of the fingerprints for a dataset • Defined as the frac<on of <mes a bit posi<on is set to 1, for each bit posi<on 0 0 1 0 1 0 1 1 1 1 0 1 0.5 0.5 0.75 ... ... ... ... ... ~ 10K molecules
  • 25.
    • Comparison •Simply e.g.: Compare ~ 800 solubles with > 30k insolubles 1.0 Use Case -­‐ Bit Spectrum of two datasets is now O(n) take the difference of the two bit spectra Frequency 0.5 Normalized 0.0 -0.5 Δ -1.0 Bit Position 0 50 100 150 ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1))
  • 26.
  • 27.
    Building Models is the Easy Part • Given a descriptor data.frame or fingerprint list we’re ready to build models – caret, caretEnsemble • Ques<on is whether the model(s) can generalize • Applicability is a key considera<on when predic<ng bioac<vity – Has economic & safety ramifica<ons in regulatory enviroments
  • 28.
    Domain Applicability •How Training Set Test Set dissimilar to the training set do you have to be before the predic<on is meaningless? – Distance to training set? Inside/outside convex hull – Comparison of bit spectra
  • 29.
    Global vs Local Models • Bioassay data is not really big data • Can big data be too big? • AID 1996 – 57K measurements of aqueous solubility • Do we build one model? • Or mul<ple local models? PCA of 166 Binary Features
  • 30.
  • 31.
    Screening Drug Combina<ons • Increased efficacy • Delay resistance • AEenuate toxicity • Inform signaling pathway connec<vity • Iden<fy synthe<c lethality • Polypharmacology Transla'onal Interest Basic Interest
  • 32.
    How to Test Combina<ons • Many procedures described in the literature – Fixed dose ra<o (aka ray) – Ray contour – Checkerboard – Gene<c algorithm C5,D5 C5 C4,D4 C4 C3,D3 C3 C2,D2 C2 C1,D5 C1,D4 C1,D3 C1,D2 C1,D1 C1 D5 D4 D3 D2 D1 0
  • 33.
    How to Test Combina<ons • Many procedures described in the literature – Fixed dose ra<o (aka ray) – Ray contour – Checkerboard – Gene<c algorithm Vargatef DCC-2036 PD-166285 GDC-0941 PI-103 GDC-0980 Bardoxolone methyl AATT-77551199 SNS-032 NCGC00188382-01 Lestaurtinib CNF-2024 ISOX Belinostat PF-477736 AZD-7762
  • 34.
    • Vargatef Why Similarity? exhibited anomalous matrix response compared to other VEGFR inhibitors Vargatef Linifanib Axitinib Sorafenib Vatalanib Motesanib Tivozanib Brivanib Telatinib Cabozantinib Cediranib BMS-794833 Lenvatinib OSI-632 Foretinib Regorafenib
  • 35.
    When are Combina<ons Similar? • Differences and their aggregates such as RMSD can lead to degeneracy • Instead we’re interested in the shape of the surface • How to characterize shape? – Parametrized fits – Distribu<on of responses 0.010 0.005 0.000 0 25 50 75 100 0.06 0.04 0.02 0.00 0 25 50 75 100 0.15 0.10 0.05 0.00 0 50 100 D, p value
  • 36.
    Similarity via the Syrjala Test 10.0 7.5 5.0 2.5 0.0 0.00 0.25 0.50 0.75 D density • Syrjala test used to compare popula<on distribu<ons over a spa<al grid – Invariant to grid orienta<on – Provides an empirical p-­‐value • Less degenerate than just considering 1D distribu<ons Syrjala, S.E., “A Sta<s<cal Test for a Difference between the Spa<al Distribu<ons of Two Popula<ons”, Ecology, 1996, 77(1), 75-­‐80
  • 37.
    Clustering Response Surfaces 0.0 0.2 0.4 0.6 0.8 C1 (24) C3(35) C2(47) C4(24)
  • 38.
    Working in “Combina<on Space” • Each cell line is represented as a vector of response matrices • “Distance” between two cell lines is a func<on of the distance between component response matrices • F can be min, max, mean, … L1 L2 = d1 = d2 = d3 = d4 = d5 D L1, L2 ( ) = F({d1, d2,…, dn}) , , , , ,
  • 39.
    Many Choices to Make 0 1 2 3 4 KMS-34 INA-6 L363 OPM-1 XG-2 FR4 AMO-1 XG-6 MOLP-8 ANBL-6 KMS-20 XG-7 OCI-MY1 XG-1 8226 EJM U266 KMS-11LB SKMM-1 MM-MM1 sum 0.0 0.1 0.2 0.3 0.4 0.5 0.6 L363 OPM-1 XG-2 KMS-20 XG-1 XG-7 ANBL-6 OCI-MY1 U266 XG-6 INA-6 MOLP-8 AMO-1 KMS-34 KMS-11LB SKMM-1 MM-MM1 EJM FR4 8226 max 0.00 0.05 0.10 0.15 0.20 0.25 INA-6 MM-MM1 8226 XG-1 U266 ANBL-6 SKMM-1 EJM OPM-1 XG-2 OCI-MY1 KMS-20 L363 KMS-11LB AMO-1 XG-6 FR4 KMS-34 MOLP-8 XG-7 min 0.0 0.2 0.4 0.6 0.8 1.0 1.2 L363 OPM-1 XG-2 KMS-34 INA-6 KMS-11LB SKMM-1 EJM U266 MM-MM1 FR4 AMO-1 XG-6 8226 MOLP-8 ANBL-6 OCI-MY1 XG-1 KMS-20 XG-7 euc
  • 40.
  • 41.
    Networks & Integra<on • Network models of molecules, and targets are common – Allows for the incorpora<on of lots of associated informa<on – Diseases, pathways, OTE’s, • When linked with clinical data & outcomes, we can generate massive networks – Adverse events (FDA AERS) – Analysis by Cloudera considered > 10E6 drug-­‐drug-­‐ reac<on triples Yildirim, M.A. et al
  • 42.
    Networks & integra<on • SAR data can be viewed in a network form – SALI, SARI based networks – Usually requires pairwise calcula<ons of the metric • Current studies have focused on small datasets (< 1000 molecules) • Hadoop + Giraph could let us apply this to HTS-­‐ scale datasets Peltason, L et al hEp://sali.rguha.net/
  • 43.
    Networks & integra<on • When we apply a network view we can consider many interes<ng applica<ons & make use of cloud scale infrastructure – Network based similarity – Community detec<on (aka clustering) – PageRank style ranking (of targets, compounds, …) – Generate network metrics, which can be used as input to predic<ve models (for interac<ons, effects, …) Bauer-­‐Mehren et al
  • 44.
    Combina<ons as Networks Combina<on screens lend themselves naturally to network representa<ons ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Δ Bliss+ 0.0 −0.5 −1.0 −1.4 −1.9 −2.4 −2.9 −3.3 −3.8 −4.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Δ Bliss+ 0.0 −0.4 −0.8 −1.2 −1.5 −1.9 −2.3 −2.7 −3.1 −3.4 immune system process apoptotic process transcription from RNA polymerase II promoter protein phosphorylation cell communication immune response
  • 45.
    Combina<ons as Networks • Things get more interes<ng when we have n m screens • Can be simplified using a variety of methods – Neighborhoods – Minimum ● ● ● Spanning Tree ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ×
  • 46.
    Comparing Neighborhoods Combina<ons that have DBSumNeg < 1st quar<le value for that strain 3D7 DD2 HB3
  • 47.
    Iden<fying the Most Synergis<c Pairs ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • 48.
    Summary • The HTS workflow presents mul<ple data science problems involving (unique) data types • R can play a role at several stages, but model building is straighXorward • Representa<on is key and guides the types and nature of analyses