Robots, Small Molecules & R

Robots,
Small
Molecules
&
R
Ingredients
for
Exploring
and
Predic<ng
Biological
Effects
Rajarshi
Guha
September
13,
2014
hEp://blog.rguha.net/

Target
Iden<fica<on
Lead
Discovery
Lead
Op<miza<on
Clinical
Development
• Sensi<vity
• Scaling
Assay
Op<miza<on
Primary
Screening
• Fluorescence
• High
Content
• Select
subset
to
follow
up
• Diversity
Cherry
Picking
Confirma<on
• Counter
screen
• Explore
SAR
HTS
Hun<ng
for
Leads

High
Throughput
Screening
• Test
thousands
to
hundreds
of
thousands
of
compounds
in
one
or
more
assays
• Employs
a
robo<c
plaXorm
• Rapidly
iden<fy
novel
modulators
of
biological
systems
– Infec<ous
agents
– Cellular
basis
of
diseases

HTS
Workflow
• Rapidly
screen
large
compound
collec<ons
• Efficiently
iden<fy
real
ac<ves
– Test
them
in
slower,
accurate,
expensive
screens
• Use
the
data
to
learn
what
types
of
compounds
tend
to
be
ac<ve
• Use
the
model
to
suggest
more
compounds
to
screen
300K
HTS
1000
300
Number of Molecules
Cherry
Picks

Data
Science
Problems
• Predic<ve
models
for
highlight
imbalanced
datasets
• Global
versus
local
models?
• Feature
selec<on
–
data
driven?
Domain
driven?
• Clustering
&
enrichment
• Similarity
–
defini<on,
computa<on,
performance
• Integra<on
–
chemical
structures,
numerical
data,
text
(papers,
patents),
images

The
Roles
of
R
Data Access
ROracle
RMyQSL
RPostgreSQL
rpubchem
chemblr
Chemistry
rcdk
ChemmineR
fingerprint
HTS QC
displayHTS
spdep
Imaging
EBImage
rflowcyt
ripa
raster
Visualization
grid
ggplot
Shiny
ggvis
igraph
Data Analysis
drc
igraph
randomForest
svm
...
Also
see
ChemPhys
CRAN
Task
View

HTS
Data
Types
–
Single
Point
100
75
50
25
0
9.50 9.75 10.00 10.25 10.50
Concentration
Response

HTS
Data
Types
–
Dose
Response
120
90
60
30
0.01 1.00
log10 Concentration
Response
y = S0 +
Sinf − S0
1+10(log AC50−x)H

HTS
Data
Types
–
Mul<ple
Readouts
(and
have
this
at
mul<ple
doses!)

HTS
Data
Types
-‐
Combina<ons
+

Independent
Variable(s)
Activity = f ( )

Features,
Features,
Features
• How
do
we
“quan<fy”
a
chemical
structure?

Features,
Features,
Features
Charges
Dipole
moments
Topological
invariants
Surface
proper<es
1 0 1 1 0 0 0 1 0

Working
with
Molecules
in
R
• A
number
of
OSS
libraries
are
available
• ChemmineR
and
rcdk
are
the
main
packages
that
allow
you
to
manipulate
molecules
in
R
• Uses
rJava
to
interface
with
JOELib
and
CDK
respec<vely

rcdk
• Idioma<c
R
interface
to
the
CDK
library
– I/O
support
for
chemical
file
formats
– Manipula<on
of
atoms,
bonds,
molecules
– Generate
molecular
descriptors,
fingerprints
library(rcdk)
mol <- parse.smiles(‘CCCC’)[[1]]
mols <- load.molecules(‘http://www.rguha.net/mipe100.smi’)

rcdk
• rcdk
works
with
references
to
Java
objects
– Can’t
save
them
in
a
workspace
(trivially)
> mol
[1] "Java-Object{AtomContainer(2040919865, #A:4, Atom(2131361171, S:C, H:3,
AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))),
Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037,
Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0,
Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3,
AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), #B:3,
Bond(549041464, #O:SINGLE, #S:NONE, #A:2, Atom(2131361171, S:C, H:3,
Element(1759969037, S:C, AN:6)))), ElectronContainer(549041464EC:2)), Bond(2654289,
#O:SINGLE, #S:NONE, #A:2, Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0,
Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2,
ElectronContainer(2654289EC:2)), Bond(1660962283, #O:SINGLE, #S:NONE, #A:2,
Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0,
Isotope(703168415, Element(703168415, S:C, AN:6)))), ElectronContainer(1660962283EC:
2)))}"
>

Calcula<ng
Molecular
Features
• Evaluate
a
matrix
of
numerical
features
mols <- load.molecules("mipe100.smi")
dnames <- get.desc.names('topological')
descs <- eval.desc(mols, dnames)
• End
up
with
a
rectangular
data.frame
> str(descs)
'data.frame': 99 obs. of 195 variables:
$ nRings7 : num 1 0 1 0 0 0 0 0 0 0 ...
$ nRings8 : num 0 0 0 0 0 0 0 0 0 0 ...
$ nRings9 : num 0 0 0 0 0 0 0 0 0 0 ...
$ tpsaEfficiency : num 0.1856 0.2035 0.0118 0.0602 ...

Calcula<ng
Fingerprints
• Binary
string
representa<on
of
molecular
structure
– Objec<vely
defined,
fast
to
calculate
– Good
for
searching,
clustering,
predic<on
library(fingerprint)
fps <- lapply(mols, get.fingerprint)
• The
fingerprint
package
is
used
to
represent
them
as
S4
objects

Calcula<ng
Fingerprints
• Methods
to
compute
similari<es,
generate
summaries
&
manipulate
fingerprints
> fps[[1]]
Fingerprint object
name =
length = 1024
folded = FALSE
source = CDK
bits on = 15 18 45 73 77 78 79 85 87 96 107 109 129 139 149 159
162 166 172 179 194 209 214 223 225 227 239 254 266 272 301 312 327
335 350 354 359 392 393 395 397 415 435 455 486 491 492 499 534 535
541 543 544 545 546 559 575 600 605 618 621 622 626 635 638 644 645
647 690 723 728 742 743 753 754 800 819 831 832 889 893 913 922 930
936 954 985 988 1005 1008 1016
>

Use
Case
-‐
SAR
• Cluster
molecules
by
structure
and
examine
whether
clusters
are
enriched
in
ac<vity
library(chemblr); library(rcdk)
d <- get.activity(chembl.id='CHEMBL857155', type='assay')
cmpds <- lapply(d$ingredient_cmpd_chemblid, get.compound,
type='chemblid')
cmpds <- do.call(rbind,
lapply(cmpds, function(x)
data.frame(x$chemblId, x$smiles,
stringsAsFactors=FALSE)))
mols <- parse.smiles(cmpds$x.smiles)
fps <- lapply(mols, get.fingerprint)
sm <- fp.sim.matrix(fps)
rownames(sm) <- cmpds$x.chemblId
dm <- as.dist(1-sm)
clus <- hclust(dm)

Use
Case
-‐
SAR
CHEMBL331502
CHEMBL328164
CHEMBL52551
CHEMBL331120
CHEMBL120497
CHEMBL331759
CHEMBL120547
CHEMBL324064
CHEMBL318208
CHEMBL328627
CHEMBL99803
CHEMBL317562
CHEMBL332678
CHEMBL100312
CHEMBL119963
CHEMBL334031
CHEMBL323657
CHEMBL118406
CHEMBL118162
CHEMBL120137
CHEMBL331722
CHEMBL120078
CHEMBL121953
CHEMBL331783
CHEMBL333066
CHEMBL116832
CHEMBL316512
CHEMBL318471
CHEMBL98153
CHEMBL95827
CHEMBL119932
CHEMBL99037
CHEMBL120355
CHEMBL430574
CHEMBL120941
CHEMBL299756
CHEMBL317964
CHEMBL98501
CHEMBL317150
CHEMBL120030
CHEMBL99779
CHEMBL98554
CHEMBL318911
CHEMBL97844
CHEMBL316485
CHEMBL296586
CHEMBL100309
CHEMBL98360
CHEMBL316940
CHEMBL120664
CHEMBL419054
CHEMBL119989
CHEMBL121958
CHEMBL121957
CHEMBL329505
CHEMBL121543
CHEMBL121492
CHEMBL333894
CHEMBL333006
CHEMBL50894
CHEMBL116545
CHEMBL331190
CHEMBL325403
CHEMBL99423
CHEMBL330398
CHEMBL95477
CHEMBL545053
CHEMBL329063
CHEMBL331000
CHEMBL319373
CHEMBL431634
CHEMBL325654
CHEMBL332359
CHEMBL334084
CHEMBL328194

1.00
0.75
0.50
0.25
0.00
0 250 500 750
Bit Position
Normalized Frequency
Use
Case
-‐
Bit
Spectrum
• Vector
summary
of
the
fingerprints
for
a
dataset
• Defined
as
the
frac<on
of
<mes
a
bit
posi<on
is
set
to
1,
for
each
bit
posi<on
0 0 1
0 1 0
1 1 1
1 0 1
0.5 0.5 0.75
...
...
...
...
...
~
10K
molecules

• Comparison
• Simply
e.g.:
Compare
~
800
solubles
with
>
30k
insolubles
1.0
Use
Case
-‐
Bit
Spectrum
of
two
datasets
is
now
O(n)
take
the
difference
of
the
two
bit
spectra
Frequency
0.5
Normalized 0.0
-0.5
Δ -1.0
Bit Position 0 50 100 150
## make two subsets and generate bit spectra
sol.idx <- which(sol$label == 'high')
insol.idx <- which(sol$label != 'high')
sol.bs <- bit.spectrum(fps[sol.idx])
insol.bs <- bit.spectrum(fps[insol.idx])
## display a difference plot
bsdiff <- sol.bs - insol.bs
d <- data.frame(x=1:length(sol.bs), y=bsdiff)
ggplot(d, aes(x=x,y=y))+geom_line()+
xlab('Bit Position')+
ylab('Normalized Frequency')+
ylim(c(-1,1))

PREDICTIVE
MODELS
-‐
CAVEATS

Building
Models
is
the
Easy
Part
• Given
a
descriptor
data.frame
or
fingerprint
list
we’re
ready
to
build
models
– caret,
caretEnsemble
• Ques<on
is
whether
the
model(s)
can
generalize
• Applicability
is
a
key
considera<on
when
predic<ng
bioac<vity
– Has
economic
&
safety
ramifica<ons
in
regulatory
enviroments

Domain
Applicability
• How
Training
Set
Test
Set
dissimilar
to
the
training
set
do
you
have
to
be
before
the
predic<on
is
meaningless?
– Distance
to
training
set?
Inside/outside
convex
hull
– Comparison
of
bit
spectra

Global
vs
Local
Models
• Bioassay
data
is
not
really
big
data
• Can
big
data
be
too
big?
• AID
1996
– 57K
measurements
of
aqueous
solubility
• Do
we
build
one
model?
• Or
mul<ple
local
models?
PCA
of
166
Binary
Features

Screening
Drug
Combina<ons
• Increased
efficacy
• Delay
resistance
• AEenuate
toxicity
• Inform
signaling
pathway
connec<vity
• Iden<fy
synthe<c
lethality
• Polypharmacology
Transla'onal
Interest
Basic
Interest

How
to
Test
Combina<ons
• Many
procedures
described
in
the
literature
– Fixed
dose
ra<o
(aka
ray)
– Ray
contour
– Checkerboard
– Gene<c
algorithm
C5,D5 C5
C4,D4 C4
C3,D3 C3
C2,D2 C2
C1,D5 C1,D4 C1,D3 C1,D2 C1,D1 C1
D5 D4 D3 D2 D1 0

How
to
Test
Combina<ons
• Many
procedures
described
in
the
literature
– Fixed
dose
ra<o
(aka
ray)
– Ray
contour
– Checkerboard
– Gene<c
algorithm
Vargatef DCC-2036 PD-166285 GDC-0941
PI-103 GDC-0980 Bardoxolone methyl AATT-77551199
SNS-032 NCGC00188382-01 Lestaurtinib CNF-2024
ISOX Belinostat PF-477736 AZD-7762

• Vargatef
Why
Similarity?
exhibited
anomalous
matrix
response
compared
to
other
VEGFR
inhibitors
Vargatef
Linifanib Axitinib Sorafenib Vatalanib
Motesanib Tivozanib Brivanib Telatinib
Cabozantinib Cediranib BMS-794833 Lenvatinib
OSI-632 Foretinib Regorafenib

When
are
Combina<ons
Similar?
• Differences
and
their
aggregates
such
as
RMSD
can
lead
to
degeneracy
• Instead
we’re
interested
in
the
shape
of
the
surface
• How
to
characterize
shape?
– Parametrized
fits
– Distribu<on
of
responses
0.010
0.005
0.000
0 25 50 75 100
0.06
0.04
0.02
0.00
0 25 50 75 100
0.15
0.10
0.05
0.00
0 50 100
D, p value

Similarity
via
the
Syrjala
Test
10.0
7.5
5.0
2.5
0.0
0.00 0.25 0.50 0.75
D
density
• Syrjala
test
used
to
compare
popula<on
distribu<ons
over
a
spa<al
grid
– Invariant
to
grid
orienta<on
– Provides
an
empirical
p-‐value
• Less
degenerate
than
just
considering
1D
distribu<ons
Syrjala,
S.E.,
“A
Sta<s<cal
Test
for
a
Difference
between
the
Spa<al
Distribu<ons
of
Two
Popula<ons”,
Ecology,
1996,
77(1),
75-‐80

Clustering
Response
Surfaces
0.0 0.2 0.4 0.6 0.8
C1
(24)
C3(35)
C2(47)
C4(24)

Working
in
“Combina<on
Space”
• Each
cell
line
is
represented
as
a
vector
of
response
matrices
• “Distance”
between
two
cell
lines
is
a
func<on
of
the
distance
between
component
response
matrices
• F
can
be
min,
max,
mean,
…
L1
L2
=
d1
=
d2
=
d3
=
d4
=
d5
D L1, L2 ( ) = F({d1, d2,…, dn})
,
,
,
,
,

Many
Choices
to
Make
0 1 2 3 4
KMS-34
INA-6
L363
OPM-1
XG-2
FR4
AMO-1
XG-6
MOLP-8
ANBL-6
KMS-20
XG-7
OCI-MY1
XG-1
8226
EJM
U266
KMS-11LB
SKMM-1
MM-MM1
sum
0.0 0.1 0.2 0.3 0.4 0.5 0.6
L363
OPM-1
XG-2
KMS-20
XG-1
XG-7
ANBL-6
OCI-MY1
U266
XG-6
INA-6
MOLP-8
AMO-1
KMS-34
KMS-11LB
SKMM-1
MM-MM1
EJM
FR4
8226
max
0.00 0.05 0.10 0.15 0.20 0.25
INA-6
MM-MM1
8226
XG-1
U266
ANBL-6
SKMM-1
EJM
OPM-1
XG-2
OCI-MY1
KMS-20
L363
KMS-11LB
AMO-1
XG-6
FR4
KMS-34
MOLP-8
XG-7
min
0.0 0.2 0.4 0.6 0.8 1.0 1.2
L363
OPM-1
XG-2
KMS-34
INA-6
KMS-11LB
SKMM-1
EJM
U266
MM-MM1
FR4
AMO-1
XG-6
8226
MOLP-8
ANBL-6
OCI-MY1
XG-1
KMS-20
XG-7
euc

Networks
&
Integra<on
• Network
models
of
molecules,
and
targets
are
common
– Allows
for
the
incorpora<on
of
lots
of
associated
informa<on
– Diseases,
pathways,
OTE’s,
• When
linked
with
clinical
data
&
outcomes,
we
can
generate
massive
networks
– Adverse
events
(FDA
AERS)
– Analysis
by
Cloudera
considered
>
10E6
drug-‐drug-‐
reac<on
triples
Yildirim,
M.A.
et
al

Networks
&
integra<on
• SAR
data
can
be
viewed
in
a
network
form
– SALI,
SARI
based
networks
– Usually
requires
pairwise
calcula<ons
of
the
metric
• Current
studies
have
focused
on
small
datasets
(<
1000
molecules)
• Hadoop
+
Giraph
could
let
us
apply
this
to
HTS-‐
scale
datasets
Peltason,
L
et
al
hEp://sali.rguha.net/

Networks
&
integra<on
• When
we
apply
a
network
view
we
can
consider
many
interes<ng
applica<ons
&
make
use
of
cloud
scale
infrastructure
– Network
based
similarity
– Community
detec<on
(aka
clustering)
– PageRank
style
ranking
(of
targets,
compounds,
…)
– Generate
network
metrics,
which
can
be
used
as
input
to
predic<ve
models
(for
interac<ons,
effects,
…)
Bauer-‐Mehren
et
al

Combina<ons
as
Networks
Combina<on
screens
lend
themselves
naturally
to
network
representa<ons
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
Δ Bliss+
0.0
−0.5
−1.0
−1.4
−1.9
−2.4
−2.9
−3.3
−3.8
−4.3
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
Δ Bliss+
0.0
−0.4
−0.8
−1.2
−1.5
−1.9
−2.3
−2.7
−3.1
−3.4
immune system process
apoptotic process
transcription from RNA
polymerase II promoter
protein phosphorylation
cell communication
immune response

Combina<ons
as
Networks
• Things
get
more
interes<ng
when
we
have
n
m
screens
• Can
be
simplified
using
a
variety
of
methods
– Neighborhoods
– Minimum
●
● ●
Spanning
Tree
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
×

Comparing
Neighborhoods
Combina<ons
that
have
DBSumNeg
<
1st
quar<le
value
for
that
strain
3D7 DD2 HB3

Iden<fying
the
Most
Synergis<c
Pairs
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●

Summary
• The
HTS
workflow
presents
mul<ple
data
science
problems
involving
(unique)
data
types
• R
can
play
a
role
at
several
stages,
but
model
building
is
straighXorward
• Representa<on
is
key
and
guides
the
types
and
nature
of
analyses

Robots, Small Molecules & R

More Related Content

What's hot

Viewers also liked

Similar to Robots, Small Molecules & R

More from Rajarshi Guha

Recently uploaded

Robots, Small Molecules & R