Modeling Chemical Datasets

Modeling
Chemical
datasets

with
a
focus
on
regression
based
methods

dsdht.wikispaces.com

Aims
•  How does the dynamic range of the data
being modeled impact the apparent
performance of the model? "
•  How does experimental error impact the
apparent predictivity of a model? "
•  How can we determine whether a model is
applicable to a new dataset?"
•  How should we compare the performance
of regression models? "

h0p://media.johnwiley.com.au/product_data/excerpt/00/11181391/1118139100-‐4.pdf

Example

Examine
a
number
of
datasets
containing

measured
values
for
aqueous
solubility
and
use

these
datasets
to
build
and
evaluate
predic7ve

models.

CChallenges
in
modeling
solubility

Aqueous solubility of a compound can vary
depending on a number of factors:
• 
Temperature

• 
Purity

• 
polymorph

Datasets
under
study

• 
The
Huuskonen
Dataset
:

1274
experimental

solubility
values
ﬁrst
largest
solubility
dataset.

• 
The
JCIM
Dataset
:

94
experimental
solubility
2008

• 
The
PubChem
Dataset
(AID1996):
A
randomly

selected
subset
of
1000
measured
solubility
values

selected
from
a
set
of
58,000
values
that
were

experimentally
determined
using
chemilumenescent

nitrogen
detec7on
(CLND).

Formula

LogS = log10((solubility in µg/ml)/(1000.0 MW))

Solubility
Comparison

A
boxplot
comparison
of
Log
S
for
the
three
datasets

Requirements
for
PredicCve
model

•  Reliable experimental data
• 
Sets
of
molecular
descriptors

• 
Sta7s7cal
or
machine-‐learning
methods

Types
of
Models

ClassificaCon
Model
:

•  Taking
cutoffs
points
in
modeling
“edge
effects”.

consider
a
case
where
we
have
a
two-‐class

system
with
a
cutoff
of
100
μM.
A
value
of
99
μ

M
will
be
considered
insoluble
while
a
value
of

101
μ
M
will
be
considered
soluble.

•  other
difficulty
with
classifica7on
models
is
that

they
provide
limited
direc7on
for
improving
the

proper7es
of
a
compound

Types
of
Models

Regression
Model
:

• 
diﬃcult
to
create
a
regression
model
given
data

with
a
limited
dynamic
range.

• 
limited
dynamic
range
unreliable
model

EvaluaCng
a
predicCve
model

•  Pearson’s
r:

commonly
referred
to
as
Pearson’s
r
,
or

its
square
r^2

Values
of
r

can
vary
between
−1
and
1,

•  Kendall’s
Tau:

Pearson’s
r

is
that
it
is
sensi7ve
to

outliers
and
to
the
distribu7on
of
the
underlying
data.

Employ
rank
order
or
values.

•  RMSD:

If
we
consider
paired
values
X

and
Y
,
RMSD
can

be
calculated
using
the
following
equa7on.

Steps
involved
in
building
a
predicCve
model

•  Integrate
the
experimental
data
and
molecular

descriptors

•  Divide
the
data
into
training
and
test
sets

•  Build
a
model
from
the
training
set

•  Use
this
model
to
predict
the
test
set

Random
forest
model

The
dynamic
range
in
a
dataset
can
have
a
large

impact
on
the
apparent
correla7on
between

experimental
and
predicted
ac7vity.

Experimental
Error
and
Model
Performance

• 
experimental
data
point
has
an
error
associated

with
it.

If
we
measure
the
Log
S

of
a
compound
as
−6
and
that
data
point
has
an
error
of

0.3
log
units,
the
actual
value
could
be
anywhere
between
−6.3
and
−5.7.

•  Brown
examined
the
rela7onship
between
experimental

error
and
model
performance.

•  Gaussian
distributed
random
values
were
added
to

data
to
simulate
experimental
errors.

• 
Correla7on
between
the
measured
values
and
the
same

values
with
simulated
error
is
measured.

Experimental
Error
and
Model
Performance

•  Table
shows
the
maximum
possible
correla7on
for

each
of
the
three
solubility
datasets
we
have
been

examining
when
experimental
errors
of
0.3,
0.5,

and
1.0
log
are
considered.

•  Error
is
more
for
a
dataset
like
pubchem.

Model
Applicability

•  Models
ofen
perform
poorly
on
molecules
that

bear
ligle
resemblance
to
those
in
the
training
set.

Dataset

Mean

Median

Huuskonen_Test

0.76

0.78

JCIM

0.74

0.62

Pubchem

0.56

0.56

Similarity
of
Each
Test
Set

Dataset

R2

Kendall

RMS

Error

Huuskonen_Test

0.92

0.82

0.58

JCIM

0.58

0.59

0.83

Pubchem

0.11

0.22

1.12

Comparing
Predic7ve
Models

• 
When
comparing
correla7on
coefficients,
we
must
not
only
consider
the
value
of
the

correla7on
coefficient,
but
also
the
confidence
intervals
around
the
correla7on

coefficient.

• 
If
the
confidence
intervals
of
two
correla7ons
overlap,
we
cannot
claim
that

one
predic7ve
model
is
superior
to
another.

•  For
subset
of
25
compounds
confidence
intervals
overlap
so
,
we
cannot
say
that
one

correla7on
is
superior
to
the
other.

•  For
subset
of
50
compounds,
there
is
a
very
small
difference
between
the
upper

bound
of
the
95%
confidence
interval.

•  For
subset
of
100
compounds,
there
is
clear
separa7on
between
the
confidence

intervals
so
it
implies
that
there
is
clear
separa7on
between
correla7on
coefficients.

References

•  hgp://www.wiley.com/WileyCDA/WileyTitle/
productCd-‐1118139100.html

•  hgps://github.com/PatWalters/
cheminforma7csbook

Modeling Chemical Datasets

More Related Content

What's hot

Viewers also liked

Similar to Modeling Chemical Datasets

Recently uploaded

Modeling Chemical Datasets