2. Assumptions
• Setting up and running the same experiment in the laboratory should get
the same results, time after time (within an error).
• The results of experiments, and how experiments are set up and run can
be described by a quantitative relationship.
• This relationship is a function 𝑦 = 𝑓 𝑥1, 𝑥2, … , 𝑥𝑚 , where y is the result of
the experiment and 𝑥1, …, 𝑥𝑚 are descriptors of the experiment. Every
time the values of the descriptors are the same, the result is the same.
• What that function looks like and what descriptors should be used are what
we are tying to find out.
2
3. Descriptors and Responses
• The descriptors of an experiment may be divided into:
o Properties of “pristine” material (e.g. surface charge, zeta potential);
o Properties of “weathered” or “aged” material (e.g. hydration);
o Parameters of experiment and assay increments (e.g. temperature,
nanomaterial concentration)
•The experimental responses may be results such as:
o The percentage of human lung cells that expire after 1 day
o The percentage of human lung cells that expire after 2 days
o Similar results for different cell types
3
5. Descriptor and Response Relationship
• A row is generated for each experiment conducted, recording the values
the descriptors take on and the results of the experiment.
• If we assume a linear relationship between descriptors and the results,
the function becomes 𝑦 = 𝑓 𝑥1, 𝑥2, … , 𝑥𝑚 = 𝑏0 + 𝑏1𝑥1 + … + 𝑏𝑚𝑥𝑚
• The results of multiple experiments can be represented using the matrix
notation
𝑦 = 𝑋𝑏 + 𝑒
where 𝑋 has m columns of descriptors and n rows of experiments.
5
7. NanoQSAR
• Select 80% of experimental results randomly to build a QSAR model
𝑅2 = 1 −
𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑚𝑜𝑑𝑒𝑙
2
𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑚𝑒𝑎𝑛
2
• How close to 1.0 reflects the quality of the model and the error terms
• With the remaining 20%, predict results
𝑄2
= 1 −
𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡
2
𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑚𝑒𝑎𝑛
2
• In general, 𝑅2
≥ 𝑄2
7
8. Latent Structure of X (and Y)
• When there are correlations (collinearity) between the columns of 𝑋, the
calculated regression coefficients 𝑏 become unstable.
• Because of this, multivariate projection methods such as PLS (Projections
to Latent Structures) are increasingly being used in QSAR analysis.
• This method takes the projections of descriptors down to a reduced
dimensional hyperplane of descriptors.
• More stable calculated regression coefficients 𝑏 can be found using this
inherent latent structure of matrix 𝑋.
• Similar reduction of dimensions can be done for experimental results.
8
10. Many Separate Clusters
• Nature is found to organize experimental results in a clustered and
discontinuous way.
• How many clusters exist may be found using a k-means algorithm that starts
from n clusters, where n is the number of experimental results.
• Number of clusters are reduced each iteration by combining closest clusters.
•Also for each iteration, QSAR modeling is performed for all clusters that are
large enough, and how close the predicted values are to the actual values
𝑄2 is calculated.
• At the final step, the number of clusters with the best 𝑄2 is selected.
•If there are any clusters that are still not large enough for QSAR modeling,
new experimental data needs to be generated.
10
12. Emerging NanoMaterials
• What cluster an emerging nanomaterial is most similar to can be
identified by including theoretical descriptors like SMILES strings, and the
x, y, z coordinates of different molecules in the nanostructure.
• The emerging nanomaterials can then be associated with the closest
cluster.
•Experimental results are predicted using the regression equation found for
that particular cluster:
𝑦 = 𝑏0 + 𝑏1𝑥1 + … + 𝑏𝑚𝑥𝑚
• Like before, if an emerging nanomaterial is found very far from any
existing cluster, new experimental data needs to be generated to fill that
hole in the database.
12