SlideShare a Scribd company logo
1 of 12
Emerging NanoMaterials –
nanoQSAR FY17
Paul Harten
July 18, 2016
Assumptions
• Setting up and running the same experiment in the laboratory should get
the same results, time after time (within an error).
• The results of experiments, and how experiments are set up and run can
be described by a quantitative relationship.
• This relationship is a function 𝑦 = 𝑓 𝑥1, 𝑥2, … , 𝑥𝑚 , where y is the result of
the experiment and 𝑥1, …, 𝑥𝑚 are descriptors of the experiment. Every
time the values of the descriptors are the same, the result is the same.
• What that function looks like and what descriptors should be used are what
we are tying to find out.
2
Descriptors and Responses
• The descriptors of an experiment may be divided into:
o Properties of “pristine” material (e.g. surface charge, zeta potential);
o Properties of “weathered” or “aged” material (e.g. hydration);
o Parameters of experiment and assay increments (e.g. temperature,
nanomaterial concentration)
•The experimental responses may be results such as:
o The percentage of human lung cells that expire after 1 day
o The percentage of human lung cells that expire after 2 days
o Similar results for different cell types
3
Descriptors and Responses (cont.)
4
Pristine Weathered Experimental Responses
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2
. .
. .
. .
Descriptor and Response Relationship
• A row is generated for each experiment conducted, recording the values
the descriptors take on and the results of the experiment.
• If we assume a linear relationship between descriptors and the results,
the function becomes 𝑦 = 𝑓 𝑥1, 𝑥2, … , 𝑥𝑚 = 𝑏0 + 𝑏1𝑥1 + … + 𝑏𝑚𝑥𝑚
• The results of multiple experiments can be represented using the matrix
notation
𝑦 = 𝑋𝑏 + 𝑒
where 𝑋 has m columns of descriptors and n rows of experiments.
5
Partial Least Squares (PLS), y = b0 + b1 * x1 + e
6
NanoQSAR
• Select 80% of experimental results randomly to build a QSAR model
𝑅2 = 1 −
𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑚𝑜𝑑𝑒𝑙
2
𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑚𝑒𝑎𝑛
2
• How close to 1.0 reflects the quality of the model and the error terms
• With the remaining 20%, predict results
𝑄2
= 1 −
𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡
2
𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑚𝑒𝑎𝑛
2
• In general, 𝑅2
≥ 𝑄2
7
Latent Structure of X (and Y)
• When there are correlations (collinearity) between the columns of 𝑋, the
calculated regression coefficients 𝑏 become unstable.
• Because of this, multivariate projection methods such as PLS (Projections
to Latent Structures) are increasingly being used in QSAR analysis.
• This method takes the projections of descriptors down to a reduced
dimensional hyperplane of descriptors.
• More stable calculated regression coefficients 𝑏 can be found using this
inherent latent structure of matrix 𝑋.
• Similar reduction of dimensions can be done for experimental results.
8
Latent Structure of X (and Y)
9
Many Separate Clusters
• Nature is found to organize experimental results in a clustered and
discontinuous way.
• How many clusters exist may be found using a k-means algorithm that starts
from n clusters, where n is the number of experimental results.
• Number of clusters are reduced each iteration by combining closest clusters.
•Also for each iteration, QSAR modeling is performed for all clusters that are
large enough, and how close the predicted values are to the actual values
𝑄2 is calculated.
• At the final step, the number of clusters with the best 𝑄2 is selected.
•If there are any clusters that are still not large enough for QSAR modeling,
new experimental data needs to be generated.
10
Many Separate Clusters (cont.)
11
Emerging NanoMaterials
• What cluster an emerging nanomaterial is most similar to can be
identified by including theoretical descriptors like SMILES strings, and the
x, y, z coordinates of different molecules in the nanostructure.
• The emerging nanomaterials can then be associated with the closest
cluster.
•Experimental results are predicted using the regression equation found for
that particular cluster:
𝑦 = 𝑏0 + 𝑏1𝑥1 + … + 𝑏𝑚𝑥𝑚
• Like before, if an emerging nanomaterial is found very far from any
existing cluster, new experimental data needs to be generated to fill that
hole in the database.
12

More Related Content

Similar to Enm fy17nano qsar

Dong Zhang's project
Dong Zhang's projectDong Zhang's project
Dong Zhang's project
Dong Zhang
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
VitAnhNguyn94
 

Similar to Enm fy17nano qsar (20)

Module 1 sp
Module 1 spModule 1 sp
Module 1 sp
 
A GENERALIZED SAMPLING THEOREM OVER GALOIS FIELD DOMAINS FOR EXPERIMENTAL DESIGN
A GENERALIZED SAMPLING THEOREM OVER GALOIS FIELD DOMAINS FOR EXPERIMENTAL DESIGNA GENERALIZED SAMPLING THEOREM OVER GALOIS FIELD DOMAINS FOR EXPERIMENTAL DESIGN
A GENERALIZED SAMPLING THEOREM OVER GALOIS FIELD DOMAINS FOR EXPERIMENTAL DESIGN
 
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
 
R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal Dependence
 
Causality detection
Causality detectionCausality detection
Causality detection
 
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Dong Zhang's project
Dong Zhang's projectDong Zhang's project
Dong Zhang's project
 
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
 
Unit3
Unit3Unit3
Unit3
 
Kinetic bands versus Bollinger Bands
Kinetic bands versus Bollinger  BandsKinetic bands versus Bollinger  Bands
Kinetic bands versus Bollinger Bands
 
Canonical correlation
Canonical correlationCanonical correlation
Canonical correlation
 
Av 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background MaterialAv 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background Material
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
 
Probabilistic Logic Programming with Beta-Distributed Random Variables
Probabilistic Logic Programming with Beta-Distributed Random VariablesProbabilistic Logic Programming with Beta-Distributed Random Variables
Probabilistic Logic Programming with Beta-Distributed Random Variables
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
 
Data Driven Choice of Threshold in Cepstrum Based Spectrum Estimate
Data Driven Choice of Threshold in Cepstrum Based Spectrum EstimateData Driven Choice of Threshold in Cepstrum Based Spectrum Estimate
Data Driven Choice of Threshold in Cepstrum Based Spectrum Estimate
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
 

Recently uploaded

Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
Kamal Acharya
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
MohammadAliNayeem
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 

Recently uploaded (20)

ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Attraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptxAttraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptx
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
 
An improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyAn improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technology
 
internship exam ppt.pptx on embedded system and IOT
internship exam ppt.pptx on embedded system and IOTinternship exam ppt.pptx on embedded system and IOT
internship exam ppt.pptx on embedded system and IOT
 
E-Commerce Shopping for developing a shopping ecommerce site
E-Commerce Shopping for developing a shopping ecommerce siteE-Commerce Shopping for developing a shopping ecommerce site
E-Commerce Shopping for developing a shopping ecommerce site
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
 

Enm fy17nano qsar

  • 1. Emerging NanoMaterials – nanoQSAR FY17 Paul Harten July 18, 2016
  • 2. Assumptions • Setting up and running the same experiment in the laboratory should get the same results, time after time (within an error). • The results of experiments, and how experiments are set up and run can be described by a quantitative relationship. • This relationship is a function 𝑦 = 𝑓 𝑥1, 𝑥2, … , 𝑥𝑚 , where y is the result of the experiment and 𝑥1, …, 𝑥𝑚 are descriptors of the experiment. Every time the values of the descriptors are the same, the result is the same. • What that function looks like and what descriptors should be used are what we are tying to find out. 2
  • 3. Descriptors and Responses • The descriptors of an experiment may be divided into: o Properties of “pristine” material (e.g. surface charge, zeta potential); o Properties of “weathered” or “aged” material (e.g. hydration); o Parameters of experiment and assay increments (e.g. temperature, nanomaterial concentration) •The experimental responses may be results such as: o The percentage of human lung cells that expire after 1 day o The percentage of human lung cells that expire after 2 days o Similar results for different cell types 3
  • 4. Descriptors and Responses (cont.) 4 Pristine Weathered Experimental Responses X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 Y1 Y2 . . . . . .
  • 5. Descriptor and Response Relationship • A row is generated for each experiment conducted, recording the values the descriptors take on and the results of the experiment. • If we assume a linear relationship between descriptors and the results, the function becomes 𝑦 = 𝑓 𝑥1, 𝑥2, … , 𝑥𝑚 = 𝑏0 + 𝑏1𝑥1 + … + 𝑏𝑚𝑥𝑚 • The results of multiple experiments can be represented using the matrix notation 𝑦 = 𝑋𝑏 + 𝑒 where 𝑋 has m columns of descriptors and n rows of experiments. 5
  • 6. Partial Least Squares (PLS), y = b0 + b1 * x1 + e 6
  • 7. NanoQSAR • Select 80% of experimental results randomly to build a QSAR model 𝑅2 = 1 − 𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑚𝑜𝑑𝑒𝑙 2 𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑚𝑒𝑎𝑛 2 • How close to 1.0 reflects the quality of the model and the error terms • With the remaining 20%, predict results 𝑄2 = 1 − 𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡 2 𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑚𝑒𝑎𝑛 2 • In general, 𝑅2 ≥ 𝑄2 7
  • 8. Latent Structure of X (and Y) • When there are correlations (collinearity) between the columns of 𝑋, the calculated regression coefficients 𝑏 become unstable. • Because of this, multivariate projection methods such as PLS (Projections to Latent Structures) are increasingly being used in QSAR analysis. • This method takes the projections of descriptors down to a reduced dimensional hyperplane of descriptors. • More stable calculated regression coefficients 𝑏 can be found using this inherent latent structure of matrix 𝑋. • Similar reduction of dimensions can be done for experimental results. 8
  • 9. Latent Structure of X (and Y) 9
  • 10. Many Separate Clusters • Nature is found to organize experimental results in a clustered and discontinuous way. • How many clusters exist may be found using a k-means algorithm that starts from n clusters, where n is the number of experimental results. • Number of clusters are reduced each iteration by combining closest clusters. •Also for each iteration, QSAR modeling is performed for all clusters that are large enough, and how close the predicted values are to the actual values 𝑄2 is calculated. • At the final step, the number of clusters with the best 𝑄2 is selected. •If there are any clusters that are still not large enough for QSAR modeling, new experimental data needs to be generated. 10
  • 12. Emerging NanoMaterials • What cluster an emerging nanomaterial is most similar to can be identified by including theoretical descriptors like SMILES strings, and the x, y, z coordinates of different molecules in the nanostructure. • The emerging nanomaterials can then be associated with the closest cluster. •Experimental results are predicted using the regression equation found for that particular cluster: 𝑦 = 𝑏0 + 𝑏1𝑥1 + … + 𝑏𝑚𝑥𝑚 • Like before, if an emerging nanomaterial is found very far from any existing cluster, new experimental data needs to be generated to fill that hole in the database. 12