1. Materials Informatics Workshop
Peter Frazier, Operations Research & Information Engineering, Cornell University
4/3/2013, Research Supported By AFOSR Natural Materials, Systems & Extremophiles FA9550-12-1-0200
1
2. Optimal Learning
✤ My research interest is in statistics & sequential decision-making under uncertainty.
✤ My specific research interest is in what I call Optimal Learning:
✤ Statistics & machine learning is about using data to infer the unknown --- to
“learn”.
✤ In many problems, we must make decisions that influence what data is
available.
✤ In making such decisions we trade the benefit of information (the ability to
make better decisions in the future) against its cost (money, time, or
opportunity cost).
✤ If we balance these costs and benefits, we are learning optimally.
✤ Other names for optimal learning: sequential experimental design, active learning,
value of information analysis, adaptive design optimization.
2
3. Choosing experiments to perform in the search
for a new material is an optimal learning problem.
✤ If we are developing a new material, we have a choice of experiments
(physical and computational) that we can run.
✤ Each experiment would give us different information about how
material quality depends on design parameters.
✤ We have a limited budget on how many experiments we can perform.
✤ We would like to have an adaptive rule for choosing the experiment
to perform next that maximizes our chances of success.
3
4. Peptide design
✤ Given two target materials (e.g., gold and silver), find a peptide that is
a strong binder for material 1, and a weak binder for material 2.
✤ Paras Prasad (Buffalo); Marc Knecht (Miami); Tiff Walsh (Deakin
University, in Australia)
✤ This will be used to create a PARE (described on the next slide).
✤ Our collaborators hypothesize that PAREs can then be used to create
reconfigurable 3D bio-mediated nanoparticle assemblies, with useful
photonic, electronic, plasmonic, and magnetic properties.
4
5. Overall Strategy for
Bio-nanocombinatorics
Will use a library of material-binding peptides connected by
switchable linkers to assemble nanoparticles into
reconfigurable assemblies
Switchable)
Linker)
Materials)
Binding))
Pep0de) Nanopar0cles)
Sequences) (not)to)scale))
5
6. What experiments can we run?
✤ We have the ability to run two kinds of experiments: computational, and
physical, on any chosen pair of peptide x and target material y.
✤ If we choose to run a physical experiment, we observe the binding strength
(Gibbs free energy of binding).
✤ If we choose to run a computational experiment, we observe an estimate of the
binding strength, but also information about which amino acids are responsible
for the binding:
✤ e.g., PPPWLPYMPPWS
✤ (red
amino
acids
are
in
contact
with
the
target
>
60%
of
the
<me)
✤ Computa<onal
&
physical
experiments
are
both
quite
expensive
(about
1
week
of
work).
6
7. We start by building two statistical
models:
✤ Statistical model 1 predicts, based on the peptide sequence, the
percentage of time an amino acid is in contact with a given target
material. [model uses hydrophobicity, charge, size, and binding strength of the amino acid, and of its two neighbors in the sequence.]
✤ Statistical model 2 predicts peptide binding strength, based on the
percentage of time each amino acid in it is in contact. (Work in
progress).
✤ Both models are Bayesian models, which means that we have more
than an estimate. We have a predictive distribution for what would
happen if we were to run an experiment.
7
8. Based on the statistical model, we
suggest what experiment to do next
✤ For simplicity in this presentation, suppose (1) we just wish to find a
peptide with the highest binding strength against a single target material;
and (2) experiments are conducted without noise. Both assumptions can
be relaxed.
✤ Here are some strategies we will consider:
✤ Strategy 1 (exploitation)
✤ Strategy 2 (expected improvement)
✤ Strategy 3 (knowledge gradient)
✤ Strategy 4 (Bayes optimal)
8
9. Strategies
✤ exploitation:
✤ For each peptide, based on the existing data, make a prediction for binding strength. Run a physical
experiment on the peptide predicted to the best.
✤ expected improvement:
✤ For each peptide x, based on the existing data, calculate the predictive distribution for the result of the
physical experiment f(x).
✤ Let f* be the previously observed smallest free energy of binding.
✤ If we measure at x, the best value observed will be min(f*,f(x)), and the improvement on the previous best
is f* - min(f*,f(x)) = (f*-f(x))+.
✤ Do the experiment on the peptide x with the largest expected improvement, E[(f*-f(x))+].
✤ Pros: Expected improvement accounts for the uncertainty in our predictions, preferring to measure
peptides with high upside potential.
✤ Cons: It is myopic.
9
10. The value of optimal learning in peptide design
! Example showing why optimal learning is beneficial
in peptide design.
" Suppose we want to find a peptide with strong binding to a
given target material.
" We have identified a few peptides as binders through
evolutionary search, and want to use this data to find ones
that bind even better.
" Let’s compare two approaches:
• 1. Use a statistical method to infer binding from the available data,
select the top 10, and test these. (the “test the best” or “exploitation”
strategy)
• 2. Use optimal learning together with the same statistical method.
11. Binding
Best previously
Strength tested peptide (small
uncertainty because
it has been tested)!
Horizontal bar is a point estimate!
Vertical bar is an interval in which the
binding strength is predicted to lie.!
Group 1 Peptides Group 2 Peptides
Peptides in group 1 are almost identical to Peptides in group 2 have more
the best previously tested peptide (e.g., one differences with previously tested
amino acid difference), and so our estimates peptides, and so our estimates
have less uncertainty.! have more uncertainty.!
13. Binding
Strength
Exploitation tests the 5
peptides in red.!
Group 1 Peptides Group 2 Peptides
14. Binding
Strength We reduce our uncertainty
about the ones we test!
Group 1 Peptides Group 2 Peptides
15. Binding
Strength We reduce our uncertainty
about the ones we test!
Group 1 Peptides Group 2 Peptides
16. Binding Our improvement is
Strength the difference between
the best one tested,
and the best previous.!
Group 1 Peptides Group 2 Peptides
17. The value of optimal learning in peptide design
! Insteadsuppose we use an optimal learning rule,
which understands that we want to test peptides that
have a good balance between:!
" Having large estimates, !
" Having high uncertainty, i.e., being different from what
we’ve previously tested.!
! Thisrule also understands correlations: closely related
peptides have similar values, and so it is a good idea to
spread measurements across different groups, in case
one group is substantially worse than we thought.!
20. Binding Optimal Learning!
Strength Improvement!
Exploitation!
Improvement!
Group 1 Peptides Group 2 Peptides
21. Strategies
✤ knowledge gradient
✤ For each possible experiment we could do (computational or physical), calculate the
predictive distribution for the observation.
✤ For each possible observation resulting from the experiment, determine how our statistical
fit would change, and what estimated value of the predicted best peptide would be, f**.
✤ The improvement due to the experiment is f*-f**.
✤ This strategy tells us to do the experiment on the peptide x with the largest value of
E[f*-f**].
✤ Pros: Knowledge-gradient values information in a less restrictive way than expected
improvement, and allows us to recommend doing computational experiments, rather than
only physical experiments. It is less myopic than expected improvement.
✤ Cons: It requires more computation than expected improvement, and it is still myopic.
21
22. Strategies
✤ Bayes optimal
✤ The optimal adaptive rule for choosing experiments to perform, as
measured by the expected free binding energy of the best peptide
found after some fixed number of experiments, is characterized by
the solution to a partially observable Markov decision process.
✤ Pros: It is optimal.
✤ Cons: It is very hard to compute.
22
23. Ongoing work
✤ In ongoing work, we are:
✤ improving our statistical models
✤ developing computational methods for implementing these
strategies.
✤ doing numerical studies to see how well these strategies work.
✤ actually using these strategies to guide experimentation.
23
24. What have I learned from all this?
✤ Datasets are really small.
✤ Because datasets are small, we have to use domain knowledge.
✤ Because we have to use domain knowledge, we machine learners / statisticians
need to do some work to learn about chemistry / physics / materials science to
be successful.
✤ Computational experiments are not just lower fidelity models of physical
experiments --- they tell us interesting things that physical experiments cannot.
✤ Because data is expensive to get, experimental design is important.
✤ In some cases, statistical models from chemoinformatics developed for biology
applications may be applicable here too (e.g., for small molecules & peptides).
24
27. Other projects in
materials informatics
✤ Another peptide design problem: Given two enzymes and some
proteins from nature that act as a substrate for both, find a peptide
that is (1) as short as possible; and (2) acts as a substrate for both
enzymes.
✤ Nathan Giannessci, Mike Gilson, Mike Burkhart (UCSD)
27
28. Other projects in
materials informatics
✤ Solar energy (with Paulette Clancy and others at Cornell)
✤ The goal is to design assemblies of chlorinated
hexabenzocoronenes and carbon nanotubes that will act as high-
efficiency solar cells.
✤ The project just started, but we are planning to start by using
optimal learning to predict crystal structures.
28
29. Other projects in
materials informatics
✤ Materials informatics at Princeton:
✤ This work is led by Warren Powell, and I have only a small amount of
involvement.
✤ Experimental collaborators: Mike McAlpine, Sigurd Warner, Jim Sturm,
Craig Arnold, Jamie Link.
✤ Warren is planning to use the “optimization of expensive functions”
methodology, where function evaluations are physical experiments, and
the goal is to find the setting of some input knobs that maximize the
quality of the output.
✤ There is a similar project choosing experimental conditions for growing
carbon nanotubes with Benji Maruyama at Air Force Research Lab (AFRL).
29
30. Optimal Learning has many applications:
medical decision-making
✤ Adaptive scheduling of diagnostic tests for
patients after vascular surgery.
✤ Shanshan Zhang, ORIE; Dr. Andrew
Meltzer, Weill Cornell Medical College
✤ Design of cardiovascular bypass grafts
✤ Jing Xie, ORIE; Alison Marsden,
UCSD
30
31. Optimal Learning has many applications:
optimization of expensive noisy functions
✤ Stochastic root-finding
✤ Shane Henderson, ORIE;
Rolf Waeber, ORIE
✤ Derivative-free global optimization, in
stochastic and parallel settings
✤ Jing Xie ORIE; Jialei Wang ORIE; Scott
Clark Cornell Center for Applied Math;
Steve Chick INSEAD
31
32. Optimal Learning has many applications:
recommendation systems
✤ A recommender system is a computer
system that recommends interesting
items to website users, e.g., Netflix,
Amazon.
✤ We are building a recommendation
system for the arXiv.org collection of
scientific papers.
✤ We are using optimal learning to make
recommendations that (1) provide a
good user experience now; and (2)
provide data for improving the user
experience in the future. 32
33. Optimal Learning has many applications:
optimal design of laboratory experiments
✤ Early stage drug development for
Ewings’ Sarcoma, a pediatric cancer.
✤ Jialei Wang Cornell Applied &
Engineering Physics; Dr. Jeff
Toretsky, Georgetown University
✤ Development of new nano-materials
✤ Jialei Wang Cornell Applied &
Engineering Physics; Paulette
Clancy, Cornell Chemical
Engineering; Nathan Gianneschi
UCSD; Marc Knecht Miami
33