This is the final presentation that my team gave at the culmination of our undergraduate research project. It involved building a protein folding simulation program using VBA in Microsoft Excel. It uses a genetic algorithm, essentially "evolving" the best fit, as determined by the fitness function, from a randomly generated seed population.
11. ►Protein misfoldings are responsible for over 20
diseases.
٭Mad Cow disease caused by an “evil” protein - The “evil”
protein and normal protein have identical primary
structures, but their tertiary structures are different.
Normal PrP Diseased PrP
12. ►Some proteins fold as fast as a millionth of a second
►Theoretically, a protein of only 100 amino acids
following the trial and error method would take 100
billion years to try out all possible conformations!
►Protein structures are highly dependent upon various
environmental parameters.
٭Such as temperature, pH, solvent, etc.
13. ► Comparative - Use evolutionary related protein
٭Advantages: fast and simple
٭Disadvantages: conformation depends upon environmental parameters
► Folding Recognition - Utilize a database of known 3-D protein
structure
٭Advantages: more accurate than comparative
٭Disadvantages: not enough NMR confirmed protein structures
► Ab Initio - Uses both scientific and engineering approach
٭Advantages: has potential to predict exact shape and immediate
structures
٭Disadvantages: computing limitations, difficulty in selecting correct
potential energy function
14. ►Not enough NMR confirmed protein structure in Protein
Data Bank (PDB)
►Evolutionary relatedness does not necessarily translate to
similar structure
►Ab initio difficulties
٭Hydrophilic and hydrophobic modeling gives only general
arrangement of the protein
-2 ٭D modeling does not predict 3-D shape of the protein
٭Monte-carlo computing method is time consuming and does not
necessarily reach global minimum
15. ►Develop a genetic algorithm based program to predict
protein conformation
►Reduce the generations needed for prediction, thus
enhance the efficiency of the search
►Explore different additional operators to modify genetic
algorithm
►Predict the protein conformation of a short 5-AA
peptide, Enkephalin
25. ►The rotational angle
between the bond between
one pair of adjacent atoms
and the next pair’s bond is
called a dihedral angle
►Phi is between N and C, psi
is between C and C’, omega
is between C’ and N
26. ► First 3 atoms on the peptide x
chain are fixed
► The coordinate system is q
arbitrarily determined around Ca (-1.52,1.37,0)
the first H atom of the N-
terminus N (-1.04 ,0,0)
w
► Assumptions:
٭Minimal bond length stretch
H- (0,0,0)
٭Bond angle stays constant
Y
٭Torsion angle (dihedral angle)
applies to the 4th atom
Z
27. cos q ij sin q ij ri 1 j cos q ij
x n1 0
0
sin q ij cos w ij cos q ij cos w ij sin w ij ri 1 j sin q ij cos w ij
xn2 0
B B ... B
Bn
x n3 0 sin q ij sin w ij ri 1 j sin q ij sin w ij
1 2 n
cos q ij sin w ij cos w ij
1
1 0 0 0 1
The first 3 Bn parameters are fixed due to the previous assumption, B1, B2, and B3 corresponds
to the H-, -N-, Ca
cos q 13 sin q 13 r23 cos q 13
1 r12
1 0 0
0 0 0 0
sin q 13 cos q 13 r23 sin q 13
0
0 1 0 0 0 1 0 0
B3
B1 B2
0 0 0 0
1 0 0 1 0
0 1 0
0 0 0 1
0 1 0 1
0 0 0 0
30. ► Search and optimization method
that mimics the natural selection
► Terms to define
٭Chromosome – a set of torsion angles
٭Gene – an individual torsion angle
٭Generation – a single loop within GA
loop search
► Loops through the reproduction,
mutation, and adaptation process
to obtain best fit model
31. ►Use a computer
simulation to perform
an intelligent
search/optimization to
find the native protein
conformation that
requires the least
amount of energy
Native Conformation
32. ►GAPSS is developed under Visual Basic Add-in
environment
►Modified genetic operators
٭ Fitness function based selection
٭ Multiple entries crossover
٭ Non-uniform mutation
٭ Adaptation
►Advantages
٭Faster convergence
٭User-friendly
33. ► Basic three primary energy:
Eletrostatic, Nonbonded (6-
12), and Hydrogen Bonded
► Exclude Torsion Energy
٭Not real interaction energy
٭Introduce penalty for positive
torsion
► Cystine Loop-Closing
introduced only when more
than one cysteins are present
in the protein
34. ►Selection Operator
Higher rank
٭Ranked Selection – higher or better
the rank higher the fitness
probability of being chosen
٭Fitness Selection – better
the fitness higher the
probability of being chosen
►Benefits of Selection Lower rank
or worse
٭Aid the Elitism Search fitness
35. ► Mutation Operator
٭Uniform Mutation – randomly
replace with a value from
-180 to 180
٭Non-uniform mutation – add
or subtract a random value
between 0 and 180
► Effects of Mutation
٭Introduce variance to search
٭Aid the search for global
minimum by directing
gradient search out of the
local minima
36. ►Crossover Operator
٭Random 2-point Crossover
– randomly exchange
between parents 2 angles at
a time
٭Multiple Entries Crossover
– multiple random
exchange
►Benefits of Crossover
٭Aid the search for elites
٭Optimize the search by
keeping the optimal folding
segments
37. ►Adaptation Operator
٭Gradient search applied to
each chromosome
٭Predict energy profile
►Benefits of Adaptation
٭Provide the local minima
search
٭Determine the energy
profile of the native folding
process
38. ► Free GA search – no restriction on dihedral angles with
exception of omega and ring structure
٭Advantages: use in any protein search, empirical way of obtaining
protein conformation, and useful for energy profile search
► α-helices and b-sheets specific GA search – randomly select
segment of protein as α-helices and b-sheets
٭Advantages: enhance the speed of free GA and accurate search for α-
helices and b-sheets
► Binary GA search – use binary to represent dihedral angles
instead decimal
٭Advantages: No barrier when doing crossover
39. ►Creates α-helices and b-sheets
of random lengths at random
start positions
►Each α-helix or b-sheet created
in this way is described by two
parameters
►Crossover will involve trading
the two parameters between
two individuals
40. ►When α-helices are crossed
over, each individual’s new
energy is compared to its old
energy. If there is a net Green
region
improvement, the crossover
is kept.
►The “former helix” regions Blue
region
will be filled with random
torsion angles like normal
41. ►Transfer torsion angles to binary code
٭Integer and decimal coded separately to shorten the total
number of digits - 17 digits altogether
►Idea is to make the torsion angles on a single
chromosome represented by one long continuous
chain
٭Cross over and Mutation operators all similar to GA
10100101010010000101001110101100001
01011010100100001010010101001000010
10010101001010010100101010011100
42.
43. ►All single AA was predicted with GAPSS
►GA parameters
٭Initial population: 20
٭Generation limitation: 15
٭Percentage of mutations: 90%
►Compared to native single AA folding
44. Asparagine
Alanine Asparatic Acid
N
A D
Asn
Ala Asp
Cysteine
C
Cys
Glutamine Glutamic Acid
Q E Glycine Isoleucine
Gln Glu G I
Gly Ile
45. Leucine Serine
Methionine
L S
M
Leu Ser
Met
Valine
Threonine
V
T
Val
Thr
46. ►Enkephalin is pentapeptide that is involved in
regulating pain
►Two forms of enkephalin
٭Methylated-enkephalin – Tyr-Gly-Gly-Phe-Met
٭Leucine-enkephalin – Tyr-Gly-Gly-Phe-Leu
►Short enough to confirm the accuracy of the
GAPSS, however still contains complex ring side
groups
47. ►Gradient zero conformations suggests the GAPSS
are capable of obtaining local minima
►Backbone conformations showed incredible
similarities
►Side group conformations still show discrepancy
between predicted and theoretical
48. ►GAPSS was able to locate a few local minimum
protein conformations
49. ►Backbone structure was predicted by the GAPSS
GA NMR
predicted Confirmed
Backbone Backbone
Structure Structure
50. ► Discrepancies between side groups due to the lack of
entropy, solvation energy, and center partial charge
assumption
GA
predicted
Backbone
Structure NMR
Confirmed
Backbone
Structure
51. ► (a) The minimum energy of each
generation with different initial
population at 3 generation limit
and 20% mutation
► (b) The minimum energy of each
generation with different the
percentage of mutation at 10
generation limit and 20 initial
population.
► The optimal condition was found
to be 30 initial population,15
generation limits, and 90%
mutation percentage
52. ► Progression of protein folding of the best prediction, potential energy
continue to reduce suggest that more stringent GA parameters could lead to
global minimum
53. ►Due to computing capability limitation, less stringent GA
parameters were used
►Energy level of predicted enkephalin structure is less than
the theoretical, however, the code is still showing energy
decrease
►More sophisticated partial charge calculation and non-
bonded energy could improve the prediction
►There are zero gradient structures predicted by the GAPSS
54. ► GA based search and optimization is a simple and efficient method
for the isolated native protein structure prediction
► Continuous decimal representation of dihedral angles is more
efficient than binary representation of dihedral angles, despite the
crossover barriers
► a-helices and b-sheets search converges faster than free torsion
angle search
► Similar backbone dihedrals predicted from VBA GA compared to
Protein Databank
55. Chemical, Biological, and Materials Engineering
Department, University of Oklahoma
Advanced Design II
56. ►Distance calculation from the origin
x R cos q 1 R cos( q 1 )
2 2
y R sin q 1 cos( b 1 ) R sin( q 1 ) cos( b 1 )
x 2 2
z R sin q 1 sin( b 1 ) R sin( q 1 ) sin( b 1 )
q 2 2
Ca (-
1.52,1.37,0)
N (-1.04 ,0,0)
(x) (y ) (z )
2 2 2
w
R cos( q 1 ) 2 R sin( q 1 ) cos( b 1 ) R sin( q 1 ) sin( b 1 )
2 2
H- (0,0,0)
Y
R cos( q 1 ) sin( q 1 ) R cos( b 1 ) sin( b 1 )
2 2 2 2
Z R cos( q 1 ) sin( q 1 ) R (1)
2 2
cos( q
) sin( q 1 )
R
2 2 2
1
R (1)
2
57. ►Rotate one axis at a time to compensate for bond
and dihedral angle, there is no rotation around y
x’ z’
x z
qz qx
y y
qz qx
y’ y’
z =z’ x =x’
58.
59. Qy is 0, cancelation of most of trigonometry functions
1 1
1
1
60. cos q ij sin q ij ri 1 j cos q ij
0
sin q ij cos w ij cos q ij cos w ij sin w ij ri 1 j sin q ij cos w ij
Det Bn 1
sin q ij sin w ij ri 1 j sin q ij sin w ij
cos q ij sin w ij cos w ij
0 0 0 1
cos q ij sin q ij ri 1 j cos q ij
0
R cos( q ij )
x i 1
x i 1
sin q ij cos w ij cos q ij cos w ij sin w ij ri 1 j sin q ij cos w ij
y i 1 Det y i 1 R sin( q ij ) cos( w ij )
sin q ij sin w ij ri 1 j sin q ij sin w ij
cos q ij sin w ij cos w ij
z i 1 R sin( q ij ) sin( w ij )
z i 1
2
2
R cos( q ij )
x i 1 x i 1
y i 1 y i 1 R sin( q ij ) cos( w ij )
R sin( q ) sin( w )
z i 1 z i 1
ij
ij