3. AlphaFold
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…
What is it?
- AF is an Artificial intelligence program
- Google’s DeepMind
The Goal:
- Predicting the three-dimensional
structure that a protein will adopt
based solely on its amino acid sequence
It “solves” two main problems:
1. Sequence-Structure gap
2. Protein folding Why solving these
problems?
Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
4. Sequence-Structure gap
- 1958: determination of
the first protein
structure.
- John Kendrew & Max
Perutz
- Structure determination
(experimental):
- NMR
- X-ray crystallography
- Cryo-Electron
microscopy
- Protein Data Bank:
- Total: ~170,000
- Unique: ~100,000
5. AlphaFold 1
The protein folding problem
- 1972: Christian Anfisen, Nobel Prize in
Chemistry.
- “It should be possible to determine a
protein’s three-dimensional shape based
solely on its sequence”
- A typical protein could adopt
10^300 different configurations
- Longer than the age of the universe
- However, in nature, proteins spontaneously fold
into their functional shape.
- Cyrus Levinthal’s paradox (1969)
- 50 years open research problem
6. The protein folding problem
CASP
Critical Assessment of
Techniques for Protein
Structure prediction
• The protein folding Olympics
• The state of the art in
protein structure prediction
- The competition:
- Since 1994
- Takes place every two years
- Last competition: CASP14 – 2020
- Organizers:
- Known both the sequence and the
structure
Participants:
- Receive only the protein’s
sequence
- Must blindly predict the
structure of the proteins
- Predictions: compared with
the experimental data
7. Homology
modeling
Threading &
Fragment assembly
Molecular
dynamics
INPUT: query sequence Q INPUT: query sequence Q INPUT: query sequence Q
INPUT:
Database of
known folds or
structure
fragments
INPUT:
Database of
protein structures
1. find protein P high sequence
similarity to Q
2. return P’s structure as an
approxima:on to Q’s structure
1. Laws of physics to
simulate folding of Q
1. find a set of fragments that Q
can be aligned with
2. return F as an approximation to
Q’s structure
• Force field
• Molecular
mechanics
8. CASP before AlphaFold
The metric:
- How well is the prediction compared
with the experimental data?
GDT: Global Distance Test
- Compares two structures
- From 0 to 100 (%)
- Greater is better
- Uses distance cutoffs
- Uses alpha Carbons
- More accurate than RMSD
Homology
modeling
Threading &
Fragment assembly
Molecular
dynamics
9. CASP and AlphaFold
CASP14: 152 targets
Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
10. How does it work? AlphaFold uses Deep Learning
Artificial
Intelligence
Machine
learning
Deep
Learning
Machine learning:
Learn from data
“The field of study that gives computers
the ability to learn without
being explicitly programmed”
Data
Algorithm
Results
Computer
Data
Results
Algorithm
Computer
Traditional Approach
Machine Learning Approach
Grokking Deep Learning/, by Andrew W. Trask, Manning Publications, 2019
11. How does it work? AlphaFold uses Deep Learning
Artificial
Intelligence
Machine
learning
Deep
Learning
Machine learning:
Learn from data
“The field of study that gives computers
the ability to learn without
being explicitly programmed”
f
X y
ML: approximates f using data (X, y)
𝒇 ≈ #
𝒇 + ℰ
a true relationship
between two variables
The ML model
Grokking Deep Learning/, by Andrew W. Trask, Manning Publications, 2019
12. Machine Learning
X
y
!
𝒇 𝑿 = %
𝒚
Data = (X, y)
ML model:
1. The ML model (blueprint):
2. A training algorithm
1. Data (training set)
2. Loss function (error)
3. Optimization algorithm
3. A validation and a test set
A linear regression model
The goal: Minimize the error
1. Training set
2. Test set (data never seen by the
model)
Generalization
!
𝒚 = 𝒘 ∗ 𝒙 + 𝒃
!
𝒚 ≈ 𝒚
13. Deep Learning
%
𝒚 = 𝒘 ∗ 𝒙 + 𝒃
!
𝒚
A linear regression model A Neural Network (Feed Forward)
!
𝒚
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒂𝒌
𝒂𝟏
𝟐
𝒘𝟏 𝒘𝟐
𝒘𝟑
𝒂𝟏
𝟏
𝒂𝟏
𝟑
𝒂𝟏
𝟒
𝒂𝟐
𝟐
𝒂𝟐
𝟏
𝒂𝟐
𝟑
𝒂𝟐
𝟒
𝒂𝟑
𝟏
prediction
prediction
A node with
its input
edges
Activation
function
- More complex models
- Learns no linear
relaBonships
- Learns interacBons
between features (X)
- Feature
extrac:on
- A linear
relationship
between x and y
Machine Learning
More than three hidden layers
19. AlphaFold 1
The distogram
Resiudue 29
The predicted probability distributions for
distances of residue 29 to all other residues (41)
Senior, et al. (2020). Nature, 577(7792), 706–710.
23. AlphaFold
References:
1. Senior, et al. (2020). Improved protein structure
prediction using potentials from deep learning.
Nature, 577(7792), 706–710.
2. Jumper, J., Evans, R., Pritzel, A. et al. Highly
accurate protein structure prediction with
AlphaFold. Nature 596, 583–589 (2021).
24.
25. Back to AlphaFold 2
X y
Sequence Structure
• Attention-based Neural Network
• Transformer-based
• Method inspired from biology, physics and
machine learning
• Trained with:
• ~170,000
• PDB structures
• UniProt sequences
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…