AlphaFold
An overview
AlphaFold
An overview
AlphaFold
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…
What is it?
- AF is an Artificial intelligence program
- Google’s DeepMind
The Goal:
- Predicting the three-dimensional
structure that a protein will adopt
based solely on its amino acid sequence
It “solves” two main problems:
1. Sequence-Structure gap
2. Protein folding Why solving these
problems?
Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
Sequence-Structure gap
- 1958: determination of
the first protein
structure.
- John Kendrew & Max
Perutz
- Structure determination
(experimental):
- NMR
- X-ray crystallography
- Cryo-Electron
microscopy
- Protein Data Bank:
- Total: ~170,000
- Unique: ~100,000
AlphaFold 1
The protein folding problem
- 1972: Christian Anfisen, Nobel Prize in
Chemistry.
- “It should be possible to determine a
protein’s three-dimensional shape based
solely on its sequence”
- A typical protein could adopt
10^300 different configurations
- Longer than the age of the universe
- However, in nature, proteins spontaneously fold
into their functional shape.
- Cyrus Levinthal’s paradox (1969)
- 50 years open research problem
The protein folding problem
CASP
Critical Assessment of
Techniques for Protein
Structure prediction
• The protein folding Olympics
• The state of the art in
protein structure prediction
- The competition:
- Since 1994
- Takes place every two years
- Last competition: CASP14 – 2020
- Organizers:
- Known both the sequence and the
structure
Participants:
- Receive only the protein’s
sequence
- Must blindly predict the
structure of the proteins
- Predictions: compared with
the experimental data
Homology
modeling
Threading &
Fragment assembly
Molecular
dynamics
INPUT: query sequence Q INPUT: query sequence Q INPUT: query sequence Q
INPUT:
Database of
known folds or
structure
fragments
INPUT:
Database of
protein structures
1. find protein P high sequence
similarity to Q
2. return P’s structure as an
approxima:on to Q’s structure
1. Laws of physics to
simulate folding of Q
1. find a set of fragments that Q
can be aligned with
2. return F as an approximation to
Q’s structure
• Force field
• Molecular
mechanics
CASP before AlphaFold
The metric:
- How well is the prediction compared
with the experimental data?
GDT: Global Distance Test
- Compares two structures
- From 0 to 100 (%)
- Greater is better
- Uses distance cutoffs
- Uses alpha Carbons
- More accurate than RMSD
Homology
modeling
Threading &
Fragment assembly
Molecular
dynamics
CASP and AlphaFold
CASP14: 152 targets
Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
How does it work? AlphaFold uses Deep Learning
Artificial
Intelligence
Machine
learning
Deep
Learning
Machine learning:
Learn from data
“The field of study that gives computers
the ability to learn without
being explicitly programmed”
Data
Algorithm
Results
Computer
Data
Results
Algorithm
Computer
Traditional Approach
Machine Learning Approach
Grokking Deep Learning/, by Andrew W. Trask, Manning Publications, 2019
How does it work? AlphaFold uses Deep Learning
Artificial
Intelligence
Machine
learning
Deep
Learning
Machine learning:
Learn from data
“The field of study that gives computers
the ability to learn without
being explicitly programmed”
f
X y
ML: approximates f using data (X, y)
𝒇 ≈ #
𝒇 + ℰ
a true relationship
between two variables
The ML model
Grokking Deep Learning/, by Andrew W. Trask, Manning Publications, 2019
Machine Learning
X
y
!
𝒇 𝑿 = %
𝒚
Data = (X, y)
ML model:
1. The ML model (blueprint):
2. A training algorithm
1. Data (training set)
2. Loss function (error)
3. Optimization algorithm
3. A validation and a test set
A linear regression model
The goal: Minimize the error
1. Training set
2. Test set (data never seen by the
model)
Generalization
!
𝒚 = 𝒘 ∗ 𝒙 + 𝒃
!
𝒚 ≈ 𝒚
Deep Learning
%
𝒚 = 𝒘 ∗ 𝒙 + 𝒃
!
𝒚
A linear regression model A Neural Network (Feed Forward)
!
𝒚
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒂𝒌
𝒂𝟏
𝟐
𝒘𝟏 𝒘𝟐
𝒘𝟑
𝒂𝟏
𝟏
𝒂𝟏
𝟑
𝒂𝟏
𝟒
𝒂𝟐
𝟐
𝒂𝟐
𝟏
𝒂𝟐
𝟑
𝒂𝟐
𝟒
𝒂𝟑
𝟏
prediction
prediction
A node with
its input
edges
Activation
function
- More complex models
- Learns no linear
relaBonships
- Learns interacBons
between features (X)
- Feature
extrac:on
- A linear
relationship
between x and y
Machine Learning
More than three hidden layers
Key: Feature Extraction
AlphaFold 1
The model:
• CASP13 (2018)
• Convolutional-based Neural Network
Training:
• Structures: 31,247 domains
• Sequences: UniClust30
Senior, et al. (2020). Nature, 577(7792), 706–710.
X y
Sequence Structure
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…
AlphaFold 1
• Input:
• Protein amino acid sequence
• Multiple Sequence Alignments (MSA):
• Profile features
MSA
1 2 3 4 5 6 7 ..
.
n
A - - 0.2 - - - - - -
R - - 0.3 - - - - - -
F - - 0.5 - - - - - -
G - - 0 - - - - - -
.. - - 0.8 - - - - - -
Y - - 1.2 - - - - - -
sequence positions
amino
acids
PSSM
position-specific scoring matrix
MSA to Profiles - PSSM:
⚙ → Familes and Domains
Senior, et al. (2020). Nature, 577(7792), 706–710.
MSA
Profile
Structure
optimization
AlphaFold 1
Senior, et al. (2020). Nature, 577(7792), 706–710.
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…
Input
Sequence
MSA
Profile
The ML
model
The
distogram
y
X
Convolu:onal
Neural Network
The central component:
• A convolutional neural network
• Trained on PDB structures
• It predicts the distances dij
between the Cβ atoms of pairs,
ij, of residues of a protein.
AlphaFold 1
Senior, et al. (2020). Nature, 577(7792), 706–710.
AlphaFold 1
The distogram
Resiudue 29
The predicted probability distributions for
distances of residue 29 to all other residues (41)
Senior, et al. (2020). Nature, 577(7792), 706–710.
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…
Input
Sequence
MSA
Profile
The ML
model
The
distogram
y
X
Convolu:onal
Neural Network
Gradient descent:
• Rotate the phi and psi angles
• Match the predicted Cβ atoms
distances
AlphaFold 1 Protein folding
Senior, et al. (2020). Nature, 577(7792), 706–710.
Senior, et al. (2020). Nature, 577(7792), 706–710.
Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
AlphaFold
References:
1. Senior, et al. (2020). Improved protein structure
prediction using potentials from deep learning.
Nature, 577(7792), 706–710.
2. Jumper, J., Evans, R., Pritzel, A. et al. Highly
accurate protein structure prediction with
AlphaFold. Nature 596, 583–589 (2021).
Back to AlphaFold 2
X y
Sequence Structure
• Attention-based Neural Network
• Transformer-based
• Method inspired from biology, physics and
machine learning
• Trained with:
• ~170,000
• PDB structures
• UniProt sequences
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…

Protein folding prediction using Alphafold 1

  • 1.
  • 2.
  • 3.
    AlphaFold MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMF… What is it? -AF is an Artificial intelligence program - Google’s DeepMind The Goal: - Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence It “solves” two main problems: 1. Sequence-Structure gap 2. Protein folding Why solving these problems? Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
  • 4.
    Sequence-Structure gap - 1958:determination of the first protein structure. - John Kendrew & Max Perutz - Structure determination (experimental): - NMR - X-ray crystallography - Cryo-Electron microscopy - Protein Data Bank: - Total: ~170,000 - Unique: ~100,000
  • 5.
    AlphaFold 1 The proteinfolding problem - 1972: Christian Anfisen, Nobel Prize in Chemistry. - “It should be possible to determine a protein’s three-dimensional shape based solely on its sequence” - A typical protein could adopt 10^300 different configurations - Longer than the age of the universe - However, in nature, proteins spontaneously fold into their functional shape. - Cyrus Levinthal’s paradox (1969) - 50 years open research problem
  • 6.
    The protein foldingproblem CASP Critical Assessment of Techniques for Protein Structure prediction • The protein folding Olympics • The state of the art in protein structure prediction - The competition: - Since 1994 - Takes place every two years - Last competition: CASP14 – 2020 - Organizers: - Known both the sequence and the structure Participants: - Receive only the protein’s sequence - Must blindly predict the structure of the proteins - Predictions: compared with the experimental data
  • 7.
    Homology modeling Threading & Fragment assembly Molecular dynamics INPUT:query sequence Q INPUT: query sequence Q INPUT: query sequence Q INPUT: Database of known folds or structure fragments INPUT: Database of protein structures 1. find protein P high sequence similarity to Q 2. return P’s structure as an approxima:on to Q’s structure 1. Laws of physics to simulate folding of Q 1. find a set of fragments that Q can be aligned with 2. return F as an approximation to Q’s structure • Force field • Molecular mechanics
  • 8.
    CASP before AlphaFold Themetric: - How well is the prediction compared with the experimental data? GDT: Global Distance Test - Compares two structures - From 0 to 100 (%) - Greater is better - Uses distance cutoffs - Uses alpha Carbons - More accurate than RMSD Homology modeling Threading & Fragment assembly Molecular dynamics
  • 9.
    CASP and AlphaFold CASP14:152 targets Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
  • 10.
    How does itwork? AlphaFold uses Deep Learning Artificial Intelligence Machine learning Deep Learning Machine learning: Learn from data “The field of study that gives computers the ability to learn without being explicitly programmed” Data Algorithm Results Computer Data Results Algorithm Computer Traditional Approach Machine Learning Approach Grokking Deep Learning/, by Andrew W. Trask, Manning Publications, 2019
  • 11.
    How does itwork? AlphaFold uses Deep Learning Artificial Intelligence Machine learning Deep Learning Machine learning: Learn from data “The field of study that gives computers the ability to learn without being explicitly programmed” f X y ML: approximates f using data (X, y) 𝒇 ≈ # 𝒇 + ℰ a true relationship between two variables The ML model Grokking Deep Learning/, by Andrew W. Trask, Manning Publications, 2019
  • 12.
    Machine Learning X y ! 𝒇 𝑿= % 𝒚 Data = (X, y) ML model: 1. The ML model (blueprint): 2. A training algorithm 1. Data (training set) 2. Loss function (error) 3. Optimization algorithm 3. A validation and a test set A linear regression model The goal: Minimize the error 1. Training set 2. Test set (data never seen by the model) Generalization ! 𝒚 = 𝒘 ∗ 𝒙 + 𝒃 ! 𝒚 ≈ 𝒚
  • 13.
    Deep Learning % 𝒚 =𝒘 ∗ 𝒙 + 𝒃 ! 𝒚 A linear regression model A Neural Network (Feed Forward) ! 𝒚 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒂𝒌 𝒂𝟏 𝟐 𝒘𝟏 𝒘𝟐 𝒘𝟑 𝒂𝟏 𝟏 𝒂𝟏 𝟑 𝒂𝟏 𝟒 𝒂𝟐 𝟐 𝒂𝟐 𝟏 𝒂𝟐 𝟑 𝒂𝟐 𝟒 𝒂𝟑 𝟏 prediction prediction A node with its input edges Activation function - More complex models - Learns no linear relaBonships - Learns interacBons between features (X) - Feature extrac:on - A linear relationship between x and y Machine Learning More than three hidden layers
  • 14.
  • 15.
    AlphaFold 1 The model: •CASP13 (2018) • Convolutional-based Neural Network Training: • Structures: 31,247 domains • Sequences: UniClust30 Senior, et al. (2020). Nature, 577(7792), 706–710. X y Sequence Structure MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMF…
  • 16.
    AlphaFold 1 • Input: •Protein amino acid sequence • Multiple Sequence Alignments (MSA): • Profile features MSA 1 2 3 4 5 6 7 .. . n A - - 0.2 - - - - - - R - - 0.3 - - - - - - F - - 0.5 - - - - - - G - - 0 - - - - - - .. - - 0.8 - - - - - - Y - - 1.2 - - - - - - sequence positions amino acids PSSM position-specific scoring matrix MSA to Profiles - PSSM: ⚙ → Familes and Domains Senior, et al. (2020). Nature, 577(7792), 706–710. MSA Profile Structure optimization
  • 17.
    AlphaFold 1 Senior, etal. (2020). Nature, 577(7792), 706–710.
  • 18.
    MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMF… Input Sequence MSA Profile The ML model The distogram y X Convolu:onal Neural Network Thecentral component: • A convolutional neural network • Trained on PDB structures • It predicts the distances dij between the Cβ atoms of pairs, ij, of residues of a protein. AlphaFold 1 Senior, et al. (2020). Nature, 577(7792), 706–710.
  • 19.
    AlphaFold 1 The distogram Resiudue29 The predicted probability distributions for distances of residue 29 to all other residues (41) Senior, et al. (2020). Nature, 577(7792), 706–710.
  • 20.
    MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMF… Input Sequence MSA Profile The ML model The distogram y X Convolu:onal Neural Network Gradientdescent: • Rotate the phi and psi angles • Match the predicted Cβ atoms distances AlphaFold 1 Protein folding Senior, et al. (2020). Nature, 577(7792), 706–710.
  • 21.
    Senior, et al.(2020). Nature, 577(7792), 706–710.
  • 22.
    Jumper, J., Evans,R., Pritzel, A. et al. Nature 596, 583–589 (2021).
  • 23.
    AlphaFold References: 1. Senior, etal. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706–710. 2. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
  • 25.
    Back to AlphaFold2 X y Sequence Structure • Attention-based Neural Network • Transformer-based • Method inspired from biology, physics and machine learning • Trained with: • ~170,000 • PDB structures • UniProt sequences MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMF…