PROTEIN STRUCTURE
PREDICTION USING
MACHINE LEARNING
Borys Biletskyy
Data Science Amsterdam
July, 2016
About Myself
• Born in Kyiv, Ukraine
• I live and work in Amsterdam since 2014
• Architect @ Levi9
• Senior Research Officer @ Glushkov
Institute of Cybernetics
• I like:
• Cycling
• Swimming
• Boating
Motivation and Domain Background
• The process of protein folding is unclear
• Experimental protein structure determination is expensive
• Soluble bacteria protein - $140000
• Human membrane protein - $2.5 million
• Single successful drug programme $15-20 million
• Life on Earth is protein-based
• Proteins make 80% of cell’s dry mass
• Protein structure determines its function
• Very important for medical industry
• Proteins are biological nano-machines
• Proteins are folded chains of amino acids of
20 types
• It’s a sequence of elements from a finite
alphabet
• We focus on secondary structure prediction
• Mapping a sequence to another sequence
Training Data
• Protein Structure Data Banks
• wwPDB, NCBI, RCSBPDB
• Open access
• Exponential growth
• Unstructured data
• Noisy data, contains duplicates
• 23000 usable out of 100000
• Example CRO protein
• PDB record: http://www.rcsb.org/pdb/files/3CRO.pdb
• Extracted Secondary Structure:
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
Protein Secondary Structure Prediction:
Problem Statement
• Protein amino acid chain (aa):
• 𝑥 = (𝑥1, … , 𝑥 𝑛), 𝑥𝑖 ∈ 𝑋 = {20 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠}
• Protein secondary structure (ss):
• 𝑦 = (𝑦1, … , 𝑦𝑛), 𝑦𝑖 ∈ 𝑌 = {3 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒: 𝑠, ℎ, −}
• Train data:
• set 𝑇 of 𝑚 pairs 𝑥, 𝑦 ∈ 𝑋 × 𝑌
• Given training data 𝑇 and aa-sequence 𝑥, find its corresponding ss-
sequence 𝑦
• Assumption: local dependency
• 𝑦𝑖 depends on a “frame” with size 𝑟 in 𝑥 : (𝑥𝑖−𝑟, … , 𝑥𝑖, … , 𝑥𝑖+𝑟)
x: MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
y: -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
YQSAINKAIHA
.....h.....
Probabilistic Tooling
• Bayes Theorem (how to calculate conditional probabilities)
• P 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)/𝑃(𝐵)
• Naïve example (frame size r = 0): P 𝑦𝑖 𝑥𝑖
• Calculate and pick maximum from P 𝑦𝑖 = "𝑠" 𝑥𝑖 , P 𝑦𝑖 = "ℎ" 𝑥𝑖 , P 𝑦𝑖 = " − " 𝑥𝑖
• Another example (frame size r > 0): P 𝑦𝑖 𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟
• How to calculate probabilities of sequences (𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟)?
• Markov Chains (how to calculate probability of sequences)
• Chain order 𝑘 = 1
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥2|𝑥1 … 𝑃 𝑥 𝑛|𝑥 𝑛−1
• Chain order 𝑘 = 2
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1, 𝑥2 𝑃 𝑥3|𝑥1, 𝑥2 … 𝑃 𝑥 𝑛|𝑥 𝑛−1, 𝑥 𝑛−2
• Anderson Results (how to select the best order of the Markov chain)
• 𝜒2
~ − 2𝑙𝑛
𝐿 𝑘
𝐿 𝑘+1
, 𝐿 𝑘 - likehood of a chain order k
• Allows to find the order of a Markov Chain that fits data the best
Problem-Specific Model
• Bayesian Theorem and Markov Chain order 𝑘 = 1
• Then each element 𝑦𝑖 is calculated as
𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑌
𝑃(𝑦𝑖|𝑥1, … , 𝑥2𝑟+1) =
= 𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑋
𝑃 𝑦𝑖 𝑃(𝑥1|𝑦𝑖) ×
𝑖=2
2𝑟+1
𝑃(𝑥𝑖|𝑥𝑖−1, 𝑦𝑖)
• We need to calculate 3 conditional probabilities (for 3 possible
values of 𝑦𝑖) and pick the maximum
• All factors in the expansion are probabilities computable from
the train data
• Elements 𝑦𝑖 are independently calculated
Building And Executing Prediction
Algorithms
• Implemented in C++
• Computational complexity:
• Training:𝑂(𝑚 × 𝑙𝑛(𝑚))
• Prediction: 𝑂(𝑙𝑛(𝑚))
• Parallelizable
• Executed on NVidia GForce 8800 GTX based GPU cluster with
total performance 4 TFLOPS
• Can be implemented using MapReduce:
• Hadoop
• Spark
Adjusting Model Parameters
• Markov chain order 𝑘
• Selected using Anderson results
during series of statistical
hypothesis tests
• Depends on train data (higher order
chain require mode data)
• Chain order k=3 was used
• Frame size 𝑟
• Selected empirically
• Frame size 14 used
Order k=1
or
Order k= 2
or…?
YQSAINKAIHA
.....h.....
AINKA
..h..
Frame size r=5
or
Frame size r= 2
or…?
Accuracy Evaluation
• Single protein structure prediction accuracy
• C3 – ratio of correctly predicted items to protein length
• C(s),C(h),C(-) – secondary structure type-specific accuracy coefficients
𝐶 ∝ =
𝑝∝ 𝑛∝−𝑢∝ 𝑜∝
(𝑛∝+𝑢∝)(𝑛∝+𝑜∝)(𝑝∝+𝑢∝)(𝑝∝+𝑜∝)
, ∝∈ {𝑠, ℎ, −}
• Example: CRO protein ss prediction
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
• --ssssh--h---h-hhhhhhhh----hhhhhhhhh--ssssssss-ssssssss-s---------
• Accuracy:
• С3: 0.878788
• C(s): 0.815068
• C(h): 0.92674
• C(-): 0.74525
• Prediction accuracy of a model with training data (23000 structures)
• Cross Validation
• Single Protein Exclusion
• Average С3 – 0.83
Thank you for your attention!
• Questions?

Protein structure prediction

  • 1.
    PROTEIN STRUCTURE PREDICTION USING MACHINELEARNING Borys Biletskyy Data Science Amsterdam July, 2016
  • 2.
    About Myself • Bornin Kyiv, Ukraine • I live and work in Amsterdam since 2014 • Architect @ Levi9 • Senior Research Officer @ Glushkov Institute of Cybernetics • I like: • Cycling • Swimming • Boating
  • 3.
    Motivation and DomainBackground • The process of protein folding is unclear • Experimental protein structure determination is expensive • Soluble bacteria protein - $140000 • Human membrane protein - $2.5 million • Single successful drug programme $15-20 million • Life on Earth is protein-based • Proteins make 80% of cell’s dry mass • Protein structure determines its function • Very important for medical industry • Proteins are biological nano-machines • Proteins are folded chains of amino acids of 20 types • It’s a sequence of elements from a finite alphabet • We focus on secondary structure prediction • Mapping a sequence to another sequence
  • 4.
    Training Data • ProteinStructure Data Banks • wwPDB, NCBI, RCSBPDB • Open access • Exponential growth • Unstructured data • Noisy data, contains duplicates • 23000 usable out of 100000 • Example CRO protein • PDB record: http://www.rcsb.org/pdb/files/3CRO.pdb • Extracted Secondary Structure: • MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA • -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
  • 5.
    Protein Secondary StructurePrediction: Problem Statement • Protein amino acid chain (aa): • 𝑥 = (𝑥1, … , 𝑥 𝑛), 𝑥𝑖 ∈ 𝑋 = {20 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠} • Protein secondary structure (ss): • 𝑦 = (𝑦1, … , 𝑦𝑛), 𝑦𝑖 ∈ 𝑌 = {3 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒: 𝑠, ℎ, −} • Train data: • set 𝑇 of 𝑚 pairs 𝑥, 𝑦 ∈ 𝑋 × 𝑌 • Given training data 𝑇 and aa-sequence 𝑥, find its corresponding ss- sequence 𝑦 • Assumption: local dependency • 𝑦𝑖 depends on a “frame” with size 𝑟 in 𝑥 : (𝑥𝑖−𝑟, … , 𝑥𝑖, … , 𝑥𝑖+𝑟) x: MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA y: -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss----------- YQSAINKAIHA .....h.....
  • 6.
    Probabilistic Tooling • BayesTheorem (how to calculate conditional probabilities) • P 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)/𝑃(𝐵) • Naïve example (frame size r = 0): P 𝑦𝑖 𝑥𝑖 • Calculate and pick maximum from P 𝑦𝑖 = "𝑠" 𝑥𝑖 , P 𝑦𝑖 = "ℎ" 𝑥𝑖 , P 𝑦𝑖 = " − " 𝑥𝑖 • Another example (frame size r > 0): P 𝑦𝑖 𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟 • How to calculate probabilities of sequences (𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟)? • Markov Chains (how to calculate probability of sequences) • Chain order 𝑘 = 1 • 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥2|𝑥1 … 𝑃 𝑥 𝑛|𝑥 𝑛−1 • Chain order 𝑘 = 2 • 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1, 𝑥2 𝑃 𝑥3|𝑥1, 𝑥2 … 𝑃 𝑥 𝑛|𝑥 𝑛−1, 𝑥 𝑛−2 • Anderson Results (how to select the best order of the Markov chain) • 𝜒2 ~ − 2𝑙𝑛 𝐿 𝑘 𝐿 𝑘+1 , 𝐿 𝑘 - likehood of a chain order k • Allows to find the order of a Markov Chain that fits data the best
  • 7.
    Problem-Specific Model • BayesianTheorem and Markov Chain order 𝑘 = 1 • Then each element 𝑦𝑖 is calculated as 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑖∈𝑌 𝑃(𝑦𝑖|𝑥1, … , 𝑥2𝑟+1) = = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑖∈𝑋 𝑃 𝑦𝑖 𝑃(𝑥1|𝑦𝑖) × 𝑖=2 2𝑟+1 𝑃(𝑥𝑖|𝑥𝑖−1, 𝑦𝑖) • We need to calculate 3 conditional probabilities (for 3 possible values of 𝑦𝑖) and pick the maximum • All factors in the expansion are probabilities computable from the train data • Elements 𝑦𝑖 are independently calculated
  • 8.
    Building And ExecutingPrediction Algorithms • Implemented in C++ • Computational complexity: • Training:𝑂(𝑚 × 𝑙𝑛(𝑚)) • Prediction: 𝑂(𝑙𝑛(𝑚)) • Parallelizable • Executed on NVidia GForce 8800 GTX based GPU cluster with total performance 4 TFLOPS • Can be implemented using MapReduce: • Hadoop • Spark
  • 9.
    Adjusting Model Parameters •Markov chain order 𝑘 • Selected using Anderson results during series of statistical hypothesis tests • Depends on train data (higher order chain require mode data) • Chain order k=3 was used • Frame size 𝑟 • Selected empirically • Frame size 14 used Order k=1 or Order k= 2 or…? YQSAINKAIHA .....h..... AINKA ..h.. Frame size r=5 or Frame size r= 2 or…?
  • 10.
    Accuracy Evaluation • Singleprotein structure prediction accuracy • C3 – ratio of correctly predicted items to protein length • C(s),C(h),C(-) – secondary structure type-specific accuracy coefficients 𝐶 ∝ = 𝑝∝ 𝑛∝−𝑢∝ 𝑜∝ (𝑛∝+𝑢∝)(𝑛∝+𝑜∝)(𝑝∝+𝑢∝)(𝑝∝+𝑜∝) , ∝∈ {𝑠, ℎ, −} • Example: CRO protein ss prediction • MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA • -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss----------- • --ssssh--h---h-hhhhhhhh----hhhhhhhhh--ssssssss-ssssssss-s--------- • Accuracy: • С3: 0.878788 • C(s): 0.815068 • C(h): 0.92674 • C(-): 0.74525 • Prediction accuracy of a model with training data (23000 structures) • Cross Validation • Single Protein Exclusion • Average С3 – 0.83
  • 11.
    Thank you foryour attention! • Questions?