Protein structure prediction

PROTEIN STRUCTURE
PREDICTION USING
MACHINE LEARNING
Borys Biletskyy
Data Science Amsterdam
July, 2016

About Myself
• Born in Kyiv, Ukraine
• I live and work in Amsterdam since 2014
• Architect @ Levi9
• Senior Research Officer @ Glushkov
Institute of Cybernetics
• I like:
• Cycling
• Swimming
• Boating

Motivation and Domain Background
• The process of protein folding is unclear
• Experimental protein structure determination is expensive
• Soluble bacteria protein - $140000
• Human membrane protein - $2.5 million
• Single successful drug programme $15-20 million
• Life on Earth is protein-based
• Proteins make 80% of cell’s dry mass
• Protein structure determines its function
• Very important for medical industry
• Proteins are biological nano-machines
• Proteins are folded chains of amino acids of
20 types
• It’s a sequence of elements from a finite
alphabet
• We focus on secondary structure prediction
• Mapping a sequence to another sequence

Training Data
• Protein Structure Data Banks
• wwPDB, NCBI, RCSBPDB
• Open access
• Exponential growth
• Unstructured data
• Noisy data, contains duplicates
• 23000 usable out of 100000
• Example CRO protein
• PDB record: http://www.rcsb.org/pdb/files/3CRO.pdb
• Extracted Secondary Structure:
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------

Protein Secondary Structure Prediction:
Problem Statement
• Protein amino acid chain (aa):
• 𝑥 = (𝑥1, … , 𝑥 𝑛), 𝑥𝑖 ∈ 𝑋 = {20 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠}
• Protein secondary structure (ss):
• 𝑦 = (𝑦1, … , 𝑦𝑛), 𝑦𝑖 ∈ 𝑌 = {3 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒: 𝑠, ℎ, −}
• Train data:
• set 𝑇 of 𝑚 pairs 𝑥, 𝑦 ∈ 𝑋 × 𝑌
• Given training data 𝑇 and aa-sequence 𝑥, find its corresponding ss-
sequence 𝑦
• Assumption: local dependency
• 𝑦𝑖 depends on a “frame” with size 𝑟 in 𝑥 : (𝑥𝑖−𝑟, … , 𝑥𝑖, … , 𝑥𝑖+𝑟)
x: MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
y: -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
YQSAINKAIHA
.....h.....

Probabilistic Tooling
• Bayes Theorem (how to calculate conditional probabilities)
• P 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)/𝑃(𝐵)
• Naïve example (frame size r = 0): P 𝑦𝑖 𝑥𝑖
• Calculate and pick maximum from P 𝑦𝑖 = "𝑠" 𝑥𝑖 , P 𝑦𝑖 = "ℎ" 𝑥𝑖 , P 𝑦𝑖 = " − " 𝑥𝑖
• Another example (frame size r > 0): P 𝑦𝑖 𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟
• How to calculate probabilities of sequences (𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟)?
• Markov Chains (how to calculate probability of sequences)
• Chain order 𝑘 = 1
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥2|𝑥1 … 𝑃 𝑥 𝑛|𝑥 𝑛−1
• Chain order 𝑘 = 2
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1, 𝑥2 𝑃 𝑥3|𝑥1, 𝑥2 … 𝑃 𝑥 𝑛|𝑥 𝑛−1, 𝑥 𝑛−2
• Anderson Results (how to select the best order of the Markov chain)
• 𝜒2
~ − 2𝑙𝑛
𝐿 𝑘
𝐿 𝑘+1
, 𝐿 𝑘 - likehood of a chain order k
• Allows to find the order of a Markov Chain that fits data the best

Problem-Specific Model
• Bayesian Theorem and Markov Chain order 𝑘 = 1
• Then each element 𝑦𝑖 is calculated as
𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑌
𝑃(𝑦𝑖|𝑥1, … , 𝑥2𝑟+1) =
= 𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑋
𝑃 𝑦𝑖 𝑃(𝑥1|𝑦𝑖) ×
𝑖=2
2𝑟+1
𝑃(𝑥𝑖|𝑥𝑖−1, 𝑦𝑖)
• We need to calculate 3 conditional probabilities (for 3 possible
values of 𝑦𝑖) and pick the maximum
• All factors in the expansion are probabilities computable from
the train data
• Elements 𝑦𝑖 are independently calculated

Building And Executing Prediction
Algorithms
• Implemented in C++
• Computational complexity:
• Training:𝑂(𝑚 × 𝑙𝑛(𝑚))
• Prediction: 𝑂(𝑙𝑛(𝑚))
• Parallelizable
• Executed on NVidia GForce 8800 GTX based GPU cluster with
total performance 4 TFLOPS
• Can be implemented using MapReduce:
• Hadoop
• Spark

Adjusting Model Parameters
• Markov chain order 𝑘
• Selected using Anderson results
during series of statistical
hypothesis tests
• Depends on train data (higher order
chain require mode data)
• Chain order k=3 was used
• Frame size 𝑟
• Selected empirically
• Frame size 14 used
Order k=1
or
Order k= 2
or…?
YQSAINKAIHA
.....h.....
AINKA
..h..
Frame size r=5
or
Frame size r= 2
or…?

Accuracy Evaluation
• Single protein structure prediction accuracy
• C3 – ratio of correctly predicted items to protein length
• C(s),C(h),C(-) – secondary structure type-specific accuracy coefficients
𝐶 ∝ =
𝑝∝ 𝑛∝−𝑢∝ 𝑜∝
(𝑛∝+𝑢∝)(𝑛∝+𝑜∝)(𝑝∝+𝑢∝)(𝑝∝+𝑜∝)
, ∝∈ {𝑠, ℎ, −}
• Example: CRO protein ss prediction
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
• --ssssh--h---h-hhhhhhhh----hhhhhhhhh--ssssssss-ssssssss-s---------
• Accuracy:
• С3: 0.878788
• C(s): 0.815068
• C(h): 0.92674
• C(-): 0.74525
• Prediction accuracy of a model with training data (23000 structures)
• Cross Validation
• Single Protein Exclusion
• Average С3 – 0.83

Thank you for your attention!
• Questions?

Protein structure prediction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Protein structure prediction

Similar to Protein structure prediction (20)

Recently uploaded

Recently uploaded (20)

Protein structure prediction