2. About Myself
• Born in Kyiv, Ukraine
• I live and work in Amsterdam since 2014
• Architect @ Levi9
• Senior Research Officer @ Glushkov
Institute of Cybernetics
• I like:
• Cycling
• Swimming
• Boating
3. Motivation and Domain Background
• The process of protein folding is unclear
• Experimental protein structure determination is expensive
• Soluble bacteria protein - $140000
• Human membrane protein - $2.5 million
• Single successful drug programme $15-20 million
• Life on Earth is protein-based
• Proteins make 80% of cell’s dry mass
• Protein structure determines its function
• Very important for medical industry
• Proteins are biological nano-machines
• Proteins are folded chains of amino acids of
20 types
• It’s a sequence of elements from a finite
alphabet
• We focus on secondary structure prediction
• Mapping a sequence to another sequence
4. Training Data
• Protein Structure Data Banks
• wwPDB, NCBI, RCSBPDB
• Open access
• Exponential growth
• Unstructured data
• Noisy data, contains duplicates
• 23000 usable out of 100000
• Example CRO protein
• PDB record: http://www.rcsb.org/pdb/files/3CRO.pdb
• Extracted Secondary Structure:
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
5. Protein Secondary Structure Prediction:
Problem Statement
• Protein amino acid chain (aa):
• 𝑥 = (𝑥1, … , 𝑥 𝑛), 𝑥𝑖 ∈ 𝑋 = {20 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠}
• Protein secondary structure (ss):
• 𝑦 = (𝑦1, … , 𝑦𝑛), 𝑦𝑖 ∈ 𝑌 = {3 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒: 𝑠, ℎ, −}
• Train data:
• set 𝑇 of 𝑚 pairs 𝑥, 𝑦 ∈ 𝑋 × 𝑌
• Given training data 𝑇 and aa-sequence 𝑥, find its corresponding ss-
sequence 𝑦
• Assumption: local dependency
• 𝑦𝑖 depends on a “frame” with size 𝑟 in 𝑥 : (𝑥𝑖−𝑟, … , 𝑥𝑖, … , 𝑥𝑖+𝑟)
x: MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
y: -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
YQSAINKAIHA
.....h.....
6. Probabilistic Tooling
• Bayes Theorem (how to calculate conditional probabilities)
• P 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)/𝑃(𝐵)
• Naïve example (frame size r = 0): P 𝑦𝑖 𝑥𝑖
• Calculate and pick maximum from P 𝑦𝑖 = "𝑠" 𝑥𝑖 , P 𝑦𝑖 = "ℎ" 𝑥𝑖 , P 𝑦𝑖 = " − " 𝑥𝑖
• Another example (frame size r > 0): P 𝑦𝑖 𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟
• How to calculate probabilities of sequences (𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟)?
• Markov Chains (how to calculate probability of sequences)
• Chain order 𝑘 = 1
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥2|𝑥1 … 𝑃 𝑥 𝑛|𝑥 𝑛−1
• Chain order 𝑘 = 2
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1, 𝑥2 𝑃 𝑥3|𝑥1, 𝑥2 … 𝑃 𝑥 𝑛|𝑥 𝑛−1, 𝑥 𝑛−2
• Anderson Results (how to select the best order of the Markov chain)
• 𝜒2
~ − 2𝑙𝑛
𝐿 𝑘
𝐿 𝑘+1
, 𝐿 𝑘 - likehood of a chain order k
• Allows to find the order of a Markov Chain that fits data the best
7. Problem-Specific Model
• Bayesian Theorem and Markov Chain order 𝑘 = 1
• Then each element 𝑦𝑖 is calculated as
𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑌
𝑃(𝑦𝑖|𝑥1, … , 𝑥2𝑟+1) =
= 𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑋
𝑃 𝑦𝑖 𝑃(𝑥1|𝑦𝑖) ×
𝑖=2
2𝑟+1
𝑃(𝑥𝑖|𝑥𝑖−1, 𝑦𝑖)
• We need to calculate 3 conditional probabilities (for 3 possible
values of 𝑦𝑖) and pick the maximum
• All factors in the expansion are probabilities computable from
the train data
• Elements 𝑦𝑖 are independently calculated
8. Building And Executing Prediction
Algorithms
• Implemented in C++
• Computational complexity:
• Training:𝑂(𝑚 × 𝑙𝑛(𝑚))
• Prediction: 𝑂(𝑙𝑛(𝑚))
• Parallelizable
• Executed on NVidia GForce 8800 GTX based GPU cluster with
total performance 4 TFLOPS
• Can be implemented using MapReduce:
• Hadoop
• Spark
9. Adjusting Model Parameters
• Markov chain order 𝑘
• Selected using Anderson results
during series of statistical
hypothesis tests
• Depends on train data (higher order
chain require mode data)
• Chain order k=3 was used
• Frame size 𝑟
• Selected empirically
• Frame size 14 used
Order k=1
or
Order k= 2
or…?
YQSAINKAIHA
.....h.....
AINKA
..h..
Frame size r=5
or
Frame size r= 2
or…?
10. Accuracy Evaluation
• Single protein structure prediction accuracy
• C3 – ratio of correctly predicted items to protein length
• C(s),C(h),C(-) – secondary structure type-specific accuracy coefficients
𝐶 ∝ =
𝑝∝ 𝑛∝−𝑢∝ 𝑜∝
(𝑛∝+𝑢∝)(𝑛∝+𝑜∝)(𝑝∝+𝑢∝)(𝑝∝+𝑜∝)
, ∝∈ {𝑠, ℎ, −}
• Example: CRO protein ss prediction
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
• --ssssh--h---h-hhhhhhhh----hhhhhhhhh--ssssssss-ssssssss-s---------
• Accuracy:
• С3: 0.878788
• C(s): 0.815068
• C(h): 0.92674
• C(-): 0.74525
• Prediction accuracy of a model with training data (23000 structures)
• Cross Validation
• Single Protein Exclusion
• Average С3 – 0.83