SlideShare a Scribd company logo
1 of 11
Download to read offline
PROTEIN STRUCTURE
PREDICTION USING
MACHINE LEARNING
Borys Biletskyy
Data Science Amsterdam
July, 2016
About Myself
• Born in Kyiv, Ukraine
• I live and work in Amsterdam since 2014
• Architect @ Levi9
• Senior Research Officer @ Glushkov
Institute of Cybernetics
• I like:
• Cycling
• Swimming
• Boating
Motivation and Domain Background
• The process of protein folding is unclear
• Experimental protein structure determination is expensive
• Soluble bacteria protein - $140000
• Human membrane protein - $2.5 million
• Single successful drug programme $15-20 million
• Life on Earth is protein-based
• Proteins make 80% of cell’s dry mass
• Protein structure determines its function
• Very important for medical industry
• Proteins are biological nano-machines
• Proteins are folded chains of amino acids of
20 types
• It’s a sequence of elements from a finite
alphabet
• We focus on secondary structure prediction
• Mapping a sequence to another sequence
Training Data
• Protein Structure Data Banks
• wwPDB, NCBI, RCSBPDB
• Open access
• Exponential growth
• Unstructured data
• Noisy data, contains duplicates
• 23000 usable out of 100000
• Example CRO protein
• PDB record: http://www.rcsb.org/pdb/files/3CRO.pdb
• Extracted Secondary Structure:
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
Protein Secondary Structure Prediction:
Problem Statement
• Protein amino acid chain (aa):
• 𝑥 = (𝑥1, … , 𝑥 𝑛), 𝑥𝑖 ∈ 𝑋 = {20 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠}
• Protein secondary structure (ss):
• 𝑦 = (𝑦1, … , 𝑦𝑛), 𝑦𝑖 ∈ 𝑌 = {3 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒: 𝑠, ℎ, −}
• Train data:
• set 𝑇 of 𝑚 pairs 𝑥, 𝑦 ∈ 𝑋 × 𝑌
• Given training data 𝑇 and aa-sequence 𝑥, find its corresponding ss-
sequence 𝑦
• Assumption: local dependency
• 𝑦𝑖 depends on a “frame” with size 𝑟 in 𝑥 : (𝑥𝑖−𝑟, … , 𝑥𝑖, … , 𝑥𝑖+𝑟)
x: MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
y: -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
YQSAINKAIHA
.....h.....
Probabilistic Tooling
• Bayes Theorem (how to calculate conditional probabilities)
• P 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)/𝑃(𝐵)
• Naïve example (frame size r = 0): P 𝑦𝑖 𝑥𝑖
• Calculate and pick maximum from P 𝑦𝑖 = "𝑠" 𝑥𝑖 , P 𝑦𝑖 = "ℎ" 𝑥𝑖 , P 𝑦𝑖 = " − " 𝑥𝑖
• Another example (frame size r > 0): P 𝑦𝑖 𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟
• How to calculate probabilities of sequences (𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟)?
• Markov Chains (how to calculate probability of sequences)
• Chain order 𝑘 = 1
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥2|𝑥1 … 𝑃 𝑥 𝑛|𝑥 𝑛−1
• Chain order 𝑘 = 2
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1, 𝑥2 𝑃 𝑥3|𝑥1, 𝑥2 … 𝑃 𝑥 𝑛|𝑥 𝑛−1, 𝑥 𝑛−2
• Anderson Results (how to select the best order of the Markov chain)
• 𝜒2
~ − 2𝑙𝑛
𝐿 𝑘
𝐿 𝑘+1
, 𝐿 𝑘 - likehood of a chain order k
• Allows to find the order of a Markov Chain that fits data the best
Problem-Specific Model
• Bayesian Theorem and Markov Chain order 𝑘 = 1
• Then each element 𝑦𝑖 is calculated as
𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑌
𝑃(𝑦𝑖|𝑥1, … , 𝑥2𝑟+1) =
= 𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑋
𝑃 𝑦𝑖 𝑃(𝑥1|𝑦𝑖) ×
𝑖=2
2𝑟+1
𝑃(𝑥𝑖|𝑥𝑖−1, 𝑦𝑖)
• We need to calculate 3 conditional probabilities (for 3 possible
values of 𝑦𝑖) and pick the maximum
• All factors in the expansion are probabilities computable from
the train data
• Elements 𝑦𝑖 are independently calculated
Building And Executing Prediction
Algorithms
• Implemented in C++
• Computational complexity:
• Training:𝑂(𝑚 × 𝑙𝑛(𝑚))
• Prediction: 𝑂(𝑙𝑛(𝑚))
• Parallelizable
• Executed on NVidia GForce 8800 GTX based GPU cluster with
total performance 4 TFLOPS
• Can be implemented using MapReduce:
• Hadoop
• Spark
Adjusting Model Parameters
• Markov chain order 𝑘
• Selected using Anderson results
during series of statistical
hypothesis tests
• Depends on train data (higher order
chain require mode data)
• Chain order k=3 was used
• Frame size 𝑟
• Selected empirically
• Frame size 14 used
Order k=1
or
Order k= 2
or…?
YQSAINKAIHA
.....h.....
AINKA
..h..
Frame size r=5
or
Frame size r= 2
or…?
Accuracy Evaluation
• Single protein structure prediction accuracy
• C3 – ratio of correctly predicted items to protein length
• C(s),C(h),C(-) – secondary structure type-specific accuracy coefficients
𝐶 ∝ =
𝑝∝ 𝑛∝−𝑢∝ 𝑜∝
(𝑛∝+𝑢∝)(𝑛∝+𝑜∝)(𝑝∝+𝑢∝)(𝑝∝+𝑜∝)
, ∝∈ {𝑠, ℎ, −}
• Example: CRO protein ss prediction
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
• --ssssh--h---h-hhhhhhhh----hhhhhhhhh--ssssssss-ssssssss-s---------
• Accuracy:
• С3: 0.878788
• C(s): 0.815068
• C(h): 0.92674
• C(-): 0.74525
• Prediction accuracy of a model with training data (23000 structures)
• Cross Validation
• Single Protein Exclusion
• Average С3 – 0.83
Thank you for your attention!
• Questions?

More Related Content

What's hot

Protein structure 2
Protein structure 2Protein structure 2
Protein structure 2Rainu Rajeev
 
Structural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeStructural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeNixon Mendez
 
Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)Melvin Alex
 
Protein Structure Alignment and Comparison
Protein Structure Alignment and ComparisonProtein Structure Alignment and Comparison
Protein Structure Alignment and ComparisonNatalio Krasnogor
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methodsratanvishwas
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcUSD Bioinformatics
 
HOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAYHOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAYShikha Popali
 
Homology modeling
Homology modelingHomology modeling
Homology modelingAjay Murali
 
threading and homology modelling methods
threading and homology modelling methodsthreading and homology modelling methods
threading and homology modelling methodsmohammed muzammil
 
Presentation1
Presentation1Presentation1
Presentation1firesea
 
Protein computational analysis
Protein computational analysisProtein computational analysis
Protein computational analysisKinza Irshad
 
Homology modeling and molecular docking
Homology modeling and molecular dockingHomology modeling and molecular docking
Homology modeling and molecular dockingRangika Munaweera
 

What's hot (20)

Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
Protein structure 2
Protein structure 2Protein structure 2
Protein structure 2
 
Structural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeStructural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its Scope
 
Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)
 
demonstration lecture on Homology modeling
demonstration lecture on Homology modelingdemonstration lecture on Homology modeling
demonstration lecture on Homology modeling
 
Protein Structure Alignment and Comparison
Protein Structure Alignment and ComparisonProtein Structure Alignment and Comparison
Protein Structure Alignment and Comparison
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Intro to homology modeling
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmc
 
Swaati modeling
Swaati modeling Swaati modeling
Swaati modeling
 
HOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAYHOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAY
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
 
threading and homology modelling methods
threading and homology modelling methodsthreading and homology modelling methods
threading and homology modelling methods
 
Presentation1
Presentation1Presentation1
Presentation1
 
Protein computational analysis
Protein computational analysisProtein computational analysis
Protein computational analysis
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
 
Homology modeling and molecular docking
Homology modeling and molecular dockingHomology modeling and molecular docking
Homology modeling and molecular docking
 
Protein modeling
Protein modelingProtein modeling
Protein modeling
 

Viewers also liked

Viewers also liked (11)

Protein Folding Prediction
Protein Folding PredictionProtein Folding Prediction
Protein Folding Prediction
 
Darius
DariusDarius
Darius
 
Scoring scheme
Scoring schemeScoring scheme
Scoring scheme
 
Rosetta stone
Rosetta stoneRosetta stone
Rosetta stone
 
Final Presentation for Pattern Recognition
Final Presentation for Pattern RecognitionFinal Presentation for Pattern Recognition
Final Presentation for Pattern Recognition
 
Rosetta Stone Presentation
Rosetta Stone PresentationRosetta Stone Presentation
Rosetta Stone Presentation
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
 
Neural networks...
Neural networks...Neural networks...
Neural networks...
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
Knowledge based systems
Knowledge based systemsKnowledge based systems
Knowledge based systems
 
neural network
neural networkneural network
neural network
 

Similar to Protein structure prediction

Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OSri Ambati
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeYuto Hayamizu
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analyticsDinakar nk
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개r-kor
 
Enm fy17nano qsar
Enm fy17nano qsarEnm fy17nano qsar
Enm fy17nano qsarPaulHarten1
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networksananth
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Tsuyoshi Sakama
 
A schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data storesA schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data storesKIRAN V
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systemsaimsnist
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selectionMarco Meoni
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...Dataconomy Media
 

Similar to Protein structure prediction (20)

Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
 
Iiwas19 yamazaki slide
Iiwas19 yamazaki slideIiwas19 yamazaki slide
Iiwas19 yamazaki slide
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Workshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
Workshop on Bayesian Workflows with CmdStanPy by Mitzi MorrisWorkshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
Workshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
 
Enm fy17nano qsar
Enm fy17nano qsarEnm fy17nano qsar
Enm fy17nano qsar
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
A schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data storesA schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data stores
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 

Recently uploaded

Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 

Recently uploaded (20)

Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 

Protein structure prediction

  • 1. PROTEIN STRUCTURE PREDICTION USING MACHINE LEARNING Borys Biletskyy Data Science Amsterdam July, 2016
  • 2. About Myself • Born in Kyiv, Ukraine • I live and work in Amsterdam since 2014 • Architect @ Levi9 • Senior Research Officer @ Glushkov Institute of Cybernetics • I like: • Cycling • Swimming • Boating
  • 3. Motivation and Domain Background • The process of protein folding is unclear • Experimental protein structure determination is expensive • Soluble bacteria protein - $140000 • Human membrane protein - $2.5 million • Single successful drug programme $15-20 million • Life on Earth is protein-based • Proteins make 80% of cell’s dry mass • Protein structure determines its function • Very important for medical industry • Proteins are biological nano-machines • Proteins are folded chains of amino acids of 20 types • It’s a sequence of elements from a finite alphabet • We focus on secondary structure prediction • Mapping a sequence to another sequence
  • 4. Training Data • Protein Structure Data Banks • wwPDB, NCBI, RCSBPDB • Open access • Exponential growth • Unstructured data • Noisy data, contains duplicates • 23000 usable out of 100000 • Example CRO protein • PDB record: http://www.rcsb.org/pdb/files/3CRO.pdb • Extracted Secondary Structure: • MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA • -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
  • 5. Protein Secondary Structure Prediction: Problem Statement • Protein amino acid chain (aa): • 𝑥 = (𝑥1, … , 𝑥 𝑛), 𝑥𝑖 ∈ 𝑋 = {20 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠} • Protein secondary structure (ss): • 𝑦 = (𝑦1, … , 𝑦𝑛), 𝑦𝑖 ∈ 𝑌 = {3 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒: 𝑠, ℎ, −} • Train data: • set 𝑇 of 𝑚 pairs 𝑥, 𝑦 ∈ 𝑋 × 𝑌 • Given training data 𝑇 and aa-sequence 𝑥, find its corresponding ss- sequence 𝑦 • Assumption: local dependency • 𝑦𝑖 depends on a “frame” with size 𝑟 in 𝑥 : (𝑥𝑖−𝑟, … , 𝑥𝑖, … , 𝑥𝑖+𝑟) x: MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA y: -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss----------- YQSAINKAIHA .....h.....
  • 6. Probabilistic Tooling • Bayes Theorem (how to calculate conditional probabilities) • P 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)/𝑃(𝐵) • Naïve example (frame size r = 0): P 𝑦𝑖 𝑥𝑖 • Calculate and pick maximum from P 𝑦𝑖 = "𝑠" 𝑥𝑖 , P 𝑦𝑖 = "ℎ" 𝑥𝑖 , P 𝑦𝑖 = " − " 𝑥𝑖 • Another example (frame size r > 0): P 𝑦𝑖 𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟 • How to calculate probabilities of sequences (𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟)? • Markov Chains (how to calculate probability of sequences) • Chain order 𝑘 = 1 • 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥2|𝑥1 … 𝑃 𝑥 𝑛|𝑥 𝑛−1 • Chain order 𝑘 = 2 • 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1, 𝑥2 𝑃 𝑥3|𝑥1, 𝑥2 … 𝑃 𝑥 𝑛|𝑥 𝑛−1, 𝑥 𝑛−2 • Anderson Results (how to select the best order of the Markov chain) • 𝜒2 ~ − 2𝑙𝑛 𝐿 𝑘 𝐿 𝑘+1 , 𝐿 𝑘 - likehood of a chain order k • Allows to find the order of a Markov Chain that fits data the best
  • 7. Problem-Specific Model • Bayesian Theorem and Markov Chain order 𝑘 = 1 • Then each element 𝑦𝑖 is calculated as 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑖∈𝑌 𝑃(𝑦𝑖|𝑥1, … , 𝑥2𝑟+1) = = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑖∈𝑋 𝑃 𝑦𝑖 𝑃(𝑥1|𝑦𝑖) × 𝑖=2 2𝑟+1 𝑃(𝑥𝑖|𝑥𝑖−1, 𝑦𝑖) • We need to calculate 3 conditional probabilities (for 3 possible values of 𝑦𝑖) and pick the maximum • All factors in the expansion are probabilities computable from the train data • Elements 𝑦𝑖 are independently calculated
  • 8. Building And Executing Prediction Algorithms • Implemented in C++ • Computational complexity: • Training:𝑂(𝑚 × 𝑙𝑛(𝑚)) • Prediction: 𝑂(𝑙𝑛(𝑚)) • Parallelizable • Executed on NVidia GForce 8800 GTX based GPU cluster with total performance 4 TFLOPS • Can be implemented using MapReduce: • Hadoop • Spark
  • 9. Adjusting Model Parameters • Markov chain order 𝑘 • Selected using Anderson results during series of statistical hypothesis tests • Depends on train data (higher order chain require mode data) • Chain order k=3 was used • Frame size 𝑟 • Selected empirically • Frame size 14 used Order k=1 or Order k= 2 or…? YQSAINKAIHA .....h..... AINKA ..h.. Frame size r=5 or Frame size r= 2 or…?
  • 10. Accuracy Evaluation • Single protein structure prediction accuracy • C3 – ratio of correctly predicted items to protein length • C(s),C(h),C(-) – secondary structure type-specific accuracy coefficients 𝐶 ∝ = 𝑝∝ 𝑛∝−𝑢∝ 𝑜∝ (𝑛∝+𝑢∝)(𝑛∝+𝑜∝)(𝑝∝+𝑢∝)(𝑝∝+𝑜∝) , ∝∈ {𝑠, ℎ, −} • Example: CRO protein ss prediction • MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA • -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss----------- • --ssssh--h---h-hhhhhhhh----hhhhhhhhh--ssssssss-ssssssss-s--------- • Accuracy: • С3: 0.878788 • C(s): 0.815068 • C(h): 0.92674 • C(-): 0.74525 • Prediction accuracy of a model with training data (23000 structures) • Cross Validation • Single Protein Exclusion • Average С3 – 0.83
  • 11. Thank you for your attention! • Questions?