SlideShare a Scribd company logo
PROTEIN STRUCTURE
PREDICTION USING
MACHINE LEARNING
Borys Biletskyy
Data Science Amsterdam
July, 2016
About Myself
• Born in Kyiv, Ukraine
• I live and work in Amsterdam since 2014
• Architect @ Levi9
• Senior Research Officer @ Glushkov
Institute of Cybernetics
• I like:
• Cycling
• Swimming
• Boating
Motivation and Domain Background
• The process of protein folding is unclear
• Experimental protein structure determination is expensive
• Soluble bacteria protein - $140000
• Human membrane protein - $2.5 million
• Single successful drug programme $15-20 million
• Life on Earth is protein-based
• Proteins make 80% of cell’s dry mass
• Protein structure determines its function
• Very important for medical industry
• Proteins are biological nano-machines
• Proteins are folded chains of amino acids of
20 types
• It’s a sequence of elements from a finite
alphabet
• We focus on secondary structure prediction
• Mapping a sequence to another sequence
Training Data
• Protein Structure Data Banks
• wwPDB, NCBI, RCSBPDB
• Open access
• Exponential growth
• Unstructured data
• Noisy data, contains duplicates
• 23000 usable out of 100000
• Example CRO protein
• PDB record: http://www.rcsb.org/pdb/files/3CRO.pdb
• Extracted Secondary Structure:
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
Protein Secondary Structure Prediction:
Problem Statement
• Protein amino acid chain (aa):
• 𝑥 = (𝑥1, … , 𝑥 𝑛), 𝑥𝑖 ∈ 𝑋 = {20 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠}
• Protein secondary structure (ss):
• 𝑦 = (𝑦1, … , 𝑦𝑛), 𝑦𝑖 ∈ 𝑌 = {3 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒: 𝑠, ℎ, −}
• Train data:
• set 𝑇 of 𝑚 pairs 𝑥, 𝑦 ∈ 𝑋 × 𝑌
• Given training data 𝑇 and aa-sequence 𝑥, find its corresponding ss-
sequence 𝑦
• Assumption: local dependency
• 𝑦𝑖 depends on a “frame” with size 𝑟 in 𝑥 : (𝑥𝑖−𝑟, … , 𝑥𝑖, … , 𝑥𝑖+𝑟)
x: MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
y: -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
YQSAINKAIHA
.....h.....
Probabilistic Tooling
• Bayes Theorem (how to calculate conditional probabilities)
• P 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)/𝑃(𝐵)
• Naïve example (frame size r = 0): P 𝑦𝑖 𝑥𝑖
• Calculate and pick maximum from P 𝑦𝑖 = "𝑠" 𝑥𝑖 , P 𝑦𝑖 = "ℎ" 𝑥𝑖 , P 𝑦𝑖 = " − " 𝑥𝑖
• Another example (frame size r > 0): P 𝑦𝑖 𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟
• How to calculate probabilities of sequences (𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟)?
• Markov Chains (how to calculate probability of sequences)
• Chain order 𝑘 = 1
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥2|𝑥1 … 𝑃 𝑥 𝑛|𝑥 𝑛−1
• Chain order 𝑘 = 2
• 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1, 𝑥2 𝑃 𝑥3|𝑥1, 𝑥2 … 𝑃 𝑥 𝑛|𝑥 𝑛−1, 𝑥 𝑛−2
• Anderson Results (how to select the best order of the Markov chain)
• 𝜒2
~ − 2𝑙𝑛
𝐿 𝑘
𝐿 𝑘+1
, 𝐿 𝑘 - likehood of a chain order k
• Allows to find the order of a Markov Chain that fits data the best
Problem-Specific Model
• Bayesian Theorem and Markov Chain order 𝑘 = 1
• Then each element 𝑦𝑖 is calculated as
𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑌
𝑃(𝑦𝑖|𝑥1, … , 𝑥2𝑟+1) =
= 𝑎𝑟𝑔𝑚𝑎𝑥
𝑦 𝑖∈𝑋
𝑃 𝑦𝑖 𝑃(𝑥1|𝑦𝑖) ×
𝑖=2
2𝑟+1
𝑃(𝑥𝑖|𝑥𝑖−1, 𝑦𝑖)
• We need to calculate 3 conditional probabilities (for 3 possible
values of 𝑦𝑖) and pick the maximum
• All factors in the expansion are probabilities computable from
the train data
• Elements 𝑦𝑖 are independently calculated
Building And Executing Prediction
Algorithms
• Implemented in C++
• Computational complexity:
• Training:𝑂(𝑚 × 𝑙𝑛(𝑚))
• Prediction: 𝑂(𝑙𝑛(𝑚))
• Parallelizable
• Executed on NVidia GForce 8800 GTX based GPU cluster with
total performance 4 TFLOPS
• Can be implemented using MapReduce:
• Hadoop
• Spark
Adjusting Model Parameters
• Markov chain order 𝑘
• Selected using Anderson results
during series of statistical
hypothesis tests
• Depends on train data (higher order
chain require mode data)
• Chain order k=3 was used
• Frame size 𝑟
• Selected empirically
• Frame size 14 used
Order k=1
or
Order k= 2
or…?
YQSAINKAIHA
.....h.....
AINKA
..h..
Frame size r=5
or
Frame size r= 2
or…?
Accuracy Evaluation
• Single protein structure prediction accuracy
• C3 – ratio of correctly predicted items to protein length
• C(s),C(h),C(-) – secondary structure type-specific accuracy coefficients
𝐶 ∝ =
𝑝∝ 𝑛∝−𝑢∝ 𝑜∝
(𝑛∝+𝑢∝)(𝑛∝+𝑜∝)(𝑝∝+𝑢∝)(𝑝∝+𝑜∝)
, ∝∈ {𝑠, ℎ, −}
• Example: CRO protein ss prediction
• MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
• -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
• --ssssh--h---h-hhhhhhhh----hhhhhhhhh--ssssssss-ssssssss-s---------
• Accuracy:
• С3: 0.878788
• C(s): 0.815068
• C(h): 0.92674
• C(-): 0.74525
• Prediction accuracy of a model with training data (23000 structures)
• Cross Validation
• Single Protein Exclusion
• Average С3 – 0.83
Thank you for your attention!
• Questions?

More Related Content

What's hot

Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Protein structure 2
Protein structure 2Protein structure 2
Protein structure 2
Rainu Rajeev
 
Structural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeStructural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its Scope
Nixon Mendez
 
Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)
Melvin Alex
 
demonstration lecture on Homology modeling
demonstration lecture on Homology modelingdemonstration lecture on Homology modeling
demonstration lecture on Homology modeling
Maharaj Vinayak Global University
 
Protein Structure Alignment and Comparison
Protein Structure Alignment and ComparisonProtein Structure Alignment and Comparison
Protein Structure Alignment and Comparison
Natalio Krasnogor
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
ratanvishwas
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
Bioinformatics and Computational Biosciences Branch
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmc
USD Bioinformatics
 
Swaati modeling
Swaati modeling Swaati modeling
Swaati modeling
Swati Kumari
 
HOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAYHOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAY
Shikha Popali
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
Ajay Murali
 
threading and homology modelling methods
threading and homology modelling methodsthreading and homology modelling methods
threading and homology modelling methods
mohammed muzammil
 
Presentation1
Presentation1Presentation1
Presentation1
firesea
 
Protein computational analysis
Protein computational analysisProtein computational analysis
Protein computational analysis
Kinza Irshad
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
Ayesha Choudhury
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
Elda Nurafnie
 
Homology modeling and molecular docking
Homology modeling and molecular dockingHomology modeling and molecular docking
Homology modeling and molecular docking
Rangika Munaweera
 
Protein modeling
Protein modelingProtein modeling

What's hot (20)

Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
Protein structure 2
Protein structure 2Protein structure 2
Protein structure 2
 
Structural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeStructural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its Scope
 
Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)
 
demonstration lecture on Homology modeling
demonstration lecture on Homology modelingdemonstration lecture on Homology modeling
demonstration lecture on Homology modeling
 
Protein Structure Alignment and Comparison
Protein Structure Alignment and ComparisonProtein Structure Alignment and Comparison
Protein Structure Alignment and Comparison
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Intro to homology modeling
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmc
 
Swaati modeling
Swaati modeling Swaati modeling
Swaati modeling
 
HOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAYHOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAY
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
 
threading and homology modelling methods
threading and homology modelling methodsthreading and homology modelling methods
threading and homology modelling methods
 
Presentation1
Presentation1Presentation1
Presentation1
 
Protein computational analysis
Protein computational analysisProtein computational analysis
Protein computational analysis
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
 
Homology modeling and molecular docking
Homology modeling and molecular dockingHomology modeling and molecular docking
Homology modeling and molecular docking
 
Protein modeling
Protein modelingProtein modeling
Protein modeling
 

Viewers also liked

Protein Folding Prediction
Protein Folding PredictionProtein Folding Prediction
Protein Folding Prediction
warrenyates
 
Darius
DariusDarius
Scoring scheme
Scoring schemeScoring scheme
Scoring scheme
Govindan Kanapathy
 
Rosetta stone
Rosetta stoneRosetta stone
Rosetta stone
Alexander Jimenez
 
Final Presentation for Pattern Recognition
Final Presentation for Pattern RecognitionFinal Presentation for Pattern Recognition
Final Presentation for Pattern Recognition
davidglenEE
 
Rosetta Stone Presentation
Rosetta Stone PresentationRosetta Stone Presentation
Rosetta Stone Presentation
mvlcerin
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
baoilleach
 
Neural networks...
Neural networks...Neural networks...
Neural networks...
Molly Chugh
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
Balachandramohan Bcm
 
Knowledge based systems
Knowledge based systemsKnowledge based systems
Knowledge based systems
Yowan Rdotexe
 
neural network
neural networkneural network
neural network
STUDENT
 

Viewers also liked (11)

Protein Folding Prediction
Protein Folding PredictionProtein Folding Prediction
Protein Folding Prediction
 
Darius
DariusDarius
Darius
 
Scoring scheme
Scoring schemeScoring scheme
Scoring scheme
 
Rosetta stone
Rosetta stoneRosetta stone
Rosetta stone
 
Final Presentation for Pattern Recognition
Final Presentation for Pattern RecognitionFinal Presentation for Pattern Recognition
Final Presentation for Pattern Recognition
 
Rosetta Stone Presentation
Rosetta Stone PresentationRosetta Stone Presentation
Rosetta Stone Presentation
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
 
Neural networks...
Neural networks...Neural networks...
Neural networks...
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
Knowledge based systems
Knowledge based systemsKnowledge based systems
Knowledge based systems
 
neural network
neural networkneural network
neural network
 

Similar to Protein structure prediction

Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
Sri Ambati
 
background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
data mining
data miningdata mining
data mining
Rahul Rock
 
Part1
Part1Part1
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
QuantUniversity
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
Yuto Hayamizu
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
Dinakar nk
 
Iiwas19 yamazaki slide
Iiwas19 yamazaki slideIiwas19 yamazaki slide
Iiwas19 yamazaki slide
Kotaro Yamazaki
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
r-kor
 
Workshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
Workshop on Bayesian Workflows with CmdStanPy by Mitzi MorrisWorkshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
Workshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
Paris Women in Machine Learning and Data Science
 
Enm fy17nano qsar
Enm fy17nano qsarEnm fy17nano qsar
Enm fy17nano qsar
PaulHarten1
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
ananth
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
Tsuyoshi Sakama
 
A schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data storesA schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data stores
KIRAN V
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
aimsnist
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
ssuseradaf5f
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
NIKHILGR3
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Marco Meoni
 
230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx230727_HB_JointJournalClub.pptx
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
Dataconomy Media
 

Similar to Protein structure prediction (20)

Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
 
Iiwas19 yamazaki slide
Iiwas19 yamazaki slideIiwas19 yamazaki slide
Iiwas19 yamazaki slide
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Workshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
Workshop on Bayesian Workflows with CmdStanPy by Mitzi MorrisWorkshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
Workshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
 
Enm fy17nano qsar
Enm fy17nano qsarEnm fy17nano qsar
Enm fy17nano qsar
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
A schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data storesA schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data stores
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 

Recently uploaded

Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Modelo de slide quimica para powerpoint
Modelo  de slide quimica para powerpointModelo  de slide quimica para powerpoint
Modelo de slide quimica para powerpoint
Karen593256
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 

Recently uploaded (20)

Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Modelo de slide quimica para powerpoint
Modelo  de slide quimica para powerpointModelo  de slide quimica para powerpoint
Modelo de slide quimica para powerpoint
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 

Protein structure prediction

  • 1. PROTEIN STRUCTURE PREDICTION USING MACHINE LEARNING Borys Biletskyy Data Science Amsterdam July, 2016
  • 2. About Myself • Born in Kyiv, Ukraine • I live and work in Amsterdam since 2014 • Architect @ Levi9 • Senior Research Officer @ Glushkov Institute of Cybernetics • I like: • Cycling • Swimming • Boating
  • 3. Motivation and Domain Background • The process of protein folding is unclear • Experimental protein structure determination is expensive • Soluble bacteria protein - $140000 • Human membrane protein - $2.5 million • Single successful drug programme $15-20 million • Life on Earth is protein-based • Proteins make 80% of cell’s dry mass • Protein structure determines its function • Very important for medical industry • Proteins are biological nano-machines • Proteins are folded chains of amino acids of 20 types • It’s a sequence of elements from a finite alphabet • We focus on secondary structure prediction • Mapping a sequence to another sequence
  • 4. Training Data • Protein Structure Data Banks • wwPDB, NCBI, RCSBPDB • Open access • Exponential growth • Unstructured data • Noisy data, contains duplicates • 23000 usable out of 100000 • Example CRO protein • PDB record: http://www.rcsb.org/pdb/files/3CRO.pdb • Extracted Secondary Structure: • MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA • -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss-----------
  • 5. Protein Secondary Structure Prediction: Problem Statement • Protein amino acid chain (aa): • 𝑥 = (𝑥1, … , 𝑥 𝑛), 𝑥𝑖 ∈ 𝑋 = {20 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑𝑠} • Protein secondary structure (ss): • 𝑦 = (𝑦1, … , 𝑦𝑛), 𝑦𝑖 ∈ 𝑌 = {3 𝑡𝑦𝑝𝑒𝑠 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒: 𝑠, ℎ, −} • Train data: • set 𝑇 of 𝑚 pairs 𝑥, 𝑦 ∈ 𝑋 × 𝑌 • Given training data 𝑇 and aa-sequence 𝑥, find its corresponding ss- sequence 𝑦 • Assumption: local dependency • 𝑦𝑖 depends on a “frame” with size 𝑟 in 𝑥 : (𝑥𝑖−𝑟, … , 𝑥𝑖, … , 𝑥𝑖+𝑟) x: MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA y: -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss----------- YQSAINKAIHA .....h.....
  • 6. Probabilistic Tooling • Bayes Theorem (how to calculate conditional probabilities) • P 𝐴 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴)/𝑃(𝐵) • Naïve example (frame size r = 0): P 𝑦𝑖 𝑥𝑖 • Calculate and pick maximum from P 𝑦𝑖 = "𝑠" 𝑥𝑖 , P 𝑦𝑖 = "ℎ" 𝑥𝑖 , P 𝑦𝑖 = " − " 𝑥𝑖 • Another example (frame size r > 0): P 𝑦𝑖 𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟 • How to calculate probabilities of sequences (𝑥𝑖−𝑟, … , 𝑥𝑖+𝑟)? • Markov Chains (how to calculate probability of sequences) • Chain order 𝑘 = 1 • 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥2|𝑥1 … 𝑃 𝑥 𝑛|𝑥 𝑛−1 • Chain order 𝑘 = 2 • 𝑃 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1, 𝑥2 𝑃 𝑥3|𝑥1, 𝑥2 … 𝑃 𝑥 𝑛|𝑥 𝑛−1, 𝑥 𝑛−2 • Anderson Results (how to select the best order of the Markov chain) • 𝜒2 ~ − 2𝑙𝑛 𝐿 𝑘 𝐿 𝑘+1 , 𝐿 𝑘 - likehood of a chain order k • Allows to find the order of a Markov Chain that fits data the best
  • 7. Problem-Specific Model • Bayesian Theorem and Markov Chain order 𝑘 = 1 • Then each element 𝑦𝑖 is calculated as 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑖∈𝑌 𝑃(𝑦𝑖|𝑥1, … , 𝑥2𝑟+1) = = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑖∈𝑋 𝑃 𝑦𝑖 𝑃(𝑥1|𝑦𝑖) × 𝑖=2 2𝑟+1 𝑃(𝑥𝑖|𝑥𝑖−1, 𝑦𝑖) • We need to calculate 3 conditional probabilities (for 3 possible values of 𝑦𝑖) and pick the maximum • All factors in the expansion are probabilities computable from the train data • Elements 𝑦𝑖 are independently calculated
  • 8. Building And Executing Prediction Algorithms • Implemented in C++ • Computational complexity: • Training:𝑂(𝑚 × 𝑙𝑛(𝑚)) • Prediction: 𝑂(𝑙𝑛(𝑚)) • Parallelizable • Executed on NVidia GForce 8800 GTX based GPU cluster with total performance 4 TFLOPS • Can be implemented using MapReduce: • Hadoop • Spark
  • 9. Adjusting Model Parameters • Markov chain order 𝑘 • Selected using Anderson results during series of statistical hypothesis tests • Depends on train data (higher order chain require mode data) • Chain order k=3 was used • Frame size 𝑟 • Selected empirically • Frame size 14 used Order k=1 or Order k= 2 or…? YQSAINKAIHA .....h..... AINKA ..h.. Frame size r=5 or Frame size r= 2 or…?
  • 10. Accuracy Evaluation • Single protein structure prediction accuracy • C3 – ratio of correctly predicted items to protein length • C(s),C(h),C(-) – secondary structure type-specific accuracy coefficients 𝐶 ∝ = 𝑝∝ 𝑛∝−𝑢∝ 𝑜∝ (𝑛∝+𝑢∝)(𝑛∝+𝑜∝)(𝑝∝+𝑢∝)(𝑝∝+𝑜∝) , ∝∈ {𝑠, ℎ, −} • Example: CRO protein ss prediction • MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA • -ssssshhhhhhhh-hhhhhhhh---hhhhhhhhhh--ssssssss-ssssssss----------- • --ssssh--h---h-hhhhhhhh----hhhhhhhhh--ssssssss-ssssssss-s--------- • Accuracy: • С3: 0.878788 • C(s): 0.815068 • C(h): 0.92674 • C(-): 0.74525 • Prediction accuracy of a model with training data (23000 structures) • Cross Validation • Single Protein Exclusion • Average С3 – 0.83
  • 11. Thank you for your attention! • Questions?