Your SlideShare is downloading. ×
20131019 生物物理若手 Journal Club
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20131019 生物物理若手 Journal Club

235
views

Published on

Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16. …

Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16.
DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches.
Liu R, Hu J.

Published in: Technology, Education

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
235
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Proteins. 2013 Jun 5. 20131019 生物物理若手関西支部 Journal Club
  • 2. Topics Prediction of protein-DNA binding residues Statistics of network Machine learning
  • 3. Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding residues. Template DNABind EcoRV(1RVE:A) CprK (3E6C:C) Machine learning True positive residues. DNABind improves classification. Query protein, Template protein, TP, , FN
  • 4. Aim Protein-DNA interactions is important for cell biology. Its determination by experiments is time- and cost-consuming. Computational approaches are desirable.
  • 5. Computational approaches Data bank (PDB) Binding residues characters Exposed solvents Higher electrostatics potential More conserved Hotspots as clusters of conserved residues Structural properties (DNA-binding residue vs surface) Packing density Surface curvature B-factor Residue fluctuation Hydrogen bond donor http://www.rcsb.org/pdb/home/home.do
  • 6. Computational algorithms Feature-based Extract effective features Template-based Align template and retrieve the best match Template!!
  • 7. Computational algorithms Feature-based Extract effective features Template-based Align template and retrieve the best match Template!!
  • 8. Computational algorithms Feature-based Extract effective features Template-based Align template and retrieve the best match Template!!
  • 9. Features used in machine learning Structure-based PSSM (position specific scoring matrix) Evolutionally conservation Solvent accessibility Local geometry (depth and protrusion index) Topological features degree, closeness, betweenness, clustering coefficient Relative position (distance to centroid) Statistical potential (Boltzmann distribution) Sequence-based (more difficult than structure) Amino acid identity Residue physicochemical properties polarity, secondary structure, molecular volume, codon diversity, electrostatic charge Predicted structure (Not need 3D structure !!)
  • 10. Features used in machine learning Structure-based PSSM Relative solvent accessibility Depth and protrusion index Topological features Distance to centroid Statistical potentials Sequence-based PSSM Predicted structures Amino acid indices Statistical potentials Construct machine learning (SVM)
  • 11. Template-based approach Used in image recognition, etc… Recognition of faces in the camera. Template!!
  • 12. Template-based approach Used in image recognition, etc… Recognition of faces in the camera. Match!! Template!!
  • 13. Template-based prediction Template-based Structural alignment and statistical potential The binding residue prediction will be conducted only if the target protein was considered as a DNA-binding protein. 312 templates were selected.
  • 14. Network Degree is a commonly used measure to reflect the local connectivity of a node. Closeness is a global centrality metric used to determine how critical a residue is in a residue interaction network. Betweenness of residue i is defined to be the sum of the fraction of shortest paths between all pairs of residues that pass through residue i. Motif, hub, and community are also important… Clustering coefficient (transitivity) quantifies how close its neighbors are to being a clique. Probability that the adjacent vertices of a vertex are connected.
  • 15. Network sample; human protein interactome Scale-free Small-world Cluster Power law (Pareto distribution) Bioinformatics. 2012 Jan 1;28(1):84-90.
  • 16. Machine learning Example; spam 4601 samples, 57 parameters. Classification; spam or nonspam
  • 17. Machine learning Support vector machine (SVM) Decision tree RandomForest Logistic regression LASSO (Elastic net and Ridge) Neural networks (Deep learning) Evolutionary algorithm Gaussian processing k nearest neighbor Clustering Bayesian networks Association rule learning Inductive logic programming (ILP)
  • 18. Support vector machine (SVM) Make hyperplane to divide groups. Kernel method; non-linear to linear Easy to do. Much computational time. Tuning is very difficult.
  • 19. Decision tree Make many trees. Easy to understand graphically. Performance is not so good.
  • 20. RandomForest Make many decision trees. Much precise. A little time consumer.
  • 21. Logistic regression Many medical researchers use… Easy to use but tuning is very difficult. (to tell the truth…)
  • 22. LASSO, Elastic net, and Ridge regression Least Absolute Shrinkage and Selection Operator LASSO Elastic Net Ridge
  • 23. Neural networks Artificial mammal brain (perceptron). Hidden multi-layer. Deep learning is hot topic!! (hard to understand…) http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html
  • 24. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set.
  • 25. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 26. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 27. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 28. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 29. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 30. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data Test 1 One-leave out CV
  • 31. Performance SVM Tree RandomForest LASSO Elastic net Ridge Logistic nnet Recall 0.917 0.872 0.927 0.894 0.892 0.852 0.893 0.930 Precision 0.948 0.914 0.954 0.932 0.926 0.926 0.930 0.935 F 0.932 0.893 0.940 0.913 0.911 0.887 0.911 0.932 MMC 0.890 0.826 0.902 0.858 0.856 0.821 0.856 0.888
  • 32. Combine two approaches
  • 33. Statistical features of structure A: Binding residues are highly solvent accessible. B, C: Binding residues have low depth and high protrusion. D-G: Not so much difference in networks. H: Binding residues are less distant to the centroid.
  • 34. Performance
  • 35. Performance Higher TM score is required for good prediction. TM-score is a measure of similarity between two protein structures with different tertiary structures. < 0.2 is random relation and > 0.5 is highly related. Proteins. 2004 Dec 1;57(4):702-10. Nucleic Acids Res. 2005 Apr 22;33(7):2302-9.
  • 36. Performance Comparison among ML, TL, and DNABind. Comparison between DNABind and other software.
  • 37. Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding residues. Template DNABind EcoRV(1RVE:A) CprK (3E6C:C) Machine learning True positive residues. DNABind improves classification. Query protein, Template protein, TP, , FN