DNABind: A hybrid algorithm for structure-based prediction of
DNA-binding residues by combining machine learning- and
temp...
Topics
Prediction of protein-DNA binding residues
Statistics of network
Machine learning
Result: DNABind, a hybrid method of machine learning and template-based
approaches showed excellent performance on predict...
Aim

Protein-DNA interactions is important for cell biology.
Its determination by experiments is time- and cost-consuming....
Computational approaches
Data bank (PDB)
Binding residues characters
Exposed solvents
Higher electrostatics potential
More...
Computational algorithms
Feature-based
Extract effective features

Template-based
Align template and retrieve the best mat...
Computational algorithms
Feature-based
Extract effective features

Template-based
Align template and retrieve the best mat...
Computational algorithms
Feature-based
Extract effective features

Template-based
Align template and retrieve the best mat...
Features used in machine learning
Structure-based
PSSM (position specific scoring matrix)
Evolutionally conservation
Solve...
Features used in machine learning
Structure-based
PSSM
Relative solvent accessibility
Depth and protrusion index
Topologic...
Template-based approach
Used in image recognition, etc…
Recognition of faces in the camera.
Template!!
Template-based approach
Used in image recognition, etc…
Recognition of faces in the camera.
Match!!

Template!!
Template-based prediction
Template-based
Structural alignment and statistical potential
The binding residue prediction wil...
Network

Degree is a commonly used measure to reflect the local
connectivity of a node.
Closeness is a global centrality m...
Network sample; human protein interactome
Scale-free
Small-world
Cluster
Power law (Pareto distribution)

Bioinformatics. ...
Machine learning
Example; spam
4601 samples, 57 parameters.
Classification; spam or nonspam
Machine learning
Support vector machine (SVM)
Decision tree
RandomForest
Logistic regression
LASSO (Elastic net and Ridge)...
Support vector machine (SVM)
Make hyperplane to divide groups.
Kernel method; non-linear to linear
Easy to do.
Much comput...
Decision tree
Make many trees.
Easy to understand graphically.
Performance is not so good.
RandomForest
Make many decision trees.
Much precise.
A little time consumer.
Logistic regression
Many medical researchers use…
Easy to use but tuning is very difficult.
(to tell the truth…)
LASSO, Elastic net, and Ridge regression
Least Absolute Shrinkage and Selection Operator

LASSO
Elastic Net
Ridge
Neural networks
Artificial mammal brain (perceptron).
Hidden multi-layer.
Deep learning is hot topic!!
(hard to understand...
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
...
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
...
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
...
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
...
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
...
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
...
Performance

SVM

Tree

RandomForest

LASSO

Elastic net

Ridge

Logistic

nnet

Recall

0.917

0.872

0.927

0.894

0.892...
Combine two approaches
Statistical features of structure
A: Binding residues are highly solvent
accessible.
B, C: Binding residues have low depth...
Performance
Performance

Higher TM score is required for good prediction.

TM-score is a measure of similarity between two protein str...
Performance
Comparison among ML, TL, and DNABind.

Comparison between DNABind and other software.
Result: DNABind, a hybrid method of machine learning and template-based
approaches showed excellent performance on predict...
20131019 生物物理若手 Journal Club
Upcoming SlideShare
Loading in …5
×

20131019 生物物理若手 Journal Club

636 views

Published on

Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16.
DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches.
Liu R, Hu J.

Published in: Technology, Education
  • Be the first to comment

20131019 生物物理若手 Journal Club

  1. 1. DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Proteins. 2013 Jun 5. 20131019 生物物理若手関西支部 Journal Club
  2. 2. Topics Prediction of protein-DNA binding residues Statistics of network Machine learning
  3. 3. Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding residues. Template DNABind EcoRV(1RVE:A) CprK (3E6C:C) Machine learning True positive residues. DNABind improves classification. Query protein, Template protein, TP, , FN
  4. 4. Aim Protein-DNA interactions is important for cell biology. Its determination by experiments is time- and cost-consuming. Computational approaches are desirable.
  5. 5. Computational approaches Data bank (PDB) Binding residues characters Exposed solvents Higher electrostatics potential More conserved Hotspots as clusters of conserved residues Structural properties (DNA-binding residue vs surface) Packing density Surface curvature B-factor Residue fluctuation Hydrogen bond donor http://www.rcsb.org/pdb/home/home.do
  6. 6. Computational algorithms Feature-based Extract effective features Template-based Align template and retrieve the best match Template!!
  7. 7. Computational algorithms Feature-based Extract effective features Template-based Align template and retrieve the best match Template!!
  8. 8. Computational algorithms Feature-based Extract effective features Template-based Align template and retrieve the best match Template!!
  9. 9. Features used in machine learning Structure-based PSSM (position specific scoring matrix) Evolutionally conservation Solvent accessibility Local geometry (depth and protrusion index) Topological features degree, closeness, betweenness, clustering coefficient Relative position (distance to centroid) Statistical potential (Boltzmann distribution) Sequence-based (more difficult than structure) Amino acid identity Residue physicochemical properties polarity, secondary structure, molecular volume, codon diversity, electrostatic charge Predicted structure (Not need 3D structure !!)
  10. 10. Features used in machine learning Structure-based PSSM Relative solvent accessibility Depth and protrusion index Topological features Distance to centroid Statistical potentials Sequence-based PSSM Predicted structures Amino acid indices Statistical potentials Construct machine learning (SVM)
  11. 11. Template-based approach Used in image recognition, etc… Recognition of faces in the camera. Template!!
  12. 12. Template-based approach Used in image recognition, etc… Recognition of faces in the camera. Match!! Template!!
  13. 13. Template-based prediction Template-based Structural alignment and statistical potential The binding residue prediction will be conducted only if the target protein was considered as a DNA-binding protein. 312 templates were selected.
  14. 14. Network Degree is a commonly used measure to reflect the local connectivity of a node. Closeness is a global centrality metric used to determine how critical a residue is in a residue interaction network. Betweenness of residue i is defined to be the sum of the fraction of shortest paths between all pairs of residues that pass through residue i. Motif, hub, and community are also important… Clustering coefficient (transitivity) quantifies how close its neighbors are to being a clique. Probability that the adjacent vertices of a vertex are connected.
  15. 15. Network sample; human protein interactome Scale-free Small-world Cluster Power law (Pareto distribution) Bioinformatics. 2012 Jan 1;28(1):84-90.
  16. 16. Machine learning Example; spam 4601 samples, 57 parameters. Classification; spam or nonspam
  17. 17. Machine learning Support vector machine (SVM) Decision tree RandomForest Logistic regression LASSO (Elastic net and Ridge) Neural networks (Deep learning) Evolutionary algorithm Gaussian processing k nearest neighbor Clustering Bayesian networks Association rule learning Inductive logic programming (ILP)
  18. 18. Support vector machine (SVM) Make hyperplane to divide groups. Kernel method; non-linear to linear Easy to do. Much computational time. Tuning is very difficult.
  19. 19. Decision tree Make many trees. Easy to understand graphically. Performance is not so good.
  20. 20. RandomForest Make many decision trees. Much precise. A little time consumer.
  21. 21. Logistic regression Many medical researchers use… Easy to use but tuning is very difficult. (to tell the truth…)
  22. 22. LASSO, Elastic net, and Ridge regression Least Absolute Shrinkage and Selection Operator LASSO Elastic Net Ridge
  23. 23. Neural networks Artificial mammal brain (perceptron). Hidden multi-layer. Deep learning is hot topic!! (hard to understand…) http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html
  24. 24. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set.
  25. 25. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  26. 26. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  27. 27. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  28. 28. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  29. 29. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  30. 30. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data Test 1 One-leave out CV
  31. 31. Performance SVM Tree RandomForest LASSO Elastic net Ridge Logistic nnet Recall 0.917 0.872 0.927 0.894 0.892 0.852 0.893 0.930 Precision 0.948 0.914 0.954 0.932 0.926 0.926 0.930 0.935 F 0.932 0.893 0.940 0.913 0.911 0.887 0.911 0.932 MMC 0.890 0.826 0.902 0.858 0.856 0.821 0.856 0.888
  32. 32. Combine two approaches
  33. 33. Statistical features of structure A: Binding residues are highly solvent accessible. B, C: Binding residues have low depth and high protrusion. D-G: Not so much difference in networks. H: Binding residues are less distant to the centroid.
  34. 34. Performance
  35. 35. Performance Higher TM score is required for good prediction. TM-score is a measure of similarity between two protein structures with different tertiary structures. < 0.2 is random relation and > 0.5 is highly related. Proteins. 2004 Dec 1;57(4):702-10. Nucleic Acids Res. 2005 Apr 22;33(7):2302-9.
  36. 36. Performance Comparison among ML, TL, and DNABind. Comparison between DNABind and other software.
  37. 37. Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding residues. Template DNABind EcoRV(1RVE:A) CprK (3E6C:C) Machine learning True positive residues. DNABind improves classification. Query protein, Template protein, TP, , FN

×