Validation of Time Series Technique   for Prediction of Conformational   States of Amino AcidsDr. Sangeeta Sawant , Bioinf...
Concepts Used              Ramachandran Plot                  Time series           AR,ARMA,ARIMA models                  ...
Ramachandran Plot
Time Seriesa sequence of data points or set of observations, measuredtypically at successive time instants spaced at unifo...
Time Series Models (probability model)Autoregressive (AR) modelsAutoregressive-moving average (ARMA)Autoregressive integra...
Materials & Methods                              R                       R-Studio, Tinn-R            bio3d,itsmr,forecast,...
MethodsA) Calculation of Potential values for AAresiduesB)Forecasting of AA statesC) Clustering
Calculation of Potential values for AA residues                                                                     Datase...
ᵠ                                      ᶲFigure No- 2 Ramachandran plot showing three conformational regions I ,II and III ...
Frequencies of single residues in three states calculated& normalized using (Kolaskar, A.S. & Sawant, S.V. -1996 )        ...
Potential values
Time Series
ACF Plot
ACF –Stat Vs. Non-stationaryStationary                     Non-stationary
Time Series      ACF plot                      Stationary                     Non-                  stationary            ...
Stationary TS
TS model building…..            AR (p)            ARMA(p,q)      ARIMA (p,q)
Best model Selection                     AR (p)                  ARMA (p, q)                  ARIMA (p, q)            AIC
Forecasting of AA states for best models
Forecasting of AA states for best models….e.g. for AR(1) process,X t = φ X (t-1) + Z (t), t=0,± 1,….Where {Z t}~ WN (0, s2...
Similarly for ARMA (1,1) /ARIMA (1,1)X t = φ X (t-1) + Z (t) + θ Z (t-1),      θ+φForecasting Quality by coefficient of de...
Clustering                      Dataset-IISCOP Domain specific PDB-style files(ATOM & HETATM records )downloaded fromASTRA...
Length of AA residues(100-110) e.g.10gsa1_a_133_pot.txt   File
Potential values (Time series),each domain divided intostationary (506) & non-stationary process (1692)Non-stationary data...
Dendrogram_TS –AR models-22
Dendrogram_TS –ARMA models-484• Phylowidget link
Results & DiscussionFor each AA of all the proteins, 3D-Cartesian co-ordinates were transformedinto 2D info. i.e. conforma...
AR values            Autoregressive order (p)  1-18 range            Short & long range dependence  variations          ...
Table No. II – Forecasting results for AR models (44) out of best 90 models (Note- for 46 models, class information not fo...
Table No. III– Forecasting results for ARMA models (557) out of best 1239   models (Note- for 682 models, class informatio...
DiscussionTS graphs opens new door in scientific visualization of proteins (no 3D str. info) i.e.specific AA can be visual...
CONCLUSIONSFound new way of looking at protein structureprediction.Application of TS technique for predicting conformation...
FUTURE WORKAR and MA order of TS models -as point of genetic information (distances) topredict evolutionary relationship b...
ReferencesBlundell TL, Sibanda BL, Sternberg MJ, Thornton JM. Knowledge-based prediction of protein structures and the des...
Questions
Thank You !
BIM_2010_20_Bioinformatics_Project
Upcoming SlideShare
Loading in...5
×

BIM_2010_20_Bioinformatics_Project

141

Published on

Project presentation for partial fulfillment of M.Sc (Bioinformatics) at Bioinformatics Center,University of Pune, Pune

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
141
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

BIM_2010_20_Bioinformatics_Project

  1. 1. Validation of Time Series Technique for Prediction of Conformational States of Amino AcidsDr. Sangeeta Sawant , Bioinformatics Centre, UoP, Pune (Guide)Dr. Mohan Kale, Dept. of Statistics, UoP, Pune (co-guide)
  2. 2. Concepts Used Ramachandran Plot Time series AR,ARMA,ARIMA models AIC criteria Euclidean distance Potential values for AA residues Feynman Problem Solving Algorithm
  3. 3. Ramachandran Plot
  4. 4. Time Seriesa sequence of data points or set of observations, measuredtypically at successive time instants spaced at uniform timeintervals. Patterns, variations forecasting
  5. 5. Time Series Models (probability model)Autoregressive (AR) modelsAutoregressive-moving average (ARMA)Autoregressive integrated moving average (ARIMA)models- depend linearly on previous data points
  6. 6. Materials & Methods R R-Studio, Tinn-R bio3d,itsmr,forecast,tseries,timsac,wordcloud ITSM_2000- Standalone R Nabble BioStars stats.stackexchange
  7. 7. MethodsA) Calculation of Potential values for AAresiduesB)Forecasting of AA statesC) Clustering
  8. 8. Calculation of Potential values for AA residues Dataset-I 3829 proteins selected from PDB (Protein Data Bank) –PDBSelect dataset list(25 % seq. similarity) Expt. method-X-ray, R-factor: - 0-0.25 (for best resolved structures) Chain breaks, only CA atomsPhi-Psi values –torsion.pdb() of “bio3d” & verified via PDBGoodies (IISC, Bangalore) & Protein Angle Descriptor utility (IIT, Delhi ) Assignment of Conformational state 1, 2, or 3 - to regions I, II, or III of the Rama. Plot, to each amino-acid residue (Phi_psi values)
  9. 9. ᵠ ᶲFigure No- 2 Ramachandran plot showing three conformational regions I ,II and III I- closely/tightly packed conformations, Phi-140 to 0,Psi -100 to 0 II-extended conformations, Phi -180 to 0, Psi 80 to 180 III- all remaining confirmations
  10. 10. Frequencies of single residues in three states calculated& normalized using (Kolaskar, A.S. & Sawant, S.V. -1996 ) nik N Pik =  nik  nikNik –no. of times the AA of type (i) occurs in state k=1-3;N -total no. of residuesPik -potential values of AA of type (i) in state k Potential values in pdf
  11. 11. Potential values
  12. 12. Time Series
  13. 13. ACF Plot
  14. 14. ACF –Stat Vs. Non-stationaryStationary Non-stationary
  15. 15. Time Series ACF plot Stationary Non- stationary Stationary
  16. 16. Stationary TS
  17. 17. TS model building….. AR (p) ARMA(p,q) ARIMA (p,q)
  18. 18. Best model Selection AR (p) ARMA (p, q) ARIMA (p, q) AIC
  19. 19. Forecasting of AA states for best models
  20. 20. Forecasting of AA states for best models….e.g. for AR(1) process,X t = φ X (t-1) + Z (t), t=0,± 1,….Where {Z t}~ WN (0, s2) & | φ | <1 1st observed potential for AA with index given as data points & t respectively, prediction starts from 2nd position up to last index using forecast() “itsmr”
  21. 21. Similarly for ARMA (1,1) /ARIMA (1,1)X t = φ X (t-1) + Z (t) + θ Z (t-1), θ+φForecasting Quality by coefficient of determination (R2)using formula R =1 2  (Yi  Fi )2  (Yi  Y )2 Yi =True value /Observed value Fi = Forecasted/predicted value
  22. 22. Clustering Dataset-IISCOP Domain specific PDB-style files(ATOM & HETATM records )downloaded fromASTRAL Compendium for Sequence and Structure Analysis -release 1.75 (June 2009)Scan for chain breaks & presence of CA atoms only, breaked fileskept aside
  23. 23. Length of AA residues(100-110) e.g.10gsa1_a_133_pot.txt File
  24. 24. Potential values (Time series),each domain divided intostationary (506) & non-stationary process (1692)Non-stationary data kept aside for furthertransformationsAR,ARMA & ARIMA modelsBest model (minimum AIC criteria)Best-AR(22),ARMA(484),ARIMA(No model)AR(p), ARMA(p,q) -distance matrix (Euclidean distance )Dendrogram-Neighbour-joing ( Phylip packages)
  25. 25. Dendrogram_TS –AR models-22
  26. 26. Dendrogram_TS –ARMA models-484• Phylowidget link
  27. 27. Results & DiscussionFor each AA of all the proteins, 3D-Cartesian co-ordinates were transformedinto 2D info. i.e. conformational states ofAA and potential values were computedand used to build time-distance (index ofAA) dependent statistical model as timeseries for forecasting purposes.
  28. 28. AR values Autoregressive order (p)  1-18 range Short & long range dependence  variations in protein structural arrangements Variations proves  diversity exhibits through structural components
  29. 29. Table No. II – Forecasting results for AR models (44) out of best 90 models (Note- for 46 models, class information not found in SCOP database) All values are in % accuracy All  (a)-12 All  (b)-5 /  (c)-9  +  (d)-13 Small Coiled-coil Designed proteins (h)-3 proteins (g)-1 (k)-1 Max Min Max Min Max Min Max Min Max MinAA 26.82 2.41 16.30 8.88 27.77 1.47 28.57 7.04 19.51 22.5 5.88 29.03seq(%)States 55.68 21.77 51.11 44.76 54.76 30.64 51.70 19.04 48.78 26 15 26.88(%)Conformational states accuracy > AA residues accuracy due to lowresolution of potential values(forecasted values)
  30. 30. Table No. III– Forecasting results for ARMA models (557) out of best 1239 models (Note- for 682 models, class information not found in SCOP database) —All values are in % accuracy All  (a)-123 All  (b)-146 /  (c)-120  +  (d)-127 Multi domains Membrane & Small proteins (e)-13 cell surface proteins(g)- (f)-3 17 Max Min Max Min Max Min Max Min Max Min Max Min Max MinAA 32.55 2.63 32.81 3.96 43.47 5 37.96 2.70 24.39 6.034 12.65 7.01 30.64 6.60seq(%)States 65.77 8.06 65.01 17.94 62.89 8.97 68.15 11.11 50 17.80 34.33 11.42 64.51 14.28(%)Due to non-representative dataset & inadequate info. about class, can’t saythat for any particular class i) pred. accuracy ↑ or ↓ & ii) follows mostlyARMA process
  31. 31. DiscussionTS graphs opens new door in scientific visualization of proteins (no 3D str. info) i.e.specific AA can be visualized on line plot with its value proportional to frequency tooccur into allowed regions of Ramachandran plot.Potential value for each AA adds new feature of selection in machine learningtechniques.Order of AR model tells how current value linearly related to past p valueIntra-dependency of AA shown using models of TS e.g. AR(4),ARMA(1,3)
  32. 32. CONCLUSIONSFound new way of looking at protein structureprediction.Application of TS technique for predicting conformational states based on theconformational state potentials instead of secondary str. has been attempted.Accuracy of prediction of conformational states for AA, using time series ishigher than that for prediction of AA residues.To increase accuracy for prediction, multivariate time series concept may beuseful instead of uni-variate time seriesIntra-fluctuations inside proteins, due to AA arrangement can be traced outby stationary & non-stationary groups
  33. 33. FUTURE WORKAR and MA order of TS models -as point of genetic information (distances) topredict evolutionary relationship between different proteins.TS concept can be used to predict conformational states of missing residuesin PDB data filesHierarchical clustering/classification of TS of proteins -birth to new conceptof time dependent clustering (pseudo-clustering) & pseudo-phylogeny.Development of synthetic proteins to combat seasonal diseases & to tacklechemical warfare attacks.TS fluctuations for specific class of proteins can be used as “Pattern” for dataanalysis and pattern-dependent classification of proteins
  34. 34. ReferencesBlundell TL, Sibanda BL, Sternberg MJ, Thornton JM. Knowledge-based prediction of protein structures and the design of novelmolecules. Nature. 1987 Mar 26-Apr 1;326(6111):347-52. ReviewKolaskar, A.S., Sawant, S.V. (1996). Prediction of conformationalstates of amino acids using a Ramachandran plot. Int.J.PeptideProtein Res.110-116Alessandro G.,Romualdo B.,(2000). Nonlinear Methods in theAnalysis of Protein Sequences:A Case Study in Rubredoxins.Biophysical Journal.136-148
  35. 35. Questions
  36. 36. Thank You !
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×