Your SlideShare is downloading. ×
Protein Structure Alignment and Comparison
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Protein Structure Alignment and Comparison

2,851
views

Published on

This talk presents an on line decision support system for structural biologists who are interested in performing multiple protein structure comparisons, via multiple methods, in one go.

This talk presents an on line decision support system for structural biologists who are interested in performing multiple protein structure comparisons, via multiple methods, in one go.

Published in: Education, Technology

1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
2,851
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
237
Comments
1
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The ProCKSI-Server An on-line Decision Support System for Protein Structure Comparison Natalio Krasnogor www.cs.nott.ac.uk/~nxk Natalio.Krasnogor@Nottingham.ac.uk Interdisciplinary Optimisation Laboratory Automated Scheduling, Optimisation & Planning Research Group School of Computer Science and Information Technology Centre for Integrative Systems Biology School of Biology Centre for Healthcare Associated Infections Institute of Infection, Immunity & Inflammation University of Nottingham 27th November 2008, University of Warwick 1
  • 2. Outline  Introduction − Brief introduction to proteins − Protein structures Comparison − Methods  ProCKSI − Motivation − External Methods − USM & MAX-CMO − Consensus building  Results − From a structural bioinformatics perspective − From a Computational perspective  Conclusions  Acknowledgement 27th November 2008, University of Warwick 2
  • 3. Introduction www.procksi.org 27th November 2008, University of Warwick 3
  • 4. What are Proteins?  Proteins are biological molecules of primary importance to the functioning of living organisms  Perform many and varied functions 27th November 2008, University of Warwick 4
  • 5. Structural Proteins: the organism's basic building blocks, eg. collagen, nails, hair, etc Enzymes: biological engines which mediate multitude of biochemical reactions. Usually enzymes are very specific and catalyze only a single type of reaction, but they can play a role in more than one pathway. Transmembrane proteins: they are the cell’s housekeepers, eg. By regulating cell volume, extraction and concentration of small molecules from the extracellular environment and generation of ionic gradients essential for muscle and nerve cell function (sodium/ potasium pump is an example) 27th November 2008, University of Warwick 5
  • 6. Protein Structures  Varying: size, shape, structure  Structure determines their biological activity  “Natures Robots”  Understanding protein structure is key to understanding function and dysfunction 27th November 2008, University of Warwick 6
  • 7. Components of Proteins  Build Blocks: − Amino Acids − Common Basic Unit Livingstone and Barton:(1993) • Distinct “side chains” • 20 Amino Acid Types 27th November 2008, University of Warwick 7
  • 8. Components of Proteins 27th November 2008, University of Warwick 8
  • 9. Components of Proteins •Thousands of different physicochemical and biochemical properties (AAIndex) • Thus proteins are beautiful combinatorial beasts! 27th November 2008, University of Warwick 8
  • 10. Protein Synthesis  Amino Acid Sequences − AAs polymerised into Chains (Residues) − Gene sequence determines Protein sequence  Protein Structure − Chains fold into specific compact structures  Structure formation (folding) is spontaneous  Sequence determines Structure  Structure determines function 27th November 2008, University of Warwick 9
  • 11. Determining Protein Structures  Protein Structure determination is slow and difficult  Determining protein sequence is relatively easy (Genomics)  PDB vs Genbank Thomas Splettstoesser 27th November 2008, University of Warwick 10
  • 12. Comparing Protein Structures • Proteins build the majority of cellular structures and perform most life functions • Extend knowledge about the protein universe: – Understand interrelations between structures and functions of proteins through measured similarities – Group (cluster) proteins by structural similarities as to infer commonalities • Goal is to predict functions of proteins from their structure, or design new proteins for specific functions • Considering any two objects: What does “similar” mean? Similar or not? How / Where similar? 27th November 2008, University of Warwick 11
  • 13. Protein Structure Comparison Similarity comparison of protein structures is not trivial even though it is obvious that proteins may share certain common patterns (motifs)  Many different similarity comparison methods available, each with its own strengths and weaknesses  Different concepts of similarity: sequence vs. structural, local vs. global, chemical role vs. biological function vs. evolution sequence vs. …  Different algorithms and implementations: exact vs. approximation vs. heuristic, local vs. global search Maximum Contact Map Overlap using e.g. Memetic algorithms, Picture source: http://www.cathdb.info Variable Neighbourhood Search, Tabu Search 27th November 2008, University of Warwick 12
  • 14. Existing Approaches A variety of structure comparison methodologies exist, e.g.: •SSAP (Orengo & Taylor, 96) •ProSup (Feng & Sippl, 96) •DALI (Holm & Sander, 93) •CE (Shindyalov & Bourne, 98) •Max-CMO (Goldman, Papadimitriou, Istrail, Lancia, 99 & 2001) •LGA (Zemla, 2003) •USM (Krasnogor & Pelta, 2004) •SCOP (Murzin, Brenner, Hubbard & Chothia, 95) •CATH (Orengo, Mithie, Jones, Jones, Swindells & Thornton, 97) 27th November 2008, University of Warwick 13
  • 15. Computational Underpinning •Dynamic programming (Taylor, 99) •Comparison of distance matrices (Holms & Sander, 93,96} •Maximal common sub-graph detection (Artimiuk, Poirrette, Rice & Willet, 95) •Geometrical matching (Wu, Schmidler, Hastie & Brutlag, 98) •Root-mean-square-distances (Maiorov & Crippen, 94 – Cohen & Sternberg,80) •Other methods (eg. Lackner, Koppensteimer, Domingues & Sippl, 99 – Zemla, Vendruscolo, Moult & Fidelis, 2001) A survey of various similarity measures can be found in (Koehl P: Protein structure similarities. Curr Opin Struct Biol 2001, 11:348-353) 27th November 2008, University of Warwick 14
  • 16. Some Observations •No agreement on which of these is the best method • Various difficulties are associated with each. • They assume that a suitable scoring function can be defined for which optimum values correspond to the best possible structural match between two structures (clearly not allways true, e.g. RMSD) • Some methods cannot produce a proper ranking due to: • ambiguous definitions of the similarity measures or • neglect of alternative solutions with equivalent similarity values. Structure Comparison, is at its core a multi-competence (multi-objective) problem but it is seldom treated as such, e.g.:  ProSup (Feng & Sippl, 96) optimizes the number of equivalent residues with the RMSD being an additional constraint (and not another search dimension).  DALI (Holm & Sander, 93) combines various derived measures into one value, effectively transforming a multi-objective problem into a (weighted) single objective one. 27th November 2008, University of Warwick 15
  • 17. What/How are we comparing? Models, Measures, Metrics & Methods or other tasks... 27th November 2008, University of Warwick 16
  • 18. Until very recently researchers would:  Focus on steps 1-4 , often collapsed into one single step  Compare one algorithm against others on a given data set  Conclude that their algorithm “is best” for that data set and write a paper Meanwhile, in the real world…  No method is best in all data sets.  The biologist will only use the method (s)he is most familiar with! Regardless of the suitability to his/her problem. 27th November 2008, University of Warwick 17
  • 19. Until very recently researchers would:  Focus on steps 1-4 , often collapsed into one single step  Compare one do we change this reality? given Q: How algorithm against others on a data set  Conclude that their algorithm “is best” for that data set and write a paper Meanwhile, in the real world…  No method is best in all data sets.  The biologist will only use the method (s)he is most familiar with! Regardless of the suitability to his/her problem. 27th November 2008, University of Warwick 17
  • 20. Until very recently researchers would:  Focus on steps 1-4 , often collapsed into one single step  Compare one do we change this reality? given Q: How algorithm against others on a data set  Conclude that their it easy for the for that data A: We make algorithm “is best” set and write a to use the correct method biologist paper (and more) Meanwhile, in the real world…  No method is best in all data sets.  The biologist will only use the method (s)he is most familiar with! Regardless of the suitability to his/her problem. 27th November 2008, University of Warwick 17
  • 21. ProCKSI www.procksi.org 27th November 2008, University of Warwick 18
  • 22. The ProCKSI-Server ProCKSI: Protein Comparison, Knowledge, Similarity, and Information  Web Server for protein structure comparison  Workbench / portal for established methods and repositories for protein structure information – Integrates results from many comparison methods in one place – Home-grown comparison methods, Max-CMO and USM (using contact maps as their input)  Decision Support System / analysis tool – Visualises, compares and clusters all similarity measure results – Incorporates all results and suggests a similarity consensus 27th November 2008, University of Warwick 19
  • 23. The ProCKSI-Server Minimise the Management Overhead for Experiments • Upload your own dataset or download structures from the PDB repository • Validate your PDB file, and extract desired models and chains • Choose from multiple similarity comparison methods at one place (including your own similarities) or don’t choose and use all! Calculation USM • Submit and monitor the Manager Local External progress of your experiment MaxCMO Dataset • Integrate results from all Manager pair-wise comparisons Similarity Results Comparison • Analyse and visualise results Management from different similarity Task / Job Scheduling comparison methods Overview Manager • Combine results and produce a Structure Task Requests similarity consensus profile Manager Managers and Results DataBase / Filesystem Analysis • Download desired results Manager 27th November 2008, University of Warwick 20
  • 24. Protein Comparison Methods United  Home-grown methods: − USM − Max-CMO  External methods: − DaliLight − FAST − CE − TMalign − Vorolign − URMS  Additional informational sources: − CATH, iHOP, RSCB, SCOP 27th November 2008, University of Warwick 21
  • 25. Home-Grown Methods • Representation of 3D protein structures as 2D contact maps - Atoms that are far away in the linear chain, come close together in the folded state - If the distance between two atoms i,j is below a threshold t, they are said to form a contact • Mathematical description of contact maps - Calculation of all pairwise Euclidean distances between atoms i,j Sequence
of
atoms - Translation into a binary, symmetrical matrix, called the contact map C Sequence
of
atoms • Contact maps in ProCKSI Input for the two main similarity measures: - Universal Similarity Metric (USM)‫‏‬ - Maximum Contact Map Overlap (MaxCMO)‫‏‬ 27th November 2008, University of Warwick 22
  • 26.  An Example of a contact map 1C7W.PDB 27th November 2008, University of Warwick 23
  • 27. Protein Structure Comparison • Secondary structure elements can be identified in the contact map: − α-helix: wide bands on main diagonal − β-sheet: parallel or perpendicular bands to main diagonal • Comparison of contact maps - using different similarity measures, e.g. number of alignments, overlap values, information content, … • Protein relationships - Pair-wise comparison of multiple proteins results in a (standardised) similarity matrix - Comparison of all possible proteins describes the protein universe Protein
1NAT
with
α-helices
and
β-sheets
 27th November 2008, University of Warwick 24
  • 28. Protein Structure Comparison • Maximum Contact Map Overlap (MaxCMO) method is a specific measure of equivalence - Number of aligned residues (dashed lines) and equivalent contacts (aligned bows, called overlap)‫‏‬ - Overlap gives strong indication for topological similarity taking the local environment into account 27th November 2008, University of Warwick 25
  • 29. 1ash 1hlm Two related proteins taken from the PDB which share a 6 helices structural motif. 27th November 2008, University of Warwick 26
  • 30. 1ash 1hlm Two related proteins taken from the PDB which share a 6 helices structural motif. 27th November 2008, University of Warwick 26
  • 31. 1ash 1hlm Two related proteins taken from the PDB which share a 6 helices structural motif. 27th November 2008, University of Warwick 26
  • 32. 1ash 1hlm Two related proteins taken from the PDB which share a 6 helices structural motif. Two locally and globally similar contact maps. 27th November 2008, University of Warwick 26
  • 33. A candidate alignment between the contact maps of these protein structures. 27th November 2008, University of Warwick 27
  • 34. Protein Structure Comparison • Universal Similarity Metric (USM) is the most concept/domain independent measure in ProCKSI - detects similarities between (quite) divergent structures - based on the concept of Kolmogorov complexity - compares the information content of two contact maps by compression (NCD) 27th November 2008, University of Warwick 28
  • 35. Protein Structure Comparison • Contact maps are the input to Universal Similarity Metric (USM) • Basic concept is Kolmogorov Complexity: - Prior Kolmogorov complexity K(o): Measures the amount of information contained in a given object o - Conditional Kolmogorov complexity K(o1|o2): How much (more) information is needed to produce object o1 if one knows object o2 (as input) • Calculation of the Normalized Information Distance (NID), which is a proper, universal and normalized similarity metric 27th November 2008, University of Warwick 29
  • 36. Protein Structure Comparison • Kolmogorov complexity is not computable directly, but can be heuristically approximated • Approximation of the Normalised Information Distance (NID) by the Normalised Compression Distance (NCD): – Objects are represented as bit strings s (or files) that can be concatenated (.) – Objects are compressed by any lossless real-world compressor (e.g. zip, bzip2, …)‫‏‬ – Length of the compressed string/file 00000000001100000 00000000011100000 approximates the Kolmogorov complexity 00001100011000000 00000100000000000 00100001000000000 00110000000000000 00000000000000000 00001000010000000 00000000001000000 01100001000000000 11100000100000000 11000000000001000 00000000000000100 00000000000100011 00000000000010000 00000000000001000 00000000000001000 00000000001100000 00000000011100000 – Compression of the second string/file using the concatenation 00001100011000000 00000100000000000 00100001000000000 dictionary of the first one gives cond. Kolmogorov 000000000011 00110000000000000 000000000111 00000000000000000 00001000010000000 00000000001000000 complexity 000011000110 000001000000 001000010000 01100001000000000 001100000000 11100000100000000 000000000000 11000000000001000 000010000100 00000000000000100 000000000010 [
0
+
ε;
1
+
ε ] 00000000000100011 011000010000 00000000000010000 NCD NCD 111000001000 00000000000001000 110000000000 00000000000001000 27th November 2008, University of Warwick 30
  • 37. Protein Structure Comparison • Analysis of similarity matrices by hierarchical clustering: – Similarity matrices not easy to analyse, especially for very large datasets – Similar proteins (with small values) are grouped together (clustered)‫‏‬ – Many clustering algorithms available, e.g. Ward’s Minimum Variance • Results of the hierarchical clustering can be visualised as linear or hyperbolic tree – Hyperbolic tree is favourable for large sets of proteins – Fish-eye perspective – Navigation through the tree possible – Tree comparison across methods/data sets 27th November 2008, University of Warwick 31
  • 38. Total Evidence Consensus • Comparison of a pair of proteins P1 and P2 with a given similarity method 1M results in a similarity score 1S12 P1 1M 1S 1S 11 1S 12 … 1S 1n 12 1S 1S P2 21 22 … … 1M 1S 1S 1S 1n n1 Text nn Pn • Comparison of a dataset with multiple proteins P1 … Pn with the same similarity method 1M results in similarity matrix 1S • Comparison of the same dataset with multiple similarity methods 1M … mM results in multiple similarity matrices 1S … mS providing multiple similarity measures 27th November 2008, University of Warwick 32
  • 39. Consensus Analysis Consensus/Greedy – Standardisation of similarity distances: [0;1] – Assumption: For a given pair of structures, the best method produces the best similarity values – Compilation of a similarity matrix including the best values from the best similarity method for each pair Consensus/Average – Expert user selects similarity measures; included measures contribute equally to the consensus – The intelligent combination of similarity comparison measures leads to better results than any single one can provide! Consensus/Weighted – Assign weights to similarity measures according to preference by ranking, e.g. Z-score > N-Align > RMSD – Optimise weights: Determine minimum, average and maximum weights by solving linear programming problem 27th November 2008, University of Warwick 33
  • 40. Total Evidence Consensus • Each similarity matrix must be standardised [0;1] as different methods produce different qualities and ranges of measures • Integration of multiple similarity matrices 1M … mM in order to build a consensus similarity matrix C 1S 11 1S 12 … 1S 1n 1S 1S 21 22 … C11 C12 … C1n 1S 1S n1 nn C21 C22 … … mS mS … mS Cn1 Cnn 11 12 1n mS 1S 21 22 … mS mS n1 nn • The consensus operator determines how the different similarity matrices are weighted and averaged, e.g.: 27th November 2008, University of Warwick 34
  • 41. Results www.procksi.org 27th November 2008, University of Warwick 35
  • 42. Evaluation of CASP6 Results • Evaluation of CASP6 competition results • Prediction of protein structure against a given target – Evaluation of predictions with similarity comparison methods CASP ProCKSI MaxCMO CASP Evaluation Target (T0196)‫‏‬ CONSENSUS Overlap GDT-TS • Similarity ranking with different methods – CONSENSUS = Unweighted arithmetic average of USM + MaxCMO/Overlap + DaliLite/Z – Comparable results between ProCKSI‘s CONSENSUS method and the community‘s gold standard GDT-TS supplemented with expert curation – CONSENSUS detect better model for target T0196 27th November 2008, University of Warwick 36
  • 43. Clustering of Protein Kinases Comparison of sequence-based classification with structure-based clustering from single similarity comparison methods and ProCKSI's consensus method • Biological background: – Kinases are enzymes that catalyse the transfer of a phosphate to a protein substrate – Play essential role in most of the cellular processes e.g. cellular differentiation and repair, cell proliferation • Kinases dataset: http://www.nih.go.jp/mirror/Kinases − 45 structures published at the Protein Kinase Resourse (PKR) web site • Hanks' and Hunter's (HH) classification as gold standard: – Based on sequence information – HH-Clusters: Mainly 9 different groups (super-families)‫‏‬ – Sub-Clusters: Common features according to the SCOP database • Experiments with 3 different comparison methods (USM, MaxCMO, DaliLite), 3 different contact map thresholds, 7 different clustering methods (e.g. Wards, UPGAA) 27th November 2008, University of Warwick 37
  • 44. Clustering of Protein Kinases Single Similarity Measures DaliLite/Z USM/USM MaxCMO/Overlap • Best results with clustering with Ward's Minimum Variance method • Each method/measure has its own strengths and flaws Strengths: • Green: Classification on Class level, e.g. α+β/PK-like • Blue: Detect similarities up to Species level with e.g. mice, pigs, cows • Red: Produce mixed bag of proteins being least similar in Blue Flaws: • MaxCMO/Overlap only distinguishes proteins on Class level • DaliLite/Z adds fairly wrong protein 1IAN to Green • USM/USM reverses order of last two clustering steps (Blue and Green) 27th November 2008, University of Warwick 38
  • 45. Clustering of Protein Kinases Similarity Consensus USM/USM + DaliLite/Z USM/USM + DaliLite/Z + MaxCMO/Overlap • Exhaustive combination of all available similarity measures Best Results: ● Correct clustering with USM/USM + DaliLite/Z compensating for each others flaws General Trends: ● Including similarity measures derived from the number of alignments (e.g. MaxCMO/Align, DaliLite/Align) partially destroy good clustering outside Green ● Adding noisier measures (e.g. MaxCMO/Overlap) still produces comparable good and robust results 27th November 2008, University of Warwick 39
  • 46. Consensus Analysis Comparison of the influence of the combination of different similarity measures on the quality of the consensus method • Rost/Sander dataset: – Designed for secondary structure prediction – Pairwise sequence similarity of less than 25% – 126 globular proteins incl. 18 multi-domain proteins • SCOP classification as gold standard: – Manually curated database containing expert knowledge – Hierarchical classification levels: Class, Fold, Superfamily, Family, Protein, Species • Analyse performance of each established comparison method against consensus method using ROC analysis – Compare true positives against false positives – Performance measure is Area under the Curve (AUC)‫‏‬ 27th November 2008, University of Warwick 40
  • 47. Consensus Analysis - Technique ROC = Receiver Operator Characteristics – Technique for comparing the overall performance of different methods / algorithms / tests on the same dataset – Widely employed e.g. in signal detection theory, machine learning, and diagnostic testing in medicine • ROC curves depict the relative trade-off between benefits (True Positives) and costs (False Positives)‫‏‬ True Classes p n • Confusion matrix of a binary test Test Classes Y TP FP – Hit rate: True Positive rate TPr N FN TN P N – False alarm: False Positive rate FPr Column Totals 27th November 2008, University of Warwick 41
  • 48. Consensus Analysis - Technique Important points in ROC space (0,1) : high TPr and low FPr; perfect classifiction (0,0) : never issue positive classifications; useless (1,1) : always issue positive classifications; useless {y=x} : randomly guessing a classification; useless ROC curves for methods with continuous output – Not a simple binary (discrete) decision problem (yes/no) – Ranking or scoring output estimates the class membership probability of an instance [0;1] – Application of a variable threshold in order to produce and validate discrete classifiers – The best method has an uppermost (north-western) curve – Area Under the Curve (AUC) quantifies the performance 27th November 2008, University of Warwick 42
  • 49. Consensus Analysis Analysis of SCOP’s Class level (as example for all levels)‫‏‬ - RMSD values are not good similarity measures (except for DaliLite)‫‏‬ - Best performance with FAST/SN and FAST/Align (Class level), and with CE/Z, DaliLite/Z, and DaliLite/Align (all other levels)‫‏‬ - Consensus/All gives worse AUC value than best method but very close to it 27th November 2008, University of Warwick 43
  • 50. Consensus Analysis  Results from Comparisons/Singles rating ranking *** first ** second * third 27th November 2008, University of Warwick 44
  • 51. Consensus Analysis  Results from Consensus/Average rating ranking *** first ** second * third 27th November 2008, University of Warwick 45
  • 52. Consensus Analysis Analysis of SCOP’s Superfamily level (exemplary for all levels)‫‏‬ Consensus/ Average-Best3 - Consensus/Average-Best3 gives better AUC values than any of the contributing similarity measures (except Protein level)‫‏‬ - Further reduction to Consensus/Average-Best2 improved only performance for Protein and Superfamily level 27th November 2008, University of Warwick 46
  • 53. Distributed Computing Similarity comparison of proteins with multiple methods and large datasets is very time consuming and needs to be parallelised / distributed / gridified – Simple automated scheduling system for job distribution works well on dedicated ProCKSI cluster (5 nodes, dual) – Research on how to bundle jobs including fast/slow methods and small/large dataset ► Optimise the ratio between calculation time and overhead (data transfer time, waiting time, ...) – Generalised scheduler for usage of clusters on the GRID and/or the University of Nottingham's cluster (> 1000 nodes) 27th November 2008, University of Warwick 47
  • 54. Problem / Solution Space All-against-all comparison of a dataset of S protein structures using M different similarity comparison methods can be represented as 3D cube. s h od Heterogeneity: et M 1. Each structure has different length i.e number of residues 2. Each method has different execution time even for Structures same pair of structures 3. Back-end computational nodes may have different speeds etc Structures 27th November 2008, University of Warwick 48
  • 55. Possible Strategies 1. Comparison of one pair of proteins using one method in the task list => SxSxM jobs, each performing 1 comparison >> far too fine-grained 2. All-against-all comparison of the entire dataset with one method => M jobs, each performing SxS comparisons >> currently running , valid only for |S|<500 proteins 3. Comparison of one pair of proteins using all methods in the task list => SxS jobs, each performing M comparisons >> Slightly different from 1, does not allow intelligent load balancing 4. Intelligent partitioning of the 3D problem space, comparing a subset of proteins with a set/subset of methods >> under investigation 27th November 2008, University of Warwick 49
  • 56. Distributed (grid-enabled) architecture • p = number of nodes • N1, N2, .. Np= Cluster or Grid nodes •The system is able to run both on a parallel environment using the MPI libraries and on a grid computing environment using the MPICH-G2 libraries. •Complexity of Proteins is estimated and bag of proteins are distributed on different nodes 27th November 2008, University of Warwick 50
  • 57. Experimental results: CK34 27th November 2008, University of Warwick 51
  • 58. Experimental results: CK34 27th November 2008, University of Warwick 51
  • 59. Experimental results: RS119 27th November 2008, University of Warwick 52
  • 60. Experimental results: RS119 27th November 2008, University of Warwick 52
  • 61. Experimental results: overall speed-up Speed-up = Ts /Tp Where, Ts: sequential exec time Tp: Parallel exec time on P processors Ideal speed-up = p where, P: number of processors 27th November 2008, University of Warwick 53
  • 62. Conclusions www.procksi.org 27th November 2008, University of Warwick 54
  • 63. Conclusions • ProCKSI is a workbench for protein structure comparison – Implements multiple different similarity comparison methods with different similarity concepts and algorithms – Facilitates the comparison and analysis of large datasets of protein structures through a single, user-friendly interface • ProCKSI is a decision-support system – Integrates many different similarity measures and suggests a consensus similarity profile, taking their strengths and weaknesses into account The combination of multi-competence similarity comparison measures leads to better results than any single one can provide! • Additional Tools: • One of the most tested PDB parsers out-there • Very flexible tool for generating contact maps under a variety of definitions and parameters • Flexible contact maps visualisation • Trees comparison and visualisation • You can add your own distance matrix 27th November 2008, University of Warwick 55
  • 64. Conclusions • ProCKSI keeps expanding: • More methods are being added. • If you have a method and want it included contact us! • More sophisticated data fusion and visualisation are in their way! • Hardware is evolving. • ProCKSI is publicly available at: http://www.procksi.net 27th November 2008, University of Warwick 56
  • 65. Literature Journal Papers – The ProCKSI Server: a decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information Daniel Barthel, Jonathan D. Hirst, Jacek Błażewicz, Edmund K. Burke, Natalio Krasnogor. BMC Bioinformatics 2007, 8, 416. – Web and Grid Technologies in Bioinformatics, Computational and Systems Biology: A Review Azhar A. Shah, Daniel Barthel, Piotr Lukasiak, Jacek Błażewicz, Natalio Krasnogor. Current Bioinformatics 2008, 3, 10-31. Conference Papers – Grid and Distributed Public Comupting Schemes for Structural Proteomics: A Short Overview Azhar A. Shah, Daniel Barthel, Natalio Krasnogor. In Frontiers of High Performance Computing and Networking (ISPA2007), Lecture Notes in Computer Science 4743, 424-434. Springer-Verlag, Niagara Falls, Canada, August 2007. – Protein Structure Comparison, Clustering and Analysis: An Overview of the ProCKSI Decision Support System Azhar Ali Shah, Daniel Barthel, Natalio Krasnogor. In Proceedings of the 4th International Symposium on Biotechnology (IBS) and 1st Pakistan-China-Iran International Conference on Biotechnology, Bioengineering and Biophysical Chemistry (ICBBB'07), Jamshoro, Pakistan, November 2007. 27th November 2008, University of Warwick 57
  • 66. Acknowledgements 27th November 2008, University of Warwick 58