Bio process


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bio process

  2. 2. Protein Classification A comparison of function inference techniques
  3. 3. Why do we need automatedclassification? Sequencing a genome is only the first step. Between 35-50% of the proteins in sequenced genomes have no assigned functionality. Direct observation of function is costly, time consuming, and difficult.
  4. 4. Protein DomainsThe tertiary structure of many proteins is built fromseveral domains.Often each domain has a separate function toperform for the protein, such as:•binding a small ligand (e.g., a peptide in themolecule shown here)•spanning the plasma membrane (transmembrane proteins)•containing the catalytic site (enzymes)•DNA-binding (in transcription factors)•providing a surface to bind specifically to anotherproteinIn some (but not all) cases, each domain in aprotein is encoded by a separate exon in the geneencoding that protein.
  5. 5. Inference through sequencesimilarity ProtoMap: Automatic Classification of Protein Sequences, a Hierarchy of Protein Families, and Local Maps of the Protein Space (1999)
  6. 6. Final Goal
  7. 7. Observations Sometimes you don’t know where the domains are. It is generally accepted that two sequences with over 30% identity are likely to have the same fold. Homologous proteins have similar functions. Homology is a transitive relationship.
  8. 8. Departures Authors do not attempt to define protein domains or motifs. Not dependant on predefined groups or classifications. Chart the space of all proteins in SWISSPROT, as opposed to individual families Produce global organization of sequences.
  9. 9. Algorithm Overview We construct a weighted graph where the nodes are protein sequences and the edges are similarity scores. Cluster the network considering only those edges above some threshold. Decrease similarity threshold and repeat.
  10. 10. Measuring Sequence Similarity  Expectation value used. This the normalized probability of the similarity occurring at random.  Lower value implies logarithmically stronger similarity. λS − ln KS= ln 2 E = N /2 S
  11. 11. Blosum62 Scoring Matrix
  12. 12. Finding Homologies Very difficult to distinguish a clear threshold between homology and chance similarity. Authors chose e = .1, .1, and .001 for SW, FASTA, and BLAST, respectively. Spent a lot of time empirically determining these thresholds.
  13. 13. Clustering Clustering is done iteratively. Start with a threshold of E < 10-100 Cluster and increase threshold by a factor of 105 Sublinear threshold prevents the collapse of sequence space
  14. 14. ProtoMap: Results Produces well-defined groups which correlate strongly to protein families in PROSITE and Pfam.
  15. 15. Results:Immunoglobin Superfamily
  16. 16. ProtoMap: Limitations Analysis performs poorly by families dominated by short/local domains (PH, EGF, ER_TARGET, C2, SH2, SH3, ect…) High scoring, low complexity segments can lead to nonhomogeneous clusters. “Hard” clustering vs. “Soft” clustering Has difficulty classifying multidomain proteins.
  17. 17. ProtoMap: Future Directions 3D structure/fold Biological function Domain content Cellular location Tissue specificity Source organism Metabolic pathways
  18. 18. Inference through proteininteraction networks Functional Classification of Proteins for the Prediction of Cellular Function from a Protein- Protein Interaction Network (2003)
  19. 19. PRODISTIN• Very similar to ProtoMap,only the data used toproduce the graph is a listof binary protein-proteininteractions instead ofsequence similarity scores• Sequence similarity not adominating factor inPRODISTIN clusters
  20. 20. PRODISTIN Results
  21. 21. Problems with PRODISTIN • Paucity of protein-protein interaction data (average # of connections = 2.6) • Either very robust or very indiscriminant
  22. 22. Problems: Multidomain and Nonlocal Proteins• protein kinases• hydrolases• ubiquitin…PRODISTIN: Present problems in clustering bybiochemical functionProtoMap: Can create undesired connection amongunrelated groups
  23. 23. Scale-Free Networks • Node connection probability follows a power law distribution • Maximum degree of separation grows as O(lg n) • Highly robust under noise, except at hubs and superhubs. kiP(linking to node i) ~ ∑kj j
  24. 24. The Internet
  25. 25. Social Networks
  26. 26. Metabolic Networks• The E. coli metabolic network is scale-free.• Actually, the metabolic networks of all organisms inall three domains of life appear to be scale-free (43examined)• The network diameter of all 43 metabolic networks isthe same, irrespective of the number of proteinsinvolved.• Is this counter-intuitive? Yes.
  27. 27. Protein Domain Networks • Protein Domains – Nature’s take on writing modular code • Reconciles apparent paradox of a fixed network diameter across species – despite vast differences in complexity (some human proteins have 130 domains) • Occurrence of specific protein domains in multidomain proteins is scale-free.
  28. 28. Protein Domain Graphs• Prosite domains have a distribution following thepower-law function f(x) = a(b + x)-c, with c = .89.There are few highly connected domains and manyrarely connected ones.• ProDom and Pfam domains follow the powerfunction P ( k ) ≈ k − γ y = 2.5 for ProDom y = 1.7 for Pfam
  29. 29. Hub Domains in SignalingPathways
  30. 30. Conclusions• The accuracy of both ProtoMap and PRODISTIN islimited because they make the tacit assumption of arandom network topology.• Protein-Protein interaction networks have scale-free topology, foiling PRODISTIN• Protein Domain networks have scale-free topology,foiling ProtoMap• Any protein classification algorithm that performsbetter than ProtoMap is probably going to have toaddress this issue.