Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TIS prediction in human cDNAs with high accuracy

371 views

Published on

Correct identification of the Translation Initiation Start (TIS) in cDNA is an important issue for genome annotation. The aim of this work is to improve upon current methods and provide a performance guaranteed prediction.

Published in: Health & Medicine, Technology
  • Be the first to comment

  • Be the first to like this

TIS prediction in human cDNAs with high accuracy

  1. 1. Translation initiation start prediction in human cDNAs with high accuracy A. G. Hatzigeorgiou Paper Presentation Introduction to Bioinformatics Anaxagoras Fotopoulos | Marina Adamou - Tzani 21/01/2014
  2. 2. Introduction • • • • Primary objective of the present research is contribution to the definition of the coding part of a gene. The search is performed in cDNA sequences. Coding regions are surrounded by UnTraslated Regions (UTRs). The interest is focused in finding the Translation Initiation Start (TIS) which defines the start of the coding region. cDNA complementary DNA (cDNA) is DNA synthesized from a messenger RNA (mRNA) in a reaction catalyzed by the enzymes reverse transcriptase and DNA polymerase. 2
  3. 3. Previous Research Salzberg, 1997 Positional Conditional Probability matrix. Generalized Second Order Profiles. • Implementation of the Ribosome Scanning Model (Kozak, 1996) Agarwal and Bafna, 1998a The ribosome first attaches to a specific region in the 5’ end of the mRNA and then scans the sequence for the first ATG • • 3 No significant deferences were observed between the above methods and a weight matrix The above methods are studied in common due to the high rate of false positives.
  4. 4. Previous Research Pedersen and Nielsen, 1997 Usage of ANNs for the recognition of local context and statistical properties around the TIS. Large region of analysis 100 bases before and 100 after the start codon Salamov et. al., 1998 Zien et. al., 2000 Six characteristics are applied for the analysis of the region around TIS including weight matrix and hexanucleotide difference. Use of Support Vector Machines (SVMs) for TIS prediction All of the above methods give up to 85% correct predictions. 4
  5. 5. Methods – Suggested Model Swissprot 475 cDNAs (Verified + Checked) Training Gene Pool Parameter estimation Training Set + Evaluation Set Conserved Motif Test Gene Pool TIS Prediction Consensus Test Set NN Score Multiplication Training Gene Pool Parameter estimation Training Set + Evaluation Set Test Gene Pool TIS Prediction Test Set 5 Coding/ Non Coding Potential Coding NN
  6. 6. Consensus Neural Network 325 positive + 325 negative examples 12-nucleotides long window Feed forward with short cut connections & two hidden units trained with cascade correlation algorithm Selection of the appropriate feed-forward NN Binirization of the input Cascade Correlation Algorithm 6
  7. 7. Coding Neural Network 54 nucleotides length window Use Smith – Waterman algorithm for the elimination of homologies between training and test data 12-nucleotides usage static long window Apply codon 250 positive (Count for every window all non-overlapping codons) 250 negative The sequence window is rescaled to 64 units 7 + Sequence regions extracted for testing Every unit gives the normalized frequency of the codon in the window 282 genes with less than 70% homology were used for training 700 positive + 700 negative Sequence regions extracted for training Resilient backpropagation algorithm is applied to a feed-forward NN.
  8. 8. Integrated method Analysis of full length mRNA sequences 1st stage • Calculation of coding score for every nucleotide of the mRNA sequence 2nd stage • Calculation of coding evidence of the coding region included in the longest ORF of the sequence 3rd stage 4th stage • For every in-frame ATG a consensus score is calculated • For the same inframe ATG, a coding difference score is calculated The final score is obtained by combining the output of the consensus ANN and the coding difference 8
  9. 9. Integrated method Analysis of full length mRNA sequences • This method provides only one prediction for every ORF • According to the results of the test group: • 94% of the TIS were correctly predicted • 6% of the predictions were false positive The use of the Las Vegas algorithm gives a confident decision. The incorporation of this algorithm leads to a highly accurate recognition of the TIS in human cDNAs for 60% of the cases! Las Vegas 9 Las Vegas algorithm provides a correct prediction in some cases and has a “no answer” option in the remaining cases. That is, it always produces the correct result or it informs about the failure.
  10. 10. Results – Score Combination 1/3 Nucleotide 255 : cod 0.98 – local 0.2 10 A score combination of coding ANN and consensus ANN gives low final score. Cod line: Score of coding ANN Local line: Score and position of consensus ANN for all ATGs in coding frame
  11. 11. Results – Score Combination 2/3 Nucleotide 270: cod 0.44 – local 0.4 11 A score combination of coding ANN and consensus ANN gives low final score. Cod line: Score of coding ANN Local line: Score and position of consensus ANN for all ATGs in coding frame
  12. 12. Results – Score Combination 3/3 Correct TIS Nucleotide 148: cod 0.95 – local 0.8 12 A score combination of coding ANN and consensus ANN gives high final score. Cod line: Score of coding ANN Local line: Score and position of consensus ANN for all ATGs in coding frame
  13. 13. Results – Methods Comparison Correct TIS positions 13
  14. 14. Results – Methods Comparison Prediction for the 3 TIS positions with the highest scores 14
  15. 15. Results – Methods Comparison Consensus motif scores (only for DIANA-TIS) 15
  16. 16. Results – Methods Comparison Final scores 16
  17. 17. Results – Methods Comparison Correct predictions 17
  18. 18. Results – Methods Comparison Prediction Analysis High prediction score difference TIS correct position: 471 Did not find TIS 18 Found TIS but other higher score exists
  19. 19. Results – Methods Comparison Performance of the three programs for TIS prediction along the mRNA with signal peptide sequences Correct TIS positions 19
  20. 20. Results – Methods Comparison Length of signal peptide 20
  21. 21. Results – Methods Comparison Prediction for the 2 TIS positions with the highest scores 21
  22. 22. Results – Methods Comparison Consensus motif scores only for DIANA-TIS) 22
  23. 23. Results – Methods Comparison Final scores 23
  24. 24. Results – Methods Comparison Prediction example #1: DIANA-TIS is able to distinguish between TIS and other ATGs better than other ANN based programs like NetStart: 2 suitable ATGs are 12 nucleotides away Coding/non-coding information is similar Consensus motif is completely different 24
  25. 25. Results – Methods Comparison Prediction example #2: A favorable prediction does not work for all examples: Consensus motif is completely different Combined score is much lower In some signal peptides sequences the coding potential score is relatively low, and can thus affect the combined score. 25
  26. 26. Results – Methods Comparison TIS prediction program TIS prediction rate DIANA-TIS (2001) 94% Agarwal & Bafna (1998) 85% ATGPred (Salamov et al, 1998) 79% NetStart (Pedersen & Nielsen, 1997) 78% These methods allow more than one prediction per gene Notice The results come from different datasets and thus these numbers should not be directly compared. 26
  27. 27. Thank you! Introduction to Bioinformatics Information Technologies in Medicine and Biology National & Kapodistrian University of Athens Department of Informatics Biomedical Research Foundation Academy of Athens 27 Technological Education Institute of Athens Department of Biomedical Engineering Demokritos National Center for Scientific Research

×