Milko stat seq_toulouse


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Last two change places?
  • Нещо допълнително?
  • Деф. заглавие!
  • Още 1 доп. Слайд?
  • Milko stat seq_toulouse

    1. 1. Milko Krachunov2 , Ivan Popov1 , Valeria Simeonova2 , Irena Avdjieva1 , Paweł Szczęsny3 , Urszula Zelenkiewicz3 , Piotr Zelenkiewicz3 , Dimitar Vassilev1 1 Bioinforomatics group, AgroBioInstitute, Bulgaria 2 Faculty of mathematics and informatics; Sofia University “St. Kliment Ohridski”, Bulgaria 3 Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland Detection and correction of errors in metagenomic 16S RNA parallel sequencing
    2. 2. NGS errors – common problems  Introduced errors in the assembled reads due to imperfections both of biological and mathematical origin; Impossibility to re-sequence the same sample again in metagenomic studies ; Tendency the error rate to increase in every step of the process; No easy way to differentiate between “sequencing error” and “rare variant”; Many existing methods and algorithms concerning different aspects of the problem but no unified solutions are available; Large amounts of data are difficult to process with common software.
    3. 3. Significance of 16S RNA sequencing Highly conserved between different species of bacteria and archaea; Sequence analysis is done with universal PCR primers; Contains hypervariable regions that can provide species- specific signature sequences; Suitable for phylogenetic studies; Suitable for metagenomic studies.
    4. 4. General approach in metagenomic biodiversity studies 454 Sequencing Filtering / Denoising Multiple alignment Distance matrix ОTU clusters with abundance count
    5. 5. Our approach:
    6. 6. A. Raw data characteristics and processing Two separate runs of metagenomic 16S RNA fragments, sequenced with 454 platform and converted in FASTA format: run 02 – 46429 short reads run 04 – 41386 short reads Our task – extract, denoise and correct only the quality reads.
    7. 7. Raw data length histogram Run 02 Run 04
    8. 8. B. Correction with SHREC
    9. 9. C. Correction with our method:
    10. 10. Classification and performance evaluation ClaMS parameters: Distance cut-off: 0,05 Signature type: DBC k-mer length: 3 Existing taxonomy: 4th Level
    11. 11. Aim of the method – idea outline To deal with the heterogeneous nature of the data, similar or related sequences are considered more important in the error evaluation The naïve approach: If a base is less common than the sequencer error rate, assume it’s likely an error and replace with the most common base Our modification: Calculate the occurrence of the base in reads that are similar in the given region – assign them bigger weights or use them exclusively
    12. 12. Progress so far Calculate occurrence rates of every base in reads that are identical to the evaluated read in a window with radius of n bases  Preliminary results: The first basic implementation leads to an increase in the number of OTUs found with ClaMS Under development  Good choice(s) of approach for alignment of the reads  Empirical evaluation of the parameters  Comparative evaluation of the variants of the approach
    13. 13. Software used in this project: Python: Cython: MEGA (Molecular Evolutionary Genetics Analysis): Muscle: SHREC (SHort Read Error Correction method): ClaMS (Classifier for Metagenomic Sequences): http://clams.jgi- NINJA (modified): R-package:
    14. 14. Thank you