Milko stat seq_toulouse

165 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
165
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Last two change places?
  • Нещо допълнително?
  • Деф. заглавие!
  • Още 1 доп. Слайд?
  • Milko stat seq_toulouse

    1. 1. Milko Krachunov2 , Ivan Popov1 , Valeria Simeonova2 , Irena Avdjieva1 , Paweł Szczęsny3 , Urszula Zelenkiewicz3 , Piotr Zelenkiewicz3 , Dimitar Vassilev1 1 Bioinforomatics group, AgroBioInstitute, Bulgaria 2 Faculty of mathematics and informatics; Sofia University “St. Kliment Ohridski”, Bulgaria 3 Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland Detection and correction of errors in metagenomic 16S RNA parallel sequencing
    2. 2. NGS errors – common problems  Introduced errors in the assembled reads due to imperfections both of biological and mathematical origin; Impossibility to re-sequence the same sample again in metagenomic studies ; Tendency the error rate to increase in every step of the process; No easy way to differentiate between “sequencing error” and “rare variant”; Many existing methods and algorithms concerning different aspects of the problem but no unified solutions are available; Large amounts of data are difficult to process with common software.
    3. 3. Significance of 16S RNA sequencing Highly conserved between different species of bacteria and archaea; Sequence analysis is done with universal PCR primers; Contains hypervariable regions that can provide species- specific signature sequences; Suitable for phylogenetic studies; Suitable for metagenomic studies.
    4. 4. General approach in metagenomic biodiversity studies 454 Sequencing Filtering / Denoising Multiple alignment Distance matrix ОTU clusters with abundance count
    5. 5. Our approach:
    6. 6. A. Raw data characteristics and processing Two separate runs of metagenomic 16S RNA fragments, sequenced with 454 platform and converted in FASTA format: run 02 – 46429 short reads run 04 – 41386 short reads Our task – extract, denoise and correct only the quality reads.
    7. 7. Raw data length histogram Run 02 Run 04
    8. 8. B. Correction with SHREC
    9. 9. C. Correction with our method:
    10. 10. Classification and performance evaluation ClaMS parameters: Distance cut-off: 0,05 Signature type: DBC k-mer length: 3 Existing taxonomy: 4th Level
    11. 11. Aim of the method – idea outline To deal with the heterogeneous nature of the data, similar or related sequences are considered more important in the error evaluation The naïve approach: If a base is less common than the sequencer error rate, assume it’s likely an error and replace with the most common base Our modification: Calculate the occurrence of the base in reads that are similar in the given region – assign them bigger weights or use them exclusively
    12. 12. Progress so far Calculate occurrence rates of every base in reads that are identical to the evaluated read in a window with radius of n bases  Preliminary results: The first basic implementation leads to an increase in the number of OTUs found with ClaMS Under development  Good choice(s) of approach for alignment of the reads  Empirical evaluation of the parameters  Comparative evaluation of the variants of the approach
    13. 13. Software used in this project: Python: http://www.python.org/ Cython: http://cython.org/ MEGA (Molecular Evolutionary Genetics Analysis): http://www.megasoftware.net/ Muscle: http://www.drive5.com/muscle/ SHREC (SHort Read Error Correction method): http://ww2.cs.mu.oz.au/~schroder/shrec_www/ ClaMS (Classifier for Metagenomic Sequences): http://clams.jgi- psf.org/ NINJA (modified): http://nimbletwist.com/software/ninja/index.html R-package: http://www.r-project.org/
    14. 14. milko@3mhz.net Thank you

    ×