Blue - Paul Greenfield

640 views

Published on

Bioinformatics tools can reasonably be expected to produce more accurate results when given better quality sequence data as their starting point. This expectation has led to the development of tools whose sole purpose is to detect and correct errors in sequencing data. A good error-correcting tool will simply take sequence data in any of the standard formats and produce a higher quality version of the same data containing far fewer errors. It should be able to correct all of the types of errors found in real sequence data (substitutions, insertions, deletions and uncalled bases), and be fast enough and scalable enough to be practical to use on the very large datasets being produced by current sequencing technologies. Blue was designed with these goals in mind.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
640
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Blue - Paul Greenfield

  1. 1. Blue Fast, accurate read correction using consensus and context Paul Greenfield (CCI), Konsta Duesing(CAFHS), Alexie Papanicolaou(CES) and Denis C. Bauer(CCI) CSIRO COMPUTATIONAL INFORMATICS Bioinformatics tools can reasonably be expected to produce more accurate results when given better quality sequence data as their starting point. This expectation has led to the development of tools whose sole purpose is to detect and correct errors in sequencing data. A good error-correcting tool will simply take sequence data in any of the standard formats and produce a higher quality version of the same data containing far fewer errors. It should be able to correct all of the types of errors found in real sequence data (substitutions, insertions, deletions and uncalled bases), and be fast enough and scalable enough to be practical to use on the very large datasets being produced by current sequencing technologies. Blue was designed with these goals in mind. Error correction algorithms and tools Performance and accuracy All error correction algorithms are basically the same, first finding a likely error within a read (or a part of a read), and then finding the best ‘fix’ for the error. They can work with variable length sub-reads (using suffix trees), fixed length k-mers or even do multiple alignments between reads. They are all constrained by knowing nothing about what is ‘correct’, apart from what can be found in the reads themselves. Most published error correction tools reduce the complexity of the task by only detecting and fixing substitution errors, and some even discard reads containing uncalled bases. The performance of Blue was measure by comparing the time and memory needed to correct two E. coli K12 MG1655 sequencing datasets as used in previous papers. ERA000206 consists of 28M Illumina paired 100-mers and ERR022075 of 45M such reads. Blue Blue uses a fast ‘k-mer consensus’ algorithm and chooses amongst multiple plausible fixes by considering the impact each one would have on the rest of the read (the ‘context’). The metrics used in making this choice are gathered by doing a depth-first, error-limited exploration of the tree of possible corrected reads. Blue can correct all types of sequencing errors, making it suitable for use on 454 and IonTorrent data as well as Illumina data. It also accepts (and writes) both FASTA and FASTQ data formats , maintains read-pairings and repairs qual values. Blue is being used in a number of CSIRO projects, both prokaryote and eukaryote, and both genomic and metagenomic, and has shown that it can greatly improve assembly results. Elapsed (mins) ERA000206 Blue Coral Echo HiTEC HSHREC Reptile SHREC ERR022075 Blue Coral Echo HiTEC HSHREC Reptile SHREC Processor (mins) Memory used (GB) Threads used RPM (elapsed) 52 203 1,596 12,054 Ran but did not complete 699 699 808 5,790 320 320 465 1,994 0.6 39.0 4 8 547,758 17,817 13.4 30.5 4.3 33.0 1 8 1 4 40,670 35,184 88,766 61,080 36 139 2,752 21,004 Ran but did not complete 1,365 1,365 1,363 9,586 509 509 625 2,405 0.6 47.0 4 8 1,280,006 16,511 11.0 1 8 1 4 33,290 33,351 89,297 72,681 2.8 27.5 The accuracy of Blue’s correction was measured using the Bowtie2 aligner, the Velvet assembler and the Mauve Assembly Metrics tool. Bowtie2 was used to measure how many of the corrected reads aligned perfectly to the MG1655 reference genome. Higher rates of perfect alignments indicate more accurate correction. Figure 1 shows the effect of correction on the ERA000206 reads. The ‘consensus’ table comes from tiling a set of reads, producing a set of distinct k-mers and their repetition counts. This consensus table does not have to be generated from the reads being corrected, allowing small sets of long reads to be corrected using large data sets of shorter reads. This capability has been successfully used to correct 454 reads using an Illuminaderived consensus table. The assembly tests showed that Blue-corrected reads resulted in the longest contigs (max contig and N50), and the lowest number of errors as measured by the Mauve tool. Cross-correcting 454 data and assembling it together with Illumina data gave the best results. Blue is available for download from www.bioinformatics.csiro.au/Blue. Miscalls - ERA000206 Contigs vk=41 Synth (36) ERA000206 Bowtie2 alignment mismatches 100% Original (385) Blue (38) 90% 10+ mismatches BlueGood (32) 80% 9 mismatches HiTEC (50) 70% 8 mismatches 7 mismatches Coral (355) 6 mismatches Reptile (171) 60% 50% 40% 30% 5 mismatches 98.8% 79.5% SHREC (107) 88.8% 84.4% 4 mismatches 3 mismatches 2 mismatches 62.9% 57.0% 454 + Illumina (493) 454Blue + IlluminaBlue (16) 1 mismatch 20% 454 + IlluminaHiTEC (55) 0 mismatches 454Coral + IlluminaCoral (428) 10% 5.0% 0% Original Blue Reptile Shrec Coral HiTEC 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 5000000 HSHREC Figure 1: Percentage of corrected MG1655 reads aligning to the MG1655 reference genome. More accurate correction will result in more reads aligning perfectly (no mismatches). The percentages here are based on the pre-correction number of reads. HiTEC and SHREC discard reads containing uncalled bases (‘N’), so removing some of the poorest quality reads prior to correction. Measuring the mapping rates of just the corrected reads improves HiTEC’s score here to 94.8%, but Blue then gets 99.76 of its ‘good’ reads mapped perfectly (after dropping the1.8% of the reads pairs it judged to be poor). Figure 2: Error density map for the assemblies of the corrected reads. The Synth reads were 30M perfect reads derived from the MG1655 reference genome and indicate the level of ‘errors’ that result from artefacts in both the assembler and the contig alignment tool. The Blue-corrected results come very close to this ‘perfect’ data. Combining Blue-corrected 454 (using the Illumina consensus) and Illumina data gave the best results. FOR FURTHER INFORMATION REFERENCES ACKNOWLEDGEMENTS Paul Greenfield e paul.greenfield@csiro.au w www.csiro.au/ cci A survey of errorr-correction tools can be found in Yang, X. Chockalingam SP, Aluru, S. (2013) A survey of error-correction methods for next-generation sequencing. Brief Bioinform 14 (1): 56-66 Darling A, Tritt A, Eisen J, Facciotti M. (2011) Mauve Assembly Metrics. Bioinformatics 27 (19): 2756-2757 This work was supported by the CSIRO Transformational Biology Capability Platform

×