0
Blue
Fast, accurate error correction using k-mer consensus and context
Paul Greenfield, Denis Bauer
15 October 2013
CSIRO ...
Error correction algorithms
• All fundamentally the same
• Find an error within a read (or a part of a read)
– Using fixed...
Blue overview
• Blue does k-mer consensus correction
• Chooses between multiple possible fixes by trialling them in the co...
Algorithms under test
FASTA FAST
In
Q In

Fmt
Ins,
in=out Dels?

Ns
Pairs
fixed? kept?

MultiThreads

Blue

Yes

Yes

Yes
...
Performance tests
Elapsed
(mins)
ERA000206
Blue
Coral
Echo
HiTEC
HSHREC
Quake
Reptile
SHREC
ERR022075
Blue
Coral
Echo
HiTE...
Testing accuracy and effectiveness
• Downloaded E. coli K12 MG1655 datasets from SRA
• Two Illumina datasets (28M & 45M pa...
Accuracy test (Illumina 28M)
ERA000206 Bowtie2 alignment mismatches
100%

90%
10+ mismatches
80%

9 mismatches
8 mismatche...
Accuracy test (454)
SRR029323 Bowtie2 alignment mismatches
100%
90%
10+ mismatches
80%

9 mismatches
8 mismatches

70%

7 ...
Assembly: contig lengths
Velvet contigs - ERA000206 - k=41

200

150000

150

100000

100

50000

50

0

0
Raw

Blue (all)...
Assembly: contig lengths with miscalls
Velvet contigs - ERA000206 - k=41

200

150000

150

100000

100

50000

50

0

0
R...
Assembly: 454+Illumina
Velvet contigs - ERA000206 - k=41
250000

600

MaxContigLength
ContigN50
ContigN90
BrokenCDS

500

...
Assembly accuracy - miscalls
Miscalls - ERA000206 Contigs vk=41
Original (385)

Blue (38)
BlueGood (32)
Synth (36)
HiTEC (...
rhsD alignments (vk=41)
950bp region repeated in pseudogene (with somewhat divergent margins)

2x

Synth

2x

Blue 454+Ill...
Summing up
• Correction can significantly improve alignment & assembly results
• Most published algorithms are not very ef...
Thank you
Bioinformatics and Biostatistics
Paul Greenfield
Research Group Leader
t +61 2 9325 3250
e paul.greenfield@csiro...
Upcoming SlideShare
Loading in...5
×

Blue - fast, accurate error correction using k-mer consensus and context - Paul Greenfield, Denis Bauer

468

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
468
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Blue - fast, accurate error correction using k-mer consensus and context - Paul Greenfield, Denis Bauer"

  1. 1. Blue Fast, accurate error correction using k-mer consensus and context Paul Greenfield, Denis Bauer 15 October 2013 CSIRO COMPUTATIONAL INFORMATICS This work was funded by CSIRO Transformational Biology
  2. 2. Error correction algorithms • All fundamentally the same • Find an error within a read (or a part of a read) – Using fixed-length k-mers, variable length sub-reads, whole reads • Find the ‘best’ replacement for the broken part of the read – k-mer consensus, suffix arrays/trees of sub-reads, alignment to consensus – Trivial much of the time, but correcting the ‘hard’ cases properly is essential – Repetitive regions  multiple possible corrections – which one is right? – Much easier if only fixing substitution errors (and ignoring ins/del errors) 100 Original Healed 80 60 40 20 1 14 27 40 53 66 79 92 105 118 131 144 157 170 0 3 | Correcting DNA sequence data | Paul Greenfield % of k-mers 120 4 3.5 3 2.5 2 1.5 1 0.5 0 0 100 200 300 Repetition Depth 400 500
  3. 3. Blue overview • Blue does k-mer consensus correction • Chooses between multiple possible fixes by trialling them in the context of the read • Recursive exploration of the tree of potential ‘fixed’ reads to find ‘best’ fix – Tree exploration error-limited to improve efficiency • Handles both Illumina and 454-like data • Blue separates consensus from the reads being corrected • Possible to correct long (454) reads with a much larger set of Illumina k-mers – Combine 454 read length with depth of Illumina – Addresses 454 homopolymer problem (different error models) • Blue has an option to discard poor-looking reads (‘-good’) • Throw away sequencing artefacts and very broken reads • Being used internally within CSIRO now • Moth genome, bacterial & metagenomic projects 4 | Correcting DNA sequence data | Paul Greenfield
  4. 4. Algorithms under test FASTA FAST In Q In Fmt Ins, in=out Dels? Ns Pairs fixed? kept? MultiThreads Blue Yes Yes Yes Yes Yes Yes Yes Coral Yes Yes Yes Yes Yes Yes Yes Echo No Yes Yes No No No Yes HiTEC Yes Yes No No No No No No No Yes No No Yes Single HSHREC -line Quake No Yes Yes No Yes Yes Yes Reptile Yes No No No Yes Yes No SHREC Single -line Yes Yes No No No Yes 5 | Correcting DNA sequence data | Paul Greenfield
  5. 5. Performance tests Elapsed (mins) ERA000206 Blue Coral Echo HiTEC HSHREC Quake Reptile SHREC ERR022075 Blue Coral Echo HiTEC HSHREC Quake Reptile SHREC Processor (mins) Memory used (GB) Threads RPM (elapsed) 52 203 1,596 12,054 Ran but did not complete 699 699 808 5,790 Failed 320 320 465 1,994 0.6 39.0 4 8 547,758 17,817 13.4 30.5 1 8 40,670 35,184 4.3 33.0 1 4 88,766 61,080 36 139 2,752 21,004 Ran but did not complete 1,365 1,365 1,363 9,586 Failed 509 509 625 2,405 0.6 47.0 4 8 1,280,006 16,511 11.0 1 8 33,290 33,351 2.8 27.5 1 4 89,297 72,681 6 | Correcting DNA sequence data | Paul Greenfield
  6. 6. Testing accuracy and effectiveness • Downloaded E. coli K12 MG1655 datasets from SRA • Two Illumina datasets (28M & 45M paired 100-bp reads) • Two 454 datasets (350K & 144K reads) • Accuracy (using Bowtie & Bowtie2) • Aligned corrected reads to E. coli K12 MG1655 reference sequence – More reads aligned with 0 mismatches  more accurate correction – Expect some genetic drift and some sequencing artefacts in practice • Effectiveness (using Velvet) • Assembled... – Raw and corrected Illumina data – Combined 454 + Illumina data – Perfect synthetic ‘reads’ used for comparison • Do you get longer contigs that contain fewer errors? – Compare contig lengths and error density in contigs 7 | Correcting DNA sequence data | Paul Greenfield
  7. 7. Accuracy test (Illumina 28M) ERA000206 Bowtie2 alignment mismatches 100% 90% 10+ mismatches 80% 9 mismatches 8 mismatches 70% 7 mismatches 60% 6 mismatches 50% 40% 30% 5 mismatches 98.8% 79.5% 4 mismatches 88.8% 84.4% 3 mismatches 62.9% 57.0% 2 mismatches 1 mismatch 20% 0 mismatches 10% 5.0% 0% Original Blue 8 | Correcting DNA sequence data | Paul Greenfield Reptile Shrec Coral HiTEC HSHREC
  8. 8. Accuracy test (454) SRR029323 Bowtie2 alignment mismatches 100% 90% 10+ mismatches 80% 9 mismatches 8 mismatches 70% 7 mismatches 60% 6 mismatches 50% 5 mismatches 95% 4 mismatches 40% 76% 3 mismatches 65% 30% 2 mismatches 51% 1 mismatch 41% 20% 0 mismatches 10% 0% 10% 1% Original Blue 454 Blue 454x2 Blue 220275 Blue 22075x2 Coral (using a indel-capable aligner) 10 | Correcting DNA sequence data | Paul Greenfield HSHREC
  9. 9. Assembly: contig lengths Velvet contigs - ERA000206 - k=41 200 150000 150 100000 100 50000 50 0 0 Raw Blue (all) 12 | Correcting DNA sequence data | Paul Greenfield Blue (good) Coral HiTEC Reptile Shrec CDS breakages 250 200000 Contig lengths 250000 Contigs max Contigs N50 Contigs N90
  10. 10. Assembly: contig lengths with miscalls Velvet contigs - ERA000206 - k=41 200 150000 150 100000 100 50000 50 0 0 Raw Blue (all) 13 | Correcting DNA sequence data | Paul Greenfield Blue (good) Coral HiTEC Reptile Shrec CDS breakages 250 200000 Contig lengths 250000 Contigs max Contigs N50 Contigs N90 Broken CDS
  11. 11. Assembly: 454+Illumina Velvet contigs - ERA000206 - k=41 250000 600 MaxContigLength ContigN50 ContigN90 BrokenCDS 500 200000 400 150000 300 100000 200 50000 100 0 0 Raw Blue (all) Blue (good) 14 | Correcting DNA sequence data | Paul Greenfield Coral HiTEC Reptile Shrec Blue 454+206 Coral 454+206 454+206 454+206Hi
  12. 12. Assembly accuracy - miscalls Miscalls - ERA000206 Contigs vk=41 Original (385) Blue (38) BlueGood (32) Synth (36) HiTEC (50) Coral (355) Reptile (171) SHREC (107) 454 + Illumina (493) 454Blue + IlluminaBlue (16) 454 + IlluminaHiTEC (55) 454Coral + IlluminaCoral (428) 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 5000000 Density of miscalls (including real differences and alignment artefacts) along MG1655 genome. Data generated by Mauve assembler-testing tools (Aaron Darling) 16 | Correcting DNA sequence data | Paul Greenfield
  13. 13. rhsD alignments (vk=41) 950bp region repeated in pseudogene (with somewhat divergent margins) 2x Synth 2x Blue 454+Illumina 2x Blue 2x Blue Good Raw 2x HiTEC Coral Reptile SHREC 17 | Correcting DNA sequence data | Paul Greenfield
  14. 14. Summing up • Correction can significantly improve alignment & assembly results • Most published algorithms are not very effective – Transparent component in a processing pipeline – Fast enough and scalable enough to handle real datasets – And... improve results enough to be worthwhile • Blue... • Uses the context of an error to decide between alternative fixes – Recursive search of the tree of potential ‘repaired’ reads to find the ‘best’ • Separates reads from consensus – Allowing cross-correction between different types of data • Testing showed Blue to be the most accurate and fastest • Available from www.bioinformatics.csiro.au/Blue 19 | Correcting DNA sequence data | Paul Greenfield
  15. 15. Thank you Bioinformatics and Biostatistics Paul Greenfield Research Group Leader t +61 2 9325 3250 e paul.greenfield@csiro.au w www.csiro.au/CCI CSIRO MATHEMATICS, INFORMATICS AND STATISTICS Paul Greenfield (CCI) Denis Bauer (CCI) Konsta Duesing (CAFHS) Alexie Papanicolaou (CES) Supported by David Lovell & CSIRO Transformational Biology
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×