Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Assembling genomes using ABySS
1. Assembling genomes using ABySS
dnGASP 2011
Shaun Jackman
BC Genome Sciences Centre
sjackman@bcgsc.ca
abyss-users@bcgsc.ca
2. An assembly in two stages
● Stage I: Sequence assembly algorithm
● Stage II: Paired-end assembly algorithm
2
3. Stage 1
Sequence assembly algorithm
● Load the reads, Load k-mers
breaking each read into k-mers
● Find adjacent k-mers, which Find overlaps
overlap by k-1 bases
● Remove k-mers resulting from Prune tips
read errors
● Remove variant sequences Pop bubbles
● Generate contigs
Generate contigs
3
4. Load the reads
● For each input read of length l, (l - k + 1) k-mers
are generated by sliding a window of length k
over the read
Read (l = 12): ● Each k-mer is a vertex of
ATCATACATGAT the de Bruijn graph
k-mers (k = 9):
ATCATACAT ●Two adjacent k-mers are
TCATACATG an edge of the de Bruijn
CATACATGA
ATACATGAT graph
4
5. De Bruijn Graph
● A simple graph for k = 5
● Two reads
– GGACATC
– GGACAGA
GACAT ACATC
GGACA
GACAG ACAGA
5
7. Pruning tips
● Read errors cause
tips
● Pruning tips
removes the
erroneous reads
from the assembly
7
8. Popping bubbles
● Variant sequences cause
bubbles
● Popping bubbles removes
the variant sequence from
the assembly
● Repeat sequences with
small differences also
cause bubbles
8
9. Assemble contigs
● Remove ambiguous
edges
● Output contigs in
FASTA format
9
10. Paired-end assembly algorithm
Stage 2
● Align the reads to the contigs of the first stage
● Generate an empirical fragment-size
distribution using the paired reads that align to
the same contig
● Estimate the distance between contigs using
the paired reads that align to different contigs
10
11. Align the reads to the contigs
KAligner
● Every k-mer in the single-end
assembly is unique
● KAligner can map reads with k
consecutive correct bases
● ABySS may use other aligners,
including BWA and bowtie
11
12. Empirical fragment-size distribution
ParseAligns
● Generate an empirical fragment-size
distribution using the paired reads that align to
the same contig
12
13. Estimate distances between contigs
DistanceEst
● Estimate the distance between contigs using
the paired reads that align to different contigs
d = 25 ± 8
d=3±5
d=6±5
d=4±3
13
14. Maximum likelihood estimator
DistanceEst
● Use the empirical paired-
end size distribution
● Maximize the likelihood
function
● Find the most likely
distance between the two
contigs
14
15. Paired-end algorithm
continued...
● Find paths through the contig
adjacency graph that agree with Generate paths
the distance estimates
● Merge overlapping paths Merge paths
● Merge the contigs in these paths
Generate contigs
and output the FASTA file
15
16. Find consistent paths
SimpleGraph
● Find paths through the contig adjacency graph
that agree with the distance estimates
d=4±3
Actual distance = 3
16
18. Generate the FASTA output
● Merge the contigs in these paths.
● Output the FASTA file
GATTTTTG GAC GTCTTGATCTT CAC GTATTG CTATT
18
19. Assembly process
● Stage 1 completed in 3.5 hours
● Used 72 processors on six machines
● Peak memory usage of 180 GB of RAM
● Stage 2 completed in 9 hours
● Used 12 processors on one machine
● Peak memory usage of 48 GB of RAM
● Assembly parameters k=64 s=200 n=10
19
20. Assembly results
Level 1: 500-bp paired-end reads
● Assembled half the genome in 7,676 contigs
larger than the N50 of 50,612 bp
● Assembled 1.81 Gbp in 170,407 contigs larger
than 200 bp
● The largest contig is 1,158,576 bp
● Removed 1,296,819 variant sequences
20
21. Alignments to the reference
● Aligned the 170,407 contigs longer than 200 bp
● 96.2% align at least 99% length
● 1.2% align between 90% and 99% length
● 2.5% align less than 90% length
>99%
90-99%
<90%
21
22. Works in progress
● Replace complex variant sequences with Ns
● Scaffold over gaps and simple repeat sequence
using large fragment mate-pair reads
● Filling in gaps with sequence using localized
microassembly
22