Assembling genomes using ABySS

Assembling genomes using ABySS
dnGASP 2011

Shaun Jackman
BC Genome Sciences Centre
sjackman@bcgsc.ca
abyss-users@bcgsc.ca

An assembly in two stages
● Stage I: Sequence assembly algorithm
● Stage II: Paired-end assembly algorithm

2

Stage 1
Sequence assembly algorithm
● Load the reads, Load k-mers
breaking each read into k-mers
● Find adjacent k-mers, which Find overlaps
overlap by k-1 bases
● Remove k-mers resulting from Prune tips
read errors
● Remove variant sequences Pop bubbles

● Generate contigs
Generate contigs

3

Load the reads
● For each input read of length l, (l - k + 1) k-mers
are generated by sliding a window of length k
over the read
Read (l = 12): ● Each k-mer is a vertex of
ATCATACATGAT the de Bruijn graph
k-mers (k = 9):
ATCATACAT ●Two adjacent k-mers are
TCATACATG an edge of the de Bruijn
CATACATGA
ATACATGAT graph

4

De Bruijn Graph
● A simple graph for k = 5
● Two reads
– GGACATC
– GGACAGA
GACAT ACATC
GGACA

GACAG ACAGA

5

Pruning tips
● Read errors cause
tips

6

Pruning tips
● Read errors cause
tips
● Pruning tips
removes the
erroneous reads
from the assembly

7

Popping bubbles
● Variant sequences cause
bubbles
● Popping bubbles removes
the variant sequence from
the assembly
● Repeat sequences with
small differences also
cause bubbles

8

Assemble contigs
● Remove ambiguous
edges
● Output contigs in
FASTA format

9

Paired-end assembly algorithm
Stage 2
● Align the reads to the contigs of the first stage
● Generate an empirical fragment-size
distribution using the paired reads that align to
the same contig
● Estimate the distance between contigs using
the paired reads that align to different contigs

10

Align the reads to the contigs
KAligner
● Every k-mer in the single-end
assembly is unique
● KAligner can map reads with k
consecutive correct bases
● ABySS may use other aligners,
including BWA and bowtie

11

Empirical fragment-size distribution
ParseAligns
● Generate an empirical fragment-size
distribution using the paired reads that align to
the same contig

12

Estimate distances between contigs
DistanceEst
● Estimate the distance between contigs using
the paired reads that align to different contigs

d = 25 ± 8

d=3±5

d=6±5

d=4±3

13

Maximum likelihood estimator
DistanceEst
● Use the empirical paired-
end size distribution
● Maximize the likelihood
function
● Find the most likely
distance between the two
contigs

14

Paired-end algorithm
continued...
● Find paths through the contig
adjacency graph that agree with Generate paths
the distance estimates
● Merge overlapping paths Merge paths

● Merge the contigs in these paths
Generate contigs
and output the FASTA file

15

Find consistent paths
SimpleGraph
● Find paths through the contig adjacency graph
that agree with the distance estimates

d=4±3

Actual distance = 3
16

Merge overlapping paths
MergePaths
● Merge paths that overlap

17

Generate the FASTA output
● Merge the contigs in these paths.
● Output the FASTA file

GATTTTTG GAC GTCTTGATCTT CAC GTATTG CTATT

18

Assembly process
● Stage 1 completed in 3.5 hours
● Used 72 processors on six machines
● Peak memory usage of 180 GB of RAM
● Stage 2 completed in 9 hours
● Used 12 processors on one machine
● Peak memory usage of 48 GB of RAM
● Assembly parameters k=64 s=200 n=10

19

Assembly results
Level 1: 500-bp paired-end reads
● Assembled half the genome in 7,676 contigs
larger than the N50 of 50,612 bp
● Assembled 1.81 Gbp in 170,407 contigs larger
than 200 bp
● The largest contig is 1,158,576 bp
● Removed 1,296,819 variant sequences

20

Alignments to the reference
● Aligned the 170,407 contigs longer than 200 bp
● 96.2% align at least 99% length
● 1.2% align between 90% and 99% length
● 2.5% align less than 90% length

>99%
90-99%
<90%

21

Works in progress
● Replace complex variant sequences with Ns
● Scaffold over gaps and simple repeat sequence
using large fragment mate-pair reads
● Filling in gaps with sequence using localized
microassembly

22

ABySS Publications
IEEE InfoVis 2009

Acknowledgments
Supervisors
● İnanç Birol
● Steven Jones
Team
● Readman Chiu
● Rod Docking
● Karen Mungall
● Jenny Qian
24

Assembling genomes using ABySS

Recommended

Recommended

More Related Content

Similar to Assembling genomes using ABySS

Similar to Assembling genomes using ABySS (20)

Recently uploaded

Recently uploaded (20)

Assembling genomes using ABySS