Assembling genomes using ABySS

3,552 views

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,552
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
67
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Assembling genomes using ABySS

  1. 1. Assembling genomes using ABySS dnGASP 2011 Shaun Jackman BC Genome Sciences Centre sjackman@bcgsc.ca abyss-users@bcgsc.ca
  2. 2. An assembly in two stages● Stage I: Sequence assembly algorithm● Stage II: Paired-end assembly algorithm 2
  3. 3. Stage 1 Sequence assembly algorithm● Load the reads, Load k-mers breaking each read into k-mers● Find adjacent k-mers, which Find overlaps overlap by k-1 bases● Remove k-mers resulting from Prune tips read errors● Remove variant sequences Pop bubbles● Generate contigs Generate contigs 3
  4. 4. Load the reads● For each input read of length l, (l - k + 1) k-mers are generated by sliding a window of length k over the read Read (l = 12): ● Each k-mer is a vertex of ATCATACATGAT the de Bruijn graph k-mers (k = 9): ATCATACAT ●Two adjacent k-mers are TCATACATG an edge of the de Bruijn CATACATGA ATACATGAT graph 4
  5. 5. De Bruijn Graph● A simple graph for k = 5● Two reads – GGACATC – GGACAGA GACAT ACATC GGACA GACAG ACAGA 5
  6. 6. Pruning tips● Read errors cause tips 6
  7. 7. Pruning tips● Read errors cause tips● Pruning tips removes the erroneous reads from the assembly 7
  8. 8. Popping bubbles● Variant sequences cause bubbles● Popping bubbles removes the variant sequence from the assembly● Repeat sequences with small differences also cause bubbles 8
  9. 9. Assemble contigs● Remove ambiguous edges● Output contigs in FASTA format 9
  10. 10. Paired-end assembly algorithm Stage 2● Align the reads to the contigs of the first stage● Generate an empirical fragment-size distribution using the paired reads that align to the same contig● Estimate the distance between contigs using the paired reads that align to different contigs 10
  11. 11. Align the reads to the contigs KAligner● Every k-mer in the single-end assembly is unique● KAligner can map reads with k consecutive correct bases● ABySS may use other aligners, including BWA and bowtie 11
  12. 12. Empirical fragment-size distribution ParseAligns● Generate an empirical fragment-size distribution using the paired reads that align to the same contig 12
  13. 13. Estimate distances between contigs DistanceEst● Estimate the distance between contigs using the paired reads that align to different contigs d = 25 ± 8 d=3±5 d=6±5 d=4±3 13
  14. 14. Maximum likelihood estimator DistanceEst● Use the empirical paired- end size distribution● Maximize the likelihood function● Find the most likely distance between the two contigs 14
  15. 15. Paired-end algorithm continued...● Find paths through the contig adjacency graph that agree with Generate paths the distance estimates● Merge overlapping paths Merge paths● Merge the contigs in these paths Generate contigs and output the FASTA file 15
  16. 16. Find consistent paths SimpleGraph● Find paths through the contig adjacency graph that agree with the distance estimates d=4±3 Actual distance = 3 16
  17. 17. Merge overlapping paths MergePaths● Merge paths that overlap 17
  18. 18. Generate the FASTA output● Merge the contigs in these paths.● Output the FASTA file GATTTTTG GAC GTCTTGATCTT CAC GTATTG CTATT 18
  19. 19. Assembly process● Stage 1 completed in 3.5 hours● Used 72 processors on six machines● Peak memory usage of 180 GB of RAM● Stage 2 completed in 9 hours● Used 12 processors on one machine● Peak memory usage of 48 GB of RAM● Assembly parameters k=64 s=200 n=10 19
  20. 20. Assembly results Level 1: 500-bp paired-end reads● Assembled half the genome in 7,676 contigs larger than the N50 of 50,612 bp● Assembled 1.81 Gbp in 170,407 contigs larger than 200 bp● The largest contig is 1,158,576 bp● Removed 1,296,819 variant sequences 20
  21. 21. Alignments to the reference● Aligned the 170,407 contigs longer than 200 bp● 96.2% align at least 99% length● 1.2% align between 90% and 99% length● 2.5% align less than 90% length >99% 90-99% <90% 21
  22. 22. Works in progress● Replace complex variant sequences with Ns● Scaffold over gaps and simple repeat sequence using large fragment mate-pair reads● Filling in gaps with sequence using localized microassembly 22
  23. 23. ABySS Publications IEEE InfoVis 2009
  24. 24. Acknowledgments Supervisors● İnanç Birol● Steven Jones Team● Readman Chiu● Rod Docking● Karen Mungall● Jenny Qian 24
  25. 25. 25

×