SlideShare a Scribd company logo
1 of 25
Download to read offline
Assembling genomes using ABySS
           dnGASP 2011



            Shaun Jackman
      BC Genome Sciences Centre
         sjackman@bcgsc.ca
        abyss-users@bcgsc.ca
An assembly in two stages
●   Stage I: Sequence assembly algorithm
●   Stage II: Paired-end assembly algorithm




                                              2
Stage 1
      Sequence assembly algorithm
●   Load the reads,                  Load k-mers
    breaking each read into k-mers
●   Find adjacent k-mers, which      Find overlaps
    overlap by k-1 bases
●   Remove k-mers resulting from     Prune tips
    read errors
●   Remove variant sequences         Pop bubbles

●   Generate contigs
                                     Generate contigs



                                                        3
Load the reads
●   For each input read of length l, (l - k + 1) k-mers
    are generated by sliding a window of length k
    over the read
      Read (l = 12):    ● Each k-mer is a vertex of
         ATCATACATGAT   the de Bruijn graph
      k-mers (k = 9):
         ATCATACAT      ●Two adjacent k-mers are
          TCATACATG     an edge of the de Bruijn
           CATACATGA
            ATACATGAT   graph

                                                      4
De Bruijn Graph
●   A simple graph for k = 5
●   Two reads
        –   GGACATC
        –   GGACAGA
                           GACAT      ACATC
            GGACA


                           GACAG      ACAGA


                                              5
Pruning tips
●   Read errors cause
    tips




                                6
Pruning tips
●   Read errors cause
    tips
●   Pruning tips
    removes the
    erroneous reads
    from the assembly




                                7
Popping bubbles
●   Variant sequences cause
    bubbles
●   Popping bubbles removes
    the variant sequence from
    the assembly
●   Repeat sequences with
    small differences also
    cause bubbles




                                 8
Assemble contigs
●   Remove ambiguous
    edges
●   Output contigs in
    FASTA format




                                  9
Paired-end assembly algorithm
                       Stage 2
●   Align the reads to the contigs of the first stage
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig
●   Estimate the distance between contigs using
    the paired reads that align to different contigs




                                                        10
Align the reads to the contigs
                      KAligner
●   Every k-mer in the single-end
    assembly is unique
●   KAligner can map reads with k
    consecutive correct bases
●   ABySS may use other aligners,
    including BWA and bowtie




                                        11
Empirical fragment-size distribution
                     ParseAligns
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig




                                                        12
Estimate distances between contigs
                     DistanceEst
●   Estimate the distance between contigs using
    the paired reads that align to different contigs

                           d = 25 ± 8

                      d=3±5


                        d=6±5




                        d=4±3

                                                       13
Maximum likelihood estimator
                    DistanceEst
●   Use the empirical paired-
    end size distribution
●   Maximize the likelihood
    function
●   Find the most likely
    distance between the two
    contigs



                                     14
Paired-end algorithm
                   continued...
●   Find paths through the contig
    adjacency graph that agree with    Generate paths
    the distance estimates
●   Merge overlapping paths             Merge paths

●   Merge the contigs in these paths
                                       Generate contigs
    and output the FASTA file




                                                      15
Find consistent paths
                    SimpleGraph
●   Find paths through the contig adjacency graph
    that agree with the distance estimates




                     d=4±3

                  Actual distance = 3
                                                    16
Merge overlapping paths
                    MergePaths
●   Merge paths that overlap




                                   17
Generate the FASTA output
●   Merge the contigs in these paths.
●   Output the FASTA file




    GATTTTTG   GAC GTCTTGATCTT   CAC    GTATTG CTATT

                                                       18
Assembly process
●   Stage 1 completed in 3.5 hours
●   Used 72 processors on six machines
●   Peak memory usage of 180 GB of RAM
●   Stage 2 completed in 9 hours
●   Used 12 processors on one machine
●   Peak memory usage of 48 GB of RAM
●   Assembly parameters k=64 s=200 n=10

                                          19
Assembly results
          Level 1: 500-bp paired-end reads
●   Assembled half the genome in 7,676 contigs
    larger than the N50 of 50,612 bp
●   Assembled 1.81 Gbp in 170,407 contigs larger
    than 200 bp
●   The largest contig is 1,158,576 bp
●   Removed 1,296,819 variant sequences




                                                   20
Alignments to the reference
●   Aligned the 170,407 contigs longer than 200 bp
●   96.2% align at least 99% length
●   1.2% align between 90% and 99% length
●   2.5% align less than 90% length


                               >99%
                               90-99%
                               <90%




                                                 21
Works in progress
●   Replace complex variant sequences with Ns
●   Scaffold over gaps and simple repeat sequence
    using large fragment mate-pair reads
●   Filling in gaps with sequence using localized
    microassembly




                                                    22
ABySS Publications
         IEEE InfoVis 2009
Acknowledgments
    Supervisors
●   İnanç Birol
●   Steven Jones
    Team
●   Readman Chiu
●   Rod Docking
●   Karen Mungall
●   Jenny Qian
                                24
25

More Related Content

Similar to Assembling genomes using ABySS

Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Anton Alexandrov
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pubsesejun
 
Maximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold AssemblyMaximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold AssemblyAlexey Sergushichev
 
Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Manchor Ko
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analysesfnothaft
 
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...LDBC council
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsAkisato Kimura
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11smashflt
 
Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.Workhorse Computing
 
Halide - 2
Halide - 2 Halide - 2
Halide - 2 Kobe Yu
 
An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsThibault Debatty
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsTakahiro Harada
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021Jihun Yun
 

Similar to Assembling genomes using ABySS (20)

Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 
Maximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold AssemblyMaximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold Assembly
 
Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11
 
mst.pdf
mst.pdfmst.pdf
mst.pdf
 
Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.
 
Halide - 2
Halide - 2 Halide - 2
Halide - 2
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphs
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUs
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Assembling genomes using ABySS

  • 1. Assembling genomes using ABySS dnGASP 2011 Shaun Jackman BC Genome Sciences Centre sjackman@bcgsc.ca abyss-users@bcgsc.ca
  • 2. An assembly in two stages ● Stage I: Sequence assembly algorithm ● Stage II: Paired-end assembly algorithm 2
  • 3. Stage 1 Sequence assembly algorithm ● Load the reads, Load k-mers breaking each read into k-mers ● Find adjacent k-mers, which Find overlaps overlap by k-1 bases ● Remove k-mers resulting from Prune tips read errors ● Remove variant sequences Pop bubbles ● Generate contigs Generate contigs 3
  • 4. Load the reads ● For each input read of length l, (l - k + 1) k-mers are generated by sliding a window of length k over the read Read (l = 12): ● Each k-mer is a vertex of ATCATACATGAT the de Bruijn graph k-mers (k = 9): ATCATACAT ●Two adjacent k-mers are TCATACATG an edge of the de Bruijn CATACATGA ATACATGAT graph 4
  • 5. De Bruijn Graph ● A simple graph for k = 5 ● Two reads – GGACATC – GGACAGA GACAT ACATC GGACA GACAG ACAGA 5
  • 6. Pruning tips ● Read errors cause tips 6
  • 7. Pruning tips ● Read errors cause tips ● Pruning tips removes the erroneous reads from the assembly 7
  • 8. Popping bubbles ● Variant sequences cause bubbles ● Popping bubbles removes the variant sequence from the assembly ● Repeat sequences with small differences also cause bubbles 8
  • 9. Assemble contigs ● Remove ambiguous edges ● Output contigs in FASTA format 9
  • 10. Paired-end assembly algorithm Stage 2 ● Align the reads to the contigs of the first stage ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig ● Estimate the distance between contigs using the paired reads that align to different contigs 10
  • 11. Align the reads to the contigs KAligner ● Every k-mer in the single-end assembly is unique ● KAligner can map reads with k consecutive correct bases ● ABySS may use other aligners, including BWA and bowtie 11
  • 12. Empirical fragment-size distribution ParseAligns ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig 12
  • 13. Estimate distances between contigs DistanceEst ● Estimate the distance between contigs using the paired reads that align to different contigs d = 25 ± 8 d=3±5 d=6±5 d=4±3 13
  • 14. Maximum likelihood estimator DistanceEst ● Use the empirical paired- end size distribution ● Maximize the likelihood function ● Find the most likely distance between the two contigs 14
  • 15. Paired-end algorithm continued... ● Find paths through the contig adjacency graph that agree with Generate paths the distance estimates ● Merge overlapping paths Merge paths ● Merge the contigs in these paths Generate contigs and output the FASTA file 15
  • 16. Find consistent paths SimpleGraph ● Find paths through the contig adjacency graph that agree with the distance estimates d=4±3 Actual distance = 3 16
  • 17. Merge overlapping paths MergePaths ● Merge paths that overlap 17
  • 18. Generate the FASTA output ● Merge the contigs in these paths. ● Output the FASTA file GATTTTTG GAC GTCTTGATCTT CAC GTATTG CTATT 18
  • 19. Assembly process ● Stage 1 completed in 3.5 hours ● Used 72 processors on six machines ● Peak memory usage of 180 GB of RAM ● Stage 2 completed in 9 hours ● Used 12 processors on one machine ● Peak memory usage of 48 GB of RAM ● Assembly parameters k=64 s=200 n=10 19
  • 20. Assembly results Level 1: 500-bp paired-end reads ● Assembled half the genome in 7,676 contigs larger than the N50 of 50,612 bp ● Assembled 1.81 Gbp in 170,407 contigs larger than 200 bp ● The largest contig is 1,158,576 bp ● Removed 1,296,819 variant sequences 20
  • 21. Alignments to the reference ● Aligned the 170,407 contigs longer than 200 bp ● 96.2% align at least 99% length ● 1.2% align between 90% and 99% length ● 2.5% align less than 90% length >99% 90-99% <90% 21
  • 22. Works in progress ● Replace complex variant sequences with Ns ● Scaffold over gaps and simple repeat sequence using large fragment mate-pair reads ● Filling in gaps with sequence using localized microassembly 22
  • 23. ABySS Publications IEEE InfoVis 2009
  • 24. Acknowledgments Supervisors ● İnanç Birol ● Steven Jones Team ● Readman Chiu ● Rod Docking ● Karen Mungall ● Jenny Qian 24
  • 25. 25