SlideShare a Scribd company logo
1 of 25
Download to read offline
Assembling genomes using ABySS
           dnGASP 2011



            Shaun Jackman
      BC Genome Sciences Centre
         sjackman@bcgsc.ca
        abyss-users@bcgsc.ca
An assembly in two stages
●   Stage I: Sequence assembly algorithm
●   Stage II: Paired-end assembly algorithm




                                              2
Stage 1
      Sequence assembly algorithm
●   Load the reads,                  Load k-mers
    breaking each read into k-mers
●   Find adjacent k-mers, which      Find overlaps
    overlap by k-1 bases
●   Remove k-mers resulting from     Prune tips
    read errors
●   Remove variant sequences         Pop bubbles

●   Generate contigs
                                     Generate contigs



                                                        3
Load the reads
●   For each input read of length l, (l - k + 1) k-mers
    are generated by sliding a window of length k
    over the read
      Read (l = 12):    ● Each k-mer is a vertex of
         ATCATACATGAT   the de Bruijn graph
      k-mers (k = 9):
         ATCATACAT      ●Two adjacent k-mers are
          TCATACATG     an edge of the de Bruijn
           CATACATGA
            ATACATGAT   graph

                                                      4
De Bruijn Graph
●   A simple graph for k = 5
●   Two reads
        –   GGACATC
        –   GGACAGA
                           GACAT      ACATC
            GGACA


                           GACAG      ACAGA


                                              5
Pruning tips
●   Read errors cause
    tips




                                6
Pruning tips
●   Read errors cause
    tips
●   Pruning tips
    removes the
    erroneous reads
    from the assembly




                                7
Popping bubbles
●   Variant sequences cause
    bubbles
●   Popping bubbles removes
    the variant sequence from
    the assembly
●   Repeat sequences with
    small differences also
    cause bubbles




                                 8
Assemble contigs
●   Remove ambiguous
    edges
●   Output contigs in
    FASTA format




                                  9
Paired-end assembly algorithm
                       Stage 2
●   Align the reads to the contigs of the first stage
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig
●   Estimate the distance between contigs using
    the paired reads that align to different contigs




                                                        10
Align the reads to the contigs
                      KAligner
●   Every k-mer in the single-end
    assembly is unique
●   KAligner can map reads with k
    consecutive correct bases
●   ABySS may use other aligners,
    including BWA and bowtie




                                        11
Empirical fragment-size distribution
                     ParseAligns
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig




                                                        12
Estimate distances between contigs
                     DistanceEst
●   Estimate the distance between contigs using
    the paired reads that align to different contigs

                           d = 25 ± 8

                      d=3±5


                        d=6±5




                        d=4±3

                                                       13
Maximum likelihood estimator
                    DistanceEst
●   Use the empirical paired-
    end size distribution
●   Maximize the likelihood
    function
●   Find the most likely
    distance between the two
    contigs



                                     14
Paired-end algorithm
                   continued...
●   Find paths through the contig
    adjacency graph that agree with    Generate paths
    the distance estimates
●   Merge overlapping paths             Merge paths

●   Merge the contigs in these paths
                                       Generate contigs
    and output the FASTA file




                                                      15
Find consistent paths
                    SimpleGraph
●   Find paths through the contig adjacency graph
    that agree with the distance estimates




                     d=4±3

                  Actual distance = 3
                                                    16
Merge overlapping paths
                    MergePaths
●   Merge paths that overlap




                                   17
Generate the FASTA output
●   Merge the contigs in these paths.
●   Output the FASTA file




    GATTTTTG   GAC GTCTTGATCTT   CAC    GTATTG CTATT

                                                       18
Assembly process
●   Stage 1 completed in 3.5 hours
●   Used 72 processors on six machines
●   Peak memory usage of 180 GB of RAM
●   Stage 2 completed in 9 hours
●   Used 12 processors on one machine
●   Peak memory usage of 48 GB of RAM
●   Assembly parameters k=64 s=200 n=10

                                          19
Assembly results
          Level 1: 500-bp paired-end reads
●   Assembled half the genome in 7,676 contigs
    larger than the N50 of 50,612 bp
●   Assembled 1.81 Gbp in 170,407 contigs larger
    than 200 bp
●   The largest contig is 1,158,576 bp
●   Removed 1,296,819 variant sequences




                                                   20
Alignments to the reference
●   Aligned the 170,407 contigs longer than 200 bp
●   96.2% align at least 99% length
●   1.2% align between 90% and 99% length
●   2.5% align less than 90% length


                               >99%
                               90-99%
                               <90%




                                                 21
Works in progress
●   Replace complex variant sequences with Ns
●   Scaffold over gaps and simple repeat sequence
    using large fragment mate-pair reads
●   Filling in gaps with sequence using localized
    microassembly




                                                    22
ABySS Publications
         IEEE InfoVis 2009
Acknowledgments
    Supervisors
●   İnanç Birol
●   Steven Jones
    Team
●   Readman Chiu
●   Rod Docking
●   Karen Mungall
●   Jenny Qian
                                24
25

More Related Content

Similar to Assembling genomes using ABySS

Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Anton Alexandrov
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pubsesejun
 
Maximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold AssemblyMaximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold AssemblyAlexey Sergushichev
 
Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Manchor Ko
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analysesfnothaft
 
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...LDBC council
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsAkisato Kimura
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11smashflt
 
Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.Workhorse Computing
 
Halide - 2
Halide - 2 Halide - 2
Halide - 2 Kobe Yu
 
An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsThibault Debatty
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsTakahiro Harada
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021Jihun Yun
 

Similar to Assembling genomes using ABySS (20)

Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 
Maximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold AssemblyMaximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold Assembly
 
Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11
 
mst.pdf
mst.pdfmst.pdf
mst.pdf
 
Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.
 
Halide - 2
Halide - 2 Halide - 2
Halide - 2
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphs
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUs
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Assembling genomes using ABySS

  • 1. Assembling genomes using ABySS dnGASP 2011 Shaun Jackman BC Genome Sciences Centre sjackman@bcgsc.ca abyss-users@bcgsc.ca
  • 2. An assembly in two stages ● Stage I: Sequence assembly algorithm ● Stage II: Paired-end assembly algorithm 2
  • 3. Stage 1 Sequence assembly algorithm ● Load the reads, Load k-mers breaking each read into k-mers ● Find adjacent k-mers, which Find overlaps overlap by k-1 bases ● Remove k-mers resulting from Prune tips read errors ● Remove variant sequences Pop bubbles ● Generate contigs Generate contigs 3
  • 4. Load the reads ● For each input read of length l, (l - k + 1) k-mers are generated by sliding a window of length k over the read Read (l = 12): ● Each k-mer is a vertex of ATCATACATGAT the de Bruijn graph k-mers (k = 9): ATCATACAT ●Two adjacent k-mers are TCATACATG an edge of the de Bruijn CATACATGA ATACATGAT graph 4
  • 5. De Bruijn Graph ● A simple graph for k = 5 ● Two reads – GGACATC – GGACAGA GACAT ACATC GGACA GACAG ACAGA 5
  • 6. Pruning tips ● Read errors cause tips 6
  • 7. Pruning tips ● Read errors cause tips ● Pruning tips removes the erroneous reads from the assembly 7
  • 8. Popping bubbles ● Variant sequences cause bubbles ● Popping bubbles removes the variant sequence from the assembly ● Repeat sequences with small differences also cause bubbles 8
  • 9. Assemble contigs ● Remove ambiguous edges ● Output contigs in FASTA format 9
  • 10. Paired-end assembly algorithm Stage 2 ● Align the reads to the contigs of the first stage ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig ● Estimate the distance between contigs using the paired reads that align to different contigs 10
  • 11. Align the reads to the contigs KAligner ● Every k-mer in the single-end assembly is unique ● KAligner can map reads with k consecutive correct bases ● ABySS may use other aligners, including BWA and bowtie 11
  • 12. Empirical fragment-size distribution ParseAligns ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig 12
  • 13. Estimate distances between contigs DistanceEst ● Estimate the distance between contigs using the paired reads that align to different contigs d = 25 ± 8 d=3±5 d=6±5 d=4±3 13
  • 14. Maximum likelihood estimator DistanceEst ● Use the empirical paired- end size distribution ● Maximize the likelihood function ● Find the most likely distance between the two contigs 14
  • 15. Paired-end algorithm continued... ● Find paths through the contig adjacency graph that agree with Generate paths the distance estimates ● Merge overlapping paths Merge paths ● Merge the contigs in these paths Generate contigs and output the FASTA file 15
  • 16. Find consistent paths SimpleGraph ● Find paths through the contig adjacency graph that agree with the distance estimates d=4±3 Actual distance = 3 16
  • 17. Merge overlapping paths MergePaths ● Merge paths that overlap 17
  • 18. Generate the FASTA output ● Merge the contigs in these paths. ● Output the FASTA file GATTTTTG GAC GTCTTGATCTT CAC GTATTG CTATT 18
  • 19. Assembly process ● Stage 1 completed in 3.5 hours ● Used 72 processors on six machines ● Peak memory usage of 180 GB of RAM ● Stage 2 completed in 9 hours ● Used 12 processors on one machine ● Peak memory usage of 48 GB of RAM ● Assembly parameters k=64 s=200 n=10 19
  • 20. Assembly results Level 1: 500-bp paired-end reads ● Assembled half the genome in 7,676 contigs larger than the N50 of 50,612 bp ● Assembled 1.81 Gbp in 170,407 contigs larger than 200 bp ● The largest contig is 1,158,576 bp ● Removed 1,296,819 variant sequences 20
  • 21. Alignments to the reference ● Aligned the 170,407 contigs longer than 200 bp ● 96.2% align at least 99% length ● 1.2% align between 90% and 99% length ● 2.5% align less than 90% length >99% 90-99% <90% 21
  • 22. Works in progress ● Replace complex variant sequences with Ns ● Scaffold over gaps and simple repeat sequence using large fragment mate-pair reads ● Filling in gaps with sequence using localized microassembly 22
  • 23. ABySS Publications IEEE InfoVis 2009
  • 24. Acknowledgments Supervisors ● İnanç Birol ● Steven Jones Team ● Readman Chiu ● Rod Docking ● Karen Mungall ● Jenny Qian 24
  • 25. 25