SlideShare a Scribd company logo
1 of 21
Download to read offline
Fast algorithms for large scale
   genome alignment and
   comparison

                                                             Davide Eynard
                                                       eynard@elet.polimi.it

                         Dipartimento di Elettronica e Informazione
                                               Politecnico di Milano

                                          2007/05/28

Algorithms for Computational Molecular Biology
The article(s)

        A.L. Delcher, S. Kasif, R.D. Fleischmann, J.
         Peterson, O. White, S.L. Salzberg: “Alignment of
         whole genomes”, 1999
        A.L. Delcher, A. Philippy, J. Carlton, S.L.
         Salzberg: “Fast algorithms for large-scale
         genome alignment and comparison”, 2002
        S. Kurtz, A. Philippy, A.L. Delcher, M. Smoot, M.
         Shumway, C. Antonescu, S.L. Salzberg:
         “Versatile and open software for comparing large
         genomes”, 2004



p. 2    2007/05/28          ACMB
The problem

        When the genome sequence of two closely
         related organisms becomes available, one of the
         first questions researchers want to ask is how the
         two genomes align
        Aligning (very) long sequences
          • Single gene sequences may be as long as tens of
            thousand of nucleotides
          • Whole genomes are usually millions of nucleotides
            or larger!




p. 3    2007/05/28           ACMB
The challenge

        Naïve
          • O(n2) space and time
        Hashing
          • faster, but still partly O(n2)
        Dynamic Programming
          • O(n) space, takes more time
        MUMmer
          • Suffix trees: O(n) space and time
          • LIS: O(k log k) where k is the number of MUMs




p. 4    2007/05/28               ACMB
The algorithm

       1) Perform a Maximal Unique Match (MUM)
         decomposition of the two genomes
       2) Sort the matches found in the MUM alignment,
         and extract the LIS (Longest Increasing
         Sequence) of matches that occur in the same
         order in both genomes
       3) Close the gaps in the alignment, performing
         local identification of large inserts, repeats, small
         mutated regions, tandem repeats and SNPs
       4) Output the alignment



p. 5    2007/05/28            ACMB
MUM: the suffix tree




p. 6   2007/05/28          ACMB
Longest Increasing Subsequence




p. 7   2007/05/28   ACMB
Closing the gaps




p. 8   2007/05/28         ACMB
MUMmer v2.0

        Relaxes the uniqueness constraint
        Faster, takes less space
        Algorithmic improvements
          • memory
          • streaming query
          • new module to cluster matches
        Able to align not only simple DNA sequences, but
         also human chromosomes
        Able to align incomplete genomes and protein
         sequences



p. 9    2007/05/28           ACMB
Time-space improvements

         The amount of memory used in the suffix tree
          has been reduced
           • from at most 37bytes/bp to at most 20bytes/bp
         Speed has increased
           • E.coli vs. V.cholerae, from 74sec,293MB to 27sec,
               100MB
         Suffix tree is used to store only one sequence,
          while the second one (query) is streamed against
          the suffix tree
           • once the suffix tree has been built, multiple queries
             can be streamed
           • quick way to find the next match
           • matches are maximal on the right hand side
p. 10    2007/05/28             ACMB
Streaming queries




p. 11   2007/05/28         ACMB
Clustering of matches

         Old version computed a single longest alignment
          between the sequences
         New version works as follows:
           • first, the system outputs a series of separate,
             independent alignment regions
           • clustering is performed by finding pairs of matches
             that are sufficiently close
           • finally, a LIS computation is done within each
             component to yield the most consistent sequence
             of matches in the cluster




p. 12    2007/05/28             ACMB
Alignment of incomplete genomes

         In a typical Whole-Genome Shotgun-Sequencing,
          the genome is broken up into millions of pieces
           • If the reads are generated at random, then >99%
             of a genome will be covered by sequencing
             enough reads to cover the genome eight times
           • The result of assembly is usually a collection of
             large, unordered DNA sequences called contigs
         NUCmer (nucleotide MUMmer) is a multiple-
          contig alignment program that uses MUMmer 2
          as its core aligment engine




p. 13    2007/05/28            ACMB
Alignment of incomplete genomes

        1)NUCmer input: two multi-fasta files representing
          partial or complete assemblies
        2)Create a map of all contig positions within each
          file
        3)Concatenate files separately and run MUMmer to
          find exact matches
        4)Map matches to separate contigs
        5)MUMs are clustered together if they are
          separated by no more than a user-specifiedd
          distance
        6)Dynamic programming is used to align
          sequences between the MUMs

p. 14    2007/05/28         ACMB
NUCmer




p. 15   2007/05/28    ACMB
PROmer

        1)Given two multi-fasta files, PROmer translates the
          DNA to amino acids
        2)An index is created that maps all protein
          sequences and lengths to the source DNA
        3)Pseudo-proteomes (amino acid sequences) are
          passed to MUMmer
        4)The index is used to translate the matches back
          to the original DNA input
        5)Clustering step




p. 16    2007/05/28          ACMB
MUMmer v3.0

         New improvements in code
           • slightly faster than 2.0, 25% less memory
         More modular and configurable
           • possibility to build hybrid systems
         Ability to run a multi-contig query against a multi-
          contig reference
         Non-unique maximal matches
         Speed-up of Nucmer and Promer modules
          (approx. 10-fold)
         Graphical viewers



p. 17    2007/05/28             ACMB
Graphical interfaces




p. 18   2007/05/28           ACMB
Graphical interfaces




p. 19   2007/05/28           ACMB
Graphical interfaces




p. 20   2007/05/28           ACMB
That's All, Folks



                          Thank you!
                     Questions are welcome




p. 21   2007/05/28          ACMB

More Related Content

Similar to Fast algorithms for large scale genome alignment and comparison

20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
Cell Processor Based Sequence Alignment
Cell Processor Based Sequence AlignmentCell Processor Based Sequence Alignment
Cell Processor Based Sequence Alignmentguestbe9138
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryIAEME Publication
 
Lightning
LightningLightning
LightningArvados
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICMVernon D Dutch Jr
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitShubham Verma
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn GraphAshwani kumar
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets? ehsan sepahi
 
Associative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networksAssociative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networkseSAT Publishing House
 
Computer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresComputer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresAqeel Khudhair
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsOregon State University
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pubsesejun
 

Similar to Fast algorithms for large scale genome alignment and comparison (20)

20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Cell Processor Based Sequence Alignment
Cell Processor Based Sequence AlignmentCell Processor Based Sequence Alignment
Cell Processor Based Sequence Alignment
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
Lightning
LightningLightning
Lightning
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICM
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets?
 
Associative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networksAssociative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networks
 
Final doc of dna
Final  doc of dnaFinal  doc of dna
Final doc of dna
 
JBUON-21-1-33
JBUON-21-1-33JBUON-21-1-33
JBUON-21-1-33
 
Computer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresComputer Simulation of Nano-Structures
Computer Simulation of Nano-Structures
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computations
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 

More from Davide Eynard

Building Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsBuilding Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsDavide Eynard
 
Laplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsLaplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsDavide Eynard
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral ClusteringDavide Eynard
 
An integrated approach to discover tag semantics
An integrated approach to discover tag semanticsAn integrated approach to discover tag semantics
An integrated approach to discover tag semanticsDavide Eynard
 
SAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationSAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationDavide Eynard
 
A Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and ParticipationA Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and ParticipationDavide Eynard
 
ReSearch - Searching for Researchers
ReSearch - Searching for ResearchersReSearch - Searching for Researchers
ReSearch - Searching for ResearchersDavide Eynard
 
PhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD StudentsPhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD StudentsDavide Eynard
 
Exploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotationExploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotationDavide Eynard
 
Performance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection SystemsPerformance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection SystemsDavide Eynard
 
Cracking Codes With Genetic Algorithms
Cracking Codes With Genetic AlgorithmsCracking Codes With Genetic Algorithms
Cracking Codes With Genetic AlgorithmsDavide Eynard
 
Unambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional LanguagesUnambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional LanguagesDavide Eynard
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systemsDavide Eynard
 

More from Davide Eynard (15)

Building Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsBuilding Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and Manifolds
 
Laplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsLaplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformations
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral Clustering
 
An integrated approach to discover tag semantics
An integrated approach to discover tag semanticsAn integrated approach to discover tag semantics
An integrated approach to discover tag semantics
 
SAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationSAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotation
 
A Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and ParticipationA Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and Participation
 
Talk Hpl
Talk HplTalk Hpl
Talk Hpl
 
ReSearch - Searching for Researchers
ReSearch - Searching for ResearchersReSearch - Searching for Researchers
ReSearch - Searching for Researchers
 
PhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD StudentsPhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD Students
 
Exploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotationExploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotation
 
Performance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection SystemsPerformance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection Systems
 
Cracking Codes With Genetic Algorithms
Cracking Codes With Genetic AlgorithmsCracking Codes With Genetic Algorithms
Cracking Codes With Genetic Algorithms
 
Rewire the Net
Rewire the NetRewire the Net
Rewire the Net
 
Unambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional LanguagesUnambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional Languages
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systems
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Fast algorithms for large scale genome alignment and comparison

  • 1. Fast algorithms for large scale genome alignment and comparison Davide Eynard eynard@elet.polimi.it Dipartimento di Elettronica e Informazione Politecnico di Milano 2007/05/28 Algorithms for Computational Molecular Biology
  • 2. The article(s)  A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, S.L. Salzberg: “Alignment of whole genomes”, 1999  A.L. Delcher, A. Philippy, J. Carlton, S.L. Salzberg: “Fast algorithms for large-scale genome alignment and comparison”, 2002  S. Kurtz, A. Philippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, S.L. Salzberg: “Versatile and open software for comparing large genomes”, 2004 p. 2 2007/05/28 ACMB
  • 3. The problem  When the genome sequence of two closely related organisms becomes available, one of the first questions researchers want to ask is how the two genomes align  Aligning (very) long sequences • Single gene sequences may be as long as tens of thousand of nucleotides • Whole genomes are usually millions of nucleotides or larger! p. 3 2007/05/28 ACMB
  • 4. The challenge  Naïve • O(n2) space and time  Hashing • faster, but still partly O(n2)  Dynamic Programming • O(n) space, takes more time  MUMmer • Suffix trees: O(n) space and time • LIS: O(k log k) where k is the number of MUMs p. 4 2007/05/28 ACMB
  • 5. The algorithm 1) Perform a Maximal Unique Match (MUM) decomposition of the two genomes 2) Sort the matches found in the MUM alignment, and extract the LIS (Longest Increasing Sequence) of matches that occur in the same order in both genomes 3) Close the gaps in the alignment, performing local identification of large inserts, repeats, small mutated regions, tandem repeats and SNPs 4) Output the alignment p. 5 2007/05/28 ACMB
  • 6. MUM: the suffix tree p. 6 2007/05/28 ACMB
  • 8. Closing the gaps p. 8 2007/05/28 ACMB
  • 9. MUMmer v2.0  Relaxes the uniqueness constraint  Faster, takes less space  Algorithmic improvements • memory • streaming query • new module to cluster matches  Able to align not only simple DNA sequences, but also human chromosomes  Able to align incomplete genomes and protein sequences p. 9 2007/05/28 ACMB
  • 10. Time-space improvements  The amount of memory used in the suffix tree has been reduced • from at most 37bytes/bp to at most 20bytes/bp  Speed has increased • E.coli vs. V.cholerae, from 74sec,293MB to 27sec, 100MB  Suffix tree is used to store only one sequence, while the second one (query) is streamed against the suffix tree • once the suffix tree has been built, multiple queries can be streamed • quick way to find the next match • matches are maximal on the right hand side p. 10 2007/05/28 ACMB
  • 11. Streaming queries p. 11 2007/05/28 ACMB
  • 12. Clustering of matches  Old version computed a single longest alignment between the sequences  New version works as follows: • first, the system outputs a series of separate, independent alignment regions • clustering is performed by finding pairs of matches that are sufficiently close • finally, a LIS computation is done within each component to yield the most consistent sequence of matches in the cluster p. 12 2007/05/28 ACMB
  • 13. Alignment of incomplete genomes  In a typical Whole-Genome Shotgun-Sequencing, the genome is broken up into millions of pieces • If the reads are generated at random, then >99% of a genome will be covered by sequencing enough reads to cover the genome eight times • The result of assembly is usually a collection of large, unordered DNA sequences called contigs  NUCmer (nucleotide MUMmer) is a multiple- contig alignment program that uses MUMmer 2 as its core aligment engine p. 13 2007/05/28 ACMB
  • 14. Alignment of incomplete genomes 1)NUCmer input: two multi-fasta files representing partial or complete assemblies 2)Create a map of all contig positions within each file 3)Concatenate files separately and run MUMmer to find exact matches 4)Map matches to separate contigs 5)MUMs are clustered together if they are separated by no more than a user-specifiedd distance 6)Dynamic programming is used to align sequences between the MUMs p. 14 2007/05/28 ACMB
  • 15. NUCmer p. 15 2007/05/28 ACMB
  • 16. PROmer 1)Given two multi-fasta files, PROmer translates the DNA to amino acids 2)An index is created that maps all protein sequences and lengths to the source DNA 3)Pseudo-proteomes (amino acid sequences) are passed to MUMmer 4)The index is used to translate the matches back to the original DNA input 5)Clustering step p. 16 2007/05/28 ACMB
  • 17. MUMmer v3.0  New improvements in code • slightly faster than 2.0, 25% less memory  More modular and configurable • possibility to build hybrid systems  Ability to run a multi-contig query against a multi- contig reference  Non-unique maximal matches  Speed-up of Nucmer and Promer modules (approx. 10-fold)  Graphical viewers p. 17 2007/05/28 ACMB
  • 18. Graphical interfaces p. 18 2007/05/28 ACMB
  • 19. Graphical interfaces p. 19 2007/05/28 ACMB
  • 20. Graphical interfaces p. 20 2007/05/28 ACMB
  • 21. That's All, Folks Thank you! Questions are welcome p. 21 2007/05/28 ACMB