Overview of Genome Assembly Algorithms

  • 6,354 views
Uploaded on

Overview of Genome Assembly Algorithms with some graph theory overview, given as invited lecture to a George Washington University course.

Overview of Genome Assembly Algorithms with some graph theory overview, given as invited lecture to a George Washington University course.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,354
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
8

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction OLC Graph theory and assembly deBruijn - EulerGenome Assembly Algorithms and Software (or...what to do with all that sequence data ?) Konstantinos Krampis Asst. Professor, Informatics J. Craig Venter Institute George Washington University, Nov. 2nd 2011 Konstantinos Krampis Genome Assembly Algorithms and Software
  • 2. Introduction OLC Graph theory and assembly deBruijn - EulerIntroduction Why do we need genome assembly Definitions of genome assemblyOLC Overlap Layout Consensus OLC assembly software and publicationsGraph theory and assembly Definition of a graph Graphs and genome assemblydeBruijn - Euler An alternative assembly graph Constructing a de Bruijn graph from reads Genome assembly from de Bruijn graphs deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 3. Introduction OLC Why do we need genome assembly Graph theory and assembly Definitions of genome assembly deBruijn - EulerCannot read the complete genomewith the sequencer from one end tothe other !DNA isolated from a cell isamplifiedBroken into fragments (shearing)Fragments are ”read” with thesequencerUse the fragments - reads toreconstruct the genome from Credit: Masahiro Kasahara, Large-Scale Genome Sequencesequencing reads Processing, Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
  • 4. Introduction OLC Why do we need genome assembly Graph theory and assembly Definitions of genome assembly deBruijn - EulerAssembly: hierarchical processto reconstruct genome fromreadsAssemble the puzzle of thegenome from the reads:overlaps connect the piecesOversample the genome so thatreads overlapKey approach: data structurerepresenting overlaps, andalgorithms operating on that Credit: Masahiro Kasahara, Large-Scale Genome Sequencedata structure Processing, Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
  • 5. Introduction OLC Why do we need genome assembly Graph theory and assembly Definitions of genome assembly deBruijn - EulerTwo major algorithmic paradigms for genome assembly Overlap - Layout - Consensus (OLC): well established, more powerful method, but more difficult to implement OLC: first to be used successfully for complex Eucaryotic genomes (Drosophila,H.sapiens) deBruijn - Euler: newer, easier to implement, problematic in complex genomes (for current implementations) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 6. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsFind Overlaps by aligningthe sequence of the readsLayout the reads basedon which aligns to whichGet Consensus by joiningall read sequences,merging overlapsSequencer reads inrandom direction,left-to-right or Credit: Masahiro Kasahara, Large-Scale Genome Sequence Processing,right-to-left Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
  • 7. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsSequence alignment,all-against-all reads(Smith-Watermann,BLAST, other?)Computationally intensivebut easily parallelizableRepresent read overlap byconnecting with directed Credit: Kececioglu and Myers 1995, Algorithmica 13:7-51linkFirst step in creating thegenome assembly graph(more later) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 8. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsCreate a consistent linear(ideally) ordering of thereadsRemove redundancy, sono two dovetails leavethe same edgeNo containment edge isfollowed by a dovetailedgeRemove cycles, one linkin, one out Konstantinos Krampis Genome Assembly Algorithms and Software
  • 9. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsMultiple SequenceAlignment (ClustalW)algorithms ? Nophylogeny here...Vote for the most abundantnucleotide for each positionIncorporate read quality dataCreate pre-consensus fromhigh-quality reads, and alignremaining reads to it Konstantinos Krampis Genome Assembly Algorithms and Software
  • 10. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsCelera Assembler Developed at Celera Genomics for first Drosophila and human genome assemblies Continuoued development at J. Craig Venter Inst. as open source project http://wgs-assembler.SourceForge.net (Licence: GPL) Plently of wiki (developer + user) documentation, examples, user forums Other OLC implementations: Arachne, PCAP, Newbler, Phrap, TIGR Assembler Konstantinos Krampis Genome Assembly Algorithms and Software
  • 11. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsCelera Assembler publications Myers et al (2000) A whole-genome assembly of Drosophila Levy et al (2007) The diploid genome sequence of an individual human Zimin et al (2009) The domestic cow, Bos taurus Dalloul et al (2010) The domestic turkey, Meleagris gallopavo Lorenzi et al (2010) New assembly of Entamoeba histolytica Lawniczak et al (2010) Divergence in Anopheles gambiae Jones et al (2011) The marine filamentous cyanobacterium Lyngbya majuscula Miller et al The Tasmanian devil, Sarcophilus harrisii Prfer et al The great ape bonobo, Pan paniscus Gordon et al The cotton bollworm moth, Helicoverpa Konstantinos Krampis Genome Assembly Algorithms and Software
  • 12. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Eulerand now a bit of Graph Theory... Konstantinos Krampis Genome Assembly Algorithms and Software
  • 13. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerGraph G with set of vertices (nodes)V: {P,T,Q,S,R}set of edges (links between nodes)E: {(P,T),(P,Q),(P,S),(Q,T),(S,T),(Q,S),(S,Q),(Q,R),(R,S)}walk from P to R:(P,Q),(Q,R)walk from R to T:(R,S),(S,Q),(Q,T)or (R,S),(S,T) Credit: Introduction to Graph Theor Robert J. Wilsonwalk from R to P: not possible Konstantinos Krampis Genome Assembly Algorithms and Software
  • 14. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerTrail: a walk of the graph whereeach edge is visited only onceExample Trail: (P,Q), (Q,R),(R,S), (S,Q), (Q,S), (S,T)Path: a walk where each verticeis visited onceExample Path: (P,Q), (Q,R),(R,S), (S,T) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 15. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerCredit: Saad Mneimneh, CUNY Konstantinos Krampis Genome Assembly Algorithms and Software
  • 16. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerRepresent sequence overlaps asa graph with weighted edgesSCS solution: find Path (visitall edges and vertices once) thatmaximizes weight sumHamiltonian Cycle or TravelingSaleman Problem Konstantinos Krampis Genome Assembly Algorithms and Software
  • 17. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerWhich edge to start from?NO: misses a vertex NO: misses edge with large weight Konstantinos Krampis Genome Assembly Algorithms and Software
  • 18. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerYES!: all vertices and edge with large weight Konstantinos Krampis Genome Assembly Algorithms and Software
  • 19. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerA more realistic version of a read / string overlap graph (C. jejuni)Credit: Eugene W. Myers Bioinformatics 21:79-85 Konstantinos Krampis Genome Assembly Algorithms and Software
  • 20. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerComputational Complexity SCS solution by searching for a Hamiltonian Cycle on a graph is a difficult algorithmic problem (NP-hard) Using approximation or greedy algorithms can yield a 2 to 4-aprroximation solutions (twice or four times the length of the optimal-shortest string) Transformation of Overlap Graph to String Graph leads to Polynomial time solution. No Polynomial(P) : O(n), O(n2 ), O(n3 )etc. assembler implementation yet. (1) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 21. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsPevzner, Tang andWaterman, AnEulerian pathapproach to DNAfragment assembly,PNAS 98 20019748-9753. Konstantinos Krampis Genome Assembly Algorithms and Software
  • 22. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsdeBruijn graph: a directed graph representing overlaps betweensequences of symbolsCredit: Wikipedia Konstantinos Krampis Genome Assembly Algorithms and Software
  • 23. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from readsGraph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 24. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from readsGraph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 25. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from readsGraph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 26. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsIn a real genome scenario...Credit: Flicek and Birney 2009, Nature Methods 6, S6 - S12 Konstantinos Krampis Genome Assembly Algorithms and Software
  • 27. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsEuler’s algorithm Using Euler’s algorithm we can find a path that visits each edge of the de Bruijn genome assembly graph once, in order to concatenate the edge labels and ”spell out” the assembly. Polynomial time! Credit: Wikipedia Konstantinos Krampis Genome Assembly Algorithms and Software
  • 28. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsEuler assembler (the very first), Pevzner et al 2001 PNAS98:9748-9753Velvet assembler (more user friendly),Both those assemlers store the complete graph on the computermemory 512GB-1024GB for human genomesAt JCVI we have two 1024GB (1TB) RAM servers for assemblyothers: ABYSS, YAGA, Contrail-Bio, PASHA parallel (distributedmemory) assemblers on computer clusters Konstantinos Krampis Genome Assembly Algorithms and Software
  • 29. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsThank you! contact: kkrampis@jcvi.org We hire interns at the J. Craig Venter Institute: http://www.jcvi.org/cms/education/internship-program/ Some of my other projects - Cloud Computing: http://tinyurl.com/cloudbiolinux-jcvi http://www.cloudbiolinux.org Konstantinos Krampis Genome Assembly Algorithms and Software