Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

14,897 views

Published on

No Downloads

Total views

14,897

On SlideShare

0

From Embeds

0

Number of Embeds

46

Shares

0

Downloads

0

Comments

6

Likes

31

No notes for slide

- 1. Introduction OLC Graph theory and assembly deBruijn - EulerGenome Assembly Algorithms and Software (or...what to do with all that sequence data ?) Konstantinos Krampis Asst. Professor, Informatics J. Craig Venter Institute George Washington University, Nov. 2nd 2011 Konstantinos Krampis Genome Assembly Algorithms and Software
- 2. Introduction OLC Graph theory and assembly deBruijn - EulerIntroduction Why do we need genome assembly Deﬁnitions of genome assemblyOLC Overlap Layout Consensus OLC assembly software and publicationsGraph theory and assembly Deﬁnition of a graph Graphs and genome assemblydeBruijn - Euler An alternative assembly graph Constructing a de Bruijn graph from reads Genome assembly from de Bruijn graphs deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
- 3. Introduction OLC Why do we need genome assembly Graph theory and assembly Deﬁnitions of genome assembly deBruijn - EulerCannot read the complete genomewith the sequencer from one end tothe other !DNA isolated from a cell isampliﬁedBroken into fragments (shearing)Fragments are ”read” with thesequencerUse the fragments - reads toreconstruct the genome from Credit: Masahiro Kasahara, Large-Scale Genome Sequencesequencing reads Processing, Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
- 4. Introduction OLC Why do we need genome assembly Graph theory and assembly Deﬁnitions of genome assembly deBruijn - EulerAssembly: hierarchical processto reconstruct genome fromreadsAssemble the puzzle of thegenome from the reads:overlaps connect the piecesOversample the genome so thatreads overlapKey approach: data structurerepresenting overlaps, andalgorithms operating on that Credit: Masahiro Kasahara, Large-Scale Genome Sequencedata structure Processing, Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
- 5. Introduction OLC Why do we need genome assembly Graph theory and assembly Deﬁnitions of genome assembly deBruijn - EulerTwo major algorithmic paradigms for genome assembly Overlap - Layout - Consensus (OLC): well established, more powerful method, but more diﬃcult to implement OLC: ﬁrst to be used successfully for complex Eucaryotic genomes (Drosophila,H.sapiens) deBruijn - Euler: newer, easier to implement, problematic in complex genomes (for current implementations) Konstantinos Krampis Genome Assembly Algorithms and Software
- 6. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsFind Overlaps by aligningthe sequence of the readsLayout the reads basedon which aligns to whichGet Consensus by joiningall read sequences,merging overlapsSequencer reads inrandom direction,left-to-right or Credit: Masahiro Kasahara, Large-Scale Genome Sequence Processing,right-to-left Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
- 7. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsSequence alignment,all-against-all reads(Smith-Watermann,BLAST, other?)Computationally intensivebut easily parallelizableRepresent read overlap byconnecting with directed Credit: Kececioglu and Myers 1995, Algorithmica 13:7-51linkFirst step in creating thegenome assembly graph(more later) Konstantinos Krampis Genome Assembly Algorithms and Software
- 8. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsCreate a consistent linear(ideally) ordering of thereadsRemove redundancy, sono two dovetails leavethe same edgeNo containment edge isfollowed by a dovetailedgeRemove cycles, one linkin, one out Konstantinos Krampis Genome Assembly Algorithms and Software
- 9. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsMultiple SequenceAlignment (ClustalW)algorithms ? Nophylogeny here...Vote for the most abundantnucleotide for each positionIncorporate read quality dataCreate pre-consensus fromhigh-quality reads, and alignremaining reads to it Konstantinos Krampis Genome Assembly Algorithms and Software
- 10. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsCelera Assembler Developed at Celera Genomics for ﬁrst Drosophila and human genome assemblies Continuoued development at J. Craig Venter Inst. as open source project http://wgs-assembler.SourceForge.net (Licence: GPL) Plently of wiki (developer + user) documentation, examples, user forums Other OLC implementations: Arachne, PCAP, Newbler, Phrap, TIGR Assembler Konstantinos Krampis Genome Assembly Algorithms and Software
- 11. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publicationsCelera Assembler publications Myers et al (2000) A whole-genome assembly of Drosophila Levy et al (2007) The diploid genome sequence of an individual human Zimin et al (2009) The domestic cow, Bos taurus Dalloul et al (2010) The domestic turkey, Meleagris gallopavo Lorenzi et al (2010) New assembly of Entamoeba histolytica Lawniczak et al (2010) Divergence in Anopheles gambiae Jones et al (2011) The marine ﬁlamentous cyanobacterium Lyngbya majuscula Miller et al The Tasmanian devil, Sarcophilus harrisii Prfer et al The great ape bonobo, Pan paniscus Gordon et al The cotton bollworm moth, Helicoverpa Konstantinos Krampis Genome Assembly Algorithms and Software
- 12. Introduction OLC Deﬁnition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Eulerand now a bit of Graph Theory... Konstantinos Krampis Genome Assembly Algorithms and Software
- 13. Introduction OLC Deﬁnition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerGraph G with set of vertices (nodes)V: {P,T,Q,S,R}set of edges (links between nodes)E: {(P,T),(P,Q),(P,S),(Q,T),(S,T),(Q,S),(S,Q),(Q,R),(R,S)}walk from P to R:(P,Q),(Q,R)walk from R to T:(R,S),(S,Q),(Q,T)or (R,S),(S,T) Credit: Introduction to Graph Theor Robert J. Wilsonwalk from R to P: not possible Konstantinos Krampis Genome Assembly Algorithms and Software
- 14. Introduction OLC Deﬁnition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerTrail: a walk of the graph whereeach edge is visited only onceExample Trail: (P,Q), (Q,R),(R,S), (S,Q), (Q,S), (S,T)Path: a walk where each verticeis visited onceExample Path: (P,Q), (Q,R),(R,S), (S,T) Konstantinos Krampis Genome Assembly Algorithms and Software
- 15. Introduction OLC Deﬁnition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerCredit: Saad Mneimneh, CUNY Konstantinos Krampis Genome Assembly Algorithms and Software
- 16. Introduction OLC Deﬁnition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerRepresent sequence overlaps asa graph with weighted edgesSCS solution: ﬁnd Path (visitall edges and vertices once) thatmaximizes weight sumHamiltonian Cycle or TravelingSaleman Problem Konstantinos Krampis Genome Assembly Algorithms and Software
- 17. Introduction OLC Deﬁnition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerWhich edge to start from?NO: misses a vertex NO: misses edge with large weight Konstantinos Krampis Genome Assembly Algorithms and Software
- 18. Introduction OLC Deﬁnition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerYES!: all vertices and edge with large weight Konstantinos Krampis Genome Assembly Algorithms and Software
- 19. Introduction OLC Deﬁnition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerA more realistic version of a read / string overlap graph (C. jejuni)Credit: Eugene W. Myers Bioinformatics 21:79-85 Konstantinos Krampis Genome Assembly Algorithms and Software
- 20. Introduction OLC Deﬁnition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - EulerComputational Complexity SCS solution by searching for a Hamiltonian Cycle on a graph is a diﬃcult algorithmic problem (NP-hard) Using approximation or greedy algorithms can yield a 2 to 4-aprroximation solutions (twice or four times the length of the optimal-shortest string) Transformation of Overlap Graph to String Graph leads to Polynomial time solution. No Polynomial(P) : O(n), O(n2 ), O(n3 )etc. assembler implementation yet. (1) Konstantinos Krampis Genome Assembly Algorithms and Software
- 21. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsPevzner, Tang andWaterman, AnEulerian pathapproach to DNAfragment assembly,PNAS 98 20019748-9753. Konstantinos Krampis Genome Assembly Algorithms and Software
- 22. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsdeBruijn graph: a directed graph representing overlaps betweensequences of symbolsCredit: Wikipedia Konstantinos Krampis Genome Assembly Algorithms and Software
- 23. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from readsGraph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
- 24. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from readsGraph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
- 25. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from readsGraph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
- 26. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsIn a real genome scenario...Credit: Flicek and Birney 2009, Nature Methods 6, S6 - S12 Konstantinos Krampis Genome Assembly Algorithms and Software
- 27. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsEuler’s algorithm Using Euler’s algorithm we can ﬁnd a path that visits each edge of the de Bruijn genome assembly graph once, in order to concatenate the edge labels and ”spell out” the assembly. Polynomial time! Credit: Wikipedia Konstantinos Krampis Genome Assembly Algorithms and Software
- 28. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsEuler assembler (the very ﬁrst), Pevzner et al 2001 PNAS98:9748-9753Velvet assembler (more user friendly),Both those assemlers store the complete graph on the computermemory 512GB-1024GB for human genomesAt JCVI we have two 1024GB (1TB) RAM servers for assemblyothers: ABYSS, YAGA, Contrail-Bio, PASHA parallel (distributedmemory) assemblers on computer clusters Konstantinos Krampis Genome Assembly Algorithms and Software
- 29. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publicationsThank you! contact: kkrampis@jcvi.org We hire interns at the J. Craig Venter Institute: http://www.jcvi.org/cms/education/internship-program/ Some of my other projects - Cloud Computing: http://tinyurl.com/cloudbiolinux-jcvi http://www.cloudbiolinux.org Konstantinos Krampis Genome Assembly Algorithms and Software

No public clipboards found for this slide

Login to see the comments