2011-04-26_01-velvet-curtain-presentation

1,389 views

Published on

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,389
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
53
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

2011-04-26_01-velvet-curtain-presentation

  1. 1. Velvet / CurtainMatthias Haimel EBI is an Outstation of the European Molecular Biology Laboratory.
  2. 2. 2 25.04.11 Velvet / Curtain
  3. 3. Overview • De Bruijn Graph • Velvet • Theory • Practice • Data formats and quality • Velvet • Simulation data • Multiple insert lengths • Curtain • Theory • Practice3 25.04.11 Velvet / Curtain
  4. 4. De Bruijn graph • A concept in combinatorial mathematics • In combinatorics, de bruijn graph is usually fully connected • http://en.wikipedia.org/wiki/De_Bruijn_graph • de bruijn sequence • Related concept • Path through graph • Velvet • de Bruijn inspired graph structure4 25.04.11 Velvet / Curtain
  5. 5. De Bruijn graph (Velvet) • Representation of • a sequence based on short words (k-mers) • overlaps between words • K-mer: word of length k • K=5 GCCTTCCA • k-1 overlap GCCTT GCCTT GCCTT CCTTC CCTTC CCTTC CTTCC CTTCC TTCCA ... GCCTTCCA GCCTTCCA GCCTTCCA5 25.04.11 Velvet / Curtain
  6. 6. De Bruijn graph (Velvet) GCCTTCCAATTT GCCTTCAAATTT C A CTTC TTCC ..... CAATT T CCT TC G CT C AATTT A A CTTC TTCA ..... AAATT6 25.04.11 Velvet / Curtain
  7. 7. De Bruijn graph representations (Velvet) TTCA ATTC TCAG Error free, no repeat, no polymorphism Repeat > kmer length SNP, variant, < kmer length Structural variant, inversion Structural variant, deletion… …7 25.04.11 Velvet / Curtain
  8. 8. Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG8 25.04.11 Velvet / Curtain
  9. 9. Example Read: GTCGAGG GTCG (1x)9 25.04.11 Velvet / Curtain
  10. 10. Example Read: GTCGAGG GTCG TCGA (1x) (1x)10 25.04.11 Velvet / Curtain
  11. 11. Example Read: GTCGAGG GTCG TCGA CGAG (1x) (1x) (1x)11 25.04.11 Velvet / Curtain
  12. 12. Example Read: GTCGAGG GTCG TCGA CGAG GAGG (1x) (1x) (1x) (1x)12 25.04.11 Velvet / Curtain
  13. 13. Example New read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (1x)13 25.04.11 Velvet / Curtain
  14. 14. Example Read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (2x)14 25.04.11 Velvet / Curtain
  15. 15. Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC (1x) (1x) (2x) (2x) (1x)15 25.04.11 Velvet / Curtain
  16. 16. Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC GGCT (1x) (1x) (2x) (2x) (1x) (1x)16 25.04.11 Velvet / Curtain
  17. 17. Example New read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x)17 25.04.11 Velvet / Curtain
  18. 18. Example Read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x) CGAC GACG ACGC (1x) (1x) (1x)18 25.04.11 Velvet / Curtain
  19. 19. Example etc… GATT (1x) TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x) AGAA (1x) GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT TAGA AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (11x) (16x) (9x) (12x) (9x) (8x) (5x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x) CGAC GACG ACGC (1x) (1x) (1x)19 25.04.11 Velvet / Curtain
  20. 20. Example After simplification… GATT AGAT GATCCGATGAG AGAA GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG CGACGC20 25.04.11 Velvet / Curtain
  21. 21. Example Tips removed… AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG21 25.04.11 Velvet / Curtain
  22. 22. De Bruijn graph biology extensions (Velvet) • Handling of reverse strand • DNA is read in two directions • Paired-end data • Handling small differences, which are “uninteresting” • Errors in sequencing technology • Memory • regularly use 80, 100GB real memory • easily get to 1TB real memory requirements22 25.04.11 Velvet / Curtain
  23. 23. Read variety • Short reads ~75bp • Illumina / Solexa • SOLiD (colour space) • Long reads 500-1000 bp • 454 read • Sanger capillary reads • Paired-end reads • Short reads • short insert length • Mate pair reads • Short reads • long insert length23 25.04.11 Velvet / Curtain
  24. 24. Paired-End Mate Pair24 25.04.11 Velvet / Curtain
  25. 25. Short paired-end / mate pair reads ?Velvet expect Illumina paired-end orientation: (L-> <-R) L R paired-end25 25.04.11 Velvet / Curtain
  26. 26. Short paired-end / mate pair readsIllumina mate-pair orientation: (<-L R->) L R mate pair reverse complement L R paired-end26 25.04.11 Velvet / Curtain
  27. 27. Velvet algorithms • Remove Bubbles • Tour Bus • Velvet parameters • -max_branch_length • -max_divergence • -max_gap_count27 25.04.11 Velvet / Curtain
  28. 28. Example AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) GGCT TAGA (11x) (16x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x)28 25.04.11 Velvet / Curtain
  29. 29. Example Bubbles removed… by TourBus AGAT GATCCGATGAG TAGTCGA CGAG GAGGCT GGCT GCTTTAG TAGA AGAGA AGACAG29 25.04.11 Velvet / Curtain
  30. 30. Example Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG30 25.04.11 Velvet / Curtain
  31. 31. Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG One possible walk through the graph ... TAGTCGAG GAGGCTTTAGA AGATCCGATGAG GAGGCTTTAGA AGAGACAG31 25.04.11 Velvet / Curtain
  32. 32. N50 • Total • N90 • N50 • N1032 25.04.11 Velvet / Curtain
  33. 33. N50 • Total • 4,295,113bp • N90 • 439bp • N50 • 3,119bp • N10 • 13,519bp33 25.04.11 Velvet / Curtain
  34. 34. N50 • N50 is the length of the smallest contig • contains the fewest (largest) contigs • combined length represents at least 50% of the assembly • N10 • > 10 % of the largest contigs http://www.broadinstitute.org/crd/wiki/index.php/N5034 25.04.11 Velvet / Curtain
  35. 35. Velvet practical: Part 1 • Compile • Single end (ERX001300) • K-mer length • Coverage cut-offs • Whole genome sequence as input??? • Staphylococcus aureus MRSA25235 25.04.11 Velvet / Curtain
  36. 36. Velvet algorithms • Long read information • Rock Band • Velvet parameters • -long_mult_cutoff36 25.04.11 Velvet / Curtain
  37. 37. Velvet algorithms • Paired-end information • Pebble • Velvet parameters • -min_pair_count Once all distances and variance computed, Simple greedy extension from main contigs out37 25.04.11 Velvet / Curtain
  38. 38. Paired-end in Velvet • Hugely improves quality of assembly • Insert length greater than repeat • greater than the length of the most common genomic repeat • Mixed insert length improves results • Short: helps for local assembly • Long: get over repeats • Large genomes • Very memory intensive • Calculation intensive38 25.04.11 Velvet / Curtain
  39. 39. Data formats and quality • Fasta • Fastq • .fasta • .fastq • .fa • .fq • ? • ? Header >read_1 @SEQ_ID TATAATATTTAT... GATTTGGGGTTCAAAGC Sequence + !*((((***+))%%% Quality39 25.04.11 Velvet / Curtain
  40. 40. FASTQ paired @SRR022863.1.F ATATAGATGTACATAAATTAGTTGAAGTATATGAACG + .F .R IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII /1 /2 @SRR022863.1.R TTCACCCATTTTATCCATGATTTTGTTCTTTCTCTTC + IIIIIHIIIIIIII3III.,IIII&II6II-))&I0 @SRR022863.1.F @SRR022863.1.R ATATAGATGTACATAAATTAGT... TTCACCCATTTTATCCATGATTTTGTT... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII IIIIIHIIIIIIII3III.,IIII&II6II-))&I0 @SRR022863.2.F @SRR022863.2.R TTATGAATTATTAATAAGTGCT... CATAAAAAAAGAAAATGTACTCTTTAC... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIII)0&A,%.&9$8I4+A;I)4II)&%-I$I%#)II40 25.04.11 Velvet / Curtain
  41. 41. Quality score • Velvet does NOT use quality score!!! • Error correction of de Bruijn graph • p • the probability that the corresponding base call is incorrect • Phred quality score • 10 -> 1 in 10 • 40 -> 1 in 10,000. • Odds ratio • earlier versions of solexa pipeline • differs mainly at lower levels41 25.04.11 Velvet / Curtain
  42. 42. Quality encoding • !*((((***+))%%% • One value per base • Integer mapping based on ASCII encoding • probability of incorrect base call • Sanger format • Illumina 1.5+ • Phred score • Phred score • ASCII 33 – 126 -> 0 – 93 • ASCII 59 – 126 -> -5 – 62 • Rarely exceeds 60 • Only 2 – 40 expected • ! = 33 -> 0 • ! = 33 -> (does not exist) • b = 66 -> 33 • b = 66 -> 242 25.04.11 Velvet / Curtain
  43. 43. Quality encoding • wikipedia43 25.04.11 Velvet / Curtain
  44. 44. Quality trimming Good / Bad ? Quality score Bp position in Read44 25.04.11 Velvet / Curtain
  45. 45. Quality trimming • Fixed length trimming • Cut-off at position x • Adaptive trimming • Quality score cut-off • Minimum sequence length • Sliding window • Window size • Quality score cut-off • Use average quality value of window45 25.04.11 Velvet / Curtain
  46. 46. Velvet practical: Part 2 • Paired-end (SRX008042) • Explore parameters • Set cut-offs • Analyse quality score (SRX008042) • Trimming reads46 25.04.11 Velvet / Curtain
  47. 47. Velvet modules • Columbus (since Velvet 1.0) • use reference sequence • assist with alignment information • local re-sequencing • structural variants47 25.04.11 Velvet / Curtain
  48. 48. Velvet modules • Oases • De novo transcriptome assembler • uses preliminary Velvet assembly • clusters contigs into loci • construct transcript isoforms using paired-end / long read information • confidence score: describes uniqueness of a transcript in a locus48 25.04.11 Velvet / Curtain
  49. 49. Read Simulation - Why? • Controlling the data • Contamination • Coverage distribution • Sequencing errors • Genome size • Insert length • Insert length distribution49 25.04.11 Velvet / Curtain
  50. 50. Read Simulation - Why? • Make results comparable • Assemblers • Parameters • Algorithms • Assembly strategies • Genome specific “features” • Robust • Introduce errors • Simulate SNPs50 25.04.11 Velvet / Curtain
  51. 51. Real data vs. simulation Mario Caccamo51 25.04.11 Velvet / Curtain
  52. 52. Real data vs. simulation Mario Caccamo52 25.04.11 Velvet / Curtain
  53. 53. Velvet practical: Part 3 • Velvet • Long Reads • Hybrid Assembly • Mixed insert length libraries53 25.04.11 Velvet / Curtain
  54. 54. Curtain • assembly pipeline • Paired-end assembly for large genomes • Group related Contigs • Uses velvet to assemble groups of related reads • Iterative approach54 25.04.11 Velvet / Curtain
  55. 55. Curtain Genome assembly Pipeline Curtain Contigs Map Group Fill Assemble Collect Reads Contigs Bins55 25.04.11 Velvet / Curtain
  56. 56. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Set of input Contigs • Use established assemblers • Velvet unpaired • Cortex • SGA • ...56 25.04.11 Velvet / Curtain
  57. 57. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Map reads to input contigs • SAM file support • bwa • maq57 25.04.11 Velvet / Curtain
  58. 58. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Group Contigs using Paired-end information 1 2 3 4 5 bin mapping read & read pair58 25.04.11 Velvet / Curtain
  59. 59. Curtain Curtain Contigs Map Group Fill Reads Contigs Bins AssembleCollect • Assemble each bin • Run velvet using paired-end information • bin specific parameters • Run each bin individually velvet • Highly parallelizable • Collect results • Start next iteration …………………. Results59 25.04.11 Velvet / Curtain
  60. 60. Curtain • Low memory footprint • Scalable for large genomes • Make use of cluster • Available • www.ebi.ac.uk/egt • http://code.google.com/p/curtain/ • Future announcements • http://groups.google.com/group/curtain-assembler • Future work • Long read support60 25.04.11 Velvet / Curtain
  61. 61. Curtain practical • Run Curtain for Staphylococcus • Simulation data61 25.04.11 Velvet / Curtain
  62. 62. Thanks ...62 25.04.11 Velvet / Curtain

×