Upcoming SlideShare
×

# 2011-04-26_01-velvet-curtain-presentation

1,389 views

Published on

Published in: Education
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,389
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
53
0
Likes
1
Embeds 0
No embeds

No notes for slide

### 2011-04-26_01-velvet-curtain-presentation

1. 1. Velvet / CurtainMatthias Haimel EBI is an Outstation of the European Molecular Biology Laboratory.
2. 2. 2 25.04.11 Velvet / Curtain
3. 3. Overview • De Bruijn Graph • Velvet • Theory • Practice • Data formats and quality • Velvet • Simulation data • Multiple insert lengths • Curtain • Theory • Practice3 25.04.11 Velvet / Curtain
4. 4. De Bruijn graph • A concept in combinatorial mathematics • In combinatorics, de bruijn graph is usually fully connected • http://en.wikipedia.org/wiki/De_Bruijn_graph • de bruijn sequence • Related concept • Path through graph • Velvet • de Bruijn inspired graph structure4 25.04.11 Velvet / Curtain
5. 5. De Bruijn graph (Velvet) • Representation of • a sequence based on short words (k-mers) • overlaps between words • K-mer: word of length k • K=5 GCCTTCCA • k-1 overlap GCCTT GCCTT GCCTT CCTTC CCTTC CCTTC CTTCC CTTCC TTCCA ... GCCTTCCA GCCTTCCA GCCTTCCA5 25.04.11 Velvet / Curtain
6. 6. De Bruijn graph (Velvet) GCCTTCCAATTT GCCTTCAAATTT C A CTTC TTCC ..... CAATT T CCT TC G CT C AATTT A A CTTC TTCA ..... AAATT6 25.04.11 Velvet / Curtain
7. 7. De Bruijn graph representations (Velvet) TTCA ATTC TCAG Error free, no repeat, no polymorphism Repeat > kmer length SNP, variant, < kmer length Structural variant, inversion Structural variant, deletion… …7 25.04.11 Velvet / Curtain
8. 8. Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG8 25.04.11 Velvet / Curtain
9. 9. Example Read: GTCGAGG GTCG (1x)9 25.04.11 Velvet / Curtain
10. 10. Example Read: GTCGAGG GTCG TCGA (1x) (1x)10 25.04.11 Velvet / Curtain
11. 11. Example Read: GTCGAGG GTCG TCGA CGAG (1x) (1x) (1x)11 25.04.11 Velvet / Curtain
12. 12. Example Read: GTCGAGG GTCG TCGA CGAG GAGG (1x) (1x) (1x) (1x)12 25.04.11 Velvet / Curtain
13. 13. Example New read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (1x)13 25.04.11 Velvet / Curtain
14. 14. Example Read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (2x)14 25.04.11 Velvet / Curtain
15. 15. Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC (1x) (1x) (2x) (2x) (1x)15 25.04.11 Velvet / Curtain
16. 16. Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC GGCT (1x) (1x) (2x) (2x) (1x) (1x)16 25.04.11 Velvet / Curtain
17. 17. Example New read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x)17 25.04.11 Velvet / Curtain
18. 18. Example Read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x) CGAC GACG ACGC (1x) (1x) (1x)18 25.04.11 Velvet / Curtain
19. 19. Example etc… GATT (1x) TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x) AGAA (1x) GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT TAGA AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (11x) (16x) (9x) (12x) (9x) (8x) (5x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x) CGAC GACG ACGC (1x) (1x) (1x)19 25.04.11 Velvet / Curtain
20. 20. Example After simplification… GATT AGAT GATCCGATGAG AGAA GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG CGACGC20 25.04.11 Velvet / Curtain
21. 21. Example Tips removed… AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG21 25.04.11 Velvet / Curtain
22. 22. De Bruijn graph biology extensions (Velvet) • Handling of reverse strand • DNA is read in two directions • Paired-end data • Handling small differences, which are “uninteresting” • Errors in sequencing technology • Memory • regularly use 80, 100GB real memory • easily get to 1TB real memory requirements22 25.04.11 Velvet / Curtain
23. 23. Read variety • Short reads ~75bp • Illumina / Solexa • SOLiD (colour space) • Long reads 500-1000 bp • 454 read • Sanger capillary reads • Paired-end reads • Short reads • short insert length • Mate pair reads • Short reads • long insert length23 25.04.11 Velvet / Curtain
24. 24. Paired-End Mate Pair24 25.04.11 Velvet / Curtain
25. 25. Short paired-end / mate pair reads ?Velvet expect Illumina paired-end orientation: (L-> <-R) L R paired-end25 25.04.11 Velvet / Curtain
26. 26. Short paired-end / mate pair readsIllumina mate-pair orientation: (<-L R->) L R mate pair reverse complement L R paired-end26 25.04.11 Velvet / Curtain
27. 27. Velvet algorithms • Remove Bubbles • Tour Bus • Velvet parameters • -max_branch_length • -max_divergence • -max_gap_count27 25.04.11 Velvet / Curtain
28. 28. Example AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) GGCT TAGA (11x) (16x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x)28 25.04.11 Velvet / Curtain
29. 29. Example Bubbles removed… by TourBus AGAT GATCCGATGAG TAGTCGA CGAG GAGGCT GGCT GCTTTAG TAGA AGAGA AGACAG29 25.04.11 Velvet / Curtain
30. 30. Example Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG30 25.04.11 Velvet / Curtain
31. 31. Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG One possible walk through the graph ... TAGTCGAG GAGGCTTTAGA AGATCCGATGAG GAGGCTTTAGA AGAGACAG31 25.04.11 Velvet / Curtain
32. 32. N50 • Total • N90 • N50 • N1032 25.04.11 Velvet / Curtain
33. 33. N50 • Total • 4,295,113bp • N90 • 439bp • N50 • 3,119bp • N10 • 13,519bp33 25.04.11 Velvet / Curtain
34. 34. N50 • N50 is the length of the smallest contig • contains the fewest (largest) contigs • combined length represents at least 50% of the assembly • N10 • > 10 % of the largest contigs http://www.broadinstitute.org/crd/wiki/index.php/N5034 25.04.11 Velvet / Curtain
35. 35. Velvet practical: Part 1 • Compile • Single end (ERX001300) • K-mer length • Coverage cut-offs • Whole genome sequence as input??? • Staphylococcus aureus MRSA25235 25.04.11 Velvet / Curtain
36. 36. Velvet algorithms • Long read information • Rock Band • Velvet parameters • -long_mult_cutoff36 25.04.11 Velvet / Curtain
37. 37. Velvet algorithms • Paired-end information • Pebble • Velvet parameters • -min_pair_count Once all distances and variance computed, Simple greedy extension from main contigs out37 25.04.11 Velvet / Curtain
38. 38. Paired-end in Velvet • Hugely improves quality of assembly • Insert length greater than repeat • greater than the length of the most common genomic repeat • Mixed insert length improves results • Short: helps for local assembly • Long: get over repeats • Large genomes • Very memory intensive • Calculation intensive38 25.04.11 Velvet / Curtain
39. 39. Data formats and quality • Fasta • Fastq • .fasta • .fastq • .fa • .fq • ? • ? Header >read_1 @SEQ_ID TATAATATTTAT... GATTTGGGGTTCAAAGC Sequence + !*((((***+))%%% Quality39 25.04.11 Velvet / Curtain
40. 40. FASTQ paired @SRR022863.1.F ATATAGATGTACATAAATTAGTTGAAGTATATGAACG + .F .R IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII /1 /2 @SRR022863.1.R TTCACCCATTTTATCCATGATTTTGTTCTTTCTCTTC + IIIIIHIIIIIIII3III.,IIII&II6II-))&I0 @SRR022863.1.F @SRR022863.1.R ATATAGATGTACATAAATTAGT... TTCACCCATTTTATCCATGATTTTGTT... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII IIIIIHIIIIIIII3III.,IIII&II6II-))&I0 @SRR022863.2.F @SRR022863.2.R TTATGAATTATTAATAAGTGCT... CATAAAAAAAGAAAATGTACTCTTTAC... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIII)0&A,%.&9\$8I4+A;I)4II)&%-I\$I%#)II40 25.04.11 Velvet / Curtain
41. 41. Quality score • Velvet does NOT use quality score!!! • Error correction of de Bruijn graph • p • the probability that the corresponding base call is incorrect • Phred quality score • 10 -> 1 in 10 • 40 -> 1 in 10,000. • Odds ratio • earlier versions of solexa pipeline • differs mainly at lower levels41 25.04.11 Velvet / Curtain
42. 42. Quality encoding • !*((((***+))%%% • One value per base • Integer mapping based on ASCII encoding • probability of incorrect base call • Sanger format • Illumina 1.5+ • Phred score • Phred score • ASCII 33 – 126 -> 0 – 93 • ASCII 59 – 126 -> -5 – 62 • Rarely exceeds 60 • Only 2 – 40 expected • ! = 33 -> 0 • ! = 33 -> (does not exist) • b = 66 -> 33 • b = 66 -> 242 25.04.11 Velvet / Curtain
43. 43. Quality encoding • wikipedia43 25.04.11 Velvet / Curtain
44. 44. Quality trimming Good / Bad ? Quality score Bp position in Read44 25.04.11 Velvet / Curtain
45. 45. Quality trimming • Fixed length trimming • Cut-off at position x • Adaptive trimming • Quality score cut-off • Minimum sequence length • Sliding window • Window size • Quality score cut-off • Use average quality value of window45 25.04.11 Velvet / Curtain
46. 46. Velvet practical: Part 2 • Paired-end (SRX008042) • Explore parameters • Set cut-offs • Analyse quality score (SRX008042) • Trimming reads46 25.04.11 Velvet / Curtain
47. 47. Velvet modules • Columbus (since Velvet 1.0) • use reference sequence • assist with alignment information • local re-sequencing • structural variants47 25.04.11 Velvet / Curtain
48. 48. Velvet modules • Oases • De novo transcriptome assembler • uses preliminary Velvet assembly • clusters contigs into loci • construct transcript isoforms using paired-end / long read information • confidence score: describes uniqueness of a transcript in a locus48 25.04.11 Velvet / Curtain
49. 49. Read Simulation - Why? • Controlling the data • Contamination • Coverage distribution • Sequencing errors • Genome size • Insert length • Insert length distribution49 25.04.11 Velvet / Curtain
50. 50. Read Simulation - Why? • Make results comparable • Assemblers • Parameters • Algorithms • Assembly strategies • Genome specific “features” • Robust • Introduce errors • Simulate SNPs50 25.04.11 Velvet / Curtain
51. 51. Real data vs. simulation Mario Caccamo51 25.04.11 Velvet / Curtain
52. 52. Real data vs. simulation Mario Caccamo52 25.04.11 Velvet / Curtain
53. 53. Velvet practical: Part 3 • Velvet • Long Reads • Hybrid Assembly • Mixed insert length libraries53 25.04.11 Velvet / Curtain
54. 54. Curtain • assembly pipeline • Paired-end assembly for large genomes • Group related Contigs • Uses velvet to assemble groups of related reads • Iterative approach54 25.04.11 Velvet / Curtain
55. 55. Curtain Genome assembly Pipeline Curtain Contigs Map Group Fill Assemble Collect Reads Contigs Bins55 25.04.11 Velvet / Curtain
56. 56. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Set of input Contigs • Use established assemblers • Velvet unpaired • Cortex • SGA • ...56 25.04.11 Velvet / Curtain
57. 57. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Map reads to input contigs • SAM file support • bwa • maq57 25.04.11 Velvet / Curtain
58. 58. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Group Contigs using Paired-end information 1 2 3 4 5 bin mapping read & read pair58 25.04.11 Velvet / Curtain
59. 59. Curtain Curtain Contigs Map Group Fill Reads Contigs Bins AssembleCollect • Assemble each bin • Run velvet using paired-end information • bin specific parameters • Run each bin individually velvet • Highly parallelizable • Collect results • Start next iteration …………………. Results59 25.04.11 Velvet / Curtain
60. 60. Curtain • Low memory footprint • Scalable for large genomes • Make use of cluster • Available • www.ebi.ac.uk/egt • http://code.google.com/p/curtain/ • Future announcements • http://groups.google.com/group/curtain-assembler • Future work • Long read support60 25.04.11 Velvet / Curtain
61. 61. Curtain practical • Run Curtain for Staphylococcus • Simulation data61 25.04.11 Velvet / Curtain
62. 62. Thanks ...62 25.04.11 Velvet / Curtain