2011-04-26_01-velvet-curtain-presentation

### 2011-04-26_01-velvet-curtain-presentation

Velvet / CurtainMatthias Haimel EBI is an Outstation of the European Molecular Biology Laboratory.
Velvet / Curtain
Overview • De Bruijn Graph • Velvet • Theory • Practice • Data formats and quality • Velvet • Simulation data • Multiple insert lengths • Curtain • Theory • Practice
Velvet / Curtain
De Bruijn graph • A concept in combinatorial mathematics • In combinatorics, de bruijn graph is usually fully connected • http://en.wikipedia.org/wiki/De_Bruijn_graph • de bruijn sequence • Related concept • Path through graph • Velvet • de Bruijn inspired graph structure
Velvet / Curtain
De Bruijn graph (Velvet) • Representation of • a sequence based on short words (k-mers) • overlaps between words • K-mer: word of length k • K=5 GCCTTCCA • k-1 overlap GCCTT GCCTT GCCTT CCTTC CCTTC CCTTC CTTCC CTTCC TTCCA ... GCCTTCCA GCCTTCCA GCCTTCCA
Velvet / Curtain
De Bruijn graph (Velvet) GCCTTCCAATTT GCCTTCAAATTT C A CTTC TTCC ..... CAATT T CCT TC G CT C AATTT A A CTTC TTCA ..... AAATT
Velvet / Curtain
De Bruijn graph representations (Velvet) TTCA ATTC TCAG Error free, no repeat, no polymorphism Repeat > kmer length SNP, variant, < kmer length Structural variant, inversion Structural variant, deletion… …
Velvet / Curtain
Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG
Velvet / Curtain
Example Read: GTCGAGG GTCG (1x)
Velvet / Curtain
Example Read: GTCGAGG GTCG TCGA (1x) (1x)
Velvet / Curtain
Example Read: GTCGAGG GTCG TCGA CGAG (1x) (1x) (1x)
Velvet / Curtain
Example Read: GTCGAGG GTCG TCGA CGAG GAGG (1x) (1x) (1x) (1x)
Velvet / Curtain
Example New read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (1x)
Velvet / Curtain
Example Read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (2x)
Velvet / Curtain
Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC (1x) (1x) (2x) (2x) (1x)
Velvet / Curtain
Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC GGCT (1x) (1x) (2x) (2x) (1x) (1x)
Velvet / Curtain
Example New read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x)
Velvet / Curtain
Example Read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x) CGAC GACG ACGC (1x) (1x) (1x)
Velvet / Curtain
Example etc… GATT (1x) TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x) AGAA (1x) GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT TAGA AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (11x) (16x) (9x) (12x) (9x) (8x) (5x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x) CGAC GACG ACGC (1x) (1x) (1x)
Velvet / Curtain
Example After simplification… GATT AGAT GATCCGATGAG AGAA GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG CGACGC
Velvet / Curtain
Example Tips removed… AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG
Velvet / Curtain
De Bruijn graph biology extensions (Velvet) • Handling of reverse strand • DNA is read in two directions • Paired-end data • Handling small differences, which are "uninteresting" • Errors in sequencing technology • Memory • regularly use 80, 100GB real memory • easily get to 1TB real memory requirements
Velvet / Curtain
Read variety • Short reads ~75bp • Illumina / Solexa • SOLiD (colour space) • Long reads 500-1000 bp • 454 read • Sanger capillary reads • Paired-end reads • Short reads • short insert length • Mate pair reads • Short reads • long insert length
Velvet / Curtain
Paired-End Mate Pair
Velvet / Curtain
Short paired-end / mate pair reads ?Velvet expect Illumina paired-end orientation: (L-> <-R) L R paired-end
Velvet / Curtain
Short paired-end / mate pair readsIllumina mate-pair orientation: (<-L R->) L R mate pair reverse complement L R paired-end
Velvet / Curtain
Velvet algorithms • Remove Bubbles • Tour Bus • Velvet parameters • -max_branch_length • -max_divergence • -max_gap_count
Velvet / Curtain
Example AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) GGCT TAGA (11x) (16x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x)
Velvet / Curtain
Example Bubbles removed… by TourBus AGAT GATCCGATGAG TAGTCGA CGAG GAGGCT GGCT GCTTTAG TAGA AGAGA AGACAG
Velvet / Curtain
Example Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG
Velvet / Curtain
Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG One possible walk through the graph ... TAGTCGAG GAGGCTTTAGA AGATCCGATGAG GAGGCTTTAGA AGAGACAG
Velvet / Curtain
N50 • Total • N90 • N50 • N10
Velvet / Curtain
N50 • Total • 4,295,113bp • N90 • 439bp • N50 • 3,119bp • N10 • 13,519bp
Velvet / Curtain
N50 • N50 is the length of the smallest contig • contains the fewest (largest) contigs • combined length represents at least 50% of the assembly • N10 • > 10 % of the largest contigs http://www.broadinstitute.org/crd/wiki/index.php/N50
Velvet / Curtain
Velvet practical: Part 1 • Compile • Single end (ERX001300) • K-mer length • Coverage cut-offs • Whole genome sequence as input??? • Staphylococcus aureus MRSA252
Velvet / Curtain
Velvet algorithms • Long read information • Rock Band • Velvet parameters • -long_mult_cutoff
Velvet / Curtain
Velvet algorithms • Paired-end information • Pebble • Velvet parameters • -min_pair_count Once all distances and variance computed, Simple greedy extension from main contigs out
Velvet / Curtain
Paired-end in Velvet • Hugely improves quality of assembly • Insert length greater than repeat • greater than the length of the most common genomic repeat • Mixed insert length improves results • Short: helps for local assembly • Long: get over repeats • Large genomes • Very memory intensive • Calculation intensive
Velvet / Curtain
Data formats and quality • Fasta • Fastq • .fasta • .fastq • .fa • .fq • ? • ? Header >read_1 @SEQ_ID TATAATATTTAT... GATTTGGGGTTCAAAGC Sequence + !*((((***+))%%% Quality
Velvet / Curtain
FASTQ paired @SRR022863.1.F ATATAGATGTACATAAATTAGTTGAAGTATATGAACG + .F .R IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII /1 /2 @SRR022863.1.R TTCACCCATTTTATCCATGATTTTGTTCTTTCTCTTC + IIIIIHIIIIIIII3III.,IIII&II6II-))&I0 @SRR022863.1.F @SRR022863.1.R ATATAGATGTACATAAATTAGT... TTCACCCATTTTATCCATGATTTTGTT... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII IIIIIHIIIIIIII3III.,IIII&II6II-))&I0 @SRR022863.2.F @SRR022863.2.R TTATGAATTATTAATAAGTGCT... CATAAAAAAAGAAAATGTACTCTTTAC... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIII)0&A,%.&9\$8I4+A;I)4II)&%-I\$I%#)II
Velvet / Curtain
Quality score • Velvet does NOT use quality score!!! • Error correction of de Bruijn graph • p • the probability that the corresponding base call is incorrect • Phred quality score • 10 -> 1 in 10 • 40 -> 1 in 10,000. • Odds ratio • earlier versions of solexa pipeline • differs mainly at lower levels
Velvet / Curtain
Quality encoding • !*((((***+))%%% • One value per base • Integer mapping based on ASCII encoding • probability of incorrect base call • Sanger format • Illumina 1.5+ • Phred score • Phred score • ASCII 33 – 126 -> 0 – 93 • ASCII 59 – 126 -> -5 – 62 • Rarely exceeds 60 • Only 2 – 40 expected • ! = 33 -> 0 • ! = 33 -> (does not exist) • b = 66 -> 33 • b = 66 -> 24
Velvet / Curtain
Quality encoding • wikipedia
Velvet / Curtain
Quality trimming Good / Bad ? Quality score Bp position in Read
Velvet / Curtain
Quality trimming • Fixed length trimming • Cut-off at
46. 46. Velvet practical: Part 2 • Paired-end (SRX008042) • Explore parameters • Set cut-offs • Analyse quality score (SRX008042) • Trimming reads46 25.04.11 Velvet / Curtain
47. 47. Velvet modules • Columbus (since Velvet 1.0) • use reference sequence • assist with alignment information • local re-sequencing • structural variants47 25.04.11 Velvet / Curtain
48. 48. Velvet modules • Oases • De novo transcriptome assembler • uses preliminary Velvet assembly • clusters contigs into loci • construct transcript isoforms using paired-end / long read information • confidence score: describes uniqueness of a transcript in a locus48 25.04.11 Velvet / Curtain
49. 49. Read Simulation - Why? • Controlling the data • Contamination • Coverage distribution • Sequencing errors • Genome size • Insert length • Insert length distribution49 25.04.11 Velvet / Curtain
50. 50. Read Simulation - Why? • Make results comparable • Assemblers • Parameters • Algorithms • Assembly strategies • Genome specific “features” • Robust • Introduce errors • Simulate SNPs50 25.04.11 Velvet / Curtain
51. 51. Real data vs. simulation Mario Caccamo51 25.04.11 Velvet / Curtain
52. 52. Real data vs. simulation Mario Caccamo52 25.04.11 Velvet / Curtain
53. 53. Velvet practical: Part 3 • Velvet • Long Reads • Hybrid Assembly • Mixed insert length libraries53 25.04.11 Velvet / Curtain
54. 54. Curtain • assembly pipeline • Paired-end assembly for large genomes • Group related Contigs • Uses velvet to assemble groups of related reads • Iterative approach54 25.04.11 Velvet / Curtain
55. 55. Curtain Genome assembly Pipeline Curtain Contigs Map Group Fill Assemble Collect Reads Contigs Bins55 25.04.11 Velvet / Curtain
56. 56. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Set of input Contigs • Use established assemblers • Velvet unpaired • Cortex • SGA • ...56 25.04.11 Velvet / Curtain
57. 57. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Map reads to input contigs • SAM file support • bwa • maq57 25.04.11 Velvet / Curtain
58. 58. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Group Contigs using Paired-end information 1 2 3 4 5 bin mapping read & read pair58 25.04.11 Velvet / Curtain
59. 59. Curtain Curtain Contigs Map Group Fill Reads Contigs Bins AssembleCollect • Assemble each bin • Run velvet using paired-end information • bin specific parameters • Run each bin individually velvet • Highly parallelizable • Collect results • Start next iteration …………………. Results59 25.04.11 Velvet / Curtain
60. 60. Curtain • Low memory footprint • Scalable for large genomes • Make use of cluster • Available • www.ebi.ac.uk/egt • http://code.google.com/p/curtain/ • Future announcements • http://groups.google.com/group/curtain-assembler • Future work • Long read support60 25.04.11 Velvet / Curtain
61. 61. Curtain practical • Run Curtain for Staphylococcus • Simulation data61 25.04.11 Velvet / Curtain
62. 62. Thanks ...62 25.04.11 Velvet / Curtain