• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
2011-04-26_01-velvet-curtain-presentation
 

2011-04-26_01-velvet-curtain-presentation

on

  • 1,050 views

 

Statistics

Views

Total Views
1,050
Views on SlideShare
1,050
Embed Views
0

Actions

Likes
1
Downloads
43
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    2011-04-26_01-velvet-curtain-presentation 2011-04-26_01-velvet-curtain-presentation Presentation Transcript

    • Velvet / CurtainMatthias Haimel EBI is an Outstation of the European Molecular Biology Laboratory.
    • 2 25.04.11 Velvet / Curtain
    • Overview • De Bruijn Graph • Velvet • Theory • Practice • Data formats and quality • Velvet • Simulation data • Multiple insert lengths • Curtain • Theory • Practice3 25.04.11 Velvet / Curtain
    • De Bruijn graph • A concept in combinatorial mathematics • In combinatorics, de bruijn graph is usually fully connected • http://en.wikipedia.org/wiki/De_Bruijn_graph • de bruijn sequence • Related concept • Path through graph • Velvet • de Bruijn inspired graph structure4 25.04.11 Velvet / Curtain
    • De Bruijn graph (Velvet) • Representation of • a sequence based on short words (k-mers) • overlaps between words • K-mer: word of length k • K=5 GCCTTCCA • k-1 overlap GCCTT GCCTT GCCTT CCTTC CCTTC CCTTC CTTCC CTTCC TTCCA ... GCCTTCCA GCCTTCCA GCCTTCCA5 25.04.11 Velvet / Curtain
    • De Bruijn graph (Velvet) GCCTTCCAATTT GCCTTCAAATTT C A CTTC TTCC ..... CAATT T CCT TC G CT C AATTT A A CTTC TTCA ..... AAATT6 25.04.11 Velvet / Curtain
    • De Bruijn graph representations (Velvet) TTCA ATTC TCAG Error free, no repeat, no polymorphism Repeat > kmer length SNP, variant, < kmer length Structural variant, inversion Structural variant, deletion… …7 25.04.11 Velvet / Curtain
    • Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG8 25.04.11 Velvet / Curtain
    • Example Read: GTCGAGG GTCG (1x)9 25.04.11 Velvet / Curtain
    • Example Read: GTCGAGG GTCG TCGA (1x) (1x)10 25.04.11 Velvet / Curtain
    • Example Read: GTCGAGG GTCG TCGA CGAG (1x) (1x) (1x)11 25.04.11 Velvet / Curtain
    • Example Read: GTCGAGG GTCG TCGA CGAG GAGG (1x) (1x) (1x) (1x)12 25.04.11 Velvet / Curtain
    • Example New read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (1x)13 25.04.11 Velvet / Curtain
    • Example Read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (2x)14 25.04.11 Velvet / Curtain
    • Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC (1x) (1x) (2x) (2x) (1x)15 25.04.11 Velvet / Curtain
    • Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC GGCT (1x) (1x) (2x) (2x) (1x) (1x)16 25.04.11 Velvet / Curtain
    • Example New read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x)17 25.04.11 Velvet / Curtain
    • Example Read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x) CGAC GACG ACGC (1x) (1x) (1x)18 25.04.11 Velvet / Curtain
    • Example etc… GATT (1x) TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x) AGAA (1x) GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT TAGA AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (11x) (16x) (9x) (12x) (9x) (8x) (5x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x) CGAC GACG ACGC (1x) (1x) (1x)19 25.04.11 Velvet / Curtain
    • Example After simplification… GATT AGAT GATCCGATGAG AGAA GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG CGACGC20 25.04.11 Velvet / Curtain
    • Example Tips removed… AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG21 25.04.11 Velvet / Curtain
    • De Bruijn graph biology extensions (Velvet) • Handling of reverse strand • DNA is read in two directions • Paired-end data • Handling small differences, which are “uninteresting” • Errors in sequencing technology • Memory • regularly use 80, 100GB real memory • easily get to 1TB real memory requirements22 25.04.11 Velvet / Curtain
    • Read variety • Short reads ~75bp • Illumina / Solexa • SOLiD (colour space) • Long reads 500-1000 bp • 454 read • Sanger capillary reads • Paired-end reads • Short reads • short insert length • Mate pair reads • Short reads • long insert length23 25.04.11 Velvet / Curtain
    • Paired-End Mate Pair24 25.04.11 Velvet / Curtain
    • Short paired-end / mate pair reads ?Velvet expect Illumina paired-end orientation: (L-> <-R) L R paired-end25 25.04.11 Velvet / Curtain
    • Short paired-end / mate pair readsIllumina mate-pair orientation: (<-L R->) L R mate pair reverse complement L R paired-end26 25.04.11 Velvet / Curtain
    • Velvet algorithms • Remove Bubbles • Tour Bus • Velvet parameters • -max_branch_length • -max_divergence • -max_gap_count27 25.04.11 Velvet / Curtain
    • Example AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) GGCT TAGA (11x) (16x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x)28 25.04.11 Velvet / Curtain
    • Example Bubbles removed… by TourBus AGAT GATCCGATGAG TAGTCGA CGAG GAGGCT GGCT GCTTTAG TAGA AGAGA AGACAG29 25.04.11 Velvet / Curtain
    • Example Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG30 25.04.11 Velvet / Curtain
    • Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG One possible walk through the graph ... TAGTCGAG GAGGCTTTAGA AGATCCGATGAG GAGGCTTTAGA AGAGACAG31 25.04.11 Velvet / Curtain
    • N50 • Total • N90 • N50 • N1032 25.04.11 Velvet / Curtain
    • N50 • Total • 4,295,113bp • N90 • 439bp • N50 • 3,119bp • N10 • 13,519bp33 25.04.11 Velvet / Curtain
    • N50 • N50 is the length of the smallest contig • contains the fewest (largest) contigs • combined length represents at least 50% of the assembly • N10 • > 10 % of the largest contigs http://www.broadinstitute.org/crd/wiki/index.php/N5034 25.04.11 Velvet / Curtain
    • Velvet practical: Part 1 • Compile • Single end (ERX001300) • K-mer length • Coverage cut-offs • Whole genome sequence as input??? • Staphylococcus aureus MRSA25235 25.04.11 Velvet / Curtain
    • Velvet algorithms • Long read information • Rock Band • Velvet parameters • -long_mult_cutoff36 25.04.11 Velvet / Curtain
    • Velvet algorithms • Paired-end information • Pebble • Velvet parameters • -min_pair_count Once all distances and variance computed, Simple greedy extension from main contigs out37 25.04.11 Velvet / Curtain
    • Paired-end in Velvet • Hugely improves quality of assembly • Insert length greater than repeat • greater than the length of the most common genomic repeat • Mixed insert length improves results • Short: helps for local assembly • Long: get over repeats • Large genomes • Very memory intensive • Calculation intensive38 25.04.11 Velvet / Curtain
    • Data formats and quality • Fasta • Fastq • .fasta • .fastq • .fa • .fq • ? • ? Header >read_1 @SEQ_ID TATAATATTTAT... GATTTGGGGTTCAAAGC Sequence + !*((((***+))%%% Quality39 25.04.11 Velvet / Curtain
    • FASTQ paired @SRR022863.1.F ATATAGATGTACATAAATTAGTTGAAGTATATGAACG + .F .R IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII /1 /2 @SRR022863.1.R TTCACCCATTTTATCCATGATTTTGTTCTTTCTCTTC + IIIIIHIIIIIIII3III.,IIII&II6II-))&I0 @SRR022863.1.F @SRR022863.1.R ATATAGATGTACATAAATTAGT... TTCACCCATTTTATCCATGATTTTGTT... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII IIIIIHIIIIIIII3III.,IIII&II6II-))&I0 @SRR022863.2.F @SRR022863.2.R TTATGAATTATTAATAAGTGCT... CATAAAAAAAGAAAATGTACTCTTTAC... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIII)0&A,%.&9$8I4+A;I)4II)&%-I$I%#)II40 25.04.11 Velvet / Curtain
    • Quality score • Velvet does NOT use quality score!!! • Error correction of de Bruijn graph • p • the probability that the corresponding base call is incorrect • Phred quality score • 10 -> 1 in 10 • 40 -> 1 in 10,000. • Odds ratio • earlier versions of solexa pipeline • differs mainly at lower levels41 25.04.11 Velvet / Curtain
    • Quality encoding • !*((((***+))%%% • One value per base • Integer mapping based on ASCII encoding • probability of incorrect base call • Sanger format • Illumina 1.5+ • Phred score • Phred score • ASCII 33 – 126 -> 0 – 93 • ASCII 59 – 126 -> -5 – 62 • Rarely exceeds 60 • Only 2 – 40 expected • ! = 33 -> 0 • ! = 33 -> (does not exist) • b = 66 -> 33 • b = 66 -> 242 25.04.11 Velvet / Curtain
    • Quality encoding • wikipedia43 25.04.11 Velvet / Curtain
    • Quality trimming Good / Bad ? Quality score Bp position in Read44 25.04.11 Velvet / Curtain
    • Quality trimming • Fixed length trimming • Cut-off at position x • Adaptive trimming • Quality score cut-off • Minimum sequence length • Sliding window • Window size • Quality score cut-off • Use average quality value of window45 25.04.11 Velvet / Curtain
    • Velvet practical: Part 2 • Paired-end (SRX008042) • Explore parameters • Set cut-offs • Analyse quality score (SRX008042) • Trimming reads46 25.04.11 Velvet / Curtain
    • Velvet modules • Columbus (since Velvet 1.0) • use reference sequence • assist with alignment information • local re-sequencing • structural variants47 25.04.11 Velvet / Curtain
    • Velvet modules • Oases • De novo transcriptome assembler • uses preliminary Velvet assembly • clusters contigs into loci • construct transcript isoforms using paired-end / long read information • confidence score: describes uniqueness of a transcript in a locus48 25.04.11 Velvet / Curtain
    • Read Simulation - Why? • Controlling the data • Contamination • Coverage distribution • Sequencing errors • Genome size • Insert length • Insert length distribution49 25.04.11 Velvet / Curtain
    • Read Simulation - Why? • Make results comparable • Assemblers • Parameters • Algorithms • Assembly strategies • Genome specific “features” • Robust • Introduce errors • Simulate SNPs50 25.04.11 Velvet / Curtain
    • Real data vs. simulation Mario Caccamo51 25.04.11 Velvet / Curtain
    • Real data vs. simulation Mario Caccamo52 25.04.11 Velvet / Curtain
    • Velvet practical: Part 3 • Velvet • Long Reads • Hybrid Assembly • Mixed insert length libraries53 25.04.11 Velvet / Curtain
    • Curtain • assembly pipeline • Paired-end assembly for large genomes • Group related Contigs • Uses velvet to assemble groups of related reads • Iterative approach54 25.04.11 Velvet / Curtain
    • Curtain Genome assembly Pipeline Curtain Contigs Map Group Fill Assemble Collect Reads Contigs Bins55 25.04.11 Velvet / Curtain
    • Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Set of input Contigs • Use established assemblers • Velvet unpaired • Cortex • SGA • ...56 25.04.11 Velvet / Curtain
    • Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Map reads to input contigs • SAM file support • bwa • maq57 25.04.11 Velvet / Curtain
    • Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Group Contigs using Paired-end information 1 2 3 4 5 bin mapping read & read pair58 25.04.11 Velvet / Curtain
    • Curtain Curtain Contigs Map Group Fill Reads Contigs Bins AssembleCollect • Assemble each bin • Run velvet using paired-end information • bin specific parameters • Run each bin individually velvet • Highly parallelizable • Collect results • Start next iteration …………………. Results59 25.04.11 Velvet / Curtain
    • Curtain • Low memory footprint • Scalable for large genomes • Make use of cluster • Available • www.ebi.ac.uk/egt • http://code.google.com/p/curtain/ • Future announcements • http://groups.google.com/group/curtain-assembler • Future work • Long read support60 25.04.11 Velvet / Curtain
    • Curtain practical • Run Curtain for Staphylococcus • Simulation data61 25.04.11 Velvet / Curtain
    • Thanks ...62 25.04.11 Velvet / Curtain