ECCB10 talk - Next-generation sequencing and structural variation
Upcoming SlideShare
Loading in...5
×
 

ECCB10 talk - Next-generation sequencing and structural variation

on

  • 2,084 views

 

Statistics

Views

Total Views
2,084
Views on SlideShare
2,084
Embed Views
0

Actions

Likes
1
Downloads
77
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ECCB10 talk - Next-generation sequencing and structural variation ECCB10 talk - Next-generation sequencing and structural variation Presentation Transcript

  • Next-generation sequencing and structural variation Jan Aerts Wellcome Trust Sanger Institute [email_address]
    • principles & pittfalls vs list of commands
  • What is structural variation?
    • “ variation that changes the structure of a chromosome”
    • Mechanisms: NAHR, NHEJ, FoSTeS
    • This presentation: focus on discovery (not: genotyping)
    • “ experiment 4” from last slide Thomas
  • Types of structural variation
  • Approaches for discovery
    • Combination of:
    • Read pairs
    • Read depth
    • Split reads
    • Fine-mapping breakpoints: local assembly
    • => Identify signatures
  • A. Read Pairs
  • RP - General principle
    • Paired-end library => insert size
    • Orientation/distance
  • RP - Signatures
  • RP - Real world
  • RP - Workflow overview
    • Mapping
    • Identify discordant readpairs
    • Cluster on location
    • Filter on nr RPs/cluster
    • Filter on RD
    • Filter: mappingQ x #readpairs
    • Identify signatures
    • Alternative reference
    • Validate
  • RP - Mapping
    • Provides raw data => crucial
    • MAQ/bwa
      • only report one hit (mappingQ = 0)
      • MAQ might prefer mismatches to aberrant distance!
    • Insert size = distribution instead of exact
  • RP - Discordant readpairs
    • Orientation
    • Distance
      • Plot insert size distribution for chromosome
      • Very long tail! => difficult to set cutoff:
        • 4mad or 0.01%?
  •  
  •  
  •  
  •  
  •  
  • RP - Clustering
    • “ standard clustering strategy”
      • Only consider mate pairs that do not have concordant mappings
      • Ignore read pairs that have more than one good mapping
    • Clustering: use insert size distribution ( e.g. 2x4 mad)
  • RP - Clustering: issues
    • Ignores pairs that have >1 good mapping => no detection within repetitive regions (segmental duplications)
    • What cutoff for what is considered abnormal distance? (4 mad? 0.01%? 2stdev?)
    • Low library quality or mix of libraries => multiple peaks in size distribution
  • RP - Filtering
    • On nr RPs/cluster
      • Normally: n=2
      • For high coverage ( e.g. pilot 2: 80X): n=5
    • On drop in RD & SR
    • On (mappingQ x nrRP)
      • If published data available: ROC for different cutoffs mQxnrRP
      • If not: very difficult
  • RP - Issues
    • Difficult => different groups = different results “consensus set”
      • RP & SP: many set agree
      • RD: totally different
    • CEU (80X): sometimes drop in RD in all 3, but RP spanning only in 2 => why??
    • Mapper = critical; maq/bwa: only 1 mapping (=> many false negatives); mosaik, mrFAST: return more results
  • RP - Issues (2)
    • Large insert size: low resolution for detecting breakpoints
    • Small insert size: low resolution for detecting complex regions
  • B. Read Depth
  • RD - General principle
    • Similar to aCGH: using reference RD file ( e.g. based on 1kG)
    • In theory: higher resolution, but noisier than aCGH
      • Algorithms not mature yet
      • More complex steps
    • => Data binned
  • RD - Exome
    • here: using exome data
  • RD - Example
  • RD - Workflow overview
    • Mapping
    • Read filtering
    • GC correction
    • Spike identification
    • Validation
  •  
  • RD - mapping
    • Critical… (see RP)
  • RD - Filtering
    • mapQ
      • mapQ >= 0 (noisy; few FN, many FP)
      • mapQ >= 10
      • mapQ >= 30 (many FN, few FP)
    • Mean depth exon (often: e.g. +/- 0.01)
      • Mean depth > 1
      • Mean depth > 5
  • RD - Filtering: what’s left 152,000 153,000 160,000 mean DP exon > 5 162,000 163,000 169,000 mean DP exon > 1 207,000 207,000 207,000 all mapQ >= 30 mapQ >= 10 mapQ >= 0
  • RD - correction
    • Mainly: GC
      • Other: repeat-rich regions, mapping Q, …
    • Fit linear model GC-content exon and RD of exon => noise decreases
  •  
  •  
  • RD - segmentation
    • Identify spikes
    • Many segmentational algorithms, e.g. GADA
    • Issues: setting parameters: when to cut off peaks?
      • Combine outputs from different runs with different parameters
      • Compare to known CNVs
  • RD - Combine algorithms
  •  
  •  
  • RD - Issues
    • How to assess TP/FP/FN? => compare with known CNVs
    • Breakpoints: unknown
      • 1 datapoint/exon
      • Can be outside of exon
    • Different parameters for rare vs common CNVs => which?
  • C. Split Reads
  • SR - Principle
  • SR - Mapping
    • Short subsequences => many possible mappings
    • Solution: “anchored split mapping” ( e.g. Pindel)
  •  
  • D. Local reassembly
    • Aim: to determine breakpoints
    • Which reads?
      • for deletions: local reads
      • for insertions: hanging reads for read pairs with only one read mapped
      • (rather not: unmapped reads)
    • For large region: split up
  • Assemblers
    • Velvet
    • ABySS
    • TIGRA
  •  
  • Conclusions
    • Available algorithms: more to demonstrate technique rather than complete solution
    • Different algorithms => different results
  • Chris Yoon
  •  
  • Genotyping
    • Create alternative reference => remap reads
      • All reads vs reads covering variant locis
      • Whole-genome vs concatenation of variant loci
    • Homozygous insertions/deletions: should disappear
    • Heterozygous insertions/deletions: should have different signatures
    • Bayesian approach: see what’s the most likely: do the reads support wild-type/het/homnonref?
    • Not exact mapping => local reassembly
      • Microhomologies & non-template sequence => “breakpoint” = region of 2-10 bp
        • Convention: left-most position reported (but not always)
  • References and software
    • Medvedev P et al . Nat Methods 6 (11):S13-S20 (2009)
    • Lee S et al . Bioinformatics 24 :i59-i67 (2008)
    • Hormozdiari F et al . Genome Res 19 :1270-1278 (2009)
    • Campbell P et al . Nat Genet 40 :722-729 (2008)
    • Ye K et al . Bioinformatics 25 (21):2865-2871 (2009)
    • Chen K et al . Genome Res 19 :1527-1541 (2009)
    • Yoon S et al . Genome Res 19 :1586-1592 (2009)
    • Du J et al . PLoS Comp Biol 5 (7):e1000432 (2009)
    • Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences (2009)
  • Questions?