Next-generation sequencing and structural variation Jan Aerts Wellcome Trust Sanger Institute [email_address]
<ul><li>principles & pittfalls vs list of commands </li></ul>
What is structural variation? <ul><li>“ variation that changes the  structure  of a chromosome” </li></ul><ul><li>Mechanis...
Types of structural variation
Approaches for discovery <ul><li>Combination of: </li></ul><ul><li>Read pairs </li></ul><ul><li>Read depth </li></ul><ul><...
A. Read Pairs
RP - General principle <ul><li>Paired-end library => insert size </li></ul><ul><li>Orientation/distance </li></ul>
RP - Signatures
RP - Real world
RP - Workflow overview <ul><li>Mapping </li></ul><ul><li>Identify discordant readpairs </li></ul><ul><li>Cluster on locati...
RP - Mapping <ul><li>Provides raw data => crucial </li></ul><ul><li>MAQ/bwa </li></ul><ul><ul><li>only report one hit (map...
RP - Discordant readpairs <ul><li>Orientation </li></ul><ul><li>Distance </li></ul><ul><ul><li>Plot insert size distributi...
 
 
 
 
 
RP - Clustering <ul><li>“ standard clustering strategy” </li></ul><ul><ul><li>Only consider mate pairs that do not have co...
RP - Clustering: issues <ul><li>Ignores pairs that have >1 good mapping => no detection within repetitive regions (segment...
RP - Filtering <ul><li>On nr RPs/cluster </li></ul><ul><ul><li>Normally: n=2 </li></ul></ul><ul><ul><li>For high coverage ...
RP - Issues <ul><li>Difficult => different groups = different results “consensus set” </li></ul><ul><ul><li>RP & SP: many ...
RP - Issues (2) <ul><li>Large insert size: low resolution for detecting breakpoints </li></ul><ul><li>Small insert size: l...
B. Read Depth
RD - General principle <ul><li>Similar to aCGH: using reference RD file ( e.g.  based on 1kG) </li></ul><ul><li>In theory:...
RD - Exome <ul><li>here: using exome data </li></ul>
RD - Example
RD - Workflow overview <ul><li>Mapping </li></ul><ul><li>Read filtering </li></ul><ul><li>GC correction </li></ul><ul><li>...
 
RD - mapping <ul><li>Critical… (see RP) </li></ul>
RD - Filtering <ul><li>mapQ </li></ul><ul><ul><li>mapQ >= 0 (noisy; few FN, many FP) </li></ul></ul><ul><ul><li>mapQ >= 10...
RD - Filtering: what’s left 152,000 153,000 160,000 mean DP exon > 5 162,000 163,000 169,000 mean DP exon > 1 207,000 207,...
RD - correction <ul><li>Mainly: GC </li></ul><ul><ul><li>Other: repeat-rich regions, mapping Q, … </li></ul></ul><ul><li>F...
 
 
RD - segmentation <ul><li>Identify spikes </li></ul><ul><li>Many segmentational algorithms,  e.g.  GADA </li></ul><ul><li>...
RD - Combine algorithms
 
 
RD - Issues <ul><li>How to assess TP/FP/FN? => compare with known CNVs </li></ul><ul><li>Breakpoints: unknown </li></ul><u...
C. Split Reads
SR - Principle
SR - Mapping <ul><li>Short subsequences => many possible mappings </li></ul><ul><li>Solution: “anchored split mapping” ( e...
 
D. Local reassembly <ul><li>Aim: to determine breakpoints </li></ul><ul><li>Which reads? </li></ul><ul><ul><li>for deletio...
Assemblers <ul><li>Velvet </li></ul><ul><li>ABySS </li></ul><ul><li>TIGRA </li></ul><ul><li>… </li></ul>
 
Conclusions <ul><li>Available algorithms: more to demonstrate technique rather than complete solution </li></ul><ul><li>Di...
Chris Yoon
 
Genotyping <ul><li>Create alternative reference => remap reads </li></ul><ul><ul><li>All reads  vs  reads covering variant...
References and software <ul><li>Medvedev P  et al . Nat Methods  6 (11):S13-S20 (2009) </li></ul><ul><li>Lee S  et al . Bi...
Questions?
Upcoming SlideShare
Loading in …5
×

ECCB10 talk - Next-generation sequencing and structural variation

1,922
-1

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,922
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
80
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

ECCB10 talk - Next-generation sequencing and structural variation

  1. 1. Next-generation sequencing and structural variation Jan Aerts Wellcome Trust Sanger Institute [email_address]
  2. 2. <ul><li>principles & pittfalls vs list of commands </li></ul>
  3. 3. What is structural variation? <ul><li>“ variation that changes the structure of a chromosome” </li></ul><ul><li>Mechanisms: NAHR, NHEJ, FoSTeS </li></ul><ul><li>This presentation: focus on discovery (not: genotyping) </li></ul><ul><li>“ experiment 4” from last slide Thomas </li></ul>
  4. 4. Types of structural variation
  5. 5. Approaches for discovery <ul><li>Combination of: </li></ul><ul><li>Read pairs </li></ul><ul><li>Read depth </li></ul><ul><li>Split reads </li></ul><ul><li>Fine-mapping breakpoints: local assembly </li></ul><ul><li>=> Identify signatures </li></ul>
  6. 6. A. Read Pairs
  7. 7. RP - General principle <ul><li>Paired-end library => insert size </li></ul><ul><li>Orientation/distance </li></ul>
  8. 8. RP - Signatures
  9. 9. RP - Real world
  10. 10. RP - Workflow overview <ul><li>Mapping </li></ul><ul><li>Identify discordant readpairs </li></ul><ul><li>Cluster on location </li></ul><ul><li>Filter on nr RPs/cluster </li></ul><ul><li>Filter on RD </li></ul><ul><li>Filter: mappingQ x #readpairs </li></ul><ul><li>Identify signatures </li></ul><ul><li>Alternative reference </li></ul><ul><li>Validate </li></ul>
  11. 11. RP - Mapping <ul><li>Provides raw data => crucial </li></ul><ul><li>MAQ/bwa </li></ul><ul><ul><li>only report one hit (mappingQ = 0) </li></ul></ul><ul><ul><li>MAQ might prefer mismatches to aberrant distance! </li></ul></ul><ul><li>Insert size = distribution instead of exact </li></ul>
  12. 12. RP - Discordant readpairs <ul><li>Orientation </li></ul><ul><li>Distance </li></ul><ul><ul><li>Plot insert size distribution for chromosome </li></ul></ul><ul><ul><li>Very long tail! => difficult to set cutoff: </li></ul></ul><ul><ul><ul><li>4mad or 0.01%? </li></ul></ul></ul>
  13. 18. RP - Clustering <ul><li>“ standard clustering strategy” </li></ul><ul><ul><li>Only consider mate pairs that do not have concordant mappings </li></ul></ul><ul><ul><li>Ignore read pairs that have more than one good mapping </li></ul></ul><ul><li>Clustering: use insert size distribution ( e.g. 2x4 mad) </li></ul>
  14. 19. RP - Clustering: issues <ul><li>Ignores pairs that have >1 good mapping => no detection within repetitive regions (segmental duplications) </li></ul><ul><li>What cutoff for what is considered abnormal distance? (4 mad? 0.01%? 2stdev?) </li></ul><ul><li>Low library quality or mix of libraries => multiple peaks in size distribution </li></ul>
  15. 20. RP - Filtering <ul><li>On nr RPs/cluster </li></ul><ul><ul><li>Normally: n=2 </li></ul></ul><ul><ul><li>For high coverage ( e.g. pilot 2: 80X): n=5 </li></ul></ul><ul><li>On drop in RD & SR </li></ul><ul><li>On (mappingQ x nrRP) </li></ul><ul><ul><li>If published data available: ROC for different cutoffs mQxnrRP </li></ul></ul><ul><ul><li>If not: very difficult </li></ul></ul>
  16. 21. RP - Issues <ul><li>Difficult => different groups = different results “consensus set” </li></ul><ul><ul><li>RP & SP: many set agree </li></ul></ul><ul><ul><li>RD: totally different </li></ul></ul><ul><li>CEU (80X): sometimes drop in RD in all 3, but RP spanning only in 2 => why?? </li></ul><ul><li>Mapper = critical; maq/bwa: only 1 mapping (=> many false negatives); mosaik, mrFAST: return more results </li></ul>
  17. 22. RP - Issues (2) <ul><li>Large insert size: low resolution for detecting breakpoints </li></ul><ul><li>Small insert size: low resolution for detecting complex regions </li></ul>
  18. 23. B. Read Depth
  19. 24. RD - General principle <ul><li>Similar to aCGH: using reference RD file ( e.g. based on 1kG) </li></ul><ul><li>In theory: higher resolution, but noisier than aCGH </li></ul><ul><ul><li>Algorithms not mature yet </li></ul></ul><ul><ul><li>More complex steps </li></ul></ul><ul><li>=> Data binned </li></ul>
  20. 25. RD - Exome <ul><li>here: using exome data </li></ul>
  21. 26. RD - Example
  22. 27. RD - Workflow overview <ul><li>Mapping </li></ul><ul><li>Read filtering </li></ul><ul><li>GC correction </li></ul><ul><li>Spike identification </li></ul><ul><li>Validation </li></ul>
  23. 29. RD - mapping <ul><li>Critical… (see RP) </li></ul>
  24. 30. RD - Filtering <ul><li>mapQ </li></ul><ul><ul><li>mapQ >= 0 (noisy; few FN, many FP) </li></ul></ul><ul><ul><li>mapQ >= 10 </li></ul></ul><ul><ul><li>mapQ >= 30 (many FN, few FP) </li></ul></ul><ul><li>Mean depth exon (often: e.g. +/- 0.01) </li></ul><ul><ul><li>Mean depth > 1 </li></ul></ul><ul><ul><li>Mean depth > 5 </li></ul></ul>
  25. 31. RD - Filtering: what’s left 152,000 153,000 160,000 mean DP exon > 5 162,000 163,000 169,000 mean DP exon > 1 207,000 207,000 207,000 all mapQ >= 30 mapQ >= 10 mapQ >= 0
  26. 32. RD - correction <ul><li>Mainly: GC </li></ul><ul><ul><li>Other: repeat-rich regions, mapping Q, … </li></ul></ul><ul><li>Fit linear model GC-content exon and RD of exon => noise decreases </li></ul>
  27. 35. RD - segmentation <ul><li>Identify spikes </li></ul><ul><li>Many segmentational algorithms, e.g. GADA </li></ul><ul><li>Issues: setting parameters: when to cut off peaks? </li></ul><ul><ul><li>Combine outputs from different runs with different parameters </li></ul></ul><ul><ul><li>Compare to known CNVs </li></ul></ul>
  28. 36. RD - Combine algorithms
  29. 39. RD - Issues <ul><li>How to assess TP/FP/FN? => compare with known CNVs </li></ul><ul><li>Breakpoints: unknown </li></ul><ul><ul><li>1 datapoint/exon </li></ul></ul><ul><ul><li>Can be outside of exon </li></ul></ul><ul><li>Different parameters for rare vs common CNVs => which? </li></ul>
  30. 40. C. Split Reads
  31. 41. SR - Principle
  32. 42. SR - Mapping <ul><li>Short subsequences => many possible mappings </li></ul><ul><li>Solution: “anchored split mapping” ( e.g. Pindel) </li></ul>
  33. 44. D. Local reassembly <ul><li>Aim: to determine breakpoints </li></ul><ul><li>Which reads? </li></ul><ul><ul><li>for deletions: local reads </li></ul></ul><ul><ul><li>for insertions: hanging reads for read pairs with only one read mapped </li></ul></ul><ul><ul><li>(rather not: unmapped reads) </li></ul></ul><ul><li>For large region: split up </li></ul>
  34. 45. Assemblers <ul><li>Velvet </li></ul><ul><li>ABySS </li></ul><ul><li>TIGRA </li></ul><ul><li>… </li></ul>
  35. 47. Conclusions <ul><li>Available algorithms: more to demonstrate technique rather than complete solution </li></ul><ul><li>Different algorithms => different results </li></ul>
  36. 48. Chris Yoon
  37. 50. Genotyping <ul><li>Create alternative reference => remap reads </li></ul><ul><ul><li>All reads vs reads covering variant locis </li></ul></ul><ul><ul><li>Whole-genome vs concatenation of variant loci </li></ul></ul><ul><li>Homozygous insertions/deletions: should disappear </li></ul><ul><li>Heterozygous insertions/deletions: should have different signatures </li></ul><ul><li>Bayesian approach: see what’s the most likely: do the reads support wild-type/het/homnonref? </li></ul><ul><li>Not exact mapping => local reassembly </li></ul><ul><ul><li>Microhomologies & non-template sequence => “breakpoint” = region of 2-10 bp </li></ul></ul><ul><ul><ul><li>Convention: left-most position reported (but not always) </li></ul></ul></ul>
  38. 51. References and software <ul><li>Medvedev P et al . Nat Methods 6 (11):S13-S20 (2009) </li></ul><ul><li>Lee S et al . Bioinformatics 24 :i59-i67 (2008) </li></ul><ul><li>Hormozdiari F et al . Genome Res 19 :1270-1278 (2009) </li></ul><ul><li>Campbell P et al . Nat Genet 40 :722-729 (2008) </li></ul><ul><li>Ye K et al . Bioinformatics 25 (21):2865-2871 (2009) </li></ul><ul><li>Chen K et al . Genome Res 19 :1527-1541 (2009) </li></ul><ul><li>Yoon S et al . Genome Res 19 :1586-1592 (2009) </li></ul><ul><li>Du J et al . PLoS Comp Biol 5 (7):e1000432 (2009) </li></ul><ul><li>Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences (2009) </li></ul>
  39. 52. Questions?

×