Successfully reported this slideshow.
Your SlideShare is downloading. ×

Graph and assembly strategies for the MHC and ribosomal DNA regions

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Variant Calling II
Variant Calling II
Loading in …3
×

Check these out next

1 of 27 Ad

More Related Content

Viewers also liked (14)

Similar to Graph and assembly strategies for the MHC and ribosomal DNA regions (20)

Advertisement

More from Genome Reference Consortium (18)

Recently uploaded (20)

Advertisement

Graph and assembly strategies for the MHC and ribosomal DNA regions

  1. 1. Graph and assembly strategies for the MHC and ribosomal DNA regions Alexander Dilthey
  2. 2. The MHC is the zebrafish of the genome! (model region)
  3. 3. PRGs – Population Reference Graphs • Simple: acyclic, directed (sub-class of general variation graphs) • Usually built from MSA, preserve gap positions (i.e. global homology between input sequences). • Generative model: Recombination • Ploidy well-defined (0, 1, 2) TA CT A G C C _ _ A TA A
  4. 4. Outline • Quick recap: What we know about the utility of graph genome approaches • New results: Haplotyping in hypervariable regions (HLA) Pseudo graph alignment • De novo assembly of ribosomal DNA
  5. 5. In most of the MHC, single-reference approaches work just fine… Numberofkmers(millions) 4.55.0 PGFreference Platypus PRG-Viterbi PRG-Mapped kmersrecovered kmersnot recovered + long-read validation with consistent results (not shown) Dilthey et al., Nature Genetics 2015
  6. 6. … graph genomes outperform in the most complex sub-region of the MHC … Dilthey et al., Nature Genetics 2015
  7. 7. … remaining problems driven by incomplete input haplotypes + algorithmics. Aligned kmers Chromotype position (kb) Readposition(kb) 0 10 20 0 2 4 6 Incomplete input haplotypes: Large uncharacterized inversion Algorithmics: Incorrect HLA haplotyping. Dilthey et al., Nature Genetics 2015
  8. 8. HLA haplotyping • Hypothesis: Whole-genome sequencing data contains the information necessary for accurate HLA typing • “HLA typing”  HLA gene exon sequences • HLA class I: exons 2 and 3 • HLA class II: exon 2 • Challenge: align reads to the right gene – homology hell. • Proper read-to-graph alignment instead of k-Mers.
  9. 9. Class I exon homology Exon 2 Exon 3 HLA-A 3284 alleles HLA-B 4077 alleles HLA-C 2799 alleles
  10. 10. Approach: deep PRG + mapping Exonic MSA T*01:01 _ _ A C G T A C T _ _ T*01:02 C A A C A T A C T _ _ T*01:03 _ _ A C G C G C T _ _ T*01:04 _ _ A T C C G C T A C T*01:05 _ _ A T C C C C T _ _ T*01:06 _ _ _ C C T A C T _ _ Genomic MSA T*01:01 A G C A _ _ A C G T A C T _ _ C C T A T*01:02 A C C A C A A C A T A C T _ _ C C T A T*01:04 _ T T A _ _ A T C C G C T A C C C T A 8 xMHC reference haplotypes PGF (with T*01:01) A C T A G C A _ _ A C G T A C T _ _ C C T A T G A MANN (with T*01:04) T T T _ T T A _ _ A T C C G C T A C C C T A T G A 1) Gene-only PRG – 46 (pseudo) genes, mostly HLA |--NNN--| |--NNN--|Gene 1 Gene 2 Gene 3 Padding UTR Exon 1 Intron 1 Exon 2 UTR Padding Numberofreferencesequences Region covered by 'genomic' sequences 2) Varying numbers of input sequences across PRG 3) Use hierarchical MSA approach to combine in
  11. 11. Approach: deep PRG + mapping Level 1 CA _ _ C T C CC G AAligned read 2 3 4 5 6 7 A _ TATA _ C 198 9 10 11 12 13 14 15 16 17 18 25 26 C AGTATC 20 21 22 23 24 TC TC T T A _ A _ A G C T C T T C T ATA C C {G, C}T C G CA A _ _ A 4) Seed-and-extend paired-end mapping to PRG 5) Likelihood-based inference: maximize L( aligned reads | HLA types ) (independently per locus)
  12. 12. High-quality WGS data enables gold-standard accuracy (of note: 2/3 original discrepancies with validation data were errors in the validation data!)
  13. 13. … but not from exome, MiSeq data
  14. 14. Sequencing error?
  15. 15. Effective fragment length? [2 x read length + IS]
  16. 16. Conclusion (intermediate) • If the input sequencing data is „good enough“, we manage near- perfect haplotyping in the genome‘s most polymorphic region • Effective fragment length likely the most important factor • Not-so-good sequencing data: joint haplotyping + alignment (i.e. alignment location is not independent of inferred haplotype) • Read mapping implementation SLOW
  17. 17. Pseudo graph mapping Input sequences
  18. 18. Pseudo graph mapping Input sequences Graph
  19. 19. Pseudo graph mapping Input sequences Graph Align short reads to input sequences...
  20. 20. Pseudo graph mapping Input sequences Graph Align short reads to input sequences... ... transpose onto graph
  21. 21. Scrubbing, cutting, cleaning Input MSA Lin. alignment MSA coor. Scrubbed 123456789 123456X789 123456789 Seq1 AACAC_TTT Seq1 AACAC_TTT AACAC__TTT AACAC_TTT Seq2 TTCACGTTT Read AACACGTTT AACAC_GTTT AACACGTTT - Graph TTCAC TTT G Scrubbing: get rid of INDEL-induced changes in the alignment coordinate system Cutting: Examine alignment gap structure; cut in „bad“ areas; use longest stretch Cleaning: Find the best gap-less sequence-to-graph alignment + extension with gaps Graph alignment 123456789 Graph AACACGTTT Seq1 AACACGTTT
  22. 22. Accuracy slightly worse; fast! Conclusion: perhaps there is a middle ground between graph and linear sequence alignment. Work in progress. Further tuning? Inferred Accuracy Call Rate Inferred Accuracy Call Rate A 6 6 1.00 1.00 6 1.00 1.00 B 6 6 1.00 1.00 6 1.00 1.00 C 6 6 1.00 1.00 6 1.00 1.00 DQA1 6 6 1.00 1.00 6 1.00 1.00 DQB1 6 6 1.00 1.00 6 1.00 1.00 DRB1 6 6 1.00 1.00 6 1.00 1.00 A 22 22 0.86 1.00 22 1.00 1.00 B 22 22 1.00 1.00 22 1.00 1.00 C 22 22 1.00 1.00 22 1.00 1.00 DQA1 12 12 1.00 1.00 12 1.00 1.00 DQB1 22 22 1.00 1.00 22 1.00 1.00 DRB1 22 22 0.91 1.00 22 0.95 1.00 Platinum Trio 1000 Genomes Highest Resolution MHC-PRG-2 HLA*PRG NLocusCohort
  23. 23. Towards additional high-quality reference haplotypes… Remaining challenges: extreme repeats, haplotypes. Sergey Koren
  24. 24. Ribosomal DNA • Encodes ribosomal RNA • Hundreds of copies (tandem repeat arrays) • Variation poorly characterized • Step 1: Targeted approach • Step 2: WGS-based • Step 3: Variation graph
  25. 25. Read error vs variation … from whole-genome data? Long reads  de Bruijn graph Technology! 6% > 50k
  26. 26. Summary • Variation graphs are worth the effort – at least in highly complex regions. • Evidence: MHC „model system“ + overall improvement of Genome inference accuracy + complex-locus haplotyping • Incorporate LD? • Middle ground between full graph alignment and linear sequence alignment? • Ribosomal DNA – let me know if you‘re also interested!
  27. 27. Acknowledgements NIH Adam Phillippy Sergey Koren Brian Walenz Jung-Hyun Kim Vladimir Larionov Oxford Gil McVean Zam Iqbal Alexander Mentzer Histogenetics Nezih Cereb UCSF/Nantes Pierre-Antoine Gourraud GSK Matt Nelson Charles Cox

×