Graph and assembly strategies for the MHC and ribosomal DNA regions

Graph and assembly strategies for the
MHC and ribosomal DNA regions
Alexander Dilthey

The MHC is the zebrafish of the genome!
(model region)

PRGs – Population Reference Graphs
• Simple: acyclic, directed (sub-class of general variation graphs)
• Usually built from MSA, preserve gap positions
(i.e. global homology between input sequences).
• Generative model: Recombination
• Ploidy well-defined (0, 1, 2)
TA CT A G
C
C
_
_
A
TA
A

Outline
• Quick recap:
What we know about the utility of graph genome approaches
• New results:
Haplotyping in hypervariable regions (HLA)
Pseudo graph alignment
• De novo assembly of ribosomal DNA

In most of the MHC, single-reference
approaches work just fine…
Numberofkmers(millions)
4.55.0
PGFreference Platypus PRG-Viterbi PRG-Mapped
kmersrecovered
kmersnot recovered
+ long-read validation with consistent results (not shown)
Dilthey et al., Nature Genetics 2015

… graph genomes outperform in the most
complex sub-region of the MHC …

… remaining problems driven by incomplete
input haplotypes + algorithmics.
Aligned kmers
Chromotype position (kb)
Readposition(kb)
0 10 20
0
2
4
6
Incomplete input haplotypes:
Large uncharacterized inversion
Algorithmics:
Incorrect HLA haplotyping.

HLA haplotyping
• Hypothesis: Whole-genome sequencing data contains the information
necessary for accurate HLA typing
• “HLA typing”  HLA gene exon sequences
• HLA class I: exons 2 and 3
• HLA class II: exon 2
• Challenge: align reads to the right gene – homology hell.
• Proper read-to-graph alignment instead of k-Mers.

Class I exon homology
Exon 2 Exon 3
HLA-A 3284 alleles
HLA-B 4077 alleles
HLA-C 2799 alleles

Approach: deep PRG + mapping
Exonic MSA
T*01:01 _ _ A C G T A C T _ _
T*01:02 C A A C A T A C T _ _
T*01:03 _ _ A C G C G C T _ _
T*01:04 _ _ A T C C G C T A C
T*01:05 _ _ A T C C C C T _ _
T*01:06 _ _ _ C C T A C T _ _
Genomic MSA
T*01:01 A G C A _ _ A C G T A C T _ _ C C T A
T*01:02 A C C A C A A C A T A C T _ _ C C T A
T*01:04 _ T T A _ _ A T C C G C T A C C C T A
8 xMHC reference haplotypes
PGF (with T*01:01) A C T A G C A _ _ A C G T A C T _ _ C C T A T G A
MANN (with T*01:04) T T T _ T T A _ _ A T C C G C T A C C C T A T G A
1) Gene-only PRG – 46 (pseudo) genes, mostly HLA
|--NNN--| |--NNN--|Gene 1 Gene 2 Gene 3
Padding UTR Exon 1 Intron 1 Exon 2 UTR Padding
Numberofreferencesequences
Region covered by 'genomic' sequences
2) Varying numbers of input sequences across PRG
3) Use hierarchical MSA approach to combine in

Approach: deep PRG + mapping
Level 1
CA
_ _
C T
C
CC
G
AAligned read
2 3 4 5 6 7
A _ TATA _ C
198 9 10 11 12 13 14 15 16 17 18 25 26
C AGTATC
20 21 22 23 24
TC
TC
T T
A
_
A _
A G
C
T
C
T
T
C T
ATA
C
C {G, C}T
C
G
CA
A
_ _
A
4) Seed-and-extend paired-end mapping to PRG
5) Likelihood-based inference: maximize L( aligned reads | HLA types )
(independently per locus)

High-quality WGS data enables gold-standard
accuracy
(of note: 2/3 original discrepancies with validation data were errors in the validation data!)

… but not from exome, MiSeq data

Effective fragment length? [2 x read length + IS]

Conclusion (intermediate)
• If the input sequencing data is „good enough“, we manage near-
perfect haplotyping in the genome‘s most polymorphic region
• Effective fragment length likely the most important factor
• Not-so-good sequencing data: joint haplotyping + alignment
(i.e. alignment location is not independent of inferred haplotype)
• Read mapping implementation SLOW

Pseudo graph mapping
Input sequences

Input sequences
Graph

Input sequences
Graph
Align short reads to input sequences...

Input sequences
Graph
Align short reads to input sequences...
... transpose onto graph

Scrubbing, cutting, cleaning
Input MSA Lin. alignment MSA coor. Scrubbed
123456789 123456X789 123456789
Seq1 AACAC_TTT Seq1 AACAC_TTT AACAC__TTT AACAC_TTT
Seq2 TTCACGTTT Read AACACGTTT AACAC_GTTT AACACGTTT
-
Graph TTCAC TTT
G
Scrubbing: get rid of INDEL-induced changes in the alignment coordinate system
Cutting: Examine alignment gap structure; cut in „bad“ areas; use longest stretch
Cleaning: Find the best gap-less sequence-to-graph alignment + extension with gaps
Graph alignment
123456789
Graph AACACGTTT
Seq1 AACACGTTT

Accuracy slightly worse; fast!
Conclusion: perhaps there is a middle ground between graph and linear sequence
alignment. Work in progress. Further tuning?
Inferred Accuracy Call Rate Inferred Accuracy Call Rate
A 6 6 1.00 1.00 6 1.00 1.00
B 6 6 1.00 1.00 6 1.00 1.00
C 6 6 1.00 1.00 6 1.00 1.00
DQA1 6 6 1.00 1.00 6 1.00 1.00
DQB1 6 6 1.00 1.00 6 1.00 1.00
DRB1 6 6 1.00 1.00 6 1.00 1.00
A 22 22 0.86 1.00 22 1.00 1.00
B 22 22 1.00 1.00 22 1.00 1.00
C 22 22 1.00 1.00 22 1.00 1.00
DQA1 12 12 1.00 1.00 12 1.00 1.00
DQB1 22 22 1.00 1.00 22 1.00 1.00
DRB1 22 22 0.91 1.00 22 0.95 1.00
Platinum
Trio
1000
Genomes
Highest
Resolution
MHC-PRG-2 HLA*PRG
NLocusCohort

Towards additional high-quality reference
haplotypes…
Remaining challenges: extreme repeats, haplotypes.
Sergey Koren

Ribosomal DNA
• Encodes ribosomal RNA
• Hundreds of copies
(tandem repeat arrays)
• Variation poorly characterized
• Step 1: Targeted approach
• Step 2: WGS-based
• Step 3: Variation graph

Read error vs variation
… from whole-genome data?
Long reads  de Bruijn graph Technology!
6% > 50k

Summary
• Variation graphs are worth the effort – at least in highly complex regions.
• Evidence: MHC „model system“
+ overall improvement of Genome inference accuracy
+ complex-locus haplotyping
• Incorporate LD?
• Middle ground between full graph alignment and linear sequence
alignment?
• Ribosomal DNA – let me know if you‘re also interested!

Acknowledgements
NIH
Adam Phillippy
Sergey Koren
Brian Walenz
Jung-Hyun Kim
Vladimir Larionov
Oxford
Gil McVean
Zam Iqbal
Alexander Mentzer
Histogenetics
Nezih Cereb
UCSF/Nantes
Pierre-Antoine Gourraud
GSK
Matt Nelson
Charles Cox

Graph and assembly strategies for the MHC and ribosomal DNA regions

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to Graph and assembly strategies for the MHC and ribosomal DNA regions

Similar to Graph and assembly strategies for the MHC and ribosomal DNA regions (20)

More from Genome Reference Consortium

More from Genome Reference Consortium (18)

Recently uploaded

Recently uploaded (20)

Graph and assembly strategies for the MHC and ribosomal DNA regions