© 2019 DNAnexus, Inc. All Rights Reserved.
GETTING “PERFECT” HAPLOTIGS FOR
MHC
MHC Team, Pangenomics Analysis Hackathon
Jason Chin, Sr. Director, Machine Learning and Genomics, DNAnexus
GIAB/GRC WORKSHOP, ASHG, OCT 11, 2019
© 2019 DNAnexus, Inc. All Rights Reserved.
Major Histocompatibility Complex Region
One of the most
diversified region
HLA matching is important
for the success of organ
transplant
Antigen processing and
presentationKulski, et al., Immunological
Reviews, 2003
© 2019 DNAnexus, Inc. All Rights Reserved.
GWAS Example
3
MHC
Number of Associated Phenotypes per Variant Number of Associated Variant per 5Mbp
Unusually high number of “significant”
association in the MHC over 918 phenotype
labels
(UK Biobank, ~337,000 samples, Neale Lab Rapid GWAS results on
41202: ICD10 code, 20002: self-reported diseases,
https://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-
phenotypes-for-337000-samples-in-the-uk-biobank)
(Instead of a typical
Manhattan plot, We
have a “Taipei” plot)
Numberofvariants
© 2019 DNAnexus, Inc. All Rights Reserved.
GWAS Example
4
MHC
Number of Associated Phenotypes per Variant Number of Associated Variant per 5Mbp
Unusually high number of “significant”
association in the MHC over 918 phenotype
labels
(UK Biobank, ~337,000 samples, Neale Lab Rapid GWAS results on
41202: ICD10 code, 20002: self-reported diseases,
https://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-
phenotypes-for-337000-samples-in-the-uk-biobank)
(Instead of a typical
Manhattan plot, We
have a “Taipei” plot)
Numberofvariants
© 2019 DNAnexus, Inc. All Rights Reserved. 5
© 2019 DNAnexus, Inc. All Rights Reserved.
Strategy to Get “Perfect” Haplotigs
6
Trio sequencing data available
k-mer binning + long reads ->
Haplotype separated read piles: e.g., TrioCanu
Trio data not available
Long Reads, FALCON-Unzip
More Accurate Long Reads
Super Long Reads
Linked Reads
Hi-C data
© 2019 DNAnexus, Inc. All Rights Reserved.
WhatsHap + Peregrine
7
Martin, et al., 2016
BioRxiv 085050.
Chin and Khalak, 2019,
BioRxiv 705616
Platform Talk: Friday: 9:00AM, Grand Ballroom B
Session F: Fast Methods for Genome Analysis
Assembling a de novo human genome in 100 minutes
© 2019 DNAnexus, Inc. All Rights Reserved.
Workflow / Pipeline
8
© 2019 DNAnexus, Inc. All Rights Reserved.
Quick Prototype and Reproducible Environment
9
DNAnexus cloud
workspace for
genomics
development and
analysis work
Better control of
data, code and
computing
environment
© 2019 DNAnexus, Inc. All Rights Reserved.
Whole MHC Haplotig Large Scale View
10
Two haplotigs (no
gap) span through
whole MHC region
© 2019 DNAnexus, Inc. All Rights Reserved.
Phased HLA Genes Confirmed by Trio Typing
11
Assembly contigID locus
utilized
Ref
Contig start stop
Called
Genotypes
Edit
Distance
Called
Genotypes
Assembly
minEditDistance
assembly truth
whichAlleles
minEditDistance
calledGenotype_
Truth
whichAlleles Haplotype
H1 000000F HLA-A pgf 1436987 1440489 A*01:01:01G 0 A*01:01:01G A*01:01:01G_v/s_A*01:01:01G HLA-A Maternal
H1 000000F HLA-B pgf 2848251 2851577 B*35:08:01G 0 B*35:08:01G B*35:08:01G_v/s_B*35:08:01G HLA-B Maternal
H1 000000F HLA-C pgf 2763854 2767202 C*04:01:01G 0 C*04:01:01G C*04:01:01G_v/s_C*04:01:01G HLA-C Maternal
H1 000000F HLA-DQA1 pgf 4086777 4093261 DQA1*01:01:01G 0 DQA1*01:01:01G DQA1*01:01:01G_v/s_DQA1*01:01:01G HLA-DQA1 Maternal
H1 000000F HLA-DQB1 pgf 4110116 4117205 DQB1*05:01:01G 0 DQB1*05:01:01G DQB1*05:01:01G_v/s_DQB1*05:01:01G HLA-DQB1 Maternal
H1 000000F HLA-DRB1 pgf 4029789 4043078 DRB1*10:01:01G 0 DRB1*10:01:01G DRB1*10:01:01G_v/s_DRB1*10:01:01G HLA-DRB1 Maternal
H2 000000F HLA-A pgf 1437427 1440943 A*26:01:01G 0 A*26:01:01G A*26:01:01G_v/s_A*26:01:01G HLA-A Paternal
H2 000000F HLA-B pgf 2843682 2846993 B*38:01:01G 0 B*38:01:01G B*38:01:01G_v/s_B*38:01:01G HLA-B Paternal
H2 000000F HLA-C pgf 2768829 2772177 C*12:03:01G 0 C*12:03:01G C*12:03:01G_v/s_C*12:03:01G HLA-C Paternal
H2 000000F HLA-DQA1 pgf 4182456 4188892 DQA1*03:01:01G 0 DQA1*03:01:01G DQA1*03:01:01G_v/s_DQA1*03:01:01G HLA-DQA1 Paternal
H2 000000F HLA-DQB1 pgf 4201076 4208201 DQB1*03:02:01G 0 DQB1*03:02:01G DQB1*03:02:01G_v/s_DQB1*03:02:01G HLA-DQB1 Paternal
H2 000000F HLA-DRB1 cox 4122938 4138189 DRB1*04:02:01 0 DRB1*04:02:01 DRB1*04:02:01_v/s_DRB1*04:02:01 HLA-DRB1 Paternal
© 2019 DNAnexus, Inc. All Rights Reserved.
What Is Still Wrong and What Can We Do With It?
12
Assembly Graph
Due to the read
length limit, it will
still need some
manual work to
resolve CYP21A2 /
TNXB.
35 kb repeat
Assembly
Reference
Reference
Reference
© 2019 DNAnexus, Inc. All Rights Reserved.
Unroll The Loop With An ONT Read
14
Getting “perfect” assembly needs multi-scale
approaches for both phasing and contig
construction.
We can spike in this 150kb ONT read to “unroll”
the loop in the assembly graph.
Self-self dot-plot of an >150 kb ONT read
Repeat 1 Repeat 2
© 2019 DNAnexus,Inc. All RightsReserved.
Unroll The Loop With An ONT Read
15
Getting “perfect” assembly needsmulti-scale
approachesfor both phasing and contig
construction.
We can spike in this150kb ONT read to
“unroll” the loop in the assembly graph.
Self-self dot-plot of an >150 kb ONT read
Repeat 1 Repeat 2
© 2019 DNAnexus, Inc. All Rights Reserved.
2 Haplotypes and 2 Copies of CYP21A2 Repeats
15
Detecting loops is easy.
(Perhaps we should
annotate assembly
output for that).
However, when the read
length is shorted than
the repeats, we need to
resolve 2x2 haplotypes.
Variant
co-occurring
pattern
© 2019 DNAnexus, Inc. All Rights Reserved.
Long Nanopore Reads can be Phased Better
16
Thursday Afternoon
Poster 1582/T: The portrait of
fully phased assembled diploid
human genome, Arkarachai
Fungtammasan, et. al.,
© 2019 DNAnexus, Inc. All Rights Reserved.
Long Nanopore Reads can be Phased Better
17
Thursday Afternoon
Poster 1582/T: The portrait of
fully phased assembled diploid
human genome, Arkarachai
Fungtammasan, et. al.,
© 2019 DNAnexus, Inc. All Rights Reserved.
Any Other Challenges?
18
• Missing reads recruitment using single
reference
• Assembly will not be complete without an
initial de novo assembly
• One can’t describe the difference with
small variant calls
Take away 179 reads that are only mapped to the HG002 de novo contig
© 2019 DNAnexus, Inc. All Rights Reserved.
“Perfect” is Still Elusive
19
Residual Errors Analysis:
Reads <-> Assembly Contig Consistency Check
(Minimap2 + FreeBayes Variant Calling)
Not surprising, major inconsistences are from
homopolymers
Integrating
assembly- and
mapping-
based calls
gives best
MHC
benchmark
• MHC assembly-based bed
includes 23187 variants in
the MHC region, excluding:
• CYP21A2 and pseudogene
• Homopolymers >10bp
• SVs in assembly
• Very dense variants
• v4.0 mapping-based bed
includes 13964 variants in
the MHC region, excluding:
• Short read callsets
• Conflicts between callers
• SVs from all methods
• Homopolymers >10bp
• Many clusters of variants,
including some HLA genes
• Only 11 differences
between assembly and
mapping based calls in
both beds
• 2 genotyping errors in
assembly-based
• 1 inaccurate complex allele
and cluster of 8 missed
variants in mapping-based
• Merged benchmark
includes 23229 variants in
the MHC region Mbp
• Covers most HLA genes and
CYP21A2/TNXA/TNXB
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
None 13899 13549 10 4 0.9993 0.9997 0.9995
These variants are fully phased through the MHC regions too!!
9265 new variants
over MHC region.
© 2019 DNAnexus, Inc. All Rights Reserved.
More MHC in Haplotype Resolved Genome Assemblies
21
NA12878 H1
NA12878 H2
PGP1 H1
PGP1 H2
HG002 H1
HG002 H2
221/4:30 A robust and
production-level approach to
haplotype-resolved assembly of
single individuals. S. Garg, C.
Fungtammasan, A. Carroll, R. Hall,
E. Hatas, M. Mahmoud, F.
Sedlazeck, M. Chou, J. Aach, J.
Zook, J. Chin, H. Lee, G. Church.
We can already see 6 different haplotypes at this scale
© 2019 DNAnexus, Inc. All Rights Reserved.
Next Generation MHC Database?
22
Number of Associated Variant per 5Mbp
Numberofvariants
http://hla.alleles.org/inc/images/graph_hires.png
Is it worth to solve this puzzle with
long read technologies at scale?
Class I &and Class II
HLA Alleles
© 2019 DNAnexus, Inc. All Rights Reserved.
Acknowledgement
23
Thank For Your Attention!!
The MHC team for Pan-genomics in the
Cloud hackathon 2019:
A. Dilthey
A. Fungtammasan
S. Garg
E. Garrison
M. Rautiainen
M. Tobias
J. Wanger
Q. Zeng
J. Zook
Peregrine Assembler Co-developer
Asif Khalak, Foundation of Bio-Data Sciences
----
B. Busby and B. Paten for hosting the hackathon
https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC

Jason Chin MHC diploid assembly

  • 1.
    © 2019 DNAnexus,Inc. All Rights Reserved. GETTING “PERFECT” HAPLOTIGS FOR MHC MHC Team, Pangenomics Analysis Hackathon Jason Chin, Sr. Director, Machine Learning and Genomics, DNAnexus GIAB/GRC WORKSHOP, ASHG, OCT 11, 2019
  • 2.
    © 2019 DNAnexus,Inc. All Rights Reserved. Major Histocompatibility Complex Region One of the most diversified region HLA matching is important for the success of organ transplant Antigen processing and presentationKulski, et al., Immunological Reviews, 2003
  • 3.
    © 2019 DNAnexus,Inc. All Rights Reserved. GWAS Example 3 MHC Number of Associated Phenotypes per Variant Number of Associated Variant per 5Mbp Unusually high number of “significant” association in the MHC over 918 phenotype labels (UK Biobank, ~337,000 samples, Neale Lab Rapid GWAS results on 41202: ICD10 code, 20002: self-reported diseases, https://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of- phenotypes-for-337000-samples-in-the-uk-biobank) (Instead of a typical Manhattan plot, We have a “Taipei” plot) Numberofvariants
  • 4.
    © 2019 DNAnexus,Inc. All Rights Reserved. GWAS Example 4 MHC Number of Associated Phenotypes per Variant Number of Associated Variant per 5Mbp Unusually high number of “significant” association in the MHC over 918 phenotype labels (UK Biobank, ~337,000 samples, Neale Lab Rapid GWAS results on 41202: ICD10 code, 20002: self-reported diseases, https://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of- phenotypes-for-337000-samples-in-the-uk-biobank) (Instead of a typical Manhattan plot, We have a “Taipei” plot) Numberofvariants
  • 5.
    © 2019 DNAnexus,Inc. All Rights Reserved. 5
  • 6.
    © 2019 DNAnexus,Inc. All Rights Reserved. Strategy to Get “Perfect” Haplotigs 6 Trio sequencing data available k-mer binning + long reads -> Haplotype separated read piles: e.g., TrioCanu Trio data not available Long Reads, FALCON-Unzip More Accurate Long Reads Super Long Reads Linked Reads Hi-C data
  • 7.
    © 2019 DNAnexus,Inc. All Rights Reserved. WhatsHap + Peregrine 7 Martin, et al., 2016 BioRxiv 085050. Chin and Khalak, 2019, BioRxiv 705616 Platform Talk: Friday: 9:00AM, Grand Ballroom B Session F: Fast Methods for Genome Analysis Assembling a de novo human genome in 100 minutes
  • 8.
    © 2019 DNAnexus,Inc. All Rights Reserved. Workflow / Pipeline 8
  • 9.
    © 2019 DNAnexus,Inc. All Rights Reserved. Quick Prototype and Reproducible Environment 9 DNAnexus cloud workspace for genomics development and analysis work Better control of data, code and computing environment
  • 10.
    © 2019 DNAnexus,Inc. All Rights Reserved. Whole MHC Haplotig Large Scale View 10 Two haplotigs (no gap) span through whole MHC region
  • 11.
    © 2019 DNAnexus,Inc. All Rights Reserved. Phased HLA Genes Confirmed by Trio Typing 11 Assembly contigID locus utilized Ref Contig start stop Called Genotypes Edit Distance Called Genotypes Assembly minEditDistance assembly truth whichAlleles minEditDistance calledGenotype_ Truth whichAlleles Haplotype H1 000000F HLA-A pgf 1436987 1440489 A*01:01:01G 0 A*01:01:01G A*01:01:01G_v/s_A*01:01:01G HLA-A Maternal H1 000000F HLA-B pgf 2848251 2851577 B*35:08:01G 0 B*35:08:01G B*35:08:01G_v/s_B*35:08:01G HLA-B Maternal H1 000000F HLA-C pgf 2763854 2767202 C*04:01:01G 0 C*04:01:01G C*04:01:01G_v/s_C*04:01:01G HLA-C Maternal H1 000000F HLA-DQA1 pgf 4086777 4093261 DQA1*01:01:01G 0 DQA1*01:01:01G DQA1*01:01:01G_v/s_DQA1*01:01:01G HLA-DQA1 Maternal H1 000000F HLA-DQB1 pgf 4110116 4117205 DQB1*05:01:01G 0 DQB1*05:01:01G DQB1*05:01:01G_v/s_DQB1*05:01:01G HLA-DQB1 Maternal H1 000000F HLA-DRB1 pgf 4029789 4043078 DRB1*10:01:01G 0 DRB1*10:01:01G DRB1*10:01:01G_v/s_DRB1*10:01:01G HLA-DRB1 Maternal H2 000000F HLA-A pgf 1437427 1440943 A*26:01:01G 0 A*26:01:01G A*26:01:01G_v/s_A*26:01:01G HLA-A Paternal H2 000000F HLA-B pgf 2843682 2846993 B*38:01:01G 0 B*38:01:01G B*38:01:01G_v/s_B*38:01:01G HLA-B Paternal H2 000000F HLA-C pgf 2768829 2772177 C*12:03:01G 0 C*12:03:01G C*12:03:01G_v/s_C*12:03:01G HLA-C Paternal H2 000000F HLA-DQA1 pgf 4182456 4188892 DQA1*03:01:01G 0 DQA1*03:01:01G DQA1*03:01:01G_v/s_DQA1*03:01:01G HLA-DQA1 Paternal H2 000000F HLA-DQB1 pgf 4201076 4208201 DQB1*03:02:01G 0 DQB1*03:02:01G DQB1*03:02:01G_v/s_DQB1*03:02:01G HLA-DQB1 Paternal H2 000000F HLA-DRB1 cox 4122938 4138189 DRB1*04:02:01 0 DRB1*04:02:01 DRB1*04:02:01_v/s_DRB1*04:02:01 HLA-DRB1 Paternal
  • 12.
    © 2019 DNAnexus,Inc. All Rights Reserved. What Is Still Wrong and What Can We Do With It? 12 Assembly Graph Due to the read length limit, it will still need some manual work to resolve CYP21A2 / TNXB. 35 kb repeat Assembly Reference Reference Reference
  • 13.
    © 2019 DNAnexus,Inc. All Rights Reserved. Unroll The Loop With An ONT Read 14 Getting “perfect” assembly needs multi-scale approaches for both phasing and contig construction. We can spike in this 150kb ONT read to “unroll” the loop in the assembly graph. Self-self dot-plot of an >150 kb ONT read Repeat 1 Repeat 2 © 2019 DNAnexus,Inc. All RightsReserved. Unroll The Loop With An ONT Read 15 Getting “perfect” assembly needsmulti-scale approachesfor both phasing and contig construction. We can spike in this150kb ONT read to “unroll” the loop in the assembly graph. Self-self dot-plot of an >150 kb ONT read Repeat 1 Repeat 2
  • 14.
    © 2019 DNAnexus,Inc. All Rights Reserved. 2 Haplotypes and 2 Copies of CYP21A2 Repeats 15 Detecting loops is easy. (Perhaps we should annotate assembly output for that). However, when the read length is shorted than the repeats, we need to resolve 2x2 haplotypes. Variant co-occurring pattern
  • 15.
    © 2019 DNAnexus,Inc. All Rights Reserved. Long Nanopore Reads can be Phased Better 16 Thursday Afternoon Poster 1582/T: The portrait of fully phased assembled diploid human genome, Arkarachai Fungtammasan, et. al.,
  • 16.
    © 2019 DNAnexus,Inc. All Rights Reserved. Long Nanopore Reads can be Phased Better 17 Thursday Afternoon Poster 1582/T: The portrait of fully phased assembled diploid human genome, Arkarachai Fungtammasan, et. al.,
  • 17.
    © 2019 DNAnexus,Inc. All Rights Reserved. Any Other Challenges? 18 • Missing reads recruitment using single reference • Assembly will not be complete without an initial de novo assembly • One can’t describe the difference with small variant calls Take away 179 reads that are only mapped to the HG002 de novo contig
  • 18.
    © 2019 DNAnexus,Inc. All Rights Reserved. “Perfect” is Still Elusive 19 Residual Errors Analysis: Reads <-> Assembly Contig Consistency Check (Minimap2 + FreeBayes Variant Calling) Not surprising, major inconsistences are from homopolymers
  • 19.
    Integrating assembly- and mapping- based calls givesbest MHC benchmark • MHC assembly-based bed includes 23187 variants in the MHC region, excluding: • CYP21A2 and pseudogene • Homopolymers >10bp • SVs in assembly • Very dense variants • v4.0 mapping-based bed includes 13964 variants in the MHC region, excluding: • Short read callsets • Conflicts between callers • SVs from all methods • Homopolymers >10bp • Many clusters of variants, including some HLA genes • Only 11 differences between assembly and mapping based calls in both beds • 2 genotyping errors in assembly-based • 1 inaccurate complex allele and cluster of 8 missed variants in mapping-based • Merged benchmark includes 23229 variants in the MHC region Mbp • Covers most HLA genes and CYP21A2/TNXA/TNXB Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure ---------------------------------------------------------------------------------------------------- None 13899 13549 10 4 0.9993 0.9997 0.9995 These variants are fully phased through the MHC regions too!! 9265 new variants over MHC region.
  • 20.
    © 2019 DNAnexus,Inc. All Rights Reserved. More MHC in Haplotype Resolved Genome Assemblies 21 NA12878 H1 NA12878 H2 PGP1 H1 PGP1 H2 HG002 H1 HG002 H2 221/4:30 A robust and production-level approach to haplotype-resolved assembly of single individuals. S. Garg, C. Fungtammasan, A. Carroll, R. Hall, E. Hatas, M. Mahmoud, F. Sedlazeck, M. Chou, J. Aach, J. Zook, J. Chin, H. Lee, G. Church. We can already see 6 different haplotypes at this scale
  • 21.
    © 2019 DNAnexus,Inc. All Rights Reserved. Next Generation MHC Database? 22 Number of Associated Variant per 5Mbp Numberofvariants http://hla.alleles.org/inc/images/graph_hires.png Is it worth to solve this puzzle with long read technologies at scale? Class I &and Class II HLA Alleles
  • 22.
    © 2019 DNAnexus,Inc. All Rights Reserved. Acknowledgement 23 Thank For Your Attention!! The MHC team for Pan-genomics in the Cloud hackathon 2019: A. Dilthey A. Fungtammasan S. Garg E. Garrison M. Rautiainen M. Tobias J. Wanger Q. Zeng J. Zook Peregrine Assembler Co-developer Asif Khalak, Foundation of Bio-Data Sciences ---- B. Busby and B. Paten for hosting the hackathon https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC

Editor's Notes

  • #8 https://www.biorxiv.org/content/biorxiv/early/2016/11/14/085050.full.pdf Fast Assembly / Fast Iteration