Successfully reported this slideshow.
Your SlideShare is downloading. ×

Jason Chin MHC diploid assembly

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 22 Ad

More Related Content

Slideshows for you (20)

Similar to Jason Chin MHC diploid assembly (20)

Advertisement

More from GenomeInABottle (12)

Recently uploaded (20)

Advertisement

Jason Chin MHC diploid assembly

  1. 1. © 2019 DNAnexus, Inc. All Rights Reserved. GETTING “PERFECT” HAPLOTIGS FOR MHC MHC Team, Pangenomics Analysis Hackathon Jason Chin, Sr. Director, Machine Learning and Genomics, DNAnexus GIAB/GRC WORKSHOP, ASHG, OCT 11, 2019
  2. 2. © 2019 DNAnexus, Inc. All Rights Reserved. Major Histocompatibility Complex Region One of the most diversified region HLA matching is important for the success of organ transplant Antigen processing and presentationKulski, et al., Immunological Reviews, 2003
  3. 3. © 2019 DNAnexus, Inc. All Rights Reserved. GWAS Example 3 MHC Number of Associated Phenotypes per Variant Number of Associated Variant per 5Mbp Unusually high number of “significant” association in the MHC over 918 phenotype labels (UK Biobank, ~337,000 samples, Neale Lab Rapid GWAS results on 41202: ICD10 code, 20002: self-reported diseases, https://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of- phenotypes-for-337000-samples-in-the-uk-biobank) (Instead of a typical Manhattan plot, We have a “Taipei” plot) Numberofvariants
  4. 4. © 2019 DNAnexus, Inc. All Rights Reserved. GWAS Example 4 MHC Number of Associated Phenotypes per Variant Number of Associated Variant per 5Mbp Unusually high number of “significant” association in the MHC over 918 phenotype labels (UK Biobank, ~337,000 samples, Neale Lab Rapid GWAS results on 41202: ICD10 code, 20002: self-reported diseases, https://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of- phenotypes-for-337000-samples-in-the-uk-biobank) (Instead of a typical Manhattan plot, We have a “Taipei” plot) Numberofvariants
  5. 5. © 2019 DNAnexus, Inc. All Rights Reserved. 5
  6. 6. © 2019 DNAnexus, Inc. All Rights Reserved. Strategy to Get “Perfect” Haplotigs 6 Trio sequencing data available k-mer binning + long reads -> Haplotype separated read piles: e.g., TrioCanu Trio data not available Long Reads, FALCON-Unzip More Accurate Long Reads Super Long Reads Linked Reads Hi-C data
  7. 7. © 2019 DNAnexus, Inc. All Rights Reserved. WhatsHap + Peregrine 7 Martin, et al., 2016 BioRxiv 085050. Chin and Khalak, 2019, BioRxiv 705616 Platform Talk: Friday: 9:00AM, Grand Ballroom B Session F: Fast Methods for Genome Analysis Assembling a de novo human genome in 100 minutes
  8. 8. © 2019 DNAnexus, Inc. All Rights Reserved. Workflow / Pipeline 8
  9. 9. © 2019 DNAnexus, Inc. All Rights Reserved. Quick Prototype and Reproducible Environment 9 DNAnexus cloud workspace for genomics development and analysis work Better control of data, code and computing environment
  10. 10. © 2019 DNAnexus, Inc. All Rights Reserved. Whole MHC Haplotig Large Scale View 10 Two haplotigs (no gap) span through whole MHC region
  11. 11. © 2019 DNAnexus, Inc. All Rights Reserved. Phased HLA Genes Confirmed by Trio Typing 11 Assembly contigID locus utilized Ref Contig start stop Called Genotypes Edit Distance Called Genotypes Assembly minEditDistance assembly truth whichAlleles minEditDistance calledGenotype_ Truth whichAlleles Haplotype H1 000000F HLA-A pgf 1436987 1440489 A*01:01:01G 0 A*01:01:01G A*01:01:01G_v/s_A*01:01:01G HLA-A Maternal H1 000000F HLA-B pgf 2848251 2851577 B*35:08:01G 0 B*35:08:01G B*35:08:01G_v/s_B*35:08:01G HLA-B Maternal H1 000000F HLA-C pgf 2763854 2767202 C*04:01:01G 0 C*04:01:01G C*04:01:01G_v/s_C*04:01:01G HLA-C Maternal H1 000000F HLA-DQA1 pgf 4086777 4093261 DQA1*01:01:01G 0 DQA1*01:01:01G DQA1*01:01:01G_v/s_DQA1*01:01:01G HLA-DQA1 Maternal H1 000000F HLA-DQB1 pgf 4110116 4117205 DQB1*05:01:01G 0 DQB1*05:01:01G DQB1*05:01:01G_v/s_DQB1*05:01:01G HLA-DQB1 Maternal H1 000000F HLA-DRB1 pgf 4029789 4043078 DRB1*10:01:01G 0 DRB1*10:01:01G DRB1*10:01:01G_v/s_DRB1*10:01:01G HLA-DRB1 Maternal H2 000000F HLA-A pgf 1437427 1440943 A*26:01:01G 0 A*26:01:01G A*26:01:01G_v/s_A*26:01:01G HLA-A Paternal H2 000000F HLA-B pgf 2843682 2846993 B*38:01:01G 0 B*38:01:01G B*38:01:01G_v/s_B*38:01:01G HLA-B Paternal H2 000000F HLA-C pgf 2768829 2772177 C*12:03:01G 0 C*12:03:01G C*12:03:01G_v/s_C*12:03:01G HLA-C Paternal H2 000000F HLA-DQA1 pgf 4182456 4188892 DQA1*03:01:01G 0 DQA1*03:01:01G DQA1*03:01:01G_v/s_DQA1*03:01:01G HLA-DQA1 Paternal H2 000000F HLA-DQB1 pgf 4201076 4208201 DQB1*03:02:01G 0 DQB1*03:02:01G DQB1*03:02:01G_v/s_DQB1*03:02:01G HLA-DQB1 Paternal H2 000000F HLA-DRB1 cox 4122938 4138189 DRB1*04:02:01 0 DRB1*04:02:01 DRB1*04:02:01_v/s_DRB1*04:02:01 HLA-DRB1 Paternal
  12. 12. © 2019 DNAnexus, Inc. All Rights Reserved. What Is Still Wrong and What Can We Do With It? 12 Assembly Graph Due to the read length limit, it will still need some manual work to resolve CYP21A2 / TNXB. 35 kb repeat Assembly Reference Reference Reference
  13. 13. © 2019 DNAnexus, Inc. All Rights Reserved. Unroll The Loop With An ONT Read 14 Getting “perfect” assembly needs multi-scale approaches for both phasing and contig construction. We can spike in this 150kb ONT read to “unroll” the loop in the assembly graph. Self-self dot-plot of an >150 kb ONT read Repeat 1 Repeat 2 © 2019 DNAnexus,Inc. All RightsReserved. Unroll The Loop With An ONT Read 15 Getting “perfect” assembly needsmulti-scale approachesfor both phasing and contig construction. We can spike in this150kb ONT read to “unroll” the loop in the assembly graph. Self-self dot-plot of an >150 kb ONT read Repeat 1 Repeat 2
  14. 14. © 2019 DNAnexus, Inc. All Rights Reserved. 2 Haplotypes and 2 Copies of CYP21A2 Repeats 15 Detecting loops is easy. (Perhaps we should annotate assembly output for that). However, when the read length is shorted than the repeats, we need to resolve 2x2 haplotypes. Variant co-occurring pattern
  15. 15. © 2019 DNAnexus, Inc. All Rights Reserved. Long Nanopore Reads can be Phased Better 16 Thursday Afternoon Poster 1582/T: The portrait of fully phased assembled diploid human genome, Arkarachai Fungtammasan, et. al.,
  16. 16. © 2019 DNAnexus, Inc. All Rights Reserved. Long Nanopore Reads can be Phased Better 17 Thursday Afternoon Poster 1582/T: The portrait of fully phased assembled diploid human genome, Arkarachai Fungtammasan, et. al.,
  17. 17. © 2019 DNAnexus, Inc. All Rights Reserved. Any Other Challenges? 18 • Missing reads recruitment using single reference • Assembly will not be complete without an initial de novo assembly • One can’t describe the difference with small variant calls Take away 179 reads that are only mapped to the HG002 de novo contig
  18. 18. © 2019 DNAnexus, Inc. All Rights Reserved. “Perfect” is Still Elusive 19 Residual Errors Analysis: Reads <-> Assembly Contig Consistency Check (Minimap2 + FreeBayes Variant Calling) Not surprising, major inconsistences are from homopolymers
  19. 19. Integrating assembly- and mapping- based calls gives best MHC benchmark • MHC assembly-based bed includes 23187 variants in the MHC region, excluding: • CYP21A2 and pseudogene • Homopolymers >10bp • SVs in assembly • Very dense variants • v4.0 mapping-based bed includes 13964 variants in the MHC region, excluding: • Short read callsets • Conflicts between callers • SVs from all methods • Homopolymers >10bp • Many clusters of variants, including some HLA genes • Only 11 differences between assembly and mapping based calls in both beds • 2 genotyping errors in assembly-based • 1 inaccurate complex allele and cluster of 8 missed variants in mapping-based • Merged benchmark includes 23229 variants in the MHC region Mbp • Covers most HLA genes and CYP21A2/TNXA/TNXB Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure ---------------------------------------------------------------------------------------------------- None 13899 13549 10 4 0.9993 0.9997 0.9995 These variants are fully phased through the MHC regions too!! 9265 new variants over MHC region.
  20. 20. © 2019 DNAnexus, Inc. All Rights Reserved. More MHC in Haplotype Resolved Genome Assemblies 21 NA12878 H1 NA12878 H2 PGP1 H1 PGP1 H2 HG002 H1 HG002 H2 221/4:30 A robust and production-level approach to haplotype-resolved assembly of single individuals. S. Garg, C. Fungtammasan, A. Carroll, R. Hall, E. Hatas, M. Mahmoud, F. Sedlazeck, M. Chou, J. Aach, J. Zook, J. Chin, H. Lee, G. Church. We can already see 6 different haplotypes at this scale
  21. 21. © 2019 DNAnexus, Inc. All Rights Reserved. Next Generation MHC Database? 22 Number of Associated Variant per 5Mbp Numberofvariants http://hla.alleles.org/inc/images/graph_hires.png Is it worth to solve this puzzle with long read technologies at scale? Class I &and Class II HLA Alleles
  22. 22. © 2019 DNAnexus, Inc. All Rights Reserved. Acknowledgement 23 Thank For Your Attention!! The MHC team for Pan-genomics in the Cloud hackathon 2019: A. Dilthey A. Fungtammasan S. Garg E. Garrison M. Rautiainen M. Tobias J. Wanger Q. Zeng J. Zook Peregrine Assembler Co-developer Asif Khalak, Foundation of Bio-Data Sciences ---- B. Busby and B. Paten for hosting the hackathon https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC

Editor's Notes

  • https://www.biorxiv.org/content/biorxiv/early/2016/11/14/085050.full.pdf
    Fast Assembly / Fast Iteration

×