Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Telomere-to-telomere assembly of a complete human chromosomes

87 views

Published on

Presentation at 2019 ASHG GRC/GIAB workshop describing goals and progress of the telomere-to-telomere consortium to generate a genome assembly that provides representation of all sequences, including repetitive regions.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Telomere-to-telomere assembly of a complete human chromosomes

  1. 1. Telomere-to-telomere assembly of a complete human chromosomes Karen Miga UC Davis Genetics Seminar Sept 30, 2019 @khmiga
  2. 2. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies
  3. 3. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies Human reference genome is incomplete. • 368 unresolved issues, 102 gaps • Segmental duplications, gene families, satellite arrays, centromeres, rDNAs • Uncharacterized sequence variation in the human population
  4. 4. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies Human reference genome is incomplete. • 368 unresolved issues, 102 gaps • Segmental duplications, gene families, satellite arrays, centromeres, rDNAs • Uncharacterized sequence variation in the human population chr21
  5. 5. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies Human reference genome is incomplete. • 368 unresolved issues, 102 gaps • Segmental duplications, gene families, satellite arrays, centromeres, rDNAs • Uncharacterized sequence variation in the human population Our current understanding of genome biology and function30 Mb chr21
  6. 6. New Era in Genetics and Genomics We are finally reaching complete, high-quality telomere-to-telomere chromosome assemblies Human reference genome is incomplete. • 368 unresolved issues, 102 gaps • Segmental duplications, gene families, satellite arrays, centromeres, rDNAs • Uncharacterized sequence variation in the human population Our current understanding of genome biology and function30 Mb chr21 ~20 Mb ?
  7. 7. Challenge: Generating assemblies across repetitive regions that span hundreds of kilobases. Repeats (100 kb+) Unique variant Unique variant Can high-coverage ultra-long sequencing resolve complete assemblies of the human genome?
  8. 8. MinION 100kb+
  9. 9. It’s time to finish the human genome The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to generate the first complete assembly of a human genome.
  10. 10. Our target: CHM13hTERT Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers N=46; XX
  11. 11. Our target: CHM13hTERT Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers N=46; XX
  12. 12. Intramural Sequencing Center CHM13 Sequencing 94 MinION/GridION flow cells 11.1M reads 155 Gb (1.6 Gb / flow cell) (50x) 99 Gb in reads >50 kb (32x) 78 Gb in reads >70 kb (25x) Max mapped read length 1.04 Mb From May 1/18 – Jan 8/19
  13. 13. Intramural Sequencing Center CHM13 Sequencing 94 MinION/GridION flow cells 11.1M reads 155 Gb (1.6 Gb / flow cell) (50x) 99 Gb in reads >50 kb (32x) 78 Gb in reads >70 kb (25x) Max mapped read length 1.04 Mb From May 1/18 – Jan 8/19 50x Nanopore ultra-long Contig building 60x PacBio Polishing 50x 10x Genomics Polishing BioNano Structural validation
  14. 14. • 2.94 Gbp assembly NG50: 75 Mbp • Exceeds the continuity of the reference genome GRCh38 (56 Mbp NG50 contig size). • Subset of chromosome assemblies break only at centromere. Roadmap for completing the genome Canu
  15. 15. Canu
  16. 16. Canu
  17. 17. Orthogonal Validation Jo and Valerie
  18. 18. 2.2 - 3.7 Mb mean of 3010 kb (S.D. = 429; n = 49)
  19. 19. STRUCTURAL VARIANT
  20. 20. STRUCTURAL VARIANT 151516 15 3 8 2 8 4 Assemble contigs Using overlapping SV patterns
  21. 21. XqXp Scaffold Assembly of XCEN
  22. 22. XqXp Rel3 Assembly: ~3.1 Mb The assembly is a hypothesis(!)
  23. 23. 2107 294659 Beth SullivanJennifer Gerton Edmund Howe Rel3 Assembly: ~3.1 Mb
  24. 24. @NanoporeConf | #NanoporeConf Marker-assisted mapping Adam Phillippy Arang Rhie Sergey Koren
  25. 25. @NanoporeConf | #NanoporeConf Create a scaffold of unique, or single copy k-mers genome-wide Marker-assisted mapping Adam Phillippy Arang Rhie Sergey Koren Marker-assisted mapping
  26. 26. @NanoporeConf | #NanoporeConf Anchor high-confident long-read alignments to repeat assemblies Marker-assisted mapping Adam Phillippy Arang Rhie Sergey Koren Marker-assisted mapping
  27. 27. 28 Confident mapping of long reads using a single-copy k-mer strategy Identify and mark all sites of unique anchors across the chromosome chrX • 21-mers that appear ~c times in Illumina data • Also found in PacBio/Nanopore reads • Less frequent in the centromere, but still there • (Validated with Duplex-Seq)
  28. 28. 29 Confident mapping of long reads using a single-copy k-mer strategy Filter long read alignments: retaining those with unique k-mer anchoring chrX chrX
  29. 29. 30 Spacing of single-copy k-mers can be irregular in repeat-dense regions chrX chrX X CENTROMERE ARRAY CENTROMERE CENX: 3.1 Mbps Number of k-mers: 2,034 Spacing N50: 6,879 Longest distance between k-mers : 53,798 bp
  30. 30. 31 10XG Polishing Unique K-mer-based filtering: Nanopore Reads longranger + freebayes (two rounds) nanopolish (two rounds) arrow (two rounds) Unique K-mer-based filtering: PacBio (CLR) Reads chrX chrX chrX
  31. 31. GAGE pre-polishing ChrX GAGE array: 19 tandemly arrayed ~9.4 kb repeats Coverage 250 200 150 100 50 0 Base position Most frequent base Second most frequent base (error) 19 tandemly arrayed ~9.4 kb repeats
  32. 32. GAGE with marker-assisted polishing Most frequent base Second most frequent base (error) ChrX GAGE array: 19 tandemly arrayed ~9.4 kb repeats Coverage 250 200 150 100 50 0 Base position 19 tandemly arrayed ~9.4 kb repeats
  33. 33. 34 CSS/HiFi Evaluation chrX HiFi Alignments to Evaluate Polishing CENTROMERE X: BEFORE POLISHING DXZ1: 3.1 Mb
  34. 34. 35 CSS/HiFi Evaluation chrX HiFi Alignments to Evaluate Polishing CENTROMERE X: AFTER POLISHING NOTE: Underlying satellite array structure remains the same. DXZ1: 3.1 Mb
  35. 35. Opens the whole genome to analysis Ariel Gershman Winston Timp’s Laboratory
  36. 36. Ariel Gershman Winston Timp’s Laboratory
  37. 37. Ariel Gershman Winston Timp’s Laboratory
  38. 38. Ariel Gershman Winston Timp’s Laboratory
  39. 39. 1. Structurally validated assembly from telomere-to-telomere. Including 3.1 Mb tandem repeat at the X centromere and providing a complete assessment across tandemly repeated gene families. Finished T2T X Chromosome: High Accuracy and High Continuity
  40. 40. 1. Structurally validated assembly from telomere-to-telomere. Including 3.1 Mb tandem repeat at the X centromere and providing a complete assessment across tandemly repeated gene families. 2. Novel polishing strategy capable of improving the quality of large repeat- rich regions. Demonstrating dramatic improvements in quality over the entirety of the X chromosome. Finished T2T X Chromosome: High Accuracy and High Continuity
  41. 41. 1. Structurally validated assembly from telomere-to-telomere. Including 3.1 Mb tandem repeat at the X centromere and providing a complete assessment across tandemly repeated gene families. 2. Novel polishing strategy capable of improving the quality of large repeat- rich regions. Demonstrating dramatic improvements in quality over the entirety of the X chromosome. 3. Statistics of CHM13 full length BAC alignments to polished assembly: 275/341 (81%) QV 37.4 QV 27.9 153/341 (45%) QV 37.7 QV 27.4 Vollger M, Logsdon, G et al. bioRxiv doi.org/10.1101/635037 MeanMedianBACs Aligned HiFi UL-asm Finished T2T X Chromosome: High Accuracy and High Continuity
  42. 42. @NanoporeConf | #NanoporeConf It is time to finish the human genome
  43. 43. • github.com/nanopore-wgs-consortium/chm13 • 120x Nanopore reads • NHGRI, UW, Nottingham, • UC Davis (PromethION, Megan Dennis) • 50x 10x Genomics linked reads (NHGRI) • 70x PacBio CLR reads (WashU) • 24x PacBio HiFi reads (UW) • 40x Hi-C (Arima Genomics) • BioNano optical map (WashU) • Unpolished Canu assemblies NEW! Rel3 open data release
  44. 44. Additional ultra-long ONT data from Glennis Logsdon (UW) Read length Coverage Percent of data >50 kbp 12X 86% >100 kbp 9.1X 66% >150 kbp 6.8X 49% >200 kbp 4.9X 35% >250 kbp 3.4X 24% N50 = 147.1 N1 = 649.6 Max = 1538.3 0.1 1 10 100 1000 10,000 Read length (kbp) 20,000 17,500 15,000 12,500 10,000 7,500 5,000 2,500 0 Numberofreads 13.9X coverage • github.com/nanopore-wgs-consortium/chm13
  45. 45. • Minimal change in continuity • 79.5 Mbp (rel2) vs. 71.8 Mbp (rel3) NG50 • Don’t judge assemblies based on continuity • Tricky regions are fixed • GAGE and more SegDups automatically resolved • Improved BAC validation • 288 (rel2) vs. 310 (rel3) of 341 BACs resolved • 1 chromosome down, 23 to go… Triple the coverage, what changed?
  46. 46. Goal of a complete human genome in the next two years. Challenges in front of us: • Acrocentric p-arms • Large segmental duplications • Classical Human satellites 2,3 Establishing new benchmarking standards (XChr) Pioneering new pipelines: Polishing, repeat assembly, and array structural validation. Setting the bar higher for quality and completeness.

×