Successfully reported this slideshow.
Your SlideShare is downloading. ×

101717.kh miga ashg_grc

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 42 Ad

More Related Content

Slideshows for you (20)

Similar to 101717.kh miga ashg_grc (20)

Advertisement

More from Genome Reference Consortium (20)

Recently uploaded (20)

Advertisement

101717.kh miga ashg_grc

  1. 1. Centromere Sequence Assembly Karen H. Miga University of California, Santa Cruz 10/17/17 GRC GIAB Workshop ASHG
  2. 2. Megabase-sized gapsP-ARM Q-ARM CEN HUMAN  CENTROMERES:  MULTI-­‐MEGABASE  SIZED   GAPS  IN  ALL  CHROMOSOME  ASSEMBLIES
  3. 3. CEN
  4. 4. PROGRESS  UPDATE:     CENTROMERE  SEQUENCE  ASSEMBLIES 1.      GRCh38  Reference  Models  for  Human   Centromere  Arrays 2.    Efforts  to  Generate  True,  Linear  Assemblies  of   Centromeric  regions:  Chromosome  Y 3.    Future  PerspecSve
  5. 5. p-arm q-arm... ... Multi-megabase sized arrays of satellite DNA ...ATCCGATTACG ATCCGATTACGATCCGATTACG... ...ATCCGATTACG ATCCGATTACGATCCGATTACG... CHALLENGE  OF  ASSEMBLING  LONG  TRACTS  OF   (NEAR  IDENTICAL)  TANDEM  REPEATS
  6. 6. p-arm q-arm ... ...ALPHA SATELLITE ~171bp Tandem Repeat Wide Range of Percent ID: ~60-100% 1 2 3 4 HUMAN  CENTROMERES:  ALPHA  SATELLITTE
  7. 7. Narrow Range of Percent ID: 94% - 100% “Higher Order Repeat” Multi-monomeric Repeat Unit p-arm q-arm ... ... 1 2 3 4 1 2 3 4 1 2 3 4 HIGHER  ORDER  REPEATS  
  8. 8. p-arm q-arm ... ... p-arm q-arm ... ... Array “A” Array “B” Array “C” chrX chr3 CHROMOSOME-­‐SPECIFIC  SATELLITE  
 SEQUENCE  ORGANIZATION
  9. 9. p-arm q-arm ... ... ... ...-A- -T- GENOME  MODEL  OF  SEQUENCE  ORGANIZATION   IN  CENTROMERE-­‐ASSIGNED  GAPS
  10. 10. p-arm q-arm ... ... ... ...-A- -T- GENOME  MODEL  OF  SEQUENCE  ORGANIZATION   IN  CENTROMERE-­‐ASSIGNED  GAPS LINE SINE OTHER NON-ALPHA SATELLITE
  11. 11. p-arm q-arm ... ... ... ...-A- -T- GENOME  MODEL  OF  SEQUENCE  ORGANIZATION   IN  CENTROMERE-­‐ASSIGNED  GAPS LINE SINE OTHER NON-ALPHA SATELLITE Unmapped (Yet Assembled) Scaffolds
  12. 12. Characterize HORs in Human Genome1 1. GRCh38  Alpha  Satellite  Reference  Models   1
  13. 13. A B C D E F Characterize HORs in Human Genome1 1. GRCh38  Alpha  Satellite  Reference  Models   1
  14. 14. >200 ENCODE datasets A B C D E F Characterize HORs in Human Genome1 1. GRCh38  Alpha  Satellite  Reference  Models   >200 ENCODE datasets y Step Example For Single P-read, I α-Centauri (centromeric automated repeat identification) 5’… …3’ 10x 10 B C D EF A 10 10 10 10 10 5’ 3’ 1 http://github.com/volkansevim/alpha- CENTAURI.
  15. 15. B C D EF A Chromosome specific assignment ?
  16. 16. Experimental Evidence: FISH Hybridization/Mapping and Screening Somatic Cell Hybrid Panel B C D EF A D7Z1 6-mer Waye  et  al  (1987)   98%    GenBank:  M16101   Flow Sorted Chromosome Alignment/Enrichment Sequence enrichment analysis of isolated human chromosomes Long Range Paired Read Support “Anchor” to mapped to the assembled p-arm and/ or q-arm Chromosome specific assignment
  17. 17. Chromosome-assignment of Higher Order Repeats
  18. 18. Characterize HORs in Human Genome 1. GRCh38  Alpha  Satellite  Reference  Models   DXZ1 (12-mer) CENX e.g. 1 2 3 4 5 6 7 8 9 10 11 12 LINEHuRef WGS Sanger read Db Constructing WGS Read Libraries for each HOR array2 LINEA/T 1
  19. 19. Characterize HORs in Human Genome 1. GRCh38  Alpha  Satellite  Reference  Models   Constructing WGS Read Libraries for each HOR array m3v1 m1v1 m2v1 m2v2 m4v1 m12v1 m5v1 m6v1 LINE m11v1 m10v1 m9v1 m8v1 m7v1 1.01.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.3 0.7 0.3 0.7 1.0 LINEA/T 2 1 3 Model ArrayVariants in Sequence Graph:
  20. 20. linearSat • 2nd Order Markov Chain • Length determined by normalized array length estimates m3v1 m1v1 m2v1 m2v2 m4v1 m12v1 m5v1 m6v1 LINE m11v1 m10v1 m9v1 m8v1 m7v1 1.01.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.3 0.7 0.3 0.7 1.0 Not the “true” long-range organization, yet adequately represents the alpha satellite array sequence https://github.com/JimKent/linearSat
  21. 21. LINEAR  ORDERING  OF  REFERENCE  MODELS  AND   ASSEMBLED  CONTIGS  USING  MATE  PAIRS CENXXp Xq 3.8 Mb chrX 2.25Mb; ~860 HOR units0.73Mb; ~43 HOR units 0.3Mb; Low Copy Repeat pp 3p 3qCEN3.1 CEN3.2 Unmapped HuRef Assembled Contig(s) (e.g.ABBA01185959) chr3
  22. 22. Yp Yq Xp Xq 100Kb 12p 12q 17q17p 2p 2q 6p 6q 3p 3q 4p 4q 11p 11q 8p 8q 10p 10q 7p 7pq 7q 9q9p 1p 1q 16q16p 18p 18q 19p 19q 20q20p 5p 5q 1 2 3 4 5 6 7 8 9 10 11 12 15 16 17 18 19 20 15q 15p X Y 21p 14q 21q Acrocentric Chr (13,14,21,22) An Initial Draft of Human Centromere Sequence Composition Alpha  Satellite  Reference  Models:   ~60  Mb  (59571670  bp)
  23. 23. CENTROMERE  SEQUENCE  ASSEMBLY   1. GRCh38  Alpha  Satellite  Reference  Models   2. Linear  Assembly  of  a  Human  Centromere   Miga, KH., et al. Genome research 24.4 (2014): 697-707.l 20
  24. 24. LINEAR  ASSEMBLY  OF    A  HUMAN   CENTROMERE  ON  THE  Y  CHROMOSOME Small, haploid satellite array with well-characterized 5.8 kb repeat p-arm q-arm
  25. 25. BACS:  OVERLAP-­‐LAYOUT-­‐ASSEMBLY p-arm q-arm Collection of 9 BACs known to span the Y Centromere Overlap determined by single copy sequence variants Tilford et al 2001 Nature
  26. 26. HIGH  QUALITY  +  LONG  (100  kb  +)  READS ~100 kb Collapsed Representation Challenge of Assembling Identical Tandem Repeats with Short Reads
  27. 27. HIGH  QUALITY  +  LONG  (100  kb  +)  READS High Quality Consensus Sequence ~100 kb
  28. 28. NANOPORE  SEQUENCING:  LONGBOARD  (1D)UCSC LONGBOARD 1D PROTOCOL
  29. 29. LONGBOARD 1D PROTOCOL NANOPORE  SEQUENCING:  LONGBOARD  (1D)
  30. 30. UCSC LONGBOARD 1D PROTOCOL In total, we have generated 3500+ reads greater than 150 kb NANOPORE  SEQUENCING:  LONGBOARD  (1D)
  31. 31. MULTIPLE ALIGNMENT STRATEGY TO IMPROV QUALITY BY CONSENSUS High Qualit Consensus Req Modest Cove UCSC LONGBOARD 1D PROTOCOL MULTIPLE  ALIGNMENT  STRATEGY  TO  IMPROVE   QUALITY  BY  CONSENSUS
  32. 32. RP11 718M18 221.4 kb Vector Insert 634 Predicted Nucleotide Variants 2 Tandem Structural Rearrangements 38 CENY RPTS (>99% Identity to published consensus) Homopolymers [A]n Homopolymers [T]n
  33. 33. Identify informative, single copy sites in the array useful for overlap BAC-based assembly Y SINGLE COPY VARIANTS USING ILLUMINA DATA RP11 718M18 221.4 kb VALIDATE  HIGH-­‐CONFIDENT    SINGLE  COPY  VARIANTS  WITH  ILLUMINA RP11 718M18 221.4 kb
  34. 34. VALIDATE  HIGH-­‐CONFIDENT    SINGLE  COPY  VARIANTS
  35. 35. LINEAR  ASSEMBLY  OF  HUMAN  Y  CENTROMERE
  36. 36. Future  PerspecSve 1.      Linear  assemblies  of  human  centromeric   regions  improve  in  step  with  sequencing   technology  (i.e.  read  length  and  quality)   2.    One  genome  is  not  enough:  Highly  variable   3.    Linear  CEN  assemblies  present  a  mapping   challenge  to  most  genomic  applicaSons
  37. 37. True Linear Maps of Human CEN Regions Y CEN True Linear Arrangement Informatics/Analysis Data Structure
  38. 38. Key Advantages of Satellite DNA Graphs 1. Eliminates sequence redundancy
  39. 39. Key Advantages of Satellite DNA Graphs Improves Unambiguous Short Read Mapping REPEAT REPEAT REPEAT ? 5’ 3’REPEAT Benedict Paten Adam Novak Centromere Graphs Demonstrate unambiguous mapping the majority ( > 98%) of 1000 genome alpha satellite reads 1. Eliminates sequence redundancy
  40. 40. Key Advantages of Satellite DNA Graphs 1. Eliminates sequence redundancy 2. Information describing long-range haplotypes are retained as defined “paths” in the graph:
  41. 41. Key Advantages of Satellite DNA Graphs 1. Eliminates sequence redundancy 2. Information describing long-range haplotypes are retained as defined “paths” in the graph 3. Graph data structure and sequence analysis tools will be consistent with the rest of the human genome The major histocompatibility complex (Kiran Garimella & Gil McVean)
  42. 42. Creating (and mapping to) a Universal Reference Genome Benedict Paten, Adam Novak, David Haussler, UC Santa Cruz Mark Akeson Miten Jain Hugh Olsen Benedict Paten Dave Deamer Robin AbuShumays Andrew Smith Ian Fiddes Art Rand Logan Mulroney Jordan Eizenga Rojin Safavi Rachel Lawton Andrew Bailey Ariah Mackie David Haussler Benedict Paten Jim Kent Sofie Salama UCSC Nanopore Analysis Group Miten Jain Hugh Olsen Mark Akeson Dan Turner David Stoddart Oxford Nanopore Technologies Huntington F. Willard David Page Product Version Device MinION MK1 Flow cell FLO-MIN106 Kits Rapid Sequencing Kit Data analysis Albacore 1.0.1 Metrichor 1D Acknowledgements

×