Successfully reported this slideshow.
Your SlideShare is downloading. ×
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Ashg2015 grc-pruitt
Ashg2015 grc-pruitt
Loading in …3
×

Check these out next

1 of 47 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (13)

Advertisement

Similar to 150224 grc kms (20)

More from Genome Reference Consortium (15)

Advertisement

Recently uploaded (20)

150224 grc kms

  1. 1. Characterizing extreme diversity in the human genome using a single haplotype genomic resource Karyn Meltz Steinberg, Ph.D. AGBT 2015 GRC Workshop @KMS_Meltzy
  2. 2. 1 bp 1 chr Frequency SNP Trisomy monosomy Copy number variants Size of variant 1 kb 1 Mb Types of genetic variants Slide courtesy of S. Girirajan Human Genetic Variation
  3. 3. 1 bp 1 chr Frequency SNP Trisomy monosomy Copy number variants Size of variant 1 kb 1 Mb Types of genetic variants 1 bp 1 chr Throughput 1 kb 1 Mb Size of variant How do we assay them? Slide courtesy of S. Girirajan Human Genetic Variation
  4. 4. 1 bp 1 chr Frequency SNP Trisomy monosomy Copy number variants Size of variant 1 kb 1 Mb Types of genetic variants SNP genotyping 1 bp 1 chr Throughput 1 kb 1 Mb Size of variant How do we assay them? Slide courtesy of S. Girirajan Human Genetic Variation
  5. 5. 1 bp 1 chr Frequency SNP Trisomy monosomy Copy number variants Size of variant 1 kb 1 Mb Types of genetic variants Array-CGH Karyotyping SNP genotyping 1 bp 1 chr Throughput 1 kb 1 Mb Size of variant How do we assay them? Slide courtesy of S. Girirajan Human Genetic Variation
  6. 6. 1 bp 1 chr Frequency SNP Trisomy monosomy Copy number variants Size of variant 1 kb 1 Mb Types of genetic variants Array-CGH Karyotyping Sequencing SNP genotyping 1 bp 1 chr Throughput 1 kb 1 Mb Size of variant How do we assay them? Slide courtesy of S. Girirajan Human Genetic Variation
  7. 7. Extreme diversity in the human genome • <99.5% identity to the reference • Refractory to traditional sequencing efforts • Loci often contain gene families associated with immune response and xenobiotic metabolism
  8. 8. HLA is a classic example of an extremely diverse locus • Critical to immune response • Characterized by overdominant selection • Alleles are linked and segregate as distinct haplotypes • Shaped by gene duplication and diversification
  9. 9. Segmental duplications can predispose loci to further rearrangement via NAHR
  10. 10. Segmental duplications can predispose loci to further rearrangement via NAHR
  11. 11. A A C T C G C C Repeat Copies (noted by color difference) Allelic Copies Diploid Genome With a diploid genome, there is significant ambiguity sorting allelic copies from repeat copies A C C C Haploid Genome Repeat Copies (ONLY but noted by color differences) With a haploid genome, allelic differences are eliminated, and base differences are likely indicative of repeat copies
  12. 12. Hydatidiform mole
  13. 13. SRGAP2 Homology between genes Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs Shows homology between SRGAP2B and SRGAP2C Dennis, et.al. 2012 SRGAP2A SRGAP2B SRGAP2C
  14. 14. 1q21 1q21 patch alignment to chromosome 1 1q32 1q21 1p21
  15. 15. Hydatidiform mole Let’s sequence and assemble the whole genome!
  16. 16. CHM1_1.1 Assembly • Reference-guided assembly • SRPRISM v2.3, R. Agarwala • Alignment of Illumina reads to GRCh37 primary assembly • CHORI-17 BAC clone tilepaths were then incorporated • 428 total clones • 324 clones in 45 tilepaths • 104 clones as singletons http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695. 2 Total Sequence Length 3,037,866,619 bp Total Assembly Gap Length 210,229,812 bp Number of Scaffolds 163 Scaffold N50 50,362,920 bp CHM1 Assembly Paper - Genome Research Steinberg et al. 2014
  17. 17. CHM1_1.1 assembly is highly contiguous compared to other WGS based assemblies
  18. 18. Integrating BAC tiling paths improved assembly
  19. 19. Integrating BAC tiling paths improved assembly
  20. 20. Alignment of CHM1 Illumina data to assembly revealed regions of extreme heterogeneity Heterozygous Homozygous Total Variants 64033 22513 86546 In RepeatMasked (RM) sequence 37060 14833 51893 In Segmental duplication (SD) 30670 4843 35513 In RM and SD 51466 17174 68640 Ts:Tv 1.5 0.7 1.2 Mean SNV density/kb 0.02 0.008 0.03 There are significantly more heterozygous variants in repetitive sequence than expected (p<1x10-16). BAC ends mapping discordantly and in multiple loci are significantly enriched for segmental duplications (p<1x10-5).
  21. 21. Identified 549 novel protein coding genes not annotated in GRCh37
  22. 22. CHM1 BioNano Genome Map Aligned to GRCh38 GRCh38 CHM1 BioNano Map ~15kb additional data
  23. 23. BioNano SV Calls Identified a Assembly Problems Collapse Expansion inAssembly Gap in SequenceCHM1_1.1 Assembly CHM1 BioNano Map
  24. 24. Conclusion • Extremely diverse regions of the genome are difficult to characterize due to issues distinguishing allelic from paralogous duplications • CHM1_1.1 highly contiguous single haplotype representation of the genome • Identified regions of misassembly or reference-ized regions • Utilize long read technology and nanopore technology to attempt to fix these regions
  25. 25. Need to add more diversity to reference • Finish another hydatidiform mole to platinum status • Finish 5 genomes to gold status • NA19240 (Yoruban) • NA12878 (European) • HG00513 (Han Chinese) • 2 “wildcards” • Looking for underrepresented minority population • Add high quality alternative sequences to reference to create a population reference graph or “pan genome”
  26. 26. Use colored de Bruijn graph structure to represent population reference graph
  27. 27. Bioinformatic tool development in the future • Alignment of short reads to population reference graph • Variant calling • Variant reporting/Haplotype resolution
  28. 28. Adapted from Weinstein et al, 2009
  29. 29. The GRCh37 reference sequence was assembled from three lymphoblastoid cell lines Not a true haplotype Incomplete
  30. 30. The CH17 haplotype is quite different from the reference
  31. 31. Novel insertion The CH17 haplotype is quite different from the reference
  32. 32. Complex Indel The CH17 haplotype is quite different from the reference
  33. 33. Hotspot/Recurrent Mutation The CH17 haplotype is quite different from the reference
  34. 34. 60 kbp Insertion (Hotspot) African Asian European
  35. 35. Duplication (influenza) The CH17 haplotype is quite different from the reference
  36. 36. 44 kbp Duplication (influenza) African Asian European
  37. 37. Summary of hydatidiform mole sequence • 47 functional V genes • 24 total variants (SNV and CNV) involving 29 IGHV genes • 5 structural variants • 19 single nucleotide variants • 15 non-synonymous mutations • 20 out of 24 variants represent differences in amino acid sequence or gene copy number
  38. 38. Summary of hydatidiform mole sequence • 47 functional V genes • 24 total variants (SNV and CNV) involving 29 IGHV genes • 5 structural variants • 19 single nucleotide variants • 15 non-synonymous mutations • 20 out of 24 variants represent differences in amino acid sequence or gene copy number 100 kbp of novel sequence
  39. 39. Current status of CHM1 resources • CHORI-17 BAC Library (created from CHM1 cell line) • CHORI-17 BAC end sequences (n=325,659) • CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs) • CHORI-17 BACs (>750 have been sequenced, with 592 of them in Genbank as phase 3) • Active cell line • >100X coverage Illumina 100bp reads • 300, 500bp, 3kb inserts • Reference assisted assembly CHM1_1.1 • BioNano genome map • >50X coverage of PacBio long read data
  40. 40. CHM1_1.1 Assembly • Reference-guided assembly – SRPRISM v2.3, R. Agarwala • Alignment of Illumina reads to GRCh37 primary assembly • CHORI-17 BAC clone tilepaths were then incorporated • 428 total clones • 324 clones in 45 tilepaths • 104 clones as singletons • Comparison back to GRCh37 reference to provide appropriate gaps sizes • Assembly submitted to Genbank • http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2 • Steinberg et al, 2014 • Genome Research (Dec;24(12):2066-76)
  41. 41. LILR (leukocyte immunoglobulin-like receptor)/KIR (killer immunoglobulin receptor) Immunoglobulin Kappa chain Immunoglobulin Lambda chain TCRA/B 17q21.31 inversion polymorphism Immunoglobulin heavy chain locus CYP2D6 SRGAP2 15q13.3 inversion polymorphism

Editor's Notes

  • There are many types of genetic variation in the human genome ranging from single nucleotide variants up to chromosomal abnormalities. Copy number variants fall within these two types of variation.
  • The techniques we use to assay these variants depends on the size of the variant and this in turn affects the throughput
  • SNP genotyping has very high throughput but the amount of genome that can be assayed is small
  • Array CGH is ideal for copy number variants but the throughput is much lower than SNP genotyping and is less effective for smaller variants
    Karyotyping can detect large abnormalities and the throughput is similar to Array CGH
  • Finally sequencing should be able to assay all forms of genetic variation and the goal of next generation sequencing is to keep increasing the throughput
  • So I became interested in how to effectively identify and assay extreme genetic diversity. We expect any two human haplotypes to be 99.8-100% identical to one another. We define extreme genetic diversity as less than 99.5% identity to the reference sequence which is approximately 1 variant per 500 base pairs. These loci are often refractory to traditional sequencing efforts and are enriched for genes and gene families related to immune response and environmental detoxification.
  • The human leukocyte antigen locus is a classic example of high diversity in the human genome. HLA is critical to human disease as it binds to the T cell receptor and plays a critical role in antigen processing and presentation. The locus is characterized by overdominant selection where individuals with a heterozygous genotype have higher fitness than those with the homozygous genotype. Alleles at this locus are linked and segregate as distinct haplotypes and the locus has been shaped by extensive gene duplication and diversification.
  • Now what do I mean by gene duplication and diversification? Let’s say there is a gene with 4 functions that duplicates. It has a few different fates.
  • First the two copies could each acquire different mutations that inactivate certain functions but overall the two genes retain the same function as before
  • Secondly the duplicate copy could acquire mutations that endow it with novel functions
  • Third the duplicate gene could acquire mutations to the point that it no longer functions leading to loss of the gene or pseudogenization
  • A duplicated region of the genome that is larger than 1 kilobase and has greater than 90% identity is called a segmental duplication. These duplications can predispose loci to further rearrangement via non-allelic homologous recombination. This can lead to deletions or duplications of intervening sequence
  • Segmental duplications may also lead to inversions. They are an important part of the genome’s architecture as they serve as hotspots that can create more complex architecture and diversity.
  • Here is an example of one of those segmentally duplicated regions. The SRGAP2 gene family maps to 3 specific regions on chr 1. All three of these loci were poorly assembled in GRCh37, This region was resequenced using the CH17 single haplotype BAC library.
    This diagram shows the homologous regions using Miropeats, where the green lines indicate nearly identical segments between SRGAP2A and the duplicate paralogs. The blue lines delineate the larger extent the homology between SRGAP2B and C. Notice the scale of these region, the red boxed regions are 244 kb of sequence that is nearly identical among among all three loci. In GRCh37, these regions contained multiple haplotypes and were very fragmented By sequencing these regions with the CH17 BACs, we were able to full resolve all three of these regions

  • Here is another representation of this region. The graphic at the bottom shows the alignment of the fix patch to the reference assembly. The blue boxes highlight sequences on the patches that were completely missing from GRCh37.
    By sequencing these regions from a single haplotype source, approximately 500kb of sequence, that was previously misassembled, incorrectly oriented or completely missing, has now been resolved.
  • As I mentioned, the CHM1_1.1, which is the Illumina-based assembly, was a reference guided assembly created by Richa Agarwala at NCBI, using her SRPRISM assembler. The process involves alignment of the Illumina data to the GRCh37 primary assembly. One thing that is unique about this assembly compared to other whole genome assemblies is the fact that we used many BAC tilepaths in segmentally duplicated regions. There were 45 total paths used in the assembly and then another 104 singletons that were included.

    This assembly is available for download from Genbank using the following link. We have recently published a paper on this assembly with further analysis.
  • At the time this assembly was generated, it had the longest N50 contig length of any human whole genome assembly in Genbank.
  • Another CHM1 resource I mentioned is the BioNano Genome Map. BioNano is a nanochannel technology where the DNA is nicked and labeled at specific recognition sites, so you end up with nick sites along the DNA molecules in context, similar to a restriction digest, only you have the added benefit of the nicks being in context.
    This is showing the same region as the previous example – The top green line here is the GRCh38 reference in silico representation and the bottom blue lines are the map. The alignment of these two data sets shows the same size discrepancy, so the BioNano map data confirms the extra data found in CHM1 at this position. If this data were accessioned, we could add this sequence to the reference, it would become a fix patch if we think there is an error in GRCh38, or a novel patch if we believe this to be variation
  • Here is another way to use the BioNano map to identify potential assembly issues. This time, I have the BioNano map aligned to the CHM1 Illumina assembly. The green bar represents our CHM1 Illumina based sequence assembly and the blue bars are the map contigs. The first tag, labeled here as a collapse is indicating that there is more data present here in the map contig. To look into the possibility that we have a collapse in our assembly, we examined the Illumina reads aligned back to the CHM1_1.1 assembly. We found that the reads were piling up in the region indicating a collapse, so it looks as if the BioNano map is correct through this region. The expansion label is indicating that there is too much data in the assembly. Within this region we have a gap in our assembly, so our gap is likely sized too big. I think in this example, a standard 50Kb gap size was used in the assembly, which likely indicates that we were not sure what the size was.
  • Immunoglobulin molecules are formed when somatic recombination occurs between one V, one D and one J gene.
  • The second major issue is that the current reference sequence was assembled from three lymphoblastoid cell lines. These will be subject to somatic recombination and don’t reflect a true haplotype
  • Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
  • Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
  • Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
  • Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
  • This is reflected in the enormous and highly significant Fst values shown here.
  • Across the V gene region we find that the CH17 haplotype is quite different from the reference haplotype. I have indicated the reference genome sequence on the bottom—each of the green boxes is a functional V gene. On the top line is the mole haplotype and as you can see there are many differences both allelic as well as structural. We identified and characterized five large structural variants including the novel insertion of the 7-4-1 gene, an insertion deletion event where two genes were deleted and replaced with two additional genes and a duplication and insertion event at V1-69 and V2-70. Interestingly, V1-69 is preferentially used against hemagglutinin epitopes of particular influenza strains and copy number correlates with expression levels.
  • Using Fst we see that this locus is significantly differentiated between the Asians and Africans.
  • We identify and annotate 47 functional V genes, 27 functional D genes and six functional J genes in the mole haplotype.
    There are 7 functional V genes present in mole not in reference; 3 V genes deleted from mole
  • In conclusion, using the CH17 hydatidiform mole BAC library, we identify approximately 100 kbp of novel sequence not present in the human reference
  • Knowing the utility of this single haploid source, it was decided to sequence the whole genome of CHM1. At the time we started this project, Illumina data was the only cost effective method of generating a whole genome assembly. We generated over 100X coverage of Illumina paired end reads, a reference guided assembly was produced using this data. More recently Pac Bio generated >50X coverage of CHM1 in long read data and we have also had a BioNano Genome map generated.
  • As I mentioned, the CHM1_1.1 assembly was a reference guided assembly created by Richa Agarwala at NCBI, using her SRPRISM assembler. The process involves alignment of the Illumina data to the GRCh37 primary assembly. One thing that is unique about this assembly compared to other whole genome assemblies is the fact that we used many BAC tilepaths in segmentally duplicated regions. There were 45 total paths used in the assembly and then another 104 singletons that were included. Then a final step was done to compare the assembly back to GRCh37 to provide appropriate gap sizes. This assembly is available for download from Genbank using the following link. We also have a paper that should be published soon and you can go to the Bioarchive to find it there.

×