Explaining the assembly model

•Download as PPTX, PDF•

1 like•1,644 views

Genome Reference Consortium

GRC Workshop at Churchill College on Sep. 21, 2014. This is Valerie Schneider's talk describing the assembly model.

Science

Explaining the assembly model
Valerie Schneider
NCBI
21 September 2014

Dilthey et al.Paten et al.
Scientific Models

• Differences between the reference genome
assembly and other assemblies
• Features of the current reference assembly
model and their relationship to genomic analyses
and tools
• The changing reference genome assembly
Outline

Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
GRC Assembly Model

Assembly (e.g. GRCh38)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
GRC Assembly Model
The human reference genome assembly is not a haploid model
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
ALT
1
Alternate loci are not synonymous with haplotypes

Assembly (e.g. GRCh38.p1)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
ALT
1
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
Patches
Genomic
Region
(ABO)
Genomic
Region
(FOXO6)
Genomic
Region
(FCGBP)
GRC Assembly Model
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXT
MAJOR ASSEMBLY RELEASE
ALT
LOCI
--
(integrated)

1q32 1q21 1p21
Dennis et al., 2012
GRC Assembly Model
Fix patches are different than novel patches

The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly

Anatomy of an alt
Alignment Legend
no alignmentmismatchdeletion

Anatomy of an alt
AC012314.8
CU151838.1
ALT LOCI
AC012314.8
AC245052.3 CHR. 19
Alternate loci contain some sequence that is redundant to the primary assembly unit

Masks and alt aware aligners reduce the incidence of
ambiguous alignments observed when aligning reads to
the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds
Simulated Reads
GRCh38: Alt Loci

GRC: Assembly Model
GRCh38
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes

chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt Loci

Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Matthew Hurles
• Richard Gibbs
GRC Credits

Source/Recruitment of DNA Donors for Library Construction
Another implication of the fact that 99.9% of the human DNA sequence
is shared by any two individuals is that the backgrounds of the
individuals who donate DNA for the first human sequence will make no
scientific difference in terms of the usefulness and applicability of the
information that results from sequencing the human genome. At the
same time, there will undoubtedly be some sensitivity about the
choice of DNA sources. There are no scientific reasons why DNA donors
should not be selected from diverse pools of potential donors.
http://www.genome.gov/10000921 (August 17, 1996)
Reference Composition

Today’s reference assembly does not represent:
1.The most common allele
2.The longest allele
3.The ancestral allele

Roles for the reference
• Getting the sequence
• Cataloging genes (and other features)
• Establishing a coordinate system
• Humans vs. other organisms

What's hot

hg19 (GRCh37) vs. hg38 (GRCh38)Shaojun Xie

Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium

NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN

Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium

Transcript detection in RNAseqDenis C. Bauer

What is ClinVar? A database for variant interpretation! [Today's paper]HeonjongHan

Next generation sequencing methods (final edit)Mrinal Vashisth

De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann

Cancer genomeKundan Singh

Crispr cas9 scalpels and their applicationPyarelal Syoran

Combining PacBio with short read technology for improved de novo genome assemblyLex Nederbragt

Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar

2015 functional genomics variant annotation and interpretation- tools and p...Gabe Rudy

Introducing VSClinical: Streamlining ACMG Variant Interpretation GuidelinesGolden Helix

Assembly and gene_predictionBas van Breukelen

Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN

Introduction to NGScursoNGS

NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...VHIR Vall d’Hebron Institut de Recerca

Bioinformatics tools for NGS data analysisDespoina Kalfakakou

Data analysis pipelines for NGS applicationsVall d'Hebron Institute of Research (VHIR)

What's hot (20)

hg19 (GRCh37) vs. hg38 (GRCh38)

Telomere-to-telomere assembly of a complete human chromosomes

NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...

Previewing GRCm39: Assembly Updates from the GRC

Transcript detection in RNAseq

What is ClinVar? A database for variant interpretation! [Today's paper]

Next generation sequencing methods (final edit)

De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...

Cancer genome

Crispr cas9 scalpels and their application

Combining PacBio with short read technology for improved de novo genome assembly

Variant (SNP) calling - an introduction (with a worked example, using FreeBay...

2015 functional genomics variant annotation and interpretation- tools and p...

Introducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines

Assembly and gene_prediction

Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...

Introduction to NGS

NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...

Bioinformatics tools for NGS data analysis

Data analysis pipelines for NGS applications

Similar to Explaining the assembly model

Ashg2017 workshop schneiderGenome Reference Consortium

Ashg2014 grc workshop_schneiderGenome Reference Consortium

Understanding the reference assembly: CSHL HackathonGenome Reference Consortium

Schneider grc workshop_finalGenome Reference Consortium

Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc

2014 agbt giab data integration poster 140206GenomeInABottle

scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017David Cook

Collaborations Between Calit2, SIO, and the Venter Institute-a BeginningLarry Smarr

Building bioinformatics resources for the global communityExternalEvents

Agbt2015 workshop schneiderGenome Reference Consortium

Amia tb-review-08Russ Altman

10.1.1.80.2149vantinhkhuc

TAGC2016 schneiderGenome Reference Consortium

Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg

scRNA-Seq Workshop Presentation - Stem Cell Network 2018David Cook

CRISPR PROJECT.pptxAcSni

GIAB_ASHG_JZook_2023.pdfGenomeInABottle

2015 GU-ICBI Poster (third printing)Michael Atkins

PMED Undergraduate Workshop - Communities & Classification in Disease Data -...The Statistical and Applied Mathematical Sciences Institute

Building an Information Infrastructure to Support Genetic SciencesLarry Smarr

Similar to Explaining the assembly model (20)

Ashg2017 workshop schneider

Ashg2014 grc workshop_schneider

Understanding the reference assembly: CSHL Hackathon

Schneider grc workshop_final

Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...

2014 agbt giab data integration poster 140206

scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017

Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning

Building bioinformatics resources for the global community

Agbt2015 workshop schneider

Amia tb-review-08

10.1.1.80.2149

TAGC2016 schneider

Visual Exploration of Clinical and Genomic Data for Patient Stratification

scRNA-Seq Workshop Presentation - Stem Cell Network 2018

CRISPR PROJECT.pptx

GIAB_ASHG_JZook_2023.pdf

2015 GU-ICBI Poster (third printing)

PMED Undergraduate Workshop - Communities & Classification in Disease Data -...

Building an Information Infrastructure to Support Genetic Sciences

Recently uploaded

basic entomology with insect anatomy and taxonomyDrAnita Sharma

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9

FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth

《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29

Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju

The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131

Harmful and Useful Microorganisms Presentationtahreemzahra82

Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix

Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur

User Guide: Magellan MX™ Weather StationColumbia Weather Systems

Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju

Neurodevelopmental disorders according to the dsm 5 trssuser06f238

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh

preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003

Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD

Recently uploaded (20)

basic entomology with insect anatomy and taxonomy

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR

FREE NURSING BUNDLE FOR NURSES.PDF by na

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx

《Queensland毕业文凭-昆士兰大学毕业证成绩单》

Pests of Bengal gram_Identification_Dr.UPR.pdf

The dark energy paradox leads to a new structure of spacetime.pptx

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai

Harmful and Useful Microorganisms Presentation

Base editing, prime editing, Cas13 & RNA editing and organelle base editing

Pests of soyabean_Binomics_IdentificationDr.UPR.pdf

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...

User Guide: Magellan MX™ Weather Station

Pests of castor_Binomics_Identification_Dr.UPR.pdf

Neurodevelopmental disorders according to the dsm 5 tr

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝

preservation, maintanence and improvement of industrial organism.pptx

Carbon Dioxide Capture and Storage (CSS)

Explaining the assembly model

1. Explaining the assembly model Valerie Schneider NCBI 21 September 2014

2. Dilthey et al.Paten et al. Scientific Models

3. • Differences between the reference genome assembly and other assemblies • Features of the current reference assembly model and their relationship to genomic analyses and tools • The changing reference genome assembly Outline

5. Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes GRC Assembly Model

6. Assembly (e.g. GRCh38) Primary Assembly Unit Non-nuclear assembly unit (e.g. MT) PAR Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 GRC Assembly Model The human reference genome assembly is not a haploid model ALT 2 ALT 3 ALT 4 ALT 5 ALT 6 ALT 7 ALT 1 Alternate loci are not synonymous with haplotypes

7. Assembly (e.g. GRCh38.p1) Primary Assembly Unit Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 6 ALT 7 PAR Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 Patches Genomic Region (ABO) Genomic Region (FOXO6) Genomic Region (FCGBP) GRC Assembly Model Patches FIX NOVEL SCAFFOLD STATUS AT NEXT MAJOR ASSEMBLY RELEASE ALT LOCI -- (integrated)

8. 1q32 1q21 1p21 Dennis et al., 2012 GRC Assembly Model Fix patches are different than novel patches

9. The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly

10. Anatomy of an alt Alignment Legend no alignmentmismatchdeletion

11. Anatomy of an alt AC012314.8 CU151838.1 ALT LOCI AC012314.8 AC245052.3 CHR. 19 Alternate loci contain some sequence that is redundant to the primary assembly unit

12. Alt Loci: Informatics Challenges

13. Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning reads to the full assembly Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds Simulated Reads GRCh38: Alt Loci

14. GRC: Assembly Model GRCh38 • 178 regions with alt loci: 2% of chromosome sequence (61.9 Mb) • 261 Alt Loci: 3.6 Mb novel sequence relative to chromosomes

15. GRCh38: Alt Loci

16. chromosome alt/patch reads On-target alignment Off-target alignments (n=122,922) GRCh38: Alt Loci

17. The Changing Reference

18. The Changing Reference

19. Collaborators • NCBI RefSeq and gpipe annotation team • Havana annotators • Karen Miga • David Schwartz • Steve Goldstein • Mario Caceres • Giulio Genovese • Jeff Kidd • Peter Lansdorp • Mark Hills • David Page • Jim Knight • Stephan Schuster • 1000 Genomes GRC SAB • Rick Myers • Granger Sutton • Evan Eichler • Jim Kent • Roderic Guigo • Carol Bult • Derek Stemple • Matthew Hurles • Richard Gibbs GRC Credits

20. Source/Recruitment of DNA Donors for Library Construction Another implication of the fact that 99.9% of the human DNA sequence is shared by any two individuals is that the backgrounds of the individuals who donate DNA for the first human sequence will make no scientific difference in terms of the usefulness and applicability of the information that results from sequencing the human genome. At the same time, there will undoubtedly be some sensitivity about the choice of DNA sources. There are no scientific reasons why DNA donors should not be selected from diverse pools of potential donors. http://www.genome.gov/10000921 (August 17, 1996) Reference Composition

21. Today’s reference assembly does not represent: 1.The most common allele 2.The longest allele 3.The ancestral allele

22. Roles for the reference • Getting the sequence • Cataloging genes (and other features) • Establishing a coordinate system • Humans vs. other organisms

Editor's Notes

I’d like to begin this talk by reminding everyone of the difference between a genome and an assembly. A human genome is a physical object. An assembly is our representation of that object. It is a model. And as shown here, genome models can take many forms. And as these atomic models illustrate, scientific models evolve over time to reflect our growing knowledge base. And so it is with the human assembly model, the reference genome.
I’m going to cover three main areas in today’s talk: What makes the reference assembly different than other genome assemblies Features of the current reference assembly model and their relationship to genomic analyses and tools The changing reference genome assembly
The human reference assembly is a special kind of genome model. In today’s era of personal genome sequencing, most assemblies only model a diploid genome. But the reference assembly is a model of many diploid genomes, meant to represent the “human” genome. This slide shows the assembly composition of the GRCh38 primary assembly. While 70% of the genome comes from one donor, sequence from >70 individuals is represented.
Even when assembling the genome of single individual, there may be divergent haplotypes that confound genome assembly. In the original reference assembly model, which was essentially a stick model of linear chromosomes, there really wasn’t a good way to represent highly variant or complex genomic regions. Different haplotypes were simply compressed into a consensus. The insertion of different haplotypes in such regions, however, often led to non-existent allele combinations and artificial gaps, as illustrated here. In the assembly model we’re using now, there’s a mechanism to cleanly represent multiple haplotypes: alternate loci. The current model allows the reference assembly to contain alternate representations for regions where haplotype compression isn’t appropriate or a single sequence path is considered insufficient. At the same time, it retains the linear chromosome models with which most users are comfortable.
GRCh37 was the first genome assembly to use this new model, which is illustrated in this cartoon. The first thing to know about the model is that the “assembly” is comprised of multiple assembly units. Primary assembly unit is the collection of chromosomes and unlocalized and unplaced scaffolds. This is essentially the original assembly model. Non-nuclear genomes are assigned to their own assembly unit. Regions are defined for those areas of the genome for which alternate sequence representation is desired. Those alternate sequence representations go into alternate loci assembly units. The first alternate sequence representation for each region goes into one assembly unit. Each additional sequence representation for a region goes into its own assembly unit. Alt loci are stand-alone scaffold sequences that are given chromosome context via their alignment to the primary assembly. I’ll return to this point shortly. The assembly model also includes regions for the pseudo-autosomal regions of the primary assembly unit.
Another aspect of the assembly model I’d like to discuss are the patches. Patches enable the reference assembly to be updated without changing chromosome coordinates. This feature of the model allows the GRC to make assembly updates available in a timely fashion to those users whose work is adversely affected by errors in the current genome, while not disrupting the coordinates upon which other users rely. Regions are defined for the genomic locations to be updated, and the sequences representing those updates are put into the “Patches” assembly unit. Like the alt loci, the patches are stand-alone scaffold sequences. It’s important to distinguish the two types of patches: (1) FIX patches correct problems in the assembly: deprecated in next assembly release (2) NOVEL patches add new alternate sequence representations to the assembly: become alternate loci in the next assembly release.
It’s important to recognize that the way fix and novel patches should be used for analysis is different. The novel patches can be treated just like alt loci, and should be considered allelic to the chromosomes. In contrast, because the FIX patches represent assembly corrections, read alignments to the fix patches should take precedence over alignments to the chromosomes. Fix patch updates can sometimes be quite dramatic. This slide shows a FIX patch that corrects a mis-assembly of the 1q21 region of GRCh37 involving the SRGAP gene family. In GRCh37 the 1q21 region was comprised of a mix of sequences from the various SRGAP-associated duplications. The bottom panel shows the alignment of chr. 1 to the fix patch, and this dot matrix view really highlights how much things changed. From this, it’s easy to see how a fix patch could improve your analyses, and why you would want to exclude alignments from the corresponding chromosome region.
As I mentioned before, the alternate loci and patches are stand-alone scaffold sequences given chromosome context by virtue of their alignment to the chromosomes, as illustrated in this image of GRCh38 chr. 19, with its aligned alts and their corresponding regions. In the next slide, we’ll take a closer look at an alt in order to better appreciate its relationship to the chromosome. Thus, another important thing to understand about the assembly model is that the alignments of the alt loci to the chromosomes are an integral part of the assembly. The alignment, in conjunction with the sequence, is what defines the alt.
This image shows one of the alt loci from the LRC/KIR region on chr. 19. The blue bars represent the component sequences of the alt. The alignment of the chromosome to the alt is shown beneath. The legend below explains the alignment graphic. We can see that there is a region at either end of the alt that aligns perfectly to the chromosome. To understand why this is the case, we will zoom in on one of these regions and compare to the corresponding region of chr. 19.
At this level, we see that the first component in the alternate locus is also a component in the chromosome. This is what is known as an anchor component. Anchor components are present in all human alt loci and are included to insure a robust alignment to the chromosome. However, the extent to which the anchor component contributes sequence to the alternate locus and the chromosome may differ, because of differences in the position of the switch point with the adjacent components, which are not the same. The identical region of the alignment corresponds to sub-region of the anchor that is common to both the alt and the chromosome. As a result of the anchor sequences, all human alt loci contain some sequence that is redundant to the chromosomes. This is the reason that most aligners are not compatible with the alternate loci. Reads corresponding to anchor sequence will map identically to the alt and chromosome, resulting in depressed mapping scores and exclusion from downstream analyses. Reads mapping to other regions of the alternate loci that are similar to the chromosome, even if not derived from anchor sequences, will have the same issue. Alternate aware aligners must recognize the relationship of the alt to the chromosome and not treat reads that map in those regions as ambiguously placed.
In the interim, the GRC has been looking at approaches that may help users make use of existing tool chains. For example, we’ve tested use of a mask that hides the duplication in the alts. In this slide, you can see the mask we’ve generated for this GRCh38 alt loci, which has an insertion relative to the chromosome, but is identical for much of the remaining length.
Prior to the release of GRCh38, we began looking at the effect of masking on alt-unaware BWA aligner and compared results to those obtained with use of an NCBI-developed alternate aware aligner called srprism. In this analysis, simulated reads were aligned to GRCh37.p9 primary or full assembly. For BWA, we tested masking of the alts/patches only, or masking a combination of sequences on the alts/patches and the chromosome. We then looked at the incidence of reads with ambiguous alignments. As shown in first two columns of the figure, there is an expected increase in multiple alignments when reads are aligned to the full assembly with BWA and no mask (expanded red). In the next two columns, you can see how use of either masking approach suppresses the increase in multiple alignments. The last two columns show that srprism, the alt aware aligner, does not need a mask to prevent ambiguous mappings. We will continue this analysis on GRCh38, with both simulated and actual reads, but I hope that even this preliminary data makes the point that it is possible to develop tools that can handle the alternate loci and may allow users to reap the benefits of using the full assembly in analyses.
GRCh38 has 178 regions associated with 261 alternate loci scaffolds. There is more than 3 Mb of sequence whose only representation in the assembly occurs in the alternate loci. We can now look at the value of the alternate loci and their implications for analysis.
One reason the alt loci add value to the assembly is their gene content. In GRCh38, there are 64 protein coding and 112 non-protein coding genes that are found only on the alternate loci. An example is shown in this slide, where you can see several genes annotated in the regions of this alternate representation of the chr. 19 KIR region that have no alignment to the chromosome. Thus, if you’re not using the alt loci in your analyses, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.
We’ve also been doing some analyses to investigate the severity of mapping errors that can occur when alternate loci aren’t used in alignment target sets. Since our analyses of GRCh38 are ongoing, I’ll describe an earlier study we did with the GRCh37.p9 assembly. In that study, we looked at the behavior of simulated reads sourced from sequence unique to patches or alt loci. We asked what happened to them when aligned to GRCh37 without the alt loci, where their true target is missing. We aligned the reads either as singletons or pairs, using two different aligners (BWA and srprism). As shown in this graph, regardless of read pairing or the aligner, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters had an off-target alignment on the GRCh37 primary assembly (in blue). These off-target alignments are likely to result in errors in variation analyses. This analysis demonstrates the value of including alternate loci in alignment target sets and again highlights the need for the development of alt aware aligners and downstream components of variant calling tool chains.
At this point, I’d like to shift gears for a little bit and conclude this talk by discussing the changing reference genome. New technologies and resources are one driving force for change. For example, a single haplotype hydatidifom mole resource is helping the GRC resolve highly complex regions. A PacBio long read was used in GRCh38 to provide sequence that had been impossible to resolve by other means. Optical map data is helping us resolve misassemblies, and will also be used to find regions that are missing sequence.
The assembly model itself may also change with time. The GRC is currently curating the CHM1 hydatidifom mole assembly in addition to the reference, and using sequence from it to improve the reference. In the future, the GRC expects to curate selected genomic regions from additional individuals representing diverse populations, to provide a more comprehensive representation of complex genomic regions. These may contribute new alt loci. At the moment, such additional genomes are considered distinct from the reference, as this slide illustrates. However, as new data becomes available both within and beyond the GRC, the GRC will continue to assess the assembly model and work towards its goal of providing a reference assembly that can be used to put any common human sequence in its chromosome context.
When the HGP was envisioned, we knew much less about human variation than we do today and the implications that using multiple donors might have for the reference assembly. That’s illustrated in the final sentence of this quote from the NHGRI/DOE guidelines for selecting DNA donors for the reference. There’s still no doubt we want and need a reference assembly that represents diverse samples. But we now know, thanks in part to projects like 1000G, that DNA assemblies, our models of the genome, are affected by sample diversity.
Before I go any further, I want to point out what today’s reference is not. It does not represent: The most common allele The longest allele The ancestral allele It represents the sequence available from the HGP.
The role of the reference genome has also changed with time. At the start of the HGP, major goals for the reference included: (1) getting the sequence, (2) cataloging genes and other features and (3) establishing a coordinate system. (4) We were interested in the difference between humans and other organisms. Today, we’re just as interested in the differences between individual humans. This has led to calls that we need a reference that, if not a pan-genome, can provide representation for complex and/or diverse genomic regions. In order to understand how the reference genome can represent diversity we’re interested in, we’ll take a look at the reference assembly model and how it has changed since the HGP.

Explaining the assembly model

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Explaining the assembly model

Similar to Explaining the assembly model (20)

More from Genome Reference Consortium

More from Genome Reference Consortium (20)

Recently uploaded

Recently uploaded (20)

Explaining the assembly model

Editor's Notes