SlideShare a Scribd company logo
1 of 22
Explaining the assembly model
Valerie Schneider
NCBI
21 September 2014
Dilthey et al.Paten et al.
Scientific Models
• Differences between the reference genome
assembly and other assemblies
• Features of the current reference assembly
model and their relationship to genomic analyses
and tools
• The changing reference genome assembly
Outline
Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
GRC Assembly Model
Assembly (e.g. GRCh38)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
GRC Assembly Model
The human reference genome assembly is not a haploid model
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
ALT
1
Alternate loci are not synonymous with haplotypes
Assembly (e.g. GRCh38.p1)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
ALT
1
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
Patches
Genomic
Region
(ABO)
Genomic
Region
(FOXO6)
Genomic
Region
(FCGBP)
GRC Assembly Model
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXT
MAJOR ASSEMBLY RELEASE
ALT
LOCI
--
(integrated)
1q32 1q21 1p21
Dennis et al., 2012
GRC Assembly Model
Fix patches are different than novel patches
The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly
Anatomy of an alt
Alignment Legend
no alignmentmismatchdeletion
Anatomy of an alt
AC012314.8
CU151838.1
ALT LOCI
AC012314.8
AC245052.3 CHR. 19
Alternate loci contain some sequence that is redundant to the primary assembly unit
Alt Loci: Informatics Challenges
Masks and alt aware aligners reduce the incidence of
ambiguous alignments observed when aligning reads to
the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds
Simulated Reads
GRCh38: Alt Loci
GRC: Assembly Model
GRCh38
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
GRCh38: Alt Loci
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt Loci
The Changing Reference
The Changing Reference
Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Matthew Hurles
• Richard Gibbs
GRC Credits
Source/Recruitment of DNA Donors for Library Construction
Another implication of the fact that 99.9% of the human DNA sequence
is shared by any two individuals is that the backgrounds of the
individuals who donate DNA for the first human sequence will make no
scientific difference in terms of the usefulness and applicability of the
information that results from sequencing the human genome. At the
same time, there will undoubtedly be some sensitivity about the
choice of DNA sources. There are no scientific reasons why DNA donors
should not be selected from diverse pools of potential donors.
http://www.genome.gov/10000921 (August 17, 1996)
Reference Composition
Today’s reference assembly does not represent:
1.The most common allele
2.The longest allele
3.The ancestral allele
Roles for the reference
• Getting the sequence
• Cataloging genes (and other features)
• Establishing a coordinate system
• Humans vs. other organisms

More Related Content

What's hot

hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)Shaojun Xie
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseqDenis C. Bauer
 
What is ClinVar? A database for variant interpretation! [Today's paper]
What is ClinVar? A database for variant interpretation! [Today's paper]What is ClinVar? A database for variant interpretation! [Today's paper]
What is ClinVar? A database for variant interpretation! [Today's paper]HeonjongHan
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Mrinal Vashisth
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Crispr cas9 scalpels and their application
Crispr cas9 scalpels and their applicationCrispr cas9 scalpels and their application
Crispr cas9 scalpels and their applicationPyarelal Syoran
 
Combining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assemblyCombining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assemblyLex Nederbragt
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar
 
2015 functional genomics variant annotation and interpretation- tools and p...
2015 functional genomics   variant annotation and interpretation- tools and p...2015 functional genomics   variant annotation and interpretation- tools and p...
2015 functional genomics variant annotation and interpretation- tools and p...Gabe Rudy
 
Introducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
Introducing VSClinical: Streamlining ACMG Variant Interpretation GuidelinesIntroducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
Introducing VSClinical: Streamlining ACMG Variant Interpretation GuidelinesGolden Helix
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_predictionBas van Breukelen
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGScursoNGS
 
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...VHIR Vall d’Hebron Institut de Recerca
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisDespoina Kalfakakou
 

What's hot (20)

hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
What is ClinVar? A database for variant interpretation! [Today's paper]
What is ClinVar? A database for variant interpretation! [Today's paper]What is ClinVar? A database for variant interpretation! [Today's paper]
What is ClinVar? A database for variant interpretation! [Today's paper]
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Cancer genome
Cancer genomeCancer genome
Cancer genome
 
Crispr cas9 scalpels and their application
Crispr cas9 scalpels and their applicationCrispr cas9 scalpels and their application
Crispr cas9 scalpels and their application
 
Combining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assemblyCombining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assembly
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
2015 functional genomics variant annotation and interpretation- tools and p...
2015 functional genomics   variant annotation and interpretation- tools and p...2015 functional genomics   variant annotation and interpretation- tools and p...
2015 functional genomics variant annotation and interpretation- tools and p...
 
Introducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
Introducing VSClinical: Streamlining ACMG Variant Interpretation GuidelinesIntroducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
Introducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 

Similar to Explaining the assembly model

Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonGenome Reference Consortium
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206GenomeInABottle
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017David Cook
 
Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a BeginningCollaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a BeginningLarry Smarr
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08Russ Altman
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018David Cook
 
CRISPR PROJECT.pptx
CRISPR PROJECT.pptxCRISPR PROJECT.pptx
CRISPR PROJECT.pptxAcSni
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGenomeInABottle
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
Building an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic SciencesBuilding an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic SciencesLarry Smarr
 

Similar to Explaining the assembly model (20)

Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
 
Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a BeginningCollaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
10.1.1.80.2149
10.1.1.80.214910.1.1.80.2149
10.1.1.80.2149
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
 
CRISPR PROJECT.pptx
CRISPR PROJECT.pptxCRISPR PROJECT.pptx
CRISPR PROJECT.pptx
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
PMED Undergraduate Workshop - Communities & Classification in Disease Data -...
PMED Undergraduate Workshop - Communities & Classification in Disease Data  -...PMED Undergraduate Workshop - Communities & Classification in Disease Data  -...
PMED Undergraduate Workshop - Communities & Classification in Disease Data -...
 
Building an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic SciencesBuilding an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic Sciences
 

More from Genome Reference Consortium

The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyGenome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsGenome Reference Consortium
 

More from Genome Reference Consortium (20)

The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 

Recently uploaded

basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD
 

Recently uploaded (20)

basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)
 

Explaining the assembly model

  • 1. Explaining the assembly model Valerie Schneider NCBI 21 September 2014
  • 2. Dilthey et al.Paten et al. Scientific Models
  • 3. • Differences between the reference genome assembly and other assemblies • Features of the current reference assembly model and their relationship to genomic analyses and tools • The changing reference genome assembly Outline
  • 4.
  • 5. Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes GRC Assembly Model
  • 6. Assembly (e.g. GRCh38) Primary Assembly Unit Non-nuclear assembly unit (e.g. MT) PAR Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 GRC Assembly Model The human reference genome assembly is not a haploid model ALT 2 ALT 3 ALT 4 ALT 5 ALT 6 ALT 7 ALT 1 Alternate loci are not synonymous with haplotypes
  • 7. Assembly (e.g. GRCh38.p1) Primary Assembly Unit Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 6 ALT 7 PAR Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 Patches Genomic Region (ABO) Genomic Region (FOXO6) Genomic Region (FCGBP) GRC Assembly Model Patches FIX NOVEL SCAFFOLD STATUS AT NEXT MAJOR ASSEMBLY RELEASE ALT LOCI -- (integrated)
  • 8. 1q32 1q21 1p21 Dennis et al., 2012 GRC Assembly Model Fix patches are different than novel patches
  • 9. The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly
  • 10. Anatomy of an alt Alignment Legend no alignmentmismatchdeletion
  • 11. Anatomy of an alt AC012314.8 CU151838.1 ALT LOCI AC012314.8 AC245052.3 CHR. 19 Alternate loci contain some sequence that is redundant to the primary assembly unit
  • 12. Alt Loci: Informatics Challenges
  • 13. Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning reads to the full assembly Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds Simulated Reads GRCh38: Alt Loci
  • 14. GRC: Assembly Model GRCh38 • 178 regions with alt loci: 2% of chromosome sequence (61.9 Mb) • 261 Alt Loci: 3.6 Mb novel sequence relative to chromosomes
  • 16. chromosome alt/patch reads On-target alignment Off-target alignments (n=122,922) GRCh38: Alt Loci
  • 19. Collaborators • NCBI RefSeq and gpipe annotation team • Havana annotators • Karen Miga • David Schwartz • Steve Goldstein • Mario Caceres • Giulio Genovese • Jeff Kidd • Peter Lansdorp • Mark Hills • David Page • Jim Knight • Stephan Schuster • 1000 Genomes GRC SAB • Rick Myers • Granger Sutton • Evan Eichler • Jim Kent • Roderic Guigo • Carol Bult • Derek Stemple • Matthew Hurles • Richard Gibbs GRC Credits
  • 20. Source/Recruitment of DNA Donors for Library Construction Another implication of the fact that 99.9% of the human DNA sequence is shared by any two individuals is that the backgrounds of the individuals who donate DNA for the first human sequence will make no scientific difference in terms of the usefulness and applicability of the information that results from sequencing the human genome. At the same time, there will undoubtedly be some sensitivity about the choice of DNA sources. There are no scientific reasons why DNA donors should not be selected from diverse pools of potential donors. http://www.genome.gov/10000921 (August 17, 1996) Reference Composition
  • 21. Today’s reference assembly does not represent: 1.The most common allele 2.The longest allele 3.The ancestral allele
  • 22. Roles for the reference • Getting the sequence • Cataloging genes (and other features) • Establishing a coordinate system • Humans vs. other organisms

Editor's Notes

  1. I’d like to begin this talk by reminding everyone of the difference between a genome and an assembly. A human genome is a physical object. An assembly is our representation of that object. It is a model. And as shown here, genome models can take many forms. And as these atomic models illustrate, scientific models evolve over time to reflect our growing knowledge base. And so it is with the human assembly model, the reference genome.
  2. I’m going to cover three main areas in today’s talk: What makes the reference assembly different than other genome assemblies Features of the current reference assembly model and their relationship to genomic analyses and tools The changing reference genome assembly
  3. The human reference assembly is a special kind of genome model. In today’s era of personal genome sequencing, most assemblies only model a diploid genome. But the reference assembly is a model of many diploid genomes, meant to represent the “human” genome. This slide shows the assembly composition of the GRCh38 primary assembly. While 70% of the genome comes from one donor, sequence from >70 individuals is represented.
  4. Even when assembling the genome of single individual, there may be divergent haplotypes that confound genome assembly. In the original reference assembly model, which was essentially a stick model of linear chromosomes, there really wasn’t a good way to represent highly variant or complex genomic regions. Different haplotypes were simply compressed into a consensus. The insertion of different haplotypes in such regions, however, often led to non-existent allele combinations and artificial gaps, as illustrated here. In the assembly model we’re using now, there’s a mechanism to cleanly represent multiple haplotypes: alternate loci. The current model allows the reference assembly to contain alternate representations for regions where haplotype compression isn’t appropriate or a single sequence path is considered insufficient. At the same time, it retains the linear chromosome models with which most users are comfortable.
  5. GRCh37 was the first genome assembly to use this new model, which is illustrated in this cartoon. The first thing to know about the model is that the “assembly” is comprised of multiple assembly units. Primary assembly unit is the collection of chromosomes and unlocalized and unplaced scaffolds. This is essentially the original assembly model. Non-nuclear genomes are assigned to their own assembly unit. Regions are defined for those areas of the genome for which alternate sequence representation is desired. Those alternate sequence representations go into alternate loci assembly units. The first alternate sequence representation for each region goes into one assembly unit. Each additional sequence representation for a region goes into its own assembly unit. Alt loci are stand-alone scaffold sequences that are given chromosome context via their alignment to the primary assembly. I’ll return to this point shortly. The assembly model also includes regions for the pseudo-autosomal regions of the primary assembly unit.
  6. Another aspect of the assembly model I’d like to discuss are the patches. Patches enable the reference assembly to be updated without changing chromosome coordinates. This feature of the model allows the GRC to make assembly updates available in a timely fashion to those users whose work is adversely affected by errors in the current genome, while not disrupting the coordinates upon which other users rely. Regions are defined for the genomic locations to be updated, and the sequences representing those updates are put into the “Patches” assembly unit. Like the alt loci, the patches are stand-alone scaffold sequences. It’s important to distinguish the two types of patches: (1) FIX patches correct problems in the assembly: deprecated in next assembly release (2) NOVEL patches add new alternate sequence representations to the assembly: become alternate loci in the next assembly release.
  7. It’s important to recognize that the way fix and novel patches should be used for analysis is different. The novel patches can be treated just like alt loci, and should be considered allelic to the chromosomes. In contrast, because the FIX patches represent assembly corrections, read alignments to the fix patches should take precedence over alignments to the chromosomes. Fix patch updates can sometimes be quite dramatic. This slide shows a FIX patch that corrects a mis-assembly of the 1q21 region of GRCh37 involving the SRGAP gene family. In GRCh37 the 1q21 region was comprised of a mix of sequences from the various SRGAP-associated duplications. The bottom panel shows the alignment of chr. 1 to the fix patch, and this dot matrix view really highlights how much things changed. From this, it’s easy to see how a fix patch could improve your analyses, and why you would want to exclude alignments from the corresponding chromosome region.
  8. As I mentioned before, the alternate loci and patches are stand-alone scaffold sequences given chromosome context by virtue of their alignment to the chromosomes, as illustrated in this image of GRCh38 chr. 19, with its aligned alts and their corresponding regions. In the next slide, we’ll take a closer look at an alt in order to better appreciate its relationship to the chromosome. Thus, another important thing to understand about the assembly model is that the alignments of the alt loci to the chromosomes are an integral part of the assembly. The alignment, in conjunction with the sequence, is what defines the alt.
  9. This image shows one of the alt loci from the LRC/KIR region on chr. 19. The blue bars represent the component sequences of the alt. The alignment of the chromosome to the alt is shown beneath. The legend below explains the alignment graphic. We can see that there is a region at either end of the alt that aligns perfectly to the chromosome. To understand why this is the case, we will zoom in on one of these regions and compare to the corresponding region of chr. 19.
  10. At this level, we see that the first component in the alternate locus is also a component in the chromosome. This is what is known as an anchor component. Anchor components are present in all human alt loci and are included to insure a robust alignment to the chromosome. However, the extent to which the anchor component contributes sequence to the alternate locus and the chromosome may differ, because of differences in the position of the switch point with the adjacent components, which are not the same. The identical region of the alignment corresponds to sub-region of the anchor that is common to both the alt and the chromosome. As a result of the anchor sequences, all human alt loci contain some sequence that is redundant to the chromosomes. This is the reason that most aligners are not compatible with the alternate loci. Reads corresponding to anchor sequence will map identically to the alt and chromosome, resulting in depressed mapping scores and exclusion from downstream analyses. Reads mapping to other regions of the alternate loci that are similar to the chromosome, even if not derived from anchor sequences, will have the same issue. Alternate aware aligners must recognize the relationship of the alt to the chromosome and not treat reads that map in those regions as ambiguously placed.
  11. In the interim, the GRC has been looking at approaches that may help users make use of existing tool chains. For example, we’ve tested use of a mask that hides the duplication in the alts. In this slide, you can see the mask we’ve generated for this GRCh38 alt loci, which has an insertion relative to the chromosome, but is identical for much of the remaining length.
  12. Prior to the release of GRCh38, we began looking at the effect of masking on alt-unaware BWA aligner and compared results to those obtained with use of an NCBI-developed alternate aware aligner called srprism. In this analysis, simulated reads were aligned to GRCh37.p9 primary or full assembly. For BWA, we tested masking of the alts/patches only, or masking a combination of sequences on the alts/patches and the chromosome. We then looked at the incidence of reads with ambiguous alignments. As shown in first two columns of the figure, there is an expected increase in multiple alignments when reads are aligned to the full assembly with BWA and no mask (expanded red). In the next two columns, you can see how use of either masking approach suppresses the increase in multiple alignments. The last two columns show that srprism, the alt aware aligner, does not need a mask to prevent ambiguous mappings. We will continue this analysis on GRCh38, with both simulated and actual reads, but I hope that even this preliminary data makes the point that it is possible to develop tools that can handle the alternate loci and may allow users to reap the benefits of using the full assembly in analyses.
  13. GRCh38 has 178 regions associated with 261 alternate loci scaffolds. There is more than 3 Mb of sequence whose only representation in the assembly occurs in the alternate loci. We can now look at the value of the alternate loci and their implications for analysis.
  14. One reason the alt loci add value to the assembly is their gene content. In GRCh38, there are 64 protein coding and 112 non-protein coding genes that are found only on the alternate loci. An example is shown in this slide, where you can see several genes annotated in the regions of this alternate representation of the chr. 19 KIR region that have no alignment to the chromosome. Thus, if you’re not using the alt loci in your analyses, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.
  15. We’ve also been doing some analyses to investigate the severity of mapping errors that can occur when alternate loci aren’t used in alignment target sets. Since our analyses of GRCh38 are ongoing, I’ll describe an earlier study we did with the GRCh37.p9 assembly. In that study, we looked at the behavior of simulated reads sourced from sequence unique to patches or alt loci. We asked what happened to them when aligned to GRCh37 without the alt loci, where their true target is missing. We aligned the reads either as singletons or pairs, using two different aligners (BWA and srprism). As shown in this graph, regardless of read pairing or the aligner, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters had an off-target alignment on the GRCh37 primary assembly (in blue). These off-target alignments are likely to result in errors in variation analyses. This analysis demonstrates the value of including alternate loci in alignment target sets and again highlights the need for the development of alt aware aligners and downstream components of variant calling tool chains.
  16. At this point, I’d like to shift gears for a little bit and conclude this talk by discussing the changing reference genome. New technologies and resources are one driving force for change. For example, a single haplotype hydatidifom mole resource is helping the GRC resolve highly complex regions. A PacBio long read was used in GRCh38 to provide sequence that had been impossible to resolve by other means. Optical map data is helping us resolve misassemblies, and will also be used to find regions that are missing sequence.
  17. The assembly model itself may also change with time. The GRC is currently curating the CHM1 hydatidifom mole assembly in addition to the reference, and using sequence from it to improve the reference. In the future, the GRC expects to curate selected genomic regions from additional individuals representing diverse populations, to provide a more comprehensive representation of complex genomic regions. These may contribute new alt loci. At the moment, such additional genomes are considered distinct from the reference, as this slide illustrates. However, as new data becomes available both within and beyond the GRC, the GRC will continue to assess the assembly model and work towards its goal of providing a reference assembly that can be used to put any common human sequence in its chromosome context.
  18. When the HGP was envisioned, we knew much less about human variation than we do today and the implications that using multiple donors might have for the reference assembly. That’s illustrated in the final sentence of this quote from the NHGRI/DOE guidelines for selecting DNA donors for the reference. There’s still no doubt we want and need a reference assembly that represents diverse samples. But we now know, thanks in part to projects like 1000G, that DNA assemblies, our models of the genome, are affected by sample diversity.
  19. Before I go any further, I want to point out what today’s reference is not. It does not represent: The most common allele The longest allele The ancestral allele It represents the sequence available from the HGP.
  20. The role of the reference genome has also changed with time. At the start of the HGP, major goals for the reference included: (1) getting the sequence, (2) cataloging genes and other features and (3) establishing a coordinate system. (4) We were interested in the difference between humans and other organisms. Today, we’re just as interested in the differences between individual humans. This has led to calls that we need a reference that, if not a pan-genome, can provide representation for complex and/or diverse genomic regions. In order to understand how the reference genome can represent diversity we’re interested in, we’ll take a look at the reference assembly model and how it has changed since the HGP.