SlideShare a Scribd company logo
Explaining the assembly model
Valerie Schneider
NCBI
21 September 2014
Dilthey et al.Paten et al.
Scientific Models
• Differences between the reference genome
assembly and other assemblies
• Features of the current reference assembly
model and their relationship to genomic analyses
and tools
• The changing reference genome assembly
Outline
Explaining the assembly model
Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
GRC Assembly Model
Assembly (e.g. GRCh38)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
GRC Assembly Model
The human reference genome assembly is not a haploid model
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
ALT
1
Alternate loci are not synonymous with haplotypes
Assembly (e.g. GRCh38.p1)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
ALT
1
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
Patches
Genomic
Region
(ABO)
Genomic
Region
(FOXO6)
Genomic
Region
(FCGBP)
GRC Assembly Model
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXT
MAJOR ASSEMBLY RELEASE
ALT
LOCI
--
(integrated)
1q32 1q21 1p21
Dennis et al., 2012
GRC Assembly Model
Fix patches are different than novel patches
The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly
Anatomy of an alt
Alignment Legend
no alignmentmismatchdeletion
Anatomy of an alt
AC012314.8
CU151838.1
ALT LOCI
AC012314.8
AC245052.3 CHR. 19
Alternate loci contain some sequence that is redundant to the primary assembly unit
Alt Loci: Informatics Challenges
Masks and alt aware aligners reduce the incidence of
ambiguous alignments observed when aligning reads to
the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds
Simulated Reads
GRCh38: Alt Loci
GRC: Assembly Model
GRCh38
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
GRCh38: Alt Loci
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt Loci
The Changing Reference
The Changing Reference
Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Matthew Hurles
• Richard Gibbs
GRC Credits
Source/Recruitment of DNA Donors for Library Construction
Another implication of the fact that 99.9% of the human DNA sequence
is shared by any two individuals is that the backgrounds of the
individuals who donate DNA for the first human sequence will make no
scientific difference in terms of the usefulness and applicability of the
information that results from sequencing the human genome. At the
same time, there will undoubtedly be some sensitivity about the
choice of DNA sources. There are no scientific reasons why DNA donors
should not be selected from diverse pools of potential donors.
http://www.genome.gov/10000921 (August 17, 1996)
Reference Composition
Today’s reference assembly does not represent:
1.The most common allele
2.The longest allele
3.The ancestral allele
Roles for the reference
• Getting the sequence
• Cataloging genes (and other features)
• Establishing a coordinate system
• Humans vs. other organisms

More Related Content

What's hot

Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
Bioinformatics and Computational Biosciences Branch
 
Genome Mapping
Genome MappingGenome Mapping
Genome Mapping
ruchibioinfo
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
Pradeep.ii
Pradeep.iiPradeep.ii
Pradeep.ii
Pradeep Jaswani
 
GWAS
GWASGWAS
Metabolomics
MetabolomicsMetabolomics
Metabolomics
priya1111
 
Gemome annotation
Gemome annotationGemome annotation
Gemome annotation
Tajammal Daultana
 
Single nucleotide polymorphism
Single nucleotide polymorphismSingle nucleotide polymorphism
Single nucleotide polymorphism
Bipul Das
 
2015 functional genomics variant annotation and interpretation- tools and p...
2015 functional genomics   variant annotation and interpretation- tools and p...2015 functional genomics   variant annotation and interpretation- tools and p...
2015 functional genomics variant annotation and interpretation- tools and p...
Gabe Rudy
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Manikhandan Mudaliar
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
University of California, Davis
 
Applications of Single Cell Analysis
Applications of Single  Cell AnalysisApplications of Single  Cell Analysis
Applications of Single Cell Analysis
QIAGEN
 
DNA Repair and Mutation.pdf
DNA Repair and Mutation.pdfDNA Repair and Mutation.pdf
DNA Repair and Mutation.pdf
Distructer
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
Karan Veer Singh
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
AGRF_Ltd
 
Epigenetics
EpigeneticsEpigenetics
Epigenetics
Paul Magbanua
 
New generation sequencing equipments
New generation sequencing equipmentsNew generation sequencing equipments
New generation sequencing equipments
Kalaivani P
 
Introduction to Epigenetics
Introduction to EpigeneticsIntroduction to Epigenetics
Introduction to Epigenetics
Garry D. Lasaga
 
Genetics & Genomic Testing
Genetics & Genomic Testing Genetics & Genomic Testing
Genetics & Genomic Testing
CHC Connecticut
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
Ntino Krampis
 

What's hot (20)

Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Genome Mapping
Genome MappingGenome Mapping
Genome Mapping
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Pradeep.ii
Pradeep.iiPradeep.ii
Pradeep.ii
 
GWAS
GWASGWAS
GWAS
 
Metabolomics
MetabolomicsMetabolomics
Metabolomics
 
Gemome annotation
Gemome annotationGemome annotation
Gemome annotation
 
Single nucleotide polymorphism
Single nucleotide polymorphismSingle nucleotide polymorphism
Single nucleotide polymorphism
 
2015 functional genomics variant annotation and interpretation- tools and p...
2015 functional genomics   variant annotation and interpretation- tools and p...2015 functional genomics   variant annotation and interpretation- tools and p...
2015 functional genomics variant annotation and interpretation- tools and p...
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Applications of Single Cell Analysis
Applications of Single  Cell AnalysisApplications of Single  Cell Analysis
Applications of Single Cell Analysis
 
DNA Repair and Mutation.pdf
DNA Repair and Mutation.pdfDNA Repair and Mutation.pdf
DNA Repair and Mutation.pdf
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Epigenetics
EpigeneticsEpigenetics
Epigenetics
 
New generation sequencing equipments
New generation sequencing equipmentsNew generation sequencing equipments
New generation sequencing equipments
 
Introduction to Epigenetics
Introduction to EpigeneticsIntroduction to Epigenetics
Introduction to Epigenetics
 
Genetics & Genomic Testing
Genetics & Genomic Testing Genetics & Genomic Testing
Genetics & Genomic Testing
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 

Similar to Explaining the assembly model

Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
Genome Reference Consortium
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
Genome Reference Consortium
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
Genome Reference Consortium
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
Genome Reference Consortium
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Golden Helix Inc
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206
GenomeInABottle
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
David Cook
 
Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a BeginningCollaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Larry Smarr
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
ExternalEvents
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
Genome Reference Consortium
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
Russ Altman
 
10.1.1.80.2149
10.1.1.80.214910.1.1.80.2149
10.1.1.80.2149
vantinhkhuc
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
Genome Reference Consortium
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Nils Gehlenborg
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
David Cook
 
CRISPR PROJECT.pptx
CRISPR PROJECT.pptxCRISPR PROJECT.pptx
CRISPR PROJECT.pptx
AcSni
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
GenomeInABottle
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
Michael Atkins
 
PMED Undergraduate Workshop - Communities & Classification in Disease Data -...
PMED Undergraduate Workshop - Communities & Classification in Disease Data  -...PMED Undergraduate Workshop - Communities & Classification in Disease Data  -...
PMED Undergraduate Workshop - Communities & Classification in Disease Data -...
The Statistical and Applied Mathematical Sciences Institute
 
Building an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic SciencesBuilding an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic Sciences
Larry Smarr
 

Similar to Explaining the assembly model (20)

Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
 
Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a BeginningCollaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
10.1.1.80.2149
10.1.1.80.214910.1.1.80.2149
10.1.1.80.2149
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
 
CRISPR PROJECT.pptx
CRISPR PROJECT.pptxCRISPR PROJECT.pptx
CRISPR PROJECT.pptx
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
PMED Undergraduate Workshop - Communities & Classification in Disease Data -...
PMED Undergraduate Workshop - Communities & Classification in Disease Data  -...PMED Undergraduate Workshop - Communities & Classification in Disease Data  -...
PMED Undergraduate Workshop - Communities & Classification in Disease Data -...
 
Building an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic SciencesBuilding an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic Sciences
 

More from Genome Reference Consortium

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
Genome Reference Consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
Genome Reference Consortium
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
Genome Reference Consortium
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
Genome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
Genome Reference Consortium
 
Mane v2 final
Mane v2 finalMane v2 final
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
Genome Reference Consortium
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
Genome Reference Consortium
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
Genome Reference Consortium
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
Genome Reference Consortium
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
Genome Reference Consortium
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
Genome Reference Consortium
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
Genome Reference Consortium
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
Genome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
Genome Reference Consortium
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
Genome Reference Consortium
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
Genome Reference Consortium
 

More from Genome Reference Consortium (20)

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 

Recently uploaded

Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Surface properties of the seas of Titan as revealed by Cassini mission bistat...Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Sérgio Sacani
 
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physicsTHE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
Dr. sreeremya S
 
AN EMPIRE ACROSS THE THREE CONTINENTS.pptx
AN EMPIRE ACROSS THE THREE CONTINENTS.pptxAN EMPIRE ACROSS THE THREE CONTINENTS.pptx
AN EMPIRE ACROSS THE THREE CONTINENTS.pptx
kalpnayadav03021986
 
Biochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Biochar impregnation as slow release fertilizer - Violeta Alexandra IonBiochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Biochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Faculty of Applied Chemistry and Materials Science
 
Types of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptxTypes of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptx
Isha Pandey
 
PART 1 & PART 2 The New Natural Principles of Newtonian Mechanics, Electromec...
PART 1 & PART 2 The New Natural Principles of Newtonian Mechanics, Electromec...PART 1 & PART 2 The New Natural Principles of Newtonian Mechanics, Electromec...
PART 1 & PART 2 The New Natural Principles of Newtonian Mechanics, Electromec...
Thane Heins
 
Potential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptxPotential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptx
J. Bovas Joel BFSc
 
Composting blue materials - Joshua Cabell
Composting blue materials - Joshua CabellComposting blue materials - Joshua Cabell
Composting blue materials - Joshua Cabell
Faculty of Applied Chemistry and Materials Science
 
Plant Kingdom BioHack class 11 neet ....
Plant Kingdom BioHack class 11 neet ....Plant Kingdom BioHack class 11 neet ....
Plant Kingdom BioHack class 11 neet ....
anushkakharat13
 
Testing the Son of God Hypothesis (Jesus Christ)
Testing the Son of God Hypothesis (Jesus Christ)Testing the Son of God Hypothesis (Jesus Christ)
Testing the Son of God Hypothesis (Jesus Christ)
Robert Luk
 
Phytoremediation: Harnessing Nature's Power with Phytoremediation
Phytoremediation: Harnessing Nature's Power with PhytoremediationPhytoremediation: Harnessing Nature's Power with Phytoremediation
Phytoremediation: Harnessing Nature's Power with Phytoremediation
Gurjant Singh
 
Lunar Mobility Drivers and Needs - Artemis
Lunar Mobility Drivers and Needs - ArtemisLunar Mobility Drivers and Needs - Artemis
Lunar Mobility Drivers and Needs - Artemis
Sérgio Sacani
 
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
Sérgio Sacani
 
Traditional, current and future use of fish and seaweed for fertilisation - ...
Traditional, current and future use of fish and seaweed for fertilisation -  ...Traditional, current and future use of fish and seaweed for fertilisation -  ...
Traditional, current and future use of fish and seaweed for fertilisation - ...
Faculty of Applied Chemistry and Materials Science
 
Pancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptxPancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptx
muralinath2
 
SCIENCEgfvhvhvkjkbbjjbbjvhvhvhvjkvjvjvjj.pptx
SCIENCEgfvhvhvkjkbbjjbbjvhvhvhvjkvjvjvjj.pptxSCIENCEgfvhvhvkjkbbjjbbjvhvhvhvjkvjvjvjj.pptx
SCIENCEgfvhvhvkjkbbjjbbjvhvhvhvjkvjvjvjj.pptx
WALTONMARBRUCAL
 
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Dr. sreeremya S
 
Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...
Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...
Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdfHow Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
Task Train
 
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Dr NEETHU ASOKAN
 

Recently uploaded (20)

Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Surface properties of the seas of Titan as revealed by Cassini mission bistat...Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Surface properties of the seas of Titan as revealed by Cassini mission bistat...
 
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physicsTHE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
 
AN EMPIRE ACROSS THE THREE CONTINENTS.pptx
AN EMPIRE ACROSS THE THREE CONTINENTS.pptxAN EMPIRE ACROSS THE THREE CONTINENTS.pptx
AN EMPIRE ACROSS THE THREE CONTINENTS.pptx
 
Biochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Biochar impregnation as slow release fertilizer - Violeta Alexandra IonBiochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Biochar impregnation as slow release fertilizer - Violeta Alexandra Ion
 
Types of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptxTypes of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptx
 
PART 1 & PART 2 The New Natural Principles of Newtonian Mechanics, Electromec...
PART 1 & PART 2 The New Natural Principles of Newtonian Mechanics, Electromec...PART 1 & PART 2 The New Natural Principles of Newtonian Mechanics, Electromec...
PART 1 & PART 2 The New Natural Principles of Newtonian Mechanics, Electromec...
 
Potential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptxPotential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptx
 
Composting blue materials - Joshua Cabell
Composting blue materials - Joshua CabellComposting blue materials - Joshua Cabell
Composting blue materials - Joshua Cabell
 
Plant Kingdom BioHack class 11 neet ....
Plant Kingdom BioHack class 11 neet ....Plant Kingdom BioHack class 11 neet ....
Plant Kingdom BioHack class 11 neet ....
 
Testing the Son of God Hypothesis (Jesus Christ)
Testing the Son of God Hypothesis (Jesus Christ)Testing the Son of God Hypothesis (Jesus Christ)
Testing the Son of God Hypothesis (Jesus Christ)
 
Phytoremediation: Harnessing Nature's Power with Phytoremediation
Phytoremediation: Harnessing Nature's Power with PhytoremediationPhytoremediation: Harnessing Nature's Power with Phytoremediation
Phytoremediation: Harnessing Nature's Power with Phytoremediation
 
Lunar Mobility Drivers and Needs - Artemis
Lunar Mobility Drivers and Needs - ArtemisLunar Mobility Drivers and Needs - Artemis
Lunar Mobility Drivers and Needs - Artemis
 
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
 
Traditional, current and future use of fish and seaweed for fertilisation - ...
Traditional, current and future use of fish and seaweed for fertilisation -  ...Traditional, current and future use of fish and seaweed for fertilisation -  ...
Traditional, current and future use of fish and seaweed for fertilisation - ...
 
Pancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptxPancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptx
 
SCIENCEgfvhvhvkjkbbjjbbjvhvhvhvjkvjvjvjj.pptx
SCIENCEgfvhvhvkjkbbjjbbjvhvhvhvjkvjvjvjj.pptxSCIENCEgfvhvhvkjkbbjjbbjvhvhvhvjkvjvjvjj.pptx
SCIENCEgfvhvhvkjkbbjjbbjvhvhvhvjkvjvjvjj.pptx
 
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
 
Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...
Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...
Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...
 
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdfHow Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
 
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
 

Explaining the assembly model

  • 1. Explaining the assembly model Valerie Schneider NCBI 21 September 2014
  • 2. Dilthey et al.Paten et al. Scientific Models
  • 3. • Differences between the reference genome assembly and other assemblies • Features of the current reference assembly model and their relationship to genomic analyses and tools • The changing reference genome assembly Outline
  • 5. Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes GRC Assembly Model
  • 6. Assembly (e.g. GRCh38) Primary Assembly Unit Non-nuclear assembly unit (e.g. MT) PAR Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 GRC Assembly Model The human reference genome assembly is not a haploid model ALT 2 ALT 3 ALT 4 ALT 5 ALT 6 ALT 7 ALT 1 Alternate loci are not synonymous with haplotypes
  • 7. Assembly (e.g. GRCh38.p1) Primary Assembly Unit Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 6 ALT 7 PAR Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 Patches Genomic Region (ABO) Genomic Region (FOXO6) Genomic Region (FCGBP) GRC Assembly Model Patches FIX NOVEL SCAFFOLD STATUS AT NEXT MAJOR ASSEMBLY RELEASE ALT LOCI -- (integrated)
  • 8. 1q32 1q21 1p21 Dennis et al., 2012 GRC Assembly Model Fix patches are different than novel patches
  • 9. The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly
  • 10. Anatomy of an alt Alignment Legend no alignmentmismatchdeletion
  • 11. Anatomy of an alt AC012314.8 CU151838.1 ALT LOCI AC012314.8 AC245052.3 CHR. 19 Alternate loci contain some sequence that is redundant to the primary assembly unit
  • 12. Alt Loci: Informatics Challenges
  • 13. Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning reads to the full assembly Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds Simulated Reads GRCh38: Alt Loci
  • 14. GRC: Assembly Model GRCh38 • 178 regions with alt loci: 2% of chromosome sequence (61.9 Mb) • 261 Alt Loci: 3.6 Mb novel sequence relative to chromosomes
  • 16. chromosome alt/patch reads On-target alignment Off-target alignments (n=122,922) GRCh38: Alt Loci
  • 19. Collaborators • NCBI RefSeq and gpipe annotation team • Havana annotators • Karen Miga • David Schwartz • Steve Goldstein • Mario Caceres • Giulio Genovese • Jeff Kidd • Peter Lansdorp • Mark Hills • David Page • Jim Knight • Stephan Schuster • 1000 Genomes GRC SAB • Rick Myers • Granger Sutton • Evan Eichler • Jim Kent • Roderic Guigo • Carol Bult • Derek Stemple • Matthew Hurles • Richard Gibbs GRC Credits
  • 20. Source/Recruitment of DNA Donors for Library Construction Another implication of the fact that 99.9% of the human DNA sequence is shared by any two individuals is that the backgrounds of the individuals who donate DNA for the first human sequence will make no scientific difference in terms of the usefulness and applicability of the information that results from sequencing the human genome. At the same time, there will undoubtedly be some sensitivity about the choice of DNA sources. There are no scientific reasons why DNA donors should not be selected from diverse pools of potential donors. http://www.genome.gov/10000921 (August 17, 1996) Reference Composition
  • 21. Today’s reference assembly does not represent: 1.The most common allele 2.The longest allele 3.The ancestral allele
  • 22. Roles for the reference • Getting the sequence • Cataloging genes (and other features) • Establishing a coordinate system • Humans vs. other organisms

Editor's Notes

  1. I’d like to begin this talk by reminding everyone of the difference between a genome and an assembly. A human genome is a physical object. An assembly is our representation of that object. It is a model. And as shown here, genome models can take many forms. And as these atomic models illustrate, scientific models evolve over time to reflect our growing knowledge base. And so it is with the human assembly model, the reference genome.
  2. I’m going to cover three main areas in today’s talk: What makes the reference assembly different than other genome assemblies Features of the current reference assembly model and their relationship to genomic analyses and tools The changing reference genome assembly
  3. The human reference assembly is a special kind of genome model. In today’s era of personal genome sequencing, most assemblies only model a diploid genome. But the reference assembly is a model of many diploid genomes, meant to represent the “human” genome. This slide shows the assembly composition of the GRCh38 primary assembly. While 70% of the genome comes from one donor, sequence from >70 individuals is represented.
  4. Even when assembling the genome of single individual, there may be divergent haplotypes that confound genome assembly. In the original reference assembly model, which was essentially a stick model of linear chromosomes, there really wasn’t a good way to represent highly variant or complex genomic regions. Different haplotypes were simply compressed into a consensus. The insertion of different haplotypes in such regions, however, often led to non-existent allele combinations and artificial gaps, as illustrated here. In the assembly model we’re using now, there’s a mechanism to cleanly represent multiple haplotypes: alternate loci. The current model allows the reference assembly to contain alternate representations for regions where haplotype compression isn’t appropriate or a single sequence path is considered insufficient. At the same time, it retains the linear chromosome models with which most users are comfortable.
  5. GRCh37 was the first genome assembly to use this new model, which is illustrated in this cartoon. The first thing to know about the model is that the “assembly” is comprised of multiple assembly units. Primary assembly unit is the collection of chromosomes and unlocalized and unplaced scaffolds. This is essentially the original assembly model. Non-nuclear genomes are assigned to their own assembly unit. Regions are defined for those areas of the genome for which alternate sequence representation is desired. Those alternate sequence representations go into alternate loci assembly units. The first alternate sequence representation for each region goes into one assembly unit. Each additional sequence representation for a region goes into its own assembly unit. Alt loci are stand-alone scaffold sequences that are given chromosome context via their alignment to the primary assembly. I’ll return to this point shortly. The assembly model also includes regions for the pseudo-autosomal regions of the primary assembly unit.
  6. Another aspect of the assembly model I’d like to discuss are the patches. Patches enable the reference assembly to be updated without changing chromosome coordinates. This feature of the model allows the GRC to make assembly updates available in a timely fashion to those users whose work is adversely affected by errors in the current genome, while not disrupting the coordinates upon which other users rely. Regions are defined for the genomic locations to be updated, and the sequences representing those updates are put into the “Patches” assembly unit. Like the alt loci, the patches are stand-alone scaffold sequences. It’s important to distinguish the two types of patches: (1) FIX patches correct problems in the assembly: deprecated in next assembly release (2) NOVEL patches add new alternate sequence representations to the assembly: become alternate loci in the next assembly release.
  7. It’s important to recognize that the way fix and novel patches should be used for analysis is different. The novel patches can be treated just like alt loci, and should be considered allelic to the chromosomes. In contrast, because the FIX patches represent assembly corrections, read alignments to the fix patches should take precedence over alignments to the chromosomes. Fix patch updates can sometimes be quite dramatic. This slide shows a FIX patch that corrects a mis-assembly of the 1q21 region of GRCh37 involving the SRGAP gene family. In GRCh37 the 1q21 region was comprised of a mix of sequences from the various SRGAP-associated duplications. The bottom panel shows the alignment of chr. 1 to the fix patch, and this dot matrix view really highlights how much things changed. From this, it’s easy to see how a fix patch could improve your analyses, and why you would want to exclude alignments from the corresponding chromosome region.
  8. As I mentioned before, the alternate loci and patches are stand-alone scaffold sequences given chromosome context by virtue of their alignment to the chromosomes, as illustrated in this image of GRCh38 chr. 19, with its aligned alts and their corresponding regions. In the next slide, we’ll take a closer look at an alt in order to better appreciate its relationship to the chromosome. Thus, another important thing to understand about the assembly model is that the alignments of the alt loci to the chromosomes are an integral part of the assembly. The alignment, in conjunction with the sequence, is what defines the alt.
  9. This image shows one of the alt loci from the LRC/KIR region on chr. 19. The blue bars represent the component sequences of the alt. The alignment of the chromosome to the alt is shown beneath. The legend below explains the alignment graphic. We can see that there is a region at either end of the alt that aligns perfectly to the chromosome. To understand why this is the case, we will zoom in on one of these regions and compare to the corresponding region of chr. 19.
  10. At this level, we see that the first component in the alternate locus is also a component in the chromosome. This is what is known as an anchor component. Anchor components are present in all human alt loci and are included to insure a robust alignment to the chromosome. However, the extent to which the anchor component contributes sequence to the alternate locus and the chromosome may differ, because of differences in the position of the switch point with the adjacent components, which are not the same. The identical region of the alignment corresponds to sub-region of the anchor that is common to both the alt and the chromosome. As a result of the anchor sequences, all human alt loci contain some sequence that is redundant to the chromosomes. This is the reason that most aligners are not compatible with the alternate loci. Reads corresponding to anchor sequence will map identically to the alt and chromosome, resulting in depressed mapping scores and exclusion from downstream analyses. Reads mapping to other regions of the alternate loci that are similar to the chromosome, even if not derived from anchor sequences, will have the same issue. Alternate aware aligners must recognize the relationship of the alt to the chromosome and not treat reads that map in those regions as ambiguously placed.
  11. In the interim, the GRC has been looking at approaches that may help users make use of existing tool chains. For example, we’ve tested use of a mask that hides the duplication in the alts. In this slide, you can see the mask we’ve generated for this GRCh38 alt loci, which has an insertion relative to the chromosome, but is identical for much of the remaining length.
  12. Prior to the release of GRCh38, we began looking at the effect of masking on alt-unaware BWA aligner and compared results to those obtained with use of an NCBI-developed alternate aware aligner called srprism. In this analysis, simulated reads were aligned to GRCh37.p9 primary or full assembly. For BWA, we tested masking of the alts/patches only, or masking a combination of sequences on the alts/patches and the chromosome. We then looked at the incidence of reads with ambiguous alignments. As shown in first two columns of the figure, there is an expected increase in multiple alignments when reads are aligned to the full assembly with BWA and no mask (expanded red). In the next two columns, you can see how use of either masking approach suppresses the increase in multiple alignments. The last two columns show that srprism, the alt aware aligner, does not need a mask to prevent ambiguous mappings. We will continue this analysis on GRCh38, with both simulated and actual reads, but I hope that even this preliminary data makes the point that it is possible to develop tools that can handle the alternate loci and may allow users to reap the benefits of using the full assembly in analyses.
  13. GRCh38 has 178 regions associated with 261 alternate loci scaffolds. There is more than 3 Mb of sequence whose only representation in the assembly occurs in the alternate loci. We can now look at the value of the alternate loci and their implications for analysis.
  14. One reason the alt loci add value to the assembly is their gene content. In GRCh38, there are 64 protein coding and 112 non-protein coding genes that are found only on the alternate loci. An example is shown in this slide, where you can see several genes annotated in the regions of this alternate representation of the chr. 19 KIR region that have no alignment to the chromosome. Thus, if you’re not using the alt loci in your analyses, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.
  15. We’ve also been doing some analyses to investigate the severity of mapping errors that can occur when alternate loci aren’t used in alignment target sets. Since our analyses of GRCh38 are ongoing, I’ll describe an earlier study we did with the GRCh37.p9 assembly. In that study, we looked at the behavior of simulated reads sourced from sequence unique to patches or alt loci. We asked what happened to them when aligned to GRCh37 without the alt loci, where their true target is missing. We aligned the reads either as singletons or pairs, using two different aligners (BWA and srprism). As shown in this graph, regardless of read pairing or the aligner, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters had an off-target alignment on the GRCh37 primary assembly (in blue). These off-target alignments are likely to result in errors in variation analyses. This analysis demonstrates the value of including alternate loci in alignment target sets and again highlights the need for the development of alt aware aligners and downstream components of variant calling tool chains.
  16. At this point, I’d like to shift gears for a little bit and conclude this talk by discussing the changing reference genome. New technologies and resources are one driving force for change. For example, a single haplotype hydatidifom mole resource is helping the GRC resolve highly complex regions. A PacBio long read was used in GRCh38 to provide sequence that had been impossible to resolve by other means. Optical map data is helping us resolve misassemblies, and will also be used to find regions that are missing sequence.
  17. The assembly model itself may also change with time. The GRC is currently curating the CHM1 hydatidifom mole assembly in addition to the reference, and using sequence from it to improve the reference. In the future, the GRC expects to curate selected genomic regions from additional individuals representing diverse populations, to provide a more comprehensive representation of complex genomic regions. These may contribute new alt loci. At the moment, such additional genomes are considered distinct from the reference, as this slide illustrates. However, as new data becomes available both within and beyond the GRC, the GRC will continue to assess the assembly model and work towards its goal of providing a reference assembly that can be used to put any common human sequence in its chromosome context.
  18. When the HGP was envisioned, we knew much less about human variation than we do today and the implications that using multiple donors might have for the reference assembly. That’s illustrated in the final sentence of this quote from the NHGRI/DOE guidelines for selecting DNA donors for the reference. There’s still no doubt we want and need a reference assembly that represents diverse samples. But we now know, thanks in part to projects like 1000G, that DNA assemblies, our models of the genome, are affected by sample diversity.
  19. Before I go any further, I want to point out what today’s reference is not. It does not represent: The most common allele The longest allele The ancestral allele It represents the sequence available from the HGP.
  20. The role of the reference genome has also changed with time. At the start of the HGP, major goals for the reference included: (1) getting the sequence, (2) cataloging genes and other features and (3) establishing a coordinate system. (4) We were interested in the difference between humans and other organisms. Today, we’re just as interested in the differences between individual humans. This has led to calls that we need a reference that, if not a pan-genome, can provide representation for complex and/or diverse genomic regions. In order to understand how the reference genome can represent diversity we’re interested in, we’ll take a look at the reference assembly model and how it has changed since the HGP.