Ashg2017 workshop tg

Reference-Grade Human
Genome Assemblies
Tina Graves Lindsay
GRC - GIAB Workshop at ASHG
Oct 17, 2017

The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some
regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major
challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have
allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to
represent the full range of genetic diversity in humans

AC074378.4
AC079749.5
AC134921.2
AC147055.2
AC140484.1
AC019173.4
AC093720.2
AC021146.7
NCBI36NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37NC_000004.11 (chr4) Tiling Path
AC074378.4
AC079749.5
AC134921.1
AC147055.2
AC093720.2
AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4
AC140484.1
AC019173.4
AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 – Conflicting Alleles
G
A
P

Genome Status
Data
Source
Origin Assembly
Accession
Status
CHM1 NA GCA_001297185.1 Assembly Improvement
CHM13 NA GCA_000983455.2 Assembly Assessment
NA19240 Yoruban GCA_001524155.4 Chr-level Assembly Submitted
HG00733 Puerto Rican GCA_002208065.1 Contig Assembly Submitted
HG00514 Han Chinese GCA_002180035.1 Contig Assembly Submitted
NA12878 European GCA_002077035.2 Chr-level Assembly Submitted
HG01352 Columbian GCA_002209525.1 Contig Assembly Submitted
HG02818 Gambian Assembly Underway
HG02059 Kinh-Vietnamese Assembly Assessment
NA19434 Luhya Assembly Assessment
HG04217 Telugu Data Production Underway
HG03486 Mende Assembly Underway**
** First Sequel only data set

Genome Total Size # Contigs Contig N50
NA19240 2.84 Gb 2965 25.7 Mb
HG00733 2.88 Gb 3580 22.2 Mb
NA12878 2.86 Gb 3663 14.5 Mb
HG01352 2.88 Gb 3120 22.8 Mb
HG00514 2.87 Gb 3160 25.3 Mb
NA19434 2.86 Gb 3083 21.6 Mb
HG02059 2.89 Gb 3148 26.0 Mb
Assembly Stats

Assembly QC and Submission Steps
Multiple Falcon
Assemblies
Using stats and
alignment to
Bionano, pick the
best assembly
Quiver and Pilon
on best assembly
Use Bionano to
identify mis-
assemblies
Submit conitg
level AGPs to
Genbank
Run through NCBI
assembly QA
pipeline
Evaluate and
curate output of
QA pipeline
Generate final
chromosome level
AGPs and Submit
Annotation of
chromosome level
assembly

Hybrid Scaffold
Hybrid Scaffold
PacBio Contigs
BioNano Contigs

Hybrid Stats
Seq Assem Seq Assem Seq Assem BN Hybrid BN Hybrid BN Hybrid
# of
Contigs
Contig N50
(Mb)
Total Size
(Gb)
# of
Scaffolds
Scaffold
N50 (Mb)
Total Size
(Gb)
NA19240 2889 26.3 2.87 218 39.9 2.82
NA12878 3551 15.1 2.86 270 28.7 2.83
HG00514 3190 24.2 2.88 208 37.0 2.83
HG00733 3553 22.8 2.88 167 48.8 2.87
HG01352 3077 22.8 2.88 220 40.0 2.84
NA19434 3083 21.9 2.86 253 34.7 2.83
HG02059 3148 26.1 2.90 242 37.2 2.83

NA19240 Assembly Assessment
Initial Calls Breaks made
Conflicts 51 35
Translocation SV 321 16
Complex 123 9
Nucmer
Alignments
9
69 Total
breaks made
Contig # Contig N50 Total Assembly
Size
Before Breaks 2889 26.4 Mb 2.87 Gb
After Breaks 2951 25.7 Mb 2.87 Gb

Chimeric PacBio Contig
GRCh38 – Chr 1
GRCh38 – Chr 4
NA19240 Contig
NA19240 Contig
Segmental Duplications
Segmental Duplications

NA19240 Inversion Compared to GRCh38
GRCh38
NA19240 Bionano Contigs

Bionano Identified SVs Compared to GRCh38
Genome Deletions Insertions Inversions
Yoruban (NA19240) 756 1795 8
European (NA12878) 750 1791 17
Han Chinese (HG00514) 743 1724 8
Puerto Rican (HG00733) 743 1862 27
Colombian (HG01352) 711 1661 6
Vietnamese (HG02059) 626 1536 4
Luhya (NA19434) 694 1643 10
Mende (HG03486) 871 1888 3

NA19240 MHC Region
GRCh38
Bionano Contigs

NA19240 MHC Region
NA19240
Reference
Alts
~65 kb insertion

CYP2D6 Alternate Alleles
Courtesy of Karyn Meltz Steinberg

NA12878 CYP2D6 Region in Bionano Map
GRCh38
NA12878
allele 1
NA12878
allele 2

Falcon Assembly of NA12878 in CYP2D6 Region
CYP2D8
CYP2D7
CYP2D6
Alignment of
NA12878 to
GRCh38
Region of NA12878 that
doesn’t exist in GRCh38
Shows Duplication of
CYP2D7 gene in
NA12878 genome

Falcon Unzip Assemblies
Contig # Assembly
Length
Contig N50 Avg Contig
Length
Largest
Contig
Primary Contigs 1220 2.83 Gb 21.63 Mb 2.31 Mb 83.00 Mb
Haplotigs 11,686 2.45 Gb 443.3 Kb 210 Kb 3.41 Mb
Gambian (HG02818) Assembly
Contig # Assembly
Length
Contig N50 Avg Contig
Length
Largest
Contig
Primary Contigs 1,801 2.83 Gb 21.16 Mb 1.57 Mb 81.12 Mb
Haplotigs 13,130 2.49 Gb 458.2 Kb 190 Kb 3.23 Mb
Yoruban (NA19240) Assembly – Not polished yet

10X Genomics Overview (DNA)
(Church 10X Genomics)

10X Data – Separating a Heterozygous Allele
GRCh38
NA12878
Falcon
10X Allele 1
10X Allele 2
Heterozygous SV identified by Bionano
10X Supernova assembly used - GCA_002022845.1

Short Term Future Plans
• Lots of assemblies to analyze!
• Generate the latest Falcon Unzip assemblies for all
samples
• Improve those assemblies
• Identifying misassemblies
• Making the breaks where needed
• Scaffolding the assemblies
• Incorporating BACs as they are finished
• Create Chromosomal AGPs
• Submit to Genbank

Longer Term Future Work
• Better Utilization of the Reference
• Mapping Strategies
• Graph based alignments
• Other alt-aware read mapping strategies
• Alternative reference data display challenges – How should we
present data
• Do we continue the current scheme of alt alleles?
• Full reference sequences?
• 2 Haplo-resolved sequences for each allele
• Using Falcon unzip
• Using 10X
• Other technologies?

Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Susan Dutcher
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Derek Albracht
Milinn Kremitzki
Susan Rock
Chad Tomlinson
Patrick Minx
Chris Markovic
Eddie Belter
Lee Trani
Sara Kohlberg
University of Washington
Evan Eichler
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine
(CHM1 and CHM13 cell line)
Urvashi Surti
BioNano Genomics
Alex Hastie
Pacific Biosciences
Nick Sisneros
Sarah Kingan
Luke Hickey
Greg Concepcion
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine Chu
10X Genomics
Deanna Church
Nationwide Children’s Hospital
Richard Wilson
Vince Magrini
Sean McGrath

Ashg2017 workshop tg

More Related Content

What's hot

Similar to Ashg2017 workshop tg

More from Genome Reference Consortium

Recently uploaded

Ashg2017 workshop tg

Editor's Notes