SlideShare a Scribd company logo
1 of 29
Reference-Grade Human
Genome Assemblies
Tina Graves Lindsay
GRC - GIAB Workshop at ASHG
Oct 17, 2017
The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some
regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major
challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have
allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to
represent the full range of genetic diversity in humans
AC074378.4
AC079749.5
AC134921.2
AC147055.2
AC140484.1
AC019173.4
AC093720.2
AC021146.7
NCBI36NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37NC_000004.11 (chr4) Tiling Path
AC074378.4
AC079749.5
AC134921.1
AC147055.2
AC093720.2
AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4
AC140484.1
AC019173.4
AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 – Conflicting Alleles
G
A
P
Samples to be Sequenced
Sequencing Plan
Genome Status
Data
Source
Origin Assembly
Accession
Status
CHM1 NA GCA_001297185.1 Assembly Improvement
CHM13 NA GCA_000983455.2 Assembly Assessment
NA19240 Yoruban GCA_001524155.4 Chr-level Assembly Submitted
HG00733 Puerto Rican GCA_002208065.1 Contig Assembly Submitted
HG00514 Han Chinese GCA_002180035.1 Contig Assembly Submitted
NA12878 European GCA_002077035.2 Chr-level Assembly Submitted
HG01352 Columbian GCA_002209525.1 Contig Assembly Submitted
HG02818 Gambian Assembly Underway
HG02059 Kinh-Vietnamese Assembly Assessment
NA19434 Luhya Assembly Assessment
HG04217 Telugu Data Production Underway
HG03486 Mende Assembly Underway**
** First Sequel only data set
Genome Total Size # Contigs Contig N50
NA19240 2.84 Gb 2965 25.7 Mb
HG00733 2.88 Gb 3580 22.2 Mb
NA12878 2.86 Gb 3663 14.5 Mb
HG01352 2.88 Gb 3120 22.8 Mb
HG00514 2.87 Gb 3160 25.3 Mb
NA19434 2.86 Gb 3083 21.6 Mb
HG02059 2.89 Gb 3148 26.0 Mb
Assembly Stats
Assembly QC and Submission Steps
Multiple Falcon
Assemblies
Using stats and
alignment to
Bionano, pick the
best assembly
Quiver and Pilon
on best assembly
Use Bionano to
identify mis-
assemblies
Submit conitg
level AGPs to
Genbank
Run through NCBI
assembly QA
pipeline
Evaluate and
curate output of
QA pipeline
Generate final
chromosome level
AGPs and Submit
Annotation of
chromosome level
assembly
Hybrid Scaffold
Hybrid Scaffold
PacBio Contigs
BioNano Contigs
Hybrid Stats
Seq Assem Seq Assem Seq Assem BN Hybrid BN Hybrid BN Hybrid
# of
Contigs
Contig N50
(Mb)
Total Size
(Gb)
# of
Scaffolds
Scaffold
N50 (Mb)
Total Size
(Gb)
NA19240 2889 26.3 2.87 218 39.9 2.82
NA12878 3551 15.1 2.86 270 28.7 2.83
HG00514 3190 24.2 2.88 208 37.0 2.83
HG00733 3553 22.8 2.88 167 48.8 2.87
HG01352 3077 22.8 2.88 220 40.0 2.84
NA19434 3083 21.9 2.86 253 34.7 2.83
HG02059 3148 26.1 2.90 242 37.2 2.83
NA19240 Assembly Assessment
Initial Calls Breaks made
Conflicts 51 35
Translocation SV 321 16
Complex 123 9
Nucmer
Alignments
9
69 Total
breaks made
Contig # Contig N50 Total Assembly
Size
Before Breaks 2889 26.4 Mb 2.87 Gb
After Breaks 2951 25.7 Mb 2.87 Gb
NA19240 contig break
Chimeric PacBio Contig
GRCh38 – Chr 1
GRCh38 – Chr 4
NA19240 Contig
NA19240 Contig
Segmental Duplications
Segmental Duplications
NA19240 Inversion Compared to GRCh38
GRCh38
NA19240 Bionano Contigs
Bionano Identified SVs Compared to GRCh38
Genome Deletions Insertions Inversions
Yoruban (NA19240) 756 1795 8
European (NA12878) 750 1791 17
Han Chinese (HG00514) 743 1724 8
Puerto Rican (HG00733) 743 1862 27
Colombian (HG01352) 711 1661 6
Vietnamese (HG02059) 626 1536 4
Luhya (NA19434) 694 1643 10
Mende (HG03486) 871 1888 3
NA19240 MHC Region
GRCh38
Bionano Contigs
NA19240 MHC Region
NA19240
Reference
Alts
~65 kb insertion
CYP2D6 Alternate Alleles
Courtesy of Karyn Meltz Steinberg
NA12878 CYP2D6 Region in Bionano Map
GRCh38
NA12878
allele 1
NA12878
allele 2
NA12878 CYP2D6 Region in Bionano Map
GRCh38
NA12878
allele 1
NA12878
allele 2
Falcon Assembly of NA12878 in CYP2D6 Region
CYP2D8
CYP2D7
CYP2D6
Alignment of
NA12878 to
GRCh38
Region of NA12878 that
doesn’t exist in GRCh38
Shows Duplication of
CYP2D7 gene in
NA12878 genome
Falcon Unzip
Falcon Unzip Assemblies
Contig # Assembly
Length
Contig N50 Avg Contig
Length
Largest
Contig
Primary Contigs 1220 2.83 Gb 21.63 Mb 2.31 Mb 83.00 Mb
Haplotigs 11,686 2.45 Gb 443.3 Kb 210 Kb 3.41 Mb
Gambian (HG02818) Assembly
Contig # Assembly
Length
Contig N50 Avg Contig
Length
Largest
Contig
Primary Contigs 1,801 2.83 Gb 21.16 Mb 1.57 Mb 81.12 Mb
Haplotigs 13,130 2.49 Gb 458.2 Kb 190 Kb 3.23 Mb
Yoruban (NA19240) Assembly – Not polished yet
10X Genomics Overview (DNA)
(Church 10X Genomics)
10X Data – Separating a Heterozygous Allele
GRCh38
NA12878
Falcon
10X Allele 1
10X Allele 2
Heterozygous SV identified by Bionano
10X Supernova assembly used - GCA_002022845.1
Short Term Future Plans
• Lots of assemblies to analyze!
• Generate the latest Falcon Unzip assemblies for all
samples
• Improve those assemblies
• Identifying misassemblies
• Making the breaks where needed
• Scaffolding the assemblies
• Incorporating BACs as they are finished
• Create Chromosomal AGPs
• Submit to Genbank
Longer Term Future Work
• Better Utilization of the Reference
• Mapping Strategies
• Graph based alignments
• Other alt-aware read mapping strategies
• Alternative reference data display challenges – How should we
present data
• Do we continue the current scheme of alt alleles?
• Full reference sequences?
• 2 Haplo-resolved sequences for each allele
• Using Falcon unzip
• Using 10X
• Other technologies?
Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Susan Dutcher
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Derek Albracht
Milinn Kremitzki
Susan Rock
Chad Tomlinson
Patrick Minx
Chris Markovic
Eddie Belter
Lee Trani
Sara Kohlberg
University of Washington
Evan Eichler
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine
(CHM1 and CHM13 cell line)
Urvashi Surti
BioNano Genomics
Alex Hastie
Pacific Biosciences
Nick Sisneros
Sarah Kingan
Luke Hickey
Greg Concepcion
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine Chu
10X Genomics
Deanna Church
Nationwide Children’s Hospital
Richard Wilson
Vince Magrini
Sean McGrath

More Related Content

What's hot

Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
 
Generating haplotype phased reference genomes for the dikaryotic wheat strip...
Generating haplotype phased reference genomes  for the dikaryotic wheat strip...Generating haplotype phased reference genomes  for the dikaryotic wheat strip...
Generating haplotype phased reference genomes for the dikaryotic wheat strip...Benjamin Schwessinger
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonGenome Reference Consortium
 
Schneider_AGBT2014
Schneider_AGBT2014Schneider_AGBT2014
Schneider_AGBT2014vaschn
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyGenome Reference Consortium
 

What's hot (20)

20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
Grc workshop agbt2015_tg
Grc workshop agbt2015_tgGrc workshop agbt2015_tg
Grc workshop agbt2015_tg
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
Ashg grc workshop2014_tg
Ashg grc workshop2014_tgAshg grc workshop2014_tg
Ashg grc workshop2014_tg
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 
Generating haplotype phased reference genomes for the dikaryotic wheat strip...
Generating haplotype phased reference genomes  for the dikaryotic wheat strip...Generating haplotype phased reference genomes  for the dikaryotic wheat strip...
Generating haplotype phased reference genomes for the dikaryotic wheat strip...
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
GRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slidesGRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slides
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
Schneider_AGBT2014
Schneider_AGBT2014Schneider_AGBT2014
Schneider_AGBT2014
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
Ashg2015 grc-pruitt
Ashg2015 grc-pruittAshg2015 grc-pruitt
Ashg2015 grc-pruitt
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 
Alignment Approaches II: Long Reads
Alignment Approaches II: Long ReadsAlignment Approaches II: Long Reads
Alignment Approaches II: Long Reads
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 

Similar to Ashg2017 workshop tg

Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Miten Jain
 
F Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis PipelineF Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis PipelineFrancesca Giordano
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethionGenomeInABottle
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop finalMeng-Ru (Raymond) Tsai
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Deanna Church
 
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...Fabio Caligaris
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Jane Landolin
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Torsten Seemann
 
Miten Generating high-quality reference human genomes using Promethion nanopo...
Miten Generating high-quality reference human genomes using Promethion nanopo...Miten Generating high-quality reference human genomes using Promethion nanopo...
Miten Generating high-quality reference human genomes using Promethion nanopo...GenomeInABottle
 
KHMiga-AGBT.020923.upload.pdf
KHMiga-AGBT.020923.upload.pdfKHMiga-AGBT.020923.upload.pdf
KHMiga-AGBT.020923.upload.pdfKarenMiga
 
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...ATMOSPHERE .
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
V4 Sequencing Reagent Experience
V4 Sequencing Reagent ExperienceV4 Sequencing Reagent Experience
V4 Sequencing Reagent ExperienceBrian Krueger
 

Similar to Ashg2017 workshop tg (20)

Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...
 
F Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis PipelineF Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis Pipeline
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethion
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
26072016 uc davis_small
26072016 uc davis_small26072016 uc davis_small
26072016 uc davis_small
 
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Miten Generating high-quality reference human genomes using Promethion nanopo...
Miten Generating high-quality reference human genomes using Promethion nanopo...Miten Generating high-quality reference human genomes using Promethion nanopo...
Miten Generating high-quality reference human genomes using Promethion nanopo...
 
KHMiga-AGBT.020923.upload.pdf
KHMiga-AGBT.020923.upload.pdfKHMiga-AGBT.020923.upload.pdf
KHMiga-AGBT.020923.upload.pdf
 
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
V4 Sequencing Reagent Experience
V4 Sequencing Reagent ExperienceV4 Sequencing Reagent Experience
V4 Sequencing Reagent Experience
 

More from Genome Reference Consortium

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?Genome Reference Consortium
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Genome Reference Consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsGenome Reference Consortium
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGenome Reference Consortium
 

More from Genome Reference Consortium (17)

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regions
 
Everyday de novo assembly
Everyday de novo assemblyEveryday de novo assembly
Everyday de novo assembly
 

Recently uploaded

linearity concept of significance, standard deviation, chi square test, stude...
linearity concept of significance, standard deviation, chi square test, stude...linearity concept of significance, standard deviation, chi square test, stude...
linearity concept of significance, standard deviation, chi square test, stude...KavyasriPuttamreddy
 
BMK Glycidic Acid (sodium salt) CAS 5449-12-7 Pharmaceutical intermediates
BMK Glycidic Acid (sodium salt)  CAS 5449-12-7 Pharmaceutical intermediatesBMK Glycidic Acid (sodium salt)  CAS 5449-12-7 Pharmaceutical intermediates
BMK Glycidic Acid (sodium salt) CAS 5449-12-7 Pharmaceutical intermediatesdorademei
 
CT scan of penetrating abdominopelvic trauma
CT scan of penetrating abdominopelvic traumaCT scan of penetrating abdominopelvic trauma
CT scan of penetrating abdominopelvic traumassuser144901
 
Cardiovascular Physiology - Regulation of Cardiac Pumping
Cardiovascular Physiology - Regulation of Cardiac PumpingCardiovascular Physiology - Regulation of Cardiac Pumping
Cardiovascular Physiology - Regulation of Cardiac PumpingMedicoseAcademics
 
Effects of vaping e-cigarettes on arterial health
Effects of vaping e-cigarettes on arterial healthEffects of vaping e-cigarettes on arterial health
Effects of vaping e-cigarettes on arterial healthCatherine Liao
 
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.GawadHemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.GawadNephroTube - Dr.Gawad
 
Factors Affecting child behavior in Pediatric Dentistry
Factors Affecting child behavior in Pediatric DentistryFactors Affecting child behavior in Pediatric Dentistry
Factors Affecting child behavior in Pediatric DentistryDr Simran Deepak Vangani
 
A thorough review of supernormal conduction.pptx
A thorough review of supernormal conduction.pptxA thorough review of supernormal conduction.pptx
A thorough review of supernormal conduction.pptxSergio Pinski
 
รายการตํารับยาแผนไทยแห่งชาติ ฉบับ พ.ศ. 2564.pdf
รายการตํารับยาแผนไทยแห่งชาติ ฉบับ พ.ศ. 2564.pdfรายการตํารับยาแผนไทยแห่งชาติ ฉบับ พ.ศ. 2564.pdf
รายการตํารับยาแผนไทยแห่งชาติ ฉบับ พ.ศ. 2564.pdfVorawut Wongumpornpinit
 
Vaccines: A Powerful and Cost-Effective Tool Protecting Americans Against Dis...
Vaccines: A Powerful and Cost-Effective Tool Protecting Americans Against Dis...Vaccines: A Powerful and Cost-Effective Tool Protecting Americans Against Dis...
Vaccines: A Powerful and Cost-Effective Tool Protecting Americans Against Dis...PhRMA
 
World Hypertension Day 17th may 2024 ppt
World Hypertension Day 17th may 2024 pptWorld Hypertension Day 17th may 2024 ppt
World Hypertension Day 17th may 2024 pptdesktoppc
 
MRI Artifacts and Their Remedies/Corrections.pptx
MRI Artifacts and Their Remedies/Corrections.pptxMRI Artifacts and Their Remedies/Corrections.pptx
MRI Artifacts and Their Remedies/Corrections.pptxDr. Dheeraj Kumar
 
Denture base resins materials and its mechanism of action
Denture base resins materials and its mechanism of actionDenture base resins materials and its mechanism of action
Denture base resins materials and its mechanism of actionDr.shiva sai vemula
 
Anuman- An inference for helpful in diagnosis and treatment
Anuman- An inference for helpful in diagnosis and treatmentAnuman- An inference for helpful in diagnosis and treatment
Anuman- An inference for helpful in diagnosis and treatmentabdeli bhadarva
 
Antiplatelets in IHD, Dose Duration, DAPT vs SAPT
Antiplatelets in IHD, Dose Duration, DAPT vs SAPTAntiplatelets in IHD, Dose Duration, DAPT vs SAPT
Antiplatelets in IHD, Dose Duration, DAPT vs SAPTAkashGanganePatil1
 
180-hour Power Capsules For Men In Ghana
180-hour Power Capsules For Men In Ghana180-hour Power Capsules For Men In Ghana
180-hour Power Capsules For Men In Ghanahealthwatchghana
 
Scientificity and feasibility study of non-invasive central arterial pressure...
Scientificity and feasibility study of non-invasive central arterial pressure...Scientificity and feasibility study of non-invasive central arterial pressure...
Scientificity and feasibility study of non-invasive central arterial pressure...Catherine Liao
 
Book Trailer: PGMEE in a Nutshell (CEE MD/MS PG Entrance Examination)
Book Trailer: PGMEE in a Nutshell (CEE MD/MS PG Entrance Examination)Book Trailer: PGMEE in a Nutshell (CEE MD/MS PG Entrance Examination)
Book Trailer: PGMEE in a Nutshell (CEE MD/MS PG Entrance Examination)Dr. Aryan (Anish Dhakal)
 
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHYTUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHYDRPREETHIJAMESP
 
PT MANAGEMENT OF URINARY INCONTINENCE.pptx
PT MANAGEMENT OF URINARY INCONTINENCE.pptxPT MANAGEMENT OF URINARY INCONTINENCE.pptx
PT MANAGEMENT OF URINARY INCONTINENCE.pptxdrtabassum4
 

Recently uploaded (20)

linearity concept of significance, standard deviation, chi square test, stude...
linearity concept of significance, standard deviation, chi square test, stude...linearity concept of significance, standard deviation, chi square test, stude...
linearity concept of significance, standard deviation, chi square test, stude...
 
BMK Glycidic Acid (sodium salt) CAS 5449-12-7 Pharmaceutical intermediates
BMK Glycidic Acid (sodium salt)  CAS 5449-12-7 Pharmaceutical intermediatesBMK Glycidic Acid (sodium salt)  CAS 5449-12-7 Pharmaceutical intermediates
BMK Glycidic Acid (sodium salt) CAS 5449-12-7 Pharmaceutical intermediates
 
CT scan of penetrating abdominopelvic trauma
CT scan of penetrating abdominopelvic traumaCT scan of penetrating abdominopelvic trauma
CT scan of penetrating abdominopelvic trauma
 
Cardiovascular Physiology - Regulation of Cardiac Pumping
Cardiovascular Physiology - Regulation of Cardiac PumpingCardiovascular Physiology - Regulation of Cardiac Pumping
Cardiovascular Physiology - Regulation of Cardiac Pumping
 
Effects of vaping e-cigarettes on arterial health
Effects of vaping e-cigarettes on arterial healthEffects of vaping e-cigarettes on arterial health
Effects of vaping e-cigarettes on arterial health
 
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.GawadHemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
 
Factors Affecting child behavior in Pediatric Dentistry
Factors Affecting child behavior in Pediatric DentistryFactors Affecting child behavior in Pediatric Dentistry
Factors Affecting child behavior in Pediatric Dentistry
 
A thorough review of supernormal conduction.pptx
A thorough review of supernormal conduction.pptxA thorough review of supernormal conduction.pptx
A thorough review of supernormal conduction.pptx
 
รายการตํารับยาแผนไทยแห่งชาติ ฉบับ พ.ศ. 2564.pdf
รายการตํารับยาแผนไทยแห่งชาติ ฉบับ พ.ศ. 2564.pdfรายการตํารับยาแผนไทยแห่งชาติ ฉบับ พ.ศ. 2564.pdf
รายการตํารับยาแผนไทยแห่งชาติ ฉบับ พ.ศ. 2564.pdf
 
Vaccines: A Powerful and Cost-Effective Tool Protecting Americans Against Dis...
Vaccines: A Powerful and Cost-Effective Tool Protecting Americans Against Dis...Vaccines: A Powerful and Cost-Effective Tool Protecting Americans Against Dis...
Vaccines: A Powerful and Cost-Effective Tool Protecting Americans Against Dis...
 
World Hypertension Day 17th may 2024 ppt
World Hypertension Day 17th may 2024 pptWorld Hypertension Day 17th may 2024 ppt
World Hypertension Day 17th may 2024 ppt
 
MRI Artifacts and Their Remedies/Corrections.pptx
MRI Artifacts and Their Remedies/Corrections.pptxMRI Artifacts and Their Remedies/Corrections.pptx
MRI Artifacts and Their Remedies/Corrections.pptx
 
Denture base resins materials and its mechanism of action
Denture base resins materials and its mechanism of actionDenture base resins materials and its mechanism of action
Denture base resins materials and its mechanism of action
 
Anuman- An inference for helpful in diagnosis and treatment
Anuman- An inference for helpful in diagnosis and treatmentAnuman- An inference for helpful in diagnosis and treatment
Anuman- An inference for helpful in diagnosis and treatment
 
Antiplatelets in IHD, Dose Duration, DAPT vs SAPT
Antiplatelets in IHD, Dose Duration, DAPT vs SAPTAntiplatelets in IHD, Dose Duration, DAPT vs SAPT
Antiplatelets in IHD, Dose Duration, DAPT vs SAPT
 
180-hour Power Capsules For Men In Ghana
180-hour Power Capsules For Men In Ghana180-hour Power Capsules For Men In Ghana
180-hour Power Capsules For Men In Ghana
 
Scientificity and feasibility study of non-invasive central arterial pressure...
Scientificity and feasibility study of non-invasive central arterial pressure...Scientificity and feasibility study of non-invasive central arterial pressure...
Scientificity and feasibility study of non-invasive central arterial pressure...
 
Book Trailer: PGMEE in a Nutshell (CEE MD/MS PG Entrance Examination)
Book Trailer: PGMEE in a Nutshell (CEE MD/MS PG Entrance Examination)Book Trailer: PGMEE in a Nutshell (CEE MD/MS PG Entrance Examination)
Book Trailer: PGMEE in a Nutshell (CEE MD/MS PG Entrance Examination)
 
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHYTUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
 
PT MANAGEMENT OF URINARY INCONTINENCE.pptx
PT MANAGEMENT OF URINARY INCONTINENCE.pptxPT MANAGEMENT OF URINARY INCONTINENCE.pptx
PT MANAGEMENT OF URINARY INCONTINENCE.pptx
 

Ashg2017 workshop tg

  • 1. Reference-Grade Human Genome Assemblies Tina Graves Lindsay GRC - GIAB Workshop at ASHG Oct 17, 2017
  • 2. The Human Reference is a Work in Progress! • The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries. • GRCh38 is comprised of DNA from several individual humans. • Allelic diversity and structural variation present major challenges when assembling a representative diploid genome. • New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome. • Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans
  • 3. AC074378.4 AC079749.5 AC134921.2 AC147055.2 AC140484.1 AC019173.4 AC093720.2 AC021146.7 NCBI36NC_000004.10 (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37NC_000004.11 (chr4) Tiling Path AC074378.4 AC079749.5 AC134921.1 AC147055.2 AC093720.2 AC021146.7 TMPRSS11E GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC074378.4 AC140484.1 AC019173.4 AC226496.2 AC021146.7 TMPRSS11E2 UGT2B17 – Conflicting Alleles G A P
  • 4. Samples to be Sequenced
  • 6. Genome Status Data Source Origin Assembly Accession Status CHM1 NA GCA_001297185.1 Assembly Improvement CHM13 NA GCA_000983455.2 Assembly Assessment NA19240 Yoruban GCA_001524155.4 Chr-level Assembly Submitted HG00733 Puerto Rican GCA_002208065.1 Contig Assembly Submitted HG00514 Han Chinese GCA_002180035.1 Contig Assembly Submitted NA12878 European GCA_002077035.2 Chr-level Assembly Submitted HG01352 Columbian GCA_002209525.1 Contig Assembly Submitted HG02818 Gambian Assembly Underway HG02059 Kinh-Vietnamese Assembly Assessment NA19434 Luhya Assembly Assessment HG04217 Telugu Data Production Underway HG03486 Mende Assembly Underway** ** First Sequel only data set
  • 7. Genome Total Size # Contigs Contig N50 NA19240 2.84 Gb 2965 25.7 Mb HG00733 2.88 Gb 3580 22.2 Mb NA12878 2.86 Gb 3663 14.5 Mb HG01352 2.88 Gb 3120 22.8 Mb HG00514 2.87 Gb 3160 25.3 Mb NA19434 2.86 Gb 3083 21.6 Mb HG02059 2.89 Gb 3148 26.0 Mb Assembly Stats
  • 8. Assembly QC and Submission Steps Multiple Falcon Assemblies Using stats and alignment to Bionano, pick the best assembly Quiver and Pilon on best assembly Use Bionano to identify mis- assemblies Submit conitg level AGPs to Genbank Run through NCBI assembly QA pipeline Evaluate and curate output of QA pipeline Generate final chromosome level AGPs and Submit Annotation of chromosome level assembly
  • 9.
  • 10. Hybrid Scaffold Hybrid Scaffold PacBio Contigs BioNano Contigs
  • 11. Hybrid Stats Seq Assem Seq Assem Seq Assem BN Hybrid BN Hybrid BN Hybrid # of Contigs Contig N50 (Mb) Total Size (Gb) # of Scaffolds Scaffold N50 (Mb) Total Size (Gb) NA19240 2889 26.3 2.87 218 39.9 2.82 NA12878 3551 15.1 2.86 270 28.7 2.83 HG00514 3190 24.2 2.88 208 37.0 2.83 HG00733 3553 22.8 2.88 167 48.8 2.87 HG01352 3077 22.8 2.88 220 40.0 2.84 NA19434 3083 21.9 2.86 253 34.7 2.83 HG02059 3148 26.1 2.90 242 37.2 2.83
  • 12. NA19240 Assembly Assessment Initial Calls Breaks made Conflicts 51 35 Translocation SV 321 16 Complex 123 9 Nucmer Alignments 9 69 Total breaks made Contig # Contig N50 Total Assembly Size Before Breaks 2889 26.4 Mb 2.87 Gb After Breaks 2951 25.7 Mb 2.87 Gb
  • 14. Chimeric PacBio Contig GRCh38 – Chr 1 GRCh38 – Chr 4 NA19240 Contig NA19240 Contig Segmental Duplications Segmental Duplications
  • 15. NA19240 Inversion Compared to GRCh38 GRCh38 NA19240 Bionano Contigs
  • 16. Bionano Identified SVs Compared to GRCh38 Genome Deletions Insertions Inversions Yoruban (NA19240) 756 1795 8 European (NA12878) 750 1791 17 Han Chinese (HG00514) 743 1724 8 Puerto Rican (HG00733) 743 1862 27 Colombian (HG01352) 711 1661 6 Vietnamese (HG02059) 626 1536 4 Luhya (NA19434) 694 1643 10 Mende (HG03486) 871 1888 3
  • 19. CYP2D6 Alternate Alleles Courtesy of Karyn Meltz Steinberg
  • 20. NA12878 CYP2D6 Region in Bionano Map GRCh38 NA12878 allele 1 NA12878 allele 2
  • 21. NA12878 CYP2D6 Region in Bionano Map GRCh38 NA12878 allele 1 NA12878 allele 2
  • 22. Falcon Assembly of NA12878 in CYP2D6 Region CYP2D8 CYP2D7 CYP2D6 Alignment of NA12878 to GRCh38 Region of NA12878 that doesn’t exist in GRCh38 Shows Duplication of CYP2D7 gene in NA12878 genome
  • 24. Falcon Unzip Assemblies Contig # Assembly Length Contig N50 Avg Contig Length Largest Contig Primary Contigs 1220 2.83 Gb 21.63 Mb 2.31 Mb 83.00 Mb Haplotigs 11,686 2.45 Gb 443.3 Kb 210 Kb 3.41 Mb Gambian (HG02818) Assembly Contig # Assembly Length Contig N50 Avg Contig Length Largest Contig Primary Contigs 1,801 2.83 Gb 21.16 Mb 1.57 Mb 81.12 Mb Haplotigs 13,130 2.49 Gb 458.2 Kb 190 Kb 3.23 Mb Yoruban (NA19240) Assembly – Not polished yet
  • 25. 10X Genomics Overview (DNA) (Church 10X Genomics)
  • 26. 10X Data – Separating a Heterozygous Allele GRCh38 NA12878 Falcon 10X Allele 1 10X Allele 2 Heterozygous SV identified by Bionano 10X Supernova assembly used - GCA_002022845.1
  • 27. Short Term Future Plans • Lots of assemblies to analyze! • Generate the latest Falcon Unzip assemblies for all samples • Improve those assemblies • Identifying misassemblies • Making the breaks where needed • Scaffolding the assemblies • Incorporating BACs as they are finished • Create Chromosomal AGPs • Submit to Genbank
  • 28. Longer Term Future Work • Better Utilization of the Reference • Mapping Strategies • Graph based alignments • Other alt-aware read mapping strategies • Alternative reference data display challenges – How should we present data • Do we continue the current scheme of alt alleles? • Full reference sequences? • 2 Haplo-resolved sequences for each allele • Using Falcon unzip • Using 10X • Other technologies?
  • 29. Acknowledgements The McDonnell Genome Institute at Washington University in St. Louis Susan Dutcher Bob Fulton Wes Warren Karyn Meltz Steinberg Derek Albracht Milinn Kremitzki Susan Rock Chad Tomlinson Patrick Minx Chris Markovic Eddie Belter Lee Trani Sara Kohlberg University of Washington Evan Eichler NCBI Valerie Schneider University of Pittsburgh School of Medicine (CHM1 and CHM13 cell line) Urvashi Surti BioNano Genomics Alex Hastie Pacific Biosciences Nick Sisneros Sarah Kingan Luke Hickey Greg Concepcion UCSF Pui-Yan Kwok Yvonne Lai Chin Lin Catherine Chu 10X Genomics Deanna Church Nationwide Children’s Hospital Richard Wilson Vince Magrini Sean McGrath

Editor's Notes

  1. As part of our work as a member of the Genome Reference Consortium, we have been working to improve the current reference, GRCh38. In doing this work, we have found that there are still a few regions of the genome not fully resolved. There are still a few genes that are not optimally represented for all individuals or ancestries, although we have fixed quite a few of them in GRCh38. The reference is comprised of many individuals, so there are regions where allelic diversity and structural variation present challenges in the assembly. Many of the newer technologies have allowed for improvements to the reference genome. But we realize that there still is a need for additional high quality human genome assemblies to fully cover the range of genetic diversity in humans.
  2. This is a great example of how allelic differences can cause assembly problems. The gene UGT2B17 is known to be copy number variant. Some individuals have 1 copy of this gene and other individuals lack this gene altogether. In Build 36, clones in this regions were all from the RP11 sample, which happened to be heterozygous for this indel.The blue, red, and black colored boxes represent the clone path through the region - the yellow blocks indicates annotated segmental duplication, and there were two genes annotated in this region. In Build 36 we were representing both the insertion and deletion alleles in the assembly. By removing the black clones from the path, we were able to close the gap, then we created an alternate allele from the black clones, which required sequencing one additional clone. To end up with both the insertion and deletion alleles in GRCh37. This changes our understand of the biology of this region, we closed the gap that existed, we removed falsely annotated duplication and there is really only one copy of this gene present in the assembly with an allelic variant. This example shows how multiple haplotypes in the assembly can cause problems
  3. In the past few years we have been working on a project funded by NIH to sequence additional human reference genomes. These are the samples we have been working on. Originally we planned to sequence 5 diploid genomes and 2 haploidgenomes. Currently we are working on our 10th diploid genome. These genomes will help to add diversity to the reference.
  4. As part of this project, we are generating ~60X coverage of PacBio long read data. We will do a de novo assembly of that PacBio data. Then we are using a variety of additional tools to help inform the assembly. BioNano has been very useful in helping to scaffold the assembly as well as to identify potential mis-assemblies. We are also starting to work with 10X genomics data as well. For the initial few genomes, we were targeting difficult to assemble regions of the genome by sequencing BACs. Once the BACs are incorporated, we plan to align all of this data to the Reference very stringently to produce chromosomal AGPs. The end product will be a very high quality whole genome assembly.
  5. To date, data has been generated for 2 Haploid genomes and 10 diploid genomes, all at ~60X coverage or higher. We have a lot of data and a lot of assemblies to work with. For 2 of the diploid genomes, we have Chromosome level assemblies, the rest are at the contig leve. **2 additional genomes – data will be generated soon
  6. Here are the assembly stats we have for all of the genomes we have assembled to date. All of these genomes are being assembled using Falcon. With the newer version of Falcon, we are seeing a huge increase in contiguity. In most cases, the N50 has increased by 3 times. FALCON-integrate 1.7.5, Various assemblies are generated, minimum seed read lengths and min_cov
  7. We generate multiple assemblies, varying the minimum seed read length and min_cov. From those 20 or so assemblies, we the Raw data is generally submitted a month or so after production of the data is completed
  8. This diagram shows the work flow for the Bionano Irys system. It is a nanochannel technology where long DNA molecules are nicked and labeled at specific recognition sites, you end up with nick sites along the DNA molecules, similar to a restriction digest, only you have the added benefit of the nicks being in context to one another. Once the data is assembled, the resulting Genome maps can be used for SV detection, gap sizing, assembly QC, and scaffolding
  9. Here is an example of one of the BioNano hybrid scaffolds that was generated The top line in green represents the Hybrid scaffold, the first set of Blue bars represents the PacBio assembly and the S2 lines represent the BioNano map contigs. So from this you can see how these two technologies are very complementary to one another. This is a snapshot of a 10 Mb portion of a larger hybrid scaffold
  10. BioNano has also identified a second enzyme that nicks well for human genomes. You can create a second map with the other enzyme and then through softtware improvements that are coming in the next month, will be able to align you sequence to both maps. This will increase the N50 by 2 times. used 14k_120_120_1
  11. Once we identified which assembly version we wanted to improve, we aligned to BioNano, SV calls were generated as well as doing hybrid scaffolding. During the hybrid scaffolding process, conflicts are identified. For this genome, 51 contflicts were identified. We looked at the sequence alignments for all of these conflicts and found 35 to be pacbio assemblie errors. WE also looked through the translocation and complex SV calls, as well as a rough alignment of the assembly to GRCh38 to identify contigs that crossed chromosomes. From looking through all of this data, 69 breaks were done. You will see that breaking the obvious chimeric contigs only brought the N50 down a little bit to 25.7 Mb. Sequence alignments were looked at for all conflicts, then to narrow down the complex and translocations first looked at the BioNano alignments in Irysview
  12. This is the same Pacbio contig as in the last slide, only this time, it is comparing the pacbio contig to GRCh38, it in the top panel you can see
  13. We have also been using the bionano maps to identify variation between our genomes and the reference. In this example, there are 2 haplotypes in BN compared to GRCh38 – This appears to be a heterozygous inversion in NA19240.
  14. Here is a list of initial set of SV calls of our genomes when compared to GRCh38. These contain both homozygous and heterozygous calls.
  15. I have a few examples of what we have been seeing in these assemblies. We decided to take a look at the MHC region, of NA19240. This is a comparison of the BioNano map of NA19240 to the reference, the reference is in green and the NA19240 BN map in blue. It looks like from the BN map there is a ~65kb insertion.
  16. We then aligned the contig from Jason’s most recent assembly to the current reference as well as the alts. This is the region that cooresponds to the insertion in the BN map, so from this initial look, it appears there is an insertion here in this assembly. Need to look at it further to evaluate if this would be a useful addition to the alts that already are present.
  17. CYP2D6 is a very diverse genomic region that has implications on drug metabolism. In collaboration with the Pharmaco Genomics Research Network (PGRN), we have sequenced multiple alleles in this region using fosmid libraries created from ethnically diverse individuals. Within the region, there is also another Cyp gene, CYP2D7 and a pseudogene called CYP2D8 that contain with common repeats interspersed between genes and pseudogene copies, facilitating genomic rearrangements. The gene CYP2D6 and the associated pseudo genes are shown here, along with some of the different alleles we have sequenced.
  18. This is the alignment of NA12878 to GRCh38 as well as the genes aligned to the NA12878
  19. IT was important, especailly in highly variable regions of the gneome to capture both alleles from the diploid samples. In collaboration with Pacbio, they have generated an unzip assembly for us. Here is a diagram showing how with Falcon you will be missing allelic variation, but by using Falcon unzip, you should capture the variation that is present. You end up with a set of very contiguous primary contigs and then a set of smaller haplotigs that contain the variation.
  20. Gambian assembly was done at Pacbio for us and this version is polished
  21. I want to acknowledge all of the collaborators on this project and all of the work that has gone into it thus far.