Ashg grc workshop2014_tg

ASHG - GRC Workshop
Tina Lindsay
ASHG Oct 18, 2014

The Human Reference is Not Complete
• Reference has been found to not be optimal in some
regions
• Structural variation makes it difficult to assemble a truly
representative genome when using a diploid sample
• Some regions were recalcitrant to closure with technology
and resources available at the time
• Additional sequences are needed to capture the full range
of diversity in humans

UGT2B17 – Conflicting Alleles
AC074378.4
AC079749.5
AC147055.2
AC134921.2
AC140484.1
AC019173.4
AC093720.2
AC021146.7
NCBI36 NC_000004.10 (chr4) Tiling Path
TMPRSS11E TMPRSS11E2
Xue Y et al, 2008
GRCh37 NC_000004.11 (chr4) Tiling Path
AC074378.4
AC079749.5
AC147055.2
AC134921.1
AC093720.2
AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4
AC140484.1
AC019173.4
AC226496.2
AC021146.7
TMPRSS11E2
G
A
P

Allelic Diversity vs. Segmental Duplication
A
A
C
T
C
G
C
C
Repeat Copies (noted by color difference)
Allelic
Copies
Diploid Genome
With a diploid genome, there is significant ambiguity sorting allelic copies from repeat copies
Haploid Genome
A C C C
Repeat Copies (ONLY but noted by color difference)
With a haploid genome, allelic differences are eliminated, and base differences are likely
indicative of repeat copies

Hydatidiform mole
1. Fertilization of an oocyte without a nucleus
2. Post-zygotic diploidization of triploid zygotes
23x
23X
23X 23X
?
Oocyte Androgenetic HM

Initial Use Of CHM1 Source
• CHORI-17 BAC Library
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs
• > 750 have been sequenced
• 590 of them in Genbank as phase 3

SRGAP2 Homology between genes
Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs
Shows homology between SRGAP2B and SRGAP2C
SRGAP2A
SRGAP2B
SRGAP2C
Dennis, et.al. 2012

1q21
1q32 1q21 1p21
1q21 patch alignment to chromosome 1

IGH Region Highlights Allelic Differences
Watson, et. al., 2013

Williams-Beuren Syndrome region
Slide courtesy of Megan Dennis

Current status of CHM1 resources
• CHORI-17 BAC Library (created from CHM1 cell line)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs (>750 have been sequenced, with 592 of them in
Genbank as phase 3)
• Active cell line
• >100X coverage Illumina 100bp reads
• 300, 500bp, 3kb inserts
• Reference assisted assembly CHM1_1.1
• BioNano genome map
• >50X coverage of PacBio long read data

CHM1_1.1 Assembly
• Reference-guided assembly – SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
• Comparison back to GRCh37 reference to provide appropriate gaps
sizes
• Assembly submitted to Genbank
• http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2
• Paper to be published soon
• Genome Research (in press)
• biorxiv doi (doi: http://dx.doi.org/10.1101/006841)

CHM1_1.1 Assembly
Total Sequence Length 3,037,866,619 bp
Total Assembly Gap Length 210,229,812 bp
Number of Scaffolds 163
Scaffold N50 50,362,920 bp
Number of Contigs 40,828
Contig N50 143,936 bp
CHM1_1.1
GRCh3
7

Incorporation of CHM1_1.1 Assembly Data in GRCh38

PacBio CHM1 Assembly potentially fills GRCh38 Gaps
GRCh38
PacBio CHM1

PacBio CHM1 Assembly Shows Data Not in GRCH38
GRCh38
PacBio CHM1
Second Pass Alignment

CHM1 BioNano Genome Map Aligned to GRCh38
GRCh38
CHM1 BioNano Map
~15kb additional data

BioNano SV Calls Identified a Assembly Problems
Collapse
Expansion
in Assembly
CHM1_1.1 Assembly Gap in Sequence
CHM1 BioNano Map

Collapse in Sequence Data
Thought to be missing ~100kb in sequenced clones
GRCh38

Gap Sizing
Chr8 – Stalled Gap
Estimated at ~150kb
GRCh38
Sized using CHM1 Genome Map - >500 Kb

Future of CHM1 Assembly
• Plan to make as contiguous and accurate as possible
• Incorporate PacBio assembly where possible
• Additional CH17 clones being sequenced through
segmentally duplicated and structurally variant regions to
provide local assembly benefits (isolates the repeats)

CYP2D6 – Providing Alternate Alleles
ABC7
(NA18517)
ABC8
(NA18507)
ABC9
(NA18956)
ABC11
(NA18555)

Future Directions
• Continued Improvement on CHM1 Genome
• Integration of Pacific Bioscience whole genome assembly
• BioNano genome map data
• Continue to add diversity to the reference by sequencing
new samples that provide additional diversity than what is
currently represented in GRCh38
• Continued sequencing of CH17 single haplotype BAC
tilepaths to better represent segmentally duplicated
regions
• Additional collaborations with the community to develop
tools to more fully utilize the full reference assembly
(alternate haplotypes)

Acknowledgements
The Genome Institute at Washington
University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Vince Magrini
Derek Albracht
Milinn Kremitzki
Susan Rock
Debbie Scheer
Aye Wollam
The Finishing and Bioinformatics Teams
at The Genome Institute
University of Washington
Evan Eichler
Megan Dennis
Xander Nuttler
NCBI
Richa Argwala
Valerie Schneider
University of Pittsburgh
School of Medicine (CHM1 cell line)
Urvashi Surti
Personalis
Deanna Church
BioNano Genomics
Pacific Biosciences
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
CHORI Catherine Chu
Pieter de Jong

Ashg grc workshop2014_tg

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ashg grc workshop2014_tg

Similar to Ashg grc workshop2014_tg (20)

More from Genome Reference Consortium

More from Genome Reference Consortium (16)

Ashg grc workshop2014_tg