Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ASHG - GRC Workshop
Tina Lindsay
ASHG Oct 6, 2015
The Human Reference is Not Complete
• Reference has been found to not be optimal in some
regions
• Structural variation ma...
AC074378.4
AC079749.5
AC134921.2
AC147055.2
AC140484.1
AC019173.4
AC093720.2
AC021146.7
NCBI36NC_000004.10 (chr4) Tiling P...
Allelic Diversity vs. Segmental Duplication
A
A
C
T
C
G
C
C
Repeat Copies (noted by color difference)
Allelic
Copies
Diplo...
Initial Use Of CHM1 Source
• CHORI-17 BAC Library
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fing...
SRGAP2 Homology between genes
Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs
Shows homology between S...
1q21
1q21 patch alignment to chromosome 1
1q32 1q21 1p21
Williams-Beuren Syndrome region
Slide courtesy of Megan Dennis
Current status of CHM1 resources
• CHORI-17 BAC Library (created from CHM1 cell line)
• CHORI-17 BAC end sequences (n=325,...
PacBio CHM1 Assembly Spans GRCh38 Gaps
GRCh38
PacBio CHM1
PacBio CHM1 Assembly Shows Data Not in GRCh38
GRCh38
PacBio CHM1
Second Pass Alignment
Sequencing Plan
Some of the Targeted Regions
CFHR1
SRGAP2/FAM72
BOLA2/CORO1A/SLX1
ARHGAP11
CHRNA7
GTF2IRD2/GTF2I/NCF1
FRMPD2/PTPN20
GPRIN2...
Genomes Planned
Data Source Origin of Sample Coverage Level Status
CHM1 NA Platinum Assembly QC
CHM13 NA Platinum Assembly...
CHM13 – 2nd Platinum Genome
• CHM13 – another hydatidiform mole sample
• PacBio data generated
• 60X data was generated us...
CHM13 Mini-Assemblethon
Falcon MHAP
Default 5%
Error
MHAP
Conservative
2.5%
MHAP
Sensitive 5%
MHAP
Sensitive
2.5%
# of
Con...
Assembly Assessment Methods
• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and conc...
BioNano SV Calls Identified a Assembly Problems
Collapse
Expansion
inAssembly
Gap in SequenceAssembly
BioNano Map
CHM13 Hybrid Scaffolds
BioNano Map PacBio Assmbly Hybrid Scaffold
# of Contigs 3593 1590 * 254
Min Contig Length 0.08 Mb 0...
CHM13 Hybrid Scaffold
Hybrid Scaffold
PacBio Contigs
BioNano Contigs
NA19240 Initial Assembly Stats
Initial Assembly Stats
# Seq Contigs 3569
Max Contig Length 20,393,869bp
Total Assembly Siz...
Future Directions
• Identification of best assembly for on CHM1 and CHM13
• Integration of targeted BACs into the whole ge...
Acknowledgements
The Genome Institute at Washington
University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Karyn Meltz ...
Ashg grc workshop2015_tg
Upcoming SlideShare
Loading in …5
×

Ashg grc workshop2015_tg

959 views

Published on

ASHG 2015 GRC workshop talk by Tina Graves-Lindsay

Published in: Science
  • Be the first to comment

  • Be the first to like this

Ashg grc workshop2015_tg

  1. 1. ASHG - GRC Workshop Tina Lindsay ASHG Oct 6, 2015
  2. 2. The Human Reference is Not Complete • Reference has been found to not be optimal in some regions • Structural variation makes it difficult to assemble a truly representative genome when using a diploid sample • Some regions were recalcitrant to closure with technology and resources available at the time • Additional sequences are needed to capture the full range of diversity in humans
  3. 3. AC074378.4 AC079749.5 AC134921.2 AC147055.2 AC140484.1 AC019173.4 AC093720.2 AC021146.7 NCBI36NC_000004.10 (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37NC_000004.11 (chr4) Tiling Path AC074378.4 AC079749.5 AC134921.1 AC147055.2 AC093720.2 AC021146.7 TMPRSS11E GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC074378.4 AC140484.1 AC019173.4 AC226496.2 AC021146.7 TMPRSS11E2 UGT2B17 – Conflicting Alleles G A P
  4. 4. Allelic Diversity vs. Segmental Duplication A A C T C G C C Repeat Copies (noted by color difference) Allelic Copies Diploid Genome With a diploid genome, there is significant ambiguity sorting allelic copies from repeat copies A C C C Haploid Genome Repeat Copies (ONLY but noted by color difference) With a haploid genome, allelic differences are eliminated, and base differences are likely indicative of repeat copies
  5. 5. Initial Use Of CHM1 Source • CHORI-17 BAC Library • CHORI-17 BAC end sequences (n=325,659) • CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs) • CHORI-17 BACs • > 750 have been sequenced • 664 of them in Genbank as phase 3
  6. 6. SRGAP2 Homology between genes Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs Shows homology between SRGAP2B and SRGAP2C Dennis, et.al. 2012 SRGAP2A SRGAP2B SRGAP2C
  7. 7. 1q21 1q21 patch alignment to chromosome 1 1q32 1q21 1p21
  8. 8. Williams-Beuren Syndrome region Slide courtesy of Megan Dennis
  9. 9. Current status of CHM1 resources • CHORI-17 BAC Library (created from CHM1 cell line) • CHORI-17 BAC end sequences (n=325,659) • CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs) • CHORI-17 BACs (>750 have been sequenced, with 664 of them in Genbank as phase 3) • Active cell line • >100X coverage Illumina 100bp reads • 300, 500bp, 3kb inserts • Reference assisted assembly CHM1_1.1 • BioNano genome map • >60X coverage of PacBio long read data (Both P5 and P6) • Multiple whole genome assemblies
  10. 10. PacBio CHM1 Assembly Spans GRCh38 Gaps GRCh38 PacBio CHM1
  11. 11. PacBio CHM1 Assembly Shows Data Not in GRCh38 GRCh38 PacBio CHM1 Second Pass Alignment
  12. 12. Sequencing Plan
  13. 13. Some of the Targeted Regions CFHR1 SRGAP2/FAM72 BOLA2/CORO1A/SLX1 ARHGAP11 CHRNA7 GTF2IRD2/GTF2I/NCF1 FRMPD2/PTPN20 GPRIN2/PPYR1 DUSP22 HYDIN IgH IgK IgL TCRA/B NBPF DEFB MUC5a/b/c LILR CCL FCGR1/HIST2H2B NOTCH2
  14. 14. Genomes Planned Data Source Origin of Sample Coverage Level Status CHM1 NA Platinum Assembly QC CHM13 NA Platinum Assembly QC NA19240 Yoruban Gold In Assembly HG00733 Puerto Rican Gold Data Generation NA12878 European Gold Not Started HG00514 Han Chinese Gold Not Started NA19434 Luhya Gold Not Started
  15. 15. CHM13 – 2nd Platinum Genome • CHM13 – another hydatidiform mole sample • PacBio data generated • 60X data was generated using P5 and P6 Chemistry • Avg read length ~11kb, longer than original CHM1 data • Assembly Contig N50 ~13Mb • Illumina coverage has been generated to use for assembly QC, SV detection, and consensus base error correction • Plan to use BACs to improve the assembly where needed • Alignment of Assembly to BioNano Genome map • Currently ~91% of CHM13 assembly aligns to BioNano map contigs
  16. 16. CHM13 Mini-Assemblethon Falcon MHAP Default 5% Error MHAP Conservative 2.5% MHAP Sensitive 5% MHAP Sensitive 2.5% # of Contigs 2873 15,538 10,430 11,138 13,500 Max Contig Length 63,148,543 81,522,549 34,039,925 80,601,297 58,311,553 Contig N50 12,981,785 13,331,528 5,550,336 19,357,701 11,964,038 Total Assembly Size 2,851,367,788 3,061,261,250 2,996,416,935 3,028,933,694 3,086,573,229
  17. 17. Assembly Assessment Methods • Assemblies will run through NCBI QA pipeline • Assessed for contiguity, annotation, and concordance with the finished BAC paths • Assembly Assembly alignments will be generated between each PB assembly as well as GRCh38 • BioNano Genome Map • SV calls generated from comparing the BioNano data to each of the assemblies • Generating hybrid scaffolds using BioNano data and assembly data • Alignment of the Illumina reads back to the each of the assemblies • Heterozygous calls are likely indicative of a collapse in the assembly (for the single haplotype genomes)
  18. 18. BioNano SV Calls Identified a Assembly Problems Collapse Expansion inAssembly Gap in SequenceAssembly BioNano Map
  19. 19. CHM13 Hybrid Scaffolds BioNano Map PacBio Assmbly Hybrid Scaffold # of Contigs 3593 1590 * 254 Min Contig Length 0.08 Mb 0 0.27 Mb Median Contig Length 0.61 Mb 0.06 Mb 4.35 Mb Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb Contig N50 1.02 Mb 13.46 Mb 20.79 Mb Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb Total Contig Length 2812 Mb 2824 Mb 2457.75 Mb *Number of contigs used in hyrbid scaffolding 57 PacBio contigs and 67 BN contigs were identified as conflicts during this process
  20. 20. CHM13 Hybrid Scaffold Hybrid Scaffold PacBio Contigs BioNano Contigs
  21. 21. NA19240 Initial Assembly Stats Initial Assembly Stats # Seq Contigs 3569 Max Contig Length 20,393,869bp Total Assembly Size 2,745,634,789 bp N50 6,003,115 bp N90 848,151 bp N95 345,457 bp
  22. 22. Future Directions • Identification of best assembly for on CHM1 and CHM13 • Integration of targeted BACs into the whole genome assembly • Improvement of the assemblies through scaffolding and making breaks in the assemblies where needed • Continue to add diversity to the reference by sequencing new samples that provide additional diversity to GRCh38 • Additional collaborations with the community to develop tools to more fully utilize the full reference assembly (alternate haplotypes)
  23. 23. Acknowledgements The Genome Institute at Washington University in St. Louis Rick Wilson Bob Fulton Wes Warren Karyn Meltz Steinberg Vince Magrini Derek Albracht Milinn Kremitzki Susan Rock Debbie Scheer Aye Wollam The Finishing and Bioinformatics Teams at The Genome Institute University of Washington Evan Eichler Megan Dennis Xander Nuttler NCBI Richa Argwala Valerie Schneider University of Pittsburgh School of Medicine (CHM1 cell line) Urvashi Surti Personalis Deanna Church BioNano Genomics Pacific Biosciences UCSF Pui-Yan Kwok Yvonne Lai Chin Lin Catherine ChuCHORI Pieter de Jong

×