Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Using BioNano Maps to Improve 
an Insect Genome Assembly 
! 
! 
! 
! 
! 
! 
! 
! 
! 
Sue Brown 
Oct 23, 2014 
1
Tribolium castaneum, the red flour beetle 
Genetic model organism for developmental, physiology and toxicology studies 
! ...
Tribolium Genome 
! 
! 
! 
! 
Genome size: 200 (Mb) Cot Analysis 
9 autosomes, X and Y 
Low methylation 
Long period inter...
Molecular map markers used to anchor scaffolds to Chromosome builds 
Few X markers, no Y, variable marker density 
4
Tcas 3.0 Reference Genome stats from NCBI 
Input file N50 (Mb) Number Cumulative 
Length (Mb) 
Genome contigs 0.04 8814 16...
Algorithms and filters used to improve the Tribolium 
draft Assembly with Physical Maps Based on 
Imaging Ultra-Long Singl...
Data formats 
BNX molecule 1 
BNX - text file of 
molecules 
7
Data formats 
BNX molecule 1 
BNX - text file of 
molecules 
CMAP - text file of 
consensus maps 
7
Data formats 
in silico CMAP - 
from genome 
FASTA 
BNX molecule 1 
in silico CMAP 1 
BNX - text file of 
molecules 
CMAP ...
Data formats 
in silico CMAP - 
from genome 
FASTA 
BNG CMAP - 
from assembled 
molecules 
BNX molecule 1 
in silico CMAP ...
Data formats 
in silico CMAP - 
from genome 
FASTA 
BNG CMAP - 
from assembled 
molecules 
XMAP - text file of 
alignment ...
Assembly Pipeline 
BBNNXX BNX 
scanBNX 
4) QC graphs for 
each flowcell 
adjBNX 
adjBNX 
5) Merge all 
flowcells 
6) First...
Assembly Pipeline 
In recent datasets when SNR is low and alignment is good we see a spike in 
bases per pixel (bpp) in th...
Assembly Pipeline 
BBNNXX BNX 
scanBNX 
4) QC graphs for 
each flowcell 
adjBNX 
adjBNX 
5) Merge all 
flowcells 
6) First...
Assembly Pipeline 
BBNNXX BNX 
scanBNX 
4) QC graphs for 
each flowcell 
adjBNX 
adjBNX 
5) Merge all 
flowcells 
6) First...
Current Tribolium sequence-based assembly 
Input file N50 (Mb) Number 
of 
Scaffolds 
Cumulative 
Length (Mb) 
Genome FAST...
Assembly Results 
Input file N50 (Mb) Number Cumulative 
Length (Mb) 
Genome FASTA 1.16 2240 160.74 
in silico CMAP from F...
Simplest XMAP alignment description 
1 (Mb) 
1.1 (Mb) 
1.1 (Mb) 1.3 (Mb) 
Breadth of alignment coverage for in silico CMAP...
Complex XMAP alignment description 
1 (Mb) 
in silico CMAP 1 
BNG CMAP 1 BNG CMAP 2 
1.1 (Mb) 1.3 (Mb) 
Breadth of alignme...
Alignment of CMAPs 
1 (Mb) 
in silico CMAP 1 
BNG CMAP 1 BNG CMAP 2 
1.1 (Mb) 1.3 (Mb) 
Breadth of alignment coverage comp...
Alignment of BNG assembly to reference genome 
CMAP name Breadth of alignment 
coverage for CMAP 
(Mb) 
Length of total 
a...
ChLG 9 super! 
Alignment of BNG assembly to reference genome 
scaffold 
BNG consensus 
maps 
ChLG 9! 
scaffolds 
130 131 1...
Potential haplotypes where overlapping BNG cmaps align 
ChLG 9 super! 
scaffold 
BNG consensus 
maps 
ChLG 9! 
scaffolds 
...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
+ in silico CMAP 1 + in...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
+ in silico CMAP 1 + in...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
+ in silico CMAP 1 + in...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
BNG CMAP 1 BNG CMAP 2 
+ in silico...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
BNG CMAP 1 BNG CMAP 2 
+ in silico...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
BNG CMAP 1 BNG CMAP 2 
+ in silico...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
are filtered for 
longe...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
are filtered for 
longe...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
Stitch is iterated 
and...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
Stitch is iterated 
and...
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
...
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
Of the automated stitch.pl Tribolium super...
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
Of the automated stitch.pl Tribolium super...
Negative gap lengths 
Is part of scaffold_23 connected to 136?! 
I went with the second alignment (21-26 together and 136-...
Negative gap lengths 
Is part of scaffold_23 connected to 136?! 
I went with the second alignment (21-26 together and 136-...
scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 
Negative gap lengths 
ChLG 2 sup...
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
This negative alignment also indicated a p...
Negative gap lengths 
This negative gap length is from a BNG consenus map joining in silico 81 and 
102 and 103 
Half of s...
Negative gap lengths 
Half of scaffold_81 aligns with ChLG7 
79 80 81 82 83 
Because the other half of 81 aligns to anothe...
Distribution of gap lengths for automated output 
Gap length (bp) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 ...
Distribution of gap lengths for automated output 
Gap length (bp) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 ...
Tribolium super-scaffolds 
Input file N50 (Mb) Number of 
Scaffolds 
Cumulative 
Length (Mb) 
genome FASTA 1.16 2240 160.7...
Tribolium super-scaffolds 
Input file N50 (Mb) Number of 
Scaffolds 
genome FASTA 1.16 2240 160.74 
4.46 2150 165.92 
For ...
Tribolium super-scaffolds 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced sc...
Tribolium super-scaffolds 
min confidence 10 
51 U 43 45 44 46 
The second scaffold from ChLG X aligned to scaffolds from ...
Tribolium super-scaffolds 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced sc...
Tribolium super-scaffolds 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced sc...
Potential haplotypes where overlapping BNG cmaps align 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds wer...
Tribmoinl icuonmfid esnucep 1e0r-scaffolds 
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why...
min confidence 10 
Tribolium super-scaffolds 
For ChLG 5 17 scaffolds were reduced to 4 
ChLG 5 super! 
scaffold 
BNG cons...
Future directions: Structural Variant (SV) 
Use SV-detect pipelines to resize existing gaps in scaffolds and identify 
mis...
Acknowledgements 
K-INBRE Bioinformatics Core! 
Susan Brown - PI 
Nic Herndon - script development 
Nanyan Lu - manual eva...
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
Of the automated stitch.pl Tribolium super...
Upcoming SlideShare
Loading in …5
×

Using BioNano Maps to Improve an Insect Genome Assembly​

1,845 views

Published on

Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules. Video of Webinar available at BioNano Genomics website http://www.bionanogenomics.com/bionano-community/webinars/ as "Using BioNano Maps to Improve an Insect Genome Assembly​".

  • Be the first to comment

Using BioNano Maps to Improve an Insect Genome Assembly​

  1. 1. Using BioNano Maps to Improve an Insect Genome Assembly ! ! ! ! ! ! ! ! ! Sue Brown Oct 23, 2014 1
  2. 2. Tribolium castaneum, the red flour beetle Genetic model organism for developmental, physiology and toxicology studies ! • Easily cultured • Short generation time • Small genome size • Molecular and visible marker genetic maps • Genetic tools: balancers, deficiencies • Genomic libraries: lambda and BAC • cDNA libraries • Mutant analysis and RNAi • Transformation • 7x Sanger draft genome (Nature, 2008) 2
  3. 3. Tribolium Genome ! ! ! ! Genome size: 200 (Mb) Cot Analysis 9 autosomes, X and Y Low methylation Long period interspersion ! ! ! ! ! ! Jeff Stuart, Purdue University 3
  4. 4. Molecular map markers used to anchor scaffolds to Chromosome builds Few X markers, no Y, variable marker density 4
  5. 5. Tcas 3.0 Reference Genome stats from NCBI Input file N50 (Mb) Number Cumulative Length (Mb) Genome contigs 0.04 8814 160.74 Genome scaffolds 0.98 481 152.53 Unmapped scaffolds 352 Unmapped contigs 1884 5
  6. 6. Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules ! Jennifer Shelton 2014 6
  7. 7. Data formats BNX molecule 1 BNX - text file of molecules 7
  8. 8. Data formats BNX molecule 1 BNX - text file of molecules CMAP - text file of consensus maps 7
  9. 9. Data formats in silico CMAP - from genome FASTA BNX molecule 1 in silico CMAP 1 BNX - text file of molecules CMAP - text file of consensus maps 7
  10. 10. Data formats in silico CMAP - from genome FASTA BNG CMAP - from assembled molecules BNX molecule 1 in silico CMAP 1 BNG CMAP 1 BNX - text file of molecules CMAP - text file of consensus maps 7
  11. 11. Data formats in silico CMAP - from genome FASTA BNG CMAP - from assembled molecules XMAP - text file of alignment of two CMAPs BNX molecule 1 in silico CMAP 1 BNG CMAP 1 in silico CMAP 1 in silico CMAP 2 BNG CMAP 1 BNG CMAP 2 BNX - text file of molecules CMAP - text file of consensus maps 7
  12. 12. Assembly Pipeline BBNNXX BNX scanBNX 4) QC graphs for each flowcell adjBNX adjBNX 5) Merge all flowcells 6) First assemblies Strict p-value threshold Default p-value threshold Assembly workflow:! ! 1) The Irys produces tiff files that are converted into BNX text files.! 2) Each chip produces one BNX file for each of two flowcells.! 3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is 3) use sequence reference to adjust molecule stretch for each scan recalculated from the alignment.! 4) Quality check graphs are created for each pre-adjusted flowcell BNX.! 5) Adjusted flowcell BNXs are merged.! 6) The first assemblies are run with a variety of p-value thresholds.! 7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced with a variety of minimum molecule length filters. adjBNX adjBNX 1) The Irys produces tiff files 3) Scan BNX are adjusted 7) Second assemblies Strict minimum molecule length Relaxed minimum molecule length mergeBNX Relaxed p-value threshold BBNNXX BNX scanBNX BBNNXX BNX scanBNX BBNNXX BNX scanBNX 2) Each chip produces flowcell BNX files BNX BNX BNX BNX 8
  13. 13. Assembly Pipeline In recent datasets when SNR is low and alignment is good we see a spike in bases per pixel (bpp) in the first scan, a plateau and a lower plateau First scan in a flow cell 9
  14. 14. Assembly Pipeline BBNNXX BNX scanBNX 4) QC graphs for each flowcell adjBNX adjBNX 5) Merge all flowcells 6) First assemblies Strict p-value threshold Default p-value threshold Assembly workflow:! ! 1) The Irys produces tiff files that are converted into BNX text files.! 2) Each chip produces one BNX file for each of two flowcells.! 3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is 5) Use sequence reference to determine assembly noise parameters. Estimated recalculated from genome the alignment.! size is used to set the p-value threshold. 4) Quality check graphs are created for each pre-adjusted flowcell BNX.! 5) Adjusted flowcell BNXs are merged.! 6) The first assemblies are run with a variety of p-value thresholds.! 7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced with a variety of minimum molecule length filters. adjBNX adjBNX 1) The Irys produces tiff files 3) Scan BNX are adjusted 7) Second assemblies Strict minimum molecule length Relaxed minimum molecule length mergeBNX Relaxed p-value threshold BBNNXX BNX scanBNX BBNNXX BNX scanBNX BBNNXX BNX scanBNX 2) Each chip produces flowcell BNX files BNX BNX BNX BNX 10
  15. 15. Assembly Pipeline BBNNXX BNX scanBNX 4) QC graphs for each flowcell adjBNX adjBNX 5) Merge all flowcells 6) First assemblies Strict p-value threshold Default p-value threshold Assembly workflow:! ! 1) The Irys produces tiff files that are converted into BNX text files.! 2) Each chip produces one BNX file for each of two flowcells.! 3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is 6/7) Variants of the starting p-value and default minimum molecule length are explored in nine assemblies. recalculated from the alignment.! 4) Quality check graphs are created for each pre-adjusted flowcell BNX.! 5) Adjusted flowcell BNXs are merged.! 6) The first assemblies are run with a variety of p-value thresholds.! 7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced with a variety of minimum molecule length filters. adjBNX adjBNX 1) The Irys produces tiff files 3) Scan BNX are adjusted 7) Second assemblies Strict minimum molecule length Relaxed minimum molecule length mergeBNX Relaxed p-value threshold BBNNXX BNX scanBNX BBNNXX BNX scanBNX BBNNXX BNX scanBNX 2) Each chip produces flowcell BNX files BNX BNX BNX BNX 11
  16. 16. Current Tribolium sequence-based assembly Input file N50 (Mb) Number of Scaffolds Cumulative Length (Mb) Genome FASTA 1.16 2240 160.74 in silico CMAP from FASTA 1.20 223 152.53 223 scaffolds from the sequence-based assembly were longer than 20 (kb) with more than 5 labels and were converted into in silico CMAPs 12
  17. 17. Assembly Results Input file N50 (Mb) Number Cumulative Length (Mb) Genome FASTA 1.16 2240 160.74 in silico CMAP from FASTA 1.20 223 152.53 CMAP from assembled BNG molecules (BNG CMAP) 1.35 216 200.47 BNG assembled molecules had a higher N50 and longer cumulative length than the sequence assembly ! The estimated size of the Tribolium genome is ~200 (Mb) 13
  18. 18. Simplest XMAP alignment description 1 (Mb) 1.1 (Mb) 1.1 (Mb) 1.3 (Mb) Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) Total alignment length for in silico CMAP: 2.1 (Mb) ! Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) Total alignment length for BNG CMAP: 2.4 (Mb) in silico CMAP from genome FASTA CMAP from assembled molecules in silico CMAP 1 in silico CMAP 2 BNG CMAP 1 BNG CMAP 2 14
  19. 19. Complex XMAP alignment description 1 (Mb) in silico CMAP 1 BNG CMAP 1 BNG CMAP 2 1.1 (Mb) 1.3 (Mb) Breadth of alignment coverage for in silico CMAP: 1 (Mb) Total alignment length for in silico CMAP: 2 (Mb) ! Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) Total alignment length for BNG CMAP: 2.4 (Mb) in silico CMAP from genome FASTA CMAP from assembled molecules 15
  20. 20. Alignment of CMAPs 1 (Mb) in silico CMAP 1 BNG CMAP 1 BNG CMAP 2 1.1 (Mb) 1.3 (Mb) Breadth of alignment coverage compared to total aligned length can indicate relevant relationships between assemblies ! In this example differences between "breadth" and "total" length could be due to: ! Genomic duplications in sample molecules were extracted from Assembly of alternate haplotypes Mis-assembly creating redundant contigs Collapsed repeat in sequence assembly in silico CMAP from genome FASTA CMAP from assembled molecules 16
  21. 21. Alignment of BNG assembly to reference genome CMAP name Breadth of alignment coverage for CMAP (Mb) Length of total alignment for CMAP (Mb) Percent of CMAP aligned in silico CMAP from FASTA 124.04 132.40 81 CMAP from assembled BNG molecules (BNG CMAP) 131.64 132.34 67 Close to 4% of the alignment of the in silico CMAP appears to be redundant ! Overall 81% of the in silico CMAP aligns to the BNG consensus map 17
  22. 22. ChLG 9 super! Alignment of BNG assembly to reference genome scaffold BNG consensus maps ChLG 9! scaffolds 130 131 133 134 132 129 135 127 136 137 BNG consensus Typically where redundant alignments occur two BNG consensus maps aligned suggesting they represent haplotypes although this has not been verified maps 18
  23. 23. Potential haplotypes where overlapping BNG cmaps align ChLG 9 super! scaffold BNG consensus maps ChLG 9! scaffolds 128 130 131 133 134 132 BNG consensus maps 19
  24. 24. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds + in silico CMAP 1 + in silico CMAP 4 Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner in silico CMAP aligned as reference + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 20
  25. 25. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner in silico CMAP aligned as reference alignment is inverted and used as input for stitch + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 20
  26. 26. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner in silico CMAP aligned as reference alignment is inverted and used as input for stitch + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 alignments are filtered based on alignment length relative total possible alignment length and confidence + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 20
  27. 27. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 + in silico CMAP 1 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length 21
  28. 28. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 scaffolds + in silico CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length 22
  29. 29. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 - in silico CMAP 2 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length 23
  30. 30. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 - in silico CMAP 2 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment fails because the alignment length is less than 30% of the potential alignment length 24
  31. 31. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 2 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment fails because the alignment length is less than 30% of the potential alignment length 25
  32. 32. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length - in silico CMAP 3 26
  33. 33. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 2 scaffolds Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment fails because the alignment length is less than 30% of the potential alignment length - in silico CMAP 3 27
  34. 34. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 2 scaffolds Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length + in silico CMAP 4 28
  35. 35. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 high quality scaffolding alignments... + in silico CMAP 1 29
  36. 36. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds are filtered for longest and highest confidence alignment for each in silico CMAP BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 + in silico CMAP 1 + in silico CMAP 4 high quality scaffolding alignments... + in silico CMAP 1 29
  37. 37. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds are filtered for longest and highest confidence alignment for each in silico CMAP Passing alignments are used to super scaffold BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4 high quality scaffolding alignments... + in silico CMAP 1 29
  38. 38. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds Stitch is iterated and additional super scaffolding alignments are found BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4
  39. 39. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds Stitch is iterated and additional super scaffolding alignments are found BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4 Until all super scaffolds are BNG CMAP 1 BNG CMAP 2 joined + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4
  40. 40. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 - in silico CMAP 3 + in silico CMAP 2 + in silico CMAP 4 + in silico CMAP 1 If gap length is estimated to be negative gaps are represented by 100 (bp) spacers 31
  41. 41. Gap lengths Distribution of gap lengths for automated output Gap length (bp) Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp)) ! Of the manually edited Tribolium super-scaffolds there were 66 gaps had known lengths and 24 had negative lengths (set to 100 (bp)) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths 32
  42. 42. Gap lengths Distribution of gap lengths for automated output Gap length (bp) Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp)) ! Of the manually edited Tribolium super-scaffolds there were 66 gaps had known lengths and 24 had negative lengths (set to 100 (bp)) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths 32
  43. 43. Negative gap lengths Is part of scaffold_23 connected to 136?! I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly? 22 23 129 136 137 The longest negative gap length is from a BNG consenus map joining in silico 23 and 136 33
  44. 44. Negative gap lengths Is part of scaffold_23 connected to 136?! I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly? 22 23 129 136 137 ! Because the same region of 136 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-run 34
  45. 45. scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? Negative gap lengths ChLG 2 super! scaffold ChLG 2 super! scaffold BNG consensus maps BNG consensus maps ChLG 2! scaffolds 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 Two new super scaffolds were created and the sequence similarity is being evaluated min confidence 10 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? ChLG 2! scaffolds 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 BNG consensus maps U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 BNG consensus maps 35
  46. 46. Gap lengths Distribution of gap lengths for automated output Gap length (bp) This negative alignment also indicated a potential assembly issue Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths 36
  47. 47. Negative gap lengths This negative gap length is from a BNG consenus map joining in silico 81 and 102 and 103 Half of scaffold_81 aligns with ChLG7 37
  48. 48. Negative gap lengths Half of scaffold_81 aligns with ChLG7 79 80 81 82 83 Because the other half of 81 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-run ! The BNG maps suggest a mis-assembly of in silico 81 at a sequence level 38
  49. 49. Distribution of gap lengths for automated output Gap length (bp) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths Gap lengths All extremely small negative gap lengths, < -20,000 (bp) (shaded), were independently flagged as potential sequence mis-assemblies to be checked at the sequence-level 39
  50. 50. Distribution of gap lengths for automated output Gap length (bp) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths Gap lengths All gaps from the shaded regions were also manually rejected and stitch.pl was rerun without them for the current super-scaffolded assembly ! We suspect extremely small negative gap sizes may be useful in locating sequence mis-assemblies ! stitch.pl version 1.4.5 rejects alignments if negative gap lengths < -20,000 (bp) but lists them in data summary 40
  51. 51. Tribolium super-scaffolds Input file N50 (Mb) Number of Scaffolds Cumulative Length (Mb) genome FASTA 1.16 2240 160.74 super-scaffold FASTA 4.46 2150 165.92 N50 of the super-scaffolded genome was ~4 times greater than the original ! Super-scaffolds tend to agree with the Tribolium genetic map 41
  52. 52. Tribolium super-scaffolds Input file N50 (Mb) Number of Scaffolds genome FASTA 1.16 2240 160.74 4.46 2150 165.92 For Tribolium : first minimum percent aligned = 30% first minimum confidence = 13 Cumulative Length (Mb) second minimum percent aligned = 90% second minimum confidence = 8 ! super-scaffold FASTA Lower quality alignments were manually selected if genetic map also supported the order Complex scaffolds were broken manually for sequence level evaluation 42
  53. 53. Tribolium super-scaffolds min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to ChLG 3 ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13 43
  54. 54. Tribolium super-scaffolds min confidence 10 51 U 43 45 44 46 The second scaffold from ChLG X aligned to scaffolds from a portion of ChLG 3 ChLG 3 super! scaffold BNG consensus maps ChLG 3! scaffolds BNG consensus maps 32 33 34 35 36 2 37 38 39 40 41 42 ChLG 3 super! scaffold BNG consensus maps ChLG 3 super! scaffold BNG consensus 44
  55. 55. Tribolium super-scaffolds min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. Two unplaced scaffolds aligned to ChLG X ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13 45
  56. 56. Tribolium super-scaffolds min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 4% Redundancy in alignment may be from assembly of haplotypes (generally observed as two BNG consensus maps aligning to the same in silico map) ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13 46
  57. 57. Potential haplotypes where overlapping BNG cmaps align min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13 47
  58. 58. Tribmoinl icuonmfid esnucep 1e0r-scaffolds ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 For ChLG 9 21 scaffolds were reduced to 9 ChLG 9 super! scaffold BNG consensus maps ChLG 9! scaffolds BNG consensus maps 48
  59. 59. min confidence 10 Tribolium super-scaffolds For ChLG 5 17 scaffolds were reduced to 4 ChLG 5 super! scaffold BNG consensus maps ChLG 5! scaffolds BNG consensus maps 69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83 49
  60. 60. Future directions: Structural Variant (SV) Use SV-detect pipelines to resize existing gaps in scaffolds and identify mis-assemblies 50
  61. 61. Acknowledgements K-INBRE Bioinformatics Core! Susan Brown - PI Nic Herndon - script development Nanyan Lu - manual evaluation Michelle Coleman - extractions and running the Irys! Zachary Sliefert - metric summaries ! Bionano Genomics! Ernest Lam - assembly pipeline best practices assistance Weiping Wang - assistance with data formats Palak Sheth - collaboration to standardize analysis ! Script availability! https://github.com/i5K-KINBRE-script-share/Irys-scaffolding BNG scripts available by request from BNG ! Slide availability! http://www.slideshare.net/kstatebioinformatics/using-bionano-maps-to-improve-an-insect-genome- Physical Molecules! University, Warren Kansas State University assembly ! This project was supported by grants from the National Center for Research Resources (5P20RR016475) and the National Institute of General Medical Sciences (8P20GM103418) from the National Institutes of Health. were constructed by mapping molecular markers from the the assembly scaffolds, anchoring greater than 90% 51
  62. 62. Gap lengths Distribution of gap lengths for automated output Gap length (bp) Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp)) ! Of the manually edited Tribolium super-scaffolds there were 66 gaps had known lengths and 24 had negative lengths (set to 100 (bp)) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths

×