1. Algorithms and filters used to improve the Tribolium
draft Assembly with Physical Maps Based on
Imaging Ultra-Long Single DNA Molecules
!
Jennifer Shelton
2014
2. Assembly Pipeline
3) use sequence reference to adjust molecule stretch for each scan
3. Assembly Pipeline
In recent datasets when SNR is low and alignment is good we see a spike in
bases per pixel (bpp) in the first scan, a plateau and a lower plateau
First scan in a
flow cell
4. Assembly Pipeline
5) Use sequence reference to determine assembly noise parameters.
Estimated genome size is used to set the p-value threshold.
5. Assembly Pipeline
6/7) Variants of the starting p-value and default minimum molecule length are
explored in nine assemblies.
6. Current Tribolium sequence-based assembly
Input file N50 (Mb) Number
of Contigs
Cumulative
Length (Mb)
Genome FASTA 1.16 2240 160.74
in silico CMAP from FASTA 1.20 223 152.53
223 scaffolds from the sequence-based assembly were longer than 20 (kb)
with more than 5 labels and were converted into in silico CMAPs
7. Assembly Results
Input file N50 (Mb) Number
of Contigs
Cumulative
Length (Mb)
Genome FASTA 1.16 2240 160.74
in silico CMAP from FASTA 1.20 223 152.53
CMAP from assembled BNG
molecules (BNG CMAP)
1.35 216 200.47
BNG assembled molecules had a higher N50 and longer cumulative length
than the sequence assembly
!
The estimated size of the Tribolium genome is ~200 (Mb)
8. Simplest XMAP alignment description
1 (Mb)
1.1 (Mb)
1.1 (Mb) 1.3 (Mb)
Breadth of alignment coverage for in silico CMAP: 2.1 (Mb)
Total alignment length for in silico CMAP: 2.1 (Mb)
!
Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)
Total alignment length for BNG CMAP: 2.4 (Mb)
in silico CMAP
from genome
FASTA
CMAP from
assembled
molecules
in silico CMAP 1 in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
9. Complex XMAP alignment description
1 (Mb)
in silico CMAP 1
BNG CMAP 1 BNG CMAP 2
1.1 (Mb) 1.3 (Mb)
Breadth of alignment coverage for in silico CMAP: 1 (Mb)
Total alignment length for in silico CMAP: 2 (Mb)
!
Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)
Total alignment length for BNG CMAP: 2.4 (Mb)
in silico CMAP
from genome
FASTA
CMAP from
assembled
molecules
10. Alignment of CMAPs
1 (Mb)
in silico CMAP 1
BNG CMAP 1 BNG CMAP 2
1.1 (Mb) 1.3 (Mb)
Breadth of alignment coverage compared to total aligned length can indicate
relevant relationships between assemblies
!
In this example differences between "breadth" and "total" length could be due to:
!
Duplications in sample molecules were extracted from
Assembly of alternate haplotypes
Mis-assembly creating redundant contigs
Collapsed repeat in sequence assembly
in silico CMAP
from genome
FASTA
CMAP from
assembled
molecules
11. Alignment of BNG assembly to reference genome
CMAP name Breadth of alignment
coverage for CMAP
(Mb)
Length of total
alignment for
CMAP (Mb)
Percent of CMAP
aligned
in silico CMAP from FASTA 124.04 132.40 81
CMAP from assembled BNG
molecules (BNG CMAP)
131.64 132.34 67
Close to 4% of the alignment of the in silico CMAP appears to be redundant
!
Overall 81% of the in silico CMAP aligns to the BNG consensus map
12. ChLG 9 super!
Alignment of BNG assembly to reference genome
scaffold
BNG consensus
maps
ChLG 9!
scaffolds
130 131 133 134 132 129 135 127 136 137 BNG consensus
Typically where redundant alignments occur two BNG consensus maps
aligned suggesting they represent haplotypes although this has not been
verified
maps
14. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
+ in silico CMAP 1 + in silico CMAP 4
Stitch.pl estimates super scaffolds using alignments of scaffolds and
assembled BNG molecules using BNG Refaligner
in silico CMAP
aligned as
reference
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 1 BNG CMAP 2
15. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
+ in silico CMAP 1 + in silico CMAP 4
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
Stitch.pl estimates super scaffolds using alignments of scaffolds and
assembled BNG molecules using BNG Refaligner
in silico CMAP
aligned as
reference
alignment is
inverted and
used as input for
stitch
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
16. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
+ in silico CMAP 1 + in silico CMAP 4
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
BNG CMAP 1 BNG CMAP 2
Stitch.pl estimates super scaffolds using alignments of scaffolds and
assembled BNG molecules using BNG Refaligner
in silico CMAP
aligned as
reference
alignment is
inverted and
used as input for
stitch
+ in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 4
alignments are
filtered based on
alignment length
relative total
possible
alignment length
and confidence
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 1
17. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 1
+ in silico CMAP 1
Stitch.pl checks alignment length against potential alignment lengths to find
relevant global rather than local alignments
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
18. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 1
scaffolds
+ in silico CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find
relevant global rather than local alignments
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
19. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
+ in silico CMAP 2 - in silico CMAP 3
- in silico CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find
relevant global rather than local alignments
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
20. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
+ in silico CMAP 2 - in silico CMAP 3
- in silico CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find
relevant global rather than local alignments
alignment fails
because the
alignment length
is less than 30%
of the potential
alignment length
21. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
+ in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find
relevant global rather than local alignments
alignment fails
because the
alignment length
is less than 30%
of the potential
alignment length
22. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find
relevant global rather than local alignments
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
- in silico CMAP 3
23. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 2
scaffolds
Stitch.pl checks alignment length against potential alignment lengths to find
relevant global rather than local alignments
alignment fails
because the
alignment length
is less than 30%
of the potential
alignment length
- in silico CMAP 3
24. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 1 + in silico CMAP 4
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 2
scaffolds
Stitch.pl checks alignment length against potential alignment lengths to find
relevant global rather than local alignments
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
+ in silico CMAP 4
25. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 4
high quality
scaffolding
alignments...
+ in silico CMAP 1
26. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
are filtered for
longest and
highest
confidence
alignment for
each in silico
CMAP
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 4
+ in silico CMAP 1 + in silico CMAP 4
high quality
scaffolding
alignments...
+ in silico CMAP 1
27. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
are filtered for
longest and
highest
confidence
alignment for
each in silico
CMAP
Passing
alignments are
used to super
scaffold
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 4
+ in silico CMAP 1 + in silico CMAP 4
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 1 + in silico CMAP 4
high quality
scaffolding
alignments...
+ in silico CMAP 1
28. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
Stitch is iterated
and additional
super
scaffolding
alignments are
found
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 1 + in silico CMAP 4
Iteration takes advantage of alignments where sequence-based scaffolds
stitch BNG consensus maps
29. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
Stitch is iterated
and additional
super
scaffolding
alignments are
found
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 1 + in silico CMAP 4
Until all super
scaffolds are
BNG CMAP 1 BNG CMAP 2
joined + in silico CMAP 2 - in silico CMAP 3
+ in silico CMAP 1 + in silico CMAP 4
Iteration takes advantage of alignments where sequence-based scaffolds
stitch BNG consensus maps
30. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
BNG CMAP 1 BNG CMAP 2
- in silico CMAP 3
+ in silico CMAP 2
+ in silico CMAP 4
+ in silico CMAP 1
If gap length is estimated to be negative gaps are represented by 100 (bp)
fillers
31. Gap lengths
Distribution of gap lengths for automated output
Gap length (bp)
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had
known lengths and 26 had negative lengths (set to 100 (bp))
!
Of the manually edited Tribolium super-scaffolds there were 66 gaps had
known lengths and 24 had negative lengths (set to 100 (bp))
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
Negative gap lengths
Positive gap lengths
32. Gap lengths
Distribution of gap lengths for automated output
Gap length (bp)
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had
known lengths and 26 had negative lengths (set to 100 (bp))
!
Of the manually edited Tribolium super-scaffolds there were 66 gaps had
known lengths and 24 had negative lengths (set to 100 (bp))
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
Negative gap lengths
Positive gap lengths
33. Negative gap lengths
Is part of scaffold_23 connected to 136?!
I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should
check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG
assembly?
22 23 129 136 137
The longest negative gap length is from a BNG consenus map joining in silico
23 and 136
34. Negative gap lengths
Is part of scaffold_23 connected to 136?!
I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should
check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG
assembly?
22 23 129 136 137
!
Because the same region of 136 aligns to another BNG consensus map that
aligns to its chromosome linkage group this alignment was rejected and stitch
was re-run
35. scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
Negative gap lengths
ChLG 2 super!
scaffold
ChLG 2 super!
scaffold
BNG consensus
maps
BNG consensus
maps
ChLG 2!
scaffolds
133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
Two new super scaffolds were created and the sequence similarity is being
evaluated
min confidence 10
scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
ChLG 2!
scaffolds
130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30
BNG consensus
maps
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30
BNG consensus
maps
36. Gap lengths
Distribution of gap lengths for automated output
Gap length (bp)
This negative alignment also indicated a potential assembly issue
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
Negative gap lengths
Positive gap lengths
37. Negative gap lengths
This negative gap length is from a BNG consenus map joining in silico 81 and
102 and 103
Half of scaffold_81 aligns with ChLG7
38. Negative gap lengths
Half of scaffold_81 aligns with ChLG7
79 80 81 82 83
Because the other half of 81 aligns to another BNG consensus map that aligns
to its chromosome linkage group this alignment was rejected and stitch was re-run
!
The BNG maps suggest a mis-assembly of in silico 81 at a sequence level
39. Distribution of gap lengths for automated output
Gap length (bp)
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
Negative gap lengths
Positive gap lengths
Gap lengths
All extremely small negative gap lengths, < -40,000 (bp) (shaded), were
independently flagged as potential sequence mis-assemblies to be checked at
the sequence-level
40. Distribution of gap lengths for automated output
Gap length (bp)
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
Negative gap lengths
Positive gap lengths
Gap lengths
All gaps from the shaded regions were also manually rejected and stitch.pl
was rerun without them for the current super-scaffolded assembly
!
We suspect extremely small negative gap sizes may be useful in locating
sequence mis-assemblies
41. Tribolium super-scaffolds
Input file N50 (Mb) Number of
Contigs
Cumulative
Length (Mb)
genome FASTA 1.16 2240 160.74
super-scaffold
FASTA
4.46 2150 165.92
N50 of the super-scaffolded genome was ~4 times greater than the original
!
Super-scaffolds tend to agree with the Tribolium genetic map
42. Tribolium super-scaffolds
Input file N50 (Mb) Number of
Contigs
genome FASTA 1.16 2240 160.74
4.46 2150 165.92
For Tribolium :
first minimum percent aligned = 30%
first minimum confidence = 13
Cumulative
Length (Mb)
second minimum percent aligned = 90%
second minimum confidence = 8
!
super-scaffold
FASTA
Lower quality alignments were manually selected if genetic map also supported
the order
Complex scaffolds were broken manually for sequence level evaluation
43. Tribolium super-scaffolds
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to
ChLG 3
ChLG X super!
scaffold
BNG consensus
maps
ChLG X!
scaffolds
BNG consensus
maps
U 3 4 5 6 7 U 8 9 10 11 12 13
44. Tribolium super-scaffolds
min confidence 10
51 U 43 45 44 46
The second scaffold from ChLG X aligned to scaffolds from a portion of
ChLG 3
ChLG 3 super!
scaffold
BNG consensus
maps
ChLG 3!
scaffolds
BNG consensus
maps
32 33 34 35 36 2 37 38 39 40 41 42
ChLG 3 super!
scaffold
BNG consensus
maps
ChLG 3 super!
scaffold
BNG consensus
45. Tribolium super-scaffolds
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
Two unplaced scaffolds aligned to ChLG X
ChLG X super!
scaffold
BNG consensus
maps
ChLG X!
scaffolds
BNG consensus
maps
U 3 4 5 6 7 U 8 9 10 11 12 13
46. Tribolium super-scaffolds
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
4% Redundancy in alignment may be from assembly of haplotypes (generally
observed as two BNG consensus maps aligning to the same in silico map)
ChLG X super!
scaffold
BNG consensus
maps
ChLG X!
scaffolds
BNG consensus
maps
U 3 4 5 6 7 U 8 9 10 11 12 13
47. Tribolium super-scaffolds overlapping BNG cmap
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
ChLG X super!
scaffold
BNG consensus
maps
ChLG X!
scaffolds
BNG consensus
maps
U 3 4 5 6 7 U 8 9 10 11 12 13
48. Tribmoinl icuonmfid esnucep 1e0r-scaffolds
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
For ChLG 9 21 scaffolds were reduced to 9
ChLG 9 super!
scaffold
BNG consensus
maps
ChLG 9!
scaffolds
BNG consensus
maps
49. min confidence 10
Tribolium super-scaffolds
For ChLG 5 17 scaffolds were reduced to 4
ChLG 5 super!
scaffold
BNG consensus
maps
ChLG 5!
scaffolds
BNG consensus
maps
69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83
50. Acknowledgements
K-INBRE Bioinformatics Core!
Susan Brown - PI
Nic Herndon - script development
Nanyan Lu - manual editing
Michelle Coleman - extractions and running the Irys!
Zachary Sliefert - metric summaries
!
Bionano Genomics!
Ernest Lam - assembly pipeline best practices assistance
Weiping Wang - assistance with data formats
Palak Sheth - collaboration to standardize analysis
!
Script availability!
https://github.com/i5K-KINBRE-script-share/Irys-scaffolding
BNG scripts available by request from BNG
51. Gap lengths
Distribution of gap lengths for automated output
Gap length (bp)
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had
known lengths and 26 had negative lengths (set to 100 (bp))
!
Of the manually edited Tribolium super-scaffolds there were 66 gaps had
known lengths and 24 had negative lengths (set to 100 (bp))
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
Negative gap lengths
Positive gap lengths