2. Estimation of Genetic
Divergence
‣ As important as SNPs/indels,
CNVs to estimate genetic
variations
‣ Provides insights into
population or cancer evolution
‣ Helps identifying foreign DNA
contamination
‣ Helps identifying lateral
transfer of sequences from
viruses or bacteria
2
Presence-Absence Variation (PAV)
Assembly Strategy and
Technology Assessment
‣ Determines the completeness of
an assembly
‣ Helps assessing an assembly
strategy and pipeline
‣ Helps identifying strengths and
weaknesses in new sequencing
technologies
6. 3
Presence Assembly
AbsenceAssembly
Scaffold 1
Scaffold 2
Scaffold 3
ScanPAV: pipeline
1 Kb
Presence Assembly
Shred into 1Kb chunks
[SMALT Aligner]
Map Against Absence Assembly
Output Not Mapped Areas
Filter noisy mapping
7. Human Assemblies
‣ HuRef (2007), 7.5x whole-genome shotgun Sanger 800 bp PE reads, Celera [Levy
S, et al., 2007, PLOS Biology 5(10), e254]
‣ Hs2-HiC (2017) 60x Illumina 250 bp PE reads, DISCOVAR denovo, Hi-C data for
scaffolding [Dudchenko O. et al., 2017, Science, doi: 10.1126/science.aal3327]
‣ Illumina (2016) 39x short-insert and 24x long-insert Illumina 101 bp PE reads,
spaDES, 10X genomics and Bionano data for scaffolding [Mostovoy Y. et al., 2016,
Nature Methods, 13, 587]
‣ AK1 (2016) 101x PacBio long reads (mean 10Kb), Falcon, polished with Quiver and
Pilon (Illumina), Bionano maps for scaffolding [Seo J.S. et al., 2016, Nature, 538, 243]
‣ ONT_30x (2017) 30x Oxford Nanopore long reads (mean 11Kb), Canu, polished
with Pilon (Illumina). No scaffolds [Jain, M. et al., 2017, bioRxiv, doi:10.1101/128835]
‣ ONT_35x (2017) Same as ONT_30x plus 5x extra-long Oxford Nanopore reads (n50
65Kb), Canu, polished with Pilon (Illumina). No scaffolds [Jain, M. et al., 2017,
bioRxiv, doi:10.1101/128835]
4
14. Mis-joints analysis
8
Map chr-level scaffolds Against Reference
If scaffold map to more than 1 chrs
If no chr-level scaffolds: assign ctgs to chrs
15. Mis-joints analysis
8
Map chr-level scaffolds Against Reference
If scaffold map to more than 1 chrs
Misjoint
If no chr-level scaffolds: assign ctgs to chrs
16. Mis-joints analysis
8
Map chr-level scaffolds Against Reference
If scaffold map to more than 1 chrs
Misjoint
If no chr-level scaffolds: assign ctgs to chrs
AK1
GRCh38
Chr16Chr2
SY-16SY-2
17. Mis-joints analysis
9
Scaffold Assigned Chr
Mis-Joints (>300 Kb)
Length (bp) Chr Avg id (%)
HuRef —
Hs2-HiC
Hs2-HiC_hic_scaffold_9 19 300 K 22 99.0
Hs2-HiC_hic_scaffold_10 21 1.3 M 22 99.3
AK1 synteny_group_16 16 16 M 2 99.7
ONT_35x
synteny_group_1 1 3.9 M 19 99.8
synteny_group_1 1 565 K 3 99.0
synteny_group_2 2 1.8 M 1 99.1
synteny_group_4 4 316 K 20 99.1
synteny_group_4 4 543 K 5 99.2
synteny_group_5 5 2.2 M 18 99.1
synteny_group_6 6 3.3 M 7 99.1
synteny_group_8 8 3.7 M 16 98.9
synteny_group_10 10 5.8 M 1 99.0
synteny_group_10 10 2.0 M 11 98.9
synteny_group_12 12 13.9 M 10 99.1
synteny_group_13 13 4.2 M 10 99.1
synteny_group_15 15 2.0 M 3 98.8
synteny_group_16 16 1.4 M 1 98.8
synteny_group_22 22 342 K 19 98.6
18. 4
Fig. S2. Misplaced block visualisation for chromosome assigned scaffolds in Hs2-HiC (left), AK1 (center) and ONT_35x (right).
Table S3. Misjoint location in the original scaffolds for assemblies AK1 and ONT_35x.
Assembly Mis-
Joint
Original Scaffold 1st
Chromosome
Length mapped to
1st
Chromosome
Break
Position
2nd
Chromosom
Giordano et a
sation for chromosome assigned scaffolds in Hs2-HiC (left), AK1 (center) and ONT_35x (right).
on in the original scaffolds for assemblies AK1 and ONT_35x.
Original Scaffold 1st
Length mapped to
st
Break 2nd
Length mapped to
nd
Mis-joints analysis
10
G
Hs2-HiC
AK1
ONT_35x
19. Mis-joints analysis
11
Scaffold
First Chr
Breaking Area
Second Chr
Chr Length (bp) Chr Length (bp)
AK1 KV784727.1 16 16.9 M 16,390,000 – 16,390,001 2 16.3 M
ONT_35x
tig00001490_pilon_pilon 1 5.5 M 3,999,036 – 4,000,001 19 3.9 M
tig01414718_pilon_pilon 1 23.6 M 23,710,000 – 23,713,101 3 561 K
tig00000928_pilon_pilon 1 2.0 M 23,710,000 – 23,713,101 2 3.0 M
tig00000326_pilon_pilon 4 10.2 M 330,000 – 365,771 20 322 K
tig01414909_pilon_pilon 4 9.5 M 557,558 – 560,001 5 535 K
tig01415181_pilon_pilon 5 9.2 M 9,740,000 – 9,741,384 18 2.2 M
tig00000726_pilon_pilon 6 6.9 M 3,399,916 – 3,400,054 7 3.4 M
tig00001250_pilon_pilon 8 4.7 M 4,820,000 – 4,820,001 16 3.7 M
tig01414799_pilon_pilon 10 14.4 M 14,700,000 – 14,770,556 1 5.8 M
tig01415009_pilon_pilon 10 9.1 M 2,037,882 – 2,040,001 11 2.0 M
tig01414760_pilon_pilon 12 18.0 M 18,455,260 – 18,460,001 10 13.9 M
tig01414699_pilon_pilon 13 45.2 M 46,129,991 – 46,131,368 10 4.2 M
tig00002429_pilon_pilon 15 7.3 M 7,426,592 – 7,430,001 3 2.0 M
tig00001215_pilon_pilon 16 3.9 M 1,719,551 – 2,010,001 1 1.6 M
tig00000735_pilon_pilon 22 11.0 M 358,160 – 360,002 19 352 K
20. Tasmanian Devil’s
Transmissible Cancer
12
First Observed in NE
Tasmania in 1996
Tumour-
Free Area
Devil Facial Tumour (DFT):
‣ Large tumour growths on face and neck
‣ Transmissible by bite
‣ Affect mostly adult individuals
‣ Death occurs in 4-6 months
‣ Two genetically different Cancers with similar phenotypes:
DFT1 and DFT2
21. Tasmanian Devil’s
Transmissible Cancer
13
A comparative characterisation analysis for DFT1 and DFT2 has been
submitted to Science* recently [Stammnitz M.R. et al., The origins and
vulnerabilities of two transmissible cancers in Tasmanian devils]
PAV estimation is one of the numerous analyses presented, both to
assess assembly completeness and to look for exposure to exogenous
agents that might have contributed to the cancer development.
The scanPAV analysis involved:
‣ 2 Existing References: Ref-v7.1 and PSU
‣ 6 New Assemblies:
2 from healthy cells: 202H, 203H
4 from tumour cells: 202T, 203T, 86T, 88T
[Stammnitz M.R. et al., The origins and vulnerabilities of two
transmissible cancers in Tasmanian devils, in review @ Science]
22. Devil’s PAVs
14
The extracted PAVs sequences were screened against the NCBI database:
- no foreign DNA found in the references
- in 202T: found 590 Kb sequence belonging to Mycoplasma arginini genome
- All 6 new assemblies PAVs contained 1.9 Mb of the Streptococcus
pneumoniae genome
Both well known laboratory cell culture contaminants
Remaining PAVs are believed to be due to assembly errors
and ancestral variation
No exogenous contribution to the emergence of DFTs found
[Stammnitz M.R. et al., The origins and vulnerabilities of two
transmissible cancers in Tasmanian devils, in review @ Science]
23. Summary
‣ Assessing the completeness and quality of a newly generated genome
assembly is a complex task and requires evaluating multiple metrics.
In presence of multiple assemblies or of a reference assembly:
‣ the extraction of PAV sequences can help assessing strengths and weaknesses
of a novel assembly strategy or technology, as well as identifying structure
variation and foreign DNA exposure
‣ indication of possible mis-joints can be inferred by mapping the assembly to
the reference and identify scaffolds that map to multiple chromosomes
‣ scanPAV pipeline for pair-wise assembly extraction of PAVs is available
@ https://sourceforge.net/projects/phusion2/files/scanPAV
‣ In progress: including BWA as an alternative aligner
15
24. 16
Acknowledgments
Paul A. Kitts, National Center for Biotechnology Information, Bethesda, MD
Elizabeth P. Murchison, University of Cambridge, Cambridge, UK
Zemin Ning, Wellcome Trust Sanger Institute, Hinxton, UK
Maximilian R. Stammnitz, University of Cambridge, Cambridge, UK
Thank you!
28. Devil’s Filtered PAVs
20
PAVs
present in:
PAVs (Mb) Absent in:
Ref-v7.1
All Filtered
PSU 125 5
202T2 110 14
202H1 107 11
203T3 107 12
202H 109 12
86T 106 12
88H 106 13
PAV filter:
filter out sequences missing from Absent Assembly but present in reads
Absent Assembly + PAVs
Absent Assembly Reads
Mapped
Filtered PAVs
Filter out PAVs
with 10x depth
Most of the missing sequences are present in the
Genome (reads), but Absent in the assembly
[Stammnitz M.R. et al., The origins and vulnerabilities of two
transmissible cancers in Tasmanian devils, in review @ Science]