Taras Oleksyk at #ICG12: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola
Taras Oleksyk at the GigaScience Prize Track at ICG: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola, #ICG12 in Shenzhen, 26th October 2017
Special session on dama gazelles : update on the dama gazelle cross-breeding ...
Similar to Taras Oleksyk at #ICG12: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola
Similar to Taras Oleksyk at #ICG12: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola (20)
Taras Oleksyk at #ICG12: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola
1. Novel assembly approach for
the homozygous genomes
and the conservation
of critically endangered
Solenodon paradoxus
Taras K Oleksyk et al.
3. Solenodon paradoxus
• One of the only two critically endangered
solenodon species found on the largest Caribbean
islands: Cuba and Hispaniola
• One of the most ancient branches among the
placental mammals: divergence dates to the
Cretaceous era, ~76 MYA before the extinction of
dinosaurs (Roca et al., 2004,)
4. Questions
• What millions of years of isolation can do to a genome?
• Our earlier analysis supported the speciation at ~76 MYA
originally proposed Roca et al. 2014 study (Brandt et al.,
2016)
• but was contested by a recent analysis of five nuclear genes
to <60 Mya, also suggesting over-water dispersal
• Morphometric and mtDNA studies of Hispaniolan solenodon
suggest that southern and northern populations may
represent distinctive subspecies that split 171 KYA
• if confirmed, there is a need to define conservation units, and
describe variation in each
5. Expedition members
• Juan Carlos Martinez-Cruzado – UPRM
• Yashira Afanador - UPRM
• Liz A. Paulino – INTEQ
• Adriel Nunez – ZooDom
• Nicolas and Yimel De J. Corona
7. Sequencing results
The genome size has been estimated using KmerGenie 2.06Gbp.
tomaximizeinformationderivedfromdata?
Province Site Coordinates Sex
Weight
(g)
Loc
North
Puerto Plata Puerto Plata Unknown F 886
Zoo
Espaillat
Cordillera
Septentrional
Unknown - - Zoo
El Seybo El Seybo Unknown M 932 Zoo
Higuey La Altagracia Unknown M 758 Zoo
South
Pedernales
La Cañada del
Verraco
N 18o 09’ 9.64”
W 710 43’ 12.0”
M 579
Wild
K
Pedernales
La Cañada del
Verraco
N 18o 09’ 9.64”
W 710 43’ 12.0”
M 1020 Wild L
Pedernales El Manguito -1
N 180 06’ 36.6”
W 710 43’ 3.58”
M 1270
Wild
M
Pedernales El Manguito -1
N 180 06’ 36.6”
W 710 43’ 3.58”
F 1420 Wild N
Pedernales El Manguito - 2
N 180 07’ 6.5”
W 710 43’ 14.7”
F 1120
Wild
O
zoo
zoo
zoo
9. Choices for the assembly approach
given the data
b | de Bruijn assembly. Reads are decomposed into
overlapping k-mers. Contigs are formed by merging
chains of k-mers until repeat boundaries are reached.
If a k-mer appears multiple times, all duplicates are
discarded.
c |String graph assembly. Align all the reads.
Alignments that can be transitively inferred from all
pairwise alignments are removed. A graph is created
with a vertex for the endpoint of every read.
As a string/unitig graph encodes every valid assembly
of reads, such a graph, if correct, is in fact a lossless
representation of reads.
When there is allelic variation, alternative paths in the
graph are formed.
Genetic variation and the de novo assembly of human genomes
Chaisson, Wilson, & Eichler. Nature Reviews Genetics 16, 627–640 (2015)
10. Comparative Assembly Results
Assembly Names: A B C D
Contig assembly tool: Fermi SOAPdenovo2
Assembly Metrics
Total contigs (>1,000 bp) 71,429 189,566
Contig N50 54,944 4,048
Contig CEGMA (%) * 96.37(77.42) 68.15(33.06)
Contig BUSCO (%) 86(65) 42(21)
Scaffolding tool:SOAPdenovo2 SSPACE SOAPdenovo2 SSPACE
Gap closing tool: GapCloser GapCloser GapCloser GapCloser
Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly
Li. Bioinformatics 15;28(14):1838-44 (2012)
SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.
Luo, Liu, Xie, Li, Huang, Yuan, et al. Gigascience. BioMed Central; 2012;1:18
11. Comparative Assembly Results
Assembly Names: A B C D
Contig assembly tool: Fermi SOAPdenovo2
Assembly Metrics
Total contigs (>1,000 bp) 71,429 189,566
Contig N50 54,944 4,048
Contig CEGMA (%) * 96.37(77.42) 68.15(33.06)
Contig BUSCO (%) 86(65) 42(21)
Scaffolding tool:SOAPdenovo2 SSPACE SOAPdenovo2 SSPACE
Gap closing tool: GapCloser
Total scaffolds (>1,000 bp) 14,417 40,372 20,466 -
Final N50 555,585 110,915 331,639 -
Final CEGMA (%) 95.56(81.85) 95.97(88.71) 95.97(90.73) -
Final BUSCO (%) 91(74) 86(64) 94(80) -
12. A B
Distribution of gene prediction support
Proteins of four reference species S.
araneus, Erinaceus europaeus, Homo
sapiens and Mus musculus were
aligned to a S. paradoxus assembly
with Exonerate with a maximum of
three hits per protein.
Coding sequences (CDS) were cut, clustered and uploaded into the
AUGUSTUS. Proteins from the predicted genes were aligned by
HMMER and BLAST to Pfam and Swiss-Prot databases. Only the genes
supported by hits to protein databases and hints were retained.
Significantly more transcripts have higher hint support in assembly B.
13. Comparative Assembly Results
Assembly Names: A B C D
Contig assembly tool: Fermi SOAPdenovo2
Assembly Metrics
Total contigs (>1,000 bp) 71,429 189,566
Contig N50 54,944 4,048
Scaffolding tool:SOAPdenovo2 SSPACE SOAPdenovo2 SSPACE
Total scaffolds (>1,000 bp) 14,417 40,372 20,466 -
Scaffold N50 555,585 110,915 331,639 -
REAPR error-free bases (%) 96.46 95.35 94.98 -
REAPR low-scoring regions 18 16 71 -
REAPR incorrectly oriented reads 11,543 5,329 28,964 -
14. Comparing assemblies
Approach Issues Assembly A Assembly B Assembly C
REAPR Low scoring regions
Incorrectly oriented reads
18
11,543
16
5,329
71
28,964
Progressive
Cactus
Inversions
Translocations
87
5
34
0
81
2
Applying
“Occam’s Razor”
15. Assembly B seems to be the best assembly
• 3x less number of contigs >1000bp
and 14x larger N50
• Scaffolds are shorter, but contain
less low-scoring regions and
incorrectly oriented reads
• Has less inversions and
translocations compared to
another genome
• Contains more transcripts with
higher hint support
• More support available
16. The inferred divergence time of S. paradoxus
from other mammals is 73.6 Mya - confirmed
(95% confidence interval of 61.4-88.2 Mya)
• Divergence time
estimates based on four-
fold degenerate sites and
on fossil-based priors
• The 95% confidence
intervals are given in
square brackets and
depicted as
semitransparent boxes
around the nodes
• Confirmed Roca et al.
2004, and Brandt et al.,
2016 estimates
17. Homozygosity &
demographic history
• Solenodon is among of the most
homozygous mammals known, with
variation at least at the level of Amur tiger
• the real number is probably lower, since this
estimate is based on the combined genome of
five individuals
• Patterns of SNP variation allowed us to
infer population demography, which
indicated that northern and southern
subspecies split at least 300 Kya.
• Also: Annotations of genome (genes, repeats),
signatures of selection, evolution of venom genes
• Developed population markers (M-sats) for
conservation studies
18. Assembly B makes
the next assembly possible
Short Read
Input Assembly
Dovetail
HiRise Assembly
Total Length 2,049.42 Mb 2,053.16 Mb
L50/N50
5,328 scaffolds;
0.111 Mb
16 scaffolds;
42.790 Mb
L90/N90
19,167 scaffolds; 0.028
Mb
51 scaffolds;
7.507 Mb
The genome size 2.06 Gbp
Estimated physical coverage (1-100 kb pairs): 116.57X
Collaboration: Harris Lewin
19. Why stop here?
• Putting the reference quality genome
• Comparative genomics – Cuban solenodon genome
• Understanding island genome evolution
• Population and conservation genomics
20. Thank you
• Sergey Kliver
• Pavel Dobrynin
• Aleksey
Komissarov
• Ksenia
Krasheninnikova
• Stephen J. O’Brien
• Kirill Grigorev
• Yashira M. Afanador
• Walter Wolfsberger
• Audrey J. Majeske
• Juan Carlos Martinez-Cruzado
• Liz A. Paulino
• Rosanna Carreras
• Luis E. Rodríguez
• Adrell Núñez
• David Hernández-Martich
• Filipe Silva
• Agostinho Antunes
NSF project #1432092
• Alfred L. Roca
• Adam Brandt
Editor's Notes
I want to thank the organizers, and especially the GigaScience team - I am very happy to be here at BGI today.
I would like to start by introducing the place where I work. My university is located on the Caribbean island of Puerto Rico. You may have heard that recently it has been hit by a Category 5 hurricane Maria which destroyed most of the infrastructure on the island on September 20. I am very lucky to be here today, but the majority of people that remained on the island are currently without power and many without clean water. I will go back to Puerto Rico after this meeting and try to rebuild the lab where this work was done.
The current study is about an animal that lives on another island the Hispaniola, which is also know in some languages as Haiti.
This island has a unique island fauna, that includes one of the most enigmatic mammalian species not known from anywhere else – the solenodon.
Solenodons represent one of the most ancient branches of placental mammals and today we can find only two species – one in Cuba and one on Hispaniola.
It is a curious looking animal, it has a very long rostrum that indicates a very good sense of smell and has specialized teeth that carry venomous saliva been
A study published in nature in 2004 used fragments of mitochondrial DNA suggested that these species split over 76 MYA, and suggested that they have survived the demise of the dinosaurs while on these islands
It is therefore a very interesting species to study what can happen to a genome after millions of years of isolation.
In an earlier paper we supported the earlier speciation date, but it recently been contested by a paper where another date was suggested, implying that the speciation has happened somewhere else.
We also wanted to know validate if there is a division between solenodons on the island as was suggested by the morphometric and mtDNA studies, including our own
So we set up an expedition to the Island oh Hispaniola. We reached out to our colleagues at the National Zoo and Institute of technology in Dominican Republic. We also spent several months getting all the required permits and approving the procedures
We also found local guides that were very knowledgeable about the species.
The guides showed to us the most effective and the most humane way to catch the animal. It is a nocturnal animal. So to catch it you have to study its trails during the day, and then go into the jungle at night and listen in silence for hours.
Then you can hear it walking in the dark, rustling leaflitter with his snout, searching for food. And when the noise gets close enough, you turn on the lamp and run after it and catch it by the tail. This is a venomous animal, so grabbing it by the tail is the safest way to handle it. Then a veterinarian would take a sample of blood, and animals were released in minutes.
Now I must have had a difficult time to explain this bit to the press, because this is the interpretation that has come out and has been picked up by a very long discussion threads on Reddit.
We sampled solenodons so there would be samples from both supposed subspecies, in the north and in the south
However, by the time the DNA made it to the US, most nothern samples were too low, and were rejected.
We then sent the five southern samples to the sequencing
Unfortunately the coverage we received was very very low
At this point I thought I failed. We sponsred the expedition out of pocket, and the sequencing was done obn a very tight budget, so no more data were possigble soon.
However, sometimes, the study limitation actually presents a opportunity, We had a very brave idea that we were not sure was going to work, but we tried it anyway.
We though, well if one sample does not give you enough coverage, why not combine the data?
We reasoned that the genome may be very homozygous, so mixing individuals would not introduce much heterogeneity and in fact may be able to increase the coverage
To our surprise, there was even less heterogeneity that we originally thought – here is distribution of k-mers (fragments of the same size) indicating that the average coverage over the genome is 25x, and a bump introducing the heterozygotes (x=5) is very very small – meaning that these individuals have very similar genomes
We thought also about the choice of the assembly, and could not come to the agreement about what algorithm to use.
On the one hand, there is a standard and proven approach by SooapDenovo2 which is the deBrujn approach that breaks genome into kmers and align only the perfect fragments into contigs. All the fragments that are not aligned are then throw away – so it is in essence a way of reducing data. We were not sure we could reduce our data anymore, so we wanted to try an alternative
On the other hand there is a string graph approach that aligns full length fragments that are not shortened into contigs and aligned to find the longest paths.
We devised a comparison table for these approached
We put together a comparison table with different options with de Brujn and string graph approach assemblies: A, B, C and D.
The assembly results for the string graph surprised us by showing 12x higher N50 for the contigs.
The following scaffolding showed higher values for the deBrujn approach, but it was producing scaffolds with large gaps.
We wanted to see how these assemblies performed in finding genes for the further analysis. We trained the algorithms with the hints from 4 other mammalian species, and then found support for our hints in protein databases
Again, the string graph assembly performed a lot better by finding genes with larger amount of support
We further looked at the numbers of errors, low scoring regions and incorrectly oriented reads, and again found better support for assembly B
Finally, we used the parsimony principle by comparing our assemblies to the same genome, in this case the shrue.
Our assumption was that the best assembly would be the most similar in the comparison, because additional differences in other assemblies are likely to be artificially introduced by the assembly process.
Again the assembly B was performing the best
So in conclusion, we think that string graph approach was very good at putting together our combined data.
We thoght it was worth to share this with others and submitted our reslts in a paper to GigaScience
But What about the questions we asked in the beginning?
Using genome data, we confirmed our earlier estimate for Solenodon divergence.
It is indeed one of the oldest branches on the eutherian mammal tree and thus a very valuable data for comparative genomics studies
We showed that this animal is extremely homozygous, on the level with some of the most homozygous animals known
We also confirmed that two subspecies existed separately for at least 300 000 years, and have different demographic histories
And finally, new opportunities arise when you keep working.
We have already added new data for the project – using Dovetail genomics in collaboration with UCDavis. The scaffolds that have been produced span entire chromosomes.
We are so excited about this that we are already planning our next expedition, this time to Cuba, to hopefully add more data from the second species, and hopefully I will be able to tell you that story next time I am here at BGI