Application of Site-
and Time-
Heterogeneous
Evolutionary Models
to Mitochondrial
Phylogenomics
Andreia Reis
Master's thesis submitted to the
Faculty of Science, University of Porto in
Biodiversity,Genetics and Evolution
2014
ApplicationofSite-andTime-Heterogeneous
EvolutionaryModelstoMitochondrialPhylogenomics
AndreiaReis
MSc
2.º
CICLO
FCUP
2014
MS
cSc
MSc
Application of Site-
and Time-
Heterogeneous
Evolutionary Models
to Mitochondrial
Phylogenomics
Andreia Reis
Master of Biodiversity, Genetics and Evolution
Biology Department
2014
Supervisor
Miguel M Fonseca, PhD, Fac.Ciências Univ. Porto /Univ. Vigo
Co-supervisor
Rui Faria, PhD, Fac.Ciências Univ. Porto /Univ. Pompeu Fabra
MS
cin
Todas as correções determinadas
pelo júri, e só essas, foram efetuadas.
O Presidente do Júri,
Porto, ______/______/_________
MS
c
iii
To the
memory of my beloved
Grandfather,
José Reis.
iv
v
“Persistence is the way to success.”
-Charles Darwin
vi
vii
Acknowledgments
This study is not the result of an individual effort but of a set of collective efforts from people
that i consider important. Without them it would have been much harder to reach the end.
Therefore, I would like to express my full gratitude to some people that helped me travelling
through this “road full of boulders”, which was one of the most challenging journeys of my
professional and personal lives.
A special thanks to my supervisors, Miguel M. Fonseca and Rui Faria, for their leadership,
support, motivation and huge patience. All advices were essential to be able to make it
through yet another stage of my career. Since this work was initially planned within the
framework of a research project financed by the FCT and FEDER (ref.: PTDC/BIA-
EVF/113805/2009), I would also like to thank these institutions to make this work possible.
I would also like to give a HUGE thanks to Graciela Sotelo for all support that she gave
me. She was always available to clarify my doubts and teach me important things for my
future research life.
To Diana Costa, I would like to thank the good vibes, advices, the rides to CIBIO ;), the
funny talks, the not so funny talks (but also needed), the beginning of a friendship.
To Prof. Dr. Marta W. Vasconcelos, thanks for helping me opening my mind to the world of
science, for her important advices and friendship.
"All the riches of the world are not worth a good friend." – Voltaire. I have a small group of
good friends that are my anchor to who I would like to thank, Ana Marta Fonseca, Sofia
Barreira, Ricardo Pinto, Tiago Carvalho, Pedro Brinca, Inês Beato and Adriano
Mendes, who attending me in times of distress, anxiety, insecurity, exhaustion and
satisfaction.
To Dr. Amélia Martins and Dr. Ana Queiros, thanks for their support, especially during
bad times, and for their friendship.
To Prof. Cecília Ferreira da Rocha, thanks for all she taught me (and still does) during all
these years.
Thanks to my Master Colleagues, who in one way or another, helped each other during
the Master degree!!! :)
viii
ix
Contents
Abstract xi
Resumo xii
List of Figures xiv
List of Tables xv
Abbreviations xvi
Introduction 1
1. Models of Sequence Evolution in Phylogenetics 3
1.1. Phylogenetic Inference 3
1.2. Heterogeneity of Molecular Sequence Evolution 4
1.2.1. Heterogeneity across Time 4
1.2.2. Heterogeneity across Sites 6
1.2.2.1. Substitution Rates Heterogeneity 6
1.2.2.2. Amino Acid Empirical Mixture Models 6
1.2.3. Heterogeneity across Sites and Time 7
1.3. Strand Asymmetry and Replication 7
1.4. Gene Rearrangements 9
2. Mitochondrial DNA 11
2.1. MtDNA Structure and Organization 11
2.2. MtDNA Strand Asymmetry and Replication 13
2.3. Gene Rearrangements 13
3. The Controversial Phylogeny of the Echinoderms 14
3.1. Morphological Data 18
3.2. Molecular Data 18
3.3. MtDNA Gene Rearrangements 19
3.4. Final Considerations 20
4. Objectives 21
Material and Methods 23
1. Retrieval of Molecular Data 25
2. Analysis of the Mitogenome Architecture 25
2.1. Comparison of Gene Orders 26
2.2. Prediction of Origins of Replication 26
x
2.3. Assessment of Compositional Heterogeneity 27
3. Phylogenetic Analysis 27
3.1. Sequence Alignment 27
3.2. Trimming of the Alignments 28
3.3. Selection of the Best Fit Model 28
3.4. Phylogenetic Inference 29
3.5. Comparison of Phylogenetic Methods 30
Results 31
1. Mitogenome Architecture 33
1.1. Mitogenome Annotation 33
1.2. Gene Order Comparison 35
1.3. Prediction of Replication Origins 37
1.4. Assessment of the Degree of Compositional Heterogeneity 39
2. Comparison of Phylogenetic Inference Methods 42
Discussion 51
References 57
Appendix 75
xi
Abstract
Recent advances in sequencing technologies have encouraged the use of whole-
genome mitochondrial DNA sequences (mitogenomics) for phylogenetic inference. This
increasing amount of information generally results in the recovery of more accurate
phylogenetic trees. However, heterogeneous nucleotide composition across sites and
taxa (due to, for example, gene rearrangements and strand asymmetry) can lead to
erroneous phylogenetic inferences. To work around this problem, several evolutionary
models have been developed that take into account the site- and time- heterogeneity,
but they have not been widely used.
The main objective of this project was to test the performance of these
heterogeneous evolutionary models using the complete mitogenome of 29 species from
all extant classes of Echinodermata. We chose this dataset because their available
mitogenome sequences show strong heterogeneous patterns in terms of nucleotide
composition and substitution rates, thus allowing a proper evaluation of those models
and also because the relationships between Echinodermata are still a debated issue.
Before inferring the phylogeny of these taxa, we made a characterization of their
mitogenomes in order to identify the potential causes of such heterogeneity. To do so,
we analyzed the mitogenome organization, searched for putative replication origins, and
estimated the nucleotide compositional heterogeneity.
The gene orders of the echinoderm mitogenomes support the monophyly of each
class. If we consider Crinoidea to have the ancestral gene order in echinoderms, the
phylogenetic tree based on gene arrangements supports the Cryptosyringid hypothesis
((Echinoidea + Holothuroidea); Asteroidea; Ophiuroidea; Crinoidea). Regarding putative
replication origins, we were able to identify, not in many though, stable structures across
different echinoderm species/classes. Our nucleotide sequence analyses confirm that
mitochondrial genes in Echinodermata have a strong heterogeneous composition within and
between genes across these taxa. Finally, when applying homogeneous and heterogeneous
models (of amino acid sequence evolution) for phylogenetic reconstruction, only the CAT
model is able to correct the known long-branch attraction artifact caused by the accelerated
evolutionary rates of the Ophiuroidea sequences. Our results are mainly in agreement with
previous studies based on the analysis of complete mitogenomes, but they also support
unexpected relationships between classes, such as Asteroidea - Echinoidea, not recovered
before attending to morphological traits or nuclear DNA fragments as phylogenetic
characters.
Keywords: compositional heterogeneity; DNA mitochondrial replication; echinoderms;
invertebrates; mitogenomics; mixture models; phylogeny.
xii
Resumo
Os avanços recentes em tecnologias de sequenciação têm fomentado o uso de
sequências do genoma complete de DNA mitocondrial para inferência filogenética.
Apesar deste aumento da quantidade de informação geralmente resultar na
recuperação de árvores filogenéticas mais precisas, a composição heterogénea nos
nucleótidos (devido, por exemplo, rearranjos de genes e assimetria de cadeias) pode
levar a inferências filogenéticas erróneas. Para contornar este problema
desenvolveram-se vários modelos evolutivos que têm em conta a heterogeneidade
composicional entre nucleótidos (ao longo de um gene/genoma) e taxa (em diferente
linhagens). No entanto, estes não são ainda amplamente utilizados.
O principal objetivo deste projecto é então testar o desempenho destes modelos
evolutivos heterogéneos utilizando os mitogenomas completos de 29 espécies de todas
as classes existentes no filo Echinodermata. Este grupo foi escolhido como objecto de
estudo, não só porque as sequências mitogenómicas disponíveis para este grupo
mostram fortes padrões heterogéneos em termos de composição e taxas de
substituição nucleotídica, oferecendo uma oportunidade única para testar esses
modelos, mas também porque as relações filogenéticas entre as classes de
equinodermes é ainda um tema em debate. Antes de inferir a filogenia deste grupo,
caracterizou-se os seus mitogenomas com o fim de identificar potenciais causas da sua
heterogeneidade. Para isso, realizaram-se análises comparativas da posição dos
diferentes genes ao longo do mitogenoma, inferiram-se as possiveis origens de
replicação, e estimou-se a heterogeneidade composicional dos mitogenomas utilizados
neste estudo.
A organização dos genes nos mitogenomas dos equinodermes confirma a
monofila das diferentes classes. Se considerarmos a classe Crinoidea como ancestral
em relação aos restantes equinodermes, a árvore filogenética baseada em rearranjos
parece apoiar a hipótese Cryptosyringid ((Echinoidea + Holothuroidea); Asteroidea;
Ophiuroidea; Crinoidea). Em relação as origens de replicação, detectaram-se estruturas
estáveis em diferentes espécies/classes de equinodermes. A análise de sequência de
nucleótidos confirma que os genes mitocondriais nos equinodermes têm uma forte
heterogeneidade composicional, quer dentro de cada gene, quer entre genes dos taxa
analisados. Finalmente, quanto à aplicação de modelos homogéneos e heterogéneos
para a reconstrução filogenética, apenas o modelo CAT foi capaz de corrigir o
conhecido long-branch attraction causado pela aceleração das taxas evolutivas das
sequências da classe Ophiuroidea. Embora os nossos resultados estejam de acordo
com estudos anteriores (também baseados na análise de mitogenomas), estes parecem
também apoiar algumas relações inesperadas entre as classes de equinodermes, tais
xiii
como Asteroidea + Echinoidea, as quais não tinham sido anteriormente evidenciadas
através de análises baseadas em caracteres morfológicos ou DNA nuclear como
marcadores filogenéticos.
Palavras-Chave: composição heterogénea; equinodermes; filogenia; invertebrados;
mitogenómica; mixture models; replicação do DNA mitocondrial.
xiv
List of Figures
Figure 1 Schematic representation of the replication of the vertebrate
mitochondrial genome.
8
Figure 2 Representation of different types of rearrangements for a hypothetical
genome consisting of five genes.
10
Figure 3 Representation of the mitochondria 11
Figure 4 Example of an invertebrate mitochondrial genome (from Arbacia lixula
with 15719bp)
12
Figure 5 The two hypotheses for the Echinodermata phylogeny. 21
Figure 6 Evolution of gene orders in the echinoderm lineages using the typical
crinoid gene order as the reference order (inferred with CREx).
36
Figure 7 Phylogenetic tree constructed with CREx based on genome order
differences (pairwise-distance) among the Echinodermata species,
resulting from gene rearrangements.
36
Figure 8 Putative replication origins inferred for the mitogenomes of a) Asterias
amurensis, b) Acanthaster brevispinus and d) Patiria pectinifera.
39
Figure 9 Plots of the GC and AT skews at the 4-fold sites of mitochondrial
protein-coding genes for all Echinodermata species included in this
study.
40
Figure 10 Phylogenetic tree based on the concatenated amino acid sequences
of thirteen echinoderm mitochondrial genes using Hemichordata as
outgroup.
44
Figure 11 Heatmap of the inferred phylogenies from the outgroup dataset. 45
Figure 12 Heatmap of the inferred phylogenies between the ingroup and
outgroup datasets using PhyML.
46
Figure 13 Heatmap of the inferred phylogenies between the ingroup and
outgroup datasets using MrBayes.
47
Figure 14 Dendrogram of the inferred phylogenies for the concatenated outgroup
dataset using different models or approaches (based on normalized
Robinson-Foulds values).
48
Figure 15 Dendrogram of the inferred phylogenies for the concatenated ingroup
dataset using different models or approaches (based on normalized
Robinson-Foulds values).
48
Figure 16 Heatmap of dendrograms for PhyloBayes results (ingroup vs.
outgroup).
49
xv
List of Tables
Table 1 Summary of previous phylogenetic studies on Echinodermata. 15
Table 2 Summary of mitogenome annotation according to different approaches. 33
Table 3
Putative replication origins inferred for the Echinodermata mitogenomes
used in this study.
38
Table 4 Summary of general results. 43
xvi
Abbreviations
A Adenine
ATP Adenosine triphosphate
ATP6 ATPase complex subunit 6
ATP8 ATPase complex subunit 8
BI Bayesian Inference
BP Breakpoints
C Cytosine
COX1 Cytochrome c oxidase subunit I
COX2 Cytochrome c oxidase subunit II
COX3 Cytochrome c oxidase subunit III
CYTB Cytochrome b
DNA Deoxyribonucleic acid
G Guanine
H-strand Heavy strand
LBA Long branch attraction
L-strand Light strand
lrRNA Large ribosomal RNA
ML Maximum likelihood
mtDNA Mitochondrial DNA
ND1 NADH dehydrogenase subunit 1
ND2 NADH dehydrogenase subunit 2
ND3 NADH dehydrogenase subunit 3
ND4 NADH dehydrogenase subunit 4
ND4L NADH dehydrogenase subunit 4L
ND5 NADH dehydrogenase subunit 5
ND6 NADH dehydrogenase subunit 6
NJ Neighbor Joining
OH Replication origin on heavy strand
OL Replication origin on light strand
OR Replication origin
RF Robinson-Foulds
xvii
RITOLS RNA incorporation throughout the lagging strand
RNA Ribonucleic acid
rRNA Ribosomal RNA
SCM Strand-coupled model
SDM Strand displacement model
srRNA Small ribosomal RNA
T Thymine
TDRL Tandem duplication random loss
tRNA Transfer RNA
tRNA-Ala tRNA alanine
tRNA-Asn tRNA asparagine
tRNA-Asp tRNA aspartic acid
tRNA-Arg tRNA arginine
tRNA-Cys tRNA cysteine
tRNA-Gln tRNA glutamine
tRNA-Glu tRNA glutamic acid
tRNA-Gly tRNA glycine
tRNA-His tRNA histidine
tRNA-Ile tRNA isoleucine
tRNA-Leu tRNA leucine
tRNA-Lys tRNA lysine
tRNA-Met tRNA methionine
tRNA-Phe tRNA phenylalanine
tRNA-Pro tRNA proline
tRNA-Ser tRNA serine
tRNA-Thr tRNA threonine
tRNA-Trp tRNA tryptophan
tRNA-Tyr tRNA tyrosine
tRNA-Val tRNA valine
CHAPTER 1
INTRODUCTION
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
3
1. Models of Sequence Evolution in Phylogenetics
Darwin realized in The Origin of Species that all forms of life on Earth evolved from a
common ancestor and that their relationships could be graphically represented by a tree
(Darwin, 1859). More than 150 years later, despite the enormous quantity and quality of data
provided by the advances in molecular biology, the phylogenetics community is still pursuing
the general goals of reconstructing the tree of life and of contributing to understand the
processes that govern evolution and the origin of biodiversity (Gesell, 2009). Although, the
increasing amount of available data reduces the influence of random errors on phylogenetic
inference, many phylogenies are still poorly resolved and the use of different reconstruction
methodologies often produces contradictory topologies (Nesnidal et al., 2010; Weisrock,
2012). While much of this uncertainty stems from the nature of the evolutionary process (e.g.
radiations), an important source of incongruence are the systematic errors (Jeffroy et al.,
2006; Philippe et al., 2005) or artifacts inherent to phylogenetic tree-reconstruction, which
result from the inability of a method (or model) to deal with biases in a dataset, originating
misleading phylogenetic inferences.
1.1. Phylogenetic Inference
The ultimate goal of molecular phylogenetics is to understand the historical relationships
between taxa using the information contained on genes or the entire genome
(phylogenomics). Several methods of phylogenetic inference were developed, but the most
commonly used are Neighbor-Joining (NJ; Saitou and Nei, 1987), Maximum Parsimony (MP;
Fitch, 1971), Maximum Likelihood (ML; Felsenstein, 1981) and Bayesian Inference (BI;
Larget et al., 1999). These phylogenetic methods have their own particularities, but they all
rely on the usage of explicit statistical model(s) of sequence (nucleotide, codon or amino acid)
evolution. These models describe the probability that one state (e.g. nucleotide, amino acid)
changes to another; generalized in the so-called transition rate matrix (here, the term
transition refers to the change between states and not to the exchange between purines or
pyrimidines). They are expected to reflect the evolutionary pattern of the sequences being
analyzed and are crucial to estimate the evolutionary parameters used for accurate
phylogenetic tree reconstruction. The first proposed models were very simple and based on
strong assumptions. For example, the nucleotide substitution model proposed by Jukes and
Cantor (1969) assumes that the base frequencies are equal among sequences and that all
4
substitutions have the same relative rate (AC=AG=AT=CG=CT=GT). In contrast to DNA
substitution models, models of protein evolution, or amino acid replacement, were initially
built using an empirical approach, i.e. derived from large numbers of empirical alignments
(Dayhoff et al., 1978; Jones et al., 1992; Adachi and Hasegawa, 1996). Since these matrices
contain a large number of parameters (20-state replacement matrix, they are usually
selected before the phylogenetic analysis of interest is executed.
These “traditional” models of molecular evolution were mainly developed assuming that
the evolutionary process is globally stationary, homogeneous and reversible (Yang, 1996).
Stationary meaning that DNA or protein composition (nucleotide or amino acid equilibrium
frequencies) is the same at all sites of a sequence and along the tree (stationary hypothesis).
Homogeneous meaning that overall nucleotide or amino acid substitution rates are constant
across all branches of the tree (homogeneity hypothesis). Reversible meaning that the
probability of change from state i to state j is equal to the probability of going from state j to
state i. It has been shown, however, that many sequences violate one or more of these
assumptions, which ultimately may result in erroneous phylogenetic reconstructions (Galtier
and Gouy, 1995; Yang, 1996; Cox et al., 2008; Dutheil and Boussau, 2008; Philippe et al.,
2011; Groussin et al., 2013).
1.2. Heterogeneity of Molecular Sequence Evolution
Heterogeneity in molecular evolution manifests itself in various ways: along branches (or
throughout time), across sites and genes, or both (across sites and time).
1.2.1. Heterogeneity across Time
Substitution rates can vary across lineages. Felsenstein (1978) was among the first to
observe that fast evolving sequences, which have long branches, are easily grouped
together in phylogenetic reconstructions, even if they are not related. Failure to handle
evolutionary heterogeneous substitution rates or compositional heterogeneity across
lineages, the probability of homoplasy (i.e. convergence of substitution patterns in unrelated
sequences) increases, which leads to Long Branch Attraction (LBA) artifact (Felsenstein,
1978; Bergsten, 2005). Similarly, when a distant outgroup is used, a divergent species may
be attracted by the long-branch separating the ingroup and the outgroup and thus be
“artifactually” placed at a basal position of the phylogenetic tree (Philippe et al., 1998;
Bergsten, 2005). Heterogeneous sequence composition or accelerated rates of evolution
underlying LBA can be originated, for instance, by variation in gene order or changes in the
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
5
transcription polarity of a gene. In the mitochondrial genome, these events are particularly
important because each strand has a strong-specific bias (Adachi and Hasegawa, 1996),
which is also related to the genomic location of genes (Faith and Pollock, 2003). Genes
coded in one strand will contain the compositional bias of that same strand, whereas genes
that reversed their coding polarity will show inverted compositional bias (Hassanin et al.,
2005; Fonseca et al., 2008). Thus, the branching pattern of a phylogenetic tree can be
strongly affected by strand bias reversal (an extreme example of compositional
heterogeneity) or significant variation in AT and GC content (Schierwater et al., 2013).
Several complementary approaches have been applied to overcome systematic errors
caused by heterogeneous evolution over the phylogenetic tree (Rodriguez-Ezpeleta et al.,
2007) and to deal with inconsistent phylogenetic reconstruction (Sheffield et al., 2009). A
distance-based method known as LogDet/Paralinear distances (Lake, 1994; Lockhart et al.,
1994) is often used. Another approach, which aims to homogenize the composition between
sequences, is data recoding. For nucleotide sequences, RY coding (Woese et al., 1991)
consists in replacing nucleotides A and G by R (purines) and C and T by Y (pyrimidines), so
that only transversion events are considered and the compositional bias posteriorly
decreases. In an analogous way, the Dayhoff coding system has been proposed for amino
acid sequences (Hrdy et al., 2004). More generally, this problem can also be accommodated
by removing saturated sites from the analysis such as 3rd
codon positions (Swofford et al.,
1996; Delsuc et al., 2002), by including a better taxon sampling (carefully selection of the
outgroup) or removing fast-evolving species (Aguinaldo et al., 1997), genes (Philippe et al.,
2005), or sequence positions (Brinkmann and Philippe, 1999; Burleigh and Mathews, 2004).
Although these methods can decrease the compositional bias, they may also reduce the
phylogenetic resolution (Weisrock, 2012). More recently, complex statistical approaches
have been implemented in several phylogenetic programs, under either Bayesian (p4,
PHASE and PhyloBayes) (Foster, 2004; Gowri-Shankar and Rattray, 2007; Blanquart and
Lartillot, 2006) or Maximum Likelihood (nhPhyML) (Boussau and Gouy, 2006; Galtier and
Gouy, 1995) frameworks, to overcome the problem of time or compositional heterogeneity
using more realistic evolutionary models. However, the number of studies applying these
methods to real data is still scarce (Loomis and Smith, 1990; Galtier and Gouy, 1995;
Herbeck et al., 2005). Moreover, most have focused on the effect of compositional bias in
relatively few genes, while the effect of complex heterogeneous data sets (e.g. entire
genomes) on phylogenetic inference has not been thoroughly explored, as these methods
are computationally extremely demanding due to the large number of parameters they
include (Sheffield et al., 2009).
6
1.2.2. Heterogeneity across Sites
1.2.2.1 Substitution Rates Heterogeneity
Another well-known source of heterogeneity is the variation of the substitution rates
among different sites in a sequence (Uzzell and Corbin, 1971), which has long been
recognized as a characteristic of sequence evolution, especially when coding for biological
products (Uzzell and Corbin, 1971; Fitch, 1986; Kocher and Wilson, 1991). This might be
caused by variation in functional constraints across sites and genes (Huelsenbeck and
Suchard, 2007) or by stochastic fluctuations (if substitutions are equally tolerated in a given
position, e.g. redundant sites of protein coding genes). The substitution rates across sites
have been widely studied and remain an important subject in phylogenetic reconstruction
(Guindon and Gascuel, 2002). If substitution rate variation is present but ignored,
model-based tree-reconstruction methods can be quite misleading (Yang, 1996). Thus,
different substitution rates (between nucleotides or amino acids) are now usually
incorporated into evolutionary models by setting a discrete gamma-distribution of rates of
variable sites (Swofford et al., 1996), as well as a proportion of invariable sites (Whelan et al.,
2001).
1.2.2.2. Amino Acid Empirical Mixture Models
Several studies have shown that homologous sequences can vary widely in their
nucleotide or amino acid composition (Lockhart et al., 1994; Foster and Hickey, 1999; Foster,
2004; Collins et al., 2005). This compositional heterogeneity generally implies that the
sequences in the dataset are not evolving under the global conditions of stationary,
reversibility and homogeneity (SRH) (Bryant et al., 2005; Jermiin et al., 2008). To deal with
this heterogeneity across sites, the data can be spatially split into different partitions (e.g.
partitions by gene or codon position), with each partition having an independent model of
evolution (Bull et al., 1993; Nylander et al., 2004). An alternative solution, suitable for small
alignments (i.e. single-genes) of amino acid sequences, is to use empirical mixture models,
which have fixed and pre-determined parameters, estimated based on large databases of
multiple sequence alignments (Lartillot and Philippe, 2004). These matrices (e.g. C20-C60,
Quang et al., 2008; WLSR5, Wang et al., 2008), unlike the classical empirical matrices (JTT,
WAG or LG), can have more than one compositional profile (i.e. more than one equilibrium
frequency for each amino acid) and more than one replacement matrix. These profile mixture
models provide a better statistical fit than classical empirical matrices, especially on
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
7
saturated data (Quang et al., 2008) and were shown to reduce the LBA problem in simulated
and real genomic datasets (Wang et al., 2008). Additionally, in a given protein sequence,
different sites can have distinct structural states (e.g. exposed or hidden). Likewise, a given
homologous position (in a multiple sequence alignment) might show a tendency for some
amino acids. Altogether, this complex evolution of real protein sequences may not be well
captured by empirical matrices. In this context, the CAT model was developed to better
adjust to the complexity of real protein sequence data. For each position, it can define a
distinct profile or group several positions into one profile. These profiles are directly inferred
from the data (Lartillot and Philippe, 2004) and then combined with a matrix of substitution
rates that can be fixed to uniform values (CAT-Poisson), derived from empirical matrices (e.g.
CAT-JTT or CAT-LG) or even inferred from the data (CAT-GTR). CAT-like models were
shown to have a better fit than standard single-matrix models and to be less prone to
systematic errors such as LBA (Lartillot and Philippe, 2004; Lartillot et al., 2007; Philippe et
al., 2011).
1.2.3. Heterogeneity across Sites and Time
As we mentioned in the previous subsections, DNA and protein sequences may have
heterogeneous evolutionary patterns along time or across the sequence. However, both
patterns are not mutually exclusive and hence sequences are expected to display both
evolutionary heterogeneities simultaneously. The first model jointly considering time- and
sequence- heterogeneity was proposed by Blanquart and Lartillot (2008). They combined the
site-heterogeneous CAT model (Lartillot and Philippe, 2004) with the time-heterogeneous
model BP (Blanquart and Lartillot, 2006). The BP model introduces a stochastic process that
acts along lineages to incorporate the possibility of having significant objective compositional
shifts in the tree (referred as BreakPoints). The authors showed that the CAT+BP model
indeed allows for base frequencies variation across time and sites simultaneously and are
able to recover the true topology of a tree, for which the CAT or the BP failed when used
separately. Importantly, this model can be applied to DNA or amino acid sequences
(Blanquart and Lartillot, 2008).
1.3. Strand Asymmetry and Replication
Patterns of substitution are expected to be symmetric, according to Chargaff’s second
parity rule ([A]=[T] and [G]=[C]), only if complementary changes occur at equal frequencies
8
(Nikolaou and Almirantis, 2006). However, sequence comparisons reveal that mutation rates,
in addition to selective constraints, are not uniformly distributed throughout the genome
(Francino and Ochman, 1997), resulting in strand asymmetry (or strand compositional bias).
In mitochondrial genomes (mitogenomes) of vertebrates and invertebrates, the strand bias is
more pronounced at fourfold redundant sites of protein coding genes (4-fold). Given that
selective pressures are weaker in these sites than in other codon positions, strand
asymmetries should be caused, to a large extent, by mutational mechanisms (e.g. replication,
transcription or DNA repair). However, several studies (e.g. Reyes et al., 1998; Hassanin et
al., 2005) support replication as the main process responsible for shaping the compositional
bias in mitogenomes (Figure 1). During replication (as in transcription), one of the strands is
temporarily in a single-stranded state and thereby more exposed to DNA damage than
double-stranded DNA (Francino and Ochman, 1997; Hassanin et al., 2005). Specifically,
single-stranded DNA is more susceptible to hydrolytic deaminations of cytosine (C→T) and
adenine (A→G), which cause the accumulation of thymine and guanine on the exposed
strand (Figure 1). Indeed, mitogenomes have one strand that is GT-rich (named the
Heavy-strand) and one that is GT-poor (named the Light-strand) (Asakawa et al., 1991;
Reyes et al., 1998). Transcription may also originate nucleotide composition variation in both
strands, but to a lesser extent (Francino and Ochman, 1997; Hassanin et al., 2005).
Figure 1 | Schematic representation of the replication of the vertebrate mitochondrial genome.
Mutations that may occur more often in single-stranded DNA are also represented. The single-stranded
DNA is particularly susceptible to deamination, especially inside mitochondria, where there is a high
concentration of reactive oxygen species that damage DNA. This systematic exposure to DNA damage
would lead to the accumulation of T and G in the H-strand. The oxidation of G is less common but also
occurs more frequently in single-stranded DNA, producing 8-hydroxyguanine, which pairs with A rather than
C on the L-strand (G decreases and T increases on the H-strand) (Fonseca, 2011).
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
9
Strand asymmetry is usually measured by AT and GC skews, as expressed by
(A-T)/(A+T) and (G-C)/(G+C), respectively (Perna and Kocher, 1995). Positive AT skew
values indicate more A than T and positive GC skew values more G than C, and vice versa.
Strand asymmetries can be generally stable in some taxa such as vertebrates, in which the
vast majority of genes show positive AT and negative GC skews. In vertebrates, only the
ND6 gene shows the opposite compositional bias because its coding polarity is the opposite
from all the other protein-coding genes. However, in invertebrates, strand specific
asymmetries are more variable, mainly because gene rearrangements (particularly gene
inversions) are much more frequent in this group. Likewise, in invertebrates, the H-strand
origins of replication also switch their orientation in many taxa (Hassanin et al., 2005;
Scouras and Smith, 2006). Since the strand bias is related with the orientation of the
replication process, a reverse replication mechanism will also lead to an inverse
compositional bias. For example, all available mitogenomes of crinoids (Echinodermata;
Scouras and Smith, 2006) and some species in Cephalochordata (Kon et al., 2007) have
generalized reversed strand asymmetry, when compared with the remaining mitogenomes of
the same taxonomic group (phylum or subphylum, respectively). That occurs due to an
inversion of the replication origin (the control region) and consequently of the replication
direction (the H-strand became the L-strand and vice versa) (Hassanin et al., 2005; Kilpert
and Podsiadlowski, 2006; Fonseca et al., 2008). In conclusion, compositional
heterogeneities can also result from switches in coding polarities of genes and from reversal
replication process of mitochondrial genomes. If a given gene (or group of genes) is inverted
in some taxa, this will contribute to increase the heterogeneity of the patterns of sequence
evolution within a dataset. In these scenarios, classical homogeneous models are not
expected to capture the complex sequence evolution, resulting in misleading phylogenetic
relationships. On the contrary, one expects that heterogeneous models can incorporate the
time- and sequence-heterogeneity of the sequences and correctly infer the phylogenetic
relationships between taxa.
1.4. Gene Rearrangements
The analysis of gene orders is another source of phylogenetic information that is
becoming increasingly used (Boore et al., 1995; Sankoff et al., 1992). A comparison of the
gene order between closely related species can reveal different types of rearrangements
such as inversions (Asakawa et al., 1995), transpositions (Macey et al., 1997), reverse
transpositions (Boore et al., 1998), and tandem duplications followed by the random loss
10
(TDRL) of one of the copied genes (Boore, 2000) (Figure 2). As well, a variety of possible
mechanisms that led to them have been proposed (Boore, 2000), some related to the
replication process, others with enzymatic errors, or with illegitimate recombination (Mueller
and Boore, 2005).
Figure 2 | Representation of different types of rearrangements for a hypothetical genome consisting
of five genes. From left to right: inversion, transposition, reverse transposition, tandem duplication random
loss (Bernt et al., 2013b).
It is also important to note that rearrangement rates are not only unevenly distributed in
the genome, but also across taxa. In mitochondria, nucleotide (and amino acid) composition
varies with the position in the genome (because there is a spatial compositional gradient
along the genome generated by the replication mechanism). When a gene changes its
location, it will be positioned in a genomic region with a different mutational pressure. Thus, if
homologous genes in different species have different locations, they might show
heterogeneous patterns of evolution. The same can happen when a gene changes its coding
strand or if the replication mechanism inverts. Differences in gene order can be estimated
and incorporated in standard frameworks for phylogenetic reconstruction, e.g., by means of
distance methods, MP, and ML approaches, as well as with different aims (e.g., evolutionary
relationships or ancestral states reconstruction). However, it remains unclear what is the most
adequate approach (Bernt et al., 2013a).
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
11
2. Mitochondrial DNA
The eukaryotic cells are composed by many organelles, with the mitochondria being
one of the most fundamental (Boyce et al., 1989; Lang et al., 1999; McBride et al., 2006).
Among the several functions of mitochondria (Figure 3), ATP production (i.e. energy) and
regulation of cellular metabolism are of primary relevance (Ricci, 2009; Bernt et al., 2013b).
Figure 3 | Representation of the mitochondria. Image adapted from Lodish (2008).
2.1. MtDNA Structure and Organization
The mitochondrion contains its own genome (mitogenome or mtDNA) (Schatz, 1963),
and reproduces independently of the cell in which it is found which is the major evidence
of the origin of eukaryotic cells by endosymbiosis (Gray et al., 1999). In terms of structure
and gene content, animal mitogenomes are more stable than other eukaryotic groups (e.g.
the phyla Porifera, Nematoda, Cnidaria) (Wolstenholme, 1992; Boore, 1999). The typical
mtDNA is a small and circular double-stranded DNA molecule of about 15-20 kb that
contains 37 genes (Boore, 1999) (Figure 4). Although some larger mitochondrial genomes
have been found, it is believed that they are the product of duplications in some portions of
mtDNA (Boyce et al., 1989). The mtDNA carries 13 protein-coding genes: the cytochrome c
oxidase subunits I, II, and III (COX1, COX2, and COX3) and the cytochrome b (CYTB), the
ATPase complex subunits 6 and 8 (ATP6 and ATP8), and the NADH dehydrogenase
subunits 1–6 and 4L (ND1–ND6 and ND4L).
12
Figure 4 | Example of an invertebrate mitochondrial genome (from Arbacia lixula with 15719bp).
Circular genome image generated with OGDraw v1.2 http://ogdraw.mpimp-golm.mpg.de/.
The mitogenome is highly compact, with no introns between the coding regions
(Anderson et al., 1981). Additionally, it contains non-coding regions involved in the
transcription of the mitochondrial-encoded proteins: 22 transfer RNA (tRNA) genes (one for
each amino acid, except leucine and serine, which have two copies each); and two
ribosomal RNA genes: the small or 12S subunit (srRNA) and the large or 16S subunit (lrRNA)
(Attardi, 1985; Taanman, 1999; Hassanin et al., 2005). Finally, the mitogenome also
contains one or more regions (non-coding) that regulate the initiation of replication and
transcription (Boore, 1999; Taanman, 1999; Hassanin et al., 2005). The two strands of the
mtDNA molecule encode genes. Other features of the mtDNA, specifically: the small size of
the molecule, its abundance in eukaryotic animal tissues, the consistent orthology of the
genes illustrated by the presence of the same genes in different taxa, its mostly uniparental
inheritance and the absence of recombination (Elson and Lightowlers, 2006); contributed
for making this molecule the marker of choice when it comes to infer the origin and
evolution of the main groups of organisms, mainly in Metazoa (Fonseca, 2011). On top of
this, the huge amount of publically available sequence data for mitochondrial genes,
recently prompted by the development of next-generation sequencing (NGS) tools for
entire mitogenomes, offers a unique opportunity to study not only the relationships among
organisms based on sequence data but also on changes in gene order (synteny),
contributing to important advances in phylogenetics and the emergence of phylogenomics.
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
13
2.2. MtDNA Strand Asymmetry and Replication
As previously mentioned, the two strands of the mtDNA molecule can be distinguished
on the basis of G+T base composition, resulting in strand asymmetry, in which one of the
strands is GT-rich (H-strand) and the other is GT-poor (L-strand) (Taanman, 1999;
Hassanin et al., 2005). The replication mechanism of mtDNA has been suggested to be the
main responsible for the observed asymmetries in nucleotide composition between strands
(Reyes et al., 1998; Faith and Pollock, 2003). During this process, a strand in
single-stranded state is more exposed to DNA damage, facing higher mutational pressures
(Hassanin et al., 2005; Francino and Ochman, 1997). Consequently, since genes closer to
the replication origin of the H-strand remain exposed to mutation for a longer period of time
during replication, a positive correlation between strand asymmetry and the time spent in
single-stranded state is generally expected (Reyes et al., 1998; Wei et al., 2010).
2.3. Gene Rearrangements
Genome rearrangement rates and nucleotide substitution frequency are generally
correlated (Shao et al., 2003; Xu et al., 2006). This correlation might imply a cause/effect
relationship between them or, alternatively, it can result from another mechanism that
accelerates both processes (e.g. DNA replication; Xu et al., 2006). However, in the case of
mitochondrial genomes, this correlation has not been thoroughly investigated (Weng et al.,
2013). Changes in mitochondrial gene content, i.e. deletions (Lavrov and Brown, 2001) or
duplications (Zhong et al., 2008), and changes in chromosome organization have been
widely reported, especially in metazoan mitogenomes (i.e. in insects - C. vestalis, B.
macrocnemis and C. bidentatus - Wei et al., 2010), often affecting the origins of replication
(San Mauro et al., 2006) but also other regions throughout the genome such as tRNAs
(Dowton and Austin, 1999). In early studies of mitogenomic rearrangements, tRNA genes
were found to be more frequently involved in rearrangements than protein and rRNA genes
(Wolstenholme, 1992; Macey et al., 1997). Although the phylogenetic studies based on
gene rearrangements comparisons has been increasing (Scouras and Smith, 2001;
Perseke et al., 2008; Shen et al., 2009; Perseke et al., 2010; Perseke et al., 2013), this can
originate different (contradictory) phylogenetic patterns than those obtained using other
information (Boore and Brown, 1998; Castresana et al., 1998).
14
3. The Controversial Phylogeny of the Echinoderms
Echinodermata belongs to the superphylum Deuterostomia, together with two other phyla:
Chordata (Craniata, Cephalochordata and Urochordata) and Hemichordata (Swalla and Smith,
2008; Perseke et al., 2013). The phylogenetic relationships among deuterostomes differ between
studies, but the most recent molecular evidences support Echinodermata and Hemichordata as
sister taxa (Bourlat et al., 2009; Perseke et al., 2013). This pairing is also supported by
morphological data (Turbeville et al., 1994; Gee, 1996; Lambert, 2005), ribosomal gene sequences
(Turbeville et al., 1994; Wada and Satoh, 1994; Cameron, 2000), shared Hox gene motifs
(Peterson, 2004), mitochondrial genes and gene arrangements (Bromham and Degnan, 1999;
Lavrov and Lang, 2005). Similarly to the deuterostomes, the evolutionary and phylogenetic
relationships among the echinoderm classes have also been highly debated (Littlewood et al.,
1997; Perseke et al., 2010; Kondo and Akasaka, 2012). There are approximately 7000 extant
echinoderm species (Amemiya et al., 2005), falling into five well-defined taxonomic classes:
Crinoidea (sea lilies), Ophiuroidea (brittlestars), Asteroidea (starfish), Echinoidea (sea urchins),
and Holothuroidea (sea cucumbers) (Littlewood et al., 1997; Amemiya et al., 2005). Their
phylogeny was previously inferred using a cladistic approach primarily based on morphologic
characters (Smith, 1988; Bottjer, 2006) and later on molecular data (Smith, 1992; Perseke et al.,
2010) or both (Littlewood et al., 1997; Janies, 2001). Table 1 shows a chronological summary
of the most relevant phylogenetic studies on Echinodermata.
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
15
Table 1 | Summary of previous phylogenetic studies on Echinodermata. The kind of data used, methods applied and topologies recovered are indicated.
Abbreviations: A- Asteroidea; C- Crinoidea; E- Echinoidea; H- Holothuroidea; O- Ophiuroidea; (1) DNA amplification; (2) not specified.
Reference Data Phylogenetic Method Topology
Smith, 1984 Morphological (larval and adult) Maximum Parsimony
Smith et al., 1993 Molecular (mitochondrial gene arrangements) (1)
H-E pairing
O-Apairing
Pearse and Pearse, 1994 Morphological (adult) (2)
Wada and Satoh, 1994 Molecular (18S rDNA)
Maximum Likelihood
Neighbour Joining
16
Littlewood et al., 1997 Morphological (larval and adult) + Molecular (18S, 28S rDNA)
Maximum Parsimony
Maximum Likelihood
Scouras and Smith, 2001
Molecular (COX1, 2 and 3; mitochondrial gene arrangements ), except
Ophiuroidea
LogDet
Maximum Likelihood
Janies, 2001 Morphological (larval and adult) + Molecular (18S, 28S rDNA) Maximum Parsimony
Mallatt and Winchell, 2007 Molecular (18S, 28S rDNA)
Maximum Parsimony
Maximum Likelihood
Bayesian Inference
Perseke et al., 2008 Molecular (mitochondrial genome; mitochondrial gene arrangements)
Neighbor Joining
Maximum Parsimony
Maximum Likelihood
Bayesian Inference
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
17
Shen et al., 2009 Molecular (CYTB; mitochondrial gene arrangements) Bayesian Inference
Perseke et al., 2010 Molecular (mitochondrial genome; mitochondrial gene arrangements)
Neighbor Joining
Maximum Parsimony
Maximum Likelihood
Bayesian Inference
Pisani et al., 2012
Molecular (nuclear rDNAand protein-coding genes), except
Holothuroidea
Neighbor Joining
Maximum Parsimony
Maximum Likelihood
Bayesian Inference
Perseke et al., 2013 Molecular (mitochondrial genome mitochondrial gene arrangements) Maximum Likelihood
Telford et al., 2014 Morphological (larval and adult) + Molecular (nuclear genes) Bayesian Inference
18
3.1. Morphological Data
Smith (1984) was the first to study the phylogeny of Echinoderms using anatomical and
developmental characters from both larvae and adults, since previous studies showed that larval
characters alone and the fossil characters of the echinoderms could contribute for misleading
evolutionary relationships between echinoderm classes. Under a parsimony analysis, the
obtained topology (Table 1) supported the relationships between four of five classes of
echinoderms, with the position of holothuroids being less clear (low support). This group
(holothuroids) shares a number of derived characters with ophiuroids and echinoids and several
others only with echinoids, while there are a number of other derived characters that are common
to asteroids, ophiuroids and echinoids (Smith, 1984). A few years later, Strathmann (1988)
reanalyzed larvae morphological characters but found no supported topology; whereas Pearse
and Pearse (1994) presented a phylogenetic analysis based on anatomical characters of adult
individuals supporting crinoids as sister group of the clades formed by the echinoids-holothuroids
and of the ophiuroids-asteroid pairs. All these studies suggested that crinoids are the
representatives of extant echinoderms that diverged earlier from the rest, but conflicting results
still exist for the relationships between the remaining classes, which can result from a
misinterpretation of morphological characters (Smith, 1984) (Table 1). For example, different
studies presented different topologies using the same morphological characters dataset
because authors had different opinions regarding character homology and polarity
(Littlewood, 1995).
3.2. Molecular Data
The increasing number of new sequences publicly available and novel phylogenetic
inference methods and tools made researchers hopeful in resolving the inconsistencies of the
echinoderm phylogeny (Table 1). Smith and co-workers (1993) analyzed the first almost-complete
dataset of mitochondrial genes within echinoderms and, by comparing the mitochondrial gene
order from all classes except Crinoidea, observed a closer relationship between holothurians and
echinoids and between ophiurids and asteroids. The subsequent analysis of 18S and/or 28S
ribosomal molecular data using different phylogenetic methods was unsuccessful in providing a
definitive topology for the relationships between echinoderm classes (Wada and Satoh, 1994;
Littlewood, 1995; Littlewood et al., 1997; Janies, 2001). After obtaining the complete mitochondrial
gene order/sequence of the crinoid Florometra serratissima, Scouras and Smith (2001) used the
mitochondrial gene order to construct a phylogeny that recovered a closer relationship between
the echinoid and the holothuroid classes with the asteroids outside this clade. However, since
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
19
Ophiuroidea was not included, it was impossible to evaluate the relative support for all
Echinodermata classes. The analysis of Shen and co-workers (2009), based on the CYTB gene
revealed the phylogenetic relationship presented on Table 1. The authors encouraged increasing
the number of mitochondrial genomes to resolve some unanswered questions though. Pisani and
co-workers (2012) were the first to use evolutionary models that take into account compositional
heterogeneity (CAT-GTR model), but their results were not conclusive because they did not use
all extant echinoderm classes. The analyses of mitochondrial genomes (based on sequence and
gene order information) using different phylogenetic methods by Perseke and co-workers (2008,
2010, 2013) resulted in conflicting topologies (e.g. Perseke et al., 2008 did not find support for the
Holothuroidea and Echinoidea grouping). More recently, Tellford and co-workers (2014) used 219
nuclear genes to perform a phylogenetic analysis using the evolutionary model CAT-GTR and
obtained good support for the Holothuroidea-Echinoidea and for the
Ophiuroidea-Asteroidea pairings, with Crinoidea as the first diverging lineage respect to
these two groups.
3.3. MtDNA Gene Rearrangements
Since mitochondrial gene rearrangements are much more slow-evolving than nucleotide
sequences (Brown, 1985), they are thought to be more informative in establishing deep
relationships, such as those between echinoderm classes (Boore et al., 1995). Given the rarity of
these events, one expects that a rearrangement shared by two organisms generally imply
common ancestry (Boore et al., 1995). Nonetheless, caution must be taken if one suspects of
independent origin for a given rearrangement or if multiple rearrangements occur (Scouras and
Smith, 2001). The different studies using this kind of information identified gene
rearrangements within all Echinodermata classes except echinoids. Comparing the different
classes, the echinoids and some holothuroids share an identical mitochondrial gene order
(Scouras and Smith, 2001). Moreover, Smith (1993) and others described a mtDNA inversion
that differentiates Asteroidea from Echinoidea/Holothuroidea. It is approximately 4.6kb long,
contains 14 tRNA genes, the lrRNA, and ND1 and ND2 protein-coding genes, and is present
in all seven Asteroidea species. Finally, the ophiuroid gene order differs from the one of
asteroids in the junction between ND1 and COX1 genes, containing substantially fewer tRNA
genes (Smith et al., 1993).
20
3.4. Final Considerations
As previously illustrated, the phylogenetic analyses of the Echinodermata did not provide a
conclusive result regarding their relationships. Although the monophyly of each class is usually
supported, the construction of phylogenetic trees with different genes, taxa and methods of
analysis often give contradictory outcomes (Kondo and Akasaka, 2012). The highly derived
morphology (Smith, 1984; Smith et al., 1993; Pearse and Pearse, 1994), the fast evolving
mitochondrial genomes and the complex gene arrangements varying both within and between
echinoderm classes, all have a significant impact on the phylogenetic reconstruction of this group
(Brown, 1985; Boore et al., 1995; Scouras et al., 2004). However, summarizing all these studies
(Table 1), most authors accept a close relationship between echinoids and holothurians and a
basal position of crinoids. This is supported by morphological characters (Smith, 1984; Smith et al.,
1993; Pearse and Pearse, 1994) or/and nuclear rRNA sequences (Wada and Satoh, 1999;
Littlewood et al., 1997; Janies, 2001; Mallatt and Winchell, 2007; Telford et al., 2014). However,
the use of complete mtDNA data revealed a basal position of the Ophiuroidea class instead of
Crinoidea (Perseke et al., 2008, 2013), which might be related with the fact that Ophiuroidea show
high substitution rates at mtDNA genes. However, it should also be noted that all crinoid
mitogenomes show a strand-specific nucleotide compositions/bias that is the opposite of the other
echinoderms studied so far. This is most likely caused by an inversion encompassing the origin of
replication in crinoid mtDNA (Perseke et al., 2013). Another important debate concerns the
relative position of asteroids. Taking into account these different studies, most authors propose
two hypotheses (Cryptosyringid and Asterozoan hypotheses; Figure 5) for the echinoderms
phylogeny. The Cryptosyringid hypothesis has Crinoidea as basal, Asteroidea as the second
class to diverge from the remaining ones, followed by Ophiuroidea, which is the sister class of the
group formed by Echinoidea and Holothuroidea. The Asterozoan hypothesis also considers
Crinoidea as the basal class, while Echinoidea - Holothuroidea and Ophiuroidea - Asteroidea
form two different internal clades. These two hypotheses are almost equally supported when
using molecular and/or morphological data (Smith, 1984; Pearse and Pearse, 1994; Wada
and Satoh, 1994; Littlewood et al., 1997; Janies, 2001; Mallett and Winchell, 2007; Shen et
al., 2009; Telford et al., 2014). Regarding the mitochondrial gene rearrangements, were
found patterns within and between classes of Echinodermata (Perseke et al., 2008, 2010).
Although within Echinoidea the gene order is conserved (Jacobs, 1988; De Giorgi, 1996),
one cannot say the same for the remaining classes (Asakawa et al., 1995; Scouras and
Smith, 2001; Scouras et al., 2004; Perseke et al., 2008; Shen, 2009; Perseke et al., 2010).
The gene order of Crinoidea has been proposed as the ancestral pattern of echinoderms
(Perseke et al., 2008) followed by Ophiuroidea and then the Asteroidea that which is related
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Introduction
21
to the clade Echinoidea-Holothuroidea (Scouras and Smith, 2001; Perseke et al., 2008;
Perseke et al., 2010).
Figure 5 | The two hypotheses for the Echinodermata phylogeny. Adapted from Smith, 1984; Pearse
and Pearse, 1994; Wada and Satoh, 1994; Littlewood et al., 1997; Janies, 2001; Mallett and Winchell, 2007;
Shen et al., 2009; Telford et al., 2014.
4. Objectives
The main objective of this study is to test the performance of site- and
time-heterogeneous evolutionary models in a dataset that is heterogeneous along its
alignment, by violating the conditions of stationary and homogeneity and, by definition, of
reversibility. By using empirical data (mitogenomes of echinoderms), we aim to assess the
effects of gene rearrangements and subsequent nucleotide composition changes in
phylogenetic inference – the so-called Long-Branch Attraction (LBA) artifact - and thus to
explore the full potential of mitogenomics. Like this, we hope to shed light into some
important issues:
i) Evaluate if methods that incorporate time - and compositional - heterogeneous
evolutionary models (empirical mixture models) improve the performance of phylogenetic
inference, when compared with more simple models (classical empirical models).
ii) Revaluate the phylogenetic relationships between echinoderm classes, using
complex heterogeneous evolutionary models of evolution on a dataset (mitogenomes)
composed by all species of Echinodermata.
Asterozoan
Cyptosyringid
CHAPTER 2
MATERIAL AND METHODS
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Material and Methods
25
This thesis is divided in two main parts. In the first part, we aimed to characterize the
mitogenomes of Echinodermata species by analyzing the mitogenome architecture,
investigating the phylogenetic signal of gene rearrangements and assessing compositional
heterogeneity of the protein-coding genes. In the second part, we reconstructed the
phylogeny of this group using homogeneous and (site- and time) heterogeneous
evolutionary models to test if the later help to surpass the limitations of previous phylogenetic
models.
1. Retrieval of Molecular Data
The data used in this study consists of the complete mitochondrial genome from 29
species representing all extant classes of Echinodermata plus three species from the phylum
Hemichordata (http://www.ncbi.nlm.nih.gov/, last accessed in October 2012) (see
Appendix-A.1), the sister group of Echinodermata according to previous studies (Bromham
and Degnan, 1999; Cameron et al., 2000; Peterson, 2003; Perseke et al., 2013; Cannon et
al., 2014).
We started by retrieving all publicly available complete mitochondrial DNA sequences for
Metazoa (2677 mitogenomes) from GenBank (NCBI ftp: ftp.ncbi.nih.gov/
genomes/MITOCHONDRIA/Metazoa/all.gbk.tar.gz). Then, we filtered the GenBank files for
the complete mitogenomes of Echinodermata species (29 mitogenomes) using custom
scripts (see Appendix-A.2).
2. Analysis of the Mitogenome Architecture
Reliable and standard genome annotations are indispensable for comparative analysis,
in particular for the studies on genomic rearrangements (Lupi et al., 2010). The most widely
used and updated resource for annotated mitogenome sequences is the GenBank sequence
database.
New tools have been created to annotate genomic sequences. Among these, there are
two very popular web tools specific for mitogenomes: DOGMA (Dual Organellar GenoMe
Annotator) (Wyman et al., 2004) and MITOS (MITOchondrial genome annotation Server)
(Bernt et al., 2013c). Since none of these tools is able to identify with absolute certainty all
the features each genome contains, we have used both, and their results were then
compared with the original GenBank annotations.
26
2.1. Comparison of Gene Order
Four rearrangement operations, at least, are reported in the literature to be relevant for
the study of mitochondrial gene order evolution (Boore, 1999): inversions, transpositions,
reverse transpositions and tandem duplications-random losses (TDRL).
CREx (Common interval Rearrangements Explorer -
http://pacosy.informatik.uni-leipzig.de/crex/form) (Bernt et al., 2007) is a web service tool for
the identification of common rearrangement events (the four types mentioned before)
between several genomes. It computes a distance matrix between the genomes according to
different methods such as the breakpoint distance, the reversal distance or the number of
common intervals (Bernt et al., 2007; Fertin, 2009). The breakpoint distance is the simplest
approach and has been used for a long time, while the two other distances, are able to
capture more subtle similarities. Moreover, according to Pevzner (2000) and Bernt and
Middendorf (2011), the breakpoint distance method does not allow a reliable inference of the
evolutionary scenarios underlying the observed rearrangements. As well, the reversal
distance is considered somewhat limited in terms of combining multiple genome
rearrangement events. Therefore, we compared the gene order (tRNAs and protein-coding
genes) between all available Echinodermata mitogenomes using the number of common
intervals implemented in CREx (Bernt et al., 2007) as distance measure, and inferred the
most parsimonious evolutionary scenarios to explain the observed differences (Bernt et al.,
2007). Within each class, rearrangements were inferred by comparing the gene order of the
different species with that of the basal taxon of that same class (Scouras and Smith, 2001;
Fan et al., 2011; Perseke et al., 2008; Perseke et al., 2010). The rearrangements between
the different classes were inferred by comparing the gene order of the most basal taxon of
each class with the proposed ancestral gene order of the Echinodermata clade (Florometra
serratissima – Crinoidea, according to Perseke et al. (2010)).
2.2. Prediction of Origins of Replication
For the mitogenome of each Echinodermata species, we inferred the probable positions
of origins of replication. The replication origins (or replication-related elements) are expected
to be located in A+T rich regions (Hassanin et al., 2005; Wei et al., 2010), to form
stem-and-loop (hairpin) structures and to typically contain tandem-repeat sequences. We
extracted all possible mtDNA fragments from 15 up to 45bp from both strands, assuming that
putative origins of replication should form small and simple hairpin structures (i.e. with a
single stem and a single loop). For each fragment, the secondary structure and respective
entropy were calculated using RNAstructure v5.6 (Reuter and Mathews, 2010); while we
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Material and Methods
27
added some important features (i.e. unpaired code, number of ATGCs on hairpin, number of
pyrimidines and purines on 3’end and 5’end of stem) with customized Perl scripts. Finally, we
filtered the resultant hairpins using R scripts. Since there is only a few studies about the
origin of replication in invertebrates, especially in echinoderm species, we used the filter
criteria generally applied for vertebrates. However, we applied two sets of filters. The first
(more stringent) consisted of: i) entropy bellow -8.0 kcal/mol, ii) loop containing at least one T,
iii) preference for no unpaired nucleotides in the stem, iv) 5'end of the stem rich in TC
(pyrimidines) and containing at least one C, v) 3'end of the stem rich in AG (purines) and
containing at least one G, and vi) hairpin located in a non-coding region (Fonseca et al.,
2014). The second (more relaxed) consisted of: i) entropy below -8.0 kcal/mol, ii) unpaired
nucleotides allowed in the stem, ii) 5'end of the stem rich in TC (pyrimidines) but without
restrictions in the number of Cs, iii) 3'end of the stem rich in AG (purines) but without
restrictions in the number of Gs, and iv) hairpins located in a non-coding region.
2.3. Assessment of Compositional Heterogeneity
For every taxon, we calculated the base composition skew asymmetry of each of the 13
protein-coding genes as AT-skew [(A-T)/(A + T)] and GC-skew [(G-C)/(G + C)] (Perna and
Kocher, 1995) using MEGA v5.0 (Tamura et al., 2011). The AT- and GC-skews were
computed using only the 4-fold redundant sites of the protein-coding genes. Then, for each
species we mapped the skews along the mitogenome to examine the distribution pattern of
the compositional bias.
3. Phylogenetic Analysis
3.1. Sequence Alignment
After retrieving the complete mitochondrial DNA sequences of the 29 echinoderm and
the three hemichordate species, a Perl script was written (see Appendix-A.3) to filter the
GenBank files keeping only the protein coding genes (ATP6, ATP8, COX1, COX2, COX3,
CYTB, ND1, ND2, ND3, ND4, ND4L, ND5 and ND6) in individual fasta files. In addition, we
created two different datasets, one only containing the protein-coding genes of
Echinodermata species, which we named the ingroup dataset; and another, the outgroup
dataset, composed by protein-coding genes of Echinodermata and Hemichordata species.
For both datasets, we also created a concatenated matrix of all protein-coding genes (amino
acid sequences) using a python script (https://github.com/ODiogoSilva/ElConcatenero). To
align the single gene sequences for both ingroup and outgroup datasets we used three
28
different multiple sequence alignment (MSA) programs (see Appendix-A.3): Muscle (Edgar,
2004), Mafft (Kuma and Miyata, 2002) and Prank (Löytynoja and Goldman, 2005) .
3.2. Trimming of the Alignments
There are highly efficient programs for multiple sequence alignment, but even the best
scoring alignment algorithms may retrieve ambiguously aligned sequences. A review of the
alignments can eventually identify gap problems and poorly or ambiguously aligned regions.
It is, therefore, important to remove such regions from the final alignment. Originally, the
trimming of the alignments was a manual task. However, reviewing thousands of aligned
sequences by eye can be tedious, time-consuming and error-prone. To do it in an automated
way we used trimAl v1.2 (Capella-Gutierrez et al., 2009; Appendix-A.3). This software was
also used to measure the consistency of each alignment (obtained with the different MSA
programs), and the one with the highest consistency score was kept for the subsequent
phylogenetic analyses.
3.3. Selection of the Best Fit Model
In the case of model-based phylogenetic inference methods, a priori formal model
selection procedure needs to be implemented to choose an appropriate model (Burnham
and Anderson, 2002; Sullivan and Joyce, 2005). Various methods and tools exist to perform
this model selection for nucleotide or amino acid sequences (Bernt et al., 2013a). ProtTest
(Abascal et al., 2005) is one of the available bioinformatics tools that allows finding the best
fitted model of protein evolution within a candidate list of models, following different
information-theoretic approaches, as an alternative to hierarchical Likelihood Ratio Tests
(Posada and Buckley, 2004; Appendix-A.2). Among these, the most popular
information-theoretic method is the Akaike Information Criterion (AIC) (Akaike, 1974), which
includes a term that penalizes a model in proportion to the number of its parameters. When
the dataset size is small compared to the number of parameters, the AIC can be inaccurate
and the use of the corrected AIC (AICc) (Burnham and Anderson, 2002) is recommended.
Thus, this was the criterion we applied. The current version of ProtTest (v3.2) includes 15
different rate matrices in a total of 120 different models, when the rate variation among sites
(+I: invariable sites; +G or +Г: gamma-distributed rates) and the observed amino acid
frequencies (+F) are considered.
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Material and Methods
29
3.4. Phylogenetic Inference
We performed the phylogenetic analyses using different inference methods such as
Maximum Likelihood (ML) and Bayesian Inference (BI). For the first, we used PhyML v3.0
(Guindon et al., 2010) under the following parameters: the evolutionary model proposed by
ProtTest (see Chapter 3 - Table 4), the equilibrium amino acid frequencies estimated by
counting the occurrence of the different amino acids in the alignment (“-f e” option), the best
of the nearest-neighbor interchange (NNI) and of the subtree pruning and regrafting (SPR)
as tree topology rearrangement search (“-s BEST” option). Support for nodes was estimated
using the bootstrap technique (Felsenstein, 1985) with 1000 pseudoreplicates.
The BI analysis was implemented using MrBayes v3.2.1 (Ronquist et al., 2012), which
calculates Bayesian posterior probabilities using a Metropolis-Coupled, Markov Chain Monte
Carlo (MC-MCMC) sampling approach. The models of evolution used was also the ones
selected by ProtTest. Parameters were estimated as part of the analysis with four Markov
chains incrementally heated with the default heating values. Each chain started with a
randomly generated tree and ran for 1x107 generations, with a sampling frequency of one
tree for every 100 generations. One quarter (25%) of the samples was discarded as burn-in
and the remaining trees were combined in a 50% majority rule consensus tree. The
frequency of any particular clade of the consensus tree represents the posterior probability of
that clade (Huelsenbeck and Ronquist, 2001). Two independent replicates were conducted
and inspected for consistency (Huelsenbeck and Bollback, 2001).
The models of amino acid evolution that we applied using PhyML or MrBayes are not
appropriate for sequences showing time and compositional heterogeneous evolution and
might introduce systematic errors (Galtier and Lobry, 1997; Blanquart and Lartillot, 2006).
Since the datasets that we used show heterogeneous composition across sites and time, we
decided to perform phylogenetic inferences using also frameworks that are expected to relax
the stationary and homogeneity assumptions.
To do so, we analyzed the data in a non-stationary Bayesian framework, using
PhyloBayes MPI v1.4f (Lartillot et al., 2013). We applied a wide variety of homogeneous and
heterogeneous models: classical empirical single-matrix (selected from ProtTest), empirical
mixture of matrices (UL3), empirical mixture of profiles (WLSR5; C20), general time
reversible matrix (GTR), infinite mixture with global exchange rates fixed to flat values
(CAT-Poisson) or inferred from the data (CAT-GTR). For each PhyloBayes analysis, we ran
two independent chains in parallel that were terminated only when they reached
30
convergence (using bpcomp program within the PhyloBayes package), which was diagnosed
as maximum discrepancy threshold below 0.1 or below 0.3 (if the points sampled in the runs
are above 10,000). Finally, we performed a Bayesian inference analysis using
NH-PhyloBayes v0.2.3 (Blanquart and Lartillot, 2008) applying a combination of the category
(CAT) mixture model (Lartillot and Philippe, 2004) with the non-stationary breakpoints (BP)
model (Blanquart and Lartillot, 2006), named CAT-BP, which takes into account the
heterogeneity along the sequence and across lineages. Under the CAT-BP model, the
number of categories (CAT) is fixed and it is not automatically estimated as implemented in
PhyloBayes. Thus, for the analysis under the CAT-BP model we used the posterior number
of categories (CAT = 50) previously determined by the CAT-GTR analysis in PhyloBayes. We
ran two independent chains and the convergence diagnostic analysis was performed using
the compchain program in NH-PhyloBayes. The convergence of the chains depended on
whether the means of the statistics of chain 1 (log likelihood, total tree length, total number of
breakpoints) fell into the means of chain 2 ± 3 times the standard error of the mean.
The visualization of the phylogenetic trees was done with the help of an R script (see
Appendix-A.4) or with FigTree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/).
3.5. Comparison of Phylogenetic Methods
The trees obtained through the different phylogenetic methods were compared (in terms
of topology) using the Robinson-Foulds (RF) metric (normalized to the total number of
possible partitions) (Robinson and Folds, 1981). This metric was computed for all trees
(single protein-coding genes and concatenated alignments) obtained with the outgroup and
ingroup datasets, as implemented in KTreeDist (Soria-Carrasco et al., 2007). In addition, we
qualitatively analyzed (by means of dendrograms, see Appendix-A.4) all the possible
relationships (pairing) between all protein-coding genes and concatenated alignments from
both datasets. To make a more reliable comparison between the datasets (ingroup vs. outgroup),
we pruned the outgroup species from the outgroup dataset with an R script (see Appendix-A.4).
CHAPTER 3
RESULTS
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Results
33
1. Mitogenome Architecture
1.1. Mitogenome Annotation
Concerning the annotation of protein-coding genes, MITOS and GenBank clearly
outperformed DOGMA. While DOGMA was not able to annotate several genes in 15 of the
32 investigated species, MITOS and GenBank annotated the full mitochondrial gene set and
in a concordant manner (Table 2).
When it comes to noncoding genes, some tRNAs (tRNA-Lys, tRNA-Glu, tRNA-Thr,
tRNA-Trp, tRNA-Ser(UCN)
, tRNA-Ser(AGN)
) and both rRNAs were not annotated in some
Asteroidea GenBank files (although they were sequenced) (Table 2). However, NCBI is still
the most complete and accurate source of annotated mtDNA genes of our dataset. Thus, the
GenBank annotation was used in subsequent analyses.
Table 2 | Summary of mitogenome annotation according to different approaches. Check marks indicate
that all genes are annotated. Listed genes are those not annotated in a given genome using a specific tool.
Class
Species
(Accession no)
GenBank DOGMA MITOS
Asteroidea
NC_001627
  
NC_004610
tRNAGlu, tRNA-Thr,
tRNA-Ser
 
NC_006664 tRNA-Trp
 
NC_006665
tRNA-Lys, SrRNA,
tRNA-Ser, LrRNA,
tRNA-Trp
 
NC_006666
tRNA-Lys, SrRNA,
tRNASer, LrRNA,
tRNA-Trp
 
NC_007788

ATP8

NC_007789

ATP8

Echinoidea
NC_001453
  
NC_001572
  
34
NC_001770
  
NC_009940
  
NC_009941

ATP8

NC_013881

ATP8

Holothuroidea
NC_005929
  
NC_012616

ATP8

NC_013432

ATP8

NC_013884

ATP8, ND5

NC_014452

ATP8

NC_014454

ATP8

Ophiuroidea
NC_005334
  
NC_005930
  
NC_010691
  
NC_013874
 ATP8, ATP6, ND2,
ND4

NC_013876
 ATP8, ND2, ND4L,
ND6

NC_013878
 ATP8, ND2, ND4L,
ND6

Crinoidea
NC_010692

ATP8

NC_001878
  
NC_007689
  
NC_007690

ATP8

Hemichordata
NC_013877
  
NC_007438
 ATP8, COX1, COX2,
CYTB, ND1, ND2,

Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Results
35
ND5, ND6
NC_001887
  
1.2. Gene Order Comparison
Using the genome organization of the basal species of each class as the reference we
also observe that the mtDNA gene rearrangements found in Echinodermata species support
the monophyly of the different classes. The comparison of the gene order among different
echinoderm classes show a block (from COX1 to ND6) that is conserved across all of them,
with the exception of one holothuroid (Cucumaria miniata) and two crinoid (Neogymnocrinus
richeri and Antedon mediterranea) species (Appendix-A.1; Figure 6).
The Echinoidea and Holothuroidea share the same basal gene order (Figure 7) with only
a few exceptions due to derived transpositions involving the tRNAs of the holothuroid
species Cucumaria miniata, Stichopus sp. SF-2010 and Stichopus horrens (see Appendix -
A.1). The basal genome organization of these two classes differs from Crinoidea (here used
as the basal species of all Echinodermata) by two inversions and a reverse transposition
(Figure 7). Asteroidea genome order differs from Crinoidea by two inversions, one
transposition and a reverse transposition (Figure 7). A 4.6-kb inversion encompassing ND1,
ND2 and a 12 tRNA-gene block is observed between Asteroidea and
Echinoidea/Holothuroidea (Appendix - A.1; Figure 7). Finally, Ophiuroidea derives from
Crinoidea by two inversions and three TDRLs events (Figures 6 and 7).
The reconstruction of the phylogenetic relationships among the five classes using
these mitogenome rearrangements as phylogenetic characters (Figure 7), and considering
the gene order of the most basal species of each class (see Appendix - A.1) as reference
(Fan et al., 2011; Perseke et al., 2010; Perseke et al., 2008; Scouras and Smith, 2001),
resulted in a topology that supports the Cryptosyringid hypothesis (Figure 6).
36
Figure 6 | Evolution of gene orders in echinoderm lineages using the typical crinoid gene order as the
reference (inferred by CREx). Genes in upper layer are transcribed on the L-strand, whereas genes in
lower layer are transcribed on the H-strand. I = inversion; T = transposition, rT = reverse transposition, TDRL
= tandem duplication random loss. Black dotted line indicates conserved gene order block. Green, purple,
orange and blue boxes indicate reverse transpositions, inversion, transposition and TDRL events,
respectively.
Figure 7 | Phylogenetic tree obtained with CREx (pairwise-distance) based on the genome
rearrangements among the Echinodermata species. I = inversion; T = transposition, rT = reverse
transposition, TDRL = tandem duplication random loss. Grey star indicates the hypothetical basal gene order
from all echinoderms.
Crinoidea
Asteroidea
Echinoidea/Holothuroidea
rT + 2I
Ophiuroidea
T + rT + 2I
2I+3TDRL
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Results
37
Regarding rearrangements within each class, Asteroidea species differ by some
inversions on tRNAs (see Appendix - A.1). Within Holothuroidea and Crinoidea species, we
verified some TDRL and transpositions on tRNAs (see Appendix - A.1) and on ND4L (see
Appendix - A.1), respectively. In Ophiuroidea, some transpositions on tRNAs occurred in
most species relative to the most basal, whereas multiple rearrangements (transpositions
and TDRL in the tRNAs, inversion of the CYTB-tRNA-Gln gene block) were observed in
Ophiura lutkeni and Ophiura albida (see Appendix - A.1). Echinoidea is the only class where
we did not detect any gene rearrangement between species (see Appendix - A.1).
1.3. Prediction of Replication Origins
The purpose of this analysis was to understand if there is more than one possible
replication origin in each species; if these replication origins are located in homologous
regions (e.g. control region) between species of the same class; and if rearrangements could
have contributed to their different location in the mitogenome of different species.
Using the more stringent set of filters we were only able to predict the putative replication
origins for four Echinodermata mitogenomes (Table 3; Figure 8). However, even when we
applied the more relaxed filters (see Methods) we were unable to retrieve the putative
replication origin for most species analyzed here, mainly because the resultant hairpin
structures were located in coding-regions. Importantly, no putative replication origins were
identified in any of the echinoid and ophiuroid species, regardless of the filter used.
Most hairpins were located between tRNA-Phe and ND2 (Table 3), falling inside a region
involved in rearrangements in different classes (Figure 8). The features of the replication
origin in asteroid mitogenomes are common to all species except to Astropecten
polyacanthus (NC_006666), where replication origin is located in a different strand and in a
different position relative to the other species (Table 3). The putative replication origin found
in the holothurian (Stichopus sp. SF-2010 (NC_014452)) are located on the plus strand
(H-strand), between tRNA-Phe and tRNA-Pro (Table 3). Finally, for the crinoid Phanogenia
gracilis (NC_007690), a putative replication origin was identified on a non-coding region
between tRNA-Gly and tRNA-Tyr, on the minus strand, with 18bp of length (Table 3).
38
Table 3 | Putative replication origins inferred for the Echinodermata mitogenomes used in this study. Shown are the features of putative hairpins: genomic
position; strand in which they are transcribed; size of their sequence; minimum entropy energy to drive replication; dot-bracket notation detailing the predicted
secondary structure; sequence itself; and location in the genome relative to some protein-coding genes. (See Chapter 2, Material and Methods, for the difference
between stringent and relaxed filters). Asterisks represent the putative hairpins found in control-region of the genome.
a,b,c
Hairpin structures are represented in Figure 8.
Class
Accession
No
Position Strand Size
Entropy
(kcal/mol)
Dot-Bracket Sequence Location
Stringent filters
Asteroidea NC_006665 a
14542 + 23 -8.9
.((((((((.....))))))))
.
ATGCCCCCTTCGG
GGGGGGGCAA
tRNA-Phe
-ND2
Asteroidea NC_001627 b
7074 + 26 -8.2
.((((((((........)))))
))).
TAACCCCCCCCCT
CCCCGGGGGGTTC
tRNA-Phe
-ND2
Asteroidea NC_007789 c
15706 + 35 -11.1
.(((((((((((...........
))))))))))).
CATAACCCCCCCC
CCCCCCCTCCGGG
GGGGTTATA
tRNA-Phe
-ND2
(*)
Relaxed filters
Asteroidea NC_006666 631 - 28 -8.9
.((((((((((......)))))
))))).
ATTATCCTCGTGAA
GAAACGAGGATAAA
tRNA-Phe
-ND2
Crinoidea NC_007690 4228 - 18 -8.2 .((((((....)))))).
ATTGACTTTTAAGT
CAAA
tRNA-Gly-
tRNA-Tyr
Holothuroidea NC_014452 10928 + 37 -10
.((((((((((((..........
.)))))))))))).
ATTTTGACACTTTC
CCTTTCTAAAAAAG
TGTCAAAAC
tRNA-Phe
-tRNA-Pro
(*)
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Results
39
Figure 8 | Representation of the seconday structure of the putative replication origins inferred for the
mitogenomes of a) Asterias amurensis, b) Acanthaster brevispinus and d) Patiria pectinifera all from
Asteroidea class. DNA hairpin structures were drawn using PseudoViewer (http://pseudoviewer.inha.ac.kr/)
1.4. Assessment of the Degree of Compositional Heterogeneity
The plot of the skew values of the 4-fold redundant sites for each mitochondrial
protein-coding gene for each Echinodermata species shows that Crinoidea presents an
overall inversion of AT and GC skews (Figure 9; Appendix - A.1). While all other classes tend
to show positive AT and negative GC skews, the crinoids have the opposite pattern of
compositional bias in ten out of 13 genes (ATP6, ATP8, COX1, COX2, COX3, CYTB, ND3,
ND4, ND4L and ND5) (Figure 9). Curiously, the Crinoidea genes that show positive AT and
negative GC skews (ND1, ND2 and ND6) present an inverted compositional bias in some
species of the remaining classes. This is the case of: one Ophiuroidea species, Ophiopholis
aculeata, with inverted skews in four out of 13 genes (CYTB; ND1; ND2 and ND6); six
Asteroidea species with inverted skews exactly in three genes (ND1; ND2 and ND6); and two
species of each Holothuroidea, Apostichopus japonicus and Cucumaria miniata, and
Echinoidea, Paracentrotus lividus and Arbacia lixula, with inverted skews in only one gene
(ND6; Figure 9).
A) B) C)
40
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Results
41
Figure 9 | | Plots of GC and AT skews at 4-fold sites of mitochondrial protein-coding genes for all Echinodermata species analyzed here. Blue represents
Asteroidea species; red, Echinoidea; green, Holothuroidea; purple, Ophiuroidea; and yellow, Crinoidea.
42
2. Comparison of Phylogenetic Inferences
The results of the multiple sequence alignment (MSA) comparison (using the
consistency scores of the different MSA methods) performed with trimAl are shown in Table
4. The program Muscle was the MSA method most frequently chosen for the ingroup
datasets, while for the outgroup datasets was the Prank. ProtTest selected MtArt as the
best fit model for most of the genes, followed by CpRev, MtRev and JTT. Importantly, in half
of the genes, different best-fit models were selected depending on whether the outgroup
sequences were included or excluded (Table 4).
All the phylogenetic trees based on the concatenated datasets show that all Echinodermata
classes are monophyletic, with high bootstrap or posterior probability values (generally over 95%
the branch that is leading every classes on different trees).
Regarding the evolutionary relationships between classes, according to MrBayes, PhyML,
PhyloBayes (except the CAT model) and NH-PhyloBayes, the clustering of (Asteroidea,
Echinoidea) with Holothuroidea is strongly supported; as well as the position of Crinoidea as sister
clade of ((Asteroidea, Echinoidea), Holothuroidea), and of Ophiuroidea as the first class to diverge
within Echinodermata (Figure 10 A)). By the opposite, the PhyloBayes analysis under the CAT
model supports the Crinoidea as the first class to diverge within Echinodermata, followed by
Ophiuroidea as the second (Figure 10 B).
,
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Results
43
Table 4 | Summary of general results. trimAl results and protein models were chosen using ProtTest. NA: Not Applicable.
Data
Best Alignment Software Best Model
Ingroup Outgroup Ingroup Outgroup
ATP6 Muscle Mafft MtArt +I +G +F CpRev +I +G +F
ATP8 Muscle Muscle MtRev MtRev +G
COX1 Prank Muscle CpRev +I +G +F CpRev +I +G +F
COX2 Mafft Mafft MtRev +G +F MtRev +I +G +F
COX3 Mafft Prank CpRev +G +F CpRev +I +G +F
CYTB Muscle Muscle MtArt +I +G +F MtArt +I +G +F
ND1 Muscle Mafft MtArt +G +F MtArt +G +F
ND2 Mafft Prank MtArt +I +G +F MtArt +I +G +F
ND3 Mafft Prank MtArt +G MtRev +G
ND4 Prank Prank MtRev +G +F MtArt +G +F
ND4L Prank Prank MtArt +G MtArt +G
ND5 Muscle Mafft MtRev +I +G +F MtRev +I +G +F
ND6 Prank Muscle JTT +G +F JTT +I +G +F
Concatenated NA NA CpRev +I +G +F CpRev +I +G +F
44
Figure 10 | Phylogenetic tree for the concatenated amino acid sequences of thirteen echinoderm
mitochondrial protein-coding genes using Hemichordata as outgroup. A) Tree topology obtained with
all methods/models except PhyloBayes (CAT model). B) Tree topology obtained with PhyloBayes (CAT
model). In each node only support values (bootstraps or posterior probabilities) greater than 50% are
presented. In A) the support values of different methods/models are presented in the following order:
PhyML/PhyloBayes(CAT+GTR)/PhyloBayes(CAT+MtArt)/PhyloBayes(C20)/MrBayes, PhyloBayes(GTR),
PhyloBayes(MtArt), PhyloBayes(UL3), PhyloBayes(WLSR5), NH-PhyloBayes(CAT+BP). Asterisks
represent values greater than 95% (bootstrap or posterior probability). Black triangles represent the clades
formed by the different classes, which are always supported by values above 95% independently of the
method/model used.
The individual gene trees show, as expected, less resolution than the concatenated
phylogenetic trees (see Appendix-A.1). The ML analyses show in general lower branch support
than MrBayes, with only the ATP6, ND1, ND2, ND3, ND4, ND4L and ND5-based trees (outgroup
data set) recovering support over 50% for all the classes (see Appendix-A.1). Comparing the
results from PhyML and MrBayes analyses for the outgroup dataset, most gene trees are similar
between both methods, with the exception of ND4 and ND4L (and ND3 to a minor extent)
gene-based trees (Figure 11).
A B
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Results
45
Figure 11 | Heatmap of the inferred phylogenies from the outgroup dataset. Comparison between the phylogenies obtained wih PhyML and MrBayes approaches
using single gene and concatenated datasets. The tree distance used in this matrix corresponds to the normalized Robinson-Foulds values. Legend: 0 (same
topology), 1 (no bipartitions are shared between trees).
46
Figure 12 | Heatmap of the inferred phylogenies between the ingroup and outgroup datasets using PhyML. Comparison of the phylogenies obtained with single
gene and concatenated datasets. The tree distance used in this matrix corresponds to the normalized Robinson-Foulds values. Legend: 0 (same topology), 1 (no
bipartitions are shared between trees).
INGROUP
OUTGROUP
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Results
47
Figure 13 | Heatmap of the inferred phylogenies between the ingroup and outgroup datasets using MrBayes. Comparison of the phylogenies obtained with
single gene and concatenated datasets. The tree distance used in this matrix corresponds to the normalized Robinson-Foulds values. Legend: 0 (same topology), 1
(no bipartitions are shared between trees).
INGROUP
OUTGROUP
48
Figure 14 | Dendrogram of the inferred phylogenies for the concatenated outgroup dataset using
different models or approaches (based on normalized Robinson-Foulds values).
Figure 15 | Dendrogram of the inferred phylogenies for the concatenated ingroup dataset using
different models or approaches (based on normalized Robinson-Foulds values).
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Results
49
Figure 16 | Heatmap of dendrograms for PhyloBayes results (ingroup vs. outgroup). Results of the comparison between all inferred trees from ingroup and
outgroup datasets based on different models. The tree distance used in this matrix corresponds to the normalized Robinson-Foulds values. Legend: 0 (same topology),
1 (no bipartitions are shared between trees).
50
The topologies obtained with the ingroup and outgroup datasets are similar, except for ND3
on PhyML (Figure 12) and for ND6 on MrBayes (Figure 13).
When we compare the tree topologies obtained through the different phylogenetic analyses,
we observe some differences depending on the models that were used. We also notice that the
models are not clustered by their nature; i.e. homogeneous models: from MrBayes and PhyML
analyses, MtArt (PhyloBayes), GTR (PhyloBayes); site-heterogeneous models with empirical
matrixes: UL3, WLRS5, C20, CATMTART; site-heterogeneous models with compositional profiles
inferred from the data: CAT, CAT+GTR, CAT+MtArt; time- and site-heterogeneous models:
CAT+BP (Figure 14 -16).
53
CHAPTER 4
DISCUSSION
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Discussion
53
Recent advances in sequencing technologies have been providing increasing amounts of
molecular data, which request efficient annotation tools to make these sequences biologically
meaningful. Despite the invested effort, it is not unusual to find sequence files containing
erroneous annotations of genes and/or other genomic elements. Misannotations can
seriously compromise the quality of the data, mostly if obtained using an automated
framework. For the work developed in this thesis, the mitogenomic data was directly retrieved
from the NCBI database using the annotations described in the GenBank files, which were
compared with those obtained with MITOS and DOGMA. Whereas GenBank and MITOS
annotated the full set of protein-coding genes in our dataset, DOGMA failed to identify some
in several species.
Early studies (Perseke et al., 2008) had already pointed out failures in the annotation of
Echinodermata mtDNA GenBank sequences, which were subsequently corrected. However,
nowadays, some genes remain non-annotated (Table 2). Consequently, in the analysis of
gene rearrangements, we had to exclude both ribosomal genes, SrRNA and LrRNA and some
tRNAs (tRNA-Lys, tRNA-Glu, tRNA-Thr, tRNA-Trp, tRNA-Ser(UCN), tRNA-Ser(AGN)), because
they were not identified in some Asteroidea mitogenomes (Asterias amurensis, Astropecten
polyacanthus, Luidia quinalia, Patiria pectinifera and Pisaster ochraceus).
Although the most reliable source of information keeps being NCBI, there is still room for
improvements. For example, the MitoZoa web tool (Lupi et al., 2010; D’Onorio et al., 2012) is
a public resource containing metazoan mitogenomes whose annotation is carefully revised,
also allowing to build different datasets that can be later used for comparative genomic and
evolutionary analyses. However, the version 1.0 (Lupi et al., 2010) has not been updated
since December 2011 and the link to the version 2.0 (http://www.caspur.it/mitozoa, D’Onorio
et al., 2012) is no longer accessible. One possible way to improve mtDNA annotations (and
other type of annotation) would be to allow researches using mtDNA sequences to correct
them, by directly curating the GenBank files through the NCBI platform. Likewise, specialized
databases should be maintained for longer periods, while automated scripts should be
developed to allow daily-based mtDNA sequence corrections (at least for the NCBI Organelle
Genome Resources, where the reference mitochondrial genome sequences are stored).
In this study, we compared Echinodermata mitogenome order using the hypothetical
basal/ancestral arrangement from each class (Appendix-A.1, Figure 6). Although we could not
include all genes in the analysis (as mentioned above), our results reveal some patterns
within and among the Echinodermata classes. We confirmed some previously described
rearrangements between species of the same class (Perseke et al., 2008, 2010): within
Asteroidea, we identified inversions on tRNAs (tRNA-Phe, tRNA-Ile, tRNA-Gly, tRNA-Cys,
54
tRNA-Ala, tRNA-Leu and tRNA-Asn); within Holothuroidea, transpositions on tRNAs
(tRNA-Arg, tRNA-Pro, tRNA-Asn, tRNA-Leu, tRNA-Met and tRNA-Val); within Crinoidea,
transpositions on tRNA-Ala, tRNA-Val and tRNA-Arg, as well as on ND4L; in Ophiuroidea,
multiple rearrangements (inversions, transpositions and reverse transpositions) that confirm
the rapid evolution in this class; and finally, within Echinoidea, as we expected, no
rearrangements were detected. The topology of the unrooted tree obtained based on the
rearrangements between classes was pairing of Holothuroidea and Echinoidea, following by
Asteroidea then the Ophiuroidea and finally the Crinoidea as ancestral class (Figure 7). This
topology differs from those reported before for Echinodermata also based on this kind of data
(i.e. gene order) (Perseke et al., 2008; Perseke et al., 2010). These different results could be
explained (at least partially) by differences in the datasets used, as previous studies mainly
relied on protein-coding genes (and only a few included rRNAs and tRNAs; Smith et al., 1993;
Scouras and Smith, 2004) or inferred the rearrangements in a predefined phylogenetic tree,
whereas here we analyzed all genes (e.g. Perseke et al., 2008; 2010). Another explanation
(not mutually exclusive), stems from the fact that while previous studies used the breakpoints
distance method in CREx to identify rearrangements, here we used the common intervals
approach, which is thought to detect more subtle similarities and multiple genome
rearrangements when compared with the others methods (Pevzner 2000; Bernt and
Middendorf, 2011).
Regarding the screening for origins of replication, we identified some putative regions
fulfilling the criteria we used, i.e., residing in non-coding regions and forming stem-and-loop
(hairpin) structures in Asteroidea (Asterias amurensis, Patiria pectinifera, Asterias amurensis).
In addition, we found other putative replication origins using more relaxed filters but these
need to be more carefully and deeply inspected before drawing strong conclusions, as they
do not conform to all requirements to be considered reliable origins of replication. These
additional regions were found in several species: in Holothuroidea Stichopus sp. SF-2010; in
Asteroidea, Astropecten polyacanthus; and in Crinoidea, Phanogenia gracilis (see Table 3). In
all species of Asteroidea, the putative replication origin is 19 to 35 bp long, it is located on the
plus strand, between tRNA-Phe and ND2, except in Astropecten polyacanthus, where it is
located on the minus strand (see Table 3). In Holothuroidea (and Echinoidea according to
Jacobs et al., 1989), the putative replication origin is on the plus strand, between tRNA-Phe
and tRNA-Pro. Finally, in Crinoidea, the putative replication origin is 18bp long, located on the
minus strand, between tRNA-Gly and tRNA-Tyr. These results suggesting that the origin of
replication is more flexible in terms of its structure as we originally predicted or could be more
than one replication origin in mostly of Echinodermata classes. About the Ophiuroidea class,
Application of Site and Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics
Discussion
55
we could not find any putative hairpin, probably of the method that we used to find origins of
replication was not the most adequate for the invertebrates.
As previously stated (see Chapter 1 - Introduction), each mitochondrial DNA strand has a
compositional bias that seems to be related with the distance to the replication origin (Adachi
and Hasegawa. 1996; Faith and Pollock, 2003). Thus, an inversion of the replication origin
(control region) is expected to produce a global reversal of asymmetric mutational constraints
in the mtDNA, which can later result in a complete reversal of strand compositional bias
(Hassanin et al., 2005). Our analysis of compositional skews for protein-coding genes across
Echinodermata revealed (Figure 3.4) that Crinoidea presents an inverted skew (except on
ND1, ND2 and ND6) respect to the remaining classes, suggesting an inversion of the control
region, which could have an important impact on the estimates of the phylogenetic
relationships within Echinodermata.
The inferred phylogenetic trees revealed that the five Echinodermata classes and the
phylum itself are monophyletic, independently of the method or model used. This result is in
agreement with the vast majority of previous phylogenetic studies in Echinodermata, using
either morphological (Pearse and Pearse, 1994; Janies, 2001) or molecular markers (Smith et
al., 1993; Pearse and Pearse, 1994; Janies, 2001; Perseke et al., 2008, 2010, 2013; Telford et
al., 2014). Regarding the phylogenetic relationships between classes, all our analyses, except
the PhyloBayes using the CAT model, recovered the same tree topology: (Ophiuroidea,
(Crinoidea, (Holothuroidea, (Asteroidea, Echinoidea)))), also in agreement with previous
works based on amino acid sequences of the 13 mitochondrial protein-coding genes (Perseke
et al., 2008, 2013). However, this topology is in contrast with the Eleutherozoa (or mobile
echinoderms) grouping that places Crinoidea as sister clade to the remaining echinoderm
classes, which is supported by morphological comparative analyses (Pearse and Pearse,
1994; Janies, 2001) and by nuclear and mitochondrial gene sequences (Scouras and Smith,
2006; Perseke et al., 2010; Telford et al., 2014). In most of our phylogenetic reconstructions,
Ophiuroidea, and not Crinoidea, is the first clade to diverge within Echinodermata, although
this placement could correspond to a phylogenetic artifact due to LBA. Mitogenomes of
Ophiuroidea are known to have accelerated substitution rates that lead to very long branches
in phylogenetic trees, when compared to mitogenome sequences of others species from the
same phylum (Perseke et al., 2008, 2010, 2013). Therefore, it is possible that the methods
and models used by others and ourselves are not able to reduce the LBA effect, even though
we would expect that mixture models (e.g. C20, CAT+GTR, UL3) and especially site- and
time-heterogeneous models (CAT+BP) could correct it. Nonetheless, it is interesting to note
56
that the Bayesian analysis using CAT are similar with some previous results (Perseke et al.,
2010, 2013).
Our phylogenetic inferences are also in contrast with the two classical hypotheses of
echinoderm evolution (even assuming that Crinoidea is the first clade to diverge): the
Asterozoan hypothesis (that pairs Asteroidea with Ophiuroidea and Echinoidea with
Holothuroidea) and the Cryptosyringid hypothesis (that places Asteroidea as the second class
to diverge and Ophiuroidea as the third one). Our results, at best, place Ophiuroidea as sister
taxa to the clade (Holothuroidea, (Echinoidea, Asteroidea)). Again, this is in agreement with
previous phylogenetic studies using echinoderm mitogenomes (Perseke et al., 2008, 2013).
However, the most recent study, based on a final concatenated alignment of 219 nuclear
genes from all echinoderm classes and using a Bayesian framework (applying the
site-heterogeneous mixture model CAT+GTR), strongly supports the Asterozoan hypothesis
(Telford et al., 2014). The authors of this study also applied a phylogenetic signal dissection
methodology (as in Pisani et al., 2012), in which the dataset was split in homogeneously and
heterogeneously evolving sites. The former sites gave strong support for the Asterozoan
hypothesis, whereas the latter sites, expected to be more problematic and prone to
systematic errors, resulted in a different topology: (Crinoidea, (Ophiuroidea, (Asteroidea,
(Holothuroidea, Echinoidea)))). Extrapolating this possibility to our dataset, we suggest that
most mitochondrial genes in echinoderms can have a strong heterogeneous and misleading
signal that overcomes the phylogenetic informativeness needed to recover a reliable
phylogenetic tree. Clearly, the phylogenetic signal of echinoderm mitogenomes is very
complex and still challenges the most advanced and statistically powerful phylogenetic
methods and models.
References 57
References
Abascal F, Zardoya R, Posada D (2005) ProtTest: selection of best-fit models of protein
evolution. Bioinformatics, 21, 2104-2105.
Adachi J, Hasegawa M (1996) Model of amino acid substitution in proteins encoded by
mitochondrial DNA. Journal of Molecular Evolution, 42, 459-468.
Aguinaldo A, Turbeville J, Linford L, et al. (1997) Evidence for a clade of nematodes,
arthropods and other moulting animals. Nature, 387, 489-493.
Akaike H (1974) A new look at the statistical model identification. Automatic Control, IEEE
Transactions on, 19, 716-723.
Amemiya CT, Miyate T, Rast JP (2005) Echinoderms. Current Biology, 15, R944-R946.
Anderson S, Bankier AT, Barrell BG, et al. (1981) Sequence and organization of the
human mitochondrial genome. Nature, 290, 457-465.
Asakawa S, Himeno H, Miura K, Watanabe K (1995) Nucleotide sequence and gene
organization of the starfish Asterina pectinifera mitochondrial genome. Genetics, 140,
1047-1060.
Asakawa S, Kumazawa Y, Araki T, et al. (1991). Strand-specific nucleotide composition
bias in echinoderm and vertebrate mitochondrial genomes. Journal of Molecular
Evolution, 32, 511-520.
Attardi G (1985) Animal mitochondrial DNA: an extreme example of genetic economy.
International Review of Cytology-a Survey of Cell Biology, 93, 93-145.
Barral PJ, Cantini L, Hasmy A, Jiménez J, Marcano A (2005) Correlation between strand
asymmetry and phylogeny in mitochondrial DNA. Journal of Theoretical Biology, 236,
422-426.
Beletskii A, Bhagwat A (1996) Transcription-induced mutations: Increase in C to T
mutations in the nontranscribed strand during transcription in Escherichia coli.
Proceedings of the National Academy of Sciences USA, 93, 13919-13924.
Bergsten J (2005) A review of long-branch attraction. Cladistics, 21, 163-193.
Bernt M, Braband A, Middendorf M, et al. (2013a) Bioinformatics methods for the
comparative analysis of metazoan mitochondrial genome sequences. Molecular
Phylogenetics and Evolution, 69, 320-327.
58
Bernt M, Braband A, Schierwater B, Stadler P (2013b) Genetic aspects of mitochondrial
genome evolution. Molecular Phylogenetics and Evolution, 69, 328-338.
Bernt M, Donath A, Juhling F, et al. (2013c) MITOS: Improved de novo metazoan
mitochondrial genome annotation. Molecular Phylogenetics and Evolution, 69, 313-
319.
Bernt M, Middendorf M (2011) A method for computing an inventory of metazoan
mitochondrial gene order rearrangements. BMC Bioinformatics, 12, S6.
Bernt M, Merkle D, Ramsch K, et al. (2007) CREx: inferring genomic rearrangements
based on common intervals. Bioinformatics, 23, 2957-2958.
Bhutkar A, Gelbart W, Smith T (2007) Inferring genome-scale rearrangement phylogeny
and ancestral gene order: a Drosophila case study. Genome Biology, 8, R236.
Blanquart S, Lartillot N (2008) A site- and time-heterogeneous model of amino acid
replacement. Molecular Biology and Evolution, 25, 842-858.
Blanquart S, Lartillot N (2006) A Bayesian compound stochastic process for modeling
nonstationary and nonhomogeneous sequence evolution. Molecular Biology and
Evolution, 23, 2058-2071.
Bogenhagen D, Clayton D (2003) The mitochondrial DNA replication bubble has not burst.
Trends in Biochemical Sciences, 28, 357-360.
Boore JL (2000) The duplication/random loss model for gene rearrangement exemplified
by mitochondrial genomes of deuterostome animals. In: Computational biology,
pp.133-147. Springer Netherlands.
Boore JL (1999) Animal mitochondrial genomes. Nucleic Acids Research, 27, 1767-1780.
Boore J, Brown W (1998) Big trees from little genomes: mitochondrial gene order as a
phylogenetic tool. Current Opinion in Genetics and Development, 8, 668-674.
Boore JL, Lavrov DV, Brown WM, (1998) Gene translocation links insects and crustaceans.
Nature, 392, 667-668.
Boore J, Collins T, Stanton D, Daehler L, Brown W (1995) Deducing the pattern of
arthropod phytogeny from mitochondrial DNA rearrangements. Nature, 376, 163-165.
Bos D, Posada D (2005) Using models of nucleotide evolution to build phylogenetic trees.
Developmental and Comparative Immunology, 29, 211-227.
References 59
Bottjer D, Davidson E, Peterson K, Cameron R (2006) Paleogenomics of Echinoderms.
Science, 314, 956-960.
Bourlat S, Rota-Stabelli O, Lanfear R, Telford M (2009) The mitochondrial genome
structure of Xenoturbella bocki (phylum Xenoturbellida) is ancestral within the
deuterostomes. BMC Evolutionary Biology, 9, 107.
Bourlat S, Juliusdottir T, Lowe C, et al. (2006) Deuterostome phylogeny reveals
monophyletic chordates and the new phylum Xenoturbellida. Nature, 444, 85-88.
Boussau B, Gouy M (2006) Efficient likelihood computations with nonreversible models of
evolution. Systematic Biology, 55, 756-768.
Boyce TM, Zwick ME, Aquadro CF (1989) Mitochondrial DNA in the bark weevils: size,
structure and heteroplasmy. Genetics, 4, 825-836.
Brinkmann H, Philippe H (1999) Archaea sister group of Bacteria? Indications from tree
reconstruction artifacts in ancient phylogenies. Molecular Biology and Evolution, 16,
817-825.
Bromham L, Degnan B (1999) Hemichordates and deuterostome evolution: robust
molecular phylogenetic support for a hemichordate + echinoderm clade. Evolution
and Development, 1, 166-171.
Brown T (2005) Replication of mitochondrial DNA occurs by strand displacement with
alternative light-strand origins, not via a strand-coupled mechanism. Genes and
Development, 19, 2466-2476.
Brown WM (1985) The mitochondrial genome of animals. In: R. J. MacIntyre, ed.
Molecular Evolutionary Genetics. Plenum Press, New York.
Bryant D, Galtier N, Poursat MA (2005) Likelihood calculation in molecular phylogenetics.
In: Mathematics of evolution and phylogeny, pp. 33-62. Oxford University Press.
Bull JJ, Huelsenbeck JP, Cunningham CW, Swofford DL, Waddell PJ (1993) Partitioning
and combining data in phylogenetic analysis. Systematic Biology, 42, 384-397.
Burleigh J, Mathews S (2004) Phylogenetic signal in nucleotide data from seed plants:
implications for resolving the seed plant tree of life. American Journal of Botany, 91,
1599-1613.
Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical
information-theoretic approach. Springer, New York.
60
Cameron CB, Garey JR, Swalla BJ (2000) Evolution of the chordate body plan: New
insights from phylogenetic analyses of deuterostome phyla. Proceedings of the
National Academy of Sciences USA, 97, 4469-4474.
Cannon JT, Kocot KM, Waits DS, et al. (2014) Phylogenomic resolution of the
hemichordate and echinoderm clade. Current Biology, 23, 2827–2832.
Cantatore P, Gadaleta M, Roberti M, Saccone C, Wilson A (1987) Duplication and
remoulding of tRNA genes during the evolutionary rearrangement of mitochondrial
genomes. Nature, 329, 853-855.
Capella-Gutierrez S, Silla-Martinez J, Gabaldon T (2009) trimAl: a tool for automated
alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25, 1972-
1973.
Castresana J, Feldmaier-Fuchs G, Yokobori SI, Satoh N, Pääbo S (1998) The
mitochondrial genome of the hemichordate Balanoglossus carnosus and the
evolution of deuterostome mitochondria. Genetics, 150, 1115-1123.
Chang B, Campbell D (2000) Bias in phylogenetic reconstruction of Vertebrate Rhodopsin
sequences. Molecular Biology and Evolution, 17, 1220-1231.
Chang J (1996) Inconsistency of evolutionary tree topology reconstruction methods when
substitution rates vary across characters. Mathematical Biosciences, 134, 189-215.
Clayton D (1991) Replication and transcription of Vertebrate mitochondrial DNA. Annual
Review of Cell and Developmental Biology, 7, 453-478.
Collins TM, Fedrigo O, Naylor, GJ (2005) Choosing the best genes for the job: the case for
stationary genes in genome-scale phylogenetics. Systematic Biology, 54, 493-500.
Cox C, Foster P, Hirt R, Harris S, Embley T (2008) The archaebacterial origin of
eukaryotes. Proceedings of the National Academy of Sciences USA, 105, 20356-
20361.
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins.
In: Atlas of Protein Sequence and Structure, pp. 345-352. DC National Biomedical
Research Foundation.
De Giorgi C (1996) Complete sequence of the mitochondrial DNA in the sea urchin
Arbacia lixula: Conserved features of the echinoid mitochondrial genome. Molecular
Phylogenetics and Evolution, 5, 323-332.
References 61
Delsuc F, Scally M, Madsen O, et al. (2002) Molecular phylogeny of living xenarthrans and
the impact of character and taxon sampling on the placental tree rooting. Molecular
Biology and Evolution, 19, 1656-1671.
De Queiros A, Gatesy J (2007) The supermatrix approach to systematics. Trends in
Ecology and Evolution, 22, 34-41.
D'Onorio de Meo P, D'Antonio M, Griggio F, et al. (2011) MitoZoa 2.0: a database
resource and search tools for comparative and evolutionary analyses of
mitochondrial genomes in Metazoa. Nucleic Acids Research, 40, D1168-D1172.
Dowton M, Austin A (1999) Evolutionary dynamics of a mitochondrial rearrangement "hot
spot" in the Hymenoptera. Molecular Biology and Evolution, 16, 298-309.
Dutheil J, Boussau B (2008) Non-homogeneous models of sequence evolution in the
Bio++ suite of libraries and programs. BMC Evolutionary Biology, 8, 255.
Edgar R (2004) MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Research, 32, 1792-1797.
Edgecombe G, Giribet G, Dunn C, et al. (2011) Higher-level metazoan relationships:
Recent progress and remaining questions. Organisms Diversity and Evolution, 11,
151-172.
Elson J, Lightowlers R (2006) Mitochondrial DNA clonality in the dock: can surveillance
swing the case? Trends in Genetics, 22, 603-607.
Erixon P, Svennblad B, Britton T, Oxelman B (2003) Reliability of bayesian posterior
probabilities and bootstrap frequencies in phylogenetics. Systematic Biology, 52,
665-673.
Faith JJ, Pollock DD (2003) Likelihood analysis of asymmetrical mutation bias gradients in
vertebrate mitochondrial genomes. Genetics, 165, 735-745.
Fan S, Hu C, Wen J, Zhang L (2011) Characterization of mitochondrial genome of sea
cucumber Stichopus horrens: A novel gene arrangement in Holothuroidea. Science
China Life, 54, 434-441.
Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap.
Evolution, 39, 783-791.
Felsenstein J (1981) Evolutionary trees from DNA sequences: A maximum likelihood
approach. Journal of Molecular Evolution, 17, 368-376.
62
Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively
misleading. Systematic Biology, 27, 401-410.
Fertin G (2009) Combinatorics of genome rearrangements. MIT Press, Cambridge.
Fitch W (1986) The estimate of total nucleotide substitutions from pairwise differences is
biased. Philosophical Transactions of the Royal Society B: Biological Sciences, 312,
317-324.
Fitch W (1971) Toward defining the course of evolution: Minimum change for a specific
tree topology. Systematic Biology, 20, 406-416.
Fonseca MM, Harris DJ, Posada D (2014). The inversion of the control region in three
mitogenomes provides further evidence for an asymmetric model of vertebrate
mtDNA replication. PLoS ONE, 9, e106654.
Fonseca MM (2011) Mitochondrial DNA evolution: replication and gene overlapping.
Published doctoral dissertation. University of Porto, Porto, Portugal.
Fonseca MM, Posada D, Harris DJ (2008) Inverted replication of vertebrate mitochondria.
Molecular Biology and Evolution, 25, 805-808.
Foster P (2004) Modeling compositional heterogeneity. Systematic Biology, 53, 485-495.
Foster P, Hickey D (1999) Compositional bias may affect both DNA-based and protein-
based phylogenetic reconstructions. Journal of Molecular Evolution, 48, 284-290.
Francino M, Ochman H (1997) Strand asymmetries in DNA evolution. Trends in Genetics,
13, 240-245.
Galtier N, Lobry J (1997) Relationships between genomic G+C content, RNA secondary
structures and optimal growth temperature in Prokaryotes. Journal of Molecular
Evolution, 44, 632-636.
Galtier N, Gouy M (1995) Inferring phylogenies from DNA sequences of unequal base
compositions. Proceedings of the National Academy of Sciences USA, 92, 11317-
11321.
Gee H (1996) Before the backbone. Views on the origin of the vertebrates. Chapman and
Hall, London.
Gesell T (2009) A phylogenetic definition of structure. Published doctoral dissertation.
University of Vienna, Vienna, Austria.
References 63
Gissi C, Iannelli F, Pesole G (2008) Evolution of the mitochondrial genome of Metazoa as
exemplified by comparison of congeneric species. Heredity, 101, 301-320.
Gowri-Shankar V, Rattray M (2007) A reversible jump method for bayesian phylogenetic
inference with a nonhomogeneous substitution model. Molecular Biology and
Evolution, 24, 1286-1299.
Gray MW, Burger G, Lang BF (1999) Mitochondrial evolution. Science, 283, 1476–81.
Groussin M, Boussau B, Gouy M (2013) A branch-heterogeneous model of protein
evolution for efficient inference of ancestral sequences. Systematic Biology, 62, 523-
538.
Gu X, Fu YX, Li WH (1995) Maximum likelihood estimation of the heterogeneity of
substitution rate among nucleotide sites. Molecular Biology and Evolution, 12, 546-
557.
Guindon S, Dufayard J, Lefort V, et al. (2010) New algorithms and methods to estimate
maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0.
Systematic Biology, 59, 307-321.
Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large
phylogenies by maximum likelihood. Systematic Biology, 52,696-704.
Guindon S, Gascuel O (2002) Efficient biased estimation of evolutionary distances when
substitution rates vary across sites. Molecular Biology and Evolution, 19, 534-43.
Green P, Ewing B, Miller W, Thomas P J, Green E D (2003) Transcription-associated
mutational asymmetry in mammalian evolution. Nature Genetics, 33, 514-517.
Grigoriev A (1998) Analyzing genomes with cumulative skew diagrams. Nucleic Acids
Research, 26, 2286-2290.
Hasegawa M, Hashimoto T, Adachi J, Iwabe N, Miyata T (1993) Early branchings in the
evolution of eukaryotes: Ancient divergence of entamoeba that lacks mitochondria
revealed by protein sequence data. Journal of Molecular Evolution, 36, 380-388.
Hassanin A, Léger N, Deutsch J (2005) Evidence for multiple reversals of asymmetric
mutational constraints during the evolution of the mitochondrial genome of Metazoa,
and consequences for phylogenetic inferences. Systematic Biology, 54, 277-298.
Herbeck JT, Degnan PH, Wernegreen JJ (2005) Nonhomogeneous model of sequence
evolution indicates independent origins of primary endosymbionts within the
Enterobacteriales (ᵧ -Proteobacteria). Molecular Biology and Evolution, 22, 520-532.
64
Holt I, Jacobs H (2003) Response: The mitochondrial DNA replication bubble has not burst.
Trends in Biochemical Sciences, 28, 355-356.
Holt I, Lorimer H, Jacobs H (2000) Coupled leading- and lagging-strand synthesis of
mammalian mitochondrial DNA. Cell, 100, 515-524.
Horner D, Pesole G (2004) Phylogenetic analyses: a brief introduction to methods and
their application. Expert Review of Molecular Diagnostics, 4, 339-350.
Hrdy I, Hirt R, Dolezal P, et al. (2004) Trichomonas hydrogenosomes contain the NADH
dehydrogenase module of mitochondrial complex I. Nature, 432, 618-622.
Huelsenbeck J, Suchard M (2007) A nonparametric method for accommodating and
testing across-site rate variation. Systematic Biology, 56, 975-987.
Huelsenbeck JP, Bollback JP (2001) Empirical and hierarchical Bayesian estimation of
ancestral states. Systematic Biology, 50, 351-366.
Huelsenbeck J, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees.
Bioinformatics, 17, 754-755.
Hyman L (1955) The invertebrates. McGraw-Hill, New York.
Inagaki Y, Susko E, Fast NM, Roger AJ (2004) Covarion shifts cause a long-branch
attraction artifact that unites microsporidia and archaebacteria in EF-1alpha
phylogenies. Molecular Biology and Evolution, 21, 1340-1349.
Jacobs HT, Herbert ER, Rankine J (1989) Sea urchin egg mitochondrial DNA contains a
short displacement loop (D-loop) in the replication origin region. Nucleic Acids
Research, 17, 8949-8965.
Janies D, Voight J, Daly M (2011) Echinoderm phylogeny including Xyloplax, a progenetic
asteroid. Systematic Biology, 60, 420-438.
Janies D (2001) Phylogenetic relationships of extant echinoderm classes. Canadian.
Journal of Zoology, 79, 1232-1250.
Jeffroy O, Brinkmann H, Delsuc F, Philippe H (2006) Phylogenomics: the beginning of
incongruence? Trends in Genetics, 22, 225-231.
Jermiin LS, Jayaswal V, Ababneh F, Robinson J (2008) Phylogenetic model evaluation. In:
Bioinformatics: data, sequence analysis, and evolution, pp. 65-91. Totawa: Humana
Press.
References 65
Jermiin L, Ho S, Ababneh F, Robinson J, Larkum A (2004) The biasing effect of
compositional heterogeneity on phylogenetic estimates may be underestimated.
Systematic Biology, 53, 638-643.
Jones D, Taylor W, Thornton J (1992) The rapid generation of mutation data matrices from
protein sequences. Bioinformatics, 8, 275-282.
Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Mammalian Protein
Metabolism, pp. 21-132. Academic Press, New York.
Katoh K (2002) MAFFT: a novel method for rapid multiple sequence alignment based on
fast Fourier transform. Nucleic Acids Research, 30, 3059-3066.
Keane TM, Creevey CJ, Pentony MM, Naughton TJ, Mclnerney JO (2006) Assessment of
methods for amino acid matrix selection and their use on empirical data shows that
ad hoc assumptions for choice of matrix are not justified. BMC Evolutionary Biology,
6, 29.
Kilpert F, Podsiadlowski L (2006) The complete mitochondrial genome of the common sea
slater, Ligia oceanica (Crustacea, Isopoda) bears a novel gene order and unusual
control region features. BMC Genomics, 7, 241.
Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions
through comparative studies of nucleotide sequences. Journal of Molecular
Evolution, 16, 111-120.
Kocher TD, Wilson AC (1991) Sequence evolution of mitochondrial DNA in humans and
chimpanzees: control region and a protein-coding region. In: Evolution of life: Fossils,
molecules and culture, pp. 391-413. Springer, Tokyo.
Kolaczkowski B, Thornton J (2008) A mixed branch length model of heterotachy improves
phylogenetic accuracy. Molecular Biology and Evolution, 25, 1054-1066.
Kon T, Nohara M, Yamanoue Y, Fujiwara Y, Nishida M, et al. (2007). Phylogenetic
position of a whale-fall lancelet (Cephalochordata) inferred from whole mitochondrial
genome sequences. BMC Evolutionary Biology, 7, 127.
Kondo M, Akasaka K (2012) Current status of echinoderm genome analysis - What do we
know? Current Genomics, 13, 134-143.
Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment
based on fast Fourier transform. Nucleic Acids Research, 30, 3059-3066.
66
Lake J (1994) Reconstructing evolutionary trees from DNA and protein sequences:
paralinear distances. Proceedings of the National Academy of Sciences USA, 91,
1455-1459.
Lambert C (2005) Historical introduction, overview, and reproductive biology of the
protochordates. Canadian Journal of Zoology, 83, 1-7.
Lambert C, Campenhout JM, DeBolle X, Depiereux E (2003) Review of common
sequence alignment methods: clues to enhance reliability. Current Genomics, 4,
131-146.
Lanave C, Preparata G, Sacone C, Serio G (1984) A new method for calculating
evolutionary substitution rates. Journal of Molecular Evolution, 20, 86-93.
Lang BF, Gray MW, Burger G (1999) Mitochondrial genome evolution and the origin of
eukaryotes. Annual Review of Genetics, 33, 351-397.
Larget B, Simon D (1999) Markov chain Monte Carlo algorithms for the Bayesian analysis
of phylogenetic trees. Molecular Biology and Evolution, 16, 750-759.
Lartillot N, Rodrigue N, Stubbs D, Richer J (2013) PhyloBayes MPI: Phylogenetic
reconstruction with infinite mixtures of profiles in a parallel environment. Systematic
Biology, 62, 611-615.
Lartillot N, Brinkmann H, Philippe H (2007) Suppression of long-branch attraction artefacts
in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary
Biology, 7, S4.
Lartillot N, Philippe H (2004) A Bayesian mixture model for across-site heterogeneities in
the amino-acid replacement process. Molecular Biology and Evolution, 21, 1095-
1109.
Lavrov D, Lang B (2005) Poriferan mtDNA and animal phylogeny based on mitochondrial
gene arrangements. Systematic Biology, 54, 651-659.
Lavrov DV, Brown WM (2001) Trichinella spiralis mtDNA: A nematode mitochondrial
genome that encodes a putative ATP8 and normally structured tRNAs and has a
gene arrangement relatable to those of coelomate metazoans. Genetics, 157, 621-
637.
Lee D, Clayton D (1998) Initiation of mitochondrial DNA replication by transcription and R-
loop processing. Journal of Biological Chemistry, 273, 30614-30621.
References 67
Lemey P, Salemi M, Vandamme A (2009) The phylogenetic handbook. Cambridge
University Press, Cambridge.
Lindahl T (1993) Instability and decay of the primary structure of DNA. Nature, 362, 709-
715.
Littlewood D, Smith AB, Clough KA, Emson RH (1997) The interrelationships of the
echinoderm classes: morphological and molecular evidence. Biological Journal of
the Linnean Society, 61, 409-438.
Littlewood DTJ (1995) Echinoderm class relationships revisited. In: Echinoderm research,
pp. 19-28. CRC Press.
Lobry J (1996) Asymmetric substitution patterns in the two DNA strands of bacteria.
Molecular Biology and Evolution, 13, 660-665.
Lockhart PJ, Steel MA, Hendy MD, Penny D (1994) Recovering evolutionary trees under a
more realistic model of sequence evolution. Molecular Biology and Evolution, 11,
605-612.
Lodish H (2008) Molecular cell biology. WH Freeman, New York.
Loomis W, Smith D (1990) Molecular phylogeny of Dictyostelium discoideum by protein
sequence comparison. Proceedings of the National Academy of Sciences USA, 87,
9093-9097.
Lopez P, Casane D, Philippe H (2002) Heterotachy, an important process of protein
evolution. Molecular Biology and Evolution, 19, 1-7.
Löytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of
sequences with insertions. Proceedings of the National Academy of Science USA,
102, 10557-10562.
Lupi R, de Meo PD, Picardi E, et al. (2010) MitoZoa: A curated mitochondrial genome
database of metazoans for comparative genomics studies. Mitochondrion, 10, 192-
199.
Macey J, Larson A, Ananjeva N, Fang Z, Papenfuss T (1997) Two novel gene orders and
the role of light-strand replication in rearrangement of the vertebrate mitochondrial
genome. Molecular Biology and Evolution, 14, 91-104.
MacIntyre R (1985) Molecular evolutionary genetics. Plenum Press, New York.
68
Mallatt J, Winchell C (2007) Ribosomal RNA genes and deuterostome phylogeny revisited:
More cyclostomes, elasmobranchs, reptiles, and a brittle star. Molecular
Phylogenetics and Evolution, 43, 1005-1022.
McBride H, Neuspiel M, Wasiak S (2006) Mitochondria: more than just a powerhouse.
Current Biology, 16, R551-R560.
McLachlan G, Peel D (2000) Finite mixture models. John Wiley and Sons., New York.
Meade A, Pagel M (2004) A phylogenetic mixture model for detecting pattern-
heterogeneity in gene sequence or character-state data. Systematic Biology, 53,
571-581.
Mohr S, Freeman J, Plasterer T, Smith T (1999) Patterns of mitochondrial DNA strand
asymmetry correlate with phylogeny. Biological Bulletin, 196, 411-412.
Mount D (2001) Bioinformatics: sequence and genome analysis. Cold Spring Harbor
Laboratory Press.
Mueller RL and Boore JL (2005) Molecular mechanisms of extensive mitochondrial gene
rearrangement in Plethodontid salamanders. Molecular Biology and Evolution, 22,
2104-2112.
Nesnidal M, Helmkampf M, Bruchhaus I, Hausdorf B (2010) Compositional heterogeneity
and phylogenomic inference of metazoan relationships. Molecular Biology and
Evolution, 27, 2095-2104.
Nikolaou C, Almirantis Y (2006) Deviations from Chargaff's second parity rule in organellar
DNA. Gene, 381, 34-41.
Nylander J, Ronquist F, Huelsenbeck J, Nieves-Aldrey J (2004) Bayesian phylogenetic
analysis of combined data. Systematic Biology, 53, 47-67.
Osawa S, Honjo T (1991) Evolution of life. Fossils, molecular and culture. Springer-Verlag,
Tokyo.
Paradis E, Claude J, Strimmer K (2004) APE: Analyses of phylogenetics and evolution in
R language. Bioinformatics, 20, 289-290.
Pearse VB, Pearse JS (1994) Echinoderm phylogeny and the place of concentricycloids.
In: Echinoderms Through Time, pp. 121-126. CRC Press.
Perna N, Kocher T (1995) Patterns of nucleotide composition at fourfold degenerate sites
of animal mitochondrial genomes. Journal of Molecular Evolution, 41, 353-358.
References 69
Perseke M, Golombek A, Schlegel M, Struck T (2013) The impact of mitochondrial
genome analyses on the understanding of deuterostome phylogeny. Molecular
Phylogenetics and Evolution, 66, 898-905.
Perseke M, Bernhard D, Fritzsch G, et al. (2010) Mitochondrial genome evolution in
Ophiuroidea, Echinoidea, and Holothuroidea: Insights in phylogenetic relationships
of Echinodermata. Molecular Phylogenetics and Evolution, 56, 201-211.
Perseke M, Fritzsch G, Ramsch K, et al. (2008) Evolution of mitochondrial gene orders in
echinoderms. Molecular Phylogenetics and Evolution, 47, 855-864.
Peterson K (2004) Isolation of Hox and Parahox genes in the hemichordate Ptychodera
flava and the evolution of deuterostome Hox genes. Molecular Phylogenetics and
Evolution, 31, 1208-1215.
Pevzner P (2000) Computational molecular biology: an algorithmic approach. MIT press.
Philippe H, Brinkmann H, Copley R, et al. (2011) Acoelomorph flatworms are
deuterostomes related to Xenoturbella. Nature, 470, 255-258.
Philippe H, Delsuc F, Brinkmann H, Lartillot N (2005) Phylogenomics. Annual Review of
Ecology, Evolution and Systematics, 36, 541-562.
Philippe H, Adoutte A, Coombs GH, Vickerman K, Sleigh MA (1998) The molecular
phylogeny of eukaryota: solid facts and uncertainties. In: Evolutionary relationships
among protozoa, pp. 25-56. Springer.
Pisani D, Feuda R, Peterson KJ, Smith AB (2012) Resolving phylogenetic signal from
noise when divergence is rapid: a new look at the old problem of echinoderm class
relationships. Molecular Phylogenetics and Evolution, 62, 27-34.
Posada D, Buckley T (2004) Model selection and model averaging in phylogenetics:
Advantages of Akaike information criterion and bayesian approaches over likelihood
ratio tests. Systematic Biology, 53, 793-808.
Quang LS, Gascuel O, Lartillot N (2008) Empirical profile mixture models for phylogenetic
reconstruction. Bioinformatics, 24, 2317-2323.
Rannala B, Zhu T, Yang Z (2011) Tail paradox, partial identifiability, and influential priors
in bayesian branch length inference. Molecular Biology and Evolution, 29, 325-335.
Reuter J, Mathews D (2010) RNAstructure: software for RNA secondary structure
prediction and analysis. BMC Bioinformatics, 11, 129.
70
Reyes A, Gissi C, Pesole G, Saccone C (1998) Asymmetrical directional mutation
pressure in the mitochondrial genome of mammals. Molecular Biology and Evolution,
15, 957-966.
Ricci A (2009) Mitochondrial Genomics: structure, inheritance and phylogenetic utility of
the mitochondrial genome in Bivalvia and Insecta. Published doctoral dissertation.
University of Bologna, Bologna, Italy.
Robinson D, Foulds L (1981) Comparison of phylogenetic trees. Mathematical Biosciences,
53, 131-147.
Rocha E, Touchon M, Feil E (2006) Similar compositional biases are caused by very
different mutational effects. Genome Research, 16, 1537-1547.
Rodriguez F, Oliver JL, Marin A, Medina JR (1990) The general stochastic model of
nucleotide substitution. Journal of Theoretical Biology, 142, 485-501.
Rodriguez-Ezpeleta N, Brinkmann H, Roure B, et al. (2007) Detecting and overcoming
systematic errors in genome-scale phylogenies. Systematic Biology, 56, 389-399.
Ronquist F, Teslenko M, van der Mark P, et al. (2012) MrBayes 3.2: Efficient Bayesian
phylogenetic inference and model choice across a large model space. Systematic
Biology, 61, 539-542.
Ruano-Rubio V, Fares M (2007) Artifactual phylogenies caused by correlated distribution
of substitution rates among sites and lineages: The good, the bad, and the ugly.
Systematic Biology, 56, 68-82.
Saccone C, De Giorgi C, Gissi C, Pesole G, Reyes A (1999) Evolutionary genomics in
Metazoa: The mitochondrial DNA as a model system. Gene, 238, 195-209.
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Molecular Biology and Evolution, 4, 406-425.
Sankoff D, Leduc G, Antoine N, et al. (1992) Gene order comparisons for phylogenetic
inference: Evolution of the mitochondrial genome. Proceedings of the National
Academy of Sciences USA, 89, 6575-6579.
San Mauro D, Gower DJ, Zardoya R, Wilkinson M (2006) A hotspot of gene order
rearrangement by tandem duplication and random loss in the vertebrate
mitochondrial genome. Molecular Biology and Evolution, 23, 227-234.
References 71
Schatz G (1963) The isolation of possible mitochondrial precursor structures from
aerobically grown baker’s yeast. Biochemical Biophysical Research
Communications, 12, 448-451.
Schierwater B, Stadler P, DeSalle R, Podsiadlowski L (2013) Mitogenomics and metazoan
evolution. Molecular Phylogenetics and Evolution, 69, 311-312.
Scouras A, Smith M (2006) The complete mitochondrial genomes of the sea lily
Gymnocrinus richeri and the feather star Phanogenia gracilis: Signature nucleotide
bias and unique nad4L gene rearrangement within crinoids. Molecular Phylogenetics
and Evolution, 39, 323-334.
Scouras A, Beckenbach K, Arndt A, Smith MJ (2004) Complete mitochondrial genome
DNA sequence for two ophiuroids and a holothuroid: the utility of protein gene
sequence and gene maps in the analyses of deep deuterostome phylogeny.
Molecular Phylogenetic and Evolution, 31, 50-65.
Scouras A, Smith M (2001) A novel mitochondrial gene order in the crinoid echinoderm
Florometra serratissima. Molecular Biology and Evolution, 18, 61-73.
Shao R, Dowton M, Murrell A, Barker SC (2003). Rates of gene rearrangement and
nucleotide substitution are correlated in the mitochondrial genomes of insects.
Molecular Biology and Evolution, 20, 1612-1619.
Sheffield N, Song H, Cameron S, Whiting M (2009) Nonstationary evolution and
compositional heterogeneity in beetle mitochondrial phylogenomics. Systematic
Biology, 58, 381-394.
Shen X, Tian M, Liu Z, et al. (2009) Complete mitochondrial genome of the sea cucumber
Apostichopus japonicus (Echinodermata: Holothuroidea): The first representative
from the subclass Aspidochirotacea with the echinoderm ground pattern. Gene, 439,
79-86.
Shi J, Zhang Y, Luo H, Tang J (2010) Using jackknife to assess the quality of gene order
phylogenies. BMC Bioinformatics, 11, 168.
Shibutani S, Takeshita M, Grollman A (1991) Insertion of specific bases during DNA
synthesis past the oxidation-damaged base 8-oxodG. Nature, 349, 431-434.
Smith M, Arndt A, Gorski S, Fajber E (1993) The phylogeny of echinoderm classes based
on mitochondrial gene arrangements. Journal of Molecular Evolution, 36, 545-554.
72
Smith A (1992) Echinoderm phylogeny: Morphology and molecules approach accord.
Trends in Ecology and Evolution, 7, 224-229.
Smith AB (1984) Classification of the Echinodermata. Palaeontolgy, 27, 431-459.
Soria-Carrasco V, Talavera G, Igea J, Castresana J (2007) The K tree score:
quantification of differences in the relative branch length and topology of
phylogenetic trees. Bioinformatics, 23, 2954-2956.
Steel M, Huson D, Lockhart P (2000) Invariable sites models and their use in phylogeny
reconstruction. Systematic Biology, 49, 225-232.
Stefankovic D, Vigoda E (2007) Pitfalls of heterogeneous processes for phylogenetic
reconstruction. Systematic Biology, 56, 113-124.
Strathmann RR (1988) Larvae, phylogeny and von Baer's Law. In: Echinoderm Phylogeny
and Evolutionary Biology, pp. 53-68. Oxford Science Publications and Liverpool
Geological Society.
Sullivan J, Joyce P (2005) Model selection in phylogenetics. Annual Review of Ecology,
Evolution and Systematics, 36, 445-466.
Swalla B, Smith A (2008) Deciphering deuterostome phylogeny: molecular, morphological
and palaeontological perspectives. Philosophical Transactions of the Royal Society
B: Biological Sciences, 363, 1557-1568.
Swofford DL, Olsen GJ, Waddell PJ, Hillis DM (1996) Phylogenetic inference. In:
Molecular systematics, pp. 407-514. Sinauer Associates.
Taanman J (1999) The mitochondrial genome: structure, transcription, translation and
replication. Biochimica et Biophysica Acta (BBA) - Bioenergetics, 1410, 103-123.
Tamura K, Peterson D, Peterson N, et al. (2011) MEGA5: Molecular evolutionary genetics
analysis using maximum likelihood, evolutionary distance, and maximum parsimony
methods. Molecular Biology and Evolution, 28, 2731-2739.
Telford MJ, Lowe CJ, Cameron CB, et al. (2014) Phylogenomic analysis of echinoderm
class relationships supports Asterozoa. Proceedings of the Royal Society B:
Biological Sciences, 281.
Tisdall J (2001) Beginning Perl for bioinformatics. O'Reilly, Beijing.
Touchon M (2004) Transcription-coupled and splicing-coupled strand asymmetries in
eukaryotic genomes. Nucleic Acids Research, 32, 4969-4978.
References 73
Turbeville J, Schulz JR, Raff RA (1994) Deuterostome phylogeny and the sister group of
the chordates: Evidence from molecules and morphology. Molecular Biology and
Evolution, 11, 648-655
Uzzell T, Corbin K (1971) Fitting discrete probability distributions to evolutionary events.
Science, 172, 1089-1096.
Wada H, Satoh N (1994) Details of the evolutionary history from invertebrates to
vertebrates, as deduced from the sequences of 18S rDNA. Proceedings of the
National Academy of Sciences USA, 91, 1801-1804.
Wang H, Li K, Susko E, Roger A (2008) A class frequency mixture model that adjusts for
site-specific amino acid frequencies and improves inference of protein phylogeny.
BMC Evolutionary Biology, 8, 331.
Wei S, Shi M, Chen X, et al. (2010) New views on strand asymmetry in insect
mitochondrial genomes. PLoS ONE, 5, e12708.
Weisrock D (2012) Concordance analysis in mitogenomic phylogenetics. Molecular
Phylogenetics and Evolution, 65, 194-202.
Weng ML, Blazier JC, Govindu M, Jansen RK (2013) Reconstruction of the ancestral
plastid genome in Geraniaceae reveals a correlation between genome
rearrangements, repeats, and nucleotide substitution rates. Molecular Biology
Evolution, 31, 645-659.
Whelan S, Goldman N (2001) Molecular phylogenetics: state-of-the-art methods for
looking into the past. Trends in Genetics, 17, 262-272.
Woese C, Achenbach L, Rouviere P, Mandelco L (1991) Archaeal phylogeny:
Reexamination of the phylogenetic position of Archaeoglohus fulgidus in light of
certain composition-induced artifacts. Systematic and Applied Microbiology, 14, 364-
371.
Wolstenholme DR (1992) Animal mitochondrial DNA: structure and evolution. International
Review of Cytology, 141, 173–216
Wyman S, Jansen R, Boore J (2004) Automatic annotation of organellar genomes with
DOGMA. Bioinformatics, 20, 3252-3255.
Xu W, Jameson D, Tang B, Higgs PG (2006). The relationship between the rate of
molecular evolution and the rate of genome rearrangement in animal mitochondrial
genomes. Journal of Molecular Evolution, 63, 375-392.
74
Yang Z, Rannala B (2012) Molecular phylogenetics: principles and practice. Nature
Reviews Genetics, 13, 303-314.
Yang Z (1996) Among-site rate variation and its impact on phylogenetic analyses. Trends
in Ecology and Evolution, 11, 367-372.
Zhong M, Struck T, Halanych K (2008) Phylogenetic information from three mitochondrial
genomes of Terebelliformia (Annelida) worms and duplication of the methionine
tRNA. Gene, 416, 11-21.
APPENDIX
Appendix 77
A.1 Supplementary Data
Table A.1.1. Species analyzed in this study. Sequences and detailed information can be cross-referenced by GenBank accession
numbers (available at http://www.ncbi.nlm.nih.gov).
Phylum Class Species Accession No.
Echinodermata Asteroidea Acanthaster brevispinus NC_007789
Acanthaster planci NC_007788
Asterias amurensis NC_006665
Astropecten polyacanthus NC_006666
Luidia quinalia NC_006664
Patiria pectinifera NC_001627
Pisaster ochraceus NC_004610
Echinoidea Arbacia lixula NC_001770
Echinocardium cordatum NC_013881
Paracentrotus lividus NC_001572
Strongylocentrotus droebachiensis NC_009940
Strongylocentrotus pallidus NC_009941
Strongylocentrotus purpuratus NC_001453
Holothuroidea Apostichopus japonicus NC_012616
Cucumaria miniata NC_005929
Holothuria forskali NC_013884
Parastichopus nigripunctatus NC_013432
Stichopus horrens NC_014454
Stichopus sp. SF-2010 NC_014452
Ophiuroidea Amphipholis squamata NC_013876
Astrospartus mediterraneus NC_013878
Ophiocomina nigra NC_013874
Ophiopholis aculeata NC_005334
Ophiura albida NC_010691
Ophiura lutkeni NC_005930
Crinoidea Antedon mediterranea NC_010692
Florometra serratissima NC_001878
Neogymnocrinus richeri NC_007689
Phanogenia gracilis NC_007690
Hemichordata Enteropneusta Balanoglossus carnosus NC_001887
Balanoglossus clavigerus NC_013877
Saccoglossus kowalevskii NC_007438
78
Table A.1.2. Proportion of nucleotides for each mitochondrial protein-coding gene sequence and respective AT and GC skews
for all species used in this study. The values were calculated with Mega v5.2 and Excel.
Classe Species Gene Proportion of Nucleotides %A+T %G+C AT skew GC skew
A C T G
Asteroidea Patiria pectinifera COX1 0.451 0.234 0.234 0.080 68.5 31.4 0.317 -0.490
COX2 0.437 0.330 0.169 0.062 60.6 39.2 0.442 -0.680
ATP8 0.428 0.321 0.214 0.035 64.2 35.6 0.333 -0.800
ATP6 0.352 0.392 0.208 0.048 56.0 44.0 0.257 -0.780
COX3 0.315 0.273 0.295 0.114 61.0 38.7 0.033 -0.410
ND3 0.333 0.333 0.263 0.070 59.6 40.3 0.117 -0.650
ND4L 0.410 0.303 0.160 0.125 57.0 42.8 0.437 -0.410
ND4 0.347 0.347 0.187 0.118 53.4 46.5 0.300 -0.490
ND5 0.393 0.278 0.278 0.049 67.1 32.7 0.170 -0.690
ND6 0.204 0.060 0.445 0.289 64.9 34.9 -0.370 0.655
CYTB 0.468 0.255 0.223 0.052 69.1 30.7 0.353 -0.660
ND1 0.160 0.051 0.528 0.258 68.8 30.9 -0.530 0.666
ND2 0.209 0.098 0.470 0.220 67.9 31.8 -0.380 0.381
Pisaster ochraceus COX1 0.357 0.225 0.371 0.046 72.8 27.1 -0.019 -0.661
COX2 0.440 0.230 0.290 0.040 73.0 27.0 0.205 -0.700
ATP8 0.117 0.294 0.529 0.058 64.6 35.2 -0.630 -0.660
ATP6 0.389 0.176 0.371 0.061 76.0 23.7 0.023 -0.480
COX3 0.429 0.177 0.348 0.044 77.7 22.1 0.104 -0.600
ND3 0.448 0.244 0.265 0.040 71.3 28.4 0.257 -0.710
ND4L 0.319 0.340 0.340 0.000 65.9 34.0 -0.030 -1.000
ND4 0.368 0.195 0.382 0.053 75.0 24.8 -0.010 -0.570
ND5 0.418 0.220 0.324 0.036 74.2 25.6 0.126 -0.710
ND6 0.160 0.086 0.493 0.259 65.3 34.5 -0.500 0.500
CYTB 0.442 0.247 0.278 0.031 72.0 27.8 0.226 -0.770
ND1 0.250 0.041 0.500 0.208 75.0 24.9 -0.330 0.666
ND2 0.345 0.075 0.396 0.182 74.1 25.7 -0.060 0.414
Luidia quinalia COX1 0.349 0.283 0.309 0.058 65.8 34.1 0.061 -0.660
COX2 0.323 0.228 0.380 0.066 70.3 29.4 -0.081 -0.540
ATP8 0.375 0.375 0.208 0.041 58.3 41.6 0.285 -0.800
ATP6 0.322 0.282 0.290 0.104 61.2 38.6 0.052 -0.450
COX3 0.321 0.284 0.321 0.072 64.2 35.6 0.000 -0.590
ND3 0.316 0.233 0.400 0.050 71.6 28.3 -0.110 -0.640
ND4L 0.350 0.228 0.385 0.035 73.5 26.3 -0.040 -0.730
ND4 0.274 0.297 0.354 0.072 62.8 36.9 -0.120 -0.600
ND5 0.317 0.266 0.326 0.088 64.3 35.4 -0.010 -0.500
ND6 0.317 0.097 0.365 0.219 68.2 31.6 -0.070 0.384
Appendix 79
CYTB 0.314 0.293 0.287 0.104 60.1 39.7 0.043 -0.470
ND1 0.243 0.085 0.414 0.256 65.7 34.1 -0.250 0.500
ND2 0.222 0.087 0.432 0.257 65.4 34.4 -0.320 0.491
Asterias amurensis COX1 0.328 0.176 0.440 0.054 76.8 23.0 -0.146 -0.530
COX2 0.326 0.192 0.394 0.086 72.0 27.8 -0.094 -0.370
ATP8 0.440 0.120 0.320 0.120 76.0 24.0 0.157 0.000
ATP6 0.375 0.151 0.401 0.071 77.6 22.2 -0.030 -0.360
COX3 0.401 0.204 0.340 0.053 74.1 25.7 0.081 -0.580
ND3 0.183 0.244 0.469 0.102 65.2 34.6 -0.430 -0.410
ND4L 0.255 0.255 0.446 0.042 70.1 29.7 -0.270 -0.710
ND4 0.353 0.174 0.388 0.082 74.1 25.6 -0.040 -0.350
ND5 0.378 0.174 0.375 0.071 75.3 24.5 0.004 -0.420
ND6 0.341 0.101 0.329 0.227 67.0 32.8 0.018 0.384
CYTB 0.356 0.227 0.340 0.075 69.6 30.2 0.023 -0.500
ND1 0.314 0.111 0.388 0.185 70.2 29.6 -0.100 0.250
ND2 0.339 0.103 0.406 0.151 74.5 25.4 -0.080 0.190
Astropecten polyacanthus COX1 0.476 0.170 0.263 0.088 73.9 25.8 0.288 -0.318
COX2 0.500 0.187 0.250 0.062 75.0 24.9 0.333 -0.500
ATP8 0.423 0.230 0.269 0.076 69.2 30.6 0.222 -0.500
ATP6 0.491 0.288 0.152 0.067 64.3 35.5 0.526 -0.610
COX3 0.450 0.267 0.218 0.063 66.8 33.0 0.347 -0.610
ND3 0.353 0.246 0.246 0.153 59.9 39.9 0.179 -0.230
ND4L 0.386 0.204 0.363 0.045 74.9 24.9 0.030 -0.630
ND4 0.393 0.305 0.216 0.084 60.9 38.9 0.291 -0.560
ND5 0.406 0.210 0.323 0.060 72.9 27.0 0.114 -0.550
ND6 0.250 0.039 0.486 0.223 73.6 26.2 -0.320 0.700
CYTB 0.426 0.205 0.300 0.068 72.6 27.3 0.173 -0.500
ND1 0.251 0.058 0.541 0.148 79.2 20.6 -0.360 0.437
ND2 0.299 0.107 0.395 0.197 69.4 30.4 -0.130 0.294
Acanthaster planci COX1 0.460 0.273 0.163 0.102 62.3 37.5 0.477 -0.456
COX2 0.401 0.348 0.133 0.116 53.4 46.4 0.502 -0.500
ATP8 0.384 0.346 0.230 0.038 61.4 38.4 0.250 -0.800
ATP6 0.393 0.325 0.181 0.098 57.4 42.3 0.368 -0.530
COX3 0.384 0.363 0.160 0.090 54.4 45.3 0.410 -0.600
ND3 0.344 0.262 0.213 0.180 55.7 44.2 0.235 -0.180
ND4L 0.363 0.254 0.272 0.109 63.5 36.3 0.142 -0.400
ND4 0.324 0.354 0.222 0.098 54.6 45.2 0.186 -0.560
ND5 0.404 0.339 0.170 0.085 57.4 42.4 0.405 -0.590
ND6 0.138 0.128 0.356 0.376 49.4 50.4 -0.440 0.490
CYTB 0.333 0.378 0.181 0.106 51.4 48.4 0.294 -0.560
ND1 0.169 0.101 0.401 0.327 57.0 42.8 -0.400 0.526
80
ND2 0.129 0.153 0.408 0.307 53.7 46.0 -0.510 0.333
Acanthaster brevispinus COX1 0.460 0.276 0.179 0.083 63.9 35.9 0.440 -0.538
COX2 0.495 0.327 0.123 0.053 61.8 38.0 0.602 -0.720
ATP8 0.440 0.360 0.200 0.000 64.0 36.0 0.375 -1.000
ATP6 0.443 0.368 0.135 0.052 57.8 42.0 0.532 -0.750
COX3 0.386 0.351 0.179 0.082 56.5 43.3 0.365 -0.610
ND3 0.406 0.312 0.140 0.140 54.6 45.2 0.485 -0.370
ND4L 0.490 0.264 0.169 0.075 65.9 33.9 0.485 -0.550
ND4 0.335 0.371 0.209 0.083 54.4 45.4 0.231 -0.630
ND5 0.403 0.342 0.158 0.095 56.1 43.7 0.435 -0.560
ND6 0.150 0.130 0.370 0.350 52.0 48.0 -0.420 0.458
CYTB 0.370 0.330 0.215 0.085 58.5 41.5 0.264 -0.590
ND1 0.173 0.104 0.404 0.317 57.7 42.1 -0.400 0.506
ND2 0.133 0.137 0.428 0.300 56.1 43.7 -0.520 0.370
Echinoidea Strongylocentrotus droebachiensis COX1 0.332 0.232 0.317 0.118 64.9 35.0 0.023 -0.326
COX2 0.336 0.267 0.302 0.095 63.8 36.2 0.054 -0.476
ATP8 0.280 0.360 0.280 0.080 56.0 44.0 0.000 -0.636
ATP6 0.402 0.205 0.303 0.090 70.5 29.5 0.140 -0.389
COX3 0.419 0.265 0.221 0.096 64.0 36.1 0.310 -0.469
ND3 0.377 0.295 0.230 0.098 60.7 39.3 0.243 -0.500
ND4L 0.389 0.241 0.222 0.148 61.1 38.9 0.273 -0.238
ND4 0.401 0.227 0.264 0.108 66.5 33.5 0.207 -0.356
ND5 0.354 0.268 0.254 0.124 60.8 39.2 0.166 -0.368
ND6 0.273 0.409 0.242 0.076 51.5 48.5 0.059 -0.688
CYTB 0.412 0.263 0.232 0.093 64.4 35.6 0.280 -0.478
ND1 0.339 0.241 0.259 0.161 59.8 40.2 0.135 -0.200
ND2 0.295 0.258 0.321 0.126 61.6 38.4 -0.043 -0.342
Strongylocentrotus pallidus COX1 0.348 0.249 0.293 0.110 64.1 35.9 0.086 -0.388
COX2 0.319 0.233 0.319 0.129 63.8 36.2 0.000 -0.286
ATP8 0.292 0.333 0.292 0.083 58.4 41.6 0.000 -0.600
ATP6 0.402 0.205 0.308 0.085 71.0 29.0 0.133 -0.412
COX3 0.438 0.277 0.204 0.080 64.2 35.7 0.364 -0.551
ND3 0.333 0.270 0.286 0.111 61.9 38.1 0.077 -0.417
ND4L 0.444 0.241 0.222 0.093 66.6 33.4 0.333 -0.444
ND4 0.386 0.246 0.250 0.117 63.6 36.3 0.214 -0.354
ND5 0.348 0.262 0.256 0.134 60.4 39.6 0.151 -0.324
ND6 0.299 0.403 0.224 0.075 52.3 47.8 0.143 -0.688
CYTB 0.373 0.290 0.218 0.119 59.1 40.9 0.263 -0.418
ND1 0.310 0.263 0.263 0.164 57.3 42.7 0.082 -0.233
ND2 0.302 0.265 0.323 0.111 62.5 37.6 -0.034 -0.408
Appendix 81
Echinocardium cordatum COX1 0.387 0.204 0.247 0.161 63.4 36.5 0.220 -0.118
COX2 0.324 0.204 0.259 0.213 58.3 41.7 0.111 0.022
ATP8 0.500 0.154 0.192 0.154 69.2 30.8 0.444 0.000
ATP6 0.407 0.179 0.293 0.122 70.0 30.1 0.163 -0.189
COX3 0.457 0.250 0.186 0.107 64.3 35.7 0.422 -0.400
ND3 0.383 0.250 0.217 0.150 60.0 40.0 0.278 -0.250
ND4L 0.382 0.218 0.255 0.145 63.7 36.3 0.200 -0.200
ND4 0.318 0.271 0.271 0.141 58.9 41.2 0.080 -0.314
ND5 0.402 0.261 0.212 0.125 61.4 38.6 0.309 -0.353
ND6 0.258 0.439 0.227 0.076 48.5 51.5 0.063 -0.706
CYTB 0.332 0.284 0.200 0.184 53.2 46.8 0.248 -0.213
ND1 0.276 0.310 0.247 0.167 52.3 47.7 0.055 -0.301
ND2 0.372 0.251 0.230 0.147 60.2 39.8 0.235 -0.263
Strongylocentrotus purpuratus COX1 0.350 0.243 0.306 0.099 65.6 34.2 0.067 -0.421
COX2 0.321 0.277 0.321 0.080 64.2 35.7 0.000 -0.550
ATP8 0.285 0.357 0.285 0.071 57.0 42.8 0.000 -0.660
ATP6 0.418 0.237 0.278 0.065 69.6 30.2 0.200 -0.560
COX3 0.413 0.278 0.240 0.067 65.3 34.5 0.264 -0.600
ND3 0.281 0.328 0.218 0.171 49.9 49.9 0.125 -0.310
ND4L 0.425 0.259 0.222 0.092 64.7 35.1 0.314 -0.470
ND4 0.401 0.258 0.227 0.111 62.8 36.9 0.276 -0.390
ND5 0.395 0.261 0.228 0.114 62.3 37.5 0.267 -0.390
ND6 0.296 0.175 0.373 0.153 66.9 32.8 -0.110 -0.060
CYTB 0.354 0.322 0.201 0.121 55.5 44.3 0.276 -0.450
ND1 0.362 0.228 0.269 0.140 63.1 36.8 0.148 -0.230
ND2 0.298 0.250 0.326 0.125 62.4 37.5 -0.040 -0.330
Paracentrotus lividus COX1 0.516 0.163 0.245 0.074 76.1 23.7 0.356 -0.376
COX2 0.413 0.250 0.293 0.043 70.6 29.3 0.170 -0.700
ATP8 0.550 0.050 0.350 0.050 90.0 10.0 0.222 0.000
ATP6 0.379 0.255 0.279 0.085 65.8 34.0 0.152 -0.500
COX3 0.485 0.235 0.227 0.051 71.2 28.6 0.360 -0.640
ND3 0.508 0.196 0.180 0.114 68.8 31.0 0.476 -0.260
ND4L 0.471 0.245 0.226 0.056 69.7 30.1 0.351 -0.620
ND4 0.421 0.238 0.214 0.125 63.5 36.3 0.325 -0.310
ND5 0.427 0.235 0.229 0.107 65.6 34.2 0.301 -0.370
ND6 0.229 0.149 0.448 0.172 67.7 32.1 -0.320 0.071
CYTB 0.446 0.248 0.177 0.126 62.3 37.4 0.430 -0.320
ND1 0.423 0.237 0.242 0.096 66.5 33.3 0.271 -0.420
ND2 0.417 0.241 0.236 0.104 65.3 34.5 0.277 -0.390
Arbacia lixula COX1 0.401 0.208 0.323 0.066 72.4 27.4 0.108 -0.518
82
COX2 0.366 0.169 0.410 0.053 77.6 22.2 -0.057 -0.520
ATP8 0.375 0.125 0.416 0.083 79.1 20.8 -0.050 -0.200
ATP6 0.385 0.188 0.303 0.122 68.8 31.0 0.119 -0.210
COX3 0.367 0.191 0.360 0.080 72.7 27.1 0.010 -0.400
ND3 0.345 0.254 0.290 0.109 63.5 36.3 0.085 -0.400
ND4L 0.511 0.155 0.288 0.044 79.9 19.9 0.277 -0.550
ND4 0.347 0.177 0.421 0.053 76.8 23.0 -0.090 -0.530
ND5 0.384 0.212 0.333 0.069 71.7 28.1 0.071 -0.500
ND6 0.333 0.114 0.390 0.160 72.3 27.4 -0.070 0.166
CYTB 0.364 0.187 0.348 0.098 71.2 28.5 0.021 -0.300
ND1 0.333 0.255 0.345 0.065 67.8 32.0 -0.010 -0.590
ND2 0.372 0.194 0.338 0.094 71.0 28.8 0.046 -0.340
Holothuroidea Apostichopus japonicus COX1 0.425 0.157 0.321 0.096 74.6 25.3 0.139 -0.239
COX2 0.396 0.198 0.245 0.160 64.1 35.8 0.235 -0.105
ATP8 0.563 0.188 0.188 0.063 75.1 25.1 0.500 -0.500
ATP6 0.413 0.165 0.314 0.107 72.7 27.2 0.136 -0.212
COX3 0.515 0.221 0.132 0.132 64.7 35.3 0.591 -0.250
ND3 0.429 0.143 0.286 0.143 71.5 28.6 0.200 0.000
ND4L 0.345 0.182 0.273 0.200 61.8 38.2 0.118 0.048
ND4 0.426 0.174 0.248 0.153 67.4 32.7 0.264 -0.063
ND5 0.004 0.002 0.002 0.002 0.60 0.40 0.264 -0.063
ND6 0.357 0.411 0.161 0.071 51.8 48.2 0.379 -0.704
CYTB 0.451 0.113 0.308 0.128 75.9 24.1 0.189 0.064
ND1 0.355 0.122 0.366 0.157 72.1 27.9 -0.016 0.125
ND2 0.415 0.199 0.239 0.148 65.4 34.7 0.270 -0.148
Parastichopus nigripunctatus COX1 0.414 0.161 0.321 0.104 73.5 26.5 0.126 -0.216
COX2 0.390 0.219 0.229 0.162 61.9 38.1 0.262 -0.150
ATP8 0.563 0.188 0.188 0.063 75.1 25.1 0.500 -0.500
ATP6 0.393 0.188 0.291 0.128 68.4 31.6 0.150 -0.189
COX3 0.489 0.222 0.148 0.141 63.7 36.3 0.535 -0.224
ND3 0.411 0.125 0.304 0.161 71.5 28.6 0.150 0.125
ND4L 0.389 0.185 0.259 0.167 64.8 35.2 0.200 -0.053
ND4 0.438 0.161 0.260 0.140 69.8 30.1 0.254 -0.068
ND5 0.362 0.146 0.327 0.165 68.9 31.1 0.051 0.061
ND6 0.304 0.429 0.179 0.089 48.3 51.8 0.259 -0.655
CYTB 0.434 0.112 0.306 0.148 74.0 26.0 0.172 0.137
ND1 0.343 0.116 0.343 0.198 68.6 31.4 0.000 0.259
ND2 0.412 0.175 0.254 0.158 66.6 33.3 0.237 -0.051
Holothuria forskali COX1 0.437 0.146 0.366 0.052 80.3 19.8 0.088 -0.472
COX2 0.367 0.257 0.321 0.055 68.8 31.2 0.067 -0.647
ATP8 0.294 0.294 0.353 0.059 64.7 35.3 -0.091 -0.667
Appendix 83
ATP6 0.481 0.194 0.279 0.047 76.0 24.1 0.265 -0.613
COX3 0.475 0.206 0.262 0.057 73.7 26.3 0.288 -0.568
ND3 0.320 0.240 0.360 0.080 68.0 32.0 -0.059 -0.500
ND4L 0.306 0.265 0.347 0.082 65.3 34.7 -0.063 -0.529
ND4 0.434 0.200 0.294 0.072 72.8 27.2 0.193 -0.469
ND5 0.463 0.210 0.246 0.081 70.9 29.1 0.306 -0.444
ND6 0.345 0.400 0.182 0.073 52.7 47.3 0.310 -0.692
CYTB 0.488 0.192 0.276 0.044 76.4 23.6 0.277 -0.625
ND1 0.411 0.194 0.300 0.094 71.1 28.8 0.156 -0.346
ND2 0.447 0.207 0.261 0.085 70.8 29.2 0.263 -0.418
Stichopus sp. SF-2010 COX1 0.409 0.189 0.317 0.085 72.6 27.4 0.127 -0.377
COX2 0.413 0.294 0.257 0.037 67.0 33.1 0.233 -0.778
ATP8 0.478 0.217 0.304 0.000 78.2 21.7 0.222 -1.000
ATP6 0.472 0.232 0.248 0.048 72.0 28.0 0.311 -0.657
COX3 0.478 0.216 0.239 0.067 71.7 28.3 0.333 -0.526
ND3 0.304 0.411 0.232 0.054 53.6 46.5 0.133 -0.769
ND4L 0.327 0.273 0.382 0.018 70.9 29.1 -0.077 -0.875
ND4 0.279 0.265 0.347 0.110 62.6 37.5 -0.109 -0.415
ND5 0.461 0.272 0.168 0.099 62.9 37.1 0.467 -0.468
ND6 0.281 0.439 0.211 0.070 49.2 50.9 0.143 -0.724
CYTB 0.466 0.279 0.172 0.083 63.8 36.2 0.462 -0.541
ND1 0.389 0.232 0.243 0.135 63.2 36.7 0.231 -0.265
ND2 0.371 0.299 0.237 0.093 60.8 39.2 0.220 -0.526
Cucumaria miniata COX1 0.530 0.250 0.174 0.043 70.4 29.3 0.506 -0.706
COX2 0.359 0.300 0.252 0.087 61.1 38.7 0.175 -0.550
ATP8 0.263 0.368 0.315 0.052 57.8 42.0 -0.090 -0.750
ATP6 0.500 0.294 0.196 0.008 69.6 30.2 0.435 -0.940
COX3 0.500 0.250 0.212 0.037 71.2 28.7 0.404 -0.730
ND3 0.423 0.250 0.269 0.057 69.2 30.7 0.222 -0.620
ND4L 0.351 0.296 0.277 0.074 62.8 37.0 0.117 -0.600
ND4 0.524 0.262 0.140 0.072 66.4 33.4 0.576 -0.560
ND5 0.455 0.278 0.214 0.051 66.9 32.9 0.360 -0.690
ND6 0.160 0.074 0.629 0.135 78.9 20.9 -0.590 0.294
CYTB 0.525 0.237 0.185 0.051 71.0 28.8 0.478 -0.640
ND1 0.508 0.289 0.138 0.063 64.6 35.2 0.571 -0.630
ND2 0.565 0.232 0.148 0.053 71.3 28.5 0.583 -0.620
Stichopus horrens COX1 0.413 0.189 0.317 0.082 73.0 27.1 0.132 -0.395
COX2 0.404 0.257 0.294 0.046 69.8 30.3 0.158 -0.697
ATP8 0.455 0.273 0.273 0.000 72.8 27.3 0.250 -1.000
ATP6 0.463 0.231 0.264 0.041 72.7 27.2 0.273 -0.697
COX3 0.466 0.237 0.221 0.076 68.7 31.3 0.356 -0.512
84
ND3 0.304 0.446 0.196 0.054 50.0 50.0 0.214 -0.786
ND4L 0.333 0.315 0.333 0.019 66.6 33.4 0.000 -0.889
ND4 0.309 0.290 0.286 0.115 59.5 40.5 0.039 -0.432
ND5 0.463 0.290 0.143 0.104 60.6 39.4 0.528 -0.473
ND6 0.305 0.424 0.203 0.068 50.8 49.2 0.200 -0.724
CYTB 0.454 0.278 0.185 0.083 63.9 36.1 0.420 -0.541
ND1 0.385 0.187 0.280 0.148 66.5 33.5 0.157 -0.115
ND2 0.351 0.320 0.211 0.119 56.2 43.9 0.248 -0.459
Ophiuroidea Ophiura lutkeni COX1 0.444 0.186 0.254 0.114 69.8 30.0 0.272 -0.240
COX2 0.387 0.215 0.322 0.075 70.9 29.0 0.092 -0.480
ATP8 0.631 0.105 0.263 0.000 89.4 10.5 0.411 -1.000
ATP6 0.400 0.230 0.310 0.060 71.0 29.0 0.126 -0.580
COX3 0.425 0.261 0.246 0.067 67.1 32.8 0.266 -0.590
ND3 0.511 0.222 0.266 0.000 77.7 22.2 0.314 -1.000
ND4L 0.472 0.166 0.194 0.166 66.6 33.2 0.416 0.000
ND4 0.447 0.158 0.303 0.090 75.0 24.8 0.192 -0.270
ND5 0.436 0.231 0.258 0.073 69.4 30.4 0.256 -0.510
ND6 0.215 0.050 0.468 0.265 68.3 31.5 -0.370 0.680
CYTB 0.502 0.216 0.198 0.081 70.0 29.7 0.433 -0.450
ND1 0.420 0.158 0.359 0.060 77.9 21.8 0.078 -0.440
ND2 0.422 0.187 0.281 0.107 70.3 29.4 0.200 -0.270
Ophiopholis aculeata COX1 0.510 0.212 0.241 0.035 75.1 24.7 0.358 -0.717
COX2 0.500 0.226 0.254 0.018 75.4 24.4 0.326 -0.840
ATP8 0.333 0.416 0.166 0.083 49.9 49.9 0.333 -0.660
ATP6 0.460 0.260 0.234 0.043 69.4 30.3 0.325 -0.710
COX3 0.461 0.223 0.246 0.069 70.7 29.2 0.304 -0.520
ND3 0.511 0.222 0.266 0.000 77.7 22.2 0.314 -1.000
ND4L 0.510 0.163 0.285 0.040 79.5 20.3 0.282 -0.600
ND4 0.425 0.280 0.225 0.068 65.0 34.8 0.307 -0.600
ND5 0.392 0.234 0.280 0.092 67.2 32.6 0.166 -0.430
ND6 0.168 0.060 0.626 0.144 79.4 20.4 -0.570 0.411
CYTB 0.206 0.089 0.491 0.212 69.7 30.1 -0.400 0.407
ND1 0.189 0.059 0.573 0.177 76.2 23.6 -0.500 0.500
ND2 0.223 0.094 0.547 0.135 77.0 22.9 -0.410 0.179
Ophiura albida COX1 0.385 0.096 0.431 0.088 81.6 18.4 -0.057 -0.042
COX2 0.424 0.141 0.402 0.033 82.6 17.4 0.000 -0.625
ATP8 0.647 0.059 0.294 0.000 94.1 5.90 0.375 -1.000
ATP6 0.441 0.151 0.376 0.032 81.7 18.3 0.079 -0.647
COX3 0.474 0.146 0.321 0.058 79.5 20.4 0.193 -0.429
ND3 0.391 0.152 0.391 0.065 78.2 21.7 0.000 -0.400
ND4L 0.471 0.029 0.471 0.029 94.2 5.80 0.000 0.000
Appendix 85
ND4 0.423 0.202 0.296 0.080 71.9 28.2 0.176 -0.433
ND5 0.511 0.173 0.244 0.071 75.5 24.4 0.353 -0.415
ND6 0.263 0.421 0.228 0.088 49.1 50.9 0.071 -0.655
CYTB 0.424 0.164 0.277 0.136 70.1 30.0 0.210 -0.094
ND1 0.390 0.221 0.260 0.130 65.0 35.1 0.200 -0.259
ND2 0.492 0.114 0.280 0.114 77.2 22.8 0.275 0.000
Ophiocomina nigra COX1 0.478 0.235 0.180 0.107 65.8 34.2 0.453 -0.376
COX2 0.377 0.292 0.264 0.066 64.1 35.8 0.176 -0.632
ATP8 0.560 0.160 0.240 0.040 80.0 20.0 0.400 -0.600
ATP6 0.443 0.252 0.165 0.139 60.8 39.1 0.457 -0.289
COX3 0.397 0.331 0.184 0.088 58.1 41.9 0.367 -0.579
ND3 0.373 0.390 0.169 0.068 54.2 45.8 0.375 -0.704
ND4L 0.435 0.326 0.152 0.087 58.7 41.3 0.481 -0.579
ND4 0.382 0.289 0.197 0.133 57.9 42.2 0.319 -0.371
ND5 0.367 0.167 0.338 0.128 70.5 29.5 0.042 -0.133
ND6 0.216 0.392 0.196 0.196 41.2 58.8 0.048 -0.333
CYTB 0.261 0.307 0.216 0.216 47.7 52.3 0.096 -0.175
ND1 0.264 0.364 0.202 0.171 46.6 53.5 0.133 -0.362
ND2 0.270 0.350 0.285 0.095 55.5 44.5 -0.026 -0.574
Amphipholis squamata COX1 0.476 0.150 0.359 0.015 83.5 16.5 0.140 -0.822
COX2 0.340 0.180 0.460 0.020 80.0 20.0 -0.150 -0.800
ATP8 0.467 0.133 0.400 0.000 86.7 13.3 0.077 -1.000
ATP6 0.419 0.162 0.390 0.029 80.9 19.1 0.035 -0.700
COX3 0.434 0.217 0.341 0.008 77.5 22.5 0.120 -0.931
ND3 0.340 0.189 0.434 0.038 77.4 22.7 -0.122 -0.667
ND4L 0.383 0.191 0.426 0.000 80.9 19.1 -0.053 -1.000
ND4 0.441 0.228 0.297 0.035 73.8 26.3 0.195 -0.736
ND5 0.388 0.157 0.409 0.045 79.7 20.2 -0.026 -0.552
ND6 0.211 0.404 0.228 0.158 43.9 56.2 -0.040 -0.438
CYTB 0.257 0.319 0.201 0.222 45.8 54.1 0.121 -0.179
ND1 0.203 0.390 0.212 0.195 41.5 58.5 -0.020 -0.333
ND2 0.212 0.271 0.381 0.136 59.3 40.7 -0.286 -0.333
Astrospartus mediterraneus COX1 0.575 0.146 0.264 0.015 83.9 16.1 0.370 -0.810
COX2 0.626 0.165 0.198 0.011 82.4 17.6 0.520 -0.875
ATP8 0.438 0.250 0.313 0.000 75.1 25.0 0.167 -1.000
ATP6 0.448 0.184 0.356 0.011 80.4 19.5 0.114 -0.882
COX3 0.546 0.143 0.286 0.025 83.2 16.8 0.313 -0.700
ND3 0.571 0.143 0.265 0.020 83.6 16.3 0.366 -0.750
ND4L 0.450 0.250 0.300 0.000 75.0 25.0 0.200 -1.000
ND4 0.656 0.103 0.226 0.015 88.2 11.8 0.488 -0.739
ND5 0.468 0.128 0.372 0.032 84.0 16.0 0.114 -0.600
86
ND6 0.333 0.296 0.222 0.148 55.5 44.4 0.200 -0.333
CYTB 0.269 0.313 0.201 0.216 47.0 52.9 0.143 -0.183
ND1 0.287 0.339 0.191 0.183 47.8 52.2 0.200 -0.300
ND2 0.327 0.243 0.308 0.121 63.5 36.4 0.029 -0.333
Crinoidea Florometra serratissima COX1 0.136 0.015 0.804 0.042 94.0 5.70 -0.711 0.474
COX2 0.157 0.031 0.778 0.031 93.5 6.20 -0.664 0.000
ATP8 0.076 0.000 0.769 0.153 84.5 15.3 -0.810 1.000
ATP6 0.118 0.010 0.795 0.075 91.3 8.50 -0.740 0.750
COX3 0.151 0.026 0.794 0.026 94.5 5.20 -0.670 0.000
ND3 0.142 0.023 0.785 0.047 92.7 7.00 -0.690 0.333
ND4L 0.095 0.000 0.809 0.095 90.4 9.50 -0.780 1.000
ND4 0.165 0.020 0.751 0.062 91.6 8.20 -0.630 0.500
ND5 0.198 0.019 0.669 0.112 86.7 13.1 -0.540 0.705
ND6 0.558 0.132 0.294 0.014 85.2 14.6 0.310 -0.800
CYTB 0.187 0.018 0.727 0.066 91.4 8.40 -0.580 0.571
ND1 0.681 0.050 0.246 0.021 92.7 7.10 0.468 -0.400
ND2 0.719 0.049 0.214 0.016 93.3 6.50 0.539 -0.500
Antedon mediterranea COX1 0.143 0.020 0.789 0.048 93.2 6.80 -0.692 0.412
COX2 0.177 0.000 0.781 0.042 95.8 4.20 -0.630 1.000
ATP8 0.111 0.000 0.778 0.111 88.9 11.1 -0.750 1.000
ATP6 0.106 0.043 0.809 0.043 91.5 8.60 -0.767 0.000
COX3 0.158 0.008 0.733 0.100 89.1 10.8 -0.645 0.846
ND3 0.280 0.000 0.640 0.080 92.0 8.00 -0.391 1.000
ND4L 0.057 0.000 0.857 0.086 91.4 8.60 -0.875 1.000
ND4 0.230 0.038 0.634 0.098 86.4 13.6 -0.468 0.440
ND5 0.208 0.031 0.690 0.071 89.8 10.2 -0.537 0.385
ND6 0.333 0.244 0.311 0.111 64.4 35.5 0.034 -0.375
CYTB 0.169 0.039 0.708 0.084 87.7 12.3 -0.615 0.368
ND1 0.227 0.347 0.267 0.160 49.4 50.7 -0.081 -0.368
ND2 0.268 0.225 0.366 0.141 63.4 36.6 -0.156 -0.231
Gymnocrinus richeri COX1 0.087 0.014 0.821 0.076 90.8 9.00 -0.808 0.689
COX2 0.101 0.000 0.828 0.070 92.9 7.00 -0.783 1.000
ATP8 0.045 0.045 0.727 0.181 77.2 22.6 -0.880 0.600
ATP6 0.066 0.018 0.783 0.132 84.9 15.0 -0.840 0.750
COX3 0.109 0.054 0.726 0.109 83.5 16.3 -0.730 0.333
ND3 0.150 0.037 0.754 0.056 90.4 9.30 -0.660 0.200
ND4L 0.104 0.000 0.854 0.041 95.8 4.10 -0.780 1.000
ND4 0.147 0.023 0.714 0.115 86.1 13.8 -0.650 0.666
ND5 0.149 0.034 0.635 0.180 78.4 21.4 -0.610 0.677
ND6 0.543 0.130 0.282 0.043 82.5 17.3 0.315 -0.500
CYTB 0.121 0.040 0.710 0.127 83.1 16.7 -0.700 0.517
Appendix 87
ND1 0.755 0.166 0.059 0.017 81.4 18.3 0.854 -0.800
ND2 0.786 0.124 0.059 0.029 84.5 15.3 0.860 -0.610
Phanogenia gracilis COX1 0.148 0.011 0.783 0.057 93.1 6.80 -0.682 0.676
COX2 0.127 0.029 0.803 0.039 93.0 6.80 -0.727 0.142
ATP8 0.187 0.000 0.750 0.062 93.7 6.20 -0.600 1.000
ATP6 0.138 0.021 0.797 0.042 93.5 6.30 -0.700 0.333
COX3 0.175 0.041 0.725 0.058 90.0 9.90 -0.610 0.166
ND3 0.139 0.000 0.790 0.069 92.9 6.90 -0.700 1.000
ND4L 0.097 0.000 0.804 0.097 90.1 9.70 -0.780 1.000
ND4 0.152 0.005 0.744 0.097 89.6 10.2 -0.660 0.894
ND5 0.241 0.016 0.677 0.064 91.8 8.00 -0.470 0.600
ND6 0.552 0.059 0.343 0.044 89.5 10.3 0.233 -0.140
CYTB 0.226 0.011 0.714 0.047 94.0 5.80 -0.510 0.600
ND1 0.712 0.090 0.181 0.015 89.3 10.5 0.593 -0.710
ND2 0.631 0.075 0.255 0.037 88.6 11.2 0.423 -0.330
88
Figure A.1.3. Mitochondrial gene order for the mitogenomes of Echinoidea species. Genes and tRNAs at upper layer are transcribed
on the L-strand, whereas the genes at lower layer are transcribed on the H-strand. Blue colors indicate the conservation of genome order for
all the species. The three-letter amino acid abbreviation is used for tRNA designation (discrimination of two types of tRNA-Leu with first-letter
codon).
Figure A.1.4. Mitochondrial gene order for the mitogenomes of Asteroidea species. Genes and tRNAs at upper layer are
transcribed on L-strand, whereas genes at lower layer are transcribed on H-strand. Blue colors indicate the conservation of genome
order for the species relative to Asteroidea genome pattern (Scouras and Smith 2001) and white colors variation of genome order.
(inversions are shown with green stars). The three-letter amino acid abbreviation is used for tRNA designation (discrimination of two
types of tRNA-Leu with first-letter codon). a) Genome pattern.
a) Pisaster ochraceus and Luidia quinaria
Appendix 89
b) Acanthaster planci and Acanthaster brevispinus
c) Patiria pectinifera
d) Astropecten polyacanthus
90
e) Asterias amurensis
Figure A.1.5. Mitochondrial gene order for the mitogenomes of Holothuroidea species. Genes and tRNAs at upper layer are
transcribed on L-strand, whereas genes at lower layer are transcribed on H-strand. Blue colors indicate the conservation of genome
order for the species relative to Holothuroidea genome pattern (Fan et al. 2011) and white colors indicate variation in transcription
direction (transpositions are shown with brown stars; TDRL are shown with grey stars). The three-letter amino acid abbreviation is used
for tRNA designation (discrimination of two types of tRNA-Leu with first-letter codon). a) Genome pattern.
a) Apostichopus japonicus, Parastichopus nigripunctatus, Holothuria forskali
b) Stichopus sp. SF-2010, Stichopus horrens
c) Cucumaria miniata
Appendix 91
Figure A.1.6. Genome mitochondrial gene rearrangements for the Ophiuroidea species. Genes and tRNAs at upper layer are
transcribed on L-strand, whereas genes at lower layer are transcribed on H-strand. Blue colors indicate the conservation of genome order for
the species relative to Ophiuroidea genome pattern (Perseke et al. 2010) and white colors indicate variation in the transcription direction
(transpositions are shown with brown stars; TDRL are shown with grey star and inversions with green stars). The three-letter amino acid
abbreviation is used for tRNA designation (discrimination of two types of tRNA-Leu with first-letter codon). a) Genome pattern.
a) Ophiopholis aculeata, Ophiocomina nigra
b) Amphipholis squamata
c) Cucumaria miniata
92
c) Astrospartus mediterraneus
d) Ophiura lutkeni, Ophiura albida
Figure A.1.7. Genome mitochondrial gene rearrangements for the Crinoidea species. Genes and tRNAs at upper layer are transcribed
on L-strand, whereas genes at lower layer are transcribed on H-strand. Blue colors indicate the conservation of genome order for the species
relative to Crinoidea genome pattern (Perseke et al. 2008) and white colors indicate variation in gene location (transpositions are shown with
brown stars; TDRL are shown with grey stars). The three-letter amino acid abbreviation is used for tRNA designation (discrimination of two
types of tRNA-Leu with first-letter codon). a) Genome pattern.
a) Florometra serratissima, Phanogenia gracilis
Appendix 93
Figure A.1.8. Echinodermata gene trees inferred using PhyML (Maximum Likelihood). The presented trees are unrooted and the
bootstraps values are shown only for nodes with support above 50%. No outgroup was used in this analysis. A: Asteroidea; E: Echinoidea;
H: Holothuroidea; C: Crinoidea; O: Ophiuroidea.
ATP6 COX1 COX3
b) Neogymnocrinus richeri
c) Antedon mediterranea
94
CYTB NAD1 NAD2
NAD4 NAD4L ND5
NAD6
Appendix 95
Figure A.1.9. Echinodermata gene trees obtained with PhyML using Hemichordata as outgroup (Maximum Likelihood). The
presented trees are rooted and the bootstraps values are shown only for nodes with support above 50%. A: Asteroidea; E: Echinoidea; H:
Holothuroidea; C: Crinoidea; O: Ophiuroidea; Out: Outgroup.
ATP6 NAD1 NAD2 NAD3
NAD4 NAD4L NAD5
96
Figure A.1.10. Echinodermata gene trees inferred using MrBayes (Bayesian Inference). The presented trees are unrooted and the
probabilities values are shown only for nodes with support above 50%. No outgroup was used in this analysis. A: Asteroidea; E:
Echinoidea; H: Holothuroidea; C: Crinoidea; O: Ophiuroidea.
ATP6 ATP8 COX1
COX2 COX3 CYTB
NAD1 NAD2 NAD3
Appendix 97
NAD4 NAD4L NAD5
NAD6
Figure A.1.11. Echinodermata gene trees obtained with MrBayes using Hemichordata as outgroup (Bayesian Inference). The
presented trees are rooted and the probabilities values are shown only for nodes with support above 50%. A: Asteroidea; E: Echinoidea;
H: Holothuroidea; C: Crinoidea; O: Ophiuroidea; Out: Outgroup.
ATP6 COX1 COX3 CYTB
C
98
NAD1 NAD2 NAD4 NAD4L
NAD5 NAD6
Appendix 99
A.2 Shell Codes/Scripts – Documentation
The documentation of the commands written in Shell used in the pipeline.
 Filter the Echinodermata species from Metazoa species
SYNOPSIS
for i in *.gbk; do if (grep -Riq "Metazoa; Echinodermata;" $i) then
(echo "Found Echinodermata in file $i"; cp $i new_$i) fi; done;
DESCRIPTION
This script uses the loop for to go through each file, to look for specific patterns in each
file and print into a new file, if the pattern is found, in this case print only the Echinodermata
genbank files.
100
A.3 Perl Scripts – Documentation
The documentation of the scripts written in Perl used for the pipeline.
 protein_fasta.pl
This script converts a genbank formatted file to fasta format and retrieves only the amino
acid sequences for the protein-coding genes.
SYNOPSIS
#run the script that translate GBK format into FASTA format and get only the protein sequence!
my $dir = shift @ARGV; #Argument to call the right directory
my $dir_output = shift @ARGV; #Argument to tell the name of new directory
system ("mkdir $dir_output"); #create the new directory that will be new files
my @files;
my @file;
my @all_gbk;
my @accessions;
my @gene;
my @protein;
my $gene = "";
my $i="";
sub ReadFiles (){
#open the directory that has GBK files
opendir DIR, $dir or die ("Cannot open directory ${dir} $!n");
#put all GBK files into array (@files)
while (my $filename = readdir(DIR)){
if ($filename =~ /.*.gbk/){
push (@files, $filename);
}
Appendix 101
}
#browses the array with files, open each one and ut all the information into array (@file)
foreach my $file (@files) {
open (INFILE, "${dir}/${file}") or die ("Cannot open folder. $!n");
@file = <INFILE>;
close INFILE;
@all_gbk = (@all_gbk,@file);
}
return;
}
sub GetSequences () {
my $sp = ""; my $protein = ""; my $id = "";
my $accession;
my $reading_CDS = "FALSE";
my $translation_block = "FALSE";
#open the array that has all information of all files
foreach my $string (@all_gbk) {
#Lets organize the analysis with the order of the genbank file
#Retrieve ACCESSION
if ($string =~ /^ACCESSIONs+(.+)s+$/){
$accession = $1;
}
#Retrieve protein sequence and name of each file (gene name)
#if you find on file the pattern "CDS" starts to collect protein sequence and gene name
#protein sequence on GBK format are in seccion that starts with word CDS and end with "
if ($string =~ /^s+CDSs+.+/){
$reading_CDS = "TRUE";
}
if ($reading_CDS eq "TRUE"){
#the pattern 'gene="gene name"'-> collect only gene name
if ($string =~ /^s+/gene="(.+)"s+/){
102
$gene = $1;
}
#the pattern '/translation="'-> start collect protein sequence
if ($string =~ /^s+/translation=".+s+/){
$translation_block = "TRUE";
}
my $aspas = "FALSE";
if ($translation_block eq "TRUE"){
#the pattern '/translation="protein sequence"'-> collect protein
sequence
if ($string =~ /^s+/translation="(.+)"?s+/){
$protein = $1;
#collect the protein sequence, if you find " and newline
#replace all the pipes, spaces and " for nothing
if ($string =~ /"n$/){
$protein =~ s/("|s)//g;
$aspas = "TRUE";
}
}
#if you dont find a pattern " -> collect protein sequence without
spaces
elsif ($string !~ /"n$/){
$string =~ s/s//g;
$protein .= $string;
}
#collect the last line of protein sequence the pattern is " and
newline
elsif ($string =~ /"n$/){
#replace all the pipes, spaces and " for nothing and collect the
protein sequence
$string =~ s/("|s)//g;
$protein .= $string;
Appendix 103
$aspas = "TRUE";
}
if ($aspas eq "TRUE"){
#collect all information (accession number, gene name and protein
sequence
#and put into array
push (@accessions, $accession);
push (@gene, $gene);
push (@protein, $protein);
$translation_block = "FALSE";
$reading_CDS = "FALSE";
$aspas = "FALSE";
}
}
}
}
return;
}
sub WriteFasta () {
#browses the gene name array to put the right names of genes on name of files
foreach $i (0..$#gene){
if ($gene[$i] =~ /a6|atp6|atp 6|atpase6|atpase 6/i){
$gene[$i] = "ATP6";
}
elsif ($gene[$i] =~ /a8|atp8|atp 8|atpase8|atpase 8/i){
$gene[$i] = "ATP8";
}
elsif ($gene[$i] =~ /n1|nd1|nad1|nadh1/i){
$gene[$i] = "ND1";
}
104
elsif ($gene[$i] =~ /n2|nd2|nad2|nadh2/i){
$gene[$i] = "ND2";
}
elsif ($gene[$i] =~ /n3|nd3|nad3|nadh3/i){
$gene[$i] = "ND3";
}
elsif ($gene[$i] =~ /n4|nd4|nad4|nadh4/i){
if ($gene[$i] =~ /l/i){
$gene[$i] = "ND4L";
}
else{
$gene[$i] = "ND4";
}
}
elsif ($gene[$i] =~ /n5|nd5|nad5|nadh5/i){
$gene[$i] = "ND5";
}
elsif ($gene[$i] =~ /n6|nd6|nad6|nadh6/i){
$gene[$i] = "ND6";
}
elsif ($gene[$i] =~ /c3|cox3|coxIII|coIII/i){
$gene[$i] = "COX3";
}
elsif ($gene[$i] =~ /c2|cox2|coxII|coII/i){
$gene[$i] = "COX2";
}
elsif ($gene[$i] =~ /c1|cox1|coxI|coI/i){
$gene[$i] = "COX1";
}
elsif ($gene[$i] =~ /cytb|cb|cob/i){
$gene[$i] = "CYTB";
Appendix 105
}
my $gene_name = $gene[$i];
#get protein sequence and respective accession number for all genes and separated
#for each one of them
open (OUTFILE, ">>${dir_output}/${gene_name}.aa.fasta") or die ("Cannot write fasta. $!n");
print OUTFILE ">".$accessions[$i],"n";
print OUTFILE $protein[$i],"n";
close OUTFILE;
}
return;
}
ReadFiles ();
GetSequences ();
WriteFasta ();
exit;
DESCRIPTION
This Perl script makes it easier and faster to convert the genbank format into fasta format
for all genes and separates them (with protein sequence). The script starts by opening the
directory that has all the genbank files, collects and puts the files content into an array, and then,
with a foreach loop, it browses the array. In the next step, it searches and collects specific
information (accession number, gene name, and the protein sequence) to save it in a new fasta
file.
106
 dna_fasta.pl
This script converts a genbank formatted file to fasta format and retrieves only the DNA
sequences.
SYNOPSIS
#run the script that translate GBK format into FASTA format and get only the DNA sequence!
my $infile = $ARGV[0]; #Argument to call the file
my @file;
#named the new file with the suffix "_dna.fasta"
my $outfile = $infile."_dna.fasta";
sub ReadFile ($){
my $infile = $_[0];
#open file to be translate and put all the information inside the file into array (@file)
open (INFILE , "$infile") or die ("Cannot open infile genbank: $infile. $!n");
@file = <INFILE>;
close INFILE;
return;
}
sub WriteFasta ($) {
my $outfile = shift; my $dna = ""; my $sp = ""; my $accession=""; my $dna_seq = "FALSE";
open (OUTFILE , ">$outfile") or die ("Cannot open outfile fasta: $outfile. $!n");
my @data = @file;
#browses the array with foreach loop
foreach my $string (@data) {
#collect important information such as the accession number, the organism, and the dna sequence
#if you find on file the pattern "ACCESSION NC_number" -> collect the NC_number
if ($string =~ /^ACCESSIONs+(.+)s+$/){
$accession = $1; # ok
}
#if you find on file the pattern "ORGANISM specie names" -> collect the specie names
if ($string =~ /^s+ORGANISMs+(.+)s+$/){
Appendix 107
$sp = $1; # ok
}
#if you find on file the pattern "ORIGIN" will start to collect DNA sequence
if ($string =~ /ORIGIN/) {
$dna_seq = "TRUE";
}
if ($dna_seq eq "TRUE"){
#the pattern "ORIGIN dnasequence" -> collect dnasequence
#(DNA sequence on GBK format starts with word ORIGIN and end with '')
if ($string =~ /ORIGINn+(.+)/){
$dna = $1; # ok
}
#if the pattern isnt '' -> delete all number and word ORIGIN and
#board only the 4 nucleotides to dna sequence
elsif ($string !~ //+$/){
$string =~ s/ORIGIN//;
$string =~ s/[s0-9]//g;
$dna .= $string;
}
#if the pattern is '' -> delete '' from dna sequence
elsif ($string =~ //+$/){
$string =~ s/(/|s)//g;
$dna_seq = "FALSE";
$dna .= $string;
}
}
}
#print in new FASTA file all the information collected
print OUTFILE ">" , $accession , ' ' , "[", $sp , "]", "n";
print OUTFILE $dna;
close OUTFILE;
108
return;
}
ReadFile ($infile);
WriteFasta ($outfile);
exit;
DESCRIPTION
This Perl script makes it easier and faster to convert the genbank format into fasta format
only with DNA sequence. The script starts by opening the directory that has all the genbank files,
collects and puts the files content into an array, and then, with a foreach loop, it browses the array.
In the next step, it searches and collects specific information (accession number, specie name,
and the dna sequence) to save it in a new fasta file.
 mafftalign.pl
This script runs the program MAFFT.
SYNOPSIS
#run alignment program MAFFT for aminoacid sequences!
my $dir = shift @ARGV;#Argument to call the right directory
system ("mkdir mafft");#create new directory (mafft) for the aligned sequences
my $input="";
my $output = "";
my @files;
#open directory with all files to be align
opendir DIR, $dir or die ("Cannot open directory ${dir} $!n");
#while there are FASTA files, put into a array (@files)
while (my $filename = readdir(DIR)){
if ($filename =~ /.*.fasta/){
push (@files, $filename);
}
}
Appendix 109
#browses the array with foreach loop
foreach $input (@files) {
#named the news files with the suffix ".mafft"
$output = "$input.mafft";
#run the program MAFFT with specific options
system ("mafft --maxiterate 1000 --globalpair $input > $output");
#when finish the run print this message for each file
print "--------------------FIM ALINHAMENTO $input-------------------n";
}
close DIR;
#move the new files for the new directory (mafft)
system ("mv *.mafft mafft");
exit;
DESCRIPTION
This Perl script allows running the multiple sequence alignment program MAFFT for
multiple amino acid sequence files (in fasta format). First, it opens the directory where all the files
to be aligned are located. Then, it puts all these files into an array and finally it browses the array
and runs the program for each file.
 musclealign.pl
This script starts the program MUSCLE.
SYNOPSIS
#run alignment program MUSCLE for aminoacid sequences!
my $dir = shift @ARGV; #Argument to call the right directory
system ("mkdir muscle");#Create new directory (muscle) for the aligned sequences
my $input="";
my $output = "";
my @files;
110
#open directory with all files to be align
opendir DIR, $dir or die ("Cannot open directory ${dir} $!n");
#while there are FASTA files, put into a array (@files)
while (my $filename = readdir(DIR)){
if ($filename =~ /.*.fasta/){
push (@files, $filename);
}
}
#browses the array with foreach loop
foreach $input (@files) {
#named the news files with the suffix ".muscle"
$output = "$input.muscle";
#run the program MUSCLE with specific options
system ("muscle -in $input -out $output");
#when finish the run print this message for each file
print "--------------------FIM ALINHAMENTO $input-------------------n";
}
close DIR;
#move the new files for the new directory (muscle)
system ("mv *.muscle muscle");
exit;
DESCRIPTION
This Perl script allows running the alignment program MUSCLE for multiple amino acid
sequence files (in fasta format). First, it opens the directory where all the files to be aligned are
located. Then, it put all these files into an array and finally it browses the array and runs the
program for each file.
Appendix 111
 prankalign.pl
This script starts the program PRANK.
SYNOPSIS
#run alignment program PRANK for aminoacid sequences!
my $dir = shift @ARGV; #Argument to call the right directory
system ("mkdir prank");#create new directory (prank) for the aligned sequences
my $input="";
my $output = "";
my @files;
#open directory with all files to be align
opendir DIR, $dir or die ("Cannot open directory ${dir} $!n");
#while there are FASTA files, put into a array (@files)
while (my $filename = readdir(DIR)){
if ($filename =~ /.*.fasta/){
push (@files, $filename);
}
}
#browses the array with foreach loop
foreach $input (@files) {
#named the news files with the suffix ".prank"
$output = "$input.prank";
#run the program PRANK with specific options
system ("prank -d=$input -o=$output");
#when finish the run print this message for each file
print "--------------------FIM ALINHAMENTO $input-------------------n";
}
close DIR;
#move the new files for the new directory (prank)
system ("mv *.prank.* prank");
exit;
112
DESCRIPTION
This Perl script allows running the alignment program PRANK for multiple amino acid
sequence files (in fasta format). First, it opens the directory where all the files to be aligned are
located. Then, it puts all these files into an array and finally it browses the array and runs the
program for each file.
 BestAlignAA.pl
This script analyzes the results from the alignment programs and identifies the more
consistent alignment.
SYNOPSIS
#run program TrimAl for compare alignment programs!
my $dir = shift @ARGV;#Argument to call the right directory
my @files;
system ("mkdir trimalCompare");#create new directory (trimalCompare) for the new files
sub TrimAl {
#open directory with all files to be analyze
opendir DIR, $dir or die ("Cannot open directory ${dir} $!n");
while (my $filename = readdir(DIR)){
#while there are fileset on prefix name files, put into a array (@files)
if ($filename =~ /fileset*/){
push (@files, $filename);
}
}
#browses the array with foreach loop
foreach my $input (@files) {
my $output = $input;
#run the program TrimAl with specific options
system ("trimal -compareset $input -out $output.out > $output.txt");
Appendix 113
}
close DIR;
system ("mv *.out trimalCompare"); #move the new files for the new directory (trimalCompare)
system ("mv *.txt trimalCompare"); #move the new files for the new directory (trimalCompare)
return;
}
TrimAl ();
exit;
DESCRIPTION
This Perl script compares all the resultant amino acid alignments using the different
programs and find out which one is the most consistent. First, it opens the directory where all the
files to be analyzed are located. Then, it puts all these files into an array and finally it browses the
array and runs the trimAl program for each file.
 TrimAL.pl
This script collects only the protein sequence aligned with best scores.
SYNOPSIS
#run program TrimAl for to get only protein sequence with
#best score on alignment!
my $dir = shift @ARGV;
my @files;
#create new directory (trimalOUT) for the new files
system ("mkdir trimalOUT");
sub TrimAl {
opendir DIR, $dir or die ("Cannot open directory ${dir} $!n");
while (my $filename = readdir(DIR)){
#while there are .out files, put into a array (@files)
if ($filename =~ /.*.out/){
push (@files, $filename);
114
}
}
#browses the array with foreach loop
foreach my $input (@files) {
my $output = $input;
#run the program TrimAl with specific options
system ("trimal -in $input -out $output.nexus -nexus -gappyout");
}
close DIR;
#move the new files for the new directory (trimalOUT)
system ("mv *.nexus trimalOUT");
return;
}
TrimAl ();
exit;
DESCRIPTION
This Perl script prints in new files only the alignment of protein sequences with best
scores. First, it opens the directory where all the files to be analyzed are located. Then, it puts all
these files into an array and finally it browses the array and runs the trimAl program for each file.
 prottest.pl
This script runs the program ProtTest.
SYNOPSIS
#run the program Prottest for aminoacid sequences!
my @files;
system ("mkdir prottest"); #create new directory (prottest) for the analysed sequences
sub Prottest {
#open directory with all files to be analyze
opendir DIR, "/home/user/Transferências/prottest-3.2-20130103" or die;
Appendix 115
while (my $filename = readdir(DIR)){
#while there are NEXUS files, put into a array (@files)
if ($filename =~ /.*.nexus/){
push (@files, $filename);
}
}
#browses the array with foreach loop
foreach my $input (@files) {
my $output = $input;
#run the program Prottest with specific options
system ("java -jar prottest-3.2.jar -i $input -all-matrices -all-distributions -AICC -o
$output.out -all -F -t1 -S 2 -tc 0.5 -threads 1");
#when finish the run print this message for each file
print "--------------------FIM ANALISE $input-------------------n";
}
close DIR;
#move the new files for the new directory (prottest)
system ("mv *.out prottest");
return;
}
Prottest ();
exit;
DESCRIPTION
This Perl script allows running the program ProtTest for multiple amino acid sequence
files (in nexus format). First, it opens the directory where all the files to be analyzed are located.
Then, it puts all these files into an array and finally it browses the array and runs the program for
each file.
116
A.4 R Codes/Scripts – Documentation
The documentation of the scripts written in R used for the pipeline.
 create.r
This script creates a rooted tree (with outgroup) and plots it to standard output.
SYNOPSIS
getwd ()
#read the tree file (nexus), if the tree file is newick the function read need to be change to
read.tree
list1 <- list.files (all.files=FALSE, full.names=FALSE, pattern = ".tre")
#sort the accession number into the Echinodermata classes, add Outgroup class (Hemichordata)
Asteroidea <- c("NC_004610", "NC_006665", "NC_007788", "NC_006666", "NC_001627", "NC_007789",
"NC_006664")
Echinoidea <- c("NC_001453", "NC_001770", "NC_009941", "NC_009940", "NC_013881", "NC_001572")
Holothuroidea <- c("NC_014452", "NC_014454", "NC_005929", "NC_013884", "NC_012616", "NC_013432")
Ophiuroidea <- c("NC_013874", "NC_013878", "NC_005334", "NC_005930", "NC_010691", "NC_013876")
Crinoidea <- c("NC_010692", "NC_001878", "NC_007689", "NC_007690")
OUTGROUP <- c("NC_013877", "NC_007438", "NC_001887")
total_taxa = length(Asteroidea) + length(Echinoidea) + length(Holothuroidea) + length(Ophiuroidea) +
length(Crinoidea)+ length(OUTGROUP)
my_tip_color <- rep(0,total_taxa)
for (i in 1:length(list1)){
tree <- paste("tree_",i," <- read.nexus("",cwd,"/",list1[i],"")",sep="")
eval(parse(text= tree))
tree <- get(paste("tree_",i,sep=""))
mytitle = list1[i]
taxa <- tree$tip.label
#attribute at each class a colour
my_blue = which(taxa %in% Asteroidea)
my_red = which(taxa %in% Echinoidea)
Appendix 117
my_green = which(taxa %in% Holothuroidea)
my_purple = which(taxa %in% Ophiuroidea)
my_orange = which(taxa %in% Crinoidea)
my_black = which(taxa %in% OUTGROUP)
for (i in my_blue){
my_tip_color[i] = "blue"
}
for (i in my_red){
my_tip_color[i] = "red"
}
for (i in my_green){
my_tip_color[i] = "green"
}
for (i in my_purple){
my_tip_color[i] = "purple"
}
for (i in my_orange){
my_tip_color[i] = "orange"
}
for (i in my_black){
my_tip_color[i] = "black"
}
#identify the outgroup
new_out <- c("NC_013877", "NC_007438", "NC_001887")
mytreereroot<-root(tree, new_out)
#with this function create a image of rooted tree, for unrooted tree add type=”u” and remove the
outgroup
plot.phylo (mytreereroot, show.node.label=TRUE, tip.color = my_tip_color, use.edge.length = TRUE, font
= 2, edge.width=2, cex=0.7)
mtext (mytitle, side = 3, font = 2, adj = 1, cex=1.0)
}
118
DESCRIPTION
This R script makes it easier and faster to create a rooted tree (PhyML/MrBayes trees).
The script starts by opening the input file. Then it splits all Accession Numbers into classes, and
attributes a color to each class. Then, with a specific R function (plot.phylo), it creates a rooted
tree (phylogenetic tree).
 dendrogram.r
This script creates a dendrogram.
SYNOPSIS
getwd ()
data <-read.table("file.txt", sep = "t")
# prepare hierarchical cluster
hc = hclust(dist(data [,2]))
# very simple dendrogram
plot(hc,labels=data[,1], main="", xlab="")
DESCRIPTION
This R script allows creating a dendrogram. The script starts by opening the input (file
with distance matrix). Next it applies the R functions hclust and dist performs the hierarchical
clustering on distance matrix. Finally, the plot function allow us to visualize the dendrogram.

thesis_AndreiaReis2014

  • 1.
    Application of Site- andTime- Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Andreia Reis Master's thesis submitted to the Faculty of Science, University of Porto in Biodiversity,Genetics and Evolution 2014 ApplicationofSite-andTime-Heterogeneous EvolutionaryModelstoMitochondrialPhylogenomics AndreiaReis MSc 2.º CICLO FCUP 2014 MS cSc MSc
  • 2.
    Application of Site- andTime- Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Andreia Reis Master of Biodiversity, Genetics and Evolution Biology Department 2014 Supervisor Miguel M Fonseca, PhD, Fac.Ciências Univ. Porto /Univ. Vigo Co-supervisor Rui Faria, PhD, Fac.Ciências Univ. Porto /Univ. Pompeu Fabra MS cin
  • 3.
    Todas as correçõesdeterminadas pelo júri, e só essas, foram efetuadas. O Presidente do Júri, Porto, ______/______/_________ MS c
  • 4.
    iii To the memory ofmy beloved Grandfather, José Reis.
  • 5.
  • 6.
    v “Persistence is theway to success.” -Charles Darwin
  • 7.
  • 8.
    vii Acknowledgments This study isnot the result of an individual effort but of a set of collective efforts from people that i consider important. Without them it would have been much harder to reach the end. Therefore, I would like to express my full gratitude to some people that helped me travelling through this “road full of boulders”, which was one of the most challenging journeys of my professional and personal lives. A special thanks to my supervisors, Miguel M. Fonseca and Rui Faria, for their leadership, support, motivation and huge patience. All advices were essential to be able to make it through yet another stage of my career. Since this work was initially planned within the framework of a research project financed by the FCT and FEDER (ref.: PTDC/BIA- EVF/113805/2009), I would also like to thank these institutions to make this work possible. I would also like to give a HUGE thanks to Graciela Sotelo for all support that she gave me. She was always available to clarify my doubts and teach me important things for my future research life. To Diana Costa, I would like to thank the good vibes, advices, the rides to CIBIO ;), the funny talks, the not so funny talks (but also needed), the beginning of a friendship. To Prof. Dr. Marta W. Vasconcelos, thanks for helping me opening my mind to the world of science, for her important advices and friendship. "All the riches of the world are not worth a good friend." – Voltaire. I have a small group of good friends that are my anchor to who I would like to thank, Ana Marta Fonseca, Sofia Barreira, Ricardo Pinto, Tiago Carvalho, Pedro Brinca, Inês Beato and Adriano Mendes, who attending me in times of distress, anxiety, insecurity, exhaustion and satisfaction. To Dr. Amélia Martins and Dr. Ana Queiros, thanks for their support, especially during bad times, and for their friendship. To Prof. Cecília Ferreira da Rocha, thanks for all she taught me (and still does) during all these years. Thanks to my Master Colleagues, who in one way or another, helped each other during the Master degree!!! :)
  • 9.
  • 10.
    ix Contents Abstract xi Resumo xii Listof Figures xiv List of Tables xv Abbreviations xvi Introduction 1 1. Models of Sequence Evolution in Phylogenetics 3 1.1. Phylogenetic Inference 3 1.2. Heterogeneity of Molecular Sequence Evolution 4 1.2.1. Heterogeneity across Time 4 1.2.2. Heterogeneity across Sites 6 1.2.2.1. Substitution Rates Heterogeneity 6 1.2.2.2. Amino Acid Empirical Mixture Models 6 1.2.3. Heterogeneity across Sites and Time 7 1.3. Strand Asymmetry and Replication 7 1.4. Gene Rearrangements 9 2. Mitochondrial DNA 11 2.1. MtDNA Structure and Organization 11 2.2. MtDNA Strand Asymmetry and Replication 13 2.3. Gene Rearrangements 13 3. The Controversial Phylogeny of the Echinoderms 14 3.1. Morphological Data 18 3.2. Molecular Data 18 3.3. MtDNA Gene Rearrangements 19 3.4. Final Considerations 20 4. Objectives 21 Material and Methods 23 1. Retrieval of Molecular Data 25 2. Analysis of the Mitogenome Architecture 25 2.1. Comparison of Gene Orders 26 2.2. Prediction of Origins of Replication 26
  • 11.
    x 2.3. Assessment ofCompositional Heterogeneity 27 3. Phylogenetic Analysis 27 3.1. Sequence Alignment 27 3.2. Trimming of the Alignments 28 3.3. Selection of the Best Fit Model 28 3.4. Phylogenetic Inference 29 3.5. Comparison of Phylogenetic Methods 30 Results 31 1. Mitogenome Architecture 33 1.1. Mitogenome Annotation 33 1.2. Gene Order Comparison 35 1.3. Prediction of Replication Origins 37 1.4. Assessment of the Degree of Compositional Heterogeneity 39 2. Comparison of Phylogenetic Inference Methods 42 Discussion 51 References 57 Appendix 75
  • 12.
    xi Abstract Recent advances insequencing technologies have encouraged the use of whole- genome mitochondrial DNA sequences (mitogenomics) for phylogenetic inference. This increasing amount of information generally results in the recovery of more accurate phylogenetic trees. However, heterogeneous nucleotide composition across sites and taxa (due to, for example, gene rearrangements and strand asymmetry) can lead to erroneous phylogenetic inferences. To work around this problem, several evolutionary models have been developed that take into account the site- and time- heterogeneity, but they have not been widely used. The main objective of this project was to test the performance of these heterogeneous evolutionary models using the complete mitogenome of 29 species from all extant classes of Echinodermata. We chose this dataset because their available mitogenome sequences show strong heterogeneous patterns in terms of nucleotide composition and substitution rates, thus allowing a proper evaluation of those models and also because the relationships between Echinodermata are still a debated issue. Before inferring the phylogeny of these taxa, we made a characterization of their mitogenomes in order to identify the potential causes of such heterogeneity. To do so, we analyzed the mitogenome organization, searched for putative replication origins, and estimated the nucleotide compositional heterogeneity. The gene orders of the echinoderm mitogenomes support the monophyly of each class. If we consider Crinoidea to have the ancestral gene order in echinoderms, the phylogenetic tree based on gene arrangements supports the Cryptosyringid hypothesis ((Echinoidea + Holothuroidea); Asteroidea; Ophiuroidea; Crinoidea). Regarding putative replication origins, we were able to identify, not in many though, stable structures across different echinoderm species/classes. Our nucleotide sequence analyses confirm that mitochondrial genes in Echinodermata have a strong heterogeneous composition within and between genes across these taxa. Finally, when applying homogeneous and heterogeneous models (of amino acid sequence evolution) for phylogenetic reconstruction, only the CAT model is able to correct the known long-branch attraction artifact caused by the accelerated evolutionary rates of the Ophiuroidea sequences. Our results are mainly in agreement with previous studies based on the analysis of complete mitogenomes, but they also support unexpected relationships between classes, such as Asteroidea - Echinoidea, not recovered before attending to morphological traits or nuclear DNA fragments as phylogenetic characters. Keywords: compositional heterogeneity; DNA mitochondrial replication; echinoderms; invertebrates; mitogenomics; mixture models; phylogeny.
  • 13.
    xii Resumo Os avanços recentesem tecnologias de sequenciação têm fomentado o uso de sequências do genoma complete de DNA mitocondrial para inferência filogenética. Apesar deste aumento da quantidade de informação geralmente resultar na recuperação de árvores filogenéticas mais precisas, a composição heterogénea nos nucleótidos (devido, por exemplo, rearranjos de genes e assimetria de cadeias) pode levar a inferências filogenéticas erróneas. Para contornar este problema desenvolveram-se vários modelos evolutivos que têm em conta a heterogeneidade composicional entre nucleótidos (ao longo de um gene/genoma) e taxa (em diferente linhagens). No entanto, estes não são ainda amplamente utilizados. O principal objetivo deste projecto é então testar o desempenho destes modelos evolutivos heterogéneos utilizando os mitogenomas completos de 29 espécies de todas as classes existentes no filo Echinodermata. Este grupo foi escolhido como objecto de estudo, não só porque as sequências mitogenómicas disponíveis para este grupo mostram fortes padrões heterogéneos em termos de composição e taxas de substituição nucleotídica, oferecendo uma oportunidade única para testar esses modelos, mas também porque as relações filogenéticas entre as classes de equinodermes é ainda um tema em debate. Antes de inferir a filogenia deste grupo, caracterizou-se os seus mitogenomas com o fim de identificar potenciais causas da sua heterogeneidade. Para isso, realizaram-se análises comparativas da posição dos diferentes genes ao longo do mitogenoma, inferiram-se as possiveis origens de replicação, e estimou-se a heterogeneidade composicional dos mitogenomas utilizados neste estudo. A organização dos genes nos mitogenomas dos equinodermes confirma a monofila das diferentes classes. Se considerarmos a classe Crinoidea como ancestral em relação aos restantes equinodermes, a árvore filogenética baseada em rearranjos parece apoiar a hipótese Cryptosyringid ((Echinoidea + Holothuroidea); Asteroidea; Ophiuroidea; Crinoidea). Em relação as origens de replicação, detectaram-se estruturas estáveis em diferentes espécies/classes de equinodermes. A análise de sequência de nucleótidos confirma que os genes mitocondriais nos equinodermes têm uma forte heterogeneidade composicional, quer dentro de cada gene, quer entre genes dos taxa analisados. Finalmente, quanto à aplicação de modelos homogéneos e heterogéneos para a reconstrução filogenética, apenas o modelo CAT foi capaz de corrigir o conhecido long-branch attraction causado pela aceleração das taxas evolutivas das sequências da classe Ophiuroidea. Embora os nossos resultados estejam de acordo com estudos anteriores (também baseados na análise de mitogenomas), estes parecem também apoiar algumas relações inesperadas entre as classes de equinodermes, tais
  • 14.
    xiii como Asteroidea +Echinoidea, as quais não tinham sido anteriormente evidenciadas através de análises baseadas em caracteres morfológicos ou DNA nuclear como marcadores filogenéticos. Palavras-Chave: composição heterogénea; equinodermes; filogenia; invertebrados; mitogenómica; mixture models; replicação do DNA mitocondrial.
  • 15.
    xiv List of Figures Figure1 Schematic representation of the replication of the vertebrate mitochondrial genome. 8 Figure 2 Representation of different types of rearrangements for a hypothetical genome consisting of five genes. 10 Figure 3 Representation of the mitochondria 11 Figure 4 Example of an invertebrate mitochondrial genome (from Arbacia lixula with 15719bp) 12 Figure 5 The two hypotheses for the Echinodermata phylogeny. 21 Figure 6 Evolution of gene orders in the echinoderm lineages using the typical crinoid gene order as the reference order (inferred with CREx). 36 Figure 7 Phylogenetic tree constructed with CREx based on genome order differences (pairwise-distance) among the Echinodermata species, resulting from gene rearrangements. 36 Figure 8 Putative replication origins inferred for the mitogenomes of a) Asterias amurensis, b) Acanthaster brevispinus and d) Patiria pectinifera. 39 Figure 9 Plots of the GC and AT skews at the 4-fold sites of mitochondrial protein-coding genes for all Echinodermata species included in this study. 40 Figure 10 Phylogenetic tree based on the concatenated amino acid sequences of thirteen echinoderm mitochondrial genes using Hemichordata as outgroup. 44 Figure 11 Heatmap of the inferred phylogenies from the outgroup dataset. 45 Figure 12 Heatmap of the inferred phylogenies between the ingroup and outgroup datasets using PhyML. 46 Figure 13 Heatmap of the inferred phylogenies between the ingroup and outgroup datasets using MrBayes. 47 Figure 14 Dendrogram of the inferred phylogenies for the concatenated outgroup dataset using different models or approaches (based on normalized Robinson-Foulds values). 48 Figure 15 Dendrogram of the inferred phylogenies for the concatenated ingroup dataset using different models or approaches (based on normalized Robinson-Foulds values). 48 Figure 16 Heatmap of dendrograms for PhyloBayes results (ingroup vs. outgroup). 49
  • 16.
    xv List of Tables Table1 Summary of previous phylogenetic studies on Echinodermata. 15 Table 2 Summary of mitogenome annotation according to different approaches. 33 Table 3 Putative replication origins inferred for the Echinodermata mitogenomes used in this study. 38 Table 4 Summary of general results. 43
  • 17.
    xvi Abbreviations A Adenine ATP Adenosinetriphosphate ATP6 ATPase complex subunit 6 ATP8 ATPase complex subunit 8 BI Bayesian Inference BP Breakpoints C Cytosine COX1 Cytochrome c oxidase subunit I COX2 Cytochrome c oxidase subunit II COX3 Cytochrome c oxidase subunit III CYTB Cytochrome b DNA Deoxyribonucleic acid G Guanine H-strand Heavy strand LBA Long branch attraction L-strand Light strand lrRNA Large ribosomal RNA ML Maximum likelihood mtDNA Mitochondrial DNA ND1 NADH dehydrogenase subunit 1 ND2 NADH dehydrogenase subunit 2 ND3 NADH dehydrogenase subunit 3 ND4 NADH dehydrogenase subunit 4 ND4L NADH dehydrogenase subunit 4L ND5 NADH dehydrogenase subunit 5 ND6 NADH dehydrogenase subunit 6 NJ Neighbor Joining OH Replication origin on heavy strand OL Replication origin on light strand OR Replication origin RF Robinson-Foulds
  • 18.
    xvii RITOLS RNA incorporationthroughout the lagging strand RNA Ribonucleic acid rRNA Ribosomal RNA SCM Strand-coupled model SDM Strand displacement model srRNA Small ribosomal RNA T Thymine TDRL Tandem duplication random loss tRNA Transfer RNA tRNA-Ala tRNA alanine tRNA-Asn tRNA asparagine tRNA-Asp tRNA aspartic acid tRNA-Arg tRNA arginine tRNA-Cys tRNA cysteine tRNA-Gln tRNA glutamine tRNA-Glu tRNA glutamic acid tRNA-Gly tRNA glycine tRNA-His tRNA histidine tRNA-Ile tRNA isoleucine tRNA-Leu tRNA leucine tRNA-Lys tRNA lysine tRNA-Met tRNA methionine tRNA-Phe tRNA phenylalanine tRNA-Pro tRNA proline tRNA-Ser tRNA serine tRNA-Thr tRNA threonine tRNA-Trp tRNA tryptophan tRNA-Tyr tRNA tyrosine tRNA-Val tRNA valine
  • 19.
  • 21.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 3 1. Models of Sequence Evolution in Phylogenetics Darwin realized in The Origin of Species that all forms of life on Earth evolved from a common ancestor and that their relationships could be graphically represented by a tree (Darwin, 1859). More than 150 years later, despite the enormous quantity and quality of data provided by the advances in molecular biology, the phylogenetics community is still pursuing the general goals of reconstructing the tree of life and of contributing to understand the processes that govern evolution and the origin of biodiversity (Gesell, 2009). Although, the increasing amount of available data reduces the influence of random errors on phylogenetic inference, many phylogenies are still poorly resolved and the use of different reconstruction methodologies often produces contradictory topologies (Nesnidal et al., 2010; Weisrock, 2012). While much of this uncertainty stems from the nature of the evolutionary process (e.g. radiations), an important source of incongruence are the systematic errors (Jeffroy et al., 2006; Philippe et al., 2005) or artifacts inherent to phylogenetic tree-reconstruction, which result from the inability of a method (or model) to deal with biases in a dataset, originating misleading phylogenetic inferences. 1.1. Phylogenetic Inference The ultimate goal of molecular phylogenetics is to understand the historical relationships between taxa using the information contained on genes or the entire genome (phylogenomics). Several methods of phylogenetic inference were developed, but the most commonly used are Neighbor-Joining (NJ; Saitou and Nei, 1987), Maximum Parsimony (MP; Fitch, 1971), Maximum Likelihood (ML; Felsenstein, 1981) and Bayesian Inference (BI; Larget et al., 1999). These phylogenetic methods have their own particularities, but they all rely on the usage of explicit statistical model(s) of sequence (nucleotide, codon or amino acid) evolution. These models describe the probability that one state (e.g. nucleotide, amino acid) changes to another; generalized in the so-called transition rate matrix (here, the term transition refers to the change between states and not to the exchange between purines or pyrimidines). They are expected to reflect the evolutionary pattern of the sequences being analyzed and are crucial to estimate the evolutionary parameters used for accurate phylogenetic tree reconstruction. The first proposed models were very simple and based on strong assumptions. For example, the nucleotide substitution model proposed by Jukes and Cantor (1969) assumes that the base frequencies are equal among sequences and that all
  • 22.
    4 substitutions have thesame relative rate (AC=AG=AT=CG=CT=GT). In contrast to DNA substitution models, models of protein evolution, or amino acid replacement, were initially built using an empirical approach, i.e. derived from large numbers of empirical alignments (Dayhoff et al., 1978; Jones et al., 1992; Adachi and Hasegawa, 1996). Since these matrices contain a large number of parameters (20-state replacement matrix, they are usually selected before the phylogenetic analysis of interest is executed. These “traditional” models of molecular evolution were mainly developed assuming that the evolutionary process is globally stationary, homogeneous and reversible (Yang, 1996). Stationary meaning that DNA or protein composition (nucleotide or amino acid equilibrium frequencies) is the same at all sites of a sequence and along the tree (stationary hypothesis). Homogeneous meaning that overall nucleotide or amino acid substitution rates are constant across all branches of the tree (homogeneity hypothesis). Reversible meaning that the probability of change from state i to state j is equal to the probability of going from state j to state i. It has been shown, however, that many sequences violate one or more of these assumptions, which ultimately may result in erroneous phylogenetic reconstructions (Galtier and Gouy, 1995; Yang, 1996; Cox et al., 2008; Dutheil and Boussau, 2008; Philippe et al., 2011; Groussin et al., 2013). 1.2. Heterogeneity of Molecular Sequence Evolution Heterogeneity in molecular evolution manifests itself in various ways: along branches (or throughout time), across sites and genes, or both (across sites and time). 1.2.1. Heterogeneity across Time Substitution rates can vary across lineages. Felsenstein (1978) was among the first to observe that fast evolving sequences, which have long branches, are easily grouped together in phylogenetic reconstructions, even if they are not related. Failure to handle evolutionary heterogeneous substitution rates or compositional heterogeneity across lineages, the probability of homoplasy (i.e. convergence of substitution patterns in unrelated sequences) increases, which leads to Long Branch Attraction (LBA) artifact (Felsenstein, 1978; Bergsten, 2005). Similarly, when a distant outgroup is used, a divergent species may be attracted by the long-branch separating the ingroup and the outgroup and thus be “artifactually” placed at a basal position of the phylogenetic tree (Philippe et al., 1998; Bergsten, 2005). Heterogeneous sequence composition or accelerated rates of evolution underlying LBA can be originated, for instance, by variation in gene order or changes in the
  • 23.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 5 transcription polarity of a gene. In the mitochondrial genome, these events are particularly important because each strand has a strong-specific bias (Adachi and Hasegawa, 1996), which is also related to the genomic location of genes (Faith and Pollock, 2003). Genes coded in one strand will contain the compositional bias of that same strand, whereas genes that reversed their coding polarity will show inverted compositional bias (Hassanin et al., 2005; Fonseca et al., 2008). Thus, the branching pattern of a phylogenetic tree can be strongly affected by strand bias reversal (an extreme example of compositional heterogeneity) or significant variation in AT and GC content (Schierwater et al., 2013). Several complementary approaches have been applied to overcome systematic errors caused by heterogeneous evolution over the phylogenetic tree (Rodriguez-Ezpeleta et al., 2007) and to deal with inconsistent phylogenetic reconstruction (Sheffield et al., 2009). A distance-based method known as LogDet/Paralinear distances (Lake, 1994; Lockhart et al., 1994) is often used. Another approach, which aims to homogenize the composition between sequences, is data recoding. For nucleotide sequences, RY coding (Woese et al., 1991) consists in replacing nucleotides A and G by R (purines) and C and T by Y (pyrimidines), so that only transversion events are considered and the compositional bias posteriorly decreases. In an analogous way, the Dayhoff coding system has been proposed for amino acid sequences (Hrdy et al., 2004). More generally, this problem can also be accommodated by removing saturated sites from the analysis such as 3rd codon positions (Swofford et al., 1996; Delsuc et al., 2002), by including a better taxon sampling (carefully selection of the outgroup) or removing fast-evolving species (Aguinaldo et al., 1997), genes (Philippe et al., 2005), or sequence positions (Brinkmann and Philippe, 1999; Burleigh and Mathews, 2004). Although these methods can decrease the compositional bias, they may also reduce the phylogenetic resolution (Weisrock, 2012). More recently, complex statistical approaches have been implemented in several phylogenetic programs, under either Bayesian (p4, PHASE and PhyloBayes) (Foster, 2004; Gowri-Shankar and Rattray, 2007; Blanquart and Lartillot, 2006) or Maximum Likelihood (nhPhyML) (Boussau and Gouy, 2006; Galtier and Gouy, 1995) frameworks, to overcome the problem of time or compositional heterogeneity using more realistic evolutionary models. However, the number of studies applying these methods to real data is still scarce (Loomis and Smith, 1990; Galtier and Gouy, 1995; Herbeck et al., 2005). Moreover, most have focused on the effect of compositional bias in relatively few genes, while the effect of complex heterogeneous data sets (e.g. entire genomes) on phylogenetic inference has not been thoroughly explored, as these methods are computationally extremely demanding due to the large number of parameters they include (Sheffield et al., 2009).
  • 24.
    6 1.2.2. Heterogeneity acrossSites 1.2.2.1 Substitution Rates Heterogeneity Another well-known source of heterogeneity is the variation of the substitution rates among different sites in a sequence (Uzzell and Corbin, 1971), which has long been recognized as a characteristic of sequence evolution, especially when coding for biological products (Uzzell and Corbin, 1971; Fitch, 1986; Kocher and Wilson, 1991). This might be caused by variation in functional constraints across sites and genes (Huelsenbeck and Suchard, 2007) or by stochastic fluctuations (if substitutions are equally tolerated in a given position, e.g. redundant sites of protein coding genes). The substitution rates across sites have been widely studied and remain an important subject in phylogenetic reconstruction (Guindon and Gascuel, 2002). If substitution rate variation is present but ignored, model-based tree-reconstruction methods can be quite misleading (Yang, 1996). Thus, different substitution rates (between nucleotides or amino acids) are now usually incorporated into evolutionary models by setting a discrete gamma-distribution of rates of variable sites (Swofford et al., 1996), as well as a proportion of invariable sites (Whelan et al., 2001). 1.2.2.2. Amino Acid Empirical Mixture Models Several studies have shown that homologous sequences can vary widely in their nucleotide or amino acid composition (Lockhart et al., 1994; Foster and Hickey, 1999; Foster, 2004; Collins et al., 2005). This compositional heterogeneity generally implies that the sequences in the dataset are not evolving under the global conditions of stationary, reversibility and homogeneity (SRH) (Bryant et al., 2005; Jermiin et al., 2008). To deal with this heterogeneity across sites, the data can be spatially split into different partitions (e.g. partitions by gene or codon position), with each partition having an independent model of evolution (Bull et al., 1993; Nylander et al., 2004). An alternative solution, suitable for small alignments (i.e. single-genes) of amino acid sequences, is to use empirical mixture models, which have fixed and pre-determined parameters, estimated based on large databases of multiple sequence alignments (Lartillot and Philippe, 2004). These matrices (e.g. C20-C60, Quang et al., 2008; WLSR5, Wang et al., 2008), unlike the classical empirical matrices (JTT, WAG or LG), can have more than one compositional profile (i.e. more than one equilibrium frequency for each amino acid) and more than one replacement matrix. These profile mixture models provide a better statistical fit than classical empirical matrices, especially on
  • 25.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 7 saturated data (Quang et al., 2008) and were shown to reduce the LBA problem in simulated and real genomic datasets (Wang et al., 2008). Additionally, in a given protein sequence, different sites can have distinct structural states (e.g. exposed or hidden). Likewise, a given homologous position (in a multiple sequence alignment) might show a tendency for some amino acids. Altogether, this complex evolution of real protein sequences may not be well captured by empirical matrices. In this context, the CAT model was developed to better adjust to the complexity of real protein sequence data. For each position, it can define a distinct profile or group several positions into one profile. These profiles are directly inferred from the data (Lartillot and Philippe, 2004) and then combined with a matrix of substitution rates that can be fixed to uniform values (CAT-Poisson), derived from empirical matrices (e.g. CAT-JTT or CAT-LG) or even inferred from the data (CAT-GTR). CAT-like models were shown to have a better fit than standard single-matrix models and to be less prone to systematic errors such as LBA (Lartillot and Philippe, 2004; Lartillot et al., 2007; Philippe et al., 2011). 1.2.3. Heterogeneity across Sites and Time As we mentioned in the previous subsections, DNA and protein sequences may have heterogeneous evolutionary patterns along time or across the sequence. However, both patterns are not mutually exclusive and hence sequences are expected to display both evolutionary heterogeneities simultaneously. The first model jointly considering time- and sequence- heterogeneity was proposed by Blanquart and Lartillot (2008). They combined the site-heterogeneous CAT model (Lartillot and Philippe, 2004) with the time-heterogeneous model BP (Blanquart and Lartillot, 2006). The BP model introduces a stochastic process that acts along lineages to incorporate the possibility of having significant objective compositional shifts in the tree (referred as BreakPoints). The authors showed that the CAT+BP model indeed allows for base frequencies variation across time and sites simultaneously and are able to recover the true topology of a tree, for which the CAT or the BP failed when used separately. Importantly, this model can be applied to DNA or amino acid sequences (Blanquart and Lartillot, 2008). 1.3. Strand Asymmetry and Replication Patterns of substitution are expected to be symmetric, according to Chargaff’s second parity rule ([A]=[T] and [G]=[C]), only if complementary changes occur at equal frequencies
  • 26.
    8 (Nikolaou and Almirantis,2006). However, sequence comparisons reveal that mutation rates, in addition to selective constraints, are not uniformly distributed throughout the genome (Francino and Ochman, 1997), resulting in strand asymmetry (or strand compositional bias). In mitochondrial genomes (mitogenomes) of vertebrates and invertebrates, the strand bias is more pronounced at fourfold redundant sites of protein coding genes (4-fold). Given that selective pressures are weaker in these sites than in other codon positions, strand asymmetries should be caused, to a large extent, by mutational mechanisms (e.g. replication, transcription or DNA repair). However, several studies (e.g. Reyes et al., 1998; Hassanin et al., 2005) support replication as the main process responsible for shaping the compositional bias in mitogenomes (Figure 1). During replication (as in transcription), one of the strands is temporarily in a single-stranded state and thereby more exposed to DNA damage than double-stranded DNA (Francino and Ochman, 1997; Hassanin et al., 2005). Specifically, single-stranded DNA is more susceptible to hydrolytic deaminations of cytosine (C→T) and adenine (A→G), which cause the accumulation of thymine and guanine on the exposed strand (Figure 1). Indeed, mitogenomes have one strand that is GT-rich (named the Heavy-strand) and one that is GT-poor (named the Light-strand) (Asakawa et al., 1991; Reyes et al., 1998). Transcription may also originate nucleotide composition variation in both strands, but to a lesser extent (Francino and Ochman, 1997; Hassanin et al., 2005). Figure 1 | Schematic representation of the replication of the vertebrate mitochondrial genome. Mutations that may occur more often in single-stranded DNA are also represented. The single-stranded DNA is particularly susceptible to deamination, especially inside mitochondria, where there is a high concentration of reactive oxygen species that damage DNA. This systematic exposure to DNA damage would lead to the accumulation of T and G in the H-strand. The oxidation of G is less common but also occurs more frequently in single-stranded DNA, producing 8-hydroxyguanine, which pairs with A rather than C on the L-strand (G decreases and T increases on the H-strand) (Fonseca, 2011).
  • 27.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 9 Strand asymmetry is usually measured by AT and GC skews, as expressed by (A-T)/(A+T) and (G-C)/(G+C), respectively (Perna and Kocher, 1995). Positive AT skew values indicate more A than T and positive GC skew values more G than C, and vice versa. Strand asymmetries can be generally stable in some taxa such as vertebrates, in which the vast majority of genes show positive AT and negative GC skews. In vertebrates, only the ND6 gene shows the opposite compositional bias because its coding polarity is the opposite from all the other protein-coding genes. However, in invertebrates, strand specific asymmetries are more variable, mainly because gene rearrangements (particularly gene inversions) are much more frequent in this group. Likewise, in invertebrates, the H-strand origins of replication also switch their orientation in many taxa (Hassanin et al., 2005; Scouras and Smith, 2006). Since the strand bias is related with the orientation of the replication process, a reverse replication mechanism will also lead to an inverse compositional bias. For example, all available mitogenomes of crinoids (Echinodermata; Scouras and Smith, 2006) and some species in Cephalochordata (Kon et al., 2007) have generalized reversed strand asymmetry, when compared with the remaining mitogenomes of the same taxonomic group (phylum or subphylum, respectively). That occurs due to an inversion of the replication origin (the control region) and consequently of the replication direction (the H-strand became the L-strand and vice versa) (Hassanin et al., 2005; Kilpert and Podsiadlowski, 2006; Fonseca et al., 2008). In conclusion, compositional heterogeneities can also result from switches in coding polarities of genes and from reversal replication process of mitochondrial genomes. If a given gene (or group of genes) is inverted in some taxa, this will contribute to increase the heterogeneity of the patterns of sequence evolution within a dataset. In these scenarios, classical homogeneous models are not expected to capture the complex sequence evolution, resulting in misleading phylogenetic relationships. On the contrary, one expects that heterogeneous models can incorporate the time- and sequence-heterogeneity of the sequences and correctly infer the phylogenetic relationships between taxa. 1.4. Gene Rearrangements The analysis of gene orders is another source of phylogenetic information that is becoming increasingly used (Boore et al., 1995; Sankoff et al., 1992). A comparison of the gene order between closely related species can reveal different types of rearrangements such as inversions (Asakawa et al., 1995), transpositions (Macey et al., 1997), reverse transpositions (Boore et al., 1998), and tandem duplications followed by the random loss
  • 28.
    10 (TDRL) of oneof the copied genes (Boore, 2000) (Figure 2). As well, a variety of possible mechanisms that led to them have been proposed (Boore, 2000), some related to the replication process, others with enzymatic errors, or with illegitimate recombination (Mueller and Boore, 2005). Figure 2 | Representation of different types of rearrangements for a hypothetical genome consisting of five genes. From left to right: inversion, transposition, reverse transposition, tandem duplication random loss (Bernt et al., 2013b). It is also important to note that rearrangement rates are not only unevenly distributed in the genome, but also across taxa. In mitochondria, nucleotide (and amino acid) composition varies with the position in the genome (because there is a spatial compositional gradient along the genome generated by the replication mechanism). When a gene changes its location, it will be positioned in a genomic region with a different mutational pressure. Thus, if homologous genes in different species have different locations, they might show heterogeneous patterns of evolution. The same can happen when a gene changes its coding strand or if the replication mechanism inverts. Differences in gene order can be estimated and incorporated in standard frameworks for phylogenetic reconstruction, e.g., by means of distance methods, MP, and ML approaches, as well as with different aims (e.g., evolutionary relationships or ancestral states reconstruction). However, it remains unclear what is the most adequate approach (Bernt et al., 2013a).
  • 29.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 11 2. Mitochondrial DNA The eukaryotic cells are composed by many organelles, with the mitochondria being one of the most fundamental (Boyce et al., 1989; Lang et al., 1999; McBride et al., 2006). Among the several functions of mitochondria (Figure 3), ATP production (i.e. energy) and regulation of cellular metabolism are of primary relevance (Ricci, 2009; Bernt et al., 2013b). Figure 3 | Representation of the mitochondria. Image adapted from Lodish (2008). 2.1. MtDNA Structure and Organization The mitochondrion contains its own genome (mitogenome or mtDNA) (Schatz, 1963), and reproduces independently of the cell in which it is found which is the major evidence of the origin of eukaryotic cells by endosymbiosis (Gray et al., 1999). In terms of structure and gene content, animal mitogenomes are more stable than other eukaryotic groups (e.g. the phyla Porifera, Nematoda, Cnidaria) (Wolstenholme, 1992; Boore, 1999). The typical mtDNA is a small and circular double-stranded DNA molecule of about 15-20 kb that contains 37 genes (Boore, 1999) (Figure 4). Although some larger mitochondrial genomes have been found, it is believed that they are the product of duplications in some portions of mtDNA (Boyce et al., 1989). The mtDNA carries 13 protein-coding genes: the cytochrome c oxidase subunits I, II, and III (COX1, COX2, and COX3) and the cytochrome b (CYTB), the ATPase complex subunits 6 and 8 (ATP6 and ATP8), and the NADH dehydrogenase subunits 1–6 and 4L (ND1–ND6 and ND4L).
  • 30.
    12 Figure 4 |Example of an invertebrate mitochondrial genome (from Arbacia lixula with 15719bp). Circular genome image generated with OGDraw v1.2 http://ogdraw.mpimp-golm.mpg.de/. The mitogenome is highly compact, with no introns between the coding regions (Anderson et al., 1981). Additionally, it contains non-coding regions involved in the transcription of the mitochondrial-encoded proteins: 22 transfer RNA (tRNA) genes (one for each amino acid, except leucine and serine, which have two copies each); and two ribosomal RNA genes: the small or 12S subunit (srRNA) and the large or 16S subunit (lrRNA) (Attardi, 1985; Taanman, 1999; Hassanin et al., 2005). Finally, the mitogenome also contains one or more regions (non-coding) that regulate the initiation of replication and transcription (Boore, 1999; Taanman, 1999; Hassanin et al., 2005). The two strands of the mtDNA molecule encode genes. Other features of the mtDNA, specifically: the small size of the molecule, its abundance in eukaryotic animal tissues, the consistent orthology of the genes illustrated by the presence of the same genes in different taxa, its mostly uniparental inheritance and the absence of recombination (Elson and Lightowlers, 2006); contributed for making this molecule the marker of choice when it comes to infer the origin and evolution of the main groups of organisms, mainly in Metazoa (Fonseca, 2011). On top of this, the huge amount of publically available sequence data for mitochondrial genes, recently prompted by the development of next-generation sequencing (NGS) tools for entire mitogenomes, offers a unique opportunity to study not only the relationships among organisms based on sequence data but also on changes in gene order (synteny), contributing to important advances in phylogenetics and the emergence of phylogenomics.
  • 31.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 13 2.2. MtDNA Strand Asymmetry and Replication As previously mentioned, the two strands of the mtDNA molecule can be distinguished on the basis of G+T base composition, resulting in strand asymmetry, in which one of the strands is GT-rich (H-strand) and the other is GT-poor (L-strand) (Taanman, 1999; Hassanin et al., 2005). The replication mechanism of mtDNA has been suggested to be the main responsible for the observed asymmetries in nucleotide composition between strands (Reyes et al., 1998; Faith and Pollock, 2003). During this process, a strand in single-stranded state is more exposed to DNA damage, facing higher mutational pressures (Hassanin et al., 2005; Francino and Ochman, 1997). Consequently, since genes closer to the replication origin of the H-strand remain exposed to mutation for a longer period of time during replication, a positive correlation between strand asymmetry and the time spent in single-stranded state is generally expected (Reyes et al., 1998; Wei et al., 2010). 2.3. Gene Rearrangements Genome rearrangement rates and nucleotide substitution frequency are generally correlated (Shao et al., 2003; Xu et al., 2006). This correlation might imply a cause/effect relationship between them or, alternatively, it can result from another mechanism that accelerates both processes (e.g. DNA replication; Xu et al., 2006). However, in the case of mitochondrial genomes, this correlation has not been thoroughly investigated (Weng et al., 2013). Changes in mitochondrial gene content, i.e. deletions (Lavrov and Brown, 2001) or duplications (Zhong et al., 2008), and changes in chromosome organization have been widely reported, especially in metazoan mitogenomes (i.e. in insects - C. vestalis, B. macrocnemis and C. bidentatus - Wei et al., 2010), often affecting the origins of replication (San Mauro et al., 2006) but also other regions throughout the genome such as tRNAs (Dowton and Austin, 1999). In early studies of mitogenomic rearrangements, tRNA genes were found to be more frequently involved in rearrangements than protein and rRNA genes (Wolstenholme, 1992; Macey et al., 1997). Although the phylogenetic studies based on gene rearrangements comparisons has been increasing (Scouras and Smith, 2001; Perseke et al., 2008; Shen et al., 2009; Perseke et al., 2010; Perseke et al., 2013), this can originate different (contradictory) phylogenetic patterns than those obtained using other information (Boore and Brown, 1998; Castresana et al., 1998).
  • 32.
    14 3. The ControversialPhylogeny of the Echinoderms Echinodermata belongs to the superphylum Deuterostomia, together with two other phyla: Chordata (Craniata, Cephalochordata and Urochordata) and Hemichordata (Swalla and Smith, 2008; Perseke et al., 2013). The phylogenetic relationships among deuterostomes differ between studies, but the most recent molecular evidences support Echinodermata and Hemichordata as sister taxa (Bourlat et al., 2009; Perseke et al., 2013). This pairing is also supported by morphological data (Turbeville et al., 1994; Gee, 1996; Lambert, 2005), ribosomal gene sequences (Turbeville et al., 1994; Wada and Satoh, 1994; Cameron, 2000), shared Hox gene motifs (Peterson, 2004), mitochondrial genes and gene arrangements (Bromham and Degnan, 1999; Lavrov and Lang, 2005). Similarly to the deuterostomes, the evolutionary and phylogenetic relationships among the echinoderm classes have also been highly debated (Littlewood et al., 1997; Perseke et al., 2010; Kondo and Akasaka, 2012). There are approximately 7000 extant echinoderm species (Amemiya et al., 2005), falling into five well-defined taxonomic classes: Crinoidea (sea lilies), Ophiuroidea (brittlestars), Asteroidea (starfish), Echinoidea (sea urchins), and Holothuroidea (sea cucumbers) (Littlewood et al., 1997; Amemiya et al., 2005). Their phylogeny was previously inferred using a cladistic approach primarily based on morphologic characters (Smith, 1988; Bottjer, 2006) and later on molecular data (Smith, 1992; Perseke et al., 2010) or both (Littlewood et al., 1997; Janies, 2001). Table 1 shows a chronological summary of the most relevant phylogenetic studies on Echinodermata.
  • 33.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 15 Table 1 | Summary of previous phylogenetic studies on Echinodermata. The kind of data used, methods applied and topologies recovered are indicated. Abbreviations: A- Asteroidea; C- Crinoidea; E- Echinoidea; H- Holothuroidea; O- Ophiuroidea; (1) DNA amplification; (2) not specified. Reference Data Phylogenetic Method Topology Smith, 1984 Morphological (larval and adult) Maximum Parsimony Smith et al., 1993 Molecular (mitochondrial gene arrangements) (1) H-E pairing O-Apairing Pearse and Pearse, 1994 Morphological (adult) (2) Wada and Satoh, 1994 Molecular (18S rDNA) Maximum Likelihood Neighbour Joining
  • 34.
    16 Littlewood et al.,1997 Morphological (larval and adult) + Molecular (18S, 28S rDNA) Maximum Parsimony Maximum Likelihood Scouras and Smith, 2001 Molecular (COX1, 2 and 3; mitochondrial gene arrangements ), except Ophiuroidea LogDet Maximum Likelihood Janies, 2001 Morphological (larval and adult) + Molecular (18S, 28S rDNA) Maximum Parsimony Mallatt and Winchell, 2007 Molecular (18S, 28S rDNA) Maximum Parsimony Maximum Likelihood Bayesian Inference Perseke et al., 2008 Molecular (mitochondrial genome; mitochondrial gene arrangements) Neighbor Joining Maximum Parsimony Maximum Likelihood Bayesian Inference
  • 35.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 17 Shen et al., 2009 Molecular (CYTB; mitochondrial gene arrangements) Bayesian Inference Perseke et al., 2010 Molecular (mitochondrial genome; mitochondrial gene arrangements) Neighbor Joining Maximum Parsimony Maximum Likelihood Bayesian Inference Pisani et al., 2012 Molecular (nuclear rDNAand protein-coding genes), except Holothuroidea Neighbor Joining Maximum Parsimony Maximum Likelihood Bayesian Inference Perseke et al., 2013 Molecular (mitochondrial genome mitochondrial gene arrangements) Maximum Likelihood Telford et al., 2014 Morphological (larval and adult) + Molecular (nuclear genes) Bayesian Inference
  • 36.
    18 3.1. Morphological Data Smith(1984) was the first to study the phylogeny of Echinoderms using anatomical and developmental characters from both larvae and adults, since previous studies showed that larval characters alone and the fossil characters of the echinoderms could contribute for misleading evolutionary relationships between echinoderm classes. Under a parsimony analysis, the obtained topology (Table 1) supported the relationships between four of five classes of echinoderms, with the position of holothuroids being less clear (low support). This group (holothuroids) shares a number of derived characters with ophiuroids and echinoids and several others only with echinoids, while there are a number of other derived characters that are common to asteroids, ophiuroids and echinoids (Smith, 1984). A few years later, Strathmann (1988) reanalyzed larvae morphological characters but found no supported topology; whereas Pearse and Pearse (1994) presented a phylogenetic analysis based on anatomical characters of adult individuals supporting crinoids as sister group of the clades formed by the echinoids-holothuroids and of the ophiuroids-asteroid pairs. All these studies suggested that crinoids are the representatives of extant echinoderms that diverged earlier from the rest, but conflicting results still exist for the relationships between the remaining classes, which can result from a misinterpretation of morphological characters (Smith, 1984) (Table 1). For example, different studies presented different topologies using the same morphological characters dataset because authors had different opinions regarding character homology and polarity (Littlewood, 1995). 3.2. Molecular Data The increasing number of new sequences publicly available and novel phylogenetic inference methods and tools made researchers hopeful in resolving the inconsistencies of the echinoderm phylogeny (Table 1). Smith and co-workers (1993) analyzed the first almost-complete dataset of mitochondrial genes within echinoderms and, by comparing the mitochondrial gene order from all classes except Crinoidea, observed a closer relationship between holothurians and echinoids and between ophiurids and asteroids. The subsequent analysis of 18S and/or 28S ribosomal molecular data using different phylogenetic methods was unsuccessful in providing a definitive topology for the relationships between echinoderm classes (Wada and Satoh, 1994; Littlewood, 1995; Littlewood et al., 1997; Janies, 2001). After obtaining the complete mitochondrial gene order/sequence of the crinoid Florometra serratissima, Scouras and Smith (2001) used the mitochondrial gene order to construct a phylogeny that recovered a closer relationship between the echinoid and the holothuroid classes with the asteroids outside this clade. However, since
  • 37.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 19 Ophiuroidea was not included, it was impossible to evaluate the relative support for all Echinodermata classes. The analysis of Shen and co-workers (2009), based on the CYTB gene revealed the phylogenetic relationship presented on Table 1. The authors encouraged increasing the number of mitochondrial genomes to resolve some unanswered questions though. Pisani and co-workers (2012) were the first to use evolutionary models that take into account compositional heterogeneity (CAT-GTR model), but their results were not conclusive because they did not use all extant echinoderm classes. The analyses of mitochondrial genomes (based on sequence and gene order information) using different phylogenetic methods by Perseke and co-workers (2008, 2010, 2013) resulted in conflicting topologies (e.g. Perseke et al., 2008 did not find support for the Holothuroidea and Echinoidea grouping). More recently, Tellford and co-workers (2014) used 219 nuclear genes to perform a phylogenetic analysis using the evolutionary model CAT-GTR and obtained good support for the Holothuroidea-Echinoidea and for the Ophiuroidea-Asteroidea pairings, with Crinoidea as the first diverging lineage respect to these two groups. 3.3. MtDNA Gene Rearrangements Since mitochondrial gene rearrangements are much more slow-evolving than nucleotide sequences (Brown, 1985), they are thought to be more informative in establishing deep relationships, such as those between echinoderm classes (Boore et al., 1995). Given the rarity of these events, one expects that a rearrangement shared by two organisms generally imply common ancestry (Boore et al., 1995). Nonetheless, caution must be taken if one suspects of independent origin for a given rearrangement or if multiple rearrangements occur (Scouras and Smith, 2001). The different studies using this kind of information identified gene rearrangements within all Echinodermata classes except echinoids. Comparing the different classes, the echinoids and some holothuroids share an identical mitochondrial gene order (Scouras and Smith, 2001). Moreover, Smith (1993) and others described a mtDNA inversion that differentiates Asteroidea from Echinoidea/Holothuroidea. It is approximately 4.6kb long, contains 14 tRNA genes, the lrRNA, and ND1 and ND2 protein-coding genes, and is present in all seven Asteroidea species. Finally, the ophiuroid gene order differs from the one of asteroids in the junction between ND1 and COX1 genes, containing substantially fewer tRNA genes (Smith et al., 1993).
  • 38.
    20 3.4. Final Considerations Aspreviously illustrated, the phylogenetic analyses of the Echinodermata did not provide a conclusive result regarding their relationships. Although the monophyly of each class is usually supported, the construction of phylogenetic trees with different genes, taxa and methods of analysis often give contradictory outcomes (Kondo and Akasaka, 2012). The highly derived morphology (Smith, 1984; Smith et al., 1993; Pearse and Pearse, 1994), the fast evolving mitochondrial genomes and the complex gene arrangements varying both within and between echinoderm classes, all have a significant impact on the phylogenetic reconstruction of this group (Brown, 1985; Boore et al., 1995; Scouras et al., 2004). However, summarizing all these studies (Table 1), most authors accept a close relationship between echinoids and holothurians and a basal position of crinoids. This is supported by morphological characters (Smith, 1984; Smith et al., 1993; Pearse and Pearse, 1994) or/and nuclear rRNA sequences (Wada and Satoh, 1999; Littlewood et al., 1997; Janies, 2001; Mallatt and Winchell, 2007; Telford et al., 2014). However, the use of complete mtDNA data revealed a basal position of the Ophiuroidea class instead of Crinoidea (Perseke et al., 2008, 2013), which might be related with the fact that Ophiuroidea show high substitution rates at mtDNA genes. However, it should also be noted that all crinoid mitogenomes show a strand-specific nucleotide compositions/bias that is the opposite of the other echinoderms studied so far. This is most likely caused by an inversion encompassing the origin of replication in crinoid mtDNA (Perseke et al., 2013). Another important debate concerns the relative position of asteroids. Taking into account these different studies, most authors propose two hypotheses (Cryptosyringid and Asterozoan hypotheses; Figure 5) for the echinoderms phylogeny. The Cryptosyringid hypothesis has Crinoidea as basal, Asteroidea as the second class to diverge from the remaining ones, followed by Ophiuroidea, which is the sister class of the group formed by Echinoidea and Holothuroidea. The Asterozoan hypothesis also considers Crinoidea as the basal class, while Echinoidea - Holothuroidea and Ophiuroidea - Asteroidea form two different internal clades. These two hypotheses are almost equally supported when using molecular and/or morphological data (Smith, 1984; Pearse and Pearse, 1994; Wada and Satoh, 1994; Littlewood et al., 1997; Janies, 2001; Mallett and Winchell, 2007; Shen et al., 2009; Telford et al., 2014). Regarding the mitochondrial gene rearrangements, were found patterns within and between classes of Echinodermata (Perseke et al., 2008, 2010). Although within Echinoidea the gene order is conserved (Jacobs, 1988; De Giorgi, 1996), one cannot say the same for the remaining classes (Asakawa et al., 1995; Scouras and Smith, 2001; Scouras et al., 2004; Perseke et al., 2008; Shen, 2009; Perseke et al., 2010). The gene order of Crinoidea has been proposed as the ancestral pattern of echinoderms (Perseke et al., 2008) followed by Ophiuroidea and then the Asteroidea that which is related
  • 39.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Introduction 21 to the clade Echinoidea-Holothuroidea (Scouras and Smith, 2001; Perseke et al., 2008; Perseke et al., 2010). Figure 5 | The two hypotheses for the Echinodermata phylogeny. Adapted from Smith, 1984; Pearse and Pearse, 1994; Wada and Satoh, 1994; Littlewood et al., 1997; Janies, 2001; Mallett and Winchell, 2007; Shen et al., 2009; Telford et al., 2014. 4. Objectives The main objective of this study is to test the performance of site- and time-heterogeneous evolutionary models in a dataset that is heterogeneous along its alignment, by violating the conditions of stationary and homogeneity and, by definition, of reversibility. By using empirical data (mitogenomes of echinoderms), we aim to assess the effects of gene rearrangements and subsequent nucleotide composition changes in phylogenetic inference – the so-called Long-Branch Attraction (LBA) artifact - and thus to explore the full potential of mitogenomics. Like this, we hope to shed light into some important issues: i) Evaluate if methods that incorporate time - and compositional - heterogeneous evolutionary models (empirical mixture models) improve the performance of phylogenetic inference, when compared with more simple models (classical empirical models). ii) Revaluate the phylogenetic relationships between echinoderm classes, using complex heterogeneous evolutionary models of evolution on a dataset (mitogenomes) composed by all species of Echinodermata. Asterozoan Cyptosyringid
  • 41.
  • 43.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Material and Methods 25 This thesis is divided in two main parts. In the first part, we aimed to characterize the mitogenomes of Echinodermata species by analyzing the mitogenome architecture, investigating the phylogenetic signal of gene rearrangements and assessing compositional heterogeneity of the protein-coding genes. In the second part, we reconstructed the phylogeny of this group using homogeneous and (site- and time) heterogeneous evolutionary models to test if the later help to surpass the limitations of previous phylogenetic models. 1. Retrieval of Molecular Data The data used in this study consists of the complete mitochondrial genome from 29 species representing all extant classes of Echinodermata plus three species from the phylum Hemichordata (http://www.ncbi.nlm.nih.gov/, last accessed in October 2012) (see Appendix-A.1), the sister group of Echinodermata according to previous studies (Bromham and Degnan, 1999; Cameron et al., 2000; Peterson, 2003; Perseke et al., 2013; Cannon et al., 2014). We started by retrieving all publicly available complete mitochondrial DNA sequences for Metazoa (2677 mitogenomes) from GenBank (NCBI ftp: ftp.ncbi.nih.gov/ genomes/MITOCHONDRIA/Metazoa/all.gbk.tar.gz). Then, we filtered the GenBank files for the complete mitogenomes of Echinodermata species (29 mitogenomes) using custom scripts (see Appendix-A.2). 2. Analysis of the Mitogenome Architecture Reliable and standard genome annotations are indispensable for comparative analysis, in particular for the studies on genomic rearrangements (Lupi et al., 2010). The most widely used and updated resource for annotated mitogenome sequences is the GenBank sequence database. New tools have been created to annotate genomic sequences. Among these, there are two very popular web tools specific for mitogenomes: DOGMA (Dual Organellar GenoMe Annotator) (Wyman et al., 2004) and MITOS (MITOchondrial genome annotation Server) (Bernt et al., 2013c). Since none of these tools is able to identify with absolute certainty all the features each genome contains, we have used both, and their results were then compared with the original GenBank annotations.
  • 44.
    26 2.1. Comparison ofGene Order Four rearrangement operations, at least, are reported in the literature to be relevant for the study of mitochondrial gene order evolution (Boore, 1999): inversions, transpositions, reverse transpositions and tandem duplications-random losses (TDRL). CREx (Common interval Rearrangements Explorer - http://pacosy.informatik.uni-leipzig.de/crex/form) (Bernt et al., 2007) is a web service tool for the identification of common rearrangement events (the four types mentioned before) between several genomes. It computes a distance matrix between the genomes according to different methods such as the breakpoint distance, the reversal distance or the number of common intervals (Bernt et al., 2007; Fertin, 2009). The breakpoint distance is the simplest approach and has been used for a long time, while the two other distances, are able to capture more subtle similarities. Moreover, according to Pevzner (2000) and Bernt and Middendorf (2011), the breakpoint distance method does not allow a reliable inference of the evolutionary scenarios underlying the observed rearrangements. As well, the reversal distance is considered somewhat limited in terms of combining multiple genome rearrangement events. Therefore, we compared the gene order (tRNAs and protein-coding genes) between all available Echinodermata mitogenomes using the number of common intervals implemented in CREx (Bernt et al., 2007) as distance measure, and inferred the most parsimonious evolutionary scenarios to explain the observed differences (Bernt et al., 2007). Within each class, rearrangements were inferred by comparing the gene order of the different species with that of the basal taxon of that same class (Scouras and Smith, 2001; Fan et al., 2011; Perseke et al., 2008; Perseke et al., 2010). The rearrangements between the different classes were inferred by comparing the gene order of the most basal taxon of each class with the proposed ancestral gene order of the Echinodermata clade (Florometra serratissima – Crinoidea, according to Perseke et al. (2010)). 2.2. Prediction of Origins of Replication For the mitogenome of each Echinodermata species, we inferred the probable positions of origins of replication. The replication origins (or replication-related elements) are expected to be located in A+T rich regions (Hassanin et al., 2005; Wei et al., 2010), to form stem-and-loop (hairpin) structures and to typically contain tandem-repeat sequences. We extracted all possible mtDNA fragments from 15 up to 45bp from both strands, assuming that putative origins of replication should form small and simple hairpin structures (i.e. with a single stem and a single loop). For each fragment, the secondary structure and respective entropy were calculated using RNAstructure v5.6 (Reuter and Mathews, 2010); while we
  • 45.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Material and Methods 27 added some important features (i.e. unpaired code, number of ATGCs on hairpin, number of pyrimidines and purines on 3’end and 5’end of stem) with customized Perl scripts. Finally, we filtered the resultant hairpins using R scripts. Since there is only a few studies about the origin of replication in invertebrates, especially in echinoderm species, we used the filter criteria generally applied for vertebrates. However, we applied two sets of filters. The first (more stringent) consisted of: i) entropy bellow -8.0 kcal/mol, ii) loop containing at least one T, iii) preference for no unpaired nucleotides in the stem, iv) 5'end of the stem rich in TC (pyrimidines) and containing at least one C, v) 3'end of the stem rich in AG (purines) and containing at least one G, and vi) hairpin located in a non-coding region (Fonseca et al., 2014). The second (more relaxed) consisted of: i) entropy below -8.0 kcal/mol, ii) unpaired nucleotides allowed in the stem, ii) 5'end of the stem rich in TC (pyrimidines) but without restrictions in the number of Cs, iii) 3'end of the stem rich in AG (purines) but without restrictions in the number of Gs, and iv) hairpins located in a non-coding region. 2.3. Assessment of Compositional Heterogeneity For every taxon, we calculated the base composition skew asymmetry of each of the 13 protein-coding genes as AT-skew [(A-T)/(A + T)] and GC-skew [(G-C)/(G + C)] (Perna and Kocher, 1995) using MEGA v5.0 (Tamura et al., 2011). The AT- and GC-skews were computed using only the 4-fold redundant sites of the protein-coding genes. Then, for each species we mapped the skews along the mitogenome to examine the distribution pattern of the compositional bias. 3. Phylogenetic Analysis 3.1. Sequence Alignment After retrieving the complete mitochondrial DNA sequences of the 29 echinoderm and the three hemichordate species, a Perl script was written (see Appendix-A.3) to filter the GenBank files keeping only the protein coding genes (ATP6, ATP8, COX1, COX2, COX3, CYTB, ND1, ND2, ND3, ND4, ND4L, ND5 and ND6) in individual fasta files. In addition, we created two different datasets, one only containing the protein-coding genes of Echinodermata species, which we named the ingroup dataset; and another, the outgroup dataset, composed by protein-coding genes of Echinodermata and Hemichordata species. For both datasets, we also created a concatenated matrix of all protein-coding genes (amino acid sequences) using a python script (https://github.com/ODiogoSilva/ElConcatenero). To align the single gene sequences for both ingroup and outgroup datasets we used three
  • 46.
    28 different multiple sequencealignment (MSA) programs (see Appendix-A.3): Muscle (Edgar, 2004), Mafft (Kuma and Miyata, 2002) and Prank (Löytynoja and Goldman, 2005) . 3.2. Trimming of the Alignments There are highly efficient programs for multiple sequence alignment, but even the best scoring alignment algorithms may retrieve ambiguously aligned sequences. A review of the alignments can eventually identify gap problems and poorly or ambiguously aligned regions. It is, therefore, important to remove such regions from the final alignment. Originally, the trimming of the alignments was a manual task. However, reviewing thousands of aligned sequences by eye can be tedious, time-consuming and error-prone. To do it in an automated way we used trimAl v1.2 (Capella-Gutierrez et al., 2009; Appendix-A.3). This software was also used to measure the consistency of each alignment (obtained with the different MSA programs), and the one with the highest consistency score was kept for the subsequent phylogenetic analyses. 3.3. Selection of the Best Fit Model In the case of model-based phylogenetic inference methods, a priori formal model selection procedure needs to be implemented to choose an appropriate model (Burnham and Anderson, 2002; Sullivan and Joyce, 2005). Various methods and tools exist to perform this model selection for nucleotide or amino acid sequences (Bernt et al., 2013a). ProtTest (Abascal et al., 2005) is one of the available bioinformatics tools that allows finding the best fitted model of protein evolution within a candidate list of models, following different information-theoretic approaches, as an alternative to hierarchical Likelihood Ratio Tests (Posada and Buckley, 2004; Appendix-A.2). Among these, the most popular information-theoretic method is the Akaike Information Criterion (AIC) (Akaike, 1974), which includes a term that penalizes a model in proportion to the number of its parameters. When the dataset size is small compared to the number of parameters, the AIC can be inaccurate and the use of the corrected AIC (AICc) (Burnham and Anderson, 2002) is recommended. Thus, this was the criterion we applied. The current version of ProtTest (v3.2) includes 15 different rate matrices in a total of 120 different models, when the rate variation among sites (+I: invariable sites; +G or +Г: gamma-distributed rates) and the observed amino acid frequencies (+F) are considered.
  • 47.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Material and Methods 29 3.4. Phylogenetic Inference We performed the phylogenetic analyses using different inference methods such as Maximum Likelihood (ML) and Bayesian Inference (BI). For the first, we used PhyML v3.0 (Guindon et al., 2010) under the following parameters: the evolutionary model proposed by ProtTest (see Chapter 3 - Table 4), the equilibrium amino acid frequencies estimated by counting the occurrence of the different amino acids in the alignment (“-f e” option), the best of the nearest-neighbor interchange (NNI) and of the subtree pruning and regrafting (SPR) as tree topology rearrangement search (“-s BEST” option). Support for nodes was estimated using the bootstrap technique (Felsenstein, 1985) with 1000 pseudoreplicates. The BI analysis was implemented using MrBayes v3.2.1 (Ronquist et al., 2012), which calculates Bayesian posterior probabilities using a Metropolis-Coupled, Markov Chain Monte Carlo (MC-MCMC) sampling approach. The models of evolution used was also the ones selected by ProtTest. Parameters were estimated as part of the analysis with four Markov chains incrementally heated with the default heating values. Each chain started with a randomly generated tree and ran for 1x107 generations, with a sampling frequency of one tree for every 100 generations. One quarter (25%) of the samples was discarded as burn-in and the remaining trees were combined in a 50% majority rule consensus tree. The frequency of any particular clade of the consensus tree represents the posterior probability of that clade (Huelsenbeck and Ronquist, 2001). Two independent replicates were conducted and inspected for consistency (Huelsenbeck and Bollback, 2001). The models of amino acid evolution that we applied using PhyML or MrBayes are not appropriate for sequences showing time and compositional heterogeneous evolution and might introduce systematic errors (Galtier and Lobry, 1997; Blanquart and Lartillot, 2006). Since the datasets that we used show heterogeneous composition across sites and time, we decided to perform phylogenetic inferences using also frameworks that are expected to relax the stationary and homogeneity assumptions. To do so, we analyzed the data in a non-stationary Bayesian framework, using PhyloBayes MPI v1.4f (Lartillot et al., 2013). We applied a wide variety of homogeneous and heterogeneous models: classical empirical single-matrix (selected from ProtTest), empirical mixture of matrices (UL3), empirical mixture of profiles (WLSR5; C20), general time reversible matrix (GTR), infinite mixture with global exchange rates fixed to flat values (CAT-Poisson) or inferred from the data (CAT-GTR). For each PhyloBayes analysis, we ran two independent chains in parallel that were terminated only when they reached
  • 48.
    30 convergence (using bpcompprogram within the PhyloBayes package), which was diagnosed as maximum discrepancy threshold below 0.1 or below 0.3 (if the points sampled in the runs are above 10,000). Finally, we performed a Bayesian inference analysis using NH-PhyloBayes v0.2.3 (Blanquart and Lartillot, 2008) applying a combination of the category (CAT) mixture model (Lartillot and Philippe, 2004) with the non-stationary breakpoints (BP) model (Blanquart and Lartillot, 2006), named CAT-BP, which takes into account the heterogeneity along the sequence and across lineages. Under the CAT-BP model, the number of categories (CAT) is fixed and it is not automatically estimated as implemented in PhyloBayes. Thus, for the analysis under the CAT-BP model we used the posterior number of categories (CAT = 50) previously determined by the CAT-GTR analysis in PhyloBayes. We ran two independent chains and the convergence diagnostic analysis was performed using the compchain program in NH-PhyloBayes. The convergence of the chains depended on whether the means of the statistics of chain 1 (log likelihood, total tree length, total number of breakpoints) fell into the means of chain 2 ± 3 times the standard error of the mean. The visualization of the phylogenetic trees was done with the help of an R script (see Appendix-A.4) or with FigTree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/). 3.5. Comparison of Phylogenetic Methods The trees obtained through the different phylogenetic methods were compared (in terms of topology) using the Robinson-Foulds (RF) metric (normalized to the total number of possible partitions) (Robinson and Folds, 1981). This metric was computed for all trees (single protein-coding genes and concatenated alignments) obtained with the outgroup and ingroup datasets, as implemented in KTreeDist (Soria-Carrasco et al., 2007). In addition, we qualitatively analyzed (by means of dendrograms, see Appendix-A.4) all the possible relationships (pairing) between all protein-coding genes and concatenated alignments from both datasets. To make a more reliable comparison between the datasets (ingroup vs. outgroup), we pruned the outgroup species from the outgroup dataset with an R script (see Appendix-A.4).
  • 49.
  • 51.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Results 33 1. Mitogenome Architecture 1.1. Mitogenome Annotation Concerning the annotation of protein-coding genes, MITOS and GenBank clearly outperformed DOGMA. While DOGMA was not able to annotate several genes in 15 of the 32 investigated species, MITOS and GenBank annotated the full mitochondrial gene set and in a concordant manner (Table 2). When it comes to noncoding genes, some tRNAs (tRNA-Lys, tRNA-Glu, tRNA-Thr, tRNA-Trp, tRNA-Ser(UCN) , tRNA-Ser(AGN) ) and both rRNAs were not annotated in some Asteroidea GenBank files (although they were sequenced) (Table 2). However, NCBI is still the most complete and accurate source of annotated mtDNA genes of our dataset. Thus, the GenBank annotation was used in subsequent analyses. Table 2 | Summary of mitogenome annotation according to different approaches. Check marks indicate that all genes are annotated. Listed genes are those not annotated in a given genome using a specific tool. Class Species (Accession no) GenBank DOGMA MITOS Asteroidea NC_001627    NC_004610 tRNAGlu, tRNA-Thr, tRNA-Ser   NC_006664 tRNA-Trp   NC_006665 tRNA-Lys, SrRNA, tRNA-Ser, LrRNA, tRNA-Trp   NC_006666 tRNA-Lys, SrRNA, tRNASer, LrRNA, tRNA-Trp   NC_007788  ATP8  NC_007789  ATP8  Echinoidea NC_001453    NC_001572   
  • 52.
    34 NC_001770    NC_009940   NC_009941  ATP8  NC_013881  ATP8  Holothuroidea NC_005929    NC_012616  ATP8  NC_013432  ATP8  NC_013884  ATP8, ND5  NC_014452  ATP8  NC_014454  ATP8  Ophiuroidea NC_005334    NC_005930    NC_010691    NC_013874  ATP8, ATP6, ND2, ND4  NC_013876  ATP8, ND2, ND4L, ND6  NC_013878  ATP8, ND2, ND4L, ND6  Crinoidea NC_010692  ATP8  NC_001878    NC_007689    NC_007690  ATP8  Hemichordata NC_013877    NC_007438  ATP8, COX1, COX2, CYTB, ND1, ND2, 
  • 53.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Results 35 ND5, ND6 NC_001887    1.2. Gene Order Comparison Using the genome organization of the basal species of each class as the reference we also observe that the mtDNA gene rearrangements found in Echinodermata species support the monophyly of the different classes. The comparison of the gene order among different echinoderm classes show a block (from COX1 to ND6) that is conserved across all of them, with the exception of one holothuroid (Cucumaria miniata) and two crinoid (Neogymnocrinus richeri and Antedon mediterranea) species (Appendix-A.1; Figure 6). The Echinoidea and Holothuroidea share the same basal gene order (Figure 7) with only a few exceptions due to derived transpositions involving the tRNAs of the holothuroid species Cucumaria miniata, Stichopus sp. SF-2010 and Stichopus horrens (see Appendix - A.1). The basal genome organization of these two classes differs from Crinoidea (here used as the basal species of all Echinodermata) by two inversions and a reverse transposition (Figure 7). Asteroidea genome order differs from Crinoidea by two inversions, one transposition and a reverse transposition (Figure 7). A 4.6-kb inversion encompassing ND1, ND2 and a 12 tRNA-gene block is observed between Asteroidea and Echinoidea/Holothuroidea (Appendix - A.1; Figure 7). Finally, Ophiuroidea derives from Crinoidea by two inversions and three TDRLs events (Figures 6 and 7). The reconstruction of the phylogenetic relationships among the five classes using these mitogenome rearrangements as phylogenetic characters (Figure 7), and considering the gene order of the most basal species of each class (see Appendix - A.1) as reference (Fan et al., 2011; Perseke et al., 2010; Perseke et al., 2008; Scouras and Smith, 2001), resulted in a topology that supports the Cryptosyringid hypothesis (Figure 6).
  • 54.
    36 Figure 6 |Evolution of gene orders in echinoderm lineages using the typical crinoid gene order as the reference (inferred by CREx). Genes in upper layer are transcribed on the L-strand, whereas genes in lower layer are transcribed on the H-strand. I = inversion; T = transposition, rT = reverse transposition, TDRL = tandem duplication random loss. Black dotted line indicates conserved gene order block. Green, purple, orange and blue boxes indicate reverse transpositions, inversion, transposition and TDRL events, respectively. Figure 7 | Phylogenetic tree obtained with CREx (pairwise-distance) based on the genome rearrangements among the Echinodermata species. I = inversion; T = transposition, rT = reverse transposition, TDRL = tandem duplication random loss. Grey star indicates the hypothetical basal gene order from all echinoderms. Crinoidea Asteroidea Echinoidea/Holothuroidea rT + 2I Ophiuroidea T + rT + 2I 2I+3TDRL
  • 55.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Results 37 Regarding rearrangements within each class, Asteroidea species differ by some inversions on tRNAs (see Appendix - A.1). Within Holothuroidea and Crinoidea species, we verified some TDRL and transpositions on tRNAs (see Appendix - A.1) and on ND4L (see Appendix - A.1), respectively. In Ophiuroidea, some transpositions on tRNAs occurred in most species relative to the most basal, whereas multiple rearrangements (transpositions and TDRL in the tRNAs, inversion of the CYTB-tRNA-Gln gene block) were observed in Ophiura lutkeni and Ophiura albida (see Appendix - A.1). Echinoidea is the only class where we did not detect any gene rearrangement between species (see Appendix - A.1). 1.3. Prediction of Replication Origins The purpose of this analysis was to understand if there is more than one possible replication origin in each species; if these replication origins are located in homologous regions (e.g. control region) between species of the same class; and if rearrangements could have contributed to their different location in the mitogenome of different species. Using the more stringent set of filters we were only able to predict the putative replication origins for four Echinodermata mitogenomes (Table 3; Figure 8). However, even when we applied the more relaxed filters (see Methods) we were unable to retrieve the putative replication origin for most species analyzed here, mainly because the resultant hairpin structures were located in coding-regions. Importantly, no putative replication origins were identified in any of the echinoid and ophiuroid species, regardless of the filter used. Most hairpins were located between tRNA-Phe and ND2 (Table 3), falling inside a region involved in rearrangements in different classes (Figure 8). The features of the replication origin in asteroid mitogenomes are common to all species except to Astropecten polyacanthus (NC_006666), where replication origin is located in a different strand and in a different position relative to the other species (Table 3). The putative replication origin found in the holothurian (Stichopus sp. SF-2010 (NC_014452)) are located on the plus strand (H-strand), between tRNA-Phe and tRNA-Pro (Table 3). Finally, for the crinoid Phanogenia gracilis (NC_007690), a putative replication origin was identified on a non-coding region between tRNA-Gly and tRNA-Tyr, on the minus strand, with 18bp of length (Table 3).
  • 56.
    38 Table 3 |Putative replication origins inferred for the Echinodermata mitogenomes used in this study. Shown are the features of putative hairpins: genomic position; strand in which they are transcribed; size of their sequence; minimum entropy energy to drive replication; dot-bracket notation detailing the predicted secondary structure; sequence itself; and location in the genome relative to some protein-coding genes. (See Chapter 2, Material and Methods, for the difference between stringent and relaxed filters). Asterisks represent the putative hairpins found in control-region of the genome. a,b,c Hairpin structures are represented in Figure 8. Class Accession No Position Strand Size Entropy (kcal/mol) Dot-Bracket Sequence Location Stringent filters Asteroidea NC_006665 a 14542 + 23 -8.9 .((((((((.....)))))))) . ATGCCCCCTTCGG GGGGGGGCAA tRNA-Phe -ND2 Asteroidea NC_001627 b 7074 + 26 -8.2 .((((((((........))))) ))). TAACCCCCCCCCT CCCCGGGGGGTTC tRNA-Phe -ND2 Asteroidea NC_007789 c 15706 + 35 -11.1 .(((((((((((........... ))))))))))). CATAACCCCCCCC CCCCCCCTCCGGG GGGGTTATA tRNA-Phe -ND2 (*) Relaxed filters Asteroidea NC_006666 631 - 28 -8.9 .((((((((((......))))) ))))). ATTATCCTCGTGAA GAAACGAGGATAAA tRNA-Phe -ND2 Crinoidea NC_007690 4228 - 18 -8.2 .((((((....)))))). ATTGACTTTTAAGT CAAA tRNA-Gly- tRNA-Tyr Holothuroidea NC_014452 10928 + 37 -10 .((((((((((((.......... .)))))))))))). ATTTTGACACTTTC CCTTTCTAAAAAAG TGTCAAAAC tRNA-Phe -tRNA-Pro (*)
  • 57.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Results 39 Figure 8 | Representation of the seconday structure of the putative replication origins inferred for the mitogenomes of a) Asterias amurensis, b) Acanthaster brevispinus and d) Patiria pectinifera all from Asteroidea class. DNA hairpin structures were drawn using PseudoViewer (http://pseudoviewer.inha.ac.kr/) 1.4. Assessment of the Degree of Compositional Heterogeneity The plot of the skew values of the 4-fold redundant sites for each mitochondrial protein-coding gene for each Echinodermata species shows that Crinoidea presents an overall inversion of AT and GC skews (Figure 9; Appendix - A.1). While all other classes tend to show positive AT and negative GC skews, the crinoids have the opposite pattern of compositional bias in ten out of 13 genes (ATP6, ATP8, COX1, COX2, COX3, CYTB, ND3, ND4, ND4L and ND5) (Figure 9). Curiously, the Crinoidea genes that show positive AT and negative GC skews (ND1, ND2 and ND6) present an inverted compositional bias in some species of the remaining classes. This is the case of: one Ophiuroidea species, Ophiopholis aculeata, with inverted skews in four out of 13 genes (CYTB; ND1; ND2 and ND6); six Asteroidea species with inverted skews exactly in three genes (ND1; ND2 and ND6); and two species of each Holothuroidea, Apostichopus japonicus and Cucumaria miniata, and Echinoidea, Paracentrotus lividus and Arbacia lixula, with inverted skews in only one gene (ND6; Figure 9). A) B) C)
  • 58.
  • 59.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Results 41 Figure 9 | | Plots of GC and AT skews at 4-fold sites of mitochondrial protein-coding genes for all Echinodermata species analyzed here. Blue represents Asteroidea species; red, Echinoidea; green, Holothuroidea; purple, Ophiuroidea; and yellow, Crinoidea.
  • 60.
    42 2. Comparison ofPhylogenetic Inferences The results of the multiple sequence alignment (MSA) comparison (using the consistency scores of the different MSA methods) performed with trimAl are shown in Table 4. The program Muscle was the MSA method most frequently chosen for the ingroup datasets, while for the outgroup datasets was the Prank. ProtTest selected MtArt as the best fit model for most of the genes, followed by CpRev, MtRev and JTT. Importantly, in half of the genes, different best-fit models were selected depending on whether the outgroup sequences were included or excluded (Table 4). All the phylogenetic trees based on the concatenated datasets show that all Echinodermata classes are monophyletic, with high bootstrap or posterior probability values (generally over 95% the branch that is leading every classes on different trees). Regarding the evolutionary relationships between classes, according to MrBayes, PhyML, PhyloBayes (except the CAT model) and NH-PhyloBayes, the clustering of (Asteroidea, Echinoidea) with Holothuroidea is strongly supported; as well as the position of Crinoidea as sister clade of ((Asteroidea, Echinoidea), Holothuroidea), and of Ophiuroidea as the first class to diverge within Echinodermata (Figure 10 A)). By the opposite, the PhyloBayes analysis under the CAT model supports the Crinoidea as the first class to diverge within Echinodermata, followed by Ophiuroidea as the second (Figure 10 B). ,
  • 61.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Results 43 Table 4 | Summary of general results. trimAl results and protein models were chosen using ProtTest. NA: Not Applicable. Data Best Alignment Software Best Model Ingroup Outgroup Ingroup Outgroup ATP6 Muscle Mafft MtArt +I +G +F CpRev +I +G +F ATP8 Muscle Muscle MtRev MtRev +G COX1 Prank Muscle CpRev +I +G +F CpRev +I +G +F COX2 Mafft Mafft MtRev +G +F MtRev +I +G +F COX3 Mafft Prank CpRev +G +F CpRev +I +G +F CYTB Muscle Muscle MtArt +I +G +F MtArt +I +G +F ND1 Muscle Mafft MtArt +G +F MtArt +G +F ND2 Mafft Prank MtArt +I +G +F MtArt +I +G +F ND3 Mafft Prank MtArt +G MtRev +G ND4 Prank Prank MtRev +G +F MtArt +G +F ND4L Prank Prank MtArt +G MtArt +G ND5 Muscle Mafft MtRev +I +G +F MtRev +I +G +F ND6 Prank Muscle JTT +G +F JTT +I +G +F Concatenated NA NA CpRev +I +G +F CpRev +I +G +F
  • 62.
    44 Figure 10 |Phylogenetic tree for the concatenated amino acid sequences of thirteen echinoderm mitochondrial protein-coding genes using Hemichordata as outgroup. A) Tree topology obtained with all methods/models except PhyloBayes (CAT model). B) Tree topology obtained with PhyloBayes (CAT model). In each node only support values (bootstraps or posterior probabilities) greater than 50% are presented. In A) the support values of different methods/models are presented in the following order: PhyML/PhyloBayes(CAT+GTR)/PhyloBayes(CAT+MtArt)/PhyloBayes(C20)/MrBayes, PhyloBayes(GTR), PhyloBayes(MtArt), PhyloBayes(UL3), PhyloBayes(WLSR5), NH-PhyloBayes(CAT+BP). Asterisks represent values greater than 95% (bootstrap or posterior probability). Black triangles represent the clades formed by the different classes, which are always supported by values above 95% independently of the method/model used. The individual gene trees show, as expected, less resolution than the concatenated phylogenetic trees (see Appendix-A.1). The ML analyses show in general lower branch support than MrBayes, with only the ATP6, ND1, ND2, ND3, ND4, ND4L and ND5-based trees (outgroup data set) recovering support over 50% for all the classes (see Appendix-A.1). Comparing the results from PhyML and MrBayes analyses for the outgroup dataset, most gene trees are similar between both methods, with the exception of ND4 and ND4L (and ND3 to a minor extent) gene-based trees (Figure 11). A B
  • 63.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Results 45 Figure 11 | Heatmap of the inferred phylogenies from the outgroup dataset. Comparison between the phylogenies obtained wih PhyML and MrBayes approaches using single gene and concatenated datasets. The tree distance used in this matrix corresponds to the normalized Robinson-Foulds values. Legend: 0 (same topology), 1 (no bipartitions are shared between trees).
  • 64.
    46 Figure 12 |Heatmap of the inferred phylogenies between the ingroup and outgroup datasets using PhyML. Comparison of the phylogenies obtained with single gene and concatenated datasets. The tree distance used in this matrix corresponds to the normalized Robinson-Foulds values. Legend: 0 (same topology), 1 (no bipartitions are shared between trees). INGROUP OUTGROUP
  • 65.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Results 47 Figure 13 | Heatmap of the inferred phylogenies between the ingroup and outgroup datasets using MrBayes. Comparison of the phylogenies obtained with single gene and concatenated datasets. The tree distance used in this matrix corresponds to the normalized Robinson-Foulds values. Legend: 0 (same topology), 1 (no bipartitions are shared between trees). INGROUP OUTGROUP
  • 66.
    48 Figure 14 |Dendrogram of the inferred phylogenies for the concatenated outgroup dataset using different models or approaches (based on normalized Robinson-Foulds values). Figure 15 | Dendrogram of the inferred phylogenies for the concatenated ingroup dataset using different models or approaches (based on normalized Robinson-Foulds values).
  • 67.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Results 49 Figure 16 | Heatmap of dendrograms for PhyloBayes results (ingroup vs. outgroup). Results of the comparison between all inferred trees from ingroup and outgroup datasets based on different models. The tree distance used in this matrix corresponds to the normalized Robinson-Foulds values. Legend: 0 (same topology), 1 (no bipartitions are shared between trees).
  • 68.
    50 The topologies obtainedwith the ingroup and outgroup datasets are similar, except for ND3 on PhyML (Figure 12) and for ND6 on MrBayes (Figure 13). When we compare the tree topologies obtained through the different phylogenetic analyses, we observe some differences depending on the models that were used. We also notice that the models are not clustered by their nature; i.e. homogeneous models: from MrBayes and PhyML analyses, MtArt (PhyloBayes), GTR (PhyloBayes); site-heterogeneous models with empirical matrixes: UL3, WLRS5, C20, CATMTART; site-heterogeneous models with compositional profiles inferred from the data: CAT, CAT+GTR, CAT+MtArt; time- and site-heterogeneous models: CAT+BP (Figure 14 -16).
  • 69.
  • 71.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Discussion 53 Recent advances in sequencing technologies have been providing increasing amounts of molecular data, which request efficient annotation tools to make these sequences biologically meaningful. Despite the invested effort, it is not unusual to find sequence files containing erroneous annotations of genes and/or other genomic elements. Misannotations can seriously compromise the quality of the data, mostly if obtained using an automated framework. For the work developed in this thesis, the mitogenomic data was directly retrieved from the NCBI database using the annotations described in the GenBank files, which were compared with those obtained with MITOS and DOGMA. Whereas GenBank and MITOS annotated the full set of protein-coding genes in our dataset, DOGMA failed to identify some in several species. Early studies (Perseke et al., 2008) had already pointed out failures in the annotation of Echinodermata mtDNA GenBank sequences, which were subsequently corrected. However, nowadays, some genes remain non-annotated (Table 2). Consequently, in the analysis of gene rearrangements, we had to exclude both ribosomal genes, SrRNA and LrRNA and some tRNAs (tRNA-Lys, tRNA-Glu, tRNA-Thr, tRNA-Trp, tRNA-Ser(UCN), tRNA-Ser(AGN)), because they were not identified in some Asteroidea mitogenomes (Asterias amurensis, Astropecten polyacanthus, Luidia quinalia, Patiria pectinifera and Pisaster ochraceus). Although the most reliable source of information keeps being NCBI, there is still room for improvements. For example, the MitoZoa web tool (Lupi et al., 2010; D’Onorio et al., 2012) is a public resource containing metazoan mitogenomes whose annotation is carefully revised, also allowing to build different datasets that can be later used for comparative genomic and evolutionary analyses. However, the version 1.0 (Lupi et al., 2010) has not been updated since December 2011 and the link to the version 2.0 (http://www.caspur.it/mitozoa, D’Onorio et al., 2012) is no longer accessible. One possible way to improve mtDNA annotations (and other type of annotation) would be to allow researches using mtDNA sequences to correct them, by directly curating the GenBank files through the NCBI platform. Likewise, specialized databases should be maintained for longer periods, while automated scripts should be developed to allow daily-based mtDNA sequence corrections (at least for the NCBI Organelle Genome Resources, where the reference mitochondrial genome sequences are stored). In this study, we compared Echinodermata mitogenome order using the hypothetical basal/ancestral arrangement from each class (Appendix-A.1, Figure 6). Although we could not include all genes in the analysis (as mentioned above), our results reveal some patterns within and among the Echinodermata classes. We confirmed some previously described rearrangements between species of the same class (Perseke et al., 2008, 2010): within Asteroidea, we identified inversions on tRNAs (tRNA-Phe, tRNA-Ile, tRNA-Gly, tRNA-Cys,
  • 72.
    54 tRNA-Ala, tRNA-Leu andtRNA-Asn); within Holothuroidea, transpositions on tRNAs (tRNA-Arg, tRNA-Pro, tRNA-Asn, tRNA-Leu, tRNA-Met and tRNA-Val); within Crinoidea, transpositions on tRNA-Ala, tRNA-Val and tRNA-Arg, as well as on ND4L; in Ophiuroidea, multiple rearrangements (inversions, transpositions and reverse transpositions) that confirm the rapid evolution in this class; and finally, within Echinoidea, as we expected, no rearrangements were detected. The topology of the unrooted tree obtained based on the rearrangements between classes was pairing of Holothuroidea and Echinoidea, following by Asteroidea then the Ophiuroidea and finally the Crinoidea as ancestral class (Figure 7). This topology differs from those reported before for Echinodermata also based on this kind of data (i.e. gene order) (Perseke et al., 2008; Perseke et al., 2010). These different results could be explained (at least partially) by differences in the datasets used, as previous studies mainly relied on protein-coding genes (and only a few included rRNAs and tRNAs; Smith et al., 1993; Scouras and Smith, 2004) or inferred the rearrangements in a predefined phylogenetic tree, whereas here we analyzed all genes (e.g. Perseke et al., 2008; 2010). Another explanation (not mutually exclusive), stems from the fact that while previous studies used the breakpoints distance method in CREx to identify rearrangements, here we used the common intervals approach, which is thought to detect more subtle similarities and multiple genome rearrangements when compared with the others methods (Pevzner 2000; Bernt and Middendorf, 2011). Regarding the screening for origins of replication, we identified some putative regions fulfilling the criteria we used, i.e., residing in non-coding regions and forming stem-and-loop (hairpin) structures in Asteroidea (Asterias amurensis, Patiria pectinifera, Asterias amurensis). In addition, we found other putative replication origins using more relaxed filters but these need to be more carefully and deeply inspected before drawing strong conclusions, as they do not conform to all requirements to be considered reliable origins of replication. These additional regions were found in several species: in Holothuroidea Stichopus sp. SF-2010; in Asteroidea, Astropecten polyacanthus; and in Crinoidea, Phanogenia gracilis (see Table 3). In all species of Asteroidea, the putative replication origin is 19 to 35 bp long, it is located on the plus strand, between tRNA-Phe and ND2, except in Astropecten polyacanthus, where it is located on the minus strand (see Table 3). In Holothuroidea (and Echinoidea according to Jacobs et al., 1989), the putative replication origin is on the plus strand, between tRNA-Phe and tRNA-Pro. Finally, in Crinoidea, the putative replication origin is 18bp long, located on the minus strand, between tRNA-Gly and tRNA-Tyr. These results suggesting that the origin of replication is more flexible in terms of its structure as we originally predicted or could be more than one replication origin in mostly of Echinodermata classes. About the Ophiuroidea class,
  • 73.
    Application of Siteand Time Heterogeneous Evolutionary Models to Mitochondrial Phylogenomics Discussion 55 we could not find any putative hairpin, probably of the method that we used to find origins of replication was not the most adequate for the invertebrates. As previously stated (see Chapter 1 - Introduction), each mitochondrial DNA strand has a compositional bias that seems to be related with the distance to the replication origin (Adachi and Hasegawa. 1996; Faith and Pollock, 2003). Thus, an inversion of the replication origin (control region) is expected to produce a global reversal of asymmetric mutational constraints in the mtDNA, which can later result in a complete reversal of strand compositional bias (Hassanin et al., 2005). Our analysis of compositional skews for protein-coding genes across Echinodermata revealed (Figure 3.4) that Crinoidea presents an inverted skew (except on ND1, ND2 and ND6) respect to the remaining classes, suggesting an inversion of the control region, which could have an important impact on the estimates of the phylogenetic relationships within Echinodermata. The inferred phylogenetic trees revealed that the five Echinodermata classes and the phylum itself are monophyletic, independently of the method or model used. This result is in agreement with the vast majority of previous phylogenetic studies in Echinodermata, using either morphological (Pearse and Pearse, 1994; Janies, 2001) or molecular markers (Smith et al., 1993; Pearse and Pearse, 1994; Janies, 2001; Perseke et al., 2008, 2010, 2013; Telford et al., 2014). Regarding the phylogenetic relationships between classes, all our analyses, except the PhyloBayes using the CAT model, recovered the same tree topology: (Ophiuroidea, (Crinoidea, (Holothuroidea, (Asteroidea, Echinoidea)))), also in agreement with previous works based on amino acid sequences of the 13 mitochondrial protein-coding genes (Perseke et al., 2008, 2013). However, this topology is in contrast with the Eleutherozoa (or mobile echinoderms) grouping that places Crinoidea as sister clade to the remaining echinoderm classes, which is supported by morphological comparative analyses (Pearse and Pearse, 1994; Janies, 2001) and by nuclear and mitochondrial gene sequences (Scouras and Smith, 2006; Perseke et al., 2010; Telford et al., 2014). In most of our phylogenetic reconstructions, Ophiuroidea, and not Crinoidea, is the first clade to diverge within Echinodermata, although this placement could correspond to a phylogenetic artifact due to LBA. Mitogenomes of Ophiuroidea are known to have accelerated substitution rates that lead to very long branches in phylogenetic trees, when compared to mitogenome sequences of others species from the same phylum (Perseke et al., 2008, 2010, 2013). Therefore, it is possible that the methods and models used by others and ourselves are not able to reduce the LBA effect, even though we would expect that mixture models (e.g. C20, CAT+GTR, UL3) and especially site- and time-heterogeneous models (CAT+BP) could correct it. Nonetheless, it is interesting to note
  • 74.
    56 that the Bayesiananalysis using CAT are similar with some previous results (Perseke et al., 2010, 2013). Our phylogenetic inferences are also in contrast with the two classical hypotheses of echinoderm evolution (even assuming that Crinoidea is the first clade to diverge): the Asterozoan hypothesis (that pairs Asteroidea with Ophiuroidea and Echinoidea with Holothuroidea) and the Cryptosyringid hypothesis (that places Asteroidea as the second class to diverge and Ophiuroidea as the third one). Our results, at best, place Ophiuroidea as sister taxa to the clade (Holothuroidea, (Echinoidea, Asteroidea)). Again, this is in agreement with previous phylogenetic studies using echinoderm mitogenomes (Perseke et al., 2008, 2013). However, the most recent study, based on a final concatenated alignment of 219 nuclear genes from all echinoderm classes and using a Bayesian framework (applying the site-heterogeneous mixture model CAT+GTR), strongly supports the Asterozoan hypothesis (Telford et al., 2014). The authors of this study also applied a phylogenetic signal dissection methodology (as in Pisani et al., 2012), in which the dataset was split in homogeneously and heterogeneously evolving sites. The former sites gave strong support for the Asterozoan hypothesis, whereas the latter sites, expected to be more problematic and prone to systematic errors, resulted in a different topology: (Crinoidea, (Ophiuroidea, (Asteroidea, (Holothuroidea, Echinoidea)))). Extrapolating this possibility to our dataset, we suggest that most mitochondrial genes in echinoderms can have a strong heterogeneous and misleading signal that overcomes the phylogenetic informativeness needed to recover a reliable phylogenetic tree. Clearly, the phylogenetic signal of echinoderm mitogenomes is very complex and still challenges the most advanced and statistically powerful phylogenetic methods and models.
  • 75.
    References 57 References Abascal F,Zardoya R, Posada D (2005) ProtTest: selection of best-fit models of protein evolution. Bioinformatics, 21, 2104-2105. Adachi J, Hasegawa M (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. Journal of Molecular Evolution, 42, 459-468. Aguinaldo A, Turbeville J, Linford L, et al. (1997) Evidence for a clade of nematodes, arthropods and other moulting animals. Nature, 387, 489-493. Akaike H (1974) A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19, 716-723. Amemiya CT, Miyate T, Rast JP (2005) Echinoderms. Current Biology, 15, R944-R946. Anderson S, Bankier AT, Barrell BG, et al. (1981) Sequence and organization of the human mitochondrial genome. Nature, 290, 457-465. Asakawa S, Himeno H, Miura K, Watanabe K (1995) Nucleotide sequence and gene organization of the starfish Asterina pectinifera mitochondrial genome. Genetics, 140, 1047-1060. Asakawa S, Kumazawa Y, Araki T, et al. (1991). Strand-specific nucleotide composition bias in echinoderm and vertebrate mitochondrial genomes. Journal of Molecular Evolution, 32, 511-520. Attardi G (1985) Animal mitochondrial DNA: an extreme example of genetic economy. International Review of Cytology-a Survey of Cell Biology, 93, 93-145. Barral PJ, Cantini L, Hasmy A, Jiménez J, Marcano A (2005) Correlation between strand asymmetry and phylogeny in mitochondrial DNA. Journal of Theoretical Biology, 236, 422-426. Beletskii A, Bhagwat A (1996) Transcription-induced mutations: Increase in C to T mutations in the nontranscribed strand during transcription in Escherichia coli. Proceedings of the National Academy of Sciences USA, 93, 13919-13924. Bergsten J (2005) A review of long-branch attraction. Cladistics, 21, 163-193. Bernt M, Braband A, Middendorf M, et al. (2013a) Bioinformatics methods for the comparative analysis of metazoan mitochondrial genome sequences. Molecular Phylogenetics and Evolution, 69, 320-327.
  • 76.
    58 Bernt M, BrabandA, Schierwater B, Stadler P (2013b) Genetic aspects of mitochondrial genome evolution. Molecular Phylogenetics and Evolution, 69, 328-338. Bernt M, Donath A, Juhling F, et al. (2013c) MITOS: Improved de novo metazoan mitochondrial genome annotation. Molecular Phylogenetics and Evolution, 69, 313- 319. Bernt M, Middendorf M (2011) A method for computing an inventory of metazoan mitochondrial gene order rearrangements. BMC Bioinformatics, 12, S6. Bernt M, Merkle D, Ramsch K, et al. (2007) CREx: inferring genomic rearrangements based on common intervals. Bioinformatics, 23, 2957-2958. Bhutkar A, Gelbart W, Smith T (2007) Inferring genome-scale rearrangement phylogeny and ancestral gene order: a Drosophila case study. Genome Biology, 8, R236. Blanquart S, Lartillot N (2008) A site- and time-heterogeneous model of amino acid replacement. Molecular Biology and Evolution, 25, 842-858. Blanquart S, Lartillot N (2006) A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. Molecular Biology and Evolution, 23, 2058-2071. Bogenhagen D, Clayton D (2003) The mitochondrial DNA replication bubble has not burst. Trends in Biochemical Sciences, 28, 357-360. Boore JL (2000) The duplication/random loss model for gene rearrangement exemplified by mitochondrial genomes of deuterostome animals. In: Computational biology, pp.133-147. Springer Netherlands. Boore JL (1999) Animal mitochondrial genomes. Nucleic Acids Research, 27, 1767-1780. Boore J, Brown W (1998) Big trees from little genomes: mitochondrial gene order as a phylogenetic tool. Current Opinion in Genetics and Development, 8, 668-674. Boore JL, Lavrov DV, Brown WM, (1998) Gene translocation links insects and crustaceans. Nature, 392, 667-668. Boore J, Collins T, Stanton D, Daehler L, Brown W (1995) Deducing the pattern of arthropod phytogeny from mitochondrial DNA rearrangements. Nature, 376, 163-165. Bos D, Posada D (2005) Using models of nucleotide evolution to build phylogenetic trees. Developmental and Comparative Immunology, 29, 211-227.
  • 77.
    References 59 Bottjer D,Davidson E, Peterson K, Cameron R (2006) Paleogenomics of Echinoderms. Science, 314, 956-960. Bourlat S, Rota-Stabelli O, Lanfear R, Telford M (2009) The mitochondrial genome structure of Xenoturbella bocki (phylum Xenoturbellida) is ancestral within the deuterostomes. BMC Evolutionary Biology, 9, 107. Bourlat S, Juliusdottir T, Lowe C, et al. (2006) Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida. Nature, 444, 85-88. Boussau B, Gouy M (2006) Efficient likelihood computations with nonreversible models of evolution. Systematic Biology, 55, 756-768. Boyce TM, Zwick ME, Aquadro CF (1989) Mitochondrial DNA in the bark weevils: size, structure and heteroplasmy. Genetics, 4, 825-836. Brinkmann H, Philippe H (1999) Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Molecular Biology and Evolution, 16, 817-825. Bromham L, Degnan B (1999) Hemichordates and deuterostome evolution: robust molecular phylogenetic support for a hemichordate + echinoderm clade. Evolution and Development, 1, 166-171. Brown T (2005) Replication of mitochondrial DNA occurs by strand displacement with alternative light-strand origins, not via a strand-coupled mechanism. Genes and Development, 19, 2466-2476. Brown WM (1985) The mitochondrial genome of animals. In: R. J. MacIntyre, ed. Molecular Evolutionary Genetics. Plenum Press, New York. Bryant D, Galtier N, Poursat MA (2005) Likelihood calculation in molecular phylogenetics. In: Mathematics of evolution and phylogeny, pp. 33-62. Oxford University Press. Bull JJ, Huelsenbeck JP, Cunningham CW, Swofford DL, Waddell PJ (1993) Partitioning and combining data in phylogenetic analysis. Systematic Biology, 42, 384-397. Burleigh J, Mathews S (2004) Phylogenetic signal in nucleotide data from seed plants: implications for resolving the seed plant tree of life. American Journal of Botany, 91, 1599-1613. Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach. Springer, New York.
  • 78.
    60 Cameron CB, GareyJR, Swalla BJ (2000) Evolution of the chordate body plan: New insights from phylogenetic analyses of deuterostome phyla. Proceedings of the National Academy of Sciences USA, 97, 4469-4474. Cannon JT, Kocot KM, Waits DS, et al. (2014) Phylogenomic resolution of the hemichordate and echinoderm clade. Current Biology, 23, 2827–2832. Cantatore P, Gadaleta M, Roberti M, Saccone C, Wilson A (1987) Duplication and remoulding of tRNA genes during the evolutionary rearrangement of mitochondrial genomes. Nature, 329, 853-855. Capella-Gutierrez S, Silla-Martinez J, Gabaldon T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25, 1972- 1973. Castresana J, Feldmaier-Fuchs G, Yokobori SI, Satoh N, Pääbo S (1998) The mitochondrial genome of the hemichordate Balanoglossus carnosus and the evolution of deuterostome mitochondria. Genetics, 150, 1115-1123. Chang B, Campbell D (2000) Bias in phylogenetic reconstruction of Vertebrate Rhodopsin sequences. Molecular Biology and Evolution, 17, 1220-1231. Chang J (1996) Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Mathematical Biosciences, 134, 189-215. Clayton D (1991) Replication and transcription of Vertebrate mitochondrial DNA. Annual Review of Cell and Developmental Biology, 7, 453-478. Collins TM, Fedrigo O, Naylor, GJ (2005) Choosing the best genes for the job: the case for stationary genes in genome-scale phylogenetics. Systematic Biology, 54, 493-500. Cox C, Foster P, Hirt R, Harris S, Embley T (2008) The archaebacterial origin of eukaryotes. Proceedings of the National Academy of Sciences USA, 105, 20356- 20361. Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, pp. 345-352. DC National Biomedical Research Foundation. De Giorgi C (1996) Complete sequence of the mitochondrial DNA in the sea urchin Arbacia lixula: Conserved features of the echinoid mitochondrial genome. Molecular Phylogenetics and Evolution, 5, 323-332.
  • 79.
    References 61 Delsuc F,Scally M, Madsen O, et al. (2002) Molecular phylogeny of living xenarthrans and the impact of character and taxon sampling on the placental tree rooting. Molecular Biology and Evolution, 19, 1656-1671. De Queiros A, Gatesy J (2007) The supermatrix approach to systematics. Trends in Ecology and Evolution, 22, 34-41. D'Onorio de Meo P, D'Antonio M, Griggio F, et al. (2011) MitoZoa 2.0: a database resource and search tools for comparative and evolutionary analyses of mitochondrial genomes in Metazoa. Nucleic Acids Research, 40, D1168-D1172. Dowton M, Austin A (1999) Evolutionary dynamics of a mitochondrial rearrangement "hot spot" in the Hymenoptera. Molecular Biology and Evolution, 16, 298-309. Dutheil J, Boussau B (2008) Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs. BMC Evolutionary Biology, 8, 255. Edgar R (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32, 1792-1797. Edgecombe G, Giribet G, Dunn C, et al. (2011) Higher-level metazoan relationships: Recent progress and remaining questions. Organisms Diversity and Evolution, 11, 151-172. Elson J, Lightowlers R (2006) Mitochondrial DNA clonality in the dock: can surveillance swing the case? Trends in Genetics, 22, 603-607. Erixon P, Svennblad B, Britton T, Oxelman B (2003) Reliability of bayesian posterior probabilities and bootstrap frequencies in phylogenetics. Systematic Biology, 52, 665-673. Faith JJ, Pollock DD (2003) Likelihood analysis of asymmetrical mutation bias gradients in vertebrate mitochondrial genomes. Genetics, 165, 735-745. Fan S, Hu C, Wen J, Zhang L (2011) Characterization of mitochondrial genome of sea cucumber Stichopus horrens: A novel gene arrangement in Holothuroidea. Science China Life, 54, 434-441. Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution, 39, 783-791. Felsenstein J (1981) Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17, 368-376.
  • 80.
    62 Felsenstein J (1978)Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology, 27, 401-410. Fertin G (2009) Combinatorics of genome rearrangements. MIT Press, Cambridge. Fitch W (1986) The estimate of total nucleotide substitutions from pairwise differences is biased. Philosophical Transactions of the Royal Society B: Biological Sciences, 312, 317-324. Fitch W (1971) Toward defining the course of evolution: Minimum change for a specific tree topology. Systematic Biology, 20, 406-416. Fonseca MM, Harris DJ, Posada D (2014). The inversion of the control region in three mitogenomes provides further evidence for an asymmetric model of vertebrate mtDNA replication. PLoS ONE, 9, e106654. Fonseca MM (2011) Mitochondrial DNA evolution: replication and gene overlapping. Published doctoral dissertation. University of Porto, Porto, Portugal. Fonseca MM, Posada D, Harris DJ (2008) Inverted replication of vertebrate mitochondria. Molecular Biology and Evolution, 25, 805-808. Foster P (2004) Modeling compositional heterogeneity. Systematic Biology, 53, 485-495. Foster P, Hickey D (1999) Compositional bias may affect both DNA-based and protein- based phylogenetic reconstructions. Journal of Molecular Evolution, 48, 284-290. Francino M, Ochman H (1997) Strand asymmetries in DNA evolution. Trends in Genetics, 13, 240-245. Galtier N, Lobry J (1997) Relationships between genomic G+C content, RNA secondary structures and optimal growth temperature in Prokaryotes. Journal of Molecular Evolution, 44, 632-636. Galtier N, Gouy M (1995) Inferring phylogenies from DNA sequences of unequal base compositions. Proceedings of the National Academy of Sciences USA, 92, 11317- 11321. Gee H (1996) Before the backbone. Views on the origin of the vertebrates. Chapman and Hall, London. Gesell T (2009) A phylogenetic definition of structure. Published doctoral dissertation. University of Vienna, Vienna, Austria.
  • 81.
    References 63 Gissi C,Iannelli F, Pesole G (2008) Evolution of the mitochondrial genome of Metazoa as exemplified by comparison of congeneric species. Heredity, 101, 301-320. Gowri-Shankar V, Rattray M (2007) A reversible jump method for bayesian phylogenetic inference with a nonhomogeneous substitution model. Molecular Biology and Evolution, 24, 1286-1299. Gray MW, Burger G, Lang BF (1999) Mitochondrial evolution. Science, 283, 1476–81. Groussin M, Boussau B, Gouy M (2013) A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences. Systematic Biology, 62, 523- 538. Gu X, Fu YX, Li WH (1995) Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Molecular Biology and Evolution, 12, 546- 557. Guindon S, Dufayard J, Lefort V, et al. (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Systematic Biology, 59, 307-321. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology, 52,696-704. Guindon S, Gascuel O (2002) Efficient biased estimation of evolutionary distances when substitution rates vary across sites. Molecular Biology and Evolution, 19, 534-43. Green P, Ewing B, Miller W, Thomas P J, Green E D (2003) Transcription-associated mutational asymmetry in mammalian evolution. Nature Genetics, 33, 514-517. Grigoriev A (1998) Analyzing genomes with cumulative skew diagrams. Nucleic Acids Research, 26, 2286-2290. Hasegawa M, Hashimoto T, Adachi J, Iwabe N, Miyata T (1993) Early branchings in the evolution of eukaryotes: Ancient divergence of entamoeba that lacks mitochondria revealed by protein sequence data. Journal of Molecular Evolution, 36, 380-388. Hassanin A, Léger N, Deutsch J (2005) Evidence for multiple reversals of asymmetric mutational constraints during the evolution of the mitochondrial genome of Metazoa, and consequences for phylogenetic inferences. Systematic Biology, 54, 277-298. Herbeck JT, Degnan PH, Wernegreen JJ (2005) Nonhomogeneous model of sequence evolution indicates independent origins of primary endosymbionts within the Enterobacteriales (ᵧ -Proteobacteria). Molecular Biology and Evolution, 22, 520-532.
  • 82.
    64 Holt I, JacobsH (2003) Response: The mitochondrial DNA replication bubble has not burst. Trends in Biochemical Sciences, 28, 355-356. Holt I, Lorimer H, Jacobs H (2000) Coupled leading- and lagging-strand synthesis of mammalian mitochondrial DNA. Cell, 100, 515-524. Horner D, Pesole G (2004) Phylogenetic analyses: a brief introduction to methods and their application. Expert Review of Molecular Diagnostics, 4, 339-350. Hrdy I, Hirt R, Dolezal P, et al. (2004) Trichomonas hydrogenosomes contain the NADH dehydrogenase module of mitochondrial complex I. Nature, 432, 618-622. Huelsenbeck J, Suchard M (2007) A nonparametric method for accommodating and testing across-site rate variation. Systematic Biology, 56, 975-987. Huelsenbeck JP, Bollback JP (2001) Empirical and hierarchical Bayesian estimation of ancestral states. Systematic Biology, 50, 351-366. Huelsenbeck J, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17, 754-755. Hyman L (1955) The invertebrates. McGraw-Hill, New York. Inagaki Y, Susko E, Fast NM, Roger AJ (2004) Covarion shifts cause a long-branch attraction artifact that unites microsporidia and archaebacteria in EF-1alpha phylogenies. Molecular Biology and Evolution, 21, 1340-1349. Jacobs HT, Herbert ER, Rankine J (1989) Sea urchin egg mitochondrial DNA contains a short displacement loop (D-loop) in the replication origin region. Nucleic Acids Research, 17, 8949-8965. Janies D, Voight J, Daly M (2011) Echinoderm phylogeny including Xyloplax, a progenetic asteroid. Systematic Biology, 60, 420-438. Janies D (2001) Phylogenetic relationships of extant echinoderm classes. Canadian. Journal of Zoology, 79, 1232-1250. Jeffroy O, Brinkmann H, Delsuc F, Philippe H (2006) Phylogenomics: the beginning of incongruence? Trends in Genetics, 22, 225-231. Jermiin LS, Jayaswal V, Ababneh F, Robinson J (2008) Phylogenetic model evaluation. In: Bioinformatics: data, sequence analysis, and evolution, pp. 65-91. Totawa: Humana Press.
  • 83.
    References 65 Jermiin L,Ho S, Ababneh F, Robinson J, Larkum A (2004) The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Systematic Biology, 53, 638-643. Jones D, Taylor W, Thornton J (1992) The rapid generation of mutation data matrices from protein sequences. Bioinformatics, 8, 275-282. Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Mammalian Protein Metabolism, pp. 21-132. Academic Press, New York. Katoh K (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30, 3059-3066. Keane TM, Creevey CJ, Pentony MM, Naughton TJ, Mclnerney JO (2006) Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evolutionary Biology, 6, 29. Kilpert F, Podsiadlowski L (2006) The complete mitochondrial genome of the common sea slater, Ligia oceanica (Crustacea, Isopoda) bears a novel gene order and unusual control region features. BMC Genomics, 7, 241. Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111-120. Kocher TD, Wilson AC (1991) Sequence evolution of mitochondrial DNA in humans and chimpanzees: control region and a protein-coding region. In: Evolution of life: Fossils, molecules and culture, pp. 391-413. Springer, Tokyo. Kolaczkowski B, Thornton J (2008) A mixed branch length model of heterotachy improves phylogenetic accuracy. Molecular Biology and Evolution, 25, 1054-1066. Kon T, Nohara M, Yamanoue Y, Fujiwara Y, Nishida M, et al. (2007). Phylogenetic position of a whale-fall lancelet (Cephalochordata) inferred from whole mitochondrial genome sequences. BMC Evolutionary Biology, 7, 127. Kondo M, Akasaka K (2012) Current status of echinoderm genome analysis - What do we know? Current Genomics, 13, 134-143. Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30, 3059-3066.
  • 84.
    66 Lake J (1994)Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proceedings of the National Academy of Sciences USA, 91, 1455-1459. Lambert C (2005) Historical introduction, overview, and reproductive biology of the protochordates. Canadian Journal of Zoology, 83, 1-7. Lambert C, Campenhout JM, DeBolle X, Depiereux E (2003) Review of common sequence alignment methods: clues to enhance reliability. Current Genomics, 4, 131-146. Lanave C, Preparata G, Sacone C, Serio G (1984) A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution, 20, 86-93. Lang BF, Gray MW, Burger G (1999) Mitochondrial genome evolution and the origin of eukaryotes. Annual Review of Genetics, 33, 351-397. Larget B, Simon D (1999) Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Molecular Biology and Evolution, 16, 750-759. Lartillot N, Rodrigue N, Stubbs D, Richer J (2013) PhyloBayes MPI: Phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Systematic Biology, 62, 611-615. Lartillot N, Brinkmann H, Philippe H (2007) Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary Biology, 7, S4. Lartillot N, Philippe H (2004) A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution, 21, 1095- 1109. Lavrov D, Lang B (2005) Poriferan mtDNA and animal phylogeny based on mitochondrial gene arrangements. Systematic Biology, 54, 651-659. Lavrov DV, Brown WM (2001) Trichinella spiralis mtDNA: A nematode mitochondrial genome that encodes a putative ATP8 and normally structured tRNAs and has a gene arrangement relatable to those of coelomate metazoans. Genetics, 157, 621- 637. Lee D, Clayton D (1998) Initiation of mitochondrial DNA replication by transcription and R- loop processing. Journal of Biological Chemistry, 273, 30614-30621.
  • 85.
    References 67 Lemey P,Salemi M, Vandamme A (2009) The phylogenetic handbook. Cambridge University Press, Cambridge. Lindahl T (1993) Instability and decay of the primary structure of DNA. Nature, 362, 709- 715. Littlewood D, Smith AB, Clough KA, Emson RH (1997) The interrelationships of the echinoderm classes: morphological and molecular evidence. Biological Journal of the Linnean Society, 61, 409-438. Littlewood DTJ (1995) Echinoderm class relationships revisited. In: Echinoderm research, pp. 19-28. CRC Press. Lobry J (1996) Asymmetric substitution patterns in the two DNA strands of bacteria. Molecular Biology and Evolution, 13, 660-665. Lockhart PJ, Steel MA, Hendy MD, Penny D (1994) Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution, 11, 605-612. Lodish H (2008) Molecular cell biology. WH Freeman, New York. Loomis W, Smith D (1990) Molecular phylogeny of Dictyostelium discoideum by protein sequence comparison. Proceedings of the National Academy of Sciences USA, 87, 9093-9097. Lopez P, Casane D, Philippe H (2002) Heterotachy, an important process of protein evolution. Molecular Biology and Evolution, 19, 1-7. Löytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proceedings of the National Academy of Science USA, 102, 10557-10562. Lupi R, de Meo PD, Picardi E, et al. (2010) MitoZoa: A curated mitochondrial genome database of metazoans for comparative genomics studies. Mitochondrion, 10, 192- 199. Macey J, Larson A, Ananjeva N, Fang Z, Papenfuss T (1997) Two novel gene orders and the role of light-strand replication in rearrangement of the vertebrate mitochondrial genome. Molecular Biology and Evolution, 14, 91-104. MacIntyre R (1985) Molecular evolutionary genetics. Plenum Press, New York.
  • 86.
    68 Mallatt J, WinchellC (2007) Ribosomal RNA genes and deuterostome phylogeny revisited: More cyclostomes, elasmobranchs, reptiles, and a brittle star. Molecular Phylogenetics and Evolution, 43, 1005-1022. McBride H, Neuspiel M, Wasiak S (2006) Mitochondria: more than just a powerhouse. Current Biology, 16, R551-R560. McLachlan G, Peel D (2000) Finite mixture models. John Wiley and Sons., New York. Meade A, Pagel M (2004) A phylogenetic mixture model for detecting pattern- heterogeneity in gene sequence or character-state data. Systematic Biology, 53, 571-581. Mohr S, Freeman J, Plasterer T, Smith T (1999) Patterns of mitochondrial DNA strand asymmetry correlate with phylogeny. Biological Bulletin, 196, 411-412. Mount D (2001) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press. Mueller RL and Boore JL (2005) Molecular mechanisms of extensive mitochondrial gene rearrangement in Plethodontid salamanders. Molecular Biology and Evolution, 22, 2104-2112. Nesnidal M, Helmkampf M, Bruchhaus I, Hausdorf B (2010) Compositional heterogeneity and phylogenomic inference of metazoan relationships. Molecular Biology and Evolution, 27, 2095-2104. Nikolaou C, Almirantis Y (2006) Deviations from Chargaff's second parity rule in organellar DNA. Gene, 381, 34-41. Nylander J, Ronquist F, Huelsenbeck J, Nieves-Aldrey J (2004) Bayesian phylogenetic analysis of combined data. Systematic Biology, 53, 47-67. Osawa S, Honjo T (1991) Evolution of life. Fossils, molecular and culture. Springer-Verlag, Tokyo. Paradis E, Claude J, Strimmer K (2004) APE: Analyses of phylogenetics and evolution in R language. Bioinformatics, 20, 289-290. Pearse VB, Pearse JS (1994) Echinoderm phylogeny and the place of concentricycloids. In: Echinoderms Through Time, pp. 121-126. CRC Press. Perna N, Kocher T (1995) Patterns of nucleotide composition at fourfold degenerate sites of animal mitochondrial genomes. Journal of Molecular Evolution, 41, 353-358.
  • 87.
    References 69 Perseke M,Golombek A, Schlegel M, Struck T (2013) The impact of mitochondrial genome analyses on the understanding of deuterostome phylogeny. Molecular Phylogenetics and Evolution, 66, 898-905. Perseke M, Bernhard D, Fritzsch G, et al. (2010) Mitochondrial genome evolution in Ophiuroidea, Echinoidea, and Holothuroidea: Insights in phylogenetic relationships of Echinodermata. Molecular Phylogenetics and Evolution, 56, 201-211. Perseke M, Fritzsch G, Ramsch K, et al. (2008) Evolution of mitochondrial gene orders in echinoderms. Molecular Phylogenetics and Evolution, 47, 855-864. Peterson K (2004) Isolation of Hox and Parahox genes in the hemichordate Ptychodera flava and the evolution of deuterostome Hox genes. Molecular Phylogenetics and Evolution, 31, 1208-1215. Pevzner P (2000) Computational molecular biology: an algorithmic approach. MIT press. Philippe H, Brinkmann H, Copley R, et al. (2011) Acoelomorph flatworms are deuterostomes related to Xenoturbella. Nature, 470, 255-258. Philippe H, Delsuc F, Brinkmann H, Lartillot N (2005) Phylogenomics. Annual Review of Ecology, Evolution and Systematics, 36, 541-562. Philippe H, Adoutte A, Coombs GH, Vickerman K, Sleigh MA (1998) The molecular phylogeny of eukaryota: solid facts and uncertainties. In: Evolutionary relationships among protozoa, pp. 25-56. Springer. Pisani D, Feuda R, Peterson KJ, Smith AB (2012) Resolving phylogenetic signal from noise when divergence is rapid: a new look at the old problem of echinoderm class relationships. Molecular Phylogenetics and Evolution, 62, 27-34. Posada D, Buckley T (2004) Model selection and model averaging in phylogenetics: Advantages of Akaike information criterion and bayesian approaches over likelihood ratio tests. Systematic Biology, 53, 793-808. Quang LS, Gascuel O, Lartillot N (2008) Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics, 24, 2317-2323. Rannala B, Zhu T, Yang Z (2011) Tail paradox, partial identifiability, and influential priors in bayesian branch length inference. Molecular Biology and Evolution, 29, 325-335. Reuter J, Mathews D (2010) RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11, 129.
  • 88.
    70 Reyes A, GissiC, Pesole G, Saccone C (1998) Asymmetrical directional mutation pressure in the mitochondrial genome of mammals. Molecular Biology and Evolution, 15, 957-966. Ricci A (2009) Mitochondrial Genomics: structure, inheritance and phylogenetic utility of the mitochondrial genome in Bivalvia and Insecta. Published doctoral dissertation. University of Bologna, Bologna, Italy. Robinson D, Foulds L (1981) Comparison of phylogenetic trees. Mathematical Biosciences, 53, 131-147. Rocha E, Touchon M, Feil E (2006) Similar compositional biases are caused by very different mutational effects. Genome Research, 16, 1537-1547. Rodriguez F, Oliver JL, Marin A, Medina JR (1990) The general stochastic model of nucleotide substitution. Journal of Theoretical Biology, 142, 485-501. Rodriguez-Ezpeleta N, Brinkmann H, Roure B, et al. (2007) Detecting and overcoming systematic errors in genome-scale phylogenies. Systematic Biology, 56, 389-399. Ronquist F, Teslenko M, van der Mark P, et al. (2012) MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology, 61, 539-542. Ruano-Rubio V, Fares M (2007) Artifactual phylogenies caused by correlated distribution of substitution rates among sites and lineages: The good, the bad, and the ugly. Systematic Biology, 56, 68-82. Saccone C, De Giorgi C, Gissi C, Pesole G, Reyes A (1999) Evolutionary genomics in Metazoa: The mitochondrial DNA as a model system. Gene, 238, 195-209. Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4, 406-425. Sankoff D, Leduc G, Antoine N, et al. (1992) Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proceedings of the National Academy of Sciences USA, 89, 6575-6579. San Mauro D, Gower DJ, Zardoya R, Wilkinson M (2006) A hotspot of gene order rearrangement by tandem duplication and random loss in the vertebrate mitochondrial genome. Molecular Biology and Evolution, 23, 227-234.
  • 89.
    References 71 Schatz G(1963) The isolation of possible mitochondrial precursor structures from aerobically grown baker’s yeast. Biochemical Biophysical Research Communications, 12, 448-451. Schierwater B, Stadler P, DeSalle R, Podsiadlowski L (2013) Mitogenomics and metazoan evolution. Molecular Phylogenetics and Evolution, 69, 311-312. Scouras A, Smith M (2006) The complete mitochondrial genomes of the sea lily Gymnocrinus richeri and the feather star Phanogenia gracilis: Signature nucleotide bias and unique nad4L gene rearrangement within crinoids. Molecular Phylogenetics and Evolution, 39, 323-334. Scouras A, Beckenbach K, Arndt A, Smith MJ (2004) Complete mitochondrial genome DNA sequence for two ophiuroids and a holothuroid: the utility of protein gene sequence and gene maps in the analyses of deep deuterostome phylogeny. Molecular Phylogenetic and Evolution, 31, 50-65. Scouras A, Smith M (2001) A novel mitochondrial gene order in the crinoid echinoderm Florometra serratissima. Molecular Biology and Evolution, 18, 61-73. Shao R, Dowton M, Murrell A, Barker SC (2003). Rates of gene rearrangement and nucleotide substitution are correlated in the mitochondrial genomes of insects. Molecular Biology and Evolution, 20, 1612-1619. Sheffield N, Song H, Cameron S, Whiting M (2009) Nonstationary evolution and compositional heterogeneity in beetle mitochondrial phylogenomics. Systematic Biology, 58, 381-394. Shen X, Tian M, Liu Z, et al. (2009) Complete mitochondrial genome of the sea cucumber Apostichopus japonicus (Echinodermata: Holothuroidea): The first representative from the subclass Aspidochirotacea with the echinoderm ground pattern. Gene, 439, 79-86. Shi J, Zhang Y, Luo H, Tang J (2010) Using jackknife to assess the quality of gene order phylogenies. BMC Bioinformatics, 11, 168. Shibutani S, Takeshita M, Grollman A (1991) Insertion of specific bases during DNA synthesis past the oxidation-damaged base 8-oxodG. Nature, 349, 431-434. Smith M, Arndt A, Gorski S, Fajber E (1993) The phylogeny of echinoderm classes based on mitochondrial gene arrangements. Journal of Molecular Evolution, 36, 545-554.
  • 90.
    72 Smith A (1992)Echinoderm phylogeny: Morphology and molecules approach accord. Trends in Ecology and Evolution, 7, 224-229. Smith AB (1984) Classification of the Echinodermata. Palaeontolgy, 27, 431-459. Soria-Carrasco V, Talavera G, Igea J, Castresana J (2007) The K tree score: quantification of differences in the relative branch length and topology of phylogenetic trees. Bioinformatics, 23, 2954-2956. Steel M, Huson D, Lockhart P (2000) Invariable sites models and their use in phylogeny reconstruction. Systematic Biology, 49, 225-232. Stefankovic D, Vigoda E (2007) Pitfalls of heterogeneous processes for phylogenetic reconstruction. Systematic Biology, 56, 113-124. Strathmann RR (1988) Larvae, phylogeny and von Baer's Law. In: Echinoderm Phylogeny and Evolutionary Biology, pp. 53-68. Oxford Science Publications and Liverpool Geological Society. Sullivan J, Joyce P (2005) Model selection in phylogenetics. Annual Review of Ecology, Evolution and Systematics, 36, 445-466. Swalla B, Smith A (2008) Deciphering deuterostome phylogeny: molecular, morphological and palaeontological perspectives. Philosophical Transactions of the Royal Society B: Biological Sciences, 363, 1557-1568. Swofford DL, Olsen GJ, Waddell PJ, Hillis DM (1996) Phylogenetic inference. In: Molecular systematics, pp. 407-514. Sinauer Associates. Taanman J (1999) The mitochondrial genome: structure, transcription, translation and replication. Biochimica et Biophysica Acta (BBA) - Bioenergetics, 1410, 103-123. Tamura K, Peterson D, Peterson N, et al. (2011) MEGA5: Molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular Biology and Evolution, 28, 2731-2739. Telford MJ, Lowe CJ, Cameron CB, et al. (2014) Phylogenomic analysis of echinoderm class relationships supports Asterozoa. Proceedings of the Royal Society B: Biological Sciences, 281. Tisdall J (2001) Beginning Perl for bioinformatics. O'Reilly, Beijing. Touchon M (2004) Transcription-coupled and splicing-coupled strand asymmetries in eukaryotic genomes. Nucleic Acids Research, 32, 4969-4978.
  • 91.
    References 73 Turbeville J,Schulz JR, Raff RA (1994) Deuterostome phylogeny and the sister group of the chordates: Evidence from molecules and morphology. Molecular Biology and Evolution, 11, 648-655 Uzzell T, Corbin K (1971) Fitting discrete probability distributions to evolutionary events. Science, 172, 1089-1096. Wada H, Satoh N (1994) Details of the evolutionary history from invertebrates to vertebrates, as deduced from the sequences of 18S rDNA. Proceedings of the National Academy of Sciences USA, 91, 1801-1804. Wang H, Li K, Susko E, Roger A (2008) A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny. BMC Evolutionary Biology, 8, 331. Wei S, Shi M, Chen X, et al. (2010) New views on strand asymmetry in insect mitochondrial genomes. PLoS ONE, 5, e12708. Weisrock D (2012) Concordance analysis in mitogenomic phylogenetics. Molecular Phylogenetics and Evolution, 65, 194-202. Weng ML, Blazier JC, Govindu M, Jansen RK (2013) Reconstruction of the ancestral plastid genome in Geraniaceae reveals a correlation between genome rearrangements, repeats, and nucleotide substitution rates. Molecular Biology Evolution, 31, 645-659. Whelan S, Goldman N (2001) Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends in Genetics, 17, 262-272. Woese C, Achenbach L, Rouviere P, Mandelco L (1991) Archaeal phylogeny: Reexamination of the phylogenetic position of Archaeoglohus fulgidus in light of certain composition-induced artifacts. Systematic and Applied Microbiology, 14, 364- 371. Wolstenholme DR (1992) Animal mitochondrial DNA: structure and evolution. International Review of Cytology, 141, 173–216 Wyman S, Jansen R, Boore J (2004) Automatic annotation of organellar genomes with DOGMA. Bioinformatics, 20, 3252-3255. Xu W, Jameson D, Tang B, Higgs PG (2006). The relationship between the rate of molecular evolution and the rate of genome rearrangement in animal mitochondrial genomes. Journal of Molecular Evolution, 63, 375-392.
  • 92.
    74 Yang Z, RannalaB (2012) Molecular phylogenetics: principles and practice. Nature Reviews Genetics, 13, 303-314. Yang Z (1996) Among-site rate variation and its impact on phylogenetic analyses. Trends in Ecology and Evolution, 11, 367-372. Zhong M, Struck T, Halanych K (2008) Phylogenetic information from three mitochondrial genomes of Terebelliformia (Annelida) worms and duplication of the methionine tRNA. Gene, 416, 11-21.
  • 93.
  • 95.
    Appendix 77 A.1 SupplementaryData Table A.1.1. Species analyzed in this study. Sequences and detailed information can be cross-referenced by GenBank accession numbers (available at http://www.ncbi.nlm.nih.gov). Phylum Class Species Accession No. Echinodermata Asteroidea Acanthaster brevispinus NC_007789 Acanthaster planci NC_007788 Asterias amurensis NC_006665 Astropecten polyacanthus NC_006666 Luidia quinalia NC_006664 Patiria pectinifera NC_001627 Pisaster ochraceus NC_004610 Echinoidea Arbacia lixula NC_001770 Echinocardium cordatum NC_013881 Paracentrotus lividus NC_001572 Strongylocentrotus droebachiensis NC_009940 Strongylocentrotus pallidus NC_009941 Strongylocentrotus purpuratus NC_001453 Holothuroidea Apostichopus japonicus NC_012616 Cucumaria miniata NC_005929 Holothuria forskali NC_013884 Parastichopus nigripunctatus NC_013432 Stichopus horrens NC_014454 Stichopus sp. SF-2010 NC_014452 Ophiuroidea Amphipholis squamata NC_013876 Astrospartus mediterraneus NC_013878 Ophiocomina nigra NC_013874 Ophiopholis aculeata NC_005334 Ophiura albida NC_010691 Ophiura lutkeni NC_005930 Crinoidea Antedon mediterranea NC_010692 Florometra serratissima NC_001878 Neogymnocrinus richeri NC_007689 Phanogenia gracilis NC_007690 Hemichordata Enteropneusta Balanoglossus carnosus NC_001887 Balanoglossus clavigerus NC_013877 Saccoglossus kowalevskii NC_007438
  • 96.
    78 Table A.1.2. Proportionof nucleotides for each mitochondrial protein-coding gene sequence and respective AT and GC skews for all species used in this study. The values were calculated with Mega v5.2 and Excel. Classe Species Gene Proportion of Nucleotides %A+T %G+C AT skew GC skew A C T G Asteroidea Patiria pectinifera COX1 0.451 0.234 0.234 0.080 68.5 31.4 0.317 -0.490 COX2 0.437 0.330 0.169 0.062 60.6 39.2 0.442 -0.680 ATP8 0.428 0.321 0.214 0.035 64.2 35.6 0.333 -0.800 ATP6 0.352 0.392 0.208 0.048 56.0 44.0 0.257 -0.780 COX3 0.315 0.273 0.295 0.114 61.0 38.7 0.033 -0.410 ND3 0.333 0.333 0.263 0.070 59.6 40.3 0.117 -0.650 ND4L 0.410 0.303 0.160 0.125 57.0 42.8 0.437 -0.410 ND4 0.347 0.347 0.187 0.118 53.4 46.5 0.300 -0.490 ND5 0.393 0.278 0.278 0.049 67.1 32.7 0.170 -0.690 ND6 0.204 0.060 0.445 0.289 64.9 34.9 -0.370 0.655 CYTB 0.468 0.255 0.223 0.052 69.1 30.7 0.353 -0.660 ND1 0.160 0.051 0.528 0.258 68.8 30.9 -0.530 0.666 ND2 0.209 0.098 0.470 0.220 67.9 31.8 -0.380 0.381 Pisaster ochraceus COX1 0.357 0.225 0.371 0.046 72.8 27.1 -0.019 -0.661 COX2 0.440 0.230 0.290 0.040 73.0 27.0 0.205 -0.700 ATP8 0.117 0.294 0.529 0.058 64.6 35.2 -0.630 -0.660 ATP6 0.389 0.176 0.371 0.061 76.0 23.7 0.023 -0.480 COX3 0.429 0.177 0.348 0.044 77.7 22.1 0.104 -0.600 ND3 0.448 0.244 0.265 0.040 71.3 28.4 0.257 -0.710 ND4L 0.319 0.340 0.340 0.000 65.9 34.0 -0.030 -1.000 ND4 0.368 0.195 0.382 0.053 75.0 24.8 -0.010 -0.570 ND5 0.418 0.220 0.324 0.036 74.2 25.6 0.126 -0.710 ND6 0.160 0.086 0.493 0.259 65.3 34.5 -0.500 0.500 CYTB 0.442 0.247 0.278 0.031 72.0 27.8 0.226 -0.770 ND1 0.250 0.041 0.500 0.208 75.0 24.9 -0.330 0.666 ND2 0.345 0.075 0.396 0.182 74.1 25.7 -0.060 0.414 Luidia quinalia COX1 0.349 0.283 0.309 0.058 65.8 34.1 0.061 -0.660 COX2 0.323 0.228 0.380 0.066 70.3 29.4 -0.081 -0.540 ATP8 0.375 0.375 0.208 0.041 58.3 41.6 0.285 -0.800 ATP6 0.322 0.282 0.290 0.104 61.2 38.6 0.052 -0.450 COX3 0.321 0.284 0.321 0.072 64.2 35.6 0.000 -0.590 ND3 0.316 0.233 0.400 0.050 71.6 28.3 -0.110 -0.640 ND4L 0.350 0.228 0.385 0.035 73.5 26.3 -0.040 -0.730 ND4 0.274 0.297 0.354 0.072 62.8 36.9 -0.120 -0.600 ND5 0.317 0.266 0.326 0.088 64.3 35.4 -0.010 -0.500 ND6 0.317 0.097 0.365 0.219 68.2 31.6 -0.070 0.384
  • 97.
    Appendix 79 CYTB 0.3140.293 0.287 0.104 60.1 39.7 0.043 -0.470 ND1 0.243 0.085 0.414 0.256 65.7 34.1 -0.250 0.500 ND2 0.222 0.087 0.432 0.257 65.4 34.4 -0.320 0.491 Asterias amurensis COX1 0.328 0.176 0.440 0.054 76.8 23.0 -0.146 -0.530 COX2 0.326 0.192 0.394 0.086 72.0 27.8 -0.094 -0.370 ATP8 0.440 0.120 0.320 0.120 76.0 24.0 0.157 0.000 ATP6 0.375 0.151 0.401 0.071 77.6 22.2 -0.030 -0.360 COX3 0.401 0.204 0.340 0.053 74.1 25.7 0.081 -0.580 ND3 0.183 0.244 0.469 0.102 65.2 34.6 -0.430 -0.410 ND4L 0.255 0.255 0.446 0.042 70.1 29.7 -0.270 -0.710 ND4 0.353 0.174 0.388 0.082 74.1 25.6 -0.040 -0.350 ND5 0.378 0.174 0.375 0.071 75.3 24.5 0.004 -0.420 ND6 0.341 0.101 0.329 0.227 67.0 32.8 0.018 0.384 CYTB 0.356 0.227 0.340 0.075 69.6 30.2 0.023 -0.500 ND1 0.314 0.111 0.388 0.185 70.2 29.6 -0.100 0.250 ND2 0.339 0.103 0.406 0.151 74.5 25.4 -0.080 0.190 Astropecten polyacanthus COX1 0.476 0.170 0.263 0.088 73.9 25.8 0.288 -0.318 COX2 0.500 0.187 0.250 0.062 75.0 24.9 0.333 -0.500 ATP8 0.423 0.230 0.269 0.076 69.2 30.6 0.222 -0.500 ATP6 0.491 0.288 0.152 0.067 64.3 35.5 0.526 -0.610 COX3 0.450 0.267 0.218 0.063 66.8 33.0 0.347 -0.610 ND3 0.353 0.246 0.246 0.153 59.9 39.9 0.179 -0.230 ND4L 0.386 0.204 0.363 0.045 74.9 24.9 0.030 -0.630 ND4 0.393 0.305 0.216 0.084 60.9 38.9 0.291 -0.560 ND5 0.406 0.210 0.323 0.060 72.9 27.0 0.114 -0.550 ND6 0.250 0.039 0.486 0.223 73.6 26.2 -0.320 0.700 CYTB 0.426 0.205 0.300 0.068 72.6 27.3 0.173 -0.500 ND1 0.251 0.058 0.541 0.148 79.2 20.6 -0.360 0.437 ND2 0.299 0.107 0.395 0.197 69.4 30.4 -0.130 0.294 Acanthaster planci COX1 0.460 0.273 0.163 0.102 62.3 37.5 0.477 -0.456 COX2 0.401 0.348 0.133 0.116 53.4 46.4 0.502 -0.500 ATP8 0.384 0.346 0.230 0.038 61.4 38.4 0.250 -0.800 ATP6 0.393 0.325 0.181 0.098 57.4 42.3 0.368 -0.530 COX3 0.384 0.363 0.160 0.090 54.4 45.3 0.410 -0.600 ND3 0.344 0.262 0.213 0.180 55.7 44.2 0.235 -0.180 ND4L 0.363 0.254 0.272 0.109 63.5 36.3 0.142 -0.400 ND4 0.324 0.354 0.222 0.098 54.6 45.2 0.186 -0.560 ND5 0.404 0.339 0.170 0.085 57.4 42.4 0.405 -0.590 ND6 0.138 0.128 0.356 0.376 49.4 50.4 -0.440 0.490 CYTB 0.333 0.378 0.181 0.106 51.4 48.4 0.294 -0.560 ND1 0.169 0.101 0.401 0.327 57.0 42.8 -0.400 0.526
  • 98.
    80 ND2 0.129 0.1530.408 0.307 53.7 46.0 -0.510 0.333 Acanthaster brevispinus COX1 0.460 0.276 0.179 0.083 63.9 35.9 0.440 -0.538 COX2 0.495 0.327 0.123 0.053 61.8 38.0 0.602 -0.720 ATP8 0.440 0.360 0.200 0.000 64.0 36.0 0.375 -1.000 ATP6 0.443 0.368 0.135 0.052 57.8 42.0 0.532 -0.750 COX3 0.386 0.351 0.179 0.082 56.5 43.3 0.365 -0.610 ND3 0.406 0.312 0.140 0.140 54.6 45.2 0.485 -0.370 ND4L 0.490 0.264 0.169 0.075 65.9 33.9 0.485 -0.550 ND4 0.335 0.371 0.209 0.083 54.4 45.4 0.231 -0.630 ND5 0.403 0.342 0.158 0.095 56.1 43.7 0.435 -0.560 ND6 0.150 0.130 0.370 0.350 52.0 48.0 -0.420 0.458 CYTB 0.370 0.330 0.215 0.085 58.5 41.5 0.264 -0.590 ND1 0.173 0.104 0.404 0.317 57.7 42.1 -0.400 0.506 ND2 0.133 0.137 0.428 0.300 56.1 43.7 -0.520 0.370 Echinoidea Strongylocentrotus droebachiensis COX1 0.332 0.232 0.317 0.118 64.9 35.0 0.023 -0.326 COX2 0.336 0.267 0.302 0.095 63.8 36.2 0.054 -0.476 ATP8 0.280 0.360 0.280 0.080 56.0 44.0 0.000 -0.636 ATP6 0.402 0.205 0.303 0.090 70.5 29.5 0.140 -0.389 COX3 0.419 0.265 0.221 0.096 64.0 36.1 0.310 -0.469 ND3 0.377 0.295 0.230 0.098 60.7 39.3 0.243 -0.500 ND4L 0.389 0.241 0.222 0.148 61.1 38.9 0.273 -0.238 ND4 0.401 0.227 0.264 0.108 66.5 33.5 0.207 -0.356 ND5 0.354 0.268 0.254 0.124 60.8 39.2 0.166 -0.368 ND6 0.273 0.409 0.242 0.076 51.5 48.5 0.059 -0.688 CYTB 0.412 0.263 0.232 0.093 64.4 35.6 0.280 -0.478 ND1 0.339 0.241 0.259 0.161 59.8 40.2 0.135 -0.200 ND2 0.295 0.258 0.321 0.126 61.6 38.4 -0.043 -0.342 Strongylocentrotus pallidus COX1 0.348 0.249 0.293 0.110 64.1 35.9 0.086 -0.388 COX2 0.319 0.233 0.319 0.129 63.8 36.2 0.000 -0.286 ATP8 0.292 0.333 0.292 0.083 58.4 41.6 0.000 -0.600 ATP6 0.402 0.205 0.308 0.085 71.0 29.0 0.133 -0.412 COX3 0.438 0.277 0.204 0.080 64.2 35.7 0.364 -0.551 ND3 0.333 0.270 0.286 0.111 61.9 38.1 0.077 -0.417 ND4L 0.444 0.241 0.222 0.093 66.6 33.4 0.333 -0.444 ND4 0.386 0.246 0.250 0.117 63.6 36.3 0.214 -0.354 ND5 0.348 0.262 0.256 0.134 60.4 39.6 0.151 -0.324 ND6 0.299 0.403 0.224 0.075 52.3 47.8 0.143 -0.688 CYTB 0.373 0.290 0.218 0.119 59.1 40.9 0.263 -0.418 ND1 0.310 0.263 0.263 0.164 57.3 42.7 0.082 -0.233 ND2 0.302 0.265 0.323 0.111 62.5 37.6 -0.034 -0.408
  • 99.
    Appendix 81 Echinocardium cordatumCOX1 0.387 0.204 0.247 0.161 63.4 36.5 0.220 -0.118 COX2 0.324 0.204 0.259 0.213 58.3 41.7 0.111 0.022 ATP8 0.500 0.154 0.192 0.154 69.2 30.8 0.444 0.000 ATP6 0.407 0.179 0.293 0.122 70.0 30.1 0.163 -0.189 COX3 0.457 0.250 0.186 0.107 64.3 35.7 0.422 -0.400 ND3 0.383 0.250 0.217 0.150 60.0 40.0 0.278 -0.250 ND4L 0.382 0.218 0.255 0.145 63.7 36.3 0.200 -0.200 ND4 0.318 0.271 0.271 0.141 58.9 41.2 0.080 -0.314 ND5 0.402 0.261 0.212 0.125 61.4 38.6 0.309 -0.353 ND6 0.258 0.439 0.227 0.076 48.5 51.5 0.063 -0.706 CYTB 0.332 0.284 0.200 0.184 53.2 46.8 0.248 -0.213 ND1 0.276 0.310 0.247 0.167 52.3 47.7 0.055 -0.301 ND2 0.372 0.251 0.230 0.147 60.2 39.8 0.235 -0.263 Strongylocentrotus purpuratus COX1 0.350 0.243 0.306 0.099 65.6 34.2 0.067 -0.421 COX2 0.321 0.277 0.321 0.080 64.2 35.7 0.000 -0.550 ATP8 0.285 0.357 0.285 0.071 57.0 42.8 0.000 -0.660 ATP6 0.418 0.237 0.278 0.065 69.6 30.2 0.200 -0.560 COX3 0.413 0.278 0.240 0.067 65.3 34.5 0.264 -0.600 ND3 0.281 0.328 0.218 0.171 49.9 49.9 0.125 -0.310 ND4L 0.425 0.259 0.222 0.092 64.7 35.1 0.314 -0.470 ND4 0.401 0.258 0.227 0.111 62.8 36.9 0.276 -0.390 ND5 0.395 0.261 0.228 0.114 62.3 37.5 0.267 -0.390 ND6 0.296 0.175 0.373 0.153 66.9 32.8 -0.110 -0.060 CYTB 0.354 0.322 0.201 0.121 55.5 44.3 0.276 -0.450 ND1 0.362 0.228 0.269 0.140 63.1 36.8 0.148 -0.230 ND2 0.298 0.250 0.326 0.125 62.4 37.5 -0.040 -0.330 Paracentrotus lividus COX1 0.516 0.163 0.245 0.074 76.1 23.7 0.356 -0.376 COX2 0.413 0.250 0.293 0.043 70.6 29.3 0.170 -0.700 ATP8 0.550 0.050 0.350 0.050 90.0 10.0 0.222 0.000 ATP6 0.379 0.255 0.279 0.085 65.8 34.0 0.152 -0.500 COX3 0.485 0.235 0.227 0.051 71.2 28.6 0.360 -0.640 ND3 0.508 0.196 0.180 0.114 68.8 31.0 0.476 -0.260 ND4L 0.471 0.245 0.226 0.056 69.7 30.1 0.351 -0.620 ND4 0.421 0.238 0.214 0.125 63.5 36.3 0.325 -0.310 ND5 0.427 0.235 0.229 0.107 65.6 34.2 0.301 -0.370 ND6 0.229 0.149 0.448 0.172 67.7 32.1 -0.320 0.071 CYTB 0.446 0.248 0.177 0.126 62.3 37.4 0.430 -0.320 ND1 0.423 0.237 0.242 0.096 66.5 33.3 0.271 -0.420 ND2 0.417 0.241 0.236 0.104 65.3 34.5 0.277 -0.390 Arbacia lixula COX1 0.401 0.208 0.323 0.066 72.4 27.4 0.108 -0.518
  • 100.
    82 COX2 0.366 0.1690.410 0.053 77.6 22.2 -0.057 -0.520 ATP8 0.375 0.125 0.416 0.083 79.1 20.8 -0.050 -0.200 ATP6 0.385 0.188 0.303 0.122 68.8 31.0 0.119 -0.210 COX3 0.367 0.191 0.360 0.080 72.7 27.1 0.010 -0.400 ND3 0.345 0.254 0.290 0.109 63.5 36.3 0.085 -0.400 ND4L 0.511 0.155 0.288 0.044 79.9 19.9 0.277 -0.550 ND4 0.347 0.177 0.421 0.053 76.8 23.0 -0.090 -0.530 ND5 0.384 0.212 0.333 0.069 71.7 28.1 0.071 -0.500 ND6 0.333 0.114 0.390 0.160 72.3 27.4 -0.070 0.166 CYTB 0.364 0.187 0.348 0.098 71.2 28.5 0.021 -0.300 ND1 0.333 0.255 0.345 0.065 67.8 32.0 -0.010 -0.590 ND2 0.372 0.194 0.338 0.094 71.0 28.8 0.046 -0.340 Holothuroidea Apostichopus japonicus COX1 0.425 0.157 0.321 0.096 74.6 25.3 0.139 -0.239 COX2 0.396 0.198 0.245 0.160 64.1 35.8 0.235 -0.105 ATP8 0.563 0.188 0.188 0.063 75.1 25.1 0.500 -0.500 ATP6 0.413 0.165 0.314 0.107 72.7 27.2 0.136 -0.212 COX3 0.515 0.221 0.132 0.132 64.7 35.3 0.591 -0.250 ND3 0.429 0.143 0.286 0.143 71.5 28.6 0.200 0.000 ND4L 0.345 0.182 0.273 0.200 61.8 38.2 0.118 0.048 ND4 0.426 0.174 0.248 0.153 67.4 32.7 0.264 -0.063 ND5 0.004 0.002 0.002 0.002 0.60 0.40 0.264 -0.063 ND6 0.357 0.411 0.161 0.071 51.8 48.2 0.379 -0.704 CYTB 0.451 0.113 0.308 0.128 75.9 24.1 0.189 0.064 ND1 0.355 0.122 0.366 0.157 72.1 27.9 -0.016 0.125 ND2 0.415 0.199 0.239 0.148 65.4 34.7 0.270 -0.148 Parastichopus nigripunctatus COX1 0.414 0.161 0.321 0.104 73.5 26.5 0.126 -0.216 COX2 0.390 0.219 0.229 0.162 61.9 38.1 0.262 -0.150 ATP8 0.563 0.188 0.188 0.063 75.1 25.1 0.500 -0.500 ATP6 0.393 0.188 0.291 0.128 68.4 31.6 0.150 -0.189 COX3 0.489 0.222 0.148 0.141 63.7 36.3 0.535 -0.224 ND3 0.411 0.125 0.304 0.161 71.5 28.6 0.150 0.125 ND4L 0.389 0.185 0.259 0.167 64.8 35.2 0.200 -0.053 ND4 0.438 0.161 0.260 0.140 69.8 30.1 0.254 -0.068 ND5 0.362 0.146 0.327 0.165 68.9 31.1 0.051 0.061 ND6 0.304 0.429 0.179 0.089 48.3 51.8 0.259 -0.655 CYTB 0.434 0.112 0.306 0.148 74.0 26.0 0.172 0.137 ND1 0.343 0.116 0.343 0.198 68.6 31.4 0.000 0.259 ND2 0.412 0.175 0.254 0.158 66.6 33.3 0.237 -0.051 Holothuria forskali COX1 0.437 0.146 0.366 0.052 80.3 19.8 0.088 -0.472 COX2 0.367 0.257 0.321 0.055 68.8 31.2 0.067 -0.647 ATP8 0.294 0.294 0.353 0.059 64.7 35.3 -0.091 -0.667
  • 101.
    Appendix 83 ATP6 0.4810.194 0.279 0.047 76.0 24.1 0.265 -0.613 COX3 0.475 0.206 0.262 0.057 73.7 26.3 0.288 -0.568 ND3 0.320 0.240 0.360 0.080 68.0 32.0 -0.059 -0.500 ND4L 0.306 0.265 0.347 0.082 65.3 34.7 -0.063 -0.529 ND4 0.434 0.200 0.294 0.072 72.8 27.2 0.193 -0.469 ND5 0.463 0.210 0.246 0.081 70.9 29.1 0.306 -0.444 ND6 0.345 0.400 0.182 0.073 52.7 47.3 0.310 -0.692 CYTB 0.488 0.192 0.276 0.044 76.4 23.6 0.277 -0.625 ND1 0.411 0.194 0.300 0.094 71.1 28.8 0.156 -0.346 ND2 0.447 0.207 0.261 0.085 70.8 29.2 0.263 -0.418 Stichopus sp. SF-2010 COX1 0.409 0.189 0.317 0.085 72.6 27.4 0.127 -0.377 COX2 0.413 0.294 0.257 0.037 67.0 33.1 0.233 -0.778 ATP8 0.478 0.217 0.304 0.000 78.2 21.7 0.222 -1.000 ATP6 0.472 0.232 0.248 0.048 72.0 28.0 0.311 -0.657 COX3 0.478 0.216 0.239 0.067 71.7 28.3 0.333 -0.526 ND3 0.304 0.411 0.232 0.054 53.6 46.5 0.133 -0.769 ND4L 0.327 0.273 0.382 0.018 70.9 29.1 -0.077 -0.875 ND4 0.279 0.265 0.347 0.110 62.6 37.5 -0.109 -0.415 ND5 0.461 0.272 0.168 0.099 62.9 37.1 0.467 -0.468 ND6 0.281 0.439 0.211 0.070 49.2 50.9 0.143 -0.724 CYTB 0.466 0.279 0.172 0.083 63.8 36.2 0.462 -0.541 ND1 0.389 0.232 0.243 0.135 63.2 36.7 0.231 -0.265 ND2 0.371 0.299 0.237 0.093 60.8 39.2 0.220 -0.526 Cucumaria miniata COX1 0.530 0.250 0.174 0.043 70.4 29.3 0.506 -0.706 COX2 0.359 0.300 0.252 0.087 61.1 38.7 0.175 -0.550 ATP8 0.263 0.368 0.315 0.052 57.8 42.0 -0.090 -0.750 ATP6 0.500 0.294 0.196 0.008 69.6 30.2 0.435 -0.940 COX3 0.500 0.250 0.212 0.037 71.2 28.7 0.404 -0.730 ND3 0.423 0.250 0.269 0.057 69.2 30.7 0.222 -0.620 ND4L 0.351 0.296 0.277 0.074 62.8 37.0 0.117 -0.600 ND4 0.524 0.262 0.140 0.072 66.4 33.4 0.576 -0.560 ND5 0.455 0.278 0.214 0.051 66.9 32.9 0.360 -0.690 ND6 0.160 0.074 0.629 0.135 78.9 20.9 -0.590 0.294 CYTB 0.525 0.237 0.185 0.051 71.0 28.8 0.478 -0.640 ND1 0.508 0.289 0.138 0.063 64.6 35.2 0.571 -0.630 ND2 0.565 0.232 0.148 0.053 71.3 28.5 0.583 -0.620 Stichopus horrens COX1 0.413 0.189 0.317 0.082 73.0 27.1 0.132 -0.395 COX2 0.404 0.257 0.294 0.046 69.8 30.3 0.158 -0.697 ATP8 0.455 0.273 0.273 0.000 72.8 27.3 0.250 -1.000 ATP6 0.463 0.231 0.264 0.041 72.7 27.2 0.273 -0.697 COX3 0.466 0.237 0.221 0.076 68.7 31.3 0.356 -0.512
  • 102.
    84 ND3 0.304 0.4460.196 0.054 50.0 50.0 0.214 -0.786 ND4L 0.333 0.315 0.333 0.019 66.6 33.4 0.000 -0.889 ND4 0.309 0.290 0.286 0.115 59.5 40.5 0.039 -0.432 ND5 0.463 0.290 0.143 0.104 60.6 39.4 0.528 -0.473 ND6 0.305 0.424 0.203 0.068 50.8 49.2 0.200 -0.724 CYTB 0.454 0.278 0.185 0.083 63.9 36.1 0.420 -0.541 ND1 0.385 0.187 0.280 0.148 66.5 33.5 0.157 -0.115 ND2 0.351 0.320 0.211 0.119 56.2 43.9 0.248 -0.459 Ophiuroidea Ophiura lutkeni COX1 0.444 0.186 0.254 0.114 69.8 30.0 0.272 -0.240 COX2 0.387 0.215 0.322 0.075 70.9 29.0 0.092 -0.480 ATP8 0.631 0.105 0.263 0.000 89.4 10.5 0.411 -1.000 ATP6 0.400 0.230 0.310 0.060 71.0 29.0 0.126 -0.580 COX3 0.425 0.261 0.246 0.067 67.1 32.8 0.266 -0.590 ND3 0.511 0.222 0.266 0.000 77.7 22.2 0.314 -1.000 ND4L 0.472 0.166 0.194 0.166 66.6 33.2 0.416 0.000 ND4 0.447 0.158 0.303 0.090 75.0 24.8 0.192 -0.270 ND5 0.436 0.231 0.258 0.073 69.4 30.4 0.256 -0.510 ND6 0.215 0.050 0.468 0.265 68.3 31.5 -0.370 0.680 CYTB 0.502 0.216 0.198 0.081 70.0 29.7 0.433 -0.450 ND1 0.420 0.158 0.359 0.060 77.9 21.8 0.078 -0.440 ND2 0.422 0.187 0.281 0.107 70.3 29.4 0.200 -0.270 Ophiopholis aculeata COX1 0.510 0.212 0.241 0.035 75.1 24.7 0.358 -0.717 COX2 0.500 0.226 0.254 0.018 75.4 24.4 0.326 -0.840 ATP8 0.333 0.416 0.166 0.083 49.9 49.9 0.333 -0.660 ATP6 0.460 0.260 0.234 0.043 69.4 30.3 0.325 -0.710 COX3 0.461 0.223 0.246 0.069 70.7 29.2 0.304 -0.520 ND3 0.511 0.222 0.266 0.000 77.7 22.2 0.314 -1.000 ND4L 0.510 0.163 0.285 0.040 79.5 20.3 0.282 -0.600 ND4 0.425 0.280 0.225 0.068 65.0 34.8 0.307 -0.600 ND5 0.392 0.234 0.280 0.092 67.2 32.6 0.166 -0.430 ND6 0.168 0.060 0.626 0.144 79.4 20.4 -0.570 0.411 CYTB 0.206 0.089 0.491 0.212 69.7 30.1 -0.400 0.407 ND1 0.189 0.059 0.573 0.177 76.2 23.6 -0.500 0.500 ND2 0.223 0.094 0.547 0.135 77.0 22.9 -0.410 0.179 Ophiura albida COX1 0.385 0.096 0.431 0.088 81.6 18.4 -0.057 -0.042 COX2 0.424 0.141 0.402 0.033 82.6 17.4 0.000 -0.625 ATP8 0.647 0.059 0.294 0.000 94.1 5.90 0.375 -1.000 ATP6 0.441 0.151 0.376 0.032 81.7 18.3 0.079 -0.647 COX3 0.474 0.146 0.321 0.058 79.5 20.4 0.193 -0.429 ND3 0.391 0.152 0.391 0.065 78.2 21.7 0.000 -0.400 ND4L 0.471 0.029 0.471 0.029 94.2 5.80 0.000 0.000
  • 103.
    Appendix 85 ND4 0.4230.202 0.296 0.080 71.9 28.2 0.176 -0.433 ND5 0.511 0.173 0.244 0.071 75.5 24.4 0.353 -0.415 ND6 0.263 0.421 0.228 0.088 49.1 50.9 0.071 -0.655 CYTB 0.424 0.164 0.277 0.136 70.1 30.0 0.210 -0.094 ND1 0.390 0.221 0.260 0.130 65.0 35.1 0.200 -0.259 ND2 0.492 0.114 0.280 0.114 77.2 22.8 0.275 0.000 Ophiocomina nigra COX1 0.478 0.235 0.180 0.107 65.8 34.2 0.453 -0.376 COX2 0.377 0.292 0.264 0.066 64.1 35.8 0.176 -0.632 ATP8 0.560 0.160 0.240 0.040 80.0 20.0 0.400 -0.600 ATP6 0.443 0.252 0.165 0.139 60.8 39.1 0.457 -0.289 COX3 0.397 0.331 0.184 0.088 58.1 41.9 0.367 -0.579 ND3 0.373 0.390 0.169 0.068 54.2 45.8 0.375 -0.704 ND4L 0.435 0.326 0.152 0.087 58.7 41.3 0.481 -0.579 ND4 0.382 0.289 0.197 0.133 57.9 42.2 0.319 -0.371 ND5 0.367 0.167 0.338 0.128 70.5 29.5 0.042 -0.133 ND6 0.216 0.392 0.196 0.196 41.2 58.8 0.048 -0.333 CYTB 0.261 0.307 0.216 0.216 47.7 52.3 0.096 -0.175 ND1 0.264 0.364 0.202 0.171 46.6 53.5 0.133 -0.362 ND2 0.270 0.350 0.285 0.095 55.5 44.5 -0.026 -0.574 Amphipholis squamata COX1 0.476 0.150 0.359 0.015 83.5 16.5 0.140 -0.822 COX2 0.340 0.180 0.460 0.020 80.0 20.0 -0.150 -0.800 ATP8 0.467 0.133 0.400 0.000 86.7 13.3 0.077 -1.000 ATP6 0.419 0.162 0.390 0.029 80.9 19.1 0.035 -0.700 COX3 0.434 0.217 0.341 0.008 77.5 22.5 0.120 -0.931 ND3 0.340 0.189 0.434 0.038 77.4 22.7 -0.122 -0.667 ND4L 0.383 0.191 0.426 0.000 80.9 19.1 -0.053 -1.000 ND4 0.441 0.228 0.297 0.035 73.8 26.3 0.195 -0.736 ND5 0.388 0.157 0.409 0.045 79.7 20.2 -0.026 -0.552 ND6 0.211 0.404 0.228 0.158 43.9 56.2 -0.040 -0.438 CYTB 0.257 0.319 0.201 0.222 45.8 54.1 0.121 -0.179 ND1 0.203 0.390 0.212 0.195 41.5 58.5 -0.020 -0.333 ND2 0.212 0.271 0.381 0.136 59.3 40.7 -0.286 -0.333 Astrospartus mediterraneus COX1 0.575 0.146 0.264 0.015 83.9 16.1 0.370 -0.810 COX2 0.626 0.165 0.198 0.011 82.4 17.6 0.520 -0.875 ATP8 0.438 0.250 0.313 0.000 75.1 25.0 0.167 -1.000 ATP6 0.448 0.184 0.356 0.011 80.4 19.5 0.114 -0.882 COX3 0.546 0.143 0.286 0.025 83.2 16.8 0.313 -0.700 ND3 0.571 0.143 0.265 0.020 83.6 16.3 0.366 -0.750 ND4L 0.450 0.250 0.300 0.000 75.0 25.0 0.200 -1.000 ND4 0.656 0.103 0.226 0.015 88.2 11.8 0.488 -0.739 ND5 0.468 0.128 0.372 0.032 84.0 16.0 0.114 -0.600
  • 104.
    86 ND6 0.333 0.2960.222 0.148 55.5 44.4 0.200 -0.333 CYTB 0.269 0.313 0.201 0.216 47.0 52.9 0.143 -0.183 ND1 0.287 0.339 0.191 0.183 47.8 52.2 0.200 -0.300 ND2 0.327 0.243 0.308 0.121 63.5 36.4 0.029 -0.333 Crinoidea Florometra serratissima COX1 0.136 0.015 0.804 0.042 94.0 5.70 -0.711 0.474 COX2 0.157 0.031 0.778 0.031 93.5 6.20 -0.664 0.000 ATP8 0.076 0.000 0.769 0.153 84.5 15.3 -0.810 1.000 ATP6 0.118 0.010 0.795 0.075 91.3 8.50 -0.740 0.750 COX3 0.151 0.026 0.794 0.026 94.5 5.20 -0.670 0.000 ND3 0.142 0.023 0.785 0.047 92.7 7.00 -0.690 0.333 ND4L 0.095 0.000 0.809 0.095 90.4 9.50 -0.780 1.000 ND4 0.165 0.020 0.751 0.062 91.6 8.20 -0.630 0.500 ND5 0.198 0.019 0.669 0.112 86.7 13.1 -0.540 0.705 ND6 0.558 0.132 0.294 0.014 85.2 14.6 0.310 -0.800 CYTB 0.187 0.018 0.727 0.066 91.4 8.40 -0.580 0.571 ND1 0.681 0.050 0.246 0.021 92.7 7.10 0.468 -0.400 ND2 0.719 0.049 0.214 0.016 93.3 6.50 0.539 -0.500 Antedon mediterranea COX1 0.143 0.020 0.789 0.048 93.2 6.80 -0.692 0.412 COX2 0.177 0.000 0.781 0.042 95.8 4.20 -0.630 1.000 ATP8 0.111 0.000 0.778 0.111 88.9 11.1 -0.750 1.000 ATP6 0.106 0.043 0.809 0.043 91.5 8.60 -0.767 0.000 COX3 0.158 0.008 0.733 0.100 89.1 10.8 -0.645 0.846 ND3 0.280 0.000 0.640 0.080 92.0 8.00 -0.391 1.000 ND4L 0.057 0.000 0.857 0.086 91.4 8.60 -0.875 1.000 ND4 0.230 0.038 0.634 0.098 86.4 13.6 -0.468 0.440 ND5 0.208 0.031 0.690 0.071 89.8 10.2 -0.537 0.385 ND6 0.333 0.244 0.311 0.111 64.4 35.5 0.034 -0.375 CYTB 0.169 0.039 0.708 0.084 87.7 12.3 -0.615 0.368 ND1 0.227 0.347 0.267 0.160 49.4 50.7 -0.081 -0.368 ND2 0.268 0.225 0.366 0.141 63.4 36.6 -0.156 -0.231 Gymnocrinus richeri COX1 0.087 0.014 0.821 0.076 90.8 9.00 -0.808 0.689 COX2 0.101 0.000 0.828 0.070 92.9 7.00 -0.783 1.000 ATP8 0.045 0.045 0.727 0.181 77.2 22.6 -0.880 0.600 ATP6 0.066 0.018 0.783 0.132 84.9 15.0 -0.840 0.750 COX3 0.109 0.054 0.726 0.109 83.5 16.3 -0.730 0.333 ND3 0.150 0.037 0.754 0.056 90.4 9.30 -0.660 0.200 ND4L 0.104 0.000 0.854 0.041 95.8 4.10 -0.780 1.000 ND4 0.147 0.023 0.714 0.115 86.1 13.8 -0.650 0.666 ND5 0.149 0.034 0.635 0.180 78.4 21.4 -0.610 0.677 ND6 0.543 0.130 0.282 0.043 82.5 17.3 0.315 -0.500 CYTB 0.121 0.040 0.710 0.127 83.1 16.7 -0.700 0.517
  • 105.
    Appendix 87 ND1 0.7550.166 0.059 0.017 81.4 18.3 0.854 -0.800 ND2 0.786 0.124 0.059 0.029 84.5 15.3 0.860 -0.610 Phanogenia gracilis COX1 0.148 0.011 0.783 0.057 93.1 6.80 -0.682 0.676 COX2 0.127 0.029 0.803 0.039 93.0 6.80 -0.727 0.142 ATP8 0.187 0.000 0.750 0.062 93.7 6.20 -0.600 1.000 ATP6 0.138 0.021 0.797 0.042 93.5 6.30 -0.700 0.333 COX3 0.175 0.041 0.725 0.058 90.0 9.90 -0.610 0.166 ND3 0.139 0.000 0.790 0.069 92.9 6.90 -0.700 1.000 ND4L 0.097 0.000 0.804 0.097 90.1 9.70 -0.780 1.000 ND4 0.152 0.005 0.744 0.097 89.6 10.2 -0.660 0.894 ND5 0.241 0.016 0.677 0.064 91.8 8.00 -0.470 0.600 ND6 0.552 0.059 0.343 0.044 89.5 10.3 0.233 -0.140 CYTB 0.226 0.011 0.714 0.047 94.0 5.80 -0.510 0.600 ND1 0.712 0.090 0.181 0.015 89.3 10.5 0.593 -0.710 ND2 0.631 0.075 0.255 0.037 88.6 11.2 0.423 -0.330
  • 106.
    88 Figure A.1.3. Mitochondrialgene order for the mitogenomes of Echinoidea species. Genes and tRNAs at upper layer are transcribed on the L-strand, whereas the genes at lower layer are transcribed on the H-strand. Blue colors indicate the conservation of genome order for all the species. The three-letter amino acid abbreviation is used for tRNA designation (discrimination of two types of tRNA-Leu with first-letter codon). Figure A.1.4. Mitochondrial gene order for the mitogenomes of Asteroidea species. Genes and tRNAs at upper layer are transcribed on L-strand, whereas genes at lower layer are transcribed on H-strand. Blue colors indicate the conservation of genome order for the species relative to Asteroidea genome pattern (Scouras and Smith 2001) and white colors variation of genome order. (inversions are shown with green stars). The three-letter amino acid abbreviation is used for tRNA designation (discrimination of two types of tRNA-Leu with first-letter codon). a) Genome pattern. a) Pisaster ochraceus and Luidia quinaria
  • 107.
    Appendix 89 b) Acanthasterplanci and Acanthaster brevispinus c) Patiria pectinifera d) Astropecten polyacanthus
  • 108.
    90 e) Asterias amurensis FigureA.1.5. Mitochondrial gene order for the mitogenomes of Holothuroidea species. Genes and tRNAs at upper layer are transcribed on L-strand, whereas genes at lower layer are transcribed on H-strand. Blue colors indicate the conservation of genome order for the species relative to Holothuroidea genome pattern (Fan et al. 2011) and white colors indicate variation in transcription direction (transpositions are shown with brown stars; TDRL are shown with grey stars). The three-letter amino acid abbreviation is used for tRNA designation (discrimination of two types of tRNA-Leu with first-letter codon). a) Genome pattern. a) Apostichopus japonicus, Parastichopus nigripunctatus, Holothuria forskali b) Stichopus sp. SF-2010, Stichopus horrens c) Cucumaria miniata
  • 109.
    Appendix 91 Figure A.1.6.Genome mitochondrial gene rearrangements for the Ophiuroidea species. Genes and tRNAs at upper layer are transcribed on L-strand, whereas genes at lower layer are transcribed on H-strand. Blue colors indicate the conservation of genome order for the species relative to Ophiuroidea genome pattern (Perseke et al. 2010) and white colors indicate variation in the transcription direction (transpositions are shown with brown stars; TDRL are shown with grey star and inversions with green stars). The three-letter amino acid abbreviation is used for tRNA designation (discrimination of two types of tRNA-Leu with first-letter codon). a) Genome pattern. a) Ophiopholis aculeata, Ophiocomina nigra b) Amphipholis squamata c) Cucumaria miniata
  • 110.
    92 c) Astrospartus mediterraneus d)Ophiura lutkeni, Ophiura albida Figure A.1.7. Genome mitochondrial gene rearrangements for the Crinoidea species. Genes and tRNAs at upper layer are transcribed on L-strand, whereas genes at lower layer are transcribed on H-strand. Blue colors indicate the conservation of genome order for the species relative to Crinoidea genome pattern (Perseke et al. 2008) and white colors indicate variation in gene location (transpositions are shown with brown stars; TDRL are shown with grey stars). The three-letter amino acid abbreviation is used for tRNA designation (discrimination of two types of tRNA-Leu with first-letter codon). a) Genome pattern. a) Florometra serratissima, Phanogenia gracilis
  • 111.
    Appendix 93 Figure A.1.8.Echinodermata gene trees inferred using PhyML (Maximum Likelihood). The presented trees are unrooted and the bootstraps values are shown only for nodes with support above 50%. No outgroup was used in this analysis. A: Asteroidea; E: Echinoidea; H: Holothuroidea; C: Crinoidea; O: Ophiuroidea. ATP6 COX1 COX3 b) Neogymnocrinus richeri c) Antedon mediterranea
  • 112.
    94 CYTB NAD1 NAD2 NAD4NAD4L ND5 NAD6
  • 113.
    Appendix 95 Figure A.1.9.Echinodermata gene trees obtained with PhyML using Hemichordata as outgroup (Maximum Likelihood). The presented trees are rooted and the bootstraps values are shown only for nodes with support above 50%. A: Asteroidea; E: Echinoidea; H: Holothuroidea; C: Crinoidea; O: Ophiuroidea; Out: Outgroup. ATP6 NAD1 NAD2 NAD3 NAD4 NAD4L NAD5
  • 114.
    96 Figure A.1.10. Echinodermatagene trees inferred using MrBayes (Bayesian Inference). The presented trees are unrooted and the probabilities values are shown only for nodes with support above 50%. No outgroup was used in this analysis. A: Asteroidea; E: Echinoidea; H: Holothuroidea; C: Crinoidea; O: Ophiuroidea. ATP6 ATP8 COX1 COX2 COX3 CYTB NAD1 NAD2 NAD3
  • 115.
    Appendix 97 NAD4 NAD4LNAD5 NAD6 Figure A.1.11. Echinodermata gene trees obtained with MrBayes using Hemichordata as outgroup (Bayesian Inference). The presented trees are rooted and the probabilities values are shown only for nodes with support above 50%. A: Asteroidea; E: Echinoidea; H: Holothuroidea; C: Crinoidea; O: Ophiuroidea; Out: Outgroup. ATP6 COX1 COX3 CYTB C
  • 116.
    98 NAD1 NAD2 NAD4NAD4L NAD5 NAD6
  • 117.
    Appendix 99 A.2 ShellCodes/Scripts – Documentation The documentation of the commands written in Shell used in the pipeline.  Filter the Echinodermata species from Metazoa species SYNOPSIS for i in *.gbk; do if (grep -Riq "Metazoa; Echinodermata;" $i) then (echo "Found Echinodermata in file $i"; cp $i new_$i) fi; done; DESCRIPTION This script uses the loop for to go through each file, to look for specific patterns in each file and print into a new file, if the pattern is found, in this case print only the Echinodermata genbank files.
  • 118.
    100 A.3 Perl Scripts– Documentation The documentation of the scripts written in Perl used for the pipeline.  protein_fasta.pl This script converts a genbank formatted file to fasta format and retrieves only the amino acid sequences for the protein-coding genes. SYNOPSIS #run the script that translate GBK format into FASTA format and get only the protein sequence! my $dir = shift @ARGV; #Argument to call the right directory my $dir_output = shift @ARGV; #Argument to tell the name of new directory system ("mkdir $dir_output"); #create the new directory that will be new files my @files; my @file; my @all_gbk; my @accessions; my @gene; my @protein; my $gene = ""; my $i=""; sub ReadFiles (){ #open the directory that has GBK files opendir DIR, $dir or die ("Cannot open directory ${dir} $!n"); #put all GBK files into array (@files) while (my $filename = readdir(DIR)){ if ($filename =~ /.*.gbk/){ push (@files, $filename); }
  • 119.
    Appendix 101 } #browses thearray with files, open each one and ut all the information into array (@file) foreach my $file (@files) { open (INFILE, "${dir}/${file}") or die ("Cannot open folder. $!n"); @file = <INFILE>; close INFILE; @all_gbk = (@all_gbk,@file); } return; } sub GetSequences () { my $sp = ""; my $protein = ""; my $id = ""; my $accession; my $reading_CDS = "FALSE"; my $translation_block = "FALSE"; #open the array that has all information of all files foreach my $string (@all_gbk) { #Lets organize the analysis with the order of the genbank file #Retrieve ACCESSION if ($string =~ /^ACCESSIONs+(.+)s+$/){ $accession = $1; } #Retrieve protein sequence and name of each file (gene name) #if you find on file the pattern "CDS" starts to collect protein sequence and gene name #protein sequence on GBK format are in seccion that starts with word CDS and end with " if ($string =~ /^s+CDSs+.+/){ $reading_CDS = "TRUE"; } if ($reading_CDS eq "TRUE"){ #the pattern 'gene="gene name"'-> collect only gene name if ($string =~ /^s+/gene="(.+)"s+/){
  • 120.
    102 $gene = $1; } #thepattern '/translation="'-> start collect protein sequence if ($string =~ /^s+/translation=".+s+/){ $translation_block = "TRUE"; } my $aspas = "FALSE"; if ($translation_block eq "TRUE"){ #the pattern '/translation="protein sequence"'-> collect protein sequence if ($string =~ /^s+/translation="(.+)"?s+/){ $protein = $1; #collect the protein sequence, if you find " and newline #replace all the pipes, spaces and " for nothing if ($string =~ /"n$/){ $protein =~ s/("|s)//g; $aspas = "TRUE"; } } #if you dont find a pattern " -> collect protein sequence without spaces elsif ($string !~ /"n$/){ $string =~ s/s//g; $protein .= $string; } #collect the last line of protein sequence the pattern is " and newline elsif ($string =~ /"n$/){ #replace all the pipes, spaces and " for nothing and collect the protein sequence $string =~ s/("|s)//g; $protein .= $string;
  • 121.
    Appendix 103 $aspas ="TRUE"; } if ($aspas eq "TRUE"){ #collect all information (accession number, gene name and protein sequence #and put into array push (@accessions, $accession); push (@gene, $gene); push (@protein, $protein); $translation_block = "FALSE"; $reading_CDS = "FALSE"; $aspas = "FALSE"; } } } } return; } sub WriteFasta () { #browses the gene name array to put the right names of genes on name of files foreach $i (0..$#gene){ if ($gene[$i] =~ /a6|atp6|atp 6|atpase6|atpase 6/i){ $gene[$i] = "ATP6"; } elsif ($gene[$i] =~ /a8|atp8|atp 8|atpase8|atpase 8/i){ $gene[$i] = "ATP8"; } elsif ($gene[$i] =~ /n1|nd1|nad1|nadh1/i){ $gene[$i] = "ND1"; }
  • 122.
    104 elsif ($gene[$i] =~/n2|nd2|nad2|nadh2/i){ $gene[$i] = "ND2"; } elsif ($gene[$i] =~ /n3|nd3|nad3|nadh3/i){ $gene[$i] = "ND3"; } elsif ($gene[$i] =~ /n4|nd4|nad4|nadh4/i){ if ($gene[$i] =~ /l/i){ $gene[$i] = "ND4L"; } else{ $gene[$i] = "ND4"; } } elsif ($gene[$i] =~ /n5|nd5|nad5|nadh5/i){ $gene[$i] = "ND5"; } elsif ($gene[$i] =~ /n6|nd6|nad6|nadh6/i){ $gene[$i] = "ND6"; } elsif ($gene[$i] =~ /c3|cox3|coxIII|coIII/i){ $gene[$i] = "COX3"; } elsif ($gene[$i] =~ /c2|cox2|coxII|coII/i){ $gene[$i] = "COX2"; } elsif ($gene[$i] =~ /c1|cox1|coxI|coI/i){ $gene[$i] = "COX1"; } elsif ($gene[$i] =~ /cytb|cb|cob/i){ $gene[$i] = "CYTB";
  • 123.
    Appendix 105 } my $gene_name= $gene[$i]; #get protein sequence and respective accession number for all genes and separated #for each one of them open (OUTFILE, ">>${dir_output}/${gene_name}.aa.fasta") or die ("Cannot write fasta. $!n"); print OUTFILE ">".$accessions[$i],"n"; print OUTFILE $protein[$i],"n"; close OUTFILE; } return; } ReadFiles (); GetSequences (); WriteFasta (); exit; DESCRIPTION This Perl script makes it easier and faster to convert the genbank format into fasta format for all genes and separates them (with protein sequence). The script starts by opening the directory that has all the genbank files, collects and puts the files content into an array, and then, with a foreach loop, it browses the array. In the next step, it searches and collects specific information (accession number, gene name, and the protein sequence) to save it in a new fasta file.
  • 124.
    106  dna_fasta.pl This scriptconverts a genbank formatted file to fasta format and retrieves only the DNA sequences. SYNOPSIS #run the script that translate GBK format into FASTA format and get only the DNA sequence! my $infile = $ARGV[0]; #Argument to call the file my @file; #named the new file with the suffix "_dna.fasta" my $outfile = $infile."_dna.fasta"; sub ReadFile ($){ my $infile = $_[0]; #open file to be translate and put all the information inside the file into array (@file) open (INFILE , "$infile") or die ("Cannot open infile genbank: $infile. $!n"); @file = <INFILE>; close INFILE; return; } sub WriteFasta ($) { my $outfile = shift; my $dna = ""; my $sp = ""; my $accession=""; my $dna_seq = "FALSE"; open (OUTFILE , ">$outfile") or die ("Cannot open outfile fasta: $outfile. $!n"); my @data = @file; #browses the array with foreach loop foreach my $string (@data) { #collect important information such as the accession number, the organism, and the dna sequence #if you find on file the pattern "ACCESSION NC_number" -> collect the NC_number if ($string =~ /^ACCESSIONs+(.+)s+$/){ $accession = $1; # ok } #if you find on file the pattern "ORGANISM specie names" -> collect the specie names if ($string =~ /^s+ORGANISMs+(.+)s+$/){
  • 125.
    Appendix 107 $sp =$1; # ok } #if you find on file the pattern "ORIGIN" will start to collect DNA sequence if ($string =~ /ORIGIN/) { $dna_seq = "TRUE"; } if ($dna_seq eq "TRUE"){ #the pattern "ORIGIN dnasequence" -> collect dnasequence #(DNA sequence on GBK format starts with word ORIGIN and end with '') if ($string =~ /ORIGINn+(.+)/){ $dna = $1; # ok } #if the pattern isnt '' -> delete all number and word ORIGIN and #board only the 4 nucleotides to dna sequence elsif ($string !~ //+$/){ $string =~ s/ORIGIN//; $string =~ s/[s0-9]//g; $dna .= $string; } #if the pattern is '' -> delete '' from dna sequence elsif ($string =~ //+$/){ $string =~ s/(/|s)//g; $dna_seq = "FALSE"; $dna .= $string; } } } #print in new FASTA file all the information collected print OUTFILE ">" , $accession , ' ' , "[", $sp , "]", "n"; print OUTFILE $dna; close OUTFILE;
  • 126.
    108 return; } ReadFile ($infile); WriteFasta ($outfile); exit; DESCRIPTION ThisPerl script makes it easier and faster to convert the genbank format into fasta format only with DNA sequence. The script starts by opening the directory that has all the genbank files, collects and puts the files content into an array, and then, with a foreach loop, it browses the array. In the next step, it searches and collects specific information (accession number, specie name, and the dna sequence) to save it in a new fasta file.  mafftalign.pl This script runs the program MAFFT. SYNOPSIS #run alignment program MAFFT for aminoacid sequences! my $dir = shift @ARGV;#Argument to call the right directory system ("mkdir mafft");#create new directory (mafft) for the aligned sequences my $input=""; my $output = ""; my @files; #open directory with all files to be align opendir DIR, $dir or die ("Cannot open directory ${dir} $!n"); #while there are FASTA files, put into a array (@files) while (my $filename = readdir(DIR)){ if ($filename =~ /.*.fasta/){ push (@files, $filename); } }
  • 127.
    Appendix 109 #browses thearray with foreach loop foreach $input (@files) { #named the news files with the suffix ".mafft" $output = "$input.mafft"; #run the program MAFFT with specific options system ("mafft --maxiterate 1000 --globalpair $input > $output"); #when finish the run print this message for each file print "--------------------FIM ALINHAMENTO $input-------------------n"; } close DIR; #move the new files for the new directory (mafft) system ("mv *.mafft mafft"); exit; DESCRIPTION This Perl script allows running the multiple sequence alignment program MAFFT for multiple amino acid sequence files (in fasta format). First, it opens the directory where all the files to be aligned are located. Then, it puts all these files into an array and finally it browses the array and runs the program for each file.  musclealign.pl This script starts the program MUSCLE. SYNOPSIS #run alignment program MUSCLE for aminoacid sequences! my $dir = shift @ARGV; #Argument to call the right directory system ("mkdir muscle");#Create new directory (muscle) for the aligned sequences my $input=""; my $output = ""; my @files;
  • 128.
    110 #open directory withall files to be align opendir DIR, $dir or die ("Cannot open directory ${dir} $!n"); #while there are FASTA files, put into a array (@files) while (my $filename = readdir(DIR)){ if ($filename =~ /.*.fasta/){ push (@files, $filename); } } #browses the array with foreach loop foreach $input (@files) { #named the news files with the suffix ".muscle" $output = "$input.muscle"; #run the program MUSCLE with specific options system ("muscle -in $input -out $output"); #when finish the run print this message for each file print "--------------------FIM ALINHAMENTO $input-------------------n"; } close DIR; #move the new files for the new directory (muscle) system ("mv *.muscle muscle"); exit; DESCRIPTION This Perl script allows running the alignment program MUSCLE for multiple amino acid sequence files (in fasta format). First, it opens the directory where all the files to be aligned are located. Then, it put all these files into an array and finally it browses the array and runs the program for each file.
  • 129.
    Appendix 111  prankalign.pl Thisscript starts the program PRANK. SYNOPSIS #run alignment program PRANK for aminoacid sequences! my $dir = shift @ARGV; #Argument to call the right directory system ("mkdir prank");#create new directory (prank) for the aligned sequences my $input=""; my $output = ""; my @files; #open directory with all files to be align opendir DIR, $dir or die ("Cannot open directory ${dir} $!n"); #while there are FASTA files, put into a array (@files) while (my $filename = readdir(DIR)){ if ($filename =~ /.*.fasta/){ push (@files, $filename); } } #browses the array with foreach loop foreach $input (@files) { #named the news files with the suffix ".prank" $output = "$input.prank"; #run the program PRANK with specific options system ("prank -d=$input -o=$output"); #when finish the run print this message for each file print "--------------------FIM ALINHAMENTO $input-------------------n"; } close DIR; #move the new files for the new directory (prank) system ("mv *.prank.* prank"); exit;
  • 130.
    112 DESCRIPTION This Perl scriptallows running the alignment program PRANK for multiple amino acid sequence files (in fasta format). First, it opens the directory where all the files to be aligned are located. Then, it puts all these files into an array and finally it browses the array and runs the program for each file.  BestAlignAA.pl This script analyzes the results from the alignment programs and identifies the more consistent alignment. SYNOPSIS #run program TrimAl for compare alignment programs! my $dir = shift @ARGV;#Argument to call the right directory my @files; system ("mkdir trimalCompare");#create new directory (trimalCompare) for the new files sub TrimAl { #open directory with all files to be analyze opendir DIR, $dir or die ("Cannot open directory ${dir} $!n"); while (my $filename = readdir(DIR)){ #while there are fileset on prefix name files, put into a array (@files) if ($filename =~ /fileset*/){ push (@files, $filename); } } #browses the array with foreach loop foreach my $input (@files) { my $output = $input; #run the program TrimAl with specific options system ("trimal -compareset $input -out $output.out > $output.txt");
  • 131.
    Appendix 113 } close DIR; system("mv *.out trimalCompare"); #move the new files for the new directory (trimalCompare) system ("mv *.txt trimalCompare"); #move the new files for the new directory (trimalCompare) return; } TrimAl (); exit; DESCRIPTION This Perl script compares all the resultant amino acid alignments using the different programs and find out which one is the most consistent. First, it opens the directory where all the files to be analyzed are located. Then, it puts all these files into an array and finally it browses the array and runs the trimAl program for each file.  TrimAL.pl This script collects only the protein sequence aligned with best scores. SYNOPSIS #run program TrimAl for to get only protein sequence with #best score on alignment! my $dir = shift @ARGV; my @files; #create new directory (trimalOUT) for the new files system ("mkdir trimalOUT"); sub TrimAl { opendir DIR, $dir or die ("Cannot open directory ${dir} $!n"); while (my $filename = readdir(DIR)){ #while there are .out files, put into a array (@files) if ($filename =~ /.*.out/){ push (@files, $filename);
  • 132.
    114 } } #browses the arraywith foreach loop foreach my $input (@files) { my $output = $input; #run the program TrimAl with specific options system ("trimal -in $input -out $output.nexus -nexus -gappyout"); } close DIR; #move the new files for the new directory (trimalOUT) system ("mv *.nexus trimalOUT"); return; } TrimAl (); exit; DESCRIPTION This Perl script prints in new files only the alignment of protein sequences with best scores. First, it opens the directory where all the files to be analyzed are located. Then, it puts all these files into an array and finally it browses the array and runs the trimAl program for each file.  prottest.pl This script runs the program ProtTest. SYNOPSIS #run the program Prottest for aminoacid sequences! my @files; system ("mkdir prottest"); #create new directory (prottest) for the analysed sequences sub Prottest { #open directory with all files to be analyze opendir DIR, "/home/user/Transferências/prottest-3.2-20130103" or die;
  • 133.
    Appendix 115 while (my$filename = readdir(DIR)){ #while there are NEXUS files, put into a array (@files) if ($filename =~ /.*.nexus/){ push (@files, $filename); } } #browses the array with foreach loop foreach my $input (@files) { my $output = $input; #run the program Prottest with specific options system ("java -jar prottest-3.2.jar -i $input -all-matrices -all-distributions -AICC -o $output.out -all -F -t1 -S 2 -tc 0.5 -threads 1"); #when finish the run print this message for each file print "--------------------FIM ANALISE $input-------------------n"; } close DIR; #move the new files for the new directory (prottest) system ("mv *.out prottest"); return; } Prottest (); exit; DESCRIPTION This Perl script allows running the program ProtTest for multiple amino acid sequence files (in nexus format). First, it opens the directory where all the files to be analyzed are located. Then, it puts all these files into an array and finally it browses the array and runs the program for each file.
  • 134.
    116 A.4 R Codes/Scripts– Documentation The documentation of the scripts written in R used for the pipeline.  create.r This script creates a rooted tree (with outgroup) and plots it to standard output. SYNOPSIS getwd () #read the tree file (nexus), if the tree file is newick the function read need to be change to read.tree list1 <- list.files (all.files=FALSE, full.names=FALSE, pattern = ".tre") #sort the accession number into the Echinodermata classes, add Outgroup class (Hemichordata) Asteroidea <- c("NC_004610", "NC_006665", "NC_007788", "NC_006666", "NC_001627", "NC_007789", "NC_006664") Echinoidea <- c("NC_001453", "NC_001770", "NC_009941", "NC_009940", "NC_013881", "NC_001572") Holothuroidea <- c("NC_014452", "NC_014454", "NC_005929", "NC_013884", "NC_012616", "NC_013432") Ophiuroidea <- c("NC_013874", "NC_013878", "NC_005334", "NC_005930", "NC_010691", "NC_013876") Crinoidea <- c("NC_010692", "NC_001878", "NC_007689", "NC_007690") OUTGROUP <- c("NC_013877", "NC_007438", "NC_001887") total_taxa = length(Asteroidea) + length(Echinoidea) + length(Holothuroidea) + length(Ophiuroidea) + length(Crinoidea)+ length(OUTGROUP) my_tip_color <- rep(0,total_taxa) for (i in 1:length(list1)){ tree <- paste("tree_",i," <- read.nexus("",cwd,"/",list1[i],"")",sep="") eval(parse(text= tree)) tree <- get(paste("tree_",i,sep="")) mytitle = list1[i] taxa <- tree$tip.label #attribute at each class a colour my_blue = which(taxa %in% Asteroidea) my_red = which(taxa %in% Echinoidea)
  • 135.
    Appendix 117 my_green =which(taxa %in% Holothuroidea) my_purple = which(taxa %in% Ophiuroidea) my_orange = which(taxa %in% Crinoidea) my_black = which(taxa %in% OUTGROUP) for (i in my_blue){ my_tip_color[i] = "blue" } for (i in my_red){ my_tip_color[i] = "red" } for (i in my_green){ my_tip_color[i] = "green" } for (i in my_purple){ my_tip_color[i] = "purple" } for (i in my_orange){ my_tip_color[i] = "orange" } for (i in my_black){ my_tip_color[i] = "black" } #identify the outgroup new_out <- c("NC_013877", "NC_007438", "NC_001887") mytreereroot<-root(tree, new_out) #with this function create a image of rooted tree, for unrooted tree add type=”u” and remove the outgroup plot.phylo (mytreereroot, show.node.label=TRUE, tip.color = my_tip_color, use.edge.length = TRUE, font = 2, edge.width=2, cex=0.7) mtext (mytitle, side = 3, font = 2, adj = 1, cex=1.0) }
  • 136.
    118 DESCRIPTION This R scriptmakes it easier and faster to create a rooted tree (PhyML/MrBayes trees). The script starts by opening the input file. Then it splits all Accession Numbers into classes, and attributes a color to each class. Then, with a specific R function (plot.phylo), it creates a rooted tree (phylogenetic tree).  dendrogram.r This script creates a dendrogram. SYNOPSIS getwd () data <-read.table("file.txt", sep = "t") # prepare hierarchical cluster hc = hclust(dist(data [,2])) # very simple dendrogram plot(hc,labels=data[,1], main="", xlab="") DESCRIPTION This R script allows creating a dendrogram. The script starts by opening the input (file with distance matrix). Next it applies the R functions hclust and dist performs the hierarchical clustering on distance matrix. Finally, the plot function allow us to visualize the dendrogram.