Gutell 109.ejp.2009.44.277


Published on

Theriot E.C., Cannone J.J., Gutell R.R., and Alverson A.J. (2009).
The limits of nuclear encoded SSU rDNA for resolving the diatom phylogeny.
European Journal of Phycology, 44(3):277-290.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Gutell 109.ejp.2009.44.277

  1. 1. PLEASE SCROLL DOWN FOR ARTICLEThis article was downloaded by: [University of Texas at Austin]On: 13 August 2009Access details: Access Details: [subscription number 907743372]Publisher Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UKEuropean Journal of PhycologyPublication details, including instructions for authors and subscription information: limits of nuclear-encoded SSU rDNA for resolving the diatom phylogenyEdward C. Theriot a; Jamie J. Cannone b; Robin R. Gutell b; Andrew J. Alverson caTexas Natural Science Center, The University of Texas at Austin, Austin, TX 78705, USA bSection ofIntegrative Biology and Center for Computational Biology and Bioinformatics, The University of Texas atAustin, 1 University Station, Austin, Texas 78712, USA cDepartment of Biology, Indiana UniversityBloomington, IN 47405, USAFirst Published:August2009To cite this Article Theriot, Edward C., Cannone, Jamie J., Gutell, Robin R. and Alverson, Andrew J.(2009)The limits of nuclear-encoded SSU rDNA for resolving the diatom phylogeny,European Journal of Phycology,44:3,277 — 290To link to this Article: DOI: 10.1080/09670260902749159URL: terms and conditions of use: article may be used for research, teaching and private study purposes. Any substantial orsystematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply ordistribution in any form to anyone is expressly forbidden.The publisher does not give any warranty express or implied or make any representation that the contentswill be complete or accurate or up to date. The accuracy of any instructions, formulae and drug dosesshould be independently verified with primary sources. The publisher shall not be liable for any loss,actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directlyor indirectly in connection with or arising out of the use of this material.
  2. 2. Eur. J. Phycol., (2009), 44(3): 277–290The limits of nuclear-encoded SSU rDNAfor resolving the diatom phylogenyEDWARD C. THERIOT1, JAMIE J. CANNONE2,ROBIN R. GUTELL2AND ANDREW J. ALVERSON31Texas Natural Science Center, The University of Texas at Austin, 2400 Trinity Street, Austin, TX 78705, USA2Section of Integrative Biology and Center for Computational Biology and Bioinformatics, The Universityof Texas at Austin, 1 University Station, Austin, Texas 78712, USA3Department of Biology, Indiana University Bloomington, 1001 E Third Street, 142 Jordan Hall, IN 47405, USA(Received 26 September 2008; revised 13 November 2008; accepted 25 November 2008)A recent reclassification of diatoms based on phylogenies recovered using the nuclear-encoded small subunit ribosomalRNA (SSU rRNA) gene contains three major classes, Coscinodiscophyceae, Mediophyceae and the Bacillariophyceae(the CMB hypothesis). We evaluated this with a sequence alignment of 1336 protist and heterokont algae SSU rRNAs,which includes 673 diatoms. Sequences were aligned to maintain structural elements conserved within this dataset.Parsimony analysis rejected the CMB hypothesis, albeit weakly. Morphological data are also incongruent with thisrecent CMB hypothesis of three diatom clades. We also re-analysed a recently published dataset that purports to supportthe CMB hypothesis. Our re-analysis found that the original analysis had not converged on the true bipartition posteriorprobability distribution, and rejected the CMB hypothesis. Thus we conclude that a reclassification of the evolutionaryrelationships of the diatoms according to the CMB hypothesis is premature.Key words: SSU, diatom phylogeny, diatom classification, Coscinodiscophyceae, Mediophyceae, BacillariophyceaeIntroductionAnalyses of molecular data (mainly nuclear-encoded small subunit ribosomal DNA;henceforth SSU) have generally reinforced the tra-ditional view (Simonsen, 1979; Round et al., 1990)that centric diatoms broadly grade into pennatesthrough many nodes (Medlin et al., 1993, 1996a,b, 2000; Ehara et al., 2000; Medlin & Kaczmarska,2004; Sorhannus, 2004, 2007; Alverson et al., 2006;Choi et al., 2008; see Alverson & Theriot, 2005for review). However, Medlin & Kaczmarska(2004) recently proposed that centric diatoms werecomposed of only two clades rather than many.They retained the name Coscinodiscophyceaefor the so-called ‘radial centrics’ and appliedthe name Mediophyceae for the so-called ‘bipolar’or ‘multipolar centrics’. They also suggesteda number of morphological characters as diagnosticfor these groups. We refer to this as the CMBhypothesis (for the three major clades discovered –Coscinodiscophyceae, Mediophyceae andBacillariophyceae).The CMB phylogenetic hypothesis has not beenuniversally embraced. For example, Adl et al.(2005) treated both Coscinodiscophyceaeand Mediophyceae as paraphyletic taxa withoutdiscussion. Williams & Kociolek (2007) challengedthe robustness of the CMB phylogeny basedon the fact that many different SSU analysesreturn different trees. In contrast, Sims et al.(2006) recovered the CMB hypothesis with highbipartition posterior probability (BPP) support.Medlin et al. (2008) recovered the CMB hypothesiswith high BPP support using a secondary structurealignment but noted that several aspects of the treewere unusual (e.g. the placement of Attheya).In fact, topology of the diatom SSU tree(and support values for incongruent groups) haschanged from study to study. For example,the elongate Toxarium has been placed wellwithin the centric grade amidst multipolar diatoms(very distant from the pennate diatoms) using max-imum likelihood (ML) analysis on 38 diatoms(Kooistra et al., 2003), as sister to all pennatesin a Bayesian analysis of 51 diatom sequences(Chepurnov et al., 2008), poorly resolvedin a maximum parsimony (MP) analysis of 181diatom sequences (Alverson et al., 2006), andCorrespondence to: Edward C. Theriot. E-mail: etheriot@mail.utexas.eduISSN 0967-0262 print/ISSN 1469-4433 online/09/030277–290 ß 2009 British Phycological SocietyDOI: 10.1080/09670260902749159DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  3. 3. once again well within the multipolar diatomsin a Bayesian analysis of 54 diatom SSU sequences(Medlin et al., 2008). As underscored by this briefcomparison, the many different inferences ofdiatom phylogeny have utilized different alignmentstrategies, different optimality criteria, haveemployed those criteria in different ways andhave used different taxa. Any or all of these factorsmay have lead to the novel results of Medlin &Kaczmarska (2004) and Sims et al. (2006), butthis cannot be directly studied because theMedlin & Kaczmarska (2004) and Sims et al.(2006) datasets which produced the CMB hypoth-esis were not publicly available. However, theMedlin et al. (2008) dataset is available and were-analyse it below. To test the effects of ingroupand outgroup sampling, we created our own largealignment of stramenopile SSU sequences, alignedaccording to secondary structure (Gutell et al.,1985, 1992, 2002) and used it to test the CMBhypothesis and its robustness. Specifically weaddress the effect (or lack thereof ) of adding dis-tantly related outgroups on inferences of thediatom SSU tree.Materials and methodsMultiple sequence alignmentWe included all 1549 SSU stramenopile sequences avail-able in Genbank as of September 1, 2007. TheSSU rDNA sequences were aligned manually with thealignment editor ‘AE2’ (developed by T. Macke, ScrippsResearch Institute, San Diego, USA; Larsen et al. 1993),which was developed for Sun Microsystems’ (SantaClara, USA) workstations running the Solaris operatingsystem. The manual alignment process involves firstpositionally aligning homologous nucleotides (i.e.those that map to the same locations and tertiary struc-ture models) into columns in the alignment, maximizingtheir sequence and structure similarity. For regions withhigh similarity between sequences, the nucleotidesequence is sufficient to align sequences with confidence.For more variable regions in closely related sequencesor when aligning more distantly related sequences, how-ever, a high-quality alignment only can be producedwhen additional information (here, secondary and/ortertiary structure data) is included.The underlying SSU rRNA secondary structuremodel was initially predicted with covariation analysis(Gutell et al., 1985, 1992). Approximately 98% of thepredicted model base pairs were present in the high-resolution crystal structure from the 30S ribosomalsubunit (Gutell et al., 2002). This model (based on thebacterium Escherichia coli) has been extended tothe eukaryotic SSU rRNA (Cannone et al., 2002),using covariation analysis to assess eukaryote-specificfeatures. The additional constraints of the eukaryoticmodel were used to refine the alignment of the strame-nopile sequences iteratively until positional homologywas established for the entire data matrix.The initial SSU alignment contained 1549 sequences,with a final length of 3786 columns. Medlin &Kaczmarska (2004) filtered out sequences less than50% complete and we followed this convention, result-ing in a final dataset of 1336 stramenopile sequences ofwhich 673 are diatoms and seven are bolidophytes,which are considered the immediate sister group to dia-toms, according to both SSU and chloroplast-encodedrbcL (Daugbjerg & Andersen, 1997; Goertzen &Theriot, 2003; Andersen, 2004). The remaining taxaare more distantly related stramenopiles. The final align-ment is available at TreeBASE ( or from the authors. Fortysecondary structure model diagrams representing themajor diatom lineages are available at Weanalysed the data in two datasets: diatoms plus bolido-phytes only (DiatBo) and diatoms plus all stramenopiles(DiatStram).Other datasetsWe obtained the Nexus files used for Figs 2 and 3 inMedlin & Kaczmarska (2004) from Dr Medlin. Onedataset had 126 sequences and the other had 281sequences, and we refer to them as the MK126 andMK281 datasets. Both had the same 123 diatomsequences and differed only in that the former used boli-dophytes only as the outgroup and the latter sampledbroadly across eukaryotes for the outgroups. We alsoused the Nexus file used to produce Fig. 1a of Medlinet al. (2008) from file had 54 sequences, all diatoms but no outgroup,and we refer to that as the M54 dataset.Phylogenetic analysisAll datasets were subjected to parsimony analysisusing the TNT program (Goloboff et al., 2003). Thefull suite of TNT options (sectorial search, ratchet,drift and tree fusion) was used. There is no standardrecommendation for use of these algorithms and thereare few comparative studies of these algorithms. Withinthe context of the ratchet, Nixon (1999) argued that forlarge datasets, it may be better to limit length of searcheson individual islands of trees and search more islands.The notion is that exploring a greater range of islandscontaining optimal trees is more likely to cover the entirediversity of optimal trees in a shorter period of time thanexhaustively searching one island. Thus, we took theapproach used by Goertzen & Theriot (2003) andAlverson et al. (2006) when employing these newer algo-rithms. We increased the number of all cycles, roundsand repetitions for sectorial, drift, fusion and ratchetsearches 10-fold beyond default values, and usedbetween 100 and 1000 random taxon additions foreach run. We saved the resultant trees from each runseparately, and then repeated the procedure witha new randomly selected seed number. After each run,we checked that no shorter tree was found, combinedtrees from all previous runs and then calculated theE. C. Theriot et al. 278DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  4. 4. number of nodes collapsed in their strict consensus.If no shorter trees were found and if no additionalnodes were collapsed, we concluded that we hadsampled the complete representative set of MP trees,as additional trees would be redundant and unlikely tofurther erode the resolution of the strict consensus(Nixon, 1999; Goertzen & Theriot, 2003).We assessed the parsimony penalty required byconstraining each of the Coscinodiscophyceae,Mediophyceae and Bacillariophyceae to monophylyunder searches as above. We assessed support for theunconstrained MP trees using nonparametric bootstrap(BS) analysis in TNT with the standard sampling withreplacement strategy. We used the new technologysearch with 10 taxon additions and sectorial, ratchet,drift and tree fusings for each of the 1000 pseudorepli-cates of the BS analysis.The DiatBo, MK 281, MK126 and M54 datasets weresubjected to Bayesian analyses. All Bayesian analyseswere run with the GTR þ G þ I model (nucmo-del ¼ 4by4, nst ¼ 6, rates ¼ invgamma). These were thesettings used by Sims et al. (2006) and also correspondedto the best model for each dataset as selected byMrModelTest (Nylander, 2004). All initial runs for alldatasets were done at 1 000 000 Markov chain MonteCarlo (MCMC) generations, equal to or greater thanthe number of generations run by Medlin &Kaczmarska (2004), Sims et al. (2006) and Medlinet al. (2008). Where these papers did not specify othersettings for the Bayesian analysis, default settings wereused. To test reproducibility of the results, we ran threeseparate analyses, each with two runs for a total of sixindependent runs of 1 000 000 MCMC generations each.We also ran one analysis of the DiatBo dataset with tworuns (four chains, three heated, one cold) for 10 milliongenerations, saving every 10 000th tree. Finally, we ranthe M54 dataset for 50 million generations, saving every10 000th tree. We assessed whether independent runs inall analyses had sampled the same posterior distributionby comparing independent run (split) posteriorprobabilities with the compare command in the AWTYprogram (Wilgenbusch et al., 2004). We followed theburn-in periods of Medlin et al. (2004) and (2008) fortheir datasets when we ran 1 000 000 generations onM54, MK126 and MK281. We used a burn-in of 90%for our DiatBo dataset 1 000 000 and 10 000 000 genera-tion analyses to approximate Sims et al. (2006).MorphologyWe coded the characters of symmetry, presenceor absence of mucilaginous matrix, auxospore shape/growth, properizonium and perizonium presence/absence for 34 taxa using Medlin & Kaczmarska’scriteria (2004: 258, table 2), and treated all multistatecharacters as unordered. Since no outgroup orontogenetic information was provided, the only rootingoption was to consider the Coscinodiscophyceaeas the outgroup to the remaining diatoms andtest for monophyly of the Mediophyceae andBacillariophyceae. However it is possible to determineif the Coscinodiscophyceae formed a convex group(possibly monophyletic depending on the placement ofthe root within the unrooted network). Winclada run-ning NONA was used for parsimony analysis, with10 000 replications, holding 100 starting trees per repeti-tion, and all other parameters set to defaults.ResultsParsimony analysisFor the DiatStram dataset, 11 runs totalling 2898random taxon addition repetitions were requiredto converge on the representative set of MP trees(length (L): 39822; consistency index (c.i.): 0.12;retention index (r.i.): 0.84). We found 22 uniqueMP trees on the first run. Their strict consensuscollapsed 441 nodes. Eight more runs produced54 more MP trees for a total of 76 trees.However, 11 of these were duplicates and therewere only 65 unique MP trees. Their strict consen-sus collapsed 450 nodes or only nine more thancollapsed in the first single run. That we foundredundant trees and the reduced yield in topologi-cal diversity suggest that we have found the truediversity of all MP trees that could be obtainedfrom the DiatStram dataset (Fig. 1).For the DiatBo dataset, three runs of 500 randomaddition sequences seemed to converge on therepresentative set of MP trees. The strict consensusof the first 139 trees of L: 14094 (c.i.: 0.19, r.i.: 0.84)collapsed 287 nodes, that of the 338 trees of the firstand second runs collapsed 288 nodes, and that ofthe 554 trees of the all three runs combined col-lapsed 288 nodes. In each of the runs, an MP treewas found within the first 18 random additionsindicating that TNT was finding at least one treeof optimal topology very early in the analysis.In addition, 51 of the 554 total trees were identicalto trees previously found, indicating that there wassome redundancy in the coverage of tree space.Thus, we believe Fig. 2 well represents the strictconsensus of all equally MP cladograms thatmight be found in the DiatBo dataset.Unconstrained searches in both analysesresulted in non-monophyly for the classesCoscinodiscophyceae and Mediophyceae, andmonophyly for the class Bacillariophyceae.In both, the Coscinodiscophyceae was positivelyparaphyletic (i.e. fully resolved as a ladder-likegrade with no polytomies) with the Melosiralessister to a non-monophyletic Mediophyceaeplus a monophyletic Bacillariophyceae. TheMediophyceae was positively paraphyletic in theDiatStram analysis, with Chaetoceros and a fewother taxa forming a clade sister to the pennates.In the DiatBo analyses the Mediophyceae formedan unresolved polytomy.Monophyly of the Coscinodiscophyceae andMediophyceae (i.e. the CMB hypothesis) requiredSSU and the diatom phylogeny 279DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  5. 5. Fig. 1. Strict consensus of 65 unique equally most parsimonious trees calculated from the stramenopile-outgroup and diatom-ingroup analysis (DiatStram dataset). Only relationships among diatoms are shown.E. C. Theriot et al. 280DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  6. 6. Fig. 2. Strict consensus of 503 unique equally most parsimonious trees calculated from the diatom plus bolidophyte (DiatBo)dataset. Only relationships among diatoms are shown.SSU and the diatom phylogeny 281DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  7. 7. little penalty for either large dataset: the CMBhypothesis was only seven steps longer than theunconstrained MP trees for the DiatStram datasetand 10 steps longer for the DiatBo dataset.Arrangements of terminal taxa were similarfor results for both datasets, and only the tree forthe DiatBo dataset is shown (Fig. 3).Given the relatively low penalty incurred fortransforming any optimal tree into the CMBhypothesis, it is not surprising that the BS valuesalong the backbone of the tree were generally quitelow. The Bacillariophyceae clade and theMediophyceae plus Bacillariophyceae clade werethe only two backbone nodes to receive BS supportvalues of 90% for either dataset.Parsimony analysis of M54, MK126 and MK281datasets yielded similar results (trees not shown).The MP tree or trees rejected the CMB hypothesis,and bootstrap values along the backbone weretypically less than 50%. The CMB constrainttrees were not much longer than the MP trees:four steps longer for the M54 dataset (4530 vs.4526), seven steps longer for the MK126 dataset(5633 vs. 5626), and 23 steps longer for theMK281 dataset (19 302 vs. 19 325).Bayesian analysisAnalyses of 1 000 000 generations had clearly notconverged on the same posterior distributionsamong independent runs in analyses of eitherthe DiatBo or M54 datasets. Plots of bipartitionposterior probability values between the first pairof runs for each of the two datasets showed manypoints (i.e. bipartitions) falling directly along theabscissa and ordinate, indicating that some cladesfound in one analysis (even at BPP values 0.8)were not found in all others (Fig. 4). Furthermore,convergence was not reached with 10 000 000MCMC generations for the DiatBo dataset orwithin 20 000 000 MCMC generations for theM54 dataset (Fig. 5). For the DiatBo dataset,topological differences between runs were notminor. In the 10 000 000 generation analysis ofDiatBo, Toxarium, Lampriscus, Biddulphiopsisand the Cymatosirales (Toxarium and allies)grouped with Lithodesmiales plus Thalassiosirales(BPP ¼ 0.90) in one run, whereas they (Toxariumand allies) were sister to pennates (BPP ¼ 0.5) inanother run. Several species of Pinnularia,a raphid pennate genus in traditional classifica-tions and in our MP analyses, were placed at thebase of the diatom tree as sister to Leptocylindruswith BPP values of 0.88 and 0.98 in each of the tworuns. While several of the 1 000 000 generation runsrecovered a monophyletic Bacillariophyceae, thefact that we did not recover the pennates in anyof the 10 000 000 generation analyses clearly indi-cates that even our longest Bayesian runs were farshort of convergence.We analysed aspects of performance of the M54data set to obtain a gross estimate of how difficultit might be to reach convergence in a Bayesiananalysis of several hundred diatom sequences.The standard deviation of bipartitions betweenindependent runs for the M54 dataset dropped tonear zero at about 22 million generations, andthereafter oscillated at 0.1 until the analysis wasterminated at 50 000 000 generations (Fig. 6).While this might suggest that convergence hadbeen reached by 22 million generations, plottingthe sampled trees for the last 28 million generationsshows clusters of points off a straight line (Fig. 7).Discarding trees from the first 45 million genera-tions resulted in a BPP plot approximatinga straight line. The majority rule consensus treereturned a convex Coscinodiscophyceae andmonophyletic Bacillariophyceae, but theMediophyceae were positively paraphyletic(Fig. 8). Attheya septentrionalis was grouped withthe pennates at a BPP of 0.95. This is the place-ment of Attheya obtained from MP analysis.In fact, incongruence between the Bayesian andMP trees for dataset M54 is restricted to areaswhere BPP values are below 0.70 (not shown).As judged by the still rapidly dropping splitstandard deviations (not shown), Bayesian ana-lyses of the intermediate-sized MK126 andMK281 datasets had not converged on thesame posterior probability distribution at1 000 000 MCMC generations, again underscoringthe difficulty of completing a meaningfulBayesian analysis on even 100 diatom SSUsequences in so few MCMC generations. Ouranalysis of the MK54 dataset indicates that itmight take as many as 50–100 million genera-tions or more to reach convergence on datasetswith 600 or more diatom SSU sequences.Morphological treeSeven trees of length nine were found. Onlythe pennates formed a convex group (neitherthe Coscinodiscophyceae or Mediophyceaewere monophyletic, regardless of how the treewas rooted). In the strict consensus, theThalassiosirales were excluded from the remainingMediophyceae (Fig. 9) because they share all thecharacteristics of the Coscinodiscophyceae (radialsymmetry, globular/isometric auxospore shape/growth, no perizonium or properizonium), andhave none of the features peculiar to otherMediophyceae or the Bacillariophyceae.E. C. Theriot et al. 282DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  8. 8. Fig. 3. Strict consensus of 147 unique equally most parsimonious trees calculated from the Bolidomonas plus diatom (DiatBo)dataset with Coscinodiscophyceae and Mediophyceae constrained to monophyly. Only relationships among diatoms are shown.SSU and the diatom phylogeny 283DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  9. 9. DiscussionOur results weakly reject the hypothesis that theCoscinodiscophyceae, Mediophyceae andBacillariophyceae are each monophyletic (theCMB hypothesis). Only the Bacillariophyceae(pennate diatoms) were monophyletic, whetherwe included only closely related outgroups (bolido-phytes only) or distantly related outgroups (boli-dophytes and all other stramenopiles). However,there is little parsimony penalty to constrain treesto the CMB hypothesis for all datasets.Given that greatly different topologies can beobtained from SSU datasets with little penalty,it is not surprising that estimates of the diatomphylogeny based on SSU sequences vary widelybetween studies using different taxa, alignments,and optimality criteria. For example, the fewstudies hinting at the possibility of a monophyleticCoscinodiscophyceae and paraphyletic Medio-phyceae or vice versa used relatively few diatomSSU sequences. Very early in the use of SSUdata in diatom systematics using 11 diatomsincluding three Coscinodiscophyceae and onemember of the Mediophyceae, Medlin et al.(1993) returned a monophyletic Coscinodisco-phyceae. Using 29 diatom SSU sequences theylater (Medlin et al. 1996a, b) returneda monophyletic Coscinodiscophyceae and parap-hyletic Mediophyceae. Kooistra Medlin (1996)analysed that same dataset, experimenting withvarious approaches to deal with the potentiallong-branch problem introduced by ‘aberrantlyevolving’ diatoms; each approach returneda monophyletic Coscinodiscophyceae and para-phyletic Mediophyceae, although relationshipswithin the mediophytes were dependent upon themethod used. Kooistra et al. (2003) used 38 diatomSSU sequences, only two of which were Coscino-discophyceae, both on long branches, returninga monophyletic Coscinodiscophyceae and para-phyletic Mediophyceae. Using 51 diatom SSUsequences, Chepurnov et al. (2008) also returneda monophyletic Coscinodiscophyceae and para-phyletic Mediophyceae. However, they only ran4 000 000 MCMC generations, so it is unclear ifthey had reached convergence of topology andposterior probabilities.In contrast, Cavalier-Smith Chao (2006),focusing not on diatoms but on a wide range ofFig. 4. Bipartition partition probability plots of two runs(split runs) from the 1 000 000 Markov Chain Monte Carlo(MCMC) generation Bayesian analysis of our diatom plusbolidophyte dataset (DiatBo: upper plot) and of the Medlinet al. (2008) dataset (M54: lower plot). 90% burn-in usedfor each.Fig. 5. Bipartition partition probability plots of two runs(split runs) from the 10 000 000 Markov Chain MonteCarlo (MCMC) generation Bayesian analysis of ourdiatom plus bolidophyte dataset (DiatBo: upper plot) andthe 20 000 000 generation Medlin et al. (2008) dataset (M54:lower plot). 90% burn-in used for each.E. C. Theriot et al. 284DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  10. 10. protists including diatoms, used a wide rangeof outgroups but only 32 diatom SSU sequencesin a distance (neighbor-joining) analysis. Whilethey found moderate (70%) BS support for mono-phyly of the Mediophyceae, they also founda paraphyletic Coscinodiscophyceae, with theinternode excluding Melosirales from otherCoscinodiscophyceae receiving slightly higher sup-port than that found for a monophyleticMediophyceae (BS ¼ 72%). In perhaps the mostextreme case of taxon sampling effects usingeleven diatom exemplars when studying relation-ships among alveolates and stramenopiles, Vande Peer et al. (1996) returned monophyly for thecentric diatoms as a whole.It is not simply monophyly (or not) of the cen-trics, the Coscinodiscophyceae or Mediophyceaethat has proved unstable in different SSU analyses.Three studies, which each included more than100 diatom sequences, offer the opportunity tocompare trees calculated under a single optimalitycriterion (Bayesian inference). These revealed thattaxon sampling differences alone may account forvery different tree topologies. Based on 123diatoms (Medlin Kaczmarska, 2004) theLithodesmiales grouped with the Thalassiosirales(BPP ¼ 1.0), with 181 diatom SSU sequences(Alverson et al., 2006) they grouped with theHemiaulales to the exclusion of theThalassiosirales (BPP ¼ 1.0) and with an unknownnumber of diatom sequences (Sims et al., 2006) theLithodesmiales grouped with the Biddulphiales,Triceratiales and Toxarium to the exclusion ofthe Thalassiosirales (BPP ¼ 1.0). The unpublisheddataset for Fig. 2 of Sims et al. (2006) has beencharacterized as including more than 800 ingroupsequences (Medlin et al., 2008). It should be notedthat alignment methods have varied greatly amongthe many studies using SSU sequences and thesecould be a possible source of variation that has yetto be fully explored. This variation has the poten-tial to change diatom SSU tree topology radically(Medlin et al., 2008).Among the many trees generated using SSU,the most radical and controversial treesFig. 7. Bipartition probability plot of two runs from theM54 dataset of Medlin et al. (2008). The upper plot dis-carded the first 22 million Markov Chain Monte Carlo(MCMC) generations (or 44% burn-in, based on initialminimum at ca. 22 million MCMC generations fromFig. 6). The lower plot discarded the first 45 millionMCMC generations (or 90% burn-in).Fig. 6. Standard deviation of likelihood scores among inde-pendent runs (split runs) vs number of generations forBayesian analysis of our diatom plus bolidophyte(DiatBo) dataset (upper plot) and of the Medlin et al.(2008) dataset (M54: lower plot). The line in the upperplot represents a power function estimate of split standarddeviations out to 50 000 000 Markov Chain Monte Carlo(MCMC) generations.SSU and the diatom phylogeny 285DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  11. 11. Fig. 8. Majority rule consensus tree (calculated without an outgroup and arbitrarily rooted in the middle of theCoscinodiscophyceae) derived from 50 000 000 Markov Chain Monte Carlo generation Bayesian analysis of the M54 datasetfrom Medlin et al. (2008) with 90% burn-in. Numbers or symbols below nodes are bipartition posterior probability (BPP)values. Asterisks indicate BPP values 95%.E. C. Theriot et al. 286DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  12. 12. Fig. 9. Unrooted tree of diatom genera as determined by a parsimony analysis of the morphology matrix of Table 2 in Medlin(2004). Strict consensus of eight trees.SSU and the diatom phylogeny 287DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  13. 13. (Williams Kociolek, 2007) are those that supportthe CMB hypothesis: the MP tree of 8600þ SSUsequences (123 diatoms) by Medlin Kaczmarska(2004), the Bayesian tree (800þ diatom sequenceswith bolidophyte outgroups) by Sims et al. (2006),and the Bayesian tree (54 diatom SSU sequenceswith no outgroup) by Medlin et al. (2008).Medlin Kaczmarska (2004) claimed that theirMP tree (8600þ sequences, but only 123 diatoms)was more accurate than their Bayesian tree (same123 diatoms but only three bolidophyte SSUoutgroup sequences) because including distantlyrelated outgroups increased the number of parsi-mony informative characters. However, whileincreased taxon sampling within the scope ofthe problem (i.e. within diatoms) may increaseaccuracy, increased taxon sampling outside thescope of the problem (adding distant outgroups)will likely decrease accuracy of phylogenetic infer-ence (Hillis, 1998; Pollock et al., 2002; Hillis et al.,2003; Hedtke et al., 2006; Verbruggen Theriot,2008). Medlin Kaczmarska (2004) cited Bollback(2002) as support for their position, but that paperstudied the effects of adding characters, not taxa(ingroup or outgroup), and only in the contextof model-based methods, specifically accuracyof model selection for phylogenetic analysis, andis therefore not pertinent. Thus, contrary to theclaims of Medlin Kaczmarska (2004),one could hypothesize that recovery of the CMBtree under parsimony is an artefact of increasederror caused by addition of distantly relatedoutgroup taxa. In the light of the literature ontaxon sampling, a more substantive claim wasmade by Sims et al. (2006), who suggested thatincreased ingroup sampling led to recovery ofthe CMB hypothesis, this time with high BPPsupport values.However, we suggest that recovery of the CMBhypothesis in Medlin Kaczmarska (2004), Simset al. (2006) and Medlin et al. (2008) is probablya result of insufficient tree search effort. Medlin Kaczmarska (2004) used the MP search in ARB,whose most effective heuristic search algorithmemploys a combination of Nearest NeighborInterchange and Kernighan-Lin optimization,which together are less effective than thecommonly used Tree–Bisection–Reconnectionalgorithm, and certainly not as effective as othermethods, such as the parsimony ratchet (Nixon,1999). Given the large number of near-optimaltrees that support the CMB hypothesis inour dataset, it is likely that a suboptimalsearch might find any one of these suboptimaltrees. Similarly, the Bayesian inference (Simset al. (2006)) was probably also confounded byinsufficient search of tree space. They only ran1 000 000 MCMC generations. Our analysis ofour DiatBo dataset (673 diatoms plus seven boli-dophytes) had not reached convergence at10 000 000 generations. Our analysis of the M54dataset, presumably the same alignment but withfar fewer taxa than used by Sims et al. (2006),seems to have required at least 45 million genera-tions for the burn-in alone. The tree in Fig. 1a(Medlin et al. 2008) supporting the CMB hypoth-esis, is clearly an artefact of running far too fewMCMC generations, and even if the tree topologyis correct, monophyly of the Coscinodiscophyceaeis an artefact of arbitrary rooting in the absence ofan outgroup.Thus, our results strongly suggest that the choiceof optimality criterion has less influence on treesderived from SSU data than does the proper appli-cation of that choice. All methods, alignments andtaxon sampling schemes we reviewed or re-analysedreturned weak rejection of the CMB hypothesis.Both Medlin Kaczmarska (2004) and Simset al. (2006) argued that morphological data werecongruent with their SSU trees. However, the char-acters discussed are either irrelevant to testing theCMB hypothesis, or ambiguous about it (e.g. sper-matozoid structure [both the Coscinodiscophyceaeand Mediophyceae have merogenous and hologen-ous spermatozoids]; pyrenoid structure [one typeis apparently symplesiomorphically shared by theCoscinodiscophyceae and Mediophyceae, whilepyrenoid structure in the Thalassiosirales is auta-pomorphic for the order]). Using the Medlin Kaczmarska (2004) morphological charactermatrix, our tree excluded the Thalassiosiralesfrom the Mediophyceae on the basis of auxosporecharacteristics. Nevertheless it was claimed that theparticular pattern of auxospore formation underdiscussion was retained in the Thalassiosirales(Medlin Kaczmarska, 2004: 267). To make thisargument under parsimony, the Thalassiosiraleswould have to be the sister group to all remainingMediophyceae, a relationship not recovered ineither Medlin Kaczmarska (2004) or Simset al. (2006).Complicated scenarios are invoked ad hoc toexplain the distribution of the four differentGolgi body arrangements. Of the two widely dis-tributed arrangements, the so-called Type 1 (sensuMedlin Kaczmarska, 2004) arrangement wasattributed to most of the Coscinodiscophyceae,and the Type 2 arrangement was attributedto the Aulacoseirales (Coscinodiscophyceae),Mediophyceae, and Bacillariophyceae. If Type 1is apomorphic and Type 2 is not, then thereis no evidence from the Golgi arrangementthat the Aulacoseirales belong to theCoscinodiscophyceae. If Type 2 is apomorphic,regardless of the interpretation of Type 1, theGolgi character is congruent with our SSU treesE. C. Theriot et al. 288DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  14. 14. and rejects the CMB hypothesis by placing theAulacoseirales with the Mediophyceae andBacillariophyceae. Nevertheless, Medlin Kaczmarksa (2004) argued away this incongru-ence, explaining the distribution of Golgi bodytypes in terms of ancestral polymorphisms, impli-citly invoking unobserved character conditions inunobserved ancestral species for as far back as thecommon ancestor to red algae and diatoms(Medlin Kaczmarska, 2004: 265): ‘However,G-ER-M units are known from the oomycetesand the red algae, whereas an association of theGolgi around the nucleus is also known in theLabyrinthuloides. Thus, it would appear thatboth features are present in ancestors of the dia-toms and the potential host cells of their plastids.It can be argued that the two traits then segregatedthemselves in the two separate lineages as theyevolved.’ConclusionMedlin Kaczmarska (2004) and Sims et al.(2006) proposed monophyly of each of theCoscinodiscophyceae, Mediophyceae, andBacillariophyceae. Since the unavailability of thedatasets (the only ones to support the CMBhypothesis apart from that of Medlin et al.(2008)) precluded direct reproduction of theirresults, we assembled datasets of similar size andcharacteristics. Our results suggest that the CMBhypothesis is rejected by SSU data, albeit veryweakly. Similarly, our re-analysis of morphologicalevidence proposed by Medlin Kaczmarska(2004) also weakly rejects the CMB hypothesis.Medlin Kaczmarska (2004) very likely recovereda suboptimal MP tree for their 8600þ sequencedataset and Sims et al. (2006) very likely failed toconverge on the true posterior distribution of treesin their Bayesian analysis. Conversely, if Medlin Kaczmarska (2004) did recover the MP tree ortrees and if the Sims et al. (2006) analysis didreach convergence for their dataset, then ourresults demonstrate that the likelihood of theirhaving done so is highly dependent on taxonsampling and/or sequence alignment. We havedemonstrated that the Medlin et al. (2008) treesupporting the CMB hypothesis is an artefactand that it must be concluded that the CMBhypothesis is far from robust, regardless of howone interprets the variation between studies.In summary, pursuit of a well-supported phylo-geny of diatoms seems to be limited as much by thenumber of characters per taxon as by the numberof taxa for which data exist. There is a small butgrowing rbcL dataset which rejects the CMBhypothesis (Choi et al., 2008). Very limited coxIdata supports the CMB hypothesis, but analysesso far only include four species (Ehara et al.,2000). While nSSU data are a useful addition tothe difficult problem of inferring diatom phylo-geny, continued use of SSU alone, as Patterson(1994: 185) wrote in a similar context, mightsimply be an ineffective attempt to ‘. . . wringtruth from recalcitrant data.’AcknowledgementsECT was supported by NSF EF 0629410 and theJane and Roland Blumberg CentennialProfessorship in Molecular Evolution. AJA was sup-ported by an NIH Ruth L. Kirschstein NRSAPostdoctoral Fellowship (1F32GM080079-01A1).Both also acknowledge the Tony Institute. RRGand JJC were supported by NIH GM067317.ReferencesADL, S.M., SIMPSON, A.G.B., FARMER, M.A., ANDERSEN, R.A.,ANDERSON, O.R., BARTA, J.R., BOWSER, S.S.,BRUGEROLLE, G.U.Y., FENSOME, R.A., FREDERICQ, S., et al.(2005). A new higher level classification of eukaryotes withemphasis on the taxonomy of protists. J. EukaryoticMicrobiol., 52: 399–451.ALVERSON, A.J., CANNONE, J.J., GUTELL, R.R. THERIOT, E.C.(2006). The evolution of elongate shape in diatoms. J. Phycol.,42: 655–668.ALVERSON, A.J. THERIOT, E.C. (2005). Comments on recent pro-gress toward reconstructing the diatom phylogeny. J. Nanosci.Nanotechnol., 5: 57–62.ANDERSEN, R.A. (2004). Biology and systematics of heterokont andhaptophyte algae. Am. J. Bot., 91: 1508–1522.BOLLBACK, J.P. (2002). Bayesian model adequacy and choicein phylogenetics. Mol. Biol. Evol., 19: 1171–1180.CANNONE, J.J., SUBRAMANIAN, S., SCHNARE, M.N., COLLETT, J.R.,D’SOUZA, L.M., DU, Y., FENG, B., LIN, N., MADABUSI, L.V.,MULLER, K.M., et al. (2002). The Comparative RNA Web(CRW) Site: an online database of comparative sequence andstructure information for ribosomal, intron, and other RNAs.BMC Bioinformatics, 3: 2.CAVALIER-SMITH, T. CHAO, E. (2006). Phylogeny and megasyste-matics of phagotrophic heterokonts (Kingdom Chromista).J. Mol. Evol., 62: 388–420.CHEPURNOV, V.A., MANN, D.G., VON DASSOW, P.,VANORMELINGEN, P., GILLARD, J., INZE´ , D., SABBE, K. VYVERMAN, W. (2008). In search of new tractable diatoms forexperimental biology. BioEssays, 30: 692–702.CHOI, H.-G., JOO, H.M., JUNG, W., HONG, S.S., KANG, J.-S. KANG, S.-H. (2008). Morphology and phylogenetic relationshipsof some psychrophilic polar diatoms (Bacillariophyta). Nov.Hedwig. Beih., 133: 7–30.DAUGBJERG, N. ANDERSEN, R.A. (1997). A molecular phylogenyof the heterokont algae based on analyses of chloroplast-encodedrbcL sequence data. J. Phycol., 33: 1031–1041.EHARA, M., INAGAKI, Y., WATANABE, K.I. OHAMA, T. (2000).Phylogenetic analysis of diatom coxI genes and implicationsof a fluctuating GC content on mitochondrial genetic code evo-lution. Curr. Genet., 37: 29–33.GOERTZEN, L.R. THERIOT, E.C. (2003). Effect of taxon sampling,character weighting, and combined data on the interpretation ofrelationships among the heterokont algae. J. Phycol., 39:423–439.GOLOBOFF, P., FARRIS, J., NIXON, K. (2003). T.N.T.: Tree analysisusing new technology. Available at: and the diatom phylogeny 289DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009
  15. 15. GUTELL, R.R., LEE, J.C. CANNONE, J.J. (2002). The accuracy ofribosomal RNA comparative structure models. Curr. Opin.Structural Biol., 12: 301–310.GUTELL, R.R., POWER, A., HERTZ, G.Z., PUTZ, E.J. STORMO, G.D. (1992). Identifying constraints on the higher-order structure of RNA: continued development and applicationof comparative sequence analysis methods. Nucl. Acids Res., 20:5785–5795.GUTELL, R.R., WEISER, B., WOESE, C.R. NOLLER, H.F. (1985).Comparative anatomy of 16S-like ribosomal RNA. Prog. Nucl.Acid Res. Mol. Biol., 32: 155–216.HEDTKE, S., TOWNSEND, T. HILLIS, D. (2006). Resolution of phy-logenetic conflict in large data sets by increased taxon sampling.System. Biol., 55: 522–529.HILLIS, D.M. (1998). Taxonomic sampling, phylogenetic accuracyand investigator bias. System. Biol., 47: 3–8.HILLIS, D.M., POLLOCK, D.D., MCGUIRE, J.A. ZWICKL, D.J.(2003). Is sparse taxon sampling a problem for phylogeneticinference? System. Biol., 52: 124–126.KOOISTRA, W., DE STEFANO, M., MANN, D.G., SALMA, N. MEDLIN, L.K. (2003). Phylogenetic position of Toxarium,a pennate-like lineage within centric diatoms(Bacillariophyceae). J. Phycol., 39: 185–197.KOOISTRA, W.H.C.F. MEDLIN, L.K. (1996). Evolution of thediatoms (Bacillariophyta): IV. A reconstruction of their agefrom small subunit rRNA coding regions and the fossil record.Mol. Phylogenet. Evol., 6: 391–407.LARSEN, N., OLSEN, G.J., MAIDAK, B.L., MCCAUGHEY, M.J.,OVERBEEK, R., MACKE, T.J., MARSH, T.L. WOESE, C.R.(1993). The Ribosomal Database Project. Nucl. Acids Res., 21:3021–3023.MEDLIN, L.K. KACZMARSKA, I. (2004). Evolution of the diatomsV: Morphological and cytological support for the major cladesand a taxonomic revision. Phycologia, 43: 245–270.MEDLIN, L.K., KOOISTRA, W.H.C.F., GERSONDE, R. WELLBROCK, U. (1996a). Evolution of the diatoms(Bacillariophyta): II. Nuclear-encoded small-subunit rRNAsequence comparisons confirm a paraphyletic origin for the cen-tric diatoms. Mol. Biol. Evol., 13: 67–75.MEDLIN, L.K., KOOISTRA, W.H.C.F., GERSONDE, R. WELLBROCK, U. (1996b). Evolution of the diatoms(Bacillariophyta): III. Molecular evidence for the origin of theThalassiosirales. Nov. Hedwig. Beih., 112: 221–234.MEDLIN, L.K., KOOISTRA, W.H.C.F. SCHMID, A.-M.M. (2000).A review of the evolution of the diatoms–a total approach usingmolecules, morphology and geology. In The Origin and EarlyEvolution of the Diatoms: Fossil, Molecular and BiogeographicalApproaches (Witkowski, A. and Sieminska, J., editors), 13–36.Szafer Institute of Botany, Polish Academy of Sciences,Krako´ w, Poland.MEDLIN, L.K., SATO, S., MANN, D.G. KOOISTRA, W.H.C.F.(2008). Molecular evidence confirms sister relationship ofArdissonea, Climacosphenia and Toxarium within the bipolar cen-tric diatoms (Bacillariophyta, Mediophyceae), and cladistic ana-lyses confirm that extremely elongate shape has arisen twice inthe diatoms. J. Phycol., 45: 1340–1348.MEDLIN, L.K., WILLIAMS, D.M. SIMS, P.A. (1993). The evolutionof the diatoms (Bacillariophyta). I. Origin of the group andassessment of the monophyly of its major divisions. Eur.J. Phycol., 28: 261–275.NIXON, K. (1999). The parsimony ratchet, a new method for rapidparsimony analysis. Cladistics, 15: 407–414.NYLANDER, J. A. A. (2004). MrModeltest v2. Program distributedby the author. Evolutionary Biology Centre, Uppsala University.PATTERSON, C. (1994). Null or minimal models. In Models inPhylogeny Reconstruction (Scotland, R., Siebert, D.J., andWilliams, D.M, editors), 173–192. Oxford University Press,Oxford, UK.POLLOCK, D.D., ZWICKL, D.J., MCGUIRE, J.A. HILLIS, D.M.(2002). Increased taxon sampling is advantageous for phyloge-netic inference. System. Biol., 51: 664–671.ROUND, F.E., CRAWFORD, R.M. MANN, D.G. (1990).TheDiatoms: Biology Morphology of the Genera. Cambridge,UK: Cambridge University Press.SIMONSEN, R. (1979). The diatom system: Ideas on phylogeny.Bacillaria, 2: 9–71.SIMS, P.A., MANN, D.G. MEDLIN, L.K. (2006). Evolution of thediatoms: Insights from fossil, biological and molecular data.Phycologia, 45: 361–402.SORHANNUS, U. (2004). Diatom phylogenetics inferred based ondirect optimization of nuclear-encoded SSU rRNA sequences.Cladistics, 20: 487–497.SORHANNUS, U. (2007). A nuclear-encoded small-subunitribosomal RNA timescale for diatom evolution. Mar.Micropaleontol., 65: 1–12.VAN DE PEER, Y., VAN DER AUWERA, G. DE WACHTER, R.(1996). The evolution of stramenopiles and alveolates as derivedby ‘‘substitution rate calibration’’ of small ribosomal subunitRNA. J. Mol. Evol., 42: 201–210.VERBRUGGEN, H. THERIOT, E.C. (2008). Building trees of algae:Some advances in phylogenetic and evolutionary analysis. Eur. J.Phycol., 43: 229–252.WILGENBUSCH, J.C., WARREN, D.L., SWOFFORD, D.L. (2004).AWTY: A system for graphical exploration of MCMC conver-gence in Bayesian phylogenetic inference. Available at:, D.M. KOCIOLEK, J.P. (2007). Pursuit of a naturalclassification of diatoms: History, monophyly and the rejectionof paraphyletic taxa. Eur. J. Phycol., 42: 313–319.E. C. Theriot et al. 290DownloadedBy:[UniversityofTexasatAustin]At:15:4413August2009