Clades With Bootstrap Support (%)
BURLEIGH ET AL.—INFERRING THE PLANT TREE OF LIFE FROM GENE TREES
TABLE 1. Summary of supertree bootstrap support from the GTP
plants (99% support), gymnosperms (100% support),
angiosperms (99% support), eudicots (99% support),
core eudicots (99% support), and asterids (100% support; Fig. 3). Within gymnosperms, Gnetales were sister
to the conifers (100% support; Fig. 3). Amborella was sister to all other angiosperms, and Nuphar (Nympheales)
was sister to all angiosperms except Amborella (Fig. 3).
Magnoliids were sister to a monocot + eudicot clade
(Fig. 3). Within monocots, the Poaceae (grass family)
had 100% support, and within the grasses, the Panicoideae clade had 100% bootstrap support (Fig. 3). In
the core eudicot clade, the Caryophyllales (100% support) were sister to the rosids (99% support) and the
asterids (100% support) (Fig. 3).
There were several differences in the species tree obtained using ML gene trees versus NJ/PP gene trees. For
example, the relationships among eurosid lineages differed slightly; however, in both analyses, Malpighiales
FIGURE 2. Average quartet similarity for each taxon among bootstrap trees. Each point in the graph represents a single taxon. The xaxis shows the number of gene families trees that have data from the
taxon. The y-axis shows the average percentage frequency of quartets
(four taxon statements) containing the taxon that are identical between
two bootstrap trees. The shaded area in the graph contains all taxa that
are present in less than 1300 gene trees.
Table 2. Tests using susceptibility genes for complex human traits
Downloaded from sysbio.oxfordjournals.org at University of North Carolina at Chapel Hill on February 18, 2011
Notes: This displays the percentage of total clades at or above a given
level of bootstrap support for 1) the majority rule consensus of all bootstrap trees from the NJ/PP analysis of 136 taxa (136-Taxon Cons.), 2)
the reduced consensus of all bootstrap trees for the 82 taxa present in
at least 1300 of the gene trees (Reduced Cons.), and 3) the majority rule
consensus of all bootstrap trees from the NJ/PP analysis of the same
82 taxa as above (82-Taxon Cons.).
(eurosid I) were nested in a clade with eurosid II taxa
(Figs. 1 and 3). The BEP-clade (Bambusoideae, Ehrhartoideae, and Pooideae) was not supported in the analysis using NJ/PP gene trees, but it was when using ML
gene trees (Fig. 3). Acorus americanus was not placed
A computational system to select candidate genes for complex human traits
with other monocots in the NJ/PP analysis, but it was
in a monocot clade when using ML gene trees (Fig. 3).
Frequent gene and whole-genome duplications have,
Percent Enrich Rank Total
in the past, limited the use of nuclear genes for deep
level phylogenetic macular
Age-related analyses in plants and 15350892 clades
603075 15094132; other
10450 12608 17.1
1784 Knies et al. 12608
with highly duplicated genomes. GTP provides a way to
exploit theARMD (second run) information inherent not only
in the relationships among orthologous genes but also
the rare gene duplications that produce paralogous gene
600807 12810182; 14551038 NPSR1
family members. Rather than treating gene tree discorAutism
209850 11733747; 12142938 EN2
dance as aCeliac disease it seeks212750 species tree that pronuisance,
the 12907013; 12699968; MYO9B
vides the best reconciliation among the many discordant
608446 15861005; 16041318 LTA4H
168600 16026116; 16278972 SEMA5A
In this study, we arthritis GTP to ﬁnd species trees PTPN22
180300 15478157; 12915205 that
minimize the total number of duplications across a FCRL3
181500 trees. The sequence
10013 14603 31.4
lection of nearly 18,896 plant gene 15340352; 16033310 ENTH
Type 1 diabetes mellitus collections of existing EST
12123 14272 15.1
sampling includes extensive 222100 12270944; 11921414 SUMO4
11237226; 11899083 PTPN22
data that have rarely before been used for plant phyloIL2RA
genetics (but see de la Torre et al. 2006; Sanderson CTLA4
McMahon Type 2 diabetes mellitus 125853 15662000; 15662001; Thus,
2007; de la Torre-B´ rcena et al. 2009). TCF7L2
this study provides a new nuclear genomic perspective
Totals of life.
54f FIG. 2.—Best-ﬁt nucleotide substitution models for each alignment. Shown is a cartoon illustration of the rate categories of the best-ﬁt nucleotide
on the plant tree
substitution models for each molecule. Within a molecule, rates were scaled to the maximum rate (black). Diagonal lines depict transitions; the edges of
Overall, athe phylogenetic relationships inferred from
the square depict transversions. The HKY85 model, which was used for the rate ratios reported throughout this article, is shown for comparison on
PubMed ID(s) of review articles used in corpus.
gene duplications are Methods section. HUGO approved genepreviousto identify genes.
largely consistent with symbols used
For references see
large-scale cNo suitable reviewstudies of (see Methods section).
molecular corpus available plant phylogeny (e.g.,
Soltis et al.The OMIM Hiluis et al. 2003; Jansen not used. 2007).
2000; record insufficiently detailed and was et al.
Substitution Patterns in RRE
a higher jp (jp 5 7.61 with 95% CI [4.79–18.48]) than the
Yet the GTP analysis also provides support for some
paired sites as a whole (jp 5 4.21 with 95% CI [3.51–
We examined 3 possible explanations for the surprisrelationships that are unresolved or conﬂicting in pre5.28]). This suggests that the presence of protein-coding
Search::VectorSpace by Maciej Ceglowskij(http://www.perl.com/pub/
ing result that p , ju in RRE. First, because both the RRE
vious analyses. For example, the results support the
constraints does impede compensatory evolution at paired
a/2003/02/19/engine.html). and CRE secondary structures occur within coding regions,
Databases and ontology schemas were
placement of magnoliids sister to monocots + eudicots,
sites in RNA secondary structures, although it does not exdownloaded and parsed into examined the possibility that the difference between jp
XML under a custom XML schema.
int3, Ceratophyllum, which
making eudicots (possibly withi ¼
Intermediate text and data-miningis diminished by stored as XML protein sequence. plain why ju would be ‘‘greater’’ than jp in RRE.
and ju results were also selection on the
Second, we examined the possibility that we had used
was not included in this study) sister to monocots (Figs 1
under the same schema. We recalculated j and j for both molecules using only
a nonrepresentative sample of RRE sequences. To conﬁrm
of the transformed scores for gene i.
and 3). The relationships among these major clades are
data from 4-fold degenerate sites in paired and unpaired rethat the observed substitution patterns in RRE were not speThe fourth method, referred to as (e.g., Soltis the other
unclear from analyses using few genesint4, differs fromet al.
gions. In CRE, the presence of codons affects the estimates ciﬁc to the particular set of HIV sequences we examined
three by considering both the score of a gene within a data source
2.5 Selection of the tests for predicted traits
complex direction (4-fold degenerate sites:
2000, 2007;as well as the number of but our resultthat consistent
Hilu et al. 2003), genes returned for is data source. First,
To assess the ability of CAESAR to; choose valid 5j1:45 ), though the 4-fold sites over- (which were all derived from subtype B), we estimated
jp 5j2:89 all sites: jp candidates, 18 test
with recent analyses using is obtained. genes (Jansen et al.
a transformed score sij 81 plastid
genes were selected from recentlythe predicted pattern. We had less power to compare jp and ju from 2 additional RRE alignments of sequences
shoot published reports providing strong
2007). The placement of Malpighiales within a eurosid
evidence of statistical association with known complex human unpaired sites of drawn from higher taxonomic levels: sequences from dif4-fold degenerate sites at the paired and
I clade (Figs. 1 and 3) generally¼ Pn rij with previous
disorders. The test genes included CTLA4 were too few 4-fold degenerate unpaired ferent subtypes (1 sequence each from A, B, C, F, G, H, J,
RRE because there (Ueda et al., 2003),
large-scale angiosperm analyses (e.g., Soltis et al. 2000;
PTPN22 (Bottini et al., sites, and there was insufﬁcient sequence variability at these and K) and sequences from different groups (1–2 sequences
2004), PTPN22 (Begovich et al., 2004),
The transformed et al. 2007). Given the novelty of
each from M, N, and O) of HIV. In both these alignments,
SUMO4 (Guo et al., 2004), FCRL3 (Kochi et al., 2005), ENTH
Hilu et al. 2003; Jansen gene scores are then summed together to provide
sites. However, the 4-fold degenerate paired sites did show
the results were qualitatively similar to those for subtype B:
(Pimm et al., 2005), EN2 (Gharani et al., 2004), TCF7L2 (Grant et al.,
the result, it final score for each gene.
should be interpreted with great caution.
ju was signiﬁcantly higher than jp (table 4).
2006), CFH (Klein et al., 2005), LOC387715 (Rivera et al., 2005),
Our results indicate that data from many gene trees
LTA4H (Helgadottir et al., 2006), C2 (Gold et al., 2006),
Third, we considered whether the RRE estimates were
may be required to produce a ¼well-supported phyG
CFB (Gold et al., 2006), NPSR1 (Laitinen et al., 2004), MYO9B
disproportionately inﬂuenced by a portion of the molecule
logeny using GTP (Table 1; Figs. 2 and 3), suggesting
(Monsuur et al., 2005), IL2RA (Vella et al., 2005), SEMA5A
that experiences a type of selection that differs from the
where gj is use data genes returned for source j and
that GTP may not the number ofas efﬁciently as more tradi(Maraganore et al., 2005) and LOC439999 (Grupe et al., 2006).
molecule as a whole. We systematically removed each
Each disorder required a custom corpus, either an OMIM record
tional phylogenetic analyses of concatenated multigene
stem-loop of RRE and reestimated jp and ju for the resultor one or more review articles describing the biology of the disorder
data sets. For example, in plants, recent analyses of up
ing partial structures. The jp and ju estimates were quali(Table 2). Review articles were selected by searching PubMed
tatively similar for all these partial structures (table 5).
to 83 plastid genes have apparently resolved enigmatic
(Wheeler et al., 2006) for articles published before the year of discovery
relationships in the backbone angiosperm phylogeny,
of each gene association. Where multiple suitable review articles
whereas our analyses appear to require data from 1000
were available, the texts were concatenated to produce the corpus.
We removed any direct reference to the testing gene in the input text.
The et al. 2007; Moore et al. 2007, Perl version
genes (Jansen CAESAR algorithms were written using 2010). Like5.8.1
Transition–Transversion Rate Ratios (jp)
and Java version 1.4.2. The vector space similarity searches were
performed using a modified version of the Perl module
In addition, entries in the GAD containing the test genes were removed.
Thus, the input data closely mimicked the state of knowledge prior
FIG. 3.—Transition–transversion rate ratios (j) for each alignment.
The dotted line represents a 1:1 relationship between jp and ju. The solid
line represents the predicted relationship jp 5j2 . Note that the CRE data
point is from the analysis of 4-fold degenerate sites in paired and unpaired
LRT value signiﬁcant at P , 0.0001
little data is
Published tables ﬁgures
Reuse of open data boosts
citations to the original article
Most analyzed data is in the ‘long tail’, for
which there is no specialized repository
(e.g. Genbank, GBIF)
Rank frequency of datatype
After Heidorn (2008) doi:10.1353/lib.0.0036
Peer-to-peer data sharing does not work
Wicherts and colleagues requested data from from
141 articles in American Psychological Association
“6 months later, after … 400 emails, [sending]
detailed descriptions of our study aims, approvals
of our ethical committee, signed assurances not to
share data with others, and even our full
resumes…” only 27% of authors complied
Wicherts JM, Borsboom D, Kats J, Molenaar D (2006) doi:10.1037/0003-066X.61.7.726
Data is best captured at the time of publication
Bumpus HC (1898) The Elimination of the Unﬁt as Illustrated by the Introduced Sparrow,
Passer domesticus. Biological Lectures from the Marine Biological Laboratory: 209-226.
Joint Data Archiving Policy ( JDAP )
Data are important products of the scientiﬁc
enterprise, and they should be preserved and
usable for decades in the future.
As a condition for publication, data supporting the
results in the article should be deposited in an
appropriate public archive.
Authors may elect to embargo access to the data for
a period up to a year after publication.
Exceptions may be granted at the discretion of the
editor, especially for sensitive information.
High impact factor journals have stronger data
Piwowar HA, Chapman WW (2008) hdl:10101/npre.2008.1700.1
and related data ﬁles
(with data citation)
(with article citation)
When using this data, please cite the original article:
Chave J, Coomes D, Jansen S, Lewis SL, Swenson NG, Zanne
AE (2009) Towards a worldwide wood economics spectrum.
Ecology Letters 12: 351-366. doi:10.1111/j.
Additionally, please cite the Dryad data package:
Zanne AE, Lopez-Gonzalez G, Coomes DA, Ilic J, Jansen S,
Lewis SL, Miller RB, Swenson NG, Wiemann MC, Chave J
(2009) Data from: Towards a worldwide wood economics
spectrum. Dryad Digital Repository. doi:10.5061/dryad.234
No fees for submission from low and
lower middle income countries
Dryad by the numbers
To learn more
Repository home: http://datadryad.org
Project documentation: http://wiki.datadryad.org
or contact us:
• Todd Vision, Director, email@example.com
• Laura Wendell, Dryad Executive Director, firstname.lastname@example.org