Data as research output; Data as part of the scholarly record

Data as research output;
Data as part of the scholarly record
Todd Vision
University of North Carolina at Chapel Hill
Dryad Digital Repository

SciELO15 Ÿ 24 October 2013 Ÿ São Paulo

CC-‐BY-‐NC-‐SA
nic221

h/p://www.ﬂickr.com/photos/nic221/391536867/

Source:
IFEX
h/p://www.ifex.org/
united_states/2013/09/05/cipa_libraries/

2011

Clades With Bootstrap Support (%)

Number
of Taxa
136-Taxon Cons.
Reduced Cons.
82-Taxon Cons.

121

BURLEIGH ET AL.—INFERRING THE PLANT TREE OF LIFE FROM GENE TREES

TABLE 1. Summary of supertree bootstrap support from the GTP
analysis

100

90

70

50

136
82
82

9.8
50.6
53.1

30.8
70.9
72.2

56.4
89.9
84.8

74.4
98.7
96.2

plants (99% support), gymnosperms (100% support),
angiosperms (99% support), eudicots (99% support),
core eudicots (99% support), and asterids (100% support; Fig. 3). Within gymnosperms, Gnetales were sister
to the conifers (100% support; Fig. 3). Amborella was sister to all other angiosperms, and Nuphar (Nympheales)
was sister to all angiosperms except Amborella (Fig. 3).
Magnoliids were sister to a monocot + eudicot clade
(Fig. 3). Within monocots, the Poaceae (grass family)
had 100% support, and within the grasses, the Panicoideae clade had 100% bootstrap support (Fig. 3). In
the core eudicot clade, the Caryophyllales (100% support) were sister to the rosids (99% support) and the
asterids (100% support) (Fig. 3).
There were several differences in the species tree obtained using ML gene trees versus NJ/PP gene trees. For
example, the relationships among eurosid lineages differed slightly; however, in both analyses, Malpighiales

FIGURE 2. Average quartet similarity for each taxon among bootstrap trees. Each point in the graph represents a single taxon. The xaxis shows the number of gene families trees that have data from the
taxon. The y-axis shows the average percentage frequency of quartets
(four taxon statements) containing the taxon that are identical between
two bootstrap trees. The shaded area in the graph contains all taxa that
are present in less than 1300 gene trees.

Table 2. Tests using susceptibility genes for complex human traits
Complex trait

D ISCUSSION
OMIM

Review(s)a

Geneb

Reviews

OMIM

Downloaded from sysbio.oxfordjournals.org at University of North Carolina at Chapel Hill on February 18, 2011

Notes: This displays the percentage of total clades at or above a given
level of bootstrap support for 1) the majority rule consensus of all bootstrap trees from the NJ/PP analysis of 136 taxa (136-Taxon Cons.), 2)
the reduced consensus of all bootstrap trees for the 82 taxa present in
at least 1300 of the gene trees (Reduced Cons.), and 3) the majority rule
consensus of all bootstrap trees from the NJ/PP analysis of the same
82 taxa as above (82-Taxon Cons.).

(eurosid I) were nested in a clade with eurosid II taxa
(Figs. 1 and 3). The BEP-clade (Bambusoideae, Ehrhartoideae, and Pooideae) was not supported in the analysis using NJ/PP gene trees, but it was when using ML
gene trees (Fig. 3). Acorus americanus was not placed
A computational system to select candidate genes for complex human traits
with other monocots in the NJ/PP analysis, but it was
in a monocot clade when using ML gene trees (Fig. 3).

Frequent gene and whole-genome duplications have,
Rank Total
Percent Enrich Rank Total
Percent Enrich
in the past, limited the use of nuclear genes for deep
level phylogenetic macular
Age-related analyses in plants and 15350892 clades
603075 15094132; other
CFH
7263
13771 47.3
2
10450 12608 17.1
1
1784 Knies et al. 12608
degeneration
LOC387715 –
13771 –
–
–
–
–
with highly duplicated genomes. GTP provides a way to
603075 N/Ac
C2
–
–
–
–
766
12875 94.1
17
exploit theARMD (second run) information inherent not only
phylogenetic
CFB
–
–
–
–
44
12875 99.7
293
in the relationships among orthologous genes but also
Alzheimer’s disease
104300 15225164
LOC439999 –
13550 –
–
–
13709 –
–
the rare gene duplications that produce paralogous gene
Asthma
600807 12810182; 14551038 NPSR1
1117
13881 92.0
12
2835
13120 78.4
5
family members. Rather than treating gene tree discorAutism
209850 11733747; 12142938 EN2
98
13610 99.3
139
98
13213 99.2
135
234
13039 98.2
56
168
12703 98.7
76
dance as aCeliac disease it seeks212750 species tree that pronuisance,
the 12907013; 12699968; MYO9B
14592529
vides the best reconciliation among the many discordant
Myocardial infarction
608446 15861005; 16041318 LTA4H
122
14043 99.1
115
–d
–
–
–
gene trees.
Parkinson’s disease
168600 16026116; 16278972 SEMA5A
4548
13477 66.2
3
879
13329 93.4
15
In this study, we arthritis GTP to find species trees PTPN22
Rheumatoid used
180300 15478157; 12915205 that
333
13279 97.5
40
2156
13038 83.5
6
minimize the total number of duplications across a FCRL3
col3743
13279 71.8
3
2230
13038 82.9
6
Schizophrenia
181500 trees. The sequence
10013 14603 31.4
1
8065
13572 40.6
2
lection of nearly 18,896 plant gene 15340352; 16033310 ENTH
Type 1 diabetes mellitus collections of existing EST
12123 14272 15.1
1
7675
13130 41.5
2
sampling includes extensive 222100 12270944; 11921414 SUMO4
11237226; 11899083 PTPN22
165
14272 98.8
86
833
13130 93.7
16
data that have rarely before been used for plant phyloIL2RA
130
14272 99.1
110
528
13130 96.0
25
genetics (but see de la Torre et al. 2006; Sanderson CTLA4
and
78
14272 99.5
183
324
13130 97.5
40
McMahon Type 2 diabetes mellitus 125853 15662000; 15662001; Thus,
2007; de la Torre-B´ rcena et al. 2009). TCF7L2
a
2911
13922 79.1
5
4013
13586 70.5
3
15662002; 15662003
this study provides a new nuclear genomic perspective
Totals of life.
725e
13826e 94.7e
54f FIG. 2.—Best-fit nucleotide substitution models for each alignment. Shown is a cartoon illustration of the rate categories of the best-fit nucleotide
879e
13130e 93.4e
43f
on the plant tree
substitution models for each molecule. Within a molecule, rates were scaled to the maximum rate (black). Diagonal lines depict transitions; the edges of
Overall, athe phylogenetic relationships inferred from
the square depict transversions. The HKY85 model, which was used for the rate ratios reported throughout this article, is shown for comparison on
PubMed ID(s) of review articles used in corpus.
the right.
b
gene duplications are Methods section. HUGO approved genepreviousto identify genes.
largely consistent with symbols used
For references see
large-scale cNo suitable reviewstudies of (see Methods section).
molecular corpus available plant phylogeny (e.g.,
d
Soltis et al.The OMIM Hiluis et al. 2003; Jansen not used. 2007).
2000; record insufficiently detailed and was et al.
e
Median result.
Substitution Patterns in RRE
a higher jp (jp 5 7.61 with 95% CI [4.79–18.48]) than the
f
Yet the GTP analysis also provides support for some
Mean result.
paired sites as a whole (jp 5 4.21 with 95% CI [3.51–
We examined 3 possible explanations for the surprisrelationships that are unresolved or conflicting in pre5.28]). This suggests that the presence of protein-coding
Search::VectorSpace by Maciej Ceglowskij(http://www.perl.com/pub/
or sum
ing result that p , ju in RRE. First, because both the RRE
vious analyses. For example, the results support the
constraints does impede compensatory evolution at paired
a/2003/02/19/engine.html). and CRE secondary structures occur within coding regions,
Databases and ontology schemas were
n
X
placement of magnoliids sister to monocots + eudicots,
sites in RNA secondary structures, although it does not exdownloaded and parsed into examined the possibility that the difference between jp
XML under a custom XML schema.
we
int3, Ceratophyllum, which
zij
making eudicots (possibly withi ¼
Intermediate text and data-miningis diminished by stored as XML protein sequence. plain why ju would be ‘‘greater’’ than jp in RRE.
and ju results were also selection on the
j¼0
Second, we examined the possibility that we had used
was not included in this study) sister to monocots (Figs 1
under the same schema. We recalculated j and j for both molecules using only
p
u
a nonrepresentative sample of RRE sequences. To confirm
of the transformed scores for gene i.
and 3). The relationships among these major clades are
data from 4-fold degenerate sites in paired and unpaired rethat the observed substitution patterns in RRE were not speThe fourth method, referred to as (e.g., Soltis the other
unclear from analyses using few genesint4, differs fromet al.
gions. In CRE, the presence of codons affects the estimates cific to the particular set of HIV sequences we examined
three by considering both the score of a gene within a data source
2.5 Selection of the tests for predicted traits
complex direction (4-fold degenerate sites:
2000, 2007;as well as the number of but our resultthat consistent
Hilu et al. 2003), genes returned for is data source. First,
in the
To assess the ability of CAESAR to; choose valid 5j1:45 ), though the 4-fold sites over- (which were all derived from subtype B), we estimated
jp 5j2:89 all sites: jp candidates, 18 test
with recent analyses using is obtained. genes (Jansen et al.
u
u
a transformed score sij 81 plastid
genes were selected from recentlythe predicted pattern. We had less power to compare jp and ju from 2 additional RRE alignments of sequences
shoot published reports providing strong
2007). The placement of Malpighiales within a eurosid
rij
evidence of statistical association with known complex human unpaired sites of drawn from higher taxonomic levels: sequences from dif4-fold degenerate sites at the paired and
sij conflicts
I clade (Figs. 1 and 3) generally¼ Pn rij with previous
disorders. The test genes included CTLA4 were too few 4-fold degenerate unpaired ferent subtypes (1 sequence each from A, B, C, F, G, H, J,
i¼0
RRE because there (Ueda et al., 2003),
large-scale angiosperm analyses (e.g., Soltis et al. 2000;
PTPN22 (Bottini et al., sites, and there was insufficient sequence variability at these and K) and sequences from different groups (1–2 sequences
2004), PTPN22 (Begovich et al., 2004),
The transformed et al. 2007). Given the novelty of
each from M, N, and O) of HIV. In both these alignments,
SUMO4 (Guo et al., 2004), FCRL3 (Kochi et al., 2005), ENTH
Hilu et al. 2003; Jansen gene scores are then summed together to provide
sites. However, the 4-fold degenerate paired sites did show
a
the results were qualitatively similar to those for subtype B:
(Pimm et al., 2005), EN2 (Gharani et al., 2004), TCF7L2 (Grant et al.,
the result, it final score for each gene.
should be interpreted with great caution.
ju was significantly higher than jp (table 4).
2006), CFH (Klein et al., 2005), LOC387715 (Rivera et al., 2005),
J
X gj
Our results indicate that data from many gene trees
LTA4H (Helgadottir et al., 2006), C2 (Gold et al., 2006),
Third, we considered whether the RRE estimates were
int4, i
sij
may be required to produce a ¼well-supported phyG
CFB (Gold et al., 2006), NPSR1 (Laitinen et al., 2004), MYO9B
disproportionately influenced by a portion of the molecule
j¼1
logeny using GTP (Table 1; Figs. 2 and 3), suggesting
(Monsuur et al., 2005), IL2RA (Vella et al., 2005), SEMA5A
that experiences a type of selection that differs from the
where gj is use data genes returned for source j and
that GTP may not the number ofas efficiently as more tradi(Maraganore et al., 2005) and LOC439999 (Grupe et al., 2006).
molecule as a whole. We systematically removed each
Each disorder required a custom corpus, either an OMIM record
tional phylogenetic analyses of concatenated multigene
stem-loop of RRE and reestimated jp and ju for the resultor one or more review articles describing the biology of the disorder
J
X
data sets. For example, in plants, recent analyses of up
ing partial structures. The jp and ju estimates were quali(Table 2). Review articles were selected by searching PubMed
G¼
gj
tatively similar for all these partial structures (table 5).
to 83 plastid genes have apparently resolved enigmatic
(Wheeler et al., 2006) for articles published before the year of discovery
j¼1
relationships in the backbone angiosperm phylogeny,
of each gene association. Where multiple suitable review articles
2.4 Implementation
whereas our analyses appear to require data from 1000
were available, the texts were concatenated to produce the corpus.
Table 3
We removed any direct reference to the testing gene in the input text.
The et al. 2007; Moore et al. 2007, Perl version
genes (Jansen CAESAR algorithms were written using 2010). Like5.8.1
Transition–Transversion Rate Ratios (jp)
and Java version 1.4.2. The vector space similarity searches were
performed using a modified version of the Perl module

In addition, entries in the GAD containing the test genes were removed.
Thus, the input data closely mimicked the state of knowledge prior

1135

FIG. 3.—Transition–transversion rate ratios (j) for each alignment.
The dotted line represents a 1:1 relationship between jp and ju. The solid
line represents the predicted relationship jp 5j2 . Note that the CRE data
u
point is from the analysis of 4-fold degenerate sites in paired and unpaired
regions.

Structure
RRE
IRES
CRE
5S rRNA
16S rRNA
23S rRNA
A tRNA
M tRNA
12S rRNA
RNase P
a

j

jp

ju

k

5.19
6.50
12.52
3.70
3.24
2.57
6.04
11.98
3.90
2.98

4.21
15.34
22.36
4.44
3.79
3.06
9.48
18.78
6.69
4.86

9.01
3.60
2.93
2.82
2.02
1.71
3.30
9.65
2.83
1.30

546.05a
73.46a
177.32a
35.05a
665.64a
1281.71a
204.73a
122.24a
131.93a
59.21a

LRT value significant at P , 0.0001

Relatively
little data is
published
within
articles

Published tables ﬁgures

Analysed data

Raw data

Reuse of open data boosts
citations to the original article

Piwowar
and
Vision
(2013)

doi:10.7717/peerj.175

Volume

Most analyzed data is in the ‘long tail’, for
which there is no specialized repository

Structured data

(e.g. Genbank, GBIF)

Long-tail data

Rank frequency of datatype

After Heidorn (2008) doi:10.1353/lib.0.0036

Peer-to-peer data sharing does not work
Wicherts and colleagues requested data from from
141 articles in American Psychological Association
journals.
“6 months later, after … 400 emails, [sending]
detailed descriptions of our study aims, approvals
of our ethical committee, signed assurances not to
share data with others, and even our full
resumes…” only 27% of authors complied

Wicherts JM, Borsboom D, Kats J, Molenaar D (2006) doi:10.1037/0003-066X.61.7.726

Data is best captured at the time of publication
Time
of
publica(on

Speciﬁc
details

Informa(on
Content

General
details

Re(rement
or

career
change

Accident

Death

Time

(Michener
et
al.
1997)

CC-‐BY
Adamo

h/p://www.piqs.de/fotos/121272.html

Bumpus HC (1898) The Elimination of the Unﬁt as Illustrated by the Introduced Sparrow,
Passer domesticus. Biological Lectures from the Marine Biological Laboratory: 209-226.

Joint Data Archiving Policy ( JDAP )
Data are important products of the scientiﬁc
enterprise, and they should be preserved and
usable for decades in the future.
As a condition for publication, data supporting the
results in the article should be deposited in an
appropriate public archive.
Authors may elect to embargo access to the data for
a period up to a year after publication.
Exceptions may be granted at the discretion of the
editor, especially for sensitive information.
http://datadryad.org/pages/jdap

High impact factor journals have stronger data
archiving policies
IF=6.0

n=70

IF=3.6

IF=4.5

Piwowar HA, Chapman WW (2008) hdl:10101/npre.2008.1700.1

author

prepare manuscript
and related data ﬁles

JOURNAL
submit manuscript

manuscript review

DRYAD
upload data

editor

accepted?
no

accepted?

send article
description

Dryad data
package

send data
identiﬁer (DOI)

yes

curation
data curator

published article
(with data citation)

published data
(with article citation)

When using this data, please cite the original article:
Chave J, Coomes D, Jansen S, Lewis SL, Swenson NG, Zanne
AE (2009) Towards a worldwide wood economics spectrum.
Ecology Letters 12: 351-366. doi:10.1111/j.
1461-0248.2009.01285.x

Additionally, please cite the Dryad data package:
Zanne AE, Lopez-Gonzalez G, Coomes DA, Ilic J, Jansen S,
Lewis SL, Miller RB, Swenson NG, Wiemann MC, Chave J
(2009) Data from: Towards a worldwide wood economics
spectrum. Dryad Digital Repository. doi:10.5061/dryad.234

No fees for submission from low and
lower middle income countries

Dryad by the numbers
Data packages

4,172
Authors

15,581
Data ﬁles

11,912
Integrated journals
37
All journals

268
File downloads
4,629,256

Stats
as
of
23
Oct
2013

To learn more
• 
• 
• 
• 
• 

Repository home: http://datadryad.org
News: http://blog.datadryad.org
Project documentation: http://wiki.datadryad.org
Twitter: @datadryad
Code: http://code.google.com/p/dryad

or contact us:
•  http://datadryad.org/feedback
•  Todd Vision, Director, tjv@bio.unc.edu
•  Laura Wendell, Dryad Executive Director, lwendell@datadryad.org

Data as research output; Data as part of the scholarly record

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data as research output; Data as part of the scholarly record

Similar to Data as research output; Data as part of the scholarly record (20)

Recently uploaded

Recently uploaded (20)

Data as research output; Data as part of the scholarly record