Data as research output, data as part of the scholarly record


Published on

Talk given at SciELO15, 24 October 2013Ÿ, São Paulo Brazil. Video followed by slides.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data as research output, data as part of the scholarly record

  1. 1. Data as research output; Data as part of the scholarly record Todd Vision University of North Carolina at Chapel Hill Dryad Digital Repository SciELO15 Ÿ 24 October 2013 Ÿ São Paulo
  2. 2. CC-­‐BY-­‐NC-­‐SA  nic221   h/p://  
  3. 3. Source:  IFEX  h/p:// united_states/2013/09/05/cipa_libraries/  
  4. 4. 2011 Clades With Bootstrap Support (%) Number of Taxa 136-Taxon Cons. Reduced Cons. 82-Taxon Cons. 121 BURLEIGH ET AL.—INFERRING THE PLANT TREE OF LIFE FROM GENE TREES TABLE 1. Summary of supertree bootstrap support from the GTP analysis 100 90 70 50 136 82 82 9.8 50.6 53.1 30.8 70.9 72.2 56.4 89.9 84.8 74.4 98.7 96.2 plants (99% support), gymnosperms (100% support), angiosperms (99% support), eudicots (99% support), core eudicots (99% support), and asterids (100% support; Fig. 3). Within gymnosperms, Gnetales were sister to the conifers (100% support; Fig. 3). Amborella was sister to all other angiosperms, and Nuphar (Nympheales) was sister to all angiosperms except Amborella (Fig. 3). Magnoliids were sister to a monocot + eudicot clade (Fig. 3). Within monocots, the Poaceae (grass family) had 100% support, and within the grasses, the Panicoideae clade had 100% bootstrap support (Fig. 3). In the core eudicot clade, the Caryophyllales (100% support) were sister to the rosids (99% support) and the asterids (100% support) (Fig. 3). There were several differences in the species tree obtained using ML gene trees versus NJ/PP gene trees. For example, the relationships among eurosid lineages differed slightly; however, in both analyses, Malpighiales FIGURE 2. Average quartet similarity for each taxon among bootstrap trees. Each point in the graph represents a single taxon. The xaxis shows the number of gene families trees that have data from the taxon. The y-axis shows the average percentage frequency of quartets (four taxon statements) containing the taxon that are identical between two bootstrap trees. The shaded area in the graph contains all taxa that are present in less than 1300 gene trees. Table 2. Tests using susceptibility genes for complex human traits Complex trait D ISCUSSION OMIM Review(s)a Geneb Reviews OMIM Downloaded from at University of North Carolina at Chapel Hill on February 18, 2011 Notes: This displays the percentage of total clades at or above a given level of bootstrap support for 1) the majority rule consensus of all bootstrap trees from the NJ/PP analysis of 136 taxa (136-Taxon Cons.), 2) the reduced consensus of all bootstrap trees for the 82 taxa present in at least 1300 of the gene trees (Reduced Cons.), and 3) the majority rule consensus of all bootstrap trees from the NJ/PP analysis of the same 82 taxa as above (82-Taxon Cons.). (eurosid I) were nested in a clade with eurosid II taxa (Figs. 1 and 3). The BEP-clade (Bambusoideae, Ehrhartoideae, and Pooideae) was not supported in the analysis using NJ/PP gene trees, but it was when using ML gene trees (Fig. 3). Acorus americanus was not placed A computational system to select candidate genes for complex human traits with other monocots in the NJ/PP analysis, but it was in a monocot clade when using ML gene trees (Fig. 3). Frequent gene and whole-genome duplications have, Rank Total Percent Enrich Rank Total Percent Enrich in the past, limited the use of nuclear genes for deep level phylogenetic macular Age-related analyses in plants and 15350892 clades 603075 15094132; other CFH 7263 13771 47.3 2 10450 12608 17.1 1 1784 Knies et al. 12608 degeneration LOC387715 – 13771 – – – – – with highly duplicated genomes. GTP provides a way to 603075 N/Ac C2 – – – – 766 12875 94.1 17 exploit theARMD (second run) information inherent not only phylogenetic CFB – – – – 44 12875 99.7 293 in the relationships among orthologous genes but also Alzheimer’s disease 104300 15225164 LOC439999 – 13550 – – – 13709 – – the rare gene duplications that produce paralogous gene Asthma 600807 12810182; 14551038 NPSR1 1117 13881 92.0 12 2835 13120 78.4 5 family members. Rather than treating gene tree discorAutism 209850 11733747; 12142938 EN2 98 13610 99.3 139 98 13213 99.2 135 234 13039 98.2 56 168 12703 98.7 76 dance as aCeliac disease it seeks212750 species tree that pronuisance, the 12907013; 12699968; MYO9B 14592529 vides the best reconciliation among the many discordant Myocardial infarction 608446 15861005; 16041318 LTA4H 122 14043 99.1 115 –d – – – gene trees. Parkinson’s disease 168600 16026116; 16278972 SEMA5A 4548 13477 66.2 3 879 13329 93.4 15 In this study, we arthritis GTP to find species trees PTPN22 Rheumatoid used 180300 15478157; 12915205 that 333 13279 97.5 40 2156 13038 83.5 6 minimize the total number of duplications across a FCRL3 col3743 13279 71.8 3 2230 13038 82.9 6 Schizophrenia 181500 trees. The sequence 10013 14603 31.4 1 8065 13572 40.6 2 lection of nearly 18,896 plant gene 15340352; 16033310 ENTH Type 1 diabetes mellitus collections of existing EST 12123 14272 15.1 1 7675 13130 41.5 2 sampling includes extensive 222100 12270944; 11921414 SUMO4 11237226; 11899083 PTPN22 165 14272 98.8 86 833 13130 93.7 16 data that have rarely before been used for plant phyloIL2RA 130 14272 99.1 110 528 13130 96.0 25 genetics (but see de la Torre et al. 2006; Sanderson CTLA4 and 78 14272 99.5 183 324 13130 97.5 40 McMahon Type 2 diabetes mellitus 125853 15662000; 15662001; Thus, 2007; de la Torre-B´ rcena et al. 2009). TCF7L2 a 2911 13922 79.1 5 4013 13586 70.5 3 15662002; 15662003 this study provides a new nuclear genomic perspective Totals of life. 725e 13826e 94.7e 54f FIG. 2.—Best-fit nucleotide substitution models for each alignment. Shown is a cartoon illustration of the rate categories of the best-fit nucleotide 879e 13130e 93.4e 43f on the plant tree substitution models for each molecule. Within a molecule, rates were scaled to the maximum rate (black). Diagonal lines depict transitions; the edges of Overall, athe phylogenetic relationships inferred from the square depict transversions. The HKY85 model, which was used for the rate ratios reported throughout this article, is shown for comparison on PubMed ID(s) of review articles used in corpus. the right. b gene duplications are Methods section. HUGO approved genepreviousto identify genes. largely consistent with symbols used For references see large-scale cNo suitable reviewstudies of (see Methods section). molecular corpus available plant phylogeny (e.g., d Soltis et al.The OMIM Hiluis et al. 2003; Jansen not used. 2007). 2000; record insufficiently detailed and was et al. e Median result. Substitution Patterns in RRE a higher jp (jp 5 7.61 with 95% CI [4.79–18.48]) than the f Yet the GTP analysis also provides support for some Mean result. paired sites as a whole (jp 5 4.21 with 95% CI [3.51– We examined 3 possible explanations for the surprisrelationships that are unresolved or conflicting in pre5.28]). This suggests that the presence of protein-coding Search::VectorSpace by Maciej Ceglowskij( or sum ing result that p , ju in RRE. First, because both the RRE vious analyses. For example, the results support the constraints does impede compensatory evolution at paired a/2003/02/19/engine.html). and CRE secondary structures occur within coding regions, Databases and ontology schemas were n X placement of magnoliids sister to monocots + eudicots, sites in RNA secondary structures, although it does not exdownloaded and parsed into examined the possibility that the difference between jp XML under a custom XML schema. we int3, Ceratophyllum, which zij making eudicots (possibly withi ¼ Intermediate text and data-miningis diminished by stored as XML protein sequence. plain why ju would be ‘‘greater’’ than jp in RRE. and ju results were also selection on the j¼0 Second, we examined the possibility that we had used was not included in this study) sister to monocots (Figs 1 under the same schema. We recalculated j and j for both molecules using only p u a nonrepresentative sample of RRE sequences. To confirm of the transformed scores for gene i. and 3). The relationships among these major clades are data from 4-fold degenerate sites in paired and unpaired rethat the observed substitution patterns in RRE were not speThe fourth method, referred to as (e.g., Soltis the other unclear from analyses using few genesint4, differs fromet al. gions. In CRE, the presence of codons affects the estimates cific to the particular set of HIV sequences we examined three by considering both the score of a gene within a data source 2.5 Selection of the tests for predicted traits complex direction (4-fold degenerate sites: 2000, 2007;as well as the number of but our resultthat consistent Hilu et al. 2003), genes returned for is data source. First, in the To assess the ability of CAESAR to; choose valid 5j1:45 ), though the 4-fold sites over- (which were all derived from subtype B), we estimated jp 5j2:89 all sites: jp candidates, 18 test with recent analyses using is obtained. genes (Jansen et al. u u a transformed score sij 81 plastid genes were selected from recentlythe predicted pattern. We had less power to compare jp and ju from 2 additional RRE alignments of sequences shoot published reports providing strong 2007). The placement of Malpighiales within a eurosid rij evidence of statistical association with known complex human unpaired sites of drawn from higher taxonomic levels: sequences from dif4-fold degenerate sites at the paired and sij conflicts I clade (Figs. 1 and 3) generally¼ Pn rij with previous disorders. The test genes included CTLA4 were too few 4-fold degenerate unpaired ferent subtypes (1 sequence each from A, B, C, F, G, H, J, i¼0 RRE because there (Ueda et al., 2003), large-scale angiosperm analyses (e.g., Soltis et al. 2000; PTPN22 (Bottini et al., sites, and there was insufficient sequence variability at these and K) and sequences from different groups (1–2 sequences 2004), PTPN22 (Begovich et al., 2004), The transformed et al. 2007). Given the novelty of each from M, N, and O) of HIV. In both these alignments, SUMO4 (Guo et al., 2004), FCRL3 (Kochi et al., 2005), ENTH Hilu et al. 2003; Jansen gene scores are then summed together to provide sites. However, the 4-fold degenerate paired sites did show a the results were qualitatively similar to those for subtype B: (Pimm et al., 2005), EN2 (Gharani et al., 2004), TCF7L2 (Grant et al., the result, it final score for each gene. should be interpreted with great caution. ju was significantly higher than jp (table 4). 2006), CFH (Klein et al., 2005), LOC387715 (Rivera et al., 2005), J X gj Our results indicate that data from many gene trees LTA4H (Helgadottir et al., 2006), C2 (Gold et al., 2006), Third, we considered whether the RRE estimates were int4, i sij may be required to produce a ¼well-supported phyG CFB (Gold et al., 2006), NPSR1 (Laitinen et al., 2004), MYO9B disproportionately influenced by a portion of the molecule j¼1 logeny using GTP (Table 1; Figs. 2 and 3), suggesting (Monsuur et al., 2005), IL2RA (Vella et al., 2005), SEMA5A that experiences a type of selection that differs from the where gj is use data genes returned for source j and that GTP may not the number ofas efficiently as more tradi(Maraganore et al., 2005) and LOC439999 (Grupe et al., 2006). molecule as a whole. We systematically removed each Each disorder required a custom corpus, either an OMIM record tional phylogenetic analyses of concatenated multigene stem-loop of RRE and reestimated jp and ju for the resultor one or more review articles describing the biology of the disorder J X data sets. For example, in plants, recent analyses of up ing partial structures. The jp and ju estimates were quali(Table 2). Review articles were selected by searching PubMed G¼ gj tatively similar for all these partial structures (table 5). to 83 plastid genes have apparently resolved enigmatic (Wheeler et al., 2006) for articles published before the year of discovery j¼1 relationships in the backbone angiosperm phylogeny, of each gene association. Where multiple suitable review articles 2.4 Implementation whereas our analyses appear to require data from 1000 were available, the texts were concatenated to produce the corpus. Table 3 We removed any direct reference to the testing gene in the input text. The et al. 2007; Moore et al. 2007, Perl version genes (Jansen CAESAR algorithms were written using 2010). Like5.8.1 Transition–Transversion Rate Ratios (jp) and Java version 1.4.2. The vector space similarity searches were performed using a modified version of the Perl module In addition, entries in the GAD containing the test genes were removed. Thus, the input data closely mimicked the state of knowledge prior 1135 FIG. 3.—Transition–transversion rate ratios (j) for each alignment. The dotted line represents a 1:1 relationship between jp and ju. The solid line represents the predicted relationship jp 5j2 . Note that the CRE data u point is from the analysis of 4-fold degenerate sites in paired and unpaired regions. Structure RRE IRES CRE 5S rRNA 16S rRNA 23S rRNA A tRNA M tRNA 12S rRNA RNase P a j jp ju k 5.19 6.50 12.52 3.70 3.24 2.57 6.04 11.98 3.90 2.98 4.21 15.34 22.36 4.44 3.79 3.06 9.48 18.78 6.69 4.86 9.01 3.60 2.93 2.82 2.02 1.71 3.30 9.65 2.83 1.30 546.05a 73.46a 177.32a 35.05a 665.64a 1281.71a 204.73a 122.24a 131.93a 59.21a LRT value significant at P , 0.0001
  5. 5. Relatively little data is published within articles Published tables figures Analysed data Raw data
  6. 6. Reuse of open data boosts citations to the original article Piwowar  and  Vision  (2013)     doi:10.7717/peerj.175  
  7. 7. Volume Most analyzed data is in the ‘long tail’, for which there is no specialized repository Structured data (e.g. Genbank, GBIF) Long-tail data Rank frequency of datatype After Heidorn (2008) doi:10.1353/lib.0.0036
  8. 8. Peer-to-peer data sharing does not work Wicherts and colleagues requested data from from 141 articles in American Psychological Association journals. “6 months later, after … 400 emails, [sending] detailed descriptions of our study aims, approvals of our ethical committee, signed assurances not to share data with others, and even our full resumes…” only 27% of authors complied Wicherts JM, Borsboom D, Kats J, Molenaar D (2006) doi:10.1037/0003-066X.61.7.726
  9. 9. Data is best captured at the time of publication Time  of  publica(on   Specific  details   Informa(on  Content   General  details   Re(rement  or     career  change   Accident   Death   Time   (Michener  et  al.  1997)  
  10. 10. CC-­‐BY  Adamo   h/p://   Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. Biological Lectures from the Marine Biological Laboratory: 209-226.
  11. 11. Joint Data Archiving Policy ( JDAP ) Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. As a condition for publication, data supporting the results in the article should be deposited in an appropriate public archive. Authors may elect to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information.
  12. 12. High impact factor journals have stronger data archiving policies IF=6.0 n=70 IF=3.6 IF=4.5 Piwowar HA, Chapman WW (2008) hdl:10101/npre.2008.1700.1
  13. 13. author prepare manuscript and related data files JOURNAL submit manuscript manuscript review DRYAD upload data editor accepted? no accepted? send article description Dryad data package send data identifier (DOI) yes curation data curator published article (with data citation) published data (with article citation)
  14. 14. When using this data, please cite the original article: Chave J, Coomes D, Jansen S, Lewis SL, Swenson NG, Zanne AE (2009) Towards a worldwide wood economics spectrum. Ecology Letters 12: 351-366. doi:10.1111/j. 1461-0248.2009.01285.x Additionally, please cite the Dryad data package: Zanne AE, Lopez-Gonzalez G, Coomes DA, Ilic J, Jansen S, Lewis SL, Miller RB, Swenson NG, Wiemann MC, Chave J (2009) Data from: Towards a worldwide wood economics spectrum. Dryad Digital Repository. doi:10.5061/dryad.234
  15. 15. No fees for submission from low and lower middle income countries
  16. 16. Dryad by the numbers Data packages 4,172 Authors 15,581 Data files 11,912 Integrated journals 37 All journals 268 File downloads 4,629,256 Stats  as  of  23  Oct  2013  
  17. 17. To learn more •  •  •  •  •  Repository home: News: Project documentation: Twitter: @datadryad Code: or contact us: • •  Todd Vision, Director, •  Laura Wendell, Dryad Executive Director,