Reuse of Repository Data


Published on

Presentation of initial findings for Summer 2010 DataONE internship.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The difference between data deposit vs. data reuse: Data is created in the process of a study and then deposited in the repository as opposed to the data being reused later for future studies and articles. Why is it important that we track the reuse of data? Transparency: ensuring that either the misinterpretation of data or outcome bias does not affect future studies Collaboration: enabling researchers to share datasets with each other to find overlap and break new ground as opposed to revisiting old territory. New data can either: Confirm existing data Refute existing data Combine with existing data to form new conclusions Healthy Competition: data citations could potentially bring the same level of prestige to a researcher or institute that article citations currently bring Invigoration: data that had gathered dust on the digital shelf gets new life when applied to new studies and articles. We track article citation, so why not data citation?
  • Email [06/28/2010 01:34:12 AM EDT] Valerie Heather Piwowar has given me your email address as I wanted to contact you directly in relation to the project you are doing this summer.  It sounds a really interesting project and I. For one, will be really keen to see what you come up with.  I should imagine that others will find it equally as interesting. The reason I wanted to write to you directly was to let you know of some of the activities of the Australian National Data Service (ANDS) as these might relate to your work.  It's also to let you know that there is a huge amount going on in the US and elsewhere about the whole issue of data re-use and sharing. ANDS was established early last year with funding from the Australian Department of Innovation, Industry, Science and Research.  You can read all about us at our website which is linked below.  One of the important aspects of our work is developing the Australian Research Data Commons, which is to say that we want to identify data sets available in Australian research institutions or through government agencies, and make that information available as widely as we can.  Metadata about data sets is provided to a central registry and the public face to that is a web interface called Research Data Australia ( http:// / ).  The Australian government invests heavily in research, as do most governments, so it is keen to ensure that data is valued as an output of research and that it is available for re-use wherever that is possible.  As part of this overall project we will be providing data sets published through us with DOIs to enable tracking of citations.  We have joined DataCite ( http:// / ) along with other international partners and hope to start minting DOIs within the next couple of months.  We have also had discussions with both Elsevier and ThomsonReuters about using them to track data citations, and both are interested.  Elsevier has already successfully conducted trials with TIB (the  German National Library of Science and Technology) to test this.   We discussed the lack of a data publication format in EndNote with ThomsonReuters, who own it, so they are aware of the issue.  This is a very brief outline of some of the activities we are doing in the area, but you can learn more from our short guide at .  As you can see from all of this, we are very interested in the whole topic of data sharing and re-use and all of those things, such as data citation, which are linked to the process.  This has been a very brief outline of some of the work we are doing, so if you have any questions we would be happy to answer them.  With best wishes, Margaret (Henty) Email [07/02/2010 09:38:23 AM EDT] I have just seen your blog entries and the data resources page on citation and have found them very interesting as I have been doing some work on citation of data, most recently in the context of the Sage initiative.  http:// / (At one time I used to work on the eBank UK project and I was bemused to see the page created for that project rather a long time ago referenced in your list of resources :) I am forwarding some more recent resources that I am aware of - apologies for the formatting (or lack of) as I have just copied these from the wiki, and fo rjust emailing them rather than go through the process of adding them to your wiki. The first in particular reviewed the citation policies of a number of data repositories.  I hope that not many of them are repeated on your current list. I look forward to hearing further results of your research in the future. Best wishes, Monica Duke
  • Initial search process: Test searches for TreeBASE resulting sample articles study accession numbers and data author names to search for later. ORNL DAAC: Oak Ridge National Laboratory Distributed Active Archive Center 
  • Reasons for miss captured: no mention of repository name (ex: Pangaea supercontinent), only mention of repository (as in articles only about repositories citing TreeBASE/Pangaea/ORNL DAAC as examples), articles that have deposited data and not downloaded from the repositories, and other. Reasons for hit captured: Citations mentioning repository name, citation of DOI or study accession number, full citation according to repository recommendations (varies), citation of data author name.   Interpretation Browse through observations made within OpenWetware journal entries Look through Search Comparisons spreadsheet for percentage of hits versus misses as well as the types of hits and misses that occurred.
  • Finding focus and the difficulty of going beyond the obvious Mention of repository could mean either data was deposited there or downloaded from there. Sometimes narrowing search terms with boolean operators or “-” exclusion only resulted in no results at all while broadening back out resulted in too many results to read through manually. A large result list can either be too much of a good thing, or just too much. However, a small list makes me worry that I'm excluding hits that would be valid by my fuzzy parameters that the search function cannot process very well. Or, in some cases, a narrower search would have no results at all. TreeBASE study accession numbers cited in article may have changed over time (from StudyID to LegacyID after study publication). “ Pangaea” can refer to either data repository or the Pangaea supercontinent. How do I exclude these results? What if there is an article that mentions both? Do I risk excluding that article? Google scholar does not make the distinction between published journal articles and non-journal publications like dissertations deposited into academic repositories.   “ Missing” searches (use a table like the one found here [[DataONE:Notebook/Reuse_of_repository_data/2010/06/28#Search_Methodology_Table|\\Search Methodology Table]] as visual aid) For the sake of thoroughness, I intended to go through each possible search combination. Not all searches worked and I did not originally record them in my notebook. However, it is important to record these “failures” for future reference. So I went back through each search and recorded the results. Also, using this table helped show me that I missed some possible combinations.   Is it possible that a majority of the citations I find only cite data from articles where the researcher finds the information in the article without ever even looking at the data?
  • Findings by Repository TreeBASE: Mentions of the repository name found in all three search databases, but distinction had to be made as to whether data was deposited into or downloaded from TreeBASE. These search limitations were constructed based on the verbiage of individual instances within found articles and also varied in effectiveness and structure within Scirus and Google Scholar. TreeBASE was listed as part of the Cited Author field within ISI. ISI did not allow for searching by study accession number. Searching for specific author names and article publications proved more useful in ISI since TreeBASE lists citation by article and ISI Web of Science Cited Reference Search is geared towards finding citations of specific articles as opposed to datasets. Even using boolean operators to limit searches yielded limited results, still often with more misses than hits. Searching by general mention of a “study accession number” was not very useful and the prefix “S####” was too general and vague for all search functions. Searches for individual study accession numbers (ex S1515, S2376, S1977) pulled no results at all in Scirus or Google Scholar In Scirus, general searches for TreeBASE with boolean limits pulled more misses than hits while searching for a specific study accession number (ex: S1515) proved too specific and usually pulled the message “no results matched your query.” Searching by data author name in Scirus proved limited, as often the searches by author name pulled either articles by that particular author or articles which did not cite the dataset or article derived from the original dataset, even when year and publisher was included in the search. Searching in Google Scholar for general mentions of TreeBASE yielded almost all misses while a specific data author proved more useful with narrower results with more hits than misses yet no mention of study accession numbers were found in the resulting articles. Pangaea ISI pulled a few hits just from general mention of Pangaea in Cited Author or Cited Work field. More hits (in the dozens range) came when specific authors and articles included in search field. ISI did not allow for searching by DOI. Searching by author in ISI was the most effective as far as hit to miss ratio in the search results. Using controlled boolean search with repository name proved ineffective in Scirus, while searching by DOI prefix only pulled a few results (12) with even fewer hits (5). Then again, the DOI prefix did not include the same controlled vocabulary. Searching by data author name proved useless in Scirus as well since either articles by the author were pulled or non-journal resources were found. Google Scholar pulled a lot of articles with the Pangaea DOI prefix, but there were far too many results to read through manually (1000+). Further search refinements may be needed still. ORNL DAAC ISI once again worked best when searching for specific authors and articles in for which the data was originally created. DOIs could be found, but again, not specifically entered in search fields. Scirus proved the most effective when using boolean operators with a keyword search of articles mentioning the repository name. Also, searching by specific project databases like BOREAS or FLUXNET proved useful. Many articles that were found were housed in Elsevier's Science Direct. Google scholar took well to the ORNL DOI prefix when compared to the Pangaea DOI prefix. Although it pulled fewer results, the results that were found held a majority of hits. Also, searching for the project databases and data author names found more solid hits.
  • Confirmed: finding data citations in journal articles is difficult. “ Like trying to find someone on Facebook only knowing their hair color and favorite breakfast cereal.” Why is this? Even if repository has recommendations or best practices for data citation, not always consistently used. Even if a repository has a DOI or other identifier, not always mentioned in citation. However, in the case of a repository like ORNL DAAC, where there are more ways to find data, it is less difficult. Even so, if there were just one method used consistently, that would make searching for data citations much easier. Google Scholar is a bit too simplified. While searching through the full text of an article is useful, there are some things that search fields are better at doing, such as the cited author or cited work field in Cited Reference Search. Scirus has a nice faceted search function that allows you to see only journal articles and narrow things down further to individual journals. However, it too does not have a function that searches for citations. However, ISI, Scirus and Google Scholar all have “cited by # articles” functionality. While more human-friendly, federated search engines like Scirus and Google Scholar aren’t very machine-friendly. Little distinction between data deposited into repositories and data downloaded from repositories for an article or study. Some ways of finding out without reading the whole article: search for the phrases “are available,” “can be downloaded,” “can be obtained from” or “uploaded into [repository]” in relation to data being added to a repository. Or, search for the phrases “downloaded from,” “obtained from” Hence another conundrum: the words “downloaded” and “obtained” are used in both contexts. Even simplifying searches to include “from TreeBASE/Pangaea/ORNL DAAC” and exclude “into TreeBASE/Pangaea/ORNL DAAC” has limited success since fulltext search often ignores the quotes around the phrase resulting in many irrelevant results. Even if data download is mentioned, not always cited in Reference section (often just mentioned in introduction, methods or results sections instead). Things that would have made the search process easier or may encourage data citation: Consistent use of data citation format either as recommended by repository or publication Consistent use of unique persistent identifiers Assigning a weight similar to impact factors for journals and articles to data repositories/data sets based on criteria such as how often the data is cited, the impact factor of the journals in which the data is cited, how often the data is updated. Metadata tags for articles that indicate data citation Functions within search engines that search for data citations within metadata (similar to ISI Web of Science Cited Reference Search) While TreeBASE has BibTeX and RIS integration, it was still difficult to find citations in other articles.  
  • Other repositories, search terms and databases. Bigger samples using different time periods (2005-2010 as opposed to 2008-2010). It would be interesting to capture whether an article agrees with or refutes the data cited from a previous article. Compare data with Nic and Sarah to see how much influence the data citation policies of journals and repositories have on authors/researchers and to see how many articles out of a random sample deposit, reuse and/or cite data. Also, compare data with other interns (especially the Baseline Assessment of Data Practices of Libraries and Librarians project) Article Some publication submission ideas: Collection Management  This journal recently ran an article titled "The Use of Web of Knowledge to Study Publishing and Citation Use for Local Researchers at the Campus Level"  doi:10.1080/01462671003597959 , in which the authors used ISI Web of Science to seek and identify periodical literature citing local researchers. DLib Link provided by Heather. Information Services & Use   Author Guidelines  International Journal focusing on information technology, particularly applications to business and scientific fields. Informing Science  Quote from their about page: "The academically peer refereed journal Informing Science endeavors to provide an understanding of the complexities in informing clientele. Fields from information systems, library science, journalism in all its forms to education all contribute to this science. These fields, which developed independently and have been researched in separate disciplines, are evolving to form a new transdiscipline, Informing Science. Informing Science publishes articles that provide insights into the nature, function and design of systems that inform clients. […] The ideal paper will serve to inform fellow researchers, perhaps from other fields, of contributions to this area." International Digital Curation Conference Call for Papers . Link provided by Nic. Journal of the American Society for Information Science & Technology  Quoted from their page: "The Journal welcomes rigorous work of an empirical, experimental, ethnographic, conceptual, historical, socio-technical, policy-analytic, or critical-theoretical nature. JASIST also commissions in-depth review articles (Advances in Information Science) and reviews of print and other media." I find this relevant to my interests. Journal of Information Science  Quoted from their about page: "The Journal of Information Science is an international journal of high repute covering topics of interest to all those researching and working in the sciences of information and knowledge management. The Editors welcome material on any aspect of information science theory, policy, application or practice that will advance thinking in the field." Library Technology Reports  As a publication of the American Library Association, this could reach a wide audience of librarians interested in born digital holdings or technological changes in scientific research. Scientometrics  Even if my study turns out to be more qualitative than quantitative, this may be useful for Nic and Sarah to consider.
  • Reuse of Repository Data

    1. 1. Reuse of Repository Data Valerie Enriquez – DataONE – Summer 2010
    2. 2. Motivation <ul><li>Data deposit vs. data reuse </li></ul><ul><li>Why track the reuse of data? </li></ul><ul><ul><li>Transparency </li></ul></ul><ul><ul><li>Collaboration </li></ul></ul><ul><ul><ul><ul><ul><li>Confirm existing data </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Refute existing data </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Combine with existing data to form new conclusions </li></ul></ul></ul></ul></ul><ul><ul><li>Healthy Competition </li></ul></ul><ul><ul><li>Invigoration </li></ul></ul>
    3. 3. Initial Questions <ul><li>How is data currently cited and how often? </li></ul><ul><li>How do we find data citations using available resources (search engines, databases, etc.)? </li></ul><ul><li>How difficult is it to find data citations using these tools and why? </li></ul><ul><li>What are the best/worst ways to find data citations? </li></ul><ul><li>How do the citations vary across discipline, repository and publication? </li></ul><ul><li>What is the most common citation? Repository name? Data author name? Unique identifier like a study number or DOI? </li></ul>
    4. 4. To whose benefit? <ul><li>Scientists </li></ul><ul><li>Academic researchers </li></ul><ul><li>Students </li></ul><ul><li>Anyone who uses or deposits data </li></ul><ul><li>Anyone interested in the citation or reuse of data </li></ul><ul><li>Similar projects </li></ul><ul><ul><li>See also: list of projects, discussion and editorials on the OpenWetware DataONE Web Resources page: </li></ul></ul>
    5. 5. Methods <ul><li>Initial search process: Test TreeBASE searches </li></ul><ul><li>Focused search </li></ul><ul><ul><li>Repositories </li></ul></ul><ul><ul><li>TreeBASE </li></ul></ul><ul><ul><li>Pangaea </li></ul></ul><ul><ul><li>ORNL DAAC </li></ul></ul><ul><ul><li>Databases </li></ul></ul><ul><ul><li>ISI Web of Science Cited Reference Search </li></ul></ul><ul><ul><li>Scirus </li></ul></ul><ul><ul><li>Google Scholar </li></ul></ul><ul><li>Limits </li></ul><ul><ul><li>Date range: 2008-2010 </li></ul></ul><ul><ul><li>Language: English </li></ul></ul><ul><ul><li>Journal articles only </li></ul></ul><ul><li>Repository-specific search terms </li></ul><ul><ul><li>TreeBASE: repository name, study accession number (S####), data author name </li></ul></ul><ul><ul><li>Pangaea: repository name, DOI prefix:10.1594/PANGAEA. ######, data author name </li></ul></ul><ul><ul><li>ORNL DAAC: repository name, DOI prefix: 10.3334/ORNLDAAC/###, data author name, project name (BOREAS, FLUXNET, etc.) </li></ul></ul>
    6. 6. Initial Analysis <ul><li>Search comparison spreadsheet hosted here </li></ul><ul><ul><li>Search methods, terms and datasets used to construct search terms were captured as well as the total number of results followed by respective hits and misses. </li></ul></ul><ul><ul><li>Percentages of hits vs. misses calculated within the spreadsheet. </li></ul></ul><ul><ul><li>Reasons for miss captured </li></ul></ul><ul><ul><li>Reasons for hit captured </li></ul></ul><ul><li>Shared fields template from Sarah with my input data hosted here </li></ul><ul><ul><li>Hosts data about individual articles, including DOIs as applicable, metadata and coding for hits and misses. </li></ul></ul>
    7. 7. Stumbles and other Worrisome Things <ul><li>Finding focus and the difficulty of going beyond the obvious </li></ul><ul><li>“ Missing” searches </li></ul><ul><li>How broad is too broad? How narrow is too narrow? </li></ul><ul><li>Article cited vs. data cited </li></ul>Image courtesy of:
    8. 8. Initial Findings *: invalid field input $: effective # ineffective ISI Web of Science Scirus Google Scholar TreeBASE <ul><li>$ Repository name </li></ul><ul><li>* </li></ul><ul><li>$ Cited Author Name/original publication title/date </li></ul><ul><li>$ Repository name </li></ul><ul><li># Study Accession Number </li></ul><ul><li># Cited Author Name/original publication title/date </li></ul><ul><li># Repository name </li></ul><ul><li># Study Accession Number </li></ul><ul><li># Cited Author Name/original publication title/date </li></ul>Pangaea <ul><li>$ Repository name </li></ul><ul><li>* </li></ul><ul><li>$ Cited Author Name/original publication title/date </li></ul><ul><li>Repository name </li></ul><ul><li>$ DOI prefix </li></ul><ul><li># Cited Author Name/original publication title/date </li></ul><ul><li># Repository name </li></ul><ul><li>$ DOI prefix </li></ul><ul><li># Cited Author Name/original publication title/date </li></ul>ORNL DAAC <ul><li>$ Repository name </li></ul><ul><li>* </li></ul><ul><li>$ Cited Author Name/original publication title/date </li></ul><ul><li>$ Repository name </li></ul><ul><li>$ DOI prefix </li></ul><ul><li>$ Cited Author Name/project name/original publication title/date </li></ul><ul><li># Repository name </li></ul><ul><li>$ DOI prefix </li></ul><ul><li>$ Cited Author Name/project name/original publication title/date </li></ul>
    9. 9. Lessons Learned Image courtesy of: Hey, I think I found that data citation you were looking for.
    10. 10. Where do we go from here? <ul><li>Solidify conclusions from initial findings. </li></ul><ul><li>Compare data with other interns. </li></ul><ul><li>Examine other repositories, search terms and databases. </li></ul><ul><li>Write article about how difficult it is to find data reuse citations. Some possible publications: </li></ul><ul><ul><li>Collection Management   </li></ul></ul><ul><ul><li>DLib Link provided by Heather. </li></ul></ul><ul><ul><li>Information Services & Use   Author Guidelines   </li></ul></ul><ul><ul><li>Informing Science </li></ul></ul><ul><ul><li>International Digital Curation Conference Call for Papers . Link provided by Nic. </li></ul></ul><ul><ul><li>Journal of the American Society for Information Science & Technology   </li></ul></ul><ul><ul><li>Journal of Information Science   </li></ul></ul><ul><ul><li>Library Technology Reports </li></ul></ul><ul><ul><li>Scientometrics </li></ul></ul>