Prepared for 2010 Graduate seminarInformetrics and e-research (prof. Han Woo Park),at Yeungnam Univ. in S. Korea<br />The Promise of Data in e-Research<br />Many Challenges, Multiple Solutions, Diverse Outcomes<br />Ann Zimmerman, Nathan Bos, <br />Judith S. Olsen and Gary M. Olsen<br />Presented by Kim KyoungEun<br />email@example.com<br />15. March 2010<br />
Introduction<br />▶ The period of ‘data deluge’ <br />(information explosion, information overload)<br />: The need to document, manage, transfer, analyze, and preserve digital data is a significant driver of the development of tools and technologies for e-research. <br />: It is not yet clear how data deluge will affect research practice and outcomes. <br />The purpose of this chapter is to analyze different approaches to data sharing in order to identify important factors that may lead to success. <br />
Introduction<br />▶ Although importance of shared access to data and collaboration across disciplines and distance, finding from both past and present studies show that efforts to share data face considerable social, organizational, legal, scientific, and technical challenges. <br />▶ The most significant obstacle is individualism of scientist. <br /> : the ‘selfish scientist’<br />◈ Goble & De Roure (2007) <br /> – “e-science is, inherently, me-science” <br />
Introduction<br />▶ A study by a special committee of the Ecological Society of America found that fields where data sharing is common are characterized by a mixture of :<br /><ul><li>Technical capabilities such as free and easy software for data transfer
Scientifically motivated needs, especially the questions that researchers want to answer and
Socially influenced demands and incentives</li></li></ul><li>Introduction<br />▶ The methods used to make research data available outside the context in which they were collected raise a number of questions for those interested in e-research : <br /><ul><li>How do different data sharing approaches affect researcher’ abilities to reuse data collected by others?
Why are data sharing methods that achieve positive results in one context not effective in another case?
Do shared data get used, and if so, how are they used? </li></li></ul><li>Introduction<br />▶ The question we take up in this chapter is one that has, to date, received little direct attention: <br /> How do the origins of digital databases and the context from which they emerge affect research practice, researcher’ attitudes toward data sharing, and the relations between researchers and other actors such as computer scientists, data managers, and information scientist?<br />-> This chapter designed to address this question. <br />
Data Sources and Methods<br />▶ data : “scientific or technical measurements, values calculated there from, and observations or facts that can be represented by numbers, tables, graphs, models, text, or symbols and that are used as a basis for reasoning or further calculation” <br />▶ Below we briefly describe our data corpus, which includes a meta-analysis of multiple distributed collaborations and a focused view of data sharing in one discipline. <br />
Data Sources and MethodsScience of Collaboratories<br />▶ The Science of Collaboratories (SOC)<br />: the name of a five-year project funded by the U.S. National Science Foundation(NSF) to study computer-supported distributed collaborations across many research disciplines. <br />: The overall goals of the SOC project were to : (1) perform a comparative analysis of collaboratory project (2) develop theory about this new organizational form (3) offer practical advice to collaboratory participants and to funding agencies about how to design and construct successful collaboratories. <br />
Data Sources and MethodsThe Sharing and Reuse of Ecological Data<br />▶ Interviews to investigate the experiences of ecologists were also conducted with data managers in order to obtain another perspective on the sharing and reuse of ecological data. <br />▶ the significant obstacles to sharing & reuse :<br />⒜The data are widely dispersed, heterogeneous, and complex, which make them difficult to locate and hard to reuse. <br />⒝ social factors that hinder data sharing, such as issues of ownership and a lack of reward for sharing. <br />
Data Sharing as a continuum<br />▶ This chapter draw on cases from their own research and examples from studies by other scholars to show that the outcomes of data sharing approaches occur along a continuum. <br />: At one end of the continuum are approaches that allow researchers to work as they always have, and the labor necessary to prepare data, make them available, and support their use is conducted by others.<br />In this case, data sharing considerations are not injected into the research process, but are managed by others after the fact. <br /><-> In contrast, solutions at the other end of the continuum force researchers to consider barriers to sharing, integration, and federation at the outset of data collection and to develop solutions in advance to deal with these issues.<br /> In this case, tighter links are formed between the production and the sharing of data. <br />
Many Challenges, Multiple solutions, Diverse outcomes<br />▶ It is hard to share data. There are many reasons for this and numerous approaches have been devised to overcome these challenges.<br />We describe some of the issues that make data sharing hard, and we analyze methods that have been developed to address them. <br />
Many Challenges, Multiple solutions, Diverse outcomesAggregating and Integrating Dispersed Data<br />▶ Bringing the widely dispersed data together in a centralized database has several potential advantages. It can help to avoid duplication of effort. <br />: The aggregation or integration of distributed data, which can be carried out by individuals, small teams of researchers, or a group of individuals with diverse skills, is a common way to create a publicly available data resource. <br />-> The following case study illustrate some prototypical strategies designed to bring dispersed data together. <br />
Many Challenges. Multiple solutions, Diverse outcomesAggregating and Integrating Dispersed Data1) Curating published data<br />※ WarmBase<br /> : maintained by the International <br />WarmBase Consortium.<br /> : They are extracted and integrated into the <br />Warmbasedatabase and made available to <br />any user via the Internet.<br /> : the data benefit from reuse. <br /> : The work of WarmBase curators is made possible by funding from the National Human Genome Research Institute and the British Medical Research Counsil. <br />
※ FlyBase<br />FlyBase, the primary source for molecular and<br /> genetic information about the Drosophila <br />(fruit fly) genome, operates and is maintained <br />in a fashion similar to WarmBase. <br />※ The Ecological Society of America(ESA)<br /> : developed a digital archive for appendices and supplements, including raw data associated with papers published in ESA journals. <br /> : Since it relies on voluntary deposits of data, it lacks the comprehensiveness of WormBase and FlyBase. <br />Many Challenges. Multiple solutions, Diverse outcomesAggregating and Integrating Dispersed Data1) Curating published data<br />
◈ Bos(2008) <br /> : identified economic incentives. <br /> : The economic method that has been most successful is the requirement that authors provide proof of data contribution as a prerequisite to publication.<br /> : ex) GenBank<br /> – GenBankis comprised primarily of data associated with a publication, and it does not appear to have motivated researchers to contribute unpublished data. <br />Many Challenges. Multiple solutions, Diverse outcomesAggregating and Integrating Dispersed Data2) Data deposition as a requirement of publication<br />
▶ Why published data comprise the majority of data in many aggregated database?<br />①the time and effort required by researchers to fully document unpublished data. <br />②their concerns about being ‘scooped’ by competitors. <br />③fears that their data will be misused. <br />▶ But, there are many demand for unpublished data. <br />: WarmBase and FlyBase are important resources for their research communities, but their value as a research tool has not motivated scientists to contribute their unpublished data to these databases. <br />Many Challenges. Multiple solutions, Diverse outcomesAggregating and Integrating Dispersed Data2) Data deposition as a requirement of publication<br />
Many Challenges. Multiple solutions, Diverse outcomesAggregating and Integrating Dispersed Data3) Contribution has its privileges<br />▶ Two of the NIH(National Institutes of Health)-funded biomedical collaboratories that we studied have attempted to motivate researchers to contribute data, particularly unpublished data, by granting special privileges to those who do so. <br /> : Consortium for Functional Glycomics(CFG)<br /> - ‘give in order to get’ strategy <br /> : Biomedical Informatics Research Network (BIRN) <br /> - development a ‘rollout’ scheme & timeline<br /> - first only to the producer, then to specified others,<br /> then to other members of the BIRN consortium, <br /> and lately to the general public. <br />
Many Challenges. Multiple solutions, Diverse outcomesAggregating and Integrating Dispersed Data4) Data as a publication<br />▶ Peer reviewed publication, particularly journal articles, are the centerpiece of the formal scholarly communication and reward system. Some projects and publications have sought to make more data available by treating the compilation and synthesis of published and unpublished data as publications. <br /> : ex) the partnership between the influential journal Nature and the Alliance for Cellular Singnaling’s(AfCS) Signaling Gateway <br />
▶ There are other examples of treating data compilations as publications.<br />Ex 1) The Ecological Society of America(ESA) developed a new form of peer-reviewed publication called Data Papers, which are compilations and syntheses of mostly unpublished datasets. <br />Ex 2) Cochrane Reviews – Authors of Cochrane Reviews are encouraged to locate and incorporate unpublished data into the reviews. <br />Many Challenges. Multiple solutions, Diverse outcomesAggregating and Integrating Dispersed Data4) Data as a publication<br />
Many Challenges. Multiple solutions, Diverse outcomesOvercoming Semantic and Methodological Differences<br />▶ two challenges that render it difficult to integrate data. <br />1) each discipline and sub-discipline has its own terminology and jargon.<br />2) some fields, such as ecology, do not have widely standardized methods of data collection.<br />▶ The Geosciences Network (GEON)<br /> : GEON is a collaboration between geoscientist and computer scientists. The main goal of GEON is to enable researchers to access, synthesize, and model geoscience data from a wide variety of sources. <br />
Many Challenges. Multiple solutions, Diverse outcomesOvercoming Semantic and Methodological DifferencesStandardizing in advance<br />▶ Another type of solution to the difficulties of sharing data considers impediments in advance of data collection. <br />: ex) researchers in one of the multi-institutional, medical collaborations we studied spent almost a year to develop standardized data collection and management protocols for aggregating data produced by the distributed collaboration. <br />
◈ Karasti, Baker, and Halkola(2006)<br /> : Findings by Karasti, Baker, and Halkola in regard to cross-site collaboration between data managers and researchers in the U.S. Long Term Ecological Research (LTER) Network are worth noting. <br /> : Karasti and her co-authors identify signs such as dialog among stakeholders that be visible in advance of more dramatic changes in practices and attitudes related to data. <br />Many Challenges. Multiple solutions, Diverse outcomesOvercoming Semantic and Methodological DifferencesStandardizing in advance<br />
▶ Cyberinfrastructure is an important component in efforts to share large amounts of data. <br />There is evidence in the cases presented here that there are some instances in which authority resides in a larger set of actors, such as computer scientists and data managers, and is not dictated primarily by researchers. <br />Many Challenges. Multiple solutions, Diverse outcomesOvercoming Semantic and Methodological DifferencesStandardizing in advance<br />
Discussion<br />▶ Visions of e-research emphasize large-scale databases that require massive storage capabilities, robust infrastructure for data management and transfer, and sophisticated tools for visualization and analysis. <br /> In this chapter, we have presented several cases to illustrate some of the factors that play a role and to show the continuum of outcomes that can result. <br />▶ We need to understand more about the complex factors that influence the sharing and reuse of data. <br /> Further, it is important to consider the goal when designing approaches to share data. <br />
Discussion<br />▶ Since the efforts devoted to data sharing divert time and resources from other activities it is important to consider several questions. <br /><ul><li>What is the appropriate amount of activity that scientists should invest in sharing?
What degree of control should investigators expect to have over data that they share?
Do the benefits outweigh the financial and human costs of sharing?
Should all data be subject to the same sharing policies? </li></ul>-> Answers to these and other questions are critical to achieving the promise of data in e-research. <br />
Thank you for your attention!<br />Presented by Kim KyoungEun<br />firstname.lastname@example.org<br />