Lets Move Beyond Open Data to Open Development?This year July The Sunday Business section of the NewYork featured a story about the Bank’s Open Datainitiative and claimed that datasets and information willultimately become more valuable than Bank lending.This is not about the World Bank as the central repositoryof knowledge sharing its knowledge and wisdom withclients from the South.It is about “democratizing development economics” inthat it levels the playing field on knowledge creation anddissemination and opens the development paradigm toparticipation from researchers and practitioners, softwaredevelopers and students, from north and south.
SRF The CGIAR is unique in having the capacity to collect experimental, monitoring, and survey data on agricultural systems throughout the developing world. Most data collected by CRPs, whether broad-scale data used to describe and monitor farming system changes, or focused data collected to examine specific processes and hypotheses, should be of such potential value that the cost of archiving and sharing is justified by the value added in terms of expanded research results from the use of that data by a wider research community.
“This is one clear and consistent messagefrom the last CGIAR science forum: dataarchiving of the CG Centers is overallabysmally poor”Robert Nasi DirectorForests, Trees and Agroforestry (CRP6)
Program Participant Agreement What data platform are we going to use?
Research Data Repository What should be deposited1. All research data belonging to publications2. High value data sets of interest to ICRAF, other CG centers & Partners Research Data Management Policy
The Policy all of the Centre’s data needs to be:a) derived from research relevant to our agenda, to the development challenges in our strategy, to the Strategy and Results Framework (SRF) of the CGIAR and to the CGIAR Research Programs (CRPs)b) of high quality (well designed, well collected, well verified, well documented);c) protected and archived;d) is made available (know that it exists) and easily accessible (can gain access to the data) to all;e) is adaptable so that it can be well utilized and transformed where possible into actionable knowledge;
Who’s responsibility?The Centre• setting up clear protocols, conducting peer reviews, using robust and well-documented methods and appropriate statistical analyses, and producing meta-analysis and syntheses of results• providing a stable, reliable data repository system that can handle both document-centric and data-centric objects.• ensuring that all necessary raw data will be made public to reproduce or replicate every scientific publication that is based on research data
Who’s responsibility?The project/scientist• compliance with explicit quality standards• submit necessary raw, verified data for every scientific publication in standard file formats.• ensure that research data produced for the Centre is described by appropriated Metadata throughout their lifecycle
How do we achieve highest scientific standards?RMG: quality control throughout the data lifecycle (collection, verifying, managing, analyzing, storing)Beyond RMG: to ensure that all staff follow the institutional standards and guidelines. The ultimate benchmark for all scientists however, is the consensus of peers
Research Data Repository Challenge:Move data from scientist laptops to institutional server and Have the data described by sufficient metadata without Increasing transaction costs or Creating an auditing issue
Dataverse Network• The Dataverse Network is an application to publish, share, reference, extract and analyse research data.• It facilitates making data available to others, and enables replication of work.• Researchers and data authors get credit, publishers and distributors get credit, affiliated institutions get credit.
Dataverse Network• A Dataverse Network hosts multiple Dataverses.• Each Dataverse contains studies or collections of studies.• Each study contains cataloguing information that describes the data plus the actual data files and complementary files.
Data Backup & Preservation• The IQSS Dataverse Network maintains a full backup of all data and directories on the Network for 6 months, in the Harvard Depository. This means that there always is a full, offsite copy of the Network that is less than 7 months old.• IQSS will maintain on-line storage, backup, and media migration sufficient for all studies it accepts (in addition to storage provided for the IQSS DVN).• The Henry A. Murray Archive, through its endowment, supports permanent bit-level preservation of all social science research studies directly deposited in the IQSS Dataverse Network.http://thedata.org/book/data-backup-terms
Hosting• There are two approaches: 1. You can download and install the Dataverse Network Application and effectively become a host; or 2. You can create a Dataverse on *IQSS Dataverse Network at Harvard University. This Network is open to all researchers, publishers and data distributors.• Option 1 gives you more control but includes added responsibility & cost*Institute for Quantitative Social Science
Hosting – IQSS Option• Advantages – Dataverse software is installed, hosted and managed for you by IQSS – Dataverse is hosted in Harvard’s infrastructure which is very good – IQSS offer great support in assisting you set up your dataverse and provide great help if you run into any problems• Disadvantages – Network level administrative tasks cannot be done, these include: • Creating user groups based on IP address or IQSS network user names • Creating harvesting dataverses which allow you to share meta data with other systems e.g. Dspace. Sharing includes exporting and importing meta data. • Complete deletion of studies not just deaccession • Accessing web statistics – Cannot use alias URLs to point to your dataverse e.g. we cannot have the url http://data.worldagroforestry.org pointing to the ICRAF IQSS dataverse http://dvn.iq.harvard.edu/dvn/dv/icraf*Institute for Quantitative Social Science
Data Citation for each study• Dataverse allows to cite research digital data from published printed work• Citation automatically generated when study is created.• Data Citation format:Author, Date, “Title”, Persistent Identifier Universal Numerical Fingerprint (UNF) Distributor or other optional fields [ …]
Unique Citation Components1. Persistent Identifiers – Offer permanent and reliable links to digital objects. Uses the handle system. e.g. hdl:1902.1/156732. Universal Numerical Fingerprint – – Applied on quantitative data – Used to uniquely identify and verify data e.g. 5:G22I+TtPQPAyFcRT6SrUfA==
Example of CitationFrank Place; Patti Kristjanson; Steve Staal; Russ Kruska; Tineke deWolff; Robert Zomer; E C Njuguna, 2005, "Replication data for: Development pathways in medium-high potential Kenya: a meso-level analysis of agricultural patterns and determinants.", http://hdl.handle.net/1902.1 /15673 UNF:5:G22I+TtPQPAyFcRT6SrUfA== World Agroforestry Centre [Distributor] V1 [Version]
Designed for Research DataData-format aware Research data workflows• Input formats: CSV • Researcher can enter deposit , TAB, SPSS, STATA, GraphML directly• Export: reformat, subset, analyze • Multiple workflows:• Preservation-reformatting closed, review-and-release, wiki• Semantic fingerprints • VersionedFind distributed resources Flexible licensing• Can provide a portal to distributed • Access control for research resources (OAI-PMH harvesting groups client) • Layered usage terms• Data can also include meta data for • Data request workflow harvesting Robust Supports Any file type, only restriction 1 file size = 2GB
Publication and Data Submission Proposed Workflow Start Request Data GRP/Region submit from scientists publication Data submitted Data Yes No Publication Publication Publication to RMG Data or data has data? submitted into manager DspaceNo Request Publication Changes Data published in Received? Publication dspace Yes Dspace Editors Dspace Editors Approval receive data link Upload data to dataverse Publication Update Dspace Yes Publication No published to the (unreleased) Approval web publication with data link