Research methods group accelarating impact by sharing data

Accelerating Impact by sharing data

Anja Gassner and Leroy Mwanzia

Let's Move Beyond Open Data
to Open Development?
This year July The Sunday Business section of the New
York featured a story about the Bank’s Open Data
initiative and claimed that datasets and information will
ultimately become more valuable than Bank lending.
This is not about the World Bank as the central repository
of knowledge sharing its knowledge and wisdom with
clients from the South.
It is about “democratizing development economics” in
that it levels the playing field on knowledge creation and
dissemination and opens the development paradigm to
participation from researchers and practitioners, software
developers and students, from north and south.

SRF
 The CGIAR is unique in having the capacity to collect
experimental, monitoring, and survey data on
agricultural systems throughout the developing world.

 Most data collected by CRPs, whether broad-scale data
used to describe and monitor farming system changes, or
focused data collected to examine specific processes and
hypotheses, should be of such potential value that the
cost of archiving and sharing is justified by the value
added in terms of expanded research results from the
use of that data by a wider research community.

“This is one clear and consistent message
from the last CGIAR science forum: data
archiving of the CG Centers is overall
abysmally poor”

Robert Nasi Director
Forests, Trees and Agroforestry (CRP6)

Program Participant Agreement

What data platform
are we going to use?

Research Data Repository

What should be deposited
1. All research data belonging to publications
2. High value data sets of interest to ICRAF, other CG
centers & Partners

Research Data Management Policy

The Policy
all of the Centre’s data needs to be:
a) derived from research relevant to our agenda, to the
development challenges in our strategy, to the
Strategy and Results Framework (SRF) of the CGIAR
and to the CGIAR Research Programs (CRPs)
b) of high quality (well designed, well collected, well
verified, well documented);
c) protected and archived;
d) is made available (know that it exists) and easily
accessible (can gain access to the data) to all;
e) is adaptable so that it can be well utilized and
transformed where possible into actionable
knowledge;

Who’s responsibility?

The Centre

• setting up clear protocols, conducting peer reviews, using robust
and well-documented methods and appropriate statistical
analyses, and producing meta-analysis and syntheses of results
• providing a stable, reliable data repository system that can handle
both document-centric and data-centric objects.
• ensuring that all necessary raw data will be made public to
reproduce or replicate every scientific publication that is based on
research data

Who’s responsibility?

The project/scientist

• compliance with explicit quality standards
• submit necessary raw, verified data for every scientific publication
in standard file formats.
• ensure that research data produced for the Centre is described by
appropriated Metadata throughout their lifecycle

How do we achieve highest
scientific standards?
RMG: quality control throughout the data
lifecycle (collection, verifying,
managing, analyzing, storing)

Beyond RMG: to ensure that all staff follow the
institutional standards and
guidelines.

The ultimate benchmark for all scientists however, is the
consensus of peers

Research Data Repository

Challenge:
Move data from scientist laptops to institutional server
and
Have the data described by sufficient metadata
without
Increasing transaction costs or
Creating an auditing issue

Dataverse Network
• The Dataverse Network is an application to
publish, share, reference, extract and analyse
research data.
• It facilitates making data available to
others, and enables replication of work.
• Researchers and data authors get
credit, publishers and distributors get
credit, affiliated institutions get credit.

Dataverse Network
• A Dataverse Network hosts multiple
Dataverses.
• Each Dataverse contains studies or collections
of studies.
• Each study contains cataloguing information
that describes the data plus the actual data
files and complementary files.

Data Backup & Preservation
• The IQSS Dataverse Network maintains a full backup of all data and
directories on the Network for 6 months, in the Harvard Depository.
This means that there always is a full, offsite copy of the Network
that is less than 7 months old.
• IQSS will maintain on-line storage, backup, and media migration
sufficient for all studies it accepts (in addition to storage provided
for the IQSS DVN).
• The Henry A. Murray Archive, through its endowment, supports
permanent bit-level preservation of all social science research
studies directly deposited in the IQSS Dataverse Network.

http://thedata.org/book/data-backup-terms

Hosting
• There are two approaches:
1. You can download and install the Dataverse Network
Application and effectively become a host; or
2. You can create a Dataverse on *IQSS Dataverse Network at
Harvard University. This Network is open to all
researchers, publishers and data distributors.
• Option 1 gives you more control but includes added
responsibility & cost

*Institute for Quantitative Social Science

Hosting – IQSS Option
• Advantages
– Dataverse software is installed, hosted and managed for you by IQSS
– Dataverse is hosted in Harvard’s infrastructure which is very good
– IQSS offer great support in assisting you set up your dataverse and
provide great help if you run into any problems
• Disadvantages
– Network level administrative tasks cannot be done, these include:
• Creating user groups based on IP address or IQSS network user names
• Creating harvesting dataverses which allow you to share meta data with other
systems e.g. Dspace. Sharing includes exporting and importing meta data.
• Complete deletion of studies not just deaccession
• Accessing web statistics
– Cannot use alias URLs to point to your dataverse e.g. we cannot have
the url http://data.worldagroforestry.org pointing to the ICRAF IQSS
dataverse http://dvn.iq.harvard.edu/dvn/dv/icraf
*Institute for Quantitative Social Science

Hosting – Self Hosting
• Advantages
– Full access to Network level Administrative tasks including:
• Ability to import and export studies to and from other systems
• Ability to create user groups based on IP address and your dataverse users
• Ability to use software supplied utilities e.g. complete deletion of studies and
locking of studies
• Greater flexibility in user management and “Terms of use” management
• Greater flexibility in dataverse branding
– Ability to use organization URLs to point to the dataverse e.g.
http://data.worldagroforestry.org
• Disadvantages
– Need an IT expert to install and manage the dataverse
software, including things like upgrading, applying security
patches, backups etc.
– Need good server infrastructure for hosting the application especially
server space.

Data Citation for each study
• Dataverse allows to cite research digital data
from published printed work
• Citation automatically generated when study
is created.
• Data Citation format:
Author, Date, “Title”, Persistent Identifier
Universal Numerical Fingerprint (UNF)
Distributor or other optional fields [ …]

Unique Citation Components
1. Persistent Identifiers – Offer permanent and
reliable links to digital objects. Uses the
handle system. e.g. hdl:1902.1/15673
2. Universal Numerical Fingerprint –
– Applied on quantitative data
– Used to uniquely identify and verify data
e.g. 5:G22I+TtPQPAyFcRT6SrUfA==

Example of Citation
Frank Place; Patti Kristjanson; Steve Staal; Russ
Kruska; Tineke deWolff; Robert Zomer; E C
Njuguna, 2005, "Replication data for:
Development pathways in medium-high
potential Kenya: a meso-level analysis of
agricultural patterns and
determinants.", http://hdl.handle.net/1902.1
/15673 UNF:5:G22I+TtPQPAyFcRT6SrUfA==
World Agroforestry Centre [Distributor] V1
[Version]

Designed for Research Data
Data-format aware Research data workflows
• Input formats: CSV • Researcher can enter deposit
, TAB, SPSS, STATA, GraphML directly
• Export: reformat, subset, analyze • Multiple workflows:
• Preservation-reformatting closed, review-and-release, wiki
• Semantic fingerprints • Versioned

Find distributed resources Flexible licensing
• Can provide a portal to distributed • Access control for research
resources (OAI-PMH harvesting groups
client) • Layered usage terms
• Data can also include meta data for • Data request workflow
harvesting

Robust
Supports Any file type, only restriction 1 file size = 2GB

What can you do with dataverse?

Publication and Data Submission
Proposed Workflow
Start

Request Data GRP/Region submit
from scientists publication

Data submitted Data Yes No
Publication Publication Publication
to RMG Data or data has data? submitted into
manager
Dspace
No

Request
Publication Changes
Data published in
Received? Publication
dspace

Yes Dspace Editors
Dspace Editors Approval
receive data link
Upload data to
dataverse

Publication
Update Dspace Yes Publication No
published to the
(unreleased) Approval
web
publication with
data link

Research methods group accelarating impact by sharing data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Research methods group accelarating impact by sharing data

Similar to Research methods group accelarating impact by sharing data (20)

More from World Agroforestry (ICRAF)

More from World Agroforestry (ICRAF) (20)

Recently uploaded

Recently uploaded (20)

Research methods group accelarating impact by sharing data