ROER4D Open Data Initiative

The ROER4D Open Data initiative
Michelle Willmers and Thomas King
January 2018
CC BY

Introduction to ROER4D
• Research on Open Educational Resources for Development project
– 18 sub-projects, across 26 countries in the Global South from Chile to
Mongolia, with 100 researchers, supported by a Network Hub team based in
the University of Cape Town and Wawasan Open University.
– Datasets in multiple languages (English, Spanish, Mongolian)
– Mostly mixed-methods data (mix of quantitative and qualitative)
• ROER4D Open Data initiative: supporting interested sub-projects in
sharing their data openly

Research
On Open Educational Resources (OER)
for Development
• Imperative to establish empirical baseline research on OER in Global South
• 86 researchers in 26 countries across 3 continents
• Project ‘Open’ ethos manifests in Open Research strategy, bridging ‘Open’
silos
• Open content (typically used in a teaching and learning
content) that can be reused, revised, remixed,
redistributed and retained
• Made possible by open licensing, although increasing
focus on differentiating implicit vs. explicit open
content
• Focus on role OER can play in improving access to quality education
• Focus on role project can play in building Global South Open Education
research capacity
• Strong advocacy and activism component (NGO, CBO sectors – not only
career researchers)
Focus on empirical baseline manifests in focus on curatorial and publishing capacity within the
research project. The project acts as publisher, providing greater agency and control (but
presenting some challenges in terms of accreditation/reward).
Unpacking the “ROER4D” project title…

Curation & Dissemination strategy
• Provide a content management and publishing service to SP researchers and the
Network Hub team in order to advance research capacity development efforts and
increase visibility of outputs.
• Support Principal Investigators and SP researchers in editorial development of
ROER4D outputs.
• Address infrastructure deficits and provide content management solutions
(including content hosting) in a research community with uneven institutional
support and capacity challenges.
• Ensure that the ROER4D legacy is freely accessible for reuse in line with international
curatorial and publishing standards.
• Complement Network Hub Communications efforts in an integrated
communications/dissemination approach.

• Data sharing as component of generalised open content focus.
• Organising and profiling open content increases the potential for reuse and citation
(impact).
• Well-organised, strategic research management and content organisation promotes
rigour in the research process.
• Copyright vests with the author > data-sharing activity determined by their willingness
and capacity to engage.
• Format and platform/tool agnostic.
• Share openly by default on condition that it is valuable, legal and ethical
Data management principles

Research Data
Management
Collect data
Organise data
Refine data
Share data
Document
data
Store data
Backup, archive, on-
site storage, cloud
storage
Metadata, dataset
description
De-identification,
publishing, open
data
Ethics clearance,
methodology,
instruments Formats, naming
conventions
Verification,
validation

The two pillars of Open Data sharing
Consensual
ethical
legal
Comprehensible
coherent
valuable
Research Data Management &
Open Data sharing

Project archive
(external)
Zenodo
Researcher
ROER4D archive (internal)
Google, Vula, UCT eResearch
Centre
Publisher
DataFirst
Network Hub
(Google, Vula)
ROER4D project data flow
Internal
sharing and
collaboration
External
sharing and
collaboration

Open Data terminology
• Open Data = Microdata
– Unit record data (survey data, census data)
– Interview and Focus Group transcripts
– i.e. the ‘raw material’ from which outputs, reports, publications etc. are
produced.
• Supportive documentation = Metadata
– Dataset descriptions
– Study descriptions (methods/methodology, data collection schedules
– Data processing information (e.g. de-identification schema)

Terms and definitions
TERM DEFINITION
Microdata (aka Unit
Record Data)
The information that underlies a research project’s analysis (i.e. the
‘thing’)
Metadata Data that describes a file or record on a database (for example,
keywords, author fields, ISBNs, DOIs)
Research Data
Management (RDM)
Overall term for how individuals/projects/institutions manage their
data
Data Management Plan
(DMP)
Outlines an individual or project’s strategy around all aspects of data
management
Curation Organising, storing/archiving and describing data to ensure & control
its long-term accessibility and usability. May include
collating/concatenating from other sources
De-identification Removing, eliding or replacing pieces of information that reveal
research participants’ (possibly also referents’) identity
Anonymity Personal details (identifiers) are not gathered
Confidentiality Personal details (identifiers) are not shared
Curation platform An on-premises or cloud-based storage space that contains metadata
capabilities, Search Engine Optimisation, and backup capabilities

Why should researchers share data?
• ROER4D motivations:
– Build the empirical base for future research
– Coherent with our generally ‘open’ approach – publishing open
access outputs, actively communicating with audiences and
stakeholders, etc.
• Good practice – many research funders now require some sort of data-
sharing activity or plan
• Improve rigour
– Sharing data openly demands that the dataset is well described
and organised
– Increased scrutiny of the dataset often leads to more refined
analysis

Five pillars of
ROER4D data
publication
approach

Step 1: Evaluate contractual framework,
articulate strategy

Step 2: Get researchers on board

Recruiting participants
• Emphasising social justice through sharing
– Sharing open data allows for latitudinal studies using data from multiple sites
• Emphasising personal reputation
– Sharing open data as a means of building one’s personal profile as a
researcher
• Emphasising rigour
– Sharing data openly enhances the quality of the research

• Check ethics approval and consent
• Ensure first-tier de-identification takes place prior to Network Hub transfer in
order to ensure research subject confidentiality
• ROER4D agnostic in its approach (in terms of scale, format and technical
sophistication)
• Challenges of varying researcher sophistication in terms of data collection and
presentation
• Challenges of varying researcher sophistication in terms of technology employed
to capture, present, and analyse data
Step 3: Source sub-project micro-data

• Archive in LMS and secure institutional archive
• Network Hub C&D team audits researchers’ submitted dataset
> What is the dataset comprised of?
> Are all the pieces there?
> What were the data collection processes, and do we have all the instruments to
share?
> What languages are represented?
> Does something else like it exist?
> Who might it be of use to?
• Address file naming and format issues
• Articulate sub-project-specific data management plan
Step 4: Network Hub curation and quality
assurance

• Scope and conceptualise the dataset
> Which components of the project-generated micro-data are you ethically and
legally allowed to share?
> Which components of the project-generated micro-data will you invest
resources in curating and sharing?
> Which instruments will you include?
• Identify focus of data and points of sensitivity
• Define appropriate second-tier de-identification approach
Step 5: Preparing data for publication

READ
DATA
Coherence
Format &
layout Editing
Fix typos &
identify
anomalous data
1.
2.
3.
4.
5.
De-identifying
Remove
identifiers
Validation
Identify and
account for missing
data
ROER4D data
interrogation
process

The de-identification balancing act
First, do no harm
Remove as much as needed to ensure the
confidentiality or anonymity of the
research participants.
Ensure that all ethical and consent
processes have been adhered to.
Don’t go overboard
Remove as little as is ethical to ensure the
richness of the data.
Take the unit of analysis as the guide – de-
identify up to the Unit of Analysis.
E.g: If Study X compares two universities,
you can safely remove all identifiers lower
than the university affiliation.
HOWEVER
Your data may be useful to others. The
purpose of de-identification is to preserve
confidentiality – don’t de-identify for the
sake of it

ROER4D de-identification process
1. First-level de-identification by researcher
– Removal of direct identifiers (names of people/institutions/companies, ID
numbers, etc.)
– Important to ensure that raw data is not shared
2. Second-level de-identification by C&D team to catch remaining direct
identifiers
3. In-depth sweep of the text to identify indirect identifiers
– Meticulous, thorough, repeated reading of the text (which ties back to
general data enhancement)

Qualitative de-identification
• De-identification located in the same ecosystem as data cleaning and data
validation – no clear line between data improvement and de-identification
– Cleaning up typos
– Standardising presentation and layout
– Identifying unanswered questions (or additional questions), mislabelled
responses, etc.
• Much of these also apply to quantitative data
• Articulation of principles in RDM and description of these processes included in
metadata

Qualitative de-identification example
• Raw data
– Well my name is Susan Tsvangirai, and I’m the Head of the
Anthropology department at the University of Zimbabwe. I first
started getting involved in publishing my data – see I’m the only
person in the country who works on human ecologies, well it’s me
and Ishaan at Wits, but I’m the only one locally, and I started out
using the institutional repository but it didn’t really work. It kept
timing out when I tried to upload resources. So I switched the Zenodo
which was fine but it felt a little bit sterile…
• Cleaned/processed data
– Well my name is [redacted], and I’m the Head of [my] department at
the University of Zimbabwe. I first started getting involved in
publishing my data – see I’m the only person in the country who
works [in my area], well it’s me and [a colleague] at Wits, but I’m the
only one locally, and I started out using the institutional repository
but it didn’t really work. It kept timing out when I tried to upload
resources. So I switched the Zenodo which was fine but it felt a little
bit sterile…

• Generate metadata and dataset description (accompanying narrative)
• Submit content to publisher (in ROER4D instance, DataFirst)
• Link to published outputs
• Include description of process in research Methodology statements
• Profile in project communications activity
Step 6: Publish

Challenges
• Data collected in multiple languages
– De-identification (particularly in qualitative data) far more difficult –
greater reliance on the researcher to identify disclosive information
• Post-hoc consent process
– Departments merge or close, participants retire or disappear
• Data collected by multiple researchers
– Different collection strategies, adherence to interview schedules, use/non-
use of clarifying questions, etc.

Ways forward: ‘Open by design’
• Help researchers write consent forms to facilitate ethical open
data sharing.
• ‘Red flag’ clauses abound in template consent forms,
including:
– “will be used for research purposes only”
– “data will be destroyed after use”
– “only researchers will have access to the data”
• More open consent forms allow for data sharing but do not
mandate it.

Lessons learned
1. Openness increases rigour. Preparing data for publication promotes professional approach to
research process.
2. Preparing data for publication exposes weaknesses in instrument design and research
process.
3. Introducing C&D and data-sharing focus midway through a project poses many challenges,
particularly in terms of ethical and consent components.
4. Data sharing drives focus on reproducibility, transforming traditional approach to crafting
methodology statements.
5. The data preparation process takes time (approx. one week of researchers’ time in ROER4D
context).
6. Obtaining balance between utility and adequate protection in de-identification of qualitative
data is a challenge.
7. Openness is threatening to researchers in terms of exposing weakness in processes and
perceived threat of losing publication advantage.
8. C&D and data sharing activity require support, capacity development and resourcing.

ROER4D Open Data Initiative

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ROER4D Open Data Initiative

Similar to ROER4D Open Data Initiative (20)

Recently uploaded

Recently uploaded (20)

ROER4D Open Data Initiative

Editor's Notes