Challenges in Preparing and Sharing Open Data
OpenCon 2016 Cape Town
14 December 2016
Michelle Willmers and Thomas King
ROER4D Curation and Dissemination Manager
On Open Educational Resources (OER)
• Imperative to establish empirical baseline research on OER in Global South
• 86 researchers in 26 countries across 3 continents
• Project ‘Open’ ethos manifests in Open Research strategy, bridging ‘Open’
• Open content (typically used in a teaching and
learning content) that can be reused, revised, remixed,
redistributed and retained
• Made possible by open licensing, although increasing
focus on differentiating implicit vs. explicit open
• Focus on role OER can play in improving access to quality education
• Focus on role project can play in building Global South Open Education
• Strong advocacy and activism component (NGO, CBO sectors – not only
Focus on empirical baseline manifests in focus on curatorial and publishing capacity
within the research project. The project acts as publisher, providing greater agency and
control (but presenting some challenges in terms of accreditation/reward).
ROER4D Curation & Dissemination Strategy
• Provide a content management and publishing service to SP researchers and the
Network Hub team in order to advance research capacity development efforts and
increase visibility of outputs.
• Support Principal Investigators and SP researchers in editorial development of
• Address infrastructure deficits and provide content management solutions
(including content hosting) in a research community with uneven institutional
support and capacity challenges.
• Ensure that the ROER4D legacy is freely accessible for reuse in line with international
curatorial and publishing standards.
• Complement Network Hub Communications efforts in an integrated
• Data sharing as component of open content focus.
• Organising and profiling open content increases the potential for reuse and citation
• Well-organised, strategic research management and content organisation promotes
rigour in the research process.
• Copyright vests with the author > data-sharing activity determined by their willingness
and capacity to engage.
• Format and platform/tool agnostic.
• Share openly by default on condition that it is valuable, legal and ethical
ROER4D data management principles
ROER4D project data flow
Five pillars of
• Check ethics approval and consent
• Ensure first-tier de-identification takes place prior to Network Hub transfer in order to
ensure research subject confidentiality
• ROER4D agnostic in its approach (in terms of scale, format and technical
• Challenges of varying researcher sophistication in terms of data collection and
• Challenges of varying researcher sophistication in terms of technology employed to
capture, present, and analyse data
Step 3: Obtain source sub-project micro-data
• Archive in Vula and UCT e-Research Centre secure institutional archive
• Network Hub C&D team audits researchers’ submitted dataset
> What is the dataset comprised of?
> Are all the pieces there?
> What were the data collection processes, and do we have all the instruments to share?
> What languages are represented?
> Does something else like it exist?
> Who might it be of use to?
• Address file naming and format issues
• Articulate sub-project-specific data management plan
Step 4: Network Hub curation and quality assurance
• Scope and conceptualise the dataset
> Which components of the project-generated micro-data are you ethically and
legally allowed to share?
> Which components of the project-generated micro-data will you invest
resources in curating and sharing?
> Which instruments will you include?
• Identify focus of data and points of sensitivity
• Define appropriate second-tier de-identification approach
Step 5: Prepare data for publication
• Generate metadata and dataset description (accompanying narrative)
• Submit content to publisher (DataFirst)
• Link to published outputs
• Include description of process in research Methodology statements
• Profile in project communications activity
Step 6: Publish
1. Openness increases rigour. Preparing data for publication promotes professional
approach to research process.
2. Preparing data for publication exposes weaknesses in instrument design and
3. Introducing C&D and data-sharing focus midway through a project poses many
challenges, particularly in terms of ethical and consent components.
4. Data sharing drives focus on reproducibility, transforming traditional approach to
crafting methodology statements.
5. The data preparation process takes time (approx. one week of researchers’ time in
6. Obtaining balance between utility and adequate protection in de-identification of
qualitative data is a challenge.
7. Openness is threatening to researchers in terms of exposing weakness in processes
and perceived threat of losing publication advantage.
8. C&D and data sharing activity require support, capacity development and
Terms and definitions
• De-identification – removing, eliding or replacing
pieces of information that reveal research
participants’ (possibly also referents’) identity.
• Anonymity – personal details are not gathered.
• Confidentiality – personal details are not shared.
• E.g. an anonymous survey contains no questions
about personal identifiers. A confidential survey
does contain these questions, but will not
The two pillars of open data sharing
Research Data Management &
Open Data sharing
The de-identification balancing act
First, do no harm
Remove as much as needed to ensure the
confidentiality or anonymity of the
Ensure that all ethical and consent
processes have been adhered to.
Don’t go overboard
Remove as little as is ethical to ensure the
richness of the data.
Take the unit of analysis as the guide – de-
identify up to the Unit of Analysis.
E.g: If Study X compares two universities,
you can safely remove all identifiers lower
than the university affiliation.
Your data may be useful to others. The
purpose of de-identification is to preserve
confidentiality – don’t de-identify for the
sake of it
• De-identification located in the same ecosystem
as data cleaning and data validation – no clear
line between data improvement and de-
– Cleaning up typos
– Standardising presentation and layout
– Identifying unanswered questions (or additional
questions), mislabelled responses, etc.
• Much of these also apply to quantitative data
• Articulation of principles in RDM and description
of these processes included in metadata
Fix typos &
account for missing
Curation and Dissemination
Communication and Evaluation
Curation and Dissemination
Communication and Evaluation
SUB PROJECTSSUB PROJECTS
ROER4D project structure
Using largely mixed-methods data (both
quantitative and qualitative)
ROER4D de-identification process
1. First-level de-identification by researcher
– Removal of direct identifiers (names of
people/institutions/companies, ID numbers, etc.)
– Important to ensure that raw data is not shared
1. Second-level de-identification by C&D team to
catch remaining direct identifiers
2. In-depth sweep of the text to identify indirect
– Meticulous, thorough, repeated reading of the text
• (which ties back to general data enhancement)
• Data collected in multiple languages
– De-identification (particularly in qualitative data) far
more difficult – greater reliance on the researcher
• Post-hoc consent process
– Departments merge or close, participants retire or
• Data collected by multiple researchers
– Different collection strategies, adherence to interview
schedules, use/non-use of clarifying questions, etc.
Open by design
• Help researchers write consent forms!
Particularly for open data sharing.
• ‘Red flag’ clauses abound in template consent
– “will be used for research purposes only”
– “data will be destroyed after use”
– “only researchers will have access to the data”
• More open consent forms allow for data
sharing but do not mandate it.