There cannot be many mature products where development meetings have not been interrupted with a rueful declaration that to make further progress “you wouldn’t start from here”. This encapsulates one key difference between the architect and engineer, the latter prepared to work with the set of tools provided, the other preferring to start with a blank sheet of paper or an open space. In building research data repositories using two different softwares, Microsoft Sharepoint and EPrints, the DataPool Project is working somewhere between these extremes. Which approach will prove to be the more resilient for research data management (RDM)? In this talk we will look at the relevant factors.
To architect or engineer? Lessons from DataPool on building RDM repositories
1. To architect or engineer?
Lessons from DataPool on
building RDM repositories
Steve Hitchcock, JISC DataPool Project
9th DCC Research Data Management Forum (RDMF9)
Cambridge, 14-15 November 2012
4. DataPool
Building Capacity, Developing Skills, Supporting Researchers
October 2011
Policy and guidance Training Data repository
SharePoint
Doctoral Training
Centres
Graduate
& staff
training
services
Progress
Case studies + EPrints 3.3
• Imaging, 3D
•Geodata University Strategic
• ++ Research Groups
IDMB EPrints data apps
Informed Surveys of
by data practices
among academics
3-layer metadata
March 2013
Support for Data Capture/share with Assign
Developing/ Management Plans external sources, Large-scale DataCite
e.g. SWORD-ARM data storage DOIs
working with e.g.
JISCMRD Progress Byatt, D. (D.R.Byatt@soton.ac.uk)
Workshop Hitchcock, S. (sh94r@ecs.soton.ac.uk )
24-25 October 2012 White, W. (whw@soton.ac.uk )
Nottingham
http:/datapool.soton.ac.uk/
5. Data repository platforms
Architected
•DataFlow
• MS Sharepoint
•EPrints
Engineered
From a data repository
Other platforms available perspective
•DSpace, CKAN,
data.bris, etc.
6. Implementations of DataFlow Model
DataFlow: two data
Curated deposit motivations
DataStage SWORD repository/ar for creators: want to
(practice), need to
chive
(policy)
Two-stage
architecture DataBank
Addresses Dropbox
effect for data EPrints
producers
DSpace QMUL
7. DataStage: Upload file
DataStage was developed at the University of Oxford
DataStage screenshots courtesy JISC Kaptur project http://www.vads.ac.uk/kaptur/
Thanks to Carlos Silva
9. 3-layer metadata model
Takeda et al., 6th IDCC, Dec. 2010
available from http://eprints.soton.ac.uk/169533/
JISC Institutional Data Management Blueprint (IDMB)
Project, University of Southampton
12. Prof. Simon Cox (engng) on Sharepoint
“The concept that formed part of SP thinking (at
Southampton) from the very inception … that ability to use
SP as a way to manage or at least collaborate as part of a 5-10
year programme of work.
“The other side is what we‟re doing with intellectual property
and what we‟re offering for students. I chair a group design
project, and every single student has said „I just do it all on
Dropbox‟. The same is happening with our research. So I
think we have at least to provide a level of service and a level
of integration between our research experience and our
teaching experience. Would these people go to Southampton
rather than University of Nowhereshire on the Web or the
University of Google or the University of Dropbox? These are
deep questions for us.”
13. ePrintsSoton: Item type: Dataset
Currently EPrints v3.2, customised to ePrintsSoton
Dataset Item Type from 2007
17. EPrints (test repo) Data Core enabled
Data Core “adds a few
fields and doesn‟t
remove any fields
from the eprint
object. It creates an
alternate workflow for
datasets which is
much smaller than a
normal eprints
workflow.”
App by Patrick McSweeney
19. Essex Research Data metadata profile aims
“Using metadata schema relevant to UK HE and
research data (DataCite, INSPIRE and DDI
2.1), we have developed a basic metadata profile
suited to describing research data generated at
institutions with disciplinary diversity. The
inclusion of fields like Funder and Grant number
will ensure future harvesting and linking
opportunities (like RCUK Research Outcome
Systems). The metadata also suits the EPSRC data
registry requirements.”
http://researchdataessex.posterous.com/reposito
ry-beta-metadata-profile-released
20. EPrints: Essex Research Data repository
Screenshots courtesy
JISC Research Data
@Essex project
Thanks to Louise
Corti, Tom Ensom,
Alexis Wolton
EPrints v3.3.10, customised to Essex Research Data
http://researchdata.essex.ac.uk/
22. Essex Research Data: observations
•Assumes data deposit, so no selection of EPrints
Item Type
• No selection of e.g. Creative Commons licence,
just copyright
• Requirement for Time Period suggests particular
type of data expected
• Fields for Geographic info (not required)
suggests particular type of data expected
23. Architects and surroundings
“On one plot aggressively crystalline
blocks by Rogers StirkHarbour are going
up, their diamond shapes having
nothing in particular to do with anything
Nine Elms, around them. On another Foster and
London Partners have designed a series of
usembassylondon
curving, stepped, blobby things, of the
kind usually designed to take advantage
of views on the Med or the Gulf, but are
here facing each other like rows of
daleks. Again, it shows little interest in
anything around it.”
R. Moore, Utopia on Thames, Observer, 11 Nov 2012
24. Open access repository interoperability
Confederation of Open Access Repositories (COAR)
Dublin Core, CRIS-CERIF
OpenAIRE, RepositoryNet+, Rioxx
RCUK: Research Outcomes System, Gateway to
Research, REF
Is there the same current debate about
interoperability of data repositories?
25. COAR on OA interoperability
Specific initiatives designed to support interoperability:
AuthorClaim, CRIS-OAR, DataCite, DINI Certificate for
Document and Publication Services, DOI, DRIVER,
Handle System, KE Usage Statistics Guidelines, OAI-
ORE, OAI-PMH, OA-Statistik, OA Repository Junction,
OpenAIRE, ORCID, PersID, PIRUS, SURE, SWORD, and
UK RepositoryNet+.
COAR, The Current State of Open Access Repository
Interoperability (2012), 26 Oct. 2012 v.02
MT @gknight2000 (Gareth Knight) Lincoln's CKan
instance impressive bit.ly/QQd1au Doesn't appear to
support OAIPMH or preservation function #jiscmrd
26. What next for DataPool repositories?
Sharepoint
• User test and feedback sessions scheduled, will
direct further development
EPrints apps (1 or 2 0f following, initially)
• Develop app based on Essex data repository,
providing other repositories with a 1-click install of
this profile
• Build interoperability (I/O) apps:
e.g. Data Management Plans, Dropbox
• Automate record capture for producers of large-
scale, regular data outputs
Editor's Notes
I thank Graham Pryor of DCC, organiser of this RDMF9 meeting, for inviting this talk, and for suggesting this topic based, presumably, on this project blog post. It sets out some of the higher-level issues while avoiding the trap of setting up a straw man pitting SharepointversusEPrints.
That blog post included this architectural diagram, produced by Peter Hancock, director of the iSolutions IT service provider at the University of Southampton. Although it leans heavily towards referencing Sharepoint, it can be viewed as a high-level reference model, analogous to the OAIS in digital preservation, and therefore as a model that can embrace other repository types.
Before we get into the detail of the presentation, here is a poster-based summary of the DataPool Project. It has a tripartite approach characteristic of similar institutional projects in the JISC MRD programme, covering data policy, training and, the area of interest here, building a data repository. It is worth noting as well, in this context, that the development partners shown in the row beneath the tripartite elements effectively represent ways of getting data in and out of the RDM service adopted, and are relevant factors in the repository design.
Here is how the different repository platforms might line up on a broad spectrum of Architected vs Engineered. This is a rough-and-ready approach to illustrate the basic point. Also included is DataFlow, from the University of Oxford, perhaps the most innovative repository platform to have emerged for RDM. Given its originality, it appears towards the architected end of the spectrum. We could not claim that Sharepoint is a new software platform in the same way as DataFlow, but from an RDM perspective you don’t get anything out of the box – you have to start from scratch and ‘architect’ an RDM solution. What developers can do is try and ‘engineer’ the designed RDM element with the IT services already provided in Sharepoint. EPrints first appeared in 2001 to manage research publications. It has offered a ‘dataset’ deposit type since 2007, so provides a ready-made solution for an RDM repository, and can be ‘engineered’ to enhance that solution. As the slide notes, other RDM repository platforms are available. In the following slides we will explore the features of our three highlighted RDM platforms, starting with DataFlow.
DataFlow is a two-stage architecture for data management: an open (Dropbox-like) space for data producers (DataStage), and a managed and curated repository (DataBank), connected by a standard content transfer protocol, SWORD. While DataBank provides a bespoke data management service for Oxford, we have recently noted experiments to connect an open source version of DataStage with EPrints- and DSpace-based curated repositories, thus providing the yearned for Dropbox functionality apparently so in demand with research data producers.
This is an example screenshot from the DataStage-EPrints experimental arrangement used by the JISC Kaptur project. It shows the familiar Choose File-Upload button combination familiar to e.g. Wordpress blog users, for uploading data. Uploaded data is then shown in a conventional file manager list.
To move data from DataStage to the curated repository, again shown in the experimental Kaptur implementation, uses this surprisingly simple SWORD client interface. If this seems insufficient description for a curated item, presumably a more detailed SWORD client could be substituted.
One basis for building a more comprehensive description, or metadata, for research data is this 3-layer model produced by the Institutional Data Management Blueprint (IDMB) Project, the project that preceded DataPool at the University of Southampton. This is quite a general-purpose and flexible model, perhaps with more flexibility than meaning. Structurally, nevertheless, we will see that this has some relevance to repository deposit workflow design.
The 3-layer metadata model can be seen quite clearly in the emerging user interface for data deposit built on Sharepoint. Here we see the interface for collecting project descriptions, used once per project and then linked to each data record produced by the project.
In the same style, here is the Sharepoint user interface for collecting data descriptions. One of the most noticeable features within both the Project and Data forms is the small number of mandatory fields (indicated with a red asterisk), just one on each form. Mandatory fields have to be filled in for the form to submit successfully. Most people will have experienced these fields; invariably when completing a Web shopping form these will be returned with red text warning. In this case you could feasibly submit a project or data description containing only a title. Aspects such as this are shortly to be subjected to user testing and review of this implementation.
Sharepoint has its detractors as an IT service platform, principally bemoaning its complexity-to-functionality ratio. Prof Simon Cox from Southampton University takes the opposite view passionately. This is an extract from his intervention at a DataPool Steering Group meeting (May 2012) putting the case for Sharepoint. It is a good way of understanding the wider strengths of Sharepoint, which may not be immediately apparent to users of particular Sharepoint services. Building the range of services suggested is a difficult and long-term project.
EPrints supports the deposit of many item types, including datasets since 2007. When you open a new deposit process in EPrints you will first be shown this screen, where you can select an item type such as ‘dataset’.
Selecting ‘dataset’ will take you to this next screen, which might look something like this from ePrintsSoton, the Southampton Institutional Repository. This is not quite a default screen for standard EPrints installs; the workflow and fields have been customised in some areas by a repository developer.
EPrints users need not be restricted to standard interfaces or interfaces customised to a repository requirement. Interfaces in EPrints can be added or amended by simply installing an app from the app store, or Bazaar. Unlike the Apple app store, with which it might optimistically be compared, EPrints apps are not selected to be installed by users but installation is authorised by repository managers. There are already two apps for those managers to choose to suit particular RDM workflow requirements: DataShare and Data Core. More data apps are expected to follow. EPrints is thus being engineered for flexibility in RDM deposit. In the following slides we will explore these first two data apps.
DataShare makes some minor modifications to the default EPrints workflow for deposit of datasets, highlighted with red circles here.
Data Core aims to implement a minimal ‘core’ metadata for datasets. Implementing this app will overwrite the default EPrints workflow, replacing it with the minimal set, approximately half of which is shown here (the remainder in the next slide). In addition, we have a short description of the design aims for Data Core, which are unavailable for Sharepoint data deposit and the DataShare app.
Taking both slides showing the Data Core deposit workflow, this is comparable, in extent, with the Sharepoint ‘data’ interface shown earlier, although it has a few more mandatory fields.
Another example of an EPrints data deposit interface has been developed at the University of Essex. Like Data Core, the Essex approach has explicit design objectives, based on aligning with other metadata initiatives to support multi-disciplinary data. In other words, this does not simply expand or reduce the default EPrints workflow for data deposit, but starts with a new perspective. We have been liaising with its development team to investigate the possibility of building this approach into an Essex EPrintsapp for other repositories to share.
Here is a section of the Essex workflow, highlighting one area of major difference with the default workflow. It shows fields for time- and geographic-based information.
We’ve looked at getting data into the repository, but not yet how it is displayed as an output, or a data record from the repository. This is one example. It is not the most revealing record, but could be expanded.
Essex has cited specific design criteria for its research data repository. Additionally we have observed some characteristic features, indicated here. In particular, it is a data-only repository, without provision for other data-types offered by EPrints (shown in slide 13). The indication of mandatory fields adds a further layer of insight into the implementation of the design criteria.
So far in this presentation we have seen different implementations of data repository deposit interfaces, includingDataFlow, Sharepoint, and multiple interfaces for EPrints. Where is this heading, and what are the common themes? Since we are exploring the difference between architecting and engineering these repositories, I was interested to see this national newspaper article about a major redevelopment of an area close to central London, Nine Elms, an area that interests me as I pass through it on regular basis. Phrases that stand out refer to the relationship between the planned new high-rise buildings. What does this have to do with data repositories?
Interoperability is the relationship between repositories and how they interact with services, such as search, through shared metadata. If repositories have ‘nothing in particular to do with anything around them’ or “show little interest in anything around’ them, then they will not be interoperable. If repositories stand alone rather than interoperate then they become less effective at making their contents visible. Open access repositories have long recognised the importance of interoperability, being founded on the Open Archives Initiative (OAI) over a decade ago, and efforts to improve interoperability continue with current developments. Shown here are some current interoperability initiatives from one morning’s mailbox. Data repositories will be connected to this debate, but so far it has not been a priority in all the examples we have considered here.
One of the organisations listed on the previous slide, COAR, produced a report that outlines more comprehensively the scope of current interoperability initiatives for open access. While some solutions to the capture of research data seen here have reasonably been ‘architected’, that is, starting with a blank sheet to focus on the specific design needs of data deposit, these will need to catch up quickly with interoperability requirements, including most of those listed here. Data repositories ‘engineered’ on a platform such as EPrints, originally designed for other data types, do not obviously lack the flexibility to accommodate research data, and by virtue of having contributed to repository interoperability since the original OAI, already support most of the requirements shown here.
As for the DataPool Project, it will continue its dual approach of developing and testing both Sharepoint and EPrints apps. As a project it does not get to choose what is ultimately adopted to run the emerging research data repository at the University of Southampton. There are repository-specific factors that will determine that; but there are other organisational factors to take into account as well. Institutions seeking to build research data repositories that are clearly focussed on this range of factors are likely to have most success in implementing a repository to attract data deposit and usage.