To architect or engineer? Lessons from DataPool on building RDM repositories


Published on

There cannot be many mature products where development meetings have not been interrupted with a rueful declaration that to make further progress “you wouldn’t start from here”. This encapsulates one key difference between the architect and engineer, the latter prepared to work with the set of tools provided, the other preferring to start with a blank sheet of paper or an open space. In building research data repositories using two different softwares, Microsoft Sharepoint and EPrints, the DataPool Project is working somewhere between these extremes. Which approach will prove to be the more resilient for research data management (RDM)? In this talk we will look at the relevant factors.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I thank Graham Pryor of DCC, organiser of this RDMF9 meeting, for inviting this talk, and for suggesting this topic based, presumably, on this project blog post. It sets out some of the higher-level issues while avoiding the trap of setting up a straw man pitting SharepointversusEPrints.
  • That blog post included this architectural diagram, produced by Peter Hancock, director of the iSolutions IT service provider at the University of Southampton. Although it leans heavily towards referencing Sharepoint, it can be viewed as a high-level reference model, analogous to the OAIS in digital preservation, and therefore as a model that can embrace other repository types.
  • Before we get into the detail of the presentation, here is a poster-based summary of the DataPool Project. It has a tripartite approach characteristic of similar institutional projects in the JISC MRD programme, covering data policy, training and, the area of interest here, building a data repository. It is worth noting as well, in this context, that the development partners shown in the row beneath the tripartite elements effectively represent ways of getting data in and out of the RDM service adopted, and are relevant factors in the repository design.
  • Here is how the different repository platforms might line up on a broad spectrum of Architected vs Engineered. This is a rough-and-ready approach to illustrate the basic point. Also included is DataFlow, from the University of Oxford, perhaps the most innovative repository platform to have emerged for RDM. Given its originality, it appears towards the architected end of the spectrum. We could not claim that Sharepoint is a new software platform in the same way as DataFlow, but from an RDM perspective you don’t get anything out of the box – you have to start from scratch and ‘architect’ an RDM solution. What developers can do is try and ‘engineer’ the designed RDM element with the IT services already provided in Sharepoint. EPrints first appeared in 2001 to manage research publications. It has offered a ‘dataset’ deposit type since 2007, so provides a ready-made solution for an RDM repository, and can be ‘engineered’ to enhance that solution. As the slide notes, other RDM repository platforms are available. In the following slides we will explore the features of our three highlighted RDM platforms, starting with DataFlow.
  • DataFlow is a two-stage architecture for data management: an open (Dropbox-like) space for data producers (DataStage), and a managed and curated repository (DataBank), connected by a standard content transfer protocol, SWORD. While DataBank provides a bespoke data management service for Oxford, we have recently noted experiments to connect an open source version of DataStage with EPrints- and DSpace-based curated repositories, thus providing the yearned for Dropbox functionality apparently so in demand with research data producers.
  • This is an example screenshot from the DataStage-EPrints experimental arrangement used by the JISC Kaptur project. It shows the familiar Choose File-Upload button combination familiar to e.g. Wordpress blog users, for uploading data. Uploaded data is then shown in a conventional file manager list.
  • To move data from DataStage to the curated repository, again shown in the experimental Kaptur implementation, uses this surprisingly simple SWORD client interface. If this seems insufficient description for a curated item, presumably a more detailed SWORD client could be substituted.
  • One basis for building a more comprehensive description, or metadata, for research data is this 3-layer model produced by the Institutional Data Management Blueprint (IDMB) Project, the project that preceded DataPool at the University of Southampton. This is quite a general-purpose and flexible model, perhaps with more flexibility than meaning. Structurally, nevertheless, we will see that this has some relevance to repository deposit workflow design.
  • The 3-layer metadata model can be seen quite clearly in the emerging user interface for data deposit built on Sharepoint. Here we see the interface for collecting project descriptions, used once per project and then linked to each data record produced by the project.
  • In the same style, here is the Sharepoint user interface for collecting data descriptions. One of the most noticeable features within both the Project and Data forms is the small number of mandatory fields (indicated with a red asterisk), just one on each form. Mandatory fields have to be filled in for the form to submit successfully. Most people will have experienced these fields; invariably when completing a Web shopping form these will be returned with red text warning. In this case you could feasibly submit a project or data description containing only a title. Aspects such as this are shortly to be subjected to user testing and review of this implementation.
  • Sharepoint has its detractors as an IT service platform, principally bemoaning its complexity-to-functionality ratio. Prof Simon Cox from Southampton University takes the opposite view passionately. This is an extract from his intervention at a DataPool Steering Group meeting (May 2012) putting the case for Sharepoint. It is a good way of understanding the wider strengths of Sharepoint, which may not be immediately apparent to users of particular Sharepoint services. Building the range of services suggested is a difficult and long-term project.
  • EPrints supports the deposit of many item types, including datasets since 2007. When you open a new deposit process in EPrints you will first be shown this screen, where you can select an item type such as ‘dataset’.
  • Selecting ‘dataset’ will take you to this next screen, which might look something like this from ePrintsSoton, the Southampton Institutional Repository. This is not quite a default screen for standard EPrints installs; the workflow and fields have been customised in some areas by a repository developer.
  • EPrints users need not be restricted to standard interfaces or interfaces customised to a repository requirement. Interfaces in EPrints can be added or amended by simply installing an app from the app store, or Bazaar. Unlike the Apple app store, with which it might optimistically be compared, EPrints apps are not selected to be installed by users but installation is authorised by repository managers. There are already two apps for those managers to choose to suit particular RDM workflow requirements: DataShare and Data Core. More data apps are expected to follow. EPrints is thus being engineered for flexibility in RDM deposit. In the following slides we will explore these first two data apps.
  • DataShare makes some minor modifications to the default EPrints workflow for deposit of datasets, highlighted with red circles here.
  • Data Core aims to implement a minimal ‘core’ metadata for datasets. Implementing this app will overwrite the default EPrints workflow, replacing it with the minimal set, approximately half of which is shown here (the remainder in the next slide). In addition, we have a short description of the design aims for Data Core, which are unavailable for Sharepoint data deposit and the DataShare app.
  • Taking both slides showing the Data Core deposit workflow, this is comparable, in extent, with the Sharepoint ‘data’ interface shown earlier, although it has a few more mandatory fields.
  • Another example of an EPrints data deposit interface has been developed at the University of Essex. Like Data Core, the Essex approach has explicit design objectives, based on aligning with other metadata initiatives to support multi-disciplinary data. In other words, this does not simply expand or reduce the default EPrints workflow for data deposit, but starts with a new perspective. We have been liaising with its development team to investigate the possibility of building this approach into an Essex EPrintsapp for other repositories to share.
  • Here is a section of the Essex workflow, highlighting one area of major difference with the default workflow. It shows fields for time- and geographic-based information.
  • We’ve looked at getting data into the repository, but not yet how it is displayed as an output, or a data record from the repository. This is one example. It is not the most revealing record, but could be expanded.
  • Essex has cited specific design criteria for its research data repository. Additionally we have observed some characteristic features, indicated here. In particular, it is a data-only repository, without provision for other data-types offered by EPrints (shown in slide 13). The indication of mandatory fields adds a further layer of insight into the implementation of the design criteria.
  • So far in this presentation we have seen different implementations of data repository deposit interfaces, includingDataFlow, Sharepoint, and multiple interfaces for EPrints. Where is this heading, and what are the common themes? Since we are exploring the difference between architecting and engineering these repositories, I was interested to see this national newspaper article about a major redevelopment of an area close to central London, Nine Elms, an area that interests me as I pass through it on regular basis. Phrases that stand out refer to the relationship between the planned new high-rise buildings. What does this have to do with data repositories?
  • Interoperability is the relationship between repositories and how they interact with services, such as search, through shared metadata. If repositories have ‘nothing in particular to do with anything around them’ or “show little interest in anything around’ them, then they will not be interoperable. If repositories stand alone rather than interoperate then they become less effective at making their contents visible. Open access repositories have long recognised the importance of interoperability, being founded on the Open Archives Initiative (OAI) over a decade ago, and efforts to improve interoperability continue with current developments. Shown here are some current interoperability initiatives from one morning’s mailbox. Data repositories will be connected to this debate, but so far it has not been a priority in all the examples we have considered here.
  • One of the organisations listed on the previous slide, COAR, produced a report that outlines more comprehensively the scope of current interoperability initiatives for open access. While some solutions to the capture of research data seen here have reasonably been ‘architected’, that is, starting with a blank sheet to focus on the specific design needs of data deposit, these will need to catch up quickly with interoperability requirements, including most of those listed here. Data repositories ‘engineered’ on a platform such as EPrints, originally designed for other data types, do not obviously lack the flexibility to accommodate research data, and by virtue of having contributed to repository interoperability since the original OAI, already support most of the requirements shown here.
  • As for the DataPool Project, it will continue its dual approach of developing and testing both Sharepoint and EPrints apps. As a project it does not get to choose what is ultimately adopted to run the emerging research data repository at the University of Southampton. There are repository-specific factors that will determine that; but there are other organisational factors to take into account as well. Institutions seeking to build research data repositories that are clearly focussed on this range of factors are likely to have most success in implementing a repository to attract data deposit and usage.
  • To architect or engineer? Lessons from DataPool on building RDM repositories

    1. 1. To architect or engineer?Lessons from DataPool onbuilding RDM repositoriesSteve Hitchcock, JISC DataPool Project9th DCC Research Data Management Forum (RDMF9)Cambridge, 14-15 November 2012
    2. 2. Why architecting?
    3. 3. DataPool architecture (Sharepoint) Peter Hancock, iSolutions, University of Southampton
    4. 4. DataPoolBuilding Capacity, Developing Skills, Supporting ResearchersOctober 2011 Policy and guidance Training Data repository SharePoint Doctoral Training Centres Graduate & staff training servicesProgress Case studies + EPrints 3.3 • Imaging, 3D •Geodata University Strategic • ++ Research Groups IDMB EPrints data appsInformed Surveys ofby data practices among academics 3-layer metadata March 2013 Support for Data Capture/share with AssignDeveloping/ Management Plans external sources, Large-scale DataCite e.g. SWORD-ARM data storage DOIsworking with e.g. JISCMRD Progress Byatt, D. ( Workshop Hitchcock, S. ( ) 24-25 October 2012 White, W. ( ) Nottingham http:/
    5. 5. Data repository platforms Architected •DataFlow • MS Sharepoint •EPrints Engineered From a data repositoryOther platforms available perspective •DSpace, CKAN, data.bris, etc.
    6. 6. Implementations of DataFlow Model DataFlow: two data Curated deposit motivationsDataStage SWORD repository/ar for creators: want to (practice), need to chive (policy)Two-stagearchitecture DataBankAddresses Dropboxeffect for data EPrintsproducers DSpace QMUL
    7. 7. DataStage: Upload file DataStage was developed at the University of OxfordDataStage screenshots courtesy JISC Kaptur project Thanks to Carlos Silva
    8. 8. DataStage: Submit as data package
    9. 9. 3-layer metadata model Takeda et al., 6th IDCC, Dec. 2010 available from JISC Institutional Data Management Blueprint (IDMB) Project, University of Southampton
    10. 10. SharePoint user interface 1: project
    11. 11. SharePoint user interface 2: data + fields for format, keywords
    12. 12. Prof. Simon Cox (engng) on Sharepoint“The concept that formed part of SP thinking (atSouthampton) from the very inception … that ability to useSP as a way to manage or at least collaborate as part of a 5-10year programme of work.“The other side is what we‟re doing with intellectual propertyand what we‟re offering for students. I chair a group designproject, and every single student has said „I just do it all onDropbox‟. The same is happening with our research. So Ithink we have at least to provide a level of service and a levelof integration between our research experience and ourteaching experience. Would these people go to Southamptonrather than University of Nowhereshire on the Web or theUniversity of Google or the University of Dropbox? These aredeep questions for us.”
    13. 13. ePrintsSoton: Item type: Dataset Currently EPrints v3.2, customised to ePrintsSoton Dataset Item Type from 2007
    14. 14. ePrintsSoton: start to deposit Dataset
    15. 15. EPrints data apps Apps available from EPrints Bazaar Apps work with EPrints v3.3 or later
    16. 16. EPrints (test repo) DataShare enabled App by Tim Brody, EPrints + DataPool
    17. 17. EPrints (test repo) Data Core enabledData Core “adds a fewfields and doesn‟tremove any fieldsfrom the eprintobject. It creates analternate workflow fordatasets which ismuch smaller than anormal eprintsworkflow.” App by Patrick McSweeney
    18. 18. EPrints (test repo) Data Core enabled 2 App by Patrick McSweeney
    19. 19. Essex Research Data metadata profile aims“Using metadata schema relevant to UK HE andresearch data (DataCite, INSPIRE and DDI2.1), we have developed a basic metadata profilesuited to describing research data generated atinstitutions with disciplinary diversity. Theinclusion of fields like Funder and Grant numberwill ensure future harvesting and linkingopportunities (like RCUK Research OutcomeSystems). The metadata also suits the EPSRC dataregistry requirements.”
    20. 20. EPrints: Essex Research Data repositoryScreenshots courtesy JISC Research Data @Essex project Thanks to Louise Corti, Tom Ensom, Alexis Wolton EPrints v3.3.10, customised to Essex Research Data
    21. 21. Essex Research Data record
    22. 22. Essex Research Data: observations•Assumes data deposit, so no selection of EPrintsItem Type• No selection of e.g. Creative Commons licence,just copyright• Requirement for Time Period suggests particulartype of data expected• Fields for Geographic info (not required)suggests particular type of data expected
    23. 23. Architects and surroundings “On one plot aggressively crystalline blocks by Rogers StirkHarbour are going up, their diamond shapes having nothing in particular to do with anything Nine Elms, around them. On another Foster and London Partners have designed a series ofusembassylondon curving, stepped, blobby things, of the kind usually designed to take advantage of views on the Med or the Gulf, but are here facing each other like rows of daleks. Again, it shows little interest in anything around it.” R. Moore, Utopia on Thames, Observer, 11 Nov 2012
    24. 24. Open access repository interoperabilityConfederation of Open Access Repositories (COAR)Dublin Core, CRIS-CERIFOpenAIRE, RepositoryNet+, RioxxRCUK: Research Outcomes System, Gateway toResearch, REFIs there the same current debate aboutinteroperability of data repositories?
    25. 25. COAR on OA interoperabilitySpecific initiatives designed to support interoperability:AuthorClaim, CRIS-OAR, DataCite, DINI Certificate forDocument and Publication Services, DOI, DRIVER,Handle System, KE Usage Statistics Guidelines, OAI-ORE, OAI-PMH, OA-Statistik, OA Repository Junction,OpenAIRE, ORCID, PersID, PIRUS, SURE, SWORD, andUK RepositoryNet+.COAR, The Current State of Open Access RepositoryInteroperability (2012), 26 Oct. 2012 v.02 MT @gknight2000 (Gareth Knight) Lincolns CKan instance impressive Doesnt appear to support OAIPMH or preservation function #jiscmrd
    26. 26. What next for DataPool repositories?Sharepoint• User test and feedback sessions scheduled, willdirect further developmentEPrints apps (1 or 2 0f following, initially)• Develop app based on Essex data repository,providing other repositories with a 1-click install ofthis profile• Build interoperability (I/O) apps:e.g. Data Management Plans, Dropbox• Automate record capture for producers of large-scale, regular data outputs