Implementing Durham Etheses - Sebastian Palucha (Pecha Kucha)
Presented by Sebastian Palucha
CC BY jitze http://www.flickr.com/photos/jitze1942/3521700792
Initial project spring/summer
First deposit September 2009
~ 300 research theses per year
Simple deposit, single PDF
EPrints 3.1.3 (born 2009)
CC BY didbygraham
Registered: EThOS, Driver, OCL Digital Gateway (2010 spr.)
EThOS harvest in operation (2010 sum.)
Google Analytics stats (2010 dec.)
EThOS digitised theses loaded (2011 sum.)
Google Custom Search (aut. 2011)
Collaboration with The BL
to improve EThOS services
(aut. 2011 – spr. 2012)
EU/ICO Cookie Law support (2013 sum.)
local digitisation project,
10k (2012 spr2 – )
MySQL migrated to UTF-8 (2013 spring)
Creative Common Licences
introduced (2012 aut.)
CC BY AlishaV http://www.flickr.com/photos/alishav/3156574283
Branding: uniform user experience
• Issues: browsers, branding
• Durham University CMS CSS
• Eprints 3 CSS
Simplistic single PDF deposit
• Details > Upload > Deposit
• LDAP integration + user field population
• Embargo implemented in first screen
CC BY Pink Sherbet Photography
Highly customized LaTeX code
Issues with UTF-8 both LaTeX
Issues with dynamic if/else
Google Analytics: full text
• Two steps:
1. PDF download link (core code)
2. special GA profile
• URL structure include
• Internal code modification
through OAI-PMH harvest
• Issues with out of the box plug-in, changes to XML schema needed
• uketdterms:qualificationlevel not defined in EPrints data model
• Embargo date not included. Plugin assumes embargo on an record
level, whereas EP on an document level!
• Added department names
• Occasional issues with UTF-8 encoding
EThOS download WS
• Script for mass download https://github.com/paluchas/ethos-bl
groovy EthosDownloadClient.groovy -i 238830 –m download
EThOS avoiding duplication
• We store EThOS persistent IDs
• We modified /cgi/oai2 script to conditionally exclude ethos records
• Modified record can be exposed to EThOS harvest in future
Unknown copy/paste issues
Cover Pages LaTeX
Whole MySQL database migration to
UTF-8, fortunately double encoding
CC BY familymwr http://www.flickr.com/photos/familymwr/5548057120//
Creative Common Licences
Approached by student:
specific query about
particular CC to be used
A lot of redefinition is code
Better search, DRO integration
Google Custom Search with modified search results
Retrospective digitisation project
• 10k paper theses being digitised by local company
• Mass upload with metadata in XML file and digitised material in PDF
files, web and archive version. A lot of metadata and quality issues
• Interesting samples of other materials:
big prints, DVDs, CDs, cassette tapes,
microfilms, small datasets and research software.
EU/ICO Cookies Law
CC BY USAG-Humphreys
Repository versus real life
• Users would like to deposit other than PDF files.
• Requested “Dark” storage
• Encrypted PDFs
• Take down requests, and Web cached content. How far should we liaise
with external world
• Some students are not aware about consequences of web deposits: 3rd
party copyright, sensitive data not embargoed etc.
• Disciplinary differences; not only humanities vs. sciences.
• External user requesting contact with author or supervisors
virtualization, operating systems
Bespoken changes and technology
hard to coordinate across the
CC BY Rennett Stowe
Review process, be paper free, include pass list, extend workflow to exam
Actively encourage students to use CC licences by demonstrate its benefit
Encourage deposit of key data sets and explore data visualization
Migrate to new repository framework
Integration with Durham University RIS
Google Analytics live stats, integration with IRUS-UK
CC BY Boston Public Library
Repository of the future
CC by http://www.flickr.com/photos/keoni101/7069578953
CC BY Keoni Cabral http://www.flickr.com/photos/52193570@N04/7069578953
In the last decade a custom has become established among the tribes of higher education universities – namely to have a repository. At Durham we are very proud from our difference, in pursuit to follow other tribe members we decided to have three different repositories: Durham Research Online, Durham ETheses and Archive and Special Collection. This is the story how we implemented Durham E-Theses.
We start our run with limited peripheral vision. We concentrate on in a short time scale to be ready for a first deposit with beginning of 2009 teaching year. We have a goal to have as simple as possible single PDF file deposit. And we had strong inclination to be complaint with EThOS project. We chose EPrints as having some experience with Durham Research Online. However, due to how this was implemented, DRO couldn't be used to handle theses. We opted for a separate EPrints installation. Durham E-Theses background Timeframe of the work Changes made to EPrints plugin Issues recognised in OAI-PMH metadata exchange Harvesting digitised data from Ethos Download Groovy script developed – code snippets and basic use example : Action: Should I develop GUI/GITHUB version En Mass load and metadata correction Changes to OAI-PMH to filter duplication Further digitised material upload - Comments about ETHOS harvesting procedures when metadata elements exchange (e.g. embargo introduced due to take down procedures)
Over the time, however, we changed or corrected our initial expectation. The live brought to us new ideas and theirs implementation were challenging as running up a step hill. But certainly very enjoyable once achieved.
One of the very first task was to branding our new repository. This might seem just merging two CSS style sheets, one from EPrints and one from University CMS. Unfortunately, it requires a lot of time trying to avoid some browser issues, most notably IE. Imagine our excitement when recently University modified its branding.
We also scaled down EPritnst data model just to support theses type and removed unnecessary functionality. The goal was to have a very simple deposit process. Our first deposit screen “Details” collects thesis metada. User, year, department are obtained from user database. We redesign embargo functionality to underpin University 5 years model. Metadata: title, abstract, award, full text status - predefined from LDAP, year, department, non required, keywords, Moving embargo to first stage was with the huge coding expenses
Eprints comes with highly customizable LaTeX based cover page. The initial template is rather uninteresting. We added to our cover pages information how to cite thesis as well as use policy. On some occasion we had character encoding problems so we need to modify the plugin and LaTeX template to avoid this issue.
As everybody we love stats. We use Google Analytics for monitoring number of full text downloads. This was implemented in two stages. Core code was modified to produce link which would trigger GA counting when clicked. A special profile was set in GA. The PDF url has department information so it is easy to produce stats against individual department.
It was our aspiration to be a member of EthOS project. We registered our service very early to be harvested by OAI-PMH protocol. We also modified out of the box UKETD-DC plugin to provide additional information most notably thesis embargo. Recently we participate in the EthOS workshop to share our knowledge and experience.
Through the Britihs Library EthOS project more than 1000 thesis has been digitised on demand by users. We would like to upload those PDF file back to our service. We developed simple command line client which talks with EthOS WS API. Subsequently all files has been mass uploaded to Durham E-Theses.
http://etheses.dur.ac.uk/cgi/oai2?verb=ListRecords&from=2012-07-05T22:16:45Z&metadataPrefix=uketd_dc http://etheses.dur.ac.uk/cgi/oai2?verb=ListRecords&from=2012-07-05T22:16:45Z&metadataPrefix=oai_dc – will display Epid:1530 Modification include: - Adding new fields into EPrints data model - eprinds_field.pl, workflow, phrases. - Adding conditional OAI matadata format filter ( metadataPrefix=uketd_dc ) in oai.pl configuration file - Adding metadata format filter support in /cgi/oai2 script In order to protectively avoid record duplication by EthOS reharvesting we extended editorial data model and added EthOS persisted ID and simple control whether the record should be excluded from EthOS /UKETD-DC harvesting process. We have modified OAI-PMH core plugin to filter EThOS records.
One of our major technical concern was related that the fact that occasionally wrong character encoding could introduced in metadat. This look rather unpleasant and can spoil user experience. Recently we adopted a radical solution and we have converted whole internal database to be UTF-8 compliant and we have updated Eprints database connector.
In our default implementation we assumed that copyright of the theses would retain within the user. This was unnecessary conservative approach which did not encouraged our students to explore CC licences. So we were very glad, once approached by students to implement specific CC with particular jurisdiction. We updated wording and CC version to support 3 jurisdiction England & Wales, world and USA.
And than we realized, that once we have a work licensed under CC we need to clearly state this not only on the abstract page but also in automated harvest as well as cover pages.
Durham E-Theses is a younger sister of Durham Research Online service. One of our later goal was to unify those two repositories by providing single search box for the external users. We use Google Custom Search with bespoken search result. Durham E-Theses results are presented under Theses tab.
In the last year we started local project to digitised our stock of 10k paper theses. Some of those theses comes with supplementary material which is also beeing digitised. Similar to the BL we implemented robust take down policy on the author request.
Currently we are in the process to implement the EU cookies law. It is rather murky area with not clear interpretation which cookies are essential thus required by service. With additional google services we fill like running in the night without light and clear course to follow.
We also introduce permanent “dark” storage based on deletion state. We have not anticipate that users would like to deposit encrypted PDFs or multimedia files.