Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Digital Documents 
Harvesting and Processing 
Tool (Document Harvester) 
Jennie Grimshaw 
Lead Curator, Social Policy ...
The Challenge 
• Ensuring long-time preservation on non-commercial online 
publications – through Legal Deposit Web Archiv...
The solution: DDHAPT 
• Crawls target websites at set intervals 
• Stores them in LDWA 
• Presents list of new documents f...
STEP 1: Selector (librarian, digital processing team 
member or publisher) logs in 
• DDHAPT is a web-based application ba...
STEP 2: Selector searches for and enters details of 
selected Watched Target 
•Watched target = URL of publications page 
...
STEP 3: Doc Harvester crawls websites, retrieves Docs 
and sends Selector (or generic email address) an email 
confirmatio...
STEP 4: Selector reviews and edits metadata fields for 
each Document and clicks ‘Submit’ which sends the SIP 
to Ingest 
...
Step 4 DDHAPT sample metadata 
• *Title Living standards, poverty and inequality in the UK: 2014 
ISBN 9781909463523 
Auth...
STEP 5: The SIP is ingested and flows through Aleph 
and into Primo. 
• Basic catalogue records appear in Explore (our cat...
Phase 2 
• In Phase 2, from January – end March 2015, we will 
develop the Tool to harvest content on websites with simple...
An additional use case: the landing page approach 
• Aims to reduce effort by forming sets of harvested websites 
into col...
Any questions? 
www.bl.uk 12
Upcoming SlideShare
Loading in …5
×

The Digital Documents Harvesting and Processing Tool (Document Harvester)

1,394 views

Published on

Jennie Grimshaw Lead Curator, Social Policy & Official Publications The British Library
Presentation at ALISS xmas 2014 event.

Published in: Education
  • Be the first to comment

The Digital Documents Harvesting and Processing Tool (Document Harvester)

  1. 1. The Digital Documents Harvesting and Processing Tool (Document Harvester) Jennie Grimshaw Lead Curator, Social Policy & Official Publications The British Library
  2. 2. The Challenge • Ensuring long-time preservation on non-commercial online publications – through Legal Deposit Web Archive • Providing easy access at the level of the individual document • Providing a system for efficient processing of individual documents at scale • Storing documents economically www.bl.uk 2
  3. 3. The solution: DDHAPT • Crawls target websites at set intervals • Stores them in LDWA • Presents list of new documents for selection • Selector creates basic metadata • Basic record + hotlink available after seven days www.bl.uk 3
  4. 4. STEP 1: Selector (librarian, digital processing team member or publisher) logs in • DDHAPT is a web-based application based on W3ACT •Web Archiving Team sets up users with different roles + permissions • After login the selector is taken to a personalised homepage, with – List of crawled targets + date – Crawl succeeded/failed – Links to new documents harvested – Provision to set up new watched target www.bl.uk 4
  5. 5. STEP 2: Selector searches for and enters details of selected Watched Target •Watched target = URL of publications page • Crawl frequency can be set in light of volume/frequency of publishing • Use NPLD selection criteria www.bl.uk 5
  6. 6. STEP 3: Doc Harvester crawls websites, retrieves Docs and sends Selector (or generic email address) an email confirmation that it has completed the crawl • Will report if no new documents are found • Will signal possible duplicates • Selector can ignore documents not required for individual cataloguing • Populates metadata creation form www.bl.uk 6
  7. 7. STEP 4: Selector reviews and edits metadata fields for each Document and clicks ‘Submit’ which sends the SIP to Ingest • Separate screens for books, journal issues and journal articles • Displays document and pre-populated form side-by-side • Selectors will need to check and edit the metadata • Hit the submit button – tool confirms if submission is successful www.bl.uk 7
  8. 8. Step 4 DDHAPT sample metadata • *Title Living standards, poverty and inequality in the UK: 2014 ISBN 9781909463523 Author(s): Up to 3Belfield, Chris; Cribb, Jonathan; Hood, Andrew • Corporate Author *Publisher Institute for Fiscal Studies Edition *Year of publication 2014 DOI (Digital Object Identifier) 10.1920/re.ifs.2014.0096 *Submission type (should autopopulate): Book * Filename (automated) • URL landing page = http://www.ifs.org.uk/publications/7274 • URL pdf = http://www.ifs.org.uk/uploads/publications/comms/r96.pdf • Comment: published in series: Report no. 96 www.bl.uk 8
  9. 9. STEP 5: The SIP is ingested and flows through Aleph and into Primo. • Basic catalogue records appear in Explore (our catalogue) 7 days after ingest, to make content available asap. • Catalogue records are upgraded later by BL cataloguing team. • The Documents can also be found in the LDWA. www.bl.uk 9
  10. 10. Phase 2 • In Phase 2, from January – end March 2015, we will develop the Tool to harvest content on websites with simple, username and password-based, barriers as covered in the NPLD Regulations. • From April 2015, we hope to roll out use of the Tool within BL and LDLs www.bl.uk 10
  11. 11. An additional use case: the landing page approach • Aims to reduce effort by forming sets of harvested websites into collections • Each collection will have a landing page set up in the LDWA • A single entry will link from Explore to the collection landing page www.bl.uk 11
  12. 12. Any questions? www.bl.uk 12

×