Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Document Harvester Story. Evolution of a System

Presentation by Jennie Grimshaw British Library to ALISS xmas event December 2019.
Covers web archiving

  • Login to see the comments

  • Be the first to like this

Document Harvester Story. Evolution of a System

  1. 1. Document Harvester Story Evolution of a System
  2. 2. www.bl.uk The Technology • Based on the W3ACT web archiving tool • Built by the Austrian Institute of Technology • Supported by the Web Archiving Technical Team at the British Library 2
  3. 3. www.bl.uk The Weekly Cycle Website crawled PDFs extracted Selected by curator Metadata created and submitted Appears in catalogues of LDLs 3
  4. 4. www.bl.uk The Theory • Crawls document heavy websites – watched targets - at set intervals • Finds and extracts pdfs – books and serial issues • Presents list to selector for review, with most recent at the top • Selector creates metadata for documents that meet selection criteria • Submits metadata to Aleph – documents remain in the web archive • Metadata exported to Explore + catalogues of the other 5 LDLs with link to document in the web archive • Can be used remotely by colleagues based in the other LDLs • Initially used for official publications, but could be developed for other formats, e.g. sheet music 4
  5. 5. www.bl.uk Choosing Your Watched Targets • Must be in scope for NPLD • Documents must be accessible to the crawler, i.e. not dynamic or held in the cloud • Documents must be pdfs • May be hidden problems with the structure of the site – always consult a web archivist, e.g. Welsh National Parks 5
  6. 6. www.bl.uk Your Personalised Home Page 6
  7. 7. www.bl.uk Documents for Selection 7
  8. 8. www.bl.uk Metadata Creation Screen: MEX Tools 8
  9. 9. www.bl.uk Metadata Creation Screen 9
  10. 10. www.bl.uk Metadata Creation: Save and Submit • Once you have completed the metadata form, hit the save button • Review your work • Press edit to make changes • Press submit to upload metadata to Aleph • System creates link to document location in web archive in 856 field in record – not through an Ark • Document remains in web archive 10
  11. 11. www.bl.uk Metadata Enhancement: Automated Addition of DDC Number Identifies DH records in Aleph Matches 1st FAST heading with look up table Populates 082 field with DDC number from table 11
  12. 12. www.bl.uk Metadata Enhancement: Name Authority Control Check LC Name Authorities Name established? Use established form Check LC Name Authorities New name? Send for Authority Control 12
  13. 13. www.bl.uk Future Development: Document Harvester Review • Immediate Improvements • Implement serials issue processing • Implement metadata screen enhancements • Implement use for sheet music • Long Term Challenges • Changes in official publishing • Reduce system vulnerabilities • Changes in user behaviour 13
  14. 14. www.bl.uk Document Harvester Challenges: Metadata Enhancements • To improve record quality we need: • Access to full set of FAST headings • Boxes to add qualifiers and relator terms to names • A notes field • But these changes require end-to-end testing and cannot be resourced due to DAMPS, etc. 14
  15. 15. www.bl.uk Document Harvester Challenges: Serial Issues Processing • Module does not work due to basic design flaws • Cannot accession & upload serial issues, but will submit to Digital Processing Team • Labour intensive as extensive checking needed to find bibliographic record manually, confirm issue not held, etc • Then accessioned manually • Enabling auto-accessioning would require complete redesign of module 15
  16. 16. www.bl.uk Document Harvester Challenges: Changes in Publishing Practice Move from PDF •Government prefers ODF or html Move from single documents •Collections of related documents, in different formats, linked to a landing page 16
  17. 17. www.bl.uk Document Harvester Challenges: System Vulnerabilities • Web site redesigns which crash crawls – beyond our control, e.g. GDS redesign of GOV.UK • Unforseen impacts of changes to W3ACT, e.g. implementation of Python wayback • Lack of resources in the Web Archive Technical Team 17
  18. 18. www.bl.uk Document Harvester Challenges: User Behaviour • User behaviour has changed following print to electronic transition, but how? • Identify use cases • Search for evidence • What about your users? 18
  19. 19. www.bl.uk Contact Details • If you are able to share insights on user behaviour around official information, please contact me: • Email: jennie.grimshaw@bl.uk • Tel: 0207 412 7537 19
  20. 20. www.bl.uk Thank you Presentation © British Library Board, 2019 20

×