Newspaper digitisation workflows: presentation for cultural heritage digitisation professionals. 2008

695 views
619 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
695
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • In addition to the titles selected for the program, the National Library has received a $1 million grant from the Vincent Fairfax Family Foundation to digitise The Sydney Morning Herald through to 1954. This project will be running concurrently with the ANDP, and the pages will be included in the delivery system being developed for the ANDP
  • Newspaper digitisation workflows: presentation for cultural heritage digitisation professionals. 2008

    1. 1. Newspaper Digitisation Workflows <ul><li>Rose Holley- Manager ANDP </li></ul><ul><li>Presentation to Cultural Heritage Digitisation professionals </li></ul><ul><li>26 November 2008 </li></ul>
    2. 2. <ul><li>Preparing for Digitisation </li></ul><ul><li>Creation of digital images </li></ul><ul><li>Adding metadata and Quality Assurance </li></ul><ul><li>Optical Character Recognition </li></ul><ul><li>Quality Assurance </li></ul><ul><li>Other information </li></ul><ul><li>Access & interaction </li></ul><ul><li>Statistics </li></ul>General Workflow
    3. 3. <ul><li>Identify title to be digitised </li></ul><ul><li>Source master microfilm from owner </li></ul><ul><li>Send master microfilm to scanning contractors </li></ul><ul><li>Add title to Content Management System </li></ul>Preparing for Digitisation
    4. 4. Add Title Screen
    5. 5. <ul><li>Microfilm converted to digital images </li></ul>
    6. 6. Image Reception <ul><li>Images received from scanning contractor on LTO2 Tape </li></ul><ul><li>Tapes added to tape robot and extracted </li></ul><ul><li>Reels automatically added to Content Management System </li></ul><ul><li>Reel details are checked </li></ul><ul><li>Images ingested into Content Management System </li></ul>
    7. 7. Check Reel Details
    8. 8. Ingest Reels
    9. 9. Quality Assurance (QA) <ul><li>QA Phase 1 – Add metadata (dates and page numbers) </li></ul><ul><li>Supervisor reviews marked pages </li></ul><ul><li>QA Phase 2 – Define batches </li></ul><ul><li>QA Phase 2 – Resolve duplicates </li></ul><ul><li>QA Phase 2 – Create missing page targets </li></ul>
    10. 10. Adding Metadata <ul><li>Date and Page Sequence number added </li></ul>
    11. 11. Supervisor Review <ul><li>Supervisor reviews pages marked for attention </li></ul>
    12. 12. Define Batches <ul><li>Batches defined by date </li></ul><ul><li>Each batch contains 2-3000 images </li></ul><ul><li>Batches are automatically assigned a number </li></ul>
    13. 13. Resolve Duplicates <ul><li>Duplicate pages compared and the best copy is selected </li></ul>
    14. 14. <ul><li>Missing page targets are generated </li></ul>Missing Pages
    15. 15. Optical Character Recognition (OCR) <ul><li>Complete batches are added to a tape </li></ul><ul><li>Tapes are generated and written by IT </li></ul><ul><li>Tapes sent to OCR contractor </li></ul><ul><li>Contractor completes OCR processes </li></ul><ul><li>OCR data (not images) is returned via FTP </li></ul>
    16. 16. Tapes Created <ul><li>Completed batches added to a tape </li></ul>
    17. 17. <ul><li>Optical Character Recognition (OCR) of pages and article zoning </li></ul>
    18. 18. OCR Data Reception (Automated process) <ul><li>OCR contractor advises NLA server that a batch has been completed </li></ul><ul><li>NLA server downloads the batch </li></ul><ul><li>Batch is ingested into Content Management System </li></ul><ul><li>Checks are performed on data validity </li></ul><ul><li>QA Derivatives are generated </li></ul><ul><li>Articles may now be searched, but are not yet accessible </li></ul>
    19. 19. Batch information
    20. 20. Quality Assurance (QA) <ul><li>A random sample of Issues and Articles is checked </li></ul><ul><li>Volume and Issue number are checked for accuracy </li></ul><ul><li>Sample articles are checked against Quality Acceptance Criteria (QAC) </li></ul><ul><li>Error rates calculated against QAC on the fly </li></ul><ul><li>Supervisor checks final result and decides on accepting the batch </li></ul>
    21. 21. Selecting the batch
    22. 22. Volume & Issue Number Check
    23. 23. Article checked against QAC
    24. 24. Clean fields checked for accuracy
    25. 25. Supervisor checks result and makes a decision
    26. 26. QA Results <ul><li>Automated email sent to supplier advising the result </li></ul><ul><li>Emails for rejected batches include a summary of errors </li></ul><ul><li>Summary of errors saved for all batches </li></ul><ul><li>Accepted batches are immediately accessible </li></ul>
    27. 27. Access <ul><li>Access is provided through Australian Newspapers beta </li></ul><ul><li>Users can search or browse newspapers </li></ul><ul><li>Search results can be refined using filters </li></ul><ul><li>Users can browse by Newspaper title or Date. </li></ul>
    28. 28. Search Results
    29. 29. Newspaper information
    30. 30. User Interaction <ul><li>Users are able to : </li></ul><ul><ul><ul><li>Correct the text </li></ul></ul></ul><ul><ul><ul><li>Add tags </li></ul></ul></ul><ul><ul><ul><li>Add comments </li></ul></ul></ul><ul><li>User-added content is not currently moderated, but may be in future. </li></ul>
    31. 31. Statistics <ul><li>Stats for content received and QAd generated on request by the Content Management System </li></ul><ul><li>Stats for volume usage of Beta collected using Google Analytics </li></ul><ul><li>Stats for user contributions to beta collected on an as-needed basis </li></ul>
    32. 32. Content Statistics
    33. 33. Work Statistics
    34. 34. Usage Statistics
    35. 35. Questions?

    ×