Papers Past Revitalisation: NDF 2007
Upcoming SlideShare
Loading in...5

Papers Past Revitalisation: NDF 2007



Presentation on the redevelopment of the National Library's Papers Past website, which contains over one million digitised copies of newspapers pages. Given at the National Digital Forum in 2007 by ...

Presentation on the redevelopment of the National Library's Papers Past website, which contains over one million digitised copies of newspapers pages. Given at the National Digital Forum in 2007 by Tracey Powell and Gordon Paynter



Total Views
Views on SlideShare
Embed Views



5 Embeds 47 38 6 1 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Papers Past Revitalisation: NDF 2007 Papers Past Revitalisation: NDF 2007 Presentation Transcript

  • Papers Past: Present and Future Gordon Paynter & Tracy Powell National Library of New Zealand Revitalising the Papers Past Historic Newspaper Collection National Digital Forum Conference, 29 November 2007
  • Outline
    • What is Papers Past?
    • What users wanted
    • Large-scale OCR for newspapers
    • User interface development
    • User response to new site
    • Papers Past: Present and Future
  • Papers Past (2001-2007)
  • What users wanted
    • Papers Past was popular, but users wanted more:
      • Searchability
      • More newspapers
      • Better printability
      • Easier downloads
    User Survey “ [I] would love to be able to search across the newspapers (but I guess that would be a pretty big project with OCR!).” A respondent to the Papers Past user survey conducted in the planning stage of the project
  • User research
    • Online survey:
      • On the front page of Papers Past
      • 212 responses in about a month
    • Comparative usability study:
      • Five Papers Past users were invited to the Library.
      • Performed tasks on Colorado and Utah collections
        • Observed using the features of these sites
        • Asked what features were important
  • Online survey: who are the users? User types (based on 212 responses to online survey)
  • Comparative usability study
    • Everybody used search, and used it first
    • Essential:
      • Printing (with context, such as citation information)
      • Browsing from page to page within a paper
      • Search term highlighting
    • Important:
      • Browse by region and by title
      • Browse by date (important, but somehow confusing)
      • Background information – for advanced users
  • Our perspective
    • An old website, and looks it
    • It is not compliant with Government web standards
    • Web browser needs a Java Applet to view TIFF files
    • Valuable (and expensive) content not being exploited
  • Large-scale Optical Character Recognition for Newspapers
  • How do we make them searchable?
    • The collection is too large to transcribe:
      • over a million pages of text,
      • about 200,000 newspaper issues,
      • around 26 million articles,
      • approximately 7 billion words.
    • Where do we get a text equivalent to search?
    Optical Character Recognition “ A process by which software reads a page image and translates it into a text file by recognizing the shapes of the letters.” The NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials.
  • Large-scale OCR for newspapers
    • Performed by Planman Consulting in New Delhi
    • CCS docWORKs software
      • Abbyy Finereader 8.0 OCR engine (industrial version)
      • Includes our list of Māori and geographical terms
  • Large-scale OCR for newspapers
    • Pages cropped and deskewed
    • Pages organised into issues
    • Pages zoned into Articles, Advertisements and Illustrations
    • Text is captured with OCR software
    • Selected Issue metadata manually cleaned up
    • Headline metadata manually corrected
    • Output to XML and image files
  • User Interface
  • Building a new web interface
    • Prototype developed by DL Consulting
    • Based on Greenstone software
    • A hybrid collection, containing:
      • Searchable newspapers with OCR text
      • Browse-only newspapers from old Papers Past
    • User interface redesigned by ClickSuite
    • User interface testing and refinement
    • Launched 03 September 2007
  • User testing
    • 15 users observed
    • One on one sessions
    • Free browsing, then some fixed tasks
    • Positive response overall
    • Design of interface well-received
    • Some changes made in response:
      • Search page rearranged
      • Search history moved from search page to own page
    Overall rating
  • New Papers Past
  • User Response
  • User response
    • We now have more people using the site
    • The terms Search and Browse are not well understood
      • “ Searchable” interpreted as “online”
      • “ Not searchable” interpreted as “not available”
    • Search functionality is very popular
    • Browse less well-received by hard-core researchers
    • Some of the issues relate only to material that is not searchable, and will disappear when all the material is searchable.
  • Web statistics after first month Conclusion: we have more users, and they are using the site a lot more 443% Number of page views Was 23%, now 73.2% Visitor repeat rate Was 14 minutes, now 29 minutes Average length of visit Was 2, now 6 Average visits per visitor 331% Number of visits 21% Number of unique visitors Increase Statistic
  • Papers Past: Present and Future
    • Website
      • Promote website to new user groups
      • Potential new features in response to user feedback
      • Evaluating Planman’s prototype metadata editing tool
    • Annual digitisation programme
      • Digitising new newspapers (and filling gaps)
      • Making all the existing pages searchable
    • Research
      • Documenting the relative advantages of OCR and transcription of textual materials
      • Testing whether changing from bitonal digitisation to greyscale digitisation improves OCR accuracy
  • Thank you