Legal Research using digitised historic Australian Newspapers August 2010, by Rose Holley


Published on

Rose Holley gives an overview of the Australian Newspapers service which is now integrated into the Trove discovery service. The digitisation workflow, user engagement and searching are covered

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Every state and territory library in Australia is involved in this national program. By 2011 we will have digitised 4 million newspaper pages, that’s about 40 million articles. The Sydney Morning Herald will comprise 600,000 pages of this. Each state has selected a major daily newspaper to begin with. We are working in collaboration with the state and territory libraries to digitise these first 4 million pages since many of them own the microfilm copies of the newspapers that we use to create the digital images from. Regional newspapers will be included from this year onwards. Regional titles are being contributed from libraries around Australia.
  • The purpose of the ANDP is to provide access to historic Australian newspapers in an online environment Key Features of this service will be that: It is accessible online Access is free The service is full text searchable.
  • At this stage, the ANDP was approved by the Minister in November 2006. A contract was signed in March 2007, with processing beginning in July. By December 2007, 500,000 pages had been digitised, with another 2.5 million pages to be digitised over the next 3 years.
  • In addition to the titles selected for the program, the National Library has received a $1 million grant from the Vincent Fairfax Family Foundation to digitise The Sydney Morning Herald through to 1954. This project will be running concurrently with the ANDP, and the pages will be included in the delivery system being developed for the ANDP
  • This is the home page of Australian Newspapers beta. Users can either keyword search or browse by date, title or state. The service is being heavily used with around 28,000 keyword searches per day and an unknown number of browses. We have not widely publicised the service as yet since we are still in a beta version.
  • This is the article view. Users can zoom in or out and choose to view the article in the context of the entire page. They can also navigate to any other page within the newspaper issue. The electronically generated text created through the OCR process is displayed on the left hand side. This is also where the users can use the 3 enhancement features. They can drag the viewing pane to see more of the or less. Users can tag the article with keywords and they can write comments and notes about the article. If users login they will be able to choose to make their tags and comments public or private. So they can share their comments with all users or they can add their own private research notes that only they can access. One feature that we believe is innovative and not available in any other online newspaper service, is the ability for the user to correct the electronically generated text. There are a number of reasons why the electronically created text is not always 100% accurate, mainly due to the quality of the original newspaper that the image was created from. Users can correct the text by clicking on the ‘Help fix this text’ button. We will now use these features on this article. The article we are looking at is the first report in an Australian paper of the sinking of the titantic.It’s in the Northern Territory Times on 19 April 1912.
  • I want to tag the article with ‘titantic sinking’. If a user does not login when they first enter the service then the first time they want to enhance an article they will be offered the option to login. At this point they can either login or enter the captcha to verify they are human (and not a robot – attempting to do something undesirable). Once logged in or verified with captcha a user can enter their tags.
  • Now I want to add a comment. Those of you who read this article may have noticed that it was reported that all passengers were safely rescued from the titanic and the weather was calm. I’ll just add a comment to say this was unfortunately not the case.
  • Now I have zoomed in on the image and if the OCR text was inaccurate I would edit it in the box on the left. This is what we call the power edit mode. In this article the text is actually very accurate so has either OCR’d very well, or already been corrected by someone else.
  • Now we can review the article with all the enhancements we have made showing on the left. Tags, comments and corrections. We can view the history of all the enhancements (both ours and other peoples history).
  • One of the innovative features that was in the first release was the ability for members of the public to correct or enhance the OCR text. When digitising old newspapers the process is to convert a digital image into full-text by use of Optical Character Recognition software (OCR). This works well on new clear documents but on old newspapers where the font and paper is of poor quality and microfilms may be out of focus the translation often goes into gibberish. After investigating every possible way technically of being able to improve this we came to the conclusion that the best way was by hand and human eye. We could not possibly afford to pay contractors to do this ‘re-keying’ so the lead programmer Kent Fitch suggested we open it up for the public to do. If text was made accurate the searching would be instantly improved for everyone since the search works over the OCR text.
  • Several people can correct the same article. All corrections are saved and viewable in the history of the article. All versions of corrections are searched for. It is the last correction that is visible in the left hand pane. Articles are corrected by many users when they are either very long, very significant, or very illegible. For example this article is in the first Australian newspaper – the Sydney Gazette and NSW advertiser of March 1803. Around 20 people have made corrections to this article. It is particularly challenging because of its use of the long f instead of an s.
  • This is the text correction history of this article, showing all the different users and what parts they corrected.
  • In response to numerous requests we instigated the ‘hall of fame’. The top 5 correctors show on the home page as well as in the hall of fame. Originally the hall of fame only showed the top 10 but users wanted to see more, so now it is anyone who has corrected more than 5000 lines per month. Users are still asking for entire league tables however so they can see where they are in the big picture. This is a motivating factor for them. During development it was suggested that we need to use gaming technologies to encourage people to correct text but this has so far not proved necessary!
  • In the first 6 months a total of 2 million lines in 100,000 articles had been corrected. The top 5 correctors had consistently remained in the top 5 each month and were working up to 45 hrs per week on text correction. Top correctors are correcting up to 30,000 lines per month. We had many users saying that t ext correction is proving to be an ‘addictive’ or compulsive activity. They sat down to fix a few words for 5 minutes and before they knew it 3 hours had passed. This was very interesting.
  • Julie the top corrector has featured in the media and become a star. She loves correcting articles about Bendigo murders.
  • The recent interactions users have made are also displayed on the homepage for everyone to see. You can see the number of searches in the last hour, newspaper article corrections so far today, works merged or split this week, items tagged this week, and comments this month.
  • Thank you for listening to me today. I am happy to take questions.
  • Legal Research using digitised historic Australian Newspapers August 2010, by Rose Holley

    1. 1. On overview of the Australian Newspapers service August 2010. Presenter: Rose Holley, Manager - Trove and Australian Newspapers, National Library of Australia Australian Law Librarians Association Event 28 August 2010: Legal research using digitised historic Australian Newspapers
    2. 2.
    3. 3. National Program and Content <ul><li>Focus on major titles from each state and territory 2007-2010 </li></ul><ul><li>Coverage: published between 1803 – 1954 </li></ul><ul><li>(out of copyright) </li></ul><ul><li>Scope – 20 million articles complete, 40 million by 2011 </li></ul><ul><li>‘ Regional ’ titles being contributed by libraries 2011 onwards </li></ul>West Australian Northern Territory Times Courier Mail Advertiser Sydney Morning Herald Sydney Gazette Argus Mercury Canberra Times
    4. 4. <ul><li>Increase access to </li></ul><ul><li>historic Australian newspapers </li></ul><ul><li>Key Features </li></ul><ul><ul><li>Online Access </li></ul></ul><ul><ul><li>Freely available </li></ul></ul><ul><ul><li>Full Text searchable </li></ul></ul>The Argus , 6 February 1945 Aims
    5. 5. 1803 to 1954
    6. 6. Popular titles
    7. 7. <ul><li>Includes Sydney Morning Herald </li></ul><ul><li>$1 million donation </li></ul>
    8. 8. Groundbreaking <ul><li>Biggest project the Library has undertaken for 10 years </li></ul><ul><li>No other country has attempted on this scale </li></ul><ul><li>Had to develop our own software and systems </li></ul><ul><li>Now made this freely available to others </li></ul><ul><li>Have let the public help to improve and correct the digitised text on a mass scale </li></ul>
    9. 9. The process behind the scenes… Microfilm reels
    10. 10. <ul><li>Microfilm scanned into digital images </li></ul>
    11. 11.
    12. 12. Students at the Library
    13. 13. <ul><li>Page sequence </li></ul><ul><li>Metadata creation </li></ul><ul><li>Missing </li></ul><ul><li>page </li></ul><ul><li>targets </li></ul>Checking Pages
    14. 14. Tapes with digital images sent to India
    15. 15. Article zoning and categorising, Optical Character Recognition (OCR)
    16. 16. Modern Offices - India
    17. 17. Digitisation Managers Hyderabad
    18. 18. Chennai newspaper facility
    19. 19. 150 data operators
    20. 20.
    21. 21. Inspection of work
    22. 22. Final checking at Library
    23. 23. Articles go into public system
    24. 24.
    25. 25.
    26. 26. Search or browse – date, state, title
    27. 27. Limit Results
    28. 28. Interaction at article level
    29. 29.
    30. 30. Add a comment
    31. 31. Fix text – power edit mode
    32. 32. After enhancements
    33. 33. Text correction
    34. 34. One article corrected by many
    35. 35. View all corrections on this article
    36. 36. Show activity in results RSS feeds
    37. 37. Public activity <ul><li>Highest usage of any service at the National Library of Australia </li></ul><ul><li>Extremely positive feedback </li></ul><ul><li>Being cited internationally as an exemplary service </li></ul><ul><li>Thousands of volunteers correcting millions of lines of text (12,000 volunteers, 17 million lines corrected) </li></ul>
    38. 38. Hall of Fame
    39. 39. Public Text correction <ul><li>10 million lines – Australia Day Awards </li></ul>
    40. 40. Don’t stop correcting!
    41. 41. 391,378 lines improved
    42. 42. Login - Your profile
    43. 43. finding information just got easier.....
    44. 44. View comments added to newspapers
    45. 45. User Forum
    46. 46. [email_address] Rose The site you manage is a nightmare! It’s addictive. Keeps me awake at night. Congratulations! Mary Questions?