The Guardian and Observer Digital Archive

2,969 views

Published on

technical and commercial challenges of publishing a digital newspaper archive

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,969
On SlideShare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The Guardian and Observer Digital Archive

    1. 1. Torsten de Riese, Guardian News & Media
    2. 2. The digital archive project <ul><li>why digitisation? </li></ul><ul><li>newspaper digitisation in context </li></ul><ul><li>choosing the technology </li></ul><ul><li>the production process </li></ul><ul><li>commercial models </li></ul><ul><li>archive in schools </li></ul><ul><li>demo </li></ul>
    3. 3. <ul><li>5 May 1821: roots in struggle for suffrage and free speech </li></ul><ul><li>daily since 1855 (abolition of stamp duty) </li></ul><ul><li>CP Scott editor from 1872 to 1929 </li></ul><ul><li>dropped ‘Manchester’ in 1959 – move to London in 1964 </li></ul><ul><li>Guardian Unlimited </li></ul><ul><li>global audience: 20 million </li></ul>
    4. 4. <ul><li>the world’s oldest Sunday paper, since 1791 </li></ul><ul><li>sided with the North during the American Civil War </li></ul><ul><li>Astor ownership in 1911 </li></ul><ul><li>David Astor (1948-1975) modernises the paper </li></ul><ul><li>prominent writers such as Orwell, Koestler, Sackville-West </li></ul><ul><li>1993: GMG acquires The Observer </li></ul>
    5. 5. why digitise? <ul><li>preservation issues </li></ul><ul><ul><li>bound paper copies are in danger of degrading beyond repair </li></ul></ul><ul><ul><li>older (acetate) microfilm is fading and starting to brittle </li></ul></ul><ul><ul><li>a significant step towards preservation </li></ul></ul><ul><li>accessibility </li></ul><ul><ul><li>access confined to libraries </li></ul></ul><ul><ul><li>‘ needle in a haystack’ </li></ul></ul><ul><li>commercial opportunity </li></ul><ul><ul><li>b2b licensing </li></ul></ul><ul><ul><li>traffic driver </li></ul></ul>
    6. 6. newspaper projects <ul><li>Library of Congress </li></ul><ul><ul><ul><ul><li>20m pages, still in progress </li></ul></ul></ul></ul><ul><li>British Library c19 project </li></ul><ul><ul><ul><ul><li>ca. 1m pages of discontinued titles, in progress </li></ul></ul></ul></ul><ul><li>ProQuest Historical Newspapers </li></ul><ul><ul><ul><ul><li>largest online newspaper archive </li></ul></ul></ul></ul><ul><ul><ul><ul><li>17m pages, New York Times, Washington Post etc. </li></ul></ul></ul></ul><ul><li>recent UK projects </li></ul><ul><ul><ul><ul><li>TDA: published by Thomson in 2002 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Scotsman 2005 </li></ul></ul></ul></ul>
    7. 7. technical challenges <ul><li>sheer volume of material </li></ul><ul><ul><ul><ul><li>1.2m pages, 20m clippings </li></ul></ul></ul></ul><ul><li>constant changes in page design </li></ul><ul><li>missing issues! </li></ul><ul><li>quality of source material </li></ul><ul><ul><ul><ul><li>microfilm quality varies </li></ul></ul></ul></ul><ul><ul><ul><ul><li>some original paper copies damaged </li></ul></ul></ul></ul><ul><ul><ul><ul><li>printing quality pre 1900 </li></ul></ul></ul></ul>
    8. 17. technology procurement <ul><li>invited 10 companies </li></ul><ul><li>manual processing vs. automation </li></ul><ul><li>great variations in costs </li></ul><ul><li>main selection criteria: </li></ul><ul><ul><ul><ul><li>functionality </li></ul></ul></ul></ul><ul><ul><ul><ul><li>cost effectiveness </li></ul></ul></ul></ul><ul><ul><ul><ul><li>open source platform/technology </li></ul></ul></ul></ul><ul><li>Olive Software: </li></ul><ul><ul><ul><ul><li>Mix of automation and manual processes </li></ul></ul></ul></ul><ul><ul><ul><ul><li>sophisticated algorithms </li></ul></ul></ul></ul><ul><ul><ul><ul><li>complete solution </li></ul></ul></ul></ul>
    9. 18. scanning <ul><li>each page scanned from microfilm </li></ul><ul><li>automated and manual monitoring </li></ul><ul><li>settings adjusted to improve OCR process </li></ul>
    10. 19. segmenting <ul><li>each page is segmented into individual elements </li></ul><ul><li>metadata: </li></ul><ul><li>date, page no., headline, byline, article, advert, photograph, relationships </li></ul><ul><li>automated and manual QA </li></ul>
    11. 20. OCR <ul><li>reading every element </li></ul><ul><li>applying dictionary / ‘fuzzy logic’ </li></ul><ul><li>recording word coordinates for highlighting </li></ul><ul><li>capturing continuations </li></ul>
    12. 21. result <ul><li>1.2m searchable newspaper pages </li></ul><ul><li>212 years of publishing heritage </li></ul><ul><li>4 terabytes of data </li></ul><ul><li>equivalent of 5,000 DVDs </li></ul><ul><li>approx. 20m clippings </li></ul>
    13. 22. GNM project office <ul><li>develop business case/secure GMG funding </li></ul><ul><li>legal risk assessment </li></ul><ul><li>IT procurement </li></ul><ul><li>technical project management </li></ul><ul><ul><li>overseeing Olive production/delivery of content </li></ul></ul><ul><ul><li>website integration/interface design </li></ul></ul><ul><ul><li>search indexing (google) </li></ul></ul><ul><ul><li>further clean-up for library product </li></ul></ul><ul><ul><li>internal QA of digitised files </li></ul></ul><ul><ul><li>refilming </li></ul></ul><ul><ul><li>legal research and deletions </li></ul></ul><ul><li>launch planning/marketing campaign for b2c and b2b </li></ul><ul><li>sales strategy/recruitment </li></ul><ul><li>negotiating global b2b distribution deals </li></ul>
    14. 23. business case <ul><li>identifying the audience - segmentation </li></ul><ul><ul><ul><ul><li>research, special interest, education </li></ul></ul></ul></ul><ul><li>commercial licensing programme to global education market </li></ul><ul><ul><ul><ul><li>universities </li></ul></ul></ul></ul><ul><ul><ul><ul><li>schools </li></ul></ul></ul></ul><ul><ul><ul><ul><li>corporate/government research </li></ul></ul></ul></ul><ul><li>b2c subscription model </li></ul><ul><ul><ul><ul><li>timed passes: £7.95 for 24 hrs </li></ul></ul></ul></ul><ul><ul><ul><ul><li>genealogy, writers, researchers, special interest </li></ul></ul></ul></ul><ul><li>adding value to newspaper website </li></ul><ul><li>revolutionises internal journalistic research </li></ul><ul><li>environmental benefits </li></ul>
    15. 24. the digital archive in schools pilot project in Durham April to August 2007 easy to integrate in VLEs covers all core subjects allows pupils to see the development of important events and their context represents a valuable resource for developing research skills our vision access to the Digital Archive for all pupils and teaching staff in the UK part of JISC Collections
    16. 25. http://archive.guardian.co.uk

    ×