20140408 digital newspapers collections [idlc kuala lumpur]

633 views

Published on

All about historic newspapers digitization at the 2014 International Digital Library Conference in Kuala Lumpur 8-Apr-2014.

Published in: Internet, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
633
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20140408 digital newspapers collections [idlc kuala lumpur]

  1. 1. digital newspaper collections: if you build one, who will visit? Frederick Zarndt IFLA Newspapers Section frederick@frederickzarndt.com @cowboyMontana hashtag #IFLAnewspaper
  2. 2. about digital newspapers • programs • collections • users / crowdsourcing San Francisco Call 21 April 1906
  3. 3. why digitize newspapers? “News is only the first rough draft of history.” Alan Barth writing for 1943 Washington Post Wikipedia contributors, “Alan Barth," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/ Alan_Barth (accessed March 2014).
  4. 4. to preserve to provide access why digitize newspapers?
  5. 5. • newspapers are deteriorating • microfilm is dissolving • no storage space or space is too expensive
  6. 6. • newspapers are deteriorating • microfilm is dissolving • no storage space or space is too expensive
  7. 7. • newspapers are deteriorating • microfilm is dissolving • no storage space or space is too expensive
  8. 8. • newspapers are deteriorating • microfilm is dissolving • no storage space or space is too expensive
  9. 9. the principal reason to digitize newspapers is to provide non-destructive, universal access to newspapers for as many users as possible
  10. 10. PhotobyDAVIDILIFF.License:CC-BY-SA3.0 readingrooms bythenumbers* Monthly average Visitors Requests for Newspapers Population Reading Room Microform Print Australia 22,876,000 5,130 345 240 France 65,350,000 3,000 2,000 1,000 Netherlands 16,847,000 NA NA NA New Zealand 4,414,000 NA NA NA Norway 4,985,000 600 400 NA Singapore 5,184,000 NA 300 NA UK 62,262,000 2,000 6,900 4,816 USA 313,292,000 NA NA NA *numbers from 2012
  11. 11. physical versus digital monthly averages 2012 requests for newspapers digitised historical newspapers population paper + microform unique visitors 22,876,000 585 150,000 37,692,000 NA 12,800 5,405,000 NA NA 65,350,000 3,000 22,000 16,847,000 NA 50,000 4,414,000 NA 83,333 4,985,000 400 1,500 5,184,000 300 12,400 62,262,000 11,716 NA 313,292,000 NA NA
  12. 12. Image from http://www.visualinsight.net/nc/gallery/pages/e-Preservation.html • newspaper digitization is expensive • newspaper digitization is complicated • digital preservation is expensive • digital preservation is untested BUT …
  13. 13. programs
  14. 14. programs National Cooperative Individual
  15. 15. national: a single (national) library which funds and manages a national newspapers digitization program. • Papers Past, National Library of New Zealand • Newspaper SG, National Library of Singapore • Historiallinen Sanomalehtikirjasto, National Library of Finland • and others … programs
  16. 16. national: centrally funded and centrally managed program with several participants. strict standards for participants. • National Digital Newspaper Program (Library of Congress) • Australian Newspaper Digitisation Program programs
  17. 17. cooperative: organizations collaborate to achieve a common goal but digitization programs are managed separately. flexible standards. • Europeana newspapers • Digital Public Library of America programs
  18. 18. individual: organization digitizes on its own. may or, more usually, does not follow open standards. all commercial organizations. • ProQuest Historical Newspapers • Newspapers.com • Newsbank • many others… programs
  19. 19. • the design of a digitization program requires careful thought and must be adapted to local circumstances • determine principal or targeted user demographic and use cases • ask those who have gone before • join the IFLA Newspapers Section! (ask me how) programs Image courtesy of Donald Zolan.
  20. 20. collections
  21. 21. as of Mar 2014 library collection ~size pages dates National Library of Australia Trove 12,668,000 1803-1995 California Digital Newspaper Collection CDNC 545,000 1846-2012 Naitonal Library of Finland Historical Newspaper Library 3,006,000 1771-1919 Bibliotheque nationale de France Gallica 2,200,000 1293-2000 Koninklijke Bibliotheek Historische Kranten 9,000,000 1618-1995 National Library of New Zealand Papers Past 3,109,000 1839-1945 National Library of Norway NBDigital Aviser 12,000,000 1763-2012 Singapore National Library Newspaper SG 2,400,000 1831-2009 British Library British Newspaper Archive 7,598,000 1710-1954 Library of Congress Chronicling America 7,293,000 1836-1922 digital historic newspaper collections
  22. 22. 1 1,000 1,000,000 AustralianNewspapers Books Picturesandphotos JournalArticles Musicsoundandvideo Maps Archivedwebsites Diaries,letters,archives Peopleandorganisations unique visits page views 2013 monthly averages
  23. 23. 0 1,500,000 3,000,000 4,500,000 6,000,000 AustralianNewspapers Books Picturesandphotos JournalArticles Musicsoundandvideo Maps Archivedwebsites Diaries,letters,archives Peopleandorganisations unique visits page views 2013 monthly averages
  24. 24. 0 200000 400000 600000 800000 NewspaperSG Infopedia iRememberSG unique visits number of visits page views 2013 monthly averages
  25. 25. February 2014 0 500000 1000000 1500000 2000000 2500000 3000000 Papers Past National Library 
 except Papers Past 517,823 2,527,926 53,897123,889 unique visits page views
  26. 26. 2013 monthly averages 0%10% 90% Historic Cambridge Newspapers
 (1846-1923) Cambridge City Directories
 (1848 - 1910) Cambridge Chronicle
 (August 2005 to present)
  27. 27. users
  28. 28. Newspaper collection user survey • California Digital Newspaper Collection and Cambridge Public Library published a user survey in Mar 2013 • 604 / 32 responses • surveys are (mostly) identical except for organization name
  29. 29. User demographic: genealogists and family historians
  30. 30. User demographic: no spring chickensX
  31. 31. User demographic: reasons for use
  32. 32. User demographic: types of information
  33. 33. • 72% visit UDN for genealogical research • 20% visit for various other types of historical research • 87% find obituaries useful • Over 60% find the other genealogical article types (birth and wedding announcements) useful • Only 7% do not find genealogical articles useful • Many are writing family histories and consequently also look for general background information • Older content is much more highly valued than more recent content (see more detailed explanation that follows) • 44% find smaller, rural papers more useful, while only 15% find larger, metropolitan papers more useful Utah Digital Newspapers: 2012 user survey John Herbert and Randy Olsen. Small town papers: still delivering the news. WLIC 2012, Helsinki Finland. http://conference.ifla.org/past-wlic/2012/119- herbert-en.pdf
  34. 34. “The ‘typical’ Trove user is a very well educated, highly paid, English speaking employed woman aged fifty or over, with a significant or primary interest in family or local history, who visits the Trove website very frequently. Users of Trove newspapers are older than the average Trove user; only 13% of newspaper users are under 40 years or age.” Marie-Louise Ayres. ‘Singing for their supper’: Trove, Australian newspapers, and the crowd. WLIC 2013,Singapore. http:// library.ifla.org/245/1/153-ayres-en.pdf. Engaged users: who are they?
  35. 35. “Many of Trove’s user engagement features are very popular. More than 100,000 users have registered to date, and more than 2 million tags and nearly 60,000 comments had been added… [Trove] text correction, however, stands head and shoulders above any other user engagement features.” Marie-Louise Ayres. ‘Singing for their supper’: Trove, Australian newspapers, and the crowd. WLIC 2013,Singapore. http:// library.ifla.org/245/1/153-ayres-en.pdf. Engaged users: who are they?
  36. 36. Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers. ... [It] is different from ordinary outsourcing since it is a task or problem that is outsourced to an undefined public rather than a specific, named group. Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http:// en.wikipedia.org/wiki/Crowdsourcing (accessed March 17, 2013)
  37. 37. Why correct text? Here’s why ...
  38. 38. Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4irJi.- ~ ; ;✓ ' • * On ijfr r inn l j j j i l F i i j ' 1 1 f H a v o d i v y d , Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . •On Tncsdav last, Mr. Charles. IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. AsbtCnvHall, mar Lancaster, Mr.,Geo. Worn ick, many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol " t h r o u g h I n s b e a d , 1 w h i c h instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week, raw OCR text Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3. newspaper image
  39. 39. Accuracy • Edwin Kiljin (Koninklijke Bibliotheek the Netherlands) reports raw OCR character accuracies of 68% for early 20th century newspapers • Rose Holley (National Library of Australia) reports raw OCR character accuracy varied from 71% to 98% on a sample Trove digitized newspapers Rose Holley. How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine. March/April 2009. Edwin Kiljin. The current state-of-art in newspaper digitization. D-Lib Magazine. January/February 2008.
  40. 40. uncorrected OCR accuracy by newspaper title title OCR character accuracy ~OCR word accuracy PRP Pacific Rural Press 1871 - 1922 92.6% 68.1% SFC San Francisco Call 1890 - 1913 92.6% 68.1% LAH Los Angeles Herald 1873 - 1910 88.7% 54.9% LH Livermore Herald 1877 - 1899 88.6% 54.6% DAC Daily Alta California 1841 - 1891 88.2% 53.4% CFJ California Farmer and Journal of Useful Sciences 1855 - 1880 86.5% 48.4% SN Sausalito News 1885 - 1922 70.4% 17.3% *Word accuracy assumes average word length is 5 characters
  41. 41. OCR accuracy by newspaper title title OCR character accuracy corrected accuracy PRP Pacific Rural Press 1871 - 1922 92.6% 99.3% SFC San Francisco Call 1890 - 1913 92.6% 99.6% LAH Los Angeles Herald 1873 - 1910 88.7% 99.1% LH Livermore Herald 1877 - 1899 88.6% 99.9% DAC Daily Alta California 1841 - 1891 88.2% 99.9% CFJ California Farmer and Journal of Useful Sciences 1855 - 1880 86.5% 99.8% SN Sausalito News 1885 - 1922 70.4% 100.0%
  42. 42. corrected accuracy by newspaper title title OCR character accuracy ~OCR word accuracy corrected accuracy ~corrected word accuracy PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5% SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0% LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6% LH 1877 - 1899 88.6% 54.6% 99.9% 99.5% DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5% CF 1855 - 1880 86.5% 48.4% 98.3% 91.8% SN 1885 - 1922 70.4% 17.3% 100.0% 100.0% *Word accuracy assumes average word length is 5 characters
  43. 43. correction accuracy by user user average OCR accuracy correction accuracy A 70.4% 100.0% B 87.1% 99.5% C 95.4% 99.5% D 86.5% 98.3% E 95.3% 100.0% F 91.0% 100.0% G 91.0% 99.8% H 90.5% 99.0% I 96.6% 99.8% J 94.8% 100.0% K 86.8% 99.3%
  44. 44. How does low text accuracy affect search recall? The Facts • Average uncorrected OCR character accuracy of the CDNC sample data is ~89% • Average length of an English word is 5 characters • Average word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to 60% or 6 out of 10 words correct Accuracy
  45. 45. ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT Search recall no text correction instances of “ARNDT” found instances of “ARNDT” not found
  46. 46. Accuracy The Facts • Average corrected character accuracy of the CDNC sample data is ~99.4% • Average word accuracy of CDNC corrected text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%
  47. 47. ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT instances of “ARNDT” found instances of “ARNDT” not found Search recall with text correction
  48. 48. A search for “Arndt” at Chronicling America gives 10,267 results* • If Chronicling America text accuracy is 55.8% (same as uncorrected CDNC sample), then 8,133 instances of “Arndt” were not found • If text accuracy is 97.0%, then 317 instances of “Arndt” were not found Accuracy * Search performed 31 Oct 2012
  49. 49. Accuracy Suppose the word/name is longer than 5 characters? The Facts • Assume that average uncorrected / corrected OCR character accuracy is ~89% / ~99% same as CDNC. name name length raw text accuracy corrected text accuracy Eklund 6 49.7% 94.2% Kennedy 7 44.2% 93.25 Espinosa 8 39.4% 92.3% Bonaparte 9 35% 91.4% Chatterjee 10 31.2% 90.4%
  50. 50. Accuracy name number of search results missing results with raw text accuracy missing results with corrected text accuracy Eklund 2,951 2,987 182 Kennedy 360,723 455,392 26,111 Espinosa 1,918 2,950 160 Bonaparte 44,664 82,947 4,203 Chatterjee 19 42 2 Chronicling America searches done 19-Mar-2013 (6,025,474 pages from 1836 to 1922).
  51. 51. user lines corrected* 1 646,873 2 236,323 3 111,749 4 100,749 5 99,999 6 87,720 7 82,768 8 63,786 9 57,441 10 56,458 lines corrected* user 2,455,338 1 1,822,422 2 1,448,370 3 1,265,217 4 1,174,835 5 1,069,669 6 1,058,179 7 1,020,462 8 949,694 9 886,315 10 *numbers from Mar 2014
  52. 52. user lines corrected Mar 2014 1 646,873 2 236,323 3 111,749 4 100,749 5 99,999 6 87,720 7 82,768 8 63,786 9 57,441 10 56,458 lines corrected Oct 2012 242,965 87,515 31,318 24,144 23,184 19,240 18,898 16,875 11,784 9,762
  53. 53. • “I enjoy the correction - it’s a great way to learn more about past history and things of interest whilst doing a ‘service to the community’ by correcting text for the benefit of others.” • “I have recently retired from IT and thought that I could be of some assistance to the project. It benefits me and other people. It helps with family research.” Rose Holley. Many Hands Make Light Work. National Library of Australia March 2009. motivation Trove users’ report
  54. 54. “I am interested in all kinds of history. I have pursued genealogy as a hobby for many years. I correct text at CDNC because I see it as a constructive way to contribute to a worthwhile project. Because I am interested in history, I enjoy it.” Wesley, California Personal communications with CDNC text correctors. motivation CDNC users’ report
  55. 55. ! “I only correct the text on articles of local interest - nothing at state, national or international level, no advertisements, etc.  The objective is to be able to help researchers to locate local people, places, organizations and events using the on-line search at CDNC.  I correct local news & gossip, personal items, real estate transactions, superior court proceedings, county and local board of supervisors meetings, obituaries, birth notices, marriages, yachting news, etc.” Ann, California Personal communications with CDNC text correctors. motivation CDNC users’ report
  56. 56. “I have always been interested in history, especially the development of the American West, and nothing brings it alive better than newspapers of the time. I believe them to be an invaluable source of knowledge for us and future generations.” David, United Kingdom motivation CDNC users’ report Personal communications with CDNC text correctors.
  57. 57. CDNC is an excellent source of information matching my personal interest in such topics as sea history, development of shipbuilding, clippers and other ships etc. ... Unfortunately, the quality of text ... is rather poor I’m afraid. This is why I started to do all corrections necessary for myself ... and to leave the corrected text for use of others. .... I am not doing this very regularly as this is just my hobby and pleasure. Jerzey, Poland motivation CDNC users’ report Personal communications with CDNC text correctors.
  58. 58. As an amateur historical researcher my time for research is very limited.  Making time to travel to archives, libraries, and historical societies does not happen as often as I would like.  The Cambridge Public Library’s online newspaper collection has been an invaluable resource and it is fun.  I am very grateful for all the help I have received over the years from so many research organizations. Correcting text has several benefits.  It makes it much more likely that I will find a story if I decide to search for it in the future.  It is a way of saying ‘thank you’ to the Cambridge Library for having such a great resource available and maybe I can make the next person’s research a little easier. It is my own little historical preservation project. Cambridge Historical Newspapers Text Corrector motivation Cambridge users’ report Personal communications with Cambridge text correctors.
  59. 59. Hard-to-measure-but-shouldn’t-be- overlooked (HTMBSBO) benefits Public domain photo “A useful instruction for young sailors from the Royal Hospital School, Greenwich” from the National Maritime Museum.
  60. 60. “when someone transcribes a document, they are actually better fulfilling the mission of a cultural heritage organization than someone who simply stops by to flip through the pages” HTMBSBO benefit Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/ crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June 2013).
  61. 61. “in addition to increasing search accuracy or lowering the costs of document transcription, crowdsourcing is the single greatest advancement in getting people using and interacting with library collections” HTMBSBO benefit Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/ crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June 2013).
  62. 62. conclusions Conclusion of the Sonata for piano #32, opus 111 by Ludwig van Beethoven • newspaper digitization may be difficult but there are many, many examples of successful digitization programs. ask for help! and join the IFLA Newspapers Section! • digital newspaper collections are the most used digital library collections • benefits to crowdsourced text correction and tagging are multi-faceted: data accuracy, patron engagement, increased web traffic • know your user community!!
  63. 63. • Library of Congress National Digital Newspaper Program http://www.loc.gov/ndnp/ • Australian Newspaper Digitisation Program http://www.nla.gov.au/content/newspaper- digitisation-program • IFLA Newspapers Section Digitisation projects and best practices http://www.ifla.org/node/6777 • ICON: International Coalition on Newspapers http://icon.crl.edu/digitization.htm
  64. 64. Wikipedia contributors, "List of online newspaper archives," Wikipedia, The Free Encyclopedia, https:// en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives (accessed March 17, 2013).
  65. 65. Become a member of the IFLA Newspapers Section! See http://www.ifla.org/ membership or ask me. ! Frederick Zarndt, Secretary IFLA Newspapers Section frederick@frederickzarndt.com
  66. 66. ?! Frederick Zarndt Secretary, IFLA Newspapers Section frederick@frederickzarndt.com Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.

×