Coming Clean About Dirty OCR

1,060 views

Published on

Presented at the 2013 Charleston Conference by:

Bob Scott
Digital Humanities Librarian, Columbia University Libraries

John Tofanelli
Librarian for British & American History & Literature, Columbia University Libraries

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,060
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Coming Clean About Dirty OCR

  1. 1. Coming Clean About Dirty OCR: Bob Scott and John Tofanelli Charleston 2013
  2. 2. A Series of Text Revolutions Cuneiform Tablet
  3. 3. Papyrus Scroll
  4. 4. Manuscript Codex
  5. 5. Movable Type
  6. 6. Machine Type
  7. 7. Microform
  8. 8. Electronic Text
  9. 9. Incredible upward growth since 1985, but especially since 1995
  10. 10. Background: the manual input era, 1985-1995      Single or double keyboarding “Letter-perfect” texts Smaller corpora, typically for analytic projects by specialized scholarly groups Texts originally “earth-bound” in labs or etext centers Expanding audience in latter part of era thanks to work of publishers like Accessible Archives, Intelex, and especially Chadwyck-Healy, as well as Project Gutenberg
  11. 11. “Dirty OCR” Era, 1995-?       1993 releases: Mosaic, Acrobat, FineReader Good quality but not letter perfect OCR’ed text presented online behind page image JSTOR Rapid transformation of microform collections by major publishers Enormous expansion of amount of available text and broadening of user public: “digital library” replaces “etext center” Continuing role of keyed text
  12. 12. Benefits and Challenges of Dirty OCR BENEFITS  Faster and cheaper per unit  Mass production  Broadened audience  Much of the time, sufficiently accurate CHALLENGES  Can contain many errors and be uneven in quality  Not sufficiently accurate to support all types of research  Typically hidden behind a page image  Likely to shock at first encounter
  13. 13. The Man Behind the Curtain
  14. 14. Variability in Quality of Source …
  15. 15. And, Therefore, in OCR  THE HISTORY  OF THE LAST TRIAL BY JURY FOR ATHEISM.  ТТГЕ CHYSTAL.  CHAPTER I. — BEFORE THB IMPRISONMENT. That day is chilled in my memory when I first set out for Cheltenham. It was in December 1840. The snow had been frozen on the ground a fortnight. There were three of us, Mrs. Holyoake, Madeline (our first child), and myself. I had been residing in Worcester, which was the first station to which I had been appointed as a Social Missionary. My salary (16s. per week) was barely sufficient to keep us alive in summer. In win¬ ter it was inherent obstinacy alone which made us believe that we existed. I feel now the fierce blast which came in at the train windows from ' the fields of Tewkesbury,' on the day on which we travelled from Worcester to Cheltenham. The intense cold wrapped us round like a cloak of ice. The shop lights threw their red glare over the snow-bedded ground as we entered the town of Cheltenham, and nothing but the drift and ourselves moved through the deserted streets. When at last we found a fire we had to wait to thaw before we could begin to speak. When tea was over we were escorted to the house where we were to stay for the night. I was told it was ' a friend's house.' Cheltenham is a fashionable town, a water¬ ing, visiting place, where everything is genteel and thin. As the parlours of some prudent house-wives are kept for show, and not to sit in, so in Cheltenham numerous houses are kept ' to be let,' and not to live in. The people who belong to the apartments are like the supernumeraries on a stage, they are employed in walking over them. Their clothes  CURE FOR ATHEISM. SECOND SCITI. I'.MI'.NT To (»ΜΙΛ IN' TVXDAb, BY CAN l'Ali, М.Д I. What we call Nature is matter moving like a clock whose spring; ol' force is invisible. And. as we lind matter in itself to be essen¬ tially inert, we conclude that such spring ol' force must exist, though mem tracing mm force to another, do not arrive at Ρ WC s. e DO force. We see llotllillj but lll.lt t · ' ľ in motion. Γ> II t there is a remarkable Ime- ol' attraction called Pistinél between living beings and things necessarv for their good ; and these things, though ofti n distant and obscure, are i n val iabl v found objectively to exist. 'Ihme n, in the particular nat un- of' man, an uni met of an inter , si m (;,,([ winch has left ps mirk over all ages and countries of tlm world. His папой draws the inference that the invisible sprin<r nI Nature and (iod are one and the same. ['hen reason observes that the torce ol Nature is alwav» directed bv wisdom and harmony. iul, because wisdom and harmony can have no other source (Ь.щ a Consciously wise being, reason receives the existence of (¡od, a.» a personal and conscious Author of Nature, bv the assurance of a three corded rope which cannot he broken. Kverv new discovery has strengthened this assurance, and, altogether, a )///"////<// frame is suggested a» undei lying л the /.'//.//./, motions of mat ter, and h ning something to do with educating that reason wliicli is ¡n man, marking hun oul as the onlv being thai can be interested so much as to look at a panorama of motions that is most grand, most beau¬ tiful,
  16. 16. Best Practices for Librarians      Advise users that they are not directly searching the text images they see, but rather a text that has been generated from those images. Show them examples of OCR text exemplifying varying levels of accuracy. Advise them to avoid drawing any absolute negative conclusions, e.g. , this word never occurs in … Advise them to employ a range of different searches related to their topic. Advise them to make use of the options a given database made available for mitigating the accuracy of OCR, e.g., fuzzy searching, wildcards and truncation.
  17. 17. Best Practices for Publishers: Let Us See the Dirty OCR
  18. 18. Best Practices for Publishers: Tools to Enhance Retrieval • Fuzzy searching
  19. 19. Fuzzy Search in ECCO Return
  20. 20. Best Practices for Publishers: Tools to Enhance Retrieval • • Fuzzy searching Prominently displayed user guides
  21. 21. Readex User Guide for Searching Historical Newspapers
  22. 22. Ebsco User Guide for Searching Historical Texts Return
  23. 23. Best Practices for Publishers: Tools to Enhance Retrieval • • • Fuzzy searching Prominently displayed user guides Rich Subject metadata
  24. 24. Metadata Return
  25. 25. Best Practices for Publishers: Tools to Enhance Retrieval • • • • • Fuzzy searching Prominently displayed user guides Rich Subject metadata Enhancing metadata by keying in title or first line Regular expression searching?
  26. 26. Regular Expression <[s8][ce][il][ec][nu]t???> Return
  27. 27. Best Practices for Publishers: Tools to Enhance Retrieval • • • • • • • • • Fuzzy searching Prominently displayed user guides Rich Subject metadata Enhancing metadata by keying in title or first line Regular expression searching? Measures of accuracy? Further transcription or cleanup? Embedding additional text in OCR? Continuing development and implementation of quality control standards
  28. 28. Best Practices for Publishers: Navigation of Results • Keyword in Context (KWIC) Displays
  29. 29. KWIC 1
  30. 30. KWIC 3
  31. 31. Return
  32. 32. Best Practices for Publishers: Navigation of Results • • Keyword in Context (KWIC) Displays Navigation Points in Margin
  33. 33. margin Return
  34. 34. Best Practices for Publishers: Navigation of Results • • • • Keyword in Context (KWIC) Displays Navigation Points in Margin Indication of Number of Hits per Record Highlighting Text
  35. 35. Highlight Return
  36. 36. Best Practices for Publishers: Navigation of Results • • • • • Keyword in Context (KWIC) Displays Navigation Points in Margin Indication of Number of Hits per Record Highlighting Text Robust Search within the Text
  37. 37. ATLA Monograph Search Within Text Retur n
  38. 38. Best Practices for Publishers: Downloading • • • Let users download the format they need Provide searchable pdfs in downloads Let users download as much or as little as they need
  39. 39. Best practices for publishers: let users create their own OCR • • • • To make downloaded non-searchable pdfs searchable To take advantage of the latest advances in scanning technology To create an edited and marked up version Avoid the following: Explicit blocking of the process
  40. 40. Blocking of copying or OCR Return
  41. 41. Best practices for publishers: let users create their own OCR • • • • To make downloaded non-searchable pdfs searchable To take advantage of the latest advances in scanning technology To create an edited and marked up version Avoid the following: Explicit blocking of the process Downloaded text that is less than optimal for OCR
  42. 42. Our survey form
  43. 43. Preliminary Observations • • • •
  44. 44. PRELIMINARY OBSERVATIONS (CONT.) • • •
  45. 45. Making friends with the man behind the curtain

×