Presented at the 2013 Charleston Conference by:
Bob Scott
Digital Humanities Librarian, Columbia University Libraries
John Tofanelli
Librarian for British & American History & Literature, Columbia University Libraries
10. Background: the manual input era, 1985-1995
Single or double keyboarding
“Letter-perfect” texts
Smaller corpora, typically for
analytic projects by specialized
scholarly groups
Texts originally “earth-bound”
in labs or etext centers
Expanding audience in latter
part of era thanks to work of
publishers like Accessible
Archives, Intelex, and
especially Chadwyck-Healy, as
well as Project Gutenberg
11. “Dirty OCR” Era, 1995-?
1993 releases: Mosaic, Acrobat,
FineReader
Good quality but not letter perfect
OCR’ed text presented online
behind page image
JSTOR
Rapid transformation of
microform collections by major
publishers
Enormous expansion of amount of
available text and broadening of
user public: “digital library”
replaces “etext center”
Continuing role of keyed text
12. Benefits and Challenges of Dirty OCR
BENEFITS
Faster and cheaper per unit
Mass production
Broadened audience
Much of the time, sufficiently accurate
CHALLENGES
Can contain many errors and be uneven in quality
Not sufficiently accurate to support all types of
research
Typically hidden behind a page image
Likely to shock at first encounter
15. And, Therefore, in OCR
THE HISTORY
OF THE LAST TRIAL BY JURY FOR ATHEISM.
CHAPTER I. — BEFORE THB IMPRISONMENT. That
day is chilled in my memory when I first set out for
Cheltenham. It was in December 1840. The snow had
been frozen on the ground a fortnight. There were three
of us, Mrs. Holyoake, Madeline (our first child), and
myself. I had been residing in Worcester, which was the
first station to which I had been appointed as a Social
Missionary. My salary (16s. per week) was barely
sufficient to keep us alive in summer. In win¬ ter it was
inherent obstinacy alone which made us believe that we
existed. I feel now the fierce blast which came in at the
train windows from ' the fields of Tewkesbury,' on the
day on which we travelled from Worcester to
Cheltenham. The intense cold wrapped us round like a
cloak of ice. The shop lights threw their red glare over
the snow-bedded ground as we entered the town of
Cheltenham, and nothing but the drift and ourselves
moved through the deserted streets. When at last we
found a fire we had to wait to thaw before we could
begin to speak. When tea was over we were escorted to
the house where we were to stay for the night. I was
told it was ' a friend's house.' Cheltenham is a
fashionable town, a water¬ ing, visiting place, where
everything is genteel and thin. As the parlours of some
prudent house-wives are kept for show, and not to sit in,
so in Cheltenham numerous houses are kept ' to be let,'
and not to live in. The people who belong to the
apartments are like the supernumeraries on a stage,
they are employed in walking over them. Their clothes
ТТГЕ CHYSTAL.
CURE FOR ATHEISM. SECOND SCITI. I'.MI'.NT To (»ΜΙΛ
IN' TVXDAb, BY CAN l'Ali, М.Д I. What we call Nature is
matter moving like a clock whose spring; ol' force is
invisible. And. as we lind matter in itself to be essen¬ tially
inert, we conclude that such spring ol' force must exist,
though mem tracing mm force to another, do not arrive at Ρ
WC s. e DO force. We see llotllillj but lll.lt t · ' ľ in motion. Γ>
II t there is a remarkable Ime- ol' attraction called Pistinél
between living beings and things necessarv for their good ;
and these things, though ofti n distant and obscure, are i n
val iabl v found objectively to exist. 'Ihme n, in the particular
nat un- of' man, an uni met of an inter , si m (;,,([ winch has
left ps mirk over all ages and countries of tlm world. His
папой draws the inference that the invisible sprin<r nI
Nature and (iod are one and the same. ['hen reason
observes that the torce ol Nature is alwav» directed bv
wisdom and harmony. iul, because wisdom and harmony
can have no other source (Ь.щ a Consciously wise being,
reason receives the existence of (¡od, a.» a personal and
conscious Author of Nature, bv the assurance of a three
corded rope which cannot he broken. Kverv new discovery
has strengthened this assurance, and, altogether, a
)///"////<// frame is suggested a» undei lying л the /.'//.//./,
motions of mat ter, and h ning something to do with
educating that reason wliicli is ¡n man, marking hun oul as
the onlv being thai can be interested so much as to look at
a panorama of motions that is most grand, most beau¬ tiful,
16. Best Practices for Librarians
Advise users that they are not directly searching the
text images they see, but rather a text that has been
generated from those images.
Show them examples of OCR text exemplifying
varying levels of accuracy.
Advise them to avoid drawing any absolute negative
conclusions, e.g. , this word never occurs in …
Advise them to employ a range of different searches
related to their topic.
Advise them to make use of the options a given
database made available for mitigating the accuracy
of OCR, e.g., fuzzy searching, wildcards and
truncation.
25. Best Practices for Publishers:
Tools to Enhance Retrieval
• Fuzzy searching
• Prominently displayed user guides
• Rich Subject metadata
• Enhancing metadata by keying in title or first line
• Regular expression searching?
27. Best Practices for Publishers:
Tools to Enhance Retrieval
• Fuzzy searching
• Prominently displayed user guides
• Rich Subject metadata
• Enhancing metadata by keying in title or first line
• Regular expression searching?
• Measures of accuracy?
• Further transcription or cleanup?
• Embedding additional text in OCR?
• Continuing development and implementation of
quality control standards
28. Best Practices for Publishers:
Navigation of Results
• Keyword in Context (KWIC) Displays
35. Best Practices for Publishers:
Navigation of Results
• Keyword in Context (KWIC) Displays
• Navigation Points in Margin
• Indication of Number of Hits per Record
• Highlighting Text
37. Best Practices for Publishers:
Navigation of Results
• Keyword in Context (KWIC) Displays
• Navigation Points in Margin
• Indication of Number of Hits per Record
• Highlighting Text
• Robust Search within the Text
39. Best Practices for Publishers: Downloading
• Let users download the
format they need
• Provide searchable pdfs
in downloads
• Let users download as
much or as little as they
need
40. Best practices for publishers:
let users create their own OCR
• To make downloaded non-searchable pdfs
searchable
• To take advantage of the latest advances in
scanning technology
• To create an edited and marked up version
• Avoid the following:
Explicit blocking of the process
42. Best practices for publishers:
let users create their own OCR
• To make downloaded non-searchable pdfs
searchable
• To take advantage of the latest advances in
scanning technology
• To create an edited and marked up version
• Avoid the following:
Explicit blocking of the process
Downloaded text that is less than optimal for OCR
44. Preliminary Observations
• Kind of Text
2 keyed and 8 OCR
Of the 8 OCR’ed, 2 allow us to see text, one in KWIC, other in full text
• Tools to Enhance Retrieval
• KWIC
7 of 10 have KWIC display
3 of those 7 across all results
4 of the 7 provide sampling of hits
• Downloading pdfs
4 searchable pdf
5 non-searchable pdf
1 non-searchable image but no pdf
45. PRELIMINARY OBSERVATIONS (CONT.)
• Downloading text file (whether OCR’ed or keyed)
Only 3 of 10 allow downloading or extraction of keyed or underlying text
file
Of those 3, only 2 (1 keyed, 1 OCR’ed) allow download of entire text at once,
while the other only copied chunks of text at any one time
• Downloading a Whole Source at Once
Yes for 8 of 10
• Can Downloaded Text Images Be OCR’ed?
8 can
1 cannot
1 not applicable, since it is a manuscript