OCR: the What, Why, and How
Mackenzie Brooks, Metadata Librarian
Alston Cobourn, Digital Scholarship Librarian
What is Optical Character Recognition?
Wikipedia says:
the mechanical or electronic conversion of
scanned or photographed images of typewritten
or printed text into machine-encoded/computer-
readable text.
http://en.wikipedia.org/wiki/Optical_character_recognition
The Second Province of Sigma
Chi, embracing the chapters at the
University of Virginia, Uampden-
Sidney, Roanoke, Randolph-Macon,
University of North Carolina and
Washington and Lec.held its annual
convention here on Thursday night
aud Friday. The delegates arrived
on the evening trains on Thursday
mid went immediately to the Lex-
ington, where they put tip.
Why
â—Ź Textual analysis
â—Ź Your research, student work, DH projects
â—Ź ADA compliance
With the full text you can…
â—Ź All PDFs are not created equal
â—‹ Searchable
â—‹ Extract text
â—Ź Textual analysis
â—‹ Voyant Tools
â—‹ TEI
Accessibility
â—Ź Screen reading
â—Ź Kurtzweil
â—Ź Indexable
â—Ź General readability
How?
â—Ź Multiple tools, multiple methods
â—Ź Reads the text and tries to assign values to the
characters it sees
â—Ź Matrix matching vs. feature extraction
â—Ź Character vs. whole word recognition
â—Ź Matches from internal lexicon/dictionary
â—Ź Language options available
Complications
â—Ź Fonts, formatting, line breaks, columns,
italics etc.
â—Ź Thin paper, writing on back of page
â—Ź Stray marks, printing errors, margin notes,
footnotes
â—Ź Year of language
Recommendations
â—Ź High resolution
â—Ź Binarization
â—Ź Deskewing
â—Ź Orientation
â—Ź Crop out extraneous marks
What tools can I use?
Adobe Acrobat Pro
Google Drive
Tesseract
Adobe Acrobat Pro
â—Ź Via the Stable
â—Ź Over 40 languages (not just Latin
characters)
â—Ź Automatically preprocesses the image
â—Ź Easy to use
â—Ź Enhances PDFs
â—Ź Will work with TIFs and JPGs
Google Drive
â—Ź Will process PDF, GIF, JPG, and PNG
â—Ź Recommends text be 10 pixels high
â—Ź Size limit: 2MB per file or 10 pages of PDF
â—Ź Upload Settings > Convert Text from
Uploaded PDFs and Image Files
Project Naptha
http://projectnaptha.com/
â—Ź automatically applies state-of-the-art computer
vision algorithms on every image you see while
browsing the web. The result is a seamless
and intuitive experience, where you can
highlight as well as copy and paste and even
edit and translate the text formerly trapped
within an image.
Tesseract
â—Ź OCR Engine
â—Ź 1985-1994 HP; 2006 Google
â—Ź Highest accuracy
â—Ź Command line or front end options
https://code.google.com/p/tesseract-ocr/
Tesseract Frontends
â—Ź FreeOCR
â—‹ Windows
â—‹ PDF, TIFF, JPG
â—‹ 11 languages
â—‹ Ability to input common errors
â—‹ Functions as scanning software
â—Ź Other options:
â—‹ https://code.google.com/p/tesseract-ocr/wiki/3rdParty
Free OCR
SWMSFC Is Set
To Interview
New Candidates
Class Ring Orders
Now Being Taken
The SWMSFC, an autonomous
committee. constituted to raise funds
for a scholarship in memory of W&L
men who lost their lives in World
War ll, wffl interview candidates
for membership in the organization
Tuesday Oct. 11, at 7 p.m. in the
Student Union.
Adobe
SWMSFC Is Set To Interview New Candidates
Class Ring Orders Now Being Taken
The SWMSFC. an autonomous committee,
cotultiluted to raise funds lor a scholarship In
memory of W&L men who l09t their lives in
World War II, will interview candidates for
membership in the organilation Tuesday Oct.
11, at 7 p.m. in the Student. Union.
Tesseract
SWMSFC Is Set
To Interview
New Candidates
Class Ring Orders
Now Being Taken
The SWMSFC, an autonomous
committee, constituted to raise funds
for a scholarship in memory of W&L
men who lost their lives in World
War II, will interview candidates
for membership in the organization
Tuesday Oct. 11, at 7 p.m. in the
Student Union.
Google Drive
SWMSFC Is Set To Interview New Candidates
Class Ring Orders Now Being Taken
The SWMSFC, an autonomous committee,
constituted to raise funds for a scholarship in memory
of W&L men who lost their lives in World War II, will
interview candidates for membership in the
organization Tuesday Oct. 11, at 7 p.m. in the Student
Union.
Contact
Mackenzie Brooks
brooksm@wlu.edu x8659
Alston Cobourn
cobourna@wlu.edu x8657
DHAT@wlu.edu

Optical Character Recognition: the What, Why, and How

  • 1.
    OCR: the What,Why, and How Mackenzie Brooks, Metadata Librarian Alston Cobourn, Digital Scholarship Librarian
  • 2.
    What is OpticalCharacter Recognition? Wikipedia says: the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer- readable text. http://en.wikipedia.org/wiki/Optical_character_recognition
  • 4.
    The Second Provinceof Sigma Chi, embracing the chapters at the University of Virginia, Uampden- Sidney, Roanoke, Randolph-Macon, University of North Carolina and Washington and Lec.held its annual convention here on Thursday night aud Friday. The delegates arrived on the evening trains on Thursday mid went immediately to the Lex- ington, where they put tip.
  • 5.
    Why â—Ź Textual analysis â—ŹYour research, student work, DH projects â—Ź ADA compliance
  • 6.
    With the fulltext you can… ● All PDFs are not created equal ○ Searchable ○ Extract text ● Textual analysis ○ Voyant Tools ○ TEI
  • 7.
    Accessibility â—Ź Screen reading â—ŹKurtzweil â—Ź Indexable â—Ź General readability
  • 8.
    How? â—Ź Multiple tools,multiple methods â—Ź Reads the text and tries to assign values to the characters it sees â—Ź Matrix matching vs. feature extraction â—Ź Character vs. whole word recognition â—Ź Matches from internal lexicon/dictionary â—Ź Language options available
  • 9.
    Complications â—Ź Fonts, formatting,line breaks, columns, italics etc. â—Ź Thin paper, writing on back of page â—Ź Stray marks, printing errors, margin notes, footnotes â—Ź Year of language
  • 10.
    Recommendations â—Ź High resolution â—ŹBinarization â—Ź Deskewing â—Ź Orientation â—Ź Crop out extraneous marks
  • 11.
    What tools canI use? Adobe Acrobat Pro Google Drive Tesseract
  • 12.
    Adobe Acrobat Pro â—ŹVia the Stable â—Ź Over 40 languages (not just Latin characters) â—Ź Automatically preprocesses the image â—Ź Easy to use â—Ź Enhances PDFs â—Ź Will work with TIFs and JPGs
  • 13.
    Google Drive â—Ź Willprocess PDF, GIF, JPG, and PNG â—Ź Recommends text be 10 pixels high â—Ź Size limit: 2MB per file or 10 pages of PDF â—Ź Upload Settings > Convert Text from Uploaded PDFs and Image Files
  • 14.
    Project Naptha http://projectnaptha.com/ â—Ź automaticallyapplies state-of-the-art computer vision algorithms on every image you see while browsing the web. The result is a seamless and intuitive experience, where you can highlight as well as copy and paste and even edit and translate the text formerly trapped within an image.
  • 15.
    Tesseract â—Ź OCR Engine â—Ź1985-1994 HP; 2006 Google â—Ź Highest accuracy â—Ź Command line or front end options https://code.google.com/p/tesseract-ocr/
  • 16.
    Tesseract Frontends â—Ź FreeOCR â—‹Windows â—‹ PDF, TIFF, JPG â—‹ 11 languages â—‹ Ability to input common errors â—‹ Functions as scanning software â—Ź Other options: â—‹ https://code.google.com/p/tesseract-ocr/wiki/3rdParty
  • 17.
    Free OCR SWMSFC IsSet To Interview New Candidates Class Ring Orders Now Being Taken The SWMSFC, an autonomous committee. constituted to raise funds for a scholarship in memory of W&L men who lost their lives in World War ll, wffl interview candidates for membership in the organization Tuesday Oct. 11, at 7 p.m. in the Student Union. Adobe SWMSFC Is Set To Interview New Candidates Class Ring Orders Now Being Taken The SWMSFC. an autonomous committee, cotultiluted to raise funds lor a scholarship In memory of W&L men who l09t their lives in World War II, will interview candidates for membership in the organilation Tuesday Oct. 11, at 7 p.m. in the Student. Union. Tesseract SWMSFC Is Set To Interview New Candidates Class Ring Orders Now Being Taken The SWMSFC, an autonomous committee, constituted to raise funds for a scholarship in memory of W&L men who lost their lives in World War II, will interview candidates for membership in the organization Tuesday Oct. 11, at 7 p.m. in the Student Union. Google Drive SWMSFC Is Set To Interview New Candidates Class Ring Orders Now Being Taken The SWMSFC, an autonomous committee, constituted to raise funds for a scholarship in memory of W&L men who lost their lives in World War II, will interview candidates for membership in the organization Tuesday Oct. 11, at 7 p.m. in the Student Union.
  • 18.
    Contact Mackenzie Brooks brooksm@wlu.edu x8659 AlstonCobourn cobourna@wlu.edu x8657 DHAT@wlu.edu