• Save
OCR challenges in historic documents and the contribution of IMPACT
Upcoming SlideShare
Loading in...5
×
 

OCR challenges in historic documents and the contribution of IMPACT

on

  • 3,488 views

Presentation by Clemens Neudecker (KB) at the IFLA satellite meeting “New Techniques for Old Documents” (16-18 August, Uppsala, Sweden)

Presentation by Clemens Neudecker (KB) at the IFLA satellite meeting “New Techniques for Old Documents” (16-18 August, Uppsala, Sweden)

Statistics

Views

Total Views
3,488
Views on SlideShare
3,488
Embed Views
0

Actions

Likes
1
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

OCR challenges in historic documents and the contribution of IMPACT OCR challenges in historic documents and the contribution of IMPACT Presentation Transcript

  • OCR challenges in historic documents and the contribution of IMPACT Clemens Neudecker, KB National Library of the Netherlands
  • Background
    • Text that is not digital is virtually invisible
    • OCR (optical character recognition) technology does not produce satisfactory results for historic documents
    • There is a lack of institutional knowledge and expertise which causes “re-inventing the wheel”
    • Innovate OCR software and language technology
    • Share best practice and build capacity across Europe (Guidelines, Training, Workshops)
  • IMPACT – Improving access to text
    • Funded by the EC as part of the 7 th Framework Programme
    • Coordinated by KB – National Library of the Netherlands
    • EU funding: € 12 100 000
    • 26 partners: Libraries, Research Institutes, Industry Partners
    • Start date: 1 January 2008
    • Duration: 48 Months  2011: Center of Competence
  • Historic material: different problems
    • OCR errors
    • Damaged material, bad quality scans, difficult layout,
    • historic fonts, …
    • Historical language
    • Spelling variants, orthographical variants, inflected forms, …
  • Bad OCR results… la 112 B ik e my lat arrived the >Pylades,-. lliot; aod. Abe- 3ineva, CNeee 4orn Neath, ' titch ,cuim; ,'t;ohn_ IoMelwl fri ytiil SUn- .die8; ,FrietndiLp, St&ar, froniidon, 'Ui wine and grocerieu ;: ;aletn, Bker, from Liverpool,. witfi eoal.;' 4Stalled the AluidonG.: ceror' Lkndon, with sundries; : ;Two Rrothwsj'@ Whe~atn-;- Pylade', Eiot; Har'tinny,; ;: Fisbley; ::Iiiveiy Peggy:-(flth add tie JAne, Redman, for eathly Newpot;agd llford; -Tw Br.otherAs, lawces, fos Lysixowjvithbinehol V pirI-ihzure;vi etsey, Per- wIliti; iIudstry, ModA - ~tbi ,Al~t,,'enniugs, for .:IP1~iOntI, StIth Ltu .c*ar An'l? Hawkinss foir ouck , + iii ballasto I _______~ ~ ~ ~~~Ai
  • Bleed t hrough & s hine t hrough Effects are high, since it is the same ink (though lighter) and the shaping of characters is directly disturbed. When the printing ink was not dry, the letters of the one page also appear on the other page. Also, if a paper is relatively thin the ink of the other side of the page may shine through. Effects on OCRing General description
  • IMPACT: Binarisation
  • Annotations in the text Effects are high, since both segmentation as well as the recognition process itself is disturbed. All notes, lines, drawings created by users, but also stamps, tapes etc. used within libraries. Effects on OCRing General description
  • IMPACT: Improved binarisation
    • Original
    • State of the Art
    • IMPACT
  • Warping of paper Partly a relatively high effect, especially if it is connected with bad printing (e.g. characters not aligned on the baseline of a line). Due to humidity the single page of an old book is very rarely really flat, in contrast it is warped. Even with putting the paper against a glass plate the warping will not disappear. Effects on OCRing General description
  • IMPACT: Border removal
  • IMPACT: Geometric correction I
  • IMPACT: Geometric correction II
  • Gothic typeface Effects are high since such fonts and characters are often not recognised correctly. Historic fonts, obsolete characters such as the long s Effects on OCRing General description
  • IMPACT: Improved recognition
  • Complex l ayout Effects are high since text is not ordered in the right way Effects on OCRing Due to difficult layouts, pages can be segmented incorrectly General description
  • IMPACT: Segmentation Blocks/Regions Words Glyphs
  • IMPACT: Functional extension parser
    • Recognition of the structure of book pages
      • Print space
      • Standard font of the
      • main text
      • Page numbers
    • Enrichment of OCR results with structural information
  • Bad p rinting: blurred, broken, faded characters Effects are high since characters are broken or bound together. According to the printing technology used letters may be blurred, broken or dotted. Effects on OCRing General description
  • IMPACT: Cooperative correction
    • Integrated web-based system for cooperative correction of OCR results
    • Character/Word/Page mode
    • Collaboratively correct OCR errors and use results for improving OCR
  • IMPACT: Word spotting
    • Alternative technique for indexing historical documents
    • After word segmentation relevant words are detected and highlighted
    • Key words can be e.g. person and location names
  • Historical language Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald we ë led
  • IMPACT: Historical dictionaries
    • OCR:
      • Lexica for German, Dutch, English, French, Spanish, Polish, Bulgarian and Czech
      • Generic tools for building historical lexica
    • FineReader with built in standard Dutch dictionary
    werreid FineReader with IMPACT dictionary of historical Dutch werreld
    • RETRIEVAL:
      • Key in ‘ wereld ’ and find ‘ werreld ’
  • IMPACT: Linguistic post-correction
    • The colors indicate different types of analysis results, like a word being found in the historical or hypothetical dictionary, or a supposed OCR error, etc.
  • IMPACT: Interoperability framework Interaction, Modularisation, Evaluation
  • Thank you!
    • http://www.impact–project.eu/
    • [email_address]
    • @impactocr
    • http://impactocr.wordpress.com/