Document Image Analysis for Text Recognition

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Document Image Analysis for Text Recognition - Presentation Transcript

    1. JISC Workshop – University of Bath Apostolos Antonacopoulos PRImA Research, University of Salford, UK www.primaresearch.org
      • Documents – examples and characteristics
      • Issues affecting OCR and document presentation
      • Full-text conversion workflow
      • Document Image Analysis (DIA) workflow
        • Characteristics / Challenges
        • Open Issues / Opportunities
      • Range from manuscripts to books to newspapers
      • From older, usually better quality…
      • … to newer, mass produced lower quality
        • e.g. newspapers meant to last only a day
      • Languages: English, German, French etc.
        • mixed with Latin also (even at word level)
      • Typefaces: regular and presentation typefaces
        • even in the same text line
      • Layouts range widely (multi-column, decorative borders, illustrations, many types of drop caps etc.)
      • Artefacts inherent in the document when produced
      • Bleed-through
      • Smear-over from opposite page
      • Typesetting
        • Font peculiarities, e.g. the use of different widths of characters to achieve full justification
        • Layout issues
      • Marks from worn-out lead type
      • Paper texture.
      • Artefacts arising from using the documents
      • Folds
      • Tears
      • Annotations
      • Stains
      • Damage/dirt from regular handling
        • e.g. turning pages
      • Clear tape over tears (repairs)
      • Replacement paper (repair for missing parts)
      • Punch holes, staples etc.
      • Scratches on microfilm.
      • Artefacts arising from ageing and storage conditions
      • Uneven paper warping (due to humidity)
      • Acid discolouration
      • Mould
      • Non-straight paper edges (due to uneven shrinkage)
        • Cannot assume rectangular page shape (relied upon by some geometric correction techniques)
      • Fading ink
      • Discoloured paper (often unevenly).
      • Page curl
        • results in warping and missing parts of the page
      • Show-through
        • Can be eliminated by placing a black card behind page to be scanned (not always feasible though)
      • Uneven illumination
        • Shadows created where ripples are present
      • Varying skew
        • Especially in newspapers scanned with wide format feeder scanners.
      • Document preparation
        • Physical examination and repair/recording of artefacts
        • Metadata recording
      • Scanning (and OCR) by contractor
        • Specification varies according to overall cost and contractor capability
      • Examination of documents
        • compare the state of original artefacts
      • Examination of scans (quality)
      • Validation and OCR error correction
        • Only if this involves a few keywords per page
      • Hosting/presentation
      • Steps 1, 3, 4, 5 and 6 may cost about as much as step 2.
      • Main steps
        • Scanning
        • Image enhancement
        • Layout analysis
        • OCR
        • Post-processing
      • Scanning quality specifications vary a lot
      • Issues of
        • Resolution
        • Colour depth
        • Compression
      • Compression aversion
        • Questionable technical decisions to save file space without compression (by not scanning in colour/greyscale)
      • Existing bitonal scans
        • Bad: there are millions of pages already which the libraries are not eager to re-scan
        • Hopeful: only about 1% of documents in existence have been digitised.
      • Overhead (Zeutschel)
      Book scanner (Treventus) ©Bavarian State Library
      • Improve image by identifying and correcting artefacts
      • Dual objective:
        • Enable DIA methods to work better
        • Improve readability by humans
      • Operations typically performed:
        • Page splitting (if more than one page is present in the same image)
        • Border removal
        • Dewarping / deskewing / geometrical correction
        • Noise and artefact removal
        • Binarisation (necessary for many DIA methods).
      • Segmentation (and classification) of
        • Blocks
        • Text lines
        • Words
        • Characters
      • Tight spacing between columns and between text and non-text blocks
        • Sometimes tighter than inter-word spacing
      • Baselines are frequently wavy
        • Local warping (e.g. due to humidity)
      • Intra-word spacing can be wide
        • Challenge for word segmentation
      • Characters may have space between strokes and be connected to other characters
        • Challenge for character segmentation
      • Binarisation is still an issue
        • Despite being very actively researched
      • Global dewarping (page curl correction)
        • Work is still needed to solve difficult cases e.g. of very tightly bound volumes partially flattened during scanning
      • Correction of geometrical artefacts beyond page curl
        • e.g. folds, local warping etc.
      • Colour analysis to remove stains and to separate annotations etc.
      • Robustness in the presence of
        • noise and other artefacts
        • very narrow spacing between entities
        • wide spacing between characters (emphasis)
        • decorative drop capitals and illustrations
        • decorative borders
      • Distinction between advertisements and main text
      • Article tracking.
      • PRImA
        • http://www.primaresearch.org
      • IMPACT
        • http://www.impact-project.eu
    SlideShare Zeitgeist 2009

    + ekhuberekhuber Nominate

    custom

    97 views, 0 favs, 1 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 97
      • 78 on SlideShare
      • 19 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 1
    Most viewed embeds
    • 19 views on http://ocr2009.wordpress.com

    more

    All embeds
    • 19 views on http://ocr2009.wordpress.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories