Scanning, OCR, Document Analysis Services
Thomas M. Breuel
IUPR
DFKI & U. Kaiserslautern
IUPR R&D Efforts
edition
analysis
book scan trainable book-level
capture quality OCR layout logical
software control analysis structure
library-level
logical
image-based image-based structure
document document
retrieval comparisons
image
search
statistical text
analysis and
alignment
information &
reference
extraction
OCRopus
background and funding
► BMBF IPET project
► Google Funding
► PAREN project
OCR system
► preprocessing
● thresholding, noise removal, deskewing, ...
► layout analysis
● 2D analysis of page components
● division into text lines
► text line recognition
● transformation of text lines into character hypotheses
● results in “recognition lattice”
► language modeling
● incorporate syntactic and semantic knowledge
general architecture
layout
analysis
text line
recognition
statistical
language
modeling
TEXT
OCR
► OCR is more than “character recognition”
► isolated character recognition is nearly useless by
itself
► themost catastrophic errors are layout analysis
errors
OCR questions
► there is no “universal” OCR system yet
► need to do engineering
● What kinds of scripts / languages are being targeted?
● How were the documents captured?
● What kinds of layouts are expected?
● What kind of training data is available?
● How is the data to be used?
● Indexing, search, publishing?
● What are the error rate requirements?
● What are the throughput requirements?
text recognition
► Questions to ask before choosing a text line
recognizer
● image resolution?
● size of character set?
● diacritics?
● connected or isolated writing style?
● ligatures? number of ligatures?
● redundancy?
tradeoffs
► HMM (e.g., screen OCR for English)
● good at low resolutions, easy to train/apply
● worse than alternatives at high resolutions
► MLP (e.g., flatbed scans for Hebrew)
● good statistical estimates for small character sets
● too slow and hard to train for large character sets
► shape-based matchers (rare character handling)
● highly accurate, easy to train
● too slow as primary classifier
OCRopus = toolbox
► goal: omni-script, omni-language
● no recognition/layout algorithm works for everything
● need to combine many implementations
► approach
● no coupling between components
● small, controlled set of data types in interfaces
► eventually
● put the right components for the task together
automatically
IUPR strengths
► quality related
● geometric algorithms for layout analysis
● statistical layout analysis
● statistical natural language processing
● adaptive recognition
► systems related
● standards-based representations
● simple intermediate formats, minimal coupling
● proven coding conventions
preprocessing
preprocessing
► challenges
● highly variable inputs, high throughput
► achievements
● novel, fast, accurate algorithms for thresholding,
deskewing improve throughput, recognition accuracy
● novel boundary noise removal algorithm greatly improves
recognition accuracy
► future work
● text extraction from complex backgrounds, highly
degraded documents
preprocessing
► released components
● general image processing library
● fast bitblit implementations
● run-length morphology methods
● multiple binarization methods
● page frame detection
► in development
● adaptive binarization
● color document processing
● script / language id
fast morphology / cleanup
fast adaptive thresholding
page frame detection
bad scan detection
text / image segmentation
layout analysis
layout analysis
► challenges
● complex, highly variable physical and logical layout
structures designed for humans
► achievements
● developed demonstrably high-accuracy generic layout
analysis
● eliminated the need for “rule-based layout analysis”
through trainable machine-learning layout analysis
system
● developed high-throughput interactive segmenter
► future work
● integration, application to large scale document
collections
layout analysis
► released components
● xy-cuts
● Voronoi segmentation
● generic geometric page segmenter
► in development
● improved text/image segmentation
● trainable statistical layout analysis
● statistical non-Manhattan layout analysis
● layout analysis on distorted/curved pages
● integration of layout analysis results
high accuracy text line finding
Q , ,=∑i max d pi ,l , , p i ,l − ,
● lots of existing methods
● this algorithm:
● doesn't assume text lines are parallel
● enables correcting for perspective distortion
● finds well-defined, globally optimal solutions
● no search parameters to tune—only pick epsilon
layout analysis
text line recognition
text line recognition
► challenges
● wide variety of writing systems, scan qualities
► released components
● template-based system based on Tesseract
● neural net based oversegmenting recognizer
► in development
● HMM recognizer
● shape-based recognizer
● word-based recognizer
line recognizers
► Tesseract
● alphabetic printed, well-segmented
► oversegmenting MLP
● alphabetic printed, handwritten
► large character set MLP
● Urdu, Indic, CJK printed
► HMM
● low resolution, poorly segmented inputs
► shape-based
● large character set, small #training examples
► word recognizer
adaptive recognition
A A
A
A A
language modeling
statistical language models
language models as
weighted finite state transducers fully probabilistic foundation
Dictionary Semantic
Grammar
Dictionary Constraints
Result
Hypothesis
Graph
modular language models allow rapid retargeting
statistical language modeling
► primary task
Given a string S, determine the probability of
occurrence of that string in text.
► not quite linguistics
● ungrammatical strings occur
● many grammatical strings are highly improbable
other tasks
► weighted finite state transducers for...
● grapheme sequence / unicode
● transliteration (including ambiguous)
● alignment
● orthographic variation
● morphology
● word spotting
● citation extraction
● dictionary entry parsing
● reading order determination
● language model adaptation
● character set conversions
probabilistic finite state transducers
► “little translation machines”
● input: one or more strings + probabilities
● output: one or more strings + probabilities
► related to HMMs
► high level operations
● minimize, reverse, complement, union, intersect,
compose, ...
● bestpath, bestpath-2, bestpath-n
► these can be combined dynamically
grapheme sequence
► Devanagarigraphemes do not occur in phonetic /
Unicode order
● vowels, “r”, ...
► OCR recognition
● left-to-right grapheme sequence
● diacritics
● ligatures as units
► translate: Unicode order, expand diacritics, ligatures
► also: rendering
● rule-based
● learn frequencies for different grapheme choices
transliteration and phonetic models
► Sanskrit written in...
● Devanagari
● other Indic scripts
● multiple Roman transliterations
● IPA
► we need to transliterate between them
► build models
● start with rule-based models
● ambiguities: learn frequencies and context to resolve
orthographic variation
► historical documents show variations in...
● accepted spelling
● spelling errors
● pronounciation
● transliteration
► build model mapping variants to standard spelling
► applications
● transform text / OCR result to standard orthography
● fuzzy search matching variant spellings
● use standard text as language model / ground truth
● use variant text as language model / ground truth
alignment
► text / image alignment
● use transcription as the “language model”
● perform recognition with that language model
● OCR output contains word aligned bounding boxes
► routinely used for training
► other uses
● ground truth doesn't need to be perfect
● ground truth may be transliterated, orthographic variants
● line breaks etc. may be formatted differently
hOCR output format
hOCR output format
► wanted an output format that...
● could represent all major languages
● could represent typography for all major scripts
● had lots of tools available for it
● could encapsulate existing OCR output formats
● was easy to incorporate into existing OCR systems
► solution
● use Unicode, HTML + CSS
● add optional, compatible markup for OCR
hOCR: reuse HTML/CSS markup
overview of hOCR markup elements
logical markup
engine-specific
page layout
language modeling
hOCR example
summary
error rates
► components ► goal
● RAST layout ● beat this with MLP
● Tesseract char. recog. in beta release
summary
► OCR is much more than “character recognition”
► current status
● excellent layout analysis support
● mature character recognition for alphabetic scripts
● powerful, high performance language modeling code
● standards-based
possible tasks OCR tasks
► web and REST-based interfaces for...
● OCR and layout analysis
● quality control and correction
● training for new scripts
● language modeling based on corpora
● batch processing and workflow
► support for
● historical documents
● additional scripts / languages
● additional logical structure analysis
● on-screen reading
applications
► camera-based scanning
► mobile document capture
► scan quality control
► ocr and ocr correction
► adaptive layout analysis
► information extraction
► image-based edition analysis
status
► we have
● distinct technologies
● working prototypes / demonstrators
► funding needed for
● developing user interfaces
● packaging
● documentation
● training and performance
● new functionality based on user requirements
0 comments
Post a comment