2008 05 OCR Services Presentation

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    2008 05 OCR Services Presentation - Presentation Transcript

    1. Scanning, OCR, Document Analysis Services Thomas M. Breuel IUPR DFKI & U. Kaiserslautern
    2. IUPR R&D Efforts edition analysis book scan trainable book-level capture quality OCR layout logical software control analysis structure library-level logical image-based image-based structure document document retrieval comparisons image search statistical text analysis and alignment information & reference extraction
    3. OCRopus
    4. background and funding ► BMBF IPET project ► Google Funding ► PAREN project
    5. OCR system ► preprocessing ● thresholding, noise removal, deskewing, ... ► layout analysis ● 2D analysis of page components ● division into text lines ► text line recognition ● transformation of text lines into character hypotheses ● results in “recognition lattice” ► language modeling ● incorporate syntactic and semantic knowledge
    6. general architecture layout analysis text line recognition statistical language modeling TEXT
    7. OCR ► OCR is more than “character recognition” ► isolated character recognition is nearly useless by itself ► themost catastrophic errors are layout analysis errors
    8. OCR questions ► there is no “universal” OCR system yet ► need to do engineering ● What kinds of scripts / languages are being targeted? ● How were the documents captured? ● What kinds of layouts are expected? ● What kind of training data is available? ● How is the data to be used? ● Indexing, search, publishing? ● What are the error rate requirements? ● What are the throughput requirements?
    9. text recognition ► Questions to ask before choosing a text line recognizer ● image resolution? ● size of character set? ● diacritics? ● connected or isolated writing style? ● ligatures? number of ligatures? ● redundancy?
    10. tradeoffs ► HMM (e.g., screen OCR for English) ● good at low resolutions, easy to train/apply ● worse than alternatives at high resolutions ► MLP (e.g., flatbed scans for Hebrew) ● good statistical estimates for small character sets ● too slow and hard to train for large character sets ► shape-based matchers (rare character handling) ● highly accurate, easy to train ● too slow as primary classifier
    11. OCRopus = toolbox ► goal: omni-script, omni-language ● no recognition/layout algorithm works for everything ● need to combine many implementations ► approach ● no coupling between components ● small, controlled set of data types in interfaces ► eventually ● put the right components for the task together automatically
    12. IUPR strengths ► quality related ● geometric algorithms for layout analysis ● statistical layout analysis ● statistical natural language processing ● adaptive recognition ► systems related ● standards-based representations ● simple intermediate formats, minimal coupling ● proven coding conventions
    13. preprocessing
    14. preprocessing ► challenges ● highly variable inputs, high throughput ► achievements ● novel, fast, accurate algorithms for thresholding, deskewing improve throughput, recognition accuracy ● novel boundary noise removal algorithm greatly improves recognition accuracy ► future work ● text extraction from complex backgrounds, highly degraded documents
    15. preprocessing ► released components ● general image processing library ● fast bitblit implementations ● run-length morphology methods ● multiple binarization methods ● page frame detection ► in development ● adaptive binarization ● color document processing ● script / language id
    16. fast morphology / cleanup
    17. fast adaptive thresholding
    18. page frame detection
    19. bad scan detection
    20. text / image segmentation
    21. layout analysis
    22. layout analysis ► challenges ● complex, highly variable physical and logical layout structures designed for humans ► achievements ● developed demonstrably high-accuracy generic layout analysis ● eliminated the need for “rule-based layout analysis” through trainable machine-learning layout analysis system ● developed high-throughput interactive segmenter ► future work ● integration, application to large scale document collections
    23. layout analysis ► released components ● xy-cuts ● Voronoi segmentation ● generic geometric page segmenter ► in development ● improved text/image segmentation ● trainable statistical layout analysis ● statistical non-Manhattan layout analysis ● layout analysis on distorted/curved pages ● integration of layout analysis results
    24. high accuracy text line finding Q  , ,=∑i max  d  pi ,l  ,   ,  p i ,l − ,   ● lots of existing methods ● this algorithm: ● doesn't assume text lines are parallel ● enables correcting for perspective distortion ● finds well-defined, globally optimal solutions ● no search parameters to tune—only pick epsilon
    25. layout analysis
    26. text line recognition
    27. text line recognition ► challenges ● wide variety of writing systems, scan qualities ► released components ● template-based system based on Tesseract ● neural net based oversegmenting recognizer ► in development ● HMM recognizer ● shape-based recognizer ● word-based recognizer
    28. built-in segmenters ► connected components ► skeletal segmenters ► curved cut segmenters ► upper contour segmenters
    29. line recognizers ► Tesseract ● alphabetic printed, well-segmented ► oversegmenting MLP ● alphabetic printed, handwritten ► large character set MLP ● Urdu, Indic, CJK printed ► HMM ● low resolution, poorly segmented inputs ► shape-based ● large character set, small #training examples ► word recognizer
    30. adaptive recognition A A A A A
    31. language modeling
    32. statistical language models language models as weighted finite state transducers fully probabilistic foundation Dictionary Semantic Grammar Dictionary Constraints Result Hypothesis Graph modular language models allow rapid retargeting
    33. statistical language modeling ► primary task Given a string S, determine the probability of occurrence of that string in text. ► not quite linguistics ● ungrammatical strings occur ● many grammatical strings are highly improbable
    34. other tasks ► weighted finite state transducers for... ● grapheme sequence / unicode ● transliteration (including ambiguous) ● alignment ● orthographic variation ● morphology ● word spotting ● citation extraction ● dictionary entry parsing ● reading order determination ● language model adaptation ● character set conversions
    35. probabilistic finite state transducers ► “little translation machines” ● input: one or more strings + probabilities ● output: one or more strings + probabilities ► related to HMMs ► high level operations ● minimize, reverse, complement, union, intersect, compose, ... ● bestpath, bestpath-2, bestpath-n ► these can be combined dynamically
    36. grapheme sequence ► Devanagarigraphemes do not occur in phonetic / Unicode order ● vowels, “r”, ... ► OCR recognition ● left-to-right grapheme sequence ● diacritics ● ligatures as units ► translate: Unicode order, expand diacritics, ligatures ► also: rendering ● rule-based ● learn frequencies for different grapheme choices
    37. transliteration and phonetic models ► Sanskrit written in... ● Devanagari ● other Indic scripts ● multiple Roman transliterations ● IPA ► we need to transliterate between them ► build models ● start with rule-based models ● ambiguities: learn frequencies and context to resolve
    38. orthographic variation ► historical documents show variations in... ● accepted spelling ● spelling errors ● pronounciation ● transliteration ► build model mapping variants to standard spelling ► applications ● transform text / OCR result to standard orthography ● fuzzy search matching variant spellings ● use standard text as language model / ground truth ● use variant text as language model / ground truth
    39. alignment ► text / image alignment ● use transcription as the “language model” ● perform recognition with that language model ● OCR output contains word aligned bounding boxes ► routinely used for training ► other uses ● ground truth doesn't need to be perfect ● ground truth may be transliterated, orthographic variants ● line breaks etc. may be formatted differently
    40. hOCR output format
    41. hOCR output format ► wanted an output format that... ● could represent all major languages ● could represent typography for all major scripts ● had lots of tools available for it ● could encapsulate existing OCR output formats ● was easy to incorporate into existing OCR systems ► solution ● use Unicode, HTML + CSS ● add optional, compatible markup for OCR
    42. hOCR: reuse HTML/CSS markup
    43. overview of hOCR markup elements logical markup engine-specific page layout language modeling
    44. hOCR example
    45. summary
    46. error rates ► components ► goal ● RAST layout ● beat this with MLP ● Tesseract char. recog. in beta release
    47. summary ► OCR is much more than “character recognition” ► current status ● excellent layout analysis support ● mature character recognition for alphabetic scripts ● powerful, high performance language modeling code ● standards-based
    48. possible tasks OCR tasks ► web and REST-based interfaces for... ● OCR and layout analysis ● quality control and correction ● training for new scripts ● language modeling based on corpora ● batch processing and workflow ► support for ● historical documents ● additional scripts / languages ● additional logical structure analysis ● on-screen reading
    49. camera-based book scanning
    50. camera-based book capture ► stereo-based dewarping ● optional structured light ► low-cost digital cameras ● 12 Mpixel, approx 300 dpi grayscale ► portable hardware
    51. handheld scanning
    52. handheld capture ► monocular scanning ● digital cameras, cell phones ► model-based dewarping ● frame, affine, curvilinear
    53. scan processing and quality control
    54. information extraction
    55. image based document analysis
    56. summary
    57. applications ► camera-based scanning ► mobile document capture ► scan quality control ► ocr and ocr correction ► adaptive layout analysis ► information extraction ► image-based edition analysis
    58. status ► we have ● distinct technologies ● working prototypes / demonstrators ► funding needed for ● developing user interfaces ● packaging ● documentation ● training and performance ● new functionality based on user requirements

    + Thomas BreuelThomas Breuel, 2 years ago

    custom

    1203 views, 0 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1203
      • 1203 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 45
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories