Slideshow transcript
Slide 1: The OCRopus OCR System Project Background and Progress Report March 2007 Thomas M. Breuel
Slide 2: motivation
Slide 3: Google Book Search
Slide 4: search results
Slide 5: single page view
Slide 6: problem ?
Slide 7: technologies machine learning statistics artificial intelligence GFS MapReduce Cluster networks operating systems
Slide 8: commercial OCR systems
Slide 9: state of commercial OCR ► primary use cases ● desktop scanning, some bill and mail processing ► layout analysis ● rule-based, not trainable, prone to catastrophic failures ► character recognition ● ad-hoc classifier combination, speed-oriented ► language modeling ● dictionary lookup, backtracking ► adaptation ● per-page “retraining”, dictionary augmentation
Slide 10: commercial OCR – clean input 2 Browser and Design Testing There are multiple implementations of HTML rendering engines; some common ones are Microsoft's Internet Explorer, Mozilla's Gecko, Apple's Safari, Opera's browser, and KDE's KHTML. Each of these render web pages differently due to bugs and incomplete specifi cations of web standards. Common defects are missing text, text that is unintentionally rendered overlapping, text that unintentionally overlaps graphical elements, bad font sub stitutions, bad spacing, and unreadable choices of foreground and background colors. Our approach to this problem is to render the HTML into an imagebased representa tion and then subject the imagebased representation to OCR (including layout analysis)...
Slide 11: commercial OCR – scientific publications Indeed, it follows from (3.5') in view of (4.39) that (4.40) S k=l fc=i = (1 - a)C - (1 - a) C + From (4.40), �^(u)x^=� and hence The last inequality means that TTQ 6 SF(f,N). Further, it follows from (4.38) that (4.41) P'{X# > /} = P*{/ -C�N(U) I{ZN f} = P'{I{ZN< 0} = P*{ZN > A} = (1 - a). Finally, we get from (4.38) and (4.41) that (4.42) P{X^>/}=E'IW� > A(l-a) > 1- a. The relations (4.41) and (4.42) show that the condition (4.35) holds for the strategy na, and hence TTQ is an a-((l - a)C, /, A^)-hedge. What has been obtained shows that it is possible to hedge a contingent claim with a specified probability (1 � a). Further, the initial funds can be reduced by the amount a C, though with a risk a the accepted contingent claim cannot be repaid. PROBLEMS 4.1. Prove that on a no-arbitrage (B, 5)-market we have for a standard Euro- pean option to buy (sell) that C(7V2) > C(Ni) (respectively, P(JV2) > P(M)) when 4.2. Prove that the fair price C = C(N, So, K) of a standard European option to buy, where N is the exercise time, So is the initial price of a share, and K is the exercise price, has the following properties: a) C(S0, K) is monotone in So and K; d) C(So, K) is convex in So and K; c) C(S0,XK) = A C(S0,K) for A > 0.
Slide 12: commercial OCR – unusual fonts OR, AN ACCOUNT OP THE FELLOWSHIPS, SCHOLARSHIPS, and EXHIBITIONS, at the attttonvitto of <C2><A9>Tforfc anft <E2><82><AC>amfitiU0 BY WHOM FOUNDED, J.VJ> UHKrilKK OPEff TO IfATIfES OF SNOLAND AND WALES, Ott RKiTRICTEU TO PARTICULAR PLACES AND PERSONS; ALSO, OF SUCH CoKrgrs, IJutlir $rf)ool6, Kniutotti (Grammar 5rf)ool CHABTERED COMPANIES OF THE CITY OF LONDON, CORPORATE BODIES, TRUSTEES, &c. At BArS OXlrESSJTY ADrANTAOES ATTACHED TO TBEX, OS IN THEM PATRONAGE. WITH APPROPRIATE INDEXES AND REFERENCES. ^LONDON: PRINTED FOR C. J. O. & F. RIV1NGTON, . PAl L's Cllf RCII.YAlin, AMI WATERLOO.PLACE, PALUMALL. MDCCCXXIX.
Slide 13: commercial OCR – multiple languages â¬*Mlnv- Toy aleertennof n^m^ritn. qaeva* desTdos eosto 4 mi padre d snstnerlos a tu oniosidad, qae d eseri- birlos. Se" qne cometa ana impradtncia iilirfirir«dn on femenil deseo qne te aearreara modiM dokns; pcro ew- tigo mas quiero pecar de tolerant* qne de wrcro. Pra- fanart COD el secrete la memoria de mi boen padre. mas anadirt qoilates a tu carioo: eatre 1« respeto* de- bidos a. la memoria de on padre nmerlo, j d amor
Slide 14: problems and solutions ► problems ● unpredictable, inconsistent performance ● high error rates on some document types ● closed source = can't be improved ● based on old technologies ● fails on unexpected input ► solution ● develop new generation of OCR system ● bring up to state-of-the-art machine learning, statistical natural language processing ● ● advance the state of the art statistical adaptation, 2D segmentation, adaptive lang. models ●
Slide 15: ocropus
Slide 16: general architecture layout analysis isolated character recognition statistical language modeling TEXT
Slide 17: high level goals ► software development ● address Google's OCR needs ► research ● machine learning, image understanding ► software engineering ● advance state of the art for large machine learning s/w
Slide 18: goals ► performance ● significantly reduce average character error rate ● greatly reduce undetected catastrophic failures ● meet production throughput/memory requirements ► functionality ● any script, any language ● pluggable architecture ● testable architecture ● fully statistical foundation
Slide 19: motivation for project ► incorporate past 20 years of advances ● improve architecture statistically justifiable information integration ● pluggable components ● multi-lingual / multi-script support ● ● improve layout analysis statistical, trainable layout analysis ● ● improve character recognition adaptivity ● ● improve language modeling statistical natural language processing ●
Slide 20: background technologies ► 20 years of research and development ● handwriting recognition system (1994-2000) top scoring in NIST evaluation in 1994 (among 14) ● deployed by US Census Bureau in 1995 ● ● probabilistic finite state transducers (1990's-present) Bell Labs, used for speech, adopted early in handwriting ● Google's OpenFST project ● ● interval arithmetic geometry (1990's-present) novel layout analysis, applications in handheld readers (PARC) ● ● IPeT project (2004-2007) publicly funded project: imaging, OCR technologies ● ► not yet used by commercial OCR systems
Slide 21: architecture ► proven architecture ● speech ● handwriting ► probabilistically sound ● finds MAP solutions ► feed-forward control flow abstraction ● “backtracking” via lazy evaluation if necessary
Slide 22: layout analysis
Slide 23: character recognition “T”
Slide 24: statistical language models language models as weighted finite state transducers fully probabilistic foundation Semantic Dictionary Grammar Dictionary Constraints Result Hypothesis Graph modular language models allow rapid retargeting
Slide 25: status
Slide 26: accomplishments ► software engineering ● initial code releases (OCRopus, hocr-tools) ● design documents, documentation ● build systems (Tesseract, OCRopus) ● integration of Tesseract and OCRopus ● data structures, interchange formats ► error rates ● great improvement over existing open source ● may be usable for some applications
Slide 27: error rates (as of 02/2007) ► components ● RAST layout ● Tesseract char. recog.
Slide 28: layout analysis performance
Slide 29: simple OCR GUI
Slide 30: software engineering
Slide 31: subversion repositories open external tesseract ocropus ocropus IUPR code.google.com tesseract ocropus you?
Slide 32: compilation dependencies
Slide 33: code and building ► coding conventions: misc.iupr.org/docs ● K&R, UNIX/Linux, GNU ● fairly strict well-defined memory management ● no STL ● limited library dependencies ● ► building and porting ● jam build system, working on autoconf ● nascent unit tests
Slide 34: evaluation tools ► stage-wise evaluation (in progress) ● pageseg, lineseg, classifier ► system evaluation ● edit distance (ISRI-like tool) ● hOCR-based (in progress) ► ground truth generation ● didegrade ● layout generation ► evaluation ● unit test (jam build system) ● large scale evaluation
Slide 35: OCRopus OCR list of line input images ISegmentLine, ILine...OCR ICleanup, IBinarize hypothesis language lattice model binarized ILanguageModel ISegmentPage nbest segmented strings page PageOCR hocr output
Slide 36: OCRopus OCR (grayscale) list of line input images ISegmentLine, ILine...OCR ICleanup, IBinarize hypothesis language lattice model binarized ILanguageModel ISegmentPage nbest segmented strings page PageOCR hocr output
Slide 37: OCRopus OCR – Tech Preview Rel. list of line input images ISegmentLine, ILine...OCR ICleanup, IBinarize Tesseract Sauvola hypothesis language lattice model binarized ILanguageModel ISegmentPage aspell RAST nbest segmented strings page PageOCR hocr output
Slide 38: Page Segmentation ► simple image in/image out components ► represented as classes (rather than functions) to make dependency injection/scripting easier ► note absence of parameters ● implementations need to figure out proper “scale” / “resolution” by themselves ► output is color-coded image
Slide 39: Color Coding Page Segmentations
Slide 40: LineOCR ► convenience interface for hooking up “old” OCR systems ► input ● line image ► output ● characters, bounding boxes, costs
Slide 41: Retrainable LineOCR ► retraining for line-based OCR engines ► retraining of individual characters or entire lines
Slide 42: OCRopus Interfaces
Slide 43: Adapter Classes
Slide 44: PageOCR Dependency Injection
Slide 45: Command Line: Simple
Slide 46: hOCR output format
Slide 47: OCR output format requirements ► requirements ● represent all common... scripts ● languages ● styles / formatting ● typographic / lingustic phenomena ● ● useful for intermediate and final results ● must be able to encapsulate most/all of current formats ● standards-based ● relate text and OCR information
Slide 48: existing formats ► XDOC, Abbyy, ... ► issues ● poor coverage of non-Latin scripts ● poor coverage of non-European languages ● poorly defined (what is a “word”?) ● poorly defined meaning of layout elements ● separate text/layout
Slide 49: hOCR approach ► basic idea... ● rely on HTML / CSS3 as much as possible e.g., writing direction, languages, scripts, fonts, ... ● kashida, ruby, half-height Japanese parentheses, ... ● ● typsetting model logical markup ● ocr_section, ocr_subsection, ... – page-markup (floats and boxes) ● XSLT, TeX output boxes, etc. – ocr_column, ocr_image, ... – image-related markup ● geometric groupings, etc. – ocrx_block, ocrx_line, ... –
Slide 50: hOCR microformat ► hOCR properties ● correctly rendering HTML ● choose any presentation you like ● easily processed with existing tools search engines, editors, etc. ● ● OCR metadata stays associated with text ● fairly compact ● lots of tools for processing it HTML DOM ●
Slide 51: hOCR example
Slide 52: hOCR processing example
Slide 53: hOCR tools ► OCR results ● hocr-check ● hocr-text ● hocr-extract-images ● hocr-combine, hocr-split ● hocr-eval-seg, hocr-eval-text ● hocr-to-xml, xml-to-hocr ● hocr-meta, hocr-add-bib ● pageseg-to-hocr, hocr-to-pageseg ► citation info ● hbib-to-bibtex, bibtex-to-hbib ● hbib Operator
Slide 54: roadmap
Slide 55: tentative release schedule ► AlphaRelease ► TechnologyPreview (Q3 2007) Release (Q1 2007) ● The Alpha release will integrate ● We're planning on a technology important additional components preview of OCRopus in Q1 2007. It (these components already exist, already has most of the but we just haven't integrated architecture down, but will initially them): only include the following components: MLP character recognition ● Tesseract character OpenFST-based ● ● recognition statistical language modeling RAST layout analysis ● more layout information in aspell-based language ● ● the hOCR output model better testing and initial testing and ● ● evaluation tool evaluation tools
Slide 56: tentative release schedule (2) ► Beta Release (Q1 2008) ► 1.0 Release (Q3 2008) ● The Beta release will focus on ● The focus is on bug fixes, stability, performance, refactoring, packaging, and ports to other and simplification of the codebase. platforms. Additional functionality Possible additional functionality may include: includes: interface for incorporating ● additional scripts and prior layout knowledge ● languages a GUI frontend ● character exception ● processing minimal adaptation for ● characters, scanners, and language models
Slide 57: evolution RAST probabilistic layout layout adaptive Tesseract MLP aspell PFST 1.0 (Q3 2008) technology preview (Q1 2007)
Slide 58: research challenges ► layout analysis ● statistical 2D image segmentation ● MRF, Bayesian Nets, machine learning ► character classification ● adaptation to style and over time ● exceptional classes ► language modeling ● adaptive language modeling ● incorporation of context and semantics ► system engineering ● classifier selection, combination, parameter optimization ● classifier testing, evaluation, validation
Slide 59: potential contributions layout document analysis community analysis modular structure ► makes plugging in contributions easy designed for easy ► character Google Tesseract contributions from the pattern recognition community recognition community envision community ► developments around rare scripts/languages statistical Google statistical NLP language speech community modeling
Slide 60: upcoming issues ► software engineering ● complete testing/evaluation framework ● autoconf ● GUI, evaluation, debugging, correction interfaces ► existing tech ● MLP-based character recognizer ● more complete hOCR output ● integrate OpenFST (replace aspell) ► ongoing ● statistical, trainable layout analysis ● adaptive classifiers
Slide 61: ocroscript ► lua binding ● secure embedded scripting ● end-user rules (layout etc.) ● standard dynamically loading interface ● widely used, tiny, JIT, very efficient ● platform-independent scripting page = Image:new(); segmented = Image:new() RASTLayout:new():segment_page(segmented,page) langmod = Pfst:new(“english.pfst”) linrec = Tesseract:new() read_image(page,”page.png”) lines = lines_of_segmentation(segmented) mods = {} for line in lines do mods:append(linrec:recognize_binary(line)) done print( langmod:intersect(pfst.concat(mods)):bestpath() )
Slide 62: related projects
Slide 63: the following pages describe other, related projects at IUPR that might be of interest go Google most of them are financed by public research funds the status of many of these is that we have working demonstrators
Slide 64: dewarping ► multiple approaches ● stereo-based, model-based
Slide 65: camera-based document interaction ► camera-based document capture ► interaction through pointing gestures ► integration with... ● OCR ● search/retrieval ► applicable also to... ● mobile document capture
Slide 66: image-based HTML verification ► detect spam, browser rendering errors, accessibility problems
Slide 67: image-based HTML verification ► approach ● automatically render in several browser ● perform text and image comparisons of rendered pages ► achievements ● developed multi-browser rendering infrastructure ● special binarization methods for web page analysis ● screen OCR for text-in-image analysis
Slide 68: accessibility proxy ► automatically add missing ALT tags to web pages ● OCR, content-based image retrieval, tagging ► applications: accessibility, mobile
Slide 69: historical document analysis ► motivation: Giordano Bruno books ► capabilities ● compare and align degraded, warped document images ● language-free – no recognition required ● highlight differences for further analysis




Add a comment on Slide 1
If you have a SlideShare account, login to comment; else you can comment as a guest- Favorites & Groups
Showing 1-50 of 0 (more)