Slideshow transcript
Slide 1: The OCRopus OCR System Project Background and Progress Report March 2007 Thomas M. Breuel
Slide 2: background ► CUDA ● reliable OCR-free conversion of scanned documents for handheld readers through layout analysis ► Image-Based Personal Computing Project ● key idea: the human-readable document image, not its structural markup, carries the meaning ● provide users with tools (incl. OCR) for dealing with image-based data ► OCRopus ● open source OCR project, sponsored by Google, for digital library/book scanning applications
Slide 3: motivation
Slide 4: Google Book Search
Slide 5: search results
Slide 6: single page view
Slide 7: problem ?
Slide 8: technologies machine learning statistics artificial intelligence GFS MapReduce Cluster networks operating systems
Slide 9: commercial OCR systems
Slide 10: state of commercial OCR ► primary use cases ● desktop scanning, some bill and mail processing ► layout analysis ● rule-based, not trainable, prone to catastrophic failures ► character recognition ● ad-hoc classifier combination, speed-oriented ► language modeling ● dictionary lookup, backtracking ► adaptation ● per-page “retraining”, dictionary augmentation
Slide 11: commercial OCR – clean input 2 Browser and Design Testing There are multiple implementations of HTML rendering engines; some common ones are Microsoft's Internet Explorer, Mozilla's Gecko, Apple's Safari, Opera's browser, and KDE's KHTML. Each of these render web pages differently due to bugs and incomplete specifi cations of web standards. Common defects are missing text, text that is unintentionally rendered overlapping, text that unintentionally overlaps graphical elements, bad font sub stitutions, bad spacing, and unreadable choices of foreground and background colors. Our approach to this problem is to render the HTML into an imagebased representa tion and then subject the imagebased representation to OCR (including layout analysis)...
Slide 12: commercial OCR – scientific publications Indeed, it follows from (3.5') in view of (4.39) that (4.40) S k=l fc=i = (1 - a)C - (1 - a) C + From (4.40), �^(u)x^=� and hence The last inequality means that TTQ 6 SF(f,N). Further, it follows from (4.38) that (4.41) P'{X# > /} = P*{/ -C�N(U) I{ZN f} = P'{I{ZN< 0} = P*{ZN > A} = (1 - a). Finally, we get from (4.38) and (4.41) that (4.42) P{X^>/}=E'IW� > A(l-a) > 1- a. The relations (4.41) and (4.42) show that the condition (4.35) holds for the strategy na, and hence TTQ is an a-((l - a)C, /, A^)-hedge. What has been obtained shows that it is possible to hedge a contingent claim with a specified probability (1 � a). Further, the initial funds can be reduced by the amount a C, though with a risk a the accepted contingent claim cannot be repaid. PROBLEMS 4.1. Prove that on a no-arbitrage (B, 5)-market we have for a standard Euro- pean option to buy (sell) that C(7V2) > C(Ni) (respectively, P(JV2) > P(M)) when 4.2. Prove that the fair price C = C(N, So, K) of a standard European option to buy, where N is the exercise time, So is the initial price of a share, and K is the exercise price, has the following properties: a) C(S0, K) is monotone in So and K; d) C(So, K) is convex in So and K; c) C(S0,XK) = A C(S0,K) for A > 0.
Slide 13: commercial OCR – unusual fonts OR, AN ACCOUNT OP THE FELLOWSHIPS, SCHOLARSHIPS, and EXHIBITIONS, at the attttonvitto of <C2><A9>Tforfc anft <E2><82><AC>amfitiU0 BY WHOM FOUNDED, J.VJ> UHKrilKK OPEff TO IfATIfES OF SNOLAND AND WALES, Ott RKiTRICTEU TO PARTICULAR PLACES AND PERSONS; ALSO, OF SUCH CoKrgrs, IJutlir $rf)ool6, Kniutotti (Grammar 5rf)ool CHABTERED COMPANIES OF THE CITY OF LONDON, CORPORATE BODIES, TRUSTEES, &c. At BArS OXlrESSJTY ADrANTAOES ATTACHED TO TBEX, OS IN THEM PATRONAGE. WITH APPROPRIATE INDEXES AND REFERENCES. ^LONDON: PRINTED FOR C. J. O. & F. RIV1NGTON, . PAl L's Cllf RCII.YAlin, AMI WATERLOO.PLACE, PALUMALL. MDCCCXXIX.
Slide 14: commercial OCR – multiple languages â¬*Mlnv- Toy aleertennof n^m^ritn. qaeva* desTdos eosto 4 mi padre d snstnerlos a tu oniosidad, qae d eseri- birlos. Se" qne cometa ana impradtncia iilirfirir«dn on femenil deseo qne te aearreara modiM dokns; pcro ew- tigo mas quiero pecar de tolerant* qne de wrcro. Pra- fanart COD el secrete la memoria de mi boen padre. mas anadirt qoilates a tu carioo: eatre 1« respeto* de- bidos a. la memoria de on padre nmerlo, j d amor
Slide 15: problems and solutions ► problems ● unpredictable, inconsistent performance ● high error rates on some document types ● closed source = can't be improved ● based on old technologies ● fails on unexpected input ► solution ● develop new generation of OCR system ● bring up to state-of-the-art ● machine learning, statistical natural language processing ● advance the state of the art ● statistical adaptation, 2D segmentation, adaptive lang. models
Slide 16: ocropus
Slide 17: general architecture layout analysis isolated character recognition statistical language modeling TEXT
Slide 18: high level goals ► software development ● address Google's OCR needs ► research ● machine learning, image understanding ► software engineering ● advance state of the art for large machine learning s/w
Slide 19: goals ► performance ● significantly reduce average character error rate ● greatly reduce undetected catastrophic failures ● meet production throughput/memory requirements ► functionality ● any script, any language ● pluggable architecture ● testable architecture ● fully statistical foundation
Slide 20: motivation for project ► incorporate past 20 years of advances ● improve architecture ● statistically justifiable information integration ● pluggable components ● multi-lingual / multi-script support ● improve layout analysis ● statistical, trainable layout analysis ● improve character recognition ● adaptivity ● improve language modeling ● statistical natural language processing
Slide 21: background technologies ► 20 years of research and development ● handwriting recognition system (1994-2000) ● top scoring in NIST evaluation in 1994 (among 14) ● deployed by US Census Bureau in 1995 ● probabilistic finite state transducers (1990's-present) ● Bell Labs, used for speech, adopted early in handwriting ● Google's OpenFST project ● interval arithmetic geometry (1990's-present) ● novel layout analysis, applications in handheld readers (PARC) ● IPeT project (2004-2007) ● publicly funded project: imaging, OCR technologies ► not yet used by commercial OCR systems
Slide 22: architecture ► proven architecture ● speech ● handwriting ► probabilistically sound ● finds MAP solutions ► feed-forward control flow abstraction ● “backtracking” via lazy evaluation if necessary
Slide 23: Color Coding Page Segmentations
Slide 24: LineOCR ► convenience interface for hooking up “old” OCR systems ► input ● line image ► output ● characters, bounding boxes, costs
Slide 25: layout analysis
Slide 26: character recognition “T”
Slide 27: statistical language models language models as weighted finite state transducers fully probabilistic foundation Dictionary Semantic Grammar Dictionary Constraints Result Hypothesis Graph modular language models allow rapid retargeting
Slide 28: status
Slide 29: accomplishments ► software engineering ● initial code releases (OCRopus, hocr-tools) ● design documents, documentation ● build systems (Tesseract, OCRopus) ● integration of Tesseract and OCRopus ● data structures, interchange formats ► error rates ● great improvement over existing open source ● may be usable for some applications
Slide 30: error rates (as of 02/2007) ► components ● RAST layout ● Tesseract char. recog.
Slide 31: layout analysis performance
Slide 32: simple OCR GUI
Slide 33: hOCR output format
Slide 34: OCR output format requirements ► requirements ● represent all common... ● scripts ● languages ● styles / formatting ● typographic / lingustic phenomena ● useful for intermediate and final results ● must be able to encapsulate most/all of current formats ● standards-based ● relate text and OCR information
Slide 35: existing formats ► XDOC, Abbyy, ... ► issues ● poor coverage of non-Latin scripts ● poor coverage of non-European languages ● poorly defined (what is a “word”?) ● poorly defined meaning of layout elements ● separate text/layout
Slide 36: hOCR approach ► basic idea... ● rely on HTML / CSS3 as much as possible ● e.g., writing direction, languages, scripts, fonts, ... ● kashida, ruby, half-height Japanese parentheses, ... ● typsetting model ● logical markup – ocr_section, ocr_subsection, ... ● page-markup (floats and boxes) – XSLT, TeX output boxes, etc. – ocr_column, ocr_image, ... ● image-related markup – geometric groupings, etc. – ocrx_block, ocrx_line, ...
Slide 37: hOCR microformat ► hOCR properties ● correctly rendering HTML ● choose any presentation you like ● easily processed with existing tools ● search engines, editors, etc. ● OCR metadata stays associated with text ● fairly compact ● lots of tools for processing it ● HTML DOM
Slide 38: hOCR example
Slide 39: hOCR processing example
Slide 40: hOCR tools ► OCR results ● hocr-check ● hocr-text ● hocr-extract-images ● hocr-combine, hocr-split ● hocr-eval-seg, hocr-eval-text ● hocr-to-xml, xml-to-hocr ● hocr-meta, hocr-add-bib ● pageseg-to-hocr, hocr-to-pageseg ► citation info ● hbib-to-bibtex, bibtex-to-hbib ● hbib Operator
Slide 41: roadmap
Slide 42: tentative release schedule ► TechnologyPreview ► AlphaRelease Release (Q1 2007) (Q3 2007) ● We're planning on a technology ● The Alpha release will integrate preview of OCRopus in Q1 2007. It important additional components already has most of the (these components already exist, architecture down, but will initially but we just haven't integrated only include the following them): components: ● MLP character recognition ● Tesseract character ● OpenFST-based recognition statistical language ● RAST layout analysis modeling ● aspell-based language ● more layout information in model the hOCR output ● initial testing and ● better testing and evaluation tools evaluation tool
Slide 43: tentative release schedule (2) ► Beta Release (Q1 2008) ► 1.0 Release (Q3 2008) ● The Beta release will focus on ● The focus is on bug fixes, stability, performance, refactoring, packaging, and ports to other and simplification of the codebase. platforms. Additional functionality Possible additional functionality may include: includes: ● interface for incorporating ● additional scripts and prior layout knowledge languages ● a GUI frontend ● character exception processing ● minimal adaptation for characters, scanners, and language models
Slide 44: evolution RAST probabilistic layout layout adaptive Tesseract MLP aspell PFST technology preview (Q1 2007) 1.0 (Q3 2008)
Slide 45: research challenges ► layout analysis ● statistical 2D image segmentation ● MRF, Bayesian Nets, machine learning ► character classification ● adaptation to style and over time ● exceptional classes ► language modeling ● adaptive language modeling ● incorporation of context and semantics ► system engineering ● classifier selection, combination, parameter optimization ● classifier testing, evaluation, validation
Slide 46: potential contributions document analysis community layout analysis ► modular structure makes plugging in contributions easy Google Tesseract character ► designed for easy pattern recognition community recognition contributions from the community ► envision community developments around rare scripts/languages statistical Google statistical NLP speech community language modeling
Slide 47: upcoming issues ► software engineering ● complete testing/evaluation framework ● autoconf ● GUI, evaluation, debugging, correction interfaces ► existing tech ● MLP-based character recognizer ● more complete hOCR output ● integrate OpenFST (replace aspell) ► ongoing ● statistical, trainable layout analysis ● adaptive classifiers
Slide 48: ocroscript ► lua binding ● secure embedded scripting ● end-user rules (layout etc.) ● standard dynamically loading interface ● widely used, tiny, JIT, very efficient ● platform-independent scripting page = Image:new(); segmented = Image:new() RASTLayout:new():segment_page(segmented,page) langmod = Pfst:new(“english.pfst”) linrec = Tesseract:new() read_image(page,”page.png”) lines = lines_of_segmentation(segmented) mods = {} for line in lines do mods:append(linrec:recognize_binary(line)) done print( langmod:intersect(pfst.concat(mods)):bestpath() )
Slide 49: related projects
Slide 50: dewarping ► multiple approaches ● stereo-based, model-based
Slide 51: camera-based document interaction ► camera-based document capture ► interaction through pointing gestures ► integration with... ● OCR ● search/retrieval ► applicable also to... ● mobile document capture
Slide 52: historical document analysis ► motivation: Giordano Bruno books ► capabilities ● compare and align degraded, warped document images ● language-free – no recognition required ● highlight differences for further analysis




Add a comment on Slide 1
If you have a SlideShare account, login to comment; else you can comment as a guest- Favorites & Groups
Showing 1-50 of 0 (more)