Slideshare.net (beta)

 
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons



All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 0 (more)

2007 03 Ocropus Internet Archive

From tmbdev, 3 months ago

264 views  |  0 comments  |  0 favorites  |  2 downloads
 

Groups/Events

Not added to any group/event

 
 

Privacy InfoNew!

This slideshow is Public

 
Embed in your blog
Embed (wordpress.com)
custom

Slideshow Statistics
Total Views: 264
on Slideshare: 264
from embeds: 0* * Views from embeds since 21 Aug, 07

Slideshow transcript

Slide 1: The OCRopus OCR System Project Background and Progress Report March 2007 Thomas M. Breuel

Slide 2: motivation

Slide 3: Google Book Search

Slide 4: search results

Slide 5: single page view

Slide 6: problem ?

Slide 7: technologies machine learning statistics artificial intelligence GFS MapReduce Cluster networks operating systems

Slide 8: commercial OCR systems

Slide 9: state of commercial OCR ► primary use cases ● desktop scanning, some bill and mail processing ► layout analysis ● rule-based, not trainable, prone to catastrophic failures ► character recognition ● ad-hoc classifier combination, speed-oriented ► language modeling ● dictionary lookup, backtracking ► adaptation ● per-page “retraining”, dictionary augmentation

Slide 10: commercial OCR – clean input 2 Browser and Design Testing There are multiple implementations of HTML rendering engines; some common ones are Microsoft's Internet Explorer, Mozilla's Gecko, Apple's Safari, Opera's browser, and KDE's KHTML. Each of these render web pages differently due to bugs and incomplete specifi­ cations of web standards. Common defects are missing text, text that is unintentionally rendered overlapping, text that unintentionally overlaps graphical elements, bad font sub­ stitutions, bad spacing, and unreadable choices of foreground and background colors. Our approach to this problem is to render the HTML into an image­based representa­ tion and then subject the image­based representation to OCR (including layout analysis)...

Slide 11: commercial OCR – scientific publications Indeed, it follows from (3.5') in view of (4.39) that (4.40) S k=l fc=i = (1 - a)C - (1 - a) C + From (4.40), �^(u)x^=� and hence The last inequality means that TTQ 6 SF(f,N). Further, it follows from (4.38) that (4.41) P'{X# > /} = P*{/ -C�N(U) I{ZN f} = P'{I{ZN< 0} = P*{ZN > A} = (1 - a). Finally, we get from (4.38) and (4.41) that (4.42) P{X^>/}=E'IW� > A(l-a) > 1- a. The relations (4.41) and (4.42) show that the condition (4.35) holds for the strategy na, and hence TTQ is an a-((l - a)C, /, A^)-hedge. What has been obtained shows that it is possible to hedge a contingent claim with a specified probability (1 � a). Further, the initial funds can be reduced by the amount a C, though with a risk a the accepted contingent claim cannot be repaid. PROBLEMS 4.1. Prove that on a no-arbitrage (B, 5)-market we have for a standard Euro- pean option to buy (sell) that C(7V2) > C(Ni) (respectively, P(JV2) > P(M)) when 4.2. Prove that the fair price C = C(N, So, K) of a standard European option to buy, where N is the exercise time, So is the initial price of a share, and K is the exercise price, has the following properties: a) C(S0, K) is monotone in So and K; d) C(So, K) is convex in So and K; c) C(S0,XK) = A C(S0,K) for A > 0.

Slide 12: commercial OCR – unusual fonts OR, AN ACCOUNT OP THE FELLOWSHIPS, SCHOLARSHIPS, and EXHIBITIONS, at the attttonvitto of <C2><A9>Tforfc anft <E2><82><AC>amfitiU0 BY WHOM FOUNDED, J.VJ> UHKrilKK OPEff TO IfATIfES OF SNOLAND AND WALES, Ott RKiTRICTEU TO PARTICULAR PLACES AND PERSONS; ALSO, OF SUCH CoKrgrs, IJutlir $rf)ool6, Kniutotti (Grammar 5rf)ool CHABTERED COMPANIES OF THE CITY OF LONDON, CORPORATE BODIES, TRUSTEES, &c. At BArS OXlrESSJTY ADrANTAOES ATTACHED TO TBEX, OS IN THEM PATRONAGE. WITH APPROPRIATE INDEXES AND REFERENCES. ^LONDON: PRINTED FOR C. J. O. & F. RIV1NGTON, . PAl L's Cllf RCII.YAlin, AMI WATERLOO.PLACE, PALUMALL. MDCCCXXIX.

Slide 13: commercial OCR – multiple languages â¬*Mlnv- Toy aleertennof n^m^ritn. qaeva* desTdos eosto 4 mi padre d snstnerlos a tu oniosidad, qae d eseri- birlos. Se" qne cometa ana impradtncia iilirfirir«dn on femenil deseo qne te aearreara modiM dokns; pcro ew- tigo mas quiero pecar de tolerant* qne de wrcro. Pra- fanart COD el secrete la memoria de mi boen padre. mas anadirt qoilates a tu carioo: eatre 1« respeto* de- bidos a. la memoria de on padre nmerlo, j d amor

Slide 14: problems and solutions ► problems ● unpredictable, inconsistent performance ● high error rates on some document types ● closed source = can't be improved ● based on old technologies ● fails on unexpected input ► solution ● develop new generation of OCR system ● bring up to state-of-the-art machine learning, statistical natural language processing ● ● advance the state of the art statistical adaptation, 2D segmentation, adaptive lang. models ●

Slide 15: ocropus

Slide 16: general architecture layout analysis isolated character recognition statistical language modeling TEXT

Slide 17: high level goals ► software development ● address Google's OCR needs ► research ● machine learning, image understanding ► software engineering ● advance state of the art for large machine learning s/w

Slide 18: goals ► performance ● significantly reduce average character error rate ● greatly reduce undetected catastrophic failures ● meet production throughput/memory requirements ► functionality ● any script, any language ● pluggable architecture ● testable architecture ● fully statistical foundation

Slide 19: motivation for project ► incorporate past 20 years of advances ● improve architecture statistically justifiable information integration ● pluggable components ● multi-lingual / multi-script support ● ● improve layout analysis statistical, trainable layout analysis ● ● improve character recognition adaptivity ● ● improve language modeling statistical natural language processing ●

Slide 20: background technologies ► 20 years of research and development ● handwriting recognition system (1994-2000) top scoring in NIST evaluation in 1994 (among 14) ● deployed by US Census Bureau in 1995 ● ● probabilistic finite state transducers (1990's-present) Bell Labs, used for speech, adopted early in handwriting ● Google's OpenFST project ● ● interval arithmetic geometry (1990's-present) novel layout analysis, applications in handheld readers (PARC) ● ● IPeT project (2004-2007) publicly funded project: imaging, OCR technologies ● ► not yet used by commercial OCR systems

Slide 21: architecture ► proven architecture ● speech ● handwriting ► probabilistically sound ● finds MAP solutions ► feed-forward control flow abstraction ● “backtracking” via lazy evaluation if necessary

Slide 22: layout analysis

Slide 23: character recognition “T”

Slide 24: statistical language models language models as weighted finite state transducers fully probabilistic foundation Semantic Dictionary Grammar Dictionary Constraints Result Hypothesis Graph modular language models allow rapid retargeting

Slide 25: status

Slide 26: accomplishments ► software engineering ● initial code releases (OCRopus, hocr-tools) ● design documents, documentation ● build systems (Tesseract, OCRopus) ● integration of Tesseract and OCRopus ● data structures, interchange formats ► error rates ● great improvement over existing open source ● may be usable for some applications

Slide 27: error rates (as of 02/2007) ► components ● RAST layout ● Tesseract char. recog.

Slide 28: layout analysis performance

Slide 29: simple OCR GUI

Slide 30: software engineering

Slide 31: subversion repositories open external tesseract ocropus ocropus IUPR code.google.com tesseract ocropus you?

Slide 32: compilation dependencies

Slide 33: code and building ► coding conventions: misc.iupr.org/docs ● K&R, UNIX/Linux, GNU ● fairly strict well-defined memory management ● no STL ● limited library dependencies ● ► building and porting ● jam build system, working on autoconf ● nascent unit tests

Slide 34: evaluation tools ► stage-wise evaluation (in progress) ● pageseg, lineseg, classifier ► system evaluation ● edit distance (ISRI-like tool) ● hOCR-based (in progress) ► ground truth generation ● didegrade ● layout generation ► evaluation ● unit test (jam build system) ● large scale evaluation

Slide 35: OCRopus OCR list of line input images ISegmentLine, ILine...OCR ICleanup, IBinarize hypothesis language lattice model binarized ILanguageModel ISegmentPage nbest segmented strings page PageOCR hocr output

Slide 36: OCRopus OCR (grayscale) list of line input images ISegmentLine, ILine...OCR ICleanup, IBinarize hypothesis language lattice model binarized ILanguageModel ISegmentPage nbest segmented strings page PageOCR hocr output

Slide 37: OCRopus OCR – Tech Preview Rel. list of line input images ISegmentLine, ILine...OCR ICleanup, IBinarize Tesseract Sauvola hypothesis language lattice model binarized ILanguageModel ISegmentPage aspell RAST nbest segmented strings page PageOCR hocr output

Slide 38: Page Segmentation ► simple image in/image out components ► represented as classes (rather than functions) to make dependency injection/scripting easier ► note absence of parameters ● implementations need to figure out proper “scale” / “resolution” by themselves ► output is color-coded image

Slide 39: Color Coding Page Segmentations

Slide 40: LineOCR ► convenience interface for hooking up “old” OCR systems ► input ● line image ► output ● characters, bounding boxes, costs

Slide 41: Retrainable LineOCR ► retraining for line-based OCR engines ► retraining of individual characters or entire lines

Slide 42: OCRopus Interfaces

Slide 43: Adapter Classes

Slide 44: PageOCR Dependency Injection

Slide 45: Command Line: Simple

Slide 46: hOCR output format

Slide 47: OCR output format requirements ► requirements ● represent all common... scripts ● languages ● styles / formatting ● typographic / lingustic phenomena ● ● useful for intermediate and final results ● must be able to encapsulate most/all of current formats ● standards-based ● relate text and OCR information

Slide 48: existing formats ► XDOC, Abbyy, ... ► issues ● poor coverage of non-Latin scripts ● poor coverage of non-European languages ● poorly defined (what is a “word”?) ● poorly defined meaning of layout elements ● separate text/layout

Slide 49: hOCR approach ► basic idea... ● rely on HTML / CSS3 as much as possible e.g., writing direction, languages, scripts, fonts, ... ● kashida, ruby, half-height Japanese parentheses, ... ● ● typsetting model logical markup ● ocr_section, ocr_subsection, ... – page-markup (floats and boxes) ● XSLT, TeX output boxes, etc. – ocr_column, ocr_image, ... – image-related markup ● geometric groupings, etc. – ocrx_block, ocrx_line, ... –

Slide 50: hOCR microformat ► hOCR properties ● correctly rendering HTML ● choose any presentation you like ● easily processed with existing tools search engines, editors, etc. ● ● OCR metadata stays associated with text ● fairly compact ● lots of tools for processing it HTML DOM ●

Slide 51: hOCR example

Slide 52: hOCR processing example

Slide 53: hOCR tools ► OCR results ● hocr-check ● hocr-text ● hocr-extract-images ● hocr-combine, hocr-split ● hocr-eval-seg, hocr-eval-text ● hocr-to-xml, xml-to-hocr ● hocr-meta, hocr-add-bib ● pageseg-to-hocr, hocr-to-pageseg ► citation info ● hbib-to-bibtex, bibtex-to-hbib ● hbib Operator

Slide 54: roadmap

Slide 55: tentative release schedule ► AlphaRelease ► TechnologyPreview (Q3 2007) Release (Q1 2007) ● The Alpha release will integrate ● We're planning on a technology important additional components preview of OCRopus in Q1 2007. It (these components already exist, already has most of the but we just haven't integrated architecture down, but will initially them): only include the following components: MLP character recognition ● Tesseract character OpenFST-based ● ● recognition statistical language modeling RAST layout analysis ● more layout information in aspell-based language ● ● the hOCR output model better testing and initial testing and ● ● evaluation tool evaluation tools

Slide 56: tentative release schedule (2) ► Beta Release (Q1 2008) ► 1.0 Release (Q3 2008) ● The Beta release will focus on ● The focus is on bug fixes, stability, performance, refactoring, packaging, and ports to other and simplification of the codebase. platforms. Additional functionality Possible additional functionality may include: includes: interface for incorporating ● additional scripts and prior layout knowledge ● languages a GUI frontend ● character exception ● processing minimal adaptation for ● characters, scanners, and language models

Slide 57: evolution RAST probabilistic layout layout adaptive Tesseract MLP aspell PFST 1.0 (Q3 2008) technology preview (Q1 2007)

Slide 58: research challenges ► layout analysis ● statistical 2D image segmentation ● MRF, Bayesian Nets, machine learning ► character classification ● adaptation to style and over time ● exceptional classes ► language modeling ● adaptive language modeling ● incorporation of context and semantics ► system engineering ● classifier selection, combination, parameter optimization ● classifier testing, evaluation, validation

Slide 59: potential contributions layout document analysis community analysis modular structure ► makes plugging in contributions easy designed for easy ► character Google Tesseract contributions from the pattern recognition community recognition community envision community ► developments around rare scripts/languages statistical Google statistical NLP language speech community modeling

Slide 60: upcoming issues ► software engineering ● complete testing/evaluation framework ● autoconf ● GUI, evaluation, debugging, correction interfaces ► existing tech ● MLP-based character recognizer ● more complete hOCR output ● integrate OpenFST (replace aspell) ► ongoing ● statistical, trainable layout analysis ● adaptive classifiers

Slide 61: ocroscript ► lua binding ● secure embedded scripting ● end-user rules (layout etc.) ● standard dynamically loading interface ● widely used, tiny, JIT, very efficient ● platform-independent scripting page = Image:new(); segmented = Image:new() RASTLayout:new():segment_page(segmented,page) langmod = Pfst:new(“english.pfst”) linrec = Tesseract:new() read_image(page,”page.png”) lines = lines_of_segmentation(segmented) mods = {} for line in lines do mods:append(linrec:recognize_binary(line)) done print( langmod:intersect(pfst.concat(mods)):bestpath() )

Slide 62: related projects

Slide 63: the following pages describe other, related projects at IUPR that might be of interest go Google most of them are financed by public research funds the status of many of these is that we have working demonstrators

Slide 64: dewarping ► multiple approaches ● stereo-based, model-based

Slide 65: camera-based document interaction ► camera-based document capture ► interaction through pointing gestures ► integration with... ● OCR ● search/retrieval ► applicable also to... ● mobile document capture

Slide 66: image-based HTML verification ► detect spam, browser rendering errors, accessibility problems

Slide 67: image-based HTML verification ► approach ● automatically render in several browser ● perform text and image comparisons of rendered pages ► achievements ● developed multi-browser rendering infrastructure ● special binarization methods for web page analysis ● screen OCR for text-in-image analysis

Slide 68: accessibility proxy ► automatically add missing ALT tags to web pages ● OCR, content-based image retrieval, tagging ► applications: accessibility, mobile

Slide 69: historical document analysis ► motivation: Giordano Bruno books ► capabilities ● compare and align degraded, warped document images ● language-free – no recognition required ● highlight differences for further analysis