Slideshare.net (beta)

 
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons



All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 0 (more)

2007 09 Ocropus Architecture Google

From tmbdev, 3 months ago

251 views  |  0 comments  |  0 favorites  |  6 downloads
 

Groups/Events

Not added to any group/event

 
 

Privacy InfoNew!

This slideshow is Public

 
Embed in your blog
Embed (wordpress.com)
custom

Slideshow Statistics
Total Views: 251
on Slideshare: 251
from embeds: 0* * Views from embeds since 21 Aug, 07

Slideshow transcript

Slide 1: The OCRopus OCR System Progress and Status September 2007 Thomas M. Breuel

Slide 2: Summary of Recent Progress ► Alpha Release Planned for October ► Simplification of C++ Interfaces ► Code Reorganization ► Incorporation of OpenFST Language Models ► Scripting Language ► Extensive Test Cases ► Integration of MLP Recognizer ► Debugging Visualization Interface ► Complete End-to-End Pipeline

Slide 3: Foundations ► based on segmentation-based speech recognition ● input image is segmented into candidate segments ● classifier estimates posterior probabilities ● segmentation lattice is integrated with language model

Slide 4: Software Architecture ► proven architecture ● speech ● handwriting ► probabilistically sound ● finds MAP solutions ► feed-forward control flow abstraction ● “backtracking” via lazy evaluation if necessary

Slide 5: Coding Conventions ► Stroustrup / Boost / Bell Labs style ► all code must be exception safe ► strict memory ownership rules (caller owns) ► no naked pointers, no pointer arithmetic ► no STL, no <string> ► no global initializers, no order of initialization dep. ► controlled set of types in public interfaces ► separation of interface and implementation ► jam, autoconf, automake (packaging) ► subversion, hosted on Google Code

Slide 6: Scripting ► Lua ● mature, widely used configuration and scripting language ● easy C++ binding, dynamic loading, GC ● used for configuration and as C interface ● used for dynamic loading, integration, object broker ● about 100kbytes large, statically linked ● self-contained, no dependencies ● won't see it if you don't want to ► why not Python? ● great language—would love to use ● lots of external dependencies, hard to package, big

Slide 7: Lua Example dofile("utest.lua") a = floatarray:new() b = floatarray:new() test_failure(function() rowswap(a,0,1) end) -- empty array while 1 do -- make an array that is not row sorted make_random(a,25,1.0) a:reshape(5,5) if not rowsorted(a) then break end end note("got unsorted array") copy(b,a) test_assert(equal(a,b)) rowsort(a) test_assert(not equal(a,b)) test_success(function() check_rowsorted(a) end)

Slide 8: Binding C++ to Lua #include “narray.h” a = floatarray:new() a:resize(100) class floatarray { fill(a,99.0) floatarray(); floatarray(int d0); b = floatarray:new() void resize(int d0); fill(b,99.0) int length() const; tolua++ float &at(int i0); test_assert(equal(a,b)) void put(int i0,float &v); a:put(10,1.0) .... test_assert(not equal(a,b)) }; print(argmin(a)) bool equal(floatarray &a,floatarray &b); test_assert(argmin(a)==10) void fill(floatarray &a,float value); a = nil int argmin(floatarray &a); simple binding using tolua++, C++ classes bound to Lua ► memory ownership rules work out well in Lua ► overloading, virtual functions, inheritance, etc. ► Lua garbage collection works for C++ objects ►

Slide 9: Interactive Usage $ ./ocroscript OCRoscript (interactive) > for i=1,3 do print(i) end 1 2 3 > dinit(300,300) > read_image_gray(image,"tests/images/italics.png") > dshow(image) > ^D $ ./ocroscript test.lua $ ./ocroscript -e 'dinit(800,800)' test.lua $ ./ocroscript check-segmentation.lua page.png

Slide 10: Available Modules ► Narray, Nustring, imgio, imglib, dgraphics ● basic array manipulation, sorting, searching, ... ● normalized Unicode strings (UTF-32 codepoints) ● image processing, I/O ● debugging graphics ► OpenFST ● weighted finite state transducers—language models ► Leptonica ● image processing ► OCR ● OCR-related functionality (more later)

Slide 11: Unicode ► goal: multi-script, multi-language recognition ► data types ● char * — UTF-8 strings ● nuchar — UTF-32 codepoints ● nustring — UTF-32 strings (normalized) ● intarray — UTF-32 strings used with algorithms ► Unicode processing via transductions ● ... rather than library code

Slide 12: Testing ► extensive test cases ● C++ test cases ● Lua test cases ► test macros ● test_assert, test_eq, test_between ● test_failure, test_success ► automated nightly testing ► TODO ● code coverage ● more tests

Slide 13: C Interface ► issues ● need C interface for FFI, DLL ● raw C++ interfaces tricky to call from C ► using scripting language as C interface ● ocropus_set_image(variable,ptr,w,h) ● ocropus_get_image(&ptr,&w,&h,variable) ● ocropus_eval(expression) ● ocropus_recognize()

Slide 14: Division of Labor ► C++ ● core libraries ● all core algorithms ● some command line drivers ► Lua ● no essential algorithms, no big scripts ● parameter settings ● many test cases ● C interface ● dynamic loading ● end-user specific image processing/cleanup

Slide 15: Processing ► standardized steps with standardized interfaces ● cleanup ● binarization ● page segmentation ● line segmentation ● character grouping ● classification ● character lattices ● weighted finite state transducers ● language modeling ● force alignment and training

Slide 16: Reducing Coupling ► limit coupling / dependencies ● e.g. page segmentation determines line parameters ● text line recognizer needs line parameters ● line parameters are not passed to text line recognizer ● text line recognizer recomputes ● ● why? no guarantee that segmenter and recognizer work together ● ● issues? performance – may relax iff profiling indicates problem ● coding effort – no problem: common code in library ● ► each stage gets only what is sufficient for its task

Slide 17: Cleanup and Binarization struct ICleanupGray : IComponent { virtual void cleanup(bytearray &out,bytearray &in) = 0; }; struct ICleanupBinary : IComponent { virtual void cleanup(bytearray &out,bytearray &in) = 0; }; struct IBinarize : IComponent { virtual void binarize(bytearray &out,bytearray &in) = 0; virtual void binarize(bytearray &out,floatarray &in) = 0; }; IBinarize *make_BinarizeBySauvola(); IBinarize *make_BinarizeByRange();

Slide 18: Page Segmentation struct ISegmentPage : IComponent { virtual void segment(intarray &out,bytearray &in) = 0; }; ISegmentPage *make_SegmentPageBySmear(); ISegmentPage *make_SegmentPageBy1CP(); ISegmentPage *make_SegmentPageByRAST(); ► pixel accurate segmentation in RGB format ● R value: column number ● G value: paragraph number ● B value: line number ● lexicographic in reading order ● some special values for other page elements

Slide 19: Page Segmentation (RAST Algorithm)

Slide 20: Representation of Page Segmentations

Slide 21: Page Segmentation ► pixel accurate segmentations... ● aren't they hard to deal with? ● region iterators (Lua and C++, like everything) segmenter = make_RastPageSegmenter() seg = intarray:new() segmenter:segment(seg,image) regions:setPageColumns(seg) ncols = regions:length()-1 assert(ncols<8) for i = 1,ncols do b = regions:bbox(i) ... end

Slide 22: Line Segmentation struct ISegmentLine : IComponent { virtual void charseg(intarray &out,bytearray &in) = 0; }; ► pixel accurate character parts ● lower 12 bits: character part number ● optional upper 12 bits: line number ► also used for pixel-accurate ground truth

Slide 23: Representation of Line Segmentations output of segmenter (oversegmentation) manually generated ground truth

Slide 24: Evaluating Segmentations cut = make_CurvedCutSegmenter() cut_seg = intarray:new() cut:charseg(cut_seg,image) dshowr(cut_seg,"yYY") test_success(function()check_line_segmentation(cut_seg) end) over,under,mis = evaluate_segmentation(0,0,0,reference_seg,cut_seg,0) test_assert(over<10) test_assert(under==0)

Slide 25: Character Grouping

Slide 26: Line Recognition ► core interface to character recognition ● IRecognizeLine::recognizeLine ● given oversegmentation and input image... generate a hypothesis lattice with classifications ● generate a map of input segments to character hypotheses ● ● map says: “which regions in the segmentation is character 177 in the hypothesis lattice composed of?” ● not needed for recognition, but useful for training linerec = make_NewBpnetLineOCR() linrec:recognizeLine(lattice,map,segmentation,image)

Slide 27: Character Grouping Code grouper = make_StandardGrouper() grouper:setSegmentation(seg) for i=0,grouper:length()-1 do grouper:extract(char,mask,image,i) classifier:classify(char,mask) for cls,score in classifier:results() do grouper:setClass(i,cls,score) end end character_lattice = make_FstBuilder() grouper:outputLattice(character_lattice)

Slide 28: Weighted Finite State Transducers ► unified representation and algorithms for... ● hypothesis lattice ● n-gram language model with back-off ● Unicode categorization and translation ● character-to-word, word-to-character translations ● modeling of misspellings (dangambis), transp. ● information extraction ► high-level manipulation and evaluation ● e.g.,pre-compose language models or on-the-fly, lazy evaluation, infinite language models ● compile regex, linguistics into WFST ► Google OpenFST implementation

Slide 29: Character Hypothesis Lattice

Slide 30: Bigram Model over 3 Letter Alphabet ► back-off is easy...

Slide 31: Modular Language Models language models as weighted finite state transducers fully probabilistic foundation Semantic Dictionary Grammar Dictionary Constraints Result Hypothesis Graph modular language models allow rapid retargeting

Slide 32: Language Modeling ► simple specialized modeling classes d = make_DictionaryModel() d:addWordTranscription("hello","hallo",1.0) d:addWordTranscription("world","welt",1.0) d:addWordTranscription("-","_",2.0) translator = d:take() openfst.ClosureStar(translator) input = as_fst("hello-world") openfst.Compose(input,translator,result) print(bestpath(result)) ids = intarray:new() costs = floatarray:new() bestpath(str,costs,ids,result)

Slide 33: Language Modeling d = make_NgramModel() d:addNgram("1a",0.1) d:addNgram("1b",0.2) d:addNgram("ab",1) d:addNgram("ba",2) d:addNgram("aa",3) d:addNgram("bb",4) d:addNgram("a1",0.3) d:addNgram("b1",0.4) fst = d:take() openfst.RmEpsilon(fst) ngram = openfst.StdVectorFst:new() openfst.Determinize(fst,ngram) dfstdraw(ngram)

Slide 34: Coming Soon ► information extraction (porting) ► standard modular model ► char/word-level integration ► regex compiler ► language model evaluation ► multi-language language models ► multi-script language models ► language model adaptation ► topic segmentation ► grammar and tree automata compiler

Slide 35: Forced Alignment ► characterlattice is a transducer from character hypothesis identifiers to characters ► for semi-supervised learning use language model instead of ground truth ► formisaligned training data, compose ground truth with block move and error model linerec:recognizeLine(lattice,map,seg,image) truth = as_fst(“hello world”) openfst.Compose(lattice,truth,aligned) bestpath(text,costs,ids,aligned) ocr_result_to_charseg(cseg,map,ids,seg)

Slide 36: Page Recognition using Lines pages = Pages:new(arg) for p = 0,pages:length()-1 do pageseg:segment(seg,pages:getBinary(p)) regions:setPageLines(seg) page_lattice = openfst.Fst:new() for l = 1,regions:length()-1 do regions:extract(line_image,pages:getGray(p),l) lineseg:charseg(lseg, linerec:recognizeLine(line_lattice,idmap,lseg,limage) page_lattice = openfst.Union(page_lattice,line_lattice) end openfst.Compose(page_lattice,language_model,result) print(bestpath(result)) end

Slide 37: Line Recognition with Oversegmentation function linerec(segmentation,image) grouper:setSegmentation(segmentation) for c = 0,grouper:length()-1 do grouper:extract(cimage,cmask,image,i) classifier:classify(cimage,cmask) for cls,score in classifier:results() do grouper:setClass(c,cls,,score) end end grouper:outputLattice(line_lattice,map) return line_lattice,map end

Slide 38: Page Recognition using Blocks pages = Pages:new(arg) for p = 0,pages:length()-1 do pageseg:segment(seg,pages:getBinary(p)) page_image = pages:getGray(p) regions:setPageColumns(seg) for b = 1,regions:length()-1 do regions:extract(block,page_image,b) text = tesseract.recognize_block(block) fst.Compose(as_fst(text),ocr_correction,result) print(bestpath(result)) end end

Slide 39: Common OCR Components ► binarizer = make_SauvolaBinarizer() ► pageseg = make_SegmentPageByRast() ► lineseg = make_CurvedCutSegmenter() ► linerec = make_TesseractRecognizeLine() ► linerec = make_NewBpnetLineOCR() ► grouper = make_StandardGrouper() ► lattice = make_TrivialCharLattice() ► lattice = make_FstBuilder()

Slide 40: Common Datatypes ► binarized image – 2D bytearray ► gray scale image – 2D bytearray ► page segmentation – 2D intarray (24bit RGB) ► line segmentation – 2D intarray (24bit RGB) ► ground truth – 2D intarray (24bit RGB) ► character lattice – OpenFST (ICharLattice) ► language model – OpenFST

Slide 41: GUI ► debugging ● dinit(w,h) ● dshow(image,where) ● dshowr(image,where) ● dwait() ► end user GUI ● dynamically load wxLua or ● embed ocroscript and use C API

Slide 42: Alpha Release ► test cases (fix RAST, bpnet; add more) ► remove deprecated features ► enhancements to Lua/tolua (nil, pointers) ► clean up namespaces & visibility ► add ● OCR evaluation, text generation, document degradation ● hOCR output for ocroscript ► maybe ● low memory page segmenters ● XML/DOM for ocroscript ● new Tesseract API

Slide 43: Possible ► existing code ● additional line segmenters ● non-STL WFST decoder ● run length binary morphology (much more efficient!) ● shape-based classifier ● HMM-based classifier ● text/image segmentation ► research code ● character shape adaptation ● ... ● (in separate repository)

Slide 44: Discussion End of Presentation