Your SlideShare is downloading. ×
0
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

1,324

Published on

Presentation of the proceeding article "Hybrid Page Layout Analysis via Tab-Stop Detection" by Ray Smith to the Page Segmentation Competition hold on ICDAR 2009.

Presentation of the proceeding article "Hybrid Page Layout Analysis via Tab-Stop Detection" by Ray Smith to the Page Segmentation Competition hold on ICDAR 2009.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,324
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Presentation of Hybrid Page Layout Analysis via Tab-Stop Detection Ray Smith, Proc. ICDAR2009, Barcelona, Spain, 2009. Javier de la Rosa {jdelaros at uwo dotca} CS 9883
  • 2. 2 | Internal use only2 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 Index. 1. Context and background. 2. Introduction. 3. Page layout via tab-stop detection. 4. Preprocessing. 5. Finding tab positions as line segments. 6. Finding the column layout. 7. Finding the regions. 8. Testing and results. 9. Conclusion and further work. 10. Criticism. 11. References.
  • 3. 3 | Internal use only3 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 1. Context and background. • International Conference on Document Analysis and Recognition [1]. • Page Segmentation competitions: 2001, 2003, 2005, 2007 and 2009 [2]. • Tesseract, the OCR from Google [3]. Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011) <http://www.icdar2011.org/> [1] A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona, Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/> [2] The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3]
  • 4. 4 | Internal use only4 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 2. Introduction. Physical page layout analysis: • Bottom-up [4]. • Top-down [5]. • Whitespaces [6]. Logical page layout analysis: • Voronoi. • Smearing. • Etc. M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,” SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408. [4] G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, 1984, pp347-349. [5] T.M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on Document Analysis Systems V, Springer-Verlag 2002, pp188-199. [6]
  • 5. 5 | Internal use only5 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 3. Tab-stop detection. • Regions bounded by tab-stops. • Fixed x-positions. • Vertical alignment. Phases: 1. Preprocessing. 2. Bottom-up tab-stop detections. 3. Finding the column layout. 4. Set of typed regions.
  • 6. 6 | Internal use only6 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 4. Preprocessing • Detection of vertical lines and image mask [7]. • Connected components (CCs) analysis. • CCs filtering by width, w, and height, h: – Small: h < 7 (@300ppi) or h < h75 / 2 – Large: h > 2h75 or w > 8h75 – Medium: rest of reminder. Leptonica image processing and analysis library <http://www.leptonica.com> [7]
  • 7. 7 | Internal use only7 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 5. Finding the positions as line segments. (1/3) • Candidate tab-stop components: – A CC is a tab-stop by default. – Look for aligned neighbours. – Mark each CC as left tab, right tab or neither. • Grouping candidate tabs: – In lines and, if there are many, in groups. – Least median of squares to fit the lines (left or right). – Refit lines to the page-mean direction.
  • 8. 8 | Internal use only8 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 5. Finding the positions as line segments. (2/3) • Tracking text lines to connect tab-stops: – From one tab-stop to another. – Associate tab-stops connected by text lines. – Discard tab-stop with no connections. – Record the most frequently occurring text lines widths.
  • 9. 9 | Internal use only9 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 5. Finding the positions as line segments. (3/3) • Cleaning up tab-stop ends: – Make connected tab lines end at the same y coordinate: – Moving the ends between the last member CC and the first non-member CC. • Reclassify CC as “Text” or “Unknown”: – A CCs group of significant with form a text line. • Create artificial CCs from the image mask.
  • 10. 10 | Internal use only10 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 6. Finding the column layout. (1/3) • Scan CCs from left to right and top to down, gathering into Column Partitions (CPs). • A CP may not cross a tab-stop line. • Collections of CPs are stored in Column Partition Sets (CPsets). • Find the column layout → find an optimal set of CPsets that best “explains” all the CPsets on the page.
  • 11. 11 | Internal use only11 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 6. Finding the column layout. (2/3) • A good CP: it touches a tab line on both vertical edges. • A good CP: its width is closely to frequency occurring width (slide 8). • The coverage of CPset = total width of all the good CPs that it contains. • A CPset A is better than CPset B if A has greater coverage. • What does it mean “explain”? In a short: – CPset A explains CPset B unless one or more of the following are true: • B hasn't more text than A. • A hasn't split a column fo common width. • A hasn't a different number of columns to B. • A hasn't merged two columns of B.
  • 12. 12 | Internal use only12 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 6. Finding the column layout. (3/3) • List from set of CPsets on the page. • Ordered by best ones first. • Duplicates eliminated by the A explains B rules. • Image CPs are ignored. • Improve the candidates adding new CPs.
  • 13. 13 | Internal use only13 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 7. Finding the regions. (1/3) • Create flows of CPs: – Choose the best matching upper and lower partner. – The list of partners is forced to become zero or one iteratively. – Different rules for image CPs and text CPs. – Each chain of CPs returned represents a candidate region: • Text is blue. • Heading text is cyan. • Heading image is magenta. • Pull-out image is orange.
  • 14. 14 | Internal use only14 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 7. Finding the regions. (2/3) ● The rules to apply: 1. Type. If there are multiple types, text can only stay with its own (exact) type, whereas image any other image type. 2. Transitive partner shortcuts are broken. If A has 2 partners B and C, and also B has C as a partner in the same direction, then delete C as a partner of A, leaving a clean chain A-B-C. Also if A has a partner B, and B has a partner A in the same direction, break the cycle. 3. (Text only) If A still has 2 partners B, C, chase B and C's partners to see which has the longest chain. Delete from A the partner that has the shortest chain, and convert the type of the shortest chain to pull-out. 4. (Image only) Choose the partner CP with the largest horizontal overlap.
  • 15. 15 | Internal use only15 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 7. Finding the regions. (3/3) • Determinate the order reading: 1. Flowing blocks follow by y position within a column. 2. Pull-out blocks follow by y position in an imaginary column between the real columns that they touch. 3. A heading spans multiple columns and follows anything that is above it in the columns spanned, or between them. 4. A change in column layout works just like a heading. 5. Between headings, the content of columns is ordered from left to right. • Find the polygon boundary for each region: – Polygons are isothetic. – Polygon edges are chosen to minimize the number of vertices. – All CPs are contained within their region.
  • 16. 16 | Internal use only16 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 8. Testing and results. (1/2) • Algorithm implemented in C++. • Part of Tesseract Open Source OCR system [3]. • 1 image of 8MPixel per second on a 3.4GHz Pentium 4.
  • 17. 17 | Internal use only17 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 8. Testing and results. (2/2) Noise Sep Text Image Overall 0 10 20 30 40 50 60 70 80 90 100 PRImA Metric 2007-Besus 2007-TH1 2007-TH2 Tesseract Measure Method Noise Sep Text Image Overall 0 20 40 60 80 100 120 F-Measure 2007-Besus 2007-TH1 2007-TH2 Tesseract Measure Method Noise Sep Text Image Overall 0 20 40 60 80 100 120 Recall 2007-Besus 2007-TH1 2007-TH2 Tesseract Measure Method Noise Sep Text Image Overall 0 20 40 60 80 100 120 Precission 2007-Besus 2007-TH1 2007-TH2 Tesseract Measure Method ICDAR 2007 set [2, 8] A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc. Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283. [8]
  • 18. 18 | Internal use only18 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 9. Conclusion and further work • Tab-stop make an interesting and useful alternative to white rectangles. • It enables page layout analysis to easily handle the complex non-rectangular layouts of modern magazines. • Table detection will be added in the future.
  • 19. 19 | Internal use only19 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 10. Criticism. (1/4) • The idea is totally new and it works reasonably well, but • No references. • No formulas. • No algorithms. • No mathematical justification. • Excess text and literature. • Process too long and with no justifications in many occasions.
  • 20. 20 | Internal use only20 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 10. Criticism. (2/4) • An example: – Preprocessing: Small CCs: h < 7 (@300ppi) ... • Why 7? • Does it only work at 300ppi? • Only on magazine papers (10.5” x78.5”)?
  • 21. 21 | Internal use only21 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 10. Criticism. (3/4) • More: – Reclassify CC as “Text” or “Unknown”: A CCs group of significant width form a text line. • What's a “significant width”? – Find the polygon boundary for each region: Polygon edges are choosen to minimize the number of vertices. • What's the algorithm or reference to do this?
  • 22. 22 | Internal use only22 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 10. Criticism. (4/4) ICDAR 2009 Results [2]
  • 23. 23 | Internal use only23 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 11. References. (1/2) 1. Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011) <http://www.icdar2011.org/> 2. A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona, Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/> 3. The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3] 4. M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,” SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408.
  • 24. 24 | Internal use only24 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 11. References. (2/2) 5. G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, 1984, pp347-349. 6. T. M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on Document Analysis Systems V, Springer-Verlag 2002, pp188-199. 7. Leptonica image processing and analysis library <http://www.leptonica.com> 8. A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc. Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283.
  • 25. 25 | Internal use only25 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 Questions? Thank you

×