Upcoming SlideShare
×

645 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
645
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
5
0
Likes
0
Embeds 0
No embeds

No notes for slide

1. 1. Table Understanding in DIADEM DIADEM 1.0 Giorgio Orsi 1,2 and Ben Watson 2 1 Institute for the Future of Computing University of Oxford 2 Department of Computer Science University of Oxford
2. 2. Table Understanding <ul><li>Process that </li></ul><ul><ul><li>locates (or recognizes), </li></ul></ul><ul><ul><li>analyses and </li></ul></ul><ul><ul><li>interprets </li></ul></ul><ul><li>a tabular structure with the goal of </li></ul><ul><ul><li>classify (layout vs data tables), </li></ul></ul><ul><ul><li>extract data, </li></ul></ul><ul><ul><li>translate or, </li></ul></ul><ul><ul><li>other . </li></ul></ul>
3. 3. What is a Table? <ul><li>Penn et Al. ’01 </li></ul><ul><ul><li>a 2D assembly of cells , where </li></ul></ul><ul><ul><li>each cell is short in length and </li></ul></ul><ul><ul><li>contains no complex structures , and </li></ul></ul><ul><ul><li>there is semantic and syntactic coherence within the rows and columns. </li></ul></ul>
4. 4. What is a Table?
5. 5. What Information do we Have? <table border=&quot;1&quot;> <tbody> <tr> <th colspan=&quot;2&quot;>NAME</th> <th rowspan=&quot;2&quot;>D.O.B.</th> </tr> <tr> <th>FIRST NAME</th> <th>SURNAME</th> </tr> <tr> <td>Sue</td> <td>Adams</td> <td>12th June 1980</td> </tr> <tr> <td>Jim</td> <td>Wright</td> <td>19th May 2000</td> </tr> </tbody> </table> <ul><li>HTML </li></ul><ul><li>CSS Boxes </li></ul><ul><li>Domain </li></ul>ox:person ox:firstName xsd:string xsd:string ox:surname xsd:date ox:dob
6. 6. Why Table Understanding in DIADEM <ul><li>recognize and extract data in tabular format </li></ul><ul><ul><li>layout tables </li></ul></ul><ul><ul><li>data tables </li></ul></ul><ul><li>understand forms and result-pages </li></ul><ul><ul><li>labelling </li></ul></ul><ul><ul><li>segmentation </li></ul></ul><ul><li>let us focus first on HTML tables (e.g., <table>) </li></ul>
7. 7. Why Table Understanding in DIADEM
8. 8. Why Table Understanding in DIADEM
9. 9. Leaf Tables <ul><li>Goal: determine whether a table contains any inner table </li></ul><ul><li>if T1 contains T2 (e.g., there is a <table> element in the subtree rooted in T1), than T1 is a layout table. </li></ul>layout recursive check
10. 10. Row and Column count <ul><li>Goal: identify “ sane ” tables </li></ul><ul><ul><li>at least two coherent adiacent cells ( TD, DIV, TH ) </li></ul></ul><ul><ul><ul><li>e.g., two data cells, two header cells, 1 header one data </li></ul></ul></ul><ul><li>allow 1D tables (i.e., vectors) </li></ul><ul><li>allow empty tables </li></ul>
11. 11. Longest String <ul><li>Goal: identify “ sane ” cells </li></ul><ul><ul><li>find the longest string w in every cell, T is a data table if |w|< δ </li></ul></ul><ul><ul><li>layout tables are likely to contain a large amount of text </li></ul></ul><ul><li>ignore text nodes associated to <SELECT> , <FORM> and <TABLE> </li></ul><ul><ul><li>in their subtree </li></ul></ul><ul><ul><li>siblings </li></ul></ul>ignore
12. 12. Empty Cell <ul><li>Goal: identify “ sane ” cells </li></ul><ul><ul><li>find empty cells, T is a data table if contains no empty cells </li></ul></ul><ul><ul><li>layout tables are likely to contain empty cells </li></ul></ul>empty
13. 13. TH Check <ul><li>Goal: identify “ sane ” tables </li></ul><ul><ul><li>find <TH> elements in a table </li></ul></ul><ul><ul><li>layout tables are not likely to contain <TH> elements </li></ul></ul>
14. 14. Largest Cell
15. 15. Picture <ul><li>Goal: identify “ sane ” cells </li></ul><ul><ul><li>check the size of pictures in a cell </li></ul></ul><ul><ul><li>T is a data table if p-area < δ </li></ul></ul><ul><ul><li>layout tables are likely to contain large pictures </li></ul></ul><ul><ul><ul><li>e.g., ads and logos </li></ul></ul></ul>
16. 16. Table Size
17. 17. Combining Rules <ul><li>Identify the combination of rules that maximizes the recognition accuracy </li></ul><ul><ul><li>cut-offs estimation </li></ul></ul><ul><ul><ul><li>best-guess estimation </li></ul></ul></ul><ul><ul><ul><li>if T passes all the rules  data table </li></ul></ul></ul><ul><ul><li>cut-off calculation </li></ul></ul><ul><ul><ul><li>cut-off = performance of each rule </li></ul></ul></ul><ul><ul><ul><li>If T passes all the rules  data table </li></ul></ul></ul><ul><ul><li>machine learning </li></ul></ul><ul><ul><ul><li>decision trees  white box model </li></ul></ul></ul>
18. 18. Evaluation: Cut-Off Estimation <ul><li>First run: all rules in AND </li></ul><ul><li>Second run: no empty cell </li></ul><ul><li>Third run: no empty cell, no table size </li></ul><ul><li>Fourth run: no empty cell, no table size, no picture rule </li></ul>
19. 19. Evaluation: Cut-Off Computation <ul><li>First run: all rules in AND </li></ul><ul><li>Second run: no empty cell, no table size </li></ul>
20. 20. Evaluation: Decision Tree <ul><li>Facts: </li></ul><ul><ul><li>65% training </li></ul></ul><ul><ul><li>35% 10-fold validation </li></ul></ul><ul><ul><li>precision: 0.807 </li></ul></ul><ul><ul><li>recall: 0.836 </li></ul></ul><ul><ul><li>F-measure: 0.821 </li></ul></ul><ul><li>Comparison: </li></ul><ul><ul><li>F-Measure 0.740 (Gatterbauer) </li></ul></ul>
21. 21. Discussion <ul><li>Most of the errors caused by missing information or bad combination of rules. </li></ul><ul><ul><li>use visual and semantic information </li></ul></ul><ul><ul><li>combine the heuristics in an “organic” way </li></ul></ul><ul><ul><ul><li>PDF -inspired extraction </li></ul></ul></ul><ul><ul><ul><li>guided by the HTML and CSS structure. </li></ul></ul></ul><ul><ul><li>use a reference model as in form and result-page analysis </li></ul></ul>
22. 22. Thank you!