Successfully reported this slideshow.
Your SlideShare is downloading. ×

ICDM2019 table tutorial

Ad

Table Extraction and Understanding
for Scientific and Enterprise
Applications
Yannis Katsis
Doug Burdick Nancy WangAlexand...

Ad

Outline
§ Introduction
– Problem Definition
– Challenges
– Applications
– Demo
§ Table Extraction
§ Table Understanding
§ ...

Ad

Introduction

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 161 Ad
1 of 161 Ad

ICDM2019 table tutorial

Download to read offline

Tutorial for "table extraction and understanding for scientific and enterprise applications" as presented at ICDM 2019, organized by Yannis Katsis, Alexandre V Evfimievski, Nancy Wang, Douglas Burdick, Marina Danilevsky

Tutorial for "table extraction and understanding for scientific and enterprise applications" as presented at ICDM 2019, organized by Yannis Katsis, Alexandre V Evfimievski, Nancy Wang, Douglas Burdick, Marina Danilevsky

More Related Content

ICDM2019 table tutorial

  1. 1. Table Extraction and Understanding for Scientific and Enterprise Applications Yannis Katsis Doug Burdick Nancy WangAlexandre V Evfimievski Marina Danilevsky IBM Research - Almaden
  2. 2. Outline § Introduction – Problem Definition – Challenges – Applications – Demo § Table Extraction § Table Understanding § Conclusion
  3. 3. Introduction
  4. 4. Introduction Outline § Problem definition – Table Extraction – Table Understanding § Challenges – Limited document format support for table structure – Table variety § Applications – Knowledge Base Population – Query Answering – Leaderboard Construction – Information Extraction § Demo Introduction
  5. 5. Tables are popular data representation Introduction Government Reports Scientific Papers Financial ReportsInvoices Contracts Loan Agreements Compact Easy to understand* (*) For humans
  6. 6. End-to-end example § What does the value 672 in the following table mean? § Answer: Net earnings for three months ended July 29th, 2017 was $672 million USD Steps: 1) Find location of table on page 2) Find cells in column containing ”672” 3) Find cells in row corresponding to “672” 4) Identify aligned row / column header cells 5) Normalize using additional context from table Introduction Table Extraction: Identify table location and structure Table Understanding: Provide semantic context to table values
  7. 7. Introduction Input: Document contents in native format - PDF - Image - Office Docs - … Table Extraction: Problem Definition Output: Document contents with tabular information: 1) Table border for each table 2) Partitioning table contents into cells 3) Both vertical and horizontal alignment of cells
  8. 8. Table Understanding: Problem Definition Introduction Output: Table content representation: 1) Captures semantic information 2) Amenable to post-processing Input: Document contents with tabular information: 1) Table border for each table 2) Partitioning table contents into cells 3) Both vertical and horizontal alignment of cells
  9. 9. { "tables": [ { "column_headers": [ { "cell_id": "colHeader-1050-1082", "text": ”Expenses ($ in thousands)", ... }, { "cell_id": "colHeader-1270-1301", "text": ”Three months ended Sept. 30", ... }, { "cell_id": "colHeader-1544-1548", "text": "2015" }, ... ], "row_headers": [ { "cell_id": "rowHeader-2244-2262", "text": ”Aircraft fuel" }, { "cell_id": "rowHeader-3197-3217", "text": ”Airport operations" }, { "cell_id": "rowHeader-4148-4176", "text": ”Flight operations and navigational changes" }, ... ], "body_cells": [ { "cell_id": "bodyCell-2450-2455", "text": ”206,924", "row_header_ids": [ "rowHeader-2244-2262" ], "column_header_ids": [ "colHeader-1050-1082", "colHeader-1270-1301”, ”colHeader-1544-1548” ], }, { "cell_id": "bodyCell-5415-8945", "text": ”142,176", "row_header_ids": [ "rowHeader-3197-3217" ], "column_header_ids": [ "colHeader-1050-1082", "colHeader-1270-1301”, ”colHeader-1544-1548” ], ... Table Understanding: Example Introduction Output: Table content representation: 1) Captures semantic information 2) Amenable to post-processing Input: Document contents with tabular information: 1) Table border for each table 2) Partitioning table contents into cells 3) Both vertical and horizontal alignment of cells { "tables": [ { "column_headers": [ { "cell_id": "colHeader-1050-1082", "text": ”Expenses ($ in thousands)", ... }, { "cell_id": "colHeader-1270-1301", "text": ”Three months ended Sept. 30", ... }, { "cell_id": "colHeader-1544-1548", "text": "2015" }, ... ], "row_headers": [ { "cell_id": "rowHeader-2244-2262", "text": ”Aircraft fuel" }, { "cell_id": "rowHeader-3197-3217", "text": ”Airport operations" }, { "cell_id": "rowHeader-4148-4176", "text": ”Flight operations and navigational changes" }, ... ], "body_cells": [ { "cell_id": "bodyCell-2450-2455", "text": ”206,924", "row_header_ids": [ "rowHeader-2244-2262" ], "column_header_ids": [ "colHeader-1050-1082", "colHeader-1270-1301”, ”colHeader-1544-1548” ], }, { "cell_id": "bodyCell-5415-8945", "text": ”142,176", "row_header_ids": [ "rowHeader-3197-3217" ], "column_header_ids": [ "colHeader-1050-1082", "colHeader-1270-1301”, ”colHeader-1544-1548” ], ...
  10. 10. Table Understanding: Example Introduction Output: Table content representation: 1) Captures semantic information 2) Amenable to post-processing Value Norm Value Year Time Period Type LineItem 206,924 $206,924, 000 2015 Q3 Expense Aircraft Fuel 286,817 $286,817, 000 2014 Q3 Expense Airport Fuel 142,176 $142,176, 000 2015 Q3 Expense Airport operations … … … … … … Change Change Normalized Begin Time Period End Time Period (27.9%) -27.9% Q3 2014 Q3 2015 10.7% 10.7% Q3 2014 Q3 2015 … … … … Input: Document contents with tabular information: 1) Table border for each table 2) Partitioning table contents into cells 3) Both vertical and horizontal alignment of cells
  11. 11. Outline § Problem definition – Table Extraction – Table Understanding § Challenges – Limited document format support for table structure – Table variety § Applications – Knowledge Base Population – Query Answering – Leaderboard Construction – Information Extraction § Demo Introduction
  12. 12. Challenge: Table structure representation varies across document formats None CompletePartial HTML MS Excel MS Word TXT PDF Image H. Dong et al. "TableSense: Spreadsheet Table Detection with Convolutional Neural Networks". AAAI '19 Z. Chen et al. “Spreadsheet Property Detection With Rule-assisted Active Learning”. CIKM ‘17 M. Cafarella et al. “ WebTables: exploring the power of tables on the web". VLDB ‘08 Table Understanding still required for all document types Introduction
  13. 13. HTML completely represents table structure HTML None CompletePartial Introduction
  14. 14. None CompletePartial MS Excel Each sheet separate table Multiple tables defined in single sheet Table structure representation varies across Excel documents H. Dong et al. "TableSense: Spreadsheet Table Detection with Convolutional Neural Networks". AAAI '19 Z. Chen et al. “Spreadsheet Property Detection With Rule-assisted Active Learning”. CIKM ‘17 Introduction
  15. 15. None CompletePartial Table structure representation varies across Word documents MS Word Omit Office Table Object Use Office Table Object for all tables Introduction
  16. 16. None CompletePartial Document formats with no native table representation Image TXT PDF TXT Image PDF Introduction
  17. 17. PDF Document Format … BT 0.0503 Tc 8.503556 0 0 8.52 503.2795 688.92 Tm /Tc2 1 Tf [ ( m) 16 (o) 21 (n) 17 (t) 39 (h) 16 (s) 29 ( ) 28 (e) 28 (n) 17 (d) 24 (e) 28 (d) 24 ( ) ] TJ 0 Tc ET … Q q 46.91952 776.52 m 242.04 776.52 l 242.04 729.96 l 144.48 729.96 l 46.91952 729.96 l h … Draw ”m” at (503, 688) in 8.5 point font in color white Draw “o” at … Draw “n” at …. Draw “t” at ... Draw “h” at … Draw “s” at … …. Draw green line segment from (46, 776) to (242, 776) Draw green line segment from (242, 776) to (242, 729) … • Programmatic PDF collection of instructions to draw characters and line segments to page with visual formatting information • 2 – 4 trillion PDFs in existence and rapidly growing Rendered PDF PDF Binary Introduction
  18. 18. Complex tables – graphical lines can be misleading – is this 1, 2 or 3 tables ? Table with visual clues only Multi-row, multi- column column headers Nested row headers Tables with Textual content Table with graphic lines Table interleaved with text and charts Challenge: Variety in Tables Introduction
  19. 19. Outline § Problem definition – Table Extraction – Table Understanding § Challenges – Limited document format support for table structure – Table variety § Applications – Knowledge Base Population – Query Answering – Leaderboard Construction – Information Extraction § Demo Introduction
  20. 20. https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal1231201410k.htm Excerpt of semi-structured XBRL file For financial statements Ex: Delta Air Lines, Inc. 2014 Annual Report Form 10-K : <xbrli:context id="FI2013Q4"><xbrli:entity> <xbrli:identifier scheme="http://www.sec.gov/CIK">0000027904 </xbrli:identifier></xbrli:entity> <xbrli:period><xbrli:instant>2013-12- 31</xbrli:instant></xbrli:period> </xbrli:context> : <us-gaap:CashAndCashEquivalentsAtCarryingValue contextRef="FI2013Q4" decimals="-6" id="Fact-C39BEC178121A91816968BA9ADCF421F” unitRef="usd"> 2844000000 </us-gaap:CashAndCashEquivalentsAtCarryingValue> : Excerpt of HTML file with granular financial metric data https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal-20141231.xml Valuable metrics for airline industry, only present in HTML table Valuable metrics present in semi- structured raw data source MUST BE INTEGRATED Application: Knowledge-base population Introduction
  21. 21. Application: Query Answering Introduction H. Sun et al. “Table Cell Search for Question Answering”. WWW '16
  22. 22. Application: Scientific Leaderboard Construction Introduction Y. Hou et al. “Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction”. ACL ‘19 Scientific Publication Leaderboard Annotations
  23. 23. Application: Biological Information Extraction Introduction G. Singh et al. “QTLTableMiner++: Semantic Mining of QTL Tables in Scientific Articles”. BMC BioInformatics ‘18 Article Trait Tables Trait Statements QTL Statements Extract info on Quantitative Trait Locus (QTL) (genomic regions that correlate with phenotypes) from tables in scientific publications
  24. 24. Takeaways Introduction § Widely used document formats have limited table representation – Limits of document format: Image, PDF – How documents authored: Word, Excel § Wide variety of tables makes general model construction difficult –Tables are form of art –Diverse visual encoding of semantic information –Different domains § Multiple applications for table extraction and understanding
  25. 25. Outline § Problem definition – Table Extraction – Table Understanding § Challenges – Limited document format support for table structure – Table variety § Applications – Knowledge Base Population – Query Answering – Leaderboard Construction – Information Extraction § Demo Introduction
  26. 26. Table Extraction
  27. 27. § Table region detection – Identify all tables – Separate tables from non-table text – Separate tables from each other § Cell structure recognition – Partition text into cells – Find cell span and cell-to-cell overlap (along X- or Y-axis) What Is Table Extraction? Table Extraction
  28. 28. [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93 [I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93 [H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95 [KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98 [H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99 [H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00 [H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00 [KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01 [C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02 [P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03 [H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03 [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05 [S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06 [M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06 [L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08 [L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 [F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12 [CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12 [K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13 [K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 [R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15 [T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 [Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019 [C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019 [L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019 [M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19 Table Extraction Table Extraction: A Sample of Prior Work
  29. 29. [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93 [I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93 [H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95 [KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98 [H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99 [H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00 [H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00 [KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01 [C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02 [P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03 [H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03 [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05 [S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06 [M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06 [L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08 [L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 [F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12 [CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12 [K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13 [K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 [R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15 [T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 [Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019 [C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019 [L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019 [M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19 Table Extraction Most papers present an end-to-end system for : • Table detection, • Cell structure recognition (table parsing), • Or both 🔥 ICDAR 2019 has ≥ 16 new papers on table extraction! – ICDAR = International Conference on Document Analysis and Recognition Table Extraction: A Sample of Prior Work
  30. 30. § Early 1990s : Separator based “top-down” methods – Ruled line tables – Extend to white-space “lines” § 1990s – early 2000s : “Bottom-up” text clustering – Group text into columns (or rows), then to tables – Use space features (gaps, overlap, alignment) and keywords § 2000s – early 2010s : Machine Learning (supervised or not) – Classify text-rows using CRF, SVM, HMM, etc. – Probabilistic models for tables – Graph-based models for cell structure – Unsupervised ML (clustering) § Late 2010s : Deep Learning – Scanned image table detection with R-CNN or YOLO – Graph neural networks and language embeddings for cell structure Table Extraction Timeline Table Extraction
  31. 31. § Analyze Page – Identify low-level structures & relations § The 2 Main Tasks – Table (region) detection – Cell structure recognition (given table region) § Refine Tables – Discard false positives – Adjust table border and structure How to Build a Table Extraction System? Table Extraction
  32. 32. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Learning Infrastructure Accuracy metrics Ground truth data Optimization method
  33. 33. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  34. 34. § Documents can be: – scanned – programmatic (“born digital” PDF, TXT) – hybrid § Scanned page requires OCR, plus: – Reverse any rotation, distortion – Filter noise, sharpen if low resolution [M19] – Fix inconsistent font features, bounding boxes – Detect ruled lines and boxes • E.g., Gaussian filter + black hat transform [K13] Page Features [K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13 [M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19 Table Extraction
  35. 35. § Programmatic PDFs (and TXTs) – Have letters, but no table markup § May contain spurious (invisible) text and lines – White-on-white lines or text – Occluded or out-of-range lines or text – Text repeated to simulate bold font – Need to filter them out § Deep Learning (CNN-based) methods need an image – Convert programmatic to scanned Page Features Table Extraction
  36. 36. § Plain text layout (1-column, 2-column, etc.) – Helps avoid false-positive “tables” § Obvious non-tables – Page & section headers, footers, lists, etc. – Short-cut computation – if no tables on page § Low-level structure – Alignment @ different box positions & tolerance levels – A minimum spanning tree for clustering by distance § Deep learning features – CNN features shared across proposal regions – Natural language embeddings Page Features Table Extraction
  37. 37. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  38. 38. § Most systems group text early on – Table detection systems may skip text grouping § Text is grouped in one of 3 ways: – Columns first – Rows first – Cell-units (“blobs”) first § Some systems partition text using separator lines – BUT: “Blob” detection reduces over- / under-partitioning Group Text into Larger Units Table Extraction
  39. 39. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Two Tables
  40. 40. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Columns
  41. 41. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Rows
  42. 42. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Multi-line “Blobs”
  43. 43. Many systems detect columns first: – T-Recs [KD98], Pdf2table [Y05], Lixto [HB07], Tesseract [SS10], smartFIX [D11] Example – Tesseract [SS10] : Start with Columns Table Extraction [KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98 [Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 1. Detect X-axis “tab-stops” (alignment positions) 2. Group tokens between “tab-stops” horizontally into entries 3. Group entries of the same font vertically into column fragments 4. Group column fragments within page columns horizontally into table fragments 5. Group table fragments if columns match vertically into tables
  44. 44. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Tab-Stops
  45. 45. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Column Fragments
  46. 46. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Table Fragments
  47. 47. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Table Fragments
  48. 48. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Tables
  49. 49. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Tables Multi-Column Headers
  50. 50. Start with Rows Table Extraction Systems with ML often detect rows first – Pinto-McCallum [P03], e Silva [S06], TableSeer [L08], PDF-TREX [OR09] Typical process: 1. Identify text-lines 2. Train an ML classifier to label text-lines: – “Table Dense”, “Table Sparse”, “Table Header”, “Non-table”, etc. – ML = CRF [P03], HMM [S06], SVM [L08], etc. 3. Merge sparse rows into dense rows – get full table rows: – Merge up, down, or cluster around, by row alignment [H00a] 4. Combine table rows into tables [H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00 [P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03 [S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06 [L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
  51. 51. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Dense Row Table Header Sparse Row Dense Row Sparse Row Dense Row Sparse Row Dense Row Sparse Row Sparse Row Sparse Row Dense Row Dense Row Dense Row Dense Row Sparse Row Sparse Row Sparse Row Table Header Dense Row Dense Row Dense Row Dense Row Dense Row Dense Row Sparse Row Sparse Row Table Header Align- ment
  52. 52. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Dense Row Table Header Sparse Row Dense Row Sparse Row Dense Row Sparse Row Dense Row Sparse Row Sparse Row Sparse Row Dense Row Dense Row Dense Row Dense Row Sparse Row Sparse Row Sparse Row Table Header Dense Row Dense Row Dense Row Dense Row Dense Row Dense Row Sparse Row Sparse Row Table Header Align- ment ✕ ✓ ✓ ✓ ✕ ✕
  53. 53. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Dense Row Table Header Sparse Row Dense Row Sparse Row Dense Row Sparse Row Dense Row Heading Row Heading Row Heading Row Dense Row Dense Row Dense Row Dense Row Heading Row Heading Row Heading Row Table Header Dense Row Dense Row Dense Row Dense Row Dense Row Dense Row Heading Row Heading Row Table Header Align- ment
  54. 54. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf ✓ ✓ ✓ ✓ ✓ ✓ Dense Row Table Header Sparse Row Dense Row Sparse Row Dense Row Sparse Row Dense Row Heading Row Heading Row Heading Row Dense Row Dense Row Dense Row Dense Row Heading Row Heading Row Heading Row Table Header Dense Row Dense Row Dense Row Dense Row Dense Row Dense Row Heading Row Heading Row Table Header Align- ment
  55. 55. § “Blob” = largest semantically bound text unit – Single-line or multi-line – If in a table, the whole “blob” must be in a single cell § “Blob” ≠ Cell – Cell has span and overlaps other cells – Some “blobs” end up in plain text or non-table text § “Blobs” help define table structure: – Trace alignment – Determine header cell spans – Fix over-split / over-merged cells, rows, columns – Reduce search space Text “Blobs” (Cell-Units, Paragraphs, …) Table Extraction
  56. 56. § [KD98] Distance based clustering: – Merge words horizontally – Merge text strings vertically if word-spans interleave § Problems with distance: – Multi-column headers: 1 justified phrase vs. ≥ 2 closely spaced phrases – Row headers / text cells: 1 multi-line cell vs. ≥ 2 closely spaced rows § Example: How to Detect “Blobs” [KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98 Two Column Header Two Column Header HEADER Header Header Header Header Row 1, text line 1 0.12 1.23 2.34 3.45 Row 1, text line 2 Row 1, text line 3 Row 2, text line 1 4.56 5.67 6.78 7.89 Row 2, text line 2 Row 2, text line 3 Table Extraction
  57. 57. § [H00a], [OR09] Merge “sparse” rows into “dense” rows – Merge up, merge down, or cluster around § [L09] Detect and follow reading order ← an NLP challenge § [B12] [B14] Train a classifier over “blob” features: – Proper termination (e.g. “blobs” don’t end with a dash or comma) – Number of numeric strings – Indentation, large space at the end of a string – Shared font properties § Deep learning approaches: – Cell-unit detection (over image) using CNNs – Semantic relationship detection (over text) using RNNs How to Detect “Blobs” [H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ’09 [L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 Table Extraction
  58. 58. Example Table Extraction Table Source: https://www.dollartreeinfo.com/static-files/0c3687d8-e6ce-4566-bc89-79fc8c8b665e (2016_Proxy_Statement_Final.pdf)
  59. 59. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  60. 60. § Ruled Lines & Colored Boxes – Extend ruled lines over small gaps, “snap” together – Merge touching colored boxes, then convert into lines – Filter out: highlighting, underlining, boxed comments, logos, charts etc. § BUT: A “perfect” ruled-line grid can be incomplete ! – Some lines may be missing – Lines may fail to extend to header rows / columns Separator Line Detection [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93 [I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93 [F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12 Table Extraction
  61. 61. Example 1 Table Extraction Table Source: https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/quarterly-result/2015/2015_MDA_q3.pdf
  62. 62. Example 2 Table Extraction Table Source: https://www.ada.gov/restripe.pdf
  63. 63. Example 3 Table Extraction Table Source: http://educationaldatamining.org/files/conferences/EDM2018/EDM2018_Preface_TOC_Proceedings.pdf
  64. 64. § White-space separators (“virtual” lines) – Help define cell span / cell alignment in tables – Prune false-positives by ML or by heuristics [B12] § How to detect white-space separators – Cell-unit (“blob”) bounding box expansion [I93] – Axis projection histograms [CK93] – White-space cover by maximum-area white-space rectangles [F11] § How to prune them (features to use) – Adjacent “blobs” : alignment, size, and content – “Strong” separators that run parallel to or intersect the separator Separator Line Detection Table Extraction [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93 [I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93 [F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
  65. 65. § Commonly used to partition page and generate separators – By [C02], [W04], [K14], and others § [H95] The algorithm recursively, for each block: – Computes X- and Y-axis projection profiles – Divides the block into sub-blocks based on dips in profiles: Recursive X-Y Cut Algorithm [H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95 [C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02 [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14 Table Extraction
  66. 66. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  67. 67. § Ruled Line grids / frames, connected components § (Rows 1st) Stack “table” rows whose “blobs” co-align [L08], [OR09] – Rows are labeled by an ML-classifier (CRF, SVM, HMM) – Labels & matching “blob” layout → table regions – NOTE: Be sure to label “header rows” to tell tables apart ! § (Cols 1st) Cluster overlapping column fragments [HB07], [SS10] – Group table columns horizontally, staying within page layout columns (when possible) – Group vertically if column fragments overlap, match, or subsume – NOTE: Column header areas require special handling ! Generate Candidate Table Regions [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13 Table Extraction
  68. 68. § (Blobs 1st) Classify text “blobs”, cluster those labeled “table” – [B14] iteratively labels “blobs” given their neighbors’ labels – [B14] trains a Kernel Logistic Regression classifier § (Lines 1st) Find areas where “strong” separators make a grid – [CL12] uses Max-Flow / Min-Cut algorithm to extract grids – Bi-cluster the intersection matrix of horizontal vs. vertical separators Generate Candidate Table Regions [CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  69. 69. k X ≈ UVT • Xij = 1 ⇔ lines i and j intersect • At intersections: 1 ≈ ui1vj1 + ui2vj2 +…+ uikvjk • Each uicvjc ≥ 0 gives affinity of intersection Xij to cluster c • uicvjc is large ⇔ uic and vjc both large 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 * * * 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 * * * 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 U ≥ 0 V ≥ 0 X Non-neg. Matrix Factorization for Grid Clustering Generate Candidate Table Regions Table Extraction
  70. 70. § (Blobs 1st) Classify text “blobs”, cluster those labeled “table” – [B14] iteratively labels “blobs” given their neighbors’ labels – [B14] trains a Kernel Logistic Regression classifier § (Lines 1st) Find areas where “strong” separators make a grid – [CL12] uses Max-Flow / Min-Cut algorithm to extract grids – Bi-cluster the intersection matrix of horizontal vs. vertical separators § (CNN-based) Try a fixed set of table region proposals – CNN shares computation of features across all translations of a given proposal rectangle – Proposal rectangle shapes / sizes are fixed as hyperparameters – If a proposal hits a table, a regression decides table borders Generate Candidate Table Regions [CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  71. 71. § Use existing object detection frameworks (Faster R-CNN or YOLO) retrained for table detection § The field is wide open for more table-specific DL approaches – E.g. involving text semantics Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”. ArXiv 2019 Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale.”. KDD 2018 Schreiber et al. “Deepdesrt: Deep learning for detection and structure recognition of tables in document images” ICDAR 2017 Gilani et al. “Table Detection using Deep Learning” ICDAR 2017 Table Extraction Deep Learning for Table Detection
  72. 72. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  73. 73. § Cells define overlap relation along X- or Y-axis – Links headers with data – critical for table understanding § Cell borders ← ruled lines ∪ “strong” white-space lines – Extend lines to make rectangular cells, avoid crossing “blobs” § Ruled grids: test for incompleteness – Multiple numerics per cell – A “strong” white-space line splits text in ≥ 2 cells – A “mini-table” inside a ruled cell – Cell structure extends beyond table frame § White-space grids: clean up empty cells – Expand header cells by merging with empty cells [S06] – Merge (almost-) empty rows and columns Cell Structure: Line Based Table Extraction [S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
  74. 74. § Use Spatial Constraints to find an overlap DAG over cells [H03] § Use Graph Neural Networks to find 2 undirected graphs: Cell Structure: Graph Based [H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03 [Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019 [C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019 [Q19] [C19] Table Extraction – “Same Row” graph & “Same Column” graph – Two cells share an edge ⇔ share a row / a column – [Q19] : Rows and columns = maximal cliques – [C19] : Only adjacent cells share a graph edge
  75. 75. Schreiber et al. “Deepdesrt: Deep learning for detection and structure recognition of tables in document images” ICDAR 2017 Table Extraction Cell Structure: CNN Based § Object detection networks were also used for cell structure detection
  76. 76. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  77. 77. § Eliminate false positive tables § Detect malformed table regions – Plain text in tables – Missing row / column headers or split-off pieces – One region covers multiple tables § Compare alternative table candidates – Example: Is this 1 table or 2 tables? § Improve table region and structure – Pick the best adjustment out of a range of options – NOTE: Knowing cell structure helps region scoring / adjustment § Provide a confidence value for output tables Why Scoring Tables? Table Extraction
  78. 78. § Tables are very diverse – Tiny or huge, misaligned, text in cells, key-value pairs, confusing delimiters – Complex row / column headers – so different, easy to chop off ! § What’s around the table also matters – Can its columns or rows be extended? Should they be? § One table, or ≥ 2 adjacent tables? – 1 table may have: ruled bars, wide gaps, font / alignment changes – 2 tables may be: fully or partly co-aligned, separated in one of many ways § Non-table text can have complex structure, too – Page headers / footers, framed / highlighted text, hierarchical lists, … Table Scoring Challenges Table Extraction
  79. 79. Example 1 Table Extraction Table Source: https://www.legislation.gov.au/Details/F2010C00607/0d99393c-5c5b-4af0-9cc1-b5c2de8632c3 (F2010C00607.pdf) NOT A TABLE !
  80. 80. Example 2 Table Extraction Table Source: https://www.thewaltdisneycompany.com/wp-content/uploads/2019/01/2018-Annual-Report.pdf Row headers Column headers
  81. 81. Example 3 Table Extraction Table Source: https://www.thewaltdisneycompany.com/wp-content/uploads/2019/01/2018-Annual-Report.pdf Row headers Column headers
  82. 82. Example 4 Table Extraction Table Source: https://assets.ctfassets.net/rz9m1rynx8pv/2x3p5ompzZyrRtAHw4M3XB/be648275661795139cabcee29a730630/TELUS_Q1_2019_quarterly_report.pdf Row headers Column headers
  83. 83. § Rule-out patterns – Rule out charts, lists, signature blocks etc. § Aggregated column / row score – [KD01] Aggregate the similarities that led to the table’s column fragments § Dynamic programming score – [H99] Score (T) = max { Score (T – line) + Merit (line) } – Score the best split into 2 sub-tables § Probability of being a table (given the features) – [W04] Partition page into blocks labeled “table” and “plain text” – Compute label probability for block + neighboring blocks § A scoring neural network on top of CNN [G17, S18b] How to Score a Table [H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99 [KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01 [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  84. 84. § Columns and rows: – Number, span / extent, alignment, font / content similarity § Ruled and white-space separators: – Number, span / extent, width of their margins – If they match, reach (good) or cross (bad) table borders § Inside vs. outside table: – Border crossing ruled lines, aligned blocks, or highly similar text – The two sides have matching structure § Cell structure: – Oversized cells, misaligned pairs of cells, “runs” of empty cells § Content: – Numerics, repeated words; customizable keywords – Domain-specific “expectations,” e.g. header dictionary [D11] § CNN-generated features Features for Table Scoring [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 Table Extraction
  85. 85. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  86. 86. § Leverage table features and score – Specify how a well-formed vs. mal-formed table looks like § Use a transparent, explainable method – If detection is a “black box”, adjustment uses explainable rules & features § Correct errors quickly – Bypass the need for extra ground-truth data, retraining § Customize to address specific concerns – Add custom features, rules, and constrains Why Adjust Tables? [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  87. 87. § Merge table with an adjacent table or text-block [W04] [SS10] § Adjust table border – add or drop rows or columns [HB07] [D11] § Split one table into two, possibly with plain text between § Re-compute table region by neural network regression [G17] [S18b] § Choose best-scoring border (or structure) out of a range of options § Iterate adjustment → traverse a search tree of candidate tables How to Adjust Candidate Tables [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  88. 88. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  89. 89. What if candidate tables overlap each other? § [H99] uses Dynamic Programming: – Only for top and bottom line-positions: [i, j] – Score disjoint unions of tables: § CNN-based object detection systems: – Greedy Approach: Pick the top-scoring region, repeat – PROBLEM: Lower-scoring table may have a high-scoring sub-table § Maximum Weighted Independent Set – Nodes = tables, edges = conflicts, weights = table scores – NP-hard even for 2-dim rectangles [RN95], but can be solved efficiently in real-life cases Select Best Tables for Output [H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99 [RN95] C.S. Rim and K. Nakajima. “On Rectangle Intersection and Overlap Graphs”, IEEE Trans. on Circuits & Systems I, 42(9), 1995 Table Extraction 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Conflict = Table Overlap
  90. 90. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  91. 91. § Accuracy Metrics – Exact match of table region or structure is too inflexible – Partial match: Text? Area? Cell relationship? Functional? § Ground Truth Labeling – Very time consuming, requires sophisticated UI tools – Humans disagree on what’s correct § Optimization (pre- deep learning) – Lots of discrete, non-differentiable steps – Learn sub-tasks, e.g. row labeling with CRF / SVM – [W04] Global parameter learning: Learning from Data: Challenges [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 Table Extraction
  92. 92. Table Boundary § Purity & Completeness § Character level recall, precision and F1 Table Structure § Recall and Precision of Cell Adjacency Relations Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12 ICDAR 2013 Competition Metrics Table Extraction Accuracy Metrics
  93. 93. § Measure what actually matters downstream § Capcture accuracy of access paths to each cell § Need header annotation as well as cell structure Table Extraction Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12 Accuracy Metrics Functional Metrics
  94. 94. Ground Truth Datasets Complete Datasets with table boundary and cell structure: - ICDAR-2013 competition (PDF Format) - ICDAR-2019 competition (Image Format) - SciTSR 2019 (Generated from LaTeX files) Incomplete Datasets § Table-bank (Full table boundary information only) § PDF-Trex (Financial Table dataset without ground truth Labels) § Marmot (Only ground truth for table boundary, cells inaccessible) § UNLV , UW-3 (Table structure and boundary annotations for scanned documents) Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”. ArXiv 2019 Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12 Oro et al. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”. ICDAR '09 Fang et al. “Dataset Ground-Truth and Performance Metrics for Table Detection Evaluation”. DAS '12 Chi et al. “Complicated Table Structure Recognition” arXiv 2019 Table Extraction
  95. 95. Example: Accuracy Comparison § Table detection accuracy on the ICDAR 2013 Competition dataset: Table Extraction
  96. 96. Table Understanding
  97. 97. Table Understanding Table Understanding HTML Document Table Extraction (Optional) Document (PDF, Image) Representation of table contents that: • Captures semantic information • Is amenable to post-processing Table Understanding Output Knowledge Base Creation Downstream Tasks Question Answering e.g., Air Canada’s oper. revenues in Q3 2015? HTML Understand the semantics of tabular data
  98. 98. Table Understanding Table Understanding HTML Document Table Extraction (Optional) Document (PDF, Image) Representation of table contents that: • Captures semantic information • Is amenable to post-processing Table Understanding Output Knowledge Base Creation Downstream Tasks Question Answering e.g., Air Canada’s oper. revenues in Q3 2015? HTML Understand the semantics of tabular data
  99. 99. Table Understanding Table Understanding HTML Document Table Extraction (Optional) Document (PDF, Image) Representation of table contents that: • Captures semantic information • Is amenable to post-processing Table Understanding Output Knowledge Base Creation Downstream Tasks Question Answering e.g., Air Canada’s oper. revenues in Q3 2015? HTML Understand the semantics of tabular data
  100. 100. Table Understanding Table Understanding HTML Document Table Extraction (Optional) Document (PDF, Image) Representation of table contents that: • Captures semantic information • Is amenable to post-processing Table Understanding Output Knowledge Base Creation Downstream Tasks Question Answering e.g., Air Canada’s oper. revenues in Q3 2015? HTML Understand the semantics of tabular data
  101. 101. Semantics of Tabular Data Table Understanding What does this cell represent?
  102. 102. Semantics of Tabular Data Table Understanding The unaudited comprehensive net loss of Air Canada in the six months ended June 30, 2015 is $13 million Canadian dollars. “ “
  103. 103. Semantics of Tabular Data Table Understanding The unaudited comprehensive net loss of Air Canada in the six months ended June 30, 2015 is $13 million Canadian dollars. “ “ Information about a single cell is derived from multiple places
  104. 104. What You Will Learn Table Understanding Components of table understanding • What are the different types of semantic information about a table? • Where can they be found? 1
  105. 105. What You Will Learn Table Understanding Components of table understanding Table understanding Methods • What are the different types of semantic information about a table? • Where can they be found? 1 2 • What techniques are used to extract info for table understanding? • What learning methods can be used?
  106. 106. What You Will Learn Table Understanding Components of table understanding Table understanding Methods • What are the different types of semantic information about a table? • Where can they be found? 1 2 • How do tables differ between domains? • How do the assumptions of proposed approaches affect their potential applicability to other domains? Importance of Domain3 • What techniques are used to extract info for table understanding? • What learning methods can be used?
  107. 107. Outline: Components of Table Understanding Table Understanding A. Table Regions (Column/Row Headers) B. Context Within Table C. Context Within Document D. Context Outside Document
  108. 108. Outline: A. Table Regions Table Understanding Column Headers (incl. nesting) Row Headers (incl. nesting) Data/Body Cells Main table regions Metadata
  109. 109. Unsupervised Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols
  110. 110. Unsupervised Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols Similarity Features
  111. 111. Unsupervised Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols Heuristics Similarity Features • Which heuristics to use?
  112. 112. Unsupervised Methods: Local Minimum Table Understanding J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12 For column (row) headers: Find first row (col) that looks “different” Pair-wise similarity of consecutive rows Local minimum of similarity
  113. 113. Unsupervised Methods: Indexing Table Understanding S. Seth et al. “Segmenting tables via indexing of value cells by table headers”. ICDAR ‘13 • Use empty and repeated cells to find critical cells that outline the stubhead • Independent of visual aspects of table Repeated cell implying hierarchical row header Empty cells implying hierarchical column header
  114. 114. Traditional ML Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols Traditional ML Methods Similarity Features Column Headers Data Cells Classification Labels • How to model this as a classification problem? • Which ML method and features to use?
  115. 115. Traditional ML Methods: Row/Column Classification Table Understanding J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12 Data row Data row Data row Data row Data row Data row Column header row Classify rows as column header rows (similarly for row header columns) D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03
  116. 116. Header Identification Results Table Understanding S. Seth et al. “Segmenting tables via indexing of value cells by table headers”. ICDAR ‘13 R. Rastan et al. “TEXUS: A unified framework for extracting and understanding tables in PDF documents”. Information Processing & Management Correct Segmentation Correct Stub Head (Critical Cell) Seth et al. 99% 100% TEXUS 100% 100% Government Statistic Table Set (Seth) Correct Segmentation Correct Stub Head (Critical Cell) TEXUS - 42.9% ASX-Announcements Dataset (TEXUS)
  117. 117. No standard benchmark or dataset Table Understanding J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12 D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03 FedStat Textfile Dataset (Pinto) CiteSeerX PDF Dataset (Fang)
  118. 118. Traditional ML Methods: Table Classification Table Understanding Web Data Commons – Web Table Corpora Classify entire tables Relational Table Entity/Listing Table Matrix Table e.g,
  119. 119. Traditional ML Methods: Table Classification Table Understanding Web Data Commons – Web Table Corpora Classify entire tables • Table class implies header structure Relational Table Entity/Listing Table Matrix Table
  120. 120. Traditional ML Methods: Table Classification Table Understanding Web Data Commons – Web Table Corpora Classify entire tables • Table class implies header structure • Can be used for header identification under certain assumptions Relational Table Entity/Listing Table Matrix Table Single col header rowSingle col header row Single row header col
  121. 121. Traditional ML Methods: Table Classification Table Understanding Table Classes Genuine vs Non-genuine Y. Wang et al. “A Machine Learning Based Approach for Table Detection on The Web“. WWW ‘02 Relational vs Non-relational M. Cafarella et al. “Uncovering the Relational Web”. WebDB ‘08 I. Relational Knowledge: Listing, Attribute/Value, Matrix, Calendar, Enumeration, Form II. Layout: Navigational, Formatting E. Crestan et al. “Web-Scale Table Census and Classification”. WSDM ‘11 Vertical listings, horizontal listings, matrix tables J. Eberius et al. “Building the Dresden Web Table Corpus: A Classification Approach”. BDC ‘15 year
  122. 122. Traditional ML Methods: Table Classification Table Understanding ML Methods Decision Tree, SVM Y. Wang et al. “A Machine Learning Based Approach for Table Detection on The Web“. WWW ‘02 Rule-based Classifier (WEKA) M. Cafarella et al. “Uncovering the Relational Web”. WebDB ‘08 Gradient Boosted Decision Tree E. Crestan et al. “Web-Scale Table Census and Classification”. WSDM ‘11 Decision Tree (CART, C4.5, Random Forest), SVM J. Eberius et al. “Building the Dresden Web Table Corpus: A Classification Approach”. BDC ‘15
  123. 123. Traditional ML Methods Table Understanding Neighborhood and Table Features • Number of non empty cells difference • Average alignment • Percentage of same cell data type • Percentage of same cell font style • Content repetition • Number and standard deviation of rows and columns Cell Features • Number of non empty cells. • Average cell length. • Percentage of numeric characters. • Percentage of symbolic characters • Average font size. • Cell Font Styles • Cell positioning in the table • Percentage of cells spanning multiple cols/rows • HTML Tags (if applicable) • Cell Span J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12 J. Eberius et al. “Building the Dresden Web Table Corpus: A Classification Approach”, BDC ‘15
  124. 124. Deep Learning Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols Deep Learning Methods Similarity Features Column Headers Data Cells Classification Labels • Which deep learning architecture to use?
  125. 125. Deep Learning Methods: Hierarchical Attention Network Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables] Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16 Hierarchical RNN proposed to leverage document structure: • 2 layers: • Words • Sentences
  126. 126. Deep Learning Methods: Hierarchical Attention Network Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables] Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16 Extend to tables: • 3 layers • Tokens • Cells • Rows or Columns • Bidirectional network • Combine row-directional and column-directional network
  127. 127. Deep Learning Methods: RNN-CNN Hybrid (TabNet) Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ‘17 LSTM captures semantic representation of each cell CNN captures relationship between cells
  128. 128. Deep Learning Methods: RNN-CNN Hybrid (TabNet) Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ‘17 LSTM captures cell text together with coordinates and other HTML tags (i.e., formatting)
  129. 129. Deep Learning Methods: Results Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ‘17 (Rule-Based) (Decision Tree) (Decision Tree) (Hierarchical Attention for Documents) (RNN-CNN Hybrid)
  130. 130. Beyond Flat Headers: Hierarchical Row Headers Table Understanding Hierarchical Row Headers
  131. 131. Beyond Flat Headers: Hierarchical Row Headers Table Understanding Identify hierarchical relationship among row headers Complex semantic row header hierarchy: Multiple cells in the same row header column are semantically related to each other
  132. 132. Beyond Flat Headers: Hierarchy as a Graphical Model Table Understanding Z. Chen et al. “Integrating Spreadsheet Data via Accurate and Low-Effort Extraction”. KDD ‘14 Encode hierarchy as graphical model • Variable: Candidate parent-child pair • Node potentials: Features for predicting parent-child pairs • Edge potentials: Correlations of variables based on style, KB affinity, …
  133. 133. Pairwise vs Rectangle cell relationships Table Understanding • Pairwise classification can only utilize local information • Simply looking at the pair may not be sufficient to determine the relation • A rectangle is “interesting” if it is the support rectangle of some cell, called a header cell of that rectangle
  134. 134. Beyond Flat Headers: Hierarchy as Rectangle Relationship Table Understanding X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of Financial Tables”. ICDAR ‘17 Two “interesting” rectangles: • “Assets” (row 1) heads rows 2-17 • “Current” (row 2) heads rows 3-11
  135. 135. Beyond Flat Headers: Hierarchy as Rectangle Relationship Table Understanding X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of Financial Tables”. ICDAR ‘17 When a “total” row is considered as a parent candidate, it cannot take children For each iteration: • Combine: Consecutive minimal rectangles with equal features • Attach: Minimal rectangle ri to directly preceding rectangle ri-1 if ri-1 > ri
  136. 136. Outline: B. Context Within Table Table Understanding Currency Additional semantic information within the table - of different types Scale
  137. 137. Outline: B. Context Within Table Table Understanding Additional semantic information within the table - of different types - of different scope Propagate to all data cells
  138. 138. Outline: B. Context Within Table Table Understanding Additional semantic information within the table - of different types - of different scope Propagate to subset of data cells
  139. 139. Outline: C. Context Within Document Table Understanding Additional context outside the table within the same document - leverage relevant text and tables
  140. 140. Table Context Within Document Table Understanding Surrounding text often contains important info about a table Deeper Semantic Understanding • Link text to table • Generate table title Shallow Context Extraction • Extract table metadata
  141. 141. Extract Table Metadata Table Understanding Ying Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”. JCDL ’07 Document title Page Table Caption Document authors
  142. 142. Link Text to Table Cells Table Understanding D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18 Text Cells described by text Approach: • Identify headers • Match sentence to table cells based on: • Unique words However, mirroring the overall softness of the tech sector, sales of computer hardware decreased 1% versus a year-ago to $1.6 billion.
  143. 143. Link Text to Table Cells Table Understanding D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18 Text Cells described by text Approach: • Identify headers • Match sentence to table cells based on: • Unique words • Syntactic analysis
  144. 144. Link Text to Table Cells Table Understanding D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18 Text Cells described by text Approach: • Identify headers • Match sentence to table cells based on: • Unique words • Syntactic analysis • Semantic analysis ”…talking about topics is an important reason to email with these special interest groups.” word2vec
  145. 145. Link Text to Table Cells Table Understanding D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18 Text Cells described by text Approach: • Identify headers • Match sentence to table cells based on: • Unique words • Syntactic analysis • Semantic analysis • Use rules to refine matches
  146. 146. Generate Table Titles (for Web Tables) Table Understanding B. Hancock et al. “Generating titles for web tables”. WWW ’19 Problem: • Web tables lack titles or • Existing titles lack context Table Title?
  147. 147. Generate Table Titles (for Web Tables) Table Understanding B. Hancock et al. “Generating titles for web tables”. WWW ’19 Solution: • Leverage surrounding context to generate table title Table + Surrounding Context Table Title Problem: • Web tables lack titles or • Existing titles lack context
  148. 148. Generate Table Titles (for Web Tables) Table Understanding B. Hancock et al. “Generating titles for web tables”. WWW ’19 Context used as input: • Page Title • Section headers (<h...> tags) • Column headers • Spanning column headers as a special case • Table caption (<caption> tag) Table + Surrounding Context Table Title Context ignored due to noise: • Text right before/after table • Table rows
  149. 149. Generate Table Titles (for Web Tables) Table Understanding B. Hancock et al. “Generating titles for web tables”. WWW ’19 Model Design • Pointer-generator network • First proposed for abstractive summarization • Combines copy & generator mechanism Table + Surrounding Context Table Title
  150. 150. Outline: D. Context Outside Document Table Understanding Additional context outside the table from other resources - link to knowledge bases
  151. 151. Table to KB Linking Table Understanding Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19 Link different parts of the table to external knowledge bases Link Columns (known as Column Type Identification) Link Rows/Cells (known as Entity Linking)
  152. 152. Table to KB Linking: Link Columns Table Understanding Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
  153. 153. Table to KB Linking: Link Rows/Cells Table Understanding Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
  154. 154. Understanding Tabular Data: Putting it All Together Table Understanding What does this cell represent?
  155. 155. Understanding Tabular Data: Putting it All Together Table Understanding What does this cell represent? A. Identify table regions (column/row headers)
  156. 156. Understanding Tabular Data: Putting it All Together Table Understanding B. Identify additional context within table
  157. 157. Understanding Tabular Data: Putting it All Together Table Understanding C. Identify context within document
  158. 158. Understanding Tabular Data: Putting it All Together Table Understanding D. Identify context outside document
  159. 159. Understanding Tabular Data: Putting it All Together Table Understanding The unaudited comprehensive net loss of Air Canada in the six months ended June 30, 2015 is $13 million Canadian dollars. “ “
  160. 160. Final Takeaways 1. A rich history of methods for many decades in table extraction & understanding 2. Tables from different domains are not the same; A general table extraction & understanding system needs to consider diversity of type, style, and content of tables 3. Both semantic and visual features are crucial to improve table extraction and understanding 4. As a community, we need to standardize tasks, evaluation metrics, and datasets
  161. 161. Build for the future by unlocking the past...

×