Tutorial for "table extraction and understanding for scientific and enterprise applications" as presented at ICDM 2019, organized by Yannis Katsis, Alexandre V Evfimievski, Nancy Wang, Douglas Burdick, Marina Danilevsky
1. Table Extraction and Understanding
for Scientific and Enterprise
Applications
Yannis Katsis
Doug Burdick Nancy WangAlexandre V Evfimievski
Marina Danilevsky
IBM Research - Almaden
4. Introduction Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction
5. Tables are popular data representation
Introduction
Government
Reports
Scientific Papers
Financial ReportsInvoices Contracts
Loan Agreements
Compact
Easy to understand*
(*) For humans
6. End-to-end example
§ What does the value 672 in the following
table mean?
§ Answer: Net earnings for three months
ended July 29th, 2017 was $672 million
USD
Steps:
1) Find location of table on page
2) Find cells in column containing ”672”
3) Find cells in row corresponding to “672”
4) Identify aligned row / column header cells
5) Normalize using additional context from
table
Introduction
Table Extraction: Identify
table location and structure
Table Understanding: Provide
semantic context to table values
7. Introduction
Input: Document contents in native format
- PDF
- Image
- Office Docs
- …
Table Extraction: Problem Definition
Output: Document contents with tabular information:
1) Table border for each table
2) Partitioning table contents into cells
3) Both vertical and horizontal alignment of cells
8. Table Understanding:
Problem Definition
Introduction
Output: Table content representation:
1) Captures semantic information
2) Amenable to post-processing
Input: Document contents with tabular information:
1) Table border for each table
2) Partitioning table contents into cells
3) Both vertical and horizontal alignment of cells
10. Table Understanding:
Example
Introduction
Output: Table content representation:
1) Captures semantic information
2) Amenable to post-processing
Value
Norm
Value
Year
Time
Period
Type LineItem
206,924 $206,924,
000
2015 Q3 Expense Aircraft
Fuel
286,817 $286,817,
000
2014 Q3 Expense Airport
Fuel
142,176 $142,176,
000
2015 Q3 Expense Airport
operations
… … … … … …
Change
Change
Normalized
Begin Time
Period
End Time
Period
(27.9%) -27.9% Q3 2014 Q3 2015
10.7% 10.7% Q3 2014 Q3 2015
… … … …
Input: Document contents with tabular information:
1) Table border for each table
2) Partitioning table contents into cells
3) Both vertical and horizontal alignment of cells
11. Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction
12. Challenge: Table structure representation varies
across document formats
None CompletePartial
HTML
MS Excel
MS Word
TXT
PDF
Image
H. Dong et al. "TableSense: Spreadsheet Table Detection with Convolutional
Neural Networks". AAAI '19
Z. Chen et al. “Spreadsheet Property Detection With Rule-assisted Active
Learning”. CIKM ‘17
M. Cafarella et al. “ WebTables: exploring the power of tables on the web".
VLDB ‘08
Table Understanding still
required for all document types
Introduction
14. None CompletePartial
MS Excel
Each sheet
separate table
Multiple tables defined
in single sheet
Table structure representation varies across
Excel documents
H. Dong et al. "TableSense: Spreadsheet Table
Detection with Convolutional Neural Networks". AAAI '19
Z. Chen et al. “Spreadsheet Property Detection With
Rule-assisted Active Learning”. CIKM ‘17
Introduction
15. None CompletePartial
Table structure representation varies across
Word documents
MS Word
Omit Office Table Object
Use Office Table
Object for all tables
Introduction
17. PDF Document Format
…
BT
0.0503 Tc
8.503556 0 0 8.52 503.2795 688.92 Tm
/Tc2 1 Tf
[ ( m) 16 (o) 21 (n) 17 (t) 39 (h) 16 (s) 29 (
) 28 (e) 28 (n) 17 (d) 24 (e) 28 (d) 24 ( ) ] TJ
0 Tc
ET
…
Q
q
46.91952 776.52 m
242.04 776.52 l
242.04 729.96 l
144.48 729.96 l
46.91952 729.96 l
h
…
Draw ”m” at (503, 688) in 8.5
point font in color white
Draw “o” at …
Draw “n” at ….
Draw “t” at ...
Draw “h” at …
Draw “s” at …
….
Draw green line segment from
(46, 776) to (242, 776)
Draw green line segment from
(242, 776) to (242, 729)
…
• Programmatic PDF collection of instructions to draw characters and line
segments to page with visual formatting information
• 2 – 4 trillion PDFs in existence and rapidly growing
Rendered PDF PDF Binary
Introduction
18. Complex tables – graphical lines can be
misleading – is this 1, 2 or 3 tables ?
Table with visual
clues only
Multi-row, multi-
column column
headers
Nested row
headers
Tables with Textual
content
Table with
graphic
lines
Table
interleaved with
text and charts
Challenge: Variety in Tables
Introduction
19. Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction
20. https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal1231201410k.htm
Excerpt of semi-structured XBRL file
For financial statements
Ex: Delta Air Lines, Inc. 2014 Annual Report Form 10-K
:
<xbrli:context id="FI2013Q4"><xbrli:entity>
<xbrli:identifier scheme="http://www.sec.gov/CIK">0000027904
</xbrli:identifier></xbrli:entity>
<xbrli:period><xbrli:instant>2013-12-
31</xbrli:instant></xbrli:period>
</xbrli:context>
:
<us-gaap:CashAndCashEquivalentsAtCarryingValue
contextRef="FI2013Q4"
decimals="-6"
id="Fact-C39BEC178121A91816968BA9ADCF421F”
unitRef="usd">
2844000000
</us-gaap:CashAndCashEquivalentsAtCarryingValue>
:
Excerpt of HTML file with granular financial metric data
https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal-20141231.xml
Valuable metrics for airline industry, only
present in HTML table
Valuable metrics present in semi-
structured raw data source MUST BE
INTEGRATED
Application: Knowledge-base population
Introduction
22. Application: Scientific Leaderboard Construction
Introduction
Y. Hou et al. “Identification of Tasks, Datasets, Evaluation Metrics, and Numeric
Scores for Scientific Leaderboards Construction”. ACL ‘19
Scientific Publication
Leaderboard Annotations
23. Application: Biological Information Extraction
Introduction
G. Singh et al. “QTLTableMiner++: Semantic Mining of QTL Tables in Scientific
Articles”. BMC BioInformatics ‘18
Article Trait Tables
Trait Statements
QTL Statements
Extract info on Quantitative
Trait Locus (QTL) (genomic
regions that correlate with
phenotypes) from tables in
scientific publications
24. Takeaways
Introduction
§ Widely used document formats have limited table
representation
– Limits of document format: Image, PDF
– How documents authored: Word, Excel
§ Wide variety of tables makes general model
construction difficult
–Tables are form of art
–Diverse visual encoding of semantic information
–Different domains
§ Multiple applications for table extraction and
understanding
25. Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction
27. § Table region detection
– Identify all tables
– Separate tables from non-table text
– Separate tables from each other
§ Cell structure recognition
– Partition text into cells
– Find cell span and cell-to-cell overlap (along X- or Y-axis)
What Is Table Extraction?
Table Extraction
28. [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93
[I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93
[H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95
[KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98
[H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99
[H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00
[H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00
[KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01
[C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02
[P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03
[H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05
[S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06
[M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06
[L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08
[L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
[F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
[CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12
[K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13
[K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
[R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15
[T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
[Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019
[C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019
[L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019
[M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19
Table Extraction
Table Extraction: A Sample of Prior Work
29. [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93
[I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93
[H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95
[KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98
[H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99
[H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00
[H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00
[KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01
[C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02
[P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03
[H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05
[S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06
[M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06
[L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08
[L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
[F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
[CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12
[K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13
[K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
[R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15
[T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
[Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019
[C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019
[L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019
[M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19
Table Extraction
Most papers present an end-to-end system for :
• Table detection,
• Cell structure recognition (table parsing),
• Or both
🔥 ICDAR 2019 has ≥ 16 new papers on table extraction!
– ICDAR = International Conference on Document Analysis and Recognition
Table Extraction: A Sample of Prior Work
30. § Early 1990s : Separator based “top-down” methods
– Ruled line tables
– Extend to white-space “lines”
§ 1990s – early 2000s : “Bottom-up” text clustering
– Group text into columns (or rows), then to tables
– Use space features (gaps, overlap, alignment) and keywords
§ 2000s – early 2010s : Machine Learning (supervised or not)
– Classify text-rows using CRF, SVM, HMM, etc.
– Probabilistic models for tables
– Graph-based models for cell structure
– Unsupervised ML (clustering)
§ Late 2010s : Deep Learning
– Scanned image table detection with R-CNN or YOLO
– Graph neural networks and language embeddings for cell structure
Table Extraction Timeline
Table Extraction
31. § Analyze Page
– Identify low-level structures & relations
§ The 2 Main Tasks
– Table (region) detection
– Cell structure recognition (given table region)
§ Refine Tables
– Discard false positives
– Adjust table border and structure
How to Build a Table Extraction System?
Table Extraction
32. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Learning Infrastructure
Accuracy metrics Ground truth data Optimization method
33. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
34. § Documents can be:
– scanned
– programmatic (“born digital” PDF, TXT)
– hybrid
§ Scanned page requires OCR, plus:
– Reverse any rotation, distortion
– Filter noise, sharpen if low resolution [M19]
– Fix inconsistent font features, bounding boxes
– Detect ruled lines and boxes
• E.g., Gaussian filter + black hat transform [K13]
Page Features
[K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13
[M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from
Scanned Documents”, ICDAR ‘19
Table Extraction
35. § Programmatic PDFs (and TXTs)
– Have letters, but no table markup
§ May contain spurious (invisible) text and lines
– White-on-white lines or text
– Occluded or out-of-range lines or text
– Text repeated to simulate bold font
– Need to filter them out
§ Deep Learning (CNN-based) methods need an image
– Convert programmatic to scanned
Page Features
Table Extraction
36. § Plain text layout (1-column, 2-column, etc.)
– Helps avoid false-positive “tables”
§ Obvious non-tables
– Page & section headers, footers, lists, etc.
– Short-cut computation – if no tables on page
§ Low-level structure
– Alignment @ different box positions & tolerance levels
– A minimum spanning tree for clustering by distance
§ Deep learning features
– CNN features shared across proposal regions
– Natural language embeddings
Page Features
Table Extraction
37. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
38. § Most systems group text early on
– Table detection systems may skip text grouping
§ Text is grouped in one of 3 ways:
– Columns first
– Rows first
– Cell-units (“blobs”) first
§ Some systems partition text using separator lines
– BUT: “Blob” detection reduces over- / under-partitioning
Group Text into Larger Units
Table Extraction
39. Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Two
Tables
40. Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Columns
42. Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Multi-line
“Blobs”
43. Many systems detect columns first:
– T-Recs [KD98], Pdf2table [Y05], Lixto [HB07], Tesseract [SS10],
smartFIX [D11]
Example – Tesseract [SS10] :
Start with Columns
Table Extraction
[KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98
[Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
1. Detect X-axis “tab-stops” (alignment positions)
2. Group tokens between “tab-stops” horizontally into entries
3. Group entries of the same font vertically into column fragments
4. Group column fragments within page columns horizontally into table fragments
5. Group table fragments if columns match vertically into tables
44. Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Tab-Stops
45. Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Column
Fragments
46. Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Table
Fragments
47. Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Table
Fragments
49. Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Tables
Multi-Column Headers
50. Start with Rows
Table Extraction
Systems with ML often detect rows first
– Pinto-McCallum [P03], e Silva [S06], TableSeer [L08], PDF-TREX [OR09]
Typical process:
1. Identify text-lines
2. Train an ML classifier to label text-lines:
– “Table Dense”, “Table Sparse”, “Table Header”, “Non-table”, etc.
– ML = CRF [P03], HMM [S06], SVM [L08], etc.
3. Merge sparse rows into dense rows – get full table rows:
– Merge up, down, or cluster around, by row alignment [H00a]
4. Combine table rows into tables
[H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00
[P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03
[S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06
[L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
55. § “Blob” = largest semantically bound text unit
– Single-line or multi-line
– If in a table, the whole “blob” must be in a single cell
§ “Blob” ≠ Cell
– Cell has span and overlaps other cells
– Some “blobs” end up in plain text or non-table text
§ “Blobs” help define table structure:
– Trace alignment
– Determine header cell spans
– Fix over-split / over-merged cells, rows, columns
– Reduce search space
Text “Blobs” (Cell-Units, Paragraphs, …)
Table Extraction
56. § [KD98] Distance based clustering:
– Merge words horizontally
– Merge text strings vertically if word-spans interleave
§ Problems with distance:
– Multi-column headers: 1 justified phrase vs. ≥ 2 closely spaced phrases
– Row headers / text cells: 1 multi-line cell vs. ≥ 2 closely spaced rows
§ Example:
How to Detect “Blobs”
[KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98
Two Column Header Two Column Header
HEADER Header Header Header Header
Row 1, text line 1 0.12 1.23 2.34 3.45
Row 1, text line 2
Row 1, text line 3
Row 2, text line 1 4.56 5.67 6.78 7.89
Row 2, text line 2
Row 2, text line 3
Table Extraction
57. § [H00a], [OR09] Merge “sparse” rows into “dense” rows
– Merge up, merge down, or cluster around
§ [L09] Detect and follow reading order ← an NLP challenge
§ [B12] [B14] Train a classifier over “blob” features:
– Proper termination (e.g. “blobs” don’t end with a dash or comma)
– Number of numeric strings
– Indentation, large space at the end of a string
– Shared font properties
§ Deep learning approaches:
– Cell-unit detection (over image) using CNNs
– Semantic relationship detection (over text) using RNNs
How to Detect “Blobs”
[H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ’09
[L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
Table Extraction
58. Example
Table Extraction
Table Source: https://www.dollartreeinfo.com/static-files/0c3687d8-e6ce-4566-bc89-79fc8c8b665e (2016_Proxy_Statement_Final.pdf)
59. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
60. § Ruled Lines & Colored Boxes
– Extend ruled lines over small gaps, “snap” together
– Merge touching colored boxes, then convert into lines
– Filter out: highlighting, underlining, boxed comments, logos, charts etc.
§ BUT: A “perfect” ruled-line grid can be incomplete !
– Some lines may be missing
– Lines may fail to extend to header rows / columns
Separator Line Detection
[CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93
[I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93
[F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
Table Extraction
61. Example 1
Table Extraction
Table Source: https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/quarterly-result/2015/2015_MDA_q3.pdf
63. Example 3
Table Extraction
Table Source: http://educationaldatamining.org/files/conferences/EDM2018/EDM2018_Preface_TOC_Proceedings.pdf
64. § White-space separators (“virtual” lines)
– Help define cell span / cell alignment in tables
– Prune false-positives by ML or by heuristics [B12]
§ How to detect white-space separators
– Cell-unit (“blob”) bounding box expansion [I93]
– Axis projection histograms [CK93]
– White-space cover by maximum-area white-space rectangles [F11]
§ How to prune them (features to use)
– Adjacent “blobs” : alignment, size, and content
– “Strong” separators that run parallel to or intersect the separator
Separator Line Detection
Table Extraction
[CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93
[I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93
[F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
65. § Commonly used to partition page and generate separators
– By [C02], [W04], [K14], and others
§ [H95] The algorithm recursively, for each block:
– Computes X- and Y-axis projection profiles
– Divides the block into sub-blocks based on dips in profiles:
Recursive X-Y Cut Algorithm
[H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95
[C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14
Table Extraction
66. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
67. § Ruled Line grids / frames, connected components
§ (Rows 1st) Stack “table” rows whose “blobs” co-align [L08], [OR09]
– Rows are labeled by an ML-classifier (CRF, SVM, HMM)
– Labels & matching “blob” layout → table regions
– NOTE: Be sure to label “header rows” to tell tables apart !
§ (Cols 1st) Cluster overlapping column fragments [HB07], [SS10]
– Group table columns horizontally, staying within page layout columns
(when possible)
– Group vertically if column fragments overlap, match, or subsume
– NOTE: Column header areas require special handling !
Generate Candidate Table Regions
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13
Table Extraction
68. § (Blobs 1st) Classify text “blobs”, cluster those labeled “table”
– [B14] iteratively labels “blobs” given their neighbors’ labels
– [B14] trains a Kernel Logistic Regression classifier
§ (Lines 1st) Find areas where “strong” separators make a grid
– [CL12] uses Max-Flow / Min-Cut algorithm to extract grids
– Bi-cluster the intersection matrix of horizontal vs. vertical separators
Generate Candidate Table Regions
[CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
69. k
X ≈ UVT
• Xij = 1 ⇔ lines i and j
intersect
• At intersections: 1 ≈
ui1vj1 + ui2vj2 +…+ uikvjk
• Each uicvjc ≥ 0 gives
affinity of intersection
Xij to cluster c
• uicvjc is large ⇔
uic and vjc both large
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 1 0
0 1 0
0 1 0
0 1 0
0 1 0
* * *
0 0 1
0 0 1
0 0 1
0 0 1
0
1
0
0
1
0
0
1
0
0
1
0
0
0
0
*
*
*
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
0
1
1
0
1
0
0
1
0
0
1
0
0
1
0
0
1
U ≥ 0
V ≥ 0
X
Non-neg. Matrix Factorization for Grid Clustering
Generate Candidate Table Regions
Table Extraction
70. § (Blobs 1st) Classify text “blobs”, cluster those labeled “table”
– [B14] iteratively labels “blobs” given their neighbors’ labels
– [B14] trains a Kernel Logistic Regression classifier
§ (Lines 1st) Find areas where “strong” separators make a grid
– [CL12] uses Max-Flow / Min-Cut algorithm to extract grids
– Bi-cluster the intersection matrix of horizontal vs. vertical separators
§ (CNN-based) Try a fixed set of table region proposals
– CNN shares computation of features across all translations of a given
proposal rectangle
– Proposal rectangle shapes / sizes are fixed as hyperparameters
– If a proposal hits a table, a regression decides table borders
Generate Candidate Table Regions
[CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
71. § Use existing object detection
frameworks (Faster R-CNN or
YOLO) retrained for table
detection
§ The field is wide open for more
table-specific DL approaches
– E.g. involving text semantics
Li et al. “TableBank: Table Benchmark for Image-based Table Detection and
Recognition”. ArXiv 2019
Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest
Documents at Scale.”. KDD 2018
Schreiber et al. “Deepdesrt: Deep learning for detection and structure
recognition of tables in document images” ICDAR 2017
Gilani et al. “Table Detection using Deep Learning” ICDAR 2017
Table Extraction
Deep Learning for Table Detection
72. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
73. § Cells define overlap relation along X- or Y-axis
– Links headers with data – critical for table understanding
§ Cell borders ← ruled lines ∪ “strong” white-space lines
– Extend lines to make rectangular cells, avoid crossing “blobs”
§ Ruled grids: test for incompleteness
– Multiple numerics per cell
– A “strong” white-space line splits text in ≥ 2 cells
– A “mini-table” inside a ruled cell
– Cell structure extends beyond table frame
§ White-space grids: clean up empty cells
– Expand header cells by merging with empty cells [S06]
– Merge (almost-) empty rows and columns
Cell Structure: Line Based
Table Extraction
[S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
74. § Use Spatial Constraints to find an overlap DAG over cells [H03]
§ Use Graph Neural Networks to find 2 undirected graphs:
Cell Structure: Graph Based
[H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03
[Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019
[C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019
[Q19] [C19]
Table Extraction
– “Same Row” graph & “Same Column” graph
– Two cells share an edge ⇔ share a row / a column
– [Q19] : Rows and columns = maximal cliques
– [C19] : Only adjacent cells share a graph edge
75. Schreiber et al. “Deepdesrt: Deep learning for detection and structure
recognition of tables in document images” ICDAR 2017
Table Extraction
Cell Structure: CNN Based
§ Object detection networks were also used for cell structure detection
76. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
77. § Eliminate false positive tables
§ Detect malformed table regions
– Plain text in tables
– Missing row / column headers or split-off pieces
– One region covers multiple tables
§ Compare alternative table candidates
– Example: Is this 1 table or 2 tables?
§ Improve table region and structure
– Pick the best adjustment out of a range of options
– NOTE: Knowing cell structure helps region scoring / adjustment
§ Provide a confidence value for output tables
Why Scoring Tables?
Table Extraction
78. § Tables are very diverse
– Tiny or huge, misaligned, text in cells, key-value pairs, confusing delimiters
– Complex row / column headers – so different, easy to chop off !
§ What’s around the table also matters
– Can its columns or rows be extended? Should they be?
§ One table, or ≥ 2 adjacent tables?
– 1 table may have: ruled bars, wide gaps, font / alignment changes
– 2 tables may be: fully or partly co-aligned, separated in one of many ways
§ Non-table text can have complex structure, too
– Page headers / footers, framed / highlighted text, hierarchical lists, …
Table Scoring Challenges
Table Extraction
79. Example 1
Table Extraction
Table Source: https://www.legislation.gov.au/Details/F2010C00607/0d99393c-5c5b-4af0-9cc1-b5c2de8632c3 (F2010C00607.pdf)
NOT A TABLE !
83. § Rule-out patterns
– Rule out charts, lists, signature blocks etc.
§ Aggregated column / row score
– [KD01] Aggregate the similarities that led to the table’s column fragments
§ Dynamic programming score
– [H99] Score (T) = max { Score (T – line) + Merit (line) }
– Score the best split into 2 sub-tables
§ Probability of being a table (given the features)
– [W04] Partition page into blocks labeled “table” and “plain text”
– Compute label probability for block + neighboring blocks
§ A scoring neural network on top of CNN [G17, S18b]
How to Score a Table
[H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99
[KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
84. § Columns and rows:
– Number, span / extent, alignment, font / content similarity
§ Ruled and white-space separators:
– Number, span / extent, width of their margins
– If they match, reach (good) or cross (bad) table borders
§ Inside vs. outside table:
– Border crossing ruled lines, aligned blocks, or highly similar text
– The two sides have matching structure
§ Cell structure:
– Oversized cells, misaligned pairs of cells, “runs” of empty cells
§ Content:
– Numerics, repeated words; customizable keywords
– Domain-specific “expectations,” e.g. header dictionary [D11]
§ CNN-generated features
Features for Table Scoring
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
Table Extraction
85. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
86. § Leverage table features and score
– Specify how a well-formed vs. mal-formed table looks like
§ Use a transparent, explainable method
– If detection is a “black box”, adjustment uses explainable rules & features
§ Correct errors quickly
– Bypass the need for extra ground-truth data, retraining
§ Customize to address specific concerns
– Add custom features, rules, and constrains
Why Adjust Tables?
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
87. § Merge table with an adjacent table or text-block [W04] [SS10]
§ Adjust table border – add or drop rows or columns [HB07] [D11]
§ Split one table into two, possibly with plain text between
§ Re-compute table region by neural network regression [G17] [S18b]
§ Choose best-scoring border (or structure) out of a range of options
§ Iterate adjustment → traverse a search tree of candidate tables
How to Adjust Candidate Tables
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
88. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
89. What if candidate tables overlap each other?
§ [H99] uses Dynamic Programming:
– Only for top and bottom line-positions: [i, j]
– Score disjoint unions of tables:
§ CNN-based object detection systems:
– Greedy Approach: Pick the top-scoring region, repeat
– PROBLEM: Lower-scoring table may have a high-scoring sub-table
§ Maximum Weighted Independent Set
– Nodes = tables, edges = conflicts, weights = table scores
– NP-hard even for 2-dim rectangles [RN95], but can be solved
efficiently in real-life cases
Select Best Tables for Output
[H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99
[RN95] C.S. Rim and K. Nakajima. “On Rectangle Intersection and Overlap Graphs”, IEEE Trans. on Circuits & Systems I, 42(9), 1995
Table Extraction
1 1
1 1 1
1 1
1 1 1
1
1 1
1 1 1
Conflict = Table
Overlap
90. Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
91. § Accuracy Metrics
– Exact match of table region or structure is too inflexible
– Partial match: Text? Area? Cell relationship? Functional?
§ Ground Truth Labeling
– Very time consuming, requires sophisticated UI tools
– Humans disagree on what’s correct
§ Optimization (pre- deep learning)
– Lots of discrete, non-differentiable steps
– Learn sub-tasks, e.g. row labeling with CRF / SVM
– [W04] Global parameter learning:
Learning from Data: Challenges
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
Table Extraction
92. Table Boundary
§ Purity & Completeness
§ Character level recall, precision
and F1
Table Structure
§ Recall and Precision of Cell
Adjacency Relations
Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12
ICDAR 2013 Competition Metrics
Table Extraction
Accuracy Metrics
93. § Measure what actually
matters downstream
§ Capcture accuracy of
access paths to each cell
§ Need header annotation
as well as cell structure
Table Extraction
Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12
Accuracy Metrics
Functional Metrics
94. Ground Truth Datasets
Complete Datasets with table boundary and cell structure:
- ICDAR-2013 competition (PDF Format)
- ICDAR-2019 competition (Image Format)
- SciTSR 2019 (Generated from LaTeX files)
Incomplete Datasets
§ Table-bank (Full table boundary information only)
§ PDF-Trex (Financial Table dataset without ground truth Labels)
§ Marmot (Only ground truth for table boundary, cells inaccessible)
§ UNLV , UW-3 (Table structure and boundary annotations for scanned documents)
Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”. ArXiv 2019
Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12
Oro et al. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”. ICDAR '09
Fang et al. “Dataset Ground-Truth and Performance Metrics for Table Detection Evaluation”. DAS '12
Chi et al. “Complicated Table Structure Recognition” arXiv 2019
Table Extraction
97. Table Understanding
Table Understanding
HTML Document
Table Extraction (Optional)
Document (PDF, Image)
Representation of table contents that:
• Captures semantic information
• Is amenable to post-processing
Table Understanding Output
Knowledge Base
Creation
Downstream
Tasks
Question
Answering
e.g., Air Canada’s oper. revenues in Q3 2015?
HTML
Understand the semantics of tabular data
98. Table Understanding
Table Understanding
HTML Document
Table Extraction (Optional)
Document (PDF, Image)
Representation of table contents that:
• Captures semantic information
• Is amenable to post-processing
Table Understanding Output
Knowledge Base
Creation
Downstream
Tasks
Question
Answering
e.g., Air Canada’s oper. revenues in Q3 2015?
HTML
Understand the semantics of tabular data
99. Table Understanding
Table Understanding
HTML Document
Table Extraction (Optional)
Document (PDF, Image)
Representation of table contents that:
• Captures semantic information
• Is amenable to post-processing
Table Understanding Output
Knowledge Base
Creation
Downstream
Tasks
Question
Answering
e.g., Air Canada’s oper. revenues in Q3 2015?
HTML
Understand the semantics of tabular data
100. Table Understanding
Table Understanding
HTML Document
Table Extraction (Optional)
Document (PDF, Image)
Representation of table contents that:
• Captures semantic information
• Is amenable to post-processing
Table Understanding Output
Knowledge Base
Creation
Downstream
Tasks
Question
Answering
e.g., Air Canada’s oper. revenues in Q3 2015?
HTML
Understand the semantics of tabular data
102. Semantics of Tabular Data
Table Understanding
The unaudited comprehensive net loss of Air
Canada in the six months ended June 30, 2015
is $13 million Canadian dollars.
“ “
103. Semantics of Tabular Data
Table Understanding
The unaudited comprehensive net loss of Air
Canada in the six months ended June 30, 2015
is $13 million Canadian dollars.
“ “
Information about a single cell is derived from multiple places
104. What You Will Learn
Table Understanding
Components of table understanding
• What are the different types of semantic information about a table?
• Where can they be found?
1
105. What You Will Learn
Table Understanding
Components of table understanding
Table understanding Methods
• What are the different types of semantic information about a table?
• Where can they be found?
1
2
• What techniques are used to extract info for table understanding?
• What learning methods can be used?
106. What You Will Learn
Table Understanding
Components of table understanding
Table understanding Methods
• What are the different types of semantic information about a table?
• Where can they be found?
1
2
• How do tables differ between domains?
• How do the assumptions of proposed approaches affect their
potential applicability to other domains?
Importance of Domain3
• What techniques are used to extract info for table understanding?
• What learning methods can be used?
107. Outline: Components of Table Understanding
Table Understanding
A. Table Regions
(Column/Row Headers)
B. Context
Within Table
C. Context
Within Document
D. Context Outside
Document
108. Outline: A. Table Regions
Table Understanding
Column Headers
(incl. nesting)
Row Headers
(incl. nesting)
Data/Body
Cells
Main table regions
Metadata
111. Unsupervised Methods Overview
Table Understanding
Header rows/cols "look different” than data rows/cols
Heuristics
Similarity Features
• Which heuristics to use?
112. Unsupervised Methods: Local Minimum
Table Understanding
J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12
For column (row) headers: Find first row (col) that looks “different”
Pair-wise similarity of
consecutive rows
Local minimum of similarity
113. Unsupervised Methods: Indexing
Table Understanding
S. Seth et al. “Segmenting tables via indexing of value cells by table headers”.
ICDAR ‘13
• Use empty and repeated
cells to find critical cells that
outline the stubhead
• Independent of visual
aspects of table
Repeated cell
implying hierarchical
row header
Empty cells implying
hierarchical column
header
114. Traditional ML Methods Overview
Table Understanding
Header rows/cols "look different” than data rows/cols
Traditional
ML Methods
Similarity Features
Column
Headers
Data
Cells
Classification Labels
• How to model this as a classification problem?
• Which ML method and features to use?
115. Traditional ML Methods: Row/Column Classification
Table Understanding
J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12
Data row
Data row
Data row
Data row
Data row
Data row
Column header row
Classify rows as column header rows (similarly for row header columns)
D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03
116. Header Identification Results
Table Understanding
S. Seth et al. “Segmenting tables via indexing of value cells by table headers”.
ICDAR ‘13
R. Rastan et al. “TEXUS: A unified framework for extracting and understanding
tables in PDF documents”. Information Processing & Management
Correct Segmentation Correct Stub Head
(Critical Cell)
Seth et al. 99% 100%
TEXUS 100% 100%
Government Statistic Table Set (Seth)
Correct Segmentation Correct Stub Head
(Critical Cell)
TEXUS - 42.9%
ASX-Announcements Dataset (TEXUS)
117. No standard benchmark or dataset
Table Understanding
J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12
D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03
FedStat Textfile Dataset (Pinto) CiteSeerX PDF Dataset (Fang)
118. Traditional ML Methods: Table Classification
Table Understanding
Web Data Commons – Web Table Corpora
Classify entire tables
Relational Table Entity/Listing Table Matrix Table
e.g,
119. Traditional ML Methods: Table Classification
Table Understanding
Web Data Commons – Web Table Corpora
Classify entire tables
• Table class implies header structure
Relational Table Entity/Listing Table Matrix Table
120. Traditional ML Methods: Table Classification
Table Understanding
Web Data Commons – Web Table Corpora
Classify entire tables
• Table class implies header structure
• Can be used for header identification under certain assumptions
Relational Table Entity/Listing Table Matrix Table
Single col header rowSingle col header row
Single row header col
121. Traditional ML Methods: Table Classification
Table Understanding
Table Classes
Genuine vs Non-genuine Y. Wang et al. “A Machine Learning
Based Approach for Table Detection
on The Web“. WWW ‘02
Relational vs Non-relational M. Cafarella et al. “Uncovering the
Relational Web”. WebDB ‘08
I. Relational Knowledge: Listing, Attribute/Value,
Matrix, Calendar, Enumeration, Form
II. Layout: Navigational, Formatting
E. Crestan et al. “Web-Scale Table
Census and Classification”. WSDM ‘11
Vertical listings, horizontal listings, matrix tables J. Eberius et al. “Building the Dresden
Web Table Corpus: A Classification
Approach”. BDC ‘15
year
122. Traditional ML Methods: Table Classification
Table Understanding
ML Methods
Decision Tree, SVM Y. Wang et al. “A Machine Learning Based Approach
for Table Detection on The Web“. WWW ‘02
Rule-based Classifier (WEKA) M. Cafarella et al. “Uncovering the Relational Web”.
WebDB ‘08
Gradient Boosted Decision Tree E. Crestan et al. “Web-Scale Table Census and
Classification”. WSDM ‘11
Decision Tree (CART, C4.5,
Random Forest), SVM
J. Eberius et al. “Building the Dresden Web Table
Corpus: A Classification Approach”. BDC ‘15
123. Traditional ML Methods
Table Understanding
Neighborhood and Table Features
• Number of non empty cells
difference
• Average alignment
• Percentage of same cell data type
• Percentage of same cell font style
• Content repetition
• Number and standard deviation of
rows and columns
Cell Features
• Number of non empty cells.
• Average cell length.
• Percentage of numeric characters.
• Percentage of symbolic characters
• Average font size.
• Cell Font Styles
• Cell positioning in the table
• Percentage of cells spanning
multiple cols/rows
• HTML Tags (if applicable)
• Cell Span
J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12
J. Eberius et al. “Building the Dresden Web Table Corpus: A Classification
Approach”, BDC ‘15
124. Deep Learning Methods Overview
Table Understanding
Header rows/cols "look different” than data rows/cols
Deep
Learning
Methods
Similarity Features
Column
Headers
Data
Cells
Classification Labels
• Which deep learning architecture to use?
125. Deep Learning Methods: Hierarchical Attention Network
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables]
Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16
Hierarchical RNN proposed to leverage
document structure:
• 2 layers:
• Words
• Sentences
126. Deep Learning Methods: Hierarchical Attention Network
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables]
Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16
Extend to tables:
• 3 layers
• Tokens
• Cells
• Rows or Columns
• Bidirectional network
• Combine row-directional and
column-directional network
127. Deep Learning Methods: RNN-CNN Hybrid (TabNet)
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ‘17
LSTM captures semantic
representation of each cell
CNN captures
relationship between cells
128. Deep Learning Methods: RNN-CNN Hybrid (TabNet)
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ‘17
LSTM captures cell text
together with coordinates
and other HTML tags (i.e.,
formatting)
129. Deep Learning Methods: Results
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ‘17
(Rule-Based)
(Decision Tree)
(Decision Tree)
(Hierarchical Attention
for Documents)
(RNN-CNN Hybrid)
131. Beyond Flat Headers: Hierarchical Row Headers
Table Understanding
Identify hierarchical relationship among row headers
Complex semantic row header hierarchy: Multiple cells in the same row header
column are semantically related to each other
132. Beyond Flat Headers: Hierarchy as a Graphical Model
Table Understanding
Z. Chen et al. “Integrating Spreadsheet Data via Accurate and Low-Effort
Extraction”. KDD ‘14
Encode hierarchy as graphical model
• Variable: Candidate parent-child pair
• Node potentials: Features for predicting
parent-child pairs
• Edge potentials: Correlations of
variables based on style, KB affinity, …
133. Pairwise vs Rectangle cell relationships
Table Understanding
• Pairwise classification can only utilize local information
• Simply looking at the pair may not be sufficient to determine the relation
• A rectangle is “interesting” if it is the support rectangle of some cell,
called a header cell of that rectangle
134. Beyond Flat Headers: Hierarchy as Rectangle Relationship
Table Understanding
X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of
Financial Tables”. ICDAR ‘17
Two “interesting” rectangles:
• “Assets” (row 1) heads rows 2-17
• “Current” (row 2) heads rows 3-11
135. Beyond Flat Headers: Hierarchy as Rectangle Relationship
Table Understanding
X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of
Financial Tables”. ICDAR ‘17
When a “total” row is considered as a parent
candidate, it cannot take children
For each iteration:
• Combine: Consecutive minimal
rectangles with equal features
• Attach: Minimal rectangle ri to
directly preceding rectangle ri-1 if
ri-1 > ri
136. Outline: B. Context Within Table
Table Understanding
Currency
Additional semantic information within the table
- of different types
Scale
137. Outline: B. Context Within Table
Table Understanding
Additional semantic information within the table
- of different types
- of different scope Propagate to all
data cells
138. Outline: B. Context Within Table
Table Understanding
Additional semantic information within the table
- of different types
- of different scope
Propagate to
subset of data cells
139. Outline: C. Context Within Document
Table Understanding
Additional context outside the table within the same document
- leverage relevant text and tables
140. Table Context Within Document
Table Understanding
Surrounding text often contains important info about a table
Deeper
Semantic Understanding
• Link text to table
• Generate table title
Shallow
Context Extraction
• Extract table metadata
141. Extract Table Metadata
Table Understanding
Ying Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in
Digital Libraries”. JCDL ’07
Document title
Page
Table Caption
Document authors
142. Link Text to Table Cells
Table Understanding
D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18
Text
Cells described by text
Approach:
• Identify headers
• Match sentence to table cells
based on:
• Unique words
However, mirroring the overall
softness of the tech sector, sales of
computer hardware decreased 1%
versus a year-ago to $1.6 billion.
143. Link Text to Table Cells
Table Understanding
D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18
Text
Cells described by text
Approach:
• Identify headers
• Match sentence to table cells
based on:
• Unique words
• Syntactic analysis
144. Link Text to Table Cells
Table Understanding
D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18
Text
Cells described by text
Approach:
• Identify headers
• Match sentence to table cells
based on:
• Unique words
• Syntactic analysis
• Semantic analysis
”…talking about topics is an
important reason to email with
these special interest groups.”
word2vec
145. Link Text to Table Cells
Table Understanding
D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18
Text
Cells described by text
Approach:
• Identify headers
• Match sentence to table cells
based on:
• Unique words
• Syntactic analysis
• Semantic analysis
• Use rules to refine matches
146. Generate Table Titles (for Web Tables)
Table Understanding
B. Hancock et al. “Generating titles for web tables”. WWW ’19
Problem:
• Web tables lack titles or
• Existing titles lack context
Table Title?
147. Generate Table Titles (for Web Tables)
Table Understanding
B. Hancock et al. “Generating titles for web tables”. WWW ’19
Solution:
• Leverage surrounding context
to generate table title
Table + Surrounding Context
Table Title
Problem:
• Web tables lack titles or
• Existing titles lack context
148. Generate Table Titles (for Web Tables)
Table Understanding
B. Hancock et al. “Generating titles for web tables”. WWW ’19
Context used as input:
• Page Title
• Section headers (<h...> tags)
• Column headers
• Spanning column headers
as a special case
• Table caption (<caption> tag)
Table + Surrounding Context
Table Title
Context ignored due to noise:
• Text right before/after table
• Table rows
149. Generate Table Titles (for Web Tables)
Table Understanding
B. Hancock et al. “Generating titles for web tables”. WWW ’19
Model Design
• Pointer-generator network
• First proposed for
abstractive summarization
• Combines copy & generator
mechanism
Table + Surrounding Context
Table Title
150. Outline: D. Context Outside Document
Table Understanding
Additional context outside the table from other resources
- link to knowledge bases
151. Table to KB Linking
Table Understanding
Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
Link different parts of the table to external knowledge bases
Link Columns
(known as Column Type Identification)
Link Rows/Cells
(known as Entity Linking)
152. Table to KB Linking: Link Columns
Table Understanding
Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
153. Table to KB Linking: Link Rows/Cells
Table Understanding
Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
155. Understanding Tabular Data: Putting it All Together
Table Understanding
What does this cell represent?
A. Identify table regions (column/row headers)
156. Understanding Tabular Data: Putting it All Together
Table Understanding
B. Identify additional context within table
157. Understanding Tabular Data: Putting it All Together
Table Understanding
C. Identify context within document
158. Understanding Tabular Data: Putting it All Together
Table Understanding
D. Identify context outside document
159. Understanding Tabular Data: Putting it All Together
Table Understanding
The unaudited comprehensive net loss of Air
Canada in the six months ended June 30, 2015
is $13 million Canadian dollars.
“ “
160. Final Takeaways
1. A rich history of methods for many decades in table
extraction & understanding
2. Tables from different domains are not the same; A general
table extraction & understanding system needs to
consider diversity of type, style, and content of tables
3. Both semantic and visual features are crucial to improve
table extraction and understanding
4. As a community, we need to standardize tasks, evaluation
metrics, and datasets