ICDM2019 table tutorial

Table Extraction and Understanding
for Scientific and Enterprise
Applications
Yannis Katsis
Doug Burdick Nancy WangAlexandre V Evfimievski
Marina Danilevsky
IBM Research - Almaden

Outline
§ Introduction
– Problem Definition
– Challenges
– Applications
– Demo
§ Table Extraction
§ Table Understanding
§ Conclusion

Introduction Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction

Tables are popular data representation
Introduction
Government
Reports
Scientific Papers
Financial ReportsInvoices Contracts
Loan Agreements
Compact
Easy to understand*
(*) For humans

End-to-end example
§ What does the value 672 in the following
table mean?
§ Answer: Net earnings for three months
ended July 29th, 2017 was $672 million
USD
Steps:
1) Find location of table on page
2) Find cells in column containing ”672”
3) Find cells in row corresponding to “672”
4) Identify aligned row / column header cells
5) Normalize using additional context from
table
Introduction
Table Extraction: Identify
table location and structure
Table Understanding: Provide
semantic context to table values

Introduction
Input: Document contents in native format
- PDF
- Image
- Office Docs
- …
Table Extraction: Problem Definition
Output: Document contents with tabular information:
1) Table border for each table
2) Partitioning table contents into cells
3) Both vertical and horizontal alignment of cells

Table Understanding:
Problem Definition
Introduction
Output: Table content representation:
1) Captures semantic information
2) Amenable to post-processing
Input: Document contents with tabular information:

{
"tables": [
{
"column_headers": [
{
"cell_id": "colHeader-1050-1082",
"text": ”Expenses ($ in thousands)",
...
},
{
"text": ”Three months ended Sept. 30",
...
},
{
"text": "2015"
},
...
],
"row_headers": [
{
"cell_id": "rowHeader-2244-2262",
"text": ”Aircraft fuel"
},
{
"text": ”Airport operations"
},
{
"text": ”Flight operations and navigational changes"
},
...
],
"body_cells": [
{
"cell_id": "bodyCell-2450-2455",
"text": ”206,924",
"row_header_ids": [
"rowHeader-2244-2262"
],
"column_header_ids": [
"colHeader-1050-1082",
"colHeader-1270-1301”,
”colHeader-1544-1548”
],
},
{
"text": ”142,176",
"row_header_ids": [
],
],
...
Example
Introduction
{
"tables": [
{
"column_headers": [
{
"text": ”Expenses ($ in thousands)",
...
},
{
"text": ”Three months ended Sept. 30",
...
},
{
"text": "2015"
},
...
],
"row_headers": [
{
"text": ”Aircraft fuel"
},
{
"text": ”Airport operations"
},
{
"text": ”Flight operations and navigational changes"
},
...
],
"body_cells": [
{
"text": ”206,924",
"row_header_ids": [
],
],
},
{
"text": ”142,176",
"row_header_ids": [
],
],
...

Example
Introduction
Value
Norm
Value
Year
Time
Period
Type LineItem
206,924 $206,924,
000
2015 Q3 Expense Aircraft
Fuel
286,817 $286,817,
000
2014 Q3 Expense Airport
Fuel
142,176 $142,176,
000
2015 Q3 Expense Airport
operations
… … … … … …
Change
Change
Normalized
Begin Time
Period
End Time
Period
(27.9%) -27.9% Q3 2014 Q3 2015
10.7% 10.7% Q3 2014 Q3 2015
… … … …

Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction

Challenge: Table structure representation varies
across document formats
None CompletePartial
HTML
MS Excel
MS Word
TXT
PDF
Image
H. Dong et al. "TableSense: Spreadsheet Table Detection with Convolutional
Neural Networks". AAAI '19
Z. Chen et al. “Spreadsheet Property Detection With Rule-assisted Active
Learning”. CIKM ‘17
M. Cafarella et al. “ WebTables: exploring the power of tables on the web".
VLDB ‘08
Table Understanding still
required for all document types
Introduction

HTML completely represents table structure
HTML
Introduction

MS Excel
Each sheet
separate table
Multiple tables defined
in single sheet
Table structure representation varies across
Excel documents
H. Dong et al. "TableSense: Spreadsheet Table
Detection with Convolutional Neural Networks". AAAI '19
Z. Chen et al. “Spreadsheet Property Detection With
Rule-assisted Active Learning”. CIKM ‘17
Introduction

Table structure representation varies across
Word documents
MS Word
Omit Office Table Object
Use Office Table
Object for all tables
Introduction

Document formats with no native table representation
Image
TXT
PDF
TXT
Image
PDF
Introduction

PDF Document Format
…
BT
0.0503 Tc
8.503556 0 0 8.52 503.2795 688.92 Tm
/Tc2 1 Tf
[ ( m) 16 (o) 21 (n) 17 (t) 39 (h) 16 (s) 29 (
) 28 (e) 28 (n) 17 (d) 24 (e) 28 (d) 24 ( ) ] TJ
0 Tc
ET
…
Q
q
46.91952 776.52 m
242.04 776.52 l
242.04 729.96 l
144.48 729.96 l
46.91952 729.96 l
h
…
Draw ”m” at (503, 688) in 8.5
point font in color white
Draw “o” at …
Draw “n” at ….
Draw “t” at ...
Draw “h” at …
Draw “s” at …
….
Draw green line segment from
(46, 776) to (242, 776)
Draw green line segment from
(242, 776) to (242, 729)
…
• Programmatic PDF collection of instructions to draw characters and line
segments to page with visual formatting information
• 2 – 4 trillion PDFs in existence and rapidly growing
Rendered PDF PDF Binary
Introduction

Complex tables – graphical lines can be
misleading – is this 1, 2 or 3 tables ?
Table with visual
clues only
Multi-row, multi-
column column
headers
Nested row
headers
Tables with Textual
content
Table with
graphic
lines
Table
interleaved with
text and charts
Challenge: Variety in Tables
Introduction

https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal1231201410k.htm
Excerpt of semi-structured XBRL file
For financial statements
Ex: Delta Air Lines, Inc. 2014 Annual Report Form 10-K
:
<xbrli:context id="FI2013Q4"><xbrli:entity>
<xbrli:identifier scheme="http://www.sec.gov/CIK">0000027904
</xbrli:identifier></xbrli:entity>
<xbrli:period><xbrli:instant>2013-12-
31</xbrli:instant></xbrli:period>
</xbrli:context>
:
<us-gaap:CashAndCashEquivalentsAtCarryingValue
contextRef="FI2013Q4"
decimals="-6"
id="Fact-C39BEC178121A91816968BA9ADCF421F”
unitRef="usd">
2844000000
</us-gaap:CashAndCashEquivalentsAtCarryingValue>
:
Excerpt of HTML file with granular financial metric data
https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal-20141231.xml
Valuable metrics for airline industry, only
present in HTML table
Valuable metrics present in semi-
structured raw data source MUST BE
INTEGRATED
Application: Knowledge-base population
Introduction

Application: Query Answering
Introduction
H. Sun et al. “Table Cell Search for Question Answering”. WWW '16

Application: Scientific Leaderboard Construction
Introduction
Y. Hou et al. “Identification of Tasks, Datasets, Evaluation Metrics, and Numeric
Scores for Scientific Leaderboards Construction”. ACL ‘19
Scientific Publication
Leaderboard Annotations

Application: Biological Information Extraction
Introduction
G. Singh et al. “QTLTableMiner++: Semantic Mining of QTL Tables in Scientific
Articles”. BMC BioInformatics ‘18
Article Trait Tables
Trait Statements
QTL Statements
Extract info on Quantitative
Trait Locus (QTL) (genomic
regions that correlate with
phenotypes) from tables in
scientific publications

Takeaways
Introduction
§ Widely used document formats have limited table
representation
– Limits of document format: Image, PDF
– How documents authored: Word, Excel
§ Wide variety of tables makes general model
construction difficult
–Tables are form of art
–Diverse visual encoding of semantic information
–Different domains
§ Multiple applications for table extraction and
understanding

§ Table region detection
– Identify all tables
– Separate tables from non-table text
– Separate tables from each other
§ Cell structure recognition
– Partition text into cells
– Find cell span and cell-to-cell overlap (along X- or Y-axis)
What Is Table Extraction?
Table Extraction

[CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93
[I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93
[H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95
[KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98
[H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99
[H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00
[H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00
[KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01
[C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02
[P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03
[H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05
[S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06
[M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06
[L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08
[L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
[F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
[CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12
[K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13
[K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
[R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15
[T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
[Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019
[C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019
[L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019
[M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19
Table Extraction
Table Extraction: A Sample of Prior Work

[H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00
[M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06
[L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07
[R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15
[T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16
[S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18
[L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019
[M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19
Table Extraction
Most papers present an end-to-end system for :
• Table detection,
• Cell structure recognition (table parsing),
• Or both
🔥 ICDAR 2019 has ≥ 16 new papers on table extraction!
– ICDAR = International Conference on Document Analysis and Recognition
Table Extraction: A Sample of Prior Work

§ Early 1990s : Separator based “top-down” methods
– Ruled line tables
– Extend to white-space “lines”
§ 1990s – early 2000s : “Bottom-up” text clustering
– Group text into columns (or rows), then to tables
– Use space features (gaps, overlap, alignment) and keywords
§ 2000s – early 2010s : Machine Learning (supervised or not)
– Classify text-rows using CRF, SVM, HMM, etc.
– Probabilistic models for tables
– Graph-based models for cell structure
– Unsupervised ML (clustering)
§ Late 2010s : Deep Learning
– Scanned image table detection with R-CNN or YOLO
– Graph neural networks and language embeddings for cell structure
Table Extraction Timeline
Table Extraction

§ Analyze Page
– Identify low-level structures & relations
§ The 2 Main Tasks
– Table (region) detection
– Cell structure recognition (given table region)
§ Refine Tables
– Discard false positives
– Adjust table border and structure
How to Build a Table Extraction System?
Table Extraction

Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Learning Infrastructure
Accuracy metrics Ground truth data Optimization method

Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method

§ Documents can be:
– scanned
– programmatic (“born digital” PDF, TXT)
– hybrid
§ Scanned page requires OCR, plus:
– Reverse any rotation, distortion
– Filter noise, sharpen if low resolution [M19]
– Fix inconsistent font features, bounding boxes
– Detect ruled lines and boxes
• E.g., Gaussian filter + black hat transform [K13]
Page Features
[M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from
Scanned Documents”, ICDAR ‘19
Table Extraction

§ Programmatic PDFs (and TXTs)
– Have letters, but no table markup
§ May contain spurious (invisible) text and lines
– White-on-white lines or text
– Occluded or out-of-range lines or text
– Text repeated to simulate bold font
– Need to filter them out
§ Deep Learning (CNN-based) methods need an image
– Convert programmatic to scanned
Page Features
Table Extraction

§ Plain text layout (1-column, 2-column, etc.)
– Helps avoid false-positive “tables”
§ Obvious non-tables
– Page & section headers, footers, lists, etc.
– Short-cut computation – if no tables on page
§ Low-level structure
– Alignment @ different box positions & tolerance levels
– A minimum spanning tree for clustering by distance
§ Deep learning features
– CNN features shared across proposal regions
– Natural language embeddings
Page Features
Table Extraction

§ Most systems group text early on
– Table detection systems may skip text grouping
§ Text is grouped in one of 3 ways:
– Columns first
– Rows first
– Cell-units (“blobs”) first
§ Some systems partition text using separator lines
– BUT: “Blob” detection reduces over- / under-partitioning
Group Text into Larger Units
Table Extraction

Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Two
Tables

Example
Table Extraction
Columns

Example
Table Extraction
Rows

Example
Table Extraction
Multi-line
“Blobs”

Many systems detect columns first:
– T-Recs [KD98], Pdf2table [Y05], Lixto [HB07], Tesseract [SS10],
smartFIX [D11]
Example – Tesseract [SS10] :
Start with Columns
Table Extraction
1. Detect X-axis “tab-stops” (alignment positions)
2. Group tokens between “tab-stops” horizontally into entries
3. Group entries of the same font vertically into column fragments
4. Group column fragments within page columns horizontally into table fragments
5. Group table fragments if columns match vertically into tables

Example
Table Extraction
Tab-Stops

Example
Table Extraction
Column
Fragments

Example
Table Extraction
Table
Fragments

Example
Table Extraction
Tables

Example
Table Extraction
Tables
Multi-Column Headers

Start with Rows
Table Extraction
Systems with ML often detect rows first
– Pinto-McCallum [P03], e Silva [S06], TableSeer [L08], PDF-TREX [OR09]
Typical process:
1. Identify text-lines
2. Train an ML classifier to label text-lines:
– “Table Dense”, “Table Sparse”, “Table Header”, “Non-table”, etc.
– ML = CRF [P03], HMM [S06], SVM [L08], etc.
3. Merge sparse rows into dense rows – get full table rows:
– Merge up, down, or cluster around, by row alignment [H00a]
4. Combine table rows into tables

Example
Table Extraction
Dense Row
Table Header
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Sparse Row
Sparse Row
Dense Row
Dense Row
Dense Row
Dense Row
Sparse Row
Sparse Row
Sparse Row
Table Header
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Sparse Row
Sparse Row
Table Header
Align-
ment

Example
Table Extraction
Dense Row
Table Header
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Sparse Row
Sparse Row
Dense Row
Dense Row
Dense Row
Dense Row
Sparse Row
Sparse Row
Sparse Row
Table Header
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Sparse Row
Sparse Row
Table Header
Align-
ment
✕
✓
✓
✓
✕
✕

Example
Table Extraction
Dense Row
Table Header
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Dense Row
Heading Row
Heading Row
Heading Row
Dense Row
Dense Row
Dense Row
Dense Row
Heading Row
Heading Row
Heading Row
Table Header
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Heading Row
Heading Row
Table Header
Align-
ment

Example
Table Extraction
✓
✓
✓
✓
✓
✓
Dense Row
Table Header
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Dense Row
Heading Row
Heading Row
Heading Row
Dense Row
Dense Row
Dense Row
Dense Row
Heading Row
Heading Row
Heading Row
Table Header
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Heading Row
Heading Row
Table Header
Align-
ment

§ “Blob” = largest semantically bound text unit
– Single-line or multi-line
– If in a table, the whole “blob” must be in a single cell
§ “Blob” ≠ Cell
– Cell has span and overlaps other cells
– Some “blobs” end up in plain text or non-table text
§ “Blobs” help define table structure:
– Trace alignment
– Determine header cell spans
– Fix over-split / over-merged cells, rows, columns
– Reduce search space
Text “Blobs” (Cell-Units, Paragraphs, …)
Table Extraction

§ [KD98] Distance based clustering:
– Merge words horizontally
– Merge text strings vertically if word-spans interleave
§ Problems with distance:
– Multi-column headers: 1 justified phrase vs. ≥ 2 closely spaced phrases
– Row headers / text cells: 1 multi-line cell vs. ≥ 2 closely spaced rows
§ Example:
How to Detect “Blobs”
Two Column Header Two Column Header
HEADER Header Header Header Header
Row 1, text line 1 0.12 1.23 2.34 3.45
Row 1, text line 2
Row 1, text line 3
Row 2, text line 1 4.56 5.67 6.78 7.89
Row 2, text line 2
Row 2, text line 3
Table Extraction

§ [H00a], [OR09] Merge “sparse” rows into “dense” rows
– Merge up, merge down, or cluster around
§ [L09] Detect and follow reading order ← an NLP challenge
§ [B12] [B14] Train a classifier over “blob” features:
– Proper termination (e.g. “blobs” don’t end with a dash or comma)
– Number of numeric strings
– Indentation, large space at the end of a string
– Shared font properties
§ Deep learning approaches:
– Cell-unit detection (over image) using CNNs
– Semantic relationship detection (over text) using RNNs
How to Detect “Blobs”
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ’09
Table Extraction

Example
Table Extraction
Table Source: https://www.dollartreeinfo.com/static-files/0c3687d8-e6ce-4566-bc89-79fc8c8b665e (2016_Proxy_Statement_Final.pdf)

§ Ruled Lines & Colored Boxes
– Extend ruled lines over small gaps, “snap” together
– Merge touching colored boxes, then convert into lines
– Filter out: highlighting, underlining, boxed comments, logos, charts etc.
§ BUT: A “perfect” ruled-line grid can be incomplete !
– Some lines may be missing
– Lines may fail to extend to header rows / columns
Separator Line Detection
Table Extraction

Example 1
Table Extraction
Table Source: https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/quarterly-result/2015/2015_MDA_q3.pdf

Example 2
Table Extraction
Table Source: https://www.ada.gov/restripe.pdf

Example 3
Table Extraction
Table Source: http://educationaldatamining.org/files/conferences/EDM2018/EDM2018_Preface_TOC_Proceedings.pdf

§ White-space separators (“virtual” lines)
– Help define cell span / cell alignment in tables
– Prune false-positives by ML or by heuristics [B12]
§ How to detect white-space separators
– Cell-unit (“blob”) bounding box expansion [I93]
– Axis projection histograms [CK93]
– White-space cover by maximum-area white-space rectangles [F11]
§ How to prune them (features to use)
– Adjacent “blobs” : alignment, size, and content
– “Strong” separators that run parallel to or intersect the separator
Separator Line Detection
Table Extraction

§ Commonly used to partition page and generate separators
– By [C02], [W04], [K14], and others
§ [H95] The algorithm recursively, for each block:
– Computes X- and Y-axis projection profiles
– Divides the block into sub-blocks based on dips in profiles:
Recursive X-Y Cut Algorithm
Table Extraction

§ Ruled Line grids / frames, connected components
§ (Rows 1st) Stack “table” rows whose “blobs” co-align [L08], [OR09]
– Rows are labeled by an ML-classifier (CRF, SVM, HMM)
– Labels & matching “blob” layout → table regions
– NOTE: Be sure to label “header rows” to tell tables apart !
§ (Cols 1st) Cluster overlapping column fragments [HB07], [SS10]
– Group table columns horizontally, staying within page layout columns
(when possible)
– Group vertically if column fragments overlap, match, or subsume
– NOTE: Column header areas require special handling !
Generate Candidate Table Regions
Table Extraction

§ (Blobs 1st) Classify text “blobs”, cluster those labeled “table”
– [B14] iteratively labels “blobs” given their neighbors’ labels
– [B14] trains a Kernel Logistic Regression classifier
§ (Lines 1st) Find areas where “strong” separators make a grid
– [CL12] uses Max-Flow / Min-Cut algorithm to extract grids
– Bi-cluster the intersection matrix of horizontal vs. vertical separators
Table Extraction

k
X ≈ UVT
• Xij = 1 ⇔ lines i and j
intersect
• At intersections: 1 ≈
ui1vj1 + ui2vj2 +…+ uikvjk
• Each uicvjc ≥ 0 gives
affinity of intersection
Xij to cluster c
• uicvjc is large ⇔
uic and vjc both large
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 1 0
0 1 0
0 1 0
0 1 0
0 1 0
* * *
0 0 1
0 0 1
0 0 1
0 0 1
0
1
0
0
1
0
0
1
0
0
1
0
0
0
0
*
*
*
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
0
1
1
0
1
0
0
1
0
0
1
0
0
1
0
0
1
U ≥ 0
V ≥ 0
X
Non-neg. Matrix Factorization for Grid Clustering
Table Extraction

§ (Blobs 1st) Classify text “blobs”, cluster those labeled “table”
– [B14] iteratively labels “blobs” given their neighbors’ labels
– [B14] trains a Kernel Logistic Regression classifier
§ (Lines 1st) Find areas where “strong” separators make a grid
– [CL12] uses Max-Flow / Min-Cut algorithm to extract grids
– Bi-cluster the intersection matrix of horizontal vs. vertical separators
§ (CNN-based) Try a fixed set of table region proposals
– CNN shares computation of features across all translations of a given
proposal rectangle
– Proposal rectangle shapes / sizes are fixed as hyperparameters
– If a proposal hits a table, a regression decides table borders
Table Extraction

§ Use existing object detection
frameworks (Faster R-CNN or
YOLO) retrained for table
detection
§ The field is wide open for more
table-specific DL approaches
– E.g. involving text semantics
Li et al. “TableBank: Table Benchmark for Image-based Table Detection and
Recognition”. ArXiv 2019
Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest
Documents at Scale.”. KDD 2018
Schreiber et al. “Deepdesrt: Deep learning for detection and structure
recognition of tables in document images” ICDAR 2017
Gilani et al. “Table Detection using Deep Learning” ICDAR 2017
Table Extraction
Deep Learning for Table Detection

§ Cells define overlap relation along X- or Y-axis
– Links headers with data – critical for table understanding
§ Cell borders ← ruled lines ∪ “strong” white-space lines
– Extend lines to make rectangular cells, avoid crossing “blobs”
§ Ruled grids: test for incompleteness
– Multiple numerics per cell
– A “strong” white-space line splits text in ≥ 2 cells
– A “mini-table” inside a ruled cell
– Cell structure extends beyond table frame
§ White-space grids: clean up empty cells
– Expand header cells by merging with empty cells [S06]
– Merge (almost-) empty rows and columns
Cell Structure: Line Based
Table Extraction

§ Use Spatial Constraints to find an overlap DAG over cells [H03]
§ Use Graph Neural Networks to find 2 undirected graphs:
Cell Structure: Graph Based
[Q19] [C19]
Table Extraction
– “Same Row” graph & “Same Column” graph
– Two cells share an edge ⇔ share a row / a column
– [Q19] : Rows and columns = maximal cliques
– [C19] : Only adjacent cells share a graph edge

Schreiber et al. “Deepdesrt: Deep learning for detection and structure
recognition of tables in document images” ICDAR 2017
Table Extraction
Cell Structure: CNN Based
§ Object detection networks were also used for cell structure detection

§ Eliminate false positive tables
§ Detect malformed table regions
– Plain text in tables
– Missing row / column headers or split-off pieces
– One region covers multiple tables
§ Compare alternative table candidates
– Example: Is this 1 table or 2 tables?
§ Improve table region and structure
– Pick the best adjustment out of a range of options
– NOTE: Knowing cell structure helps region scoring / adjustment
§ Provide a confidence value for output tables
Why Scoring Tables?
Table Extraction

§ Tables are very diverse
– Tiny or huge, misaligned, text in cells, key-value pairs, confusing delimiters
– Complex row / column headers – so different, easy to chop off !
§ What’s around the table also matters
– Can its columns or rows be extended? Should they be?
§ One table, or ≥ 2 adjacent tables?
– 1 table may have: ruled bars, wide gaps, font / alignment changes
– 2 tables may be: fully or partly co-aligned, separated in one of many ways
§ Non-table text can have complex structure, too
– Page headers / footers, framed / highlighted text, hierarchical lists, …
Table Scoring Challenges
Table Extraction

Example 1
Table Extraction
Table Source: https://www.legislation.gov.au/Details/F2010C00607/0d99393c-5c5b-4af0-9cc1-b5c2de8632c3 (F2010C00607.pdf)
NOT A TABLE !

Example 2
Table Extraction
Table Source: https://www.thewaltdisneycompany.com/wp-content/uploads/2019/01/2018-Annual-Report.pdf
Row
headers Column
headers

Example 3
Table Extraction
Table Source: https://www.thewaltdisneycompany.com/wp-content/uploads/2019/01/2018-Annual-Report.pdf
Row
headers
Column
headers

Example 4
Table Extraction
Table Source:
https://assets.ctfassets.net/rz9m1rynx8pv/2x3p5ompzZyrRtAHw4M3XB/be648275661795139cabcee29a730630/TELUS_Q1_2019_quarterly_report.pdf
Row
headers
Column
headers

§ Rule-out patterns
– Rule out charts, lists, signature blocks etc.
§ Aggregated column / row score
– [KD01] Aggregate the similarities that led to the table’s column fragments
§ Dynamic programming score
– [H99] Score (T) = max { Score (T – line) + Merit (line) }
– Score the best split into 2 sub-tables
§ Probability of being a table (given the features)
– [W04] Partition page into blocks labeled “table” and “plain text”
– Compute label probability for block + neighboring blocks
§ A scoring neural network on top of CNN [G17, S18b]
How to Score a Table
Table Extraction

§ Columns and rows:
– Number, span / extent, alignment, font / content similarity
§ Ruled and white-space separators:
– Number, span / extent, width of their margins
– If they match, reach (good) or cross (bad) table borders
§ Inside vs. outside table:
– Border crossing ruled lines, aligned blocks, or highly similar text
– The two sides have matching structure
§ Cell structure:
– Oversized cells, misaligned pairs of cells, “runs” of empty cells
§ Content:
– Numerics, repeated words; customizable keywords
– Domain-specific “expectations,” e.g. header dictionary [D11]
§ CNN-generated features
Features for Table Scoring
Table Extraction

§ Leverage table features and score
– Specify how a well-formed vs. mal-formed table looks like
§ Use a transparent, explainable method
– If detection is a “black box”, adjustment uses explainable rules & features
§ Correct errors quickly
– Bypass the need for extra ground-truth data, retraining
§ Customize to address specific concerns
– Add custom features, rules, and constrains
Why Adjust Tables?
Table Extraction

§ Merge table with an adjacent table or text-block [W04] [SS10]
§ Adjust table border – add or drop rows or columns [HB07] [D11]
§ Split one table into two, possibly with plain text between
§ Re-compute table region by neural network regression [G17] [S18b]
§ Choose best-scoring border (or structure) out of a range of options
§ Iterate adjustment → traverse a search tree of candidate tables
How to Adjust Candidate Tables
Table Extraction

What if candidate tables overlap each other?
§ [H99] uses Dynamic Programming:
– Only for top and bottom line-positions: [i, j]
– Score disjoint unions of tables:
§ CNN-based object detection systems:
– Greedy Approach: Pick the top-scoring region, repeat
– PROBLEM: Lower-scoring table may have a high-scoring sub-table
§ Maximum Weighted Independent Set
– Nodes = tables, edges = conflicts, weights = table scores
– NP-hard even for 2-dim rectangles [RN95], but can be solved
efficiently in real-life cases
Select Best Tables for Output
[RN95] C.S. Rim and K. Nakajima. “On Rectangle Intersection and Overlap Graphs”, IEEE Trans. on Circuits & Systems I, 42(9), 1995
Table Extraction
1 1
1 1 1
1 1
1 1 1
1
1 1
1 1 1
Conflict = Table
Overlap

§ Accuracy Metrics
– Exact match of table region or structure is too inflexible
– Partial match: Text? Area? Cell relationship? Functional?
§ Ground Truth Labeling
– Very time consuming, requires sophisticated UI tools
– Humans disagree on what’s correct
§ Optimization (pre- deep learning)
– Lots of discrete, non-differentiable steps
– Learn sub-tasks, e.g. row labeling with CRF / SVM
– [W04] Global parameter learning:
Learning from Data: Challenges
Table Extraction

Table Boundary
§ Purity & Completeness
§ Character level recall, precision
and F1
Table Structure
§ Recall and Precision of Cell
Adjacency Relations
Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12
ICDAR 2013 Competition Metrics
Table Extraction
Accuracy Metrics

§ Measure what actually
matters downstream
§ Capcture accuracy of
access paths to each cell
§ Need header annotation
as well as cell structure
Table Extraction
Accuracy Metrics
Functional Metrics

Ground Truth Datasets
Complete Datasets with table boundary and cell structure:
- ICDAR-2013 competition (PDF Format)
- ICDAR-2019 competition (Image Format)
- SciTSR 2019 (Generated from LaTeX files)
Incomplete Datasets
§ Table-bank (Full table boundary information only)
§ PDF-Trex (Financial Table dataset without ground truth Labels)
§ Marmot (Only ground truth for table boundary, cells inaccessible)
§ UNLV , UW-3 (Table structure and boundary annotations for scanned documents)
Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”. ArXiv 2019
Oro et al. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”. ICDAR '09
Fang et al. “Dataset Ground-Truth and Performance Metrics for Table Detection Evaluation”. DAS '12
Chi et al. “Complicated Table Structure Recognition” arXiv 2019
Table Extraction

Example: Accuracy Comparison
§ Table detection accuracy on the ICDAR 2013 Competition dataset:
Table Extraction

Table Understanding
Table Understanding
HTML Document
Table Extraction (Optional)
Document (PDF, Image)
Representation of table contents that:
• Captures semantic information
• Is amenable to post-processing
Table Understanding Output
Knowledge Base
Creation
Downstream
Tasks
Question
Answering
e.g., Air Canada’s oper. revenues in Q3 2015?
HTML
Understand the semantics of tabular data

Semantics of Tabular Data
Table Understanding
What does this cell represent?

Table Understanding
The unaudited comprehensive net loss of Air
Canada in the six months ended June 30, 2015
is $13 million Canadian dollars.
“ “

Table Understanding
“ “
Information about a single cell is derived from multiple places

What You Will Learn
Table Understanding
Components of table understanding
• What are the different types of semantic information about a table?
• Where can they be found?
1

What You Will Learn
Table Understanding
Table understanding Methods
1
2
• What techniques are used to extract info for table understanding?
• What learning methods can be used?

What You Will Learn
Table Understanding
Table understanding Methods
1
2
• How do tables differ between domains?
• How do the assumptions of proposed approaches affect their
potential applicability to other domains?
Importance of Domain3
• What techniques are used to extract info for table understanding?
• What learning methods can be used?

Outline: Components of Table Understanding
Table Understanding
A. Table Regions
(Column/Row Headers)
B. Context
Within Table
C. Context
Within Document
D. Context Outside
Document

Outline: A. Table Regions
Table Understanding
Column Headers
(incl. nesting)
Row Headers
(incl. nesting)
Data/Body
Cells
Main table regions
Metadata

Unsupervised Methods Overview
Table Understanding
Header rows/cols "look different” than data rows/cols

Table Understanding
Similarity Features

Table Understanding
Heuristics
Similarity Features
• Which heuristics to use?

Unsupervised Methods: Local Minimum
Table Understanding
J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12
For column (row) headers: Find first row (col) that looks “different”
Pair-wise similarity of
consecutive rows
Local minimum of similarity

Unsupervised Methods: Indexing
Table Understanding
S. Seth et al. “Segmenting tables via indexing of value cells by table headers”.
ICDAR ‘13
• Use empty and repeated
cells to find critical cells that
outline the stubhead
• Independent of visual
aspects of table
Repeated cell
implying hierarchical
row header
Empty cells implying
hierarchical column
header

Traditional ML Methods Overview
Table Understanding
Traditional
ML Methods
Similarity Features
Column
Headers
Data
Cells
Classification Labels
• How to model this as a classification problem?
• Which ML method and features to use?

Traditional ML Methods: Row/Column Classification
Table Understanding
Data row
Data row
Data row
Data row
Data row
Data row
Column header row
Classify rows as column header rows (similarly for row header columns)
D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03

Header Identification Results
Table Understanding
S. Seth et al. “Segmenting tables via indexing of value cells by table headers”.
ICDAR ‘13
R. Rastan et al. “TEXUS: A unified framework for extracting and understanding
tables in PDF documents”. Information Processing & Management
Correct Segmentation Correct Stub Head
(Critical Cell)
Seth et al. 99% 100%
TEXUS 100% 100%
Government Statistic Table Set (Seth)
Correct Segmentation Correct Stub Head
(Critical Cell)
TEXUS - 42.9%
ASX-Announcements Dataset (TEXUS)

No standard benchmark or dataset
Table Understanding
D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03
FedStat Textfile Dataset (Pinto) CiteSeerX PDF Dataset (Fang)

Traditional ML Methods: Table Classification
Table Understanding
Web Data Commons – Web Table Corpora
Classify entire tables
Relational Table Entity/Listing Table Matrix Table
e.g,

Table Understanding
• Table class implies header structure

Table Understanding
• Table class implies header structure
• Can be used for header identification under certain assumptions
Single col header rowSingle col header row
Single row header col

Table Understanding
Table Classes
Genuine vs Non-genuine Y. Wang et al. “A Machine Learning
Based Approach for Table Detection
on The Web“. WWW ‘02
Relational vs Non-relational M. Cafarella et al. “Uncovering the
Relational Web”. WebDB ‘08
I. Relational Knowledge: Listing, Attribute/Value,
Matrix, Calendar, Enumeration, Form
II. Layout: Navigational, Formatting
E. Crestan et al. “Web-Scale Table
Census and Classification”. WSDM ‘11
Vertical listings, horizontal listings, matrix tables J. Eberius et al. “Building the Dresden
Web Table Corpus: A Classification
Approach”. BDC ‘15
year

Table Understanding
ML Methods
Decision Tree, SVM Y. Wang et al. “A Machine Learning Based Approach
for Table Detection on The Web“. WWW ‘02
Rule-based Classifier (WEKA) M. Cafarella et al. “Uncovering the Relational Web”.
WebDB ‘08
Gradient Boosted Decision Tree E. Crestan et al. “Web-Scale Table Census and
Classification”. WSDM ‘11
Decision Tree (CART, C4.5,
Random Forest), SVM
J. Eberius et al. “Building the Dresden Web Table
Corpus: A Classification Approach”. BDC ‘15

Traditional ML Methods
Table Understanding
Neighborhood and Table Features
• Number of non empty cells
difference
• Average alignment
• Percentage of same cell data type
• Percentage of same cell font style
• Content repetition
• Number and standard deviation of
rows and columns
Cell Features
• Number of non empty cells.
• Average cell length.
• Percentage of numeric characters.
• Percentage of symbolic characters
• Average font size.
• Cell Font Styles
• Cell positioning in the table
• Percentage of cells spanning
multiple cols/rows
• HTML Tags (if applicable)
• Cell Span
J. Eberius et al. “Building the Dresden Web Table Corpus: A Classification
Approach”, BDC ‘15

Deep Learning Methods Overview
Table Understanding
Deep
Learning
Methods
Similarity Features
Column
Headers
Data
Cells
Classification Labels
• Which deep learning architecture to use?

Deep Learning Methods: Hierarchical Attention Network
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables]
Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16
Hierarchical RNN proposed to leverage
document structure:
• 2 layers:
• Words
• Sentences

Deep Learning Methods: Hierarchical Attention Network
Table Understanding
Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables]
Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16
Extend to tables:
• 3 layers
• Tokens
• Cells
• Rows or Columns
• Bidirectional network
• Combine row-directional and
column-directional network

Deep Learning Methods: RNN-CNN Hybrid (TabNet)
Table Understanding
Deep Neural Network Architecture”. AAAI ‘17
LSTM captures semantic
representation of each cell
CNN captures
relationship between cells

Deep Learning Methods: RNN-CNN Hybrid (TabNet)
Table Understanding
LSTM captures cell text
together with coordinates
and other HTML tags (i.e.,
formatting)

Deep Learning Methods: Results
Table Understanding
(Rule-Based)
(Decision Tree)
(Decision Tree)
(Hierarchical Attention
for Documents)
(RNN-CNN Hybrid)

Beyond Flat Headers: Hierarchical Row Headers
Table Understanding
Hierarchical
Row Headers

Beyond Flat Headers: Hierarchical Row Headers
Table Understanding
Identify hierarchical relationship among row headers
Complex semantic row header hierarchy: Multiple cells in the same row header
column are semantically related to each other

Beyond Flat Headers: Hierarchy as a Graphical Model
Table Understanding
Z. Chen et al. “Integrating Spreadsheet Data via Accurate and Low-Effort
Extraction”. KDD ‘14
Encode hierarchy as graphical model
• Variable: Candidate parent-child pair
• Node potentials: Features for predicting
parent-child pairs
• Edge potentials: Correlations of
variables based on style, KB affinity, …

Pairwise vs Rectangle cell relationships
Table Understanding
• Pairwise classification can only utilize local information
• Simply looking at the pair may not be sufficient to determine the relation
• A rectangle is “interesting” if it is the support rectangle of some cell,
called a header cell of that rectangle

Beyond Flat Headers: Hierarchy as Rectangle Relationship
Table Understanding
X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of
Financial Tables”. ICDAR ‘17
Two “interesting” rectangles:
• “Assets” (row 1) heads rows 2-17
• “Current” (row 2) heads rows 3-11

Beyond Flat Headers: Hierarchy as Rectangle Relationship
Table Understanding
X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of
Financial Tables”. ICDAR ‘17
When a “total” row is considered as a parent
candidate, it cannot take children
For each iteration:
• Combine: Consecutive minimal
rectangles with equal features
• Attach: Minimal rectangle ri to
directly preceding rectangle ri-1 if
ri-1 > ri

Outline: B. Context Within Table
Table Understanding
Currency
Additional semantic information within the table
- of different types
Scale

Table Understanding
- of different scope Propagate to all
data cells

Table Understanding
- of different scope
Propagate to
subset of data cells

Outline: C. Context Within Document
Table Understanding
Additional context outside the table within the same document
- leverage relevant text and tables

Table Context Within Document
Table Understanding
Surrounding text often contains important info about a table
Deeper
Semantic Understanding
• Link text to table
• Generate table title
Shallow
Context Extraction
• Extract table metadata

Extract Table Metadata
Table Understanding
Ying Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in
Digital Libraries”. JCDL ’07
Document title
Page
Table Caption
Document authors

Link Text to Table Cells
Table Understanding
D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18
Text
Cells described by text
Approach:
• Identify headers
• Match sentence to table cells
based on:
• Unique words
However, mirroring the overall
softness of the tech sector, sales of
computer hardware decreased 1%
versus a year-ago to $1.6 billion.

Table Understanding
Text
Approach:
based on:
• Unique words
• Syntactic analysis

Table Understanding
Text
Approach:
based on:
• Unique words
• Semantic analysis
”…talking about topics is an
important reason to email with
these special interest groups.”
word2vec

Table Understanding
Text
Approach:
based on:
• Unique words
• Semantic analysis
• Use rules to refine matches

Generate Table Titles (for Web Tables)
Table Understanding
B. Hancock et al. “Generating titles for web tables”. WWW ’19
Problem:
• Web tables lack titles or
• Existing titles lack context
Table Title?

Table Understanding
Solution:
• Leverage surrounding context
to generate table title
Table + Surrounding Context
Table Title
Problem:
• Web tables lack titles or
• Existing titles lack context

Table Understanding
Context used as input:
• Page Title
• Section headers (<h...> tags)
• Column headers
• Spanning column headers
as a special case
• Table caption (<caption> tag)
Table Title
Context ignored due to noise:
• Text right before/after table
• Table rows

Table Understanding
Model Design
• Pointer-generator network
• First proposed for
abstractive summarization
• Combines copy & generator
mechanism
Table Title

Outline: D. Context Outside Document
Table Understanding
Additional context outside the table from other resources
- link to knowledge bases

Table to KB Linking
Table Understanding
Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
Link different parts of the table to external knowledge bases
Link Columns
(known as Column Type Identification)
Link Rows/Cells
(known as Entity Linking)

Table to KB Linking: Link Columns
Table Understanding

Table to KB Linking: Link Rows/Cells
Table Understanding

Understanding Tabular Data: Putting it All Together
Table Understanding

Table Understanding
A. Identify table regions (column/row headers)

Table Understanding
B. Identify additional context within table

Table Understanding
C. Identify context within document

Table Understanding
D. Identify context outside document

Table Understanding
“ “

Final Takeaways
1. A rich history of methods for many decades in table
extraction & understanding
2. Tables from different domains are not the same; A general
table extraction & understanding system needs to
consider diversity of type, style, and content of tables
3. Both semantic and visual features are crucial to improve
table extraction and understanding
4. As a community, we need to standardize tasks, evaluation
metrics, and datasets

Build for the future by unlocking the past...

ICDM2019 table tutorial

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ICDM2019 table tutorial

Similar to ICDM2019 table tutorial (20)

Recently uploaded

Recently uploaded (20)

ICDM2019 table tutorial