SlideShare a Scribd company logo
1 of 161
Download to read offline
Table Extraction and Understanding
for Scientific and Enterprise
Applications
Yannis Katsis
Doug Burdick Nancy WangAlexandre V Evfimievski
Marina Danilevsky
IBM Research - Almaden
Outline
§ Introduction
– Problem Definition
– Challenges
– Applications
– Demo
§ Table Extraction
§ Table Understanding
§ Conclusion
Introduction
Introduction Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction
Tables are popular data representation
Introduction
Government
Reports
Scientific Papers
Financial ReportsInvoices Contracts
Loan Agreements
Compact
Easy to understand*
(*) For humans
End-to-end example
§ What does the value 672 in the following
table mean?
§ Answer: Net earnings for three months
ended July 29th, 2017 was $672 million
USD
Steps:
1) Find location of table on page
2) Find cells in column containing ”672”
3) Find cells in row corresponding to “672”
4) Identify aligned row / column header cells
5) Normalize using additional context from
table
Introduction
Table Extraction: Identify
table location and structure
Table Understanding: Provide
semantic context to table values
Introduction
Input: Document contents in native format
- PDF
- Image
- Office Docs
- …
Table Extraction: Problem Definition
Output: Document contents with tabular information:
1) Table border for each table
2) Partitioning table contents into cells
3) Both vertical and horizontal alignment of cells
Table Understanding:
Problem Definition
Introduction
Output: Table content representation:
1) Captures semantic information
2) Amenable to post-processing
Input: Document contents with tabular information:
1) Table border for each table
2) Partitioning table contents into cells
3) Both vertical and horizontal alignment of cells
{
"tables": [
{
"column_headers": [
{
"cell_id": "colHeader-1050-1082",
"text": ”Expenses ($ in thousands)",
...
},
{
"cell_id": "colHeader-1270-1301",
"text": ”Three months ended Sept. 30",
...
},
{
"cell_id": "colHeader-1544-1548",
"text": "2015"
},
...
],
"row_headers": [
{
"cell_id": "rowHeader-2244-2262",
"text": ”Aircraft fuel"
},
{
"cell_id": "rowHeader-3197-3217",
"text": ”Airport operations"
},
{
"cell_id": "rowHeader-4148-4176",
"text": ”Flight operations and navigational changes"
},
...
],
"body_cells": [
{
"cell_id": "bodyCell-2450-2455",
"text": ”206,924",
"row_header_ids": [
"rowHeader-2244-2262"
],
"column_header_ids": [
"colHeader-1050-1082",
"colHeader-1270-1301”,
”colHeader-1544-1548”
],
},
{
"cell_id": "bodyCell-5415-8945",
"text": ”142,176",
"row_header_ids": [
"rowHeader-3197-3217"
],
"column_header_ids": [
"colHeader-1050-1082",
"colHeader-1270-1301”,
”colHeader-1544-1548”
],
...
Table Understanding:
Example
Introduction
Output: Table content representation:
1) Captures semantic information
2) Amenable to post-processing
Input: Document contents with tabular information:
1) Table border for each table
2) Partitioning table contents into cells
3) Both vertical and horizontal alignment of cells
{
"tables": [
{
"column_headers": [
{
"cell_id": "colHeader-1050-1082",
"text": ”Expenses ($ in thousands)",
...
},
{
"cell_id": "colHeader-1270-1301",
"text": ”Three months ended Sept. 30",
...
},
{
"cell_id": "colHeader-1544-1548",
"text": "2015"
},
...
],
"row_headers": [
{
"cell_id": "rowHeader-2244-2262",
"text": ”Aircraft fuel"
},
{
"cell_id": "rowHeader-3197-3217",
"text": ”Airport operations"
},
{
"cell_id": "rowHeader-4148-4176",
"text": ”Flight operations and navigational changes"
},
...
],
"body_cells": [
{
"cell_id": "bodyCell-2450-2455",
"text": ”206,924",
"row_header_ids": [
"rowHeader-2244-2262"
],
"column_header_ids": [
"colHeader-1050-1082",
"colHeader-1270-1301”,
”colHeader-1544-1548”
],
},
{
"cell_id": "bodyCell-5415-8945",
"text": ”142,176",
"row_header_ids": [
"rowHeader-3197-3217"
],
"column_header_ids": [
"colHeader-1050-1082",
"colHeader-1270-1301”,
”colHeader-1544-1548”
],
...
Table Understanding:
Example
Introduction
Output: Table content representation:
1) Captures semantic information
2) Amenable to post-processing
Value
Norm
Value
Year
Time
Period
Type LineItem
206,924 $206,924,
000
2015 Q3 Expense Aircraft
Fuel
286,817 $286,817,
000
2014 Q3 Expense Airport
Fuel
142,176 $142,176,
000
2015 Q3 Expense Airport
operations
… … … … … …
Change
Change
Normalized
Begin Time
Period
End Time
Period
(27.9%) -27.9% Q3 2014 Q3 2015
10.7% 10.7% Q3 2014 Q3 2015
… … … …
Input: Document contents with tabular information:
1) Table border for each table
2) Partitioning table contents into cells
3) Both vertical and horizontal alignment of cells
Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction
Challenge: Table structure representation varies
across document formats
None CompletePartial
HTML
MS Excel
MS Word
TXT
PDF
Image
H. Dong et al. "TableSense: Spreadsheet Table Detection with Convolutional
Neural Networks". AAAI '19
Z. Chen et al. “Spreadsheet Property Detection With Rule-assisted Active
Learning”. CIKM ‘17
M. Cafarella et al. “ WebTables: exploring the power of tables on the web".
VLDB ‘08
Table Understanding still
required for all document types
Introduction
HTML completely represents table structure
HTML
None CompletePartial
Introduction
None CompletePartial
MS Excel
Each sheet
separate table
Multiple tables defined
in single sheet
Table structure representation varies across
Excel documents
H. Dong et al. "TableSense: Spreadsheet Table
Detection with Convolutional Neural Networks". AAAI '19
Z. Chen et al. “Spreadsheet Property Detection With
Rule-assisted Active Learning”. CIKM ‘17
Introduction
None CompletePartial
Table structure representation varies across
Word documents
MS Word
Omit Office Table Object
Use Office Table
Object for all tables
Introduction
None CompletePartial
Document formats with no native table representation
Image
TXT
PDF
TXT
Image
PDF
Introduction
PDF Document Format
…
BT
0.0503 Tc
8.503556 0 0 8.52 503.2795 688.92 Tm
/Tc2 1 Tf
[ ( m) 16 (o) 21 (n) 17 (t) 39 (h) 16 (s) 29 (
) 28 (e) 28 (n) 17 (d) 24 (e) 28 (d) 24 ( ) ] TJ
0 Tc
ET
…
Q
q
46.91952 776.52 m
242.04 776.52 l
242.04 729.96 l
144.48 729.96 l
46.91952 729.96 l
h
…
Draw ”m” at (503, 688) in 8.5
point font in color white
Draw “o” at …
Draw “n” at ….
Draw “t” at ...
Draw “h” at …
Draw “s” at …
….
Draw green line segment from
(46, 776) to (242, 776)
Draw green line segment from
(242, 776) to (242, 729)
…
• Programmatic PDF collection of instructions to draw characters and line
segments to page with visual formatting information
• 2 – 4 trillion PDFs in existence and rapidly growing
Rendered PDF PDF Binary
Introduction
Complex tables – graphical lines can be
misleading – is this 1, 2 or 3 tables ?
Table with visual
clues only
Multi-row, multi-
column column
headers
Nested row
headers
Tables with Textual
content
Table with
graphic
lines
Table
interleaved with
text and charts
Challenge: Variety in Tables
Introduction
Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction
https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal1231201410k.htm
Excerpt of semi-structured XBRL file
For financial statements
Ex: Delta Air Lines, Inc. 2014 Annual Report Form 10-K
:
<xbrli:context id="FI2013Q4"><xbrli:entity>
<xbrli:identifier scheme="http://www.sec.gov/CIK">0000027904
</xbrli:identifier></xbrli:entity>
<xbrli:period><xbrli:instant>2013-12-
31</xbrli:instant></xbrli:period>
</xbrli:context>
:
<us-gaap:CashAndCashEquivalentsAtCarryingValue
contextRef="FI2013Q4"
decimals="-6"
id="Fact-C39BEC178121A91816968BA9ADCF421F”
unitRef="usd">
2844000000
</us-gaap:CashAndCashEquivalentsAtCarryingValue>
:
Excerpt of HTML file with granular financial metric data
https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal-20141231.xml
Valuable metrics for airline industry, only
present in HTML table
Valuable metrics present in semi-
structured raw data source MUST BE
INTEGRATED
Application: Knowledge-base population
Introduction
Application: Query Answering
Introduction
H. Sun et al. “Table Cell Search for Question Answering”. WWW '16
Application: Scientific Leaderboard Construction
Introduction
Y. Hou et al. “Identification of Tasks, Datasets, Evaluation Metrics, and Numeric
Scores for Scientific Leaderboards Construction”. ACL ‘19
Scientific Publication
Leaderboard Annotations
Application: Biological Information Extraction
Introduction
G. Singh et al. “QTLTableMiner++: Semantic Mining of QTL Tables in Scientific
Articles”. BMC BioInformatics ‘18
Article Trait Tables
Trait Statements
QTL Statements
Extract info on Quantitative
Trait Locus (QTL) (genomic
regions that correlate with
phenotypes) from tables in
scientific publications
Takeaways
Introduction
§ Widely used document formats have limited table
representation
– Limits of document format: Image, PDF
– How documents authored: Word, Excel
§ Wide variety of tables makes general model
construction difficult
–Tables are form of art
–Diverse visual encoding of semantic information
–Different domains
§ Multiple applications for table extraction and
understanding
Outline
§ Problem definition
– Table Extraction
– Table Understanding
§ Challenges
– Limited document format support for table structure
– Table variety
§ Applications
– Knowledge Base Population
– Query Answering
– Leaderboard Construction
– Information Extraction
§ Demo
Introduction
Table Extraction
§ Table region detection
– Identify all tables
– Separate tables from non-table text
– Separate tables from each other
§ Cell structure recognition
– Partition text into cells
– Find cell span and cell-to-cell overlap (along X- or Y-axis)
What Is Table Extraction?
Table Extraction
[CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93
[I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93
[H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95
[KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98
[H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99
[H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00
[H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00
[KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01
[C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02
[P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03
[H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05
[S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06
[M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06
[L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08
[L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
[F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
[CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12
[K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13
[K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
[R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15
[T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
[Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019
[C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019
[L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019
[M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19
Table Extraction
Table Extraction: A Sample of Prior Work
[CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93
[I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93
[H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95
[KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98
[H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99
[H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00
[H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00
[KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01
[C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02
[P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03
[H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05
[S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06
[M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06
[L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08
[L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
[F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
[CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12
[K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13
[K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
[R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15
[T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
[Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019
[C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019
[L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019
[M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19
Table Extraction
Most papers present an end-to-end system for :
• Table detection,
• Cell structure recognition (table parsing),
• Or both
🔥 ICDAR 2019 has ≥ 16 new papers on table extraction!
– ICDAR = International Conference on Document Analysis and Recognition
Table Extraction: A Sample of Prior Work
§ Early 1990s : Separator based “top-down” methods
– Ruled line tables
– Extend to white-space “lines”
§ 1990s – early 2000s : “Bottom-up” text clustering
– Group text into columns (or rows), then to tables
– Use space features (gaps, overlap, alignment) and keywords
§ 2000s – early 2010s : Machine Learning (supervised or not)
– Classify text-rows using CRF, SVM, HMM, etc.
– Probabilistic models for tables
– Graph-based models for cell structure
– Unsupervised ML (clustering)
§ Late 2010s : Deep Learning
– Scanned image table detection with R-CNN or YOLO
– Graph neural networks and language embeddings for cell structure
Table Extraction Timeline
Table Extraction
§ Analyze Page
– Identify low-level structures & relations
§ The 2 Main Tasks
– Table (region) detection
– Cell structure recognition (given table region)
§ Refine Tables
– Discard false positives
– Adjust table border and structure
How to Build a Table Extraction System?
Table Extraction
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Learning Infrastructure
Accuracy metrics Ground truth data Optimization method
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
§ Documents can be:
– scanned
– programmatic (“born digital” PDF, TXT)
– hybrid
§ Scanned page requires OCR, plus:
– Reverse any rotation, distortion
– Filter noise, sharpen if low resolution [M19]
– Fix inconsistent font features, bounding boxes
– Detect ruled lines and boxes
• E.g., Gaussian filter + black hat transform [K13]
Page Features
[K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13
[M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from
Scanned Documents”, ICDAR ‘19
Table Extraction
§ Programmatic PDFs (and TXTs)
– Have letters, but no table markup
§ May contain spurious (invisible) text and lines
– White-on-white lines or text
– Occluded or out-of-range lines or text
– Text repeated to simulate bold font
– Need to filter them out
§ Deep Learning (CNN-based) methods need an image
– Convert programmatic to scanned
Page Features
Table Extraction
§ Plain text layout (1-column, 2-column, etc.)
– Helps avoid false-positive “tables”
§ Obvious non-tables
– Page & section headers, footers, lists, etc.
– Short-cut computation – if no tables on page
§ Low-level structure
– Alignment @ different box positions & tolerance levels
– A minimum spanning tree for clustering by distance
§ Deep learning features
– CNN features shared across proposal regions
– Natural language embeddings
Page Features
Table Extraction
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
§ Most systems group text early on
– Table detection systems may skip text grouping
§ Text is grouped in one of 3 ways:
– Columns first
– Rows first
– Cell-units (“blobs”) first
§ Some systems partition text using separator lines
– BUT: “Blob” detection reduces over- / under-partitioning
Group Text into Larger Units
Table Extraction
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Two
Tables
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Columns
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Rows
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Multi-line
“Blobs”
Many systems detect columns first:
– T-Recs [KD98], Pdf2table [Y05], Lixto [HB07], Tesseract [SS10],
smartFIX [D11]
Example – Tesseract [SS10] :
Start with Columns
Table Extraction
[KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98
[Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
1. Detect X-axis “tab-stops” (alignment positions)
2. Group tokens between “tab-stops” horizontally into entries
3. Group entries of the same font vertically into column fragments
4. Group column fragments within page columns horizontally into table fragments
5. Group table fragments if columns match vertically into tables
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Tab-Stops
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Column
Fragments
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Table
Fragments
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Table
Fragments
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Tables
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Tables
Multi-Column Headers
Start with Rows
Table Extraction
Systems with ML often detect rows first
– Pinto-McCallum [P03], e Silva [S06], TableSeer [L08], PDF-TREX [OR09]
Typical process:
1. Identify text-lines
2. Train an ML classifier to label text-lines:
– “Table Dense”, “Table Sparse”, “Table Header”, “Non-table”, etc.
– ML = CRF [P03], HMM [S06], SVM [L08], etc.
3. Merge sparse rows into dense rows – get full table rows:
– Merge up, down, or cluster around, by row alignment [H00a]
4. Combine table rows into tables
[H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00
[P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03
[S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06
[L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Dense Row
Table Header
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Sparse Row
Sparse Row
Dense Row
Dense Row
Dense Row
Dense Row
Sparse Row
Sparse Row
Sparse Row
Table Header
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Sparse Row
Sparse Row
Table Header
Align-
ment
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Dense Row
Table Header
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Sparse Row
Sparse Row
Dense Row
Dense Row
Dense Row
Dense Row
Sparse Row
Sparse Row
Sparse Row
Table Header
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Sparse Row
Sparse Row
Table Header
Align-
ment
✕
✓
✓
✓
✕
✕
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
Dense Row
Table Header
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Dense Row
Heading Row
Heading Row
Heading Row
Dense Row
Dense Row
Dense Row
Dense Row
Heading Row
Heading Row
Heading Row
Table Header
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Heading Row
Heading Row
Table Header
Align-
ment
Example
Table Extraction
Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf
✓
✓
✓
✓
✓
✓
Dense Row
Table Header
Sparse Row
Dense Row
Sparse Row
Dense Row
Sparse Row
Dense Row
Heading Row
Heading Row
Heading Row
Dense Row
Dense Row
Dense Row
Dense Row
Heading Row
Heading Row
Heading Row
Table Header
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Dense Row
Heading Row
Heading Row
Table Header
Align-
ment
§ “Blob” = largest semantically bound text unit
– Single-line or multi-line
– If in a table, the whole “blob” must be in a single cell
§ “Blob” ≠ Cell
– Cell has span and overlaps other cells
– Some “blobs” end up in plain text or non-table text
§ “Blobs” help define table structure:
– Trace alignment
– Determine header cell spans
– Fix over-split / over-merged cells, rows, columns
– Reduce search space
Text “Blobs” (Cell-Units, Paragraphs, …)
Table Extraction
§ [KD98] Distance based clustering:
– Merge words horizontally
– Merge text strings vertically if word-spans interleave
§ Problems with distance:
– Multi-column headers: 1 justified phrase vs. ≥ 2 closely spaced phrases
– Row headers / text cells: 1 multi-line cell vs. ≥ 2 closely spaced rows
§ Example:
How to Detect “Blobs”
[KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98
Two Column Header Two Column Header
HEADER Header Header Header Header
Row 1, text line 1 0.12 1.23 2.34 3.45
Row 1, text line 2
Row 1, text line 3
Row 2, text line 1 4.56 5.67 6.78 7.89
Row 2, text line 2
Row 2, text line 3
Table Extraction
§ [H00a], [OR09] Merge “sparse” rows into “dense” rows
– Merge up, merge down, or cluster around
§ [L09] Detect and follow reading order ← an NLP challenge
§ [B12] [B14] Train a classifier over “blob” features:
– Proper termination (e.g. “blobs” don’t end with a dash or comma)
– Number of numeric strings
– Indentation, large space at the end of a string
– Shared font properties
§ Deep learning approaches:
– Cell-unit detection (over image) using CNNs
– Semantic relationship detection (over text) using RNNs
How to Detect “Blobs”
[H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ’09
[L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
Table Extraction
Example
Table Extraction
Table Source: https://www.dollartreeinfo.com/static-files/0c3687d8-e6ce-4566-bc89-79fc8c8b665e (2016_Proxy_Statement_Final.pdf)
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
§ Ruled Lines & Colored Boxes
– Extend ruled lines over small gaps, “snap” together
– Merge touching colored boxes, then convert into lines
– Filter out: highlighting, underlining, boxed comments, logos, charts etc.
§ BUT: A “perfect” ruled-line grid can be incomplete !
– Some lines may be missing
– Lines may fail to extend to header rows / columns
Separator Line Detection
[CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93
[I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93
[F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
Table Extraction
Example 1
Table Extraction
Table Source: https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/quarterly-result/2015/2015_MDA_q3.pdf
Example 2
Table Extraction
Table Source: https://www.ada.gov/restripe.pdf
Example 3
Table Extraction
Table Source: http://educationaldatamining.org/files/conferences/EDM2018/EDM2018_Preface_TOC_Proceedings.pdf
§ White-space separators (“virtual” lines)
– Help define cell span / cell alignment in tables
– Prune false-positives by ML or by heuristics [B12]
§ How to detect white-space separators
– Cell-unit (“blob”) bounding box expansion [I93]
– Axis projection histograms [CK93]
– White-space cover by maximum-area white-space rectangles [F11]
§ How to prune them (features to use)
– Adjacent “blobs” : alignment, size, and content
– “Strong” separators that run parallel to or intersect the separator
Separator Line Detection
Table Extraction
[CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93
[I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93
[F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
§ Commonly used to partition page and generate separators
– By [C02], [W04], [K14], and others
§ [H95] The algorithm recursively, for each block:
– Computes X- and Y-axis projection profiles
– Divides the block into sub-blocks based on dips in profiles:
Recursive X-Y Cut Algorithm
[H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95
[C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14
Table Extraction
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
§ Ruled Line grids / frames, connected components
§ (Rows 1st) Stack “table” rows whose “blobs” co-align [L08], [OR09]
– Rows are labeled by an ML-classifier (CRF, SVM, HMM)
– Labels & matching “blob” layout → table regions
– NOTE: Be sure to label “header rows” to tell tables apart !
§ (Cols 1st) Cluster overlapping column fragments [HB07], [SS10]
– Group table columns horizontally, staying within page layout columns
(when possible)
– Group vertically if column fragments overlap, match, or subsume
– NOTE: Column header areas require special handling !
Generate Candidate Table Regions
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08
[OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13
Table Extraction
§ (Blobs 1st) Classify text “blobs”, cluster those labeled “table”
– [B14] iteratively labels “blobs” given their neighbors’ labels
– [B14] trains a Kernel Logistic Regression classifier
§ (Lines 1st) Find areas where “strong” separators make a grid
– [CL12] uses Max-Flow / Min-Cut algorithm to extract grids
– Bi-cluster the intersection matrix of horizontal vs. vertical separators
Generate Candidate Table Regions
[CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
k
X ≈ UVT
• Xij = 1 ⇔ lines i and j
intersect
• At intersections: 1 ≈
ui1vj1 + ui2vj2 +…+ uikvjk
• Each uicvjc ≥ 0 gives
affinity of intersection
Xij to cluster c
• uicvjc is large ⇔	
uic and vjc both large
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 1 0
0 1 0
0 1 0
0 1 0
0 1 0
* * *
0 0 1
0 0 1
0 0 1
0 0 1
0
1
0
0
1
0
0
1
0
0
1
0
0
0
0
*
*
*
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
0
1
1
0
1
0
0
1
0
0
1
0
0
1
0
0
1
U ≥ 0
V ≥ 0
X
Non-neg. Matrix Factorization for Grid Clustering
Generate Candidate Table Regions
Table Extraction
§ (Blobs 1st) Classify text “blobs”, cluster those labeled “table”
– [B14] iteratively labels “blobs” given their neighbors’ labels
– [B14] trains a Kernel Logistic Regression classifier
§ (Lines 1st) Find areas where “strong” separators make a grid
– [CL12] uses Max-Flow / Min-Cut algorithm to extract grids
– Bi-cluster the intersection matrix of horizontal vs. vertical separators
§ (CNN-based) Try a fixed set of table region proposals
– CNN shares computation of features across all translations of a given
proposal rectangle
– Proposal rectangle shapes / sizes are fixed as hyperparameters
– If a proposal hits a table, a regression decides table borders
Generate Candidate Table Regions
[CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12
[B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
§ Use existing object detection
frameworks (Faster R-CNN or
YOLO) retrained for table
detection
§ The field is wide open for more
table-specific DL approaches
– E.g. involving text semantics
Li et al. “TableBank: Table Benchmark for Image-based Table Detection and
Recognition”. ArXiv 2019
Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest
Documents at Scale.”. KDD 2018
Schreiber et al. “Deepdesrt: Deep learning for detection and structure
recognition of tables in document images” ICDAR 2017
Gilani et al. “Table Detection using Deep Learning” ICDAR 2017
Table Extraction
Deep Learning for Table Detection
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
§ Cells define overlap relation along X- or Y-axis
– Links headers with data – critical for table understanding
§ Cell borders ← ruled lines ∪ “strong” white-space lines
– Extend lines to make rectangular cells, avoid crossing “blobs”
§ Ruled grids: test for incompleteness
– Multiple numerics per cell
– A “strong” white-space line splits text in ≥ 2 cells
– A “mini-table” inside a ruled cell
– Cell structure extends beyond table frame
§ White-space grids: clean up empty cells
– Expand header cells by merging with empty cells [S06]
– Merge (almost-) empty rows and columns
Cell Structure: Line Based
Table Extraction
[S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06
[B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
§ Use Spatial Constraints to find an overlap DAG over cells [H03]
§ Use Graph Neural Networks to find 2 undirected graphs:
Cell Structure: Graph Based
[H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03
[Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019
[C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019
[Q19] [C19]
Table Extraction
– “Same Row” graph & “Same Column” graph
– Two cells share an edge ⇔ share a row / a column
– [Q19] : Rows and columns = maximal cliques
– [C19] : Only adjacent cells share a graph edge
Schreiber et al. “Deepdesrt: Deep learning for detection and structure
recognition of tables in document images” ICDAR 2017
Table Extraction
Cell Structure: CNN Based
§ Object detection networks were also used for cell structure detection
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
§ Eliminate false positive tables
§ Detect malformed table regions
– Plain text in tables
– Missing row / column headers or split-off pieces
– One region covers multiple tables
§ Compare alternative table candidates
– Example: Is this 1 table or 2 tables?
§ Improve table region and structure
– Pick the best adjustment out of a range of options
– NOTE: Knowing cell structure helps region scoring / adjustment
§ Provide a confidence value for output tables
Why Scoring Tables?
Table Extraction
§ Tables are very diverse
– Tiny or huge, misaligned, text in cells, key-value pairs, confusing delimiters
– Complex row / column headers – so different, easy to chop off !
§ What’s around the table also matters
– Can its columns or rows be extended? Should they be?
§ One table, or ≥ 2 adjacent tables?
– 1 table may have: ruled bars, wide gaps, font / alignment changes
– 2 tables may be: fully or partly co-aligned, separated in one of many ways
§ Non-table text can have complex structure, too
– Page headers / footers, framed / highlighted text, hierarchical lists, …
Table Scoring Challenges
Table Extraction
Example 1
Table Extraction
Table Source: https://www.legislation.gov.au/Details/F2010C00607/0d99393c-5c5b-4af0-9cc1-b5c2de8632c3 (F2010C00607.pdf)
NOT A TABLE !
Example 2
Table Extraction
Table Source: https://www.thewaltdisneycompany.com/wp-content/uploads/2019/01/2018-Annual-Report.pdf
Row
headers Column
headers
Example 3
Table Extraction
Table Source: https://www.thewaltdisneycompany.com/wp-content/uploads/2019/01/2018-Annual-Report.pdf
Row
headers
Column
headers
Example 4
Table Extraction
Table Source:
https://assets.ctfassets.net/rz9m1rynx8pv/2x3p5ompzZyrRtAHw4M3XB/be648275661795139cabcee29a730630/TELUS_Q1_2019_quarterly_report.pdf
Row
headers
Column
headers
§ Rule-out patterns
– Rule out charts, lists, signature blocks etc.
§ Aggregated column / row score
– [KD01] Aggregate the similarities that led to the table’s column fragments
§ Dynamic programming score
– [H99] Score (T) = max { Score (T – line) + Merit (line) }
– Score the best split into 2 sub-tables
§ Probability of being a table (given the features)
– [W04] Partition page into blocks labeled “table” and “plain text”
– Compute label probability for block + neighboring blocks
§ A scoring neural network on top of CNN [G17, S18b]
How to Score a Table
[H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99
[KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
§ Columns and rows:
– Number, span / extent, alignment, font / content similarity
§ Ruled and white-space separators:
– Number, span / extent, width of their margins
– If they match, reach (good) or cross (bad) table borders
§ Inside vs. outside table:
– Border crossing ruled lines, aligned blocks, or highly similar text
– The two sides have matching structure
§ Cell structure:
– Oversized cells, misaligned pairs of cells, “runs” of empty cells
§ Content:
– Numerics, repeated words; customizable keywords
– Domain-specific “expectations,” e.g. header dictionary [D11]
§ CNN-generated features
Features for Table Scoring
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
Table Extraction
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
§ Leverage table features and score
– Specify how a well-formed vs. mal-formed table looks like
§ Use a transparent, explainable method
– If detection is a “black box”, adjustment uses explainable rules & features
§ Correct errors quickly
– Bypass the need for extra ground-truth data, retraining
§ Customize to address specific concerns
– Add custom features, rules, and constrains
Why Adjust Tables?
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
§ Merge table with an adjacent table or text-block [W04] [SS10]
§ Adjust table border – add or drop rows or columns [HB07] [D11]
§ Split one table into two, possibly with plain text between
§ Re-compute table region by neural network regression [G17] [S18b]
§ Choose best-scoring border (or structure) out of a range of options
§ Iterate adjustment → traverse a search tree of candidate tables
How to Adjust Candidate Tables
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
[HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07
[SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10
[D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11
[G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17
[S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18
Table Extraction
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
What if candidate tables overlap each other?
§ [H99] uses Dynamic Programming:
– Only for top and bottom line-positions: [i,	j]
– Score disjoint unions of tables:
§ CNN-based object detection systems:
– Greedy Approach: Pick the top-scoring region, repeat
– PROBLEM: Lower-scoring table may have a high-scoring sub-table
§ Maximum Weighted Independent Set
– Nodes = tables, edges = conflicts, weights = table scores
– NP-hard even for 2-dim rectangles [RN95], but can be solved
efficiently in real-life cases
Select Best Tables for Output
[H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99
[RN95] C.S. Rim and K. Nakajima. “On Rectangle Intersection and Overlap Graphs”, IEEE Trans. on Circuits & Systems I, 42(9), 1995
Table Extraction
1 1
1 1 1
1 1
1 1 1
1
1 1
1 1 1
Conflict = Table
Overlap
Common Sub-Tasks in Table Extraction
Table Extraction
Analyze Detect Refine
Learning Infrastructure
Extract table’s
cell structure
Generate
candidate table
regions
Select tables for
output
Adjust candidate
tables
Compute table’s
features & score
Identify
separator lines
Group text into
larger units
Compute page
features
Accuracy metrics Ground truth data Optimization method
§ Accuracy Metrics
– Exact match of table region or structure is too inflexible
– Partial match: Text? Area? Cell relationship? Functional?
§ Ground Truth Labeling
– Very time consuming, requires sophisticated UI tools
– Humans disagree on what’s correct
§ Optimization (pre- deep learning)
– Lots of discrete, non-differentiable steps
– Learn sub-tasks, e.g. row labeling with CRF / SVM
– [W04] Global parameter learning:
Learning from Data: Challenges
[W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04
Table Extraction
Table Boundary
§ Purity & Completeness
§ Character level recall, precision
and F1
Table Structure
§ Recall and Precision of Cell
Adjacency Relations
Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12
ICDAR 2013 Competition Metrics
Table Extraction
Accuracy Metrics
§ Measure what actually
matters downstream
§ Capcture accuracy of
access paths to each cell
§ Need header annotation
as well as cell structure
Table Extraction
Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12
Accuracy Metrics
Functional Metrics
Ground Truth Datasets
Complete Datasets with table boundary and cell structure:
- ICDAR-2013 competition (PDF Format)
- ICDAR-2019 competition (Image Format)
- SciTSR 2019 (Generated from LaTeX files)
Incomplete Datasets
§ Table-bank (Full table boundary information only)
§ PDF-Trex (Financial Table dataset without ground truth Labels)
§ Marmot (Only ground truth for table boundary, cells inaccessible)
§ UNLV , UW-3 (Table structure and boundary annotations for scanned documents)
Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”. ArXiv 2019
Göbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12
Oro et al. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”. ICDAR '09
Fang et al. “Dataset Ground-Truth and Performance Metrics for Table Detection Evaluation”. DAS '12
Chi et al. “Complicated Table Structure Recognition” arXiv 2019
Table Extraction
Example: Accuracy Comparison
§ Table detection accuracy on the ICDAR 2013 Competition dataset:
Table Extraction
Table Understanding
Table Understanding
Table Understanding
HTML Document
Table Extraction (Optional)
Document (PDF, Image)
Representation of table contents that:
• Captures semantic information
• Is amenable to post-processing
Table Understanding Output
Knowledge Base
Creation
Downstream
Tasks
Question
Answering
e.g., Air Canada’s oper. revenues in Q3 2015?
HTML
Understand the semantics of tabular data
Table Understanding
Table Understanding
HTML Document
Table Extraction (Optional)
Document (PDF, Image)
Representation of table contents that:
• Captures semantic information
• Is amenable to post-processing
Table Understanding Output
Knowledge Base
Creation
Downstream
Tasks
Question
Answering
e.g., Air Canada’s oper. revenues in Q3 2015?
HTML
Understand the semantics of tabular data
Table Understanding
Table Understanding
HTML Document
Table Extraction (Optional)
Document (PDF, Image)
Representation of table contents that:
• Captures semantic information
• Is amenable to post-processing
Table Understanding Output
Knowledge Base
Creation
Downstream
Tasks
Question
Answering
e.g., Air Canada’s oper. revenues in Q3 2015?
HTML
Understand the semantics of tabular data
Table Understanding
Table Understanding
HTML Document
Table Extraction (Optional)
Document (PDF, Image)
Representation of table contents that:
• Captures semantic information
• Is amenable to post-processing
Table Understanding Output
Knowledge Base
Creation
Downstream
Tasks
Question
Answering
e.g., Air Canada’s oper. revenues in Q3 2015?
HTML
Understand the semantics of tabular data
Semantics of Tabular Data
Table Understanding
What does this cell represent?
Semantics of Tabular Data
Table Understanding
The unaudited comprehensive net loss of Air
Canada in the six months ended June 30, 2015
is $13 million Canadian dollars.
“ “
Semantics of Tabular Data
Table Understanding
The unaudited comprehensive net loss of Air
Canada in the six months ended June 30, 2015
is $13 million Canadian dollars.
“ “
Information about a single cell is derived from multiple places
What You Will Learn
Table Understanding
Components of table understanding
• What are the different types of semantic information about a table?
• Where can they be found?
1
What You Will Learn
Table Understanding
Components of table understanding
Table understanding Methods
• What are the different types of semantic information about a table?
• Where can they be found?
1
2
• What techniques are used to extract info for table understanding?
• What learning methods can be used?
What You Will Learn
Table Understanding
Components of table understanding
Table understanding Methods
• What are the different types of semantic information about a table?
• Where can they be found?
1
2
• How do tables differ between domains?
• How do the assumptions of proposed approaches affect their
potential applicability to other domains?
Importance of Domain3
• What techniques are used to extract info for table understanding?
• What learning methods can be used?
Outline: Components of Table Understanding
Table Understanding
A. Table Regions
(Column/Row Headers)
B. Context
Within Table
C. Context
Within Document
D. Context Outside
Document
Outline: A. Table Regions
Table Understanding
Column Headers
(incl. nesting)
Row Headers
(incl. nesting)
Data/Body
Cells
Main table regions
Metadata
Unsupervised Methods Overview
Table Understanding
Header rows/cols "look different” than data rows/cols
Unsupervised Methods Overview
Table Understanding
Header rows/cols "look different” than data rows/cols
Similarity Features
Unsupervised Methods Overview
Table Understanding
Header rows/cols "look different” than data rows/cols
Heuristics
Similarity Features
• Which heuristics to use?
Unsupervised Methods: Local Minimum
Table Understanding
J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12
For column (row) headers: Find first row (col) that looks “different”
Pair-wise similarity of
consecutive rows
Local minimum of similarity
Unsupervised Methods: Indexing
Table Understanding
S. Seth et al. “Segmenting tables via indexing of value cells by table headers”.
ICDAR ‘13
• Use empty and repeated
cells to find critical cells that
outline the stubhead
• Independent of visual
aspects of table
Repeated cell
implying hierarchical
row header
Empty cells implying
hierarchical column
header
Traditional ML Methods Overview
Table Understanding
Header rows/cols "look different” than data rows/cols
Traditional
ML Methods
Similarity Features
Column
Headers
Data
Cells
Classification Labels
• How to model this as a classification problem?
• Which ML method and features to use?
Traditional ML Methods: Row/Column Classification
Table Understanding
J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12
Data row
Data row
Data row
Data row
Data row
Data row
Column header row
Classify rows as column header rows (similarly for row header columns)
D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03
Header Identification Results
Table Understanding
S. Seth et al. “Segmenting tables via indexing of value cells by table headers”.
ICDAR ‘13
R. Rastan et al. “TEXUS: A unified framework for extracting and understanding
tables in PDF documents”. Information Processing & Management
Correct Segmentation Correct Stub Head
(Critical Cell)
Seth et al. 99% 100%
TEXUS 100% 100%
Government Statistic Table Set (Seth)
Correct Segmentation Correct Stub Head
(Critical Cell)
TEXUS - 42.9%
ASX-Announcements Dataset (TEXUS)
No standard benchmark or dataset
Table Understanding
J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12
D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03
FedStat Textfile Dataset (Pinto) CiteSeerX PDF Dataset (Fang)
Traditional ML Methods: Table Classification
Table Understanding
Web Data Commons – Web Table Corpora
Classify entire tables
Relational Table Entity/Listing Table Matrix Table
e.g,
Traditional ML Methods: Table Classification
Table Understanding
Web Data Commons – Web Table Corpora
Classify entire tables
• Table class implies header structure
Relational Table Entity/Listing Table Matrix Table
Traditional ML Methods: Table Classification
Table Understanding
Web Data Commons – Web Table Corpora
Classify entire tables
• Table class implies header structure
• Can be used for header identification under certain assumptions
Relational Table Entity/Listing Table Matrix Table
Single col header rowSingle col header row
Single row header col
Traditional ML Methods: Table Classification
Table Understanding
Table Classes
Genuine vs Non-genuine Y. Wang et al. “A Machine Learning
Based Approach for Table Detection
on The Web“. WWW ‘02
Relational vs Non-relational M. Cafarella et al. “Uncovering the
Relational Web”. WebDB ‘08
I. Relational Knowledge: Listing, Attribute/Value,
Matrix, Calendar, Enumeration, Form
II. Layout: Navigational, Formatting
E. Crestan et al. “Web-Scale Table
Census and Classification”. WSDM ‘11
Vertical listings, horizontal listings, matrix tables J. Eberius et al. “Building the Dresden
Web Table Corpus: A Classification
Approach”. BDC ‘15
year
Traditional ML Methods: Table Classification
Table Understanding
ML Methods
Decision Tree, SVM Y. Wang et al. “A Machine Learning Based Approach
for Table Detection on The Web“. WWW ‘02
Rule-based Classifier (WEKA) M. Cafarella et al. “Uncovering the Relational Web”.
WebDB ‘08
Gradient Boosted Decision Tree E. Crestan et al. “Web-Scale Table Census and
Classification”. WSDM ‘11
Decision Tree (CART, C4.5,
Random Forest), SVM
J. Eberius et al. “Building the Dresden Web Table
Corpus: A Classification Approach”. BDC ‘15
Traditional ML Methods
Table Understanding
Neighborhood and Table Features
• Number of non empty cells
difference
• Average alignment
• Percentage of same cell data type
• Percentage of same cell font style
• Content repetition
• Number and standard deviation of
rows and columns
Cell Features
• Number of non empty cells.
• Average cell length.
• Percentage of numeric characters.
• Percentage of symbolic characters
• Average font size.
• Cell Font Styles
• Cell positioning in the table
• Percentage of cells spanning
multiple cols/rows
• HTML Tags (if applicable)
• Cell Span
J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12
J. Eberius et al. “Building the Dresden Web Table Corpus: A Classification
Approach”, BDC ‘15
Deep Learning Methods Overview
Table Understanding
Header rows/cols "look different” than data rows/cols
Deep
Learning
Methods
Similarity Features
Column
Headers
Data
Cells
Classification Labels
• Which deep learning architecture to use?
Deep Learning Methods: Hierarchical Attention Network
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables]
Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16
Hierarchical RNN proposed to leverage
document structure:
• 2 layers:
• Words
• Sentences
Deep Learning Methods: Hierarchical Attention Network
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables]
Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16
Extend to tables:
• 3 layers
• Tokens
• Cells
• Rows or Columns
• Bidirectional network
• Combine row-directional and
column-directional network
Deep Learning Methods: RNN-CNN Hybrid (TabNet)
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ‘17
LSTM captures semantic
representation of each cell
CNN captures
relationship between cells
Deep Learning Methods: RNN-CNN Hybrid (TabNet)
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ‘17
LSTM captures cell text
together with coordinates
and other HTML tags (i.e.,
formatting)
Deep Learning Methods: Results
Table Understanding
K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid
Deep Neural Network Architecture”. AAAI ‘17
(Rule-Based)
(Decision Tree)
(Decision Tree)
(Hierarchical Attention
for Documents)
(RNN-CNN Hybrid)
Beyond Flat Headers: Hierarchical Row Headers
Table Understanding
Hierarchical
Row Headers
Beyond Flat Headers: Hierarchical Row Headers
Table Understanding
Identify hierarchical relationship among row headers
Complex semantic row header hierarchy: Multiple cells in the same row header
column are semantically related to each other
Beyond Flat Headers: Hierarchy as a Graphical Model
Table Understanding
Z. Chen et al. “Integrating Spreadsheet Data via Accurate and Low-Effort
Extraction”. KDD ‘14
Encode hierarchy as graphical model
• Variable: Candidate parent-child pair
• Node potentials: Features for predicting
parent-child pairs
• Edge potentials: Correlations of
variables based on style, KB affinity, …
Pairwise vs Rectangle cell relationships
Table Understanding
• Pairwise classification can only utilize local information
• Simply looking at the pair may not be sufficient to determine the relation
• A rectangle is “interesting” if it is the support rectangle of some cell,
called a header cell of that rectangle
Beyond Flat Headers: Hierarchy as Rectangle Relationship
Table Understanding
X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of
Financial Tables”. ICDAR ‘17
Two “interesting” rectangles:
• “Assets” (row 1) heads rows 2-17
• “Current” (row 2) heads rows 3-11
Beyond Flat Headers: Hierarchy as Rectangle Relationship
Table Understanding
X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of
Financial Tables”. ICDAR ‘17
When a “total” row is considered as a parent
candidate, it cannot take children
For each iteration:
• Combine: Consecutive minimal
rectangles with equal features
• Attach: Minimal rectangle ri to
directly preceding rectangle ri-1 if
ri-1 > ri
Outline: B. Context Within Table
Table Understanding
Currency
Additional semantic information within the table
- of different types
Scale
Outline: B. Context Within Table
Table Understanding
Additional semantic information within the table
- of different types
- of different scope Propagate to all
data cells
Outline: B. Context Within Table
Table Understanding
Additional semantic information within the table
- of different types
- of different scope
Propagate to
subset of data cells
Outline: C. Context Within Document
Table Understanding
Additional context outside the table within the same document
- leverage relevant text and tables
Table Context Within Document
Table Understanding
Surrounding text often contains important info about a table
Deeper
Semantic Understanding
• Link text to table
• Generate table title
Shallow
Context Extraction
• Extract table metadata
Extract Table Metadata
Table Understanding
Ying Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in
Digital Libraries”. JCDL ’07
Document title
Page
Table Caption
Document authors
Link Text to Table Cells
Table Understanding
D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18
Text
Cells described by text
Approach:
• Identify headers
• Match sentence to table cells
based on:
• Unique words
However, mirroring the overall
softness of the tech sector, sales of
computer hardware decreased 1%
versus a year-ago to $1.6 billion.
Link Text to Table Cells
Table Understanding
D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18
Text
Cells described by text
Approach:
• Identify headers
• Match sentence to table cells
based on:
• Unique words
• Syntactic analysis
Link Text to Table Cells
Table Understanding
D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18
Text
Cells described by text
Approach:
• Identify headers
• Match sentence to table cells
based on:
• Unique words
• Syntactic analysis
• Semantic analysis
”…talking about topics is an
important reason to email with
these special interest groups.”
word2vec
Link Text to Table Cells
Table Understanding
D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18
Text
Cells described by text
Approach:
• Identify headers
• Match sentence to table cells
based on:
• Unique words
• Syntactic analysis
• Semantic analysis
• Use rules to refine matches
Generate Table Titles (for Web Tables)
Table Understanding
B. Hancock et al. “Generating titles for web tables”. WWW ’19
Problem:
• Web tables lack titles or
• Existing titles lack context
Table Title?
Generate Table Titles (for Web Tables)
Table Understanding
B. Hancock et al. “Generating titles for web tables”. WWW ’19
Solution:
• Leverage surrounding context
to generate table title
Table + Surrounding Context
Table Title
Problem:
• Web tables lack titles or
• Existing titles lack context
Generate Table Titles (for Web Tables)
Table Understanding
B. Hancock et al. “Generating titles for web tables”. WWW ’19
Context used as input:
• Page Title
• Section headers (<h...> tags)
• Column headers
• Spanning column headers
as a special case
• Table caption (<caption> tag)
Table + Surrounding Context
Table Title
Context ignored due to noise:
• Text right before/after table
• Table rows
Generate Table Titles (for Web Tables)
Table Understanding
B. Hancock et al. “Generating titles for web tables”. WWW ’19
Model Design
• Pointer-generator network
• First proposed for
abstractive summarization
• Combines copy & generator
mechanism
Table + Surrounding Context
Table Title
Outline: D. Context Outside Document
Table Understanding
Additional context outside the table from other resources
- link to knowledge bases
Table to KB Linking
Table Understanding
Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
Link different parts of the table to external knowledge bases
Link Columns
(known as Column Type Identification)
Link Rows/Cells
(known as Entity Linking)
Table to KB Linking: Link Columns
Table Understanding
Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
Table to KB Linking: Link Rows/Cells
Table Understanding
Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
Understanding Tabular Data: Putting it All Together
Table Understanding
What does this cell represent?
Understanding Tabular Data: Putting it All Together
Table Understanding
What does this cell represent?
A. Identify table regions (column/row headers)
Understanding Tabular Data: Putting it All Together
Table Understanding
B. Identify additional context within table
Understanding Tabular Data: Putting it All Together
Table Understanding
C. Identify context within document
Understanding Tabular Data: Putting it All Together
Table Understanding
D. Identify context outside document
Understanding Tabular Data: Putting it All Together
Table Understanding
The unaudited comprehensive net loss of Air
Canada in the six months ended June 30, 2015
is $13 million Canadian dollars.
“ “
Final Takeaways
1. A rich history of methods for many decades in table
extraction & understanding
2. Tables from different domains are not the same; A general
table extraction & understanding system needs to
consider diversity of type, style, and content of tables
3. Both semantic and visual features are crucial to improve
table extraction and understanding
4. As a community, we need to standardize tasks, evaluation
metrics, and datasets
Build for the future by unlocking the past...

More Related Content

What's hot

Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsSteven Francia
 
Deep Dive In To Redis Replication: Vishy Kasar
Deep Dive In To Redis Replication: Vishy KasarDeep Dive In To Redis Replication: Vishy Kasar
Deep Dive In To Redis Replication: Vishy KasarRedis Labs
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at FacebookRedis Labs
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL DatabaseHeman Hosainpana
 
Fuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer ShinFuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer ShinDatabricks
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
 
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAmazon Web Services
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBLee Theobald
 
redis 소개자료 - 네오클로바
redis 소개자료 - 네오클로바redis 소개자료 - 네오클로바
redis 소개자료 - 네오클로바NeoClova
 
Redshift VS BigQuery
Redshift VS BigQueryRedshift VS BigQuery
Redshift VS BigQueryKostas Pardalis
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdaviirpycon
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBAmazon Web Services
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deploymentYoshinori Matsunobu
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 

What's hot (20)

Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS Applications
 
Deep Dive In To Redis Replication: Vishy Kasar
Deep Dive In To Redis Replication: Vishy KasarDeep Dive In To Redis Replication: Vishy Kasar
Deep Dive In To Redis Replication: Vishy Kasar
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Fuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer ShinFuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer Shin
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
redis 소개자료 - 네오클로바
redis 소개자료 - 네오클로바redis 소개자료 - 네오클로바
redis 소개자료 - 네오클로바
 
Redshift VS BigQuery
Redshift VS BigQueryRedshift VS BigQuery
Redshift VS BigQuery
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
MongoDB
MongoDBMongoDB
MongoDB
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
Redis
RedisRedis
Redis
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 

Similar to ICDM2019 table tutorial

When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaoneSimon Elliston Ball
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Maxime Beugnet
 
Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02Lalit009kumar
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
No SQL, No Problem: Use Azure DocumentDB
No SQL, No Problem: Use Azure DocumentDBNo SQL, No Problem: Use Azure DocumentDB
No SQL, No Problem: Use Azure DocumentDBKen Cenerelli
 
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...Beat Signer
 
Inside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPInside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPBob Ward
 
OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchAppsBradley Holt
 
Introduction to Oracle
Introduction to OracleIntroduction to Oracle
Introduction to OracleAchmad Solichin
 
Introduction to Oracle
Introduction to OracleIntroduction to Oracle
Introduction to OracleAchmad Solichin
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndicThreads
 
web programming using html,css, JavaScript ,php etc
web programming using html,css, JavaScript ,php etcweb programming using html,css, JavaScript ,php etc
web programming using html,css, JavaScript ,php etcalbinjamestpra
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
Oracle Text in APEX
Oracle Text in APEXOracle Text in APEX
Oracle Text in APEXScott Wesley
 
Windows Azure and a little SQL Data Services
Windows Azure and a little SQL Data ServicesWindows Azure and a little SQL Data Services
Windows Azure and a little SQL Data Servicesukdpe
 
Digital System Design-Gatelevel and Dataflow Modeling
Digital System Design-Gatelevel and Dataflow ModelingDigital System Design-Gatelevel and Dataflow Modeling
Digital System Design-Gatelevel and Dataflow ModelingIndira Priyadarshini
 

Similar to ICDM2019 table tutorial (20)

lecture_34e.pptx
lecture_34e.pptxlecture_34e.pptx
lecture_34e.pptx
 
Database driven web pages
Database driven web pagesDatabase driven web pages
Database driven web pages
 
When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaone
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...
 
Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
No SQL, No Problem: Use Azure DocumentDB
No SQL, No Problem: Use Azure DocumentDBNo SQL, No Problem: Use Azure DocumentDB
No SQL, No Problem: Use Azure DocumentDB
 
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
 
Inside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPInside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTP
 
OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchApps
 
Introduction to Oracle
Introduction to OracleIntroduction to Oracle
Introduction to Oracle
 
Introduction to Oracle
Introduction to OracleIntroduction to Oracle
Introduction to Oracle
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path ahead
 
Intro
IntroIntro
Intro
 
web programming using html,css, JavaScript ,php etc
web programming using html,css, JavaScript ,php etcweb programming using html,css, JavaScript ,php etc
web programming using html,css, JavaScript ,php etc
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
Oracle Text in APEX
Oracle Text in APEXOracle Text in APEX
Oracle Text in APEX
 
Windows Azure and a little SQL Data Services
Windows Azure and a little SQL Data ServicesWindows Azure and a little SQL Data Services
Windows Azure and a little SQL Data Services
 
Introduction to HDLs
Introduction to HDLsIntroduction to HDLs
Introduction to HDLs
 
Digital System Design-Gatelevel and Dataflow Modeling
Digital System Design-Gatelevel and Dataflow ModelingDigital System Design-Gatelevel and Dataflow Modeling
Digital System Design-Gatelevel and Dataflow Modeling
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

ICDM2019 table tutorial

  • 1. Table Extraction and Understanding for Scientific and Enterprise Applications Yannis Katsis Doug Burdick Nancy WangAlexandre V Evfimievski Marina Danilevsky IBM Research - Almaden
  • 2. Outline § Introduction – Problem Definition – Challenges – Applications – Demo § Table Extraction § Table Understanding § Conclusion
  • 4. Introduction Outline § Problem definition – Table Extraction – Table Understanding § Challenges – Limited document format support for table structure – Table variety § Applications – Knowledge Base Population – Query Answering – Leaderboard Construction – Information Extraction § Demo Introduction
  • 5. Tables are popular data representation Introduction Government Reports Scientific Papers Financial ReportsInvoices Contracts Loan Agreements Compact Easy to understand* (*) For humans
  • 6. End-to-end example § What does the value 672 in the following table mean? § Answer: Net earnings for three months ended July 29th, 2017 was $672 million USD Steps: 1) Find location of table on page 2) Find cells in column containing ”672” 3) Find cells in row corresponding to “672” 4) Identify aligned row / column header cells 5) Normalize using additional context from table Introduction Table Extraction: Identify table location and structure Table Understanding: Provide semantic context to table values
  • 7. Introduction Input: Document contents in native format - PDF - Image - Office Docs - … Table Extraction: Problem Definition Output: Document contents with tabular information: 1) Table border for each table 2) Partitioning table contents into cells 3) Both vertical and horizontal alignment of cells
  • 8. Table Understanding: Problem Definition Introduction Output: Table content representation: 1) Captures semantic information 2) Amenable to post-processing Input: Document contents with tabular information: 1) Table border for each table 2) Partitioning table contents into cells 3) Both vertical and horizontal alignment of cells
  • 9. { "tables": [ { "column_headers": [ { "cell_id": "colHeader-1050-1082", "text": ”Expenses ($ in thousands)", ... }, { "cell_id": "colHeader-1270-1301", "text": ”Three months ended Sept. 30", ... }, { "cell_id": "colHeader-1544-1548", "text": "2015" }, ... ], "row_headers": [ { "cell_id": "rowHeader-2244-2262", "text": ”Aircraft fuel" }, { "cell_id": "rowHeader-3197-3217", "text": ”Airport operations" }, { "cell_id": "rowHeader-4148-4176", "text": ”Flight operations and navigational changes" }, ... ], "body_cells": [ { "cell_id": "bodyCell-2450-2455", "text": ”206,924", "row_header_ids": [ "rowHeader-2244-2262" ], "column_header_ids": [ "colHeader-1050-1082", "colHeader-1270-1301”, ”colHeader-1544-1548” ], }, { "cell_id": "bodyCell-5415-8945", "text": ”142,176", "row_header_ids": [ "rowHeader-3197-3217" ], "column_header_ids": [ "colHeader-1050-1082", "colHeader-1270-1301”, ”colHeader-1544-1548” ], ... Table Understanding: Example Introduction Output: Table content representation: 1) Captures semantic information 2) Amenable to post-processing Input: Document contents with tabular information: 1) Table border for each table 2) Partitioning table contents into cells 3) Both vertical and horizontal alignment of cells { "tables": [ { "column_headers": [ { "cell_id": "colHeader-1050-1082", "text": ”Expenses ($ in thousands)", ... }, { "cell_id": "colHeader-1270-1301", "text": ”Three months ended Sept. 30", ... }, { "cell_id": "colHeader-1544-1548", "text": "2015" }, ... ], "row_headers": [ { "cell_id": "rowHeader-2244-2262", "text": ”Aircraft fuel" }, { "cell_id": "rowHeader-3197-3217", "text": ”Airport operations" }, { "cell_id": "rowHeader-4148-4176", "text": ”Flight operations and navigational changes" }, ... ], "body_cells": [ { "cell_id": "bodyCell-2450-2455", "text": ”206,924", "row_header_ids": [ "rowHeader-2244-2262" ], "column_header_ids": [ "colHeader-1050-1082", "colHeader-1270-1301”, ”colHeader-1544-1548” ], }, { "cell_id": "bodyCell-5415-8945", "text": ”142,176", "row_header_ids": [ "rowHeader-3197-3217" ], "column_header_ids": [ "colHeader-1050-1082", "colHeader-1270-1301”, ”colHeader-1544-1548” ], ...
  • 10. Table Understanding: Example Introduction Output: Table content representation: 1) Captures semantic information 2) Amenable to post-processing Value Norm Value Year Time Period Type LineItem 206,924 $206,924, 000 2015 Q3 Expense Aircraft Fuel 286,817 $286,817, 000 2014 Q3 Expense Airport Fuel 142,176 $142,176, 000 2015 Q3 Expense Airport operations … … … … … … Change Change Normalized Begin Time Period End Time Period (27.9%) -27.9% Q3 2014 Q3 2015 10.7% 10.7% Q3 2014 Q3 2015 … … … … Input: Document contents with tabular information: 1) Table border for each table 2) Partitioning table contents into cells 3) Both vertical and horizontal alignment of cells
  • 11. Outline § Problem definition – Table Extraction – Table Understanding § Challenges – Limited document format support for table structure – Table variety § Applications – Knowledge Base Population – Query Answering – Leaderboard Construction – Information Extraction § Demo Introduction
  • 12. Challenge: Table structure representation varies across document formats None CompletePartial HTML MS Excel MS Word TXT PDF Image H. Dong et al. "TableSense: Spreadsheet Table Detection with Convolutional Neural Networks". AAAI '19 Z. Chen et al. “Spreadsheet Property Detection With Rule-assisted Active Learning”. CIKM ‘17 M. Cafarella et al. “ WebTables: exploring the power of tables on the web". VLDB ‘08 Table Understanding still required for all document types Introduction
  • 13. HTML completely represents table structure HTML None CompletePartial Introduction
  • 14. None CompletePartial MS Excel Each sheet separate table Multiple tables defined in single sheet Table structure representation varies across Excel documents H. Dong et al. "TableSense: Spreadsheet Table Detection with Convolutional Neural Networks". AAAI '19 Z. Chen et al. “Spreadsheet Property Detection With Rule-assisted Active Learning”. CIKM ‘17 Introduction
  • 15. None CompletePartial Table structure representation varies across Word documents MS Word Omit Office Table Object Use Office Table Object for all tables Introduction
  • 16. None CompletePartial Document formats with no native table representation Image TXT PDF TXT Image PDF Introduction
  • 17. PDF Document Format … BT 0.0503 Tc 8.503556 0 0 8.52 503.2795 688.92 Tm /Tc2 1 Tf [ ( m) 16 (o) 21 (n) 17 (t) 39 (h) 16 (s) 29 ( ) 28 (e) 28 (n) 17 (d) 24 (e) 28 (d) 24 ( ) ] TJ 0 Tc ET … Q q 46.91952 776.52 m 242.04 776.52 l 242.04 729.96 l 144.48 729.96 l 46.91952 729.96 l h … Draw ”m” at (503, 688) in 8.5 point font in color white Draw “o” at … Draw “n” at …. Draw “t” at ... Draw “h” at … Draw “s” at … …. Draw green line segment from (46, 776) to (242, 776) Draw green line segment from (242, 776) to (242, 729) … • Programmatic PDF collection of instructions to draw characters and line segments to page with visual formatting information • 2 – 4 trillion PDFs in existence and rapidly growing Rendered PDF PDF Binary Introduction
  • 18. Complex tables – graphical lines can be misleading – is this 1, 2 or 3 tables ? Table with visual clues only Multi-row, multi- column column headers Nested row headers Tables with Textual content Table with graphic lines Table interleaved with text and charts Challenge: Variety in Tables Introduction
  • 19. Outline § Problem definition – Table Extraction – Table Understanding § Challenges – Limited document format support for table structure – Table variety § Applications – Knowledge Base Population – Query Answering – Leaderboard Construction – Information Extraction § Demo Introduction
  • 20. https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal1231201410k.htm Excerpt of semi-structured XBRL file For financial statements Ex: Delta Air Lines, Inc. 2014 Annual Report Form 10-K : <xbrli:context id="FI2013Q4"><xbrli:entity> <xbrli:identifier scheme="http://www.sec.gov/CIK">0000027904 </xbrli:identifier></xbrli:entity> <xbrli:period><xbrli:instant>2013-12- 31</xbrli:instant></xbrli:period> </xbrli:context> : <us-gaap:CashAndCashEquivalentsAtCarryingValue contextRef="FI2013Q4" decimals="-6" id="Fact-C39BEC178121A91816968BA9ADCF421F” unitRef="usd"> 2844000000 </us-gaap:CashAndCashEquivalentsAtCarryingValue> : Excerpt of HTML file with granular financial metric data https://www.sec.gov/Archives/edgar/data/27904/000002790415000003/dal-20141231.xml Valuable metrics for airline industry, only present in HTML table Valuable metrics present in semi- structured raw data source MUST BE INTEGRATED Application: Knowledge-base population Introduction
  • 21. Application: Query Answering Introduction H. Sun et al. “Table Cell Search for Question Answering”. WWW '16
  • 22. Application: Scientific Leaderboard Construction Introduction Y. Hou et al. “Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction”. ACL ‘19 Scientific Publication Leaderboard Annotations
  • 23. Application: Biological Information Extraction Introduction G. Singh et al. “QTLTableMiner++: Semantic Mining of QTL Tables in Scientific Articles”. BMC BioInformatics ‘18 Article Trait Tables Trait Statements QTL Statements Extract info on Quantitative Trait Locus (QTL) (genomic regions that correlate with phenotypes) from tables in scientific publications
  • 24. Takeaways Introduction § Widely used document formats have limited table representation – Limits of document format: Image, PDF – How documents authored: Word, Excel § Wide variety of tables makes general model construction difficult –Tables are form of art –Diverse visual encoding of semantic information –Different domains § Multiple applications for table extraction and understanding
  • 25. Outline § Problem definition – Table Extraction – Table Understanding § Challenges – Limited document format support for table structure – Table variety § Applications – Knowledge Base Population – Query Answering – Leaderboard Construction – Information Extraction § Demo Introduction
  • 27. § Table region detection – Identify all tables – Separate tables from non-table text – Separate tables from each other § Cell structure recognition – Partition text into cells – Find cell span and cell-to-cell overlap (along X- or Y-axis) What Is Table Extraction? Table Extraction
  • 28. [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93 [I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93 [H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95 [KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98 [H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99 [H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00 [H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00 [KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01 [C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02 [P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03 [H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03 [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05 [S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06 [M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06 [L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08 [L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 [F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12 [CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12 [K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13 [K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 [R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15 [T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 [Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019 [C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019 [L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019 [M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19 Table Extraction Table Extraction: A Sample of Prior Work
  • 29. [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93 [I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93 [H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95 [KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98 [H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99 [H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00 [H00b] J. Hu et al. “Table Structure Recognition and Its Evaluation”, SPIE Doc. Recog. & Retr. ‘00 [KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01 [C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02 [P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03 [H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03 [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05 [S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06 [M06] S. Mandal et al. “A Simple and Effective Table Detection System from Document Images”, IJDAR ‘06 [L07] Y. Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”, JCDL ‘07 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08 [L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 [F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12 [CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12 [K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13 [K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 [R15] R. Rastan et al. “TEXUS: A Task-based Approach for Table Extraction and Understanding”, DocEng ‘15 [T16] T. A. Tran et al. “A Mixture Model Using Random Rotation Bounding Box to Detect Table Region in Document Image”, JVCIR ‘16 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18a] P. Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale”, KDD ‘18 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 [Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019 [C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019 [L19] M. Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”, arXiv, 2019 [M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19 Table Extraction Most papers present an end-to-end system for : • Table detection, • Cell structure recognition (table parsing), • Or both 🔥 ICDAR 2019 has ≥ 16 new papers on table extraction! – ICDAR = International Conference on Document Analysis and Recognition Table Extraction: A Sample of Prior Work
  • 30. § Early 1990s : Separator based “top-down” methods – Ruled line tables – Extend to white-space “lines” § 1990s – early 2000s : “Bottom-up” text clustering – Group text into columns (or rows), then to tables – Use space features (gaps, overlap, alignment) and keywords § 2000s – early 2010s : Machine Learning (supervised or not) – Classify text-rows using CRF, SVM, HMM, etc. – Probabilistic models for tables – Graph-based models for cell structure – Unsupervised ML (clustering) § Late 2010s : Deep Learning – Scanned image table detection with R-CNN or YOLO – Graph neural networks and language embeddings for cell structure Table Extraction Timeline Table Extraction
  • 31. § Analyze Page – Identify low-level structures & relations § The 2 Main Tasks – Table (region) detection – Cell structure recognition (given table region) § Refine Tables – Discard false positives – Adjust table border and structure How to Build a Table Extraction System? Table Extraction
  • 32. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Learning Infrastructure Accuracy metrics Ground truth data Optimization method
  • 33. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  • 34. § Documents can be: – scanned – programmatic (“born digital” PDF, TXT) – hybrid § Scanned page requires OCR, plus: – Reverse any rotation, distortion – Filter noise, sharpen if low resolution [M19] – Fix inconsistent font features, bounding boxes – Detect ruled lines and boxes • E.g., Gaussian filter + black hat transform [K13] Page Features [K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13 [M19] S. Mujumdar et al. “Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents”, ICDAR ‘19 Table Extraction
  • 35. § Programmatic PDFs (and TXTs) – Have letters, but no table markup § May contain spurious (invisible) text and lines – White-on-white lines or text – Occluded or out-of-range lines or text – Text repeated to simulate bold font – Need to filter them out § Deep Learning (CNN-based) methods need an image – Convert programmatic to scanned Page Features Table Extraction
  • 36. § Plain text layout (1-column, 2-column, etc.) – Helps avoid false-positive “tables” § Obvious non-tables – Page & section headers, footers, lists, etc. – Short-cut computation – if no tables on page § Low-level structure – Alignment @ different box positions & tolerance levels – A minimum spanning tree for clustering by distance § Deep learning features – CNN features shared across proposal regions – Natural language embeddings Page Features Table Extraction
  • 37. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  • 38. § Most systems group text early on – Table detection systems may skip text grouping § Text is grouped in one of 3 ways: – Columns first – Rows first – Cell-units (“blobs”) first § Some systems partition text using separator lines – BUT: “Blob” detection reduces over- / under-partitioning Group Text into Larger Units Table Extraction
  • 39. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Two Tables
  • 40. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Columns
  • 41. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Rows
  • 42. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Multi-line “Blobs”
  • 43. Many systems detect columns first: – T-Recs [KD98], Pdf2table [Y05], Lixto [HB07], Tesseract [SS10], smartFIX [D11] Example – Tesseract [SS10] : Start with Columns Table Extraction [KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98 [Y05] B. Yildiz et al. “pdf2table: A Method to Extract Table Information from PDF Files”, IICAI ‘05 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 1. Detect X-axis “tab-stops” (alignment positions) 2. Group tokens between “tab-stops” horizontally into entries 3. Group entries of the same font vertically into column fragments 4. Group column fragments within page columns horizontally into table fragments 5. Group table fragments if columns match vertically into tables
  • 44. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Tab-Stops
  • 45. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Column Fragments
  • 46. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Table Fragments
  • 47. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Table Fragments
  • 48. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Tables
  • 49. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Tables Multi-Column Headers
  • 50. Start with Rows Table Extraction Systems with ML often detect rows first – Pinto-McCallum [P03], e Silva [S06], TableSeer [L08], PDF-TREX [OR09] Typical process: 1. Identify text-lines 2. Train an ML classifier to label text-lines: – “Table Dense”, “Table Sparse”, “Table Header”, “Non-table”, etc. – ML = CRF [P03], HMM [S06], SVM [L08], etc. 3. Merge sparse rows into dense rows – get full table rows: – Merge up, down, or cluster around, by row alignment [H00a] 4. Combine table rows into tables [H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00 [P03] D. Pinto et al. “Table Extraction Using Conditional Random Fields”, SIGIR ‘03 [S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06 [L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09
  • 51. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Dense Row Table Header Sparse Row Dense Row Sparse Row Dense Row Sparse Row Dense Row Sparse Row Sparse Row Sparse Row Dense Row Dense Row Dense Row Dense Row Sparse Row Sparse Row Sparse Row Table Header Dense Row Dense Row Dense Row Dense Row Dense Row Dense Row Sparse Row Sparse Row Table Header Align- ment
  • 52. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Dense Row Table Header Sparse Row Dense Row Sparse Row Dense Row Sparse Row Dense Row Sparse Row Sparse Row Sparse Row Dense Row Dense Row Dense Row Dense Row Sparse Row Sparse Row Sparse Row Table Header Dense Row Dense Row Dense Row Dense Row Dense Row Dense Row Sparse Row Sparse Row Table Header Align- ment ✕ ✓ ✓ ✓ ✕ ✕
  • 53. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf Dense Row Table Header Sparse Row Dense Row Sparse Row Dense Row Sparse Row Dense Row Heading Row Heading Row Heading Row Dense Row Dense Row Dense Row Dense Row Heading Row Heading Row Heading Row Table Header Dense Row Dense Row Dense Row Dense Row Dense Row Dense Row Heading Row Heading Row Table Header Align- ment
  • 54. Example Table Extraction Table Source: http://iq.iradesso.ca/main/components/clients_profiles/55/financial_reports/LRE-2014-YearEnd-Combined.pdf ✓ ✓ ✓ ✓ ✓ ✓ Dense Row Table Header Sparse Row Dense Row Sparse Row Dense Row Sparse Row Dense Row Heading Row Heading Row Heading Row Dense Row Dense Row Dense Row Dense Row Heading Row Heading Row Heading Row Table Header Dense Row Dense Row Dense Row Dense Row Dense Row Dense Row Heading Row Heading Row Table Header Align- ment
  • 55. § “Blob” = largest semantically bound text unit – Single-line or multi-line – If in a table, the whole “blob” must be in a single cell § “Blob” ≠ Cell – Cell has span and overlaps other cells – Some “blobs” end up in plain text or non-table text § “Blobs” help define table structure: – Trace alignment – Determine header cell spans – Fix over-split / over-merged cells, rows, columns – Reduce search space Text “Blobs” (Cell-Units, Paragraphs, …) Table Extraction
  • 56. § [KD98] Distance based clustering: – Merge words horizontally – Merge text strings vertically if word-spans interleave § Problems with distance: – Multi-column headers: 1 justified phrase vs. ≥ 2 closely spaced phrases – Row headers / text cells: 1 multi-line cell vs. ≥ 2 closely spaced rows § Example: How to Detect “Blobs” [KD98] T. Kieninger and A. Dengel. “The T-Recs Table Recognition and Analysis System”, DAS ‘98 Two Column Header Two Column Header HEADER Header Header Header Header Row 1, text line 1 0.12 1.23 2.34 3.45 Row 1, text line 2 Row 1, text line 3 Row 2, text line 1 4.56 5.67 6.78 7.89 Row 2, text line 2 Row 2, text line 3 Table Extraction
  • 57. § [H00a], [OR09] Merge “sparse” rows into “dense” rows – Merge up, merge down, or cluster around § [L09] Detect and follow reading order ← an NLP challenge § [B12] [B14] Train a classifier over “blob” features: – Proper termination (e.g. “blobs” don’t end with a dash or comma) – Number of numeric strings – Indentation, large space at the end of a string – Shared font properties § Deep learning approaches: – Cell-unit detection (over image) using CNNs – Semantic relationship detection (over text) using RNNs How to Detect “Blobs” [H00a] J. C. Handley. “Table Analysis for Multi-line Cell Identification”, SPIE Doc. Recog. & Retr. ‘00 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ’09 [L09] Y. Liu et al. “Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines”, ICDAR ‘09 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 Table Extraction
  • 58. Example Table Extraction Table Source: https://www.dollartreeinfo.com/static-files/0c3687d8-e6ce-4566-bc89-79fc8c8b665e (2016_Proxy_Statement_Final.pdf)
  • 59. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  • 60. § Ruled Lines & Colored Boxes – Extend ruled lines over small gaps, “snap” together – Merge touching colored boxes, then convert into lines – Filter out: highlighting, underlining, boxed comments, logos, charts etc. § BUT: A “perfect” ruled-line grid can be incomplete ! – Some lines may be missing – Lines may fail to extend to header rows / columns Separator Line Detection [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93 [I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93 [F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12 Table Extraction
  • 61. Example 1 Table Extraction Table Source: https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/quarterly-result/2015/2015_MDA_q3.pdf
  • 62. Example 2 Table Extraction Table Source: https://www.ada.gov/restripe.pdf
  • 63. Example 3 Table Extraction Table Source: http://educationaldatamining.org/files/conferences/EDM2018/EDM2018_Preface_TOC_Proceedings.pdf
  • 64. § White-space separators (“virtual” lines) – Help define cell span / cell alignment in tables – Prune false-positives by ML or by heuristics [B12] § How to detect white-space separators – Cell-unit (“blob”) bounding box expansion [I93] – Axis projection histograms [CK93] – White-space cover by maximum-area white-space rectangles [F11] § How to prune them (features to use) – Adjacent “blobs” : alignment, size, and content – “Strong” separators that run parallel to or intersect the separator Separator Line Detection Table Extraction [CK93] S. Chandran and R. Kasturi. “Structural Recognition of Tabulated Data”, ICDAR ‘93 [I93] K. Itonori. “Table Structure Recognition Based on Textblock Arrangement and Ruled Line Position”, ICDAR ‘93 [F11] J. Fang et al. “A Table Detection Method for Multipage PDF Documents via Visual Separators and Tabular Structures”, ICDAR ‘11 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
  • 65. § Commonly used to partition page and generate separators – By [C02], [W04], [K14], and others § [H95] The algorithm recursively, for each block: – Computes X- and Y-axis projection profiles – Divides the block into sub-blocks based on dips in profiles: Recursive X-Y Cut Algorithm [H95] J. Ha et al. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”, ICDAR ‘95 [C02] F. Cesarini et al. “Trainable Table Location in Document Images”, ICPR ‘02 [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [K14] S. Klampfl et al. “A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles”, D-Lib Mag. ‘14 Table Extraction
  • 66. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  • 67. § Ruled Line grids / frames, connected components § (Rows 1st) Stack “table” rows whose “blobs” co-align [L08], [OR09] – Rows are labeled by an ML-classifier (CRF, SVM, HMM) – Labels & matching “blob” layout → table regions – NOTE: Be sure to label “header rows” to tell tables apart ! § (Cols 1st) Cluster overlapping column fragments [HB07], [SS10] – Group table columns horizontally, staying within page layout columns (when possible) – Group vertically if column fragments overlap, match, or subsume – NOTE: Column header areas require special handling ! Generate Candidate Table Regions [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [L08] Y. Liu et al. “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM ’08 [OR09] E. Oro and M. Ruffolo. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”, ICDAR ‘09 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [K13] T. Kasar et al. “Learning to Detect Tables in Scanned Document Images Using Line Information”, ICDAR ‘13 Table Extraction
  • 68. § (Blobs 1st) Classify text “blobs”, cluster those labeled “table” – [B14] iteratively labels “blobs” given their neighbors’ labels – [B14] trains a Kernel Logistic Regression classifier § (Lines 1st) Find areas where “strong” separators make a grid – [CL12] uses Max-Flow / Min-Cut algorithm to extract grids – Bi-cluster the intersection matrix of horizontal vs. vertical separators Generate Candidate Table Regions [CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  • 69. k X ≈ UVT • Xij = 1 ⇔ lines i and j intersect • At intersections: 1 ≈ ui1vj1 + ui2vj2 +…+ uikvjk • Each uicvjc ≥ 0 gives affinity of intersection Xij to cluster c • uicvjc is large ⇔ uic and vjc both large 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 * * * 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 * * * 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 U ≥ 0 V ≥ 0 X Non-neg. Matrix Factorization for Grid Clustering Generate Candidate Table Regions Table Extraction
  • 70. § (Blobs 1st) Classify text “blobs”, cluster those labeled “table” – [B14] iteratively labels “blobs” given their neighbors’ labels – [B14] trains a Kernel Logistic Regression classifier § (Lines 1st) Find areas where “strong” separators make a grid – [CL12] uses Max-Flow / Min-Cut algorithm to extract grids – Bi-cluster the intersection matrix of horizontal vs. vertical separators § (CNN-based) Try a fixed set of table region proposals – CNN shares computation of features across all translations of a given proposal rectangle – Proposal rectangle shapes / sizes are fixed as hyperparameters – If a proposal hits a table, a regression decides table borders Generate Candidate Table Regions [CL12] J. Chen and D. Lopresti. “Model-Based Tabular Structure Detection and Recognition in Noisy Handwritten Documents”, ICFHR ‘12 [B14] A. Bansal et al. “Table Extraction from Document Images using Fixed Point Model”, ICVGIP ‘14 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  • 71. § Use existing object detection frameworks (Faster R-CNN or YOLO) retrained for table detection § The field is wide open for more table-specific DL approaches – E.g. involving text semantics Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”. ArXiv 2019 Staar et al. “Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale.”. KDD 2018 Schreiber et al. “Deepdesrt: Deep learning for detection and structure recognition of tables in document images” ICDAR 2017 Gilani et al. “Table Detection using Deep Learning” ICDAR 2017 Table Extraction Deep Learning for Table Detection
  • 72. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  • 73. § Cells define overlap relation along X- or Y-axis – Links headers with data – critical for table understanding § Cell borders ← ruled lines ∪ “strong” white-space lines – Extend lines to make rectangular cells, avoid crossing “blobs” § Ruled grids: test for incompleteness – Multiple numerics per cell – A “strong” white-space line splits text in ≥ 2 cells – A “mini-table” inside a ruled cell – Cell structure extends beyond table frame § White-space grids: clean up empty cells – Expand header cells by merging with empty cells [S06] – Merge (almost-) empty rows and columns Cell Structure: Line Based Table Extraction [S06] A. C. e Silva et al. “Design of an End-to-end Method to Extract Information from Tables”, IJDAR ‘06 [B12] E. Bart. “Parsing Tables by Probabilistic Modeling of Perceptual Cues”, DAS ‘12
  • 74. § Use Spatial Constraints to find an overlap DAG over cells [H03] § Use Graph Neural Networks to find 2 undirected graphs: Cell Structure: Graph Based [H03] M. Hurst. “A Constraint-based Approach to Table Structure Derivation”, ICDAR ‘03 [Q19] S. R. Qasim et al. “Rethinking Table Recognition using Graph Neural Networks”, 2019 [C19] Z. Chi et al. “Complicated Table Structure Recognition”, arXiv, 2019 [Q19] [C19] Table Extraction – “Same Row” graph & “Same Column” graph – Two cells share an edge ⇔ share a row / a column – [Q19] : Rows and columns = maximal cliques – [C19] : Only adjacent cells share a graph edge
  • 75. Schreiber et al. “Deepdesrt: Deep learning for detection and structure recognition of tables in document images” ICDAR 2017 Table Extraction Cell Structure: CNN Based § Object detection networks were also used for cell structure detection
  • 76. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  • 77. § Eliminate false positive tables § Detect malformed table regions – Plain text in tables – Missing row / column headers or split-off pieces – One region covers multiple tables § Compare alternative table candidates – Example: Is this 1 table or 2 tables? § Improve table region and structure – Pick the best adjustment out of a range of options – NOTE: Knowing cell structure helps region scoring / adjustment § Provide a confidence value for output tables Why Scoring Tables? Table Extraction
  • 78. § Tables are very diverse – Tiny or huge, misaligned, text in cells, key-value pairs, confusing delimiters – Complex row / column headers – so different, easy to chop off ! § What’s around the table also matters – Can its columns or rows be extended? Should they be? § One table, or ≥ 2 adjacent tables? – 1 table may have: ruled bars, wide gaps, font / alignment changes – 2 tables may be: fully or partly co-aligned, separated in one of many ways § Non-table text can have complex structure, too – Page headers / footers, framed / highlighted text, hierarchical lists, … Table Scoring Challenges Table Extraction
  • 79. Example 1 Table Extraction Table Source: https://www.legislation.gov.au/Details/F2010C00607/0d99393c-5c5b-4af0-9cc1-b5c2de8632c3 (F2010C00607.pdf) NOT A TABLE !
  • 80. Example 2 Table Extraction Table Source: https://www.thewaltdisneycompany.com/wp-content/uploads/2019/01/2018-Annual-Report.pdf Row headers Column headers
  • 81. Example 3 Table Extraction Table Source: https://www.thewaltdisneycompany.com/wp-content/uploads/2019/01/2018-Annual-Report.pdf Row headers Column headers
  • 82. Example 4 Table Extraction Table Source: https://assets.ctfassets.net/rz9m1rynx8pv/2x3p5ompzZyrRtAHw4M3XB/be648275661795139cabcee29a730630/TELUS_Q1_2019_quarterly_report.pdf Row headers Column headers
  • 83. § Rule-out patterns – Rule out charts, lists, signature blocks etc. § Aggregated column / row score – [KD01] Aggregate the similarities that led to the table’s column fragments § Dynamic programming score – [H99] Score (T) = max { Score (T – line) + Merit (line) } – Score the best split into 2 sub-tables § Probability of being a table (given the features) – [W04] Partition page into blocks labeled “table” and “plain text” – Compute label probability for block + neighboring blocks § A scoring neural network on top of CNN [G17, S18b] How to Score a Table [H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99 [KD01] T. Kieninger and A. Dengel. “Applying the T-Recs Table Recognition System to the Business Letter Domain”, ICDAR ‘01 [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  • 84. § Columns and rows: – Number, span / extent, alignment, font / content similarity § Ruled and white-space separators: – Number, span / extent, width of their margins – If they match, reach (good) or cross (bad) table borders § Inside vs. outside table: – Border crossing ruled lines, aligned blocks, or highly similar text – The two sides have matching structure § Cell structure: – Oversized cells, misaligned pairs of cells, “runs” of empty cells § Content: – Numerics, repeated words; customizable keywords – Domain-specific “expectations,” e.g. header dictionary [D11] § CNN-generated features Features for Table Scoring [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 Table Extraction
  • 85. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  • 86. § Leverage table features and score – Specify how a well-formed vs. mal-formed table looks like § Use a transparent, explainable method – If detection is a “black box”, adjustment uses explainable rules & features § Correct errors quickly – Bypass the need for extra ground-truth data, retraining § Customize to address specific concerns – Add custom features, rules, and constrains Why Adjust Tables? [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  • 87. § Merge table with an adjacent table or text-block [W04] [SS10] § Adjust table border – add or drop rows or columns [HB07] [D11] § Split one table into two, possibly with plain text between § Re-compute table region by neural network regression [G17] [S18b] § Choose best-scoring border (or structure) out of a range of options § Iterate adjustment → traverse a search tree of candidate tables How to Adjust Candidate Tables [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 [HB07] T. Hassan and R. Baumgartner. “Table Recognition and Understanding from PDF Files”, ICDAR ‘07 [SS10] F. Shafait and R. Smith. “Table Detection in Heterogeneous Documents”, DAS ‘10 [D11] F. Deckert et al. “Table Content Understanding in smartFIX”, ICDAR ‘11 [G17] A. Gilani et al. “Table Detection using Deep Learning”, ICDAR ‘17 [S18b] S. A. Siddiqui et al. “DeCNT: Deep Deformable CNN for Table Detection”, IEEE Acc. ‘18 Table Extraction
  • 88. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  • 89. What if candidate tables overlap each other? § [H99] uses Dynamic Programming: – Only for top and bottom line-positions: [i, j] – Score disjoint unions of tables: § CNN-based object detection systems: – Greedy Approach: Pick the top-scoring region, repeat – PROBLEM: Lower-scoring table may have a high-scoring sub-table § Maximum Weighted Independent Set – Nodes = tables, edges = conflicts, weights = table scores – NP-hard even for 2-dim rectangles [RN95], but can be solved efficiently in real-life cases Select Best Tables for Output [H99] J. Hu et al. “Medium-Independent Table Detection”, SPIE Doc. Recog. & Retr. ‘99 [RN95] C.S. Rim and K. Nakajima. “On Rectangle Intersection and Overlap Graphs”, IEEE Trans. on Circuits & Systems I, 42(9), 1995 Table Extraction 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Conflict = Table Overlap
  • 90. Common Sub-Tasks in Table Extraction Table Extraction Analyze Detect Refine Learning Infrastructure Extract table’s cell structure Generate candidate table regions Select tables for output Adjust candidate tables Compute table’s features & score Identify separator lines Group text into larger units Compute page features Accuracy metrics Ground truth data Optimization method
  • 91. § Accuracy Metrics – Exact match of table region or structure is too inflexible – Partial match: Text? Area? Cell relationship? Functional? § Ground Truth Labeling – Very time consuming, requires sophisticated UI tools – Humans disagree on what’s correct § Optimization (pre- deep learning) – Lots of discrete, non-differentiable steps – Learn sub-tasks, e.g. row labeling with CRF / SVM – [W04] Global parameter learning: Learning from Data: Challenges [W04] Y. Wang et al. “Table Structure Understanding and Its Performance Evaluation”, Pattern Recog. ‘04 Table Extraction
  • 92. Table Boundary § Purity & Completeness § Character level recall, precision and F1 Table Structure § Recall and Precision of Cell Adjacency Relations GĂśbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12 ICDAR 2013 Competition Metrics Table Extraction Accuracy Metrics
  • 93. § Measure what actually matters downstream § Capcture accuracy of access paths to each cell § Need header annotation as well as cell structure Table Extraction GĂśbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12 Accuracy Metrics Functional Metrics
  • 94. Ground Truth Datasets Complete Datasets with table boundary and cell structure: - ICDAR-2013 competition (PDF Format) - ICDAR-2019 competition (Image Format) - SciTSR 2019 (Generated from LaTeX files) Incomplete Datasets § Table-bank (Full table boundary information only) § PDF-Trex (Financial Table dataset without ground truth Labels) § Marmot (Only ground truth for table boundary, cells inaccessible) § UNLV , UW-3 (Table structure and boundary annotations for scanned documents) Li et al. “TableBank: Table Benchmark for Image-based Table Detection and Recognition”. ArXiv 2019 GĂśbel et al. “A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents”. DocEng '12 Oro et al. “PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents”. ICDAR '09 Fang et al. “Dataset Ground-Truth and Performance Metrics for Table Detection Evaluation”. DAS '12 Chi et al. “Complicated Table Structure Recognition” arXiv 2019 Table Extraction
  • 95. Example: Accuracy Comparison § Table detection accuracy on the ICDAR 2013 Competition dataset: Table Extraction
  • 97. Table Understanding Table Understanding HTML Document Table Extraction (Optional) Document (PDF, Image) Representation of table contents that: • Captures semantic information • Is amenable to post-processing Table Understanding Output Knowledge Base Creation Downstream Tasks Question Answering e.g., Air Canada’s oper. revenues in Q3 2015? HTML Understand the semantics of tabular data
  • 98. Table Understanding Table Understanding HTML Document Table Extraction (Optional) Document (PDF, Image) Representation of table contents that: • Captures semantic information • Is amenable to post-processing Table Understanding Output Knowledge Base Creation Downstream Tasks Question Answering e.g., Air Canada’s oper. revenues in Q3 2015? HTML Understand the semantics of tabular data
  • 99. Table Understanding Table Understanding HTML Document Table Extraction (Optional) Document (PDF, Image) Representation of table contents that: • Captures semantic information • Is amenable to post-processing Table Understanding Output Knowledge Base Creation Downstream Tasks Question Answering e.g., Air Canada’s oper. revenues in Q3 2015? HTML Understand the semantics of tabular data
  • 100. Table Understanding Table Understanding HTML Document Table Extraction (Optional) Document (PDF, Image) Representation of table contents that: • Captures semantic information • Is amenable to post-processing Table Understanding Output Knowledge Base Creation Downstream Tasks Question Answering e.g., Air Canada’s oper. revenues in Q3 2015? HTML Understand the semantics of tabular data
  • 101. Semantics of Tabular Data Table Understanding What does this cell represent?
  • 102. Semantics of Tabular Data Table Understanding The unaudited comprehensive net loss of Air Canada in the six months ended June 30, 2015 is $13 million Canadian dollars. “ “
  • 103. Semantics of Tabular Data Table Understanding The unaudited comprehensive net loss of Air Canada in the six months ended June 30, 2015 is $13 million Canadian dollars. “ “ Information about a single cell is derived from multiple places
  • 104. What You Will Learn Table Understanding Components of table understanding • What are the different types of semantic information about a table? • Where can they be found? 1
  • 105. What You Will Learn Table Understanding Components of table understanding Table understanding Methods • What are the different types of semantic information about a table? • Where can they be found? 1 2 • What techniques are used to extract info for table understanding? • What learning methods can be used?
  • 106. What You Will Learn Table Understanding Components of table understanding Table understanding Methods • What are the different types of semantic information about a table? • Where can they be found? 1 2 • How do tables differ between domains? • How do the assumptions of proposed approaches affect their potential applicability to other domains? Importance of Domain3 • What techniques are used to extract info for table understanding? • What learning methods can be used?
  • 107. Outline: Components of Table Understanding Table Understanding A. Table Regions (Column/Row Headers) B. Context Within Table C. Context Within Document D. Context Outside Document
  • 108. Outline: A. Table Regions Table Understanding Column Headers (incl. nesting) Row Headers (incl. nesting) Data/Body Cells Main table regions Metadata
  • 109. Unsupervised Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols
  • 110. Unsupervised Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols Similarity Features
  • 111. Unsupervised Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols Heuristics Similarity Features • Which heuristics to use?
  • 112. Unsupervised Methods: Local Minimum Table Understanding J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12 For column (row) headers: Find first row (col) that looks “different” Pair-wise similarity of consecutive rows Local minimum of similarity
  • 113. Unsupervised Methods: Indexing Table Understanding S. Seth et al. “Segmenting tables via indexing of value cells by table headers”. ICDAR ‘13 • Use empty and repeated cells to find critical cells that outline the stubhead • Independent of visual aspects of table Repeated cell implying hierarchical row header Empty cells implying hierarchical column header
  • 114. Traditional ML Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols Traditional ML Methods Similarity Features Column Headers Data Cells Classification Labels • How to model this as a classification problem? • Which ML method and features to use?
  • 115. Traditional ML Methods: Row/Column Classification Table Understanding J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12 Data row Data row Data row Data row Data row Data row Column header row Classify rows as column header rows (similarly for row header columns) D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03
  • 116. Header Identification Results Table Understanding S. Seth et al. “Segmenting tables via indexing of value cells by table headers”. ICDAR ‘13 R. Rastan et al. “TEXUS: A unified framework for extracting and understanding tables in PDF documents”. Information Processing & Management Correct Segmentation Correct Stub Head (Critical Cell) Seth et al. 99% 100% TEXUS 100% 100% Government Statistic Table Set (Seth) Correct Segmentation Correct Stub Head (Critical Cell) TEXUS - 42.9% ASX-Announcements Dataset (TEXUS)
  • 117. No standard benchmark or dataset Table Understanding J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12 D. Pinto et al. “Table Extraction Using Conditional Random Fields”. SIGIR ‘03 FedStat Textfile Dataset (Pinto) CiteSeerX PDF Dataset (Fang)
  • 118. Traditional ML Methods: Table Classification Table Understanding Web Data Commons – Web Table Corpora Classify entire tables Relational Table Entity/Listing Table Matrix Table e.g,
  • 119. Traditional ML Methods: Table Classification Table Understanding Web Data Commons – Web Table Corpora Classify entire tables • Table class implies header structure Relational Table Entity/Listing Table Matrix Table
  • 120. Traditional ML Methods: Table Classification Table Understanding Web Data Commons – Web Table Corpora Classify entire tables • Table class implies header structure • Can be used for header identification under certain assumptions Relational Table Entity/Listing Table Matrix Table Single col header rowSingle col header row Single row header col
  • 121. Traditional ML Methods: Table Classification Table Understanding Table Classes Genuine vs Non-genuine Y. Wang et al. “A Machine Learning Based Approach for Table Detection on The Web“. WWW ‘02 Relational vs Non-relational M. Cafarella et al. “Uncovering the Relational Web”. WebDB ‘08 I. Relational Knowledge: Listing, Attribute/Value, Matrix, Calendar, Enumeration, Form II. Layout: Navigational, Formatting E. Crestan et al. “Web-Scale Table Census and Classification”. WSDM ‘11 Vertical listings, horizontal listings, matrix tables J. Eberius et al. “Building the Dresden Web Table Corpus: A Classification Approach”. BDC ‘15 year
  • 122. Traditional ML Methods: Table Classification Table Understanding ML Methods Decision Tree, SVM Y. Wang et al. “A Machine Learning Based Approach for Table Detection on The Web“. WWW ‘02 Rule-based Classifier (WEKA) M. Cafarella et al. “Uncovering the Relational Web”. WebDB ‘08 Gradient Boosted Decision Tree E. Crestan et al. “Web-Scale Table Census and Classification”. WSDM ‘11 Decision Tree (CART, C4.5, Random Forest), SVM J. Eberius et al. “Building the Dresden Web Table Corpus: A Classification Approach”. BDC ‘15
  • 123. Traditional ML Methods Table Understanding Neighborhood and Table Features • Number of non empty cells difference • Average alignment • Percentage of same cell data type • Percentage of same cell font style • Content repetition • Number and standard deviation of rows and columns Cell Features • Number of non empty cells. • Average cell length. • Percentage of numeric characters. • Percentage of symbolic characters • Average font size. • Cell Font Styles • Cell positioning in the table • Percentage of cells spanning multiple cols/rows • HTML Tags (if applicable) • Cell Span J. Fang et al. “Table Header Detection and Classification”. AAAI ‘12 J. Eberius et al. “Building the Dresden Web Table Corpus: A Classification Approach”, BDC ‘15
  • 124. Deep Learning Methods Overview Table Understanding Header rows/cols "look different” than data rows/cols Deep Learning Methods Similarity Features Column Headers Data Cells Classification Labels • Which deep learning architecture to use?
  • 125. Deep Learning Methods: Hierarchical Attention Network Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables] Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16 Hierarchical RNN proposed to leverage document structure: • 2 layers: • Words • Sentences
  • 126. Deep Learning Methods: Hierarchical Attention Network Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ’17 [Adaptation to tables] Z. Yang et al. “Hierarchical Attention Networks for Document Classification”. ACL ‘16 Extend to tables: • 3 layers • Tokens • Cells • Rows or Columns • Bidirectional network • Combine row-directional and column-directional network
  • 127. Deep Learning Methods: RNN-CNN Hybrid (TabNet) Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ‘17 LSTM captures semantic representation of each cell CNN captures relationship between cells
  • 128. Deep Learning Methods: RNN-CNN Hybrid (TabNet) Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ‘17 LSTM captures cell text together with coordinates and other HTML tags (i.e., formatting)
  • 129. Deep Learning Methods: Results Table Understanding K. Nishida et al. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. AAAI ‘17 (Rule-Based) (Decision Tree) (Decision Tree) (Hierarchical Attention for Documents) (RNN-CNN Hybrid)
  • 130. Beyond Flat Headers: Hierarchical Row Headers Table Understanding Hierarchical Row Headers
  • 131. Beyond Flat Headers: Hierarchical Row Headers Table Understanding Identify hierarchical relationship among row headers Complex semantic row header hierarchy: Multiple cells in the same row header column are semantically related to each other
  • 132. Beyond Flat Headers: Hierarchy as a Graphical Model Table Understanding Z. Chen et al. “Integrating Spreadsheet Data via Accurate and Low-Effort Extraction”. KDD ‘14 Encode hierarchy as graphical model • Variable: Candidate parent-child pair • Node potentials: Features for predicting parent-child pairs • Edge potentials: Correlations of variables based on style, KB affinity, …
  • 133. Pairwise vs Rectangle cell relationships Table Understanding • Pairwise classification can only utilize local information • Simply looking at the pair may not be sufficient to determine the relation • A rectangle is “interesting” if it is the support rectangle of some cell, called a header cell of that rectangle
  • 134. Beyond Flat Headers: Hierarchy as Rectangle Relationship Table Understanding X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of Financial Tables”. ICDAR ‘17 Two “interesting” rectangles: • “Assets” (row 1) heads rows 2-17 • “Current” (row 2) heads rows 3-11
  • 135. Beyond Flat Headers: Hierarchy as Rectangle Relationship Table Understanding X. Chen et al. “A Rectangle Mining Method for Understanding the Semantics of Financial Tables”. ICDAR ‘17 When a “total” row is considered as a parent candidate, it cannot take children For each iteration: • Combine: Consecutive minimal rectangles with equal features • Attach: Minimal rectangle ri to directly preceding rectangle ri-1 if ri-1 > ri
  • 136. Outline: B. Context Within Table Table Understanding Currency Additional semantic information within the table - of different types Scale
  • 137. Outline: B. Context Within Table Table Understanding Additional semantic information within the table - of different types - of different scope Propagate to all data cells
  • 138. Outline: B. Context Within Table Table Understanding Additional semantic information within the table - of different types - of different scope Propagate to subset of data cells
  • 139. Outline: C. Context Within Document Table Understanding Additional context outside the table within the same document - leverage relevant text and tables
  • 140. Table Context Within Document Table Understanding Surrounding text often contains important info about a table Deeper Semantic Understanding • Link text to table • Generate table title Shallow Context Extraction • Extract table metadata
  • 141. Extract Table Metadata Table Understanding Ying Liu et al. “TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries”. JCDL ’07 Document title Page Table Caption Document authors
  • 142. Link Text to Table Cells Table Understanding D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18 Text Cells described by text Approach: • Identify headers • Match sentence to table cells based on: • Unique words However, mirroring the overall softness of the tech sector, sales of computer hardware decreased 1% versus a year-ago to $1.6 billion.
  • 143. Link Text to Table Cells Table Understanding D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18 Text Cells described by text Approach: • Identify headers • Match sentence to table cells based on: • Unique words • Syntactic analysis
  • 144. Link Text to Table Cells Table Understanding D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18 Text Cells described by text Approach: • Identify headers • Match sentence to table cells based on: • Unique words • Syntactic analysis • Semantic analysis ”…talking about topics is an important reason to email with these special interest groups.” word2vec
  • 145. Link Text to Table Cells Table Understanding D. H. Kim et al. “Facilitating document reading by linking text and tables”. UIST ’18 Text Cells described by text Approach: • Identify headers • Match sentence to table cells based on: • Unique words • Syntactic analysis • Semantic analysis • Use rules to refine matches
  • 146. Generate Table Titles (for Web Tables) Table Understanding B. Hancock et al. “Generating titles for web tables”. WWW ’19 Problem: • Web tables lack titles or • Existing titles lack context Table Title?
  • 147. Generate Table Titles (for Web Tables) Table Understanding B. Hancock et al. “Generating titles for web tables”. WWW ’19 Solution: • Leverage surrounding context to generate table title Table + Surrounding Context Table Title Problem: • Web tables lack titles or • Existing titles lack context
  • 148. Generate Table Titles (for Web Tables) Table Understanding B. Hancock et al. “Generating titles for web tables”. WWW ’19 Context used as input: • Page Title • Section headers (<h...> tags) • Column headers • Spanning column headers as a special case • Table caption (<caption> tag) Table + Surrounding Context Table Title Context ignored due to noise: • Text right before/after table • Table rows
  • 149. Generate Table Titles (for Web Tables) Table Understanding B. Hancock et al. “Generating titles for web tables”. WWW ’19 Model Design • Pointer-generator network • First proposed for abstractive summarization • Combines copy & generator mechanism Table + Surrounding Context Table Title
  • 150. Outline: D. Context Outside Document Table Understanding Additional context outside the table from other resources - link to knowledge bases
  • 151. Table to KB Linking Table Understanding Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19 Link different parts of the table to external knowledge bases Link Columns (known as Column Type Identification) Link Rows/Cells (known as Entity Linking)
  • 152. Table to KB Linking: Link Columns Table Understanding Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
  • 153. Table to KB Linking: Link Rows/Cells Table Understanding Zhang et al. “Web Table Extraction, Retrieval and Augmentation”, SIGIR Tutorial ’19
  • 154. Understanding Tabular Data: Putting it All Together Table Understanding What does this cell represent?
  • 155. Understanding Tabular Data: Putting it All Together Table Understanding What does this cell represent? A. Identify table regions (column/row headers)
  • 156. Understanding Tabular Data: Putting it All Together Table Understanding B. Identify additional context within table
  • 157. Understanding Tabular Data: Putting it All Together Table Understanding C. Identify context within document
  • 158. Understanding Tabular Data: Putting it All Together Table Understanding D. Identify context outside document
  • 159. Understanding Tabular Data: Putting it All Together Table Understanding The unaudited comprehensive net loss of Air Canada in the six months ended June 30, 2015 is $13 million Canadian dollars. “ “
  • 160. Final Takeaways 1. A rich history of methods for many decades in table extraction & understanding 2. Tables from different domains are not the same; A general table extraction & understanding system needs to consider diversity of type, style, and content of tables 3. Both semantic and visual features are crucial to improve table extraction and understanding 4. As a community, we need to standardize tasks, evaluation metrics, and datasets
  • 161. Build for the future by unlocking the past...