SlideShare a Scribd company logo
www.karakun.com
Extracting information
from tables in documents
Holger Keibel
AI-SDV 2022, Vienna
© 2022 Karakun AG | 2
Karakun
Services
Software Engineering, UX Design, Consulting,
Training, Maintenance & Support
Platforms & Products
Efficiency-enhancing software platforms,
ready-made products for selected use cases,
e.g., HIBU Platform for search and LT solutions
Experienced & Established Team
60+ employees working in 4 locations
in CH (HQ), DE and IN
Competences / Skills
State-of-the-art tech stack (Java, web &
mobile), LT / AI / Big Data,
focus on open-source software
Sustainable Custom Solutions
Customers from various industries,
e.g., Insurance, Finance, Life Science,
Logistics
Authors, speakers, lecturers at universities,
Java Champions, contributors to
open-source projects
Community Engagement
© 2022 Karakun AG | 3
HIBU Platform
Efficient development of custom solutions in the areas of
Artificial Intelligence: Rule-based, statistical, neural
Intelligent Search
Full-text search,
search filters,
convenience functions
Text Analysis
Classification,
information extraction,
sentiment analysis,
…
Document Automation
Content-driven,
input management,
smart actions
© 2022 Karakun AG | 4
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut
labore et dolore magna aliqua.
Ut enim ad minim veniam, quis
nostrud exercitation ullamco
laboris nisi ut aliquip ex ea
commodo consequat.
Lorem ipsum: dolor sit
Lorem ipsum 1500
Information extraction methods generally
tuned to sequential textual data
• Running text
• Example approach: modern language models
(transformer-based)
• Graphical information (coordinates) can
largely be ignored
• Horizontal label-value pairs
• Example approach: regular expressions
• Graphical/layout information rarely needed
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut
labore et dolore magna aliqua.
Ut enim ad minim veniam, quis
nostrud exercitation ullamco
laboris nisi ut aliquip ex ea
commodo consequat.
Lorem ipsum: dolor sit
Lorem ipsum 1500
© 2022 Karakun AG | 5
Vertical label-value pairs
• Semi-sequential problem → more challenging
• Some graphical/layout information needed:
same x-coordinate on subsequent lines
• Example approach:
• regular expressions (to find labels)
• plus mild use of graphical information
(to extract the corresponding values)
• plus possibly again regular expressions
(to constrain values)
Date Reference
Oct 10, 2022 12345-67
Order date Your order
Oct 5, 2022 ABC-789/5
© 2022 Karakun AG | 6
Information extraction from tables
© 2022 Karakun AG | 7
Table extraction challenges
• How are row boundaries encoded?
Lines, spacing, aligned content, …
• How are column boundaries encoded?
Lines, spacing, aligned content, …
• Merged table cells
• …
• General lesson learned: Very difficult to solve by a
general-purpose table extraction solution
• Better: Limit the solution to specific table types
→ use any known constraints in the algorithm
© 2022 Karakun AG | 8
Use case 1
• Land certificates (in Germany)
• Only scanned documents
© 2022 Karakun AG | 9
Land certificates (in Germany)
© 2022 Karakun AG | 10
Detect tables
Step 1:
Detect graphical
elements (red)
vs.
free text elements (blue)
Indicators:
• pixel density
• border lines
Using LAREX (Reul et al., 2017)
© 2022 Karakun AG | 11
Detect tables
Step 2:
Straighten lines
(bounding boxes)
This and the subsequent steps performed with OpenCV (https://opencv.org/).
© 2022 Karakun AG | 12
Detect tables
Step 3: For each detected table: Cut out table image
© 2022 Karakun AG | 13
Analyze table structure
Step 4: Blur image to smoothen and repair lines
© 2022 Karakun AG | 14
Analyze table structure
Step 5: Invert colors
© 2022 Karakun AG | 15
Analyze table structure
Step 6: Binarize image to increase width of lines
© 2022 Karakun AG | 16
Analyze table structure
Step 6: Detect horizontal lines
© 2022 Karakun AG | 17
Analyze table structure
Step 7: Extend horizontal lines by means of dilation
© 2022 Karakun AG | 18
Analyze table structure
Step 8: The same for vertical lines
© 2022 Karakun AG | 19
Analyze table structure
Step 9: Combine vertical and horizontal lines to a grid
and derive coordinates of cells.
© 2022 Karakun AG | 20
Analyze table structure
Step 10: Submit entire table to OCR engine
© 2022 Karakun AG | 21
Analyze table structure
Step 11: Parse OCR result (hOCR) to assign words to table cells
© 2022 Karakun AG | 22
Analyze table structure
Step 12: Resolve merged cells to derive structured representation
[Lfd.Nr. …] [Bish. …] [Bezeichnung …] [Bezeichnung …] [Bezeichnung …]
[Größe] [Größe] [Größe]
[Lfd.Nr. …] [Bish. …] [a) Gemarkung …] [a) Gemarkung …] [c) Wirtschaftsart
…] [ha] [a] [m²]
[Lfd.Nr. …] [Bish. …] [b) Karte] [Flurstück] [c) Wirtschaftsart …] [ha]
[a] [m²]
[1] [2] [3] [3] [3] [4] [4] [4]
[1] [-] [123.45] [234/15] [Musterstraße 123nGebäude- und Freifläche] []
[5] [94]
[2] [-] [123.45] [234/18] [Musterstraße 123anGebäude- und Freifläche] []
[3] [41]
[3] [-] [123.45] [137/8] [Musterstraße 58nGebäude- und Freifläche] []
[11] [70]
© 2022 Karakun AG | 23
Use case 2
• Order confirmations + invoices
• Digitally generated documents (PDFs)
• Focus: Known table layouts
© 2022 Karakun AG | 24
Table extraction
{
"freightCosts": 145.0,
"orderDate": "2020-10-09",
"orderId": "12345",
"packagingCosts": 11.8,
"positions": [
{ … },
{
"values": {
"articleDescription": "Air/oil
separator",
"articleId": "341018-00",
"articlePrice": 13.7,
"deliveryDate": "2020-10-28",
"positionPrice": 3425.0,
"positionQuantity": 250.0,
"positionReference": "107.246"
}
},
{ … },
],
"positionsTotal": 7017.15
© 2022 Karakun AG | 25
Considerations
• Broad range of fairly special table layouts
• Tables might spread across multiple pages
• But mostly with some commonalities:
• Column boundaries indicated by aligned content
• Column titles use fairly recurrent terms
• Strategy of four steps
• Each can be rule-based or an ML component
© 2022 Karakun AG | 26
Step 1: Detect table area
Approaches:
• Train on textual
and graphical input
• Rule-based
(layout-specific
keywords)
© 2022 Karakun AG | 27
Step 2: Detect columns
Approaches:
• Train on textual
and graphical input
• Cluster left/right
x-coordinates
of tokens
• Configure exact
x-coordinates
© 2022 Karakun AG | 28
Step 3: Detect positions (logical rows)
Approaches:
• Train on textual
and graphical input
• Rule-based
(layout-specific)
Handle exceptions
(e.g. values across
cell boundaries)
© 2022 Karakun AG | 29
Step 4: For each position:
Extract target field values from cells
Select proper cell
by column label.
Approaches:
• Train on cell
content
• Rule-based
(partly layout-
specific)
© 2022 Karakun AG | 30
Rule-based steps
• Very efficient approach (unless large
number of different layouts)
• General rule logic with configurable
parameters or regex patterns
• Default configuration
• Parameters/patterns only have to be
configured if deviating
• Logic can also be applied to unknown layouts
and produce some results
© 2022 Karakun AG | 31
Challenge: Map table data to table context
Defined globally
Per position, overhead
Per position, inside position
One table per position
© 2022 Karakun AG | 32
Summary and insights
• No general-purpose table extraction
solution exists on the market
• Do not try to build one
• Instead:
• Limit the solution to specific table types
• Use any known constraints to inform the
extraction algorithm
• Split task into smaller tasks
• Decide for each task independently:
which method (some ML method?, rule-based?)
© 2022 Karakun AG | 33
Summary and insights
• No general-purpose table extraction
solution exists on the market
• Do not try to build one
• Instead:
• Limit the solution to specific table types
• Use any known constraints to inform the
extraction algorithm
• Split task into smaller tasks
• Decide for each task independently:
which method (some ML method?, rule-based?)
Karakun AG
Elisabethenanlage 25
4051 Basel
Switzerland
P
E
W
+41 61 551 36 00
info@karakun.com
www.karakun.com
Dr. Holger Keibel
Product Manager
holger.keibel@karakun.com
Thank you!

More Related Content

Similar to AI-SDV 2022: Extracting information from tables in documents Holger Keibel (Karakun, CH)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between  CAD & GIS: 6 Ways to Automate Your  Data IntegrationBridging Between  CAD & GIS: 6 Ways to Automate Your  Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Safe Software
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
marketing932765
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
marketing932765
 
Bridging Between CAD & GIS: 8 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 8 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Your Data Integration
Safe Software
 
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data IntegrationBridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
Safe Software
 
A CAD ppt 25-10-19.pdf
A CAD ppt 25-10-19.pdfA CAD ppt 25-10-19.pdf
A CAD ppt 25-10-19.pdf
KeerthanaP37
 
Q Cad Presentation
Q Cad PresentationQ Cad Presentation
Q Cad Presentation
vikas mahajan
 
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data IntegrationBridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
Safe Software
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Research
 
RESUME.pdf
RESUME.pdfRESUME.pdf
RESUME.pdf
ROHISHGamer
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
Bharath123Maddipati
 
computer aided design
computer aided design computer aided design
computer aided design
Amita Gautam
 
Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
Neo4j
 
Presentation on
Presentation on Presentation on
Presentation on
Amrit Pal Goyal
 
Presentation
Presentation Presentation
Presentation
Amrit Pal Goyal
 
Ptc creo reverse engineering extension
Ptc creo reverse engineering extensionPtc creo reverse engineering extension
Ptc creo reverse engineering extension
Victor Mitov
 
Graphics Standards and Algorithm
Graphics Standards and AlgorithmGraphics Standards and Algorithm
Graphics Standards and Algorithm
Yatin Singh
 
DLP_Observation-1.docx
DLP_Observation-1.docxDLP_Observation-1.docx
DLP_Observation-1.docx
WyztyDelle2
 
PCL (Point Cloud Library)
PCL (Point Cloud Library)PCL (Point Cloud Library)
PCL (Point Cloud Library)
University of Oklahoma
 
Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...
Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...
Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...
AVEVA Group plc
 

Similar to AI-SDV 2022: Extracting information from tables in documents Holger Keibel (Karakun, CH) (20)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between  CAD & GIS: 6 Ways to Automate Your  Data IntegrationBridging Between  CAD & GIS: 6 Ways to Automate Your  Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 8 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 8 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data IntegrationBridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
 
A CAD ppt 25-10-19.pdf
A CAD ppt 25-10-19.pdfA CAD ppt 25-10-19.pdf
A CAD ppt 25-10-19.pdf
 
Q Cad Presentation
Q Cad PresentationQ Cad Presentation
Q Cad Presentation
 
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data IntegrationBridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
 
RESUME.pdf
RESUME.pdfRESUME.pdf
RESUME.pdf
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
computer aided design
computer aided design computer aided design
computer aided design
 
Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
 
Presentation on
Presentation on Presentation on
Presentation on
 
Presentation
Presentation Presentation
Presentation
 
Ptc creo reverse engineering extension
Ptc creo reverse engineering extensionPtc creo reverse engineering extension
Ptc creo reverse engineering extension
 
Graphics Standards and Algorithm
Graphics Standards and AlgorithmGraphics Standards and Algorithm
Graphics Standards and Algorithm
 
DLP_Observation-1.docx
DLP_Observation-1.docxDLP_Observation-1.docx
DLP_Observation-1.docx
 
PCL (Point Cloud Library)
PCL (Point Cloud Library)PCL (Point Cloud Library)
PCL (Point Cloud Library)
 
Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...
Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...
Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...
 

More from Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
Dr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
Dr. Haxel Consult
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
Dr. Haxel Consult
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
Dr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
Dr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
Dr. Haxel Consult
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
Dr. Haxel Consult
 

More from Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
 

Recently uploaded

办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
xjq03c34
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
SEO Article Boost
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
hackersuli
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
Danica Gill
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
Trending Blogers
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
bseovas
 
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
bseovas
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
cuobya
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
uehowe
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Florence Consulting
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
vmemo1
 
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
ukwwuq
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
Trish Parr
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
cuobya
 
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
CIOWomenMagazine
 
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalmanuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
wolfsoftcompanyco
 

Recently uploaded (20)

办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
 
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
 
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
 
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
 
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalmanuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
 

AI-SDV 2022: Extracting information from tables in documents Holger Keibel (Karakun, CH)

  • 1. www.karakun.com Extracting information from tables in documents Holger Keibel AI-SDV 2022, Vienna
  • 2. © 2022 Karakun AG | 2 Karakun Services Software Engineering, UX Design, Consulting, Training, Maintenance & Support Platforms & Products Efficiency-enhancing software platforms, ready-made products for selected use cases, e.g., HIBU Platform for search and LT solutions Experienced & Established Team 60+ employees working in 4 locations in CH (HQ), DE and IN Competences / Skills State-of-the-art tech stack (Java, web & mobile), LT / AI / Big Data, focus on open-source software Sustainable Custom Solutions Customers from various industries, e.g., Insurance, Finance, Life Science, Logistics Authors, speakers, lecturers at universities, Java Champions, contributors to open-source projects Community Engagement
  • 3. © 2022 Karakun AG | 3 HIBU Platform Efficient development of custom solutions in the areas of Artificial Intelligence: Rule-based, statistical, neural Intelligent Search Full-text search, search filters, convenience functions Text Analysis Classification, information extraction, sentiment analysis, … Document Automation Content-driven, input management, smart actions
  • 4. © 2022 Karakun AG | 4 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Lorem ipsum: dolor sit Lorem ipsum 1500 Information extraction methods generally tuned to sequential textual data • Running text • Example approach: modern language models (transformer-based) • Graphical information (coordinates) can largely be ignored • Horizontal label-value pairs • Example approach: regular expressions • Graphical/layout information rarely needed Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Lorem ipsum: dolor sit Lorem ipsum 1500
  • 5. © 2022 Karakun AG | 5 Vertical label-value pairs • Semi-sequential problem → more challenging • Some graphical/layout information needed: same x-coordinate on subsequent lines • Example approach: • regular expressions (to find labels) • plus mild use of graphical information (to extract the corresponding values) • plus possibly again regular expressions (to constrain values) Date Reference Oct 10, 2022 12345-67 Order date Your order Oct 5, 2022 ABC-789/5
  • 6. © 2022 Karakun AG | 6 Information extraction from tables
  • 7. © 2022 Karakun AG | 7 Table extraction challenges • How are row boundaries encoded? Lines, spacing, aligned content, … • How are column boundaries encoded? Lines, spacing, aligned content, … • Merged table cells • … • General lesson learned: Very difficult to solve by a general-purpose table extraction solution • Better: Limit the solution to specific table types → use any known constraints in the algorithm
  • 8. © 2022 Karakun AG | 8 Use case 1 • Land certificates (in Germany) • Only scanned documents
  • 9. © 2022 Karakun AG | 9 Land certificates (in Germany)
  • 10. © 2022 Karakun AG | 10 Detect tables Step 1: Detect graphical elements (red) vs. free text elements (blue) Indicators: • pixel density • border lines Using LAREX (Reul et al., 2017)
  • 11. © 2022 Karakun AG | 11 Detect tables Step 2: Straighten lines (bounding boxes) This and the subsequent steps performed with OpenCV (https://opencv.org/).
  • 12. © 2022 Karakun AG | 12 Detect tables Step 3: For each detected table: Cut out table image
  • 13. © 2022 Karakun AG | 13 Analyze table structure Step 4: Blur image to smoothen and repair lines
  • 14. © 2022 Karakun AG | 14 Analyze table structure Step 5: Invert colors
  • 15. © 2022 Karakun AG | 15 Analyze table structure Step 6: Binarize image to increase width of lines
  • 16. © 2022 Karakun AG | 16 Analyze table structure Step 6: Detect horizontal lines
  • 17. © 2022 Karakun AG | 17 Analyze table structure Step 7: Extend horizontal lines by means of dilation
  • 18. © 2022 Karakun AG | 18 Analyze table structure Step 8: The same for vertical lines
  • 19. © 2022 Karakun AG | 19 Analyze table structure Step 9: Combine vertical and horizontal lines to a grid and derive coordinates of cells.
  • 20. © 2022 Karakun AG | 20 Analyze table structure Step 10: Submit entire table to OCR engine
  • 21. © 2022 Karakun AG | 21 Analyze table structure Step 11: Parse OCR result (hOCR) to assign words to table cells
  • 22. © 2022 Karakun AG | 22 Analyze table structure Step 12: Resolve merged cells to derive structured representation [Lfd.Nr. …] [Bish. …] [Bezeichnung …] [Bezeichnung …] [Bezeichnung …] [Größe] [Größe] [Größe] [Lfd.Nr. …] [Bish. …] [a) Gemarkung …] [a) Gemarkung …] [c) Wirtschaftsart …] [ha] [a] [m²] [Lfd.Nr. …] [Bish. …] [b) Karte] [Flurstück] [c) Wirtschaftsart …] [ha] [a] [m²] [1] [2] [3] [3] [3] [4] [4] [4] [1] [-] [123.45] [234/15] [Musterstraße 123nGebäude- und Freifläche] [] [5] [94] [2] [-] [123.45] [234/18] [Musterstraße 123anGebäude- und Freifläche] [] [3] [41] [3] [-] [123.45] [137/8] [Musterstraße 58nGebäude- und Freifläche] [] [11] [70]
  • 23. © 2022 Karakun AG | 23 Use case 2 • Order confirmations + invoices • Digitally generated documents (PDFs) • Focus: Known table layouts
  • 24. © 2022 Karakun AG | 24 Table extraction { "freightCosts": 145.0, "orderDate": "2020-10-09", "orderId": "12345", "packagingCosts": 11.8, "positions": [ { … }, { "values": { "articleDescription": "Air/oil separator", "articleId": "341018-00", "articlePrice": 13.7, "deliveryDate": "2020-10-28", "positionPrice": 3425.0, "positionQuantity": 250.0, "positionReference": "107.246" } }, { … }, ], "positionsTotal": 7017.15
  • 25. © 2022 Karakun AG | 25 Considerations • Broad range of fairly special table layouts • Tables might spread across multiple pages • But mostly with some commonalities: • Column boundaries indicated by aligned content • Column titles use fairly recurrent terms • Strategy of four steps • Each can be rule-based or an ML component
  • 26. © 2022 Karakun AG | 26 Step 1: Detect table area Approaches: • Train on textual and graphical input • Rule-based (layout-specific keywords)
  • 27. © 2022 Karakun AG | 27 Step 2: Detect columns Approaches: • Train on textual and graphical input • Cluster left/right x-coordinates of tokens • Configure exact x-coordinates
  • 28. © 2022 Karakun AG | 28 Step 3: Detect positions (logical rows) Approaches: • Train on textual and graphical input • Rule-based (layout-specific) Handle exceptions (e.g. values across cell boundaries)
  • 29. © 2022 Karakun AG | 29 Step 4: For each position: Extract target field values from cells Select proper cell by column label. Approaches: • Train on cell content • Rule-based (partly layout- specific)
  • 30. © 2022 Karakun AG | 30 Rule-based steps • Very efficient approach (unless large number of different layouts) • General rule logic with configurable parameters or regex patterns • Default configuration • Parameters/patterns only have to be configured if deviating • Logic can also be applied to unknown layouts and produce some results
  • 31. © 2022 Karakun AG | 31 Challenge: Map table data to table context Defined globally Per position, overhead Per position, inside position One table per position
  • 32. © 2022 Karakun AG | 32 Summary and insights • No general-purpose table extraction solution exists on the market • Do not try to build one • Instead: • Limit the solution to specific table types • Use any known constraints to inform the extraction algorithm • Split task into smaller tasks • Decide for each task independently: which method (some ML method?, rule-based?)
  • 33. © 2022 Karakun AG | 33 Summary and insights • No general-purpose table extraction solution exists on the market • Do not try to build one • Instead: • Limit the solution to specific table types • Use any known constraints to inform the extraction algorithm • Split task into smaller tasks • Decide for each task independently: which method (some ML method?, rule-based?)
  • 34. Karakun AG Elisabethenanlage 25 4051 Basel Switzerland P E W +41 61 551 36 00 info@karakun.com www.karakun.com Dr. Holger Keibel Product Manager holger.keibel@karakun.com Thank you!