SlideShare a Scribd company logo
1 of 15
Download to read offline
Learning from similarity and
information extraction from
structured documents
It is a human tendency to formulate assumptions while analyzing
the difficulty of information extraction in documents. We
automatically assume it is easier to extract information in the
form of named entities from a set of similar documents.
Nonetheless, similar-looking documents have a distinct set of
problems. The named entities in these document types vary in
size, akin to the number of characters, words, height, width, and
location. These variations cannot be handled using heuristics or
pre-trained language models.
When we have exhausted all the modeling options, we look for a
new direction. The method s presented here is an exploration of
deep learning techniques to improve the information extraction
results [4]. To evaluate the techniques, a dataset with more than
25,000 documents has been compiled, anonymized, and
published. It is already known that convolutions, graph
convolutions, and self-attention can work together and exploit all
the information present in a structured document. Here, we
examine various approaches such as siamese networks,
concepts of similarity, one-shot learning, and context/memory
awareness as deep learning techniques.
Information extraction tasks are not a new problem. Information
extraction starts with a collection of texts, then transforms these
into information that is more readily digested and analyzed. It
isolates relevant text fragments, extracts pertinent information
from the fragments, and then pieces together the targeted
information in a coherent framework [1]. The relevant collection
of texts for this study is the content within business documents
such as invoices, pro forma invoices, and debit notes.
Example of an invoice and an extraction system together with its
output.
A classical heuristic way to generally improve a target metric is
to provide more relevant information to the network. The idea of
providing more information is fundamental – even for simpler
templating techniques as problems cannot necessarily be solved
using templates alone. The research question will focus on a
similarity-based mechanism with various model implementations
and whether they can improve an existing solution. In the
background, we have assessed that none of these methods is
well-suited for working with structured documents (like invoices),
since they generally do not have any fixed layout, language,
caption set, delimiters, or fonts. For example, invoices vary
across countries, companies and departments and change over
time. To retrieve any information from a structured document, it
must be understood.
One-shot learning and similarity
In one-shot learning, we are usually able to correctly identify
classes by comparing them with already-known data. One-shot
learning works well when the concept of similarity is utilized. For
similarity to work, two types of data must be recognized –
unknown and known. For the known data, target values are
known to the method and/or the model. To classify any unknown
input, the usual practice is to assign the same class to it as this
is the most similar-known input. Siamese network is used for
similarity in this type of work meaning the retrieval of similar
documents that need to be compared. This is performed using
the nearest neighbor search in the embedding space for this
work.
The loss used for similarity learning is called triplet loss because
it is applied on a triplet of classes (R reference, P positive, N
negative) for each data point:
L(R, P, N ) = min( ||f (A) − f (P)||2 − ||f(A) − f(N )||2 + α, 0)
Where α is a margin between positive and negative classes and f is the model
function mapping inputs to embedding space (with the euclidean norm).
Methodology
The main unit of our scope is every individual word on every
individual page of each document. For the scope of this work, we
define a word as a text segment that is separated from the rest of
the text by (at least) a white space, and we will not consider any
other text segmentation.
Inputs and outputs: Conceptually, a document's entire page is
considered the input to the whole system. Each word – together
with its positional information (or word-box for short) – is to be
classified into zero, one, or more target classes as the output. We
are dealing with a multi-label problem with 35 possible classes in
total.
The dataset and the metric: Overall, we have a dataset with
25,071 PDF documents, totaling 35,880 pages. The documents
are of various vendors, layouts, and languages, and are split into
a training, validation, and test set at random (80 % / 10 % / 10 %).
A validation set is used for model selection and early stopping.
The metric used is computed first by computing all the F1 scores
of all the classes and aggregated by the micrometric principle.
The architecture used in the paper is called a simple data
extraction model [2].
The features of each word-box are:
Geometrical – To construct graphical CNN, reading order, and
normalized (left, top, right, bottom) coordinates.
Textual – The count of all characters, numbers, the length of
word, the count of first two and last two characters and trainable
word features are one-hot encoded, deaccented, lowercase
characters.
Image – Each word-box is cropped from the image.
The five inputs namely downsampled picture, feature of all
word-boxes, 40 one-hot encoded characters for each word-box,
neighbor-ids, position id by geometric ordering are concatenated
to generate an embedding vector. The transformer approach is
used to the position embedding vector. Image is stacked,
max-pooled and morphologically dilated to generate 32 float
features. Before attention, dense, or graph convolution layers are
used, all the features are simply concatenated.
Basic building block definition ends with each word-box
embedded to a feature space of a specified dimension (being 640
unless said otherwise in a specific experiment). The following
layer, for the “Simple data extraction model”, is a sigmoidal layer
with binary cross-entropy as the loss function. This is a standard
setting, since the output of this model is meant to solve a
multi-class multi-label problem.
The learning framework spans: 1) The system needs to keep a
notion of already-known documents in a reasonably sized set. 2)
When a “new” or “unknown” page is presented to the system,
search for the most similar page (given any reasonable
algorithm) from the knownpages. 3) Allow the model to use all
the information from both pages (and “learn from similarity”) to
make the prediction.
Nearest neighbor definition: For one-shot learning to work on a
new and unknown page (sometimes denoted reference), the
system always needs to have a known (also denoted as similar or
nearest) document with familiar annotations at its disposal.
Embedding for nearest neighbor search [3] is prepared by
removing the latest layer and adding a simple pooling layer to the
document classification model. This modified the model to
output 4850 float features based only on image input. These
features were then assigned to each page as its embedding.
These embeddings are held fixed during training and inference
and computed only once in advance.
Baselines: We do not have any benchmark for comparison.
Therefore, some models are prepared as baselines in the
process.
Simple data extraction model without any access to the nearest
known page.
Copypaste – overlay the target classes from the nearest known
page word-boxes. It provides counterpart to triplet loss and
pairwise classification.
Oracle – always correctly predicts the nearest known page
classes.
Fully linear model without feature picture data - provides a
counterpart to the query and answer approach.
Model architectures: Every single one of the architectures is
trained as a whole, no pre-training or transfer learning takes
place, and every model is always implemented as a single
computation graph in tensorflow.
Triplet Loss architecture – using siamese networks canonically
with triplet loss.
Pairwise classification – using a trainable classifier pairwise over
all combinations of word-box features from reference and
nearest page.
Query answer architecture (or “QA” for short) – using the
attention transformer as an answering machine to a question of
which word-box class is the most similar.
The filtering mechanism addresses only the annotated
word-boxes from the nearest page. The tiling mechanism takes
two sequences – first, the sequence of reference page
word-boxes and secondly, the sequence of nearest page filtered
word-boxes; and produces a bipartite matrix.
The model selected in each experimental run was always the one
that performed best on the validation set in terms of loss. The
basic building blocks present in every architecture were usually
set to produce feature space of dimensionality 640.
Table 1: The results could be interpreted as the model reaching
its maximal reasonable complexity at one transformer layer and
smaller feature space.
Table 2: Low score indicates that it is not enough to just overlay
a different similar known page over the unknown page, as the
dataset does not contain completely identical layouts.
Table 3: Only roughly 60% of word-boxes have their counterpart
(class-wise) in the found nearest page.
Table 4: Linear baseline performance justifies the progress from
the basic Copypaste model towards trainable architectures with
similarity.
Table 5: Pairwise classification performed better than simple
Copypaste, but still worse than linear architecture.
Table 6: It also verifies that all of the visual, geometric and textual
features are important for good quality results.
Conclusion:
We have verified that all possible parts of the architecture are
needed in the training and prediction of the Query Answer model
to achieve the highest score.
What is the effect of the size of the datasets? By exploring the
effect of the size of the training dataset and/or the search space
for the nearest pages, we could ask if (and when) the model
needs to be retrained and determine what a sample of a
difficult-to-extract document looks like.
How to improve the means of generalization? Currently, the
method generalizes to unseen documents. In theory, we could
desire a method to generalize to new classes of words, since this
way the model needs to be retrained if a new class is desired to
be detected and extracted.
The model can fit into just one consumer-grade GPU and trains
from scratch for at most four days using only one CPU process.
References:
[1] Cowie, J., Lehnert, W.: Information extraction. Commun. ACM
39, 80–91 (1996)
[2] Holecek, M., Hoskovec, A., Baudis, P., Klinger, P.: Table
understanding in structured documents. In: 2019 International
Conference on Document Analysis and Recognition Workshops
(ICDARW), vol. 5, pp. 158–164 (2019).
[3] Burkov, A.: Machine Learning Engineering. True Positive
Incorporated (2020)
[4] Holecek, M.: Learning from similarity and information
extraction from structured documents (2020) URL: Learning from
similarity and information extraction from...

More Related Content

Similar to Learning from similarity and information extraction from structured documents.pdf

G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance Anaya Zafar
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET Journal
 
EE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxEE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxiamultapromax
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET Journal
 
Data Structures unit I Introduction - data types
Data Structures unit I Introduction - data typesData Structures unit I Introduction - data types
Data Structures unit I Introduction - data typesAmirthaVarshini80
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnIOSR Journals
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Mohit Sngg
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
 
DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS Adams Sidibe
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
 
Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)smumbahelp
 

Similar to Learning from similarity and information extraction from structured documents.pdf (20)

G04124041046
G04124041046G04124041046
G04124041046
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
 
EE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxEE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptx
 
Algorithms and Data Structures~hmftj
Algorithms and Data Structures~hmftjAlgorithms and Data Structures~hmftj
Algorithms and Data Structures~hmftj
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document Clustering
 
Summary2 (1)
Summary2 (1)Summary2 (1)
Summary2 (1)
 
Data Structures unit I Introduction - data types
Data Structures unit I Introduction - data typesData Structures unit I Introduction - data types
Data Structures unit I Introduction - data types
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
 
DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)
 

More from Infrrd

Intelligent Document Processing
Intelligent Document ProcessingIntelligent Document Processing
Intelligent Document ProcessingInfrrd
 
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code ImplementationsIDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code ImplementationsInfrrd
 
Using Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdfUsing Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdfInfrrd
 
Launching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest FeaturesLaunching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest FeaturesInfrrd
 
Transformer-Based OCR.pdf
Transformer-Based OCR.pdfTransformer-Based OCR.pdf
Transformer-Based OCR.pdfInfrrd
 
Invoice processing
Invoice processingInvoice processing
Invoice processingInfrrd
 
Where have all the data entry candidates gone?
Where have all the data entry candidates gone?Where have all the data entry candidates gone?
Where have all the data entry candidates gone?Infrrd
 
Frequently Asked Questions About IDP
Frequently Asked Questions About IDPFrequently Asked Questions About IDP
Frequently Asked Questions About IDPInfrrd
 
IDP with Intelligent Table Extraction
IDP with Intelligent Table ExtractionIDP with Intelligent Table Extraction
IDP with Intelligent Table ExtractionInfrrd
 
Document Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and UnstructuredDocument Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and UnstructuredInfrrd
 
Understanding IDP: Data Integration
Understanding IDP: Data IntegrationUnderstanding IDP: Data Integration
Understanding IDP: Data IntegrationInfrrd
 
Understanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback LoopUnderstanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback LoopInfrrd
 
Understanding IDP: Document Classification
Understanding IDP: Document ClassificationUnderstanding IDP: Document Classification
Understanding IDP: Document ClassificationInfrrd
 
Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors Infrrd
 
Infrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit AutomationInfrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit AutomationInfrrd
 
How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?Infrrd
 
Intelligent Data Capture Process
Intelligent Data Capture Process Intelligent Data Capture Process
Intelligent Data Capture Process Infrrd
 

More from Infrrd (17)

Intelligent Document Processing
Intelligent Document ProcessingIntelligent Document Processing
Intelligent Document Processing
 
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code ImplementationsIDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
 
Using Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdfUsing Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdf
 
Launching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest FeaturesLaunching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest Features
 
Transformer-Based OCR.pdf
Transformer-Based OCR.pdfTransformer-Based OCR.pdf
Transformer-Based OCR.pdf
 
Invoice processing
Invoice processingInvoice processing
Invoice processing
 
Where have all the data entry candidates gone?
Where have all the data entry candidates gone?Where have all the data entry candidates gone?
Where have all the data entry candidates gone?
 
Frequently Asked Questions About IDP
Frequently Asked Questions About IDPFrequently Asked Questions About IDP
Frequently Asked Questions About IDP
 
IDP with Intelligent Table Extraction
IDP with Intelligent Table ExtractionIDP with Intelligent Table Extraction
IDP with Intelligent Table Extraction
 
Document Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and UnstructuredDocument Types Explained: Structured, Semi-Structured and Unstructured
Document Types Explained: Structured, Semi-Structured and Unstructured
 
Understanding IDP: Data Integration
Understanding IDP: Data IntegrationUnderstanding IDP: Data Integration
Understanding IDP: Data Integration
 
Understanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback LoopUnderstanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback Loop
 
Understanding IDP: Document Classification
Understanding IDP: Document ClassificationUnderstanding IDP: Document Classification
Understanding IDP: Document Classification
 
Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors
 
Infrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit AutomationInfrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit Automation
 
How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?
 
Intelligent Data Capture Process
Intelligent Data Capture Process Intelligent Data Capture Process
Intelligent Data Capture Process
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Learning from similarity and information extraction from structured documents.pdf

  • 1. Learning from similarity and information extraction from structured documents It is a human tendency to formulate assumptions while analyzing the difficulty of information extraction in documents. We automatically assume it is easier to extract information in the form of named entities from a set of similar documents. Nonetheless, similar-looking documents have a distinct set of problems. The named entities in these document types vary in size, akin to the number of characters, words, height, width, and location. These variations cannot be handled using heuristics or pre-trained language models.
  • 2. When we have exhausted all the modeling options, we look for a new direction. The method s presented here is an exploration of deep learning techniques to improve the information extraction results [4]. To evaluate the techniques, a dataset with more than 25,000 documents has been compiled, anonymized, and published. It is already known that convolutions, graph convolutions, and self-attention can work together and exploit all the information present in a structured document. Here, we examine various approaches such as siamese networks, concepts of similarity, one-shot learning, and context/memory awareness as deep learning techniques. Information extraction tasks are not a new problem. Information extraction starts with a collection of texts, then transforms these into information that is more readily digested and analyzed. It isolates relevant text fragments, extracts pertinent information from the fragments, and then pieces together the targeted information in a coherent framework [1]. The relevant collection of texts for this study is the content within business documents such as invoices, pro forma invoices, and debit notes. Example of an invoice and an extraction system together with its output.
  • 3. A classical heuristic way to generally improve a target metric is to provide more relevant information to the network. The idea of providing more information is fundamental – even for simpler templating techniques as problems cannot necessarily be solved using templates alone. The research question will focus on a similarity-based mechanism with various model implementations and whether they can improve an existing solution. In the background, we have assessed that none of these methods is well-suited for working with structured documents (like invoices), since they generally do not have any fixed layout, language,
  • 4. caption set, delimiters, or fonts. For example, invoices vary across countries, companies and departments and change over time. To retrieve any information from a structured document, it must be understood. One-shot learning and similarity In one-shot learning, we are usually able to correctly identify classes by comparing them with already-known data. One-shot learning works well when the concept of similarity is utilized. For similarity to work, two types of data must be recognized – unknown and known. For the known data, target values are known to the method and/or the model. To classify any unknown input, the usual practice is to assign the same class to it as this is the most similar-known input. Siamese network is used for similarity in this type of work meaning the retrieval of similar documents that need to be compared. This is performed using the nearest neighbor search in the embedding space for this work. The loss used for similarity learning is called triplet loss because it is applied on a triplet of classes (R reference, P positive, N negative) for each data point: L(R, P, N ) = min( ||f (A) − f (P)||2 − ||f(A) − f(N )||2 + α, 0)
  • 5. Where α is a margin between positive and negative classes and f is the model function mapping inputs to embedding space (with the euclidean norm). Methodology The main unit of our scope is every individual word on every individual page of each document. For the scope of this work, we define a word as a text segment that is separated from the rest of the text by (at least) a white space, and we will not consider any other text segmentation. Inputs and outputs: Conceptually, a document's entire page is considered the input to the whole system. Each word – together with its positional information (or word-box for short) – is to be classified into zero, one, or more target classes as the output. We are dealing with a multi-label problem with 35 possible classes in total. The dataset and the metric: Overall, we have a dataset with 25,071 PDF documents, totaling 35,880 pages. The documents are of various vendors, layouts, and languages, and are split into a training, validation, and test set at random (80 % / 10 % / 10 %). A validation set is used for model selection and early stopping. The metric used is computed first by computing all the F1 scores of all the classes and aggregated by the micrometric principle.
  • 6. The architecture used in the paper is called a simple data extraction model [2]. The features of each word-box are: Geometrical – To construct graphical CNN, reading order, and normalized (left, top, right, bottom) coordinates.
  • 7. Textual – The count of all characters, numbers, the length of word, the count of first two and last two characters and trainable word features are one-hot encoded, deaccented, lowercase characters. Image – Each word-box is cropped from the image. The five inputs namely downsampled picture, feature of all word-boxes, 40 one-hot encoded characters for each word-box, neighbor-ids, position id by geometric ordering are concatenated to generate an embedding vector. The transformer approach is used to the position embedding vector. Image is stacked, max-pooled and morphologically dilated to generate 32 float features. Before attention, dense, or graph convolution layers are used, all the features are simply concatenated. Basic building block definition ends with each word-box embedded to a feature space of a specified dimension (being 640 unless said otherwise in a specific experiment). The following layer, for the “Simple data extraction model”, is a sigmoidal layer with binary cross-entropy as the loss function. This is a standard setting, since the output of this model is meant to solve a multi-class multi-label problem.
  • 8. The learning framework spans: 1) The system needs to keep a notion of already-known documents in a reasonably sized set. 2) When a “new” or “unknown” page is presented to the system, search for the most similar page (given any reasonable algorithm) from the knownpages. 3) Allow the model to use all the information from both pages (and “learn from similarity”) to make the prediction. Nearest neighbor definition: For one-shot learning to work on a new and unknown page (sometimes denoted reference), the system always needs to have a known (also denoted as similar or nearest) document with familiar annotations at its disposal. Embedding for nearest neighbor search [3] is prepared by removing the latest layer and adding a simple pooling layer to the document classification model. This modified the model to output 4850 float features based only on image input. These features were then assigned to each page as its embedding. These embeddings are held fixed during training and inference and computed only once in advance. Baselines: We do not have any benchmark for comparison. Therefore, some models are prepared as baselines in the process.
  • 9. Simple data extraction model without any access to the nearest known page. Copypaste – overlay the target classes from the nearest known page word-boxes. It provides counterpart to triplet loss and pairwise classification. Oracle – always correctly predicts the nearest known page classes. Fully linear model without feature picture data - provides a counterpart to the query and answer approach. Model architectures: Every single one of the architectures is trained as a whole, no pre-training or transfer learning takes place, and every model is always implemented as a single computation graph in tensorflow. Triplet Loss architecture – using siamese networks canonically with triplet loss.
  • 10. Pairwise classification – using a trainable classifier pairwise over all combinations of word-box features from reference and nearest page. Query answer architecture (or “QA” for short) – using the attention transformer as an answering machine to a question of which word-box class is the most similar.
  • 11. The filtering mechanism addresses only the annotated word-boxes from the nearest page. The tiling mechanism takes two sequences – first, the sequence of reference page word-boxes and secondly, the sequence of nearest page filtered word-boxes; and produces a bipartite matrix. The model selected in each experimental run was always the one that performed best on the validation set in terms of loss. The basic building blocks present in every architecture were usually set to produce feature space of dimensionality 640. Table 1: The results could be interpreted as the model reaching its maximal reasonable complexity at one transformer layer and smaller feature space.
  • 12. Table 2: Low score indicates that it is not enough to just overlay a different similar known page over the unknown page, as the dataset does not contain completely identical layouts. Table 3: Only roughly 60% of word-boxes have their counterpart (class-wise) in the found nearest page.
  • 13. Table 4: Linear baseline performance justifies the progress from the basic Copypaste model towards trainable architectures with similarity. Table 5: Pairwise classification performed better than simple Copypaste, but still worse than linear architecture.
  • 14. Table 6: It also verifies that all of the visual, geometric and textual features are important for good quality results. Conclusion: We have verified that all possible parts of the architecture are needed in the training and prediction of the Query Answer model to achieve the highest score. What is the effect of the size of the datasets? By exploring the effect of the size of the training dataset and/or the search space for the nearest pages, we could ask if (and when) the model needs to be retrained and determine what a sample of a difficult-to-extract document looks like.
  • 15. How to improve the means of generalization? Currently, the method generalizes to unseen documents. In theory, we could desire a method to generalize to new classes of words, since this way the model needs to be retrained if a new class is desired to be detected and extracted. The model can fit into just one consumer-grade GPU and trains from scratch for at most four days using only one CPU process. References: [1] Cowie, J., Lehnert, W.: Information extraction. Commun. ACM 39, 80–91 (1996) [2] Holecek, M., Hoskovec, A., Baudis, P., Klinger, P.: Table understanding in structured documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 158–164 (2019). [3] Burkov, A.: Machine Learning Engineering. True Positive Incorporated (2020) [4] Holecek, M.: Learning from similarity and information extraction from structured documents (2020) URL: Learning from similarity and information extraction from...