According to Gartner, "The market for document capture, extraction, and processing is highly fragmented. Data and analytics leaders should use this research to understand the process flow and differentiated capabilities offered by intelligent document processing solutions". Gartner's recently released "Infographic: Understand Intelligent Document Processing" covers these 6 critical flows in IDP.
1. Capture or Ingestion
2. Document Preprocessing
3. Document Classification
4. Data Extraction
5. Validation and Feedback Loop
6. Integration
This is the fourth post in the series exploring Data Validation and Feedback Loop.
My INSURER PTE LTD - Insurtech Innovation Award 2024
Understanding IDP: Data Validation and Feedback Loop
1. Understanding IDP: Data
Validation and Feedback
Loop
According to Gartner, "The market for document capture,
extraction, and processing is highly fragmented. Data and analytics
leaders should use this research to understand the process flow
and differentiated capabilities offered by intelligent document
processing solutions". Gartner's recently released "Infographic:
Understand Intelligent Document Processing" covers these 6 critical
flows in IDP.
1. Capture or Ingestion
2. Document Pre-processing
3. Document Classification
4. Data Extraction
5. Validation and Feedback Loop
6. Integration
2. This is the fourth post in the series exploring Data Validation and
Feedback Loop.
When it comes to IDP systems, one of the key evaluation
parameters is the accuracy it offers. Besides depending on just the
quality of the extraction process, there are external signals that IDP
systems tap into to improve accuracy. Data validation against an
external source is one of many such signals.
When you think of these signals, try to draw a parallel to how
modern-day GPS location systems work. You may know that GPS
systems measure the distance of the subject from three or more
satellites and apply a technique called triangulation to detect an
intersection point. It is impossible to accurately pinpoint the location
of the subject with a signal from just one satellite.
To relate to this problem, stick out your arm, raise a finger and
close one eye. You will notice that with one closed eye, you lose the
sense of distance. You cannot really tell how far your finger is.
Getting visual signals from both eyes helps you get a true reading
of your depth of field. Similarly, GPS systems use three different
signals to accurately place the subject's location. Opening an IDP
conversation with satellites is quite a stretch but the point to note
here is that more signals lead to higher accuracy. Similarly, data
validation and feedback loops are techniques used by modern IDP
systems to improve accuracy and thereby mature faster
exponentially. An efficient data validation system can lift your IDP
accuracy by 15 to 20%. Let's see how.
Data Validation
If IDP is the best option to automate data processing, what does
data validation add to it? Data validation, as the name suggests, is
the process of validating the extracted data for multiple points of
accuracy, such as is the right data being extracted and if the
3. extracted data itself is accurate. A typical use case for data
validation is exception handling, such as weeding out documents
that are out of scope. For example, you have a list of vendors
where only documents from these vendors should be extracted, or
a receipt is mixed among the invoices you are processing and
needs to be disregarded. If you experience these or similar cases,
then you need data validation.
Let us look at a scenario for data validation. Imagine you are
extracting information from a loan document. Borrowers have
availed loans from different banks, but you want to validate the list
of approved lenders or banks in your system and differentiate
between the approved and unapproved lenders. In this case, you
implement data validation techniques where an IDP system usually
connects with the third-party database through APIs or to a set of
data in the IDP vendor's cloud system synced daily or periodically
from the third-party database. Let me simplify this. You are
extracting a loan document where the borrower has availed a loan
from Bank of America, and Bank of America is your approved
lender. Then, with data validation, you can have an identifier for it,
maybe list the lender as a lien-holder in the extraction results.
Data validation is one of the key factors that brings in an
exponential increase in the extraction accuracies, which means
your IDP models mature in no time. Let me give you a ballpark
figure. After analyzing the extraction results of our customers for the
past few months, we have observed that Infrrd's data validation
algorithms immediately spike the accuracy levels around 10%. It
means if the IDP system was providing 80% accuracy without data
validation, it may give 90% accuracy or more with data validation.
There are different types of validation. The most common ones are:
Pattern-based validation: Here, the data is validated based on
patterns. For example, the vehicle identification number (VIN),
which is a unique identifier for a car, is a combination of digits and
4. capital letters and usually constitutes 17 characters. This number
has a pattern, such as the first 3 digits representing the
manufacturer, digits 4 to 8 may be alphanumeric and represent the
vehicle descriptions, and so on. In this case, pattern-based data
validation detects and corrects the extraction errors in the VIN
number, including tricky ones, such as the number 1 and the capital
letter I getting interchanged.
Dictionary-based validation: This is done against a set of data in
the system. For example, you can verify the extracted invoice
approver name matches the name of the approver in the IDP
system. In this case, the dictionary-based validation detects and
corrects the currency code.
Context-based validation: This is done where the same value is
relevant in two contexts. For example, you are extracting an
insurance document that has the same value in two contexts, say
collision deductible and comprehensive deductible always have the
value 500. In such cases, the ML models may misinterpret the
context as the values are the same and may learn incorrectly, which
eventually may have a dip in the accuracy. So, to detect these kinds
of different contexts with similar or the same value, context-based
validation is the way forward.
So, how do you implement data validation in IDP solutions? One of
the key strategies is configuring business rules.
Business Rules
Modern IDP solutions mostly validate extracted data using business
rules. Let us say you have an expense management system to
process invoices. You are extracting relevant information from
these invoices using an OCR system. In the initial stages, the
extraction accuracy is not expected to be high. However, you have
an agreement with your IDP vendor that an expected level of
5. accuracy can be achieved in a specific timeframe. Now, how do you
frequently measure the improvements in accuracy? You can do this
by configuring business rules.
Business rules can be configured in an IDP solution in two ways,
either through customization from the backend or through the user
interface. In modern IDP solutions, business rules are a high-value
offering in the user interface, where you can configure them based
on your requirements.
Automated Accuracy Improvement
Any corrections performed by your data entry or correction user
acts as an input to the system so that the accuracy is improved in
future extractions. Modern ML-based IDP systems automatically
learn from corrections so that the accuracy of future extractions is
improved. The feedback loop brings the best results when
corrections are integrated with extraction.
When you extract data, human-in-the-loop (HITL) plays the role of
correcting the data that are extracted with low confidence. IDP
solutions assign a confidence score while extracting data at a
granular level, usually at the field level. So, each field that is
automatically extracted has a confidence score assigned to it. You
can decide the fields that need correction based on the confidence
score.
Let us take an example. You are extracting the invoice number,
merchant name, merchant address, and total amount from an
invoice. In this case, you set a high confidence score for critical
fields, such as the invoice number. If the invoice number is not
extracted with high confidence, it will be served to a human to
correct it.
Some companies outsource corrections to manage costs. However,
6. the chances are that they incur higher costs in the long run. Let us
say you have an OCR system to extract data but corrections are
outsourced to a BPO team because it is cheaper or more
convenient than employing data entry or correction users. However,
what you miss here is a long-term matured IDP system that can
drastically reduce the corrections efforts for the future.
Infrrd's IDP solution has an integrated dashboard to perform
corrections where the feedback loop is automated. There are
patent-pending capabilities Infrrd offers to ensure efficient and
intelligent analysis of data before triggering a feedback loop.
After Infrrd's IDP automatically extracts the data, two things can
happen based on the maturity of the models: either a document
goes through Straight Through Processing, or it is served for
correction. If some fields are extracted with low confidence, the
corresponding documents are sent to queues for correction by a
data entry user.
7. The queues are configured based on the confidence score
assigned by the system during extraction.
The corrections performed by the data entry user act as feedback
for the system to learn, and this ensures improved accuracy in
future extractions.
There you go. Ensure that you choose a futuristic IDP solution to
stay competitive. It means choosing an IDP solution that offers
excellent extraction and classification features and has excellent
data validation and feedback loop capabilities to manage variations
and inaccuracies efficiently.
8. Here is a table that depicts the industry-relevant data validation and
feedback loop features and Infrrd's capabilities:
Feature Infrrd's IDP
Pattern-based validation
✔
Dictionary-based validation
✔
Context-based validation
✔
Business Rules Through Configuration
✔
Self Service Business Rules
On The Roadmap
Automated Accuracy Improvements
✔
In our next post, we explore Gartner's description of Integration and
how Infrrd stacks up.