Fundamentals for understanding what to look for and how to achieve high data quality. This eBook helps us dispel myths and face the realities of document processing (capture and recognition) in actual production environments:
* What we talk about when we talk about DATA QUALITY
* Understanding the recognition and capture technology options
* Differences between OCR and advanced recognition engines
* Examining case studies in full automation
2. DATA QUALITY IN DOCUMENT PROCESSING
There is a lot of misinformation about the accuracy of data
extraction, especially for complex forms such as invoices.
Most data
recognition
engines do not
provide tuned
systems and
require the user
to do the
configuration.
3. DATA QUALITY IN DOCUMENT PROCESSING
There is a lot of misinformation about the accuracy of data
extraction, especially for complex forms such as invoices.
Most data
recognition
engines do not
provide tuned
systems and
require the user
to do the
configuration.
Most engines
focus on read
rates, not error
rates, and this
results in
100% manual
verification of
the data.
4. DATA QUALITY IN DOCUMENT PROCESSING
There is a lot of misinformation about the accuracy of data
extraction, especially for complex forms such as invoices.
Most data
recognition
engines do not
provide tuned
systems and
require the user
to do the
configuration.
Most engines
focus on read
rates, not error
rates, and this
results in
100% manual
verification of
the data.
Many engines
rely on
templates to
achieve higher
rates of
accuracy, but
templates are
insufficient in
dynamic
environments.
5. DATA QUALITY WHAT WE TALK ABOUT
What we talk about when we talk about
DATA QUALITY
4 Must-Know Terms
6. DATA QUALITY WHAT WE TALK ABOUT
Read Rate is the percent of extracted
data from all the available data.
Accuracy
Acceptance Rate
Ground Truth Data
#1 Must-Know Term
7. DATA QUALITY WHAT WE TALK ABOUT
Acceptance Rate
Ground Truth Data
Accuracy is the percentage of extracted
data–the read rate—that is accurate
while error rate is the percentage of
extracted data that is erroneous.
Read Rate is the percent of extracted data from all the available data.
#2 Must-Know Term
8. DATA QUALITY WHAT WE TALK ABOUT
Acceptance Rate is the percentage of
extracted data that flows through the
system at an error rate acceptable by
your application. Any data that meets or
exceeds your threshold will be accepted.
Ground Truth Data
Read Rate is the percent of extracted data from all the available data.
Accuracy is the percentage of extracted data (the read rate) that is accurate. Error
rate is the percentage of extracted data that is erroneous.
#3 Must-Know Term
9. DATA QUALITY WHAT WE TALK ABOUT
Read Rate is the percent of extracted data from all the available data.
Ground Truth Data is a sample set of images or
documents and verified extracted data results—
truth data—that allows you to measure your
capture system performance. Continuously adding
to samples and truth data allows you to measure
your system over time.
Acceptance Rate is the percentage of extracted data that flows through the system
at an error rate acceptable by your application. Any data that meets or exceeds your
threshold will be accepted.
Accuracy is the percentage of extracted data (the read rate) that is accurate. Error
rate is the percentage of extracted data that is erroneous.
#4 Must-Know Term
10. RECOGNITION TECHNOLOGY OPTIONS
OCR
The complexity of turning OCR output
into useful information remains an
expensive, time-consuming problem
even today when OCR use is common.
The primary reason for this is that OCR
generates inconsistent random errors in
its recognition results.
Because errors are randomly
generated, validation of data results
requires 100% manual verification to
identify where the errors occurred.
100% recognition results
Inconsistent, random errors
No guaranteed accuracy rate
100% operator verification
necessary
OCR
11. RECOGNITION TECHNOLOGY OPTIONS
OCR Example
100% recognition results
Inconsistent, random errors
No guaranteed accuracy rate
100% manual verification
necessary
OCR
For example, let’s process 1000 checks
using OCR to read the check amount.
If the OCR engine has a 10 percent
error rate, 100 errors are randomly
distributed among the 1000 results.
Because of this random distribution,
finding these errors requires
100 percent manual (visual) verification
of all the checks in order to locate the
errors.
12. RECOGNITION TECHNOLOGY OPTIONS
Automated Recognition
& Interpretation
Automated recognition and
interpretation technology uses
voting algorithms and machine
learning that have proven to
achieve much lower error rates
than those provided by
individuals doing manual double
keying (data verification done by
two separate individuals).
N% Automated Recognition – no
human intervention necessary
Every result is assigned a confidence
value that guarantees accuracy for
the data stream
N% operator verification depends on
your acceptance rate
AI
13. RECOGNITION TECHNOLOGY OPTIONS
Automated Recognition
& Interpretation
With ground truth data, statistical
measurements of how often these
errors occur can be obtained.
The number of test images required
to determine the error rate depends
on the accuracy of the project.
The test images must represent all
the types of images and the quality
of images that are encountered in
the real production stream of
documents.
N% Automated Recognition – no
human intervention necessary
Every result is assigned a confidence
value that guarantees accuracy for
the data stream
N% operator verification depends on
your acceptance rate
AI
14. RECOGNITION TECHNOLOGY OPTIONS
Automated Recognition
& Interpretation
When performing recognition, the
software evaluates an image and
provides a data result (answer) with
an associated metric called a
confidence value, which is
calculated using a highly complex
algorithm.
You can easily tune the system for
the goals of the application by
setting a threshold at a specific
confidence value above which a
certain level of accuracy is
guaranteed. This dramatically
reduces the need for manual
verification.
N% Automated Recognition – no
human intervention necessary
• Every result is assigned a confidence
value that guarantees accuracy for
the data stream
N% operator verification depends on
your acceptance rate
AI
15. OCR & MANUAL DATA ENTRY
X% + Y% = 100%
100 % 100 %
CORRECT
X%
Error
Y%
INPUT
IMAGES
OCR
MANUAL
VERIFICATION
This diagram represents the typical OCR process that requires 100%
manual verification and is subject to error.
100% Manual verification
and data entry is time
consuming, costly and
prone to error.
16. AUTOMATED RECOGNITION & INTERPRETATION
ACCEPTED
CORRECT
C%
CORRECT
E%
Error
F%
A%
100%=A%+B%
100 %
B%
INPUT
IMAGES
Error
D%
Higher accuracy
Faster processing
Data entry savings
AUTOMATED
INTERPRETATION
MANUAL
VERIFICATION
All input images are fed to the software. After automated recognition and interpretation, data is
divided into two streams: accepted answers and any answers requiring manual verification.
17. POSTAL AUTOMATED RECOGNITION
TIME COMPARISON EXAMPLE
The Benchmark
Initially mail piece address data processing time with no OCR involved
took 75 seconds per package.
T1 = 75 seconds
Stage 1
18. POSTAL AUTOMATED RECOGNITION
TIME COMPARISON EXAMPLE
Stage 2
OCR
Mail piece address data processing time with simply OCR reading and
100% verification of OCR results took 55 seconds per mail package.
T2 = 55 seconds
19. AUTOMATION
Mail piece address data processing time, reduced verification with fully
automated processing of part of the data to 25 seconds per mail package.
Two times as fast and efficient.
T3 = 25 seconds
POSTAL AUTOMATED RECOGNITION
TIME COMPARISON EXAMPLE
Stage 3
20. AUTOMATION
Mail piece address data processing time, reduced verification with fully
automated processing of part of the data to 25 seconds per mail package.
Two times as fast and efficient.
T3 = 25 seconds
POSTAL AUTOMATED RECOGNITION
TIME COMPARISON EXAMPLE
Stage 3
T1=75 sec > T2=55 sec > T3=25 sec
Automatic Interpretation software provides significant time savings compared to OCR.
AUTOMATION is 2 times as fast.
21. POSTAL AUTOMATED RECOGNITION
ERROR RATE COMPARISON EXAMPLE
Full Address: 4%
Full Name: 5%
Phone: 1%
Operator Error
Full Address: 2%
Full Name: 3%
Phone: 0.5%
Automated Interpretation Error
The error rate comparison demonstrates automation to be significantly more accurate for
addresses, name and phone identification.
22. AUTOMATED INTERPRETATION VS. OCR
ANSWER STRUCTURE
Automated InterpretationOCR
The difference between OCR and Automated Interpretation is that there is a confidence score
associated with every result.
file,answer,conf
000001.tif, Samuel Jones, 87
000002.tif, Ashly Thompson, 68
000003.tif, Nancy Wright, 64
000004.tif, Donald Taylor, 76
000005.tif, Mark White, 72
000006.tif, Jessica Hall, 69
000007.tif, Ryan Scott, 32
000008.tif, Sandra Moore, 67
000009.tif, William Brown, 86
000010.tif, Michael Wood, 59
000011.tif, Chris Walker, 66
000012.tif, Mary Clark, 72
000013.tif, John Allen, 58
000014.tif, Joseph Martin, 63
000015.tif, Sarah Wilson, 76
000016.tif, Gary Lewis, 54
000017.tif, Andrew Hill, 67
...
file,answer
000001.tif, Samuel Jones
000002.tif, Ashly Thompson
000003.tif, Nancy Wright
000004.tif, Donald Taylor
000005.tif, Mark White
000006.tif, Jessica Hall
000007.tif, Ryan Scott
000008.tif, Sandra Moore
000009.tif, William Brown
000010.tif, Michael Wood
000011.tif, Chris Walker
000012.tif, Mary Clark
000013.tif, John Allen
000014.tif, Joseph Martin
000015.tif, Sarah Wilson
000016.tif, Gary Lewis
000017.tif, Andrew Hill
...
23. Instead of examining the data results of every single file to verify accuracy, you set the
acceptance threshold. Only results below the threshold need to be manually verified.
MOVING FROM 100% MANUAL VERIFICATION TO
AUTOMATED RECOGNITION
File Answer
000001.tif Samuel Jones
000002.tif Ashly Thompson
000003.tif Nancy Wright
000004.tif Donald Taylor
000005.tif Mark White
000006.tif Jessica Hall
000007.tif Ryan Short
000008.tif Sandra Moore
000009.tif William Brown
000010.tif Michael Wood
000011.tif Chris Walker
000012.tif Mary Clark
000013.tif John Allen
000014.tif Joshua Martin
000015.tif Sarah Wilson
000016.tif Cary Levis
000017.tif Andrew Hill
001001.tif Mary Miller
001002.tif Robert Johnson
001003.tif Thomas Thomas
001004.tif Betty Anderson
001005.tif David Davis
001006.tif Paul Robinson
001007.tif Dorothy Williams
001008.tif James Smith
002001.tif Ryan Scott
002002.tif Nicholas Young
002003.tif Edward Jackson
002004.tif Shirleigh Admon
002005.tif Margaret Tryniski
002006.tif Melissa Green
002007.tif Anna Evans
002008.tif Timothy King
002009.tif Steve Baker
002010.tif Amanda John
002011.tif Brian Howe
002012.tif Jason Roberts
002013.tif Dennis Campbell
002014.tif Samantha James
002015.tif Rachel Stewart
002016.tif Jerry Lee
TRUTH DATA
File with Data Results &
Acceptance Thresholds
Identified by
human operators
and results
corrected
Eliminate errors by setting
threshold at a specific confidence
value where you know that above
this threshold, the data is
ALWAYS correct.
68%
0%
12%
20%
Threshold = 64
Accepted correct Accepted error
Rejected error Rejected correct
24. Instead of examining the data results of every single file to verify accuracy, you set the
acceptance threshold. Only results below the threshold need to be manually verified.
MOVING FROM 100% MANUAL VERIFICATION TO
AUTOMATED RECOGNITION
File Answer
000001.tif Samuel Jones
000002.tif Ashly Thompson
000003.tif Nancy Wright
000004.tif Donald Taylor
000005.tif Mark White
000006.tif Jessica Hall
000007.tif Ryan Short
000008.tif Sandra Moore
000009.tif William Brown
000010.tif Michael Wood
000011.tif Chris Walker
000012.tif Mary Clark
000013.tif John Allen
000014.tif Joshua Martin
000015.tif Sarah Wilson
000016.tif Cary Levis
000017.tif Andrew Hill
001001.tif Mary Miller
001002.tif Robert Johnson
001003.tif Thomas Thomas
001004.tif Betty Anderson
001005.tif David Davis
001006.tif Paul Robinson
001007.tif Dorothy Williams
001008.tif James Smith
002001.tif Ryan Scott
002002.tif Nicholas Young
002003.tif Edward Jackson
002004.tif Shirleigh Admon
002005.tif Margaret Tryniski
002006.tif Melissa Green
002007.tif Anna Evans
002008.tif Timothy King
002009.tif Steve Baker
002010.tif Amanda John
002011.tif Brian Howe
002012.tif Jason Roberts
002013.tif Dennis Campbell
002014.tif Samantha James
002015.tif Rachel Stewart
002016.tif Jerry Lee
TRUTH DATA
File with Data Results &
Acceptance Thresholds
Identified by
human operators
and results
corrected
Or set the threshold at a
lower confidence value
where you know that above
this threshold, the answer is
USUALLY correct
78%
2%
10%
10%
Threshold = 55
Accepted correct Accepted error
Rejected error Rejected correct
25. To recap, recognition data results (answers X and Y) are fully automated above the
threshold (confidence value) that you set. Below the threshold, documents are sent to be
manually verified.
MOVING FROM 100% MANUAL VERIFICATION TO
AUTOMATED RECOGNITION: COST OF TRANSITION
Recognition Results
X – correct answers
Y – incorrect answers
M – correct answers
N – incorrect answers
More correct data results from
automated recognition
Guaranteed accuracy of fully
automated part of the data stream
Accuracy is HIGHER than the
accuracy of an operator
Quantifiable cost & time savings
AnswerConfidenceValue
Threshold
26. AUTOMATED RECOGNITION & INTERPRETATION
Much of the data is
automatically
recognized:
SIGNIFICANT TIME
SAVINGS
Only part of the data
need human verification:
SIGNIFICANT LABOR
SAVINGS
Controlled and guaranteed
accuracy of automatically
recognized data:
OVERALL HIGHER
ACCURACY
Fewer answers sent for
manual verification:
HIGHER % CORRECT
RESULTS