SlideShare a Scribd company logo
1 of 27
Download to read offline
© 2017 Parascript, LLC.
DATA QUALITY
MYTHS | REALITIES
DATA QUALITY IN DOCUMENT PROCESSING
There is a lot of misinformation about the accuracy of data
extraction, especially for complex forms such as invoices.
Most data
recognition
engines do not
provide tuned
systems and
require the user
to do the
configuration.
DATA QUALITY IN DOCUMENT PROCESSING
There is a lot of misinformation about the accuracy of data
extraction, especially for complex forms such as invoices.
Most data
recognition
engines do not
provide tuned
systems and
require the user
to do the
configuration.
Most engines
focus on read
rates, not error
rates, and this
results in
100% manual
verification of
the data.
DATA QUALITY IN DOCUMENT PROCESSING
There is a lot of misinformation about the accuracy of data
extraction, especially for complex forms such as invoices.
Most data
recognition
engines do not
provide tuned
systems and
require the user
to do the
configuration.
Most engines
focus on read
rates, not error
rates, and this
results in
100% manual
verification of
the data.
Many engines
rely on
templates to
achieve higher
rates of
accuracy, but
templates are
insufficient in
dynamic
environments.
DATA QUALITY WHAT WE TALK ABOUT
What we talk about when we talk about
DATA QUALITY
4 Must-Know Terms
DATA QUALITY WHAT WE TALK ABOUT
Read Rate is the percent of extracted
data from all the available data.
Accuracy
Acceptance Rate
Ground Truth Data
#1 Must-Know Term
DATA QUALITY WHAT WE TALK ABOUT
Acceptance Rate
Ground Truth Data
Accuracy is the percentage of extracted
data–the read rate—that is accurate
while error rate is the percentage of
extracted data that is erroneous.
Read Rate is the percent of extracted data from all the available data.
#2 Must-Know Term
DATA QUALITY WHAT WE TALK ABOUT
Acceptance Rate is the percentage of
extracted data that flows through the
system at an error rate acceptable by
your application. Any data that meets or
exceeds your threshold will be accepted.
Ground Truth Data
Read Rate is the percent of extracted data from all the available data.
Accuracy is the percentage of extracted data (the read rate) that is accurate. Error
rate is the percentage of extracted data that is erroneous.
#3 Must-Know Term
DATA QUALITY WHAT WE TALK ABOUT
Read Rate is the percent of extracted data from all the available data.
Ground Truth Data is a sample set of images or
documents and verified extracted data results—
truth data—that allows you to measure your
capture system performance. Continuously adding
to samples and truth data allows you to measure
your system over time.
Acceptance Rate is the percentage of extracted data that flows through the system
at an error rate acceptable by your application. Any data that meets or exceeds your
threshold will be accepted.
Accuracy is the percentage of extracted data (the read rate) that is accurate. Error
rate is the percentage of extracted data that is erroneous.
#4 Must-Know Term
RECOGNITION TECHNOLOGY OPTIONS
OCR
The complexity of turning OCR output
into useful information remains an
expensive, time-consuming problem
even today when OCR use is common.
The primary reason for this is that OCR
generates inconsistent random errors in
its recognition results.
Because errors are randomly
generated, validation of data results
requires 100% manual verification to
identify where the errors occurred.
 100% recognition results
 Inconsistent, random errors
 No guaranteed accuracy rate
 100% operator verification
necessary
OCR
RECOGNITION TECHNOLOGY OPTIONS
OCR Example
 100% recognition results
 Inconsistent, random errors
 No guaranteed accuracy rate
 100% manual verification
necessary
OCR
For example, let’s process 1000 checks
using OCR to read the check amount.
If the OCR engine has a 10 percent
error rate, 100 errors are randomly
distributed among the 1000 results.
Because of this random distribution,
finding these errors requires
100 percent manual (visual) verification
of all the checks in order to locate the
errors.
RECOGNITION TECHNOLOGY OPTIONS
Automated Recognition
& Interpretation
Automated recognition and
interpretation technology uses
voting algorithms and machine
learning that have proven to
achieve much lower error rates
than those provided by
individuals doing manual double
keying (data verification done by
two separate individuals).
 N% Automated Recognition – no
human intervention necessary
 Every result is assigned a confidence
value that guarantees accuracy for
the data stream
 N% operator verification depends on
your acceptance rate
AI
RECOGNITION TECHNOLOGY OPTIONS
Automated Recognition
& Interpretation
With ground truth data, statistical
measurements of how often these
errors occur can be obtained.
The number of test images required
to determine the error rate depends
on the accuracy of the project.
The test images must represent all
the types of images and the quality
of images that are encountered in
the real production stream of
documents.
 N% Automated Recognition – no
human intervention necessary
 Every result is assigned a confidence
value that guarantees accuracy for
the data stream
 N% operator verification depends on
your acceptance rate
AI
RECOGNITION TECHNOLOGY OPTIONS
Automated Recognition
& Interpretation
When performing recognition, the
software evaluates an image and
provides a data result (answer) with
an associated metric called a
confidence value, which is
calculated using a highly complex
algorithm.
You can easily tune the system for
the goals of the application by
setting a threshold at a specific
confidence value above which a
certain level of accuracy is
guaranteed. This dramatically
reduces the need for manual
verification.
 N% Automated Recognition – no
human intervention necessary
• Every result is assigned a confidence
value that guarantees accuracy for
the data stream
 N% operator verification depends on
your acceptance rate
AI
OCR & MANUAL DATA ENTRY
X% + Y% = 100%
100 % 100 %
CORRECT
X%
Error
Y%
INPUT
IMAGES
OCR
MANUAL
VERIFICATION
This diagram represents the typical OCR process that requires 100%
manual verification and is subject to error.
100% Manual verification
and data entry is time
consuming, costly and
prone to error.
AUTOMATED RECOGNITION & INTERPRETATION
ACCEPTED
CORRECT
C%
CORRECT
E%
Error
F%
A%
100%=A%+B%
100 %
B%
INPUT
IMAGES
Error
D%
 Higher accuracy
 Faster processing
 Data entry savings
AUTOMATED
INTERPRETATION
MANUAL
VERIFICATION
All input images are fed to the software. After automated recognition and interpretation, data is
divided into two streams: accepted answers and any answers requiring manual verification.
POSTAL AUTOMATED RECOGNITION
TIME COMPARISON EXAMPLE
The Benchmark
Initially mail piece address data processing time with no OCR involved
took 75 seconds per package.
 T1 = 75 seconds
Stage 1
POSTAL AUTOMATED RECOGNITION
TIME COMPARISON EXAMPLE
Stage 2
OCR
Mail piece address data processing time with simply OCR reading and
100% verification of OCR results took 55 seconds per mail package.
 T2 = 55 seconds
AUTOMATION
Mail piece address data processing time, reduced verification with fully
automated processing of part of the data to 25 seconds per mail package.
Two times as fast and efficient.
 T3 = 25 seconds
POSTAL AUTOMATED RECOGNITION
TIME COMPARISON EXAMPLE
Stage 3
AUTOMATION
Mail piece address data processing time, reduced verification with fully
automated processing of part of the data to 25 seconds per mail package.
Two times as fast and efficient.
 T3 = 25 seconds
POSTAL AUTOMATED RECOGNITION
TIME COMPARISON EXAMPLE
Stage 3
T1=75 sec > T2=55 sec > T3=25 sec
Automatic Interpretation software provides significant time savings compared to OCR.
AUTOMATION is 2 times as fast.
POSTAL AUTOMATED RECOGNITION
ERROR RATE COMPARISON EXAMPLE
 Full Address: 4%
 Full Name: 5%
 Phone: 1%
Operator Error
 Full Address: 2%
 Full Name: 3%
 Phone: 0.5%
Automated Interpretation Error
The error rate comparison demonstrates automation to be significantly more accurate for
addresses, name and phone identification.
AUTOMATED INTERPRETATION VS. OCR
ANSWER STRUCTURE
Automated InterpretationOCR
The difference between OCR and Automated Interpretation is that there is a confidence score
associated with every result.
file,answer,conf
000001.tif, Samuel Jones, 87
000002.tif, Ashly Thompson, 68
000003.tif, Nancy Wright, 64
000004.tif, Donald Taylor, 76
000005.tif, Mark White, 72
000006.tif, Jessica Hall, 69
000007.tif, Ryan Scott, 32
000008.tif, Sandra Moore, 67
000009.tif, William Brown, 86
000010.tif, Michael Wood, 59
000011.tif, Chris Walker, 66
000012.tif, Mary Clark, 72
000013.tif, John Allen, 58
000014.tif, Joseph Martin, 63
000015.tif, Sarah Wilson, 76
000016.tif, Gary Lewis, 54
000017.tif, Andrew Hill, 67
...
file,answer
000001.tif, Samuel Jones
000002.tif, Ashly Thompson
000003.tif, Nancy Wright
000004.tif, Donald Taylor
000005.tif, Mark White
000006.tif, Jessica Hall
000007.tif, Ryan Scott
000008.tif, Sandra Moore
000009.tif, William Brown
000010.tif, Michael Wood
000011.tif, Chris Walker
000012.tif, Mary Clark
000013.tif, John Allen
000014.tif, Joseph Martin
000015.tif, Sarah Wilson
000016.tif, Gary Lewis
000017.tif, Andrew Hill
...
Instead of examining the data results of every single file to verify accuracy, you set the
acceptance threshold. Only results below the threshold need to be manually verified.
MOVING FROM 100% MANUAL VERIFICATION TO
AUTOMATED RECOGNITION
File Answer
000001.tif Samuel Jones
000002.tif Ashly Thompson
000003.tif Nancy Wright
000004.tif Donald Taylor
000005.tif Mark White
000006.tif Jessica Hall
000007.tif Ryan Short
000008.tif Sandra Moore
000009.tif William Brown
000010.tif Michael Wood
000011.tif Chris Walker
000012.tif Mary Clark
000013.tif John Allen
000014.tif Joshua Martin
000015.tif Sarah Wilson
000016.tif Cary Levis
000017.tif Andrew Hill
001001.tif Mary Miller
001002.tif Robert Johnson
001003.tif Thomas Thomas
001004.tif Betty Anderson
001005.tif David Davis
001006.tif Paul Robinson
001007.tif Dorothy Williams
001008.tif James Smith
002001.tif Ryan Scott
002002.tif Nicholas Young
002003.tif Edward Jackson
002004.tif Shirleigh Admon
002005.tif Margaret Tryniski
002006.tif Melissa Green
002007.tif Anna Evans
002008.tif Timothy King
002009.tif Steve Baker
002010.tif Amanda John
002011.tif Brian Howe
002012.tif Jason Roberts
002013.tif Dennis Campbell
002014.tif Samantha James
002015.tif Rachel Stewart
002016.tif Jerry Lee
TRUTH DATA
File with Data Results &
Acceptance Thresholds
Identified by
human operators
and results
corrected
Eliminate errors by setting
threshold at a specific confidence
value where you know that above
this threshold, the data is
ALWAYS correct.
68%
0%
12%
20%
Threshold = 64
Accepted correct Accepted error
Rejected error Rejected correct
Instead of examining the data results of every single file to verify accuracy, you set the
acceptance threshold. Only results below the threshold need to be manually verified.
MOVING FROM 100% MANUAL VERIFICATION TO
AUTOMATED RECOGNITION
File Answer
000001.tif Samuel Jones
000002.tif Ashly Thompson
000003.tif Nancy Wright
000004.tif Donald Taylor
000005.tif Mark White
000006.tif Jessica Hall
000007.tif Ryan Short
000008.tif Sandra Moore
000009.tif William Brown
000010.tif Michael Wood
000011.tif Chris Walker
000012.tif Mary Clark
000013.tif John Allen
000014.tif Joshua Martin
000015.tif Sarah Wilson
000016.tif Cary Levis
000017.tif Andrew Hill
001001.tif Mary Miller
001002.tif Robert Johnson
001003.tif Thomas Thomas
001004.tif Betty Anderson
001005.tif David Davis
001006.tif Paul Robinson
001007.tif Dorothy Williams
001008.tif James Smith
002001.tif Ryan Scott
002002.tif Nicholas Young
002003.tif Edward Jackson
002004.tif Shirleigh Admon
002005.tif Margaret Tryniski
002006.tif Melissa Green
002007.tif Anna Evans
002008.tif Timothy King
002009.tif Steve Baker
002010.tif Amanda John
002011.tif Brian Howe
002012.tif Jason Roberts
002013.tif Dennis Campbell
002014.tif Samantha James
002015.tif Rachel Stewart
002016.tif Jerry Lee
TRUTH DATA
File with Data Results &
Acceptance Thresholds
Identified by
human operators
and results
corrected
Or set the threshold at a
lower confidence value
where you know that above
this threshold, the answer is
USUALLY correct
78%
2%
10%
10%
Threshold = 55
Accepted correct Accepted error
Rejected error Rejected correct
To recap, recognition data results (answers X and Y) are fully automated above the
threshold (confidence value) that you set. Below the threshold, documents are sent to be
manually verified.
MOVING FROM 100% MANUAL VERIFICATION TO
AUTOMATED RECOGNITION: COST OF TRANSITION
Recognition Results
X – correct answers
Y – incorrect answers
M – correct answers
N – incorrect answers
 More correct data results from
automated recognition
 Guaranteed accuracy of fully
automated part of the data stream
 Accuracy is HIGHER than the
accuracy of an operator
 Quantifiable cost & time savings
AnswerConfidenceValue
Threshold
AUTOMATED RECOGNITION & INTERPRETATION
 Much of the data is
automatically
recognized:
SIGNIFICANT TIME
SAVINGS
 Only part of the data
need human verification:
SIGNIFICANT LABOR
SAVINGS
 Controlled and guaranteed
accuracy of automatically
recognized data:
OVERALL HIGHER
ACCURACY
 Fewer answers sent for
manual verification:
HIGHER % CORRECT
RESULTS
© 2017 Parascript, LLC. parascript.com
www.parascript.com | info@parascript.com | 888.225.0169
YOUR SOLUTIONS POWER BUSINESSES.
OUR SOLUTIONS DELIVER YOUR DATA.
WANT TO LEARN MORE?

More Related Content

Similar to Data Quality | Myths & Realities

50409621003 fingerprint recognition system-ppt
50409621003  fingerprint recognition system-ppt50409621003  fingerprint recognition system-ppt
50409621003 fingerprint recognition system-ppt
Mohankumar Ramachandran
 
Cognition ondemandhealthcare
Cognition ondemandhealthcareCognition ondemandhealthcare
Cognition ondemandhealthcare
Leah Lukach
 
Advanced redaction whitepaper
Advanced redaction whitepaperAdvanced redaction whitepaper
Advanced redaction whitepaper
Mark_Miller
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
Alois Reitbauer
 

Similar to Data Quality | Myths & Realities (20)

Credit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperCredit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research Paper
 
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurneyCertus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
 
50409621003 fingerprint recognition system-ppt
50409621003  fingerprint recognition system-ppt50409621003  fingerprint recognition system-ppt
50409621003 fingerprint recognition system-ppt
 
Robotic Process Automation in Supply Chain Management
Robotic Process Automation in Supply Chain ManagementRobotic Process Automation in Supply Chain Management
Robotic Process Automation in Supply Chain Management
 
Cognition ondemandhealthcare
Cognition ondemandhealthcareCognition ondemandhealthcare
Cognition ondemandhealthcare
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
 
Advanced redaction whitepaper
Advanced redaction whitepaperAdvanced redaction whitepaper
Advanced redaction whitepaper
 
Enterprise Output Management Automotive Industry Case Studies
Enterprise Output Management Automotive Industry Case StudiesEnterprise Output Management Automotive Industry Case Studies
Enterprise Output Management Automotive Industry Case Studies
 
Data Validation in a Low-Code Environment
Data Validation in a Low-Code EnvironmentData Validation in a Low-Code Environment
Data Validation in a Low-Code Environment
 
Dcc Cheque Scanner
Dcc Cheque ScannerDcc Cheque Scanner
Dcc Cheque Scanner
 
Calculating a Sample Size
Calculating a Sample SizeCalculating a Sample Size
Calculating a Sample Size
 
A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...
 
Data entry-services
Data entry-servicesData entry-services
Data entry-services
 
Inttelix OnTime
Inttelix OnTimeInttelix OnTime
Inttelix OnTime
 
Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)
 
Dumb and Dumber: how smart is your monitoring data?
Dumb and Dumber: how smart is your monitoring data?Dumb and Dumber: how smart is your monitoring data?
Dumb and Dumber: how smart is your monitoring data?
 
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
 
Chanchal Chatterjee PARTNERS 2017 Oct24
Chanchal Chatterjee PARTNERS 2017 Oct24Chanchal Chatterjee PARTNERS 2017 Oct24
Chanchal Chatterjee PARTNERS 2017 Oct24
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
What does it take to be a performance tester?
What does it take to be a performance tester?What does it take to be a performance tester?
What does it take to be a performance tester?
 

Recently uploaded

Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Recently uploaded (20)

Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 

Data Quality | Myths & Realities

  • 1. © 2017 Parascript, LLC. DATA QUALITY MYTHS | REALITIES
  • 2. DATA QUALITY IN DOCUMENT PROCESSING There is a lot of misinformation about the accuracy of data extraction, especially for complex forms such as invoices. Most data recognition engines do not provide tuned systems and require the user to do the configuration.
  • 3. DATA QUALITY IN DOCUMENT PROCESSING There is a lot of misinformation about the accuracy of data extraction, especially for complex forms such as invoices. Most data recognition engines do not provide tuned systems and require the user to do the configuration. Most engines focus on read rates, not error rates, and this results in 100% manual verification of the data.
  • 4. DATA QUALITY IN DOCUMENT PROCESSING There is a lot of misinformation about the accuracy of data extraction, especially for complex forms such as invoices. Most data recognition engines do not provide tuned systems and require the user to do the configuration. Most engines focus on read rates, not error rates, and this results in 100% manual verification of the data. Many engines rely on templates to achieve higher rates of accuracy, but templates are insufficient in dynamic environments.
  • 5. DATA QUALITY WHAT WE TALK ABOUT What we talk about when we talk about DATA QUALITY 4 Must-Know Terms
  • 6. DATA QUALITY WHAT WE TALK ABOUT Read Rate is the percent of extracted data from all the available data. Accuracy Acceptance Rate Ground Truth Data #1 Must-Know Term
  • 7. DATA QUALITY WHAT WE TALK ABOUT Acceptance Rate Ground Truth Data Accuracy is the percentage of extracted data–the read rate—that is accurate while error rate is the percentage of extracted data that is erroneous. Read Rate is the percent of extracted data from all the available data. #2 Must-Know Term
  • 8. DATA QUALITY WHAT WE TALK ABOUT Acceptance Rate is the percentage of extracted data that flows through the system at an error rate acceptable by your application. Any data that meets or exceeds your threshold will be accepted. Ground Truth Data Read Rate is the percent of extracted data from all the available data. Accuracy is the percentage of extracted data (the read rate) that is accurate. Error rate is the percentage of extracted data that is erroneous. #3 Must-Know Term
  • 9. DATA QUALITY WHAT WE TALK ABOUT Read Rate is the percent of extracted data from all the available data. Ground Truth Data is a sample set of images or documents and verified extracted data results— truth data—that allows you to measure your capture system performance. Continuously adding to samples and truth data allows you to measure your system over time. Acceptance Rate is the percentage of extracted data that flows through the system at an error rate acceptable by your application. Any data that meets or exceeds your threshold will be accepted. Accuracy is the percentage of extracted data (the read rate) that is accurate. Error rate is the percentage of extracted data that is erroneous. #4 Must-Know Term
  • 10. RECOGNITION TECHNOLOGY OPTIONS OCR The complexity of turning OCR output into useful information remains an expensive, time-consuming problem even today when OCR use is common. The primary reason for this is that OCR generates inconsistent random errors in its recognition results. Because errors are randomly generated, validation of data results requires 100% manual verification to identify where the errors occurred.  100% recognition results  Inconsistent, random errors  No guaranteed accuracy rate  100% operator verification necessary OCR
  • 11. RECOGNITION TECHNOLOGY OPTIONS OCR Example  100% recognition results  Inconsistent, random errors  No guaranteed accuracy rate  100% manual verification necessary OCR For example, let’s process 1000 checks using OCR to read the check amount. If the OCR engine has a 10 percent error rate, 100 errors are randomly distributed among the 1000 results. Because of this random distribution, finding these errors requires 100 percent manual (visual) verification of all the checks in order to locate the errors.
  • 12. RECOGNITION TECHNOLOGY OPTIONS Automated Recognition & Interpretation Automated recognition and interpretation technology uses voting algorithms and machine learning that have proven to achieve much lower error rates than those provided by individuals doing manual double keying (data verification done by two separate individuals).  N% Automated Recognition – no human intervention necessary  Every result is assigned a confidence value that guarantees accuracy for the data stream  N% operator verification depends on your acceptance rate AI
  • 13. RECOGNITION TECHNOLOGY OPTIONS Automated Recognition & Interpretation With ground truth data, statistical measurements of how often these errors occur can be obtained. The number of test images required to determine the error rate depends on the accuracy of the project. The test images must represent all the types of images and the quality of images that are encountered in the real production stream of documents.  N% Automated Recognition – no human intervention necessary  Every result is assigned a confidence value that guarantees accuracy for the data stream  N% operator verification depends on your acceptance rate AI
  • 14. RECOGNITION TECHNOLOGY OPTIONS Automated Recognition & Interpretation When performing recognition, the software evaluates an image and provides a data result (answer) with an associated metric called a confidence value, which is calculated using a highly complex algorithm. You can easily tune the system for the goals of the application by setting a threshold at a specific confidence value above which a certain level of accuracy is guaranteed. This dramatically reduces the need for manual verification.  N% Automated Recognition – no human intervention necessary • Every result is assigned a confidence value that guarantees accuracy for the data stream  N% operator verification depends on your acceptance rate AI
  • 15. OCR & MANUAL DATA ENTRY X% + Y% = 100% 100 % 100 % CORRECT X% Error Y% INPUT IMAGES OCR MANUAL VERIFICATION This diagram represents the typical OCR process that requires 100% manual verification and is subject to error. 100% Manual verification and data entry is time consuming, costly and prone to error.
  • 16. AUTOMATED RECOGNITION & INTERPRETATION ACCEPTED CORRECT C% CORRECT E% Error F% A% 100%=A%+B% 100 % B% INPUT IMAGES Error D%  Higher accuracy  Faster processing  Data entry savings AUTOMATED INTERPRETATION MANUAL VERIFICATION All input images are fed to the software. After automated recognition and interpretation, data is divided into two streams: accepted answers and any answers requiring manual verification.
  • 17. POSTAL AUTOMATED RECOGNITION TIME COMPARISON EXAMPLE The Benchmark Initially mail piece address data processing time with no OCR involved took 75 seconds per package.  T1 = 75 seconds Stage 1
  • 18. POSTAL AUTOMATED RECOGNITION TIME COMPARISON EXAMPLE Stage 2 OCR Mail piece address data processing time with simply OCR reading and 100% verification of OCR results took 55 seconds per mail package.  T2 = 55 seconds
  • 19. AUTOMATION Mail piece address data processing time, reduced verification with fully automated processing of part of the data to 25 seconds per mail package. Two times as fast and efficient.  T3 = 25 seconds POSTAL AUTOMATED RECOGNITION TIME COMPARISON EXAMPLE Stage 3
  • 20. AUTOMATION Mail piece address data processing time, reduced verification with fully automated processing of part of the data to 25 seconds per mail package. Two times as fast and efficient.  T3 = 25 seconds POSTAL AUTOMATED RECOGNITION TIME COMPARISON EXAMPLE Stage 3 T1=75 sec > T2=55 sec > T3=25 sec Automatic Interpretation software provides significant time savings compared to OCR. AUTOMATION is 2 times as fast.
  • 21. POSTAL AUTOMATED RECOGNITION ERROR RATE COMPARISON EXAMPLE  Full Address: 4%  Full Name: 5%  Phone: 1% Operator Error  Full Address: 2%  Full Name: 3%  Phone: 0.5% Automated Interpretation Error The error rate comparison demonstrates automation to be significantly more accurate for addresses, name and phone identification.
  • 22. AUTOMATED INTERPRETATION VS. OCR ANSWER STRUCTURE Automated InterpretationOCR The difference between OCR and Automated Interpretation is that there is a confidence score associated with every result. file,answer,conf 000001.tif, Samuel Jones, 87 000002.tif, Ashly Thompson, 68 000003.tif, Nancy Wright, 64 000004.tif, Donald Taylor, 76 000005.tif, Mark White, 72 000006.tif, Jessica Hall, 69 000007.tif, Ryan Scott, 32 000008.tif, Sandra Moore, 67 000009.tif, William Brown, 86 000010.tif, Michael Wood, 59 000011.tif, Chris Walker, 66 000012.tif, Mary Clark, 72 000013.tif, John Allen, 58 000014.tif, Joseph Martin, 63 000015.tif, Sarah Wilson, 76 000016.tif, Gary Lewis, 54 000017.tif, Andrew Hill, 67 ... file,answer 000001.tif, Samuel Jones 000002.tif, Ashly Thompson 000003.tif, Nancy Wright 000004.tif, Donald Taylor 000005.tif, Mark White 000006.tif, Jessica Hall 000007.tif, Ryan Scott 000008.tif, Sandra Moore 000009.tif, William Brown 000010.tif, Michael Wood 000011.tif, Chris Walker 000012.tif, Mary Clark 000013.tif, John Allen 000014.tif, Joseph Martin 000015.tif, Sarah Wilson 000016.tif, Gary Lewis 000017.tif, Andrew Hill ...
  • 23. Instead of examining the data results of every single file to verify accuracy, you set the acceptance threshold. Only results below the threshold need to be manually verified. MOVING FROM 100% MANUAL VERIFICATION TO AUTOMATED RECOGNITION File Answer 000001.tif Samuel Jones 000002.tif Ashly Thompson 000003.tif Nancy Wright 000004.tif Donald Taylor 000005.tif Mark White 000006.tif Jessica Hall 000007.tif Ryan Short 000008.tif Sandra Moore 000009.tif William Brown 000010.tif Michael Wood 000011.tif Chris Walker 000012.tif Mary Clark 000013.tif John Allen 000014.tif Joshua Martin 000015.tif Sarah Wilson 000016.tif Cary Levis 000017.tif Andrew Hill 001001.tif Mary Miller 001002.tif Robert Johnson 001003.tif Thomas Thomas 001004.tif Betty Anderson 001005.tif David Davis 001006.tif Paul Robinson 001007.tif Dorothy Williams 001008.tif James Smith 002001.tif Ryan Scott 002002.tif Nicholas Young 002003.tif Edward Jackson 002004.tif Shirleigh Admon 002005.tif Margaret Tryniski 002006.tif Melissa Green 002007.tif Anna Evans 002008.tif Timothy King 002009.tif Steve Baker 002010.tif Amanda John 002011.tif Brian Howe 002012.tif Jason Roberts 002013.tif Dennis Campbell 002014.tif Samantha James 002015.tif Rachel Stewart 002016.tif Jerry Lee TRUTH DATA File with Data Results & Acceptance Thresholds Identified by human operators and results corrected Eliminate errors by setting threshold at a specific confidence value where you know that above this threshold, the data is ALWAYS correct. 68% 0% 12% 20% Threshold = 64 Accepted correct Accepted error Rejected error Rejected correct
  • 24. Instead of examining the data results of every single file to verify accuracy, you set the acceptance threshold. Only results below the threshold need to be manually verified. MOVING FROM 100% MANUAL VERIFICATION TO AUTOMATED RECOGNITION File Answer 000001.tif Samuel Jones 000002.tif Ashly Thompson 000003.tif Nancy Wright 000004.tif Donald Taylor 000005.tif Mark White 000006.tif Jessica Hall 000007.tif Ryan Short 000008.tif Sandra Moore 000009.tif William Brown 000010.tif Michael Wood 000011.tif Chris Walker 000012.tif Mary Clark 000013.tif John Allen 000014.tif Joshua Martin 000015.tif Sarah Wilson 000016.tif Cary Levis 000017.tif Andrew Hill 001001.tif Mary Miller 001002.tif Robert Johnson 001003.tif Thomas Thomas 001004.tif Betty Anderson 001005.tif David Davis 001006.tif Paul Robinson 001007.tif Dorothy Williams 001008.tif James Smith 002001.tif Ryan Scott 002002.tif Nicholas Young 002003.tif Edward Jackson 002004.tif Shirleigh Admon 002005.tif Margaret Tryniski 002006.tif Melissa Green 002007.tif Anna Evans 002008.tif Timothy King 002009.tif Steve Baker 002010.tif Amanda John 002011.tif Brian Howe 002012.tif Jason Roberts 002013.tif Dennis Campbell 002014.tif Samantha James 002015.tif Rachel Stewart 002016.tif Jerry Lee TRUTH DATA File with Data Results & Acceptance Thresholds Identified by human operators and results corrected Or set the threshold at a lower confidence value where you know that above this threshold, the answer is USUALLY correct 78% 2% 10% 10% Threshold = 55 Accepted correct Accepted error Rejected error Rejected correct
  • 25. To recap, recognition data results (answers X and Y) are fully automated above the threshold (confidence value) that you set. Below the threshold, documents are sent to be manually verified. MOVING FROM 100% MANUAL VERIFICATION TO AUTOMATED RECOGNITION: COST OF TRANSITION Recognition Results X – correct answers Y – incorrect answers M – correct answers N – incorrect answers  More correct data results from automated recognition  Guaranteed accuracy of fully automated part of the data stream  Accuracy is HIGHER than the accuracy of an operator  Quantifiable cost & time savings AnswerConfidenceValue Threshold
  • 26. AUTOMATED RECOGNITION & INTERPRETATION  Much of the data is automatically recognized: SIGNIFICANT TIME SAVINGS  Only part of the data need human verification: SIGNIFICANT LABOR SAVINGS  Controlled and guaranteed accuracy of automatically recognized data: OVERALL HIGHER ACCURACY  Fewer answers sent for manual verification: HIGHER % CORRECT RESULTS
  • 27. © 2017 Parascript, LLC. parascript.com www.parascript.com | info@parascript.com | 888.225.0169 YOUR SOLUTIONS POWER BUSINESSES. OUR SOLUTIONS DELIVER YOUR DATA. WANT TO LEARN MORE?