SlideShare a Scribd company logo
1 of 48
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 1
How to Keep OCR
Errors from
Spoiling Your
eDiscovery Party
ACEDS Webinar, May 21st 2014
ACEDS Membership Benefits
Training, Resources and Networking for the
E-Discovery Community
Join Today! aceds.org/join
or Call ACEDS Member Services 786-517-2701
Exclusive News and Analysis
Weekly Web Seminars
Podcasts
On-Demand Training
Networking
Resources
Jobs Board & Career Center
bits + bytes Newsletter
CEDS Certification
And Much More!
“ACEDS provides an excellent, much needed forum… to train, network and stay
current on critical information.”
Kimarie Stratos, General Counsel, Memorial Health Systems, Ft. Lauderdale
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 3
Speaker Introduction
Greg Gies
Director, Product Marketing, Imaging
Nuance Communications
Leads go-to-market planning for
Nuance’s print, capture & PDF
solutions.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 4
Agenda
– What OCR is.
– What are OCR errors and what causes them.
– Why eDiscovery professionals should care.
– How common these problems are.
– What can be done to prevent OCR errors.
– How to correct OCR errors when they happen.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 5
What is OCR?
Wikipedia: “Optical character recognition, usually
abbreviated to OCR, is the mechanical or electronic
conversion of scanned or photographed images of
typewritten or printed text into machine encoded /
computer-readable text.”
2009 Computerworld Article: “Optical character
recognition (OCR) is the translation of optically scanned
bitmaps of printed or written text characters into character
codes.”
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 6
I say, “OCR is the digital transcription of
bitmaps containing machine-printed text
into encoded text characters, using a
coding scheme such asASCII, which
among other capabilities enables indexing
software to decipher textual elements
contained within bitmaps.
Encoded text is ‘searchable’text.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 7
What are OCR errors and what causes
them.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 8
Types of OCR errors
Transcription errors: Result: misspelled words.
Impact: Unsearchable. Proportional fonts are
especially problematic.
Example, “learning” becomes “leaming”
Formatting errors. Result: poor legibility.
Impact: Unsearchable.
Example, “learning” becomes “l e a r n i n g”
Deleted metadata: Result: data loss.
Impact. Metadata is unsearchable.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 9
Some image defects and causes
DEFECT CAUSE
Faulty
printing
equipment
Toner specks
Vertical lines
Toner smear
Gray background
Page skew
Light/dark print
Defective toner cartridge
Worn rollers
Wrong settings
Worn pickup roller
Low toner or ink cartridge
Clogged nozzles
Paper or
form
elements
Halftones
Vertical/horizontal lines
Noise
Colored paper
Carbon copies
Shaded and lined forms
Low/high contrast background
Faulty
scanning
equipment
Specks
Page skew
Light/dark image
Dirty platen
Feeder misadjusted
Worn pickup roller
Low resolution
Misfeed
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 10
Examples of image defects
Halftone
Specks
Color Background
Fuzzy edges
Gridlines
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 11
Cause & effect
SKEWED IMAGE
TONER SPECKS
DARK TEXT
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 12
OCR results
•..... ; .,
1HIt#~!':tl'Ol<fLoi;l;i~: •••• 'do ••••• ~""r.bf .. . ,200., by- ••• ~ •• ODNEY
D~"'!Ob$Y~s.IN(; ..• ~~"""""" Iaws "'''''.S_",c.rm",,;, ~,-"'"'_U "s.,.",) •••
JOE STANDUi' ~Irnown •• """,,"). _ and SclIa _<01"'"""y
'l>eknoWhllereinasutlJ.e paiti~S",' . . . ..... ..
. "
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 13
Formatting errors
Original Image OCR conversion to .docx
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 14
Why should eDiscovery professionals
care about OCR errors?
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 15
PDF Files Created by ScannersAren’t
Necessarily Searchable
1. Unlike PDF Files that were “born digital”, a PDF file from
a scanner is an image of a paper document.
2. While the text in an image may appear similar or the
same as text in a “born digital” PDF, it’s invisible as far
as search algorithms are concerned.
3. Images aren’t searchable until processed with OCR.
4. “Searchable PDF” contains OCR output in its metadata.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 16
Digitized Versus Native Documents
Electronic
• Originated online
• Many native file formats
• Encoded text – “machine-
readable”
• eDiscovery software designed
for these documents
Digitized
• Scans of paper originals
• TIFF & PDF common formats
• Text is a bitmap; not encoded
• OCR makes scans “machine-
readable”
Why special handling is required
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 17
All Searchable PDFsAren’t Equally
Findable
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 18
All Searchable PDFsAren’t Equally
Findable
– OCR Word Accuracy Matters
– Character vs. word accuracy
– 1character error = 1 word error
– 1 char / 10,000 = .0001
– 1 word / 1,000 = .001
– Seemingly small differences in
OCR error rate lead to very large
differences in errors
– 98.25% vs. 96.05%
– 948 pages
– ~ 2% delta
– Over 12,000 more word errors
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 19
PDF Isn’t Just PDFAnymore
Improper handling may lead to data loss
Electronic “sticky
notes” & stamps
obscure text
behind
PDF contains
digitized pages
or elements
OCR software may
flatten the image,
i.e., convert PDF to
TIF, then OCR
PDF contains
native pages or
elements
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 20
Four Important Ways PDF is Different
1. PDF files can be assembled from multiple files.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 21
Four Important Ways PDF is Different
2. PDF elements can be rearranged.
User copied snip of
scanned document.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 22
Four Important Ways PDF is Different
3. PDF files can have multiple layers.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 23
Four Important Ways PDF is Different
4. PDF files can be “born digitally” or created by a scanner.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 24
Why These Differences Matter
When these properties converge with scanned
text eDiscovery pitfalls arise that can lead to
inadvertent data loss due to processing errors.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 25
Single PDF Page Can Contain Both
“Born Digital” and Scanned Content
This PDF
contains 2
scanned pages.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 26
Notes, Text Boxes, Callouts, Stamps,
Etc…Can Be Overlaid On Images
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 27
How Data Gets Inadvertently Destroyed
OCR may ‘flatten’ objects overlaid on text, making the text
underneath unreadable and metadata unsearchable
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 28
Why Is Data DestructionA Problem?
“Spoliation is the destruction or significant alteration of
evidence, or the failure to preserve property for another's
use as evidence in pending or reasonably foreseeable
litigation.”
Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 29
Penalties for Spoliation
$2,750,000
United States v. Philip Morris USA, Inc.
Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 30
What If The Data Was Destroyed
Unintentionally? I’m Okay Right?
“The intent to alter or destroy electronic data is
not required for spoliation to occur.”
Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 31
Other Reasons Data Loss is Bad
– Destruction of evidence – affect case outcome.
– Time wasted recreating lost data – reduced profitability.
– Client relations – loss of credibility and future business.
Spoliation isn’t the only concern
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 32
How common are OCR errors?
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 33
Typical OCR Word Accuracy Rates
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 34
Firms With Initiatives To Convert Paper
DocumentsAnd Process To Digital
0%
20%
40%
60%
80%
100%
Total
PercentageofRespondents
Don’t know
No
Yes
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 35
What Percentage Of The Knowledge
Workers Have Access To Scanners?
72.8%
66.7%
75.2%
64.5%
78.6%
80.9%
0%
20%
40%
60%
80%
100%
Total Education Financial Healthcare Insurance Legal
Workers
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 36
The Use Of Scanning Within
Organizations
0%
20%
40%
60%
80%
100%
Total
PercentageofRespondents
Decreasing
Stay the same
Increasing
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 37
Why scanning is increasing
0%
20%
40%
60%
80%
100%
Total
PercentageofRespondents
More people have
been given access
to scanning
Mix of more
documents being
scanned and more
people gaining
access to scanning
Individuals are
scanning more
documents
N = 158
Respondents who have seen an increase in scanning
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 38
What can be done to prevent OCR
errors from occurring
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 39
Image enhancement filters out defects
Halftone
Removal
Despeckle
Color
Background
Removal
Smooth Characters
Remove Gridlines
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 40
Selectively Processing PDF Files Isn’t a
Viable Strategy
– Client delivers a disk with millions of
files, how do you know which PDF
are “compound” that require special
handling?
– It’s kind of like looking for needles in
a haystack.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 41
SegregateAll PDF Files Before
Conversion
– After collecting documents,…
– …but before processing, review and analysis
– Identify all PDF documents
– Move them to a separate directory to be run through OCR
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 42
Pre-process all PDF files with OCR
– Text images are potentially hidden within files
– Newer OCR tools will make image text searchable,
without disturbing the rest of the data within the document
– Ensures all text within every PDF document can be
searched by eDiscovery system
– Should also convert to PDF/A at the same time so files
are ready for court filing
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 43
How to correct OCR errors when they
happen
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 44
Post processing error-correction
options
1. Proofreading – too time-intensive
2. Run an automated spell check
– Effective at identifying and correcting spelling errors
– Doesn’t solve contextual errors
 Example, “How is you day?”
 Spelled correctly but clearly is wrong
 If searching for “your” this instance won’t be found
 No commercial solutions today solve this problem
 Helps to understand this potential problem
 Can either manually review and correct or adjust search strategy to
minimize the impact, e.g., fuzzy search techniques
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 45
Final Thought – Why You Should Care
General business /
transaction documents
Purchase orders, invoices, contracts, employee identification documents…
Healthcare organizations
/ patient medical records
Physician’s notes, discharge summaries, test results, post operative reports, etc…
Personal identification
documents
Driver’s licenses, social security cards, professional certificates
Public institutions /
Police records
Incident / accident reports, police logs, court records…
Insurance & banking Claims documents, medical records, financial records…
Schools Inoculation records, transcripts, applications…
Important enough to print, then to scan, it is important!
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 46
Q&A
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 47
Next Steps
• A recording of today’s Webinar will be available shortly for you to review
at your leisure
• Contact Nuance with any questions you may have:
781-565-5000 or imaging@nuance.com
• For more information visit: http://www.nuance.com/for-business/by-
industry/legal/legal-solution
• Try our 30-Day free trial of Power PDF at http://www.powerpdf.com
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 48
Thank you

More Related Content

Similar to Nuance-ACEDS May 21 OCR Webcast

Streamline your data security with user-friendly DRM controls.
Streamline your data security with user-friendly DRM controls.Streamline your data security with user-friendly DRM controls.
Streamline your data security with user-friendly DRM controls.Home
 
Effortless Data Security: Unlock the Power of DRM in Virtual Data Rooms
Effortless Data Security: Unlock the Power of DRM in Virtual Data RoomsEffortless Data Security: Unlock the Power of DRM in Virtual Data Rooms
Effortless Data Security: Unlock the Power of DRM in Virtual Data RoomsHome
 
Elevate Data Protection: Embrace DRM in Virtual Data Rooms
Elevate Data Protection: Embrace DRM in Virtual Data RoomsElevate Data Protection: Embrace DRM in Virtual Data Rooms
Elevate Data Protection: Embrace DRM in Virtual Data RoomsHome
 
Achieve regulatory compliance effortlessly with DRM controls. Simplify your c...
Achieve regulatory compliance effortlessly with DRM controls. Simplify your c...Achieve regulatory compliance effortlessly with DRM controls. Simplify your c...
Achieve regulatory compliance effortlessly with DRM controls. Simplify your c...Home
 
Experience peace of mind with our comprehensive DRM controls.
Experience peace of mind with our comprehensive DRM controls.Experience peace of mind with our comprehensive DRM controls.
Experience peace of mind with our comprehensive DRM controls.Home
 
PROTECT YOUR DIGITAL ASSETS WITH DRM CONTROLS!
 PROTECT YOUR DIGITAL ASSETS WITH DRM CONTROLS! PROTECT YOUR DIGITAL ASSETS WITH DRM CONTROLS!
PROTECT YOUR DIGITAL ASSETS WITH DRM CONTROLS!Home
 
How Your Document Habits are Destroying Productivity
How Your Document Habits are Destroying Productivity How Your Document Habits are Destroying Productivity
How Your Document Habits are Destroying Productivity Nitro, Inc.
 
3 Advantages (And 1 Disadvantage) Of Edge Computing
3 Advantages (And 1 Disadvantage) Of Edge Computing3 Advantages (And 1 Disadvantage) Of Edge Computing
3 Advantages (And 1 Disadvantage) Of Edge ComputingBernard Marr
 
ICT's role in Successful Studiies
ICT's role in Successful StudiiesICT's role in Successful Studiies
ICT's role in Successful Studiiesakinwunmi adelanwa
 
Digital Archiving Solutions Presentation English
Digital Archiving Solutions Presentation EnglishDigital Archiving Solutions Presentation English
Digital Archiving Solutions Presentation Englishamangu
 
Why and how Law Firms go Paperless
Why and how Law Firms go PaperlessWhy and how Law Firms go Paperless
Why and how Law Firms go PaperlessFrancois Thevenot
 
Doing Information Management Right
Doing Information Management Right Doing Information Management Right
Doing Information Management Right Lane Severson
 
2013.01.17 the mechanics of setting up and running a successful law practice
2013.01.17 the mechanics of setting up and running a successful law practice2013.01.17 the mechanics of setting up and running a successful law practice
2013.01.17 the mechanics of setting up and running a successful law practiceAlan Klevan
 
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...Michael Lew
 
Starting the Small Case: Technical Considerations
Starting the Small Case: Technical ConsiderationsStarting the Small Case: Technical Considerations
Starting the Small Case: Technical ConsiderationsMuruga J
 
Endeca business white paper for media and publishing
Endeca business white paper for media and publishingEndeca business white paper for media and publishing
Endeca business white paper for media and publishingDeidre Caldbeck
 
End-user computing is not a trend, it's a transformational shift
End-user computing is not a trend, it's a transformational shiftEnd-user computing is not a trend, it's a transformational shift
End-user computing is not a trend, it's a transformational shiftUni Systems S.M.S.A.
 
Going green kl presentation
Going green kl presentationGoing green kl presentation
Going green kl presentationPeter1020
 

Similar to Nuance-ACEDS May 21 OCR Webcast (20)

Streamline your data security with user-friendly DRM controls.
Streamline your data security with user-friendly DRM controls.Streamline your data security with user-friendly DRM controls.
Streamline your data security with user-friendly DRM controls.
 
Effortless Data Security: Unlock the Power of DRM in Virtual Data Rooms
Effortless Data Security: Unlock the Power of DRM in Virtual Data RoomsEffortless Data Security: Unlock the Power of DRM in Virtual Data Rooms
Effortless Data Security: Unlock the Power of DRM in Virtual Data Rooms
 
Elevate Data Protection: Embrace DRM in Virtual Data Rooms
Elevate Data Protection: Embrace DRM in Virtual Data RoomsElevate Data Protection: Embrace DRM in Virtual Data Rooms
Elevate Data Protection: Embrace DRM in Virtual Data Rooms
 
Achieve regulatory compliance effortlessly with DRM controls. Simplify your c...
Achieve regulatory compliance effortlessly with DRM controls. Simplify your c...Achieve regulatory compliance effortlessly with DRM controls. Simplify your c...
Achieve regulatory compliance effortlessly with DRM controls. Simplify your c...
 
Experience peace of mind with our comprehensive DRM controls.
Experience peace of mind with our comprehensive DRM controls.Experience peace of mind with our comprehensive DRM controls.
Experience peace of mind with our comprehensive DRM controls.
 
PROTECT YOUR DIGITAL ASSETS WITH DRM CONTROLS!
 PROTECT YOUR DIGITAL ASSETS WITH DRM CONTROLS! PROTECT YOUR DIGITAL ASSETS WITH DRM CONTROLS!
PROTECT YOUR DIGITAL ASSETS WITH DRM CONTROLS!
 
How Your Document Habits are Destroying Productivity
How Your Document Habits are Destroying Productivity How Your Document Habits are Destroying Productivity
How Your Document Habits are Destroying Productivity
 
3 Advantages (And 1 Disadvantage) Of Edge Computing
3 Advantages (And 1 Disadvantage) Of Edge Computing3 Advantages (And 1 Disadvantage) Of Edge Computing
3 Advantages (And 1 Disadvantage) Of Edge Computing
 
ICT's role in Successful Studiies
ICT's role in Successful StudiiesICT's role in Successful Studiies
ICT's role in Successful Studiies
 
Digital Archiving Solutions Presentation English
Digital Archiving Solutions Presentation EnglishDigital Archiving Solutions Presentation English
Digital Archiving Solutions Presentation English
 
Why and how Law Firms go Paperless
Why and how Law Firms go PaperlessWhy and how Law Firms go Paperless
Why and how Law Firms go Paperless
 
Doing Information Management Right
Doing Information Management Right Doing Information Management Right
Doing Information Management Right
 
2013.01.17 the mechanics of setting up and running a successful law practice
2013.01.17 the mechanics of setting up and running a successful law practice2013.01.17 the mechanics of setting up and running a successful law practice
2013.01.17 the mechanics of setting up and running a successful law practice
 
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
 
Ict
IctIct
Ict
 
Starting the Small Case: Technical Considerations
Starting the Small Case: Technical ConsiderationsStarting the Small Case: Technical Considerations
Starting the Small Case: Technical Considerations
 
Endeca business white paper for media and publishing
Endeca business white paper for media and publishingEndeca business white paper for media and publishing
Endeca business white paper for media and publishing
 
End-user computing is not a trend, it's a transformational shift
End-user computing is not a trend, it's a transformational shiftEnd-user computing is not a trend, it's a transformational shift
End-user computing is not a trend, it's a transformational shift
 
Going green kl presentation
Going green kl presentationGoing green kl presentation
Going green kl presentation
 
PROSTEP 3D PDF Technologies
PROSTEP 3D PDF TechnologiesPROSTEP 3D PDF Technologies
PROSTEP 3D PDF Technologies
 

More from Robbie Hilson

ACEDS-Stroock 9-4-14 Webcast Presentation
ACEDS-Stroock 9-4-14 Webcast Presentation ACEDS-Stroock 9-4-14 Webcast Presentation
ACEDS-Stroock 9-4-14 Webcast Presentation Robbie Hilson
 
ACEDS-TRU Staffing Partners 7-23-14 Webcast
ACEDS-TRU Staffing Partners 7-23-14 WebcastACEDS-TRU Staffing Partners 7-23-14 Webcast
ACEDS-TRU Staffing Partners 7-23-14 WebcastRobbie Hilson
 
The Ethics of Predictive Coding
The Ethics of Predictive CodingThe Ethics of Predictive Coding
The Ethics of Predictive CodingRobbie Hilson
 
Aceds presentation formatted final v2
Aceds presentation formatted final v2Aceds presentation formatted final v2
Aceds presentation formatted final v2Robbie Hilson
 
Ppt day in the life of a case manager final v 2
Ppt day in the life of a case manager    final v 2Ppt day in the life of a case manager    final v 2
Ppt day in the life of a case manager final v 2Robbie Hilson
 
Slides from ACEDS-Xact Data Discovery 5-7-14 Webcast
Slides from ACEDS-Xact Data Discovery 5-7-14 WebcastSlides from ACEDS-Xact Data Discovery 5-7-14 Webcast
Slides from ACEDS-Xact Data Discovery 5-7-14 WebcastRobbie Hilson
 
Xact aceds 5-7-14 webcast
Xact aceds 5-7-14 webcastXact aceds 5-7-14 webcast
Xact aceds 5-7-14 webcastRobbie Hilson
 

More from Robbie Hilson (7)

ACEDS-Stroock 9-4-14 Webcast Presentation
ACEDS-Stroock 9-4-14 Webcast Presentation ACEDS-Stroock 9-4-14 Webcast Presentation
ACEDS-Stroock 9-4-14 Webcast Presentation
 
ACEDS-TRU Staffing Partners 7-23-14 Webcast
ACEDS-TRU Staffing Partners 7-23-14 WebcastACEDS-TRU Staffing Partners 7-23-14 Webcast
ACEDS-TRU Staffing Partners 7-23-14 Webcast
 
The Ethics of Predictive Coding
The Ethics of Predictive CodingThe Ethics of Predictive Coding
The Ethics of Predictive Coding
 
Aceds presentation formatted final v2
Aceds presentation formatted final v2Aceds presentation formatted final v2
Aceds presentation formatted final v2
 
Ppt day in the life of a case manager final v 2
Ppt day in the life of a case manager    final v 2Ppt day in the life of a case manager    final v 2
Ppt day in the life of a case manager final v 2
 
Slides from ACEDS-Xact Data Discovery 5-7-14 Webcast
Slides from ACEDS-Xact Data Discovery 5-7-14 WebcastSlides from ACEDS-Xact Data Discovery 5-7-14 Webcast
Slides from ACEDS-Xact Data Discovery 5-7-14 Webcast
 
Xact aceds 5-7-14 webcast
Xact aceds 5-7-14 webcastXact aceds 5-7-14 webcast
Xact aceds 5-7-14 webcast
 

Recently uploaded

Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 

Recently uploaded (20)

Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 

Nuance-ACEDS May 21 OCR Webcast

  • 1. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 1 How to Keep OCR Errors from Spoiling Your eDiscovery Party ACEDS Webinar, May 21st 2014
  • 2. ACEDS Membership Benefits Training, Resources and Networking for the E-Discovery Community Join Today! aceds.org/join or Call ACEDS Member Services 786-517-2701 Exclusive News and Analysis Weekly Web Seminars Podcasts On-Demand Training Networking Resources Jobs Board & Career Center bits + bytes Newsletter CEDS Certification And Much More! “ACEDS provides an excellent, much needed forum… to train, network and stay current on critical information.” Kimarie Stratos, General Counsel, Memorial Health Systems, Ft. Lauderdale
  • 3. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 3 Speaker Introduction Greg Gies Director, Product Marketing, Imaging Nuance Communications Leads go-to-market planning for Nuance’s print, capture & PDF solutions.
  • 4. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 4 Agenda – What OCR is. – What are OCR errors and what causes them. – Why eDiscovery professionals should care. – How common these problems are. – What can be done to prevent OCR errors. – How to correct OCR errors when they happen.
  • 5. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 5 What is OCR? Wikipedia: “Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine encoded / computer-readable text.” 2009 Computerworld Article: “Optical character recognition (OCR) is the translation of optically scanned bitmaps of printed or written text characters into character codes.”
  • 6. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 6 I say, “OCR is the digital transcription of bitmaps containing machine-printed text into encoded text characters, using a coding scheme such asASCII, which among other capabilities enables indexing software to decipher textual elements contained within bitmaps. Encoded text is ‘searchable’text.
  • 7. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 7 What are OCR errors and what causes them.
  • 8. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 8 Types of OCR errors Transcription errors: Result: misspelled words. Impact: Unsearchable. Proportional fonts are especially problematic. Example, “learning” becomes “leaming” Formatting errors. Result: poor legibility. Impact: Unsearchable. Example, “learning” becomes “l e a r n i n g” Deleted metadata: Result: data loss. Impact. Metadata is unsearchable.
  • 9. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 9 Some image defects and causes DEFECT CAUSE Faulty printing equipment Toner specks Vertical lines Toner smear Gray background Page skew Light/dark print Defective toner cartridge Worn rollers Wrong settings Worn pickup roller Low toner or ink cartridge Clogged nozzles Paper or form elements Halftones Vertical/horizontal lines Noise Colored paper Carbon copies Shaded and lined forms Low/high contrast background Faulty scanning equipment Specks Page skew Light/dark image Dirty platen Feeder misadjusted Worn pickup roller Low resolution Misfeed
  • 10. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 10 Examples of image defects Halftone Specks Color Background Fuzzy edges Gridlines
  • 11. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 11 Cause & effect SKEWED IMAGE TONER SPECKS DARK TEXT
  • 12. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 12 OCR results •..... ; ., 1HIt#~!':tl'Ol<fLoi;l;i~: •••• 'do ••••• ~""r.bf .. . ,200., by- ••• ~ •• ODNEY D~"'!Ob$Y~s.IN(; ..• ~~"""""" Iaws "'''''.S_",c.rm",,;, ~,-"'"'_U "s.,.",) ••• JOE STANDUi' ~Irnown •• """,,"). _ and SclIa _<01"'"""y 'l>eknoWhllereinasutlJ.e paiti~S",' . . . ..... .. . "
  • 13. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 13 Formatting errors Original Image OCR conversion to .docx
  • 14. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 14 Why should eDiscovery professionals care about OCR errors?
  • 15. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 15 PDF Files Created by ScannersAren’t Necessarily Searchable 1. Unlike PDF Files that were “born digital”, a PDF file from a scanner is an image of a paper document. 2. While the text in an image may appear similar or the same as text in a “born digital” PDF, it’s invisible as far as search algorithms are concerned. 3. Images aren’t searchable until processed with OCR. 4. “Searchable PDF” contains OCR output in its metadata.
  • 16. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 16 Digitized Versus Native Documents Electronic • Originated online • Many native file formats • Encoded text – “machine- readable” • eDiscovery software designed for these documents Digitized • Scans of paper originals • TIFF & PDF common formats • Text is a bitmap; not encoded • OCR makes scans “machine- readable” Why special handling is required
  • 17. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 17 All Searchable PDFsAren’t Equally Findable
  • 18. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 18 All Searchable PDFsAren’t Equally Findable – OCR Word Accuracy Matters – Character vs. word accuracy – 1character error = 1 word error – 1 char / 10,000 = .0001 – 1 word / 1,000 = .001 – Seemingly small differences in OCR error rate lead to very large differences in errors – 98.25% vs. 96.05% – 948 pages – ~ 2% delta – Over 12,000 more word errors
  • 19. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 19 PDF Isn’t Just PDFAnymore Improper handling may lead to data loss Electronic “sticky notes” & stamps obscure text behind PDF contains digitized pages or elements OCR software may flatten the image, i.e., convert PDF to TIF, then OCR PDF contains native pages or elements
  • 20. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 20 Four Important Ways PDF is Different 1. PDF files can be assembled from multiple files.
  • 21. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 21 Four Important Ways PDF is Different 2. PDF elements can be rearranged. User copied snip of scanned document.
  • 22. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 22 Four Important Ways PDF is Different 3. PDF files can have multiple layers.
  • 23. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 23 Four Important Ways PDF is Different 4. PDF files can be “born digitally” or created by a scanner.
  • 24. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 24 Why These Differences Matter When these properties converge with scanned text eDiscovery pitfalls arise that can lead to inadvertent data loss due to processing errors.
  • 25. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 25 Single PDF Page Can Contain Both “Born Digital” and Scanned Content This PDF contains 2 scanned pages.
  • 26. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 26 Notes, Text Boxes, Callouts, Stamps, Etc…Can Be Overlaid On Images
  • 27. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 27 How Data Gets Inadvertently Destroyed OCR may ‘flatten’ objects overlaid on text, making the text underneath unreadable and metadata unsearchable
  • 28. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 28 Why Is Data DestructionA Problem? “Spoliation is the destruction or significant alteration of evidence, or the failure to preserve property for another's use as evidence in pending or reasonably foreseeable litigation.” Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php
  • 29. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 29 Penalties for Spoliation $2,750,000 United States v. Philip Morris USA, Inc. Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php
  • 30. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 30 What If The Data Was Destroyed Unintentionally? I’m Okay Right? “The intent to alter or destroy electronic data is not required for spoliation to occur.” Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php
  • 31. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 31 Other Reasons Data Loss is Bad – Destruction of evidence – affect case outcome. – Time wasted recreating lost data – reduced profitability. – Client relations – loss of credibility and future business. Spoliation isn’t the only concern
  • 32. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 32 How common are OCR errors?
  • 33. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 33 Typical OCR Word Accuracy Rates
  • 34. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 34 Firms With Initiatives To Convert Paper DocumentsAnd Process To Digital 0% 20% 40% 60% 80% 100% Total PercentageofRespondents Don’t know No Yes
  • 35. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 35 What Percentage Of The Knowledge Workers Have Access To Scanners? 72.8% 66.7% 75.2% 64.5% 78.6% 80.9% 0% 20% 40% 60% 80% 100% Total Education Financial Healthcare Insurance Legal Workers
  • 36. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 36 The Use Of Scanning Within Organizations 0% 20% 40% 60% 80% 100% Total PercentageofRespondents Decreasing Stay the same Increasing
  • 37. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 37 Why scanning is increasing 0% 20% 40% 60% 80% 100% Total PercentageofRespondents More people have been given access to scanning Mix of more documents being scanned and more people gaining access to scanning Individuals are scanning more documents N = 158 Respondents who have seen an increase in scanning
  • 38. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 38 What can be done to prevent OCR errors from occurring
  • 39. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 39 Image enhancement filters out defects Halftone Removal Despeckle Color Background Removal Smooth Characters Remove Gridlines
  • 40. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 40 Selectively Processing PDF Files Isn’t a Viable Strategy – Client delivers a disk with millions of files, how do you know which PDF are “compound” that require special handling? – It’s kind of like looking for needles in a haystack.
  • 41. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 41 SegregateAll PDF Files Before Conversion – After collecting documents,… – …but before processing, review and analysis – Identify all PDF documents – Move them to a separate directory to be run through OCR
  • 42. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 42 Pre-process all PDF files with OCR – Text images are potentially hidden within files – Newer OCR tools will make image text searchable, without disturbing the rest of the data within the document – Ensures all text within every PDF document can be searched by eDiscovery system – Should also convert to PDF/A at the same time so files are ready for court filing
  • 43. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 43 How to correct OCR errors when they happen
  • 44. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 44 Post processing error-correction options 1. Proofreading – too time-intensive 2. Run an automated spell check – Effective at identifying and correcting spelling errors – Doesn’t solve contextual errors  Example, “How is you day?”  Spelled correctly but clearly is wrong  If searching for “your” this instance won’t be found  No commercial solutions today solve this problem  Helps to understand this potential problem  Can either manually review and correct or adjust search strategy to minimize the impact, e.g., fuzzy search techniques
  • 45. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 45 Final Thought – Why You Should Care General business / transaction documents Purchase orders, invoices, contracts, employee identification documents… Healthcare organizations / patient medical records Physician’s notes, discharge summaries, test results, post operative reports, etc… Personal identification documents Driver’s licenses, social security cards, professional certificates Public institutions / Police records Incident / accident reports, police logs, court records… Insurance & banking Claims documents, medical records, financial records… Schools Inoculation records, transcripts, applications… Important enough to print, then to scan, it is important!
  • 46. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 46 Q&A
  • 47. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 47 Next Steps • A recording of today’s Webinar will be available shortly for you to review at your leisure • Contact Nuance with any questions you may have: 781-565-5000 or imaging@nuance.com • For more information visit: http://www.nuance.com/for-business/by- industry/legal/legal-solution • Try our 30-Day free trial of Power PDF at http://www.powerpdf.com
  • 48. © 2002-2013 Nuance Communications, Inc. All rights reserved. Page 48 Thank you