SlideShare a Scribd company logo
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Introduction to Optical Character
Recognition (OCR)
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Summary
 Overview of OCR
 System Requirements
 Advantages and Disadvantages
 Operation and Management
 Questionnaire Design and Preparation
 OCR Field Operation
 OCR Country Outlook
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR (Optical Character Recognition)
 Function & Features of OCR/ICR
 ICR, OCR and OMR Compared
 Optical Mark Reader (OMR)
 OCR/ ICR
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR (Optical Character Recognition)
 Also referred to as Optical Character Reader
 “…a system that provides a full alphanumeric
recognition of printed or handwritten characters at
electronic speed by simply scanning the form.”(UNESCAP, Pop-IT
project, 1997-2001)
 Intelligent Character Recognition (ICR) is used to
describe the process of interpreting image data, in
particular alphanumeric text.
 Sometimes OCR is known as ICR
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Functions & Features of OCR
 Forms can be scanned through a scanner and then the
recognition engine of the OCR system interpret the images
and turn images of handwritten or printed characters into
ASCII data (machine-readable characters).
 The technology provides a complete form processing and
documents capture solution.
 Allows an open, scaleable and workflow.
 Includes forms definition, scanning, image
 pre-processing, and recognition capabilities.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
ICR,OCR and OMR Differences
 ICR and OCR are recognition engines used with
imaging;
 OMR is a data collection technology that does
not require a recognition engine.
 OMR cannot recognize hand-printed or
machine-printed characters.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Optical Mark Reader (OMR)
 Forms
 An OMR works with a specialized document and contains
timing tracks along one edge of the form to indicate scanner
where to read for marks which look like black boxes on the
top or bottom of a form.
 The cut of the form is very precise and the bubbles on a form
must be located in the same location on every form.
 Storage
 With OMR, the image of a document is not scanned and
stored.
 Accuracy
 OMR is simpler than OCR.
 designed properly, OMR has more accuracy than OCR.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR/ ICR
 Forms
 OCR/ ICR is more flexible since no timing tracks or block
like form IDs required.
 The image can float on a page.
 ICR/ OCR technology uses registration mark on the four-
corners of a document, in the recognition of an image.
Respondents place one character per box on this form.
 The use of drop color reduces the size of the scanner’s
output and enhances the accuracy.
 Storage/ retrieval
 If the document needs to be electronically stored and
maintained, then OCR/ ICR is needed.
 OCR/ICR technologies, images can be scanned, indexed,
and written to optical media.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OMR-OCR/ICR Compared
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
System Requirements
 Minimum capacity PC Requirements:
 Processor: Pentium 200 MHz RAM: 32 MB Disk: 4 GB
 Form modules are designed to operate in a batch
processing;
 Run under LAN and PC based platforms and take full
advantage of the graphical user interface and 32 bit
processing power available with most Windows
versions.
 Software:
 OCR with ICR capability software
 Questionnaire Design Software
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
System Requirements (cont.)
 Scanner
 OCR scanners with minimum capacity:
 Duplex scanning
 Speed: 60 sheets/ min
 Automatic Document Feeder (ADF):
Scanning can take a significant amount,
and the system lets user scan up without
doing the OCR.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Advantages and Disadvantages
 Advantages of Using Images Rather Than Paper
 Quicker processing; no moving or storage of questionnaires near
operators
 Savings in costs and efficiencies by not having the paper
questionnaires
 Scanning and recognition allowed efficient management and planning
for the rest of the processing workload
 Reduced long term storage requirements, questionnaires could be
destroyed after the initial scanning, recognition and repair
 Quick retrieval for editing and reprocessing
 Minimizes errors associated with physical handling of the
questionnaires
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Advantages and Disadvantages
 Disadvantages of Using Images Rather Than Paper
 Accuracy
 While OCR technology can be effective in
converting handwritten or typed characters, it
does not give as high accuracy as of OMR for
reading data, where users are actually marking
forms
 Additional workload to data collectors OCR has
severe limitations when it comes to human
handwriting
 Characters must be hand-printed with separate
characters in boxes
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Operation and Management
 OCR Process Stages
 Document Scanning process
 Scanning speed will be determined by the quality of the
scanner machines, the size of non-drop out color. Paper
quality, cleanness, weights.
 Recognizing process
 The recognizing process is to interpret images. The right
memory (dictionary) and the configuration threshold will
determine the accuracy of interpretation of the ICR.
 Verifying Process
 To compare the value of the interpreted image with the real
image of the form.
 Processing can be in geographic order or in random order.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Operation and Management (cont.)
 Image Manipulation
 Electronic questionnaires can be sent to specialist operators then
back to the original operator if necessary
 Same questionnaire can be worked on simultaneously by two or
more persons
 Electronic questionnaires are readily available for post census
analysis (easier access to questionnaires)
 Parts of various questionnaires on screen at once for inter record
editing
 Able to view the relevant field book entry on screen in conjunction
with questionnaires which is helpful for coding and editing
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Operation and Management (cont.)
 Coding Assistance
 The problems are simpler for the operator to identify
 Can use images of questions that will not be captured (scanned but
not recognized) to help the coding process. ex, light pencil.
 Operator can magnify images to read characters not discernible to
the naked eye
 Appropriate software ensures that the data is validated as the
forms are read.
 Checks to ensure selections on a form are filled in.
 Possible to distinguish between intended marks and marks that
have been erased.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Operation and Management (cont.)
 OMR Scanner Speed
 Factors
 Skew: Each document is moved from an automatic
feeder into ascanner and angle of skew is
sometimes introduced.
 De-skew: Analyze the image bit- map, calculates
and returns the angle of skew up to +/-25.
Example. De-skew often refer to %, which is the
pixel shift. 10% is a 20-pixel shift in a line of 200
pixels or one tenth of an inch in an inch long line.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Operation and Management (cont.)
 Landscape Detection and Auto Rotation:
 landscape detection will automatically detect
and rotate appropriate images 90 degrees.
 White Page Detection:
 Normally, a double-sided scanner creates two
images per scanners page.
 However, if the back or front page is blank,
there is no need to store this image.
 White page detection
 Allows the user to avoid storing blank page.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Operation and Management (cont.)
 Other Factors
 Automatic Image Registration
 De-Speckle and Shade Removal
 Character Enhancer
 Cost Savings
 Automatic processes to improve recognition
rates
 Voting techniques, Multiple engines, Learning
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Questionnaire Design and
Preparation
 Drop Out Color
 Usually red- the color facility in OCR system that
allows the system to pick up only the meaningful
information from an OCR form.
 The system doesn't need to know the values
including tick boxes written in the drop out color.
 The OCR system only needs to see the black parts,
and compares them to specifications to see parts
that are filled or written.
 Characters or Marks
 Considering the speed of the data capture process
and to reduce rates, it is advisable to use marks or
“ticks” as much as possible
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
Questionnaire Design and
Preparation (cont.)
 How to Obtain Good Results of Scanning
 Select adequate paper quality; Reliable printing press.
 Appropriate ink, considering drop out color, for the questionnaires paper
heavier than 80 grams per square meter can help avoid paper crashes or
over read the other side of a single page.
 Form Design Advise
 Number items to be included in a form; Design size of boxes for each
character answer carefully.
 Define drop out color properly; use registration marks.
 Pre-print the codes near the place where the box for ticks are located
 Maintain consistent pattern in which the information to be collected will be
located.
 Do not disturb the visibility of the ticks and marks with titles, labels or
instructions.
 Avoid putting "answers" of one field to another page of the questions;
Avoid using open ended questions
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR Field Operation
 Training for Collection and Processing Staff
 Basic software, scanner operations, including
installation and troubleshooting.
 Applications with emphasis on the development of
custom applications including: configuring
nonstandard forms
 Pre-marking of forms, use of overprinting customize
forms
 Processing of surveys
 Crating custom outputs file formats
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR Field Operation (cont.)
 Reasons of Error- Reading of OCR
 Bad condition of the form because of dirt, folded, crumple, etc.
 Forms fed into OCR scanner are not straight (at an angle); Incompletely filled
 Reduce Error-Reading of OCR
 Checking the questionnaires for completeness and consistencies; Preparation of own memory (dictionary);
Defining permissible margins of OCR reading errors
 Particular Care in Writing Numbers or Alphabetic
 One box contains only one character; Characters should not extend outside designated boxes; Unnecessary
lines of characters such as points, decorative strokes, hooks, etc. are prohibited. Strokes should not be ended
with flourishes or extensions.
 All lines should be connected without breaks; All lines or dots should be pressed with the same pressure.
 Value Checking Steps: Verify that the information captured by OMR is the same with the questionnaire
 Control for Blank: If the information is blank, what type of control must be taken.
 Control steps should be taken if the information image is partial or no information to assure the quality of
generated files.
 Missing Questionnaire; Make sure that the entire questionnaires are scanned
 completely, no missing and no duplication as well.
 Therefore control procedures including to produce control tables to compare with manual work.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR Country Outlook
 Countries using optical mark recognition
 (Greece)
 Countries using optical character recognition
 (Croatia- in use for the next census round)
 (Japan-out-sources entire process and in use for the next
census round)
 Countries using both
 Belgium
 Countries planning to use OCR
 Tajikistan
 (Tonga) looking to introduce and use OCR for our next Census
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR Country Outlook
 Common device/scanner and software used by
NSOs
 (Croatia) KODAK DS3520 bitonal scanners, IBM IFP
(intelligent Forms Processing)
 (Greece) OMR- devices/scanners were ‘’axm
990/995’’ with FORM/ AXF/ ADELE+ software
 (New Zealand) Kodak scanners i830 and i7620 -
scanning and raw data capture process (recognition
aspect) were outsourced.- For the next census -end
scanning and data capture process will more than
likely be outsourced but it really is a variation to a
current supplier agreement.
 (Belgium) AGFA (high resolution) scanner
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR in Use
 Editing method used for the census
 (Japan) cold-deck method, hot-deck method, etc.
 (Croatia) in house developed – logical checking and automatic
and manual correcting
 (Greece) via PC- editor (officer of N.S.S.G.) confirms or rejects
a non-accurate value or inputs a missing one.
 (New Zealand) mixture of micro and macro editing practices.
Individual responses may have range or validity edits, inter-
field edits and also inter-form edits (within a household).
Macro editing is particularly used during the data evaluation
process and data may be reprocessed as a result of this
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR Country Outlook
 Common commercial or free software used in
OCR
 (Croatia) Use ACTR (automated coding by text
recognition) for coding -software developed by
Statistics Canada.
 (Greece) Commercial software, after an open
bidding, according to the budgetary plan of the
population census
 (New Zealand) IBM Intelligent Forms Processing
(IFP) system through an established user agreement.
 (Belgium) IRIS (Image Recognition Integrated
Systems)
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
OCR Country Outlook
Concerns/issues with the use of optical
character recognition for data capture
for the census?
 (Japan) Speed of data capture and recognition,
recognition accuracy of Japanese characters, etc.
 (Greece) OMR -related to the optical recognition of
numbers, the rapidity of optical recognition itself
and the electronic storage of the questionnaires.
 (Tajikistan) Getting equipment and training.
 (Samoa) Not enough financial support and
technical human resources.
Workshop on international standards, contemporary technologies and regional cooperation
Noumea, New Caledonia, 4 – 8 February 2008
THANK YOU!

More Related Content

Similar to 05a(1).ppt

IRJET- Text Recognization of Product for Blind Person using MATLAB
IRJET- Text Recognization of Product for Blind Person using MATLABIRJET- Text Recognization of Product for Blind Person using MATLAB
IRJET- Text Recognization of Product for Blind Person using MATLAB
IRJET Journal
 
Product Label Reading System for visually challenged people
Product Label Reading System for visually challenged peopleProduct Label Reading System for visually challenged people
Product Label Reading System for visually challenged people
IRJET Journal
 
IRJET- Offline Transcription using AI
IRJET-  	  Offline Transcription using AIIRJET-  	  Offline Transcription using AI
IRJET- Offline Transcription using AI
IRJET Journal
 
A Deep Learning Approach to Recognize Cursive Handwriting
A Deep Learning Approach to Recognize Cursive HandwritingA Deep Learning Approach to Recognize Cursive Handwriting
A Deep Learning Approach to Recognize Cursive Handwriting
IRJET Journal
 
50120130406005
5012013040600550120130406005
50120130406005
IAEME Publication
 
Real Time Character Recognition on FPGA for Braille Devices
Real Time Character Recognition on FPGA for Braille DevicesReal Time Character Recognition on FPGA for Braille Devices
Real Time Character Recognition on FPGA for Braille Devices
IRJET Journal
 
IRJET- Optical Character Recognition using Image Processing
IRJET-  	  Optical Character Recognition using Image ProcessingIRJET-  	  Optical Character Recognition using Image Processing
IRJET- Optical Character Recognition using Image Processing
IRJET Journal
 
Visual Product Identification For Blind Peoples
Visual Product Identification For Blind PeoplesVisual Product Identification For Blind Peoples
Visual Product Identification For Blind Peoples
IRJET Journal
 
OPTICAL CHARACTER RECOGNITION IN HEALTHCARE
OPTICAL CHARACTER RECOGNITION IN HEALTHCAREOPTICAL CHARACTER RECOGNITION IN HEALTHCARE
OPTICAL CHARACTER RECOGNITION IN HEALTHCARE
IRJET Journal
 
Text reader [OCR]
Text reader [OCR]Text reader [OCR]
Text reader [OCR]
MisbahUddin52
 
IRJET- Intelligent Character Recognition of Handwritten Characters
IRJET- Intelligent Character Recognition of Handwritten CharactersIRJET- Intelligent Character Recognition of Handwritten Characters
IRJET- Intelligent Character Recognition of Handwritten Characters
IRJET Journal
 
Optical Recognition of Handwritten Text
Optical Recognition of Handwritten TextOptical Recognition of Handwritten Text
Optical Recognition of Handwritten Text
IRJET Journal
 
Information Extraction from Product Labels: A Machine Vision Approach
Information Extraction from Product Labels: A Machine Vision ApproachInformation Extraction from Product Labels: A Machine Vision Approach
Information Extraction from Product Labels: A Machine Vision Approach
gerogepatton
 
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACHINFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
ijaia
 
Drive Paper Out of Your Processes
Drive Paper Out of Your ProcessesDrive Paper Out of Your Processes
Drive Paper Out of Your Processes
AIIM International
 
OCR 's Functions
OCR 's FunctionsOCR 's Functions
OCR 's Functions
prithvi764
 
New Age Digital Pen Presentation 05 2009
New Age Digital Pen Presentation 05 2009New Age Digital Pen Presentation 05 2009
New Age Digital Pen Presentation 05 2009
manos99
 
IRJET- Sign Language Interpreter
IRJET- Sign Language InterpreterIRJET- Sign Language Interpreter
IRJET- Sign Language Interpreter
IRJET Journal
 
Scanning 101 Standards
Scanning 101 StandardsScanning 101 Standards
Scanning 101 Standards
Jenel Farrell
 
Automated Identification of Road Identifications using CNN and Keras
Automated Identification of Road Identifications using CNN and KerasAutomated Identification of Road Identifications using CNN and Keras
Automated Identification of Road Identifications using CNN and Keras
IRJET Journal
 

Similar to 05a(1).ppt (20)

IRJET- Text Recognization of Product for Blind Person using MATLAB
IRJET- Text Recognization of Product for Blind Person using MATLABIRJET- Text Recognization of Product for Blind Person using MATLAB
IRJET- Text Recognization of Product for Blind Person using MATLAB
 
Product Label Reading System for visually challenged people
Product Label Reading System for visually challenged peopleProduct Label Reading System for visually challenged people
Product Label Reading System for visually challenged people
 
IRJET- Offline Transcription using AI
IRJET-  	  Offline Transcription using AIIRJET-  	  Offline Transcription using AI
IRJET- Offline Transcription using AI
 
A Deep Learning Approach to Recognize Cursive Handwriting
A Deep Learning Approach to Recognize Cursive HandwritingA Deep Learning Approach to Recognize Cursive Handwriting
A Deep Learning Approach to Recognize Cursive Handwriting
 
50120130406005
5012013040600550120130406005
50120130406005
 
Real Time Character Recognition on FPGA for Braille Devices
Real Time Character Recognition on FPGA for Braille DevicesReal Time Character Recognition on FPGA for Braille Devices
Real Time Character Recognition on FPGA for Braille Devices
 
IRJET- Optical Character Recognition using Image Processing
IRJET-  	  Optical Character Recognition using Image ProcessingIRJET-  	  Optical Character Recognition using Image Processing
IRJET- Optical Character Recognition using Image Processing
 
Visual Product Identification For Blind Peoples
Visual Product Identification For Blind PeoplesVisual Product Identification For Blind Peoples
Visual Product Identification For Blind Peoples
 
OPTICAL CHARACTER RECOGNITION IN HEALTHCARE
OPTICAL CHARACTER RECOGNITION IN HEALTHCAREOPTICAL CHARACTER RECOGNITION IN HEALTHCARE
OPTICAL CHARACTER RECOGNITION IN HEALTHCARE
 
Text reader [OCR]
Text reader [OCR]Text reader [OCR]
Text reader [OCR]
 
IRJET- Intelligent Character Recognition of Handwritten Characters
IRJET- Intelligent Character Recognition of Handwritten CharactersIRJET- Intelligent Character Recognition of Handwritten Characters
IRJET- Intelligent Character Recognition of Handwritten Characters
 
Optical Recognition of Handwritten Text
Optical Recognition of Handwritten TextOptical Recognition of Handwritten Text
Optical Recognition of Handwritten Text
 
Information Extraction from Product Labels: A Machine Vision Approach
Information Extraction from Product Labels: A Machine Vision ApproachInformation Extraction from Product Labels: A Machine Vision Approach
Information Extraction from Product Labels: A Machine Vision Approach
 
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACHINFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
 
Drive Paper Out of Your Processes
Drive Paper Out of Your ProcessesDrive Paper Out of Your Processes
Drive Paper Out of Your Processes
 
OCR 's Functions
OCR 's FunctionsOCR 's Functions
OCR 's Functions
 
New Age Digital Pen Presentation 05 2009
New Age Digital Pen Presentation 05 2009New Age Digital Pen Presentation 05 2009
New Age Digital Pen Presentation 05 2009
 
IRJET- Sign Language Interpreter
IRJET- Sign Language InterpreterIRJET- Sign Language Interpreter
IRJET- Sign Language Interpreter
 
Scanning 101 Standards
Scanning 101 StandardsScanning 101 Standards
Scanning 101 Standards
 
Automated Identification of Road Identifications using CNN and Keras
Automated Identification of Road Identifications using CNN and KerasAutomated Identification of Road Identifications using CNN and Keras
Automated Identification of Road Identifications using CNN and Keras
 

Recently uploaded

State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 

Recently uploaded (20)

State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 

05a(1).ppt

  • 1. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Introduction to Optical Character Recognition (OCR)
  • 2. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Summary  Overview of OCR  System Requirements  Advantages and Disadvantages  Operation and Management  Questionnaire Design and Preparation  OCR Field Operation  OCR Country Outlook
  • 3. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR (Optical Character Recognition)  Function & Features of OCR/ICR  ICR, OCR and OMR Compared  Optical Mark Reader (OMR)  OCR/ ICR
  • 4. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR (Optical Character Recognition)  Also referred to as Optical Character Reader  “…a system that provides a full alphanumeric recognition of printed or handwritten characters at electronic speed by simply scanning the form.”(UNESCAP, Pop-IT project, 1997-2001)  Intelligent Character Recognition (ICR) is used to describe the process of interpreting image data, in particular alphanumeric text.  Sometimes OCR is known as ICR
  • 5. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Functions & Features of OCR  Forms can be scanned through a scanner and then the recognition engine of the OCR system interpret the images and turn images of handwritten or printed characters into ASCII data (machine-readable characters).  The technology provides a complete form processing and documents capture solution.  Allows an open, scaleable and workflow.  Includes forms definition, scanning, image  pre-processing, and recognition capabilities.
  • 6. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 ICR,OCR and OMR Differences  ICR and OCR are recognition engines used with imaging;  OMR is a data collection technology that does not require a recognition engine.  OMR cannot recognize hand-printed or machine-printed characters.
  • 7. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Optical Mark Reader (OMR)  Forms  An OMR works with a specialized document and contains timing tracks along one edge of the form to indicate scanner where to read for marks which look like black boxes on the top or bottom of a form.  The cut of the form is very precise and the bubbles on a form must be located in the same location on every form.  Storage  With OMR, the image of a document is not scanned and stored.  Accuracy  OMR is simpler than OCR.  designed properly, OMR has more accuracy than OCR.
  • 8. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR/ ICR  Forms  OCR/ ICR is more flexible since no timing tracks or block like form IDs required.  The image can float on a page.  ICR/ OCR technology uses registration mark on the four- corners of a document, in the recognition of an image. Respondents place one character per box on this form.  The use of drop color reduces the size of the scanner’s output and enhances the accuracy.  Storage/ retrieval  If the document needs to be electronically stored and maintained, then OCR/ ICR is needed.  OCR/ICR technologies, images can be scanned, indexed, and written to optical media.
  • 9. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OMR-OCR/ICR Compared
  • 10. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 System Requirements  Minimum capacity PC Requirements:  Processor: Pentium 200 MHz RAM: 32 MB Disk: 4 GB  Form modules are designed to operate in a batch processing;  Run under LAN and PC based platforms and take full advantage of the graphical user interface and 32 bit processing power available with most Windows versions.  Software:  OCR with ICR capability software  Questionnaire Design Software
  • 11. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 System Requirements (cont.)  Scanner  OCR scanners with minimum capacity:  Duplex scanning  Speed: 60 sheets/ min  Automatic Document Feeder (ADF): Scanning can take a significant amount, and the system lets user scan up without doing the OCR.
  • 12. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Advantages and Disadvantages  Advantages of Using Images Rather Than Paper  Quicker processing; no moving or storage of questionnaires near operators  Savings in costs and efficiencies by not having the paper questionnaires  Scanning and recognition allowed efficient management and planning for the rest of the processing workload  Reduced long term storage requirements, questionnaires could be destroyed after the initial scanning, recognition and repair  Quick retrieval for editing and reprocessing  Minimizes errors associated with physical handling of the questionnaires
  • 13. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Advantages and Disadvantages  Disadvantages of Using Images Rather Than Paper  Accuracy  While OCR technology can be effective in converting handwritten or typed characters, it does not give as high accuracy as of OMR for reading data, where users are actually marking forms  Additional workload to data collectors OCR has severe limitations when it comes to human handwriting  Characters must be hand-printed with separate characters in boxes
  • 14. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Operation and Management  OCR Process Stages  Document Scanning process  Scanning speed will be determined by the quality of the scanner machines, the size of non-drop out color. Paper quality, cleanness, weights.  Recognizing process  The recognizing process is to interpret images. The right memory (dictionary) and the configuration threshold will determine the accuracy of interpretation of the ICR.  Verifying Process  To compare the value of the interpreted image with the real image of the form.  Processing can be in geographic order or in random order.
  • 15. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Operation and Management (cont.)  Image Manipulation  Electronic questionnaires can be sent to specialist operators then back to the original operator if necessary  Same questionnaire can be worked on simultaneously by two or more persons  Electronic questionnaires are readily available for post census analysis (easier access to questionnaires)  Parts of various questionnaires on screen at once for inter record editing  Able to view the relevant field book entry on screen in conjunction with questionnaires which is helpful for coding and editing
  • 16. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Operation and Management (cont.)  Coding Assistance  The problems are simpler for the operator to identify  Can use images of questions that will not be captured (scanned but not recognized) to help the coding process. ex, light pencil.  Operator can magnify images to read characters not discernible to the naked eye  Appropriate software ensures that the data is validated as the forms are read.  Checks to ensure selections on a form are filled in.  Possible to distinguish between intended marks and marks that have been erased.
  • 17. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Operation and Management (cont.)  OMR Scanner Speed  Factors  Skew: Each document is moved from an automatic feeder into ascanner and angle of skew is sometimes introduced.  De-skew: Analyze the image bit- map, calculates and returns the angle of skew up to +/-25. Example. De-skew often refer to %, which is the pixel shift. 10% is a 20-pixel shift in a line of 200 pixels or one tenth of an inch in an inch long line.
  • 18. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Operation and Management (cont.)  Landscape Detection and Auto Rotation:  landscape detection will automatically detect and rotate appropriate images 90 degrees.  White Page Detection:  Normally, a double-sided scanner creates two images per scanners page.  However, if the back or front page is blank, there is no need to store this image.  White page detection  Allows the user to avoid storing blank page.
  • 19. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Operation and Management (cont.)  Other Factors  Automatic Image Registration  De-Speckle and Shade Removal  Character Enhancer  Cost Savings  Automatic processes to improve recognition rates  Voting techniques, Multiple engines, Learning
  • 20. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Questionnaire Design and Preparation  Drop Out Color  Usually red- the color facility in OCR system that allows the system to pick up only the meaningful information from an OCR form.  The system doesn't need to know the values including tick boxes written in the drop out color.  The OCR system only needs to see the black parts, and compares them to specifications to see parts that are filled or written.  Characters or Marks  Considering the speed of the data capture process and to reduce rates, it is advisable to use marks or “ticks” as much as possible
  • 21. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 Questionnaire Design and Preparation (cont.)  How to Obtain Good Results of Scanning  Select adequate paper quality; Reliable printing press.  Appropriate ink, considering drop out color, for the questionnaires paper heavier than 80 grams per square meter can help avoid paper crashes or over read the other side of a single page.  Form Design Advise  Number items to be included in a form; Design size of boxes for each character answer carefully.  Define drop out color properly; use registration marks.  Pre-print the codes near the place where the box for ticks are located  Maintain consistent pattern in which the information to be collected will be located.  Do not disturb the visibility of the ticks and marks with titles, labels or instructions.  Avoid putting "answers" of one field to another page of the questions; Avoid using open ended questions
  • 22. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR Field Operation  Training for Collection and Processing Staff  Basic software, scanner operations, including installation and troubleshooting.  Applications with emphasis on the development of custom applications including: configuring nonstandard forms  Pre-marking of forms, use of overprinting customize forms  Processing of surveys  Crating custom outputs file formats
  • 23. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR Field Operation (cont.)  Reasons of Error- Reading of OCR  Bad condition of the form because of dirt, folded, crumple, etc.  Forms fed into OCR scanner are not straight (at an angle); Incompletely filled  Reduce Error-Reading of OCR  Checking the questionnaires for completeness and consistencies; Preparation of own memory (dictionary); Defining permissible margins of OCR reading errors  Particular Care in Writing Numbers or Alphabetic  One box contains only one character; Characters should not extend outside designated boxes; Unnecessary lines of characters such as points, decorative strokes, hooks, etc. are prohibited. Strokes should not be ended with flourishes or extensions.  All lines should be connected without breaks; All lines or dots should be pressed with the same pressure.  Value Checking Steps: Verify that the information captured by OMR is the same with the questionnaire  Control for Blank: If the information is blank, what type of control must be taken.  Control steps should be taken if the information image is partial or no information to assure the quality of generated files.  Missing Questionnaire; Make sure that the entire questionnaires are scanned  completely, no missing and no duplication as well.  Therefore control procedures including to produce control tables to compare with manual work.
  • 24. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR Country Outlook  Countries using optical mark recognition  (Greece)  Countries using optical character recognition  (Croatia- in use for the next census round)  (Japan-out-sources entire process and in use for the next census round)  Countries using both  Belgium  Countries planning to use OCR  Tajikistan  (Tonga) looking to introduce and use OCR for our next Census
  • 25. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR Country Outlook  Common device/scanner and software used by NSOs  (Croatia) KODAK DS3520 bitonal scanners, IBM IFP (intelligent Forms Processing)  (Greece) OMR- devices/scanners were ‘’axm 990/995’’ with FORM/ AXF/ ADELE+ software  (New Zealand) Kodak scanners i830 and i7620 - scanning and raw data capture process (recognition aspect) were outsourced.- For the next census -end scanning and data capture process will more than likely be outsourced but it really is a variation to a current supplier agreement.  (Belgium) AGFA (high resolution) scanner
  • 26. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR in Use  Editing method used for the census  (Japan) cold-deck method, hot-deck method, etc.  (Croatia) in house developed – logical checking and automatic and manual correcting  (Greece) via PC- editor (officer of N.S.S.G.) confirms or rejects a non-accurate value or inputs a missing one.  (New Zealand) mixture of micro and macro editing practices. Individual responses may have range or validity edits, inter- field edits and also inter-form edits (within a household). Macro editing is particularly used during the data evaluation process and data may be reprocessed as a result of this
  • 27. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR Country Outlook  Common commercial or free software used in OCR  (Croatia) Use ACTR (automated coding by text recognition) for coding -software developed by Statistics Canada.  (Greece) Commercial software, after an open bidding, according to the budgetary plan of the population census  (New Zealand) IBM Intelligent Forms Processing (IFP) system through an established user agreement.  (Belgium) IRIS (Image Recognition Integrated Systems)
  • 28. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 OCR Country Outlook Concerns/issues with the use of optical character recognition for data capture for the census?  (Japan) Speed of data capture and recognition, recognition accuracy of Japanese characters, etc.  (Greece) OMR -related to the optical recognition of numbers, the rapidity of optical recognition itself and the electronic storage of the questionnaires.  (Tajikistan) Getting equipment and training.  (Samoa) Not enough financial support and technical human resources.
  • 29. Workshop on international standards, contemporary technologies and regional cooperation Noumea, New Caledonia, 4 – 8 February 2008 THANK YOU!