2. Nowadays, a lot of paper documents are transformed to
electronic form, which makes information processing
easier, like searching, analysis and conversion.
Many companies and other institutions decide to digitalize
their documents. Working with files is cheaper than
processing traditional documents, because there is no
space required for document storage. There are three
main steps of document digitalization: scanning,
indexation (data entry) and presentation of digitalized
documents.
Researchers proved that the recognition of both barcodes
and printed text through Optical Character Recognition
or OCR is reliable and significantly accelerates data
processing. On the contrary, the handwritten text
appeared difficult to recognize by OCR systems.
3. Optical Character Recognition or OCR is a system
that provides a full alphanumeric recognition of
printed or handwritten characters at electronic
speed by simply scanning the form.
It is the mechanical or electronic conversion of
scanned or photographed images of typewritten or
printed text into machine-encoded/computer-
readable text.
OCR is a field of research in pattern recognition,
artificial intelligence and computer vision. It is the
electronic translation of handwritten, typewritten or
printed text into machine translated images.
4. History of OCR:
1928/9: Gustav Tauschek of Vienna, Austria patents a basic OCR "reading
machine.“
1949: L.E. Flory and W.S. Pike of RCA Laboratories develop a photocell-based
machine that can read text to blind people at a rate of 60 words per minute.
1950: David H. Shepard develops machines that can turn printed information
into machine-readable form for the US military and later founds a pioneering
OCR company called Intelligent Machines Research (IMR).
1960: Lawrence (Larry) Roberts, a computer graphics researcher working at
MIT, develops early text recognition using specially simplified fonts such as
OCR-A.
1950s/1960s: Reader's Digest and RCA work together to develop some of the
first commercial OCR systems.
1960s: Postal services around the world begin to use OCR technology for mail-
sorting.
1974: Raymond Kurzweil develops the Kurzweil Reading Machine that can read
printed pages aloud to blind people. Kurzweil's OCR software is acquired by
Xerox and marketed under the names ScanSoft and (later) Nuance
Communications.
1993: The Apple Newton MessagePad (PDA) is one of the first handheld
computers to feature handwriting recognition on a touch-sensitive screen.
2000: Researchers at Carnegie Mellon University flip the problem of developing
a good OCR system on its head—and develop a spam-busting system called
CAPTCHA
5. Pre-processing:
Deals with improving quality of the image for better
recognition by the system. Techniques include –
De- skew
Despeckle
Binarization
Line removal
Zoning etc..
Character recognition:
There are two basic types of core OCR algorithm which may
produce a ranked list of candidate character –
Matrix matching
Feature extraction
Post-processing:
OCR accuracy can be increased if the output is constrained by
lexicon. Eg. all the words in the English language can be
problematic if the document contains words that are not in
the lexicon, like proper nouns.
6. Data entry for business documents, e.g. check, passport,
invoice, bank statement and receipt
Automatic number plate recognition
Automatic insurance documents key information
extraction
Extracting business card information into a contact list
More quickly make textual versions of printed documents,
e.g. book scanning for Project Gutenberg
Make electronic images of printed documents searchable,
e.g. Google Books
Converting handwriting in real time to control a computer
(pen computing)
Assistive technology for blind and visually impaired users
7. Once a printed page is in this machine-
readable text form, one can do all kinds of thing
that couldn't do before.
Machine-readable text can also be decoded by
screen readers, tools that use speech synthesizers
to read out the words on a screen so blind and
visually impaired people can understand them.
In the 1970s, one of the first major uses of OCR
was in a photocopier-like device called the
Kurzweil Reading Machine, which could read
printed books out loud to blind people.
8. Institutional repositories are digital collections of the
outputs created within an institution. It collects
intellectual data of an institution, especially a
research institution where it is collected, preserved
and aired. It is basically a collection of peer reviewed
journal articles, conference proceedings, research
data, monographs, books, theses and dissertations
and presentations. Practical implementation of this
includes setting up a system which consists of
scanner which scans the documents. This scanned
document is then fed as an input to an Optical
Character Recognition system where information is
acquired and retained in digitized form.
9. Nowadays, a lot of documents are produced in paper form but it
is obvious, that automatic data recognition systems are very
popular.
Though researchers have suggested various sophisticated ideas
and techniques, practical OCR systems suffer from a lack of
various characteristics. It is because of the claims made by the
researchers are not adequately justified by exposure of the
systems into real working environments and the lack of practical
feasibility of such advanced techniques with the available
hardware from an economical viewpoint. From these constraints
and the lack of performances it can be concluded that the ability
to read text by machines with the same fluency as the human
remains an unachieved goal, though a great amount of effort has
already been expended on the subject.
However, the frontiers of character recognition have now moved
to the recognition of cursive script that is the recognition of
characters which may be connected or written in calligraphy.
10. Asif , Ali Mir Arif Mir, Hannan, Shaikh Abdul, Perwej,
Yusuf, Vithalrao, Mane Arjun. An Overview and
Applications of Optical Character Recognition.
International Journal of Advance Research In Science
And Engineering , Vol. 3(7), 261-274p.
https://en.wikipedia.org/wiki/Optical_character_reco
gnition (accessed in 10/03/2017)
http://www.webopedia.com/TERM/O/optical_charact
er_recognition.html (accessed in 10/03/2017)
http://www.computerhope.com/jargon/o/ocr.htm
(accessed in 10/03/2017)
http://www.explainthatstuff.com/how-ocr-
works.html (accessed in 10/03/2017)