SlideShare a Scribd company logo
1 of 24
Kurtz1
Computer Vision: Optical Character Recognition
Mark Kurtz
Fontbonne University
Department of Mathematics and Computer Science
ABSTRACT
Humans are very visual creatures. With everything we do or think about, our visual
system is generally involved. The activities that incorporate our visual systemspan everything
from reading to driving. Computer Vision is the study of how to implement the human visual
system and how we perform visual tasks into machines and programs. My studies and this
paper revolve around this topic, specifically Optical Character Recognition. Here OCR (Optical
Character Recognition) tries to recognize characters or words inside of images. My focus
revolved around implementing OCR through custom algorithms I wrote after studying
techniques used in the computer vision field. I successfully created a program with a few
algorithms with the rest of the algorithms untested but documented. The results for the tested
algorithms are documented in this paper.
Kurtz2
1.1 Introduction
Computer Vision has slowly implemented itself into our lives in limited aspects in the
fields of automotive drive assistance, eye and head tracking, filmand video analysis, gesture
recognition, industrial automation and inspection, medical analysis, object recognition,
photography, security, 3D modeling, etc. (Lowe) These applications and the algorithms behind
them are extremely specific. Because the programs created from the algorithms are so specific,
much of the code does not successfully transfer among different applications. Since much of
the code does not transfer among applications, there is no master code or algorithm for the
computer vision field. Hense even the visual systemof a two-year old cannot be replicated—
computer programs still cannot successfully find all the animals in a picture. The reasons for this
are many which simplify down to one point: the human visual systemis extremely hard to
understand and replicate. The process becomes an inverse problem where we try to form a
solution that resembles what our eyes process. This seems easy, but there are many different
hidden layers between the input images on our eyes to what we perceive. With numerous
unknowns, much of the focus in the computer vision field has resorted to physics-based or
statistical models to determine potential solutions. (Szeliski)
Because of how well our visual systemhandles images, I underestimated the complexity
of computer vision. It seems easy to select different objects in an image in our everyday world.
Looking around you can readily distinguish objects in your surroundings, what they are, how far
away they are, and their three-dimensional shape. To determine all of this our brains transform
the images taken in by our eyes in many different steps. As an example, we may perceive colors
darker or lighter than what the actual color value is. This is how we see the same shade of red
Kurtz3
on an apple throughout the day despite the changing colors of light reflecting through the
atmosphere. An example is in the picture that follows. The cells A and B are the exact same
color, but our visual systemchanges the color we perceive based on the surrounding colors:
Current algorithms have yet to effectively replicate the human color perception system.
(McCann) Other visual tricks our eyes perform range from reconstructing a three-dimensional
reality from two-dimensional images in our retinas to perceiving lines and edges from missing
image data. An article titled What Visual Perception Tells Us about Mind and Brain explains this
in more detail. “What we see is actually more than what is imaged on the retina. For example,
we perceive a three-dimensional world full of objects despite the fact that there is a simple
two-dimensional image on each retina. … Thus perception is inevitably an ambiguity-solving
process. The perceptual systemgenerally reaches the most plausible global interpretation of
the retinal input by integrating local cues”.
Despite the complexity of the human visual system, many people have tried to replicate
it and implement it in machines and robotics. Many algorithms have been developed for
specific applications across different fields and disciplines. One of the most prevalent in
consumer applications is facial recognition. In fact, it has nearly become a part of everyday use
because of its use in images in social networks such as Facebook, images processed in digital
Kurtz4
cameras, and images processed in photo editing software such as Picasa. The most successful
algorithm creates Eigenfaces, which are a set of eigenvectors, and then looks for these inside of
an image. The eigenvectors are derived from statistical analysis of many different pictures of
faces. (Szeliski) In other words, it creates a template from other pre-labeled faces and searches
through an image to see where they occur. This way seems surprising to me since the algorithm
never tries to break apart the constituents of the image or even determine what is contained
within the image. The algorithms never process corresponding shapes, three-dimensional
figures, edges, etc. from the image data. With no other processing performed other than a
simple template search, the program has no clues about context leading it to label faces within
an image that we may consider unimportant. For example, I used Picasa to sort my images by
who was in each picture. It does this through facial recognition. However, in some images it
recognized faces that were far off in the background, faces that were out of focus, and even a
face on the cover of a magazine someone was reading in the background of the image.
Not only does facial recognition suffer from a lack of context, but also the exact method
of template matching by using Eigenfaces in facial recognition makes the algorithm extremely
specific. This speaks volumes about the methods used in computer vision. The methods and
algorithms for facial recognition cannot readily be used to identify animals in an image without
completely retraining the algorithm. By extension, the specificity of algorithms developed for
each application in computer vision cannot transfer over to another application. I see this as a
huge problem since we cannot possibly have thousands of different image analysis processes in
our brain running all at once to look specifically for certain objects. I believe the way our brain
determines there is a person in an image is the same as how it determines there are animals in
Kurtz5
an image. The current field of computer vision is moving towards this template matching. While
this may work in specific situations, I cannot see how this will ever replicate human vision
because of the reasons stated before. To break away from the specificity of the computer vision
field I took a few ideas from lower level processing techniques, developed my own algorithms
in place of the ones used, and then explained a new matching technique.
1.2 New Techniques NeededinComputer Vision
Over the decades that computer vision has been studied and developed, it has not
progressed as well as most predicted or hoped for. In fact it has led some who contributed and
pioneered the field to form the extreme view that computer vision is dead. (Chetverikov) While
I do not believe the study of computer vision is dead, I think new algorithms are needed in the
field. The algorithms need to move away from the specific applications and involve more ideas
from neuroscience and biological processes.
Jeff Hawkins, a pioneer of PDAs (Personal Digital Assistant, essentially the precursor to
the smart phone) and now an artificial intelligence researcher, spoke at a TED (Technology,
Entertainment, and Design) conference in 2003. His speech focused on artificial intelligence. In
it he states not only is computer vision moving in the wrong direction, but also the entire field
of artificial intelligence. He explains that this comes from an incorrect, but strongly held belief,
that intelligence is defined by behavior--we are intelligent because of the way we do things. He
counters that our intelligence comes from experiencing the world through a sequence of
patterns that we store and recall to match with reality. In this way we make predictions about
what will happen next, or what should happen next. An example he gives is our recognition of
Kurtz6
faces. As observed through study when humans view a person, we first look at an eye, then the
next eye, then at the nose, and finally at the mouth. This is simply explained by predictions
happening and then being confirmed as we observe our world. We expect an eye, then an eye
next to it, then a nose, then a mouth. If we see something different it will not match up with
our predictions. Here learning or more concentrated analysis will occur. (How Brain Science Will
ChangeComputing)
I developed my research around the direction of prediction as explained by Jeff
Hawkins. In this way patterns, sequences, and predictions can be used in object recognition and
classification. Logically pattern matching seems to make more sense than specific template
matching. By looking for exact matches to templates, we will never be able to reproduce visual
tricks such as when we see shapes in clouds. Also, template matching often produces false
results if an object is shaded differently or partially hidden. (Ritter, G. X., and Joseph N. Wilson)
Pattern matching seems to offer a fix for the different lighting conditions that occur in images.
Also, pattern matching may have a better chance of identifying objects which are slightly
obscured or are missing some data. This is something that must be studied further through
experimentation, however.
1.3 Simplifying Things to Optical Character Recognition
In my beginning research, I hoped to apply the ideas I developed to full images for
classification and recognition of objects. Unfortunately time constraints limited my research
and implementation. Thus I decided to focus on OCR (Optical Character Recognition). This field
is a subset of computer vision which focuses on text recognition in images. In OCR there is a
Kurtz7
limited number of objects that can occur within the images. Also, the images processed in OCR
are much simpler than the full images processed in computer vision. In most cases it is a 2-
dimensional binary image such as an image of a page within a book. While it is simpler than
computer vision, the algorithms I created can be implemented within OCR since it is a field
within computer vision.
OCR is used by many industries and businesses. It is often packaged with Adobe PDF
software and document scanning software. Also, the United States Post Office has heavily
implemented OCR to recognize addresses written on packages and letters. OCR is generally
divided into two methods. The first is matrix matching where characters are matched pixel by
pixel to a template. The second is feature extraction where the program looks for general
features such as open areas, closed shapes, diagonal lines, etc. This method is applied much
more than matrix matching, and is much more successful. (What’s OCR?)
Feature extraction has become the default for OCR software. It works by analyzing
characters to determine strokes, lines of discontinuity, and the background. From here the
program builds up from the character to the word and then assigns a confidence level after
comparing to a dictionary of words. This seems to work well for images converted straight from
computer documents with a 99% accuracy rate. For older, scanned papers the accuracy rate
drops and varies wildly from 71% to 98%. (Holley,Rose)
The method I have developed follows the feature extraction method. The ideas of
feature extraction have worked very well within OCR so far and seem to resemble the idea of
prediction matching described earlier. The problem is defining what features are and how to
decode them inside of an image.
Kurtz8
1.4 Describing the Overall Idea
The general idea of my algorithms in computer vision and OCR is to seperate an image
into constituent objects and then define those objects by their surprising features. In doing this,
the algorithm builds a definition of an object rather than a template for an object. It seems
more natural to describe objects by a definition rather than a template, so I pursued algorithms
that defined objects by definitions. For example, when we describe a face we do not build an
exact template from thousands of faces we have seen in the past. Instead we describe a face as
having two eyes, a nose, and a mouth. These are features that distinguish a face from anything
else. If there were no features that protruded from the interior of the face, we would have a
hard time distinguishing it. Definitions work the same for outlines of an object, too. We can
easily draw the outline of a dolphin because we know the border points that are most
memorable and stick out from a regular oval or other shape. For a dolphin it is the tail, dorsal
fin, and the mouth.
While definition building seems to be a more natural way at understanding images, it
also may offer reasons as to why we see objects in clouds or ink blobs. I believe we see images
in these objects because certain features resemble patterns we have seen before in other
objects. We may see a face in an ink blob because there are two dots to represent eyes and a
nose all in the correct relation to each other. We may see a dolphin in the clouds because a part
of it resembles the dorsal fin and the nose. In both of these examples general objects portray
specific objects because certain key features match up.
Kurtz9
1.5 Lower Level Processing
Lower level processing for computer vision defines finding key points, features, edges,
lines, etc. in an image. At the lower level no matching takes place. The focus is to decode the
image into basic points, lines, and shapes for processing later on. (Szeliski, Richard) One of the
many lower level processes is edge detection. Edges define boundaries between regions in an
image and occur when there is a distinct change in the intensity of an image. There are tens if
not hundreds of algorithms written for edge detection ranging from the simple to the
extremely complex. (Nadernejad, Ehsan)
The most widely used algorithms for edge detection are the Marr-Hildreth edge
detector and the Canny edge detector. (Szeliski, Richard) The general algorithm for the Marr-
Hildreth edge detector first applies a Gaussian smoothing operator (a matrix which
approximates a bell-shaped curve) and applies a two dimensional Laplacian to the image
(another matrix which is equivalent to taking the second derivative of the image). The Gaussian
reduces the amount of noise in the image simply by blurring it. This has the unwanted effect of
losing fine detail in the image, though. The Laplacian is applied to take the second derivative of
the image. The idea is that if there is a step difference in the intensity of an image, it is
represented by a zero crossing in the second derivative. The Canny edge detector also begins by
applying a Gaussian smoothing operator. It then finds the gradient of the image at each point to
indicate the presence of edges while suppressing any points that are not maximum gradient
values in a small region. After all this has been performed, thresholding is applied by using
hysteresis which applies a high and low threshold. Again, the Gaussian loses detail as it tries to
reduce the amount of noise in an image. Both the Marr-Hildreth and the Canny edge detectors
Kurtz10
are very expensive in terms of computation time because of the operations that are involved. I
stepped away from these approaches and tried to look at edge detection in a simpler way that
could be reproduced by artificial neural networks.
Artificial neural networks were inspired by the way biological nervous systems process
information. The neural networks are composed of a large number of processing elements that
work together to solve specific problems. Neurons inside the neural networks are
interconnected and feed inputs into each other. If an operation applied to all of the inputs into
a neuron is above a certain threshold then the neuron sends an input to other neurons. The
way the neurons are connected, the thresholds set for each, and the operations performed on
each all can change and are adapted to better the performance of an algorithm. This replicates
the learning process in biological brains and nervous systems. Artificial neural networks have
been effectively applied in pattern recognition and other applications because of their ability to
change and adapt to new inputs. (Stergiou, Christopher, and Dimitrios Siganos)
Neural networks have a downside, though. They need many sets of training data in
order to achieve accurate results. These sets of training data take a lot of time to make. So,
instead of abandoning neural networks I tried to combine the best of traditional computing and
neural networks for my edge detection algorithms. I built upon using arrays as inputs and
outputs to form the neurons of the neural networks. Next I predetermined what the operations
would be, and then applied a threshold which could be changed to maximize the accuracy of
the algorithm.
My first idea built upon the previously described combination of neural networks and
predefined algorithms. I explored the ideas that edges are step changes (non-algebraic) in light
Kurtz11
intensity in an image. I figured out that I could calculate the change in light intensity along a
specific direction in the image. By doing this, I could approximate the derivative at each point in
an image. After this I could approximate the second derivative, the third derivative, and so on.
With the derivatives approximated, the algorithm can then work backwards and figure out
what the next pixel value should be for an algebraic equation. The following image explains this
further:
This systemworks perfectly for predicting the next value in algebraic equations such as y = 2x +
20 or y = 5x3 + 3x – 10. If the array is expanded to include more pixel values, it can work with
even higher order equations. The algorithm is able to do this because eventually the
approximate derivative is a constant value or 0. Surprisingly, the algorithm also was able to
reasonably approximate the next value in y = cos(x) and y = ln(x). I then designed a program to
apply the algorithm to images. For each pixel it would calculate a predicted value from the
surrounding pixels. If the predicted value and the actual value were off by a certain threshold,
Kurtz12
then the pixel would be marked as an edge. The algorithm did not work as well in practice,
though.
Small variations in light intensity in an image would create large changes in values
higher up in the array. This led to extreme predicted values which did not match with what a
human might expect the next value to be. Another problem appeared whenever edges
occurred between the values used for the prediction. For example, if an edge occurred at Pixel
2 in the image shown above, then the value of PR4 became an extreme value. Noise such as a
bad pixel or spec in the image also created the same problem as an edge occurring in the values
used for the prediction. All three of these problems created false edges and spots inside of the
generated images. Here are the results of this algorithm (the top pictures are gray scale, and
the bottom are the edge detection):
After days of trying and several algorithms, I derived an algorithm that was based off of
the previously described neural network and predefined algorithm combination. Essentially it
approximates the derivative of order n on each side of the pixel being tested. This includes local
pixels inside of the reasoning for finding edges. The benefit of including these local pixels is to
eliminate noise that might be present in the image already. It also approximates the first
Kurtz13
derivative on each side of the pixel to be tested. The purpose of the first derivative is to make
sure the edge occurred at the pixel being tested instead of in the general region of the pixel
being tested. Next, instead of predicting the next values, my algorithm simply compared the
values of the approximated derivatives. The following picture explains the algorithm in a visual
way.
The new algorithm seemed to work better, and fairly quickly since all operations are performed
on integer values. I believe it is faster than the Canny and Marr-Hildreth edge detector, but that
is only speculation. To speed up the algorithm and use less memory, I figured out the relation of
each successive approximate derivative. It follows Pascal’s Triangle with alternating signs. For
example, to approximate only the first derivative you take PL1 – P1 (when referring back to the
image above). To approximate the second derivative you would normally calculate the first and
then find the difference between these first derivatives. This equation can be simplified from
DLA2-DLA3 to PL2 – 2*PL1 + P1 (again when referring to the image above). To approximate the
Kurtz14
third derivative, the equation simplifies to PL3 – 3*PL2 + 3*PL1 – P1 (when referring to the
image above).
After the program finds the necessary approximate derivatives, it then calculates the
difference between these derivatives. If the difference is more than a specific threshold, then it
records the results as an edge. The thresholds are adjustable, but I was not able to experiment
with many thresholds or design the program to choose the appropriate thresholds. Despite the
limited testing, the results seempromising. Here are the results of this algorithm (the top
pictures are gray scale, and the bottom are the edge detection):
As shown in the above image, the algorithm seems to work fairly well for single objects
within a landscape. The algorithm has problems decoding edges when there are a lot of objects
within the same image such as with the picture of trees. However, I believe all edge detectors
have this problem. It also has problems with texture such as the water in the dolphin picture or
the fur on the kangaroo. Some of these problems with texture occurred because the algorithm
only uses the gray scale version of the images.
To fix the problem of only being able to analyze gray scale images, I decided to make a
three dimensional color mapping to plot the points of every possible color combination. I
Kurtz15
worked on the color mapping using two main facts. The first is that the lowest possible color is
black and the highest possible color is white. All other colors can be considered a tint of black or
a shade of white. The fact that there are two colors that all others build from gave me two
limiting points to build a mapping off of at each end of the map. This gave me a basis for where
to put colors when related to the Z-axis in the three dimensional mapping. The sum of the red,
blue, and green pixel values would determine where the color was on the Z-axis. Black (with
RGB pixel values 0,0,0) occurs at Z=0. White (with RGB pixel values of 255,255,255) occurs at
Z=765. Red, green and blue then occur at Z=255 (Red has an RGB value of 255,0,0 for example)
and yellow, cyan, and magenta occur at Z=510 (yellow has an RGB value of 255,255,0 for
example). The second fact I used comes from color theory where the colors red, blue, and
green are all separated by an angle of 120 degrees on a color wheel. By using this color wheel
and the separation of the three primary colors by 120 degrees, I created an equilateral triangle
mapping for color values with the maximum red values at the top corner of the triangle, the
maximum green values in the left corner of the triangle, and the maximum blue values in the
right corner of the triangle. Using this triangle, I mapped an XY axis over it with the Y axis
aligned with the red corner of the triangle and the origin at the centroid of the triangle. Also,
the Z-axis was shifted so that 382 (the approximate center of the range 0 to 765) occurs at Z =
0. With all of this explained, here are the equations for the mapping and a pictorial diagram:
𝑋 = (𝐵𝑙𝑢𝑒 ∗ cos
−π
6
) + (Green ∗ cos
7π
6
)
𝑌 = 𝑅𝑒𝑑 + (𝐵𝑙𝑢𝑒 ∗ sin
−𝜋
6
) + (𝐺𝑟𝑒𝑒𝑛 ∗ sin
7𝜋
6
)
𝑍 = ( 𝑅𝑒𝑑 + 𝐺𝑟𝑒𝑒𝑛 + 𝐵𝑙𝑢𝑒) − 382
Kurtz16
Computation time was decreased by rounding and increasing the values so that
everything was kept as an integer. The formulas then became:
𝑋 = ( 𝐵𝑙𝑢𝑒 − 𝐺𝑟𝑒𝑒𝑛) ∗ 173
𝑌 = 𝑅𝑒𝑑 ∗ 100 − ( 𝑏𝑙𝑢𝑒 + 𝑔𝑟𝑒𝑒𝑛) ∗ 50
𝑍 = ( 𝑟𝑒𝑑 + 𝑔𝑟𝑒𝑒𝑛 + 𝑏𝑙𝑢𝑒 − 382) ∗ 100
The first algorithm written with this new color mapping worked very well despite its
simplicity. The algorithm compared the distances between the color values of pixels on either
side of the pixel being tested. It did this to all pixels within a predetermined distance away from
the pixel being tested along the horizontal and vertical directions in the image. If the calculated
magnitudes between the color values of pixels on either side of the pixel being tested exceeded
a threshold, then the pixel being tested was marked as an edge. Here is an image diagram of
the algorithm, the results, and the results compared with other edge detectors:
Kurtz17
(Images from Eshan Nadernejad’s Edge Detection Technique)
Kurtz18
The edge detection algorithms listed so far are all the ones that have been developed
and tested. I still have two more I would like to develop and test, so the general ideas behind
the algorithms are included here. The first builds further on the color mapping I developed. The
algorithm uses the color mapping to parse through an image and group pixels together
according to which pixels are closest in value to each other. It does this recursively by searching
matching every pixel in an image to neighboring pixels that are closest in value. I created an
algorithm to attempt this methodology, but it took several minutes to go through one image
and did not produce accurate results. The second algorithm yet to be developed and tested can
be added onto any of the algorithms listed in this paper. Essentially the algorithm checks to
make sure a pixel marked as an edge is part of a line and not just random values or single pixels
in an image. Also, the algorithm will be able to fill in missing pixels based on the presence of
surrounding edge values.
The results of the edge detection were easiest to see in regular images, but the
algorithms are fully applicable in OCR. Here is an example of text in an image that has been run
through the color mapping edge detection algorithm:
The edge detection algorithms will help with recognition in the next section.
Kurtz19
1.6 Dopamine Maps
With the lower level processing resolved, I still needed a way to match characters and
objects. The general ideas for the algorithms to match characters and objects come from a
prediction and pattern approach. Generally humans define objects by their features. For
example, a face contains two eyes, a nose, and a mouth. The letter “A” is made up of two
slanted lines which meet at a point at the top with a horizontal line in the middle. I followed
this methodology rather than trying to match objects by templates like many algorithms before.
My way of matching assumes smooth transitions between areas and lines. In this way it can
predict what the next value would be much like the first algorithm I described for edge
detection. If the predictions the algorithm makes do not match up, then it marks the points as
surprises or unexpected values. In the same way that dopamine in your brain is release by
unexpected events, the algorithm marks points that it did not expect. I will explain these
algorithms in detail, but the programs behind the algorithms are not working yet.
There are three types of dopamine maps I have developed and plan to implement. The
first map is called the boundary dopamine map which describes the outside shape of an object.
The algorithm traces along the boundary of an object predicting where the next point should be
based off of previous points and their slopes. If the point predicted by the algorithm is not part
of the boundary of the object, then that point is marked in the dopamine boundary map. A
good example is the letter “A” again. Here the algorithm explores the boundary of the letter
and marks anything it did not expect. In this case it marks the ends of two slanted lines since it
expected the lines to continue. It also marks the point at the top and the intersections of the
horizontal and the slanted lines because the change in slope is very extreme. It did not mark
Kurtz20
those points because they were corners. Any extreme change in slope will create a surprise
point according to the algorithm. The image below shows the final output from this type of
algorithm.
The points marked from the boundary dopamine algorithm are only relational to each other. In
other words, these points can show up in any orientation and will still be matched as long as
their distance and orientation towards each other are the same.
The next boundary map is called an interior dopamine map. Again, the algorithm makes
predictions about what it expects to see and marks anything that does not match up. The
interior dopamine maps algorithm assumes smooth transitions between surfaces. It tracks
along the image and continues predicting what the next values should be based on the previous
values. For example, we defined a face earlier as being made up of two eyes, a nose, and a
mouth. Those objects distinguish a face, and that is exactly what this type of algorithm would
mark. The features that show up inside of an object and their relation to each other define
what an object is. It only makes sense to design an algorithm to do try this. Also, once the
algorithm has been trained on what a nose, eyes, and mouth are it can define a face in the
same way. So, instead of saying there should be feature points in this position relative to this
Kurtz21
position, it can build from the bottom up. If the algorithm finds an eye, it would expect another
eye and a nose to be present for a face.
The final map is called the background dopamine map. It is defined by using the border
and interior methods. The only difference is every object stores the types of backgrounds it can
be found on. This is to avoid confusion between objects that look like other objects. If the
algorithm saw a marble that looked like an eye, the above algorithms would try to label it as an
eye. By including the background, the algorithm can figure out that it is not an eye since it does
not occur on a face.
1.7 Generating Dopamine Maps and their Importance
The maps are generated from a pre-labeled data set. Dopamine maps would be
generated for each image in the pre-labeled data set. An algorithm would then find the
correlation among the maps and create a new map taking into account the differences between
the pictures. If a dopamine map is sufficiently different from other maps, then learning will take
place and a new map would be created alongside the old one. The need for learning becomes
apparent when we look at the letter “a” in different fonts which may be represented as “a” or
“a”. These correlation maps allow a general description for an object that is self relational and
definition based so that it can recognize different types of objects it has not seen before.
These maps seemto be a much more human way of thinking about things than
template matching. When we are asked to describe what an object looks like, we are not
drawing an exact object from a template stored in our brain. Instead we draw an object based
off of its definition. For the letter “E” we understand that it is a vertical line with horizontal lines
Kurtz22
that extends from the top and bottom a certain distance and another horizontal line that
extends from the middle. These maps allow learning and adjustments to be made within the
program constantly.
1.8 Smart Algorithms
The focus so far has been on algorithms that allow for learning and prediction of values.
These are present in the human cognitive process and provide what should be a much more
dynamic object recognition process. Also, all the implementations have been simple
mathematical operations which can be easily and quickly performed. I intentionally stayed
away from complex mathematics such as Gaussian smoothing, using Eigen vectors, or statistical
analysis because it’s hard to see how our brains neural cells could implement these operations.
Also, the methods I have written so far have all had customizable values or thresholds. This
allows the program to learn and adapt so that it runs efficiently and outputs the best results.
1.9 Conclusionand Results
I have very few results to report at this time because of time constraints. I can report
results on the edge detection algorithm, though. It appears to perform at the same level as
more complicated algorithms as shown in the included pictures. Also, the implementation in
OCR will help out greatly. I ran the program on sample text and it performed better that I could
have expected. The text images were full of different text colors, text sizes, and background
colors. The program was able to successfully separate out the characters into a binary black and
white image. The next step will be to implement the border prediction algorithm.
Kurtz23
I plan to continue this research outside of class since it looks promising. It looks
promising, and I look forward to working on it. The working code used so far is attached.
2.0 Acknowledgments
I would like to thank Dr. Abkemeier and the Fontbonne Math Department for letting me
pursue this area of interest. This paper was submitted to the faculty of Fontbonne University’s
Department of Mathematics and Computer Science as partial requirement for the degree of
Bachelor of Science in Mathematics.
Kurtz24
2.1 References
Chetverikov, Dmitry. Is Computer Vision Possible? Rep. N.p.: n.p., n.d. Print.
Holley, Rose. "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale
Historic Newspaper Digitisation Programs." D-Lib Magazine N.p., n.d. Web. 5 Apr. 2013.
<http://www.dlib.org/dlib/march09/holley/03holley.html>.
How Brain Science Will Change Computing. Dir. Jeff Hawkins. TED, n.d. Web.
Lowe, David. The Computer Vision Industry. N.p., n.d. Web. 6 Apr. 2013.
<http://www.cs.ubc.ca/~lowe/vision.html>.
McCann, John J. Human Color Perception. Cambridge: Polaroid Corporation, 1973. Print.
Nadernejad, Ehsan. "Edge Detection Techniques: Evaluations and Comparisons." Mazandaran
Institute of Technology, n.d. Print.
Ritter, G. X., and Joseph N. Wilson. Handbook of Computer Vision Algorithms in Image Algebra.
Boca Raton: CRC, 1996. Print.
Shimojo, Shinsuke, Michael Paradiso, and Ichiro Fujitas. What Visual Perception Tells Us about
Mind and Brain. Rep. N.p., n.d. Web. 5 Apr. 2013.
<http://www.pnas.org/content/98/22/12340.full>.
Stergiou, Christopher, and Dimitrios Siganos. "Neural Networks." Neural Networks. N.p., n.d.
Web. 5 Apr. 2013.
<http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html>.
Szeliski, Richard. Computer Vision: Algorithms and Applications. London: Springer, 2011. Print.
"What's OCR?" Data ID. N.p., n.d. Web. 5 Apr. 2013. <http://www.dataid.com/aboutocr.htm>.

More Related Content

What's hot

Elderly Assistance- Deep Learning Theme detection
Elderly Assistance- Deep Learning Theme detectionElderly Assistance- Deep Learning Theme detection
Elderly Assistance- Deep Learning Theme detectionTanvi Mittal
 
Ai complete note
Ai complete noteAi complete note
Ai complete noteNajar Aryal
 
Artificial Intelligence Research Topics for PhD Manuscripts 2021 - Phdassistance
Artificial Intelligence Research Topics for PhD Manuscripts 2021 - PhdassistanceArtificial Intelligence Research Topics for PhD Manuscripts 2021 - Phdassistance
Artificial Intelligence Research Topics for PhD Manuscripts 2021 - PhdassistancePhD Assistance
 
Emotinal Design
Emotinal DesignEmotinal Design
Emotinal DesignHamed Abdi
 
HUMAN FACE IDENTIFICATION
HUMAN FACE IDENTIFICATION HUMAN FACE IDENTIFICATION
HUMAN FACE IDENTIFICATION bhupesh lahare
 
Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...
Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...
Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...Willy Marroquin (WillyDevNET)
 
Facial Emotion Recognition using Convolution Neural Network
Facial Emotion Recognition using Convolution Neural NetworkFacial Emotion Recognition using Convolution Neural Network
Facial Emotion Recognition using Convolution Neural NetworkYogeshIJTSRD
 
Case study on deep learning
Case study on deep learningCase study on deep learning
Case study on deep learningHarshitBarde
 
Harry Collins - Testing Machines as Social Prostheses - EuroSTAR 2013
Harry Collins - Testing Machines as Social Prostheses - EuroSTAR 2013Harry Collins - Testing Machines as Social Prostheses - EuroSTAR 2013
Harry Collins - Testing Machines as Social Prostheses - EuroSTAR 2013TEST Huddle
 
Study on Different Human Emotions Using Back Propagation Method
Study on Different Human Emotions Using Back Propagation MethodStudy on Different Human Emotions Using Back Propagation Method
Study on Different Human Emotions Using Back Propagation Methodijiert bestjournal
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikThe Hive
 
Case study on machine learning
Case study on machine learningCase study on machine learning
Case study on machine learningHarshitBarde
 
[TRANSCRIPT] Do we have a right to freedom of thought?
 [TRANSCRIPT] Do we have a right to freedom of thought?  [TRANSCRIPT] Do we have a right to freedom of thought?
[TRANSCRIPT] Do we have a right to freedom of thought? Jim Stroud
 
Robotic models of active perception
Robotic models of active perceptionRobotic models of active perception
Robotic models of active perceptionDimitri Ognibene
 
Meaning and the Semantic Web
Meaning and the Semantic WebMeaning and the Semantic Web
Meaning and the Semantic WebPhiloWeb
 

What's hot (20)

Elderly Assistance- Deep Learning Theme detection
Elderly Assistance- Deep Learning Theme detectionElderly Assistance- Deep Learning Theme detection
Elderly Assistance- Deep Learning Theme detection
 
Ai complete note
Ai complete noteAi complete note
Ai complete note
 
Artificial Intelligence Research Topics for PhD Manuscripts 2021 - Phdassistance
Artificial Intelligence Research Topics for PhD Manuscripts 2021 - PhdassistanceArtificial Intelligence Research Topics for PhD Manuscripts 2021 - Phdassistance
Artificial Intelligence Research Topics for PhD Manuscripts 2021 - Phdassistance
 
Emotinal Design
Emotinal DesignEmotinal Design
Emotinal Design
 
06 intelligence
06 intelligence06 intelligence
06 intelligence
 
Intelligence & Computers
Intelligence & ComputersIntelligence & Computers
Intelligence & Computers
 
HUMAN FACE IDENTIFICATION
HUMAN FACE IDENTIFICATION HUMAN FACE IDENTIFICATION
HUMAN FACE IDENTIFICATION
 
Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...
Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...
Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...
 
Facial Emotion Recognition using Convolution Neural Network
Facial Emotion Recognition using Convolution Neural NetworkFacial Emotion Recognition using Convolution Neural Network
Facial Emotion Recognition using Convolution Neural Network
 
Case study on deep learning
Case study on deep learningCase study on deep learning
Case study on deep learning
 
Harry Collins - Testing Machines as Social Prostheses - EuroSTAR 2013
Harry Collins - Testing Machines as Social Prostheses - EuroSTAR 2013Harry Collins - Testing Machines as Social Prostheses - EuroSTAR 2013
Harry Collins - Testing Machines as Social Prostheses - EuroSTAR 2013
 
Study on Different Human Emotions Using Back Propagation Method
Study on Different Human Emotions Using Back Propagation MethodStudy on Different Human Emotions Using Back Propagation Method
Study on Different Human Emotions Using Back Propagation Method
 
Image recognition
Image recognitionImage recognition
Image recognition
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
 
SIXTH SENSE TECHNOLOGY
SIXTH SENSE TECHNOLOGYSIXTH SENSE TECHNOLOGY
SIXTH SENSE TECHNOLOGY
 
Case study on machine learning
Case study on machine learningCase study on machine learning
Case study on machine learning
 
[TRANSCRIPT] Do we have a right to freedom of thought?
 [TRANSCRIPT] Do we have a right to freedom of thought?  [TRANSCRIPT] Do we have a right to freedom of thought?
[TRANSCRIPT] Do we have a right to freedom of thought?
 
Robotic models of active perception
Robotic models of active perceptionRobotic models of active perception
Robotic models of active perception
 
Meaning and the Semantic Web
Meaning and the Semantic WebMeaning and the Semantic Web
Meaning and the Semantic Web
 
Mind reading-computer
Mind reading-computerMind reading-computer
Mind reading-computer
 

Viewers also liked

"Я - учитель!" эссе
"Я  - учитель!"  эссе"Я  - учитель!"  эссе
"Я - учитель!" эссеcdoarg01
 
Анель Хасенова + аренда квартир + клиенты
Анель Хасенова + аренда квартир + клиентыАнель Хасенова + аренда квартир + клиенты
Анель Хасенова + аренда квартир + клиентыAnel Khassenova
 
Volatilidad en el precio del petróleo
Volatilidad en el precio del petróleo Volatilidad en el precio del petróleo
Volatilidad en el precio del petróleo Carolina Lo
 
Анель Хасенова+переводческие услуги+конкуренты
Анель Хасенова+переводческие услуги+конкурентыАнель Хасенова+переводческие услуги+конкуренты
Анель Хасенова+переводческие услуги+конкурентыAnel Khassenova
 
Fircroft Oil & Gas Powerpoint
Fircroft Oil & Gas PowerpointFircroft Oil & Gas Powerpoint
Fircroft Oil & Gas PowerpointShaun Garrathy
 
Kim Ross-Smith Resume_February2016
Kim Ross-Smith Resume_February2016Kim Ross-Smith Resume_February2016
Kim Ross-Smith Resume_February2016Kimberly Ross-Smith
 
RITIKA CHOPRA-ACCOUNTS EXECUTIVE - Copy
RITIKA CHOPRA-ACCOUNTS EXECUTIVE - CopyRITIKA CHOPRA-ACCOUNTS EXECUTIVE - Copy
RITIKA CHOPRA-ACCOUNTS EXECUTIVE - CopyRitika Chopra
 

Viewers also liked (20)

Matrices
MatricesMatrices
Matrices
 
Reseña Contable
Reseña ContableReseña Contable
Reseña Contable
 
"Я - учитель!" эссе
"Я  - учитель!"  эссе"Я  - учитель!"  эссе
"Я - учитель!" эссе
 
Intro to bus
Intro to busIntro to bus
Intro to bus
 
Анель Хасенова + аренда квартир + клиенты
Анель Хасенова + аренда квартир + клиентыАнель Хасенова + аренда квартир + клиенты
Анель Хасенова + аренда квартир + клиенты
 
Oca 3
Oca 3Oca 3
Oca 3
 
Trabajo de el romanticismo de jose luis
Trabajo de el romanticismo de jose luisTrabajo de el romanticismo de jose luis
Trabajo de el romanticismo de jose luis
 
Volatilidad en el precio del petróleo
Volatilidad en el precio del petróleo Volatilidad en el precio del petróleo
Volatilidad en el precio del petróleo
 
Анель Хасенова+переводческие услуги+конкуренты
Анель Хасенова+переводческие услуги+конкурентыАнель Хасенова+переводческие услуги+конкуренты
Анель Хасенова+переводческие услуги+конкуренты
 
Matt's Resumes_r
Matt's Resumes_rMatt's Resumes_r
Matt's Resumes_r
 
Syed_Khaja_Nooruddin.04042016
Syed_Khaja_Nooruddin.04042016Syed_Khaja_Nooruddin.04042016
Syed_Khaja_Nooruddin.04042016
 
Los ecosistemas terrestres
Los ecosistemas terrestresLos ecosistemas terrestres
Los ecosistemas terrestres
 
Poder
PoderPoder
Poder
 
MATTS VISUAL AID
MATTS VISUAL AIDMATTS VISUAL AID
MATTS VISUAL AID
 
Poder
PoderPoder
Poder
 
Title
TitleTitle
Title
 
first_assignment_Report
first_assignment_Reportfirst_assignment_Report
first_assignment_Report
 
Fircroft Oil & Gas Powerpoint
Fircroft Oil & Gas PowerpointFircroft Oil & Gas Powerpoint
Fircroft Oil & Gas Powerpoint
 
Kim Ross-Smith Resume_February2016
Kim Ross-Smith Resume_February2016Kim Ross-Smith Resume_February2016
Kim Ross-Smith Resume_February2016
 
RITIKA CHOPRA-ACCOUNTS EXECUTIVE - Copy
RITIKA CHOPRA-ACCOUNTS EXECUTIVE - CopyRITIKA CHOPRA-ACCOUNTS EXECUTIVE - Copy
RITIKA CHOPRA-ACCOUNTS EXECUTIVE - Copy
 

Similar to Senior Project Paper

0-1--Introduction FPCV-0-1.pdf
0-1--Introduction FPCV-0-1.pdf0-1--Introduction FPCV-0-1.pdf
0-1--Introduction FPCV-0-1.pdfPatrickMatthewChan
 
Everything You Need to Know About Computer Vision
Everything You Need to Know About Computer VisionEverything You Need to Know About Computer Vision
Everything You Need to Know About Computer VisionKavika Roy
 
The relationship between artificial intelligence and psychological theories
The relationship between artificial intelligence and psychological theoriesThe relationship between artificial intelligence and psychological theories
The relationship between artificial intelligence and psychological theoriesEr. rahul abhishek
 
The Magic Behind AI
The Magic Behind AIThe Magic Behind AI
The Magic Behind AIOthman Gacem
 
Paper on Computer Vision
Paper on Computer VisionPaper on Computer Vision
Paper on Computer VisionSanjayS117
 
Machine creativity TED Talk 2.0
Machine creativity TED Talk 2.0Machine creativity TED Talk 2.0
Machine creativity TED Talk 2.0Cameron Aaron
 
Applied Computer Vision - a Deep Learning Approach
Applied Computer Vision - a Deep Learning ApproachApplied Computer Vision - a Deep Learning Approach
Applied Computer Vision - a Deep Learning ApproachJose Berengueres
 
What is Computer Vision?
What is Computer Vision?What is Computer Vision?
What is Computer Vision?Kavika Roy
 
IRJET- ATM Security using Machine Learning
IRJET- ATM Security using Machine LearningIRJET- ATM Security using Machine Learning
IRJET- ATM Security using Machine LearningIRJET Journal
 
Computer vision lightning talk castaway week
Computer vision lightning talk castaway weekComputer vision lightning talk castaway week
Computer vision lightning talk castaway weekChristopher Decker
 
Face Recognition Human Computer Interaction
Face Recognition Human Computer InteractionFace Recognition Human Computer Interaction
Face Recognition Human Computer Interactionines beltaief
 
scene description
scene descriptionscene description
scene descriptionkhushi2551
 
Dragos_Papava_dissertation
Dragos_Papava_dissertationDragos_Papava_dissertation
Dragos_Papava_dissertationDragoș Papavă
 
Mind reading computer
Mind reading computerMind reading computer
Mind reading computerJudy Francis
 
Intellectual Person Identification Using 3DMM, GPSO and Genetic Algorithm
Intellectual Person Identification Using 3DMM, GPSO and Genetic AlgorithmIntellectual Person Identification Using 3DMM, GPSO and Genetic Algorithm
Intellectual Person Identification Using 3DMM, GPSO and Genetic AlgorithmIJCSIS Research Publications
 
AI Therapist – Emotion Detection using Facial Detection and Recognition and S...
AI Therapist – Emotion Detection using Facial Detection and Recognition and S...AI Therapist – Emotion Detection using Facial Detection and Recognition and S...
AI Therapist – Emotion Detection using Facial Detection and Recognition and S...ijtsrd
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionathirakurup3
 

Similar to Senior Project Paper (20)

0-1--Introduction FPCV-0-1.pdf
0-1--Introduction FPCV-0-1.pdf0-1--Introduction FPCV-0-1.pdf
0-1--Introduction FPCV-0-1.pdf
 
Everything You Need to Know About Computer Vision
Everything You Need to Know About Computer VisionEverything You Need to Know About Computer Vision
Everything You Need to Know About Computer Vision
 
The relationship between artificial intelligence and psychological theories
The relationship between artificial intelligence and psychological theoriesThe relationship between artificial intelligence and psychological theories
The relationship between artificial intelligence and psychological theories
 
The Magic Behind AI
The Magic Behind AIThe Magic Behind AI
The Magic Behind AI
 
Paper on Computer Vision
Paper on Computer VisionPaper on Computer Vision
Paper on Computer Vision
 
Machine creativity TED Talk 2.0
Machine creativity TED Talk 2.0Machine creativity TED Talk 2.0
Machine creativity TED Talk 2.0
 
Applied Computer Vision - a Deep Learning Approach
Applied Computer Vision - a Deep Learning ApproachApplied Computer Vision - a Deep Learning Approach
Applied Computer Vision - a Deep Learning Approach
 
1. The Game Of The Century
1. The Game Of The Century1. The Game Of The Century
1. The Game Of The Century
 
What is Computer Vision?
What is Computer Vision?What is Computer Vision?
What is Computer Vision?
 
IRJET- ATM Security using Machine Learning
IRJET- ATM Security using Machine LearningIRJET- ATM Security using Machine Learning
IRJET- ATM Security using Machine Learning
 
Computer vision lightning talk castaway week
Computer vision lightning talk castaway weekComputer vision lightning talk castaway week
Computer vision lightning talk castaway week
 
Beekman5 std ppt_14
Beekman5 std ppt_14Beekman5 std ppt_14
Beekman5 std ppt_14
 
Face Recognition Human Computer Interaction
Face Recognition Human Computer InteractionFace Recognition Human Computer Interaction
Face Recognition Human Computer Interaction
 
scene description
scene descriptionscene description
scene description
 
Computer vision
Computer visionComputer vision
Computer vision
 
Dragos_Papava_dissertation
Dragos_Papava_dissertationDragos_Papava_dissertation
Dragos_Papava_dissertation
 
Mind reading computer
Mind reading computerMind reading computer
Mind reading computer
 
Intellectual Person Identification Using 3DMM, GPSO and Genetic Algorithm
Intellectual Person Identification Using 3DMM, GPSO and Genetic AlgorithmIntellectual Person Identification Using 3DMM, GPSO and Genetic Algorithm
Intellectual Person Identification Using 3DMM, GPSO and Genetic Algorithm
 
AI Therapist – Emotion Detection using Facial Detection and Recognition and S...
AI Therapist – Emotion Detection using Facial Detection and Recognition and S...AI Therapist – Emotion Detection using Facial Detection and Recognition and S...
AI Therapist – Emotion Detection using Facial Detection and Recognition and S...
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 

Senior Project Paper

  • 1. Kurtz1 Computer Vision: Optical Character Recognition Mark Kurtz Fontbonne University Department of Mathematics and Computer Science ABSTRACT Humans are very visual creatures. With everything we do or think about, our visual system is generally involved. The activities that incorporate our visual systemspan everything from reading to driving. Computer Vision is the study of how to implement the human visual system and how we perform visual tasks into machines and programs. My studies and this paper revolve around this topic, specifically Optical Character Recognition. Here OCR (Optical Character Recognition) tries to recognize characters or words inside of images. My focus revolved around implementing OCR through custom algorithms I wrote after studying techniques used in the computer vision field. I successfully created a program with a few algorithms with the rest of the algorithms untested but documented. The results for the tested algorithms are documented in this paper.
  • 2. Kurtz2 1.1 Introduction Computer Vision has slowly implemented itself into our lives in limited aspects in the fields of automotive drive assistance, eye and head tracking, filmand video analysis, gesture recognition, industrial automation and inspection, medical analysis, object recognition, photography, security, 3D modeling, etc. (Lowe) These applications and the algorithms behind them are extremely specific. Because the programs created from the algorithms are so specific, much of the code does not successfully transfer among different applications. Since much of the code does not transfer among applications, there is no master code or algorithm for the computer vision field. Hense even the visual systemof a two-year old cannot be replicated— computer programs still cannot successfully find all the animals in a picture. The reasons for this are many which simplify down to one point: the human visual systemis extremely hard to understand and replicate. The process becomes an inverse problem where we try to form a solution that resembles what our eyes process. This seems easy, but there are many different hidden layers between the input images on our eyes to what we perceive. With numerous unknowns, much of the focus in the computer vision field has resorted to physics-based or statistical models to determine potential solutions. (Szeliski) Because of how well our visual systemhandles images, I underestimated the complexity of computer vision. It seems easy to select different objects in an image in our everyday world. Looking around you can readily distinguish objects in your surroundings, what they are, how far away they are, and their three-dimensional shape. To determine all of this our brains transform the images taken in by our eyes in many different steps. As an example, we may perceive colors darker or lighter than what the actual color value is. This is how we see the same shade of red
  • 3. Kurtz3 on an apple throughout the day despite the changing colors of light reflecting through the atmosphere. An example is in the picture that follows. The cells A and B are the exact same color, but our visual systemchanges the color we perceive based on the surrounding colors: Current algorithms have yet to effectively replicate the human color perception system. (McCann) Other visual tricks our eyes perform range from reconstructing a three-dimensional reality from two-dimensional images in our retinas to perceiving lines and edges from missing image data. An article titled What Visual Perception Tells Us about Mind and Brain explains this in more detail. “What we see is actually more than what is imaged on the retina. For example, we perceive a three-dimensional world full of objects despite the fact that there is a simple two-dimensional image on each retina. … Thus perception is inevitably an ambiguity-solving process. The perceptual systemgenerally reaches the most plausible global interpretation of the retinal input by integrating local cues”. Despite the complexity of the human visual system, many people have tried to replicate it and implement it in machines and robotics. Many algorithms have been developed for specific applications across different fields and disciplines. One of the most prevalent in consumer applications is facial recognition. In fact, it has nearly become a part of everyday use because of its use in images in social networks such as Facebook, images processed in digital
  • 4. Kurtz4 cameras, and images processed in photo editing software such as Picasa. The most successful algorithm creates Eigenfaces, which are a set of eigenvectors, and then looks for these inside of an image. The eigenvectors are derived from statistical analysis of many different pictures of faces. (Szeliski) In other words, it creates a template from other pre-labeled faces and searches through an image to see where they occur. This way seems surprising to me since the algorithm never tries to break apart the constituents of the image or even determine what is contained within the image. The algorithms never process corresponding shapes, three-dimensional figures, edges, etc. from the image data. With no other processing performed other than a simple template search, the program has no clues about context leading it to label faces within an image that we may consider unimportant. For example, I used Picasa to sort my images by who was in each picture. It does this through facial recognition. However, in some images it recognized faces that were far off in the background, faces that were out of focus, and even a face on the cover of a magazine someone was reading in the background of the image. Not only does facial recognition suffer from a lack of context, but also the exact method of template matching by using Eigenfaces in facial recognition makes the algorithm extremely specific. This speaks volumes about the methods used in computer vision. The methods and algorithms for facial recognition cannot readily be used to identify animals in an image without completely retraining the algorithm. By extension, the specificity of algorithms developed for each application in computer vision cannot transfer over to another application. I see this as a huge problem since we cannot possibly have thousands of different image analysis processes in our brain running all at once to look specifically for certain objects. I believe the way our brain determines there is a person in an image is the same as how it determines there are animals in
  • 5. Kurtz5 an image. The current field of computer vision is moving towards this template matching. While this may work in specific situations, I cannot see how this will ever replicate human vision because of the reasons stated before. To break away from the specificity of the computer vision field I took a few ideas from lower level processing techniques, developed my own algorithms in place of the ones used, and then explained a new matching technique. 1.2 New Techniques NeededinComputer Vision Over the decades that computer vision has been studied and developed, it has not progressed as well as most predicted or hoped for. In fact it has led some who contributed and pioneered the field to form the extreme view that computer vision is dead. (Chetverikov) While I do not believe the study of computer vision is dead, I think new algorithms are needed in the field. The algorithms need to move away from the specific applications and involve more ideas from neuroscience and biological processes. Jeff Hawkins, a pioneer of PDAs (Personal Digital Assistant, essentially the precursor to the smart phone) and now an artificial intelligence researcher, spoke at a TED (Technology, Entertainment, and Design) conference in 2003. His speech focused on artificial intelligence. In it he states not only is computer vision moving in the wrong direction, but also the entire field of artificial intelligence. He explains that this comes from an incorrect, but strongly held belief, that intelligence is defined by behavior--we are intelligent because of the way we do things. He counters that our intelligence comes from experiencing the world through a sequence of patterns that we store and recall to match with reality. In this way we make predictions about what will happen next, or what should happen next. An example he gives is our recognition of
  • 6. Kurtz6 faces. As observed through study when humans view a person, we first look at an eye, then the next eye, then at the nose, and finally at the mouth. This is simply explained by predictions happening and then being confirmed as we observe our world. We expect an eye, then an eye next to it, then a nose, then a mouth. If we see something different it will not match up with our predictions. Here learning or more concentrated analysis will occur. (How Brain Science Will ChangeComputing) I developed my research around the direction of prediction as explained by Jeff Hawkins. In this way patterns, sequences, and predictions can be used in object recognition and classification. Logically pattern matching seems to make more sense than specific template matching. By looking for exact matches to templates, we will never be able to reproduce visual tricks such as when we see shapes in clouds. Also, template matching often produces false results if an object is shaded differently or partially hidden. (Ritter, G. X., and Joseph N. Wilson) Pattern matching seems to offer a fix for the different lighting conditions that occur in images. Also, pattern matching may have a better chance of identifying objects which are slightly obscured or are missing some data. This is something that must be studied further through experimentation, however. 1.3 Simplifying Things to Optical Character Recognition In my beginning research, I hoped to apply the ideas I developed to full images for classification and recognition of objects. Unfortunately time constraints limited my research and implementation. Thus I decided to focus on OCR (Optical Character Recognition). This field is a subset of computer vision which focuses on text recognition in images. In OCR there is a
  • 7. Kurtz7 limited number of objects that can occur within the images. Also, the images processed in OCR are much simpler than the full images processed in computer vision. In most cases it is a 2- dimensional binary image such as an image of a page within a book. While it is simpler than computer vision, the algorithms I created can be implemented within OCR since it is a field within computer vision. OCR is used by many industries and businesses. It is often packaged with Adobe PDF software and document scanning software. Also, the United States Post Office has heavily implemented OCR to recognize addresses written on packages and letters. OCR is generally divided into two methods. The first is matrix matching where characters are matched pixel by pixel to a template. The second is feature extraction where the program looks for general features such as open areas, closed shapes, diagonal lines, etc. This method is applied much more than matrix matching, and is much more successful. (What’s OCR?) Feature extraction has become the default for OCR software. It works by analyzing characters to determine strokes, lines of discontinuity, and the background. From here the program builds up from the character to the word and then assigns a confidence level after comparing to a dictionary of words. This seems to work well for images converted straight from computer documents with a 99% accuracy rate. For older, scanned papers the accuracy rate drops and varies wildly from 71% to 98%. (Holley,Rose) The method I have developed follows the feature extraction method. The ideas of feature extraction have worked very well within OCR so far and seem to resemble the idea of prediction matching described earlier. The problem is defining what features are and how to decode them inside of an image.
  • 8. Kurtz8 1.4 Describing the Overall Idea The general idea of my algorithms in computer vision and OCR is to seperate an image into constituent objects and then define those objects by their surprising features. In doing this, the algorithm builds a definition of an object rather than a template for an object. It seems more natural to describe objects by a definition rather than a template, so I pursued algorithms that defined objects by definitions. For example, when we describe a face we do not build an exact template from thousands of faces we have seen in the past. Instead we describe a face as having two eyes, a nose, and a mouth. These are features that distinguish a face from anything else. If there were no features that protruded from the interior of the face, we would have a hard time distinguishing it. Definitions work the same for outlines of an object, too. We can easily draw the outline of a dolphin because we know the border points that are most memorable and stick out from a regular oval or other shape. For a dolphin it is the tail, dorsal fin, and the mouth. While definition building seems to be a more natural way at understanding images, it also may offer reasons as to why we see objects in clouds or ink blobs. I believe we see images in these objects because certain features resemble patterns we have seen before in other objects. We may see a face in an ink blob because there are two dots to represent eyes and a nose all in the correct relation to each other. We may see a dolphin in the clouds because a part of it resembles the dorsal fin and the nose. In both of these examples general objects portray specific objects because certain key features match up.
  • 9. Kurtz9 1.5 Lower Level Processing Lower level processing for computer vision defines finding key points, features, edges, lines, etc. in an image. At the lower level no matching takes place. The focus is to decode the image into basic points, lines, and shapes for processing later on. (Szeliski, Richard) One of the many lower level processes is edge detection. Edges define boundaries between regions in an image and occur when there is a distinct change in the intensity of an image. There are tens if not hundreds of algorithms written for edge detection ranging from the simple to the extremely complex. (Nadernejad, Ehsan) The most widely used algorithms for edge detection are the Marr-Hildreth edge detector and the Canny edge detector. (Szeliski, Richard) The general algorithm for the Marr- Hildreth edge detector first applies a Gaussian smoothing operator (a matrix which approximates a bell-shaped curve) and applies a two dimensional Laplacian to the image (another matrix which is equivalent to taking the second derivative of the image). The Gaussian reduces the amount of noise in the image simply by blurring it. This has the unwanted effect of losing fine detail in the image, though. The Laplacian is applied to take the second derivative of the image. The idea is that if there is a step difference in the intensity of an image, it is represented by a zero crossing in the second derivative. The Canny edge detector also begins by applying a Gaussian smoothing operator. It then finds the gradient of the image at each point to indicate the presence of edges while suppressing any points that are not maximum gradient values in a small region. After all this has been performed, thresholding is applied by using hysteresis which applies a high and low threshold. Again, the Gaussian loses detail as it tries to reduce the amount of noise in an image. Both the Marr-Hildreth and the Canny edge detectors
  • 10. Kurtz10 are very expensive in terms of computation time because of the operations that are involved. I stepped away from these approaches and tried to look at edge detection in a simpler way that could be reproduced by artificial neural networks. Artificial neural networks were inspired by the way biological nervous systems process information. The neural networks are composed of a large number of processing elements that work together to solve specific problems. Neurons inside the neural networks are interconnected and feed inputs into each other. If an operation applied to all of the inputs into a neuron is above a certain threshold then the neuron sends an input to other neurons. The way the neurons are connected, the thresholds set for each, and the operations performed on each all can change and are adapted to better the performance of an algorithm. This replicates the learning process in biological brains and nervous systems. Artificial neural networks have been effectively applied in pattern recognition and other applications because of their ability to change and adapt to new inputs. (Stergiou, Christopher, and Dimitrios Siganos) Neural networks have a downside, though. They need many sets of training data in order to achieve accurate results. These sets of training data take a lot of time to make. So, instead of abandoning neural networks I tried to combine the best of traditional computing and neural networks for my edge detection algorithms. I built upon using arrays as inputs and outputs to form the neurons of the neural networks. Next I predetermined what the operations would be, and then applied a threshold which could be changed to maximize the accuracy of the algorithm. My first idea built upon the previously described combination of neural networks and predefined algorithms. I explored the ideas that edges are step changes (non-algebraic) in light
  • 11. Kurtz11 intensity in an image. I figured out that I could calculate the change in light intensity along a specific direction in the image. By doing this, I could approximate the derivative at each point in an image. After this I could approximate the second derivative, the third derivative, and so on. With the derivatives approximated, the algorithm can then work backwards and figure out what the next pixel value should be for an algebraic equation. The following image explains this further: This systemworks perfectly for predicting the next value in algebraic equations such as y = 2x + 20 or y = 5x3 + 3x – 10. If the array is expanded to include more pixel values, it can work with even higher order equations. The algorithm is able to do this because eventually the approximate derivative is a constant value or 0. Surprisingly, the algorithm also was able to reasonably approximate the next value in y = cos(x) and y = ln(x). I then designed a program to apply the algorithm to images. For each pixel it would calculate a predicted value from the surrounding pixels. If the predicted value and the actual value were off by a certain threshold,
  • 12. Kurtz12 then the pixel would be marked as an edge. The algorithm did not work as well in practice, though. Small variations in light intensity in an image would create large changes in values higher up in the array. This led to extreme predicted values which did not match with what a human might expect the next value to be. Another problem appeared whenever edges occurred between the values used for the prediction. For example, if an edge occurred at Pixel 2 in the image shown above, then the value of PR4 became an extreme value. Noise such as a bad pixel or spec in the image also created the same problem as an edge occurring in the values used for the prediction. All three of these problems created false edges and spots inside of the generated images. Here are the results of this algorithm (the top pictures are gray scale, and the bottom are the edge detection): After days of trying and several algorithms, I derived an algorithm that was based off of the previously described neural network and predefined algorithm combination. Essentially it approximates the derivative of order n on each side of the pixel being tested. This includes local pixels inside of the reasoning for finding edges. The benefit of including these local pixels is to eliminate noise that might be present in the image already. It also approximates the first
  • 13. Kurtz13 derivative on each side of the pixel to be tested. The purpose of the first derivative is to make sure the edge occurred at the pixel being tested instead of in the general region of the pixel being tested. Next, instead of predicting the next values, my algorithm simply compared the values of the approximated derivatives. The following picture explains the algorithm in a visual way. The new algorithm seemed to work better, and fairly quickly since all operations are performed on integer values. I believe it is faster than the Canny and Marr-Hildreth edge detector, but that is only speculation. To speed up the algorithm and use less memory, I figured out the relation of each successive approximate derivative. It follows Pascal’s Triangle with alternating signs. For example, to approximate only the first derivative you take PL1 – P1 (when referring back to the image above). To approximate the second derivative you would normally calculate the first and then find the difference between these first derivatives. This equation can be simplified from DLA2-DLA3 to PL2 – 2*PL1 + P1 (again when referring to the image above). To approximate the
  • 14. Kurtz14 third derivative, the equation simplifies to PL3 – 3*PL2 + 3*PL1 – P1 (when referring to the image above). After the program finds the necessary approximate derivatives, it then calculates the difference between these derivatives. If the difference is more than a specific threshold, then it records the results as an edge. The thresholds are adjustable, but I was not able to experiment with many thresholds or design the program to choose the appropriate thresholds. Despite the limited testing, the results seempromising. Here are the results of this algorithm (the top pictures are gray scale, and the bottom are the edge detection): As shown in the above image, the algorithm seems to work fairly well for single objects within a landscape. The algorithm has problems decoding edges when there are a lot of objects within the same image such as with the picture of trees. However, I believe all edge detectors have this problem. It also has problems with texture such as the water in the dolphin picture or the fur on the kangaroo. Some of these problems with texture occurred because the algorithm only uses the gray scale version of the images. To fix the problem of only being able to analyze gray scale images, I decided to make a three dimensional color mapping to plot the points of every possible color combination. I
  • 15. Kurtz15 worked on the color mapping using two main facts. The first is that the lowest possible color is black and the highest possible color is white. All other colors can be considered a tint of black or a shade of white. The fact that there are two colors that all others build from gave me two limiting points to build a mapping off of at each end of the map. This gave me a basis for where to put colors when related to the Z-axis in the three dimensional mapping. The sum of the red, blue, and green pixel values would determine where the color was on the Z-axis. Black (with RGB pixel values 0,0,0) occurs at Z=0. White (with RGB pixel values of 255,255,255) occurs at Z=765. Red, green and blue then occur at Z=255 (Red has an RGB value of 255,0,0 for example) and yellow, cyan, and magenta occur at Z=510 (yellow has an RGB value of 255,255,0 for example). The second fact I used comes from color theory where the colors red, blue, and green are all separated by an angle of 120 degrees on a color wheel. By using this color wheel and the separation of the three primary colors by 120 degrees, I created an equilateral triangle mapping for color values with the maximum red values at the top corner of the triangle, the maximum green values in the left corner of the triangle, and the maximum blue values in the right corner of the triangle. Using this triangle, I mapped an XY axis over it with the Y axis aligned with the red corner of the triangle and the origin at the centroid of the triangle. Also, the Z-axis was shifted so that 382 (the approximate center of the range 0 to 765) occurs at Z = 0. With all of this explained, here are the equations for the mapping and a pictorial diagram: 𝑋 = (𝐵𝑙𝑢𝑒 ∗ cos −π 6 ) + (Green ∗ cos 7π 6 ) 𝑌 = 𝑅𝑒𝑑 + (𝐵𝑙𝑢𝑒 ∗ sin −𝜋 6 ) + (𝐺𝑟𝑒𝑒𝑛 ∗ sin 7𝜋 6 ) 𝑍 = ( 𝑅𝑒𝑑 + 𝐺𝑟𝑒𝑒𝑛 + 𝐵𝑙𝑢𝑒) − 382
  • 16. Kurtz16 Computation time was decreased by rounding and increasing the values so that everything was kept as an integer. The formulas then became: 𝑋 = ( 𝐵𝑙𝑢𝑒 − 𝐺𝑟𝑒𝑒𝑛) ∗ 173 𝑌 = 𝑅𝑒𝑑 ∗ 100 − ( 𝑏𝑙𝑢𝑒 + 𝑔𝑟𝑒𝑒𝑛) ∗ 50 𝑍 = ( 𝑟𝑒𝑑 + 𝑔𝑟𝑒𝑒𝑛 + 𝑏𝑙𝑢𝑒 − 382) ∗ 100 The first algorithm written with this new color mapping worked very well despite its simplicity. The algorithm compared the distances between the color values of pixels on either side of the pixel being tested. It did this to all pixels within a predetermined distance away from the pixel being tested along the horizontal and vertical directions in the image. If the calculated magnitudes between the color values of pixels on either side of the pixel being tested exceeded a threshold, then the pixel being tested was marked as an edge. Here is an image diagram of the algorithm, the results, and the results compared with other edge detectors:
  • 17. Kurtz17 (Images from Eshan Nadernejad’s Edge Detection Technique)
  • 18. Kurtz18 The edge detection algorithms listed so far are all the ones that have been developed and tested. I still have two more I would like to develop and test, so the general ideas behind the algorithms are included here. The first builds further on the color mapping I developed. The algorithm uses the color mapping to parse through an image and group pixels together according to which pixels are closest in value to each other. It does this recursively by searching matching every pixel in an image to neighboring pixels that are closest in value. I created an algorithm to attempt this methodology, but it took several minutes to go through one image and did not produce accurate results. The second algorithm yet to be developed and tested can be added onto any of the algorithms listed in this paper. Essentially the algorithm checks to make sure a pixel marked as an edge is part of a line and not just random values or single pixels in an image. Also, the algorithm will be able to fill in missing pixels based on the presence of surrounding edge values. The results of the edge detection were easiest to see in regular images, but the algorithms are fully applicable in OCR. Here is an example of text in an image that has been run through the color mapping edge detection algorithm: The edge detection algorithms will help with recognition in the next section.
  • 19. Kurtz19 1.6 Dopamine Maps With the lower level processing resolved, I still needed a way to match characters and objects. The general ideas for the algorithms to match characters and objects come from a prediction and pattern approach. Generally humans define objects by their features. For example, a face contains two eyes, a nose, and a mouth. The letter “A” is made up of two slanted lines which meet at a point at the top with a horizontal line in the middle. I followed this methodology rather than trying to match objects by templates like many algorithms before. My way of matching assumes smooth transitions between areas and lines. In this way it can predict what the next value would be much like the first algorithm I described for edge detection. If the predictions the algorithm makes do not match up, then it marks the points as surprises or unexpected values. In the same way that dopamine in your brain is release by unexpected events, the algorithm marks points that it did not expect. I will explain these algorithms in detail, but the programs behind the algorithms are not working yet. There are three types of dopamine maps I have developed and plan to implement. The first map is called the boundary dopamine map which describes the outside shape of an object. The algorithm traces along the boundary of an object predicting where the next point should be based off of previous points and their slopes. If the point predicted by the algorithm is not part of the boundary of the object, then that point is marked in the dopamine boundary map. A good example is the letter “A” again. Here the algorithm explores the boundary of the letter and marks anything it did not expect. In this case it marks the ends of two slanted lines since it expected the lines to continue. It also marks the point at the top and the intersections of the horizontal and the slanted lines because the change in slope is very extreme. It did not mark
  • 20. Kurtz20 those points because they were corners. Any extreme change in slope will create a surprise point according to the algorithm. The image below shows the final output from this type of algorithm. The points marked from the boundary dopamine algorithm are only relational to each other. In other words, these points can show up in any orientation and will still be matched as long as their distance and orientation towards each other are the same. The next boundary map is called an interior dopamine map. Again, the algorithm makes predictions about what it expects to see and marks anything that does not match up. The interior dopamine maps algorithm assumes smooth transitions between surfaces. It tracks along the image and continues predicting what the next values should be based on the previous values. For example, we defined a face earlier as being made up of two eyes, a nose, and a mouth. Those objects distinguish a face, and that is exactly what this type of algorithm would mark. The features that show up inside of an object and their relation to each other define what an object is. It only makes sense to design an algorithm to do try this. Also, once the algorithm has been trained on what a nose, eyes, and mouth are it can define a face in the same way. So, instead of saying there should be feature points in this position relative to this
  • 21. Kurtz21 position, it can build from the bottom up. If the algorithm finds an eye, it would expect another eye and a nose to be present for a face. The final map is called the background dopamine map. It is defined by using the border and interior methods. The only difference is every object stores the types of backgrounds it can be found on. This is to avoid confusion between objects that look like other objects. If the algorithm saw a marble that looked like an eye, the above algorithms would try to label it as an eye. By including the background, the algorithm can figure out that it is not an eye since it does not occur on a face. 1.7 Generating Dopamine Maps and their Importance The maps are generated from a pre-labeled data set. Dopamine maps would be generated for each image in the pre-labeled data set. An algorithm would then find the correlation among the maps and create a new map taking into account the differences between the pictures. If a dopamine map is sufficiently different from other maps, then learning will take place and a new map would be created alongside the old one. The need for learning becomes apparent when we look at the letter “a” in different fonts which may be represented as “a” or “a”. These correlation maps allow a general description for an object that is self relational and definition based so that it can recognize different types of objects it has not seen before. These maps seemto be a much more human way of thinking about things than template matching. When we are asked to describe what an object looks like, we are not drawing an exact object from a template stored in our brain. Instead we draw an object based off of its definition. For the letter “E” we understand that it is a vertical line with horizontal lines
  • 22. Kurtz22 that extends from the top and bottom a certain distance and another horizontal line that extends from the middle. These maps allow learning and adjustments to be made within the program constantly. 1.8 Smart Algorithms The focus so far has been on algorithms that allow for learning and prediction of values. These are present in the human cognitive process and provide what should be a much more dynamic object recognition process. Also, all the implementations have been simple mathematical operations which can be easily and quickly performed. I intentionally stayed away from complex mathematics such as Gaussian smoothing, using Eigen vectors, or statistical analysis because it’s hard to see how our brains neural cells could implement these operations. Also, the methods I have written so far have all had customizable values or thresholds. This allows the program to learn and adapt so that it runs efficiently and outputs the best results. 1.9 Conclusionand Results I have very few results to report at this time because of time constraints. I can report results on the edge detection algorithm, though. It appears to perform at the same level as more complicated algorithms as shown in the included pictures. Also, the implementation in OCR will help out greatly. I ran the program on sample text and it performed better that I could have expected. The text images were full of different text colors, text sizes, and background colors. The program was able to successfully separate out the characters into a binary black and white image. The next step will be to implement the border prediction algorithm.
  • 23. Kurtz23 I plan to continue this research outside of class since it looks promising. It looks promising, and I look forward to working on it. The working code used so far is attached. 2.0 Acknowledgments I would like to thank Dr. Abkemeier and the Fontbonne Math Department for letting me pursue this area of interest. This paper was submitted to the faculty of Fontbonne University’s Department of Mathematics and Computer Science as partial requirement for the degree of Bachelor of Science in Mathematics.
  • 24. Kurtz24 2.1 References Chetverikov, Dmitry. Is Computer Vision Possible? Rep. N.p.: n.p., n.d. Print. Holley, Rose. "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs." D-Lib Magazine N.p., n.d. Web. 5 Apr. 2013. <http://www.dlib.org/dlib/march09/holley/03holley.html>. How Brain Science Will Change Computing. Dir. Jeff Hawkins. TED, n.d. Web. Lowe, David. The Computer Vision Industry. N.p., n.d. Web. 6 Apr. 2013. <http://www.cs.ubc.ca/~lowe/vision.html>. McCann, John J. Human Color Perception. Cambridge: Polaroid Corporation, 1973. Print. Nadernejad, Ehsan. "Edge Detection Techniques: Evaluations and Comparisons." Mazandaran Institute of Technology, n.d. Print. Ritter, G. X., and Joseph N. Wilson. Handbook of Computer Vision Algorithms in Image Algebra. Boca Raton: CRC, 1996. Print. Shimojo, Shinsuke, Michael Paradiso, and Ichiro Fujitas. What Visual Perception Tells Us about Mind and Brain. Rep. N.p., n.d. Web. 5 Apr. 2013. <http://www.pnas.org/content/98/22/12340.full>. Stergiou, Christopher, and Dimitrios Siganos. "Neural Networks." Neural Networks. N.p., n.d. Web. 5 Apr. 2013. <http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html>. Szeliski, Richard. Computer Vision: Algorithms and Applications. London: Springer, 2011. Print. "What's OCR?" Data ID. N.p., n.d. Web. 5 Apr. 2013. <http://www.dataid.com/aboutocr.htm>.