Computer vision is the study and application of methods which allow computers to
"understand" image content or content of multidimensional data in general. The term
"understand" means here that specific information is being extracted from the image data
for a specific purpose: either for presenting it to a human operator (e. g., if cancerous
cells have been detected in a microscopy image), or for controlling some process (e. g.,
an industry robot or an autonomous vehicle). The image data that is fed into a computer
vision system is often a digital gray-scale or colour image, but can also be in the form of
two or more such images (e. g., from a stereo camera pair), a video sequence, or a 3D
volume (e. g., from a tomography device). In most practical computer vision applications,
the computers are pre-programmed to solve a particular task, but methods based on
learning are now becoming increasingly common. Computer vision can also be described
as the complement (but not necessary the opposite) of biological vision. In biological
vision and visual perception real vision systems of humans and various animals are
studied, resulting in models of how these systems are implemented in terms of neural
processing at various levels.
State Of The Art
Relation between Computer vision and various other fields
The field of computer vision can be characterized as immature and diverse. Even though
earlier work exists, it was not until the late 1970's that a more focused study of the field
started when computers could manage the processing of large data sets such as images.
However, these studies usually originated from various other fields, and consequently
there is no standard formulation of the "computer vision problem". Also, and to an even
larger extent, there is no standard formulation of how computer vision problems should
be solved. Instead, there exists an abundance of methods for solving various well-defined
computer vision tasks, where the methods often are very task specific and seldom can be
generalized over a wide range of applications. Many of the methods and applications are
still in the state of basic research, but more and more methods have found their way into
commercial products, where they often constitute a part of a larger system which can
solve complex tasks (e.g., in the area of medical images, or quality control and
measurements in industrial processes).
A significant part of artificial intelligence deals with planning or deliberation for system
which can perform mechanical actions such as moving a robot through some
environment. This type of processing typically needs input data provided by a computer
vision system, acting as a vision sensor and providing high-level information about the
environment and the robot. Other parts which sometimes are described as belonging to
artificial intelligence and which are used in relation to computer vision is pattern
recognition and learning techniques. As a consequence, computer vision is sometimes
seen as a part of the artificial intelligence field.
Since a camera can be seen as a light sensor, there are various methods in computer
vision based on correspondences between a physical phenomenon related to light and
images of that phenomenon. For example, it is possible to extract information about
motion in fluids and about waves by analyzing images of these phenomena. Also, a
subfield within computer vision deals with the physical process which given a scene of
objects, light sources, and camera lenses forms the image in a camera. Consequently,
computer vision can also be seen as an extension of physics.A third field which plays an
important role is neurobiology, specifically the study of the biological vision system.
Over the last century, there has been an extensive study of eyes, neurons, and the brain
structures devoted to processing of visual stimuli in both humans and various animals.
This has led to a coarse, yet complicated, description of how "real" vision systems
operate in order to solve certain vision related tasks. These results have led to a subfield
within computer vision where artificial systems are designed to mimic the processing and
behaviour of biological systems, at different levels of complexity. Also, some of the
learning-based methods developed within computer vision have their background in
Yet another field related to computer vision is signal processing. Many existing methods
for processing of one-variable signals, typically temporal signals, can be extended in a
natural way to processing of two-variable signals or multi-variable signals in computer
vision. However, because of the specific nature of images there are many methods
developed within computer vision which have no counterpart in the processing of one-
variable signals. A distinct character of these methods is the fact that they are non-linear
which, together with the multi-dimensionality of the signal, defines a subfield in signal
processing as a part of computer vision.
Beside the above mentioned views on computer vision, many of the related research
topics can also be studied from a purely mathematical point of view. For example, many
methods in computer vision are based on statistics, optimization or geometry. Finally, a
significant part of the field is devoted to the implementation aspect of computer vision;
how existing methods can be realized in various combinations of software and hardware,
or how these methods can be modified in order to gain processing speed without losing
too much performance.
Computer vision, Image processing, Image analysis, Robot vision and Machine vision are
closely related fields. If you look inside text books which have either of these names in
the title there is a significant overlap in terms of what techniques and applications they
cover. This implies that the basic techniques that are used and developed in these fields
are more or less identical, something which can be interpreted as there is only one field
with different names. On the other hand, it appears to be necessary for research groups,
scientific journals, conferences and companies to present or market themselves as
belonging specifically to one of these fields and, hence, various characterizations which
distinguish each of the fields from the others have been presented. The following
characterizations appear relevant but should not be taken as universally accepted.
Image processing and Image analysis tend to focus on 2D images, how to transform one
image to another, e.g., by pixel-wise operations such as contrast enhancement, local
operations such as edge extraction or noise removal, or geometrical transformations such
as rotating the image. This characterization implies that image processing/analysis neither
require assumptions nor produce interpretations about the image content.
Computer vision tends to focus on the 3D scene projected onto one or several images,
e.g., how to reconstruct structure or other information about the 3D scene from one or
several images. Computer vision often relies on more or less complex assumptions about
the scene depicted in an image.
Machine vision tends to focus on applications, mainly in industry, e.g., vision based
autonomous robots and systems for vision based inspection or measurement. This implies
that image sensor technologies and control theory often are integrated with the processing
of image data to control a robot and that real-time processing is emphasized by means of
efficient implementations in hardware and software. There is also a field called Imaging
which primarily focus on the process of producing images, but sometimes also deals with
processing and analysis of images. For example, Medical imaging contains lots of work
on the analysis of image data in medical applications.
Finally, pattern recognition is a field which uses various methods to extract information
from signals in general, mainly based on statistical approaches. A significant part of this
field is devoted to applying these methods to image data.A consequence of this state of
affairs is that you can be working in a lab related to one of these fields, apply methods
from a second field to solve a problem in a third field and present the result at a
conference related to a fourth field!
Typical Tasks Of Computer Vision
Each of the application areas described above employ a range of computer vision tasks;
more or less well-defined measurement problems or processing problems, which can be
solved using a variety of methods. Some examples of typical computer vision tasks are
The classical problem in computer vision, image processing and machine vision is that of
determining whether or not the image data contains some specific object, feature, or
activity. This task can normally be solved robustly and without effort by a human, but is
still not satisfactory solved in computer vision for the general case: arbitrary objects in
arbitrary situations. The existing methods for dealing with this problem can at best solve
it only for specific objects, such as simple geometric objects (e.g., polyhedrons), human
faces, printed or hand-written characters, or vehicles, and in specific situations, typically
described in terms of well-defined illumination, background, and pose of the object
relative to the camera.
Different varieties of the recognition problem are described in the literature:
• Recognition: one or several pre-specified or learned objects or object classes can
be recognized, usually together with their 2D positions in the image or 3D poses
in the scene.
• Identification: An individual instance of an object is recognized. Examples:
identification of a specific person face or fingerprint, or identification of a specific
• Detection: the image data is scanned for a specific condition. Examples: detection
of possible abnormal cells or tissues in medical images or detection of a vehicle in
an automatic road toll system. Detection based on relatively simple and fast
computations is sometimes used for finding smaller regions of interesting image
data which can be further analyzed by more computationally demanding
techniques to produce a correct interpretation. Several specialized tasks based on
recognition exist, such as:
• Content-based image retrieval: find all images which has a specific content in a
larger set or database of images.
• Pose estimation: estimation of the position and orientation of specific object
relative to the camera. Example: to allow a robot arm to pick up the objects from
• Optical character recognition (or OCR): images of printed or handwritten text
are converted to computer readable text such as ASCII or Unicode.
Several tasks relate to motion estimation in which an image sequence is processed to
produce an estimate of the local image velocity at each point. Examples of such tasks are
• Egomotion: determine the 3D rigid motion of the camera.
• Tracking of one or several objects (e.g. vehicles or humans) through the image
• Surveillance: detection of possible activities based on motion.
Given two or more images of a scene, or a video, scene reconstruction aims at computing
a 3D model of the scene. In the simplest case the model can be a set of 3D points. More
sophisticated methods produce a complete 3D surface model.
Given an image, an image sequence, or a 3D volume, which has been degraded by noise,
image restoration aims at producing the image data without the noise. Examples of noise
processes which are considered are sensor noise (e.g., ultrasonic images) and motion blur
(e.g., because of a moving camera or moving objects in the scene).
Computer Vision Systems
A typical computer vision system can be divided in the following subsystems:
The image or image sequence is acquired with an imaging system
(camera,radar,lidar,tomography system). Often the imaging system has to be calibrated
before being used.
In the preprocessing step, the image is being treated with "low-level"-operations. The aim
of this step is to do noise reduction on the image (i.e. to dissociate the signal from the
noise) and to reduce the overall amount of data. This is typically being done by
employing different (digital)image processing methods such as:
1. Downsampling the image.
2. Applying digital filters
3. Computing the x- and y-gradient (possibly also the time-gradient).
4. Segmenting the image.
a. Pixelwise thresholding.
5. Performing an eigentransform on the image
a. Fourier transform
6. Doing motion estimation for local regions of the image (also known as optical
7. Estimating disparity in stereo images.
8. Multiresolution analysis
The aim of feature extraction is to further reduce the data to a set of features, which ought
to be invariant to disturbances such as lighting conditions, camera position, noise and
distortion. Examples of feature extraction are:
1. Performing edge detection or estimation of local orientation.
2. Extracting corner features.
3. Detecting blob features.
4. Extracting spin images from depth maps.
5. Extracting geons or other three-dimensional primitives, such as superquadrics.
6. Acquiring contour lines and maybe curvature zero crossings.
7. Generating features with the Scale-invariant feature transform.
8. Calculating the Co-occurrence matrix of the image or sub-images to measure
The aim of the registration step is to establish correspondence between the features in the
acquired set and the features of known objects in a model-database and/or the features of
the preceding image. The registration step has to bring up a final hypothesis. To name a
1. Least squares estimation
2. Hough transform in many variations
3. Geometric hashing
4. Particle filtering
Applications Of Computer Vision
The following is a non-complete list of applications which are studied in computer vision.
In this category, the term application should be interpreted as a high level function which
solves a problem at a higher level of complexity. Typically, the various technical
problems related to an application can be solved and implemented in different ways.
Applications Of Computer Vision
A facial recognition system is a computer-driven application for automatically
identifying a person from a digital image. It does that by comparing selected facial
features in the live image and a facial database. It is typically used for security systems
and can be compared to other biometrics such as fingerprint or eye iris recognition
Popular recognition algorithms include eigenface, fisherface, the Hidden Markov model,
and the neuronal motivated Dynamic Link Matching. A newly emerging trend, claimed to
achieve previously unseen accuracies, is three-dimensional face recognition. Another
emerging trend uses the visual details of the skin, as captured in standard digital or
scanned images. Tests on the FERET database, the widely used industry benchmark,
showed that this approach is substantially more reliable than previous algorithms.
Polly was a robot created at the MIT Artificial Intelligence Laboratory by Ian Horswill
for his PhD, which was published in 1993 as a technical report. It was the first mobile
robot to move at animal-like speeds (1m per second) using computer vision for its
navigation. It was an example of behavior based robotics. For a few years, Polly was able
to give tours of the AI laboratory's seventh floor, using canned speech to point out
landmarks such as Anita Flynn's office. The Polly algorithm is a way to navigate in a
cluttered space using very low resolution vision to find uncluttered areas to move forward
into, assuming that the pixels at the bottom of the frame (the closest to the robot) show an
example of an uncluttered area. Since this could be done 60 times a second, the algorithm
only needed to discriminate three categories: telling the robot at each instant to go
straight, towards the right or towards the left.
Mobile Robots are automatic machines that are capable of movement in a given
environment. Robots generally fall into two classes, linked manipulators (or Industrial
robots) and mobile robots. Mobile robots have the capability to move around in their
environment and are not fixed to one physical location. In contrast, industrial
manipulators usually consist of a jointed arm and gripper assembly (or end effector) that
is attached to a fixed surface.
The most common class of mobile robots are wheeled robots. A second class of mobile
robots includes legged robots while a third smaller class includes aerial robots, usually
referred to as unmanned aerial vehicles (UAVs). Mobile robots are the focus of a great
deal or current research and almost every major university has one or more labs that
focus on mobile robot research. Mobile robots are also found in industry, military and
security environments, and appear as consumer products.
A humanoid robot manufactured by Toyota "playing" a trumpet
The word robot is used to refer to a wide range of machines, the common feature of
which is that they are all capable of movement and can be used to perform physical tasks.
Robots take on many different forms, ranging from humanoid, which mimic the human
form and way of moving, to industrial, whose appearance is dictated by the function they
are to perform. Robots can be grouped generally as mobile robots (eg. autonomous
vehicles), manipulator robots (eg. industrial robots) and Self reconfigurable robots, which
can conform themselves to the task at hand.
Robots may be controlled directly by a human, such as remotely-controlled bomb-
disposal robots, robotic arms, or shuttles, or may act according to their own decision
making ability, provided by artificial intelligence. However, the majority of robots fall in-
between these extremes, being controlled by pre-programmed computers. Such robots
may include feedback loops such that they can interact with their environment, but do not
display actual intelligence.
The word "robot" is also used in a general sense to mean any machine which mimics the
actions of a human (biomimicry), in the physical sense or in the mental sense.It comes
from the Czech and Slovak word robota, labour or work (also used in a sense of a serf).
The word robot first appeared in Karel Čapek's science fiction play R.U.R. (Rossum's
Universal Robots) in 1921.
The construction of the Soviet-made robot of the 1970's. The robot was able to move,
reproduce the pre-recorded sounds, imitate the clever conversation using the built-in
radio station and demonstrate movies on the built-in screen. It was used in various
shows.The word robot was introduced by Czech writer Karel Capek in his play R.U.R.
(Rossum's Universal Robots) which was written in 1920 (See also Robots in literature for
details of the play). However, the verb robotovat, meaning "to work" or "to slave", and
the noun robota (meaning corvée) used in the Czech and Slovak languages, has been
used since the early 10th century. It was suggested that the word robot had been coined
by Karel Čapek's brother, painter and writer Josef Čapek.
An early automaton was created 1738 by Jacques de Vaucanson, who created a
mechanical duck that was able to eat grain, flap its wings, and excrete.
The first human to be killed by a robot was 37 year-old Kenji Urada, a Japanese factory
worker, in 1981. According the Economist.com, Urada "climbed over a safety fence at a
Kawasaki plant to carry out some maintenance work on a robot. In his haste, he failed to
switch the robot off properly. Unable to sense him, the robot's powerful hydraulic arm
kept on working and accidentally pushed the engineer into a grinding machine."
A smart camera is an integrated machine vision system which, in addition to image
capture circuitry, includes a processor, which can extract information fromimageswithout
need for an external processing unit, and interface devices used to make results available
to other devices.
A Smart Camera or „intelligent Camera“ is a self-contained, standalone vision system
with built-in image sensor in the housing of an industrial video camera. It contains all
necessary communication interfaces, e.g. Ethernet. It is not necessarily larger than an
industrial or surveillance camera. This architecture has the advantage of a more compact
volume compared to PC-based vision systems and often achieves lower cost, at the
expense of a somewhat simpler (or missing altogether) user interface.
Early smart camera (ca. 1985, in red) with an 8MHz Z80 compared to a modern device
featuring Texas Instruments' C64 @1GHz. A Smart Camera usually consists of several
(but not necessarily all) of the following components:
1. Image sensor (matrix or linear, CCD- or CMOS)
2. Image digitization circuitry
3. Image memory
4. Communication interface (RS232, Ethernet)
5. I/O lines (often optoisolated)
6. Lens holder or built in lens (usually C or C-mount)
Examples Of Applications For Computer Vision
Another way to describe computer vision is in terms of applications areas. One of the
most prominent application fields is medical computer vision or medical image
processing. This area is characterized by the extraction of information from image data
for the purpose of making a medical diagnosis of a patient. Typically image data is in the
form of microscopy images, X-ray images, angiography images, ultrasonic images, and
tomography images. An example of information which can be extracted from such image
data is detection of tumours, arteriosclerosis or other malign changes. It can also be
measurements of organ dimensions, blood flow, etc. This application area also supports
medical research by providing new information, e.g., about the structure of the brain, or
about the quality of medical treatments.
A second application area in computer vision is in industry. Here, information is
extracted for the purpose of supporting a manufacturing process. One example is quality
control where details or final products are being automatically inspected in order to find
defects. Another example is measurement of position and orientation of details to be
picked up by a robot arm. See the article on machine vision for more details on this area.
Military applications are probably one of the largest areas for computer vision, even
though only a small part of this work is open to the public. The obvious examples are
detection of enemy soldiers or vehicles and guidance of missiles to a designated target.
More advanced systems for missile guidance send the missile to an area rather than a
specific target, and target selection is made when the missile reaches the area based on
locally acquired image data. Modern military concepts, such as "battlefield
awareness,"imply that various sensors, including image sensors, provide a rich set of
information about a combat scene which can be used to support strategic decisions. In
this case, automatic processing of the data is used to reduce complexity and to fuse
information from multiple sensors to increase reliability.
Artist's Concept of Rover on Mars. Notice the stereo cameras mounted on top of the
Rover. (credit: Maas Digital LLC) One of the newer application areas is autonomous
vehicles, which include submersibles, land-based vehicles (small robots with wheels, cars
or trucks), and aerial vehicles. An unmanned aerial vehicle is often denoted UAV. The
level of autonomy ranges from fully autonomous (unmanned) vehicles to vehicles where
computer vision based systems support a driver or a pilot in various situations. Fully
autonomous vehicles typically use computer vision for navigation, e. g., a UAV looking
for forest fires. Examples of supporting system are obstacle warning systems in cars and
systems for autonomous landing of aircraft. Several car manufacturers have demonstrated
systems for autonomous driving of cars, but this technology has still not reached a level
where it can be put on the market.
Software For Computer Vision
Animal (first implementation: 1988 - revised: 2004) is an interactive environment for
Image processing that is oriented toward the rapid prototyping, testing, and modification
of algorithms. To create ANIMAL (AN IMage ALgebra), XLISP of David Betz was
extended with some new types: sockets, arrays, images, masks, and drawables. The
theoretical framework and the implementation of the working environment is described
in the paper ANIMAL: AN IMage ALgebra.In the theoretical framework of ANIMAL a
digital image is a boundless matrix. However, in the implementation it is bounded by a
rectangular region in the discrete plane and the elements outside the region have a
constant value. The size and position of the region in the plane (focus) is defined by the
coordinates of the rectangle. In this way all the pixels, including those on the border, have
the same number of neighbors (useful in local operators, such as digital filters).
Furthermore, pixelwise commutative operations remain commutative on image level,
independently on focus.
OpenCV is an open source computer vision library developed by Intel. The library is
cross-platform, and runs on both Windows and Linux. It focuses mainly towards real-
time image processing. The application areas include
1. Human-Computer Interface (HCI)
2. Object Identification
3. Segmentation and Recognition
4. Face Recognition
5. Gesture Recognition
6. Motion Tracking
Visualization Toolkit (VTK)
Visualization Toolkit (VTK) is an open source, freely available software system for 3D
computer graphics, image processing, and visualization used by thousands of researchers
and developers around the world. VTK consists of a C++ class library, and several
interpreted interface layers including Tcl/Tk, Java, and Python. Professional support and
products for VTK are provided by Kitware, Inc. VTK supports a wide variety
ofvisualization algorithms including scalar, vector, tensor, texture, and volumetric
methods; and advanced modeling techniques such as implicit modelling, polygon
reduction, mesh smoothing, cutting, contouring, and Delaunay triangulation.
Commercial Computer Vision Systems
Automatix Inc., founded in January 1980, was the first company to market industrial
robots with built-in machine vision. Its founders were Victor Scheinman, inventor of the
Stanford arm; Phillippe Villers, Michael Cronin, and Arnold Reinhold of
Computervision; Jake Dias and Dan Nigro of Data General; Gordon VanderBrug, of NBS
and Norman Wittels of Clark University.
Automatix Robots at the Robots 1985 show in Detroit, Michigan. Clockwise from lower
left: AID 600, AID 900 Seamtracker, Yaskawa Motoman.Automatix mostly used robot
mechanisms imported from Hitachi at first and later from Yaskawa and KUKA. It did
design and manufacture a Cartesian robot called the AID-600. The 600 was intended for
use in precision assembly but was adapted for welding use, particularly Tungsten inert
gas welding (TIG), which demands high accuracy and immunity from the intense
electromagnetic interference that the TIG process creates. Automatix was the first
company to market a vision-guided welding robot called Seamtracker. Structured laser
light and monochromatic filters were used to allow an image to be seen in the presence of
the welding arc. Another concept, invented by Mr. Scheinman, was RobotWorld, a
system of cooperating small modules suspended from a 2-D linear motor. The product
line was later sold to Yaskawa.
Automatix raised large amounts of venture capital, and went public in 1983, but was not
profitable until the early 1990s. In 1994, Automatix merged with another machine vision
company, Itran Corp., to form Acuity Imaging, Inc. Acuity was acquired by Robotics
Vision Systems Inc. (RVSI) in September 1995. As of 2004, RVSI still supported the
evolved Automatix machine vision package under the PowerVision brand.
RapidEye is a commercial multispectral remote sensing satellite mission being designed
and implemented by MDA for RapidEye AG. The RapidEye sensor images five optical
bands in the 400-850nm range and provides 5m pixel size at nadir. Rapid delivery and
short revisit times are provided through the use of a five-satellite constellation.
Scantron is the name of a United States company that makes and sells Scantron exam
answer sheets and the machines to grade them. The Scantron system usually takes the
form of a "multiple choice, fill-in-the-circle/square/rectangle" form of varying length and
width, from single column 50 answer tests, to multiple 8.5" x 11" page forms used in
standardized testing such as the SAT and ACT. The forms are sensed optically, using
optical mark recognition to detect markings in each place, in a "Scantron Machine" that
tabulates and can automatically grade results. Earlier versions were sensed electrically.
A typical 100-answer Scantron answer sheet. This is only half of it (the front side) with
the back side not being shown.Commonly, there are two sides to Scantron answer sheets.
They can contain 50 answer blanks, 100 answer blanks, and so on. There is even a
smaller form called a "Quiz Strip" that contains only about 20 answer boxes to bubble-in.
On the larger sheets, there is a space on the back where answers can be manually written
in for separate questions, if a test giver issues them out. The full-sized 8.5" x 11" form
may contain a larger area for using it to work on math formulas, write short answers, etc.
Answers "A" and "B" are commonly used for "True" and "False" questions, as shown in
the image to the right on the top of each row.
Grading of Scantron sheets is performed first by creating an answer key. The answer key
is simply a standard Scantron answer sheet with all of the correct answers filled in, along
with the "key" rectangle at the top of the sheet.Once you have your answer key ready the
Scantron machine is powered on and the answer key is fed through. This stores the
answer key in the memory of the Scantron machine and any further sheets that are fed
through will be graded and marked according to the key in memory. Switching off the
Scantron machine will stop the paper feed and clear the memory.
Computer vision, unlike for example factory machine vision, happens in unconstrained
environments, potentially with changing cameras and changing lighting and camera
views. Also, some “objects” such as roads, rivers, bushes, etc. are just difficult to
describe. In these situations, engineering a model a-priori can be difficult. With learning-
based vision, one just “points” the algorithm at the data and useful models for detection,
segmentation, and identification can often be formed. Learning can often easily fuse or
incorporate other sensing modalities such as sound, vibration, or heat. Since cameras and
sensors are becoming cheap and powerful and learning algorithms have a vast appetite
for computational threads, Intel is very interested in enabling geometric and learning-
based vision routines in its OpenCV library since such routines are vast consumers of