Image processing

IMAGE PROCESSING
REAL WORLD, REAL TIME AUTOMATIC RECOGNITION
OF FACIAL EXPRESSIONS
ABSTRACT:
The most expressive way humans display emotions is through facial expressions.
Enabling computer systems to recognize facial expressions and infer emotions from them
in real time presents a challenging research topic. Most facial expression analysis systems
attempt to recognize facial expressions from data collected in a highly controlled
laboratory with very high resolution frontal faces (face regions greater than 200 x 200
pixels) and also cannot handle large head motions. But in real environments such as
smart meetings, a facial expression analysis system must be able to automatically
recognize expressions at lower resolution and handle the full range of head motion. This
paper describes a real-time system to automatically recognize facial expressions in
relatively low-resolution of face images (around 50x70 to 75x100 pixels). This system
proves to be successful in dealing with complex real world interactions and is a highly
promising method.
2.INTRODUCTION:
The human face possesses superior expressive ability and provides one of the
most powerful, versatile and natural means of communicating motivational and affective
state. We use facial expressions not only to express our emotions, but also to provide
important communicative cues during social interaction, such as our level of interest, our
desire to take a speaking turn and continuous feedback signaling understanding of the
information conveyed. Facial expression constitutes 55 percent of the effect of a
communicated message and is hence a major modality in human communication.
Face Recognition and face expression recognition is the inherent capability of
human beings. Identifying a person by face is one of the most fundamental
human functions since time immemorial. Face expression recognition by
computer is to endow a machine with capability to approximate in some sense, a
similar capability in human beings. To impart this basic human capability to a
machine has been a subject of interest over the last few years.
Before we discuss how facial expressions can be recognized we will let u know what are
the main problems that a face recognition system faces.
3.IMPORTANT SUB PROBLEMS:
1

The problem of recognizing and interpreting faces comprises four main sub problem
areas:
· Finding faces and facial features: This problem would be considered a
segmentation problem in the machine vision literature, and a detection problem in
the pattern recognition literature.
· Recognizing faces and facial features: This problem requires defining a similarity
metric that allows comparison between examples; this is the fundamental
operation in database access.
· Tracking faces and facial features: Because facial motion is very fast (with respect
to either human or biological vision systems), the techniques of optimal
estimation and control are required to obtain robust performance.
· Temporal interpretation. The problem of interpretation is often too difficult to
solve from a single frame and requires temporal context for its solution. Similar
problems of interpretation are found in speech processing, and it is likely that
speech methods such as hidden Markov models, discrete Kalman filters, and
dynamic time warping will prove useful in the facial domain as well.
Current approaches to automated facial expression analysis typically attempt to
recognize a small set of prototypic emotional expressions, i.e. joy, surprise, anger,
sadness, fear, and disgust. Some group of researchers presented an early attempt to
analyze facial expressions by tracking the motion of twenty identified spots on an image
sequence. Some developed a dynamic parametric model based on a 3D geometric mesh
face model to recognize 5 prototypic expressions. Some selected manually selected facial
regions that corresponded to facial muscles and computed motion within these regions
using optical flow. Some other group of researchers used optical flow work, but tracked
the motion of the surface regions of facial features (brows, eyes, nose, and mouth) instead
of the motion of the underlying muscle groups.
4. LIMITATIONS OF EXISTING SYSTEMS:
The limitations of the existing systems are summarized as following:
 Most systems attempt to recognize facial expressions from data collected in a
highly controlled laboratory with very high-resolution frontal faces (face regions
greater than 200 x 200 pixels).
 Most system needs some manual preprocessing.
 Most systems cannot handle large out-of-plane head motion.
 None of these systems deals with complex real world interactions.
 Except the system proposed by Moses et al. [14], none of those systems performs
in real-time.
In this paper, we report an expression recognition system, which addresses many of these
limitations. In real environments, a facial expression analysis system must be able to:
2

 Fully automatically recognize expressions.
 Handle a full range of head motions.
 Recognize expressions in face images with relatively lower resolution.
 Recognize expressions in lower intensity.
 Perform in real-time.
Figure 1: A face at different resolutions.
All images are enlarged to the same size
Figure 1 shows a face at different resolutions. Most automated face processing
tasks should be possible for a 69x93 pixel image. At 48x64 pixels the facial features such
as the corners of the eyes and the mouth become hard to detect. The facial expressions
may be recognized at 48x64 and are not recognized at 24x32 pixels. This paper describes
a real-time system to automatically recognize facial expressions in relatively low-resolution
face images (50x70 to 75x100 pixels). To handle the full range of head motion,
instead of detecting the face, the head pose is estimated based on the detected head. For
frontal and near frontal views of the face, the location and shape features are computed
for expression recognition. This system successfully deals with complex real world
interactions. We present the overall architecture of the system and its components:
background subtraction, head detection and head pose estimation respectively. The
method for facial feature extraction and tracking is also explained clearly. The method for
recognizing expressions is reported and at we summarized our paper and presented future
directions in the last part of our paper.
5. SYSTEM ARCHITECTURE:
In this paper we describe a new facial expression analysis system designed to
automatically recognize facial expressions in real-time and real environments, using
relatively low-resolution face images. Figure 2 shows the structure of the tracking
system. The input video sequence is used to estimate a background model, which is then
used to perform background subtraction, as described in Section 3. In Section 4, the
resulting foreground regions are used to detect the head. After finding the head, head
pose estimation is performed to find the head in frontal or near-frontal views. The facial
features are extracted only for those faces in which both eyes and mouth corners are
visible. The normalized facial features are input to a neural network based expression
classifier to recognize different expressions.
3

Figure 2. Block diagram of the expression
Recognition system
6. BACKGROUND ESTIMATION AND SUBTRACTION:
The background subtraction approach presented is an attempt to make the
background subtraction robust to illumination changes. The background is modeled
statistically at each pixel. The estimation process computes the brightness distortion and
color distortion in RGB color space. It also proposes an active background estimation
method that can deal with moving objects in the frame. First, we calculate image
difference over three frames to detect the moving objects. Then the statistical background
model is constructed, excluding these moving object regions. By comparing the
difference between the background image and the current image, a given pixel is
classified into one of four categories: original background, shaded background or
4

shadow, highlighted background, and foreground objects. Finally, a morphology step is
applied to remove small isolated spots and fill holes in the foreground image.
7.HEAD DETECTION:
In order to handle the full range of head motion, we detect the head instead of
detecting the face. The head detection uses the smoothed silhouette of the foreground
object as segmented using background subtraction. Based on human intuition about the
parts of an object, a segmentation into parts generally occurs at the negative curvature
minima (NCM) points of the silhouette as shown with small circles in Figure 3. The
boundaries between parts are called cuts (shown as the line L in Figure 3(a). some
researchers noted that human vision prefers the partitioning scheme which uses the
shortest cuts. They proposed a shortcut rule which requires a cut: 1) be a straight line, 2)
cross an axis of local symmetry, 3) join two points on the outline of a silhouette and at
least one of the two points is NCM, 4) be the shortest one if there are several possible
competing cuts.
In this system, the following steps are used to calculate the cut of the head:
 Calculate the contour centroid C and the vertically symmetry axis y of the
silhouette.
 Compute the cuts for the NCMs, which are located above the contour centroid C.
 Measure the salience of a part’s protrusion, which is defined as the ratio of the
perimeter of the Part (excluding the cut) to the length of the cut.
 Test if the salience of a part exceeds a low threshold.
 Test if the cut crosses the vertical symmetry axis y of the silhouette.
 Select the top one as the cut of the head if there are several possible competing
cuts.
After the cut of the head L is detected, the head region can be easily determined as the
part above the cut. As shown in Figure 3(b), in most situations, only part of the head lies
above the cut. To obtain the correct head region, we first calculate the head width W,
then the head height H is enlarged to * W from the top of the head. In our system, =
1:4.
5

Figure 3. Head detection steps. (a) Calculate
The cut of the head part. (b) Obtain the correct
Head region from the cut of the head part.
6

8. HEAD POSE DETECTION:
After the head is located, the head image is converted to gray-scale, histogram
equalized and resized to the estimated resolution. Then we employ a three layer neural
network (NN) to estimate the head pose. The inputs to the network are the processed head
image. The outputs are the head poses. Here only 3 head pose classes are trained for
expression analysis: 1) frontal or near frontal view, 2) side view or profile, 3) others such
as back of the head or occluded face.
Figure 4. The definitions and examples of the
3 head pose classes: 1) frontal or near frontal
View, 2) side view or profile, 3) others such as
Back of the head or occluded faces.
Figure 4 shows the definitions and some examples of the 3 head pose classes. In the
frontal or near frontal view, both eyes and lip corners are visible. In side view or profile,
at least one eye or one corner of the mouth becomes self occluded because of the head
turn. All other reasons cause more facial features to not be detected such as the back of
the head, occluded face, and face with extreme tilt angles is treated as one class.
9.FACIAL FEATURE EXTRACTION FOR FRONTAL OR NEAR-FRONTAL
FACE:
After estimating the head pose, the facial features are extracted only for the face
in the frontal or near-frontal view. Since the face images are in relatively low resolution
in most real environments, the detailed facial features such as the corners of the eyes and
7

the upper or lower eyelids are not available To recognize facial expressions, however, we
need to detect reliable facial features. We observe that most facial feature changes that
are caused by an expression are in the areas of eyes, brows and mouth. In this paper, two
types of facial features in these areas are extracted: location features and shape features.
9.1 LOCATION FEATURE EXTRACTION
In this system, six location features are extracted for expression analysis. They are
eye centers (2), eyebrow inner endpoints (2), and corners of the mouth (2).
Eye centers and eyebrow inner endpoints: To find the eye centers and eyebrow inner
endpoints inside the detected frontal or near frontal face, we have developed an algorithm
that searches for two pairs of dark regions which correspond to the eyes and the brows by
using certain geometric constraints such as position inside the face, size and symmetry to
the facial symmetry axis. The algorithm employs an iterative threshold method to find
these dark regions under different or changing lighting conditions.
Figure 5. Iterative thresholding of the face to
Find eyes and brows. (a) Gray-scale face image,
(b) Threshold = 30, (c) threshold = 42,
(d) Threshold = 54.
Figure 5 shows the iterative thresholding method to find eyes and brows.
Generally, after five iterations, all the eyes and brows are found. If satisfactory results are
not found after 20 iterations, we think the eyes or the brows are occluded or the face is
not in a near frontal view. Unlike to find one pair of dark regions for the eyes only, we
find two pairs of parallel dark regions for both the eyes and eyebrows. By doing this, not
only are more features obtained, but also the accuracy of the extracted features is
improved. Then the eye centers and eyebrow inner endpoints can be easily determined. If
the face image is continually in the frontal or near frontal view in an image sequence, the
eyes and brows can be tracked by simply searching for the dark pixels around their
positions in the last frame.
Mouth corners: After finding the positions of the eyes, the location of the mouth is first
predicted. Then the vertical position of the line between the lips is found using an integral
projection of the mouth region Finally, the horizontal borders of the line between the lips
is found using an integral projection over an edge-image of the mouth. The following
8

steps are use to track the corners of the mouth: 1) Find two points on the line between the
lips near the previous positions of the corners in the image 2) Search along the darkest
path to the left and right, until the corners are found. Finding the points on the line
between the lips can be done by searching for the darkest pixels in search windows near
the previous mouth corner positions. Because there is a strong change from dark to bright
at the location of the corners, the corners can be found by looking for the maximum
contrast along the search path
9.2 LOCATION FEATURE REPRESENTATION:
After extracting the location features, the face can be normalized to a canonical
face size based on two of these features, i.e., the eye-separation after the line connecting
two eyes (eye-line) is rotated to horizontal. In our system, all faces are normalized to 90 x
90 pixels by re-sampling. We transform the extracted features into a set of parameters for
expression recognition. We represent the face location features by 5 parameters, which
are shown in Figure 6.
Figure 6. Face location feature representation
For expression recognition.
These parameters are the distances between the eye-line and the corners of the
mouth, the distances between the eye-line and the inner eyebrows, and the width of the
mouth (the distance between two corners of the mouth).
9

9.3 SHAPE FEATURE EXTRACTION:
Another type of distinguishing feature is the shape of the mouth. Global shape
features are not adequate to describe the shape of the mouth. Therefore, in order to
extract the mouth shape features, an edge detector is applied to the normalized face to get
an edge map. This edge map is divided into 3 x 3 zones as shown in Figure7 (b). The size
of the zones is selected to be half of the distance between the eyes. The mouth shape
features are computed from zonal shape histograms of the edges in the mouth region.
10

Figure 7. Zonal-histogram features. (a) Normalized face,
(b) Zones of the edge map of the normalized face,
(c) Four quantization levels for calculating histogram features,
(d) Histogram corresponding to the middle zone of the mouth.
11

10. EXPRESSION RECOGNITION:
This system has neural network-based recognizer having the structure as shown in
Figure 8. The standard back-propagation in the form of a three-layer neural network with
one hidden layer was used to recognize facial expressions. The inputs to the network
were the 5 location features (Figure 6) and the 12 zone components of shape features of
the mouth (Figure7). Hence, a total of 17 features were used to represent the amount of
expression in a face image. The outputs were a set of expressions. In this system, 5
expressions were recognized. They were neutral, smile, angry, surprise, and others
(including fear, sad, and disgust). Researchers tested various numbers of hidden units and
found that 6 hidden units gave the best performance.
12

Figure 8. Neural network-based recognizer for
Facial expressions.
11. SUMMARY AND CONCLUSIONS:
Automatically recognizing facial expressions is important to understand human
emotion and paralinguistic communication so as to design multi modal user interfaces,
and for related applications such as human identification. Incorporating emotive
information in computer-human interfaces will allow for much more natural and efficient
interaction paradigms to be established. It is very challenging to develop a system that
can perform in real time and in real world because of low image resolution, low
expression intensity, and the full range of head motion and the system that we have
reported is an automatic expression recognition system that addresses all the above
13

challenges and successfully deals with complex real world interactions. In most real word
interactions, the facial feature changes are caused by both talking and expression
changes. We feel that further efforts will be required for distinguishing talking and
expression changes by fusing audio signal processing and visual image analysis. Also it
will benefit the expression recognition accuracy by using the sequential information
instead of using each frame.
REFERENCES:
1. T. Kanade, J.F. Cohn, and Y.L. Tian. Comprehensive database for facial
expression analysis. In Proceedings of International Conference on Face and
Gesture Recognition, pages 46–53, March 2000.
2. Z. Zhang. Feature-based facial expression recognition: Sensitivity analysis and
experiments with a multi-layer perceptron. International Journal of Pattern
Recognition and Artificial Intelligence, 13(6):893–911, 1999.
3. Y. Moses, D. Reynard, and A. Blake. Determiningfacial expressions in real time.
In Proc. Of Int. Conf.On Automatic Face and Gesture Recognition, pages 332–
337, 1995.
4. B. Fasel and J. Luttin. Recognition of asymmetric facial action unit
activities and intensities. In Proceedings of International Conference of Pattern
Recognition, 2000.
CODE NO:EC 79 IS 2
Advanced Video Coding : MPEG-4/H.264 and Beyond
Bhavana, K.B.Jyothsna
III/IV E.C.E.
Padmasri Dr. B.V.Raju Institute of Technology
reshaboinabhavana@yahoo.co.in, _dch_jyo@yahoo.com
Advanced Video Coding : MPEG-4/H.264 and Beyond
Bhavana, K.B.Jyothsna
III/IV E.C.E.
Padmasri Dr. B.V.Raju Institute of Technology
reshaboinabhavana@yahoo.co.in, dch_jyo@yahoo.com
Abstract :
With the high demand for digital video products in popular applications such
as video communications, security and surveillance, industrial automation and
14

entertainment, video compression is an essential enabler for video applications design.
The video coding standards are being under development for various applications. The
purposes include better picture quality, higher coding efficiency and more error
robustness. The new international video coding standard MPEG-4 part -10/H.264/AVC
achieves significant improvements in coding efficiency and error robustness in
comparison with the previous standards such as MPEG-2, MPEG-4 Visual. This paper
provides an overview of H.264 and surveys the other current in use video coding
methods.
Introduction:
Video coding deals with the compression of digital video data. Digital video is
basically a three-dimensional array of color pixels. Two dimensions serve as spatial
(horizontal and vertical) directions of the moving pictures, and one dimension represents
the time domain. The video data contains fair amount of spatial and temporal
redundancy. Similarities can thus be encoded by merely registering differences within a
frame (spatial) and/or between frames (temporal). Video coding typically reduces this
redundancy by using lossy compression. Usually this is achieved by image compression
techniques to reduce spatial redundancy from frames (this is known as intraframe
compression or spatial compression) and motion compensation, and other techniques to
reduce temporal redundancy (known as interframe compression or temporal
compression).
Video coding for telecommunication applications has evolved through the
development of the ITU-T H.261, H.262 (MPEG-2), H.263 (MPEG-4 Part-2, Visual)
video coding standards (and later enhancements of known as H.263+ and H.263++) and
now the H.264 (MPEG-4 Part-10).
MPEG-4 Part-10 or H.264, is a high compression digital video codec
standard written by the ITU-T Video Coding Experts Group (VCEG) together with the
ISO/IEC Moving Pictures Experts Group (MPEG) as the product of a collective
partnership effort known as the Joint Video Team (JVT). The ISO/IEC MPEG-4 Part-10
standards and the ITU-T H.264 standard are technically identical and the technology is
also known as AVC (Advanced Video Coding). The main objective behind the H.264
project is to develop a high-performance video coding standard by adopting a “back-to-basics”
approach with simple and straightforward design using well known building
blocks.
The intent of H.264/AVC project is to create a standard that would be capable
of providing superb video quality at bitrate that is substantially lower (e.g., half or less)
than what
previous standards would need (e.g., relative to H.262, H.263) and to do so without so
much of an increase in complexity as to make the design impractical (i.e. excessively
expensive) to implement. Another ambitious goal is to do this in a flexible way that
would allow the standard to be applied to a very wide variety of applications (e.g., for
both low and high bitrate, and low and high resolution video) and to work well on a very
wide variety of networks and systems (e.g., for RTP/IP packet networks, and ITU-T
multimedia telephony systems).
15

Overview : The Advanced Video Coding / H.264
The new standard is designed for technical solutions including at least the
following application areas
* Broadcast over cable, satellite, cable modem, DSL, terrestrial, etc.
* Interactive or serial storage on optical and magnetic devices, high definition DVD, etc.
* Conversational services over ISDN, Ethernet, LAN, DSL, wireless and mobile
networks, modems, etc. or mixtures of these.
* Video-on-demand or multimedia streaming services over ISDN, cable modem, DSL,
LAN, wireless networks, etc.
* Multimedia messaging services (MMS) over ISDN, DSL, Ethernet, LAN, wireless and
mobile networks, etc.
Fig.1 H.264/AVC Conceptual layers.
For efficient transmission in different environments not only coding
efficiency is relevant, but also the seamless and easy integration of the coded video into
all current and future protocol and network architectures. This includes the public
Internet with best effort delivery, as well as wireless networks expected to be a major
application for the new video coding standard. The adaptation of the coded video
representation or bitstream to different transport networks was typically defined in the
systems specification in previous MPEG standards or separate standards like H.320 or
H.324. However, only the close integration of network adaptation and video coding can
bring the best possible performance of a video communication system. Therefore
H.264/AVC consists of two conceptual layers (Figure1). The VCL (Video Coding
Layer), which is designed to efficiently represent the video content, and a NAL (Network
Abstraction Layer), which formats the VCL representation of the video and provides
header information in a manner appropriate for conveyance by a variety of transport
layers or storage media.
H.264 Technical Description :
The main objective of the emerging H.264 standard is to provide a means to
achieve substantially higher video quality compared to what could be achieved by using
anyone of the existing video coding standards. Nonetheless, the underlying approach of
H.264 is similar to that adopted in previous standards such as MPEG-2 and MPEG-4
part-2 and consists of the following four main stages:
a. Dividing each video frame into blocks of pixels so that processing of the
video frame can be conducted at a block level.
16

b. Exploiting the spatial redundancies that exist within the video frame by
coding some of the original blocks through spatial prediction, transform,
quantization and entropy coding.
c. Exploiting the temporal dependencies that exist between blocks in successive
frames, so that only changes between successive frames need to be encoded.
This is accomplished by using motion estimation and compensation. For any
given block, a search is performed in the previously coded one or more
frames to determine the motion vectors that are then used by the encoder and
decoder to predict the subject block.
d. Exploiting any remaining spatial redundancies that exist within the video
frame by coding the residual blocks, i.e., the difference between the original
blocks and the corresponding predicted blocks, again through transform,
quantization and entropy coding.
On the motion estimation/compensation side, H.264 employs blocks of different sizes
and shape, higher resolution ¼-pel motion estimation, multiple reference frame selection
and complex multiple bidirectional mode selection. On the transform side H.264 uses an
integer based transform that approximates roughly the discrete cosine transform (DCT)
used in the MPEG-2, but does not have the mismatch problem in the inverse transform.
Entropy coding can be performed using either a combination of Universal Variable
Length Codes (UVLC) table with a Context Adaptive Variable Length Codes (CAVLC)
for the transform
17

Fig.2 Block Diagram of the H.264 Encoder.
Coefficients or using Context-based Adaptive Binary Arithmetic Coding.
Organization of Bitstream:
The input image is divided into macroblocks. Each macroblock consists of the
three components Y, Cr and Cb. Y is the luminance component which represents the
brightness information. Cr and Cb represent the color information. Due to the fact that the
human eye system is less sensitive to the chrominance than to the luminance the
chrominance signals are both subsampled by a factor of 2 in horizontal and vertical
direction. Therefore, a macroblock consists of one block of 16 by 16 picture elements for
the luminance component and of two blocks of 8 by 8 picture elements for the color
components. These macroblocks are coded in Intra or Inter mode. In Inter mode, a
macroblock is predicted using motion compensation. For motion compensated prediction
a displacement vector is estimated and transmitted for each block (motion data) that
refers to the corresponding position of its image signal in an already transmitted reference
image stored in memory. In Intra mode, former standards set the prediction signal to zero
such that the image can be coded without reference to previously sent information. This
is important to provide for error resilience and for entry points into the bit streams
enabling random access. The prediction error, which is the difference between the
original and the predicted block, is transformed, quantized and entropy coded. In
18

order to reconstruct the same image on the decoder side, the quantized coefficients are
inverse transformed and added to the prediction signal. The result is the reconstructed
macroblock that is also available at the decoder side. This macroblock is stored in a
memory. Macroblocks are typically stored in raster scan order. H.264/AVC introduces
the following changes:
1. In order to reduce the block-artifacts an adaptive deblocking filter is used in
the prediction loop. The deblocked macroblock is stored in the memory and can
be used to predict future macroblocks.
2. Whereas the memory contains one video frame in previous standards,
H.264/AVC allows storing multiple video frames in the memory.
3. In H.264/AVC a prediction scheme is used also in Intra mode that uses the
image signal of already transmitted macro blocks of the same image in order to
predict the block to code.
4. The Direct Cosine Transform (DCT) used in former standards is replaced by an
integer transform.
In H.264/AVC, the macroblocks are processed in so called slices whereas a slice is
usually a group of macroblocks which is valuable for resynchronization should some data
be lost.
Fig. 3 Division of image into several
slices
A H.264 video stream is
organized in discrete packets, called “NAL
units”. Each of these packets can contain a
part of a slice, i.e., there may be one or
more NAL units per slice. The slices, in
turn, contain a part of a video frame. The
decoder may resynchronize after each NAL unit instead of skipping a whole frame if a
single error occurs. H.264 also supports optional interlaced encoding. In this encoding
mode, a frame is split into two fields. Fields may be encoded using spatial or temporal
interleaving. To encode color images, H.264 uses the YCbCr color space like its
predecessors, separating the image into luminance (or “luma”, brightness) and
chrominance (or “chroma”, color) planes. It is , however , fixed at 4:2:0 subsampling, i.e.,
the chroma channels each have half the resolution of the luma channel
Five different slice types are supported which are I, P, B, SI and SP.
‘I’ slices or “Intra” slices describe a full still image, containing only references to itself.
The first frame of a sequence always needs to be built out of I slices.
‘P’ slices or “Predicted” slices use one or more recently decoded slices as a reference or
prediction for picture constructed using motion compensated prediction. The prediction is
usually not exactly the same as the actual picture content, so a “residual” may be added.
‘B’ slices or “Bi-directional Predicted” slices work like P slices with the exception that
former and future I or P slices (in playback order) may be used as reference pictures. For
this to work, B slices may be decoded after the following I or P slices.
19

‘SI’ or ‘SP’ slices or “Switching” slices may be used for efficient transitions between two
different H.264 video streams.
The tools that make H.264 such a successful video coding scheme are
discussed below.
Intra Prediction and Coding :
Intra prediction means that the samples of a macroblock are predicted by using
only information of already transmitted macroblocks of the same image thereby
exploiting only spatial redundancies with in a video picture .The resulting frame is
referred to as an I-picture .I-pictures are typically encoded by directly applying the
transform to different macroblocks in the frame .in order to increase the efficiency of the
intra coding process in H.264, spatial correlation between adjacent macrablocks in a
given frame is exploited .The idea is based on the observation that adjacent macroblocks
tend to have similar properties. The difference between the actual macroblock and its
prediction is then coded, which results in fewer bits to represent the macroblocks of
interest compared to when applying the transform directly to the macroblock itself.
In H.264/AVC, two different types of intra prediction are possible for the
prediction of the luminance component Y.
The first type is called INTRA_4×4 and the second one INTRA_16×16. Using the
INTRA_4×4 type, the macroblock, which is of the size 16 by 16 picture elements
(16×16), is divided into sixteen 4×4 sub-blocks and a prediction for each 4×4 sub-block
of the luminance signal is applied individually. For the prediction purpose, nine different
prediction modes are supported. One mode is DC prediction mode, whereas all samples
of the current 4×4
Sub-block are predicted by the mean of all samples neighboring to the left and to the top
of the current block and which have been already reconstructed
at the encoder and at the decoder side (see Figure4b). In
addition to DC-prediction mode(mode 2), eight prediction modes labeled 0,1,3,4,5,6,7and
8, each for a specific prediction direction are supported as shown in fig.4c.
(a) (b)
Fig. 4 Intra prediction modes for 4x4 luminance blocks
Pixels A to M from neighboring blocks have already been encoded and may be
used for prediction .For example, if Mode 0 (vertical prediction) is selected, then the
values of the pixels a to p are assigned as follows:
20

· a ,e , i and m are equal to A
· b , f, j and n are equal to B
· c , g , k and o are equal to C
· d , h ,l and p are equal to D
For regions with less spatial detail (i.e., flat regions), H.264 supports 16x16 intra coding ,
in which one of the four prediction modes (DC, Vertical, Horizontal and Planar ) is
chosen for the prediction of the entire luminance component of the macroblock. In
addition, H.264 supports intra prediction for 8x8 chrominance blocks also using four
prediction modes (DC, Vertical, Horizontal and Planar). Finally, the prediction mode for
each block is efficiently coded by assigning shorter symbols to more likely modes, where
the probability of each mode is determined based on the modes used for coding the
surrounding blocks.
Inter prediction and coding:
Inter prediction and coding is based on using motion estimation and
compensation to take advantage of the temporal redundancies that exist between
successive frames, hence, providing very efficient coding of video sequences. When a
selected reference frame for motion estimation is a previously encoded frame, the frame
to be encoded is referred to as a P-picture. When both a previously encoded frame and a
future frame are chosen as reference frames , then the frame to be encoded is referred to
as a B-picture. The inclusion of a new inter-stream transitional picture called an SP-picture
in a bit stream enables efficient switching between bit streams with similar
content encoded at different bit rates , as well as random access and fast playback modes.
Motion compensation prediction for different block sizes :
In H.264/AVC it is possible to refer to several preceding images. For this
purpose, an additional picture reference parameter has to be transmitted together with the
motion vector. This technique is denoted as motion-compensated prediction with multiple
reference frames. In the classical concept, B-pictures are pictures that are encoded using
both past and future pictures as references. The prediction is obtained by a linear
combination of forward and backward prediction signals. In former standards, this linear
combination is just an averaging of the two prediction signals whereas H.264/AVC
allows arbitrary weights. In this generalized concept, the linear combination of prediction
signals is also made regardless of the temporal direction. For example, a linear
combination of two forward-prediction signals may be used (see Figure 5). Furthermore,
using H.264/AVC it is possible to use images containing B-slices as reference images for
further predictions which were not possible in any former standard.
21

Fig. 5 Motion-compensated prediction with
multiple reference images
In case of motion compensated prediction macroblocks are predicted from the
image signal of already transmitted reference images. For this purpose, each macroblock
can be divided into smaller partitions. Partitions with
luminance block sizes of 16×16, 16×8, 8×16, and 8×8 samples are supported. In case of
an 8×8 sub-macroblock in a P-slice, one additional syntax element specifies if the
corresponding 8×8 sub-macroblock is further divided into partitions with block sizes of
8×4, 4×8 or 4×4.
Fig. 6 Partitioning of a macroblock and a sub-macroblock for motion compensated
prediction
The availability of smaller motion compensation blocks improves prediction in general,
and in particular, the small blocks improve the ability of the model to handle fine motion
detail and result in better subjective viewing quality, more efficient coding and more
error resilience because they do not produce large blocking artifacts.
Fig. 7 Example of 16x16 macroblock
Adaptive de-blocking loop filter :
H.264 specifies the use of an adaptive de-blocking filter that operates on the
horizontal and vertical block edges with in the prediction loop in order to remove artifacts
caused by block prediction errors in order to achieve higher visual quality. Another
reason to make de-blocking a mandatory in-loop tool in H.264/AVC is to enforce a
decoder to approximately deliver a quality to the customer, which was intended by the
produce and not leaving this basic picture enhancement tool to the optional good will.
The filtering is generally is based on 4x4 block boundaries, in which two pixels on either
side of the boundary may be updated using a different filter.
22

The filter described in the
H.264/AVC standard is highly
adaptive. Several parameters and
thresholds and also the local
characteristics of the picture itself control the strength of the filtering process. All
involved thresholds are quantizer dependent, because blocking artifacts will always
become more severe when quantization gets coarse. H.264/MPEG-4 AVC deblocking is
adaptive on three levels:
■ On slice level, the global filtering strength can be adjusted to the individual
characteristics of the video sequence.
■ On block edge level, the filtering strength is made dependent on inter/intra prediction
decision, motion differences, and the presence of coded residuals in the two participating
blocks. From these variables a filtering-strength parameter is calculated, which can take
values from 0 to 4 causing modes from no filtering to very strong filtering of the involved
block edge.
■ On sample level, it is crucially important to be able to distinguish between true edges
in the image and those created by the quantization of the transform-coefficients. True
edges should be left unfiltered as much as possible. In order to separate the two cases, the
sample values across every edge are analyzed.
Integer transform :
Similar to former standards transform coding is applied in order to code the
prediction error signal. The task of the transform is to reduce the spatial redundancy of
the prediction error signal. For the purpose of transform coding, all former standards such
as MPEG-1 and MPEG- 2 applied a two dimensional Discrete Cosine Transform (DCT)
which had to define rounding-error tolerances for fixed point implementation of the
inverse transform. Drift caused by the mismatches in the IDCT precision between the
encoder and decoder were a source of quality loss. H.264/AVC gets round the problem
by using an integer 4x4 spatial transform which is an approximation of the DCT, which
helps reduce blocking and ringing artifacts. Much lower bit rate and reasonable
performance are reported based on the application of these techniques.
Quantization and Transform coefficient scanning :
The quantization step is where a significant portion data compression takes
place. In H.264, the transform coefficients are quantized using scalar quantization with
no widened dead zone. Fifty-two different quantization step sizes can be chosen on a
macroblock basis, this being different from prior standards. Moreover, in H.264, the step
sizes are increased at a compounding rate of approx. 12.5%, rather than increasing it by a
constant increment. The fidelity of chrominance components is improved by using finer
quantization step sizes compared to those used for luminance coefficients. The quantized
transform coefficients correspond to different frequencies, with the coefficient at the top
left hand corner representing the DC value, and the rest of the coefficients corresponding
to different non-zero frequency values. The next step in the encoding process is to
arrange the quantized coefficients in an array, starting with the DC coefficients.
23

Fig.8 Scan pattern for frame coding in H.264.
The zig-zag scan illustrated in fig.8 is used in all frame coding cases. The zig-zag scan
arranges the coefficient in an ascending order of the corresponding frequencies.
Entropy coding :
Entropy coding is based on assigning shorter code words to symbols with
higher probabilities of occurrence and longer code words to symbols with less frequent
occurrences. Some of the parameters to be entropy coded include transform coefficients
for the residual data, motion vectors and other encoder information. H.264/AVC
specifies two alternative methods of entropy coding: a low-complexity technique based
on the usage of context-adaptively switched sets of variable length codes, so-called
CAVLC, and the computationally more demanding algorithm of context-based adaptive
binary arithmetic coding (CABAC). Both methods represent major improvements in
terms of coding efficiency compared to the techniques of statistical coding traditionally
used in prior video coding standards.
CAVLC is the baseline entropy coding method of H.264/AVC. Its basic coding tool
consists of a single VLC of structured Exp-Golomb codes, which by means of
individually customized mappings is applied to all syntax elements except those related
to quantized transform coefficients. For typical coding conditions and test material, bit
rate reductions of 2–7% are obtained by CAVLC.
For significantly improved coding efficiency, CABAC as the alternative entropy coding
mode of H.264/AVC is the method of choice. The CABAC design is based on the key
elements: binarization, context modeling, and binary arithmetic coding. Binarization
enables efficient binary arithmetic coding via a unique mapping of non-binary syntax
elements to a sequence of bits, a so-called bin string. Each element of this bin string can
either be processed in the regular coding mode or the bypass mode. The latter is chosen
for selected bins such as for the sign information or lower significant bins, in order to
speedup the whole encoding (and decoding) process by means of a simplified coding
engine bypass. Typically, CABAC provides bit rate reductions of 5–15% compared to
CAVLC.
Robustness and error resilience :
Switching slices (called SP and SI slices), allow an encoder to direct a decoder
to jump into an ongoing video stream for such purposes as video streaming bitrate
switching and "trick mode" operation. When a decoder jumps into the middle of a video
stream using the SP/SI feature, it can get an exact match to the decoded pictures at that
24

location in the video stream despite using different pictures (or no pictures at all) as
references prior to the switch.
Flexible macroblock ordering (FMO, also known as slice groups) and arbitrary
slice ordering (ASO), which are techniques for restructuring the ordering of the
representation of the fundamental regions (called macroblocks) in pictures. Typically
considered an error/loss robustness feature, FMO and ASO can also be used for other
purposes.
Data partitioning (DP), a feature providing the ability to separate more
important and less important syntax elements into different packets of data, enabling the
application of unequal error protection (UEP) and other types of improvement of
error/loss robustness.
Redundant slices (RS), an error/loss robustness feature allowing an encoder to
send an extra representation of a picture region (typically at lower fidelity) that can be
used if the primary representation is corrupted or lost.
Supplemental enhancement information (SEI) and video usability information
(VUI), which are extra information that can be inserted into the bitstream to enhance the
use of the video for a wide variety of purposes.
Frame numbering, a feature that allows the creation of "sub-sequences"
(enabling temporal scalability by optional inclusion of extra pictures between other
pictures), and the detection and concealment of losses of entire pictures (which can occur
due to network packet losses or channel errors).
Picture order count, a feature that serves to keep the ordering of the pictures
and the values of samples in the decoded pictures isolated from timing information
(allowing timing information to be carried and controlled/changed separately by a system
without affecting decoded picture content).
These techniques, along with several others, help H.264 to perform significantly better
than any prior standard can, under a wide variety of circumstances in a wide variety of
application environments. H.264 can often perform radically better than MPEG-2 video
—typically obtaining the same quality at half of the bitrate or less.
Comparison to previous standard :
Coding efficiency :
The coding efficiency is measured by
average bit rate savings for a constant peak
25

signal to noise ratio (PSNR). Therefore the required bit rates of several test sequences
and different qualities are taken into account.
For video streaming applications, H.264/AVC, MPEG-4 Visual ASP, H.263 HLP and
MPEG-2 Video are considered. Fig.9 shows the PSNR of the luminance component
versus the average bit rate for the single test sequence Tempete encoded at 15 Hz and
Table 1 presents the average bit rate savings for a variety of test sequences and bit rates.
It can be drawn from Table 1 that H.264/AVC outperforms all other considered encoders.
For example, H.264/AVC MP allows an average bit rate saving of about 63% compared
to MPEG-2 Video and about 37% compared to MPEG-4 Visual ASP.
For video conferencing applications, H.264/AVC MPEG-4 Visual SP, H.263 Baseline,
and H.263 CHC are considered. Figure 10 shows the luminance PSNR versus average bit
rate for the single test sequence Paris encoded at 15 Hz and Table 2 presents the average
bit rate savings for a variety of test sequences and bit rates. As for video streaming
applications, H.264/AVC outperforms all other considered encoders. H.264/AVC BP
allows an average bit rate saving of about 40% compared to H.263 Baseline and about
27% compared to H.263 CHC.
Fig. 9 Luminance PSNR versus average
bit rate for different coding standards
measured for Tempete.
Fig. 10 Luminance PSNR versus
average bit rate for different coding
standards measured for Paris.
Hardware complexity
:
In relative terms, the
encoder complexity
increases with more
than one order of
magnitude between
26

MPEG-4 Part 2 and H.264/AVC and with a factor of 2 for the decoder. The H.264/AVC
encoder/decoder complexity ratio is in the order of 10 for basic configurations and can
grow up to 2 orders of magnitude for complex ones.
Table 3. Comparison of MPEG standards
Conclusion :
Compared to previous video coding standards, H.264/AVC provides an improved
coding efficiency and a significant improvement in flexibility for effective use over a
wide range of networks. While H.264/AVC still uses the concept of block-based motion
compensation, it provides some significant changes:
■ Enhanced motion compensation capability using high precision and multiple reference
frames
■ Use of an integer DCT-like transform instead of the DCT
■ Enhanced adaptive entropy coding including arithmetic coding
■ Adaptive in-loop deblocking filter
The coding tools of H.264/AVC when used in an optimized mode allow for bit savings of
about 50% compared to previous video coding standards like MPEG-4 and MPEG-2 for a
wide range of bit rates and resolutions. H.264 performs significantly better than any prior
standard can, under a wide variety of circumstances in a wide variety of application
environments. H.264 can often perform radically better than MPEG-2 video—typically
obtaining the same quality at half of the bitrate or less.
References:
27

1. www.scientificatlanta.com
2. www.techrepublic.com
3. www.mpeg.org
4. www.vcodex.com/h.264.html
5. www.sciencedirect.org
6. www.ieee.org
7. The MPEG hand book by John Watkinson.
CODE NO: EC 66 IS 3
HUMAN-ROBOT INTERFACE BASED ON THE
Mutual Assistance between Speech and Vision
.
Submitted by:
1.Harleen Kaur Chadha 2.Sonia Kapoor
EEE-3rd year, EEE-3rd year,
Guru Nanak Engineering College, Guru Nanak Engineering College,
Ibrahimpatnam, Ibrahimpatnam,
Hyderabad Hyderabad
Abstract:
In this paper we are developing a helper robot that brings the objects ordered by user with the
mutual assistance between speech and vision. The robot needs a vision system to recognize
objects appearing in the orders. However, conventional vision systems cannot recognize objects
in complex scenes. They may find many objects and cannot determine which the target is. This
paper proposes a method of using a conversation with the user to solve this problem. Speech
based interface is appropriate for this application. The robot asks a question to which the user can
easily answer and whose answer can efficiently reduce the number of candidate objects. It
28

considers the characteristics of features used for object recognition such as the easiness for
humans to specify them by word, generating a user-friendly and efficient sequence of questions.
Robot can detect target objects by asking the questions generated by the method. After the target
object has been detected by the robot it will handover that target object to its master and for doing
this we equip the robot with sensors, lasers and pan-tilt camera.
I. INTRODUCTION
Helper robots or service robots in welfare domain have attracted much attention of researchers for
the coming aged society. Here we are developing a helper robot that carries out tasks ordered by the user
through voice and/or gestures. In addition to gesture recognition, such robots need to have vision systems
that can recognize the objects mentioned in speech. It is, however, difficult to realize vision systems that
can work in various conditions. Thus, we have proposed to use the human user’s assistance through speech.
When the vision system cannot achieve a task, the robot makes a speech to the user so that the natural
response by the user can give helpful information for its vision system. Thus, even though detecting the
target object may be difficult and need the user’s assistance, once the robot has detected an object, it can
assume the object as the target. However, in actual complex scenes, the vision system may detect various
objects. The robot must choose the target object among them, which is a hard problem especially if it does
not have much a priori knowledge about the object. This paper tackles this problem. The robot determines
the target through a conversation with the user. The point of research is how to generate a sequence of
utterances that can lead to determine the object efficiently and user-friendly. This paper presents such a
dialog generation method. It determines what and how to ask the user by considering image processing
results and the characteristics of object attributes.
After the object has been selected by the mutual assistance between the speech and vision
capabilities of the robot with the help of master it should be handed over to its master, for this we are going
to use robot that is equipped with the sonar, laser, infrared and pan-tilt camera.
II. SYSTEM CONFIGURATION
Our system consists of a speech module, a gesture module, a vision module, an action module and
a central processing module.
Speech Module: The speech module consists of a voice
recognition sub module and a text-to-speech sub
module. Via Voice Millennium is used for speech
recognition, and ProTalker97 software is used to do
text-to-speech.
Vision Module: The vision module performs image
processing when the central processing module sends it
a command. We have equipped it with the ability to
recognize objects based on color segmentation and
simple shape detection. Gesture recognition methods
are also used to detect the objects and its result is sent to
the central processing module.
Action Module: The action module waits for commands
from the central processing module to carry out the
actions intended for the robot and the camera.
29

Central Processing Module: The Central processing module is the center of the system. It uses various
information and knowledge to analyze the meanings of recognized voice input. It sends commands to the
Vision module when it thinks that visual information is needed and sends commands to the Speech module
to make sentences when it thinks that that it needs to ask the user for additional information, it sends
commands to action module when action is to be carried out by the robot .
III. FEATURE CHARACTERISTICS
We consider the characteristics of features todetermine which feature the robot uses and how to
use it from the following four viewpoints. In the current implementation, we use four features: color, size,
position, and shape. Here, we classify recognized words into thedifferent categories: objects, actions,
directions, adjectives, adverbs, emergency words, numerals, colors, names of people and others and we
train the robot that it can understand any of these features.
A. Vocabulary
Humans can easily describe some features by word .If we can represent a particular feature easily
by word for any given object, we call it a vocabulary-rich feature. The robot can ask relatively complex
questions such as ‘what-type’ questions for a vocabulary-rich feature since we can easily find an
appropriate word for answer.
For example, we have rich vocabulary for color description: such as, red, green, blue, etc. When the robot
asks what color it is, we can easily give an
answer. Position is also a vocabulary-rich
feature. We have a large vocabulary to express
position such as left, right, upper, and lower.
B. Distribution
Although we consider features for
each object, we may find it difficult to express
some features by word depending on the
spatial distribution of objects. We call a feature
with this problem a distribution-dependent feature. Position is a distribution-dependent feature. If several
objects exist close together, it is difficult specify the position of each object. Color, size, and shape are not
such features.
C.Uniqueness:
If the value of a particular feature is different for eachobject, we call it a unique feature. Position
can be a unique feature since no multiple objects share the same position.
D. Absoluteness/Relativeness:
If we can describe a particular feature by word even if only an object exists, we call it an absolute
feature. Otherwise, we call it a relative feature. Color and shape are absolute features in general. Size and
position are not absolute features but relative features. We say ‘big’ or ‘small’ by comparing with other
objects.
IV. ASSISTANCE BY SPEECH
The basic strategy for generating dialog is ‘ask-and remove’. The robot asks the user about a
certain feature. Then, it removes unrelated objects from the detected objects using the information given by
the user. It iterates this process until only an object remains. The robot applies color segmentation .When
the number of object is large; it may be difficult to use distribution-dependent, relative features. So we
30

mainly consider vocabulary-rich, absolute features. We consider unique features only when the other
features cannot work, because in the current implementation, position is the only unique feature, and it is a
distribution-dependent feature. The robot generates its utterances for dialog with the useras follows. First, it
classifies the features of all regions into classes. For example, it assigns a color label to each region based
on the average hue value of the region. How to classify the data is determined for each feature in advance.
For color, it classifies them into seven colors: blue, yellow, green, red, magenta, white, and black. Then, the
robot computes the percentage of the number of objects in each class to the total number of objects. It
classifies the situation of each feature into three categories depending on the maximum percentage: the
variation category, the mediumcategory, and the concentration category. The variation category is the case
where the maximum percentage is less than 33% (1/3), for concentration category it is about 67% (2/3), for
the medium category it is from 33% through 67%. These percentage values are experimentally determined.
If the robot can obtain information about any feature classified as the variation category, it can reduce
many unrelated objects among the regions. Therefore, the first rule for determining what feature the robot
chooses for its question to the user is to give a priority to the variation category features. If no such feature
exists, the medium category features are given the second priority and the concentration category features
the last.
A. Case with variation category features:
If there are any variation category features, the robot asks the features to the user. If the present
features classified into the variation category include a vocabulary-rich feature, the robot asks the user
‘what-type’ questions. For example, if the color feature satisfies the variation category condition, the robot
asks, “What is the color of the target object?” as color is a vocabulary-rich feature. If there are multiple
vocabulary-rich features, the first priority is given to the feature with the smallest maximum percentage. If
question. Thus, the there is no such vocabulary-rich feature, the robot needs to adopt absolute features.
Since they are not vocabulary-rich features, the user may find it difficult to answer the question if the robot
asks a ‘what-type’ robot examines whether or not each region can be described easily by word in terms of
the feature. If all regions satisfy this, the robot adopts a ‘what-type’ question. Otherwise, it uses a multiple
choice question such as, “Is the target object A, B, or others?” where ‘A’ and ‘B’ are features that can be
expressed by word easily. For example, in the case of shape, the robot sees for the shape factor which helps
to decide shape and we will deal about this concept in coming session. There could be a case where all
regions are hard to be expressed by word.. It classifies the regions into classes that can be expressed by
word; and it assigns the label ‘others’ to the regions that cannot be expressed by word. Thus, the number of
regions with the ‘others’ label should be less than one third of the total number if the feature is classified
into the variation category.
B. Case with medium category features
If no features are classified into the variation category but any into the medium category, the robot
considers using the features in the medium category. In this case, the robot uses a ‘yes/no-type’ question.
The robot generates a question such as, “Is the target object A, B, or others?” where ‘A’ is the label of
feature with the largest percentage. An example is, “Is the target object red?” The robot can reduce the
number of candidates into half on average by one question. If there are multiple such features, the robot
gives them priorities according to the order fixed in advance. We determine the priority in the order that we
can obtain reliable information.
C. Case with concentration category features
This is the case where all features are classified into the concentration category, which means that all
regions (objects) are similar in several respects. Thus, the robot considers using unique features. The robot
asks a ‘yes/no-type’ question about unique features. In the current implementation, position is the only
unique feature. An example question is, “Is the target
31

Object on the right?” The robot computes the spatial distribution pattern of the objects. The word
specifying the positional relationship, ‘right’ in the above example, is determined by this pattern. We
determine the relationships between the pattern and the word in advance. When we use position, we need to
consider two things. One is that position is a distribution-dependent feature. As we have trained robot about
the position relation-ships the user can use words specifying positional relationships, such as ‘right’ and
‘left’ by considering the robot’s camera direction. Thus, ‘right’ means the right part to robot, and ‘close’
means the lower part to the robot. However, such interpretation may be wrong and asking the not conform
to the purpose of this research. To solve such problems, we are planning to specify positional relationships
with respect to some distinguished objects in the scene. For example, if the robot finds a red object in the
scene where no other red objects can be seen, it asks the user such as if the target object is on the right of
the red object.
V. IMAGE PROCESSING
In current implementation we apply color segmentation and compute four features for each foreground
region in the segmentation result: color, shape, position, and size. The size is the number of pixels of the
region.
A. Color Segmentation
We use a robust approach of features space method: the mean shift algorithm combined with HSI (Hue,
Saturation, and Intensity) color space for color image segmentation. Although the mean shift algorithm and
HSI
Color space can be separately used for color image segmentation; they surely fail to segment object when
illumination condition will change. To solve this problem, we use the mean shift algorithm as an image
preprocessing tool to reduce regions and number of colors used and then uses HIS color space for merging
regions to segment single colored objects in different illumination conditions. Our method consists of the
following parts: • Apply the mean shift algorithm into a real image to reduce colors and divide it into
several regions.
· Merge regions of a specific color based on H, S, I components of HSI color space.
· Filter the result using median filter.
· Eliminate the small regions using region growing algorithm.
The input image is first analyzed using the mean shift algorithm. The image may contain many colors
and several regions. The algorithm significantly and accurately reduces the number of colors and regions.
Thus, the output of the mean shift algorithm is several regions with a few numbers of colors in comparisons
with the input image. These regions, however, do not imply that each comes from a single object. The
mean shift algorithm may divide even a single color object into several regions with more than one color.
To remove this ambiguity, we use the Hue, Saturation and
Intensity components of the HSI color space to merge the homogeneous regions which likely come from a
single object. We use threshold values for each component of HSI to obtain homogeneous regions. The
threshold values are selected dynamically. Then, we use the median filter as image post processing. This
may help to smooth the image boundary and
also helps to reduce the unwanted regions. Finally, we use the region growing procedure as another image
post-processing procedure to avoid over segmentation or remove small highlights from the objects.
B. Shape Detection
We compute the shape factor S for each segmented region
32

We classify the regions into shape categories by this value. If it is around 1, the shape is a circle; around
0.8, a square, less than 0.6, an irregular shape.
Gesture Recognition:
Although we can convey much information about target objects by speech, there still remain some
attributes that are difficult to be expressed by words. We often use gestures to explain the objects in such
cases. This may be sometimes useful where shape factor fails to work. In those situations we use gestures
to determine shape.
VI.OBJECT RETRIEVAL PHASE
Giving the object is seen with an example. Consider that there is a table on which three drinks are placed
and the master want Sprite bottle and orders robot to get
it, then robot by using some of above mentioned
methods selects the desired drink as shown in the
second figure. Based on the data from the laser scanner,
a collision-free trajectory for moving the end effector of
the manipulator to the detected object is computed.
After detecting the desired bottle the robot moves to a
location, where a bottle of Sprite is standing, by using
image processing techniques.
Once the object to grasp is identified, the robot scans the area and matches the identified image-region with
the 3D information. The robot grasps the object with a collision-free trajectory. This method, using camera
and 3D information, guarantees robustness against positioning inaccuracy and changing object positions.
After lifting the object, the manipulator is moved on a collision-free trajectory to a safe position for driving
the hand over is accomplished by using a force-torque sensor in the end effector to detect that the user has
grasped. Vice versa, the bottle can also be handed over to the robot and put on furniture by the robot. When
handing-over an object to the robot, the in-finger sensors are used to detect the object and close the gripper.
When placing objects on furniture, the location is
first analyzed with the 3D laser scanner. Once a
free position has been detected, a collision free
path is planned and the arm moved to this
position. The force torque sensor data are required
to detect the point where the object touches the
table. The gripper is then opened and the object is
relieved.
VII.APPLICATIONS.
1. Robots help to Elders and Handicapped
people: Technical aids allow elderly and
handicapped people to live independently and
supported in their private homes for a longer time.
Robots manipulator arm is equipped with a
gripper, hand-camera, force-torque-sensor and optical in-finger distance sensors. The tilting head contains
33

two cameras, a laser scanner and speakers. A hand-held panel with touch-screen on the robot’s back is
detachable to keep in touch even if the robot moves in a different room, and depending upon the necessity
of the user he will order it to the robot which in turn can perform the retrieval of the object, thus helping the
people as though equal to a human-being.
2. Robots help during Surgery time: Robots can be used during the surgeries to help the surgeon to bring
the operational equipment.
3. Robots help in Industries: In an industry there may be circumstances where we can replace mankind
with the robot and for this we should have robot equipped with the above mentioned facilities and by this
we can reduce the human labour, money etc.
VIII.CONCLUSION
We have proposed a human-robot interface based on the mutual assistance between speech and vision.
Robots need vision to carry out their tasks. We have proposed to use human user’s assistance to solve this
problem. The robot asks a question to the user when it cannot detect the target object. It generates a
sequence of utterances that can lead to determine the object efficiently and user friendly. It determines what
and how to ask the user by considering the image processing results and the characteristics of object
(image) attributes. After the target object is detected that object is handed over to its user with the help of
the robot.
References:
[1] D.Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis”
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, pp. 603 – 619, 2002
[2] T.Takahashi, S.Nakanishi, Y.Kuno and Y.Shirai “Human-robot interface by verbal and non-verbal
communication” Proceedings 1998 IEEE/RSJ International Conference on intellengent
Robots and Systems.
[3] P. McGuire, J.Fritsch, J.J.Steil, F.Roothling, G.A.Fink, S.Wachsmuth,G. Sagerer, H. Ritter,
“Multi-modal human machine communication for instruction robot grasping tasks,” in Proc
.International Conference on Intelligent Robots and System, pp. 1082-1089, September – October
2002.
[4] D. Roy, B. Schiele, and A. Pentland, “Learning audio-visual associations using mutual information,”
International Conference on Computer Vision, Workshop on Integrating Speech and Image Understanding,
1999.
CODE NO:EC 23 IS 4
34

DIGITAL IMAGE WATERMARKING
V.BHARGAV L.SANJAY REDDY
ECE 3/4 ECE 3/4
bhargavvaddavalli@gmail.com san_jay87@yahoo.co.in
ABSTRACT :
Digital watermarking is a relatively new technology that allows the imperceptible
insertion of information into multimedia data. The supplementary information called
watermark is embedded into the cover work through its slight modification. Watermarks
are classified as being visible and invisible. A visible watermark is intended to be seen
with the content of the images and at the same time it is embedded into the material in
such a way that its unauthorized removal will cause damage to the image. In case of the
invisible watermark, it is hidden from view during normal use and only becomes visible
as a result of special visualization process. An important point of watermarking technique
is that the embedded mark must carry information about the host in which it is hidden.
There are several techniques of digital watermarking such as spatial domain encoding,
frequency domain embedding. DCT domain watermarking, and wavelet domain
embedding. In this paper we have examined spatial domain and DCT domain
watermarking technique. Both techniques were implemented on gray scale image of Lena
and Baboon.
INTRODUCTION :
Digital watermarking is a technique which allows an individual to add hidden
copyright notices or other verification messages to digital audio, video, or image signals
and documents. Such hidden message is a group of bits describing information pertaining
to the signal or to the author of the signal (name, place, etc.). The technique takes its
name from watermarking of paper or money as a security measure. Digital watermarking
is not a form of steganography, in which data is hidden in the message without the end
user's knowledge, although some watermarking techniques have the steganographic
feature of not being perceivable by the human eye.
While the addition of the hidden message to the signal does not restrict that
signal's use, it provides a mechanism to track the signal to the original owner.
35

A watermark can be classified into two sub-types: visible and invisible. Visible
watermarks change the signal altogether such that the watermarked signal is totally
different from the actual signal, e.g., adding an image as a watermark to another image.
Invisible watermarks do not change the signal to a perceptually great extent, i.e., there are
only minor variations in 'the output "Signal. An example of an invisible watermark is
when some bits are added to an image modifying only its least significant bits.
1. Spatial Domain Watermarking
One of the simplest technique in digital watermarking is in spatial domain using
the two dimensional army of pixels in the container image to hold hidden data using the
least significant bits (LSB) method. Note that the human eyes are not very attuned to
small variance in color and therefore processing of small difference in the LSB will not
noticeable. The slops to embed watermark image are given below.
1.1 Steps of Spatial Domain Watermarking
1) Convert RGB image to gray scale image.
2) Make double-precision for image.
3) Shift most significant bits to low significant bits of watermark image.
4) Make least significant bits of host image to zero
5) Add shifted version (step 3) of watermarked image to modified (step 4) host
image.
To implement above algorithm, we used 512x512 8 bit image of Lena and 512 x
5128 bit image of baboon which are shown in Figure 1 below. Embedded images are
shown in Figure 2 :
36

Original Image of Lena
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500
Origial Image of Baboon
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500
Figure 1: 512 x 512 8 bit gray scale images:
(a) Image of Lena. (b) Image of Baboon.
1.2 Embedding Water mark Image :
Figure 2: Digital linage Watermarking of Two Equal Size Image using LSB. (a)
Image of Baboon hidden in image of Lena, (b) Image of Lena hidden in image of
Baboon.
37
Baboon Image Hidden in Lena
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500

Note that it was determined that 5 most significant bits represents good
visualization of any image. Figure 2(a) shows the host image of Lena where 3 MSB of
baboon is represented as 3 LSB of Lena. The same experiment was performed for
Baboon as host image. Figure 1(b) shows the Baboon as host image where 3 MSB of
Lena is used as 3 LSB of Baboon. To obtain the extracted image, simply 3 LSB of
watermarked embedded image extracted which is shown in Figure 3 below.
1.3 Extracting Water Mark Image :
Extracted Baboon Image
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500
Figure 3: Extracted Images from Watermarked Images: (a) Extracted image of Baboon
from Lena, (b) Extracted image of Lena from Baboon.
Note that resolution of embedded image and extracted image is a tradeoff between
38
Extracted Lena Image
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500

each other. If the requirement of high resolution of extracted image than the more MSB
of watermark image can be used to get embedded image however in this case, resolution
of embedded image will be reduce.
To compare the resolution of embedded image and extracted image, MSB and
PSNR were calculated between original host image and embedded image as well as
original watermark image and extracted image. The result is shown in Table 1 below.
Table 1: MSB / PSNR of embedded and extracted images.
Lena Baboon
Embedded Image 1.0861*10-4/87.77dB/ 1.1642*10-4/87.47dB
Extracted Image 1.3898*10-4/6.7011dB 1.4557*104/6.5002dB
Notice from Table I that MSE for Baboon image for embedded and extracted image is
more compare to (he image of Lena which was expected since the variation in gray scale
is more in baboon than image of Lena.
2. DCT Domain Water Marking :
The classic and still popular domain for image processing is that of the discrete
cosine-transform (DCT). The DCT allows an image to be broken up in to different
frequency bands. Making it much easier to embed watermarking information into the
middle frequency bands of an image. The middle frequency bands are chosen such that
they have minimized they avoid the most visual important parts of the image (low
frequency) without over-exposing themselves to removal through compression and noise
attacks.
Flow chart of DCT Domain Watermarking :
39

NO
2.2 Embedding Watermark Image
The embedding algorithm can be described in following equation
Watermarked Image = DCT Transformed Image (1+Scaling Factor * Watermark)
The DCT domain 2-D signal then performed 8x8 block inverse DCT to obtain the water
marked image. The result is shown in figure 5 below.
40
Start
Perform 8x8 block DCT to Host
Image
Calculate the Variance of next
block
Variance > 45
Embed
watermarking value All
blocks
done?
DCT Coefficient in this block
left unmodified
IDCT
End
YES
YES
Figure 4 : Flow chart of the Watermark Embedding Procedure

Original Image of Lena
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500
Figure 5 shows the DCT based embedding on 512 x 512 image of Lena. A binary (2
level) 16 x 16 image of Temple logo was taken as watermark image.
Figure 5 shows the DCT based embedding on 512 x 512 image of Lena. A binary (2
level) 16 x 16 image of Temple logo was taken as watermark image. The watermark
image was embedded in Image of Lena and resultant watermarked image is shown in
Figure 5(c).
2.3 Extracting Watermark Image
To obtain the extracted watermark from watermarked image, following procedure
was performed.
41

1. Perform DCT transform on watermarked image and original host image.
2. Substract original host image from watermarked image.
3. Multiply extracted watermark by scaling factor to display.
The extracted temple logo is shown in Figure 5(d). Note that due to the DCT
transformed of watermarked image, recovered massage is not exactly same as the
original, however contains enough information for authentication. Also it should be noted
that the watermark was only 16x16 binary images, in the case of larger image, extracted
watermark would be expected in better resolution.
Scaled difference of Original Image and Watermarked Image
Figure 6 : Scaled difference between original image of
Lena and watermarked image.
Finally the scaled difference between the original image of Lena and watermarked
image of Lean is shown in Figure 6. The difference is not noticeable that proves that the
watermarked image is close to the original host image.
The same experiment was performed on image of baboon and result is shown in
Figure 7 below :
42

Figure 7 shows : DCT based watermarking (a) 512 x 512 8 bit original image of Baboon.
(b) 16 x 16 2 bit image of Temple logo (c) DCT Watermarked image (d) recovered
Temple Logo from watermarked image.
Applications:
 Steganography
 Copyright protection and authentication
 Anti-piracy
 Broadcast monitoring
43

Limitations of Spatial Domain Watermarking:
This method is comparatively simple, lacks the basic robustness that may he
expected in any watermarking applications. It can survive simple operation such as
cropping, any addition of noise. However lossy compression is going to defeat the
watermark. An even better attack is to set all the LSB bits to bits to ‘I’ fully defeating the
watermark at the cost of negligible perceptual impact on the cover object. Furthermore,
once the algorithm was discovered, it would be very easy for an intermediate party to
alter the watermark. To overcome this problem, we have investigated DCT domain
watermarking which discussed in next section.
Conclusion:
Two different technique of digital watermarking is investigated in this project, It
was determine that DCT domain watermarking is comparatively much better than the
spatial domain encoding since DCT domain watermarking can survive against the attacks
such as noising, compression, sharpening, and filtering. Also the extraction of watermark
image is depends on the original host image. It was noted that the PSNR is less in the
case of Baboon host image than the Lena host image.
References :
 "Digital Watermarking Alliance - Furthering the Adoption of Digital
Watermarking (http/www, digitalwatermarkingalliance.org/)
 Digital Watermarking & Data Hiding research papers
(http://www.forensics.nl/digital-watermarking) at Forensics.nl
 "Retrieved from "http://en.wikipedia.org/wiki/Digital_ watermarking"
 Digital Image Processing – Rofal C Gonzalez, Richard E woods – 2n edition
pearson Education (PHI)
 Digital processing using Matlab – Rafeal C Gonzalez, - Richard E woods –
Steven L Eddins – Pearson Education
 Digital Image Processing and Analysis B Chande, D Datta Majunder Prentice
Hall as India, 2003.
 Under the Guidance of Associate Prof. Mr. Kishore kumar Bapatla Engg College
44

CODE NO:93 IS 5
By:
B.Jeevan Jyothi Kumar.
K.Sri Sai Koteswara Rao.
Email:jeevan_koti@yahoo.co.in
Abstract:
The influx of sophisticated technologies in the field of image processing was concomitant
with that of digitization in the computers arena. Today image processing is used in fields
of astronomy, medicine, crime &finger print analysis, remote sensing, manufacturing,
aerospace &defense, movies & entertainment, and multimedia.
In this paper, we propose a new scheme for image compression using neural networks.
Image data compression deals with minimization of the amount of data required to
represent an image while maintaining an acceptable quality. Several image compression
techniques have been developed in recent years Over the last few years neural network
has emerged as an effective tool for solving a wide range of problems involving
adaptivity and learning. A multilayer feed-forward neural network trained using the
backward error propagation algorithm is used in many applications. However, this model
is not suitable for image compression because of its poor coding performance. Recently,
a self-organizing feature map (SOFM) algorithm has been proposed which yields a good
coding performance. However, this algorithm requires a long training time because the
network starts with random initial weights. In this paper we have used the backward error
propagation algorithm (BEP) to quickly obtain the initial weights which are then used to
speedup the training time required by the SOFM algorithm .In this paper we propose an
architecture and an improved training method to attempt to solve some of the
shortcomings of traditional data compression systems based on feed forward neural
networks trained with back propagation—the dynamic auto association neural network
(DANN).Image compression will be of rigorous use where the crucial factor is the utmost
45

efficacy Quality of the image process varies according to specialized image signal
processing.
History:
The history of digital image processing and analysis is quite short. It cannot be older than
the first electronic computer which was built. However, the concept of digital image
could be found in literature as early as in 1920 with transmission of images through the
Bart lane cable picture transmission system (McFarlane, 1972). Images were coded for
submarine cable transmission and then reconstructed at the receiving end by a specialized
printing device. The first computers powerful enough to carry out image processing tasks
appeared in the early 1960’s.the birth of what we call digital image processing today can
be traced to the availability of those machines and the onset of the space program during
that period. Attention was then concentrated on improving visual quality of transmitter
(or reconstructed) images. In fact, potentials of image processing techniques came into
focus with the advancement of large scale digital computer and with the journey to the
moon. Improving image quality using computer, started at Jet Propulsion Laboratory,
California, USA in 1964, and the images of the moon transmitted by RANGER-7 were
processed. In parallel with space applications, digital image processing techniques begun
in the late 1960’s and early 1970’s to be used in medical imaging, remote earth resources
observations and astronomy. Since 1964, the field has experienced vigorous growth]
certain efficient computer processing techniques (ex: fast Fourier transform) have also
contributed to this development.
Introduction
Modern digital technology has made it possible to manipulate multi-dimensional image
signals with systems that range from simple digital circuits to advanced parallel
computers. The goal of this manipulation can be divided into three categories:
* Image Processing image in -> image out
* Image Analysis image in -> measurements out
* Image Understanding image in -> high-level description out
46

We will focus on the fundamental concepts of image processing. Space does not permit
us to make more than a few introductory remarks about image analysis. Image
understanding requires an approach that differs fundamentally from the theme of our
discussion. Further, we will restrict ourselves to two-dimensional (2D) image processing
although most of the concepts and techniques that are to be described can be extended
easily to three or more dimensions. We begin with certain basic definitions.
An image defined in the "real world" is considered to be a function of two real variables,
for example, a(x, y) with a as the amplitude (e.g. brightness) of the image at the real
coordinate position (x, y). An image may be considered to contain sub-images sometimes
referred to as regions-of-interest, ROIs, or simply regions. This concept reflects the fact
that images frequently contain collections of objects each of which can be the basis for a
region.. In certain image-forming processes, however, the signal may involve photon
counting which implies that the amplitude would be inherently quantized.
47

Image Computer
Display
Mass
Storage
Hard
Copy
Specialized
Image
Processing
Hardware
Image
Processing
Software
Image
Sensors
Image compression:
Compression is one of the techniques used to make the file size of an image smaller. The
file size may decrease only slightly or tremendously depending upon the type of
compression used. Think of compression much like you would a balloon. You start out
with a balloon filled with air - this is your image. By squeezing out (or compressing) all
of the air, your balloon shrinks to a fraction of the size of the air-filled original. This
balloon will now fit in a lot of spaces that were initially impossible. The end result is that
you still have the exact same balloon, but just in a slightly different form. To get back the
original balloon, simply blow up the balloon to its original size. A direct analogy can be
drawn with image compression. You start out with a very large file size of an image. By
applying compression to the file, the file shrinks to a fraction of its original size. You can
now fit more images onto a floppy disk or hard disk because they have been compressed
and take up less space. More importantly, the smaller file size also means that the file can
48

be sent over the Worldwide Web much faster. This is good news for people browsing
your web site, and good news for network congestion problems.
There are two different types of compression - lossless and lossy:
Lossless:
A compression scheme in which no bits of information are permanently lost during
compression/decompression of an image. This means that, just like the balloon
analogy, even though the air is out of the balloon, it is capable of returning to its
original state. Your image will look exactly the same before and after you've
compressed it using a lossless compression scheme. The most common image format
on the WWW that uses a lossless compression scheme is the GIF (Graphics
Interchange Format) format. Although it is lossless, it has the capability of showing
you a maximum of only 256 colors at a time. The GIF format is used mainly when
there are distinct lines and colors in your image, as is the case in logos and illustration
work. Cartoons are an excellent example of the type of work that is best suited for the
GIF format. At this time, all web browsers support the GIF format.
When converting an image to GIF format, you have the option to have the image
display any number of colors up to 256 (the maximum number of colors for this
format). A lot of images appropriate for the GIF format can be saved with as little as
8 to 16 colors which will greatly decrease the required file size compared to the same
image saved with 256 colors. These settings can be specified when using Photoshop,
an image editing tool that we will discuss later on.
· Lossy
A compression scheme in which some bits of information are permanently lost
during compression and decompression of an image. This means that, unlike the
balloon analogy, an image will permanently lose some of the information that it
originally contained. Fortunately, the loss is usually only minimal and hardly
detectable. The most common image format on the WWW that uses a lossy
compression scheme is the JPEG (Joint Photographic Experts Group) format.
49

JPEG is a very efficient, true-color, compressed image format. Although it is
lossy, it has the capability of showing you more colors than GIF (more than 256
colors). The JPEG format is used mainly when your image contains gradients,
blends, and inconsistent color variations, as is the case with photographic images.
Because it is lossy, JPEG has the ability to compress an image tremendously, with
little loss in image quality. It is usually able to compress more efficiently than the
lossless GIF format, achieving much greater compression. The more popular
browsers such as Netscape do support JPEG, and it is expected that very shortly
all browsers will have built-in support for it.
It's important to note that since JPEG is a lossy image format, it is very important
to have a non-JPEG image as your original. This way, you can make changes to
the original and save it as a JPEG under a different name. If you need to make
revisions, you can go back to the original non-JPEG image and make your
corrections and only then should you save it as a JPEG. By opening a JPEG
image, revising it, and saving it back out as a JPEG time and time again, you may
introduce unwanted artifacts or "noise" that you may otherwise be able to avoid.
A lot of people are scared off by the term "lossy" compression. But when it comes
to real-world scenes, no digital image format can retain all the information that
your eyes can see. By comparison with a real-world scene, JPEG loses far less
information than GIF.
Both GIF and JPEG have their distinct advantages, depending on the types of
images you are including on your page. If you are uncertain which the best is, it
doesn't hurt to try both on the same image. Experiment and see which format
gives you the best picture and the lowest cost in disk space.
Neural networks:
The area of Neural Networks probably belongs to the borderline between the Artificial
Intelligence and Approximation Algorithms. Think of it as of algorithms for "smart
approximation". The NNs are used in (to name few) universal approximation (mapping
50

input to the output), tools capable of learning from their environment, tools for finding
non-evident dependencies between data and so on.
The Neural Networking algorithms (at least some of them) are modeled after the
brain (not necessarily - human brain) and how it processes the information. The brain
is a very efficient tool. Having about 100,000 times slower response time than
computer chips, it (so far) beats the computer in complex tasks, such as image and
sound recognition, motion control and so on. It is also about 10,000,000,000 times
more efficient than the computer chip in terms of energy consumption per operation.
The brain is a multi layer structure (think 6-7 layers of neurons, if we are talking about
human cortex) with 10^11 neurons, structure, that works as a parallel computer capable
of learning from the "feedback" it receives from the world and changing its design (think
of the computer hardware changing while performing the task) by growing new neural
links between neurons or altering activities of existing ones. To make picture a bit more
complete, let's also mention, that a typical neuron is connected to 50-100 of the other
neurons, sometimes, to itself, too.
To put it simple, the brain is composed of neurons, interconnected.
Structure of a neuron:
51

Image compression using back prop:
Computer images are extremely data intensive and hence require large amounts of
memory for storage. As a result, the transmission of an image from one machine to
another can be very time consuming. By using data compression techniques, it is possible
to remove some of the redundant information contained in images, requiring less storage
space and less time to transmit. Neural nets can be used for the purpose of image
compression, as shown in the following demonstration.
A neural net architecture suitable for solving the image compression problem is shown
below. This type of structure--a large input layer feeding into a small hidden layer, which
then feeds into a large output layer--, is referred to as a bottleneck type network. The idea
is this: suppose that the neural net shown below had been trained to implement the
identity map. Then, a tiny image presented to the network as input would appear exactly
the same at the output layer.
Bottleneck-type Neural Net Architecture for Image Compression:
In this case, the network could be used for image compression by breaking it in two as
shown in the Figure below. The transmitter encodes and then transmits the output of the
hidden layer (only 16 values as compared to the 64 values of the original image).The
receiver receives and decodes the 16 hidden outputs and generates the 64 outputs. Since
52

the network is implementing an identity map, the output at the receiver is an exact
reconstruction of the original image.
The Image Compression Scheme using the Trained Neural Net
Actually, even though the bottleneck takes us from 64 nodes down to 16 nodes, no real
compression has occurred because unlike the 64 original inputs which are 8-bit pixel
values, the outputs of the hidden layer are real-valued (between -1 and 1), which requires
possibly an infinite number of bits to transmit. True image compression occurs when the
hidden layer outputs are quantized before transmission. The Figure below shows a typical
quantization scheme using 3 bits to encode each input. In this case, there are 8 possible
binary codes which may be formed: 000, 001, 010, 011, 100, 101, 110, 111. Each of these
codes represents a range of values for a hidden unit output. To compute the amount of
image compression (measured in bits-per-pixel) for this level of quantization, we
compute the ratio of the total number of bits transmitted: to the total number of pixels in
the original image: 64
53

The Quantization of Hidden Unit Outputs
The training of the neural net proceeds as follows, a 256x256 training image is used to
train the bottleneck type network to learn the required identity map. Training input-output
pairs are produced from the training image by extracting small 8x8 chunks of the image
chosen at a uniformly random location in the image. The easiest way to extract such a
random chunk i s to generate a pair of random integers to serve as the upper left hand
corner of the extracted chunk. In this case, we choose random integers i and j, each
between 0 and 248, and then (i, j) is the coordinate of the upper left hand corner of the
extracted chunk. The pixel values of the extracted image chunk are sent (left to right, top
to bottom) through the pixel-to-real mapping shown in the Figure below to construct the
64-dimensional neural net input . Since the goal is to learn the identity map, the desired
target for the constructed input is itself; hence, the training pair is used to update the
weights of the network.
The Pixel-to-Real and Real-to-Pixel Conversions:
54

Once training is complete, image compression is demonstrated in the recall phase. In this
case, we still present the neural net with 8x8 chunks of the image, but now instead of
randomly selecting the location of each chunk, we select the chunks in sequence from left
to right and from top to bottom. For each such 8x8 chunk, the output the network can be
computed and displayed on the screen to visually observe the performance of neural net
image compression. In addition, the 16 outputs of the hidden layer can be grouped into a
4x4 "compressed image", which can be displayed as well.
CODE NO: EC 105 IS 6
55

BIOMETRIC ATHENTICATION
SYSTEM IN CREDIT CARDS
By
B.VARSHA (04251A1711)
C.NITHYA (04251A1712)
ETM 3/4
GNITS
ID:b_varshareddy@yahoo.com
nitchunduri_87@ yahoo.com
ABSTRACT:
Catching ID thieves is like spear fishing during a salmon run:
skewering one big fish barely registers when the vast majority just keeps on going. With
birth dates, addresses, and Social Security and credit card numbers in hand, we can use a
computer at a public library to order merchandise online, withdraw money from
brokerage accounts, and apply for credit cards in other people’s names.
It's a security-obsessed world. Identity thefts, racial profiling, border checkpoints,
computer passwords ... it all boils down to a simple question: "Are you who you say you
are?"
Biometrics has developed a means to reliably answer this
deceptively simple question by using fingerprint sensing in any type of wallet-sized
plastic card-credit cards, ID cards, smart cards, drivers licenses, passports and more.
In this paper we discuss biometric credit cards which use finger
print sensing,, it’s functioning, and its improvement over conventional authentication
techniques & how effective it has been in increasing the security & preventing ID thefts,
its limitations& how it can be made more effective in future.
INTRODUCTION:
It is far too easy to steal personal information these days, especially
credit card numbers, which are involved in more than 67 percent of identity thefts,
56

according to a U.S. Federal Trade Commission study. It’s also relatively easy to fake
someone’s signature or guess a password; thieves can often just look at the back of a
credit or an atm card, where some 30 percent of people actually write down their personal
identification number (PIN) and give the thief all that’s needed to raid the account. But
what if we all had to present our fingers to a scanner built into our credit cards to
authenticate our identities before completing a transaction? Faking fingerprints would
prove challenging to even the most technologically sophisticated identity thief
The sensors, processors, and software needed to make secure
credit cards that authenticate users on the basis of their physical, or biometric, attributes
are already on the market. But concerned about biometric system performance, customer
acceptance, and the cost of making changes to their existing infrastructure, the credit card
issuers apparently would rather go on eating an expense equal to 0.25 percent of Internet
transaction revenues and the 0.08 percent of off-line revenues that now come from stolen
credit card numbers.
Our proposed system uses fingerprint sensors, though other biometric
technologies, either alone or in combination, could be incorporated. The system could be
economical, protect privacy, and guarantee the validity of all kinds of credit card
transactions, including ones that take place at a store, over the telephone, or with an
Internet-based retailer. By preventing identity thieves from entering the transaction loop,
credit card companies could quickly recoup their infrastructure investments and save
businesses, consumers, and themselves billions of dollars every year.
57

Current credit card authentication systems validate anyone, including
impostors, who can reproduce the exclusive possessions or knowledge of legitimate
cardholders. Presenting a physical card at a cash register proves only that you have a
credit card in your possession, not that you are who the card says you are. Similarly,
passwords or Pins do not authenticate your identity but rather your knowledge. Most
passwords or Pins can be guessed with just a little information: an address, license plate
number, birth date, or pet’s name. Patient thieves can and do take pieces of information
gleaned from the Internet or from mail found in the trash and eventually associate enough
bits to bring a victim to financial grief.
To ensure truly secure credit card transactions, we need to minimize this
kind of human intervention in the authentication process. Such a major transition will
come by using fingerprint sensing in credit cards at a cost that credit card companies have
so far declined to pay. They are particularly worried about the cost of transmitting and
receiving biometric information between point-of-sale terminals and the credit card
payment system. They also fret that some customers, anxious about having their
biometric information floating around cyberspace, might not adopt the cards. To address
these concerns, we offer an outline for a self-contained smart-card system that we believe
could be implemented within the next few years.
WORKING OF THIS AUTHETICATION SYSTEM:
When activating your new card, you would load an image of your
fingerprint onto the card. To do this, you would press your finger against a sensor in the
card—a silicon chip containing an array of microcapacitor plates. (In large quantities,
these fingerprint-sensing chips cost only about $5 each.) The surface of the skin serves as
a second layer of plates for each microcapacitor, and the air gap acts as the dielectric
medium. A small electrical charge is created between the finger surface and the capacitor
plates in the chip. The magnitude of the charge depends on the distance between the skin
surface and the plates. Because the ridges in the fingerprint pattern are closer to the
silicon chip than the valleys, ridges and valleys result in different capacitance values
across the matrix of plates. The capacitance values of different plates are measured and
58

converted into pixel intensities to form a digital image of the fingerprint(with reference to
the figure).
Next, a microprocessor in the smart card extracts a few specific
details, called minutiae, from the digital image of the fingerprint. Minutiae include
locations where the ridges end abruptly and locations where two or more ridges merge, or
a single ridge branches out into two or more ridges. Typically, in a live-scan fingerprint
image of good quality, there are 20 to 70 minutiae; the actual number depends on the size
of the sensor surface and the placement of the finger on the sensor. The minutiae
information is encrypted and stored, along with the cardholder’s identifying information,
as a template in the smart card’s flash memory.
At the start of a credit card transaction, you would present your smart
credit card to a point-of-sale terminal. The terminal would establish secure
communications channels between itself and your card via communications chips
embedded in the card and with the credit card company’s central database via Ethernet.
The terminal then would verify that your card has not been reported lost or stolen, by
exchanging encrypted information with the card in a predetermined sequence and
checking its responses against the credit card database.
Next, you would touch your credit card’s fingerprint sensor pad. The
matcher, a software program running on the card’s microprocessor, would compare the
signals from the sensor to the biometric template stored in the card’s memory. The
matcher would determine the number of corresponding minutiae and calculate a
fingerprint similarity result, known as a matching score. Even in ideal situations, not all
minutiae from the input and template prints taken from the same finger will match. So the
matcher uses what’s called a threshold parameter to decide whether a given pair of
feature sets belong to the same finger or not. If there’s a match, the card sends a digital
signature and a time stamp to the point-of-sale terminal. The entire matching process
could take less than a second, after which the card is accepted or rejected.
59

The point-of-sale terminal sends both the vendor information and your
account information to the credit card company’s transaction-processing system. Your
private biometric information remains safely on the card, which ideally never leaves your
possession.
But say your card is lost or stolen. First of all, it is unlikely that a thief
could recover your fingerprint data, because it is encrypted and stored on a flash memory
chip that very, very few thieves would have the resources to access and decrypt.
Nevertheless, suppose that an especially industrious, and perhaps unusually attractive,
operator does get hold of the fingerprint of your right index finger—say, off a cocktail
glass at a hotel bar where you really should not have been drinking. Then this industrious
thief manages to fashion a latex glove molded in a slab of gelatin containing a nearly
flawless print of your right index finger, painstakingly transferred from the cocktail glass.
Even such an effort would fail, thanks to new applications that test
the vitality of the biometric signal. One identifies sweat pores, which are just 0.1
millimeter across, in the ridges using high-resolution fingerprint sensors. We could also
detect spoofs by measuring the conduction properties of the finger using electric field
sensors. Software-based spoof detectors aren’t far behind. Researchers are differentiating
the way a live finger deforms the surface of a sensor from the way a dummy finger does.
With software that applies the deformation parameters to live scans, we can automatically
distinguish between a real and a dummy finger 85 percent of the time—enough to make
your average identity thief think twice before fashioning a fake finger.
60

FINGERPRINT MATCHING: In this simplified diagram, the matching process
consists of minutiae extraction followed by alignment and determination of
corresponding minutiae stored as a template in the card’s flash memory. Even prints from
the same finger won’t ever exactly match, because of dirt, sweat, smudging, or placement
on the sensor. Therefore, the system has a threshold parameter: a maximum number of
mismatched minutiae that a scanned fingerprint can have, beyond which the card will
reject the print as inauthentic. In the case shown, just three minutiae don’t match up, and
the user is positively authenticated
A version of the system designed to protect Internet shoppers might be
even easier to implement, and less expensive, too. When mulling the costs and benefits of
biometric credit cards, card issuers might well decide to first deploy biometric
authentication systems for Internet transactions, which is where ID thieves cause them
61

the most pain. A number of approaches could work, but here’s a simple one that adapts
some of the basic concepts from our proposed smart-card system.
To begin with, you’d need a PC equipped with a biometric sensing
device such as a fingerprint sensor, a camera for iris scans, or a microphone for taking a
voice signature. Next, you’d need to enroll in your credit card company’s secure e-commerce
system. You would first download and install a biometric credit card protocol
plug-in for your Web browser. The plug-in, certified by the credit card company, would
enable the computer to identify its sensor peripherals so that biometric information
registered during the enrollment process could be traced back to specific sensors on a
specific PC. After the sensor scanned your fingerprints, you would have to answer some
of the old authentication questions—such as your Social Security number, mother’s
maiden name, or PIN. Once the system authenticated you, the biometric information
would be officially certified as valid by the credit card company and stored as an
encrypted template on your PC’s hard drive.
During your initial purchase after enrollment, perhaps buying a nice
shirt from your favorite online retailer, you would go through a conventional
authentication procedure that would prompt you to touch your PC’s finger scanner. The
credit card protocol plug-in would then function as a matcher and would compare the live
biometric scan with the encrypted, certified template on the hard drive. If there were a
match, your PC would send a certified digital signature to the credit card company, which
would release funds to the retailer, and your shirt would be on its way. Accepting the
charge for the shirt on the next bill by paying for it would confirm to the card issuer that
you are the person who enrolled the fingerprints stored on the PC. From then on, each
time you made an online purchase, you would touch the fingerprint sensor, the plug-in
would confirm your identity, and your PC would send the digital signature to your credit
card company, authorizing it to release funds to the vendor.
62

If someone else tried to use his fingerprints on your machine, the plug-in would
recognize that the live scan didn’t match the stored template and would reject the
attempted purchase. If someone stole your credit card number, enrolled her own
fingerprints on her own PC, and went on an online shopping spree, you would dispute the
charges on your next bill and the credit card issuer would have to investigate.
63

OTHER APPLICATIONS OF FINGER PRINT SENSING:
· Fingerprint sensing biometric pen: The new pen uses biometric authentication
to verify the identity of a signer. This occurs in less than one second after the
singer grips the pen.
· Fingerprint sensing biometrics can also be used in ID Cards, Passport & visas,
driver’s license, traffic control.
· Access control: Since the 9-11 tragedy, the need for
secure access to buildings and various facilities
became more apparent .Each person needing access
to the facility has ID card contains their personal
biometric information and any other additional data
necessary for the particular application. All
information is stored in the printed on the ID card. Building entrances are
equipped with fingerprint scanner and a control box connects the security system
to the "building local area network" and/or the Internet. The security systems can
then be accessed either through the facility LAN via a "INTRANET METHOD"
or through the internet "INTERNET METHOD" to monitor the entire security
system in the building or facility, including access, video monitoring system,
visitor passes, entry and exit times, etc.
· Pay by touch fingerprint scanners are used to buy groceries.
· Fingerprint sensors are used in mobile phones for security.
64

ADVANTAGES:
· High Security
· Reduces ID thefts
· Reduces the burden of remembering passwords
· Easier to implement
LIMITATIONS & REMEDIES:
Any biometric system is prone to two basic types of errors
· False positive: In a false positive, the system incorrectly declares a successful
match between, in our case, the fingerprint of an impostor and that of the
legitimate cardholder, in other words, a thief manages to pass himself off as you
and gains access to your accounts.
· False negative: In the case of a false negative, on the other hand, the system fails
to make a match between your fingerprint and your stored template i.e, the
system doesn’t recognize you and therefore denies you access to your own
account.
Some errors might be avoided by using improved sensors. For
instance, optical sensors capture fingerprint details better than capacitive fingerprint
sensors and are as much as four times as accurate. Even more accurate than
conventional optical sensors, the new multi-spectral sensor distinguishes structures in
living skin according to the light-absorbing and -scattering properties of different
layers. By illuminating the finger surface with light of different wavelengths, the
sensor captures an especially detailed image of the fingerprint pattern just below the
skin surface to do a better job of taking prints from dry, wet, or dirty fingers.
65

· Cost:
But costs are declining for all of the major smart-card
components, including flash memory, microprocessors, communications chips, and
fingerprint sensors.
CONCLUSION:
Biometric authentication systems based on available technology
would be a major improvement over conventional authentication techniques. If widely
implemented, such systems could put thousands of ID thieves out of business and spare
countless individuals the nightmare of trying to get their good names and credit back.
Though the technology to implement these systems already exists, ongoing research
efforts aimed at improving the performance of biometric systems in general and sensors
in particular will make them even more reliable, robust, and convenient.
REFERENCES:
· www.ieee.org
· www.spectrum.ieee.org
· www.howstuffworks.com
· www.google.com
· IEEE Journals
CODE NO: EC 108 IS 7
AUTOMATIC SPEAKER RECOGNITION SYSTEM
66

BBYY
PP..MMEEGGHHAANNAA RREEDDDDYY (( 00660050044775 )) DD..VVEEEENNAA RRAAOO (( 00660050051166 ))
EECCEE TTHHIIRRDD YYEEAARR,, EECCEE TTHHIIRRDD YYEEAARR,,
VVAASSAAVVII CCOOLLLLEEGGEE OOFF EENNGGGG.. VVAASSAAVVII CCOOLLLLEEGGEE OOFF EENNGGGG..
IIBBRRAAHHIIMMBBAAGGHH,, IIBBRRAAHHIIMMBBAAGGHH,,
HHYYDDEERRAABBAADD.. HHYYDDEERRAABBAADD.. MMAAIILL
IIDD:: mmeegghhaa22882288@@ggmmaaiill..ccoomm MMAAIILL IIDD::vveeeennaarraaoo__sseepp@@yyaahhoooo..ccoo..uukk
PH.NO: 9989194272 PH NO: 9866160356
ABSTRACT
Speaker recognition is the process of automatically recognizing who is
speaking on the basis of individual information included in speech waves. This
technique makes it possible to use the speaker’s voice to verify their identity and
control access to services such as voice dialing, banking by telephone, telephone
shopping, database access services, information services, voice mail, security control
for confidential information areas, and remote access to computers.
The goal of this work is to build a simple, yet complete and representative
automatic speaker recognition system using MATLAB software. The system
developed here is tested on a small (but already non-trivial) speech database. There
are 8 male speakers, labeled from S1 to S8. All speakers uttered the same single
digit "zero" once in a training session and once in a testing session later on. The
vocabulary of digit is used very often in testing speaker recognition because of its
applicability to many security applications. For example, users have to speak a PIN
(Personal Identification Number) in order to gain access to the laboratory door, or
users have to speak their credit card number over the telephone line. By checking
the voice characteristics of the input utterance using an automatic speaker
67

recognition system similar to the one that has been developed now, the system is
able to add an extra level of security.
INTRODUCTION
Speaker recognition can be classified into identification and verification. Speaker
identification is the process of determining which registered speaker provides a given
utterance. Speaker verification, on the other hand, is the process of accepting or rejecting
the identity claim of a speaker. Figure1 shows the basic structures of speaker
identification and verification systems. Speaker recognition methods can also be divided
into text-independent and text-dependent methods. In a text-independent system, speaker
models capture characteristics of somebody’s speech which show up irrespective of what
one is saying. In a text-dependent system, on the other hand, the recognition of the
speaker’s identity is based on his speaking one or more specific phrases, like passwords,
card numbers, PIN codes, etc. When the task is to identify the person talking rather than
what he is saying, the speech signal must be processed to extract measures of speaker
variability instead of segmental features. There are two sources of variation among
speakers: differences in vocal cords and vocal tract shape, and differences in speaking
style.
At the highest level, all speaker recognition systems contain two main modules
(refer to Figure 1): feature extraction and feature matching. Feature extraction is the
process that extracts a small amount of data from the voice signal that can later be used to
represent each speaker. Feature matching involves the actual procedure to identify the
unknown speaker by comparing extracted features from his voice input with the ones
from a set of known speakers. We will discuss each module in detail in later sections.
All speaker recognition systems have to serve two distinguish phases. The first
one is referred to as the enrollment session or training phase while the second one is
referred to as the operation session or testing phase. In the training phase, each registered
speaker has to provide samples of their speech so that the system can build or train a
68

reference model for that speaker. In case of speaker verification systems, in addition, a
speaker-specific threshold is also computed from the training samples. During the testing
(operational) phase (see Figure 1), the input speech is matched with stored reference
model(s) and recognition decision is made.
Figure 1(a): Speakeridentification
Figure 1(b): Speaker verification
Automatic speaker recognition works based on the premise that a person’s speech
exhibits characteristics that are unique to the speaker. However this task has been
challenged by the highly variant of input speech signals. The principle source of variance
comes form the speakers themselves. Speech signals in training and testing sessions can
be greatly different due to many facts such as people voice change with time, health
69

conditions (e.g. the speaker has a cold), speaking rates, etc. There are also other factors,
beyond speaker variability, that present a challenge to speaker recognition technology.
Examples of these are acoustical noise and variations in recording environments
(e.g.speaker uses different telephone handsets).
Speech Feature Extraction
The purpose of this module is to convert the speech waveform to some type of
parametric representation (at a considerably lower information rate) for further analysis
and processing. This is often referred as the signal-processing front end. The speech
signal is a slowly timed varying signal (it is called quasi-stationary). When examined
over a sufficiently short period of time (between 5 and 100 msec), its characteristics are
fairly stationary. However, over long periods of time (of the order of 1/5 seconds or
more) the signal characteristic change to reflect the different speech sounds being spoken.
Therefore, short-time spectral analysis is the most common way to characterize the
speech signal.
A wide range of possibilities exist for parametrically representing the speech
signal for the speaker recognition task. Mel-Frequency Cepstrum Coefficients (MFCC),
is perhaps the best known and most popular, and these will be used in this project.
MFCC’s are based on the known variation of the human ear’s critical bandwidths with
frequency, filters spaced linearly at low frequencies and logarithmically at high
frequencies have been used to capture the phonetically important characteristics of
speech. This is expressed in the mel-frequency scale, which is a linear frequency spacing
below 1000 Hz and a logarithmic spacing above 1000 Hz. The process of computing
MFCCs is described in more detail next.
Mel-frequency cepstrum coefficients processor
A block diagram of the structure of an MFCC processor is given in Figure 2. The
speech input is typically recorded at a sampling rate above 10000 Hz. This sampling
frequency was chosen to minimize the effects of aliasing in the analog-to-digital
70

conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover
most energy of sounds that are generated by humans. As been discussed previously, the
main purpose of the MFCC processor is to mimic the behavior of the human ears. In
addition, rather than the speech waveforms themselves, MFFC’s are shown to be less
susceptible to mentioned variations.
Figure 2. Block diagram of the MFCC processor
1.Frame Blocking
In this step the continuous speech signal is blocked into frames of N samples,
with adjacent frames being separated by M (M < N). The first frame consists of the first
N samples. The second frame begins M samples after the first frame,and overlaps it by
N-M samples. Similarly, the third frame begins 2M samples after the first frame (or M
samples after the second frame) and overlaps it by N - 2M samples. This process
continues until all the speech is accounted for within one or more frames.
2.Windowing
The next step in the processing is to window each individual frame so as to
minimize the signal discontinuities at the beginning and end of each frame. The concept
here is to minimize the spectral distortion by using the window to taper the signal to zero
at the beginning and end of each frame. If we define the window as w(n),0 n N-1,
71

where N is the number of samples in each frame, then the result of windowing is the
signal ,
Typically the Hamming window is used, which has the form,
3. Fast Fourier Transform (FFT)
The next processing step is the Fast Fourier Transform, which converts each
frame of N samples from the time domain into the frequency domain. The FFT is a
fast algorithm to implement the Discrete Fourier Transform (DFT) which is defined on
the set of N samples {xn}, as follow:
In general Xn’s are complex numbers. The resulting sequence {Xn} is interpreted
as follow: the zero frequency corresponds to n = 0, positive frequencies 0 < f < Fs / 2,
correspond to values 1nN /2-1, while negative frequencies -Fs / 2 < f < 0 correspond
to N /2+1nN-1. Here, Fs denotes the sampling frequency. The result obtained after
this step is often referred to as signal’s spectrum or periodogram.
4. Mel-frequency Wrapping
As mentioned above, psychophysical studies have shown that human perception
of the frequency contents of sounds for speech signals does not follow a linear scale.
Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is
measured on a scale called the ‘mel’ scale. The mel-frequency scale is a linear frequency
spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. As a reference point,
the pitch of a 1 kHz tone, 40 dB above the perceptual hearing threshold, is defined as
72

1000 mels. Therefore we can use the following approximate formula to compute the mels
for a given frequency f in Hz:
One approach to simulating the subjective spectrum is to use a filter bank, one
filter for each desired mel-frequency component. That filter bank has a triangular
bandpass frequency response, and the spacing as well as the bandwidth is determined by
a constant mel-frequency interval. The modified spectrum of S( ) thus consists of the
output power of these filters when S( ) is the input. Note that this filter bank is applied
in the frequency domain.
5. Cepstrum:
In this final step, log mel spectrum is converted back to time. The result is called
the mel frequency cepstrum coefficients (MFCC). The cepstral representation of the
speech spectrum provides a good representation of the local spectral properties of the
signal for the given frame analysis.
Because the mel spectrum coefficients (and so their logarithm) are real numbers,
it can be converted into the time domain using the Discrete Cosine Transform (DCT).
Therefore if we denote those mel power spectrum coefficients that are the result of the
last step are S
, k=1,2,3,…..K, we can calculate the MFCC’s as,
k
Note that we exclude the first component, c
0
from the DCT since it represents the
mean value of the input signal which carried little speaker specific information.
Summary
By applying the procedure described above, for each speech frame of around
30msec with overlap, a set of mel-frequency cepstrum coefficients is computed. These
73

are result of a cosine transform of the logarithm of the short-term power spectrum
expressed on a mel-frequency scale. This set of coefficients is called an acoustic vector.
Therefore each input utterance is transformed into a sequence of acoustic vectors. In the
next section we will see how those acoustic vectors can be used to represent and
recognize the voice characteristic of the speaker.
FEATURE MATCHING
The problem of speaker recognition belongs to a much broader topic in scientific
and engineering so called pattern recognition. The goal of pattern recognition is to
classify objects of interest into one of a number of categories or classes. The objects of
interest are generically called patterns and in our case are sequences of acoustic vectors
that are extracted from an input speech using the techniques described in the previous
section. The classes here refer to individual speakers. Since the classification procedure
in our case is applied on extracted features, it can be also referred to as feature matching.
Furthermore, if there exists some set of patterns that the individual classes of which are
already known, then one has a problem in supervised pattern recognition.
This is exactly our case since during the training session, we label each input
speech with the ID of the speaker (S1 to S8). These patterns comprise the training set and
are used to derive a classification algorithm. The remaining patterns are then used to test
the classification algorithm; these patterns are collectively referred to as the test set.
There are many feature matching techniques used in speaker recognition .In this
project the Vector Quantization (VQ) approach is used, due to ease of implementation
and high accuracy. VQ is a process of mapping vectors from a large vector space to a
finite number of regions in that space. Each region is called a cluster and can be
represented by its center called a codeword. The collection of all codewords is called a
codebook.
Figure 3 shows a conceptual diagram to illustrate this recognition process. In the
figure, only two speakers and two dimensions of the acoustic space are shown. The
74

circles refer to the acoustic vectors from the speaker1 while the triangles are from the
speaker2. In the training phase, a speaker-specific VQ codebook is generated for each
known speaker by clustering his training acoustic vectors. The result codewords
(centroids) are shown in Figure 3 by black circles and black triangles for speaker 1 and 2,
respectively.
The distance from a vector to the closest codeword of a codebook is called a VQ-distortion.
In the recognition phase, an input utterance of an unknown voice is “vector-quantized”
using each trained codebook and the total VQ distortion is computed. The
speaker corresponding to the VQ codebook with smallest total distortion is identified.
Figure 3: Conceptual diagram illustrating vector quantization codebook
formation.
One speaker can be discriminated from another based of the location of centroids.
Clustering the Training Vectors
After the enrolment session, the acoustic vectors extracted from input speech of a
speaker provide a set of training vectors. As described above, the next important step is to
build a speaker-specific VQ codebook for this speaker using those training vectors. There
75

is a well-know algorithm, namely LBG algorithm [Linde, Buzo and Gray], for clustering
a set of L training vectors into a set of M codebook vectors.
The algorithm is formally implemented by the following recursive procedure:
1. Design a 1-vector codebook; this is the centroid of the entire set of training vectors
(hence, no iteration is required here).
2. Double the size of the codebook by splitting each current codebook yn according
to the rule
where n varies from 1 to the current size of the codebook, and is a splitting parameter
(we choose =0.01).
3. Nearest-Neighbor Search: for each training vector, find the codeword in the current
codebook that is closest (in terms of similarity measurement), and assign that vector to
the corresponding cell (associated with the closest codeword).
4. Centroid Update: update the codeword in each cell using the centroid of the training
vectors assigned to that cell.
5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset threshold
6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.
Intuitively, the LBG algorithm designs an M-vector codebook in stages. It starts
first by designing a 1-vector codebook, then uses a splitting technique on the codewords
to initialize the search for a 2-vector codebook, and continues the splitting process until
the desired M-vector codebook is obtained.
Figure 4 shows, in a flow diagram, the detailed steps of the LBG algorithm.
“Cluster vectors” is the nearest-neighbor search procedure which assigns each training
vector to a cluster associated with the closest codeword. “Find centroids” is the centroid
update procedure. “Compute D (distortion)” sums the distances of all training vectors in
the nearest-neighbor search so as to determine whether the procedure has converged.
76

Figure 4. Flow diagram of the LBG algorithm
IMPLEMENTATION
All the steps outlined in the previous sections are implemented in the MATLAB
tool provided by "The Mathworks Inc" and the system developed here is tested on a small
speech database. There are 8 male speakers, labeled from S1 to S8. All speakers uttered
the same single digit "zero" once in a training session and once in a testing session later
on. The figures 5 to 15 shows results of all the steps in the speaker recognition task. First
MFCC's for one speaker are computed .This is illustrated in figures 5 to 11 .Firstly ,in the
Figure 5 input speech signal of one of the speaker is plotted against time. It should be
obvious that the raw data in the time domain has a very high amount of data and it is
difficult for analyzing the voice characteristic. So the motivation for the step of speech
feature extraction should be clear now!
Now the speech signal (a vector) is cutted into frames with overlap. The output of
this is a matrix where each column is a frame of N samples from original speech signal
which is displayed in Figure 6. Now the signal is windowed by means of hamming
window. The result is again a similar matrix except that each frame(column) has been
windowed as shown in Figure 7. The FFT is applied to the signal and the signal is
transformed into the frequency domain and the output is displayed in Figure 8. Applying
these steps: Windowing and FFT is referred as Windowed Fourier Transform (WFT) or
77

Short-Time Fourier Transform (STFT). The result is often called as the spectrum or
periodogram. The last step in speech processing is converting the spectrum into mel
frequency cepstrum coefficients which can be accomplished by generating a mel
frequency filter bank having characteristics as shown in Figure 9 and multiplying this in
frequency domain with FFT obtained in the last step yielding mel spectrum which is
shown in Figure 10. Finally mel frequency cepstrum coefficients (MFCC) are generated
by taking discrete cosine transform of the logarithm of the mel-spectrum obtained in the
last step and MFCC's are shown in Figure 11.
Similar procedure is followed for all the remaining speakers and MFCC's for all
the speakers are computed. To inspect the acoustic space (MFCC vectors) any two
dimensions (say the 5th and the 6th) are picked and the data points are plotted in a 2D
plane and it is shown in the Figure 12. Now the LBG algorithm is applied to the set of
MFCC's coefficients obtained in the previous stage and the intermediate stages are shown
in Figures 13, 14, 15.
Finally the system is trained for all the speakers and each speaker specific
codebook is generated. After this training step, the system would have knowledge of the
voice characteristic of each (known) speaker. In the recognition phase, an input utterance
of an unknown voice is “vector-quantized” using each trained codebook and the total VQ
distortion is computed. The speaker corresponding to the VQ codebook with smallest
total distortion is identified.
RESULTS
78

Figure 5: An Input Speech Signal
Figure 6: After Frame Blocking
Figure 7: After Windowing
79

Figure 8: After short-time fourier transform
Figure 9: A Mel Spaced Filter Bank
Figure 10: After mel frequency wrapping
80

Figure 11: Mel Frequency Cepstrum Coeffecients
Figure 12: Training Vectors as points in a 2D-space
Figure 13: The centroid of the entire set.
81

Figure 14: The centroid is splitted into 2 using LBG algorithm.
Figure 15: Finally an 16-vector codebook is generated using LBG algorithm.
CONCLUSIONS & DISCUSSIONS
As the codebook size is increased the recognition performance has been improved
but as it is still increased further the performance has not improved as expected i.e., the
rate of the increase of performance has been decreased as code book size is increased.
The most distinctive feature of the proposed speaker-based VQ model is its multiple
representation or partitioning of a speaker's spectral space. The VQ speaker model, while
allowing some amount of overlap between different speaker's codebooks, is quite capable
of discriminating impostors from a true speaker because of this distinctive feature.
82

MFCC allow better suppression of insignificant spectral variation in the higher
frequency bands. Another obvious advantage is that mel-frequency cepstrum coefficients
form a particular compact representation.
It is useful to examine the lack of commercial success for Automatic Speaker
Recognition compared to that for speech recognition. Both speech and speaker
recognition analyze speech signals to extract spectral parameters such as cepstral
coefficients. Both often employ similar template matching methods, the same distance
measures, and similar decision procedures. Speech and speaker recognition, however,
have different objectives: selecting which of M words was spoken vs. which of N
speakers spoke. Speech analysis techniques have primarily been developed for phonemic
analysis, e.g., to preserve phonemic content during speech coding or to aid phoneme
identification in speech recognition. Our understanding of how listeners exploit spectral
cues to identify human sounds exceeds our knowledge of how we distinguish speakers.
For text-dependent Automatic Speaker Recognition, using template-matching methods
borrowed directly from speech recognition yields good results in limited tests, but
performance decreases under adverse conditions that might be found in practical
applications. For example, telephone distortions, uncooperative speakers, and speaker
variability over time often lead to accuracy levels unacceptable for many applications.
REFERENCES
[1] L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice- Hall,
Englewood Cliffs, N.J., 1993.
[2] L.R Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice- Hall,
Englewood Cliffs, N.J., 1978.
83

[3] S.B. Davis and P. Mermelstein, “Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences".
[4] F.K. Song, A.E. Rosenberg and B.H. Juang, “A vector quantisation approach to
speaker recognition”, AT&T Technical Journal, March 1987.
[5] Douglas O'Shaughnessy, "Speaker Recognition", IEEE Acoustic, Speech, Signal
Processing Magazine, October 1986.
[6] S. Furui, "A Training Procedure for Isolated Word Recognition Systems", IEEE
Transactions on Acoustic, Speech, Signal Processing, April 1980.
84

Image processing

More Related Content

What's hot

Viewers also liked

Similar to Image processing

Recently uploaded

Image processing