HGR-thesis

HGR Project Report Page 1
Chapter 1
Introduction
This chapter will give the reader an insight into what this project work is all about.
1.1 Overview:
Computer is used by many people either at their work or in their spare-time.
Special input and output devices have been designed over the years with the purpose of
easing the communication between computers and humans, the two most known are the
keyboard and mouse . Every new device can be seen as an attempt to make the computer
more intelligent and making humans able to perform more complicated communication
with the computer. This has been possible due to the result oriented efforts made by
computer professionals for creating successful human computer interfaces . As the
complexities of human needs have turned into many folds and continues to grow so, the
need for Complex programming ability and intuitiveness are critical attributes of
computer programmers to survive in a competitive environment. The computer
programmers have been incredibly successful in easing the communication between
computers and human. With the emergence of every new product in the market; it
attempts to ease the complexity of jobs performed. For instance, it has helped in
facilitating tele operating, robotic use, better human control over complex work systems
like cars, planes and monitoring systems. Earlier, Computer programmers were avoiding
such kind of complex programs as the focus was more on speed than other modifiable
features. However, a shift towards a user friendly environment has driven them to revisit
the focus area .
With the development of information technology in our society, we can expect
that computer systems to a larger extent will be embedded into our environment. These
environments will impose needs for new types of human computer-interaction, with

interfaces that are natural and easy to use. Hand is a natural and powerful means of
communication that conveys information very effectively. Hand gesture recognition is an
important aspect in Human-Computer interaction , and can be used in various
applications, such as virtual reality and computer games.
In moving advanced with this technological world the interaction between the
human and devices is becoming very closer to each other. To move step ahead in making
these devices more user friendly, indirect interaction between the users and the devices is
needed instead of direct contact. By utilizing the generic features of living beings like
vision, voice, gestures we can establish an indirect communication between the devices
and humans. In the view of interaction between the computer and the human, this project
―Vision Based Hand Gesture Recognition‖ extends the one of the feature to enable the
user to use the computer mouse over the display with human hand gestures.
The user interface (UI) of the personal computer has evolved from a text-based
command line to a graphical interface with keyboard and mouse inputs. However, they
are inconvenient and unnatural. The use of hand gestures provides an attractive
alternative to these cumbersome interface devices for human-computer interaction (HCI).
In particular, visual interpretation of hand gestures can help in achieving the ease and
naturalness desired for HCI.
Vision has the potential of carrying a wealth of information in a nonintrusive
manner and at a low cost, therefore it constitutes a very attractive sensing modality for
developing hand gestures recognition. Recent researches in computer vision have
established the importance of gesture recognition systems for the purpose of human
computer interaction. Two approaches are commonly used to interpret gestures for
Human Computer interaction. They are

1.1.1 Methods Which Use Data Gloves:
This method employs sensors (mechanical or optical) attached to a glove that
transduces finger flexions into electrical signals for determining the hand posture. This
approach forces the user to carry a load of cables which are connected to the computer
and hinders the ease and naturalness of the user interaction.
1.1.2 Methods Which are Vision Based:
Computer vision based techniques are non invasive and based on the way human
beings perceive information about their surroundings. Although it is difficult to design a
vision based interface for generic usage, yet it is feasible to design such an interface for a
controlled environment
1.2 Gestures:
It is hard to settle on a specific useful definition of gestures due to its wide
variety of applications and a statement can only specify a particular domain of gestures.
Many researchers had tried to define gestures but their actual meaning is still arbitrary.
Bobick and Wilson have defined gestures as the motion of the body that is intended to
communicate with other agents. As per the context of the project, gesture is defined as an
expressive movement of body parts which has a particular message, to be communicated
precisely between a sender and a receiver.
A gesture is scientifically categorized into two distinctive categories: dynamic and
static. A dynamic gesture is intended to change over a period of time whereas a static
gesture is observed at the spurt of time. A waving hand means goodbye is an example of
dynamic gesture and the stop sign is an example of static gesture. To understand a full
message, it is necessary to interpret all the static and dynamic gestures over a period of
time. This complex process is called gesture recognition. Gesture recognition is the
process of recognizing and interpreting a stream continuous sequential gesture from the
given set of input data.

1.3 Gesture Based Applications:
Gesture based applications are broadly classified into two groups on the basis of
their purpose: multidirectional control and a symbolic language.
3D Design: CAD (computer aided design) is an HCI which provides a platform for
interpretation and manipulation of 3-Dimensional inputs which can be the gestures.
Manipulating 3D inputs with a mouse is a time consuming task as the task involves a
complicated process of decomposing a six degree freedom task into at least three
sequential two degree tasks. Massachuchetttes institute of technology [3] has come up
with the 3DRAW technology that uses a pen embedded in polhemus device to track the
pen position and orientation in 3D.A 3space sensor is embedded in a flat palette,
representing the plane in which the objects rest .The CAD model is moved synchronously
with the users gesture movements and objects can thus be rotated and translated in order
to view them from all sides as they are being created and altered.
Tele presence: There may raise the need of manual operations in some cases such as
system failure or emergency hostile conditions or inaccessible remote areas. Often it is
impossible for human operators to be physically present near the machines [4]. Tele
presence is that area of technical intelligence which aims to provide physical operation
support that maps the operator arm to the robotic arm to carry out the necessary task, for
instance the real time ROBOGEST system constructed at University of California, San
Diego presents a natural way of controlling an outdoor autonomous vehicle by use of a
language of hand gestures. The prospects of tele presence includes space, undersea
mission, medicine manufacturing and in maintenance of nuclear power reactors.
Virtual reality: Virtual reality is applied to computer-simulated environments that can
simulate physical presence in places in the real world, as well as in imaginary worlds.
Most current virtual reality environments are primarily visual experiences, displayed
either on a computer screen or through special stereoscopic displays [6]. There are also
some simulations include additional sensory information, such as sound through speakers

or headphones. Some advanced, haptic systems now include tactile information,
generally known as force feedback, in medical and gaming applications.
Sign Language: Sign languages are the most raw and natural form of languages could be
dated back to as early as the advent of the human civilization, when the first theories of
sign languages appeared in history. It has started even before the emergence of spoken
languages. Since then the sign language has evolved and been adopted as an integral part
of our day to day communication process. Now, sign languages are being used
extensively in international sign use of deaf and dumb, in the world of sports, for
religious practices and also at work places. Gestures are one of the first forms of
communication when a child learns to express its need for food, warmth and comfort. It
enhances the emphasis of spoken language and helps in expressing thoughts and feelings
effectively.
Now a days a lot of research is going on the human hand gestures to interpret them for pc
control.
1.4 Purpose and Objective:
The main purpose of this project is to create an application which can identify
specific human hand gestures and interpret them for pc control (mouse operations). The
use of hand gestures provides an attractive alternative to these cumbersome interface
devices for human-computer interaction. Basically this project involves two parts. First,
image acquisition using static system webcam and recognizing specific hand gestures.
Second, after determining specific hand gesture this output should be given to a java
program which performs mouse events based on the given hand gesture. The purpose of
this project is to help new researchers learn and further research on their topic of
interest, which in this case is the human hand gesture recognition for pc control.

1.5 Layout of the thesis:
Chapter-1 introduces the reader to the Human Motion Detection. All the
background details and the prior knowledge required to correctly understand this project
work have been briefly covered in this chapter.
Chapter-2 includes the details of the literature survey carried out before starting
this work as well as in between the project work so as to meet the desired
objective.
Chapter-3 titled ―Proposed Work‖ contains the various features that have been used
and proposed method along with the details of the database used.
Chapter-4 testifies the correctness of the proposed system by showing the results
and performance values.
Chapter-5 talks about the conclusions derived and the future work along with the listing
of the references that have been used.
1.6 Conclusion:
In this chapter an outline of the project work has been sketched to give an insight into
―Vision Based Hand Gesture Recognition‖ what motivated this work to be implemented
has been discussed along with the report organization details.

Chapter 2
Literature Review
In this chapter, we will look at several hand gesture identification techniques and
methodologies that have been researched and implemented by other researchers.
There are several approaches to identify human hand gestures, here we are presenting
some of them which we have studied during the literature survey of this project.
1. Feature Matching
2. Machine Learning
3. Segmentation based
2.1 Feature Matching:
This approach is based on the comparisons of image features between two images. There
are two matchers to perform this operation.
1. Brute Force Matcher
2. FLANN Based Matcher
2.1.1 Basics of Brute-Force Matcher:
Brute-Force matcher is simple. It takes the descriptor of one feature in first set and
is matched with all other features in second set using some distance calculation. And the
closest one is returned.
For BF matcher, first we have to create the BFMatcher object using cv2.BFMatcher(). It
takes two optional params. First one is normType. It specifies the distance measurement
to be used. By default, it is cv2.NORM_L2. It is good for SIFT, SURF etc

(cv2.NORM_L1 is also there). For binary string based descriptors like ORB, BRIEF,
BRISK etc, cv2.NORM_HAMMING should be used, which used Hamming distance as
measurement. If ORB is using WTA_K == 3 or 4, cv2.NORM_HAMMING2 should
be used.
Second param is boolean variable, crossCheck which is false by default. If it is true,
Matcher returns only those matches with value (i,j) such that i-th descriptor in set A has
j-th descriptor in set B as the best match and vice-versa. That is, the two features in both
sets should match each other. It provides consistant result, and is a good alternative to
ratio test proposed by D.Lowe in SIFT paper.
Once it is created, two important methods are BFMatcher.match() and
BFMatcher.knnMatch(). First one returns the best match. Second method returns k best
matches where k is specified by the user. It may be useful when we need to do additional
work that on.Like we used cv2.drawKeypoints() to draw keypoints, cv2.drawMatches()
helps us to draw the matches. It stacks two images horizontally and draw lines from first
image to second image showing best matches. There is also cv2.drawMatchesKnn
which draws all the k best matches. If k=2, it will draw two match-lines for each
keypoint. So we have to pass a mask if we want to selectively draw it.
2.1.2 FLANN Based Matcher:
FLANN stands for Fast Library for Approximate Nearest Neighbors. It contains a
collection of algorithms optimized for fast nearest neighbor search in large datasets and
for high dimensional features. It works more faster than BFMatcher for large datasets.
We will see the second example with FLANN based matcher.
For FLANN based matcher, we need to pass two dictionaries which specifies the
algorithm to be used, its related parameters etc. First one is IndexParams. For various

algorithms, the information to be passed is explained in FLANN docs. As a summary, for
algorithms like SIFT, SURF etc. you can pass following:
index_params = dict(algorithm = FLANN_INDEX_KDTREE, trees =
5)
While using ORB, you can pass the following. The commented values are recommended
as per the docs, but it didn’t provide required results in some cases. Other values worked
fine.:
index_params= dict(algorithm = FLANN_INDEX_LSH,
table_number = 6, # 12
key_size = 12, # 20
multi_probe_level = 1) #2
Second dictionary is the SearchParams. It specifies the number of times the trees in the
index should be recursively traversed. Higher values gives better precision, but also takes
more time. If you want to change the value, pass search_params =
dict(checks=100)
Both Brute-Force Matcher & FLANN Based Matcher can either SIFT or SURF
algorithms. Let us discuss what is SIFT and SURF algorithms.
2.1.3 Scale Invariant Feature Transform (SIFT):
SIFT is used to solve the image rotation, scaling, viewpoint change, noise, illumination
changes. First the keypoints are extracted from the image then neighborhood regions are
picked around each key point & feature descriptors are computed.Feature descriptors are
extracted and stored in db. Feature descriptors matching based on Euclidean distance.

Algorithm:
 Scale space extreme detection
 Key Point localization
 Orientation assignment
 Description Generation
Scale space extreme detection:
Key points are detected here.Image is convolved with Gaussian filters at different scales,
and then the difference of successive Gaussian-blurred images are taken.Key points are
then taken as maxima/minima of the Difference of Gaussians (DoG) that occur at
multiple scales
Key point localization:
Once potential keypoints locations are found, they have to be refined to get more accurate
results. They used Taylor series expansion of scale space to get more accurate location
of extrema, and if the intensity at this extrema is less than a threshold value (0.03), it is
rejected. The DoG function will have strong responses along edges, to increase stability,
we need to eliminate the keypoints that have poorly determined locations but have high
edge responses.

Orientation assignment:
Each key point is assigned one or more orientations based on local image gradient
directions. A neigborhood is taken around the keypoint location depending on the scale,
and the gradient magnitude and direction is calculated in that region. An orientation
histogram with 36 bins covering 360 degrees is created. Highest peak in the histogram is
taken to calculate orientation.
Key point descriptor:
A 16x16 neighborhood around the keypoint is taken. It is divided into 16 sub-blocks of
4x4 size. For each subblock, 8 bin orientation histogram is created.
2.1.4 Speeded Up Robust Features(SURF):
Works much faster than SIFT For feature description, SURF uses Wavelet
responses in horizontal and vertical direction. The detector is based on the Hessian
matrix, due to its good performance in accuracy.

2.2 Machine Learning:
In classification system will be trained with some sample train data. Later test data
will be given to the system , it will find the given gesture belongs to which family.
1. K-nearest neighbor
2. Support Vector Machine
2.2.1 K-nearest neighbor:
KNN is one of the simplest of classification algorithms available for supervised
learning. The idea is to search for closest match of the test data in feature space. We
check some k nearest families. Then whoever is majority in them, the new one belongs
to that family.
Figure 2: K-nearest neighbor example
In the image, there are two families, Blue Squares and Red Triangles. We call each
family as Class. Their houses are shown in their town map which we call feature space.
Now a new member comes into the town and creates a new home, which is shown as
green circle. He should be added to one of these Blue/Red families. We call that process,
Classification.Since we are dealing with kNN, let us apply this algorithm.

One method is to check who is his nearest neighbour. From the image, it is clear it is the
Red Triangle family. So he is also added into Red Triangle. This method is called simply
Nearest Neighbour, because classification depends only on the nearest neighbour.
But there is a problem with that. Red Triangle may be the nearest. But what if there are
lot of Blue Squares near to him? Then Blue Squares have more strength in that locality
than Red Triangle. So just checking nearest one is not sufficient. Instead we check some
k nearest families. Then whoever is majority in them, the new guy belongs to that family.
In our image, let’s take k=3, ie 3 nearest families. He has two Red and one Blue (there
are two Blues equidistant, but since k=3, we take only one of them), so again he should
be added to Red family. But what if we take k=7? Then he has 5 Blue families and 2 Red
families. Great!! Now he should be added to Blue family. So it all changes with value of
k. More funny thing is, what if k = 4? He has 2 Red and 2 Blue neighbours. It is a tie !!!
So better take k as an odd number. So this method is called k-Nearest Neighbour since
classification depends on k nearest neighbours.
2.2.2 Support Vector Machine (SVM):
A support vector machine (SVM) is a computer algorithm that learns by example
to assign labels to objects.
Figure 3: SVM example 1

In SVM we find a line, f(x) = ax1+bx2+c which divides both the data to two regions.
When we get a new test_data X, just substitute it in f(x). If f(X) > 0, it belongs to blue
group, else it belongs to red group. We can call this line as Decision Boundary. It is very
simple and memory-efficient. Such data which can be divided into two with a straight
line (or hyperplanes in higher dimensions) is called Linear Separable.
So in above image, we can see plenty of such lines are possible. Which one we will take?
Very intuitively we can say that the line should be passing as far as possible from all the
points. Why? Because there can be noise in the incoming data. This data should not affect
the classification accuracy. So taking a farthest line will provide more immunity against
noise. So what SVM does is to find a straight line (or hyper plane) with largest minimum
distance to the training samples. See the bold line in below image passing through the
center.
Figure 4: SVM example 2
So to find this Decision Boundary, you need training data. Do you need all? NO. Just the
ones which are close to the opposite group are sufficient. In our image, they are the one
blue filled circle and two red filled squares. We can call them Support Vectors and the
lines passing through them are called Support Planes. They are adequate for finding our
decision boundary. We need not worry about all the data. It helps in data reduction.
What happened is, first two hyperplanes are found which best represents the data. For eg,
blue data is represented by while red data is represented by

where is weight vector ( ) and is the feature
vector ( ). is the bias. Weight vector decides the orientation of
decision boundary while bias point decides its location. Now decision boundary is
defined to be midway between these hyperplanes, so expressed as . The
minimum distance from support vector to the decision boundary is given by
. Margin is twice this distance, and we need to maximize
this margin. i.e. we need to minimize a new function with some constraints
which can expressed below:
where is the label of each class, .
2.3 Image Segmentation:
Hand segmentation deals with separating the user’s hand from the background in
the image. This can be done using various different methods. The most important step
for hand segmentation is Thresholding which is used in most of the methods
described below to separate the hand from the background.
Thresholding can be used to extract an object from its background by
assigning intensity values for each pixel such that each pixel is either classified as an
object pixel or a background pixel. Thresholding is done on the input image according
to a threshold value. Any pixel with intensity less than the threshold value is set to 0
and any pixel with intensity more than the threshold value is set to 1. Thus the output
of thresholding is a binary image with all pixels 0 belonging to the background and pixels
1 represent the hand. Therefore the white blob that is pixels having value 1 is the object
area. In our case the object is the user’s hand. The most important component for
thresholding is the threshold value. There are various methods to select the appropriate
threshold value.

Types of segmentation:
1. Static Thresholding
2. Incremental Thresholding
3. Thresholding using Otsu’s method
4. Dynamic thresholding using color at real time
5. Color based thresholding(using inRange function)
2.3.1 Static Thresholding:
Image frame is taken as input from the webcam in the RGB format. This image is
converted into gray scale. Then either a static threshold value is used or a threshold value
is selected from 0 to 55 according to the user specification which acts as the threshold
value. This threshold value should be chosen by the user in such a way that the
white blob of the hand is segmented with minimum noise possible. A Trackbar can be
provided to adjust the threshold value for the current usage scenario.
ThresholdValue= 0-255 set by the user requirement
For every usage, either the thresholding value is static that is each time same
value is used or the user is required to set the threshold value to ensure good level of
hand segmentation. Thus this method is not used since it puts the systems success or
failure dependant on the user setting a proper threshold value or on the quality of
the static threshold value. This method is useful where the intensity of the hand is
almost similar whenever the system is used. Also the background intensity should be
similar every time the system is used. But even in constant lighting conditions during
every system use, the system might fail depending on the user’s hand color. If the user’s
hand is also darker in color, the system might not be able to separate the user’s hands and
the dark background. The figures 2 and 3 below show the thresholded input image from
figure 1 using a static threshold value of 70 and 20. The noise introduced in figure 3

clearly shows how using a bad thresholding value can introduce noise which can
reduce the accuracy of hand detection.
Figure 5: Static Thresholding
2.3.2 Incremental Thresholding:
In this method, same pre-processing as in the static thresholding value is
done on the image input frame, converting from RGB to Grayscale. Instead of
using a constant value for every input image frame, the threshold value is
incremented till a condition is not met. For this method, a minimum threshold value
is set and then the input image frame is thresholded using this value. If the current
thresholding value does not fulfil the condition, then the thresholding value is
incremented and again the same procedure is followed till the condition is met. The
condition to detect hand is until only one white blob is present in the thresholded
image. The detected white blob can also be some other object so whether the detected
object is a hand or not is decided by the hand detection part explained further. This
method can automatically select the threshold value. This method generally gives good
results especially in the environment where intensity values of input image frame
changes continuously. This method ensures that the entire hand is detected as a
whole blob without any internal fragmentation. But on the negative side, sometimes
Fig1:Input image Fig2:Thresholded image
With value 20
Fig3:Thresholded image
With value 70

the background pixels near the hand might also get included in the white blob. Also
if the background is not constantly dark, some areas of the background might also add
up in with the hand in the white blob at certain threshold values and still make
only one white blob. That is even though it would pass the condition that only white
blob is present but the white blob would consist of the hand and the lighter
background areas that are connected to the hand.
ThresVal= initially set to some value(0-255) value is increased untill we get the
result
To remove these problems a test has to be conducted to find if the white blob has a
structure similar to hand or not using convexity defects explained further.
2.3.3 Thresholding using Otsu’s Method:
Otsu’s Methodis used to automatically select a threshold value based on the
shape of the histogram of the image. This algorithm assumes that the image
contains two dominant peaks of pixel intensities in the histogram that is two classes
of pixels. The histogram should be bi-model for using this method. The two classes
are foreground and background. The algorithm tries to find the optimal threshold value
using which the two classes are separated in such a way that their intra-class
variance or combined spread is minimal. The threshold value given by the Otsu’s
Method thus in our case works well since the images contain two type of pixels,
background pixels and hand pixels. Thus the two classes are background and hand. So
the threshold value tries to separate the peaks in order to give minimal intra-class
variance.

This final value is the 'sum of weighted variances' for the threshold value 3. This same
calculation needs to be performed for all the possible threshold values 0 to 5. It can be
seen that for the threshold equal to 3, as well as being used for the example, also has the
lowest sum of weighted variances. Therefore, this is the final selected threshold. All
pixels with a level less than 3 are background, all those with a level equal to or greater
than 3 are foreground.

This approach for calculating Otsu's threshold is useful for explaining the theory, but it is
computationally intensive, especially if you have a full 8-bit gray-scale.
The advantage is that this method works well under any circumstances until the
hand and background pixels create distinct peaks in the histogram of the image.
The only problem with this method is that if the user’s hand is not in view, the
method would give a threshold value which breaks up the background pixels into two
separate classes making it difficult to understand that the hand is not in view. This
problem can be solved again by using the tests explained further to make sure the
detected white blob is a hand only. Since the chances of the background getting
thresholded in a way that the white blob passes the hand detection test are
extremely less, the system practically does not give any false positives.
2.3.4 Dynamic thresholding using color at real time:
Unlike previous methods of thresholding, in this method color based thresholding
is done. This can also be termed as color level slicing. Initially the user has to give some
dummy input image frames with the hand to be detected in the central part of the image.
The system would do the analysis on these dummy input frames and generating dynamic
threshold values in RGB. In this analysis, a small central circular part, with arbitrary
radius, of the dummy input frames is considered initially. The first two pixels of the
central part are set as minimum pixel value and maximum pixel values. Then rest all
pixels in the central part are processed. For every pixel value that is scanned, it is
compared with the minimum and maximum pixel values. If the scanned pixel value is
less than the minimum pixels value then the minimum pixel value is updated to the

scanned pixel value. Similarly if scanned pixel value is more than that of maximum
pixel value, then the maximum pixel value is updated to the scanned pixel value. The
range defined by the Minimum and Maximum pixel value is used to threshold the image,
whichever pixel comes between this ranges is considered as hand pixel.
This method is very accurate to segment the hand if the intensity of the hand does
not alter much during usage. It can detect any color of hand thus making it
independent of the user’s skin color. The dummy input frames should have the hand
in the central part else the entire system collapses since the range decided is not actually
for the hand. The background should not contain pixels with values that fall in between
the decided range as they too would be included as hand pixels.
2.3.5 Color based thresholding(using inRange function) :
In color based thresholding, static values of hand color are considered for
thresholding. RGB values of hand are taken with minimum RGB value and
maximum RGB value as a range. These ranges are selected after analysing the
general range of color of human hands. Then the input image frame uses these two
minimum and maximum RGB values for thresholding. Any pixel between this
range is considered as the hand pixel so it is set to 1 and pixels outside this range are
considered as background pixel is set to 0.
As motion tracking is not possible with the first two methods i.e Feature Matching and
Machine Learning, we like to choose color segmentation for our project.
2.4 Conclusion:
The past has a lot to teach us. The work in image processing started in 1970’s hence there
has been already a lot of work done in this field and surely there is so much one can learn
by reviewing the works of senior researchers. In this chapter the approaches used in past
have been discussed.

Chapter 3
Proposed Work
So far the discussion was about every detail necessary to better understand ―Vision
Based Hand Gesture Recognition‖.
Hand Gesture Recognition System for pc control:
3.1 System Overview
Figure 6: Block Diagram of Hand Gesture Recognition system for pc control
3.2 Image Acquisition:
This is the first step in any gesture recognition system. The system developed here,
can capture sequence of images from real time video from a static web camera of a
computer. Here the resolution of a camera has no major effect on the functionality of the
system. During the image acquisition we should make sure that sufficient illumination is
present. As the system proceed with the color segmentation it will not give proper results
under fluorescent bulb illumination. So it is strongly recommended that image acquisition
should be done under sunlight for better and accurate results.

3.3 HSV color model:
RGB is useful for hardware implementations and it matches nicely with the fact that the
human eye is strongly perspective to red, green and blue primaries. However, RGB is not
a particularly intuitive way for describing colors in terms that are practical for human
interpretation. Rather when people describe colours they tend to use hue, saturation and
brightness. RGB has to do with "implementation details" regarding the way RGB
displays color, and HSV has to do with the "actual color" components. Another way to
say this would be RGB is the way computers treats color, and HSV try to capture the
components of the way we humans perceive color. RGB is great for colour generation,
but HSI is great for colour description.
Figure 7: HSV Color Model
3.3.1 RGB to HSV conversion:

3.4 Color Segmentation:
In color based thresholding, static values of hand color are considered for
thresholding. RGB values of hand are taken with minimum RGB value and
maximum RGB value as a range. These ranges are selected after analysing the
general range of color of human hands. Then the input image frame uses these two
minimum and maximum RGB values for thresholding. Any pixel between this
range is considered as the hand pixel so it is set to 1 and pixels outside this range are
considered as background pixel is set to 0.
Figure 8: Color Segmentation Example
As there are no background constraints in this method, it is highly prone to noise.
This method can also work on general backgrounds with a slight constraint that the
background should not contain pixels that lie between the ranges specified. Else extra
processing like selecting the largest contour explained further is required to ensure such
white blobs are not detected as hand. The thresholding values are very tricky to
select. For some user’s,the color of the hand could vary a lot and be outside the
specified range thus making the system unable to detect such user’s hand.
lower_hand = np.array([0,30,60])
upper_hand = np.array([20,150,250])
mask = cv2.inRange(img lower_hand, upper_hand)

3.5 Contour detection:
A contour is the curve for a two variables function along which the
function has a constant value. A contour joins points above a given level and of
equal elevation. A contour map illustratesthe contour using contour lines, which
shows the steepness of slopes and valleys and hills. The function’s gradient is
always perpendicular to the contour lines. When the lines are close together, the
magnitude of the gradient is usually very large. Contours are straight lines or
curves describing the intersection of one or more horizontal planes with a real or
hypothetical surface with.
The contour is drawn around the white blob of the hand that is found out by
thresholding the input image. There can be possibilities that more than one blob
will be formed in the image due to noise in the background. So the contours are
drawn on such smaller white blobs too. Considering all blobs formed due to noise are
small, thus the large contour is considered for further processing specifying it is
contour of hand.
In this implementation, after preprocessing of the image frame, white blob is
formed. Contour is drawn around this whiteblob. Vector contains set of contour
points in the coordinate form. Figure 6 shows the detected contour for the input image.
Figure 9: Detected Contour for the Input Image

3.6 Convex Hull:
The convex hull of a set of points in the euclidean space is the smallest convex set that
contains all the set of given points. For example, when this set of points is a
bounded subset of the plane, the convex hull can be visualized as the shape formed
by a rubber band stretched around this set of points. Convex hull is drawn around the
contour of the hand, such that all contour points are within the convex hull. This makes
anenvelope around the hand contour. Figure 7 shows the convex hull formed around
the detected hand.
Figure 10: Convex Hull of the Input Image
hull = cv2.convexHull(points[, hull[, clockwise[, returnPoints]]
Arguments details:
 points are the contours we pass into.
 hull is the output, normally we avoid it.
 clockwise : Orientation flag. If it is True, the output convex hull is
oriented clockwise. Otherwise, it is oriented counter-clockwise.
To draw all the contours in an image:
cv2.drawContours(img, contours, -1, (0,255,0), 3)
To draw an individual contour, say 4th contour:
cv2.drawContours(img, contours, 3, (0,255,0), 3)
But most of the time, below method will be useful:
cnt = contours[4]
cv2.drawContours(img, [cnt], 0, (0,255,0), 3)

The convex hulls are also drawn on an image using the same function drawContours as
explained in Contours. This is because both, contours and convex hulls are nothing
but a collection of points which needs to be connected with straight lines.
Basically when we draw convex hull for given color range i.e color range of human palm,
we many get many such convex hulls if we have any objects matches with that color
range, so we will consider only the convex hull with maximum area which might be the
convex hull of human hand palm.
3.7 Convexity Defects:
When the convex hull is drawn around the contour of the hand, it fits set of contour
points of the hand within the hull. It uses minimum points to form the hull to include
all contour points inside or on the hull and maintain the property of convexity.
This causes the formation of defects in the convex hull with respect to the contour drawn
on hand.
A defect is present wherever the contour of the object is away from the convex hull
drawn around the same contour. Convexity defect gives the set of values for every
defect in the form of vector. This vector contains the start and end point of the line of
defect in the convex hull. These points indicate indices of the coordinate points of
the contour. These points can be easily retrieved by using start and end indices of
the defect formed from the contour vector. Convexity defect also includes index of the
depth point in the contour and its depth value from the line. Figure 11 shows an
example of convexity defects calculated in the detected hand using the input image of
Figure9.

Figure 11: Major Convexity Defects Calculated for the given image
Final result:
Figure 12:Input image with all calculations
hull = cv2.convexHull(cnt,returnPoints = False)
defects = cv2.convexityDefects(cnt,hull)

3.8 Gesture Recognition:
After finding the convexity defects we can identify the type of gesture based on the
number of convexity defects. A gesture with all five fingers will give four major
convexity defects, four fingers will give three convexity defects, three fingers will give
two convexity defects. A gesture with all fingers closed will give no convexity defects.
Here the result is taken for every 60 frames. The most frequent gesture in the 60 frames
will be considered and rest of them will be discared. Here the result is an integer value,
the number of convexity defects present in the given gesture. This value will be sent to a
java class through a gateway called py4j(python4java).
Figure 13: Example Of Hand Gestures With Convexity Defects

3.9 Mouse Events:
Mouse events are performed using Robot Class in Java which are available in java.awt
package. A java class takes input from python program , an integer value that counts the
number of convexity defects in the given gesture based on which mouse events are
performed.
3.9.1 Robot Class:
This class is used to generate native system input events for the purposes of test
automation, self-running demos, and other applications where control of the mouse and
keyboard is needed. The primary purpose of Robot is to facilitate automated testing of
Java platform implementations.
Using the class to generate input events differs from posting events to the AWT event
queue or AWT components in that the events are generated in the platform's native input
queue. For example, Robot.mouseMove will actually move the mouse cursor instead of
just generating mouse move events.
Constructure and Description:
Robot( )
public Robot( ) throws AWTException
Constructs a Robot object in the coordinate system of the primary screen.
Method Details:
3.9.1.1 mouseMove
public void mouseMove(int x, int y)
Moves mouse pointer to given screen coordinates.
Parameters:
x - X position(x-coordinate)
y - Y position(y-coordinate)

Mouse cursor is moved based on the movement of centroid of convex hull drawn for the
hand. It calculates the difference between previous and present position i.e (x,y)
coordinates of the convex hull centroid.
3.9.1.2 mousePress
public void mousePress(int buttons)
Presses one or more mouse buttons. The mouse buttons should be released using
the mouseRelease(int) method.
Parameters:
buttons - the Button mask; a combination of one or more mouse button masks.
It is allowed to use only a combination of valid values as a buttons parameter.
A valid combination consist of InputEvent.BUTTON1_DOWN_MASK,
InputEvent.BUTTON2_DOWN_MASK,InputEvent.BUTTON3_DOWN_MASK
and values returned by the InputEvent.getMaskForButton(button) method.
a Toolkit.areExtraMouseButtonsEnabled() value as follows:
 If support for extended mouse buttons is disabled by Java then it is allowed
to use only the following standard button
masks: InputEvent.BUTTON1_DOWN_MASK, InputEvent.BUTTON2_D
OWN_MASK,InputEvent.BUTTON3_DOWN_MASK.
 If support for extended mouse buttons is enabled by Java then it is allowed
to use the standard button masks and masks for existing extended mouse
buttons, if the mouse has more then three buttons. In that way, it is allowed
to use the button masks corresponding to the buttons in the range from 1
to MouseInfo.getNumberOfButtons().
It is recommended to usethe InputEvent.getMaskForButton(button)method
to obtain the mask for any mouse button by its number.

The following standard button masks are also accepted:
 InputEvent.BUTTON1_MASK
However,it is recommended to use InputEvent. BUTTON1_DOWN_MASK,
InputEvent.BUTTON2_DOWN_MASK, InputEvent.BUTTON3_DOWN_MASK
instead. Either extended _DOWN_MASK or old _MASK values should be used,
but both those models should not be mixed.
Throws:
IllegalArgumentException - if the buttons mask contains the mask for extra mouse
button and support for extended mouse buttons is disabled by Java
button that does not exist on the mouse and support for extended mouse buttons
is enabled by Java
3.9.1.3 mouseRelease
public void mouseRelease(int buttons)
Releases one or more mouse buttons.
Parameters:
buttons - the Button mask; a combination of one or more mouse button masks.
It is allowed to use only a combination of valid values as a buttons parameter.
A valid combination consists of InputEvent.BUTTON1_DOWN_MASK,
InputEvent.BUTTON2_DOWN_MASK, InputEvent.BUTTON3_DOWN_MASK
and values returned by the InputEvent.getMaskForButton(button) method. The

valid combination also depends on
a Toolkit.areExtraMouseButtonsEnabled() value as follows:
 If the support for extended mouse buttons is disabled by Java then it is
allowed to use only the following standard button
masks: InputEvent.BUTTON1_DOWN_MASK, InputEvent.BUTTON2_
DOWN_MASK,InputEvent.BUTTON3_DOWN_MASK.
 If the support for extended mouse buttons is enabled by Java then it is
allowed to use the standard button masks and masks for existing extended
mouse buttons, if the mouse has more then three buttons. In that way, it is
allowed to use the button masks corresponding to the buttons in the range
from 1 to MouseInfo.getNumberOfButtons().
It is recommended to usethe InputEvent.getMaskForButton(button) method
to obtain the mask for any mouse button by its number.
The following standard button masks are also accepted:
Throws:
button and support for extended mouse buttons is disabled by Java
button that does not exist on the mouse and support for extended mouse buttons
is enabled by Java

3.9.1.4 mouseWheel
public void mouseWheel(int wheelAmt)
Rotates the scroll wheel on wheel-equipped mice.
Parameters:
wheelAmt - number of "notches" to move the mouse wheel Negative values indicate
movement up/away from the user, positive values indicate movement down/towards the
user.
3.9.1.5 keyPress
public void keyPress(int keycode)
Presses a given key. The key should be released using the keyRelease method.
Parameters:
keycode - Key to press (e.g. KeyEvent.VK_A)
Throws:
IllegalArgumentException - if keycode is not a valid key
3.9.1.6 keyRelease
public void keyRelease(int keycode)
Releases a given key.
Key codes that have more than one physical key associated with them
(e.g. KeyEvent.VK_SHIFT could mean either the left or right shift key) will map to the
left key.
Parameters:
keycode - Key to release (e.g. KeyEvent.VK_A)

Proposed Gestures and Corresponding Mouse Operations:
Proposed Gesture Mouse operation
Mouse movement | Release click
Right Click
Scroll
Middle Click
Left click

3.10 Python4java(py4j) Gateway:
Py4J enables Python programs running in a Python interpreter to dynamically
access Java objects in a Java Virtual Machine. Methods are called as if the Java objects
resided in the Python interpreter and Java collections can be accessed through standard
Python collection methods. Py4J also enables Java programs to call back Python objects.
The goal is to enable developers to program in Python and benefit from Python libraries
such as lxml while being able to reuse Java libraries and frameworks such as Eclipse,
Netbeans. You can see Py4J as an hybrid between a glorified Remote Procedure Call and
using the Java Virtual Machine to run a Python program. So through this gateway we
create object for java class in python program and we pass the output of python program
,an integer value i.e number of convexity defects in the given gesture to the method
defined in java class.
3.11 Technologies used:
3.11.1 Python:
Python is a basic programming as well as scripting language. It serves its
applications different sections like scripting, programming, hardware interaction based
events, device coding, GUI programming etc., It provides lot of libraries for different
operations regarding operating system, file system, hardware, web application
development and advanced fields like image processing, voice processing etc., Some of
the libraries which were used in this project were listed below.
PIL – Python Imaging Library that serves different Image processing Operations.
CV2 – Computer vision algorithms library to process the videos and images.
Matplotlib – It provides graphical interface to visualize the data during
development as well as processing.

Numpy – Numpy is an matrix oriented operations providing library.
It plays crucial role while working with Image based tasks.
Ref.: www.python.org, http://pypi.python.org/
3.11.2 OpenCV (Open Computer Vision):
OpenCV provides different libraries for different image operations, video based
filters, hardware interaction etc., It makes the developers way easy by providing standard
libraries to estimate the required data by the program. It spreads over wide areas like
Image Processing, Computer Vision, Video processing, Object detection, Machine
Learning etc., OpenCV was started at Intel in 1999 by Gary Bradsky and the first
release came out in 2000. It has lot of algorithms related to Computer Vision and
Machine Learning and it is expanding day-by-day.It Supports programming languages
like C++, Python, Java etc., and is available on different platforms including Windows,
Linux, OS X, Android, iOS etc., also interfaces based on CUDA and OpenCL are also
under active development for high-speed GPU Operations.
Ref.: http://docs.opencv.org/
― Cv2 ‖ library of python serves the opencv-python operations over different
functionalities. Some of them were listed below.
cv2.CaptureVideo(“file” or flagbit)
This functions allows the user to capture frames from a video file or with a device.
If we provide file instead of flagbit, then it will take frames from that file. Flag bits are
0,1,2 etc., 0 specifies the default web camera, and 1 specifies the any USB camera
connected to it so and so.
cv2.imread(“Image”,mode)

This function reads the image data to a matrix for further operations. In parameters
Image specifies the filename of the image to be read and the mode specifies the Image-
Mode to convert. Eg. Mode=0 reads the image in gray level.
cv2.imwrite(“Filename”, source)
This function writes some image data to a filename specified and stored in the
current working directory.
cv2.imshow(“Name”, source)
This function is used to visualize any image data during runtime. Name specifies
the window name and source specifies the image data to be visualize.
cv2.cvtColor(Image,cv2.COLOR_BGR2GRAY)
Converts the RGB image to gray level image. Input image should be RGB image.
3.11.3 Java AWT:
AWT(Abstract Window ToolKit) is a collection of classes which provides
graphical components such as buttons, text boxes, Robot actions etc., which will be used
in the graphical programming.
3.12 Conclusion:
This chapter first described the system overview , detailed description of all steps
involved in the application system and then described the proposed methodologies and
algorithms . The correctness of this proposed work will be discussed in the next chapter.

Chapter 4
Results and Discussion
4.1 Snapshots of the Result
4.1.1 Cursor move based on centroid position of convexhull

4.1.2 Moving cursor onto the folder
4.1.3 Right click on the folder

4.1.4 Cursor moved onto the “open” option
4.1.5 Left click on the selected option

4.1.6 Result of left click on the selected option

Chapter 5
Conclusion and Future Work
5.1 Conclusion of the work:
This application system ―Vision Based Hand Gesture Recognition‖ for pc control
will help to perform mouse operations with hand gestures. It can give more accurate
results if it is provided with constant background and proper illumination conditions. As
it is a research project, we are able to perform some operations butl with some limitations
as mentioned above , noisy background and poor illumination. As it does color based
segmentation it may not perform well under some illuminations such as fluorescent bulb.
This system performs really well under daylight illumination conditions. This application
system is implemented to be platform independent.
5.2 Future Work
Adding more gestures to this system to execute some more mouse operations.
Making this resist to environmental conditions and background variation. Enhancing
the performance of this system to make it more user friendly.

REFERENCES:
[1] G. R. S. Murthy, R. S. Jadon. (2009). ―A Review of Vision Based Hand
Gestures Recognition,‖ International Journal of Information Technology and Knowledge
Management, vol. 2(2), pp. 405 -410.
[2] R. Lockton. ―Hand Gesture Recognition Using Computer Vision.‖
http://research.microsoft.com/en-us/um/people/awf/bmvc02/project.pdf
[3] S. Mitra, T. Acharya ―Gesture recognition: a survey‖, IEEE Trans Syst Man Cybern
Part C Appl Rev 37(3):311–324 (2007).
[4] Fakhreddine Karray, Milad Alemzadeh, Jamil Abou Saleh, Mo Nours Arab,
(2008) .―Human Computer Interaction: Overview on State of the Art‖, International
Journal on Smart Sensing andn Intelligent Systems, Vol. 1(1)
[5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised
scale-invariant learning. InCVPR,volume 2, pages 264–271, 2003
[6] S. Ullman, M. Vidal-Naquet, and E. Sali. Visual features of intermdediate complexity
and their use in classification.Nature Neuroscience, 5(7):682–687, 2002.
[7] M. Weber, M. Welling, and P. Perona. Unsupervised learningof models for
recognition. InECCV, Dublin, Ireland, 2000.
[8] D. G. Lowe, ―Distinctive image features from scale invariant keypoints,‖
International Journal of Computer Vision,2004.
[9] Mikolajczyk, K. 2002. Detection of local features invariant to affine
transformations,Ph.D. thesis,Institut National Polytechnique de Grenoble, France
[10] Pope, A.R., and Lowe, D.G. 2000. Probabilistic models of appearance for 3-D object
recognition.International Journal of Computer Vision, 40(2):149-167
[11] Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for localimage
descriptors. In: CVPR (2). (2004) 506 – 513
[12] A. Blake and M. Isard. 3D position, attitude and shape input using video tracking of
hands and lips. In Proceedings of SIGGRAPH 94, pages 185{192, 1994}
[13] J. Segen. Gest: a learning computer visionsystem that recognizes gestures. In
Machine Learning IV. Morgan Kau man, 1992. editedby Michalski et. al.

[14] J. M. Rehg and T. Kanade. Digiteyes: visionbased human hand tracking. Technical
Report CMU-CS-93-220, Carnegie Mellon School of Computer Science, Pittsburgh, PA
15213,1993.
[15] D. Rubine and P. McAvinney. Programmable finger-tracking instrument controllers.
Computer Music Journal, 14(1):26{41, 1990
[16] RichardWatson, ―Gesture recognition techniques‖, Technical report, Trinity
College, Department of Computer Science, Dublin, July, Technical Report No. TCD-CS-
93-11, 1993
[17] Chan Wah Ng, Surendra Ranganath, ―Real-time gesture recognition system and
application‖, Image Vision Comput, 20(13-14): 993-1007 ,2002
[18] Thomas G. Zimmerman , Jaron Lanier , Chuck Blanchard , Steve Bryson , Young
Harvill, ―A hand gesture interface device‖, SIGCHI/GI Proceedings, conference on
Human factors in computing systems and graphics interface, p.189-192, April 05- 09,
Toronto, Ontario, Canada, 1987
[19] Lalit Gupta and Suwei Ma ―Gesture-Based Interaction and Communication:
Automated Classification of Hand Gesture Contours‖ , IEEE transactions on systems,
man, and cybernetics—part c: applications and reviews, vol. 31, no. 1, February 2001

HGR-thesis

More Related Content

What's hot

Viewers also liked

Similar to HGR-thesis

HGR-thesis