HGR Project Report Page 1
Chapter 1
Introduction
This chapter will give the reader an insight into what this project work is all about.
1.1 Overview:
Computer is used by many people either at their work or in their spare-time.
Special input and output devices have been designed over the years with the purpose of
easing the communication between computers and humans, the two most known are the
keyboard and mouse . Every new device can be seen as an attempt to make the computer
more intelligent and making humans able to perform more complicated communication
with the computer. This has been possible due to the result oriented efforts made by
computer professionals for creating successful human computer interfaces . As the
complexities of human needs have turned into many folds and continues to grow so, the
need for Complex programming ability and intuitiveness are critical attributes of
computer programmers to survive in a competitive environment. The computer
programmers have been incredibly successful in easing the communication between
computers and human. With the emergence of every new product in the market; it
attempts to ease the complexity of jobs performed. For instance, it has helped in
facilitating tele operating, robotic use, better human control over complex work systems
like cars, planes and monitoring systems. Earlier, Computer programmers were avoiding
such kind of complex programs as the focus was more on speed than other modifiable
features. However, a shift towards a user friendly environment has driven them to revisit
the focus area .
With the development of information technology in our society, we can expect
that computer systems to a larger extent will be embedded into our environment. These
environments will impose needs for new types of human computer-interaction, with
HGR Project Report Page 2
interfaces that are natural and easy to use. Hand is a natural and powerful means of
communication that conveys information very effectively. Hand gesture recognition is an
important aspect in Human-Computer interaction , and can be used in various
applications, such as virtual reality and computer games.
In moving advanced with this technological world the interaction between the
human and devices is becoming very closer to each other. To move step ahead in making
these devices more user friendly, indirect interaction between the users and the devices is
needed instead of direct contact. By utilizing the generic features of living beings like
vision, voice, gestures we can establish an indirect communication between the devices
and humans. In the view of interaction between the computer and the human, this project
―Vision Based Hand Gesture Recognition‖ extends the one of the feature to enable the
user to use the computer mouse over the display with human hand gestures.
The user interface (UI) of the personal computer has evolved from a text-based
command line to a graphical interface with keyboard and mouse inputs. However, they
are inconvenient and unnatural. The use of hand gestures provides an attractive
alternative to these cumbersome interface devices for human-computer interaction (HCI).
In particular, visual interpretation of hand gestures can help in achieving the ease and
naturalness desired for HCI.
Vision has the potential of carrying a wealth of information in a nonintrusive
manner and at a low cost, therefore it constitutes a very attractive sensing modality for
developing hand gestures recognition. Recent researches in computer vision have
established the importance of gesture recognition systems for the purpose of human
computer interaction. Two approaches are commonly used to interpret gestures for
Human Computer interaction. They are
HGR Project Report Page 3
1.1.1 Methods Which Use Data Gloves:
This method employs sensors (mechanical or optical) attached to a glove that
transduces finger flexions into electrical signals for determining the hand posture. This
approach forces the user to carry a load of cables which are connected to the computer
and hinders the ease and naturalness of the user interaction.
1.1.2 Methods Which are Vision Based:
Computer vision based techniques are non invasive and based on the way human
beings perceive information about their surroundings. Although it is difficult to design a
vision based interface for generic usage, yet it is feasible to design such an interface for a
controlled environment
1.2 Gestures:
It is hard to settle on a specific useful definition of gestures due to its wide
variety of applications and a statement can only specify a particular domain of gestures.
Many researchers had tried to define gestures but their actual meaning is still arbitrary.
Bobick and Wilson have defined gestures as the motion of the body that is intended to
communicate with other agents. As per the context of the project, gesture is defined as an
expressive movement of body parts which has a particular message, to be communicated
precisely between a sender and a receiver.
A gesture is scientifically categorized into two distinctive categories: dynamic and
static. A dynamic gesture is intended to change over a period of time whereas a static
gesture is observed at the spurt of time. A waving hand means goodbye is an example of
dynamic gesture and the stop sign is an example of static gesture. To understand a full
message, it is necessary to interpret all the static and dynamic gestures over a period of
time. This complex process is called gesture recognition. Gesture recognition is the
process of recognizing and interpreting a stream continuous sequential gesture from the
given set of input data.
HGR Project Report Page 4
1.3 Gesture Based Applications:
Gesture based applications are broadly classified into two groups on the basis of
their purpose: multidirectional control and a symbolic language.
3D Design: CAD (computer aided design) is an HCI which provides a platform for
interpretation and manipulation of 3-Dimensional inputs which can be the gestures.
Manipulating 3D inputs with a mouse is a time consuming task as the task involves a
complicated process of decomposing a six degree freedom task into at least three
sequential two degree tasks. Massachuchetttes institute of technology [3] has come up
with the 3DRAW technology that uses a pen embedded in polhemus device to track the
pen position and orientation in 3D.A 3space sensor is embedded in a flat palette,
representing the plane in which the objects rest .The CAD model is moved synchronously
with the users gesture movements and objects can thus be rotated and translated in order
to view them from all sides as they are being created and altered.
Tele presence: There may raise the need of manual operations in some cases such as
system failure or emergency hostile conditions or inaccessible remote areas. Often it is
impossible for human operators to be physically present near the machines [4]. Tele
presence is that area of technical intelligence which aims to provide physical operation
support that maps the operator arm to the robotic arm to carry out the necessary task, for
instance the real time ROBOGEST system constructed at University of California, San
Diego presents a natural way of controlling an outdoor autonomous vehicle by use of a
language of hand gestures. The prospects of tele presence includes space, undersea
mission, medicine manufacturing and in maintenance of nuclear power reactors.
Virtual reality: Virtual reality is applied to computer-simulated environments that can
simulate physical presence in places in the real world, as well as in imaginary worlds.
Most current virtual reality environments are primarily visual experiences, displayed
either on a computer screen or through special stereoscopic displays [6]. There are also
some simulations include additional sensory information, such as sound through speakers
HGR Project Report Page 5
or headphones. Some advanced, haptic systems now include tactile information,
generally known as force feedback, in medical and gaming applications.
Sign Language: Sign languages are the most raw and natural form of languages could be
dated back to as early as the advent of the human civilization, when the first theories of
sign languages appeared in history. It has started even before the emergence of spoken
languages. Since then the sign language has evolved and been adopted as an integral part
of our day to day communication process. Now, sign languages are being used
extensively in international sign use of deaf and dumb, in the world of sports, for
religious practices and also at work places. Gestures are one of the first forms of
communication when a child learns to express its need for food, warmth and comfort. It
enhances the emphasis of spoken language and helps in expressing thoughts and feelings
effectively.
Now a days a lot of research is going on the human hand gestures to interpret them for pc
control.
1.4 Purpose and Objective:
The main purpose of this project is to create an application which can identify
specific human hand gestures and interpret them for pc control (mouse operations). The
use of hand gestures provides an attractive alternative to these cumbersome interface
devices for human-computer interaction. Basically this project involves two parts. First,
image acquisition using static system webcam and recognizing specific hand gestures.
Second, after determining specific hand gesture this output should be given to a java
program which performs mouse events based on the given hand gesture. The purpose of
this project is to help new researchers learn and further research on their topic of
interest, which in this case is the human hand gesture recognition for pc control.
HGR Project Report Page 6
1.5 Layout of the thesis:
Chapter-1 introduces the reader to the Human Motion Detection. All the
background details and the prior knowledge required to correctly understand this project
work have been briefly covered in this chapter.
Chapter-2 includes the details of the literature survey carried out before starting
this work as well as in between the project work so as to meet the desired
objective.
Chapter-3 titled ―Proposed Work‖ contains the various features that have been used
and proposed method along with the details of the database used.
Chapter-4 testifies the correctness of the proposed system by showing the results
and performance values.
Chapter-5 talks about the conclusions derived and the future work along with the listing
of the references that have been used.
1.6 Conclusion:
In this chapter an outline of the project work has been sketched to give an insight into
―Vision Based Hand Gesture Recognition‖ what motivated this work to be implemented
has been discussed along with the report organization details.
HGR Project Report Page 7
Chapter 2
Literature Review
In this chapter, we will look at several hand gesture identification techniques and
methodologies that have been researched and implemented by other researchers.
There are several approaches to identify human hand gestures, here we are presenting
some of them which we have studied during the literature survey of this project.
1. Feature Matching
2. Machine Learning
3. Segmentation based
2.1 Feature Matching:
This approach is based on the comparisons of image features between two images. There
are two matchers to perform this operation.
1. Brute Force Matcher
2. FLANN Based Matcher
2.1.1 Basics of Brute-Force Matcher:
Brute-Force matcher is simple. It takes the descriptor of one feature in first set and
is matched with all other features in second set using some distance calculation. And the
closest one is returned.
For BF matcher, first we have to create the BFMatcher object using cv2.BFMatcher(). It
takes two optional params. First one is normType. It specifies the distance measurement
to be used. By default, it is cv2.NORM_L2. It is good for SIFT, SURF etc
HGR Project Report Page 8
(cv2.NORM_L1 is also there). For binary string based descriptors like ORB, BRIEF,
BRISK etc, cv2.NORM_HAMMING should be used, which used Hamming distance as
measurement. If ORB is using WTA_K == 3 or 4, cv2.NORM_HAMMING2 should
be used.
Second param is boolean variable, crossCheck which is false by default. If it is true,
Matcher returns only those matches with value (i,j) such that i-th descriptor in set A has
j-th descriptor in set B as the best match and vice-versa. That is, the two features in both
sets should match each other. It provides consistant result, and is a good alternative to
ratio test proposed by D.Lowe in SIFT paper.
Once it is created, two important methods are BFMatcher.match() and
BFMatcher.knnMatch(). First one returns the best match. Second method returns k best
matches where k is specified by the user. It may be useful when we need to do additional
work that on.Like we used cv2.drawKeypoints() to draw keypoints, cv2.drawMatches()
helps us to draw the matches. It stacks two images horizontally and draw lines from first
image to second image showing best matches. There is also cv2.drawMatchesKnn
which draws all the k best matches. If k=2, it will draw two match-lines for each
keypoint. So we have to pass a mask if we want to selectively draw it.
2.1.2 FLANN Based Matcher:
FLANN stands for Fast Library for Approximate Nearest Neighbors. It contains a
collection of algorithms optimized for fast nearest neighbor search in large datasets and
for high dimensional features. It works more faster than BFMatcher for large datasets.
We will see the second example with FLANN based matcher.
For FLANN based matcher, we need to pass two dictionaries which specifies the
algorithm to be used, its related parameters etc. First one is IndexParams. For various
HGR Project Report Page 9
algorithms, the information to be passed is explained in FLANN docs. As a summary, for
algorithms like SIFT, SURF etc. you can pass following:
index_params = dict(algorithm = FLANN_INDEX_KDTREE, trees =
5)
While using ORB, you can pass the following. The commented values are recommended
as per the docs, but it didn’t provide required results in some cases. Other values worked
fine.:
index_params= dict(algorithm = FLANN_INDEX_LSH,
table_number = 6, # 12
key_size = 12, # 20
multi_probe_level = 1) #2
Second dictionary is the SearchParams. It specifies the number of times the trees in the
index should be recursively traversed. Higher values gives better precision, but also takes
more time. If you want to change the value, pass search_params =
dict(checks=100)
Both Brute-Force Matcher & FLANN Based Matcher can either SIFT or SURF
algorithms. Let us discuss what is SIFT and SURF algorithms.
2.1.3 Scale Invariant Feature Transform (SIFT):
SIFT is used to solve the image rotation, scaling, viewpoint change, noise, illumination
changes. First the keypoints are extracted from the image then neighborhood regions are
picked around each key point & feature descriptors are computed.Feature descriptors are
extracted and stored in db. Feature descriptors matching based on Euclidean distance.
HGR Project Report Page 10
Algorithm:
 Scale space extreme detection
 Key Point localization
 Orientation assignment
 Description Generation
Scale space extreme detection:
Key points are detected here.Image is convolved with Gaussian filters at different scales,
and then the difference of successive Gaussian-blurred images are taken.Key points are
then taken as maxima/minima of the Difference of Gaussians (DoG) that occur at
multiple scales
Key point localization:
Once potential keypoints locations are found, they have to be refined to get more accurate
results. They used Taylor series expansion of scale space to get more accurate location
of extrema, and if the intensity at this extrema is less than a threshold value (0.03), it is
rejected. The DoG function will have strong responses along edges, to increase stability,
we need to eliminate the keypoints that have poorly determined locations but have high
edge responses.
HGR Project Report Page 11
Orientation assignment:
Each key point is assigned one or more orientations based on local image gradient
directions. A neigborhood is taken around the keypoint location depending on the scale,
and the gradient magnitude and direction is calculated in that region. An orientation
histogram with 36 bins covering 360 degrees is created. Highest peak in the histogram is
taken to calculate orientation.
Key point descriptor:
A 16x16 neighborhood around the keypoint is taken. It is divided into 16 sub-blocks of
4x4 size. For each subblock, 8 bin orientation histogram is created.
2.1.4 Speeded Up Robust Features(SURF):
Works much faster than SIFT For feature description, SURF uses Wavelet
responses in horizontal and vertical direction. The detector is based on the Hessian
matrix, due to its good performance in accuracy.
HGR Project Report Page 12
2.2 Machine Learning:
In classification system will be trained with some sample train data. Later test data
will be given to the system , it will find the given gesture belongs to which family.
1. K-nearest neighbor
2. Support Vector Machine
2.2.1 K-nearest neighbor:
KNN is one of the simplest of classification algorithms available for supervised
learning. The idea is to search for closest match of the test data in feature space. We
check some k nearest families. Then whoever is majority in them, the new one belongs
to that family.
Figure 2: K-nearest neighbor example
In the image, there are two families, Blue Squares and Red Triangles. We call each
family as Class. Their houses are shown in their town map which we call feature space.
Now a new member comes into the town and creates a new home, which is shown as
green circle. He should be added to one of these Blue/Red families. We call that process,
Classification.Since we are dealing with kNN, let us apply this algorithm.
HGR Project Report Page 13
One method is to check who is his nearest neighbour. From the image, it is clear it is the
Red Triangle family. So he is also added into Red Triangle. This method is called simply
Nearest Neighbour, because classification depends only on the nearest neighbour.
But there is a problem with that. Red Triangle may be the nearest. But what if there are
lot of Blue Squares near to him? Then Blue Squares have more strength in that locality
than Red Triangle. So just checking nearest one is not sufficient. Instead we check some
k nearest families. Then whoever is majority in them, the new guy belongs to that family.
In our image, let’s take k=3, ie 3 nearest families. He has two Red and one Blue (there
are two Blues equidistant, but since k=3, we take only one of them), so again he should
be added to Red family. But what if we take k=7? Then he has 5 Blue families and 2 Red
families. Great!! Now he should be added to Blue family. So it all changes with value of
k. More funny thing is, what if k = 4? He has 2 Red and 2 Blue neighbours. It is a tie !!!
So better take k as an odd number. So this method is called k-Nearest Neighbour since
classification depends on k nearest neighbours.
2.2.2 Support Vector Machine (SVM):
A support vector machine (SVM) is a computer algorithm that learns by example
to assign labels to objects.
Figure 3: SVM example 1
HGR Project Report Page 14
In SVM we find a line, f(x) = ax1+bx2+c which divides both the data to two regions.
When we get a new test_data X, just substitute it in f(x). If f(X) > 0, it belongs to blue
group, else it belongs to red group. We can call this line as Decision Boundary. It is very
simple and memory-efficient. Such data which can be divided into two with a straight
line (or hyperplanes in higher dimensions) is called Linear Separable.
So in above image, we can see plenty of such lines are possible. Which one we will take?
Very intuitively we can say that the line should be passing as far as possible from all the
points. Why? Because there can be noise in the incoming data. This data should not affect
the classification accuracy. So taking a farthest line will provide more immunity against
noise. So what SVM does is to find a straight line (or hyper plane) with largest minimum
distance to the training samples. See the bold line in below image passing through the
center.
Figure 4: SVM example 2
So to find this Decision Boundary, you need training data. Do you need all? NO. Just the
ones which are close to the opposite group are sufficient. In our image, they are the one
blue filled circle and two red filled squares. We can call them Support Vectors and the
lines passing through them are called Support Planes. They are adequate for finding our
decision boundary. We need not worry about all the data. It helps in data reduction.
What happened is, first two hyperplanes are found which best represents the data. For eg,
blue data is represented by while red data is represented by
HGR Project Report Page 15
where is weight vector ( ) and is the feature
vector ( ). is the bias. Weight vector decides the orientation of
decision boundary while bias point decides its location. Now decision boundary is
defined to be midway between these hyperplanes, so expressed as . The
minimum distance from support vector to the decision boundary is given by
. Margin is twice this distance, and we need to maximize
this margin. i.e. we need to minimize a new function with some constraints
which can expressed below:
where is the label of each class, .
2.3 Image Segmentation:
Hand segmentation deals with separating the user’s hand from the background in
the image. This can be done using various different methods. The most important step
for hand segmentation is Thresholding which is used in most of the methods
described below to separate the hand from the background.
Thresholding can be used to extract an object from its background by
assigning intensity values for each pixel such that each pixel is either classified as an
object pixel or a background pixel. Thresholding is done on the input image according
to a threshold value. Any pixel with intensity less than the threshold value is set to 0
and any pixel with intensity more than the threshold value is set to 1. Thus the output
of thresholding is a binary image with all pixels 0 belonging to the background and pixels
1 represent the hand. Therefore the white blob that is pixels having value 1 is the object
area. In our case the object is the user’s hand. The most important component for
thresholding is the threshold value. There are various methods to select the appropriate
threshold value.
HGR Project Report Page 16
Types of segmentation:
1. Static Thresholding
2. Incremental Thresholding
3. Thresholding using Otsu’s method
4. Dynamic thresholding using color at real time
5. Color based thresholding(using inRange function)
2.3.1 Static Thresholding:
Image frame is taken as input from the webcam in the RGB format. This image is
converted into gray scale. Then either a static threshold value is used or a threshold value
is selected from 0 to 55 according to the user specification which acts as the threshold
value. This threshold value should be chosen by the user in such a way that the
white blob of the hand is segmented with minimum noise possible. A Trackbar can be
provided to adjust the threshold value for the current usage scenario.
ThresholdValue= 0-255 set by the user requirement
For every usage, either the thresholding value is static that is each time same
value is used or the user is required to set the threshold value to ensure good level of
hand segmentation. Thus this method is not used since it puts the systems success or
failure dependant on the user setting a proper threshold value or on the quality of
the static threshold value. This method is useful where the intensity of the hand is
almost similar whenever the system is used. Also the background intensity should be
similar every time the system is used. But even in constant lighting conditions during
every system use, the system might fail depending on the user’s hand color. If the user’s
hand is also darker in color, the system might not be able to separate the user’s hands and
the dark background. The figures 2 and 3 below show the thresholded input image from
figure 1 using a static threshold value of 70 and 20. The noise introduced in figure 3
HGR Project Report Page 17
clearly shows how using a bad thresholding value can introduce noise which can
reduce the accuracy of hand detection.
Figure 5: Static Thresholding
2.3.2 Incremental Thresholding:
In this method, same pre-processing as in the static thresholding value is
done on the image input frame, converting from RGB to Grayscale. Instead of
using a constant value for every input image frame, the threshold value is
incremented till a condition is not met. For this method, a minimum threshold value
is set and then the input image frame is thresholded using this value. If the current
thresholding value does not fulfil the condition, then the thresholding value is
incremented and again the same procedure is followed till the condition is met. The
condition to detect hand is until only one white blob is present in the thresholded
image. The detected white blob can also be some other object so whether the detected
object is a hand or not is decided by the hand detection part explained further. This
method can automatically select the threshold value. This method generally gives good
results especially in the environment where intensity values of input image frame
changes continuously. This method ensures that the entire hand is detected as a
whole blob without any internal fragmentation. But on the negative side, sometimes
Fig1:Input image Fig2:Thresholded image
With value 20
Fig3:Thresholded image
With value 70
HGR Project Report Page 18
the background pixels near the hand might also get included in the white blob. Also
if the background is not constantly dark, some areas of the background might also add
up in with the hand in the white blob at certain threshold values and still make
only one white blob. That is even though it would pass the condition that only white
blob is present but the white blob would consist of the hand and the lighter
background areas that are connected to the hand.
ThresVal= initially set to some value(0-255) value is increased untill we get the
result
To remove these problems a test has to be conducted to find if the white blob has a
structure similar to hand or not using convexity defects explained further.
2.3.3 Thresholding using Otsu’s Method:
Otsu’s Methodis used to automatically select a threshold value based on the
shape of the histogram of the image. This algorithm assumes that the image
contains two dominant peaks of pixel intensities in the histogram that is two classes
of pixels. The histogram should be bi-model for using this method. The two classes
are foreground and background. The algorithm tries to find the optimal threshold value
using which the two classes are separated in such a way that their intra-class
variance or combined spread is minimal. The threshold value given by the Otsu’s
Method thus in our case works well since the images contain two type of pixels,
background pixels and hand pixels. Thus the two classes are background and hand. So
the threshold value tries to separate the peaks in order to give minimal intra-class
variance.
HGR Project Report Page 19
This final value is the 'sum of weighted variances' for the threshold value 3. This same
calculation needs to be performed for all the possible threshold values 0 to 5. It can be
seen that for the threshold equal to 3, as well as being used for the example, also has the
lowest sum of weighted variances. Therefore, this is the final selected threshold. All
pixels with a level less than 3 are background, all those with a level equal to or greater
than 3 are foreground.
HGR Project Report Page 20
This approach for calculating Otsu's threshold is useful for explaining the theory, but it is
computationally intensive, especially if you have a full 8-bit gray-scale.
The advantage is that this method works well under any circumstances until the
hand and background pixels create distinct peaks in the histogram of the image.
The only problem with this method is that if the user’s hand is not in view, the
method would give a threshold value which breaks up the background pixels into two
separate classes making it difficult to understand that the hand is not in view. This
problem can be solved again by using the tests explained further to make sure the
detected white blob is a hand only. Since the chances of the background getting
thresholded in a way that the white blob passes the hand detection test are
extremely less, the system practically does not give any false positives.
2.3.4 Dynamic thresholding using color at real time:
Unlike previous methods of thresholding, in this method color based thresholding
is done. This can also be termed as color level slicing. Initially the user has to give some
dummy input image frames with the hand to be detected in the central part of the image.
The system would do the analysis on these dummy input frames and generating dynamic
threshold values in RGB. In this analysis, a small central circular part, with arbitrary
radius, of the dummy input frames is considered initially. The first two pixels of the
central part are set as minimum pixel value and maximum pixel values. Then rest all
pixels in the central part are processed. For every pixel value that is scanned, it is
compared with the minimum and maximum pixel values. If the scanned pixel value is
less than the minimum pixels value then the minimum pixel value is updated to the
HGR Project Report Page 21
scanned pixel value. Similarly if scanned pixel value is more than that of maximum
pixel value, then the maximum pixel value is updated to the scanned pixel value. The
range defined by the Minimum and Maximum pixel value is used to threshold the image,
whichever pixel comes between this ranges is considered as hand pixel.
This method is very accurate to segment the hand if the intensity of the hand does
not alter much during usage. It can detect any color of hand thus making it
independent of the user’s skin color. The dummy input frames should have the hand
in the central part else the entire system collapses since the range decided is not actually
for the hand. The background should not contain pixels with values that fall in between
the decided range as they too would be included as hand pixels.
2.3.5 Color based thresholding(using inRange function) :
In color based thresholding, static values of hand color are considered for
thresholding. RGB values of hand are taken with minimum RGB value and
maximum RGB value as a range. These ranges are selected after analysing the
general range of color of human hands. Then the input image frame uses these two
minimum and maximum RGB values for thresholding. Any pixel between this
range is considered as the hand pixel so it is set to 1 and pixels outside this range are
considered as background pixel is set to 0.
As motion tracking is not possible with the first two methods i.e Feature Matching and
Machine Learning, we like to choose color segmentation for our project.
2.4 Conclusion:
The past has a lot to teach us. The work in image processing started in 1970’s hence there
has been already a lot of work done in this field and surely there is so much one can learn
by reviewing the works of senior researchers. In this chapter the approaches used in past
have been discussed.
HGR Project Report Page 22
Chapter 3
Proposed Work
So far the discussion was about every detail necessary to better understand ―Vision
Based Hand Gesture Recognition‖.
Hand Gesture Recognition System for pc control:
3.1 System Overview
Figure 6: Block Diagram of Hand Gesture Recognition system for pc control
3.2 Image Acquisition:
This is the first step in any gesture recognition system. The system developed here,
can capture sequence of images from real time video from a static web camera of a
computer. Here the resolution of a camera has no major effect on the functionality of the
system. During the image acquisition we should make sure that sufficient illumination is
present. As the system proceed with the color segmentation it will not give proper results
under fluorescent bulb illumination. So it is strongly recommended that image acquisition
should be done under sunlight for better and accurate results.
HGR Project Report Page 23
3.3 HSV color model:
RGB is useful for hardware implementations and it matches nicely with the fact that the
human eye is strongly perspective to red, green and blue primaries. However, RGB is not
a particularly intuitive way for describing colors in terms that are practical for human
interpretation. Rather when people describe colours they tend to use hue, saturation and
brightness. RGB has to do with "implementation details" regarding the way RGB
displays color, and HSV has to do with the "actual color" components. Another way to
say this would be RGB is the way computers treats color, and HSV try to capture the
components of the way we humans perceive color. RGB is great for colour generation,
but HSI is great for colour description.
Figure 7: HSV Color Model
3.3.1 RGB to HSV conversion:
HGR Project Report Page 24
3.4 Color Segmentation:
In color based thresholding, static values of hand color are considered for
thresholding. RGB values of hand are taken with minimum RGB value and
maximum RGB value as a range. These ranges are selected after analysing the
general range of color of human hands. Then the input image frame uses these two
minimum and maximum RGB values for thresholding. Any pixel between this
range is considered as the hand pixel so it is set to 1 and pixels outside this range are
considered as background pixel is set to 0.
Figure 8: Color Segmentation Example
As there are no background constraints in this method, it is highly prone to noise.
This method can also work on general backgrounds with a slight constraint that the
background should not contain pixels that lie between the ranges specified. Else extra
processing like selecting the largest contour explained further is required to ensure such
white blobs are not detected as hand. The thresholding values are very tricky to
select. For some user’s,the color of the hand could vary a lot and be outside the
specified range thus making the system unable to detect such user’s hand.
lower_hand = np.array([0,30,60])
upper_hand = np.array([20,150,250])
mask = cv2.inRange(img lower_hand, upper_hand)
HGR Project Report Page 25
3.5 Contour detection:
A contour is the curve for a two variables function along which the
function has a constant value. A contour joins points above a given level and of
equal elevation. A contour map illustratesthe contour using contour lines, which
shows the steepness of slopes and valleys and hills. The function’s gradient is
always perpendicular to the contour lines. When the lines are close together, the
magnitude of the gradient is usually very large. Contours are straight lines or
curves describing the intersection of one or more horizontal planes with a real or
hypothetical surface with.
The contour is drawn around the white blob of the hand that is found out by
thresholding the input image. There can be possibilities that more than one blob
will be formed in the image due to noise in the background. So the contours are
drawn on such smaller white blobs too. Considering all blobs formed due to noise are
small, thus the large contour is considered for further processing specifying it is
contour of hand.
In this implementation, after preprocessing of the image frame, white blob is
formed. Contour is drawn around this whiteblob. Vector contains set of contour
points in the coordinate form. Figure 6 shows the detected contour for the input image.
Figure 9: Detected Contour for the Input Image
HGR Project Report Page 26
3.6 Convex Hull:
The convex hull of a set of points in the euclidean space is the smallest convex set that
contains all the set of given points. For example, when this set of points is a
bounded subset of the plane, the convex hull can be visualized as the shape formed
by a rubber band stretched around this set of points. Convex hull is drawn around the
contour of the hand, such that all contour points are within the convex hull. This makes
anenvelope around the hand contour. Figure 7 shows the convex hull formed around
the detected hand.
Figure 10: Convex Hull of the Input Image
hull = cv2.convexHull(points[, hull[, clockwise[, returnPoints]]
Arguments details:
 points are the contours we pass into.
 hull is the output, normally we avoid it.
 clockwise : Orientation flag. If it is True, the output convex hull is
oriented clockwise. Otherwise, it is oriented counter-clockwise.
To draw all the contours in an image:
cv2.drawContours(img, contours, -1, (0,255,0), 3)
To draw an individual contour, say 4th contour:
cv2.drawContours(img, contours, 3, (0,255,0), 3)
But most of the time, below method will be useful:
cnt = contours[4]
cv2.drawContours(img, [cnt], 0, (0,255,0), 3)
HGR Project Report Page 27
The convex hulls are also drawn on an image using the same function drawContours as
explained in Contours. This is because both, contours and convex hulls are nothing
but a collection of points which needs to be connected with straight lines.
Basically when we draw convex hull for given color range i.e color range of human palm,
we many get many such convex hulls if we have any objects matches with that color
range, so we will consider only the convex hull with maximum area which might be the
convex hull of human hand palm.
3.7 Convexity Defects:
When the convex hull is drawn around the contour of the hand, it fits set of contour
points of the hand within the hull. It uses minimum points to form the hull to include
all contour points inside or on the hull and maintain the property of convexity.
This causes the formation of defects in the convex hull with respect to the contour drawn
on hand.
A defect is present wherever the contour of the object is away from the convex hull
drawn around the same contour. Convexity defect gives the set of values for every
defect in the form of vector. This vector contains the start and end point of the line of
defect in the convex hull. These points indicate indices of the coordinate points of
the contour. These points can be easily retrieved by using start and end indices of
the defect formed from the contour vector. Convexity defect also includes index of the
depth point in the contour and its depth value from the line. Figure 11 shows an
example of convexity defects calculated in the detected hand using the input image of
Figure9.
HGR Project Report Page 28
Figure 11: Major Convexity Defects Calculated for the given image
Final result:
Figure 12:Input image with all calculations
hull = cv2.convexHull(cnt,returnPoints = False)
defects = cv2.convexityDefects(cnt,hull)
HGR Project Report Page 29
3.8 Gesture Recognition:
After finding the convexity defects we can identify the type of gesture based on the
number of convexity defects. A gesture with all five fingers will give four major
convexity defects, four fingers will give three convexity defects, three fingers will give
two convexity defects. A gesture with all fingers closed will give no convexity defects.
Here the result is taken for every 60 frames. The most frequent gesture in the 60 frames
will be considered and rest of them will be discared. Here the result is an integer value,
the number of convexity defects present in the given gesture. This value will be sent to a
java class through a gateway called py4j(python4java).
Figure 13: Example Of Hand Gestures With Convexity Defects
HGR Project Report Page 30
3.9 Mouse Events:
Mouse events are performed using Robot Class in Java which are available in java.awt
package. A java class takes input from python program , an integer value that counts the
number of convexity defects in the given gesture based on which mouse events are
performed.
3.9.1 Robot Class:
This class is used to generate native system input events for the purposes of test
automation, self-running demos, and other applications where control of the mouse and
keyboard is needed. The primary purpose of Robot is to facilitate automated testing of
Java platform implementations.
Using the class to generate input events differs from posting events to the AWT event
queue or AWT components in that the events are generated in the platform's native input
queue. For example, Robot.mouseMove will actually move the mouse cursor instead of
just generating mouse move events.
Constructure and Description:
Robot( )
public Robot( ) throws AWTException
Constructs a Robot object in the coordinate system of the primary screen.
Method Details:
3.9.1.1 mouseMove
public void mouseMove(int x, int y)
Moves mouse pointer to given screen coordinates.
Parameters:
x - X position(x-coordinate)
y - Y position(y-coordinate)
HGR Project Report Page 31
Mouse cursor is moved based on the movement of centroid of convex hull drawn for the
hand. It calculates the difference between previous and present position i.e (x,y)
coordinates of the convex hull centroid.
3.9.1.2 mousePress
public void mousePress(int buttons)
Presses one or more mouse buttons. The mouse buttons should be released using
the mouseRelease(int) method.
Parameters:
buttons - the Button mask; a combination of one or more mouse button masks.
It is allowed to use only a combination of valid values as a buttons parameter.
A valid combination consist of InputEvent.BUTTON1_DOWN_MASK,
InputEvent.BUTTON2_DOWN_MASK,InputEvent.BUTTON3_DOWN_MASK
and values returned by the InputEvent.getMaskForButton(button) method.
a Toolkit.areExtraMouseButtonsEnabled() value as follows:
 If support for extended mouse buttons is disabled by Java then it is allowed
to use only the following standard button
masks: InputEvent.BUTTON1_DOWN_MASK, InputEvent.BUTTON2_D
OWN_MASK,InputEvent.BUTTON3_DOWN_MASK.
 If support for extended mouse buttons is enabled by Java then it is allowed
to use the standard button masks and masks for existing extended mouse
buttons, if the mouse has more then three buttons. In that way, it is allowed
to use the button masks corresponding to the buttons in the range from 1
to MouseInfo.getNumberOfButtons().
It is recommended to usethe InputEvent.getMaskForButton(button)method
to obtain the mask for any mouse button by its number.
HGR Project Report Page 32
The following standard button masks are also accepted:
 InputEvent.BUTTON1_MASK
 InputEvent.BUTTON2_MASK
 InputEvent.BUTTON3_MASK
However,it is recommended to use InputEvent. BUTTON1_DOWN_MASK,
InputEvent.BUTTON2_DOWN_MASK, InputEvent.BUTTON3_DOWN_MASK
instead. Either extended _DOWN_MASK or old _MASK values should be used,
but both those models should not be mixed.
Throws:
IllegalArgumentException - if the buttons mask contains the mask for extra mouse
button and support for extended mouse buttons is disabled by Java
IllegalArgumentException - if the buttons mask contains the mask for extra mouse
button that does not exist on the mouse and support for extended mouse buttons
is enabled by Java
3.9.1.3 mouseRelease
public void mouseRelease(int buttons)
Releases one or more mouse buttons.
Parameters:
buttons - the Button mask; a combination of one or more mouse button masks.
It is allowed to use only a combination of valid values as a buttons parameter.
A valid combination consists of InputEvent.BUTTON1_DOWN_MASK,
InputEvent.BUTTON2_DOWN_MASK, InputEvent.BUTTON3_DOWN_MASK
and values returned by the InputEvent.getMaskForButton(button) method. The
HGR Project Report Page 33
valid combination also depends on
a Toolkit.areExtraMouseButtonsEnabled() value as follows:
 If the support for extended mouse buttons is disabled by Java then it is
allowed to use only the following standard button
masks: InputEvent.BUTTON1_DOWN_MASK, InputEvent.BUTTON2_
DOWN_MASK,InputEvent.BUTTON3_DOWN_MASK.
 If the support for extended mouse buttons is enabled by Java then it is
allowed to use the standard button masks and masks for existing extended
mouse buttons, if the mouse has more then three buttons. In that way, it is
allowed to use the button masks corresponding to the buttons in the range
from 1 to MouseInfo.getNumberOfButtons().
It is recommended to usethe InputEvent.getMaskForButton(button) method
to obtain the mask for any mouse button by its number.
The following standard button masks are also accepted:
 InputEvent.BUTTON1_MASK
 InputEvent.BUTTON2_MASK
 InputEvent.BUTTON3_MASK
Throws:
IllegalArgumentException - if the buttons mask contains the mask for extra mouse
button and support for extended mouse buttons is disabled by Java
IllegalArgumentException - if the buttons mask contains the mask for extra mouse
button that does not exist on the mouse and support for extended mouse buttons
is enabled by Java
HGR Project Report Page 34
3.9.1.4 mouseWheel
public void mouseWheel(int wheelAmt)
Rotates the scroll wheel on wheel-equipped mice.
Parameters:
wheelAmt - number of "notches" to move the mouse wheel Negative values indicate
movement up/away from the user, positive values indicate movement down/towards the
user.
3.9.1.5 keyPress
public void keyPress(int keycode)
Presses a given key. The key should be released using the keyRelease method.
Parameters:
keycode - Key to press (e.g. KeyEvent.VK_A)
Throws:
IllegalArgumentException - if keycode is not a valid key
3.9.1.6 keyRelease
public void keyRelease(int keycode)
Releases a given key.
Key codes that have more than one physical key associated with them
(e.g. KeyEvent.VK_SHIFT could mean either the left or right shift key) will map to the
left key.
Parameters:
keycode - Key to release (e.g. KeyEvent.VK_A)
HGR Project Report Page 35
Proposed Gestures and Corresponding Mouse Operations:
Proposed Gesture Mouse operation
Mouse movement | Release click
Right Click
Scroll
Middle Click
Left click
HGR Project Report Page 36
3.10 Python4java(py4j) Gateway:
Py4J enables Python programs running in a Python interpreter to dynamically
access Java objects in a Java Virtual Machine. Methods are called as if the Java objects
resided in the Python interpreter and Java collections can be accessed through standard
Python collection methods. Py4J also enables Java programs to call back Python objects.
The goal is to enable developers to program in Python and benefit from Python libraries
such as lxml while being able to reuse Java libraries and frameworks such as Eclipse,
Netbeans. You can see Py4J as an hybrid between a glorified Remote Procedure Call and
using the Java Virtual Machine to run a Python program. So through this gateway we
create object for java class in python program and we pass the output of python program
,an integer value i.e number of convexity defects in the given gesture to the method
defined in java class.
3.11 Technologies used:
3.11.1 Python:
Python is a basic programming as well as scripting language. It serves its
applications different sections like scripting, programming, hardware interaction based
events, device coding, GUI programming etc., It provides lot of libraries for different
operations regarding operating system, file system, hardware, web application
development and advanced fields like image processing, voice processing etc., Some of
the libraries which were used in this project were listed below.
PIL – Python Imaging Library that serves different Image processing Operations.
CV2 – Computer vision algorithms library to process the videos and images.
Matplotlib – It provides graphical interface to visualize the data during
development as well as processing.
HGR Project Report Page 37
Numpy – Numpy is an matrix oriented operations providing library.
It plays crucial role while working with Image based tasks.
Ref.: www.python.org, http://pypi.python.org/
3.11.2 OpenCV (Open Computer Vision):
OpenCV provides different libraries for different image operations, video based
filters, hardware interaction etc., It makes the developers way easy by providing standard
libraries to estimate the required data by the program. It spreads over wide areas like
Image Processing, Computer Vision, Video processing, Object detection, Machine
Learning etc., OpenCV was started at Intel in 1999 by Gary Bradsky and the first
release came out in 2000. It has lot of algorithms related to Computer Vision and
Machine Learning and it is expanding day-by-day.It Supports programming languages
like C++, Python, Java etc., and is available on different platforms including Windows,
Linux, OS X, Android, iOS etc., also interfaces based on CUDA and OpenCL are also
under active development for high-speed GPU Operations.
Ref.: http://docs.opencv.org/
― Cv2 ‖ library of python serves the opencv-python operations over different
functionalities. Some of them were listed below.
cv2.CaptureVideo(“file” or flagbit)
This functions allows the user to capture frames from a video file or with a device.
If we provide file instead of flagbit, then it will take frames from that file. Flag bits are
0,1,2 etc., 0 specifies the default web camera, and 1 specifies the any USB camera
connected to it so and so.
cv2.imread(“Image”,mode)
HGR Project Report Page 38
This function reads the image data to a matrix for further operations. In parameters
Image specifies the filename of the image to be read and the mode specifies the Image-
Mode to convert. Eg. Mode=0 reads the image in gray level.
cv2.imwrite(“Filename”, source)
This function writes some image data to a filename specified and stored in the
current working directory.
cv2.imshow(“Name”, source)
This function is used to visualize any image data during runtime. Name specifies
the window name and source specifies the image data to be visualize.
cv2.cvtColor(Image,cv2.COLOR_BGR2GRAY)
Converts the RGB image to gray level image. Input image should be RGB image.
3.11.3 Java AWT:
AWT(Abstract Window ToolKit) is a collection of classes which provides
graphical components such as buttons, text boxes, Robot actions etc., which will be used
in the graphical programming.
3.12 Conclusion:
This chapter first described the system overview , detailed description of all steps
involved in the application system and then described the proposed methodologies and
algorithms . The correctness of this proposed work will be discussed in the next chapter.
HGR Project Report Page 39
Chapter 4
Results and Discussion
4.1 Snapshots of the Result
4.1.1 Cursor move based on centroid position of convexhull
HGR Project Report Page 40
4.1.2 Moving cursor onto the folder
4.1.3 Right click on the folder
HGR Project Report Page 41
4.1.4 Cursor moved onto the “open” option
4.1.5 Left click on the selected option
HGR Project Report Page 42
4.1.6 Result of left click on the selected option
HGR Project Report Page 43
Chapter 5
Conclusion and Future Work
5.1 Conclusion of the work:
This application system ―Vision Based Hand Gesture Recognition‖ for pc control
will help to perform mouse operations with hand gestures. It can give more accurate
results if it is provided with constant background and proper illumination conditions. As
it is a research project, we are able to perform some operations butl with some limitations
as mentioned above , noisy background and poor illumination. As it does color based
segmentation it may not perform well under some illuminations such as fluorescent bulb.
This system performs really well under daylight illumination conditions. This application
system is implemented to be platform independent.
5.2 Future Work
Adding more gestures to this system to execute some more mouse operations.
Making this resist to environmental conditions and background variation. Enhancing
the performance of this system to make it more user friendly.
HGR Project Report Page 44
REFERENCES:
[1] G. R. S. Murthy, R. S. Jadon. (2009). ―A Review of Vision Based Hand
Gestures Recognition,‖ International Journal of Information Technology and Knowledge
Management, vol. 2(2), pp. 405 -410.
[2] R. Lockton. ―Hand Gesture Recognition Using Computer Vision.‖
http://research.microsoft.com/en-us/um/people/awf/bmvc02/project.pdf
[3] S. Mitra, T. Acharya ―Gesture recognition: a survey‖, IEEE Trans Syst Man Cybern
Part C Appl Rev 37(3):311–324 (2007).
[4] Fakhreddine Karray, Milad Alemzadeh, Jamil Abou Saleh, Mo Nours Arab,
(2008) .―Human Computer Interaction: Overview on State of the Art‖, International
Journal on Smart Sensing andn Intelligent Systems, Vol. 1(1)
[5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised
scale-invariant learning. InCVPR,volume 2, pages 264–271, 2003
[6] S. Ullman, M. Vidal-Naquet, and E. Sali. Visual features of intermdediate complexity
and their use in classification.Nature Neuroscience, 5(7):682–687, 2002.
[7] M. Weber, M. Welling, and P. Perona. Unsupervised learningof models for
recognition. InECCV, Dublin, Ireland, 2000.
[8] D. G. Lowe, ―Distinctive image features from scale invariant keypoints,‖
International Journal of Computer Vision,2004.
[9] Mikolajczyk, K. 2002. Detection of local features invariant to affine
transformations,Ph.D. thesis,Institut National Polytechnique de Grenoble, France
[10] Pope, A.R., and Lowe, D.G. 2000. Probabilistic models of appearance for 3-D object
recognition.International Journal of Computer Vision, 40(2):149-167
[11] Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for localimage
descriptors. In: CVPR (2). (2004) 506 – 513
[12] A. Blake and M. Isard. 3D position, attitude and shape input using video tracking of
hands and lips. In Proceedings of SIGGRAPH 94, pages 185{192, 1994}
[13] J. Segen. Gest: a learning computer visionsystem that recognizes gestures. In
Machine Learning IV. Morgan Kau man, 1992. editedby Michalski et. al.
HGR Project Report Page 45
[14] J. M. Rehg and T. Kanade. Digiteyes: visionbased human hand tracking. Technical
Report CMU-CS-93-220, Carnegie Mellon School of Computer Science, Pittsburgh, PA
15213,1993.
[15] D. Rubine and P. McAvinney. Programmable finger-tracking instrument controllers.
Computer Music Journal, 14(1):26{41, 1990
[16] RichardWatson, ―Gesture recognition techniques‖, Technical report, Trinity
College, Department of Computer Science, Dublin, July, Technical Report No. TCD-CS-
93-11, 1993
[17] Chan Wah Ng, Surendra Ranganath, ―Real-time gesture recognition system and
application‖, Image Vision Comput, 20(13-14): 993-1007 ,2002
[18] Thomas G. Zimmerman , Jaron Lanier , Chuck Blanchard , Steve Bryson , Young
Harvill, ―A hand gesture interface device‖, SIGCHI/GI Proceedings, conference on
Human factors in computing systems and graphics interface, p.189-192, April 05- 09,
Toronto, Ontario, Canada, 1987
[19] Lalit Gupta and Suwei Ma ―Gesture-Based Interaction and Communication:
Automated Classification of Hand Gesture Contours‖ , IEEE transactions on systems,
man, and cybernetics—part c: applications and reviews, vol. 31, no. 1, February 2001

HGR-thesis

  • 1.
    HGR Project ReportPage 1 Chapter 1 Introduction This chapter will give the reader an insight into what this project work is all about. 1.1 Overview: Computer is used by many people either at their work or in their spare-time. Special input and output devices have been designed over the years with the purpose of easing the communication between computers and humans, the two most known are the keyboard and mouse . Every new device can be seen as an attempt to make the computer more intelligent and making humans able to perform more complicated communication with the computer. This has been possible due to the result oriented efforts made by computer professionals for creating successful human computer interfaces . As the complexities of human needs have turned into many folds and continues to grow so, the need for Complex programming ability and intuitiveness are critical attributes of computer programmers to survive in a competitive environment. The computer programmers have been incredibly successful in easing the communication between computers and human. With the emergence of every new product in the market; it attempts to ease the complexity of jobs performed. For instance, it has helped in facilitating tele operating, robotic use, better human control over complex work systems like cars, planes and monitoring systems. Earlier, Computer programmers were avoiding such kind of complex programs as the focus was more on speed than other modifiable features. However, a shift towards a user friendly environment has driven them to revisit the focus area . With the development of information technology in our society, we can expect that computer systems to a larger extent will be embedded into our environment. These environments will impose needs for new types of human computer-interaction, with
  • 2.
    HGR Project ReportPage 2 interfaces that are natural and easy to use. Hand is a natural and powerful means of communication that conveys information very effectively. Hand gesture recognition is an important aspect in Human-Computer interaction , and can be used in various applications, such as virtual reality and computer games. In moving advanced with this technological world the interaction between the human and devices is becoming very closer to each other. To move step ahead in making these devices more user friendly, indirect interaction between the users and the devices is needed instead of direct contact. By utilizing the generic features of living beings like vision, voice, gestures we can establish an indirect communication between the devices and humans. In the view of interaction between the computer and the human, this project ―Vision Based Hand Gesture Recognition‖ extends the one of the feature to enable the user to use the computer mouse over the display with human hand gestures. The user interface (UI) of the personal computer has evolved from a text-based command line to a graphical interface with keyboard and mouse inputs. However, they are inconvenient and unnatural. The use of hand gestures provides an attractive alternative to these cumbersome interface devices for human-computer interaction (HCI). In particular, visual interpretation of hand gestures can help in achieving the ease and naturalness desired for HCI. Vision has the potential of carrying a wealth of information in a nonintrusive manner and at a low cost, therefore it constitutes a very attractive sensing modality for developing hand gestures recognition. Recent researches in computer vision have established the importance of gesture recognition systems for the purpose of human computer interaction. Two approaches are commonly used to interpret gestures for Human Computer interaction. They are
  • 3.
    HGR Project ReportPage 3 1.1.1 Methods Which Use Data Gloves: This method employs sensors (mechanical or optical) attached to a glove that transduces finger flexions into electrical signals for determining the hand posture. This approach forces the user to carry a load of cables which are connected to the computer and hinders the ease and naturalness of the user interaction. 1.1.2 Methods Which are Vision Based: Computer vision based techniques are non invasive and based on the way human beings perceive information about their surroundings. Although it is difficult to design a vision based interface for generic usage, yet it is feasible to design such an interface for a controlled environment 1.2 Gestures: It is hard to settle on a specific useful definition of gestures due to its wide variety of applications and a statement can only specify a particular domain of gestures. Many researchers had tried to define gestures but their actual meaning is still arbitrary. Bobick and Wilson have defined gestures as the motion of the body that is intended to communicate with other agents. As per the context of the project, gesture is defined as an expressive movement of body parts which has a particular message, to be communicated precisely between a sender and a receiver. A gesture is scientifically categorized into two distinctive categories: dynamic and static. A dynamic gesture is intended to change over a period of time whereas a static gesture is observed at the spurt of time. A waving hand means goodbye is an example of dynamic gesture and the stop sign is an example of static gesture. To understand a full message, it is necessary to interpret all the static and dynamic gestures over a period of time. This complex process is called gesture recognition. Gesture recognition is the process of recognizing and interpreting a stream continuous sequential gesture from the given set of input data.
  • 4.
    HGR Project ReportPage 4 1.3 Gesture Based Applications: Gesture based applications are broadly classified into two groups on the basis of their purpose: multidirectional control and a symbolic language. 3D Design: CAD (computer aided design) is an HCI which provides a platform for interpretation and manipulation of 3-Dimensional inputs which can be the gestures. Manipulating 3D inputs with a mouse is a time consuming task as the task involves a complicated process of decomposing a six degree freedom task into at least three sequential two degree tasks. Massachuchetttes institute of technology [3] has come up with the 3DRAW technology that uses a pen embedded in polhemus device to track the pen position and orientation in 3D.A 3space sensor is embedded in a flat palette, representing the plane in which the objects rest .The CAD model is moved synchronously with the users gesture movements and objects can thus be rotated and translated in order to view them from all sides as they are being created and altered. Tele presence: There may raise the need of manual operations in some cases such as system failure or emergency hostile conditions or inaccessible remote areas. Often it is impossible for human operators to be physically present near the machines [4]. Tele presence is that area of technical intelligence which aims to provide physical operation support that maps the operator arm to the robotic arm to carry out the necessary task, for instance the real time ROBOGEST system constructed at University of California, San Diego presents a natural way of controlling an outdoor autonomous vehicle by use of a language of hand gestures. The prospects of tele presence includes space, undersea mission, medicine manufacturing and in maintenance of nuclear power reactors. Virtual reality: Virtual reality is applied to computer-simulated environments that can simulate physical presence in places in the real world, as well as in imaginary worlds. Most current virtual reality environments are primarily visual experiences, displayed either on a computer screen or through special stereoscopic displays [6]. There are also some simulations include additional sensory information, such as sound through speakers
  • 5.
    HGR Project ReportPage 5 or headphones. Some advanced, haptic systems now include tactile information, generally known as force feedback, in medical and gaming applications. Sign Language: Sign languages are the most raw and natural form of languages could be dated back to as early as the advent of the human civilization, when the first theories of sign languages appeared in history. It has started even before the emergence of spoken languages. Since then the sign language has evolved and been adopted as an integral part of our day to day communication process. Now, sign languages are being used extensively in international sign use of deaf and dumb, in the world of sports, for religious practices and also at work places. Gestures are one of the first forms of communication when a child learns to express its need for food, warmth and comfort. It enhances the emphasis of spoken language and helps in expressing thoughts and feelings effectively. Now a days a lot of research is going on the human hand gestures to interpret them for pc control. 1.4 Purpose and Objective: The main purpose of this project is to create an application which can identify specific human hand gestures and interpret them for pc control (mouse operations). The use of hand gestures provides an attractive alternative to these cumbersome interface devices for human-computer interaction. Basically this project involves two parts. First, image acquisition using static system webcam and recognizing specific hand gestures. Second, after determining specific hand gesture this output should be given to a java program which performs mouse events based on the given hand gesture. The purpose of this project is to help new researchers learn and further research on their topic of interest, which in this case is the human hand gesture recognition for pc control.
  • 6.
    HGR Project ReportPage 6 1.5 Layout of the thesis: Chapter-1 introduces the reader to the Human Motion Detection. All the background details and the prior knowledge required to correctly understand this project work have been briefly covered in this chapter. Chapter-2 includes the details of the literature survey carried out before starting this work as well as in between the project work so as to meet the desired objective. Chapter-3 titled ―Proposed Work‖ contains the various features that have been used and proposed method along with the details of the database used. Chapter-4 testifies the correctness of the proposed system by showing the results and performance values. Chapter-5 talks about the conclusions derived and the future work along with the listing of the references that have been used. 1.6 Conclusion: In this chapter an outline of the project work has been sketched to give an insight into ―Vision Based Hand Gesture Recognition‖ what motivated this work to be implemented has been discussed along with the report organization details.
  • 7.
    HGR Project ReportPage 7 Chapter 2 Literature Review In this chapter, we will look at several hand gesture identification techniques and methodologies that have been researched and implemented by other researchers. There are several approaches to identify human hand gestures, here we are presenting some of them which we have studied during the literature survey of this project. 1. Feature Matching 2. Machine Learning 3. Segmentation based 2.1 Feature Matching: This approach is based on the comparisons of image features between two images. There are two matchers to perform this operation. 1. Brute Force Matcher 2. FLANN Based Matcher 2.1.1 Basics of Brute-Force Matcher: Brute-Force matcher is simple. It takes the descriptor of one feature in first set and is matched with all other features in second set using some distance calculation. And the closest one is returned. For BF matcher, first we have to create the BFMatcher object using cv2.BFMatcher(). It takes two optional params. First one is normType. It specifies the distance measurement to be used. By default, it is cv2.NORM_L2. It is good for SIFT, SURF etc
  • 8.
    HGR Project ReportPage 8 (cv2.NORM_L1 is also there). For binary string based descriptors like ORB, BRIEF, BRISK etc, cv2.NORM_HAMMING should be used, which used Hamming distance as measurement. If ORB is using WTA_K == 3 or 4, cv2.NORM_HAMMING2 should be used. Second param is boolean variable, crossCheck which is false by default. If it is true, Matcher returns only those matches with value (i,j) such that i-th descriptor in set A has j-th descriptor in set B as the best match and vice-versa. That is, the two features in both sets should match each other. It provides consistant result, and is a good alternative to ratio test proposed by D.Lowe in SIFT paper. Once it is created, two important methods are BFMatcher.match() and BFMatcher.knnMatch(). First one returns the best match. Second method returns k best matches where k is specified by the user. It may be useful when we need to do additional work that on.Like we used cv2.drawKeypoints() to draw keypoints, cv2.drawMatches() helps us to draw the matches. It stacks two images horizontally and draw lines from first image to second image showing best matches. There is also cv2.drawMatchesKnn which draws all the k best matches. If k=2, it will draw two match-lines for each keypoint. So we have to pass a mask if we want to selectively draw it. 2.1.2 FLANN Based Matcher: FLANN stands for Fast Library for Approximate Nearest Neighbors. It contains a collection of algorithms optimized for fast nearest neighbor search in large datasets and for high dimensional features. It works more faster than BFMatcher for large datasets. We will see the second example with FLANN based matcher. For FLANN based matcher, we need to pass two dictionaries which specifies the algorithm to be used, its related parameters etc. First one is IndexParams. For various
  • 9.
    HGR Project ReportPage 9 algorithms, the information to be passed is explained in FLANN docs. As a summary, for algorithms like SIFT, SURF etc. you can pass following: index_params = dict(algorithm = FLANN_INDEX_KDTREE, trees = 5) While using ORB, you can pass the following. The commented values are recommended as per the docs, but it didn’t provide required results in some cases. Other values worked fine.: index_params= dict(algorithm = FLANN_INDEX_LSH, table_number = 6, # 12 key_size = 12, # 20 multi_probe_level = 1) #2 Second dictionary is the SearchParams. It specifies the number of times the trees in the index should be recursively traversed. Higher values gives better precision, but also takes more time. If you want to change the value, pass search_params = dict(checks=100) Both Brute-Force Matcher & FLANN Based Matcher can either SIFT or SURF algorithms. Let us discuss what is SIFT and SURF algorithms. 2.1.3 Scale Invariant Feature Transform (SIFT): SIFT is used to solve the image rotation, scaling, viewpoint change, noise, illumination changes. First the keypoints are extracted from the image then neighborhood regions are picked around each key point & feature descriptors are computed.Feature descriptors are extracted and stored in db. Feature descriptors matching based on Euclidean distance.
  • 10.
    HGR Project ReportPage 10 Algorithm:  Scale space extreme detection  Key Point localization  Orientation assignment  Description Generation Scale space extreme detection: Key points are detected here.Image is convolved with Gaussian filters at different scales, and then the difference of successive Gaussian-blurred images are taken.Key points are then taken as maxima/minima of the Difference of Gaussians (DoG) that occur at multiple scales Key point localization: Once potential keypoints locations are found, they have to be refined to get more accurate results. They used Taylor series expansion of scale space to get more accurate location of extrema, and if the intensity at this extrema is less than a threshold value (0.03), it is rejected. The DoG function will have strong responses along edges, to increase stability, we need to eliminate the keypoints that have poorly determined locations but have high edge responses.
  • 11.
    HGR Project ReportPage 11 Orientation assignment: Each key point is assigned one or more orientations based on local image gradient directions. A neigborhood is taken around the keypoint location depending on the scale, and the gradient magnitude and direction is calculated in that region. An orientation histogram with 36 bins covering 360 degrees is created. Highest peak in the histogram is taken to calculate orientation. Key point descriptor: A 16x16 neighborhood around the keypoint is taken. It is divided into 16 sub-blocks of 4x4 size. For each subblock, 8 bin orientation histogram is created. 2.1.4 Speeded Up Robust Features(SURF): Works much faster than SIFT For feature description, SURF uses Wavelet responses in horizontal and vertical direction. The detector is based on the Hessian matrix, due to its good performance in accuracy.
  • 12.
    HGR Project ReportPage 12 2.2 Machine Learning: In classification system will be trained with some sample train data. Later test data will be given to the system , it will find the given gesture belongs to which family. 1. K-nearest neighbor 2. Support Vector Machine 2.2.1 K-nearest neighbor: KNN is one of the simplest of classification algorithms available for supervised learning. The idea is to search for closest match of the test data in feature space. We check some k nearest families. Then whoever is majority in them, the new one belongs to that family. Figure 2: K-nearest neighbor example In the image, there are two families, Blue Squares and Red Triangles. We call each family as Class. Their houses are shown in their town map which we call feature space. Now a new member comes into the town and creates a new home, which is shown as green circle. He should be added to one of these Blue/Red families. We call that process, Classification.Since we are dealing with kNN, let us apply this algorithm.
  • 13.
    HGR Project ReportPage 13 One method is to check who is his nearest neighbour. From the image, it is clear it is the Red Triangle family. So he is also added into Red Triangle. This method is called simply Nearest Neighbour, because classification depends only on the nearest neighbour. But there is a problem with that. Red Triangle may be the nearest. But what if there are lot of Blue Squares near to him? Then Blue Squares have more strength in that locality than Red Triangle. So just checking nearest one is not sufficient. Instead we check some k nearest families. Then whoever is majority in them, the new guy belongs to that family. In our image, let’s take k=3, ie 3 nearest families. He has two Red and one Blue (there are two Blues equidistant, but since k=3, we take only one of them), so again he should be added to Red family. But what if we take k=7? Then he has 5 Blue families and 2 Red families. Great!! Now he should be added to Blue family. So it all changes with value of k. More funny thing is, what if k = 4? He has 2 Red and 2 Blue neighbours. It is a tie !!! So better take k as an odd number. So this method is called k-Nearest Neighbour since classification depends on k nearest neighbours. 2.2.2 Support Vector Machine (SVM): A support vector machine (SVM) is a computer algorithm that learns by example to assign labels to objects. Figure 3: SVM example 1
  • 14.
    HGR Project ReportPage 14 In SVM we find a line, f(x) = ax1+bx2+c which divides both the data to two regions. When we get a new test_data X, just substitute it in f(x). If f(X) > 0, it belongs to blue group, else it belongs to red group. We can call this line as Decision Boundary. It is very simple and memory-efficient. Such data which can be divided into two with a straight line (or hyperplanes in higher dimensions) is called Linear Separable. So in above image, we can see plenty of such lines are possible. Which one we will take? Very intuitively we can say that the line should be passing as far as possible from all the points. Why? Because there can be noise in the incoming data. This data should not affect the classification accuracy. So taking a farthest line will provide more immunity against noise. So what SVM does is to find a straight line (or hyper plane) with largest minimum distance to the training samples. See the bold line in below image passing through the center. Figure 4: SVM example 2 So to find this Decision Boundary, you need training data. Do you need all? NO. Just the ones which are close to the opposite group are sufficient. In our image, they are the one blue filled circle and two red filled squares. We can call them Support Vectors and the lines passing through them are called Support Planes. They are adequate for finding our decision boundary. We need not worry about all the data. It helps in data reduction. What happened is, first two hyperplanes are found which best represents the data. For eg, blue data is represented by while red data is represented by
  • 15.
    HGR Project ReportPage 15 where is weight vector ( ) and is the feature vector ( ). is the bias. Weight vector decides the orientation of decision boundary while bias point decides its location. Now decision boundary is defined to be midway between these hyperplanes, so expressed as . The minimum distance from support vector to the decision boundary is given by . Margin is twice this distance, and we need to maximize this margin. i.e. we need to minimize a new function with some constraints which can expressed below: where is the label of each class, . 2.3 Image Segmentation: Hand segmentation deals with separating the user’s hand from the background in the image. This can be done using various different methods. The most important step for hand segmentation is Thresholding which is used in most of the methods described below to separate the hand from the background. Thresholding can be used to extract an object from its background by assigning intensity values for each pixel such that each pixel is either classified as an object pixel or a background pixel. Thresholding is done on the input image according to a threshold value. Any pixel with intensity less than the threshold value is set to 0 and any pixel with intensity more than the threshold value is set to 1. Thus the output of thresholding is a binary image with all pixels 0 belonging to the background and pixels 1 represent the hand. Therefore the white blob that is pixels having value 1 is the object area. In our case the object is the user’s hand. The most important component for thresholding is the threshold value. There are various methods to select the appropriate threshold value.
  • 16.
    HGR Project ReportPage 16 Types of segmentation: 1. Static Thresholding 2. Incremental Thresholding 3. Thresholding using Otsu’s method 4. Dynamic thresholding using color at real time 5. Color based thresholding(using inRange function) 2.3.1 Static Thresholding: Image frame is taken as input from the webcam in the RGB format. This image is converted into gray scale. Then either a static threshold value is used or a threshold value is selected from 0 to 55 according to the user specification which acts as the threshold value. This threshold value should be chosen by the user in such a way that the white blob of the hand is segmented with minimum noise possible. A Trackbar can be provided to adjust the threshold value for the current usage scenario. ThresholdValue= 0-255 set by the user requirement For every usage, either the thresholding value is static that is each time same value is used or the user is required to set the threshold value to ensure good level of hand segmentation. Thus this method is not used since it puts the systems success or failure dependant on the user setting a proper threshold value or on the quality of the static threshold value. This method is useful where the intensity of the hand is almost similar whenever the system is used. Also the background intensity should be similar every time the system is used. But even in constant lighting conditions during every system use, the system might fail depending on the user’s hand color. If the user’s hand is also darker in color, the system might not be able to separate the user’s hands and the dark background. The figures 2 and 3 below show the thresholded input image from figure 1 using a static threshold value of 70 and 20. The noise introduced in figure 3
  • 17.
    HGR Project ReportPage 17 clearly shows how using a bad thresholding value can introduce noise which can reduce the accuracy of hand detection. Figure 5: Static Thresholding 2.3.2 Incremental Thresholding: In this method, same pre-processing as in the static thresholding value is done on the image input frame, converting from RGB to Grayscale. Instead of using a constant value for every input image frame, the threshold value is incremented till a condition is not met. For this method, a minimum threshold value is set and then the input image frame is thresholded using this value. If the current thresholding value does not fulfil the condition, then the thresholding value is incremented and again the same procedure is followed till the condition is met. The condition to detect hand is until only one white blob is present in the thresholded image. The detected white blob can also be some other object so whether the detected object is a hand or not is decided by the hand detection part explained further. This method can automatically select the threshold value. This method generally gives good results especially in the environment where intensity values of input image frame changes continuously. This method ensures that the entire hand is detected as a whole blob without any internal fragmentation. But on the negative side, sometimes Fig1:Input image Fig2:Thresholded image With value 20 Fig3:Thresholded image With value 70
  • 18.
    HGR Project ReportPage 18 the background pixels near the hand might also get included in the white blob. Also if the background is not constantly dark, some areas of the background might also add up in with the hand in the white blob at certain threshold values and still make only one white blob. That is even though it would pass the condition that only white blob is present but the white blob would consist of the hand and the lighter background areas that are connected to the hand. ThresVal= initially set to some value(0-255) value is increased untill we get the result To remove these problems a test has to be conducted to find if the white blob has a structure similar to hand or not using convexity defects explained further. 2.3.3 Thresholding using Otsu’s Method: Otsu’s Methodis used to automatically select a threshold value based on the shape of the histogram of the image. This algorithm assumes that the image contains two dominant peaks of pixel intensities in the histogram that is two classes of pixels. The histogram should be bi-model for using this method. The two classes are foreground and background. The algorithm tries to find the optimal threshold value using which the two classes are separated in such a way that their intra-class variance or combined spread is minimal. The threshold value given by the Otsu’s Method thus in our case works well since the images contain two type of pixels, background pixels and hand pixels. Thus the two classes are background and hand. So the threshold value tries to separate the peaks in order to give minimal intra-class variance.
  • 19.
    HGR Project ReportPage 19 This final value is the 'sum of weighted variances' for the threshold value 3. This same calculation needs to be performed for all the possible threshold values 0 to 5. It can be seen that for the threshold equal to 3, as well as being used for the example, also has the lowest sum of weighted variances. Therefore, this is the final selected threshold. All pixels with a level less than 3 are background, all those with a level equal to or greater than 3 are foreground.
  • 20.
    HGR Project ReportPage 20 This approach for calculating Otsu's threshold is useful for explaining the theory, but it is computationally intensive, especially if you have a full 8-bit gray-scale. The advantage is that this method works well under any circumstances until the hand and background pixels create distinct peaks in the histogram of the image. The only problem with this method is that if the user’s hand is not in view, the method would give a threshold value which breaks up the background pixels into two separate classes making it difficult to understand that the hand is not in view. This problem can be solved again by using the tests explained further to make sure the detected white blob is a hand only. Since the chances of the background getting thresholded in a way that the white blob passes the hand detection test are extremely less, the system practically does not give any false positives. 2.3.4 Dynamic thresholding using color at real time: Unlike previous methods of thresholding, in this method color based thresholding is done. This can also be termed as color level slicing. Initially the user has to give some dummy input image frames with the hand to be detected in the central part of the image. The system would do the analysis on these dummy input frames and generating dynamic threshold values in RGB. In this analysis, a small central circular part, with arbitrary radius, of the dummy input frames is considered initially. The first two pixels of the central part are set as minimum pixel value and maximum pixel values. Then rest all pixels in the central part are processed. For every pixel value that is scanned, it is compared with the minimum and maximum pixel values. If the scanned pixel value is less than the minimum pixels value then the minimum pixel value is updated to the
  • 21.
    HGR Project ReportPage 21 scanned pixel value. Similarly if scanned pixel value is more than that of maximum pixel value, then the maximum pixel value is updated to the scanned pixel value. The range defined by the Minimum and Maximum pixel value is used to threshold the image, whichever pixel comes between this ranges is considered as hand pixel. This method is very accurate to segment the hand if the intensity of the hand does not alter much during usage. It can detect any color of hand thus making it independent of the user’s skin color. The dummy input frames should have the hand in the central part else the entire system collapses since the range decided is not actually for the hand. The background should not contain pixels with values that fall in between the decided range as they too would be included as hand pixels. 2.3.5 Color based thresholding(using inRange function) : In color based thresholding, static values of hand color are considered for thresholding. RGB values of hand are taken with minimum RGB value and maximum RGB value as a range. These ranges are selected after analysing the general range of color of human hands. Then the input image frame uses these two minimum and maximum RGB values for thresholding. Any pixel between this range is considered as the hand pixel so it is set to 1 and pixels outside this range are considered as background pixel is set to 0. As motion tracking is not possible with the first two methods i.e Feature Matching and Machine Learning, we like to choose color segmentation for our project. 2.4 Conclusion: The past has a lot to teach us. The work in image processing started in 1970’s hence there has been already a lot of work done in this field and surely there is so much one can learn by reviewing the works of senior researchers. In this chapter the approaches used in past have been discussed.
  • 22.
    HGR Project ReportPage 22 Chapter 3 Proposed Work So far the discussion was about every detail necessary to better understand ―Vision Based Hand Gesture Recognition‖. Hand Gesture Recognition System for pc control: 3.1 System Overview Figure 6: Block Diagram of Hand Gesture Recognition system for pc control 3.2 Image Acquisition: This is the first step in any gesture recognition system. The system developed here, can capture sequence of images from real time video from a static web camera of a computer. Here the resolution of a camera has no major effect on the functionality of the system. During the image acquisition we should make sure that sufficient illumination is present. As the system proceed with the color segmentation it will not give proper results under fluorescent bulb illumination. So it is strongly recommended that image acquisition should be done under sunlight for better and accurate results.
  • 23.
    HGR Project ReportPage 23 3.3 HSV color model: RGB is useful for hardware implementations and it matches nicely with the fact that the human eye is strongly perspective to red, green and blue primaries. However, RGB is not a particularly intuitive way for describing colors in terms that are practical for human interpretation. Rather when people describe colours they tend to use hue, saturation and brightness. RGB has to do with "implementation details" regarding the way RGB displays color, and HSV has to do with the "actual color" components. Another way to say this would be RGB is the way computers treats color, and HSV try to capture the components of the way we humans perceive color. RGB is great for colour generation, but HSI is great for colour description. Figure 7: HSV Color Model 3.3.1 RGB to HSV conversion:
  • 24.
    HGR Project ReportPage 24 3.4 Color Segmentation: In color based thresholding, static values of hand color are considered for thresholding. RGB values of hand are taken with minimum RGB value and maximum RGB value as a range. These ranges are selected after analysing the general range of color of human hands. Then the input image frame uses these two minimum and maximum RGB values for thresholding. Any pixel between this range is considered as the hand pixel so it is set to 1 and pixels outside this range are considered as background pixel is set to 0. Figure 8: Color Segmentation Example As there are no background constraints in this method, it is highly prone to noise. This method can also work on general backgrounds with a slight constraint that the background should not contain pixels that lie between the ranges specified. Else extra processing like selecting the largest contour explained further is required to ensure such white blobs are not detected as hand. The thresholding values are very tricky to select. For some user’s,the color of the hand could vary a lot and be outside the specified range thus making the system unable to detect such user’s hand. lower_hand = np.array([0,30,60]) upper_hand = np.array([20,150,250]) mask = cv2.inRange(img lower_hand, upper_hand)
  • 25.
    HGR Project ReportPage 25 3.5 Contour detection: A contour is the curve for a two variables function along which the function has a constant value. A contour joins points above a given level and of equal elevation. A contour map illustratesthe contour using contour lines, which shows the steepness of slopes and valleys and hills. The function’s gradient is always perpendicular to the contour lines. When the lines are close together, the magnitude of the gradient is usually very large. Contours are straight lines or curves describing the intersection of one or more horizontal planes with a real or hypothetical surface with. The contour is drawn around the white blob of the hand that is found out by thresholding the input image. There can be possibilities that more than one blob will be formed in the image due to noise in the background. So the contours are drawn on such smaller white blobs too. Considering all blobs formed due to noise are small, thus the large contour is considered for further processing specifying it is contour of hand. In this implementation, after preprocessing of the image frame, white blob is formed. Contour is drawn around this whiteblob. Vector contains set of contour points in the coordinate form. Figure 6 shows the detected contour for the input image. Figure 9: Detected Contour for the Input Image
  • 26.
    HGR Project ReportPage 26 3.6 Convex Hull: The convex hull of a set of points in the euclidean space is the smallest convex set that contains all the set of given points. For example, when this set of points is a bounded subset of the plane, the convex hull can be visualized as the shape formed by a rubber band stretched around this set of points. Convex hull is drawn around the contour of the hand, such that all contour points are within the convex hull. This makes anenvelope around the hand contour. Figure 7 shows the convex hull formed around the detected hand. Figure 10: Convex Hull of the Input Image hull = cv2.convexHull(points[, hull[, clockwise[, returnPoints]] Arguments details:  points are the contours we pass into.  hull is the output, normally we avoid it.  clockwise : Orientation flag. If it is True, the output convex hull is oriented clockwise. Otherwise, it is oriented counter-clockwise. To draw all the contours in an image: cv2.drawContours(img, contours, -1, (0,255,0), 3) To draw an individual contour, say 4th contour: cv2.drawContours(img, contours, 3, (0,255,0), 3) But most of the time, below method will be useful: cnt = contours[4] cv2.drawContours(img, [cnt], 0, (0,255,0), 3)
  • 27.
    HGR Project ReportPage 27 The convex hulls are also drawn on an image using the same function drawContours as explained in Contours. This is because both, contours and convex hulls are nothing but a collection of points which needs to be connected with straight lines. Basically when we draw convex hull for given color range i.e color range of human palm, we many get many such convex hulls if we have any objects matches with that color range, so we will consider only the convex hull with maximum area which might be the convex hull of human hand palm. 3.7 Convexity Defects: When the convex hull is drawn around the contour of the hand, it fits set of contour points of the hand within the hull. It uses minimum points to form the hull to include all contour points inside or on the hull and maintain the property of convexity. This causes the formation of defects in the convex hull with respect to the contour drawn on hand. A defect is present wherever the contour of the object is away from the convex hull drawn around the same contour. Convexity defect gives the set of values for every defect in the form of vector. This vector contains the start and end point of the line of defect in the convex hull. These points indicate indices of the coordinate points of the contour. These points can be easily retrieved by using start and end indices of the defect formed from the contour vector. Convexity defect also includes index of the depth point in the contour and its depth value from the line. Figure 11 shows an example of convexity defects calculated in the detected hand using the input image of Figure9.
  • 28.
    HGR Project ReportPage 28 Figure 11: Major Convexity Defects Calculated for the given image Final result: Figure 12:Input image with all calculations hull = cv2.convexHull(cnt,returnPoints = False) defects = cv2.convexityDefects(cnt,hull)
  • 29.
    HGR Project ReportPage 29 3.8 Gesture Recognition: After finding the convexity defects we can identify the type of gesture based on the number of convexity defects. A gesture with all five fingers will give four major convexity defects, four fingers will give three convexity defects, three fingers will give two convexity defects. A gesture with all fingers closed will give no convexity defects. Here the result is taken for every 60 frames. The most frequent gesture in the 60 frames will be considered and rest of them will be discared. Here the result is an integer value, the number of convexity defects present in the given gesture. This value will be sent to a java class through a gateway called py4j(python4java). Figure 13: Example Of Hand Gestures With Convexity Defects
  • 30.
    HGR Project ReportPage 30 3.9 Mouse Events: Mouse events are performed using Robot Class in Java which are available in java.awt package. A java class takes input from python program , an integer value that counts the number of convexity defects in the given gesture based on which mouse events are performed. 3.9.1 Robot Class: This class is used to generate native system input events for the purposes of test automation, self-running demos, and other applications where control of the mouse and keyboard is needed. The primary purpose of Robot is to facilitate automated testing of Java platform implementations. Using the class to generate input events differs from posting events to the AWT event queue or AWT components in that the events are generated in the platform's native input queue. For example, Robot.mouseMove will actually move the mouse cursor instead of just generating mouse move events. Constructure and Description: Robot( ) public Robot( ) throws AWTException Constructs a Robot object in the coordinate system of the primary screen. Method Details: 3.9.1.1 mouseMove public void mouseMove(int x, int y) Moves mouse pointer to given screen coordinates. Parameters: x - X position(x-coordinate) y - Y position(y-coordinate)
  • 31.
    HGR Project ReportPage 31 Mouse cursor is moved based on the movement of centroid of convex hull drawn for the hand. It calculates the difference between previous and present position i.e (x,y) coordinates of the convex hull centroid. 3.9.1.2 mousePress public void mousePress(int buttons) Presses one or more mouse buttons. The mouse buttons should be released using the mouseRelease(int) method. Parameters: buttons - the Button mask; a combination of one or more mouse button masks. It is allowed to use only a combination of valid values as a buttons parameter. A valid combination consist of InputEvent.BUTTON1_DOWN_MASK, InputEvent.BUTTON2_DOWN_MASK,InputEvent.BUTTON3_DOWN_MASK and values returned by the InputEvent.getMaskForButton(button) method. a Toolkit.areExtraMouseButtonsEnabled() value as follows:  If support for extended mouse buttons is disabled by Java then it is allowed to use only the following standard button masks: InputEvent.BUTTON1_DOWN_MASK, InputEvent.BUTTON2_D OWN_MASK,InputEvent.BUTTON3_DOWN_MASK.  If support for extended mouse buttons is enabled by Java then it is allowed to use the standard button masks and masks for existing extended mouse buttons, if the mouse has more then three buttons. In that way, it is allowed to use the button masks corresponding to the buttons in the range from 1 to MouseInfo.getNumberOfButtons(). It is recommended to usethe InputEvent.getMaskForButton(button)method to obtain the mask for any mouse button by its number.
  • 32.
    HGR Project ReportPage 32 The following standard button masks are also accepted:  InputEvent.BUTTON1_MASK  InputEvent.BUTTON2_MASK  InputEvent.BUTTON3_MASK However,it is recommended to use InputEvent. BUTTON1_DOWN_MASK, InputEvent.BUTTON2_DOWN_MASK, InputEvent.BUTTON3_DOWN_MASK instead. Either extended _DOWN_MASK or old _MASK values should be used, but both those models should not be mixed. Throws: IllegalArgumentException - if the buttons mask contains the mask for extra mouse button and support for extended mouse buttons is disabled by Java IllegalArgumentException - if the buttons mask contains the mask for extra mouse button that does not exist on the mouse and support for extended mouse buttons is enabled by Java 3.9.1.3 mouseRelease public void mouseRelease(int buttons) Releases one or more mouse buttons. Parameters: buttons - the Button mask; a combination of one or more mouse button masks. It is allowed to use only a combination of valid values as a buttons parameter. A valid combination consists of InputEvent.BUTTON1_DOWN_MASK, InputEvent.BUTTON2_DOWN_MASK, InputEvent.BUTTON3_DOWN_MASK and values returned by the InputEvent.getMaskForButton(button) method. The
  • 33.
    HGR Project ReportPage 33 valid combination also depends on a Toolkit.areExtraMouseButtonsEnabled() value as follows:  If the support for extended mouse buttons is disabled by Java then it is allowed to use only the following standard button masks: InputEvent.BUTTON1_DOWN_MASK, InputEvent.BUTTON2_ DOWN_MASK,InputEvent.BUTTON3_DOWN_MASK.  If the support for extended mouse buttons is enabled by Java then it is allowed to use the standard button masks and masks for existing extended mouse buttons, if the mouse has more then three buttons. In that way, it is allowed to use the button masks corresponding to the buttons in the range from 1 to MouseInfo.getNumberOfButtons(). It is recommended to usethe InputEvent.getMaskForButton(button) method to obtain the mask for any mouse button by its number. The following standard button masks are also accepted:  InputEvent.BUTTON1_MASK  InputEvent.BUTTON2_MASK  InputEvent.BUTTON3_MASK Throws: IllegalArgumentException - if the buttons mask contains the mask for extra mouse button and support for extended mouse buttons is disabled by Java IllegalArgumentException - if the buttons mask contains the mask for extra mouse button that does not exist on the mouse and support for extended mouse buttons is enabled by Java
  • 34.
    HGR Project ReportPage 34 3.9.1.4 mouseWheel public void mouseWheel(int wheelAmt) Rotates the scroll wheel on wheel-equipped mice. Parameters: wheelAmt - number of "notches" to move the mouse wheel Negative values indicate movement up/away from the user, positive values indicate movement down/towards the user. 3.9.1.5 keyPress public void keyPress(int keycode) Presses a given key. The key should be released using the keyRelease method. Parameters: keycode - Key to press (e.g. KeyEvent.VK_A) Throws: IllegalArgumentException - if keycode is not a valid key 3.9.1.6 keyRelease public void keyRelease(int keycode) Releases a given key. Key codes that have more than one physical key associated with them (e.g. KeyEvent.VK_SHIFT could mean either the left or right shift key) will map to the left key. Parameters: keycode - Key to release (e.g. KeyEvent.VK_A)
  • 35.
    HGR Project ReportPage 35 Proposed Gestures and Corresponding Mouse Operations: Proposed Gesture Mouse operation Mouse movement | Release click Right Click Scroll Middle Click Left click
  • 36.
    HGR Project ReportPage 36 3.10 Python4java(py4j) Gateway: Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods. Py4J also enables Java programs to call back Python objects. The goal is to enable developers to program in Python and benefit from Python libraries such as lxml while being able to reuse Java libraries and frameworks such as Eclipse, Netbeans. You can see Py4J as an hybrid between a glorified Remote Procedure Call and using the Java Virtual Machine to run a Python program. So through this gateway we create object for java class in python program and we pass the output of python program ,an integer value i.e number of convexity defects in the given gesture to the method defined in java class. 3.11 Technologies used: 3.11.1 Python: Python is a basic programming as well as scripting language. It serves its applications different sections like scripting, programming, hardware interaction based events, device coding, GUI programming etc., It provides lot of libraries for different operations regarding operating system, file system, hardware, web application development and advanced fields like image processing, voice processing etc., Some of the libraries which were used in this project were listed below. PIL – Python Imaging Library that serves different Image processing Operations. CV2 – Computer vision algorithms library to process the videos and images. Matplotlib – It provides graphical interface to visualize the data during development as well as processing.
  • 37.
    HGR Project ReportPage 37 Numpy – Numpy is an matrix oriented operations providing library. It plays crucial role while working with Image based tasks. Ref.: www.python.org, http://pypi.python.org/ 3.11.2 OpenCV (Open Computer Vision): OpenCV provides different libraries for different image operations, video based filters, hardware interaction etc., It makes the developers way easy by providing standard libraries to estimate the required data by the program. It spreads over wide areas like Image Processing, Computer Vision, Video processing, Object detection, Machine Learning etc., OpenCV was started at Intel in 1999 by Gary Bradsky and the first release came out in 2000. It has lot of algorithms related to Computer Vision and Machine Learning and it is expanding day-by-day.It Supports programming languages like C++, Python, Java etc., and is available on different platforms including Windows, Linux, OS X, Android, iOS etc., also interfaces based on CUDA and OpenCL are also under active development for high-speed GPU Operations. Ref.: http://docs.opencv.org/ ― Cv2 ‖ library of python serves the opencv-python operations over different functionalities. Some of them were listed below. cv2.CaptureVideo(“file” or flagbit) This functions allows the user to capture frames from a video file or with a device. If we provide file instead of flagbit, then it will take frames from that file. Flag bits are 0,1,2 etc., 0 specifies the default web camera, and 1 specifies the any USB camera connected to it so and so. cv2.imread(“Image”,mode)
  • 38.
    HGR Project ReportPage 38 This function reads the image data to a matrix for further operations. In parameters Image specifies the filename of the image to be read and the mode specifies the Image- Mode to convert. Eg. Mode=0 reads the image in gray level. cv2.imwrite(“Filename”, source) This function writes some image data to a filename specified and stored in the current working directory. cv2.imshow(“Name”, source) This function is used to visualize any image data during runtime. Name specifies the window name and source specifies the image data to be visualize. cv2.cvtColor(Image,cv2.COLOR_BGR2GRAY) Converts the RGB image to gray level image. Input image should be RGB image. 3.11.3 Java AWT: AWT(Abstract Window ToolKit) is a collection of classes which provides graphical components such as buttons, text boxes, Robot actions etc., which will be used in the graphical programming. 3.12 Conclusion: This chapter first described the system overview , detailed description of all steps involved in the application system and then described the proposed methodologies and algorithms . The correctness of this proposed work will be discussed in the next chapter.
  • 39.
    HGR Project ReportPage 39 Chapter 4 Results and Discussion 4.1 Snapshots of the Result 4.1.1 Cursor move based on centroid position of convexhull
  • 40.
    HGR Project ReportPage 40 4.1.2 Moving cursor onto the folder 4.1.3 Right click on the folder
  • 41.
    HGR Project ReportPage 41 4.1.4 Cursor moved onto the “open” option 4.1.5 Left click on the selected option
  • 42.
    HGR Project ReportPage 42 4.1.6 Result of left click on the selected option
  • 43.
    HGR Project ReportPage 43 Chapter 5 Conclusion and Future Work 5.1 Conclusion of the work: This application system ―Vision Based Hand Gesture Recognition‖ for pc control will help to perform mouse operations with hand gestures. It can give more accurate results if it is provided with constant background and proper illumination conditions. As it is a research project, we are able to perform some operations butl with some limitations as mentioned above , noisy background and poor illumination. As it does color based segmentation it may not perform well under some illuminations such as fluorescent bulb. This system performs really well under daylight illumination conditions. This application system is implemented to be platform independent. 5.2 Future Work Adding more gestures to this system to execute some more mouse operations. Making this resist to environmental conditions and background variation. Enhancing the performance of this system to make it more user friendly.
  • 44.
    HGR Project ReportPage 44 REFERENCES: [1] G. R. S. Murthy, R. S. Jadon. (2009). ―A Review of Vision Based Hand Gestures Recognition,‖ International Journal of Information Technology and Knowledge Management, vol. 2(2), pp. 405 -410. [2] R. Lockton. ―Hand Gesture Recognition Using Computer Vision.‖ http://research.microsoft.com/en-us/um/people/awf/bmvc02/project.pdf [3] S. Mitra, T. Acharya ―Gesture recognition: a survey‖, IEEE Trans Syst Man Cybern Part C Appl Rev 37(3):311–324 (2007). [4] Fakhreddine Karray, Milad Alemzadeh, Jamil Abou Saleh, Mo Nours Arab, (2008) .―Human Computer Interaction: Overview on State of the Art‖, International Journal on Smart Sensing andn Intelligent Systems, Vol. 1(1) [5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. InCVPR,volume 2, pages 264–271, 2003 [6] S. Ullman, M. Vidal-Naquet, and E. Sali. Visual features of intermdediate complexity and their use in classification.Nature Neuroscience, 5(7):682–687, 2002. [7] M. Weber, M. Welling, and P. Perona. Unsupervised learningof models for recognition. InECCV, Dublin, Ireland, 2000. [8] D. G. Lowe, ―Distinctive image features from scale invariant keypoints,‖ International Journal of Computer Vision,2004. [9] Mikolajczyk, K. 2002. Detection of local features invariant to affine transformations,Ph.D. thesis,Institut National Polytechnique de Grenoble, France [10] Pope, A.R., and Lowe, D.G. 2000. Probabilistic models of appearance for 3-D object recognition.International Journal of Computer Vision, 40(2):149-167 [11] Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for localimage descriptors. In: CVPR (2). (2004) 506 – 513 [12] A. Blake and M. Isard. 3D position, attitude and shape input using video tracking of hands and lips. In Proceedings of SIGGRAPH 94, pages 185{192, 1994} [13] J. Segen. Gest: a learning computer visionsystem that recognizes gestures. In Machine Learning IV. Morgan Kau man, 1992. editedby Michalski et. al.
  • 45.
    HGR Project ReportPage 45 [14] J. M. Rehg and T. Kanade. Digiteyes: visionbased human hand tracking. Technical Report CMU-CS-93-220, Carnegie Mellon School of Computer Science, Pittsburgh, PA 15213,1993. [15] D. Rubine and P. McAvinney. Programmable finger-tracking instrument controllers. Computer Music Journal, 14(1):26{41, 1990 [16] RichardWatson, ―Gesture recognition techniques‖, Technical report, Trinity College, Department of Computer Science, Dublin, July, Technical Report No. TCD-CS- 93-11, 1993 [17] Chan Wah Ng, Surendra Ranganath, ―Real-time gesture recognition system and application‖, Image Vision Comput, 20(13-14): 993-1007 ,2002 [18] Thomas G. Zimmerman , Jaron Lanier , Chuck Blanchard , Steve Bryson , Young Harvill, ―A hand gesture interface device‖, SIGCHI/GI Proceedings, conference on Human factors in computing systems and graphics interface, p.189-192, April 05- 09, Toronto, Ontario, Canada, 1987 [19] Lalit Gupta and Suwei Ma ―Gesture-Based Interaction and Communication: Automated Classification of Hand Gesture Contours‖ , IEEE transactions on systems, man, and cybernetics—part c: applications and reviews, vol. 31, no. 1, February 2001