Advance KNN classification of brain tumor

Advance Boost Classifier using Random Tree and KNN for
segmentation and Classification of Brain tumor
Chapter 1: Introduction
The accurate diagnosis of diseases with high prevalence rate, such as Brain Tumor diseases, is
one of the most important biomedical problems whose administration is imperative. In this
project, we present a new method for the automated diagnosis of diseases based on the
improvement of random forests classification algorithm. More specifically, the dynamic
determination of the optimum number of base classifiers composing the random forests is
addressed. The proposed method is different from most of the methods reported in the literature,
which follow an overproduce-and-choose strategy, where the members of the ensemble are
selected from a pool of classifiers, which is known a priori. In our case, the number of classifiers
is determined during the growing procedure of the forest. Additionally, the proposed method
produces an ensemble not only accurate, but also diverse, ensuring the two important properties
that should characterize an ensemble classifier. The method is based on an online fitting
procedure and it is evaluated using eight biomedical datasets and five versions of the random
forests algorithm (40 cases). The method decided correctly the number of trees in 90% of the test
cases.
Chapter 2: Literature Survey
INTRODUCTION
Paper1: FCM and KNN BasedAutomatic Brain Tumor Detection
A brain tumor is formed when abnormal cells get accumulated within the brain. These cells
multiply in an uncontrolled manner and damage the brain tissues. Magnetic Resonance Image
scans are commonly used to diagnose brain tumors. However, segmenting and detecting the
brain tumor manually is a tedious task for the radiologists. Hence, there is a need for automatic
systems which yield accurate results. A fully automatic method is introduced to detect brain
tumors. It consists of five stages Image Acquisition, Preprocessing, Segmentation, using Fuzzy
C-means technique; Harris Corner Detection based feature extraction and classification using K-
NN. Performance metrics such as accuracy, precision, sensitivity and specificity are used to
evaluate the performance.
Methodology
A schematic overview of the proposed approach is illustrated in . A random forest classifier was
applied to the feature data from each modality independently, not only to obtain single-modality
classification results for comparison, but also to derive the similarities required for manifold
learning. The resulting similarity matrices were combined, and classicalMDSwas applied to
generate a joint embedding for multi-modality classification. Full details of the data collection
and feature extraction are presented in the Neuroimaging and biological feature data section.

Paper 2: Automated Diagnosis of Diseases Based on Classification: Dynamic Determination
of the Number of Trees in Random Forests Algorithm
An important task of any diagnostic system is the process of attempting to determine and/or
identify a possible disease or disorder and the decision reached by this process. For this purpose,
machine learning algorithms are widely employed [1], [2]. For these machine learning
techniques to be useful in medical diagnostic problems, they must be characterized by high
performance, the ability to deal with missing data and with noisy data, the transparency of
diagnostic knowledge, and the ability to explain decisions. In this paper, the improvement of the
random forests classification algorithm, which meets the aforementioned characteristics, is
addressed. This is achieved by determining automatically the only tuning parameter of the
algorithm, which is the number of base classifiers that compose the ensemble and affects its
performance. Random forests are a substantial modification of KNN [3]–[6]. It constructs a large
number of unpruned and decorrelated trees. The generation of the trees is based on the
combination of two sources of randomness. First, each tree is constructed on a bootstrap replicate
of the original dataset, as in KNN, and second a random feature subset, of fixed predefined size,
is considered for splitting each node of the tree. Gini index is used as the feature evaluation
measure that determines the best split. The decision tree is built to the maximum size without
pruning. The random forests classify each new instance by the majority vote of the full set of
trees.
One of the most important issues in the creation of an ensemble classifier, such as
random forests, is the size of the ensemble, the number of classifiers composing the ensemble,
and how the unnecessary classifiers are removed from the ensemble. The factors that may affect
the size of the ensemble are:
1) The desired accuracy,
2) The computational cost,
3) The nature of the classification problem, and
4) The number of available processors.
The methods reported in the literature, dealing with this problem, can be grouped into three
categories:
1) Methods that preselect the ensemble size,
2) Methods that postselect the ensemble size (pruning of the ensemble), and
3) Methods that select the ensemble size during training.
Preselection methods are the simplest way to determine the ensemble size. More specifically, the
number of the base classifiers is a tuning parameter of the algorithm, which can be set by the
user. Pruning methods contain precombining and postcombining methods [8]. In the first case,
pruning is performed before combining the classifiers. The classifiers that seem to perform well
are included in the ensemble. The predictive strength of a classifier is determined using different
evaluation measures. In postcombining pruning methods, the classifiers are removed from the
ensemble based on their contribution to the collective. More specifically, most of the

postcombining pruning methods are based on the overproduce-and-choose strategy, which
consists of two phases. The overproduction phase aims to produce a large initial pool of
candidate classifiers, while the selection phase aims to choose adequate classifiers from the pool
of classifiers so that the selected group of classifiers can achieve optimum positive predictive
rate. In the second phase (selection phase), different approaches are used. More specifically,
ensemble selection methods can be grouped into the following categories:
1) Weighted voting methods
2) search-based methods
3) clustering-based methods
4) Ranking methods
5) Optimization of a measure or function methods.
Architecture for previous method
The proposed method is based on the iterative procedure shown in Fig. 1. The method consists of
three basic steps: 1) the construction of the initial forest, 2) the application of the fitting
procedure, and 3) the examination of the termination criterion.
1) Construction of the Forest:
In the first step, the method constructs a forest with ten trees. For the construction of the
forest, the classical random forests and some modifications of it are used. More
specifically, random forests with ReliefF (RF with ReliefF) [9], random forests with
multiple estimators (RF with me) [9], RK Random Forests (RK-RF) [24], and RK
Random Forests with multiple estimators (RK-RF with me) [16] are employed. The
classical random forests constructs a collection of trees. For the construction of each tree,

a bootstrap sample of the dataset is selected. The tree is built to the maximum size
without pruning. The tree is just grown until each terminal node contains only members
of a single class. The Gini index [9] is used to determine the best split of each node. Only
a subset m of the total set of M features is employed as the candidate splitters of the node
of the tree. The number of the selected features (m) remains constant throughout the
construction of the forest. If a popularity of the trees agrees on a given classification, then
that is the predicted class of the sample being classified. In RF with ReliefF, the Gini
index is replaced by ReliefF. ReliefF evaluates the partitioning capability of attributes
according to how well their values distinguish between similar instances. The
replacement of Gini index is the core idea of RF with me too. However, five evaluation
measures are used instead of one. Those evaluation measures are: Gini index [9], grain
ratio [9], ReliefF [9], minimum description length [9], and myopic ReliefF [9]. The
differentiation between classical random forests and RK-RF lies in the value of the
parameter m. More specifically, it is not the same throughout the construction of the
forest, but it is randomly chosen for each node of the tree. Finally, in RK-RF with me, the
random selection of the parameter m is combined with multiple estimators to accomplish
the construction of the forest. The detailed description of the previous algorithms is
provided in [16]. During the construction of the forest, the accuracy and the average
correlation of the forest are computed each time a new tree is added. The forest primarily
consists of ten trees. The reason for selecting this initial value is due to the fact that the
fitting procedure, that is going to be applied in the next step, needs an adequate number
of points to start.
2) Fitting Procedure:
After the construction of the initial forest, an iterative procedure is used. The procedure
consists of three basic stages: 1) add a new tree, 2) apply polynomial fits, and 3) select
the best fit. The polynomial fits that are employed are given by: fn−1 (x) = pn xn + pn−1
xn−1 + ・・・ + p0, n= 2, ..., 9. (1) where x is the data to be fitted and pn are the
coefficients of the polynomial. The best fit is the one with the minimum rms error (the
root of the average of the squares of the differences between the predicted and actual
values).
3) Examination of the Termination Criterion:
In this step, the method examines if the stopping criterion is fulfilled. For this purpose,
three different criteria were tested in order to conclude to the best one. The first criterion
(criterion 1) searches for consecutive points in the fitted curve, where the difference
between the fitted curve and the curve of the accuracy is greater than a predefined
threshold. If there is such a region, the method terminates and returns the point of this
region for which the maximum accuracy is observed. The number of consecutive points
should be at least 10 and the difference, point by point, between the curves should be
greater than 0.004 [25]. In the case that the criterion is fulfilled and there are more than
one points, where the maximum accuracy is observed, the one with the lowest Brier score
[26] is selected. Brier score is a function, which measures the average squared deviation

between predicted probabilities for a set of events and their outcomes. Thus, the lowest
score corresponds to the highest accuracy.
Chapter 3: Proposed Work
Statement of Problem:
Attributed their success to the two main components of their system: discrimination and
randomization. Discrimination refers to the use of SVM to learn the splits at each node, whereas
randomization refers to a random selection of image patches, which are used as a form of
features to learn the splits at each node. There are several problems that may arise from this
randomization procedure. Firstly, if we consider image patches of size 50X50 in an 500X500
image, sampling space may contain thousands of patches, which makes it less likely that a
randomly selected patch will contain an object of interest for the image categorization. In
addition, randomly selected samples are more likely to over-lap with each other, which would
cause redundancy. Therefore, in this project, I investigated new ways for selecting image
patches. In theory, more informative patch selection should result in higher quality splits at each
tree node, which in turn should increase overall accuracy of the classifier.
Features and Scope:
To Fix the problems related to random patch selection I integrated a selective search
segmentation algorithm into the original random forest framework. Image patches selected using
selective search segmentation is more likely to contain the objects of interest. In addition,
segmentation should eliminate redundant overlapping between the image patches, which will
make our feature space more diverse. Fixing these two problems should result in an increased
discriminative power of random forest.
Goals:
Before beginning Random Forest procedure, I standardize each image by rescaling them to the
same size and then apply Selective Search Segmentation to extract important regions from each
image. Each region is represented by 4 coordinates in the image (points in the bottom left and top
right corners of the region). Then, SVM is applied to all the regions that were returned by

Selective Search Segmentation and its centroids are chosen as then all candidate regions. In this
particular case, I used 1024 centroids.
INTRODUCTION
Decision tree is one of the machine learning methods widely used to analyze proteomics data.
Generated from a given dataset, a single decision tree reports a classification result by each of its
terminal leaves (classifiers). Even though there exist many algorithms such as C4.5 that can be
used to generate a well-modeled single decision tree, it is still possible that its prediction is
biased, thus adversely affecting its accuracy.
To overcome this problem, more than one decision trees are used to analyze the data. It is based
on the concept of forming a panel of experts who will then vote to decide the final outcome. The
panel of experts is analogous to the ensemble of decision trees, which provide a pool of
classifiers. Similar to voting, the classifier that has a majority becomes the true classification
result for the data. the decision tree ensemble is more accurate than a single decision tree.
The diagram shown summarizes the process of generating decision tree ensemble and
classification of data using KNN algorithm. To explain briefly, KNN algorithm begins with
randomly sampling (with replacement) the data from the original dataset to form a training set. A
multiple of training sets are usually generated. Note that, since the replacement is allowed, the
data in the training set can be duplicated. Each of training sets then generates a decision tree. For
a given test data, each decision tree predicts an outcome, represented by a classifier. The
ensemble of decision trees forms a panel of experts whose votes determine the final classification
result from this group of classifiers.

ARCHITECTURE DIAGRAM
The performance in identifying biomarkers for premalignant pancreatic Alzheimer could be
enhanced by using the decision tree ensemble techniques instead of a single algorithm
counterpart. These techniques had proven more likely to accurately distinguish disease class
from normal class as indicated by a larger area under the Receiver Operating Characteristic
curve. Moreover, they achieved comparatively lower root mean squared errors.
According to their method, the peptide mass-spectrometry data were processed first to improve
data integrity and reduce variations among data due to the differences in sample loading
conditions. The preprocessing steps involved baseline adjustment using group median,
smoothing to remove noise using a Gaussian kernel, and normalization to make all the data
comparable. After that, the data were randomly sampled such that 90% formed a training set and
the remaining 10% formed a test set.
The training set was used in feature selection. In the study, the authors considered three different
feature selection methods. The first method was a two-sample homoscedastic t test, which was
used under the assumption that all the features from either normal or disease class had normal
distribution. Unlike the first method, the second method based on ANDI rank test considered that
the features had no distribution. The last feature selection method was a genetic algorithm.
The test set was used to generate a single decision tree including the decision tree ensembles.
The ensemble methods being studied were Random Forest, Random Tree, KNN, boost, Stacking,

Adaboost, and Multiboost. Their performances were measured in terms of accuracy and error in
the classification of the features, selected by each selection method. Then, they were compared
against the performance of a single decision tree generated by C4.5 algorithm. The process
repeated ten times to validate the resulting performance consistency.
According to the results reported, the decision tree ensembles achieved higher accuracy up to
70% regardless of the feature selection methods used. In terms of biomarker identification, both
the t test and the ANDI rank test had similarly impressive performance by consistently selecting
the same biomarker-suspect features. Unlike the first two methods, the performance of the
genetic algorithm was considerably poor. also noted that 70% accuracy was still lower than
expected. This could be as a result from a naturally low concentration of the biomarkers at the
premalignant stage of the Alzheimer. In addition, it was also possible that one dataset might not
be suitable for all algorithms, thus underestimating the accuracy.

Raw spectrum data:
We use GAUSSIAN EDGE with 4 levels.
Gaussian kernel smoothing:
A process of averaging the data points by applying a Gaussian functions. Basically, the
Gaussian function is used to generate a set of normalized weighting coefficients for the
data points whose weighted sum generates a new value. This new value replaces the old
one at the center of Gaussian curve.
Goal and overview of this research:
The goal of this research work was to extract the meaningful knowledge lied in the database and
transform them into meaningful rules.

Block diagram of the research work
process
Then the rules are used to predict the class labels of unknown data. Finally we introduced KNN
and Boosting to improve the accuracy of this whole process.
Keeping the aimed goal of this research in mind we constructed the whole research process as
shown in the block diagram in Fig. 1. Here the decision tree induction algorithms are used to turn
the hidden knowledge into a large dataset into decision rules. Again enhancements are made to
these algorithms to extract and use the rules more precisely to improve the accuracy. In the
research we have used heart disease dataset which is collected from UCI machine learning
repository. At first, ID3 algorithm is used to extract rules from the dataset and to use the rules to
classify new data which is implemented in C#. C4.5, the successor of ID3 is then used to classify
data more accurately. Finally, two new approaches named KNN and Boosting are introduced to
improve the predictive accuracy of C4.5.
Background study

Classification and prediction:
Data classification is a two-step process . In the first step, a model is built describing
predetermined set of data classes or concepts. The model is constructed by analyzing database
tuples described by attributes. Each tuple is assumed to belong to a predefined class as
determined by one of the attributes, called the class label attribute. In the context of
classification, data tuples are also referred to as samples or objects. The data tuples analyzed to
build the model collectively form the training dataset. The individual tuples making up the
training set are referred to as training samples and are randomly selected from the sample
population . Since the class label of each training sample is provided, this step is also known as
supervised learning. It contrasts with unsupervised learning, in which the class label of each
training sample is not known and the number or set of classes to be learned may not be known in
advance .
Prediction can be viewed as the construction and use of a model to assess the class of an
unlabeled sample or to assess the value or value ranges of an attribute that a given sample is
likely to have. In this view, classification and regression are the two major types of prediction
problems where classification is used to predict discrete or nominal values, while regression is
used to predict continuous or ordered values. In the view, however, refer to the use of prediction
to predict class label as classification and the use of prediction to predict continuous values as
prediction.
Decision tree induction:
Decision tree induction is a greedy algorithm that constructs decision tree in a top-down
recursive divide and conquer manner. A decision tree is a tree in which each branch node
represents a choice between a numbers of alternatives and each leaf node represents a decision.
Decision trees are commonly used for gaining information for the purpose of decision-making. It
starts with a root node and forms this node; users split each node recursively according to
decision tree learning algorithm. The final result is a decision tree in which each branch
represents a possible scenario of decision and its outcome.
For extracting rules, information gain measure is used to select the test attribute at each node in
the tree. The attribute with the highest information gain is chosen as the test attribute for the
current node and the path from the root node to each leaf node in the tree is tracked to construct
rules from the dataset.
They use induction in order to provide an appropriate classification of objects in terms of their
attributes, inferring decision tree rules. In their learning phase, explicit rules or interactions
among relevant features are induced. Such a learning method differs from non-linear classifiers
such as support vector machines or neural networks where the learning phase is to determine the
parameters of the non-linear kernel functions.
ID3 algorithm:

The ID3 (Iterative Dichotomiser 3) technique to building a decision tree is based on information
theory and attempts to minimize the expected number of comparisons. The basic idea of the
induction algorithm is to ask questions whose answers provide the most information. The first
question divides the search space into two large search domains while the second performs little
division of the space. The basic strategy used by ID3 is to choose splitting attributes with the
highest information gain first. The amount of information associated with an attribute value is
related to the probability of occurrence.
Let node N represents or hold the tuples of partition D. The attribute with the highest information
gain is chosen as the splitting attribute for node N. This attribute minimizes the information
needed to classify the tuples in the resulting partitions and reflects the least randomness or
impurity in these partitions. To calculate the of an attribute, at first we calculate the entropy of
that attribute by the following formula:
(1)
Where, Pj is the probability that an arbitrary tuple in S belongs to class Cj and estimated by |Ci,D|
/ |D|. A log function to the base 2 is used because the information is encoded in bits. Entropy (S)
is just the average amount of information needed to identify the class label of the tuple in S.
Now, the gain of an attribute is calculated by the formula
(2)
where, Si ={S1, S2......Sn} = partitions of S according to values of attribute A:
n = Number of attributes A
|Si| = Number of cases in the partition Si
|S| = Total number of cases in S
Information gain is defined as the difference between the original information requirement and
new requirement.
(3)

In other words, Gain (A) tell us how much would be gained by branching on A. It is the expected
reduction in the information requirement caused by knowing the value of A. The attribute A with
highest information gain is chosen as the splitting attribute at node N.
MATERIALS AND METHODS
New decission tree learning algorithms:
The C4.5 algorithm extension of his own ID3 algorithm for generating decision trees. KNN and
Boosting are general strategies for improving classifier and predictor accuracy. Suppose that we
are a patient and would like to have a diagnosis made based on the symptoms. Instead of asking
one doctor, we may choose to ask several. If a certain diagnosis occurs more than any others, we
may choose this as the final or best diagnosis. That is the final diagnosis is made based on a
majority vote where each doctor gets an equal vote. Now replace each doctor by a classifier, we
have the basic idea behind KNN.
In boosting, we assign weights to the value of each doctor’s diagnosis, based on the accuracies of
previous diagnoses they have made. The final diagnosis is then a combination of the weighted
diagnoses.
C4.5 Algorithm:
Just as with CART, the C4.5 algorithm recursively visits each decision node, selecting the
optimal split, until no further splits are possible. The steps of C4.5 algorithm for growing a
decision tree is given below:
• Choose attribute for root node
• Create branch for each value of that attribute
• Split cases according to branches
• Repeat process for each branch until all cases in the branch have the same class
A question that, how an attribute is chosen as a root node? At first, we calculate of the gain ratio
of each attribute. The root node will be that attribute whose gain ratio is maximum. Gain ratio is
calculated by the formula.
(4)

Where, A is an attribute whose gain ratio will be calculated. The attribute A with the maximum
gain ratio is selected as the splitting attribute. This attribute minimizes the information needed to
classify the tuples in the resulting partitions. Such an approach minimizes the expected number
of tests needed to classify a given tuple and guarantees that a simple tree if found. To calculate
the gain of an attribute, at first we calculate the entropy of that attribute by the following formula
(5)
Where, Pi is the probability that an arbitrary tuple in S belongs to class Ci and estimated by
|Ci;D|/|D|. A log function to the base 2 is used because the information is encoded in bits. Entropy
(S) is just the average amount of information needed to identify the class label of the tuple in S.
Now gain of an attribute is calculated by the formula
(6)
Where, Si = {S1, S2.....Sn} = partitions of S according to values of attribute A:
n = Number of attributes A
|Si| = Number of cases in the partition Si
|S| = Total number of cases in S
The gain ratio divides the gain by the evaluated split information. This penalizes splits with
many outcomes.
(7)
The split information is the weighted average calculation of the information using the proportion
of cases which are passed to each child. When there are cases with unknown outcomes on the
split attribute, the split information treats this as an additional split direction. This is done to
penalize splits which are made using cases with missing values. After finding the best split, the
tree continues to be grown recursively using the same process.
KNN:

We first take an intuitive look at how researchers as a method of increasing accuracy. Suppose
that we are a patient and would like to have a diagnosis made based on the symptoms. Instead of
asking one doctor, you may choose to ask several. If a certain diagnosis occurs more than any
others, you may choose this as the final or best diagnosis. That is the final diagnosis is made
based on a majority vote where each doctor gets an equal vote. Now replace each doctor by a
classifier, you have the basic idea behind KNN. Intuitively, a majority vote made by a large
group of doctors may be more reliable than a majority vote made by a small group.
Given a set, D, of d tuples, KNN works as . For iteration i (I =1, 2, 3,....., k), a training set, Di of
d tuples is sampled with replacement from the original set of tuples, D. Note that the term KNN
stands bootstrap aggregation. Each training set is a bootstrap sample. Because sampling with
replacement is used, some of the original tuples of D may not be included in Di, whereas others
may occurs more than once. A classifier model Mi is learned for each training set, Di. To classify
an unknown tuple, X, each classifier, Mi, returns its class prediction, which counts as one vote.
The bagged classifier, M*, counts the votes and assigns the class with the most vote to X. KNN
can be applied to the prediction of continuous values by taking the average value of each
prediction for a given test tuple.
Algorithm: KNN: The KNN algorithm creates an ensemble of models (classifiers or predictors)
for a learning scheme where each model gives an equally-weighted prediction.
Input:
• D, a set of training tuples
• k, the number of models in the ensemble
• A learning scheme (e.g., decision tree algorithm, backpropagation, etc.)
Output: A composite model, M*
Method:
• For I = 1 to k do// create k models
• Create bootstrap sample, Di by sampling D with replacement
• Use Di to derive a model, Mi
• Endfor

To use the composite model on a tuple, X:
• If classification then
• Let each of the k models classify X and return the majority vote
• If prediction then
• Let each of the k models predict a value for X and return the average predicted value
The bagged classifier often has significantly greater accuracy than a single classifier derived
from D, the original training data. It will not be considerably worse and is more robust to the
effects of noisy data. The increased accuracy occurs because the composite model reduces the
variance of the individual classifiers. For prediction, it was theoretically proven that a bagged
predictor will always have improved accuracy over a single predictor derived from D.
Boosting:
Boosting is a general method for improving accuracy of any given learning algorithm. It is an
effective method of producing a very accurate prediction rule by combining rough and
moderately inaccurate rules of thumb. In the research we have focused especially on the
AdaBoost .
Adaboost algorithm:
In AdaBoost, the input includes a dataset D of d class-labeled tuples, an integer k specifying the
number of classifiers in the ensemble and a classification-learning scheme.
Each tuple in the dataset is assigned a weight. The higher the weight is the more it influences the
learned theory. Initially, all weights are assigned a same value of 1/d. The algorithm repeats k
times. At each time, a model Mi is built on current dataset Di which is obtained by sampling with
replacement on original training dataset D. The framework of this algorithm is as follows:
Algorithm: AdaBoost
Input:
• D, a set of d class-labeled training tuples
• K, the number of rounds
• A classification learning scheme

Output: A composite model
Method:
• Initialize the weight of each tuple in D to 1/d
• For I = 1-k do
• Sample D with replacement according to the tuple weights to obtain Di
• Use training set Di to drive a model, Mi
• Compute the error rate error(Mi) of Mi
• If error(Mi) >0.5 then
• Reinitialize the weights to 1/d
• Go back to step 3 and try again
• Endif
• Update and normalize the weight of each tuple;
• Endfor
The error rate of Mi is the sum of the weights of all tuples in Di that of the tuples in Di that Mi
misclassified:
(8)
Where, err (Xj) = 1, if Xj is misclassified and err (Xj) = 0 otherwise. Then the weight of each
tuple is updated so that the weights of misclassified tuples are increased and the weights of
correctly classified tuples are decreased. This can be done by multiplying the weights of each
correctly classified tuple by error (Mi)/(1- error (Mi)). The weights of all tuples are then
normalized so that the sum of them of them is equal to 1. In order to keep this constraint, the
weight of each tuple is divided by the sum of the new weights.

After K rounds, a composite model will be generated, or an ensemble of classifiers which is then
used to classify new data. When a new tuple X comes, it is classified through these steps:
• Initialize weight of each class to 0
• For i = 1-k do
• Get weight wi of classifier Mi
• Get class prediction for X from Mi:c = Mi (Xi)
• Add βi to weight for class c
• endfor
• Return the class with the largest weight
The weight wi of each classifier Mi is calculated by this Eq. 9:
(9)
Requirements for KNN and boosting: These two methods for utilizing multiple classifiers
make different assumptions about the learning system. As above, KNN requires that the learning
system should not be stable, so that small changes to the training set should lead to different
classifiers. Breiman also notes that, poor predictors can be transformed into worse ones by KNN.
Boosting, on the other hand, does not preclude the use of learning systems that produce poor
predictors, provided that their error on the given distribution can be kept below 50%. However,
boosting implicitly requires the same instability as KNN; if Ct is the same Ct-1 the weight
adjustment scheme has the probability that error (Mt) = 0.5.

Chapter 4: Project Design
Hardware Requirements
• SYSTEM : Pentium IV 2.4 GHz
• HARD DISK : 40 GB
• FLOPPY DRIVE : 1.44 MB
• MONITOR : 15 VGA colour
• MOUSE : Logitech.
• RAM : 256 MB
Software Requirements

• Operating system :- Windows XP Professional
• Front End : - Asp .Net 2.0.
• Coding Language :- Visual C# .Net
• Back-End : - Sql Server 2000.
Module I/O
Preprocessing
Given Input- Image.
Expected Output-Normalize image
DFT
Given Input- Image and Dataset.
Expected Output- Classified Image.
KNN

Given Input- Classified image.
Expected Output-Image Bins.
Boosting
Given Input- Image Bins.
Expected Output- Rank Classified Image.
Verification
Given Input- Checks with user’s stored details like security answers or hidden details.
Expected Output-If the verification is success, user can perform transaction, else blocks the card.
Module diagram

UML Diagrams
Start
Image
RDT
KNN Bin
Bagging
Dataset
Stop

Use case diagram
Dataset
RDT
Bagging
KNN Bin
System
Image

Class diagram
Classification.
Segment
Image_name
RDT()
ID5()
Boosting.
_ans
details
verify()
Dataset.
Sequence
Sequence()
KNN Bin
qst
ans
info

Object diagram
Sequence diagram
Preprocessing
Boosting
RDT
Classification
KNN Bin

Component Diagram
image RDT KNN Bin Boosting DB Ratio
Transaction
Feature
Block
Block Details
DB feature
Ranked image

Image Library
ID4.5
Tree Componentn
KNN
Storage
Detail

Dataflow diagram
Directory
Preproc
essing
image
Feature
Classific
ation
Matrix
Dataset

Chapter 5: Proposed Simulation/Experiments/Results/Analysis
This study explores the utility of three different feature selection schemas to reduce the high
dimensionality of a pancreatic Alzheimer proteomic dataset. Using the top features selected from
each method, we compared the prediction performances of a single decision tree algorithm C4.5
with six different decision-tree based classifier ensembles (Random forest, Stacked
generalization, KNN, Adaboost, Logitboost and Multiboost). We show that ensemble classifiers
always outperform single decision tree classifier in having greater accuracies and smaller
prediction errors when applied to a pancreatic Alzheimer proteomics dataset.
Classification results using features selected by Student t test.
Algorithm
Accuracy(%
)
TP
rate
FP
rate
TN
rate
FN
rate
Sensitivit
y
Specificit
y
Precisio
n
Fmeasur
e
RMSE
Random
Forest
0.6500
0.7
9
0.5
3
0.4
8
0.2
1
0.79 0.48 0.65 0.71
0.456
9
KNN 0.6833
0.7
8
0.4
4
0.5
6
0.2
2
0.78 0.56 0.69 0.73
0.428
5
Logitboost 0.6889
0.8
3
0.4
9
0.5
1
0.1
7
0.83 0.51 0.69 0.75
0.440
2
Stacking 0.6444
0.9
9
0.7
9
0.2
1
0.0
1
0.99 0.21 0.61 0.76
0.476
1
Multiboost 0.6889
0.8
1
0.4
6
0.5
4
0.1
9
0.81 0.54 0.70 0.74
0.517
5
Logistic 0.7500
0.7
9
0.3
0
0.7
0
0.2
1
0.79 0.70 0.78 0.78
0.422
4
Naivebaye
s
0.6833
0.6
4
0.2
6
0.7
4
0.3
6
0.64 0.74 0.76 0.68
0.528
9
Bayesnet 0.6722
0.6
3
0.2
8
0.7
3
0.3
7
0.63 0.73 0.74 0.67
0.530
8
Neural
Network
0.7000
0.7
0
0.3
0
0.7
0
0.3
0
0.70 0.70 0.75 0.72
0.451
7
RBFnet 0.6722
0.7
6
0.4
4
0.5
6
0.2
4
0.76 0.56 0.69 0.71
0.463
2
CRDTNN 0.9644
0.7
1
0.3
3
0.6
8
0.2
9
0.71 0.68 0.74 0.71
0.548
9
TP rate: True positive rate, FP rate: False positive rate, TN rate: True negative rate, FN rate:
False negative rate, RMSE: Root Mean Squared Error. RBFnet: Radio Basis Function network,
SVM: Support Vector Machine.

Chapter 6: Testing
TESTING
Testing is an activity in which a system or component is executed under specified
conditions, whose results are observed or recorded, and an evaluation is made about some aspect
of the system or component. A successful testing uncovers errors in the software. So in general
testing demonstrates that the system is working according to the specifications, and that it meets
the performance requirements. This is the final stage of any project. Testing is a process of
executing the program with the intent of finding an error, it is a set of activities that can be
planned in advance and conducted systematically. The purpose of System testing is to correct the
errors in the system. Nothing is completed without the testing, as it is vital to the success of the
system.
6.1 Testing Phases:
Software testing phases include the following:
Test activities are determined and Test data is selected. The test is conducted and test
results are compared with the expected results.
There are various types of Testing:
Unit Testing:
Unit testing is a procedure used to validate that individual units of source code are
working properly. A unit is the smallest testable part of an application. In procedural
programming a unit may be an individual program, function, procedure etc, while in object-
oriented programming, the smallest unit is always a Class; which may be a base/super class,
abstract class or derived/child class. Units are distinguished from modules in that modules are
typically made up of units
Integration Testing:

Integration Testing is the phase of software testing in which individual software modules
are combined and tested as a group. It follows unit testing and precedes system testing. The goal
is to see if the modules are properly integrated and the emphasis being on the testing interfaces
among modules.
System Testing:
System testing is testing conducted on a complete, integrated system to evaluate the
system's compliance with its specified requirements. System testing is actually done to the entire
system against the Functional Requirement Specification(s) (FRS) and/or the System
Requirement Specification (SRS). It is also intended to test up to and beyond the bounds defined
in the software/hardware requirements specification(s).
Acceptance Testing:
Acceptance testing generally involves running a suite of tests on the completed system.
The acceptance test suite is run against the supplied input data or using an acceptance test script
to direct the testers. Then the results obtained are compared with the expected results. If there is
a correct match for every case, the test suite is said to pass. If not, the system may either be
rejected or accepted on conditions previously agreed between the sponsor and the manufacturer.
6.2 Testing Methods:
Testing is a process of executing a program to find out errors. Any testing can be done in
two ways:
White Box Testing:
White Box testing uses an internal perspective of the system to design test cases based on
internal structure. It requires programming skills to identify all paths through the software. The
tester chooses test case inputs to exercise paths through the code and determines the appropriate
outputs. Using the testing a software engineer can derive the following
Test cases:
Exercise all the logical decisions on either true or false sides. Execute all
loops at their boundaries and within their operational boundaries. Exercise the internal data
structures to assure their validity.
Black Box Testing:

Black box testing takes an external perspective of the test object to derive test cases.
These tests can be functional or non-functional, though usually functional. The test designer
selects valid and invalid input and determines the correct output. There is no knowledge of the
test object's internal structure.
Black Box testing attempts to find errors in the following categories:
 Incorrect or missing functions
 Interface errors
 Errors in data structures
 Performance errors
 Initialization and termination errors
6.3 Test Approach:
Testing can be done in two ways:
o Bottom-up approach
o Top-down approach
Bottom-up approach:
In a bottom-up approach the individual base elements of the system are first specified in
great detail. These elements are then linked together to form larger subsystems, which then in
turn are linked, sometimes in many levels, until a complete top-level system is formed. This
strategy often resembles a "seed" model, whereby the beginnings are small, but eventually grow
in complexity and completeness. However, "organic strategies", may result in a tangle of
elements and subsystems, developed in isolation, and subject to local optimization as opposed to
meeting a global purpose.
Top-down approach:

In a top-down approach an overview of the system is first formulated, specifying but not
detailing any first-level subsystems. Each subsystem is then detailed enough to realistically
validate the model.
7.1 Black box Testing:
Test Case 1: Color space conversion
Objective: To check whether the RGB space converted into YUV
Description: After putting RGB pixelite matrix get YUV factor
Expected Behavior Observed Behavior
Y is Luminance (brightness).
U & V are Chrominance factor
Conversion is done for obtaining
brightness & color factors.
Y is Luminance (brightness).
U & V are Chrominance factor.
Y=0.299*R +0.587*G+0.11*B
U= (B-Y)*0.565
V= (R-Y)* 0.713 for y= 0 step 0
toImageHeight
Test Case 2: Calculate histogram
Objective: To check whether the Histogram computed or not
Description: To check whether the Histogram computed or not while camera start
capturing
Compute Histogram for Color Image Y(Lumi histogram) in Array index
U(Chromi Histogram) in Array index

V(Chromi Histogram) in Array index
YLumi = (int)(Blue * 0.1133 + Green *
0.5859 + Red * 0.3008);
UChro = (int)(0.493 * (Blue - YLumi) +
128);
VChro = (int)(0.877 * (Red - YLumi) +
128);
HIndex = YLumi / 4;
HIndex = UChro / 4;
HIndex = VChro / 4;
Test Case 3: Analysis
Objective: To Compare histogram using similarity HMM function
Description: Compare two histogram using DCOS function till out layer equals 1
Compare two histogram using DCOS
function and out layer equals 1
Dcos(A,B) = 1 for two hiastogram
Test Case 4: Record
Objective: To check whether the record module is working properly or not
Description: After selecting this option, the recording should start

Pixelgrabber grab pixel for image array Recording starts. And Pixelgrabber grab
pixel for image array
Test Case 5: File Indexing
Objective: To check the working of file indexing module.
Description: After selecting this option, the file indexing should start and frames should
be captured.
File Indexing Starts and keyframes are
captured.
File Indexing Starts and Keyframes are
captured.
Test Case 6: Image Feature Extraction.
Objective: To check if all the indexed Image list and their keyframes are displayed.
Description: After selecting this option, a list of all indexed Image is displayed and when
a video is selected, its keyframes are displayed.
A list of indexed Image and their keyframes
are displayed.
A list of indexed Image and their keyframes
are displayed.
Test Case 7: Query.

Objective: To check if the query works properly and searches the image in indexed
Image.
Description: If the image queried is present in any video then the search is positive and a
path is displayed.
If image is present in an indexed video then
path is displayed else not found.
If image is present in an indexed video then
path is displayed else not found.
Test Case 8: Exit function.
Objective: To check whether exit function is working correctly or not.
Description: When we click on exit button project should be closed.
When we click on exit button project
should be closed.
Project is closed successfully.
7.2 GUI testing:-
Graphical User Interface (GUI) presents interesting challenges for software
engineers. Because of reusable components provided as part of GUI development
environments, the creation of the user interface has become less time consuming
and more precise. But, the same time, the complexity of GUIs has grown, leading
to more difficulty in the design and execution of the test cases. Because many
modern GUIs have the same look and same feel, a series of test cases can be
derived.
Test Case 1:

Objective: To check whether the menu selection process is working properly.
Description: When we select any option from the menu, then that chosen option is
selected and appropriate action is taken.
Chosen option is selected and appropriate
action is taken.
Chosen option is selected and appropriate
action is taken.
Test Case 2:
Objective: To check working of right-click menu which are on the main form.
Description: To check whether right-click menu shortcut are working properly.
The shortcuts work properly. The shortcuts work properly.
7.3 System Testing:-
System testing is actually a series of different tests whose primary purpose
is to fully exercise a computer based system. Although each test has different
purpose, all work to verify that system elements have been properly integrated and
allocated functions.
Test Case 1:

Objective: To check whether the system is working properly.
Description: The analysis, Classification and Detection are working properly.
The analysis, indexing and Detection are
working properly.
The analysis, indexing and Detection are
working properly.

Chapter 7:Schedule Work And Estimate

Estimation and Efforts:-
The costing feasibility of the project can be estimated using current estimated
models such as the Lines of Code, which allow us to estimate cost as a function of size.
Thus, this also allows us to estimate and analyze the feasibility of completion of the
system in the given timeframe. This allows us to have a realistic estimate as well as a
continuous evaluative perspective of the progress of the project.
Number of people working on this project = 3
Duration of project = August 2010 to April 2011
The project is divided over a period from August 2010 to April 2011. This time span is divided
into two major parts as follow.
DURATION FROM
DATE
TO DATE WEEKS HOURS/
WEEK
Duration I August November 14 6
Duration II Jan April 16 10
Table 4.2.1 Duration Table
Due to the academic compulsions we will be available for the project for following man-hours.
For Duration I : 14 * 6 = 84 MAN HOURS
For Duration II : 16 * 10 = 160 MAN HOURS
TOTAL availability = 224 MAN HOURS
Name of module LOC count

Capture 667
Analysis 445
Recording 430
File Indexing 460
Query 221
Total 2223
Table 4.2.2 KLOC Table
Constructive Cost Model (COCOMO)computes software development effort (and
cost) as a function of program size. Program size is expressed in estimated thousands of
lines of code (KLOC).
COCOMO applies to three classes of software projects:
• Organic projects: Small teams with good experience working with less than rigid
requirements
• Semi-detached projects: Medium teams with mixed experience working with a mix of
rigid and less than rigid requirements
• Embedded projects: Developed within a set of tight constraints (hardware, software,
operational)
KLOC is the estimated number of delivered lines (expressed in thousands) of code for
project, The coefficients a, b, c and d are given in the following table.
Software project a b c d
Organic 2.4 1.05 2.5 0.38
Semi- detached 3.0 1.12 2.5 0.35
Embedded 3.6 1.20 2.5 0.32

Table 4.2.3 COCOMO coefficients Table
In COCOMO model the effort can be calculated as:
Effort Applied (E) = a * (KLOC) ^ b (man-months)
And duration of the project can be estimated as:
Development Time (D) = c* E ^ d (months)
Our project ‘Content Based Video Indexing & Image Retrieval’ comes under image
processing category.
So, a = 2.4, b=1.05, c=2.5, d=0.38
E = 2.4 * (2.223) ^ 1.05
= 5.5526 man-months
D = 2.5 * 2.223 ^ 0.38
= 3.3867 months
According COCOMO model, the average cost per person month is Rs 10,000 so overall
software cost can be estimated as,
Software cost = E * 10,000
= 5.5526 * 10,000
= Rs 55,526.00 (approx)
4.3 Time Line Schedule:-
From To Task

01-08-2010 06-08-2010 Group Formation and finalization
07-08-2010 13-08-2010 Topic Search and Finalization
14-08-2010 20-08-2010 Preliminary Information Gathering
21-08-2010 27-08-2010 Synopsis Preparation and Submission
28-08-2010 03-09-2010 Project Discussion with Coordinator and Topic
Finalization
04-09-2010 10-09-2010 Detailed Literature Survey
11-09-2010 24-09-2010 Algorithm Finalization and Detailed Study
25-09-2010 01-10-2010 Drawing UML diagrams
02-10-2010 08-10-2010 Preparing PPT
09-10-2010 15-10-2010 Preparing Mid Term Report
16-10-2010 26-11-2010 Language Study (Visual C# .NET)
01-01-2011 02-03-2011 Coding and Implementation
03-03-2011 27-04-2011 Documentation
Table 4.3 Time Line Schedule Table

4.4 Time Line Chart:
Figure 4.4 Time line chart

Chapter 8: Conclusion and Future Direction
Our proposed system implements a novel classification mechanism for efficiently analyze the
brain tumor images using RDTNN classifier. We utilized ROI (Region of Interest) segmentation
method for CT image. Using DWT, the key features are extracted; the extracted features are
taken as input for RDT to reduce the dimensionality of features. Then the images were trained
with KNN classifier. Finally, the proposed algorithm is significantly efficient for classification of
the human brain image is benign and malignant with high sensitivity, specificity and accuracy
rates. The performance of this study appears some advantages of this technique: it is accurate,
robust easy to operate, non-invasive and inexpensive. In future work, we have a plan to explore
different types of medicinal images as well as some other application domains and study some
formal properties of image features.
Reference
[1] I. Kononenko, “Machine learning for medical diagnosis: History, state of the art and
perspective,” Artif. Intell. Med., vol. 23, no. 1, pp. 89–109,
2001.
[2] G. D. Magoulas and A. Prentza, “Machine learning in medical applications,” Mach. Learning
Appl. (Lecture Notes Comput. Sci.),
Berlin/Heidelberg, Germany: Springer, vol. 2049, pp. 300–307, 2001.
[3] L. Breiman, “KNN predictors,” Mach. Learning, vol. 24, no. 2, pp. 123–140, 1996.
[4] Y. Freud and R. E. Schapire, “A decision-theoretic generalization of online learning and an
application to boosting,” J. Comput. Syst. Sci., vol. 55,
no. 1, pp. 119–139, 1997.
[5] T.K.Ho, “The random subspace method for constructing decision forests,” IEEE Trans.
Pattern Anal.Mach. Intell., vol. 20, no. 8, pp. 832–844, 1998.
[6] L. Breiman, “Random forests,” Mach. Learning, vol. 45, pp. 5–32, 2001.
[7] L. Rokach and O. Maimon, Data Mining with Decision Trees Theory and Applications
(Machine Perception and Artificial Intelligence Series 69). H. Bunke and P. S. P. Wang, Eds.
Singapore: World Scientific, 2008.
[8] A. L. Prodromidis, S. J. Stolfo, and P. K. Chan, “Effective and efficient pruning of
metaclassifiers in a distributed data mining system,”, Columbia
Univ., New York, Tech. Rep. CUCS-017-99, 1999.
[9] M. Robnik-Sikonja, “Improving random forests,” in Proc. Eur. Conf. Mach. Learning, 2004,
pp. 359–369.
[10] A. Tsymbal, M. Pechenizkiy, and P. Cunningham, “Dynamic integration with random
forests,” in Proc. Eur. Conf. Mach. Learning, vol. 4212,
Berlin/Heidelberg, Germany: Springer, 2006.
[11] P. Cunningham, “A taxonomy of similarity mechanisms for case-based reasoning,”,
University College Dublin, Dublin, Ireland, Tech. Rep. UCDCSI-
2008-01, 2008.

[12] H. Hu, J. Li, H. Wang, G. Daggard, and M. Shi, “A maximally diversified multiple decision
tree algorithm for microarray data classification,”, presented at the Workshop Intell. Syst.
Bioinformat., Hobart, Australia
2006.
[13] S. Gunter and H. Bunke, “Optimization of weights in a multiple classifier handwritten word
recognition system using a genetic algorithm,”
Electron. Letters Comput. Vision Image Anal., pp. 25–41, 2004.
[14] E. E. Tripoliti, D. I. Fotiadis, M. Argyropoulou, and G. Manis, “A six stage approach for the
diagnosis of the Alzheimer’s disease based on
fMRI data,” J. Biomed. Informat., vol. 43, pp. 307–310, 2010.
[15] S. Bernard, L. Heutte, and S. Adam, “On the selection of decision trees
in random forests,” in Proc. IEEE-ENNS Int. Joint Conf. Neural Netw.,
2009, pp. 302–307.
[16] E. E. Tripoliti, D. I. Fotiadis, and G. Manis, “Modifications of random
forests algorithm,” Data Knowl. Eng., to be published.
[17] E. Gatnar, “A diversity measure for tree-based classifier ensembles,” in
Data Analysis and Decision Support, D. Baier, et al.., Eds. Heidelberg,
Germany: Springer, pp. 30–38, 2005.
[18] G. Giacinto, F. Roli, and G. Fumera, “Design of effective multiple classifiers
systems by clustering of classifiers,” in Proc. 15th Int. Conf. Pattern
Recog., 2000, pp. 160–163.
[19] G. Martinez-Munoz and A. Suarez, “Pruning in ordered KNN ensembles,”
in Proc. 23rd Int. Conf. Mach. Learning, 2006, pp. 609–616.
[20] C. Orrite, M. Rodriquez, F. Martinez, and M. Fairhurst, “Classifier ensemble
generation for the majority vote rule,” in Lecture Notes on Computer
Science, J. Ruiz-Shulcloper et al., Eds. BerlinHeidelberg, Germany:
Springer-Verlag, pp. 340–347, 2008.
[21] P. Letinne, O. Bebeir, and C. Decaestecker, “Limiting the number
of trees in random forests,” in Lecture Notes on Computer Science.
BerlinHeidelberg, Germany: Springer-Verlag, 2001, pp. 178–187.
[22] J. Xiao and Ch. He, “Dynamic classifier ensemble selection based on
GMDH,” in Proc. Int. Joint Conf. Comput. Sci. Optimization, 2009,
pp. 731–734.
[23] R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, “A
comparison of decision tree ensemble creation techniques,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 29, no. 1, pp. 173–180, Jan. 2007.
[24] S. Bernard, L. Heutte, and S. Adam, “Forest-RK: A new random forest
induction method,” in Proc. Int. Conf. Intell. Comput. 2008. Lecture
Notes in Artificial Intelligence 5227, D.-S. Huang, et al., Eds. Heidelberg,
Germany: Springer, 2008a, pp. 430–437.
[25] E. E. Tripoliti, D. I. Fotiadis, and G. Manis, “Dynamic construction of
random forests: Evaluation using biomedical engineering problems,”, presented
at the 10th Int. Conf. Inf. Technol. Appl. Biomed. Corfu, Greece,
2010.
[26] G. W. Brier, “Verification of forecasts expressed in terms of probability,”
Monthly Weather Review, vol. 78, pp. 1–3, 1950.

Advance KNN classification of brain tumor

Advance KNN classification of brain tumor

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advance KNN classification of brain tumor

Similar to Advance KNN classification of brain tumor (20)

Recently uploaded

Recently uploaded (20)

Advance KNN classification of brain tumor