SlideShare a Scribd company logo
Progress REPORT ON
(Dept. Of Information Technology, 2016-2020)
(GROUP-03) - 7th
VIKASH KUMAR (2016-3028)
RAKESH RANJAN (2016-3027)
SUMIT ABHISHEK (2016-3031)
1. Abstract
2. Introduction
3. Problem Statement and Data sets
4. Some terminologies
5. Software & Hardware Requirement
6. Different models used (Algorithms)
a. K-Nearest Neighbors
b. Random Forest Classification
c. Adaptive Boosting
d. Support Vector Machine
7. Implementation of our models on problem set
8. Comparison between various Algorithms
9. Future improvements and scopes
10. Conclusion
11. References
Image classification is a complex process that may be affected by many
factors. This paper examines current practices, problems, and prospects
of image classification. The emphasis is placed on the summarization of
major advanced classification approaches and the techniques used for
improving classification accuracy. In addition, some important issues
affecting classification performance are discussed. This literature review
suggests that designing a suitable image‐processing procedure is a
prerequisite for a successful classification of remotely sensed data into a
thematic map. Effective use of multiple features of remotely sensed data
and the selection of a suitable classification method are especially
significant for improving classification accuracy. Non‐parametric
classifiers such as neural network, decision tree classifier, and
knowledge‐based classification have increasingly become important
approaches for multisource data classification. Integration of remote
sensing, geographical information systems (GIS), and expert system
emerges as a new research frontier.
More research, however, is needed to identify and reduce uncertainties
in the image‐processing chain to improve classification accuracy.
The image classification follows the steps as pre-processing,
segmentation, feature extraction and classification. In the Classification
system database is very important that contains predefined sample
patterns of object under consideration that compare with the test object
to classify it appropriate class. Image Classification is an important task
in various fields such as biometry, remote sensing, and biomedical
images. In a typical classification system image is captured by a camera
and consequently processed. In Supervised classification, first of all
training took place through known group of pixels. The trained classifier
used to classify other images. The Unsupervised classification uses the
properties of the pixels to group them and these groups are known as
cluster and process is called clustering. The numbers of clusters are
decided by users. When trained pixels are not available the unsupervised
classification is used. The example for classification methods are:
Decision Tree, Artificial Neural Network (ANN) and Support Vector
Problem statement: To study a retina image dataset and to model a
classifier for predicting whether a person is suffering from glaucoma or not.
the problem statement for a document classifier has two aspects: the
document space and set of document class. The former defines the range
of input documents and the latter defines the output that the classifier can
Here in our project, the document space is a database consisting of several
numerical data sets of retinal Image.
Data Sets: we have taken 255 retinal image data sets and performed our
classification operations on that image. We have used 70% of the image
data set for training our model and left 30% for testing the model.
The features are extracted from the fundus images using image processing
techniques - kurtosis, k-stat, mean, median, standard deviation and the
obtained numerical features are stored in a dataset.
Some Terminologies
Confusion Matrix:
A confusion matrix is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values
and broken down by each class. This is the key to the confusion matrix.
The confusion matrix shows the ways in which your classification model is
confused when it makes predictions.
It gives us insight not only into the errors being made by a classifier but more
importantly the types of errors that are being made.
Definition of the Terms:
• Positive (P) : Observation is positive (for example: is an apple).
• Negative (N) : Observation is not positive (for example: is not an apple).
• True Positive (TP) : Observation is positive, and is predicted to be positive.
• False Negative (FN) : Observation is positive, but is predicted negative.
• True Negative (TN) : Observation is negative, and is predicted to be negative.
• False Positive (FP) : Observation is negative, but is predicted positive.
1. Jupyter Notebook (Anaconda):Anaconda is a free and open-
source[5] distribution of the Python and R programming languages
for scientific computing (data science, machine
learning applications, large-scale data processing, predictive
analytics, etc.), that aims to simplify package management and
deployment. Package versions are managed by the package
management system conda.[6] The Anaconda distribution includes
data-science packages suitable for Windows, Linux, and MacOS.
And Different Package install for implementation
a) NumPy Library
b) Pandas Library
c) Matplotlib
2. Browser
1. Windows 7/8/10
2. RAM 2GB
3. Minimum Storage 20GB
We Have used four algorithms which are
➢ K-Nearest Neighbors
➢ Random Forest Classification
➢ Adaptive Boosting
➢ Support Vector Machine
The K-NN is also the classifier of the category of supervised learning algorithm. In
supervised learning the targets are known to us but the pathway to target is not
known. To comprehend machine learning nearest neighbor forms is the perfect
example. Let us consider that there are many clusters of labelled samples. The
nature of items of the same identified clusters or groups are of homogeneous
nature. Now if an unlabeled item needs to be labelled under one of the labelled
groups. Now to classify it K-nearest neighbors is easy and best algorithm that have
record of all available classes can perfectly put the new item into the class on the
basis of largest number of votes for k neighbors. In this way KNN is one of the
alternate to classify an unlabeled item into identified class. Selecting the no. of
nearest neighbors or in another words calculating k value plays important role in
determining the efficiency of designed model. The accuracy and efficiency of k-
NN algorithm basically evaluated by the K value determined. A larger number for
k value has advantage in reducing the variance because of noisy data.
Advantage: The KNN is an unbiased algorithm and have not any assumption of
the data under consideration. It is very popular because of its simplicity and ease of
implementation plus effectiveness.
Disadvantage: The k-NN not create model so abstraction process not included. It
takes high time to predicate the item. It requires high time to prepare data to design
a robust system.
Random Forest is a method that operates by constructing multiple decision trees
during training phase.The decision of the majority of the trees is choose by the
random forest as the final decision.
Random Forests grows many classification trees. To classify a new object from an
input vector, put the input vector down each of the trees in the forest. Each tree
gives a classification, and we say the tree "votes" for that class. The forest chooses
the classification having the most votes (over all the trees in the forest).
Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random -
but with replacement, from the original data. This sample will be the training
set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each
node, m variables are selected at random out of the M and the best split on
these m is used to split the node. The value of m is held constant during the
forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
Algorithm for Construction of Random Forest is
Step 1: Let the number of training cases be “n” and let the number of
variables included in the classifier be “m”.
Step 2: Let the number of input variables used to make decision at the
node of a tree be “p”. We assume that p is always less than “m”.
Step 3: Choose a training set for the decision tree by choosing k times
with replacement from all “n” available training cases by taking a
bootstrap sample. Bootstrapping computes for a given set of data the
accuracy in terms of deviation from the mean data. It is usually used for
hypothesis tests. Simple block bootstrap can be used when the data can
be divided into nonoverlapping blocks. But, moving block bootstrap is
used when we divide the data into overlapping blocks where the portion
“k” of overlap between first and second block is always equal to the “k”
overlap between second and third overlap and so on. We use the
remaining cases to estimate the error of the tree. Bootstrapping is also
used for estimating the properties of the given training data.
Step 4: For each node of the tree, randomly choose variables on which to
search for the best split. New data can be predicted by considering the
majority votes in the tree. Predict data which is not in the bootstrap
sample. And compute the aggregate.
Step 5: Calculate the best split based on these chosen variables in the
training set. Base the decision at that node using the best split.
Step 6: Each tree is fully grown and not pruned. Pruning is used to cut of
the leaf nodes so that the tree can grow further. Here the tree is
completely retained.
Step 7: The best split is one with the least error i.e. the least deviation
from the observed data set.
1. It provides accurate predictions for many types of applications
2. It can measure the importance of each feature with respect to the
training data set.
3. Pairwise proximity between samples can be measured by the
training data set.
1. For data including categorical variables with different number of
levels, random forests are biased in favor of those attributes
with more levels.
2. If the data contain groups of correlated features of similar
relevance for the output, then smaller groups are favored over
larger groups
1. Is used for image classification for pixel analysis.
2. Is used in the field of Bioinformatics for complex data Analysis.
3. It is used for video segmentation (high dimensional data).
First of all, AdaBoost is short for Adaptive Boosting. Basically, Ada Boosting was
the first really successful boosting algorithm developed for binary classification.
Also, it is the best starting point for understanding boosting. Moreover, modern
boosting methods build on AdaBoost, most notably stochastic gradient boosting
Generally, AdaBoost is used with short decision trees. Further, the first tree is
created, the performance of the tree on each training instance is used. Also, we use
it to weight how much attention the next tree. Thus, it is created should pay
attention to each training instance. Hence, training data that is hard to predict is
given more weight. Although, whereas easy to predict instances are given less
Learn AdaBoost Model from Data
Ada Boosting is best used to boost the performance of decision trees and this is
based on binary classification problems.
Each instance in the training dataset is weighted. The initial weight is set to:
weight(xi) = 1/n
Where xi is the i’th training instance and n is the number of training instances
How To Train One Model?
A weak classifier is prepared on the training data using the weighted samples. Only
binary classification problems are supported. So each decision stump makes one
decision on one input variable. And outputs a +1.0 or -1.0 value for the first or
second class value.
The misclassification rate is calculated for the trained model. Traditionally, this is
calculated as:
error = (correct – N) / N
Where error is the misclassification rate. While correct is the number of training
instance predicted by the model. And N is the total number of training instances.
AdaBoost Ensemble
• Basically, weak models are added sequentially, trained using the weighted
training data.
• Generally, the process continues until a pre-set number of weak learners
have been created.
• Once completed, you are left with a pool of weak learners each with a stage
Making Predictions with AdaBoost
Predictions are made by calculating the weighted average of the weak classifiers.
For a new input instance, each weak learner calculates a predicted value as either
+1.0 or -1.0. The predicted values are weighted by each weak learner stage value.
The prediction for the ensemble model is taken as a sum of the weighted
predictions. If the sum is positive, then the first class is predicted, if negative the
second class is predicted
Data Preparation for AdaBoost
This section lists some heuristics for best preparing your data for AdaBoost.
Quality Data: Because of the ensemble method attempt to correct
misclassifications in the training data. Also, you need to be careful that the training
data is high-quality. Outliers: Generally, outliers will force the ensemble down the
rabbit hole of work. Although, it is so hard to correct for cases that are unrealistic.
These could be removed from the training dataset. Noisy Data: Basically, noisy
data, specifical noise in the output variable can be problematic. But if possible,
attempt to isolate and clean these from your training dataset.
AdaBoost algorithm advantages:
Very good use of weak classifiers for cascading;
Different classification algorithms can be used as weak classifiers;
AdaBoost has a high degree of precision;
Relative to the bagging algorithm and Random Forest Algorithm, AdaBoost fully
considers the weight of each classifier;
Adaboost algorithm disadvantages:
The number of AdaBoost iterations is also a poorly set number of weak classifiers,
which can be determined using cross-validation;
Data imbalance leads to a decrease in classification accuracy;
Training is time consuming, and it is best to cut the point at each reselection of the
current classifier;
The Support vector machine comes in the category of supervised learning .The
SVM used for regression and classification. But it is popularly known for
classification. It is a very efficient classifier. In this every object or item is
represented by a point in the n- dimensional space. The value of each feature is
represented by the particular coordinate. Then the items divided into classes by
finding hyper-plane as shown in the figure.
The diagram shows support Vectors that represent the coordinates of each item.
The SVM algorithm is a good choice to segregates the two classes.
SVM Advantages
SVM’s are very good when we have no idea on the data.
Works well with even unstructured and semi structured data like text, Images and
The kernel trick is real strength of SVM. With an appropriate kernel function, we
can solve any complex problem.
Unlike in neural networks, SVM is not solved for local optima.
It scales relatively well to high dimensional data.
SVM models have generalization in practice, the risk of over-fitting is less in
SVM is always compared with ANN. When compared to ANN models, SVMs
give better results.
SVM Disadvantages
Choosing a “good” kernel function is not easy.
Long training time for large datasets.
Difficult to understand and interpret the final model, variable weights and
individual impact.
Since the final model is not so easy to see, we cannot do small calibrations to the
model hence it’s tough to incorporate our business logic.
The SVM hyper parameters are Cost -C and gamma. It is not that easy to fine-tune
these hyper-parameters. It is hard to visualize their impact
SVM Application
• Protein Structure Prediction
• Intrusion Detection
• Handwriting Recognition
• Detecting Steganography in digital images
• Breast Cancer Diagnosis
• Almost all the applications where ANN is used
In our Glaucoma dataset, we achieved accuracy of 82% in finding the disease and
in future we will increase the accuracy to higher extent.
We will use algorithms like Convolutional Neural Network, to increase the
accuracy rate.
Currently we are using numerical data set as our input for classification and we
will directly take image data set as input in future.
Advances in image processing and its classification will be helpful in diagnosing
medical conditions correctly.
It will be helpful in recognizing people, performing surgery, reprograming, defects
in human DNA etc.
The paper provides a brief idea of classifier to the beginners of this field.
It helps the researchers in selecting the appropriate classifier for their problem.
This paper explains about KNN, SVM, Random Forest and Adaboost Algorithm
which are very popular classifier in field of image processing. The classifier
mainly classified as supervised or unsupervised in short this paper
provides the theoretical knowledge of concept of above mentioned classifiers
We applied four algorithms on our glaucoma dataset and we found that random
forest algorithm has highest accuracy level of 82% in detecting glaucoma diseases.
We found that KNN algorithm has highest Specificity value.
All this Algorithms can be used for better medical diagnosis of disease like cancer,
Eye disease etc.
It can also be used for biometric purposes such as identity, face and finger print
• Digital Image Processing: Kennth R.Castleman

More Related Content

What's hot

Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
Chapter 09 class advanced
Chapter 09 class advancedChapter 09 class advanced
Chapter 09 class advanced
Houw Liong The
DataminingTools Inc
A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...
Editor Jacotech
01 Introduction to Machine Learning
01 Introduction to Machine Learning01 Introduction to Machine Learning
01 Introduction to Machine Learning
Tamer Ahmed Farrag, PhD
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Tonmoy Bhagawati
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionMargaret Wang
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
Functional Imperative
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
Valerii Klymchuk
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Salah Amean
DATA MINING.docbutest
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
Palin analytics
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Kishor Datta Gupta
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
xGem Machine Learning
xGem Machine LearningxGem Machine Learning
xGem Machine Learning
Jorge Hirtz
Machine learning - xGem - AI
Machine learning - xGem - AIMachine learning - xGem - AI
Machine learning - xGem - AI
Juan Carniglia
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Agile Testing Alliance
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...

What's hot (20)

Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
Chapter 09 class advanced
Chapter 09 class advancedChapter 09 class advanced
Chapter 09 class advanced
A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...
01 Introduction to Machine Learning
01 Introduction to Machine Learning01 Introduction to Machine Learning
01 Introduction to Machine Learning
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
xGem Machine Learning
xGem Machine LearningxGem Machine Learning
xGem Machine Learning
Machine learning - xGem - AI
Machine learning - xGem - AIMachine learning - xGem - AI
Machine learning - xGem - AI
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...


Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
IRJET Journal
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
IJERA Editor
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
IOSR Journals
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Datamining Tools
Analysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetAnalysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data Set
IJERA Editor
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A Review
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
AmAn Singh
Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique
IOSR Journals


Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Analysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetAnalysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data Set
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
PNN and inversion-B
PNN and inversion-BPNN and inversion-B
PNN and inversion-B
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A Review
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique Clustering and Classification of Cancer Data Using Soft Computing Technique
Clustering and Classification of Cancer Data Using Soft Computing Technique

Recently uploaded

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf

Recently uploaded (20)

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf


  • 2. Contents 1. Abstract 2. Introduction 3. Problem Statement and Data sets 4. Some terminologies 5. Software & Hardware Requirement 6. Different models used (Algorithms) a. K-Nearest Neighbors b. Random Forest Classification c. Adaptive Boosting d. Support Vector Machine 7. Implementation of our models on problem set 8. Comparison between various Algorithms 9. Future improvements and scopes 10. Conclusion 11. References
  • 3. ABSTRACT Image classification is a complex process that may be affected by many factors. This paper examines current practices, problems, and prospects of image classification. The emphasis is placed on the summarization of major advanced classification approaches and the techniques used for improving classification accuracy. In addition, some important issues affecting classification performance are discussed. This literature review suggests that designing a suitable image‐processing procedure is a prerequisite for a successful classification of remotely sensed data into a thematic map. Effective use of multiple features of remotely sensed data and the selection of a suitable classification method are especially significant for improving classification accuracy. Non‐parametric classifiers such as neural network, decision tree classifier, and knowledge‐based classification have increasingly become important approaches for multisource data classification. Integration of remote sensing, geographical information systems (GIS), and expert system emerges as a new research frontier. More research, however, is needed to identify and reduce uncertainties in the image‐processing chain to improve classification accuracy.
  • 4. INTRODUCTION The image classification follows the steps as pre-processing, segmentation, feature extraction and classification. In the Classification system database is very important that contains predefined sample patterns of object under consideration that compare with the test object to classify it appropriate class. Image Classification is an important task in various fields such as biometry, remote sensing, and biomedical images. In a typical classification system image is captured by a camera and consequently processed. In Supervised classification, first of all training took place through known group of pixels. The trained classifier used to classify other images. The Unsupervised classification uses the properties of the pixels to group them and these groups are known as cluster and process is called clustering. The numbers of clusters are decided by users. When trained pixels are not available the unsupervised classification is used. The example for classification methods are: Decision Tree, Artificial Neural Network (ANN) and Support Vector Machines.
  • 5. PROBLEM STATEMENTS AND DATA SETS Problem statement: To study a retina image dataset and to model a classifier for predicting whether a person is suffering from glaucoma or not. the problem statement for a document classifier has two aspects: the document space and set of document class. The former defines the range of input documents and the latter defines the output that the classifier can produce. Here in our project, the document space is a database consisting of several numerical data sets of retinal Image. Data Sets: we have taken 255 retinal image data sets and performed our classification operations on that image. We have used 70% of the image data set for training our model and left 30% for testing the model. The features are extracted from the fundus images using image processing techniques - kurtosis, k-stat, mean, median, standard deviation and the obtained numerical features are stored in a dataset.
  • 6. Some Terminologies Confusion Matrix: A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made. Definition of the Terms: • Positive (P) : Observation is positive (for example: is an apple). • Negative (N) : Observation is not positive (for example: is not an apple). • True Positive (TP) : Observation is positive, and is predicted to be positive. • False Negative (FN) : Observation is positive, but is predicted negative. • True Negative (TN) : Observation is negative, and is predicted to be negative. • False Positive (FP) : Observation is negative, but is predicted positive.
  • 7. SOFTWARE AND HARDWARE REQUIREMENTS • SOFTWARE 1. Jupyter Notebook (Anaconda):Anaconda is a free and open- source[5] distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Package versions are managed by the package management system conda.[6] The Anaconda distribution includes data-science packages suitable for Windows, Linux, and MacOS. And Different Package install for implementation a) NumPy Library b) Pandas Library c) Matplotlib 2. Browser • HARDWARE 1. Windows 7/8/10 2. RAM 2GB 3. Minimum Storage 20GB
  • 8. DIFFERENT MODELS USED (Algorithms) We Have used four algorithms which are ➢ K-Nearest Neighbors ➢ Random Forest Classification ➢ Adaptive Boosting ➢ Support Vector Machine K-NEAREST NEIGHBORS The K-NN is also the classifier of the category of supervised learning algorithm. In supervised learning the targets are known to us but the pathway to target is not known. To comprehend machine learning nearest neighbor forms is the perfect example. Let us consider that there are many clusters of labelled samples. The nature of items of the same identified clusters or groups are of homogeneous nature. Now if an unlabeled item needs to be labelled under one of the labelled groups. Now to classify it K-nearest neighbors is easy and best algorithm that have record of all available classes can perfectly put the new item into the class on the basis of largest number of votes for k neighbors. In this way KNN is one of the alternate to classify an unlabeled item into identified class. Selecting the no. of nearest neighbors or in another words calculating k value plays important role in determining the efficiency of designed model. The accuracy and efficiency of k- NN algorithm basically evaluated by the K value determined. A larger number for k value has advantage in reducing the variance because of noisy data.
  • 9. Advantage: The KNN is an unbiased algorithm and have not any assumption of the data under consideration. It is very popular because of its simplicity and ease of implementation plus effectiveness. Disadvantage: The k-NN not create model so abstraction process not included. It takes high time to predicate the item. It requires high time to prepare data to design a robust system. ALGORITHM FOR KNN:
  • 10.
  • 11.
  • 12. RANDOM FOREST ALGORITHM Random Forest is a method that operates by constructing multiple decision trees during training phase.The decision of the majority of the trees is choose by the random forest as the final decision. Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest). Each tree is grown as follows: 1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree. 2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing. 3. Each tree is grown to the largest extent possible. There is no pruning.
  • 13. Algorithm for Construction of Random Forest is Step 1: Let the number of training cases be “n” and let the number of variables included in the classifier be “m”. Step 2: Let the number of input variables used to make decision at the node of a tree be “p”. We assume that p is always less than “m”. Step 3: Choose a training set for the decision tree by choosing k times with replacement from all “n” available training cases by taking a bootstrap sample. Bootstrapping computes for a given set of data the accuracy in terms of deviation from the mean data. It is usually used for hypothesis tests. Simple block bootstrap can be used when the data can be divided into nonoverlapping blocks. But, moving block bootstrap is used when we divide the data into overlapping blocks where the portion “k” of overlap between first and second block is always equal to the “k” overlap between second and third overlap and so on. We use the remaining cases to estimate the error of the tree. Bootstrapping is also used for estimating the properties of the given training data. Step 4: For each node of the tree, randomly choose variables on which to search for the best split. New data can be predicted by considering the majority votes in the tree. Predict data which is not in the bootstrap sample. And compute the aggregate. Step 5: Calculate the best split based on these chosen variables in the training set. Base the decision at that node using the best split. Step 6: Each tree is fully grown and not pruned. Pruning is used to cut of the leaf nodes so that the tree can grow further. Here the tree is completely retained. Step 7: The best split is one with the least error i.e. the least deviation from the observed data set.
  • 14. Advantages: 1. It provides accurate predictions for many types of applications 2. It can measure the importance of each feature with respect to the training data set. 3. Pairwise proximity between samples can be measured by the training data set. Disadvantages: 1. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. 2. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups Applications: 1. Is used for image classification for pixel analysis. 2. Is used in the field of Bioinformatics for complex data Analysis. 3. It is used for video segmentation (high dimensional data).
  • 15.
  • 16. ADABOOST ALGORITHM First of all, AdaBoost is short for Adaptive Boosting. Basically, Ada Boosting was the first really successful boosting algorithm developed for binary classification. Also, it is the best starting point for understanding boosting. Moreover, modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines. Generally, AdaBoost is used with short decision trees. Further, the first tree is created, the performance of the tree on each training instance is used. Also, we use it to weight how much attention the next tree. Thus, it is created should pay attention to each training instance. Hence, training data that is hard to predict is given more weight. Although, whereas easy to predict instances are given less weight. Learn AdaBoost Model from Data Ada Boosting is best used to boost the performance of decision trees and this is based on binary classification problems. Each instance in the training dataset is weighted. The initial weight is set to: weight(xi) = 1/n Where xi is the i’th training instance and n is the number of training instances How To Train One Model? A weak classifier is prepared on the training data using the weighted samples. Only binary classification problems are supported. So each decision stump makes one decision on one input variable. And outputs a +1.0 or -1.0 value for the first or second class value. The misclassification rate is calculated for the trained model. Traditionally, this is calculated as: error = (correct – N) / N Where error is the misclassification rate. While correct is the number of training instance predicted by the model. And N is the total number of training instances.
  • 17. AdaBoost Ensemble • Basically, weak models are added sequentially, trained using the weighted training data. • Generally, the process continues until a pre-set number of weak learners have been created. • Once completed, you are left with a pool of weak learners each with a stage value. Making Predictions with AdaBoost Predictions are made by calculating the weighted average of the weak classifiers. For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted values are weighted by each weak learner stage value. The prediction for the ensemble model is taken as a sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted Data Preparation for AdaBoost This section lists some heuristics for best preparing your data for AdaBoost. Quality Data: Because of the ensemble method attempt to correct misclassifications in the training data. Also, you need to be careful that the training data is high-quality. Outliers: Generally, outliers will force the ensemble down the rabbit hole of work. Although, it is so hard to correct for cases that are unrealistic. These could be removed from the training dataset. Noisy Data: Basically, noisy data, specifical noise in the output variable can be problematic. But if possible, attempt to isolate and clean these from your training dataset.
  • 18. AdaBoost algorithm advantages: Very good use of weak classifiers for cascading; Different classification algorithms can be used as weak classifiers; AdaBoost has a high degree of precision; Relative to the bagging algorithm and Random Forest Algorithm, AdaBoost fully considers the weight of each classifier; Adaboost algorithm disadvantages: The number of AdaBoost iterations is also a poorly set number of weak classifiers, which can be determined using cross-validation; Data imbalance leads to a decrease in classification accuracy; Training is time consuming, and it is best to cut the point at each reselection of the current classifier;
  • 19.
  • 20. SUPPORT VECTOR MACHINE The Support vector machine comes in the category of supervised learning .The SVM used for regression and classification. But it is popularly known for classification. It is a very efficient classifier. In this every object or item is represented by a point in the n- dimensional space. The value of each feature is represented by the particular coordinate. Then the items divided into classes by finding hyper-plane as shown in the figure. The diagram shows support Vectors that represent the coordinates of each item. The SVM algorithm is a good choice to segregates the two classes. SVM Advantages SVM’s are very good when we have no idea on the data. Works well with even unstructured and semi structured data like text, Images and trees. The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex problem. Unlike in neural networks, SVM is not solved for local optima.
  • 21. It scales relatively well to high dimensional data. SVM models have generalization in practice, the risk of over-fitting is less in SVM. SVM is always compared with ANN. When compared to ANN models, SVMs give better results. SVM Disadvantages Choosing a “good” kernel function is not easy. Long training time for large datasets. Difficult to understand and interpret the final model, variable weights and individual impact. Since the final model is not so easy to see, we cannot do small calibrations to the model hence it’s tough to incorporate our business logic. The SVM hyper parameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact SVM Application • Protein Structure Prediction • Intrusion Detection • Handwriting Recognition • Detecting Steganography in digital images • Breast Cancer Diagnosis • Almost all the applications where ANN is used
  • 22.
  • 24. FURTHER IMPROVEMENTS AND FUTURE SCOPES In our Glaucoma dataset, we achieved accuracy of 82% in finding the disease and in future we will increase the accuracy to higher extent. We will use algorithms like Convolutional Neural Network, to increase the accuracy rate. Currently we are using numerical data set as our input for classification and we will directly take image data set as input in future. Advances in image processing and its classification will be helpful in diagnosing medical conditions correctly. It will be helpful in recognizing people, performing surgery, reprograming, defects in human DNA etc.
  • 25. CONCLUSION The paper provides a brief idea of classifier to the beginners of this field. It helps the researchers in selecting the appropriate classifier for their problem. This paper explains about KNN, SVM, Random Forest and Adaboost Algorithm which are very popular classifier in field of image processing. The classifier mainly classified as supervised or unsupervised in short this paper provides the theoretical knowledge of concept of above mentioned classifiers We applied four algorithms on our glaucoma dataset and we found that random forest algorithm has highest accuracy level of 82% in detecting glaucoma diseases. We found that KNN algorithm has highest Specificity value. All this Algorithms can be used for better medical diagnosis of disease like cancer, Eye disease etc. It can also be used for biometric purposes such as identity, face and finger print documentation.
  • 26. References • Digital Image Processing: Kennth R.Castleman • • • • • • classification-models