SlideShare a Scribd company logo
1 of 11
Q. Write advantages, disadvantages and applications of different
algorithms which are used in Data Mining?
Ans. Decision Trees
In simple words, a decision tree is a structure that contains nodes (rectangular
boxes) and edges(arrows) and is built from a dataset (table of columns
representing features/attributes and rows corresponds to records). Each node is
either used to make a decision (known as decision node) or represent an
outcome (known as leaf node).
1.Naive Bayes classifier (NBC)
Naive Bayes is a machine learning algorithm we use to solve classification
problems. It is based onthe Bayes Theorem. It is one of the simplest yet powerful
ML algorithms in use and finds applications in many industries.
Supposeyou have to solve a classification problem and have created the features
and generated the hypothesis, but your superiors want to seethe model. You have
numerous data points (lakhs of data points) and many variables to train the
dataset. The best solution for this situation would be to use the Naive Bayes
classifier, which is quite faster in comparison to other classification algorithms.
Advantages NBC:
1) The naive Bayesian model originated from classical mathematical theory
and has a solid mathematical foundation and stable classification
efficiency.
2) It has a higher speed for large numbers of training and queries. Even with
very large training sets, there is usually only a relatively small number of
features for each project, and the training and classification of the project
is only a mathematical operation of the feature probability;
3) It works well for small-scale data, can handle multi-category tasks, and is
suitable for incremental training (that is, it can train new samples in real
time);
4) Less sensitive to missing data, the algorithm is also relatively simple, often
used for text classification;
5) Naïve Bayes explains the results easily.
Disadvantages of NBC:
1) There is an error rate in the classification decision;
2) Very sensitive to the form of input data;
3) The assumption of sample attribute independence is used, so if the sample
attributes are related, the effect is not good.
4) Naive Bayes assumes that all predictors (or features) are independent,
rarely happening in real life. This limits the applicability of this algorithm
in real-world use cases.
5) This algorithm faces the ‘zero-frequency problem’ where it assigns zero
probability to a categorical variable whose category in the test data set
wasn’t available in the training dataset. It would be best if you used a
smoothing technique to overcome this issue.
6) Its estimations can be wrong in some cases, so you shouldn’t take its
probability outputs very seriously.
Applications of Naive Bayes Algorithms
 Real-time Prediction: As Naive Bayes is super fast; it can be used for
making predictions in real time.
 Multi-class Prediction:
 This algorithm can predict the posterior probability of multiple classes of
the target variable.
 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes
classifiers are mostly used in text classification (due to their better results
in multi-class problems and independence rule) have a higher success rate
as compared to other algorithms. As a result, it is widely used in Spam
filtering (identify spam e-mail) and Sentiment Analysis (in social media
analysis, to identify positive and negative customer sentiments)
 RecommendationSystem: Naive Bayes Classifier along with algorithms
like Collaborative Filtering makes a Recommendation System that uses
machine learning and data mining techniques to filter unseen information
and predict whether a user would like a given resource or not.
2. Iterative Dichotomiser 3
ID3 stands for Iterative Dichotomiser 3 is a classificationalgorithmand is named
suchbecausethe algorithm iteratively (repeatedly) dichotomizes(divides) features
into two or more groups at each step. Invented by Ross Quinlan, ID3 uses a top-
down greedy approach to build a decision tree. In simple words, the top-
down approach means that we start building the tree from the top and
the greedy approach of building a decision tree by selecting a best attribute that
yields maximum Information Gain (IG) or minimum Entropy (H).
Advantages of using ID3
1) Understandable prediction rules are created from the training data.
2) Builds the fastest tree.
3) Builds a short tree.
4) Only need to test enough attributes until all data is classified.
5) Finding leaf nodes enables test data to be pruned, reducing number of
tests. Whole dataset is searched to create tree.
Disadvantages of using ID3
1) Data may be over-fitted or over-classified, if a small sample is tested.
2) Only one attribute at a time is tested for making a decision.
3) Classifying continuous data may be computationally expensive, as many
trees must be generated to see where to break the continuum.
Applications of ID3
ID3 algorithm is used in many places some are as land capability classification
Information Asset Identification etc.
3. K-Nearest Neighbours
KNN for NearestNeighbourSearch:KNN algorithm involves retrieving the K
datapoints that are nearest in distance to the original point. It can be used for
classification or regression by aggregating the target values of the nearest
neighbours to make a prediction. However, just retrieving the nearest
neighbours is a very important aspect in several applications. For instance,
suppose we write a movie recommender system, once we find a suitable vector
representation for all the movies, given a movie, recommending the five closest
movies involves retrieving the five nearest neighbour vectors.
KNN for classification: KNN can be used for classification in a supervised
setting where we are given a dataset with target labels. For classification, KNN
finds the k nearest data points in the training set and the target label is computed
as the mode of the target label of these k nearest neighbours.
KNN for Regression: KNN can be used for regression in a supervised setting
where we are given a dataset with continuoustarget values. Forregression, KNN
finds the k nearest data points in the training set and the target value is computed
as the mean of the target value of these k nearest neighbours.
Advantages of KNN
1) K-NN is pretty intuitive and simple: K-NN algorithm is very simple to
understand and equally easy to implement. To classify the new data point
K-NN algorithm reads through whole dataset to find out K nearest
neighbours.
2) K-NN has no assumptions: K-NN is a non-parametric algorithm which
means there are assumptions to be met to implement K-NN. Parametric
models like linear regression has lots of assumptions to be met by data
before it can be implemented which is not the case with K-NN.
3) No Training Step: K-NN does not explicitly build any model, it simply
tags the new data entry-based learning from historical data. New data entry
would be tagged with majority class in the nearest neighbour.
4) It constantly evolves: Given it’s an instance-based learning; k-NN is a
memory-based approach. The classifier immediately adapts as we collect
new training data. It allows the algorithm to respond quickly to changes in
the input during real-time use.
5) Very easy to implement for multi-class problem: Most of the classifier
algorithms are easy to implement for binary problems and needs effort to
implement for multi class whereas K-NN adjust to multi class without any
extra efforts.
6) Can be used both for Classificationand Regression: One of the biggest
advantages of K-NN is that K-NN can be used both for classification and
regression problems.
7) One Hyper Parameter: K-NN might take some time while selecting the
first hyper parameter but after that rest of the parameters are aligned to it.
8) Variety of distance criteria to be choose from: K-NN algorithm gives
user the flexibility to choose distance while building K-NN model.
a. Euclidean Distance
b. Hamming Distance
c. Manhattan Distance
d. Makowski Distance
Even though K-NN has several advantages but there are certain very important
disadvantages or constraints of K-NN.
Disadvantages of KNN
1) K-NN slow algorithm: K-NN might be very easy to implement but as
dataset grows efficiency or speed of algorithm declines very fast.
2) Curse of Dimensionality: KNN works well with small number of input
variables but as the numbers of variables grow K-NN algorithm struggles
to predict the output of new data point.
3) K-NN needs homogeneous features: If you decide to build k-NN using a
common distance, like Euclidean or Manhattan distances, it is completely
necessary that features have the same scale, since absolute differences in
features weight the same, i.e., a given distance in feature 1 must means the
same for feature 2.
4) Optimal number of neighbours: One of the biggest issues with K-NN is
to choose the optimal number of neighbours to be consider while
classifying the new data entry.
5) Imbalanced data causes problems: k-NN doesn’t perform well on
imbalanced data. If we consider two classes, A and B, and the majority of
the training data is labelled as A, then the model will ultimately give a lot
of preference to A. This might result in getting the less common class B
wrongly classified.
6) Outlier sensitivity: K-NN algorithm is very sensitive to outliers as it
simply chose the neighbours based on distance criteria.
7) Missing Value treatment: K-NN inherently has no capability of dealing
with missing value problem.
Applications of KNN
 Used in classification and Interpretation (legal, news, banking)
 Used in get missing values
 Used in pattern recognition
 Used in gene expression
 Used in protein-protein prediction
 Used to get 3D structure of problem
 Used to measure document similarity
 Problem solving (planning, pronunciation)
 Functional learning (dynamic control)
 Teaching and aiding (help desk, user training)
4. Classification and Regression Trees (CART) Algorithm
Classification and Regression Trees (CART) is only a modern term for what are
otherwise known as DecisionTrees.Decision Trees have been around for a very
long time and are important for predictive modelling in Machine Learning. As
the name suggests, these trees are used for classification and prediction problems.
These models are obtained by partitioning the data space and fitting a simple
prediction model within each partition. This is donerecursively. Wecan represent
the partitioning graphically as a tree; hence the name.
They have withstood the test of time because of the following reasons:
1. Very competitive with other methods
2. High efficiency
Classification trees which are used to separate a dataset into different classes
(generally used when we expect categorical classes). The other type are
Regression Trees which are used when the class variable is continuous (or
numerical).
Advantages of CART
1) CART does not require any assumptions for underlying distributions.
2) It is easy to use and can quickly provide valuable insights.
3) CART can be used efficiently to assess massive datasets
4) be further used to drill down to a particular cause and find effective, quick
solutions.
5) The solution is easily interpretable, intuitive and can be verified with
existing data.
6) it is a good way to present solutions to management.
Disadvantages of CART
1) The biggest limitation is the fact that it is a nonparametric technique; it is
not recommended to make any generalization on the underlying
phenomenon based upon the results observed. Although the rules obtained
through the analysis can be tested on new data, it must be remembered that
the model is built based upon the sample without making any inference
about the underlying probability distribution.
2) Another limitation of CART is that the tree becomes quite complex after
seven or eight layers.
3) Interpreting the results in this situation is not intuitive.
Applications of CART:
CART is used in many places in machine learning such as Blood Donors
Classificationn, spatial data environmental and ecological data, Hepatitis disease
diagnosis.
5. K- Means Clustering
Means algorithm is an iterative algorithm that tries to partition the dataset
into Kpre-defined distinct non-overlapping subgroups (clusters) where each data
point belongs to only one group. It tries to make the intra-cluster data points as
similar as possible while also keeping the clusters as different (far) as possible. It
assigns data points to a cluster such that the sum of the squared distance between
the data points and the cluster’s centroid (arithmetic mean of all the data points
that belong to that cluster) is at the minimum. The less variation we have within
clusters, the more homogeneous (similar) the data points are within the same
cluster.
Advantages of K-means
1) Relatively simple to implement.
2) Scales to large data sets.
3) Guarantees convergence.
4) Can warm-start the positions of centroids.
5) Easily adapts to new examples.
6) Generalizes to clusters of different shapes and sizes, such as elliptical
clusters.
Disadvantages of K-means
1) Being dependent on initial values.
Fora low k, you can mitigate this dependenceby running k-means several
times with different initial values and picking the best result.
As k increases, you need advanced versions of k-means to pick better
values of the initial centroids (called k-means seeding).
2) Clustering data of varying sizes and density
K-means has trouble clustering data where clusters are of varying sizes and
density. To cluster such data, you need to generalize k-means.
3) Clustering outliers
Centroids can be dragged by outliers, or outliers might get their own cluster
instead of being ignored. Consider removing or clipping outliers before
clustering.
4) Scaling with number of dimensions
As the number of dimensions increases, a distance-based similarity
measure converges to a constant value between any given examples.
Reducedimensionality either by using PCAonthe feature data, orby using
“spectral clustering” to modify the clustering algorithm .
Applications of K-Means Clustering
K-Means clustering is used in a variety of examples or business cases in real life,
like:
 Academic performance
 Diagnostic systems
 Search engines
 Wireless sensor networks
 Academic Performance: Based on the scores, students are categorized
into grades like A, B, or C.
 Diagnostic systems: The medical profession uses k-means in creating
smarter medical decision support systems, especially in the treatment of
liver ailments.
 Search engines:Clustering forms a backbone of search engines. When a
search is performed, the search results need to be grouped, and the search
engines very often use clustering to do this.
 Wireless sensor networks: The clustering algorithm plays the role of
finding the cluster heads, which collects all the data in its respective cluster.
END

More Related Content

What's hot

Comparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit RecognitionComparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit RecognitionSafaa Alnabulsi
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
 
Observations
ObservationsObservations
Observationsbutest
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyNUPUR YADAV
 
An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clusteringidescitation
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
 
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...IJERA Editor
 
ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016InVID Project
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classificationijtsrd
 
Offline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural NetworkOffline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural Networkijaia
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkItachi SK
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesJinwon Lee
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1ananth
 
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...wolf
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earningAnirudh Ganguly
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_papershanullah3
 

What's hot (20)

Comparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit RecognitionComparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit Recognition
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
01 Introduction to Machine Learning
01 Introduction to Machine Learning01 Introduction to Machine Learning
01 Introduction to Machine Learning
 
Observations
ObservationsObservations
Observations
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
 
An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clustering
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
 
ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016
 
Neural networks
Neural networksNeural networks
Neural networks
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 
Image recognition
Image recognitionImage recognition
Image recognition
 
Offline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural NetworkOffline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural Network
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earning
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
 

Similar to Advantages, disadvantages and applications of decision trees, naive bayes, ID3, KNN and CART algorithms

IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
 
image_classification.pptx
image_classification.pptximage_classification.pptx
image_classification.pptxtayyaba977749
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta
 
Improving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docImproving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docbutest
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERIJCSEA Journal
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
SVM-KNN Hybrid Method for MR Image
SVM-KNN Hybrid Method for MR ImageSVM-KNN Hybrid Method for MR Image
SVM-KNN Hybrid Method for MR ImageIRJET Journal
 
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A ReviewIOSRjournaljce
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel basedIJITCA Journal
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...ijcsit
 

Similar to Advantages, disadvantages and applications of decision trees, naive bayes, ID3, KNN and CART algorithms (20)

IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.
 
PNN and inversion-B
PNN and inversion-BPNN and inversion-B
PNN and inversion-B
 
image_classification.pptx
image_classification.pptximage_classification.pptx
image_classification.pptx
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 
Improving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docImproving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..doc
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
SVM-KNN Hybrid Method for MR Image
SVM-KNN Hybrid Method for MR ImageSVM-KNN Hybrid Method for MR Image
SVM-KNN Hybrid Method for MR Image
 
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A Review
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
 

Recently uploaded

Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 

Recently uploaded (20)

Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint Presentation
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 

Advantages, disadvantages and applications of decision trees, naive bayes, ID3, KNN and CART algorithms

  • 1. Q. Write advantages, disadvantages and applications of different algorithms which are used in Data Mining? Ans. Decision Trees In simple words, a decision tree is a structure that contains nodes (rectangular boxes) and edges(arrows) and is built from a dataset (table of columns representing features/attributes and rows corresponds to records). Each node is either used to make a decision (known as decision node) or represent an outcome (known as leaf node). 1.Naive Bayes classifier (NBC) Naive Bayes is a machine learning algorithm we use to solve classification problems. It is based onthe Bayes Theorem. It is one of the simplest yet powerful ML algorithms in use and finds applications in many industries. Supposeyou have to solve a classification problem and have created the features and generated the hypothesis, but your superiors want to seethe model. You have numerous data points (lakhs of data points) and many variables to train the dataset. The best solution for this situation would be to use the Naive Bayes classifier, which is quite faster in comparison to other classification algorithms. Advantages NBC:
  • 2. 1) The naive Bayesian model originated from classical mathematical theory and has a solid mathematical foundation and stable classification efficiency. 2) It has a higher speed for large numbers of training and queries. Even with very large training sets, there is usually only a relatively small number of features for each project, and the training and classification of the project is only a mathematical operation of the feature probability; 3) It works well for small-scale data, can handle multi-category tasks, and is suitable for incremental training (that is, it can train new samples in real time); 4) Less sensitive to missing data, the algorithm is also relatively simple, often used for text classification; 5) Naïve Bayes explains the results easily. Disadvantages of NBC: 1) There is an error rate in the classification decision; 2) Very sensitive to the form of input data; 3) The assumption of sample attribute independence is used, so if the sample attributes are related, the effect is not good. 4) Naive Bayes assumes that all predictors (or features) are independent, rarely happening in real life. This limits the applicability of this algorithm in real-world use cases. 5) This algorithm faces the ‘zero-frequency problem’ where it assigns zero probability to a categorical variable whose category in the test data set wasn’t available in the training dataset. It would be best if you used a smoothing technique to overcome this issue.
  • 3. 6) Its estimations can be wrong in some cases, so you shouldn’t take its probability outputs very seriously. Applications of Naive Bayes Algorithms  Real-time Prediction: As Naive Bayes is super fast; it can be used for making predictions in real time.  Multi-class Prediction:  This algorithm can predict the posterior probability of multiple classes of the target variable.  Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers are mostly used in text classification (due to their better results in multi-class problems and independence rule) have a higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)  RecommendationSystem: Naive Bayes Classifier along with algorithms like Collaborative Filtering makes a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not. 2. Iterative Dichotomiser 3 ID3 stands for Iterative Dichotomiser 3 is a classificationalgorithmand is named suchbecausethe algorithm iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each step. Invented by Ross Quinlan, ID3 uses a top- down greedy approach to build a decision tree. In simple words, the top- down approach means that we start building the tree from the top and the greedy approach of building a decision tree by selecting a best attribute that yields maximum Information Gain (IG) or minimum Entropy (H). Advantages of using ID3 1) Understandable prediction rules are created from the training data. 2) Builds the fastest tree.
  • 4. 3) Builds a short tree. 4) Only need to test enough attributes until all data is classified. 5) Finding leaf nodes enables test data to be pruned, reducing number of tests. Whole dataset is searched to create tree. Disadvantages of using ID3 1) Data may be over-fitted or over-classified, if a small sample is tested. 2) Only one attribute at a time is tested for making a decision. 3) Classifying continuous data may be computationally expensive, as many trees must be generated to see where to break the continuum. Applications of ID3 ID3 algorithm is used in many places some are as land capability classification Information Asset Identification etc. 3. K-Nearest Neighbours KNN for NearestNeighbourSearch:KNN algorithm involves retrieving the K datapoints that are nearest in distance to the original point. It can be used for classification or regression by aggregating the target values of the nearest neighbours to make a prediction. However, just retrieving the nearest neighbours is a very important aspect in several applications. For instance, suppose we write a movie recommender system, once we find a suitable vector representation for all the movies, given a movie, recommending the five closest movies involves retrieving the five nearest neighbour vectors.
  • 5. KNN for classification: KNN can be used for classification in a supervised setting where we are given a dataset with target labels. For classification, KNN finds the k nearest data points in the training set and the target label is computed as the mode of the target label of these k nearest neighbours. KNN for Regression: KNN can be used for regression in a supervised setting where we are given a dataset with continuoustarget values. Forregression, KNN finds the k nearest data points in the training set and the target value is computed as the mean of the target value of these k nearest neighbours. Advantages of KNN 1) K-NN is pretty intuitive and simple: K-NN algorithm is very simple to understand and equally easy to implement. To classify the new data point K-NN algorithm reads through whole dataset to find out K nearest neighbours. 2) K-NN has no assumptions: K-NN is a non-parametric algorithm which means there are assumptions to be met to implement K-NN. Parametric models like linear regression has lots of assumptions to be met by data before it can be implemented which is not the case with K-NN.
  • 6. 3) No Training Step: K-NN does not explicitly build any model, it simply tags the new data entry-based learning from historical data. New data entry would be tagged with majority class in the nearest neighbour. 4) It constantly evolves: Given it’s an instance-based learning; k-NN is a memory-based approach. The classifier immediately adapts as we collect new training data. It allows the algorithm to respond quickly to changes in the input during real-time use. 5) Very easy to implement for multi-class problem: Most of the classifier algorithms are easy to implement for binary problems and needs effort to implement for multi class whereas K-NN adjust to multi class without any extra efforts. 6) Can be used both for Classificationand Regression: One of the biggest advantages of K-NN is that K-NN can be used both for classification and regression problems. 7) One Hyper Parameter: K-NN might take some time while selecting the first hyper parameter but after that rest of the parameters are aligned to it. 8) Variety of distance criteria to be choose from: K-NN algorithm gives user the flexibility to choose distance while building K-NN model. a. Euclidean Distance b. Hamming Distance c. Manhattan Distance d. Makowski Distance Even though K-NN has several advantages but there are certain very important disadvantages or constraints of K-NN. Disadvantages of KNN 1) K-NN slow algorithm: K-NN might be very easy to implement but as dataset grows efficiency or speed of algorithm declines very fast.
  • 7. 2) Curse of Dimensionality: KNN works well with small number of input variables but as the numbers of variables grow K-NN algorithm struggles to predict the output of new data point. 3) K-NN needs homogeneous features: If you decide to build k-NN using a common distance, like Euclidean or Manhattan distances, it is completely necessary that features have the same scale, since absolute differences in features weight the same, i.e., a given distance in feature 1 must means the same for feature 2. 4) Optimal number of neighbours: One of the biggest issues with K-NN is to choose the optimal number of neighbours to be consider while classifying the new data entry. 5) Imbalanced data causes problems: k-NN doesn’t perform well on imbalanced data. If we consider two classes, A and B, and the majority of the training data is labelled as A, then the model will ultimately give a lot of preference to A. This might result in getting the less common class B wrongly classified. 6) Outlier sensitivity: K-NN algorithm is very sensitive to outliers as it simply chose the neighbours based on distance criteria. 7) Missing Value treatment: K-NN inherently has no capability of dealing with missing value problem. Applications of KNN  Used in classification and Interpretation (legal, news, banking)  Used in get missing values  Used in pattern recognition  Used in gene expression  Used in protein-protein prediction  Used to get 3D structure of problem  Used to measure document similarity
  • 8.  Problem solving (planning, pronunciation)  Functional learning (dynamic control)  Teaching and aiding (help desk, user training) 4. Classification and Regression Trees (CART) Algorithm Classification and Regression Trees (CART) is only a modern term for what are otherwise known as DecisionTrees.Decision Trees have been around for a very long time and are important for predictive modelling in Machine Learning. As the name suggests, these trees are used for classification and prediction problems. These models are obtained by partitioning the data space and fitting a simple prediction model within each partition. This is donerecursively. Wecan represent the partitioning graphically as a tree; hence the name. They have withstood the test of time because of the following reasons: 1. Very competitive with other methods 2. High efficiency Classification trees which are used to separate a dataset into different classes (generally used when we expect categorical classes). The other type are Regression Trees which are used when the class variable is continuous (or numerical). Advantages of CART 1) CART does not require any assumptions for underlying distributions. 2) It is easy to use and can quickly provide valuable insights. 3) CART can be used efficiently to assess massive datasets 4) be further used to drill down to a particular cause and find effective, quick solutions.
  • 9. 5) The solution is easily interpretable, intuitive and can be verified with existing data. 6) it is a good way to present solutions to management. Disadvantages of CART 1) The biggest limitation is the fact that it is a nonparametric technique; it is not recommended to make any generalization on the underlying phenomenon based upon the results observed. Although the rules obtained through the analysis can be tested on new data, it must be remembered that the model is built based upon the sample without making any inference about the underlying probability distribution. 2) Another limitation of CART is that the tree becomes quite complex after seven or eight layers. 3) Interpreting the results in this situation is not intuitive. Applications of CART: CART is used in many places in machine learning such as Blood Donors Classificationn, spatial data environmental and ecological data, Hepatitis disease diagnosis. 5. K- Means Clustering Means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between
  • 10. the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster. Advantages of K-means 1) Relatively simple to implement. 2) Scales to large data sets. 3) Guarantees convergence. 4) Can warm-start the positions of centroids. 5) Easily adapts to new examples. 6) Generalizes to clusters of different shapes and sizes, such as elliptical clusters. Disadvantages of K-means 1) Being dependent on initial values. Fora low k, you can mitigate this dependenceby running k-means several times with different initial values and picking the best result. As k increases, you need advanced versions of k-means to pick better values of the initial centroids (called k-means seeding). 2) Clustering data of varying sizes and density K-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to generalize k-means. 3) Clustering outliers Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.
  • 11. 4) Scaling with number of dimensions As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reducedimensionality either by using PCAonthe feature data, orby using “spectral clustering” to modify the clustering algorithm . Applications of K-Means Clustering K-Means clustering is used in a variety of examples or business cases in real life, like:  Academic performance  Diagnostic systems  Search engines  Wireless sensor networks  Academic Performance: Based on the scores, students are categorized into grades like A, B, or C.  Diagnostic systems: The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.  Search engines:Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this.  Wireless sensor networks: The clustering algorithm plays the role of finding the cluster heads, which collects all the data in its respective cluster. END