SlideShare a Scribd company logo
Q. Write advantages, disadvantages and applications of different
algorithms which are used in Data Mining?
Ans. Decision Trees
In simple words, a decision tree is a structure that contains nodes (rectangular
boxes) and edges(arrows) and is built from a dataset (table of columns
representing features/attributes and rows corresponds to records). Each node is
either used to make a decision (known as decision node) or represent an
outcome (known as leaf node).
1.Naive Bayes classifier (NBC)
Naive Bayes is a machine learning algorithm we use to solve classification
problems. It is based onthe Bayes Theorem. It is one of the simplest yet powerful
ML algorithms in use and finds applications in many industries.
Supposeyou have to solve a classification problem and have created the features
and generated the hypothesis, but your superiors want to seethe model. You have
numerous data points (lakhs of data points) and many variables to train the
dataset. The best solution for this situation would be to use the Naive Bayes
classifier, which is quite faster in comparison to other classification algorithms.
Advantages NBC:
1) The naive Bayesian model originated from classical mathematical theory
and has a solid mathematical foundation and stable classification
efficiency.
2) It has a higher speed for large numbers of training and queries. Even with
very large training sets, there is usually only a relatively small number of
features for each project, and the training and classification of the project
is only a mathematical operation of the feature probability;
3) It works well for small-scale data, can handle multi-category tasks, and is
suitable for incremental training (that is, it can train new samples in real
time);
4) Less sensitive to missing data, the algorithm is also relatively simple, often
used for text classification;
5) Naïve Bayes explains the results easily.
Disadvantages of NBC:
1) There is an error rate in the classification decision;
2) Very sensitive to the form of input data;
3) The assumption of sample attribute independence is used, so if the sample
attributes are related, the effect is not good.
4) Naive Bayes assumes that all predictors (or features) are independent,
rarely happening in real life. This limits the applicability of this algorithm
in real-world use cases.
5) This algorithm faces the ‘zero-frequency problem’ where it assigns zero
probability to a categorical variable whose category in the test data set
wasn’t available in the training dataset. It would be best if you used a
smoothing technique to overcome this issue.
6) Its estimations can be wrong in some cases, so you shouldn’t take its
probability outputs very seriously.
Applications of Naive Bayes Algorithms
 Real-time Prediction: As Naive Bayes is super fast; it can be used for
making predictions in real time.
 Multi-class Prediction:
 This algorithm can predict the posterior probability of multiple classes of
the target variable.
 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes
classifiers are mostly used in text classification (due to their better results
in multi-class problems and independence rule) have a higher success rate
as compared to other algorithms. As a result, it is widely used in Spam
filtering (identify spam e-mail) and Sentiment Analysis (in social media
analysis, to identify positive and negative customer sentiments)
 RecommendationSystem: Naive Bayes Classifier along with algorithms
like Collaborative Filtering makes a Recommendation System that uses
machine learning and data mining techniques to filter unseen information
and predict whether a user would like a given resource or not.
2. Iterative Dichotomiser 3
ID3 stands for Iterative Dichotomiser 3 is a classificationalgorithmand is named
suchbecausethe algorithm iteratively (repeatedly) dichotomizes(divides) features
into two or more groups at each step. Invented by Ross Quinlan, ID3 uses a top-
down greedy approach to build a decision tree. In simple words, the top-
down approach means that we start building the tree from the top and
the greedy approach of building a decision tree by selecting a best attribute that
yields maximum Information Gain (IG) or minimum Entropy (H).
Advantages of using ID3
1) Understandable prediction rules are created from the training data.
2) Builds the fastest tree.
3) Builds a short tree.
4) Only need to test enough attributes until all data is classified.
5) Finding leaf nodes enables test data to be pruned, reducing number of
tests. Whole dataset is searched to create tree.
Disadvantages of using ID3
1) Data may be over-fitted or over-classified, if a small sample is tested.
2) Only one attribute at a time is tested for making a decision.
3) Classifying continuous data may be computationally expensive, as many
trees must be generated to see where to break the continuum.
Applications of ID3
ID3 algorithm is used in many places some are as land capability classification
Information Asset Identification etc.
3. K-Nearest Neighbours
KNN for NearestNeighbourSearch:KNN algorithm involves retrieving the K
datapoints that are nearest in distance to the original point. It can be used for
classification or regression by aggregating the target values of the nearest
neighbours to make a prediction. However, just retrieving the nearest
neighbours is a very important aspect in several applications. For instance,
suppose we write a movie recommender system, once we find a suitable vector
representation for all the movies, given a movie, recommending the five closest
movies involves retrieving the five nearest neighbour vectors.
KNN for classification: KNN can be used for classification in a supervised
setting where we are given a dataset with target labels. For classification, KNN
finds the k nearest data points in the training set and the target label is computed
as the mode of the target label of these k nearest neighbours.
KNN for Regression: KNN can be used for regression in a supervised setting
where we are given a dataset with continuoustarget values. Forregression, KNN
finds the k nearest data points in the training set and the target value is computed
as the mean of the target value of these k nearest neighbours.
Advantages of KNN
1) K-NN is pretty intuitive and simple: K-NN algorithm is very simple to
understand and equally easy to implement. To classify the new data point
K-NN algorithm reads through whole dataset to find out K nearest
neighbours.
2) K-NN has no assumptions: K-NN is a non-parametric algorithm which
means there are assumptions to be met to implement K-NN. Parametric
models like linear regression has lots of assumptions to be met by data
before it can be implemented which is not the case with K-NN.
3) No Training Step: K-NN does not explicitly build any model, it simply
tags the new data entry-based learning from historical data. New data entry
would be tagged with majority class in the nearest neighbour.
4) It constantly evolves: Given it’s an instance-based learning; k-NN is a
memory-based approach. The classifier immediately adapts as we collect
new training data. It allows the algorithm to respond quickly to changes in
the input during real-time use.
5) Very easy to implement for multi-class problem: Most of the classifier
algorithms are easy to implement for binary problems and needs effort to
implement for multi class whereas K-NN adjust to multi class without any
extra efforts.
6) Can be used both for Classificationand Regression: One of the biggest
advantages of K-NN is that K-NN can be used both for classification and
regression problems.
7) One Hyper Parameter: K-NN might take some time while selecting the
first hyper parameter but after that rest of the parameters are aligned to it.
8) Variety of distance criteria to be choose from: K-NN algorithm gives
user the flexibility to choose distance while building K-NN model.
a. Euclidean Distance
b. Hamming Distance
c. Manhattan Distance
d. Makowski Distance
Even though K-NN has several advantages but there are certain very important
disadvantages or constraints of K-NN.
Disadvantages of KNN
1) K-NN slow algorithm: K-NN might be very easy to implement but as
dataset grows efficiency or speed of algorithm declines very fast.
2) Curse of Dimensionality: KNN works well with small number of input
variables but as the numbers of variables grow K-NN algorithm struggles
to predict the output of new data point.
3) K-NN needs homogeneous features: If you decide to build k-NN using a
common distance, like Euclidean or Manhattan distances, it is completely
necessary that features have the same scale, since absolute differences in
features weight the same, i.e., a given distance in feature 1 must means the
same for feature 2.
4) Optimal number of neighbours: One of the biggest issues with K-NN is
to choose the optimal number of neighbours to be consider while
classifying the new data entry.
5) Imbalanced data causes problems: k-NN doesn’t perform well on
imbalanced data. If we consider two classes, A and B, and the majority of
the training data is labelled as A, then the model will ultimately give a lot
of preference to A. This might result in getting the less common class B
wrongly classified.
6) Outlier sensitivity: K-NN algorithm is very sensitive to outliers as it
simply chose the neighbours based on distance criteria.
7) Missing Value treatment: K-NN inherently has no capability of dealing
with missing value problem.
Applications of KNN
 Used in classification and Interpretation (legal, news, banking)
 Used in get missing values
 Used in pattern recognition
 Used in gene expression
 Used in protein-protein prediction
 Used to get 3D structure of problem
 Used to measure document similarity
 Problem solving (planning, pronunciation)
 Functional learning (dynamic control)
 Teaching and aiding (help desk, user training)
4. Classification and Regression Trees (CART) Algorithm
Classification and Regression Trees (CART) is only a modern term for what are
otherwise known as DecisionTrees.Decision Trees have been around for a very
long time and are important for predictive modelling in Machine Learning. As
the name suggests, these trees are used for classification and prediction problems.
These models are obtained by partitioning the data space and fitting a simple
prediction model within each partition. This is donerecursively. Wecan represent
the partitioning graphically as a tree; hence the name.
They have withstood the test of time because of the following reasons:
1. Very competitive with other methods
2. High efficiency
Classification trees which are used to separate a dataset into different classes
(generally used when we expect categorical classes). The other type are
Regression Trees which are used when the class variable is continuous (or
numerical).
Advantages of CART
1) CART does not require any assumptions for underlying distributions.
2) It is easy to use and can quickly provide valuable insights.
3) CART can be used efficiently to assess massive datasets
4) be further used to drill down to a particular cause and find effective, quick
solutions.
5) The solution is easily interpretable, intuitive and can be verified with
existing data.
6) it is a good way to present solutions to management.
Disadvantages of CART
1) The biggest limitation is the fact that it is a nonparametric technique; it is
not recommended to make any generalization on the underlying
phenomenon based upon the results observed. Although the rules obtained
through the analysis can be tested on new data, it must be remembered that
the model is built based upon the sample without making any inference
about the underlying probability distribution.
2) Another limitation of CART is that the tree becomes quite complex after
seven or eight layers.
3) Interpreting the results in this situation is not intuitive.
Applications of CART:
CART is used in many places in machine learning such as Blood Donors
Classificationn, spatial data environmental and ecological data, Hepatitis disease
diagnosis.
5. K- Means Clustering
Means algorithm is an iterative algorithm that tries to partition the dataset
into Kpre-defined distinct non-overlapping subgroups (clusters) where each data
point belongs to only one group. It tries to make the intra-cluster data points as
similar as possible while also keeping the clusters as different (far) as possible. It
assigns data points to a cluster such that the sum of the squared distance between
the data points and the cluster’s centroid (arithmetic mean of all the data points
that belong to that cluster) is at the minimum. The less variation we have within
clusters, the more homogeneous (similar) the data points are within the same
cluster.
Advantages of K-means
1) Relatively simple to implement.
2) Scales to large data sets.
3) Guarantees convergence.
4) Can warm-start the positions of centroids.
5) Easily adapts to new examples.
6) Generalizes to clusters of different shapes and sizes, such as elliptical
clusters.
Disadvantages of K-means
1) Being dependent on initial values.
Fora low k, you can mitigate this dependenceby running k-means several
times with different initial values and picking the best result.
As k increases, you need advanced versions of k-means to pick better
values of the initial centroids (called k-means seeding).
2) Clustering data of varying sizes and density
K-means has trouble clustering data where clusters are of varying sizes and
density. To cluster such data, you need to generalize k-means.
3) Clustering outliers
Centroids can be dragged by outliers, or outliers might get their own cluster
instead of being ignored. Consider removing or clipping outliers before
clustering.
4) Scaling with number of dimensions
As the number of dimensions increases, a distance-based similarity
measure converges to a constant value between any given examples.
Reducedimensionality either by using PCAonthe feature data, orby using
“spectral clustering” to modify the clustering algorithm .
Applications of K-Means Clustering
K-Means clustering is used in a variety of examples or business cases in real life,
like:
 Academic performance
 Diagnostic systems
 Search engines
 Wireless sensor networks
 Academic Performance: Based on the scores, students are categorized
into grades like A, B, or C.
 Diagnostic systems: The medical profession uses k-means in creating
smarter medical decision support systems, especially in the treatment of
liver ailments.
 Search engines:Clustering forms a backbone of search engines. When a
search is performed, the search results need to be grouped, and the search
engines very often use clustering to do this.
 Wireless sensor networks: The clustering algorithm plays the role of
finding the cluster heads, which collects all the data in its respective cluster.
END

More Related Content

What's hot

Comparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit RecognitionComparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit Recognition
Safaa Alnabulsi
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
NUPUR YADAV
 
01 Introduction to Machine Learning
01 Introduction to Machine Learning01 Introduction to Machine Learning
01 Introduction to Machine Learning
Tamer Ahmed Farrag, PhD
 
Observations
ObservationsObservations
Observations
butest
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
NUPUR YADAV
 
An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clustering
idescitation
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MLAI2
 
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
IJERA Editor
 
ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016
InVID Project
 
Neural networks
Neural networksNeural networks
Neural networks
HarshitGupta367
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
ijtsrd
 
Image recognition
Image recognitionImage recognition
Image recognition
Stig-Arne Kristoffersen
 
Offline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural NetworkOffline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural Network
ijaia
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
Itachi SK
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
ananth
 
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
wolf
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earning
Anirudh Ganguly
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
shanullah3
 

What's hot (20)

Comparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit RecognitionComparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit Recognition
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
01 Introduction to Machine Learning
01 Introduction to Machine Learning01 Introduction to Machine Learning
01 Introduction to Machine Learning
 
Observations
ObservationsObservations
Observations
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
 
An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clustering
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
 
ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016
 
Neural networks
Neural networksNeural networks
Neural networks
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 
Image recognition
Image recognitionImage recognition
Image recognition
 
Offline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural NetworkOffline Character Recognition Using Monte Carlo Method and Neural Network
Offline Character Recognition Using Monte Carlo Method and Neural Network
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Fe...
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earning
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
 

Similar to Types of Machine Learnig Algorithms(CART, ID3)

IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Vikash Kumar
 
Natural Language Processing of applications.pdf
Natural Language Processing of applications.pdfNatural Language Processing of applications.pdf
Natural Language Processing of applications.pdf
pranavi452104
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
IJERA Editor
 
House price prediction
House price predictionHouse price prediction
House price prediction
SabahBegum
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
AmAn Singh
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.
bhavinecindus
 
PNN and inversion-B
PNN and inversion-BPNN and inversion-B
PNN and inversion-B
Stig-Arne Kristoffersen
 
image_classification.pptx
image_classification.pptximage_classification.pptx
image_classification.pptx
tayyaba977749
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
mlaij
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
Shaleen Kumar Gupta
 
Improving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docImproving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..doc
butest
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
IJCSEA Journal
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Editor IJCATR
 
SVM-KNN Hybrid Method for MR Image
SVM-KNN Hybrid Method for MR ImageSVM-KNN Hybrid Method for MR Image
SVM-KNN Hybrid Method for MR Image
IRJET Journal
 
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A Review
IOSRjournaljce
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
IRJET Journal
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
IJITCA Journal
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
jagan477830
 

Similar to Types of Machine Learnig Algorithms(CART, ID3) (20)

IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Natural Language Processing of applications.pdf
Natural Language Processing of applications.pdfNatural Language Processing of applications.pdf
Natural Language Processing of applications.pdf
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.
 
PNN and inversion-B
PNN and inversion-BPNN and inversion-B
PNN and inversion-B
 
image_classification.pptx
image_classification.pptximage_classification.pptx
image_classification.pptx
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 
Improving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docImproving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..doc
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
SVM-KNN Hybrid Method for MR Image
SVM-KNN Hybrid Method for MR ImageSVM-KNN Hybrid Method for MR Image
SVM-KNN Hybrid Method for MR Image
 
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A Review
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 

Recently uploaded

Parent PD Design for Professional Development .docx
Parent PD Design for Professional Development .docxParent PD Design for Professional Development .docx
Parent PD Design for Professional Development .docx
AntonioJarligoCompra
 
RDBMS Lecture Notes Unit4 chapter12 VIEW
RDBMS Lecture Notes Unit4 chapter12 VIEWRDBMS Lecture Notes Unit4 chapter12 VIEW
RDBMS Lecture Notes Unit4 chapter12 VIEW
Murugan Solaiyappan
 
Lange and Roberts "DEIA in the Scholarly Landscape Session 5: DEIA in Peer Re...
Lange and Roberts "DEIA in the Scholarly Landscape Session 5: DEIA in Peer Re...Lange and Roberts "DEIA in the Scholarly Landscape Session 5: DEIA in Peer Re...
Lange and Roberts "DEIA in the Scholarly Landscape Session 5: DEIA in Peer Re...
National Information Standards Organization (NISO)
 
2 Post harvest Physiology of Horticulture produce.pptx
2 Post harvest Physiology of Horticulture  produce.pptx2 Post harvest Physiology of Horticulture  produce.pptx
2 Post harvest Physiology of Horticulture produce.pptx
UmeshTimilsina1
 
11. Post harvest quality, Quality criteria and Judgement.pptx
11. Post harvest quality, Quality criteria and Judgement.pptx11. Post harvest quality, Quality criteria and Judgement.pptx
11. Post harvest quality, Quality criteria and Judgement.pptx
UmeshTimilsina1
 
Node JS Interview Question PDF By ScholarHat
Node JS Interview Question PDF By ScholarHatNode JS Interview Question PDF By ScholarHat
Node JS Interview Question PDF By ScholarHat
Scholarhat
 
Our Guide to the July 2024 USPS® Rate Change
Our Guide to the July 2024 USPS® Rate ChangeOur Guide to the July 2024 USPS® Rate Change
Our Guide to the July 2024 USPS® Rate Change
Postal Advocate Inc.
 
FINAL MATATAG Science CG 2023 Grades 3-10.pdf
FINAL MATATAG Science CG 2023 Grades 3-10.pdfFINAL MATATAG Science CG 2023 Grades 3-10.pdf
FINAL MATATAG Science CG 2023 Grades 3-10.pdf
maritescanete2
 
MVC Interview Questions PDF By ScholarHat
MVC Interview Questions PDF By ScholarHatMVC Interview Questions PDF By ScholarHat
MVC Interview Questions PDF By ScholarHat
Scholarhat
 
How to Manage Shipping Connectors & Shipping Methods in Odoo 17
How to Manage Shipping Connectors & Shipping Methods in Odoo 17How to Manage Shipping Connectors & Shipping Methods in Odoo 17
How to Manage Shipping Connectors & Shipping Methods in Odoo 17
Celine George
 
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - ...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - ...BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - ...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - ...
Nguyen Thanh Tu Collection
 
Open Source and AI - ByWater Closing Keynote Presentation.pdf
Open Source and AI - ByWater Closing Keynote Presentation.pdfOpen Source and AI - ByWater Closing Keynote Presentation.pdf
Open Source and AI - ByWater Closing Keynote Presentation.pdf
Jessica Zairo
 
C# Interview Questions PDF By ScholarHat.pdf
C# Interview Questions PDF By ScholarHat.pdfC# Interview Questions PDF By ScholarHat.pdf
C# Interview Questions PDF By ScholarHat.pdf
Scholarhat
 
Mail Server Configuration Using App passwords in Odoo 17
Mail Server Configuration Using App passwords in Odoo 17Mail Server Configuration Using App passwords in Odoo 17
Mail Server Configuration Using App passwords in Odoo 17
Celine George
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
Scholarhat
 
ASP.NET Core Interview Questions PDF By ScholarHat.pdf
ASP.NET Core Interview Questions PDF By ScholarHat.pdfASP.NET Core Interview Questions PDF By ScholarHat.pdf
ASP.NET Core Interview Questions PDF By ScholarHat.pdf
Scholarhat
 
How to Empty a One2Many Field in Odoo 17
How to Empty a One2Many Field in Odoo 17How to Empty a One2Many Field in Odoo 17
How to Empty a One2Many Field in Odoo 17
Celine George
 
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.pptFIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
ashutoshklal29
 
slidesgo-mastering-the-art-of-listening-insights-from-robin-sharma-2024070718...
slidesgo-mastering-the-art-of-listening-insights-from-robin-sharma-2024070718...slidesgo-mastering-the-art-of-listening-insights-from-robin-sharma-2024070718...
slidesgo-mastering-the-art-of-listening-insights-from-robin-sharma-2024070718...
MANIVALANSR
 
SD_Integrating 21st Century Skills in Classroom-based Assessment.pptx
SD_Integrating 21st Century Skills in Classroom-based Assessment.pptxSD_Integrating 21st Century Skills in Classroom-based Assessment.pptx
SD_Integrating 21st Century Skills in Classroom-based Assessment.pptx
elwoodprias1
 

Recently uploaded (20)

Parent PD Design for Professional Development .docx
Parent PD Design for Professional Development .docxParent PD Design for Professional Development .docx
Parent PD Design for Professional Development .docx
 
RDBMS Lecture Notes Unit4 chapter12 VIEW
RDBMS Lecture Notes Unit4 chapter12 VIEWRDBMS Lecture Notes Unit4 chapter12 VIEW
RDBMS Lecture Notes Unit4 chapter12 VIEW
 
Lange and Roberts "DEIA in the Scholarly Landscape Session 5: DEIA in Peer Re...
Lange and Roberts "DEIA in the Scholarly Landscape Session 5: DEIA in Peer Re...Lange and Roberts "DEIA in the Scholarly Landscape Session 5: DEIA in Peer Re...
Lange and Roberts "DEIA in the Scholarly Landscape Session 5: DEIA in Peer Re...
 
2 Post harvest Physiology of Horticulture produce.pptx
2 Post harvest Physiology of Horticulture  produce.pptx2 Post harvest Physiology of Horticulture  produce.pptx
2 Post harvest Physiology of Horticulture produce.pptx
 
11. Post harvest quality, Quality criteria and Judgement.pptx
11. Post harvest quality, Quality criteria and Judgement.pptx11. Post harvest quality, Quality criteria and Judgement.pptx
11. Post harvest quality, Quality criteria and Judgement.pptx
 
Node JS Interview Question PDF By ScholarHat
Node JS Interview Question PDF By ScholarHatNode JS Interview Question PDF By ScholarHat
Node JS Interview Question PDF By ScholarHat
 
Our Guide to the July 2024 USPS® Rate Change
Our Guide to the July 2024 USPS® Rate ChangeOur Guide to the July 2024 USPS® Rate Change
Our Guide to the July 2024 USPS® Rate Change
 
FINAL MATATAG Science CG 2023 Grades 3-10.pdf
FINAL MATATAG Science CG 2023 Grades 3-10.pdfFINAL MATATAG Science CG 2023 Grades 3-10.pdf
FINAL MATATAG Science CG 2023 Grades 3-10.pdf
 
MVC Interview Questions PDF By ScholarHat
MVC Interview Questions PDF By ScholarHatMVC Interview Questions PDF By ScholarHat
MVC Interview Questions PDF By ScholarHat
 
How to Manage Shipping Connectors & Shipping Methods in Odoo 17
How to Manage Shipping Connectors & Shipping Methods in Odoo 17How to Manage Shipping Connectors & Shipping Methods in Odoo 17
How to Manage Shipping Connectors & Shipping Methods in Odoo 17
 
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - ...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - ...BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - ...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - ...
 
Open Source and AI - ByWater Closing Keynote Presentation.pdf
Open Source and AI - ByWater Closing Keynote Presentation.pdfOpen Source and AI - ByWater Closing Keynote Presentation.pdf
Open Source and AI - ByWater Closing Keynote Presentation.pdf
 
C# Interview Questions PDF By ScholarHat.pdf
C# Interview Questions PDF By ScholarHat.pdfC# Interview Questions PDF By ScholarHat.pdf
C# Interview Questions PDF By ScholarHat.pdf
 
Mail Server Configuration Using App passwords in Odoo 17
Mail Server Configuration Using App passwords in Odoo 17Mail Server Configuration Using App passwords in Odoo 17
Mail Server Configuration Using App passwords in Odoo 17
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
 
ASP.NET Core Interview Questions PDF By ScholarHat.pdf
ASP.NET Core Interview Questions PDF By ScholarHat.pdfASP.NET Core Interview Questions PDF By ScholarHat.pdf
ASP.NET Core Interview Questions PDF By ScholarHat.pdf
 
How to Empty a One2Many Field in Odoo 17
How to Empty a One2Many Field in Odoo 17How to Empty a One2Many Field in Odoo 17
How to Empty a One2Many Field in Odoo 17
 
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.pptFIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
 
slidesgo-mastering-the-art-of-listening-insights-from-robin-sharma-2024070718...
slidesgo-mastering-the-art-of-listening-insights-from-robin-sharma-2024070718...slidesgo-mastering-the-art-of-listening-insights-from-robin-sharma-2024070718...
slidesgo-mastering-the-art-of-listening-insights-from-robin-sharma-2024070718...
 
SD_Integrating 21st Century Skills in Classroom-based Assessment.pptx
SD_Integrating 21st Century Skills in Classroom-based Assessment.pptxSD_Integrating 21st Century Skills in Classroom-based Assessment.pptx
SD_Integrating 21st Century Skills in Classroom-based Assessment.pptx
 

Types of Machine Learnig Algorithms(CART, ID3)

  • 1. Q. Write advantages, disadvantages and applications of different algorithms which are used in Data Mining? Ans. Decision Trees In simple words, a decision tree is a structure that contains nodes (rectangular boxes) and edges(arrows) and is built from a dataset (table of columns representing features/attributes and rows corresponds to records). Each node is either used to make a decision (known as decision node) or represent an outcome (known as leaf node). 1.Naive Bayes classifier (NBC) Naive Bayes is a machine learning algorithm we use to solve classification problems. It is based onthe Bayes Theorem. It is one of the simplest yet powerful ML algorithms in use and finds applications in many industries. Supposeyou have to solve a classification problem and have created the features and generated the hypothesis, but your superiors want to seethe model. You have numerous data points (lakhs of data points) and many variables to train the dataset. The best solution for this situation would be to use the Naive Bayes classifier, which is quite faster in comparison to other classification algorithms. Advantages NBC:
  • 2. 1) The naive Bayesian model originated from classical mathematical theory and has a solid mathematical foundation and stable classification efficiency. 2) It has a higher speed for large numbers of training and queries. Even with very large training sets, there is usually only a relatively small number of features for each project, and the training and classification of the project is only a mathematical operation of the feature probability; 3) It works well for small-scale data, can handle multi-category tasks, and is suitable for incremental training (that is, it can train new samples in real time); 4) Less sensitive to missing data, the algorithm is also relatively simple, often used for text classification; 5) Naïve Bayes explains the results easily. Disadvantages of NBC: 1) There is an error rate in the classification decision; 2) Very sensitive to the form of input data; 3) The assumption of sample attribute independence is used, so if the sample attributes are related, the effect is not good. 4) Naive Bayes assumes that all predictors (or features) are independent, rarely happening in real life. This limits the applicability of this algorithm in real-world use cases. 5) This algorithm faces the ‘zero-frequency problem’ where it assigns zero probability to a categorical variable whose category in the test data set wasn’t available in the training dataset. It would be best if you used a smoothing technique to overcome this issue.
  • 3. 6) Its estimations can be wrong in some cases, so you shouldn’t take its probability outputs very seriously. Applications of Naive Bayes Algorithms  Real-time Prediction: As Naive Bayes is super fast; it can be used for making predictions in real time.  Multi-class Prediction:  This algorithm can predict the posterior probability of multiple classes of the target variable.  Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers are mostly used in text classification (due to their better results in multi-class problems and independence rule) have a higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)  RecommendationSystem: Naive Bayes Classifier along with algorithms like Collaborative Filtering makes a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not. 2. Iterative Dichotomiser 3 ID3 stands for Iterative Dichotomiser 3 is a classificationalgorithmand is named suchbecausethe algorithm iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each step. Invented by Ross Quinlan, ID3 uses a top- down greedy approach to build a decision tree. In simple words, the top- down approach means that we start building the tree from the top and the greedy approach of building a decision tree by selecting a best attribute that yields maximum Information Gain (IG) or minimum Entropy (H). Advantages of using ID3 1) Understandable prediction rules are created from the training data. 2) Builds the fastest tree.
  • 4. 3) Builds a short tree. 4) Only need to test enough attributes until all data is classified. 5) Finding leaf nodes enables test data to be pruned, reducing number of tests. Whole dataset is searched to create tree. Disadvantages of using ID3 1) Data may be over-fitted or over-classified, if a small sample is tested. 2) Only one attribute at a time is tested for making a decision. 3) Classifying continuous data may be computationally expensive, as many trees must be generated to see where to break the continuum. Applications of ID3 ID3 algorithm is used in many places some are as land capability classification Information Asset Identification etc. 3. K-Nearest Neighbours KNN for NearestNeighbourSearch:KNN algorithm involves retrieving the K datapoints that are nearest in distance to the original point. It can be used for classification or regression by aggregating the target values of the nearest neighbours to make a prediction. However, just retrieving the nearest neighbours is a very important aspect in several applications. For instance, suppose we write a movie recommender system, once we find a suitable vector representation for all the movies, given a movie, recommending the five closest movies involves retrieving the five nearest neighbour vectors.
  • 5. KNN for classification: KNN can be used for classification in a supervised setting where we are given a dataset with target labels. For classification, KNN finds the k nearest data points in the training set and the target label is computed as the mode of the target label of these k nearest neighbours. KNN for Regression: KNN can be used for regression in a supervised setting where we are given a dataset with continuoustarget values. Forregression, KNN finds the k nearest data points in the training set and the target value is computed as the mean of the target value of these k nearest neighbours. Advantages of KNN 1) K-NN is pretty intuitive and simple: K-NN algorithm is very simple to understand and equally easy to implement. To classify the new data point K-NN algorithm reads through whole dataset to find out K nearest neighbours. 2) K-NN has no assumptions: K-NN is a non-parametric algorithm which means there are assumptions to be met to implement K-NN. Parametric models like linear regression has lots of assumptions to be met by data before it can be implemented which is not the case with K-NN.
  • 6. 3) No Training Step: K-NN does not explicitly build any model, it simply tags the new data entry-based learning from historical data. New data entry would be tagged with majority class in the nearest neighbour. 4) It constantly evolves: Given it’s an instance-based learning; k-NN is a memory-based approach. The classifier immediately adapts as we collect new training data. It allows the algorithm to respond quickly to changes in the input during real-time use. 5) Very easy to implement for multi-class problem: Most of the classifier algorithms are easy to implement for binary problems and needs effort to implement for multi class whereas K-NN adjust to multi class without any extra efforts. 6) Can be used both for Classificationand Regression: One of the biggest advantages of K-NN is that K-NN can be used both for classification and regression problems. 7) One Hyper Parameter: K-NN might take some time while selecting the first hyper parameter but after that rest of the parameters are aligned to it. 8) Variety of distance criteria to be choose from: K-NN algorithm gives user the flexibility to choose distance while building K-NN model. a. Euclidean Distance b. Hamming Distance c. Manhattan Distance d. Makowski Distance Even though K-NN has several advantages but there are certain very important disadvantages or constraints of K-NN. Disadvantages of KNN 1) K-NN slow algorithm: K-NN might be very easy to implement but as dataset grows efficiency or speed of algorithm declines very fast.
  • 7. 2) Curse of Dimensionality: KNN works well with small number of input variables but as the numbers of variables grow K-NN algorithm struggles to predict the output of new data point. 3) K-NN needs homogeneous features: If you decide to build k-NN using a common distance, like Euclidean or Manhattan distances, it is completely necessary that features have the same scale, since absolute differences in features weight the same, i.e., a given distance in feature 1 must means the same for feature 2. 4) Optimal number of neighbours: One of the biggest issues with K-NN is to choose the optimal number of neighbours to be consider while classifying the new data entry. 5) Imbalanced data causes problems: k-NN doesn’t perform well on imbalanced data. If we consider two classes, A and B, and the majority of the training data is labelled as A, then the model will ultimately give a lot of preference to A. This might result in getting the less common class B wrongly classified. 6) Outlier sensitivity: K-NN algorithm is very sensitive to outliers as it simply chose the neighbours based on distance criteria. 7) Missing Value treatment: K-NN inherently has no capability of dealing with missing value problem. Applications of KNN  Used in classification and Interpretation (legal, news, banking)  Used in get missing values  Used in pattern recognition  Used in gene expression  Used in protein-protein prediction  Used to get 3D structure of problem  Used to measure document similarity
  • 8.  Problem solving (planning, pronunciation)  Functional learning (dynamic control)  Teaching and aiding (help desk, user training) 4. Classification and Regression Trees (CART) Algorithm Classification and Regression Trees (CART) is only a modern term for what are otherwise known as DecisionTrees.Decision Trees have been around for a very long time and are important for predictive modelling in Machine Learning. As the name suggests, these trees are used for classification and prediction problems. These models are obtained by partitioning the data space and fitting a simple prediction model within each partition. This is donerecursively. Wecan represent the partitioning graphically as a tree; hence the name. They have withstood the test of time because of the following reasons: 1. Very competitive with other methods 2. High efficiency Classification trees which are used to separate a dataset into different classes (generally used when we expect categorical classes). The other type are Regression Trees which are used when the class variable is continuous (or numerical). Advantages of CART 1) CART does not require any assumptions for underlying distributions. 2) It is easy to use and can quickly provide valuable insights. 3) CART can be used efficiently to assess massive datasets 4) be further used to drill down to a particular cause and find effective, quick solutions.
  • 9. 5) The solution is easily interpretable, intuitive and can be verified with existing data. 6) it is a good way to present solutions to management. Disadvantages of CART 1) The biggest limitation is the fact that it is a nonparametric technique; it is not recommended to make any generalization on the underlying phenomenon based upon the results observed. Although the rules obtained through the analysis can be tested on new data, it must be remembered that the model is built based upon the sample without making any inference about the underlying probability distribution. 2) Another limitation of CART is that the tree becomes quite complex after seven or eight layers. 3) Interpreting the results in this situation is not intuitive. Applications of CART: CART is used in many places in machine learning such as Blood Donors Classificationn, spatial data environmental and ecological data, Hepatitis disease diagnosis. 5. K- Means Clustering Means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between
  • 10. the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster. Advantages of K-means 1) Relatively simple to implement. 2) Scales to large data sets. 3) Guarantees convergence. 4) Can warm-start the positions of centroids. 5) Easily adapts to new examples. 6) Generalizes to clusters of different shapes and sizes, such as elliptical clusters. Disadvantages of K-means 1) Being dependent on initial values. Fora low k, you can mitigate this dependenceby running k-means several times with different initial values and picking the best result. As k increases, you need advanced versions of k-means to pick better values of the initial centroids (called k-means seeding). 2) Clustering data of varying sizes and density K-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to generalize k-means. 3) Clustering outliers Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.
  • 11. 4) Scaling with number of dimensions As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reducedimensionality either by using PCAonthe feature data, orby using “spectral clustering” to modify the clustering algorithm . Applications of K-Means Clustering K-Means clustering is used in a variety of examples or business cases in real life, like:  Academic performance  Diagnostic systems  Search engines  Wireless sensor networks  Academic Performance: Based on the scores, students are categorized into grades like A, B, or C.  Diagnostic systems: The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.  Search engines:Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this.  Wireless sensor networks: The clustering algorithm plays the role of finding the cluster heads, which collects all the data in its respective cluster. END