Data Mining
Unit – III
Classification
Dr. YOGESH K M
Assistant Professor
M S Ramaiah University of Applied Science
Dept. of Computer Science and Engineering
Bangalore, Karnataka INDIA - 560 058
yogeshkm.mrj@gmail.com
Classification
Classification
 Classification is a technique in data mining that involves
categorizing or classifying data objects into predefined classes,
categories, or groups based on their features or attributes.
 It is a supervised learning technique that uses labelled data to
build a model that can predict the class of new, unseen data. It
is an important task in data mining because it enables
organizations to make informed decisions based on their data.
 For example, a retailer may use data classification to group
customers into different segments based on their purchase
history and demographic data. This information can be used to
target specific marketing campaigns for each segment and
improve customer satisfaction.
Classification
What is Classification in Data Mining?
Classification
 Classification in data mining is a technique used to assign labels
or classify each instance, record, or data object in a dataset
based on their features or attributes.
 The objective of the classification approach is to predict class
labels of new, unseen data accurately. It is an important task in
data mining because it enables organizations to make data-
driven decisions.
 For example, businesses can assign or classify sentiments of
customer feedback, reviews, or social media posts to
understand how well their products or services are doing.
Classification
 Classification techniques can be divided into categories - binary
classification and multi-class classification.
 Binary classification assigns labels to instances into two
classes, such as Yes or No. Multi-class classification assigns
labels into more than two classes, such as Happy, Neutral, or
Sad.
Classification
Steps to Build a Classification Model
There are several steps involved in building a classification
model, as shown below
Data preparation - The first step in building a classification model
is to prepare the data. This involves collecting, cleaning, and
transforming the data into a suitable format for further analysis.
Feature selection - The next step is to select the most important
and relevant features that will be used to build the classification
model. This can be done using various techniques, such as
correlation, feature importance analysis, or domain knowledge.
Classification
Prepare train and test data - Once the data is prepared and
relevant features are selected, the dataset is divided into two parts
- training and test datasets. The training set is used to build the
model, while the testing set is used to evaluate the model's
performance.
Model selection - Many algorithms can be used to build a
classification model, such as decision trees, logistic regression, k-
nearest neighbors, and neural networks. The choice of algorithm
depends on the type of data, the number of features, and the
desired accuracy.
Model training - Once the algorithm is selected, the model is
trained on the training dataset. This involves adjusting the model
parameters to minimize the error between the predicted and actual
class labels.
Classification
Model evaluation - The model's performance is evaluated using
the test dataset. The accuracy, precision, recall, and F1 score are
commonly used metrics to evaluate the model performance.
Model tuning - If the model's performance is not satisfactory, the
model can be tuned by adjusting the parameters or selecting a
different algorithm. This process is repeated until the desired
performance is achieved.
Model deployment - Once the model is built and evaluated, it can
be deployed in production to classify new data. The model should
be monitored regularly to ensure its accuracy and effectiveness
over time.
Classification
Categorization of Classification in
Data Mining
Classification
There are different types of classification algorithms based on
their approach, complexity, and performance. Here are some
common categorizations of classification in data mining
Decision tree-based classification - This type of classification
algorithm builds a tree-like model of decisions and their possible
consequences. Decision trees are easy to understand and
interpret, making them a popular choice for classification
problems.
Rule-based classification - This type of classification algorithm
uses a set of rules to determine the class label of an observation.
The rules are typically expressed in the form of IF-THEN
statements, where each statement represents a condition and a
corresponding action.
Classification
 Instance-based classification - This type of classification
algorithm uses a set of training instances to classify new,
unseen instances. The classification is based on the similarity
between the training instances' features and the new instances'
features.
 Bayesian classification - This classification algorithm uses
Bayes' theorem to compute the probability of each class label
given the observed features. Bayesian classification is
particularly useful when dealing with incomplete or uncertain
data.
 Neural network-based classification - This classification
algorithm uses a network of interconnected nodes or neurons to
learn a mapping between the input features and the output class
labels. Neural networks can handle complex and nonlinear
relationships between the features and the class labels.
Classification
 Ensemble-based classification - This classification algorithm
combines the predictions of multiple classifiers to improve the
overall accuracy and robustness of the classification model.
Ensemble methods include bagging, boosting, and stacking.
Classification
Classification Vs. Regression in Data
Mining
Classification
Factors Classification Regression
Task/Objective
Identifying or assigning the
class label of a new
observation based on its
features.
Estimating a continuous or
discrete value for a new
observation based on its
features.
Outcome
Categorical variable, i.e., a
class label or category.
Continuous or discrete
variable, i.e., a numeric
value.
Evaluation
Accuracy, precision, recall,
F1 score, AUC.
Mean squared error, root
mean squared error,
correlation coefficient
Classification
Factors Classification Regression
Àlgorithms
Decision trees, rule-based
systems, neural networks,
support vector machines,
k-nearest neighbors.
Linear regression, logistic
regression, polynomial
regression, time series
analysis, neural networks.
Examples
Spam email classification,
sentiment analysis, fraud
detection, etc.
Housing price prediction,
stock price prediction,
predicting a customer's
purchase amount or sale,
etc.
Classification
Naïve Bayes Classifier Algorithm
Classification
 Naïve Bayes algorithm is a supervised learning algorithm, which
is based on Bayes theorem and used for solving classification
problems.
 It is mainly used in text classification that includes a high-
dimensional training dataset.
 Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast Data
Mining and machine learning models that can make quick
predictions.
 It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
 Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
Classification
Classification
Classification
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Decision Tree
Decision Tree | ID3 Algorithm
Decision Tree
Decision Tree | ID3 Algorithm
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Clustering Algorithm
Clustering
Cluster Analysis in Data Mining means that to find out the group of
objects which are similar to each other in the group but are
different from the object in other groups. In the process of
clustering in data analytics, the sets of data are divided into groups
or classes based on data similarity.
The process of making a group of abstract objects into classes of
similar objects is known as clustering.
One group is treated as a cluster of data objects
Clustering Algorithm
In the process of cluster analysis, the first step is to partition the
set of data into groups with the help of data similarity, and then
groups are assigned to their respective labels.
The biggest advantage of clustering over-classification is it can
adapt to the changes made and helps single out useful features
that differentiate different groups.
Clustering Algorithm
Applications of cluster analysis :
It is widely used in many applications such as image processing,
data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer
base and they can characterize their customer groups by using
purchasing patterns.
It can be used in the field of biology, by deriving animal and plant
taxonomies and identifying genes with the same capabilities.
It also helps in information discovery by classifying documents on
the web.
Clustering Algorithm
Clustering Methods:
It can be classified based on the following categories.
Model-Based Method
Hierarchical Method
Constraint-Based Method
Grid-Based Method
Partitioning Method
Density-Based Method
Clustering Algorithm
Requirements of clustering in data mining:
The following are some points why clustering is important in data
mining.
Scalability – we require highly scalable clustering algorithms to
work with large databases.
Ability to deal with different kinds of attributes – Algorithms
should be able to work with the type of data such as categorical,
numerical, and binary data.
Discovery of clusters with attribute shape – The algorithm should
be able to detect clusters in arbitrary shapes and it should not be
bounded to distance measures.
Clustering Algorithm
 Interpretability – The results should be comprehensive, usable,
and interpretable.
 High dimensionality – The algorithm should be able to handle
high dimensional space instead of only handling low
dimensional data.
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
K Means Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm
DBSCAN Clustering Algorithm

Unit - III Classification wjwjdbekwjwbdbekwk