DM_Notes.pptx

DATA_MINING_NOTES
1. Explain steps in KDD process. [5]
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets. The KDD process in
data mining typically involves the following steps:
1. Selection: Select a relevant subset of the data for analysis.
2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include
tasks such as data normalization, missing value handling, and data integration.
3. Transformation: Transform the data into a format suitable for data mining, such as a matrix or a
graph.
4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful
information and insights. This may include tasks such as clustering, classification, association
rule mining, and anomaly detection.
5. Interpretation: Interpret the results and extract knowledge from the data. This may include
tasks such as visualizing the results, evaluating the quality of the discovered patterns, and
identifying relationships and associations among the data.
6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and
meaningful.
7. Deployment: Use the discovered knowledge to solve the business problem and make decisions.
2. What is text mining? [2]
o Definition: Text mining is the process of extracting meaningful information from text data.
o Process: It involves using natural language processing (NLP) techniques and machine
learning algorithms to analyze large volumes of unstructured text data and identify
patterns, trends, and insights that would be difficult to uncover manually.
o Application: This can be applied in various field such as sentiment analysis, topic modeling,
and text classification and so on.
o Goal: The goal of text mining is to extract valuable information from text data and use it to
make data-driven decisions or predictions.
3. What do you mean by Clustering? [2]
o Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of
data points into clusters so that the objects belong to the same group.
o Clustering is a method of partitioning a set of data or objects into a set of significant
subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either
as a stand-alone instrument to get a better insight into data distribution or as a pre-
processing step for other algorithms
4. Linear Regression.
o It is simplest form of regression. Linear regression attempts to model the relationship
between two variables by fitting a linear equation to observe the data.
o Linear regression attempts to find the mathematical relationship between variables.
o If outcome is straight line then it is considered as linear model and if it is curved line, then it
is a non-linear model.

o The relationship between dependent variable is given by straight line and it has only one
independent variable.
Y = α + Β X
Model 'Y', is a linear function of 'X'.
The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also
changes.
4. Difference between Data Mining and Text Mining. [3/5]
Data Mining Text Mining
Data mining is a process to extract useful
information from huge datasets.
Text Mining is a part of data mining that includes the
processing of text from huge documents.
In data mining, we get the stored data in
a structured format.
In text mining, we get the stored data in an
unstructured format.
It allows the mining of mixed data. It allows mining of text only.
Data processing is done directly. Data processing is done linguistically.
It is a homogeneous process. It is a heterogeneous process.
Pre-defined databases and sheets are
used to collect the information.
The text is used to gather high-quality data.
The statistical method is used for data
evaluation.
Computational linguistic principles are used to
evaluate the text.
5. Difference between DM and OLAP. [3/5]
Data Mining OLAP
Data mining refers to the field of computer
science, which deals with the extraction of
data, trends and patterns from huge sets of
data.
OLAP is a technology of immediate access to
data with the help of multidimensional
structures.
It deals with the data summary. It deals with detailed transaction-level data.
It is discovery-driven. It is query driven.
It is used for future data prediction. It is used for analyzing past data.

It has huge numbers of dimensions. It has a limited number of dimensions.
Bottom-up approach. Top-down approach.
It is an emerging field. It is widely used.
6. Difference between Descriptive and predictive data mining. [3/5]
Descriptive data mining Predictive data mining
Descriptive mining is usually used to
provide correlation, cross-tabulation,
frequency, etc.
The term 'Predictive' means to predict something, so
predictive data mining is the analysis done to predict the
future event or other data or trends.
It is based on the reactive approach. It is based on the proactive approach.
It specifies the characteristics of the
data in a target data set.
It executes the induction over the current and past data
so that prediction can happen.
It needs data aggregation and data
mining.
It needs statistics and data forecasting procedures.
It provides precise data. It produces outcomes without ensuring accuracy.
7. Difference between Classification and Clustering [3/5]
Classification Clustering
Classification is a supervised learning
approach where a specific label is provided
to the machine to classify new observations.
Here the machine needs proper testing and
training for the label verification.
Clustering is an unsupervised learning
approach where grouping is done on
similarities basis.
Supervised learning approach. Unsupervised learning approach.
It uses a training dataset. It does not use a training dataset.
It uses algorithms to categorize the new
data as per the observations of the training
set.
It uses statistical concepts in which the data
set is divided into subsets with the same
features.
In classification, there are labels for training
data.
In clustering, there are no labels for training
data.
Its objective is to find which class a new
object belongs to form the set of predefined
classes.
Its objective is to group a set of objects to
find whether there is any relationship
between them.
It is more complex as compared to
clustering.
It is less complex as compared to clustering.

8. Difference between Supervised and un-supervised Learning [3/5]
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained
using labeled data.
Unsupervised learning algorithms are trained
using unlabeled data.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to
the model along with the output.
In unsupervised learning, only input data is
provided to the model.
The goal of supervised learning is to train the
model so that it can predict the output when it
is given new data.
The goal of unsupervised learning is to find the
hidden patterns and useful insights from the
unknown dataset.
Supervised learning can be categorized
in Classification and Regression problems.
Unsupervised Learning can be classified
in Clustering and Associations problems.
Supervised learning model produces an
accurate result.
Unsupervised learning model may give less
accurate result as compared to supervised
learning.
It includes various algorithms such as Linear
Regression, Logistic Regression, Support Vector
Machine, Multi-class Classification, Decision
tree, Bayesian Logic, etc.
It includes various algorithms such as Clustering,
KNN, and Apriori algorithm.
9. Difference between OLAP and OLTP [3/5]
Category OLAP (Online analytical processing) OLTP (Online transaction processing)
Definition It is well-known as an online database
query management system.
It is well-known as an online database
modifying system.
Data source Consists of historical data from various
Databases.
Consists of only of operational current
data.
Method
used
It makes use of a data warehouse. It makes use of a standard database
management system (DBMS).
Application It is subject-oriented. Used for Data
Mining, Analytics, Decisions making, etc.
It is application-oriented. Used for
business tasks.
Normalized In an OLAP database, tables are not
normalized.
In an OLTP database, tables are
normalized (3NF).
Usage of
data
The data is used in planning, problem-
solving, and decision-making.
The data is used to perform day-to-day
fundamental operations.
Purpose It serves the purpose to extract
information for analysis and decision-
making.
It serves the purpose to Insert, Update,
and Delete information from the
database.
Volumeof
data
A large amount of data is stored typically
in TB, PB
The size of the data is relatively small as
the historical data is archived. For ex MB,
GB
Queries Relatively slow as the amount of data
involved is large. Queries may take
hours.
Very Fast as the queries operate on 5% of
the data.

10. Difference between Data Mining and Data Warehousing. [3/5]
Data Mining Data Warehousing
Data mining is the process of determining
data patterns.
A data warehouse is a database system designed for
analytics.
Data mining is generally considered as the
process of extracting useful data from a
large set of data.
Data warehousing is the process of combining all
the relevant data.
Business entrepreneurs carry data mining
with the help of engineers.
Data warehousing is entirely carried out by the
engineers.
In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically.
Data mining uses pattern recognition
techniques to identify patterns.
Data warehousing is the process of extracting and
storing data that allow easier reporting.
One of the most amazing data mining
technique is the detection and identification
of the unwanted errors that occur in the
system.
One of the advantages of the data warehouse is its
ability to update frequently. That is the reason why
it is ideal for business entrepreneurs who want up
to date with the latest stuff.
The data mining techniques are cost-
efficient as compared to other statistical
data applications.
The responsibility of the data warehouse is to
simplify every type of business data.
The data mining techniques are not 100
percent accurate. It may lead to serious
consequences in a certain condition.
In the data warehouse, there is a high possibility
that the data required for analysis by the company
may not be integrated into the warehouse. It can
simply lead to loss of data.
Companies can benefit from this analytical
tool by equipping suitable and accessible
knowledge-based data.
Data warehouse stores a huge amount of historical
data that helps users to analyze different periods
and trends to make future predictions.
11. K-Means vs KNN [3/5]
Category K-Means KNN
Algorithm Unsupervised learning algorithm Supervised learning algorithm
Process
Clusters data points into k clusters
based on their similarity
Classifies data points based on the
majority class of their k nearest
neighbors
Number
Requires the number of clusters (k) to
be specified in advance
Requires the number of nearest
neighbors (k) to be specified in advance
method
Clustering is done using the mean of
the data points in each cluster
Classification is done using majority
vote of the k nearest neighbors
Suitability Suitable for continuous variables
Suitable for both continuous and
categorical variables
Scalability
K-Means is generally faster and more
scalable than KNN, especially for large
datasets.
KNN is generally slower and more
scalable than K-Means, for large
datasets.

12. What do you mean by an outlier? [2]
An outlier in data mining is an observation that is significantly different from the other observations
in a dataset.
o Outliers can have a major impact on the results of data mining and statistical analysis, and
are often considered to be undesirable because they can skew the results and lead to
inaccurate conclusions.
o Outliers can be identified by a number of methods, including statistical tests, visualization
techniques, and machine learning algorithms.
o Once identified, outliers can be handled in a number of ways, such as removing them from
the dataset, treating them as special cases, or including them in the analysis but with
appropriate caution.
It's important to note that the definition of an outlier is context dependent, in some cases an
outlier can be a valuable information, for example in fraud detection, identifying an outlier can be
the key to finding a fraudulent transaction.
13.What is Knowledge Discovery in Databases? [2]
Knowledge Discovery in Databases (KDD) is the iterative process of extracting useful and valuable
information from large and complex sets of data.
o The goal of KDD is to identify patterns, trends, and insights hidden within the data that can
be used to make better decisions and improve business processes etc.
o The KDD process typically involves several steps, including data cleaning and preprocessing,
data mining, pattern evaluation, and knowledge representation.
o This process can be used in a variety of applications, including business intelligence, fraud
detection, and customer relationship management.
14. Hierarchical Clustering in Data Mining: [4]
 Definition: A Hierarchical clustering method works via grouping data into a tree of clusters.
Hierarchical clustering begins by treating every data point as a separate cluster. Then, it
repeatedly executes the subsequent steps:
o Identify the 2 clusters which can be closest together, and
o Merge the 2 maximum comparable clusters. We need to continue these steps until all
the clusters are merged together.
 Steps:
1. Compute the pairwise similarity or distance between all data points.
2. Start with each data point as a separate cluster.
3. Merge the two closest clusters into a new larger cluster.
4. Repeat step 3 until all data points belong to a single cluster or some stopping criteria is
met.

 Representation: The hierarchy of clusters can be represented using a tree-based structure
called dendrogram.
 Advantages:
o It can handle non-linearly separable data.
o It can handle different shapes and sizes of clusters.
o It allows for incremental and dynamic updates of the clustering results.
o It can be used to visualize the relationships between clusters.
 Disadvantages:
o It is sensitive to the choice of the similarity or distance metric.
o It is sensitive to the choice of linkage method used to merge clusters.
o It can be computationally expensive for large datasets.
o It can be hard to interpret the results for higher dimensions.
15. Associative Classification in Data Mining. [2]
 Definition: A data mining technique that discovers associations between features and class
labels, instead of building a predictive model for the class labels.
 Advantages:
o It can handle noisy and incomplete data.
o It can discover important features and relationships between features and class labels.
 Disadvantages:
o It is only applicable for binary or nominal class labels.
o It can be computationally expensive for large datasets.
16. Explain the following terms in the context of association rule mining:
(i) Support of an itemset.
(ii) Frequent closed itemset.
(iii) Lift of a rule. [3X2]
i. Support of an Itemset:
 Definition: The proportion of transactions in a transaction database that contain a particular
itemset.
 Calculation: The support of an itemset X can be calculated as the number of transactions
containing X divided by the total number of transactions in the database.
 Significance: Support is a measure of the popularity of an itemset and is used as a threshold to
determine which itemsets are considered frequent.
 Advantages:
o It provides a simple and intuitive measure of the popularity of an itemset.
o It can be easily calculated from transaction data.
ii. Frequent Closed Itemset:
 Definition: A frequent itemset is closed if there is no superset of the itemset that has the same
support.
 Significance: A frequent closed itemset is considered a more meaningful result than a frequent
itemset as it captures the complete information of the itemset and its subsets.
 Advantages:
o It can avoid generating redundant and less meaningful results.
o It can capture the complete information of the itemset and its subsets.

iii
.
Lift of a Rule:
 Definition: A measure of the degree of association between two items in a rule, compared to
their individual frequencies in the transaction database.
 Calculation: The lift of a rule X -> Y is calculated as the ratio of the support of X U Y divided by
the support of X times the support of Y.
 Significance: Lift is a measure of the strength of the association between two items in a rule, and
is used to rank and select the most interesting rules.
 Advantages:
o It provides a measure of the strength of the association between two items in a rule.
o It can adjust for the overall popularity of the items in the transaction database.
17. Why data preprocessing is required? [2]
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for
cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
It involves below steps:
o Getting the dataset
o Importinglibraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling
18. Explain Market Basket Analysis with suitable example.
Suppose 5000 transactions have been made through a popular e-Commerce website. Now they
want to calculate the support, confidence, and lift for the two products. For example, let's say pen
and notebook, out of 5000 transactions, 500 transactions for pen, 700 transactions for notebook,
and 1000 transactions for both.
Using the information provided, we can calculate the support, confidence, and lift for the two
products: pen and notebook.
 Support:
Support for pen = (number of transactions containing pen) / (total number of transactions) = 500 /
5000 = 0.1 or 10%
Support for notebook = (number of transactions containing notebook) / (total number of
transactions) = 700 / 5000 = 0.14 or 14%
Support for pen and notebook = (number of transactions containing both pen and notebook) /
(total number of transactions) = 1000 / 5000 = 0.2 or 20%
 Confidence:
Confidence of the rule "If a customer buys a pen, they will also buy a notebook" = (number of
transactions containing both pen and notebook) / (number of transactions containing pen) = 1000 /
500 = 2 or 200%
Confidence of the rule "If a customer buys a notebook, they will also buy a pen" = (number of
transactions containing both pen and notebook) / (number of transactions containing notebook) =
1000 / 700 = 1.43 or 143%
 Lift:

Lift of the rule "If a customer buys a pen, they will also buy a notebook" = (confidence of the rule)
/ (support of notebook) = 2 / 0.14 = 14.3
Lift of the rule "If a customer buys a notebook, they will also buy a pen" = (confidence of the rule)
/ (support of pen) = 1.43 / 0.1 = 14.3
Note that the lift is the same for both rules, this is because the lift is symmetric, it doesn't depend
on the order of the antecedent and the consequent.
A lift value of 1 indicates that there is no association between the antecedent and consequent, and
values greater than 1 indicate a positive association. Here the lift is 14.3 times, which is a strong
positive association between buying pen and notebook, the more the lift value more the
association.
19. Support Vector Machine
A support vector machine (SVM) is a type of deep learning algorithm that performs supervised
learning for classification or regression of data groups.
In AI and Machine learning, supervised learning system provide both input and desired output data,
which are labeled for classification.

DM_Notes.pptx

Recommended

Recommended

More Related Content

Similar to DM_Notes.pptx

Similar to DM_Notes.pptx (20)

Recently uploaded

Recently uploaded (20)

DM_Notes.pptx