SlideShare a Scribd company logo
DATA_MINING_NOTES
1. Explain steps in KDD process. [5]
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets. The KDD process in
data mining typically involves the following steps:
1. Selection: Select a relevant subset of the data for analysis.
2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include
tasks such as data normalization, missing value handling, and data integration.
3. Transformation: Transform the data into a format suitable for data mining, such as a matrix or a
graph.
4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful
information and insights. This may include tasks such as clustering, classification, association
rule mining, and anomaly detection.
5. Interpretation: Interpret the results and extract knowledge from the data. This may include
tasks such as visualizing the results, evaluating the quality of the discovered patterns, and
identifying relationships and associations among the data.
6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and
meaningful.
7. Deployment: Use the discovered knowledge to solve the business problem and make decisions.
2. What is text mining? [2]
o Definition: Text mining is the process of extracting meaningful information from text data.
o Process: It involves using natural language processing (NLP) techniques and machine
learning algorithms to analyze large volumes of unstructured text data and identify
patterns, trends, and insights that would be difficult to uncover manually.
o Application: This can be applied in various field such as sentiment analysis, topic modeling,
and text classification and so on.
o Goal: The goal of text mining is to extract valuable information from text data and use it to
make data-driven decisions or predictions.
3. What do you mean by Clustering? [2]
o Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of
data points into clusters so that the objects belong to the same group.
o Clustering is a method of partitioning a set of data or objects into a set of significant
subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either
as a stand-alone instrument to get a better insight into data distribution or as a pre-
processing step for other algorithms
4. Linear Regression.
o It is simplest form of regression. Linear regression attempts to model the relationship
between two variables by fitting a linear equation to observe the data.
o Linear regression attempts to find the mathematical relationship between variables.
o If outcome is straight line then it is considered as linear model and if it is curved line, then it
is a non-linear model.
o The relationship between dependent variable is given by straight line and it has only one
independent variable.
Y = α + Β X
Model 'Y', is a linear function of 'X'.
The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also
changes.
4. Difference between Data Mining and Text Mining. [3/5]
Data Mining Text Mining
Data mining is a process to extract useful
information from huge datasets.
Text Mining is a part of data mining that includes the
processing of text from huge documents.
In data mining, we get the stored data in
a structured format.
In text mining, we get the stored data in an
unstructured format.
It allows the mining of mixed data. It allows mining of text only.
Data processing is done directly. Data processing is done linguistically.
It is a homogeneous process. It is a heterogeneous process.
Pre-defined databases and sheets are
used to collect the information.
The text is used to gather high-quality data.
The statistical method is used for data
evaluation.
Computational linguistic principles are used to
evaluate the text.
5. Difference between DM and OLAP. [3/5]
Data Mining OLAP
Data mining refers to the field of computer
science, which deals with the extraction of
data, trends and patterns from huge sets of
data.
OLAP is a technology of immediate access to
data with the help of multidimensional
structures.
It deals with the data summary. It deals with detailed transaction-level data.
It is discovery-driven. It is query driven.
It is used for future data prediction. It is used for analyzing past data.
It has huge numbers of dimensions. It has a limited number of dimensions.
Bottom-up approach. Top-down approach.
It is an emerging field. It is widely used.
6. Difference between Descriptive and predictive data mining. [3/5]
Descriptive data mining Predictive data mining
Descriptive mining is usually used to
provide correlation, cross-tabulation,
frequency, etc.
The term 'Predictive' means to predict something, so
predictive data mining is the analysis done to predict the
future event or other data or trends.
It is based on the reactive approach. It is based on the proactive approach.
It specifies the characteristics of the
data in a target data set.
It executes the induction over the current and past data
so that prediction can happen.
It needs data aggregation and data
mining.
It needs statistics and data forecasting procedures.
It provides precise data. It produces outcomes without ensuring accuracy.
7. Difference between Classification and Clustering [3/5]
Classification Clustering
Classification is a supervised learning
approach where a specific label is provided
to the machine to classify new observations.
Here the machine needs proper testing and
training for the label verification.
Clustering is an unsupervised learning
approach where grouping is done on
similarities basis.
Supervised learning approach. Unsupervised learning approach.
It uses a training dataset. It does not use a training dataset.
It uses algorithms to categorize the new
data as per the observations of the training
set.
It uses statistical concepts in which the data
set is divided into subsets with the same
features.
In classification, there are labels for training
data.
In clustering, there are no labels for training
data.
Its objective is to find which class a new
object belongs to form the set of predefined
classes.
Its objective is to group a set of objects to
find whether there is any relationship
between them.
It is more complex as compared to
clustering.
It is less complex as compared to clustering.
8. Difference between Supervised and un-supervised Learning [3/5]
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained
using labeled data.
Unsupervised learning algorithms are trained
using unlabeled data.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to
the model along with the output.
In unsupervised learning, only input data is
provided to the model.
The goal of supervised learning is to train the
model so that it can predict the output when it
is given new data.
The goal of unsupervised learning is to find the
hidden patterns and useful insights from the
unknown dataset.
Supervised learning can be categorized
in Classification and Regression problems.
Unsupervised Learning can be classified
in Clustering and Associations problems.
Supervised learning model produces an
accurate result.
Unsupervised learning model may give less
accurate result as compared to supervised
learning.
It includes various algorithms such as Linear
Regression, Logistic Regression, Support Vector
Machine, Multi-class Classification, Decision
tree, Bayesian Logic, etc.
It includes various algorithms such as Clustering,
KNN, and Apriori algorithm.
9. Difference between OLAP and OLTP [3/5]
Category OLAP (Online analytical processing) OLTP (Online transaction processing)
Definition It is well-known as an online database
query management system.
It is well-known as an online database
modifying system.
Data source Consists of historical data from various
Databases.
Consists of only of operational current
data.
Method
used
It makes use of a data warehouse. It makes use of a standard database
management system (DBMS).
Application It is subject-oriented. Used for Data
Mining, Analytics, Decisions making, etc.
It is application-oriented. Used for
business tasks.
Normalized In an OLAP database, tables are not
normalized.
In an OLTP database, tables are
normalized (3NF).
Usage of
data
The data is used in planning, problem-
solving, and decision-making.
The data is used to perform day-to-day
fundamental operations.
Purpose It serves the purpose to extract
information for analysis and decision-
making.
It serves the purpose to Insert, Update,
and Delete information from the
database.
Volumeof
data
A large amount of data is stored typically
in TB, PB
The size of the data is relatively small as
the historical data is archived. For ex MB,
GB
Queries Relatively slow as the amount of data
involved is large. Queries may take
hours.
Very Fast as the queries operate on 5% of
the data.
10. Difference between Data Mining and Data Warehousing. [3/5]
Data Mining Data Warehousing
Data mining is the process of determining
data patterns.
A data warehouse is a database system designed for
analytics.
Data mining is generally considered as the
process of extracting useful data from a
large set of data.
Data warehousing is the process of combining all
the relevant data.
Business entrepreneurs carry data mining
with the help of engineers.
Data warehousing is entirely carried out by the
engineers.
In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically.
Data mining uses pattern recognition
techniques to identify patterns.
Data warehousing is the process of extracting and
storing data that allow easier reporting.
One of the most amazing data mining
technique is the detection and identification
of the unwanted errors that occur in the
system.
One of the advantages of the data warehouse is its
ability to update frequently. That is the reason why
it is ideal for business entrepreneurs who want up
to date with the latest stuff.
The data mining techniques are cost-
efficient as compared to other statistical
data applications.
The responsibility of the data warehouse is to
simplify every type of business data.
The data mining techniques are not 100
percent accurate. It may lead to serious
consequences in a certain condition.
In the data warehouse, there is a high possibility
that the data required for analysis by the company
may not be integrated into the warehouse. It can
simply lead to loss of data.
Companies can benefit from this analytical
tool by equipping suitable and accessible
knowledge-based data.
Data warehouse stores a huge amount of historical
data that helps users to analyze different periods
and trends to make future predictions.
11. K-Means vs KNN [3/5]
Category K-Means KNN
Algorithm Unsupervised learning algorithm Supervised learning algorithm
Process
Clusters data points into k clusters
based on their similarity
Classifies data points based on the
majority class of their k nearest
neighbors
Number
Requires the number of clusters (k) to
be specified in advance
Requires the number of nearest
neighbors (k) to be specified in advance
method
Clustering is done using the mean of
the data points in each cluster
Classification is done using majority
vote of the k nearest neighbors
Suitability Suitable for continuous variables
Suitable for both continuous and
categorical variables
Scalability
K-Means is generally faster and more
scalable than KNN, especially for large
datasets.
KNN is generally slower and more
scalable than K-Means, for large
datasets.
12. What do you mean by an outlier? [2]
An outlier in data mining is an observation that is significantly different from the other observations
in a dataset.
o Outliers can have a major impact on the results of data mining and statistical analysis, and
are often considered to be undesirable because they can skew the results and lead to
inaccurate conclusions.
o Outliers can be identified by a number of methods, including statistical tests, visualization
techniques, and machine learning algorithms.
o Once identified, outliers can be handled in a number of ways, such as removing them from
the dataset, treating them as special cases, or including them in the analysis but with
appropriate caution.
It's important to note that the definition of an outlier is context dependent, in some cases an
outlier can be a valuable information, for example in fraud detection, identifying an outlier can be
the key to finding a fraudulent transaction.
13.What is Knowledge Discovery in Databases? [2]
Knowledge Discovery in Databases (KDD) is the iterative process of extracting useful and valuable
information from large and complex sets of data.
o The goal of KDD is to identify patterns, trends, and insights hidden within the data that can
be used to make better decisions and improve business processes etc.
o The KDD process typically involves several steps, including data cleaning and preprocessing,
data mining, pattern evaluation, and knowledge representation.
o This process can be used in a variety of applications, including business intelligence, fraud
detection, and customer relationship management.
14. Hierarchical Clustering in Data Mining: [4]
 Definition: A Hierarchical clustering method works via grouping data into a tree of clusters.
Hierarchical clustering begins by treating every data point as a separate cluster. Then, it
repeatedly executes the subsequent steps:
o Identify the 2 clusters which can be closest together, and
o Merge the 2 maximum comparable clusters. We need to continue these steps until all
the clusters are merged together.
 Steps:
1. Compute the pairwise similarity or distance between all data points.
2. Start with each data point as a separate cluster.
3. Merge the two closest clusters into a new larger cluster.
4. Repeat step 3 until all data points belong to a single cluster or some stopping criteria is
met.
 Representation: The hierarchy of clusters can be represented using a tree-based structure
called dendrogram.
 Advantages:
o It can handle non-linearly separable data.
o It can handle different shapes and sizes of clusters.
o It allows for incremental and dynamic updates of the clustering results.
o It can be used to visualize the relationships between clusters.
 Disadvantages:
o It is sensitive to the choice of the similarity or distance metric.
o It is sensitive to the choice of linkage method used to merge clusters.
o It can be computationally expensive for large datasets.
o It can be hard to interpret the results for higher dimensions.
15. Associative Classification in Data Mining. [2]
 Definition: A data mining technique that discovers associations between features and class
labels, instead of building a predictive model for the class labels.
 Advantages:
o It can handle noisy and incomplete data.
o It can discover important features and relationships between features and class labels.
 Disadvantages:
o It is only applicable for binary or nominal class labels.
o It can be computationally expensive for large datasets.
16. Explain the following terms in the context of association rule mining:
(i) Support of an itemset.
(ii) Frequent closed itemset.
(iii) Lift of a rule. [3X2]
i. Support of an Itemset:
 Definition: The proportion of transactions in a transaction database that contain a particular
itemset.
 Calculation: The support of an itemset X can be calculated as the number of transactions
containing X divided by the total number of transactions in the database.
 Significance: Support is a measure of the popularity of an itemset and is used as a threshold to
determine which itemsets are considered frequent.
 Advantages:
o It provides a simple and intuitive measure of the popularity of an itemset.
o It can be easily calculated from transaction data.
ii. Frequent Closed Itemset:
 Definition: A frequent itemset is closed if there is no superset of the itemset that has the same
support.
 Significance: A frequent closed itemset is considered a more meaningful result than a frequent
itemset as it captures the complete information of the itemset and its subsets.
 Advantages:
o It can avoid generating redundant and less meaningful results.
o It can capture the complete information of the itemset and its subsets.
iii
.
Lift of a Rule:
 Definition: A measure of the degree of association between two items in a rule, compared to
their individual frequencies in the transaction database.
 Calculation: The lift of a rule X -> Y is calculated as the ratio of the support of X U Y divided by
the support of X times the support of Y.
 Significance: Lift is a measure of the strength of the association between two items in a rule, and
is used to rank and select the most interesting rules.
 Advantages:
o It provides a measure of the strength of the association between two items in a rule.
o It can adjust for the overall popularity of the items in the transaction database.
17. Why data preprocessing is required? [2]
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for
cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
It involves below steps:
o Getting the dataset
o Importinglibraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling
18. Explain Market Basket Analysis with suitable example.
Suppose 5000 transactions have been made through a popular e-Commerce website. Now they
want to calculate the support, confidence, and lift for the two products. For example, let's say pen
and notebook, out of 5000 transactions, 500 transactions for pen, 700 transactions for notebook,
and 1000 transactions for both.
Using the information provided, we can calculate the support, confidence, and lift for the two
products: pen and notebook.
 Support:
Support for pen = (number of transactions containing pen) / (total number of transactions) = 500 /
5000 = 0.1 or 10%
Support for notebook = (number of transactions containing notebook) / (total number of
transactions) = 700 / 5000 = 0.14 or 14%
Support for pen and notebook = (number of transactions containing both pen and notebook) /
(total number of transactions) = 1000 / 5000 = 0.2 or 20%
 Confidence:
Confidence of the rule "If a customer buys a pen, they will also buy a notebook" = (number of
transactions containing both pen and notebook) / (number of transactions containing pen) = 1000 /
500 = 2 or 200%
Confidence of the rule "If a customer buys a notebook, they will also buy a pen" = (number of
transactions containing both pen and notebook) / (number of transactions containing notebook) =
1000 / 700 = 1.43 or 143%
 Lift:
Lift of the rule "If a customer buys a pen, they will also buy a notebook" = (confidence of the rule)
/ (support of notebook) = 2 / 0.14 = 14.3
Lift of the rule "If a customer buys a notebook, they will also buy a pen" = (confidence of the rule)
/ (support of pen) = 1.43 / 0.1 = 14.3
Note that the lift is the same for both rules, this is because the lift is symmetric, it doesn't depend
on the order of the antecedent and the consequent.
A lift value of 1 indicates that there is no association between the antecedent and consequent, and
values greater than 1 indicate a positive association. Here the lift is 14.3 times, which is a strong
positive association between buying pen and notebook, the more the lift value more the
association.
19. Support Vector Machine
A support vector machine (SVM) is a type of deep learning algorithm that performs supervised
learning for classification or regression of data groups.
In AI and Machine learning, supervised learning system provide both input and desired output data,
which are labeled for classification.

More Related Content

Similar to DM_Notes.pptx

Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
Harsha Patel
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
IJERA Editor
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
Basma Gamal
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
Valerii Klymchuk
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
Sunny Gandhi
 
Data mining
Data miningData mining
KDD assignmnt data.docx
KDD assignmnt data.docxKDD assignmnt data.docx
KDD assignmnt data.docx
AbihaAkter201153203
 
4113ijaia09
4113ijaia094113ijaia09
4113ijaia09
mamin321
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
Kartik Kalpande Patil
 
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUESTUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
IJDKP
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
Editor IJMTER
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
IJCSES Journal
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
IJDKP
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
Adetimehin Oluwasegun Matthew
 
An Empirical Study of the Applications of Classification Techniques in Studen...
An Empirical Study of the Applications of Classification Techniques in Studen...An Empirical Study of the Applications of Classification Techniques in Studen...
An Empirical Study of the Applications of Classification Techniques in Studen...
IJERA Editor
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
idescitation
 

Similar to DM_Notes.pptx (20)

Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Data mining
Data miningData mining
Data mining
 
KDD assignmnt data.docx
KDD assignmnt data.docxKDD assignmnt data.docx
KDD assignmnt data.docx
 
4113ijaia09
4113ijaia094113ijaia09
4113ijaia09
 
4113ijaia09
4113ijaia094113ijaia09
4113ijaia09
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUESTUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
Unit 4 Advanced Data Analytics
Unit 4 Advanced Data AnalyticsUnit 4 Advanced Data Analytics
Unit 4 Advanced Data Analytics
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
An Empirical Study of the Applications of Classification Techniques in Studen...
An Empirical Study of the Applications of Classification Techniques in Studen...An Empirical Study of the Applications of Classification Techniques in Studen...
An Empirical Study of the Applications of Classification Techniques in Studen...
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
 

Recently uploaded

Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
PedroFerreira53928
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
Vivekanand Anglo Vedic Academy
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 

Recently uploaded (20)

Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 

DM_Notes.pptx

  • 1. DATA_MINING_NOTES 1. Explain steps in KDD process. [5] KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and potentially valuable information from large datasets. The KDD process in data mining typically involves the following steps: 1. Selection: Select a relevant subset of the data for analysis. 2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include tasks such as data normalization, missing value handling, and data integration. 3. Transformation: Transform the data into a format suitable for data mining, such as a matrix or a graph. 4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful information and insights. This may include tasks such as clustering, classification, association rule mining, and anomaly detection. 5. Interpretation: Interpret the results and extract knowledge from the data. This may include tasks such as visualizing the results, evaluating the quality of the discovered patterns, and identifying relationships and associations among the data. 6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and meaningful. 7. Deployment: Use the discovered knowledge to solve the business problem and make decisions. 2. What is text mining? [2] o Definition: Text mining is the process of extracting meaningful information from text data. o Process: It involves using natural language processing (NLP) techniques and machine learning algorithms to analyze large volumes of unstructured text data and identify patterns, trends, and insights that would be difficult to uncover manually. o Application: This can be applied in various field such as sentiment analysis, topic modeling, and text classification and so on. o Goal: The goal of text mining is to extract valuable information from text data and use it to make data-driven decisions or predictions. 3. What do you mean by Clustering? [2] o Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data points into clusters so that the objects belong to the same group. o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called clusters. o It helps users to understand the structure or natural grouping in a data set and used either as a stand-alone instrument to get a better insight into data distribution or as a pre- processing step for other algorithms 4. Linear Regression. o It is simplest form of regression. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observe the data. o Linear regression attempts to find the mathematical relationship between variables. o If outcome is straight line then it is considered as linear model and if it is curved line, then it is a non-linear model.
  • 2. o The relationship between dependent variable is given by straight line and it has only one independent variable. Y = α + Β X Model 'Y', is a linear function of 'X'. The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also changes. 4. Difference between Data Mining and Text Mining. [3/5] Data Mining Text Mining Data mining is a process to extract useful information from huge datasets. Text Mining is a part of data mining that includes the processing of text from huge documents. In data mining, we get the stored data in a structured format. In text mining, we get the stored data in an unstructured format. It allows the mining of mixed data. It allows mining of text only. Data processing is done directly. Data processing is done linguistically. It is a homogeneous process. It is a heterogeneous process. Pre-defined databases and sheets are used to collect the information. The text is used to gather high-quality data. The statistical method is used for data evaluation. Computational linguistic principles are used to evaluate the text. 5. Difference between DM and OLAP. [3/5] Data Mining OLAP Data mining refers to the field of computer science, which deals with the extraction of data, trends and patterns from huge sets of data. OLAP is a technology of immediate access to data with the help of multidimensional structures. It deals with the data summary. It deals with detailed transaction-level data. It is discovery-driven. It is query driven. It is used for future data prediction. It is used for analyzing past data.
  • 3. It has huge numbers of dimensions. It has a limited number of dimensions. Bottom-up approach. Top-down approach. It is an emerging field. It is widely used. 6. Difference between Descriptive and predictive data mining. [3/5] Descriptive data mining Predictive data mining Descriptive mining is usually used to provide correlation, cross-tabulation, frequency, etc. The term 'Predictive' means to predict something, so predictive data mining is the analysis done to predict the future event or other data or trends. It is based on the reactive approach. It is based on the proactive approach. It specifies the characteristics of the data in a target data set. It executes the induction over the current and past data so that prediction can happen. It needs data aggregation and data mining. It needs statistics and data forecasting procedures. It provides precise data. It produces outcomes without ensuring accuracy. 7. Difference between Classification and Clustering [3/5] Classification Clustering Classification is a supervised learning approach where a specific label is provided to the machine to classify new observations. Here the machine needs proper testing and training for the label verification. Clustering is an unsupervised learning approach where grouping is done on similarities basis. Supervised learning approach. Unsupervised learning approach. It uses a training dataset. It does not use a training dataset. It uses algorithms to categorize the new data as per the observations of the training set. It uses statistical concepts in which the data set is divided into subsets with the same features. In classification, there are labels for training data. In clustering, there are no labels for training data. Its objective is to find which class a new object belongs to form the set of predefined classes. Its objective is to group a set of objects to find whether there is any relationship between them. It is more complex as compared to clustering. It is less complex as compared to clustering.
  • 4. 8. Difference between Supervised and un-supervised Learning [3/5] Supervised Learning Unsupervised Learning Supervised learning algorithms are trained using labeled data. Unsupervised learning algorithms are trained using unlabeled data. Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns in data. In supervised learning, input data is provided to the model along with the output. In unsupervised learning, only input data is provided to the model. The goal of supervised learning is to train the model so that it can predict the output when it is given new data. The goal of unsupervised learning is to find the hidden patterns and useful insights from the unknown dataset. Supervised learning can be categorized in Classification and Regression problems. Unsupervised Learning can be classified in Clustering and Associations problems. Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as compared to supervised learning. It includes various algorithms such as Linear Regression, Logistic Regression, Support Vector Machine, Multi-class Classification, Decision tree, Bayesian Logic, etc. It includes various algorithms such as Clustering, KNN, and Apriori algorithm. 9. Difference between OLAP and OLTP [3/5] Category OLAP (Online analytical processing) OLTP (Online transaction processing) Definition It is well-known as an online database query management system. It is well-known as an online database modifying system. Data source Consists of historical data from various Databases. Consists of only of operational current data. Method used It makes use of a data warehouse. It makes use of a standard database management system (DBMS). Application It is subject-oriented. Used for Data Mining, Analytics, Decisions making, etc. It is application-oriented. Used for business tasks. Normalized In an OLAP database, tables are not normalized. In an OLTP database, tables are normalized (3NF). Usage of data The data is used in planning, problem- solving, and decision-making. The data is used to perform day-to-day fundamental operations. Purpose It serves the purpose to extract information for analysis and decision- making. It serves the purpose to Insert, Update, and Delete information from the database. Volumeof data A large amount of data is stored typically in TB, PB The size of the data is relatively small as the historical data is archived. For ex MB, GB Queries Relatively slow as the amount of data involved is large. Queries may take hours. Very Fast as the queries operate on 5% of the data.
  • 5. 10. Difference between Data Mining and Data Warehousing. [3/5] Data Mining Data Warehousing Data mining is the process of determining data patterns. A data warehouse is a database system designed for analytics. Data mining is generally considered as the process of extracting useful data from a large set of data. Data warehousing is the process of combining all the relevant data. Business entrepreneurs carry data mining with the help of engineers. Data warehousing is entirely carried out by the engineers. In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically. Data mining uses pattern recognition techniques to identify patterns. Data warehousing is the process of extracting and storing data that allow easier reporting. One of the most amazing data mining technique is the detection and identification of the unwanted errors that occur in the system. One of the advantages of the data warehouse is its ability to update frequently. That is the reason why it is ideal for business entrepreneurs who want up to date with the latest stuff. The data mining techniques are cost- efficient as compared to other statistical data applications. The responsibility of the data warehouse is to simplify every type of business data. The data mining techniques are not 100 percent accurate. It may lead to serious consequences in a certain condition. In the data warehouse, there is a high possibility that the data required for analysis by the company may not be integrated into the warehouse. It can simply lead to loss of data. Companies can benefit from this analytical tool by equipping suitable and accessible knowledge-based data. Data warehouse stores a huge amount of historical data that helps users to analyze different periods and trends to make future predictions. 11. K-Means vs KNN [3/5] Category K-Means KNN Algorithm Unsupervised learning algorithm Supervised learning algorithm Process Clusters data points into k clusters based on their similarity Classifies data points based on the majority class of their k nearest neighbors Number Requires the number of clusters (k) to be specified in advance Requires the number of nearest neighbors (k) to be specified in advance method Clustering is done using the mean of the data points in each cluster Classification is done using majority vote of the k nearest neighbors Suitability Suitable for continuous variables Suitable for both continuous and categorical variables Scalability K-Means is generally faster and more scalable than KNN, especially for large datasets. KNN is generally slower and more scalable than K-Means, for large datasets.
  • 6. 12. What do you mean by an outlier? [2] An outlier in data mining is an observation that is significantly different from the other observations in a dataset. o Outliers can have a major impact on the results of data mining and statistical analysis, and are often considered to be undesirable because they can skew the results and lead to inaccurate conclusions. o Outliers can be identified by a number of methods, including statistical tests, visualization techniques, and machine learning algorithms. o Once identified, outliers can be handled in a number of ways, such as removing them from the dataset, treating them as special cases, or including them in the analysis but with appropriate caution. It's important to note that the definition of an outlier is context dependent, in some cases an outlier can be a valuable information, for example in fraud detection, identifying an outlier can be the key to finding a fraudulent transaction. 13.What is Knowledge Discovery in Databases? [2] Knowledge Discovery in Databases (KDD) is the iterative process of extracting useful and valuable information from large and complex sets of data. o The goal of KDD is to identify patterns, trends, and insights hidden within the data that can be used to make better decisions and improve business processes etc. o The KDD process typically involves several steps, including data cleaning and preprocessing, data mining, pattern evaluation, and knowledge representation. o This process can be used in a variety of applications, including business intelligence, fraud detection, and customer relationship management. 14. Hierarchical Clustering in Data Mining: [4]  Definition: A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the subsequent steps: o Identify the 2 clusters which can be closest together, and o Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters are merged together.  Steps: 1. Compute the pairwise similarity or distance between all data points. 2. Start with each data point as a separate cluster. 3. Merge the two closest clusters into a new larger cluster. 4. Repeat step 3 until all data points belong to a single cluster or some stopping criteria is met.
  • 7.  Representation: The hierarchy of clusters can be represented using a tree-based structure called dendrogram.  Advantages: o It can handle non-linearly separable data. o It can handle different shapes and sizes of clusters. o It allows for incremental and dynamic updates of the clustering results. o It can be used to visualize the relationships between clusters.  Disadvantages: o It is sensitive to the choice of the similarity or distance metric. o It is sensitive to the choice of linkage method used to merge clusters. o It can be computationally expensive for large datasets. o It can be hard to interpret the results for higher dimensions. 15. Associative Classification in Data Mining. [2]  Definition: A data mining technique that discovers associations between features and class labels, instead of building a predictive model for the class labels.  Advantages: o It can handle noisy and incomplete data. o It can discover important features and relationships between features and class labels.  Disadvantages: o It is only applicable for binary or nominal class labels. o It can be computationally expensive for large datasets. 16. Explain the following terms in the context of association rule mining: (i) Support of an itemset. (ii) Frequent closed itemset. (iii) Lift of a rule. [3X2] i. Support of an Itemset:  Definition: The proportion of transactions in a transaction database that contain a particular itemset.  Calculation: The support of an itemset X can be calculated as the number of transactions containing X divided by the total number of transactions in the database.  Significance: Support is a measure of the popularity of an itemset and is used as a threshold to determine which itemsets are considered frequent.  Advantages: o It provides a simple and intuitive measure of the popularity of an itemset. o It can be easily calculated from transaction data. ii. Frequent Closed Itemset:  Definition: A frequent itemset is closed if there is no superset of the itemset that has the same support.  Significance: A frequent closed itemset is considered a more meaningful result than a frequent itemset as it captures the complete information of the itemset and its subsets.  Advantages: o It can avoid generating redundant and less meaningful results. o It can capture the complete information of the itemset and its subsets.
  • 8. iii . Lift of a Rule:  Definition: A measure of the degree of association between two items in a rule, compared to their individual frequencies in the transaction database.  Calculation: The lift of a rule X -> Y is calculated as the ratio of the support of X U Y divided by the support of X times the support of Y.  Significance: Lift is a measure of the strength of the association between two items in a rule, and is used to rank and select the most interesting rules.  Advantages: o It provides a measure of the strength of the association between two items in a rule. o It can adjust for the overall popularity of the items in the transaction database. 17. Why data preprocessing is required? [2] A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model. It involves below steps: o Getting the dataset o Importinglibraries o Importing datasets o Finding Missing Data o Encoding Categorical Data o Splitting dataset into training and test set o Feature scaling 18. Explain Market Basket Analysis with suitable example. Suppose 5000 transactions have been made through a popular e-Commerce website. Now they want to calculate the support, confidence, and lift for the two products. For example, let's say pen and notebook, out of 5000 transactions, 500 transactions for pen, 700 transactions for notebook, and 1000 transactions for both. Using the information provided, we can calculate the support, confidence, and lift for the two products: pen and notebook.  Support: Support for pen = (number of transactions containing pen) / (total number of transactions) = 500 / 5000 = 0.1 or 10% Support for notebook = (number of transactions containing notebook) / (total number of transactions) = 700 / 5000 = 0.14 or 14% Support for pen and notebook = (number of transactions containing both pen and notebook) / (total number of transactions) = 1000 / 5000 = 0.2 or 20%  Confidence: Confidence of the rule "If a customer buys a pen, they will also buy a notebook" = (number of transactions containing both pen and notebook) / (number of transactions containing pen) = 1000 / 500 = 2 or 200% Confidence of the rule "If a customer buys a notebook, they will also buy a pen" = (number of transactions containing both pen and notebook) / (number of transactions containing notebook) = 1000 / 700 = 1.43 or 143%  Lift:
  • 9. Lift of the rule "If a customer buys a pen, they will also buy a notebook" = (confidence of the rule) / (support of notebook) = 2 / 0.14 = 14.3 Lift of the rule "If a customer buys a notebook, they will also buy a pen" = (confidence of the rule) / (support of pen) = 1.43 / 0.1 = 14.3 Note that the lift is the same for both rules, this is because the lift is symmetric, it doesn't depend on the order of the antecedent and the consequent. A lift value of 1 indicates that there is no association between the antecedent and consequent, and values greater than 1 indicate a positive association. Here the lift is 14.3 times, which is a strong positive association between buying pen and notebook, the more the lift value more the association. 19. Support Vector Machine A support vector machine (SVM) is a type of deep learning algorithm that performs supervised learning for classification or regression of data groups. In AI and Machine learning, supervised learning system provide both input and desired output data, which are labeled for classification.