- 1. DATA_MINING_NOTES 1. Explain steps in KDD process. [5] KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and potentially valuable information from large datasets. The KDD process in data mining typically involves the following steps: 1. Selection: Select a relevant subset of the data for analysis. 2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include tasks such as data normalization, missing value handling, and data integration. 3. Transformation: Transform the data into a format suitable for data mining, such as a matrix or a graph. 4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful information and insights. This may include tasks such as clustering, classification, association rule mining, and anomaly detection. 5. Interpretation: Interpret the results and extract knowledge from the data. This may include tasks such as visualizing the results, evaluating the quality of the discovered patterns, and identifying relationships and associations among the data. 6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and meaningful. 7. Deployment: Use the discovered knowledge to solve the business problem and make decisions. 2. What is text mining? [2] o Definition: Text mining is the process of extracting meaningful information from text data. o Process: It involves using natural language processing (NLP) techniques and machine learning algorithms to analyze large volumes of unstructured text data and identify patterns, trends, and insights that would be difficult to uncover manually. o Application: This can be applied in various field such as sentiment analysis, topic modeling, and text classification and so on. o Goal: The goal of text mining is to extract valuable information from text data and use it to make data-driven decisions or predictions. 3. What do you mean by Clustering? [2] o Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data points into clusters so that the objects belong to the same group. o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called clusters. o It helps users to understand the structure or natural grouping in a data set and used either as a stand-alone instrument to get a better insight into data distribution or as a pre- processing step for other algorithms 4. Linear Regression. o It is simplest form of regression. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observe the data. o Linear regression attempts to find the mathematical relationship between variables. o If outcome is straight line then it is considered as linear model and if it is curved line, then it is a non-linear model.
- 2. o The relationship between dependent variable is given by straight line and it has only one independent variable. Y = α + Β X Model 'Y', is a linear function of 'X'. The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also changes. 4. Difference between Data Mining and Text Mining. [3/5] Data Mining Text Mining Data mining is a process to extract useful information from huge datasets. Text Mining is a part of data mining that includes the processing of text from huge documents. In data mining, we get the stored data in a structured format. In text mining, we get the stored data in an unstructured format. It allows the mining of mixed data. It allows mining of text only. Data processing is done directly. Data processing is done linguistically. It is a homogeneous process. It is a heterogeneous process. Pre-defined databases and sheets are used to collect the information. The text is used to gather high-quality data. The statistical method is used for data evaluation. Computational linguistic principles are used to evaluate the text. 5. Difference between DM and OLAP. [3/5] Data Mining OLAP Data mining refers to the field of computer science, which deals with the extraction of data, trends and patterns from huge sets of data. OLAP is a technology of immediate access to data with the help of multidimensional structures. It deals with the data summary. It deals with detailed transaction-level data. It is discovery-driven. It is query driven. It is used for future data prediction. It is used for analyzing past data.
- 3. It has huge numbers of dimensions. It has a limited number of dimensions. Bottom-up approach. Top-down approach. It is an emerging field. It is widely used. 6. Difference between Descriptive and predictive data mining. [3/5] Descriptive data mining Predictive data mining Descriptive mining is usually used to provide correlation, cross-tabulation, frequency, etc. The term 'Predictive' means to predict something, so predictive data mining is the analysis done to predict the future event or other data or trends. It is based on the reactive approach. It is based on the proactive approach. It specifies the characteristics of the data in a target data set. It executes the induction over the current and past data so that prediction can happen. It needs data aggregation and data mining. It needs statistics and data forecasting procedures. It provides precise data. It produces outcomes without ensuring accuracy. 7. Difference between Classification and Clustering [3/5] Classification Clustering Classification is a supervised learning approach where a specific label is provided to the machine to classify new observations. Here the machine needs proper testing and training for the label verification. Clustering is an unsupervised learning approach where grouping is done on similarities basis. Supervised learning approach. Unsupervised learning approach. It uses a training dataset. It does not use a training dataset. It uses algorithms to categorize the new data as per the observations of the training set. It uses statistical concepts in which the data set is divided into subsets with the same features. In classification, there are labels for training data. In clustering, there are no labels for training data. Its objective is to find which class a new object belongs to form the set of predefined classes. Its objective is to group a set of objects to find whether there is any relationship between them. It is more complex as compared to clustering. It is less complex as compared to clustering.
- 4. 8. Difference between Supervised and un-supervised Learning [3/5] Supervised Learning Unsupervised Learning Supervised learning algorithms are trained using labeled data. Unsupervised learning algorithms are trained using unlabeled data. Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns in data. In supervised learning, input data is provided to the model along with the output. In unsupervised learning, only input data is provided to the model. The goal of supervised learning is to train the model so that it can predict the output when it is given new data. The goal of unsupervised learning is to find the hidden patterns and useful insights from the unknown dataset. Supervised learning can be categorized in Classification and Regression problems. Unsupervised Learning can be classified in Clustering and Associations problems. Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as compared to supervised learning. It includes various algorithms such as Linear Regression, Logistic Regression, Support Vector Machine, Multi-class Classification, Decision tree, Bayesian Logic, etc. It includes various algorithms such as Clustering, KNN, and Apriori algorithm. 9. Difference between OLAP and OLTP [3/5] Category OLAP (Online analytical processing) OLTP (Online transaction processing) Definition It is well-known as an online database query management system. It is well-known as an online database modifying system. Data source Consists of historical data from various Databases. Consists of only of operational current data. Method used It makes use of a data warehouse. It makes use of a standard database management system (DBMS). Application It is subject-oriented. Used for Data Mining, Analytics, Decisions making, etc. It is application-oriented. Used for business tasks. Normalized In an OLAP database, tables are not normalized. In an OLTP database, tables are normalized (3NF). Usage of data The data is used in planning, problem- solving, and decision-making. The data is used to perform day-to-day fundamental operations. Purpose It serves the purpose to extract information for analysis and decision- making. It serves the purpose to Insert, Update, and Delete information from the database. Volumeof data A large amount of data is stored typically in TB, PB The size of the data is relatively small as the historical data is archived. For ex MB, GB Queries Relatively slow as the amount of data involved is large. Queries may take hours. Very Fast as the queries operate on 5% of the data.
- 5. 10. Difference between Data Mining and Data Warehousing. [3/5] Data Mining Data Warehousing Data mining is the process of determining data patterns. A data warehouse is a database system designed for analytics. Data mining is generally considered as the process of extracting useful data from a large set of data. Data warehousing is the process of combining all the relevant data. Business entrepreneurs carry data mining with the help of engineers. Data warehousing is entirely carried out by the engineers. In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically. Data mining uses pattern recognition techniques to identify patterns. Data warehousing is the process of extracting and storing data that allow easier reporting. One of the most amazing data mining technique is the detection and identification of the unwanted errors that occur in the system. One of the advantages of the data warehouse is its ability to update frequently. That is the reason why it is ideal for business entrepreneurs who want up to date with the latest stuff. The data mining techniques are cost- efficient as compared to other statistical data applications. The responsibility of the data warehouse is to simplify every type of business data. The data mining techniques are not 100 percent accurate. It may lead to serious consequences in a certain condition. In the data warehouse, there is a high possibility that the data required for analysis by the company may not be integrated into the warehouse. It can simply lead to loss of data. Companies can benefit from this analytical tool by equipping suitable and accessible knowledge-based data. Data warehouse stores a huge amount of historical data that helps users to analyze different periods and trends to make future predictions. 11. K-Means vs KNN [3/5] Category K-Means KNN Algorithm Unsupervised learning algorithm Supervised learning algorithm Process Clusters data points into k clusters based on their similarity Classifies data points based on the majority class of their k nearest neighbors Number Requires the number of clusters (k) to be specified in advance Requires the number of nearest neighbors (k) to be specified in advance method Clustering is done using the mean of the data points in each cluster Classification is done using majority vote of the k nearest neighbors Suitability Suitable for continuous variables Suitable for both continuous and categorical variables Scalability K-Means is generally faster and more scalable than KNN, especially for large datasets. KNN is generally slower and more scalable than K-Means, for large datasets.
- 6. 12. What do you mean by an outlier? [2] An outlier in data mining is an observation that is significantly different from the other observations in a dataset. o Outliers can have a major impact on the results of data mining and statistical analysis, and are often considered to be undesirable because they can skew the results and lead to inaccurate conclusions. o Outliers can be identified by a number of methods, including statistical tests, visualization techniques, and machine learning algorithms. o Once identified, outliers can be handled in a number of ways, such as removing them from the dataset, treating them as special cases, or including them in the analysis but with appropriate caution. It's important to note that the definition of an outlier is context dependent, in some cases an outlier can be a valuable information, for example in fraud detection, identifying an outlier can be the key to finding a fraudulent transaction. 13.What is Knowledge Discovery in Databases? [2] Knowledge Discovery in Databases (KDD) is the iterative process of extracting useful and valuable information from large and complex sets of data. o The goal of KDD is to identify patterns, trends, and insights hidden within the data that can be used to make better decisions and improve business processes etc. o The KDD process typically involves several steps, including data cleaning and preprocessing, data mining, pattern evaluation, and knowledge representation. o This process can be used in a variety of applications, including business intelligence, fraud detection, and customer relationship management. 14. Hierarchical Clustering in Data Mining: [4] Definition: A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the subsequent steps: o Identify the 2 clusters which can be closest together, and o Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters are merged together. Steps: 1. Compute the pairwise similarity or distance between all data points. 2. Start with each data point as a separate cluster. 3. Merge the two closest clusters into a new larger cluster. 4. Repeat step 3 until all data points belong to a single cluster or some stopping criteria is met.
- 7. Representation: The hierarchy of clusters can be represented using a tree-based structure called dendrogram. Advantages: o It can handle non-linearly separable data. o It can handle different shapes and sizes of clusters. o It allows for incremental and dynamic updates of the clustering results. o It can be used to visualize the relationships between clusters. Disadvantages: o It is sensitive to the choice of the similarity or distance metric. o It is sensitive to the choice of linkage method used to merge clusters. o It can be computationally expensive for large datasets. o It can be hard to interpret the results for higher dimensions. 15. Associative Classification in Data Mining. [2] Definition: A data mining technique that discovers associations between features and class labels, instead of building a predictive model for the class labels. Advantages: o It can handle noisy and incomplete data. o It can discover important features and relationships between features and class labels. Disadvantages: o It is only applicable for binary or nominal class labels. o It can be computationally expensive for large datasets. 16. Explain the following terms in the context of association rule mining: (i) Support of an itemset. (ii) Frequent closed itemset. (iii) Lift of a rule. [3X2] i. Support of an Itemset: Definition: The proportion of transactions in a transaction database that contain a particular itemset. Calculation: The support of an itemset X can be calculated as the number of transactions containing X divided by the total number of transactions in the database. Significance: Support is a measure of the popularity of an itemset and is used as a threshold to determine which itemsets are considered frequent. Advantages: o It provides a simple and intuitive measure of the popularity of an itemset. o It can be easily calculated from transaction data. ii. Frequent Closed Itemset: Definition: A frequent itemset is closed if there is no superset of the itemset that has the same support. Significance: A frequent closed itemset is considered a more meaningful result than a frequent itemset as it captures the complete information of the itemset and its subsets. Advantages: o It can avoid generating redundant and less meaningful results. o It can capture the complete information of the itemset and its subsets.
- 8. iii . Lift of a Rule: Definition: A measure of the degree of association between two items in a rule, compared to their individual frequencies in the transaction database. Calculation: The lift of a rule X -> Y is calculated as the ratio of the support of X U Y divided by the support of X times the support of Y. Significance: Lift is a measure of the strength of the association between two items in a rule, and is used to rank and select the most interesting rules. Advantages: o It provides a measure of the strength of the association between two items in a rule. o It can adjust for the overall popularity of the items in the transaction database. 17. Why data preprocessing is required? [2] A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model. It involves below steps: o Getting the dataset o Importinglibraries o Importing datasets o Finding Missing Data o Encoding Categorical Data o Splitting dataset into training and test set o Feature scaling 18. Explain Market Basket Analysis with suitable example. Suppose 5000 transactions have been made through a popular e-Commerce website. Now they want to calculate the support, confidence, and lift for the two products. For example, let's say pen and notebook, out of 5000 transactions, 500 transactions for pen, 700 transactions for notebook, and 1000 transactions for both. Using the information provided, we can calculate the support, confidence, and lift for the two products: pen and notebook. Support: Support for pen = (number of transactions containing pen) / (total number of transactions) = 500 / 5000 = 0.1 or 10% Support for notebook = (number of transactions containing notebook) / (total number of transactions) = 700 / 5000 = 0.14 or 14% Support for pen and notebook = (number of transactions containing both pen and notebook) / (total number of transactions) = 1000 / 5000 = 0.2 or 20% Confidence: Confidence of the rule "If a customer buys a pen, they will also buy a notebook" = (number of transactions containing both pen and notebook) / (number of transactions containing pen) = 1000 / 500 = 2 or 200% Confidence of the rule "If a customer buys a notebook, they will also buy a pen" = (number of transactions containing both pen and notebook) / (number of transactions containing notebook) = 1000 / 700 = 1.43 or 143% Lift:
- 9. Lift of the rule "If a customer buys a pen, they will also buy a notebook" = (confidence of the rule) / (support of notebook) = 2 / 0.14 = 14.3 Lift of the rule "If a customer buys a notebook, they will also buy a pen" = (confidence of the rule) / (support of pen) = 1.43 / 0.1 = 14.3 Note that the lift is the same for both rules, this is because the lift is symmetric, it doesn't depend on the order of the antecedent and the consequent. A lift value of 1 indicates that there is no association between the antecedent and consequent, and values greater than 1 indicate a positive association. Here the lift is 14.3 times, which is a strong positive association between buying pen and notebook, the more the lift value more the association. 19. Support Vector Machine A support vector machine (SVM) is a type of deep learning algorithm that performs supervised learning for classification or regression of data groups. In AI and Machine learning, supervised learning system provide both input and desired output data, which are labeled for classification.