- 1. Q. Write advantages, disadvantages and applications of different algorithms which are used in Data Mining? Ans. Decision Trees In simple words, a decision tree is a structure that contains nodes (rectangular boxes) and edges(arrows) and is built from a dataset (table of columns representing features/attributes and rows corresponds to records). Each node is either used to make a decision (known as decision node) or represent an outcome (known as leaf node). 1.Naive Bayes classifier (NBC) Naive Bayes is a machine learning algorithm we use to solve classification problems. It is based onthe Bayes Theorem. It is one of the simplest yet powerful ML algorithms in use and finds applications in many industries. Supposeyou have to solve a classification problem and have created the features and generated the hypothesis, but your superiors want to seethe model. You have numerous data points (lakhs of data points) and many variables to train the dataset. The best solution for this situation would be to use the Naive Bayes classifier, which is quite faster in comparison to other classification algorithms. Advantages NBC:
- 2. 1) The naive Bayesian model originated from classical mathematical theory and has a solid mathematical foundation and stable classification efficiency. 2) It has a higher speed for large numbers of training and queries. Even with very large training sets, there is usually only a relatively small number of features for each project, and the training and classification of the project is only a mathematical operation of the feature probability; 3) It works well for small-scale data, can handle multi-category tasks, and is suitable for incremental training (that is, it can train new samples in real time); 4) Less sensitive to missing data, the algorithm is also relatively simple, often used for text classification; 5) Naïve Bayes explains the results easily. Disadvantages of NBC: 1) There is an error rate in the classification decision; 2) Very sensitive to the form of input data; 3) The assumption of sample attribute independence is used, so if the sample attributes are related, the effect is not good. 4) Naive Bayes assumes that all predictors (or features) are independent, rarely happening in real life. This limits the applicability of this algorithm in real-world use cases. 5) This algorithm faces the ‘zero-frequency problem’ where it assigns zero probability to a categorical variable whose category in the test data set wasn’t available in the training dataset. It would be best if you used a smoothing technique to overcome this issue.
- 3. 6) Its estimations can be wrong in some cases, so you shouldn’t take its probability outputs very seriously. Applications of Naive Bayes Algorithms Real-time Prediction: As Naive Bayes is super fast; it can be used for making predictions in real time. Multi-class Prediction: This algorithm can predict the posterior probability of multiple classes of the target variable. Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers are mostly used in text classification (due to their better results in multi-class problems and independence rule) have a higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments) RecommendationSystem: Naive Bayes Classifier along with algorithms like Collaborative Filtering makes a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not. 2. Iterative Dichotomiser 3 ID3 stands for Iterative Dichotomiser 3 is a classificationalgorithmand is named suchbecausethe algorithm iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each step. Invented by Ross Quinlan, ID3 uses a top- down greedy approach to build a decision tree. In simple words, the top- down approach means that we start building the tree from the top and the greedy approach of building a decision tree by selecting a best attribute that yields maximum Information Gain (IG) or minimum Entropy (H). Advantages of using ID3 1) Understandable prediction rules are created from the training data. 2) Builds the fastest tree.
- 4. 3) Builds a short tree. 4) Only need to test enough attributes until all data is classified. 5) Finding leaf nodes enables test data to be pruned, reducing number of tests. Whole dataset is searched to create tree. Disadvantages of using ID3 1) Data may be over-fitted or over-classified, if a small sample is tested. 2) Only one attribute at a time is tested for making a decision. 3) Classifying continuous data may be computationally expensive, as many trees must be generated to see where to break the continuum. Applications of ID3 ID3 algorithm is used in many places some are as land capability classification Information Asset Identification etc. 3. K-Nearest Neighbours KNN for NearestNeighbourSearch:KNN algorithm involves retrieving the K datapoints that are nearest in distance to the original point. It can be used for classification or regression by aggregating the target values of the nearest neighbours to make a prediction. However, just retrieving the nearest neighbours is a very important aspect in several applications. For instance, suppose we write a movie recommender system, once we find a suitable vector representation for all the movies, given a movie, recommending the five closest movies involves retrieving the five nearest neighbour vectors.
- 5. KNN for classification: KNN can be used for classification in a supervised setting where we are given a dataset with target labels. For classification, KNN finds the k nearest data points in the training set and the target label is computed as the mode of the target label of these k nearest neighbours. KNN for Regression: KNN can be used for regression in a supervised setting where we are given a dataset with continuoustarget values. Forregression, KNN finds the k nearest data points in the training set and the target value is computed as the mean of the target value of these k nearest neighbours. Advantages of KNN 1) K-NN is pretty intuitive and simple: K-NN algorithm is very simple to understand and equally easy to implement. To classify the new data point K-NN algorithm reads through whole dataset to find out K nearest neighbours. 2) K-NN has no assumptions: K-NN is a non-parametric algorithm which means there are assumptions to be met to implement K-NN. Parametric models like linear regression has lots of assumptions to be met by data before it can be implemented which is not the case with K-NN.
- 6. 3) No Training Step: K-NN does not explicitly build any model, it simply tags the new data entry-based learning from historical data. New data entry would be tagged with majority class in the nearest neighbour. 4) It constantly evolves: Given it’s an instance-based learning; k-NN is a memory-based approach. The classifier immediately adapts as we collect new training data. It allows the algorithm to respond quickly to changes in the input during real-time use. 5) Very easy to implement for multi-class problem: Most of the classifier algorithms are easy to implement for binary problems and needs effort to implement for multi class whereas K-NN adjust to multi class without any extra efforts. 6) Can be used both for Classificationand Regression: One of the biggest advantages of K-NN is that K-NN can be used both for classification and regression problems. 7) One Hyper Parameter: K-NN might take some time while selecting the first hyper parameter but after that rest of the parameters are aligned to it. 8) Variety of distance criteria to be choose from: K-NN algorithm gives user the flexibility to choose distance while building K-NN model. a. Euclidean Distance b. Hamming Distance c. Manhattan Distance d. Makowski Distance Even though K-NN has several advantages but there are certain very important disadvantages or constraints of K-NN. Disadvantages of KNN 1) K-NN slow algorithm: K-NN might be very easy to implement but as dataset grows efficiency or speed of algorithm declines very fast.
- 7. 2) Curse of Dimensionality: KNN works well with small number of input variables but as the numbers of variables grow K-NN algorithm struggles to predict the output of new data point. 3) K-NN needs homogeneous features: If you decide to build k-NN using a common distance, like Euclidean or Manhattan distances, it is completely necessary that features have the same scale, since absolute differences in features weight the same, i.e., a given distance in feature 1 must means the same for feature 2. 4) Optimal number of neighbours: One of the biggest issues with K-NN is to choose the optimal number of neighbours to be consider while classifying the new data entry. 5) Imbalanced data causes problems: k-NN doesn’t perform well on imbalanced data. If we consider two classes, A and B, and the majority of the training data is labelled as A, then the model will ultimately give a lot of preference to A. This might result in getting the less common class B wrongly classified. 6) Outlier sensitivity: K-NN algorithm is very sensitive to outliers as it simply chose the neighbours based on distance criteria. 7) Missing Value treatment: K-NN inherently has no capability of dealing with missing value problem. Applications of KNN Used in classification and Interpretation (legal, news, banking) Used in get missing values Used in pattern recognition Used in gene expression Used in protein-protein prediction Used to get 3D structure of problem Used to measure document similarity
- 8. Problem solving (planning, pronunciation) Functional learning (dynamic control) Teaching and aiding (help desk, user training) 4. Classification and Regression Trees (CART) Algorithm Classification and Regression Trees (CART) is only a modern term for what are otherwise known as DecisionTrees.Decision Trees have been around for a very long time and are important for predictive modelling in Machine Learning. As the name suggests, these trees are used for classification and prediction problems. These models are obtained by partitioning the data space and fitting a simple prediction model within each partition. This is donerecursively. Wecan represent the partitioning graphically as a tree; hence the name. They have withstood the test of time because of the following reasons: 1. Very competitive with other methods 2. High efficiency Classification trees which are used to separate a dataset into different classes (generally used when we expect categorical classes). The other type are Regression Trees which are used when the class variable is continuous (or numerical). Advantages of CART 1) CART does not require any assumptions for underlying distributions. 2) It is easy to use and can quickly provide valuable insights. 3) CART can be used efficiently to assess massive datasets 4) be further used to drill down to a particular cause and find effective, quick solutions.
- 9. 5) The solution is easily interpretable, intuitive and can be verified with existing data. 6) it is a good way to present solutions to management. Disadvantages of CART 1) The biggest limitation is the fact that it is a nonparametric technique; it is not recommended to make any generalization on the underlying phenomenon based upon the results observed. Although the rules obtained through the analysis can be tested on new data, it must be remembered that the model is built based upon the sample without making any inference about the underlying probability distribution. 2) Another limitation of CART is that the tree becomes quite complex after seven or eight layers. 3) Interpreting the results in this situation is not intuitive. Applications of CART: CART is used in many places in machine learning such as Blood Donors Classificationn, spatial data environmental and ecological data, Hepatitis disease diagnosis. 5. K- Means Clustering Means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between
- 10. the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster. Advantages of K-means 1) Relatively simple to implement. 2) Scales to large data sets. 3) Guarantees convergence. 4) Can warm-start the positions of centroids. 5) Easily adapts to new examples. 6) Generalizes to clusters of different shapes and sizes, such as elliptical clusters. Disadvantages of K-means 1) Being dependent on initial values. Fora low k, you can mitigate this dependenceby running k-means several times with different initial values and picking the best result. As k increases, you need advanced versions of k-means to pick better values of the initial centroids (called k-means seeding). 2) Clustering data of varying sizes and density K-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to generalize k-means. 3) Clustering outliers Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.
- 11. 4) Scaling with number of dimensions As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reducedimensionality either by using PCAonthe feature data, orby using “spectral clustering” to modify the clustering algorithm . Applications of K-Means Clustering K-Means clustering is used in a variety of examples or business cases in real life, like: Academic performance Diagnostic systems Search engines Wireless sensor networks Academic Performance: Based on the scores, students are categorized into grades like A, B, or C. Diagnostic systems: The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments. Search engines:Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this. Wireless sensor networks: The clustering algorithm plays the role of finding the cluster heads, which collects all the data in its respective cluster. END