Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
This document discusses unsupervised machine learning classification through clustering. It defines clustering as the process of grouping similar items together, with high intra-cluster similarity and low inter-cluster similarity. The document outlines common clustering algorithms like K-means and hierarchical clustering, and describes how K-means works by assigning points to centroids and iteratively updating centroids. It also discusses applications of clustering in domains like marketing, astronomy, genomics and more.
This presentation introduces clustering analysis and the k-means clustering technique. It defines clustering as an unsupervised method to segment data into groups with similar traits. The presentation outlines different clustering types (hard vs soft), techniques (partitioning, hierarchical, etc.), and describes the k-means algorithm in detail through multiple steps. It discusses requirements for clustering, provides examples of applications, and reviews advantages and disadvantages of k-means clustering.
Clustering is an unsupervised learning technique used to group unlabeled data points together based on similarities. It aims to maximize similarity within clusters and minimize similarity between clusters. There are several clustering methods including partitioning, hierarchical, density-based, grid-based, and model-based. Clustering has many applications such as pattern recognition, image processing, market research, and bioinformatics. It is useful for extracting hidden patterns from large, complex datasets.
This document discusses machine learning concepts including supervised vs. unsupervised learning, clustering algorithms, and specific clustering methods like k-means and k-nearest neighbors. It provides examples of how clustering can be used for applications such as market segmentation and astronomical data analysis. Key clustering algorithms covered are hierarchy methods, partitioning methods, k-means which groups data by assigning objects to the closest cluster center, and k-nearest neighbors which classifies new data based on its closest training examples.
Clustering is the process of grouping similar objects together. It allows data to be analyzed and summarized. There are several methods of clustering including partitioning, hierarchical, density-based, grid-based, and model-based. Hierarchical clustering methods are either agglomerative (bottom-up) or divisive (top-down). Density-based methods like DBSCAN and OPTICS identify clusters based on density. Grid-based methods impose grids on data to find dense regions. Model-based clustering uses models like expectation-maximization. High-dimensional data can be clustered using subspace or dimension-reduction methods. Constraint-based clustering allows users to specify preferences.
The document discusses various clustering approaches including partitioning, hierarchical, density-based, grid-based, model-based, frequent pattern-based, and constraint-based methods. It focuses on partitioning methods such as k-means and k-medoids clustering. K-means clustering aims to partition objects into k clusters by minimizing total intra-cluster variance, representing each cluster by its centroid. K-medoids clustering is a more robust variant that represents each cluster by its medoid or most centrally located object. The document also covers algorithms for implementing k-means and k-medoids clustering.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
The document discusses different clustering techniques used for grouping large amounts of data. It covers partitioning methods like k-means and k-medoids that organize data into exclusive groups. It also describes hierarchical methods like agglomerative and divisive clustering that arrange data into nested groups or trees. Additionally, it mentions density-based and grid-based clustering and provides algorithms for different clustering approaches.
This document discusses unsupervised machine learning classification through clustering. It defines clustering as the process of grouping similar items together, with high intra-cluster similarity and low inter-cluster similarity. The document outlines common clustering algorithms like K-means and hierarchical clustering, and describes how K-means works by assigning points to centroids and iteratively updating centroids. It also discusses applications of clustering in domains like marketing, astronomy, genomics and more.
This presentation introduces clustering analysis and the k-means clustering technique. It defines clustering as an unsupervised method to segment data into groups with similar traits. The presentation outlines different clustering types (hard vs soft), techniques (partitioning, hierarchical, etc.), and describes the k-means algorithm in detail through multiple steps. It discusses requirements for clustering, provides examples of applications, and reviews advantages and disadvantages of k-means clustering.
Clustering is an unsupervised learning technique used to group unlabeled data points together based on similarities. It aims to maximize similarity within clusters and minimize similarity between clusters. There are several clustering methods including partitioning, hierarchical, density-based, grid-based, and model-based. Clustering has many applications such as pattern recognition, image processing, market research, and bioinformatics. It is useful for extracting hidden patterns from large, complex datasets.
This document discusses machine learning concepts including supervised vs. unsupervised learning, clustering algorithms, and specific clustering methods like k-means and k-nearest neighbors. It provides examples of how clustering can be used for applications such as market segmentation and astronomical data analysis. Key clustering algorithms covered are hierarchy methods, partitioning methods, k-means which groups data by assigning objects to the closest cluster center, and k-nearest neighbors which classifies new data based on its closest training examples.
Clustering is the process of grouping similar objects together. It allows data to be analyzed and summarized. There are several methods of clustering including partitioning, hierarchical, density-based, grid-based, and model-based. Hierarchical clustering methods are either agglomerative (bottom-up) or divisive (top-down). Density-based methods like DBSCAN and OPTICS identify clusters based on density. Grid-based methods impose grids on data to find dense regions. Model-based clustering uses models like expectation-maximization. High-dimensional data can be clustered using subspace or dimension-reduction methods. Constraint-based clustering allows users to specify preferences.
The document discusses various clustering approaches including partitioning, hierarchical, density-based, grid-based, model-based, frequent pattern-based, and constraint-based methods. It focuses on partitioning methods such as k-means and k-medoids clustering. K-means clustering aims to partition objects into k clusters by minimizing total intra-cluster variance, representing each cluster by its centroid. K-medoids clustering is a more robust variant that represents each cluster by its medoid or most centrally located object. The document also covers algorithms for implementing k-means and k-medoids clustering.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
The document discusses different clustering techniques used for grouping large amounts of data. It covers partitioning methods like k-means and k-medoids that organize data into exclusive groups. It also describes hierarchical methods like agglomerative and divisive clustering that arrange data into nested groups or trees. Additionally, it mentions density-based and grid-based clustering and provides algorithms for different clustering approaches.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
This document discusses decision trees for data classification. It defines a decision tree as a tree where internal nodes represent attributes, branches represent attribute values, and leaf nodes represent class predictions. It describes the basic decision tree algorithm which builds the tree by recursively splitting the training data on attributes and stopping when data is pure or stopping criteria are met. Finally, it notes advantages like interpretability but also disadvantages like potential overfitting and issues with non-numeric data.
It is a data mining technique used to place the data elements into their related groups. Clustering is the process of partitioning the data (or objects) into the same class, The data in one class is more similar to each other than to those in other cluster.
Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on similarities. It partitions data into meaningful subgroups without predefined labels. Common clustering algorithms include k-means, hierarchical, density-based, and grid-based methods. K-means clustering aims to partition data into k clusters where each data point belongs to the cluster with the nearest mean. It is sensitive to outliers but simple and fast.
This document discusses different types of clustering techniques used in machine learning. It describes clustering as an unsupervised learning method that groups similar data points together. The main types covered are hard clustering, soft clustering, density-based clustering, hierarchical clustering, and partitioning-based clustering like k-means. K-means and hierarchical clustering are explained in more detail, including examples of how they work. Agglomerative and divisive hierarchical clustering are two approaches defined.
2.1 Data Mining-classification Basic conceptsKrish_ver2
This document discusses classification and decision trees. It defines classification as predicting categorical class labels using a model constructed from a training set. Decision trees are a popular classification method that operate in a top-down recursive manner, splitting the data into purer subsets based on attribute values. The algorithm selects the optimal splitting attribute using an evaluation metric like information gain at each step until it reaches a leaf node containing only one class.
Clustering is the process of grouping similar objects together. Hierarchical agglomerative clustering builds a hierarchy by iteratively merging the closest pairs of clusters. It starts with each document in its own cluster and successively merges the closest pairs of clusters until all documents are in one cluster, forming a dendrogram. Different linkage methods, such as single, complete, and average linkage, define how the distance between clusters is calculated during merging. Hierarchical clustering provides a multilevel clustering structure but has computational complexity of O(n3) in general.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
Types of clustering and different types of clustering algorithmsPrashanth Guntal
The document discusses different types of clustering algorithms:
1. Hard clustering assigns each data point to one cluster, while soft clustering allows points to belong to multiple clusters.
2. Hierarchical clustering builds clusters hierarchically in a top-down or bottom-up approach, while flat clustering does not have a hierarchy.
3. Model-based clustering models data using statistical distributions to find the best fitting model.
It then provides examples of specific clustering algorithms like K-Means, Fuzzy K-Means, Streaming K-Means, Spectral clustering, and Dirichlet clustering.
This document provides an overview of classification techniques. It defines classification as assigning records to predefined classes based on their attribute values. The key steps are building a classification model from training data and then using the model to classify new, unseen records. Decision trees are discussed as a popular classification method that uses a tree structure with internal nodes for attributes and leaf nodes for classes. The document covers decision tree induction, handling overfitting, and performance evaluation methods like holdout validation and cross-validation.
This document outlines topics to be covered in a presentation on K-means clustering. It will discuss the introduction of K-means clustering, how the algorithm works, provide an example, and applications. The key aspects are that K-means clustering partitions data into K clusters based on similarity, assigns data points to the closest centroid, and recalculates centroids until clusters are stable. It is commonly used for market segmentation, computer vision, astronomy, and agriculture.
This document provides a short review of clustering techniques for students. It defines clustering and different types of grouping methods such as hard vs soft clustering. It discusses popular clustering algorithms like hierarchical clustering, k-means clustering, and density-based clustering. It also covers cluster validity, usability, preprocessing techniques, meta methods, and visual clustering. Open problems in clustering mentioned include how to identify outlier objects and accelerate classification.
Hierarchical clustering methods group data points into a hierarchy of clusters based on their distance or similarity. There are two main approaches: agglomerative, which starts with each point as a separate cluster and merges them; and divisive, which starts with all points in one cluster and splits them. AGNES and DIANA are common agglomerative and divisive algorithms. Hierarchical clustering represents the hierarchy as a dendrogram tree structure and allows exploring data at different granularities of clusters.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
Bayesian classification is a statistical classification method that uses Bayes' theorem to calculate the probability of class membership. It provides probabilistic predictions by calculating the probabilities of classes for new data based on training data. The naive Bayesian classifier is a simple Bayesian model that assumes conditional independence between attributes, allowing faster computation. Bayesian belief networks are graphical models that represent dependencies between variables using a directed acyclic graph and conditional probability tables.
Grid based method & model based clustering methodrajshreemuthiah
The document discusses several grid-based, density-based, and conceptual clustering algorithms. Grid-based approaches like STING and WAVECLUSTER cluster data by quantizing space into grids or cells. CLIQUE uses a grid-based approach to identify dense units of data. Conceptual clustering algorithms like COBWEB create hierarchical cluster trees to classify objects based on attribute probabilities.
This document discusses hierarchical clustering, an unsupervised learning technique. It describes different types of hierarchical clustering including agglomerative versus divisive approaches. It also discusses dendrograms, which show how clusters are merged or split hierarchically. The document focuses on the agglomerative clustering algorithm and different methods for defining the distance between clusters when they are merged, including single link, complete link, average link, and centroid methods.
Dr. Oner CelepcikayITS 632ITS 632Week 4ClassificationDustiBuckner14
This document discusses machine learning classification methods. It provides an example of using a decision tree to classify tax refund requests based on attributes like marital status, income, and refund amount. It explains how decision trees are constructed using a training set of records to learn a model, which can then be applied to a test set to classify new records. The document outlines the process of splitting nodes based on attribute tests to minimize impurity measures like Gini index or classification error.
Classification is a supervised machine learning task where a model is trained on labeled data to predict the class labels of new unlabeled data. The given document discusses classification and decision tree induction for classification. It defines classification, provides examples of classification tasks, and describes how decision trees are constructed through a greedy top-down induction approach that aims to split the training data into homogeneous subsets based on measures of impurity like Gini index or information gain. The goal is to build an accurate model that can classify new unlabeled data.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
This document discusses decision trees for data classification. It defines a decision tree as a tree where internal nodes represent attributes, branches represent attribute values, and leaf nodes represent class predictions. It describes the basic decision tree algorithm which builds the tree by recursively splitting the training data on attributes and stopping when data is pure or stopping criteria are met. Finally, it notes advantages like interpretability but also disadvantages like potential overfitting and issues with non-numeric data.
It is a data mining technique used to place the data elements into their related groups. Clustering is the process of partitioning the data (or objects) into the same class, The data in one class is more similar to each other than to those in other cluster.
Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on similarities. It partitions data into meaningful subgroups without predefined labels. Common clustering algorithms include k-means, hierarchical, density-based, and grid-based methods. K-means clustering aims to partition data into k clusters where each data point belongs to the cluster with the nearest mean. It is sensitive to outliers but simple and fast.
This document discusses different types of clustering techniques used in machine learning. It describes clustering as an unsupervised learning method that groups similar data points together. The main types covered are hard clustering, soft clustering, density-based clustering, hierarchical clustering, and partitioning-based clustering like k-means. K-means and hierarchical clustering are explained in more detail, including examples of how they work. Agglomerative and divisive hierarchical clustering are two approaches defined.
2.1 Data Mining-classification Basic conceptsKrish_ver2
This document discusses classification and decision trees. It defines classification as predicting categorical class labels using a model constructed from a training set. Decision trees are a popular classification method that operate in a top-down recursive manner, splitting the data into purer subsets based on attribute values. The algorithm selects the optimal splitting attribute using an evaluation metric like information gain at each step until it reaches a leaf node containing only one class.
Clustering is the process of grouping similar objects together. Hierarchical agglomerative clustering builds a hierarchy by iteratively merging the closest pairs of clusters. It starts with each document in its own cluster and successively merges the closest pairs of clusters until all documents are in one cluster, forming a dendrogram. Different linkage methods, such as single, complete, and average linkage, define how the distance between clusters is calculated during merging. Hierarchical clustering provides a multilevel clustering structure but has computational complexity of O(n3) in general.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
Types of clustering and different types of clustering algorithmsPrashanth Guntal
The document discusses different types of clustering algorithms:
1. Hard clustering assigns each data point to one cluster, while soft clustering allows points to belong to multiple clusters.
2. Hierarchical clustering builds clusters hierarchically in a top-down or bottom-up approach, while flat clustering does not have a hierarchy.
3. Model-based clustering models data using statistical distributions to find the best fitting model.
It then provides examples of specific clustering algorithms like K-Means, Fuzzy K-Means, Streaming K-Means, Spectral clustering, and Dirichlet clustering.
This document provides an overview of classification techniques. It defines classification as assigning records to predefined classes based on their attribute values. The key steps are building a classification model from training data and then using the model to classify new, unseen records. Decision trees are discussed as a popular classification method that uses a tree structure with internal nodes for attributes and leaf nodes for classes. The document covers decision tree induction, handling overfitting, and performance evaluation methods like holdout validation and cross-validation.
This document outlines topics to be covered in a presentation on K-means clustering. It will discuss the introduction of K-means clustering, how the algorithm works, provide an example, and applications. The key aspects are that K-means clustering partitions data into K clusters based on similarity, assigns data points to the closest centroid, and recalculates centroids until clusters are stable. It is commonly used for market segmentation, computer vision, astronomy, and agriculture.
This document provides a short review of clustering techniques for students. It defines clustering and different types of grouping methods such as hard vs soft clustering. It discusses popular clustering algorithms like hierarchical clustering, k-means clustering, and density-based clustering. It also covers cluster validity, usability, preprocessing techniques, meta methods, and visual clustering. Open problems in clustering mentioned include how to identify outlier objects and accelerate classification.
Hierarchical clustering methods group data points into a hierarchy of clusters based on their distance or similarity. There are two main approaches: agglomerative, which starts with each point as a separate cluster and merges them; and divisive, which starts with all points in one cluster and splits them. AGNES and DIANA are common agglomerative and divisive algorithms. Hierarchical clustering represents the hierarchy as a dendrogram tree structure and allows exploring data at different granularities of clusters.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
Bayesian classification is a statistical classification method that uses Bayes' theorem to calculate the probability of class membership. It provides probabilistic predictions by calculating the probabilities of classes for new data based on training data. The naive Bayesian classifier is a simple Bayesian model that assumes conditional independence between attributes, allowing faster computation. Bayesian belief networks are graphical models that represent dependencies between variables using a directed acyclic graph and conditional probability tables.
Grid based method & model based clustering methodrajshreemuthiah
The document discusses several grid-based, density-based, and conceptual clustering algorithms. Grid-based approaches like STING and WAVECLUSTER cluster data by quantizing space into grids or cells. CLIQUE uses a grid-based approach to identify dense units of data. Conceptual clustering algorithms like COBWEB create hierarchical cluster trees to classify objects based on attribute probabilities.
This document discusses hierarchical clustering, an unsupervised learning technique. It describes different types of hierarchical clustering including agglomerative versus divisive approaches. It also discusses dendrograms, which show how clusters are merged or split hierarchically. The document focuses on the agglomerative clustering algorithm and different methods for defining the distance between clusters when they are merged, including single link, complete link, average link, and centroid methods.
Dr. Oner CelepcikayITS 632ITS 632Week 4ClassificationDustiBuckner14
This document discusses machine learning classification methods. It provides an example of using a decision tree to classify tax refund requests based on attributes like marital status, income, and refund amount. It explains how decision trees are constructed using a training set of records to learn a model, which can then be applied to a test set to classify new records. The document outlines the process of splitting nodes based on attribute tests to minimize impurity measures like Gini index or classification error.
Classification is a supervised machine learning task where a model is trained on labeled data to predict the class labels of new unlabeled data. The given document discusses classification and decision tree induction for classification. It defines classification, provides examples of classification tasks, and describes how decision trees are constructed through a greedy top-down induction approach that aims to split the training data into homogeneous subsets based on measures of impurity like Gini index or information gain. The goal is to build an accurate model that can classify new unlabeled data.
The document discusses decision trees and their algorithms. It introduces decision trees, describing their structure as having root, internal, and leaf nodes. It then discusses Hunt's algorithm, the basis for decision tree induction algorithms like ID3 and C4.5. Hunt's algorithm grows a decision tree recursively by partitioning training records into purer subsets based on attribute tests. The document also covers methods for expressing test conditions based on attribute type, measures for selecting the best split like information gain, and advantages and disadvantages of decision trees.
The document discusses classification techniques in data mining. Classification involves using a training dataset containing records with attributes and class labels to build a model that predicts the class labels of new records. Several classification techniques are described, including decision trees which use a tree structure to split the data into purer subsets based on attribute values. The Hunt's algorithm for generating decision trees is explained in detail, including how it recursively splits the training data based on evaluating attribute tests until reaching single class leaf nodes. Design considerations for decision tree induction like attribute test methods and stopping criteria are also covered.
The document discusses decision tree classification and the decision tree induction algorithm. It begins by defining classification and providing an example of classifying loan applicants as likely to cheat. It then discusses evaluating classification methods and lists common classification techniques, including decision trees. The remainder of the document focuses on decision tree classification and induction. It provides examples of decision trees built from sample training data and the process of applying a decision tree to classify new test data. It also discusses issues in Hunt's algorithm for decision tree induction, such as how to specify test conditions for splitting nodes and how to determine the best attribute and split to partition data.
The document provides an overview of decision tree learning algorithms:
- Decision trees are a supervised learning method that can represent discrete functions and efficiently process large datasets.
- Basic algorithms like ID3 use a top-down greedy search to build decision trees by selecting attributes that best split the training data at each node.
- The quality of a split is typically measured by metrics like information gain, with the goal of creating pure, homogeneous child nodes.
- Fully grown trees may overfit, so algorithms incorporate a bias toward smaller, simpler trees with informative splits near the root.
Date: September 25, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
The document discusses classification and decision trees. It provides an overview of classification, including defining it as discriminating between classes of objects and describing the classification process. It then details decision trees, including that they are flow-chart structures with internal nodes as tests on attributes, branches as outcomes, and leaf nodes as class labels. The document explains Hunt's algorithm for generating a decision tree through recursive splitting of training data based on optimizing an attribute test criterion until terminating conditions are met.
This document provides an overview of Naive Bayes classifiers. It begins by explaining Bayes' theorem and how it can be used for classification problems. It then shows an example dataset and how to estimate probabilities from the data to build a Naive Bayes model. The key assumptions of Naive Bayes are conditional independence between attributes given the class. The document demonstrates how to classify a new data point by calculating the posterior probabilities for each class and selecting the class with the highest probability.
This document provides an overview of Naive Bayes classifiers. It begins by explaining Bayes' theorem and how it can be used for classification problems. It then shows an example dataset and explains how to estimate probabilities from the data to build a Naive Bayes model. The key assumptions of Naive Bayes are conditional independence between attributes given the class. The document demonstrates how to classify a new data point using the model by calculating the posterior probabilities for each class and selecting the class with the highest probability.
This document provides an overview of Naive Bayes classifiers. It begins by explaining Bayes' theorem and how it can be used for classification problems. It then shows an example dataset and explains how to estimate probabilities from the data to build a Naive Bayes model. The key assumptions of Naive Bayes are conditional independence between attributes given the class. The document demonstrates how to classify a new data point using the model by calculating the posterior probabilities for each class and selecting the class with the highest probability.
This document provides an overview of Naive Bayes classifiers. It begins by explaining Bayes' theorem and how it can be used for classification problems. It then shows an example dataset and describes how to estimate probabilities from the data to build a Naive Bayes model. The key assumptions of Naive Bayes are conditional independence between attributes given the class. The document demonstrates how to classify a new data point using the model by calculating the posterior probabilities for each class and selecting the class with the highest probability.
This document provides an overview of Naive Bayes classifiers. It begins by explaining Bayes' theorem and how it can be used for classification problems. It then shows an example dataset and how to estimate probabilities from the data to build a Naive Bayes model. The key assumptions of Naive Bayes are conditional independence between attributes given the class. The document demonstrates how to classify a new data point by calculating the posterior probabilities for each class and selecting the class with the highest probability.
This document provides an overview of Naive Bayes classifiers. It begins by explaining Bayes' theorem and how it can be used for classification problems. It then shows an example dataset and how to estimate probabilities from the data to build a Naive Bayes model. The key assumptions of Naive Bayes are conditional independence between attributes given the class. The document demonstrates how to classify a new data point by calculating the posterior probabilities for each class and selecting the class with the highest probability.
This document provides an overview of Naive Bayes classifiers. It begins by explaining Bayes' theorem and how it can be used for classification problems. It then shows an example dataset and explains how to estimate probabilities from the data to build a Naive Bayes model. The key assumptions of Naive Bayes are conditional independence between attributes given the class. The document demonstrates calculating class probabilities for a new data point based on the probabilities estimated from training data.
NN Classififcation Neural Network NN.pptxcmpt cmpt
The document provides an overview of classification methods. It begins with defining classification tasks and discussing evaluation metrics like accuracy, recall, and precision. It then describes several common classification techniques including nearest neighbor classification, decision tree induction, Naive Bayes, logistic regression, and support vector machines. For decision trees specifically, it explains the basic structure of decision trees, the induction process using Hunt's algorithm, and issues in tree induction.
The document discusses classification and prediction techniques in data mining, explaining that classification involves organizing data into discrete classes while prediction forecasts continuous values, and it outlines various classification methods like decision trees, neural networks, and Bayesian classification as well as evaluating model accuracy, speed, and interpretability.
The document discusses classification, which involves using a training dataset containing records with attributes and classes to build a model that can predict the class of previously unseen records. It describes dividing the dataset into training and test sets, with the training set used to build the model and test set to validate it. Decision trees are presented as a classification method that splits records recursively into partitions based on attribute values to optimize a metric like GINI impurity, allowing predictions to be made by following the tree branches. The key steps of building a decision tree classifier are outlined.
The document discusses classification, which is the task of assigning objects to predefined categories. Classification involves learning a target function that maps attribute sets to class labels. Decision tree induction is a common classification technique that builds a tree-like model by recursively splitting the data into purer subsets based on attribute values. The document covers concepts like decision tree structure, algorithms like Hunt's algorithm for building trees, evaluating performance, and handling different attribute types when constructing decision tree models.
Similar to Classification: Basic Concepts and Decision Trees (20)
This document discusses transparent concrete, which is produced by adding optical fibers to concrete mixes. There are two main materials used: concrete and optical fibers. The mixing process involves combining cement, water, and other components like epoxy resins, polycarbonate, fiberglass, and silica in specific proportions. Transparent concrete has similar strength properties to regular concrete but allows light to pass through up to 20 meters. It can be used to reduce lighting needs and has applications in partitions, floors, and decorative elements where improved visibility is desired. While advantageous for its aesthetics and energy savings, transparent concrete is more expensive than traditional concrete due to the optical fibers.
Stationary waves are produced by the superposition of two progressive waves of equal amplitude and frequency traveling in opposite directions. They have nodes where there is no displacement and antinodes where displacement is at maximum. A stationary wave's waveform does not move through the medium. Particles within a stationary wave vibrate in phase but with varying amplitudes, reaching maximum at antinodes and resting at nodes. The fundamental frequency and overtones of vibrating strings and air columns can be determined from measurements of stationary waves produced in them. The timbre of a sound depends on its harmonic content and amplitudes.
The document discusses the electrical activity of the heart as measured by an electrocardiogram (ECG). It begins by introducing the concept of polarization and the resting membrane potential of heart cells. It then describes the phases of the cardiac action potential: depolarization, repolarization, and the refractory periods. It compares the action potentials of pacemaker and non-pacemaker cells. It also discusses the conduction of the cardiac impulse and how this is reflected in the different waves of the ECG. Finally, it covers topics like ECG paper format, lead systems, heart rate calculation, and interpreting the ECG.
The document summarizes electrical activity in the heart. It discusses:
1) How electrical signals originate in the sinoatrial node and propagate through the heart, causing atrial and ventricular contraction.
2) The action potentials that occur in the sinoatrial node, atrioventricular node, Purkinje fibers, and ventricles.
3) How the electrocardiogram (ECG) records these electrical signals to examine cardiac excitation and contraction.
The document discusses various software process life cycle models, including:
- Waterfall model which progresses in linear stages from requirements to maintenance. It values predictability but is inflexible to changes.
- Prototyping model which adds prototyping stages to explore risks before full development.
- V model which mirrors each development phase with a testing phase. It emphasizes verification and validation.
- Iterative and incremental models like RUP which develop software iteratively in phases and increments, releasing early and often. This is more flexible and reduces risks compared to waterfall.
- Agile methods are also iterative and incremental but emphasize lightweight processes, adaptation, and flexibility over heavy documentation.
The
This document discusses digital logic circuits and binary logic. It begins with an overview of binary logic, logic gates like NAND, NOR and XOR, and Boolean algebra. It then covers analog vs digital signals, quantization, and converting between analog and digital formats. Various representations of digital designs are presented, including truth tables, Boolean algebra, and schematics. Common logic gates and their representations are described. The document discusses design methodologies and analyses, as well as simulation of logic circuits. It also covers elementary binary logic functions, basic identities of Boolean algebra, and converting between Boolean expressions and logic circuits.
This document discusses real-time scheduling algorithms. It begins by defining real-time systems and their key properties of timeliness and predictability. It then discusses two common real-time scheduling algorithms: fixed-priority Rate Monotonic scheduling and dynamic-priority Earliest Deadline First scheduling. It covers how each algorithm prioritizes and orders tasks, and analyzes their schedulability and utilization bounds. It concludes by comparing the two approaches.
Real-Time Signal Processing: Implementation and Applicationsathish sak
This document discusses real-time signal processing, including what it means, why it is used, and platforms for implementation. Real-time signal processing allows signals to be collected, analyzed, and modified in real-time as they occur. It is used to avoid time and money lost when collecting and processing data separately. Common platforms include software/PC, hardware like FPGAs, and firmware/hardware like DSPs, each with their own benefits and drawbacks relating to flexibility, speed, cost, and practicality. The document focuses on DSPs as a popular "middle ground" option and discusses code generation applications and the Embedded Target for TI's C6711 DSP.
This document provides an overview of digital signal processors (DSPs). It discusses how DSPs are specialized processors that are optimized for real-time signal processing applications like filtering. DSPs offer advantages over general purpose processors and analog signal processing techniques, including programmability, reduced noise susceptibility, and lower power consumption. The document compares different DSP families from Texas Instruments and discusses their applications and key parameters.
POWER GENERATION OF THERMAL POWER PLANTsathish sak
. The kinetic energy of the molecules in a solid, liquid or gas
2. The more kinetic energy, the more thermal energy the object possesses
3. Physicists also call this the internal energy of an object
mathematics application fiels of engineeringsathish sak
MATHS IS HARD
MATHS IS BORING
MATHS HAS NOTHING TO DO WITH REAL LIFE
ALL MATHEMATICIANS ARE MAD!
BUT I CAN SHOW YOU THAT MATHS IS IMPORTANT IN
CRIME DETECTION MEDICINE FINDING LANDMINES
Plastic is a material consists of wide range of synthetic or organics that can be moulded into solid object with diverse shapes.
The word PLASTIC is derived from the Greek Word “PLASTIKOS” meaning capable of being shaped or moulded.
Plastics are organic polymers of higher molecular mass.
The document discusses various engineering fields including biomedical engineering, civil engineering, mechanical engineering, electrical and electronics engineering, aeronautical engineering, and chemical engineering. It provides more details on biomedical engineering, stating that it applies engineering principles and problem-solving techniques to biology and medicine, and that biomedical engineers use an intimate knowledge of biological principles in their design process. The document also thanks the reader.
Pollution is a major problem affecting living organisms. There are different types of pollution including air, noise, and water pollution. Air pollution is caused by vehicles and industries releasing gases and carbon monoxide into the atmosphere. Noise pollution stems from human activities like vehicles and cell phone towers negatively impacting small birds. Water pollution occurs when industrial waste is dumped into bodies of water, harming organisms and potentially causing toxic algae blooms. Prevention efforts include using biofuels instead of petrol and diesel, stopping wastewater dumping, and planting trees to reduce noise pollution.
Radio frequency identification(RFID) technology using at various application by using radio frequency ranges.
It is especially used at tollgates. For automation of gate control.
It can also used at library systems.
Green Chemistry is the utilisation of a set of principles that reduces or eliminates the use or generation of hazardous substances in the design, manufacture and application of chemical products .
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
UI5con 2024 - Bring Your Own Design SystemPeter Muessig
How do you combine the OpenUI5/SAPUI5 programming model with a design system that makes its controls available as Web Components? Since OpenUI5/SAPUI5 1.120, the framework supports the integration of any Web Components. This makes it possible, for example, to natively embed own Web Components of your design system which are created with Stencil. The integration embeds the Web Components in a way that they can be used naturally in XMLViews, like with standard UI5 controls, and can be bound with data binding. Learn how you can also make use of the Web Components base class in OpenUI5/SAPUI5 to also integrate your Web Components and get inspired by the solution to generate a custom UI5 library providing the Web Components control wrappers for the native ones.
Preparing Non - Technical Founders for Engaging a Tech AgencyISH Technologies
Preparing non-technical founders before engaging a tech agency is crucial for the success of their projects. It starts with clearly defining their vision and goals, conducting thorough market research, and gaining a basic understanding of relevant technologies. Setting realistic expectations and preparing a detailed project brief are essential steps. Founders should select a tech agency with a proven track record and establish clear communication channels. Additionally, addressing legal and contractual considerations and planning for post-launch support are vital to ensure a smooth and successful collaboration. This preparation empowers non-technical founders to effectively communicate their needs and work seamlessly with their chosen tech agency.Visit our site to get more details about this. Contact us today www.ishtechnologies.com.au
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Drona Infotech is a premier mobile app development company in Noida, providing cutting-edge solutions for businesses.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...kalichargn70th171
In today's business landscape, digital integration is ubiquitous, demanding swift innovation as a necessity rather than a luxury. In a fiercely competitive market with heightened customer expectations, the timely launch of flawless digital products is crucial for both acquisition and retention—any delay risks ceding market share to competitors.
3. Classification: Definition
Given a collection of records (training
set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a
function of the values of other
attributes.
Goal: previously unseen records should
be assigned a class as accurately as
possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
5. Examples of Classification Task
Predicting tumor cells as benign or malignant
Classifying credit card transactions
as legitimate or fraudulent
Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
Categorizing news stories as finance,
weather, entertainment, sports, etc
6. Classification Using Distance
Place items in class to which they are
“closest”.
Must determine distance between an
item and a class.
Classes represented by
Centroid: Central value.
Medoid: Representative point.
Individual points
Algorithm: KNN
7. K Nearest Neighbor (KNN):
Training set includes classes.
Examine K items near item to be
classified.
New item placed in class with the most
number of close items.
O(q) for each tuple to be classified.
(Here q is the size of the training set.)
9. Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
10. Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
11. Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
13. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.
14. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
15. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
16. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
17. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
18. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
20. Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
21. General Structure of Hunt’s
Algorithm
Let Dt be the set of training
records that reach a node t
General Procedure:
If Dt contains records that
belong the same class yt, then
t is a leaf node labeled as yt
If Dt is an empty set, then t is
a leaf node labeled by the
default class, yd
If Dt contains records that
belong to more than one
class, use an attribute test to
split the data into smaller
subsets. Recursively apply the
procedure to each subset.
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Dt
?
23. Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
24. Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
25. How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split
26. Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct
values.
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury} {Sports}
CarType
{Sports,
Luxury} {Family}
OR
27. Multi-way split: Use as many partitions as distinct
values.
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
What about this split?
Splitting Based on Ordinal
Attributes
Size
Small
Medium
Large
Size
{Medium,
Large} {Small}
Size
{Small,
Medium} {Large}
OR
Size
{Small,
Large} {Medium}
28. Splitting Based on Continuous
Attributes
Different ways of handling
Discretization to form an ordinal categorical
attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
Binary Decision: (A < v) or (A ≥ v)
consider all possible splits and finds the best cut
can be more compute intensive
30. Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
31. How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
32. How to determine the Best Split
Greedy approach:
Nodes with homogeneous class distribution are
preferred
Need a measure of node impurity:
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
33. Measures of Node Impurity
Gini Index
Entropy
Misclassification error
34. How to Find the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
M0
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
35. Measure of Impurity: GINI
Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
Minimum (0.0) when all records belong to one class,
implying most interesting information
∑−=
j
tjptGINI 2
)]|([1)(
C1 0
C2 6
Gini= 0.000
C1 2
C2 4
Gini= 0.444
C1 3
C2 3
Gini= 0.500
C1 1
C2 5
Gini= 0.278
37. Splitting Based on GINI
Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions (children),
the quality of split is computed as,
where, ni = number of records at child i,
n = number of records at node p.
∑=
=
k
i
i
split iGINI
n
n
GINI
1
)(
39. Categorical Attributes: Computing Gini
Index
For each distinct value, gather counts for each
class in the dataset
Use the count matrix to make decisions
CarType
{Sports,
Luxury}
{Family}
C1 3 1
C2 2 4
Gini 0.400
CarType
{Sports}
{Family,
Luxury}
C1 2 2
C2 1 5
Gini 0.419
CarType
Family Sports Luxury
C1 1 2 1
C2 4 1 1
Gini 0.393
Multi-way split Two-way split
(find best partition of values)
40. Continuous Attributes: Computing Gini
Index
Use Binary Decisions based on
one value
Several Choices for the splitting
value
Number of possible splitting
values
= Number of distinct values
Each splitting value has a count
matrix associated with it
Class counts in each of the
partitions, A < v and A ≥ v
Simple method to choose best v
For each v, scan the database to
gather count matrix and compute
its Gini index
Computationally Inefficient!
Repetition of work.
41. Continuous Attributes: Computing Gini
Index...
For efficient computation: for each attribute,
Sort the attribute on values
Linearly scan these values, each time updating the count
matrix and computing gini index
Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Split Positions
Sorted Values
42. Alternative Splitting Criteria based on
INFO
Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
Measures homogeneity of a node.
Maximum (log nc) when records are equally distributed
among all classes implying least information
Minimum (0.0) when all records belong to one class,
implying most information
Entropy based computations are similar to the
GINI index computations
∑−= j
tjptjptEntropy )|(log)|()(
44. Splitting Based on INFO...
Information Gain:
Parent Node, p is split into k partitions;
ni is number of records in partition i
Measures Reduction in Entropy achieved because of the
split. Choose the split that achieves most reduction
(maximizes GAIN)
Used in ID3 and C4.5
Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
−= ∑
=
k
i
i
split
iEntropy
n
n
pEntropyGAIN 1
)()(
45. Splitting Based on INFO...
Gain Ratio:
Parent Node, p is split into k partitions
ni is the number of records in partition i
Adjusts Information Gain by the entropy of the
partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
Used in C4.5
Designed to overcome the disadvantage of Information
Gain
SplitINFO
GAIN
GainRATIO Split
split
= ∑
=
−=
k
i
ii
n
n
n
n
SplitINFO 1
log
46. Splitting Criteria based on Classification
Error
Classification error at a node t :
Measures misclassification error made by a node.
Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
Minimum (0.0) when all records belong to one class,
implying most interesting information
)|(max1)( tiPtError i
−=
50. Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
51. Stopping Criteria for Tree Induction
Stop expanding a node when all the
records belong to the same class
Stop expanding a node when all the
records have similar attribute values
Early termination (to be discussed later)
52. Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
53. Example: C4.5
Simple depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
Needs out-of-core sorting.
You can download the software from:
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
54. Practical Issues of Classification
Underfitting and Overfitting
Missing Values
Costs of Classification
55. Underfitting and Overfitting
(Example)
500 circular and 500
triangular data points.
Circular points:
0.5 ≤ sqrt(x1
2
+x2
2
) ≤ 1
Triangular points:
sqrt(x1
2
+x2
2
) > 0.5 or
sqrt(x1
2
+x2
2
) < 1
58. Overfitting due to Insufficient
Examples
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
59. Notes on Overfitting
Overfitting results in decision trees that
are more complex than necessary
Training error no longer provides a good
estimate of how well the tree will perform
on previously unseen records
Need new ways for estimating errors
60. Estimating Generalization Errors
Re-substitution errors: error on training (Σ e(t) )
Generalization errors: error on testing (Σ e’(t))
Methods for estimating generalization errors:
Optimistic approach: e’(t) = e(t)
Pessimistic approach:
For each leaf node: e’(t) = (e(t)+0.5)
Total errors: e’(T) = e(T) + N × 0.5 (N: number of leaf
nodes)
For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 30×0.5)/1000 = 2.5%
Reduced error pruning (REP):
uses validation data set to estimate generalization
error
61. Occam’s Razor
Given two models of similar generalization
errors, one should prefer the simpler
model over the more complex model
For complex models, there is a greater
chance that it was fitted accidentally by
errors in data
Therefore, one should include model
complexity when evaluating a model
62. Minimum Description Length
(MDL)
Cost(Model,Data) = Cost(Data|Model) + Cost(Model)
Cost is the number of bits needed for encoding.
Search for the least costly model.
Cost(Data|Model) encodes the misclassification errors.
Cost(Model) uses node encoding (number of children) plus
splitting condition encoding.
A B
X y
X1 1
X2 0
X3 0
X4 1
… …
Xn 1
X y
X1 ?
X2 ?
X3 ?
X4 ?
… …
Xn ?
63. How to Address Overfitting
Pre-Pruning (Early Stopping Rule)
Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:
Stop if all instances belong to the same class
Stop if all the attribute values are the same
More restrictive conditions:
Stop if number of instances is less than some user-specified
threshold
Stop if class distribution of instances are independent of the
available features (e.g., using χ2
test)
Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
64. How to Address Overfitting…
Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a
bottom-up fashion
If generalization error improves after trimming,
replace sub-tree by a leaf node.
Class label of leaf node is determined from
majority class of instances in the sub-tree
Can use MDL for post-pruning
65. Example of Post-Pruning
A?
A1
A2 A3
A4
Class = Yes 20
Class = No 10
Error = 10/30
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
= (9 + 4 × 0.5)/30 = 11/30
PRUNE!
Class =
Yes
8
Class = No 4
Class =
Yes
3
Class = No 4
Class =
Yes
4
Class = No 1
Class =
Yes
5
Class = No 1
66. Examples of Post-pruning
Optimistic error?
Pessimistic error?
Reduced error pruning?
C0: 11
C1: 3
C0: 2
C1: 4
C0: 14
C1: 3
C0: 2
C1: 2
Don’t prune for both cases
Don’t prune case 1, prune case 2
Case 1:
Case 2:
Depends on validation set
67. Handling Missing Attribute Values
Missing values affect decision tree
construction in three different ways:
Affects how impurity measures are computed
Affects how to distribute instance with missing
value to child nodes
Affects how a test instance with missing value
is classified
68. Computing Impurity Measure
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 ? Single 90K Yes
10
Class
= Yes
Class
= No
Refund=Yes 0 3
Refund=No 2 4
Refund=? 1 0
Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 × (0.8813 – 0.551) = 0.3303
Missing
value
Before Splitting:
Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
69. Distribute Instances
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
Refund
Yes No
Class= Yes 0
Class= No 3
Cheat= Yes 2
Cheat= No 4
Refund
Yes
Tid Refund Marital
Status
Taxable
Income Class
10 ? Single 90K Yes
10
No
Class= Yes 2 + 6/ 9
Class= No 4
Probability that Refund=Yes is 3/9
Probability that Refund=No is 6/9
Assign record to the left child with
weight = 3/9 and to the right child
with weight = 6/9
Class= Yes 0 + 3/ 9
Class= No 3
70. Classify Instances
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
Married Single Divorce
d
Total
Class=No 3 1 0 4
Class=Yes 6/9 1 1 2.67
Total 3.67 2 1 6.67
Tid Refund Marital
Status
Taxable
Income Class
11 No ? 85K ?
10
New record:
Probability that Marital Status
= Married is 3.67/6.67
Probability that Marital Status
={Single,Divorced} is 3/6.67
71. Scalable Decision Tree Induction Methods
SLIQ (EDBT’96 — Mehta et al.)
Builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
Constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
Integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan &
Ganti)
Builds an AVC-list (attribute, value, class label)
BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan &
Loh)
Uses bootstrapping to create several small samples