The document discusses classification techniques in data mining. Classification involves using a training dataset containing records with attributes and class labels to build a model that predicts the class labels of new records. Several classification techniques are described, including decision trees which use a tree structure to split the data into purer subsets based on attribute values. The Hunt's algorithm for generating decision trees is explained in detail, including how it recursively splits the training data based on evaluating attribute tests until reaching single class leaf nodes. Design considerations for decision tree induction like attribute test methods and stopping criteria are also covered.
The document provides an introduction to linear algebra concepts for machine learning. It defines vectors as ordered tuples of numbers that express magnitude and direction. Vector spaces are sets that contain all linear combinations of vectors. Linear independence and basis of vector spaces are discussed. Norms measure the magnitude of a vector, with examples given of the 1-norm and 2-norm. Inner products measure the correlation between vectors. Matrices can represent linear operators between vector spaces. Key linear algebra concepts such as trace, determinant, and matrix decompositions are outlined for machine learning applications.
Introduction to text classification using naive bayesDhwaj Raj
This document provides an overview of text classification and the Naive Bayes classification method. It defines text classification as assigning categories, topics or genres to documents. It describes classification methods like hand-coded rules and supervised machine learning. It explains the bag-of-words representation and how Naive Bayes classification works by calculating the probability of a document belonging to a class using Bayes' rule and independence assumptions. It discusses parameter estimation and how to build a multinomial Naive Bayes classifier for text classification tasks.
Process the sentiments of NLP with Naive Bayes Rule, Random Forest, Support Vector Machine, and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
The document discusses different approaches for concept learning from examples, including viewing it as a search problem to find the hypothesis that best fits the training examples. It also describes the general-to-specific learning approach, where the goal is to find the maximally specific hypothesis consistent with the positive training examples by starting with the most general hypothesis and replacing constraints to better fit the examples. The document also discusses the version space and candidate elimination algorithms for obtaining the version space of all hypotheses consistent with the training data.
The document discusses various problem solving techniques in artificial intelligence, including different types of problems, components of well-defined problems, measuring problem solving performance, and different search strategies. It describes single-state and multiple-state problems, and defines the key components of a problem including the data type, operators, goal test, and path cost. It also explains different search strategies such as breadth-first search, uniform cost search, depth-first search, depth-limited search, iterative deepening search, and bidirectional search.
In this tutorial, we will learn the the following topics -
+ Training and Visualizing a Decision Tree
+ Making Predictions
+ Estimating Class Probabilities
+ The CART Training Algorithm
+ Computational Complexity
+ Gini Impurity or Entropy?
+ Regularization Hyperparameters
+ Regression
+ Instability
This document provides a step-by-step guide to learning R. It begins with the basics of R, including downloading and installing R and R Studio, understanding the R environment and basic operations. It then covers R packages, vectors, data frames, scripts, and functions. The second section discusses data handling in R, including importing data from external files like CSV and SAS files, working with datasets, creating new variables, data manipulations, sorting, removing duplicates, and exporting data. The document is intended to guide users through the essential skills needed to work with data in R.
The document provides an introduction to linear algebra concepts for machine learning. It defines vectors as ordered tuples of numbers that express magnitude and direction. Vector spaces are sets that contain all linear combinations of vectors. Linear independence and basis of vector spaces are discussed. Norms measure the magnitude of a vector, with examples given of the 1-norm and 2-norm. Inner products measure the correlation between vectors. Matrices can represent linear operators between vector spaces. Key linear algebra concepts such as trace, determinant, and matrix decompositions are outlined for machine learning applications.
Introduction to text classification using naive bayesDhwaj Raj
This document provides an overview of text classification and the Naive Bayes classification method. It defines text classification as assigning categories, topics or genres to documents. It describes classification methods like hand-coded rules and supervised machine learning. It explains the bag-of-words representation and how Naive Bayes classification works by calculating the probability of a document belonging to a class using Bayes' rule and independence assumptions. It discusses parameter estimation and how to build a multinomial Naive Bayes classifier for text classification tasks.
Process the sentiments of NLP with Naive Bayes Rule, Random Forest, Support Vector Machine, and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
The document discusses different approaches for concept learning from examples, including viewing it as a search problem to find the hypothesis that best fits the training examples. It also describes the general-to-specific learning approach, where the goal is to find the maximally specific hypothesis consistent with the positive training examples by starting with the most general hypothesis and replacing constraints to better fit the examples. The document also discusses the version space and candidate elimination algorithms for obtaining the version space of all hypotheses consistent with the training data.
The document discusses various problem solving techniques in artificial intelligence, including different types of problems, components of well-defined problems, measuring problem solving performance, and different search strategies. It describes single-state and multiple-state problems, and defines the key components of a problem including the data type, operators, goal test, and path cost. It also explains different search strategies such as breadth-first search, uniform cost search, depth-first search, depth-limited search, iterative deepening search, and bidirectional search.
In this tutorial, we will learn the the following topics -
+ Training and Visualizing a Decision Tree
+ Making Predictions
+ Estimating Class Probabilities
+ The CART Training Algorithm
+ Computational Complexity
+ Gini Impurity or Entropy?
+ Regularization Hyperparameters
+ Regression
+ Instability
This document provides a step-by-step guide to learning R. It begins with the basics of R, including downloading and installing R and R Studio, understanding the R environment and basic operations. It then covers R packages, vectors, data frames, scripts, and functions. The second section discusses data handling in R, including importing data from external files like CSV and SAS files, working with datasets, creating new variables, data manipulations, sorting, removing duplicates, and exporting data. The document is intended to guide users through the essential skills needed to work with data in R.
Hive is a data warehouse infrastructure tool that allows users to query and analyze large datasets stored in Hadoop. It uses a SQL-like language called HiveQL to process structured data stored in HDFS. Hive stores metadata about the schema in a database and processes data into HDFS. It provides a familiar interface for querying large datasets using SQL-like queries and scales easily to large datasets.
UNIT - 5: Data Warehousing and Data MiningNandakumar P
UNIT-V
Mining Object, Spatial, Multimedia, Text, and Web Data: Multidimensional Analysis and Descriptive Mining of Complex Data Objects – Spatial Data Mining – Multimedia Data Mining – Text Mining – Mining the World Wide Web.
1. The document provides an overview of key concepts in data science and machine learning including the data science process, types of data, machine learning techniques, and Python tools used for machine learning.
2. It describes the typical 6 step data science process: setting goals, data retrieval, data preparation, exploration, modeling, and presentation.
3. Different types of data are discussed including structured, unstructured, machine-generated, graph-based, and audio/video data.
4. Machine learning techniques can be supervised, unsupervised, or semi-supervised depending on whether labeled data is used.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Reinforcement learning is a machine learning technique where an agent learns how to behave in an environment by receiving rewards or punishments for its actions. The goal of the agent is to learn an optimal policy that maximizes long-term rewards. Reinforcement learning can be applied to problems like game playing, robot control, scheduling, and economic modeling. The reinforcement learning process involves an agent interacting with an environment to learn through trial-and-error using state, action, reward, and policy. Common algorithms include Q-learning which uses a Q-table to learn the optimal action-selection policy.
Tweets Classification using Naive Bayes and SVMTrilok Sharma
This document summarizes a project to automatically classify tweets into predefined Wikipedia categories. It discusses using three algorithms - Naive Bayes, SVM, and rule-based - to classify tweets into 11 categories like business, sports, politics etc. It explains the concepts used like removing outliers, stemming, spell checking. Accuracy results using 10-fold cross validation show SVM and rule-based achieving over 80% accuracy on most categories. The project analyzed real-time tweet data using an API and achieved high performance speeds for classification.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
The document discusses decision trees and random forest algorithms. It begins with an outline and defines the problem as determining target attribute values for new examples given a training data set. It then explains key requirements like discrete classes and sufficient data. The document goes on to describe the principles of decision trees, including entropy and information gain as criteria for splitting nodes. Random forests are introduced as consisting of multiple decision trees to help reduce variance. The summary concludes by noting out-of-bag error rate can estimate classification error as trees are added.
The document discusses statistical learning approaches like Naive Bayes and Bayesian networks. It provides an example of using Bayesian learning to predict the flavor of candy in a bag based on observations, calculating the probability of hypotheses given data. The document also covers parameter estimation, the naive Bayes assumption of conditional independence between variables, and using maximum likelihood estimates from training data to learn probabilities.
This document discusses decision trees, a classification technique in data mining. It defines classification as assigning class labels to unlabeled data based on a training set. Decision trees generate a tree structure to classify data, with internal nodes representing attributes, branches representing attribute values, and leaf nodes holding class labels. An algorithm is used to recursively split the data set into purer subsets based on attribute tests until each subset belongs to a single class. The tree can then classify new examples by traversing it from root to leaf.
ID3, C4.5 :used to generate a decision tree developed by Ross Quinlan typically used in the machine learning and natural language processing domains, overview about these algorithms with illustrated examples
The document discusses social recommender systems and how they can improve on traditional collaborative filtering approaches by incorporating trust relationships between users. It outlines research that used trust propagation algorithms to make recommendations for cold start users who lack sufficient rating histories. The author proposes to further explore how different types of social relationships (e.g. trust, friendship) differentially impact recommendation performance and to evaluate social and similarity-based collaborative filtering approaches.
Heart disease prediction using machine learning algorithm Kedar Damkondwar
The document summarizes a seminar presentation on predicting heart disease using machine learning algorithms. It introduces the problem of heart disease prediction and the motivation to develop an automated system to assist in diagnosis and treatment. It reviews several existing studies that used methods like decision trees, naive Bayes, neural networks, and support vector machines to predict heart disease risk factors. The objectives of the presented model are to develop a predictive system using machine learning techniques to analyze heart data and help reduce medical costs and human biases. The proposed model and applications in medical institutions and hospitals are also discussed.
Data Mining: Mining ,associations, and correlationsDatamining Tools
Market basket analysis examines customer purchasing patterns to determine which items are commonly bought together. This can help retailers with marketing strategies like product bundling and complementary product placement. Association rule mining is a two-step process that first finds frequent item sets that occur together above a minimum support threshold, and then generates strong association rules from these frequent item sets based on minimum support and confidence. Various techniques can improve the efficiency of the Apriori algorithm for mining association rules, such as hashing, transaction reduction, partitioning, sampling, and dynamic item-set counting. Pruning strategies like item merging, sub-item-set pruning, and item skipping can also enhance efficiency. Constraint-based mining allows users to specify constraints on the type of
This document discusses representation learning on graphs. It begins by explaining why graph representation learning is important as graphs are pervasive in many scientific disciplines. It then discusses various techniques for graph representation learning including graph embedding methods like DeepWalk and Node2Vec, message passing neural networks, and graph generation methods like variational autoencoders. The document concludes by discussing challenges in graph reasoning to deduce knowledge from graphs in response to queries.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
The document describes the author's experience with evaluating a new machine learning algorithm. It discusses how:
- The author and a student developed a new algorithm that performed better than state-of-the-art according to standard evaluation practices, winning an award, but the NIH disagreed.
- This surprised the author and led to an realization that evaluation is more complex than initially understood, motivating the author to learn more about evaluation methods from other fields.
- The author has since co-written a book on evaluating machine learning algorithms from a classification perspective, and the tutorial presented is based on this book to provide an overview of issues in evaluation and available resources.
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
This Data Science presentation will help you understand what is Data Science, who is a Data Scientist, what does a Data Scientist do and also how Python is used for Data Science. Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. This Data Science tutorial will help you establish your skills at analytical techniques using Python. With this Data Science video, you’ll learn the essential concepts of Data Science with Python programming and also understand how data acquisition, data preparation, data mining, model building & testing, data visualization is done. This Data Science tutorial is ideal for beginners who aspire to become a Data Scientist.
This Data Science presentation will cover the following topics:
1. What is Data Science?
2. Who is a Data Scientist?
3. What does a Data Scientist do?
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. A data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its largelibrary of mathematical functions.
Learn more at: https://www.simplilearn.com
The document discusses classification, which is the task of assigning objects to predefined categories. Classification involves learning a target function that maps attribute sets to class labels. Decision tree induction is a common classification technique that builds a tree-like model by recursively splitting the data into purer subsets based on attribute values. The document covers concepts like decision tree structure, algorithms like Hunt's algorithm for building trees, evaluating performance, and handling different attribute types when constructing decision tree models.
Classification: Basic Concepts and Decision Treessathish sak
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Hive is a data warehouse infrastructure tool that allows users to query and analyze large datasets stored in Hadoop. It uses a SQL-like language called HiveQL to process structured data stored in HDFS. Hive stores metadata about the schema in a database and processes data into HDFS. It provides a familiar interface for querying large datasets using SQL-like queries and scales easily to large datasets.
UNIT - 5: Data Warehousing and Data MiningNandakumar P
UNIT-V
Mining Object, Spatial, Multimedia, Text, and Web Data: Multidimensional Analysis and Descriptive Mining of Complex Data Objects – Spatial Data Mining – Multimedia Data Mining – Text Mining – Mining the World Wide Web.
1. The document provides an overview of key concepts in data science and machine learning including the data science process, types of data, machine learning techniques, and Python tools used for machine learning.
2. It describes the typical 6 step data science process: setting goals, data retrieval, data preparation, exploration, modeling, and presentation.
3. Different types of data are discussed including structured, unstructured, machine-generated, graph-based, and audio/video data.
4. Machine learning techniques can be supervised, unsupervised, or semi-supervised depending on whether labeled data is used.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Reinforcement learning is a machine learning technique where an agent learns how to behave in an environment by receiving rewards or punishments for its actions. The goal of the agent is to learn an optimal policy that maximizes long-term rewards. Reinforcement learning can be applied to problems like game playing, robot control, scheduling, and economic modeling. The reinforcement learning process involves an agent interacting with an environment to learn through trial-and-error using state, action, reward, and policy. Common algorithms include Q-learning which uses a Q-table to learn the optimal action-selection policy.
Tweets Classification using Naive Bayes and SVMTrilok Sharma
This document summarizes a project to automatically classify tweets into predefined Wikipedia categories. It discusses using three algorithms - Naive Bayes, SVM, and rule-based - to classify tweets into 11 categories like business, sports, politics etc. It explains the concepts used like removing outliers, stemming, spell checking. Accuracy results using 10-fold cross validation show SVM and rule-based achieving over 80% accuracy on most categories. The project analyzed real-time tweet data using an API and achieved high performance speeds for classification.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
The document discusses decision trees and random forest algorithms. It begins with an outline and defines the problem as determining target attribute values for new examples given a training data set. It then explains key requirements like discrete classes and sufficient data. The document goes on to describe the principles of decision trees, including entropy and information gain as criteria for splitting nodes. Random forests are introduced as consisting of multiple decision trees to help reduce variance. The summary concludes by noting out-of-bag error rate can estimate classification error as trees are added.
The document discusses statistical learning approaches like Naive Bayes and Bayesian networks. It provides an example of using Bayesian learning to predict the flavor of candy in a bag based on observations, calculating the probability of hypotheses given data. The document also covers parameter estimation, the naive Bayes assumption of conditional independence between variables, and using maximum likelihood estimates from training data to learn probabilities.
This document discusses decision trees, a classification technique in data mining. It defines classification as assigning class labels to unlabeled data based on a training set. Decision trees generate a tree structure to classify data, with internal nodes representing attributes, branches representing attribute values, and leaf nodes holding class labels. An algorithm is used to recursively split the data set into purer subsets based on attribute tests until each subset belongs to a single class. The tree can then classify new examples by traversing it from root to leaf.
ID3, C4.5 :used to generate a decision tree developed by Ross Quinlan typically used in the machine learning and natural language processing domains, overview about these algorithms with illustrated examples
The document discusses social recommender systems and how they can improve on traditional collaborative filtering approaches by incorporating trust relationships between users. It outlines research that used trust propagation algorithms to make recommendations for cold start users who lack sufficient rating histories. The author proposes to further explore how different types of social relationships (e.g. trust, friendship) differentially impact recommendation performance and to evaluate social and similarity-based collaborative filtering approaches.
Heart disease prediction using machine learning algorithm Kedar Damkondwar
The document summarizes a seminar presentation on predicting heart disease using machine learning algorithms. It introduces the problem of heart disease prediction and the motivation to develop an automated system to assist in diagnosis and treatment. It reviews several existing studies that used methods like decision trees, naive Bayes, neural networks, and support vector machines to predict heart disease risk factors. The objectives of the presented model are to develop a predictive system using machine learning techniques to analyze heart data and help reduce medical costs and human biases. The proposed model and applications in medical institutions and hospitals are also discussed.
Data Mining: Mining ,associations, and correlationsDatamining Tools
Market basket analysis examines customer purchasing patterns to determine which items are commonly bought together. This can help retailers with marketing strategies like product bundling and complementary product placement. Association rule mining is a two-step process that first finds frequent item sets that occur together above a minimum support threshold, and then generates strong association rules from these frequent item sets based on minimum support and confidence. Various techniques can improve the efficiency of the Apriori algorithm for mining association rules, such as hashing, transaction reduction, partitioning, sampling, and dynamic item-set counting. Pruning strategies like item merging, sub-item-set pruning, and item skipping can also enhance efficiency. Constraint-based mining allows users to specify constraints on the type of
This document discusses representation learning on graphs. It begins by explaining why graph representation learning is important as graphs are pervasive in many scientific disciplines. It then discusses various techniques for graph representation learning including graph embedding methods like DeepWalk and Node2Vec, message passing neural networks, and graph generation methods like variational autoencoders. The document concludes by discussing challenges in graph reasoning to deduce knowledge from graphs in response to queries.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
The document describes the author's experience with evaluating a new machine learning algorithm. It discusses how:
- The author and a student developed a new algorithm that performed better than state-of-the-art according to standard evaluation practices, winning an award, but the NIH disagreed.
- This surprised the author and led to an realization that evaluation is more complex than initially understood, motivating the author to learn more about evaluation methods from other fields.
- The author has since co-written a book on evaluating machine learning algorithms from a classification perspective, and the tutorial presented is based on this book to provide an overview of issues in evaluation and available resources.
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
This Data Science presentation will help you understand what is Data Science, who is a Data Scientist, what does a Data Scientist do and also how Python is used for Data Science. Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. This Data Science tutorial will help you establish your skills at analytical techniques using Python. With this Data Science video, you’ll learn the essential concepts of Data Science with Python programming and also understand how data acquisition, data preparation, data mining, model building & testing, data visualization is done. This Data Science tutorial is ideal for beginners who aspire to become a Data Scientist.
This Data Science presentation will cover the following topics:
1. What is Data Science?
2. Who is a Data Scientist?
3. What does a Data Scientist do?
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. A data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its largelibrary of mathematical functions.
Learn more at: https://www.simplilearn.com
The document discusses classification, which is the task of assigning objects to predefined categories. Classification involves learning a target function that maps attribute sets to class labels. Decision tree induction is a common classification technique that builds a tree-like model by recursively splitting the data into purer subsets based on attribute values. The document covers concepts like decision tree structure, algorithms like Hunt's algorithm for building trees, evaluating performance, and handling different attribute types when constructing decision tree models.
Classification: Basic Concepts and Decision Treessathish sak
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification is a supervised machine learning task where a model is trained on labeled data to predict the class labels of new unlabeled data. The given document discusses classification and decision tree induction for classification. It defines classification, provides examples of classification tasks, and describes how decision trees are constructed through a greedy top-down induction approach that aims to split the training data into homogeneous subsets based on measures of impurity like Gini index or information gain. The goal is to build an accurate model that can classify new unlabeled data.
The document discusses decision tree classification and the decision tree induction algorithm. It begins by defining classification and providing an example of classifying loan applicants as likely to cheat. It then discusses evaluating classification methods and lists common classification techniques, including decision trees. The remainder of the document focuses on decision tree classification and induction. It provides examples of decision trees built from sample training data and the process of applying a decision tree to classify new test data. It also discusses issues in Hunt's algorithm for decision tree induction, such as how to specify test conditions for splitting nodes and how to determine the best attribute and split to partition data.
This document provides an overview of classification tasks in data mining. Classification involves using a training dataset to build a model that can predict the class of new, unlabeled data. The document discusses different classification techniques like decision trees and neural networks. It also explains how decision trees are built using a greedy algorithm that recursively splits the data based on attributes. The goal is to find the split that optimizes some criterion at each node. Issues like determining the optimal split condition and knowing when to stop splitting are also covered.
The document discusses classification, which is a type of data mining task where a model is created to predict the class of new data based on attributes of training data. Specifically, it covers:
- The goal of classification is to assign classes to new, unseen data as accurately as possible based on a model built from a training dataset.
- Common classification techniques include decision trees, rules, neural networks, naive Bayes, and support vector machines.
- Decision tree induction works by splitting the training data into purer subsets based on attribute tests that optimize an impurity measure, with the splits forming the branches of the tree.
- Key considerations in decision tree induction include how to specify the attribute test conditions for splits
The document provides an overview of decision tree learning algorithms:
- Decision trees are a supervised learning method that can represent discrete functions and efficiently process large datasets.
- Basic algorithms like ID3 use a top-down greedy search to build decision trees by selecting attributes that best split the training data at each node.
- The quality of a split is typically measured by metrics like information gain, with the goal of creating pure, homogeneous child nodes.
- Fully grown trees may overfit, so algorithms incorporate a bias toward smaller, simpler trees with informative splits near the root.
The document discusses classification and decision trees. It provides an overview of classification, including defining it as discriminating between classes of objects and describing the classification process. It then details decision trees, including that they are flow-chart structures with internal nodes as tests on attributes, branches as outcomes, and leaf nodes as class labels. The document explains Hunt's algorithm for generating a decision tree through recursive splitting of training data based on optimizing an attribute test criterion until terminating conditions are met.
Dr. Oner CelepcikayITS 632ITS 632Week 4ClassificationDustiBuckner14
This document discusses machine learning classification methods. It provides an example of using a decision tree to classify tax refund requests based on attributes like marital status, income, and refund amount. It explains how decision trees are constructed using a training set of records to learn a model, which can then be applied to a test set to classify new records. The document outlines the process of splitting nodes based on attribute tests to minimize impurity measures like Gini index or classification error.
Date: September 25, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
The document discusses knowledge discovery and data mining from big data. It provides examples of areas where big data is generated such as business, science, health, and discusses how data mining can be used to extract useful information from large and complex datasets. Specifically, it discusses how data mining techniques like classification can be applied to areas like medical diagnosis, gene function prediction, and fraud detection. It then provides details on building classification models using decision trees and concepts like impurity measures, splitting criteria, and continuous attribute handling.
The document discusses decision trees and their algorithms. It introduces decision trees, describing their structure as having root, internal, and leaf nodes. It then discusses Hunt's algorithm, the basis for decision tree induction algorithms like ID3 and C4.5. Hunt's algorithm grows a decision tree recursively by partitioning training records into purer subsets based on attribute tests. The document also covers methods for expressing test conditions based on attribute type, measures for selecting the best split like information gain, and advantages and disadvantages of decision trees.
Data mining and machine learning techniques like classification and clustering are increasingly being used to extract useful information from large datasets. Data mining helps provide better customer service and aids scientists in hypothesis formation by analyzing patterns in data from various sources like business transactions, sensor networks, and scientific experiments. Classification algorithms such as decision trees can be applied to datasets containing attributes for individuals and a target variable to predict, like credit worthiness, to build a predictive model. Clustering algorithms like K-means group unlabeled data into clusters without a predefined target variable to discover hidden patterns in the data.
data mining presentation power point for the studyanjanishah774
This document provides an overview of a data mining course. It discusses that the course will be taught by George Kollios and will cover topics like data warehouses, association rule mining, clustering, classification, and advanced topics. It also outlines the grading breakdown and schedule. Additionally, it defines data mining and describes common data mining tasks like classification, clustering, and association rule mining. It provides examples of applications and discusses the data mining process.
This document provides an overview and introduction to a data mining course. It discusses the instructor, meeting times, grading breakdown, and an overview of topics to be covered including data warehousing, association rules mining, clustering, classification, sequential pattern mining, and advanced topics. Key terms related to data mining like data, patterns, attributes, and interestingness are also defined. The data mining process and examples of applications are outlined.
This document provides an overview of a data mining course. It discusses that the course will be taught by George Kollios and will cover topics like data warehouses, association rule mining, clustering, classification, and advanced topics. It also outlines the grading breakdown and schedule. Additionally, it defines data mining and describes common data mining tasks like classification, clustering, and association rule mining. It provides examples of applications and discusses the data mining process.
The document discusses classification, which involves using a training dataset containing records with attributes and classes to build a model that can predict the class of previously unseen records. It describes dividing the dataset into training and test sets, with the training set used to build the model and test set to validate it. Decision trees are presented as a classification method that splits records recursively into partitions based on attribute values to optimize a metric like GINI impurity, allowing predictions to be made by following the tree branches. The key steps of building a decision tree classifier are outlined.
This document discusses advanced concepts in association analysis, including handling continuous and categorical attributes, multi-level association rules, and sequential patterns. It describes various methods for applying association rule mining to datasets with non-binary attributes, such as discretization techniques and statistics-based approaches. It also discusses challenges like varying discretization intervals and generating rules at different levels of a concept hierarchy. Finally, it provides examples of sequential patterns that can be mined, such as sequences of customer transactions or nuclear accident events.
NN Classififcation Neural Network NN.pptxcmpt cmpt
The document provides an overview of classification methods. It begins with defining classification tasks and discussing evaluation metrics like accuracy, recall, and precision. It then describes several common classification techniques including nearest neighbor classification, decision tree induction, Naive Bayes, logistic regression, and support vector machines. For decision trees specifically, it explains the basic structure of decision trees, the induction process using Hunt's algorithm, and issues in tree induction.
Lecture Notes on Recommender System IntroductionPerumalPitchandi
This document provides an overview of recommender systems and the techniques used to build them. It discusses collaborative filtering, content-based filtering, knowledge-based recommendations, and hybrid approaches. For collaborative filtering, it describes user-based and item-based approaches, including measuring similarity, making predictions, and generating recommendations. It also discusses evaluation techniques and advanced topics like explanations.
The document provides an overview of a business analytics model that illustrates how business analytics is a layered and hierarchical discipline. It describes the different layers in the model from the business-driven environment at the top where strategy is set, to the technically oriented environment at the bottom where data is generated. It then discusses four scenarios for how business analytics can support organizational strategy, from no formal link between the two, to business analytics supporting strategy at the functional level, to a dialogue between the functions, to information being treated as a strategic resource.
This document discusses the differences between bivariate and multivariate analyses and their results. It explains that a bivariate correlation shows the relationship between two variables, while regression weights from simple regression show the relationship between a predictor and criterion while holding other predictors constant. Regression weights from multiple regression also show this relationship between a predictor and criterion while controlling for other predictors. The document provides examples of different patterns that can emerge between bivariate and multivariate results and discusses factors like collinearity that can influence weights. It also addresses issues like proxy variables that may be standing in for other causal factors.
This document discusses data science, its key concepts, and its role in big data analytics. It defines data science as the study of where information comes from, what it represents, and how it can be turned into a valuable resource. Data science involves using automated methods to analyze massive amounts of data and extract knowledge from them. It is an interdisciplinary field that incorporates techniques from fields like mathematics, statistics, computer science, and domain knowledge.
This document provides an overview of analysis of variance (ANOVA). It explains that ANOVA allows researchers to compare means across multiple groups simultaneously, reducing the risk of type 1 errors associated with multiple t-tests. ANOVA separates overall variance into between-group variance, reflecting differences in treatment means, and within-group variance, reflecting individual differences. If the between-group variance is sufficiently large compared to the within-group variance, then there are significant differences between treatment means.
This document provides an introduction to data science. It defines data science as a multi-disciplinary field that uses scientific methods and processes to extract knowledge and insights from structured and unstructured data. The document discusses the importance and impact of data science on organizations and society. It also outlines common applications of data science and the roles and skills required for a career in data science.
This document discusses descriptive statistics and how to calculate them. It covers preparing data for analysis through coding and tabulation. It then defines four main types of descriptive statistics: measures of central tendency like mean, median, and mode; measures of variability like range and standard deviation; measures of relative position like percentiles and z-scores; and measures of relationships like correlation coefficients. It provides formulas for calculating common descriptive statistics like the mean, standard deviation, and Pearson correlation.
This document discusses software cost estimation techniques. It introduces software productivity metrics like lines of code and function points. It explains that software cost estimation is needed early for pricing purposes. Several models for software estimation are described, including COCOMO, LOC, function points and use case points. Factors that influence software costs like effort, duration and components are outlined. The advantages and disadvantages of lines of code and function points as size metrics are also summarized.
This document discusses software cost estimation and different techniques for estimating software costs. It covers topics like software cost components, metrics for assessing software productivity like lines of code and function points, and challenges with measurement. It also describes different estimation techniques like top-down and bottom-up, and how changing technologies can impact estimates. The document emphasizes that estimation requires using multiple techniques and that "pricing to win" may be necessary when information is limited.
The document discusses microprogram sequencing and pipelining. It describes how the task of microprogram sequencing is performed by a microprogram sequencer. It discusses factors like microinstruction size and address generation time that must be considered in microprogram sequencer design. It also explains different methods for microinstruction address generation including next sequential address, branching, bit-oring, and using a next address field in each microinstruction. Additionally, it provides comparisons between hardwired and microprogrammed control approaches. Finally, it provides an overview of pipelining, describing how pipelining improves processor throughput by allowing partial processing of multiple instructions simultaneously.
This document discusses hardwired control and microprogrammed control in computer processors. It provides details on:
- Hardwired control generates control signals using logic circuits like gates and counters. It can operate at high speed but has little flexibility and complexity is limited.
- Microprogrammed control stores sequences of control words (microinstructions) in a control store. It uses a microprogram counter to sequentially fetch microinstructions and generate control signals. This allows for more flexible control but is slower than hardwired.
- Key components of a microprogrammed control unit include the control store, microprogram counter, and starting address generator block to load addresses and support conditional branching in the microcode.
The Capability Maturity Model (CMM) is a framework for software process improvement composed of 5 levels of process maturity. It was developed by the Software Engineering Institute to help organizations improve their software development process. The CMM describes key process areas that must be addressed to achieve each increasing level of process maturity, from initial/ad hoc processes at level 1 to optimized processes at level 5. Achieving higher levels involves more defined, measured, controlled, and continuously improving processes. While implementation takes significant time and effort, following the CMM helps organizations establish a foundation for consistent, predictable processes that improve quality.
The document compares the waterfall and agile project management methodologies. It provides details on the key aspects of each like phases, requirements, flexibility, and execution. For the CapraTek project to develop an iOS app for their Alfred! software, the document recommends using an agile methodology. Agile is deemed more appropriate because it allows requirements changes, encourages customer feedback, and can adapt to the changing technological landscape faster than waterfall.
This document provides an introduction to data science. It defines data science as a multi-disciplinary field that uses scientific methods and processes to extract knowledge and insights from structured and unstructured data. The document discusses the importance and impact of data science on organizations and society. It also outlines common applications of data science and the roles and skills required for a career in data science.
This document provides an overview of a data science course. It discusses applications of data science in healthcare for predicting disease infection risks using patient data. It also introduces concepts around data engineering, data science, and data analysis roles.
This document outlines the key topics to be covered in a course on foundations of data science. The course will cover basic concepts of big data and data science including data collection, analysis, modeling and inference using statistical concepts. Students will learn to identify appropriate data mining algorithms to solve real-world problems and analyze relevant data, models and tools for applications. The syllabus will include modules on big data overview, data warehousing, data mining, data analysis, data science life cycle and more.
The document discusses agile software development. It defines agility as the ability to create and respond to change in a turbulent business environment. It also discusses chaordic structures, which blend chaos and order similar to how most organizations operate. The document outlines how agile development emerged in response to heavy, bureaucratic processes like waterfall and V-model approaches. It summarizes the Agile Manifesto and its focus on individuals, working software, customer collaboration, and responding to change over strict plans and processes. Finally, it provides overviews of popular agile methods like Extreme Programming, Scrum, and Crystal.
Agile and its impact to Project Management 022218.pptxPerumalPitchandi
This document provides an introduction to Agile project management. It discusses the history and evolution of Agile, including the Agile Manifesto. It then describes several common Agile methodologies like Scrum, Kanban, and Extreme Programming. The document also introduces key Agile concepts like iterative development, user stories, and velocity. It discusses how project scheduling, cost estimation, and DevOps relate to Agile. Finally, it provides an overview of the Scaled Agile Framework (SAFe) for implementing Agile at an enterprise level.
A preliminary exploration of the data to better understand its characteristics. Techniques used in data exploration include summary statistics, visualization, and online analytical processing (OLAP). Summary statistics provide numerical summaries of the data, visualization converts data into visual formats to detect patterns, and OLAP represents data in multidimensional arrays to enable analysis across different dimensions and levels of aggregation.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
An improved modulation technique suitable for a three level flying capacitor ...IJECEIAES
This research paper introduces an innovative modulation technique for controlling a 3-level flying capacitor multilevel inverter (FCMLI), aiming to streamline the modulation process in contrast to conventional methods. The proposed
simplified modulation technique paves the way for more straightforward and
efficient control of multilevel inverters, enabling their widespread adoption and
integration into modern power electronic systems. Through the amalgamation of
sinusoidal pulse width modulation (SPWM) with a high-frequency square wave
pulse, this controlling technique attains energy equilibrium across the coupling
capacitor. The modulation scheme incorporates a simplified switching pattern
and a decreased count of voltage references, thereby simplifying the control
algorithm.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
Rainfall intensity duration frequency curve statistical analysis and modeling...bijceesjournal
Using data from 41 years in Patna’ India’ the study’s goal is to analyze the trends of how often it rains on a weekly, seasonal, and annual basis (1981−2020). First, utilizing the intensity-duration-frequency (IDF) curve and the relationship by statistically analyzing rainfall’ the historical rainfall data set for Patna’ India’ during a 41 year period (1981−2020), was evaluated for its quality. Changes in the hydrologic cycle as a result of increased greenhouse gas emissions are expected to induce variations in the intensity, length, and frequency of precipitation events. One strategy to lessen vulnerability is to quantify probable changes and adapt to them. Techniques such as log-normal, normal, and Gumbel are used (EV-I). Distributions were created with durations of 1, 2, 3, 6, and 24 h and return times of 2, 5, 10, 25, and 100 years. There were also mathematical correlations discovered between rainfall and recurrence interval.
Findings: Based on findings, the Gumbel approach produced the highest intensity values, whereas the other approaches produced values that were close to each other. The data indicates that 461.9 mm of rain fell during the monsoon season’s 301st week. However, it was found that the 29th week had the greatest average rainfall, 92.6 mm. With 952.6 mm on average, the monsoon season saw the highest rainfall. Calculations revealed that the yearly rainfall averaged 1171.1 mm. Using Weibull’s method, the study was subsequently expanded to examine rainfall distribution at different recurrence intervals of 2, 5, 10, and 25 years. Rainfall and recurrence interval mathematical correlations were also developed. Further regression analysis revealed that short wave irrigation, wind direction, wind speed, pressure, relative humidity, and temperature all had a substantial influence on rainfall.
Originality and value: The results of the rainfall IDF curves can provide useful information to policymakers in making appropriate decisions in managing and minimizing floods in the study area.
cnn.pptx Convolutional neural network used for image classication
Basic Classification.pptx
1. Data Mining
Classification: Basic Concepts and
Techniques
Lecture Notes for Chapter 3
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
2/1/2021 Introduction to Data Mining, 2nd Edition 1
2. Classification: Definition
l Given a collection of records (training set )
– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
x: attribute, predictor, independent variable, input
y: class, response, dependent variable, output
l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
2/1/2021 Introduction to Data Mining, 2nd Edition 2
3. Examples of Classification Task
Task Attribute set, x Class label, y
Categorizing
email
messages
Features extracted from
email message header
and content
spam or non-spam
Identifying
tumor cells
Features extracted from
x-rays or MRI scans
malignant or benign
cells
Cataloging
galaxies
Features extracted from
telescope images
Elliptical, spiral, or
irregular-shaped
galaxies
2/1/2021 Introduction to Data Mining, 2nd Edition 3
4. General Approach for Building
Classification Model
2/1/2021 Introduction to Data Mining, 2nd Edition 4
5. Classification Techniques
Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets
Ensemble Classifiers
– Boosting, Bagging, Random Forests
2/1/2021 Introduction to Data Mining, 2nd Edition 5
6. Example of a Decision Tree
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Home
Owner
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
2/1/2021 Introduction to Data Mining, 2nd Edition 6
7. Apply Model to Test Data
Home
Owner
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Start from the root of tree.
2/1/2021 Introduction to Data Mining, 2nd Edition 7
8. Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 8
9. Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 9
10. Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 10
11. Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 11
12. Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Assign Defaulted to
“No”
Home
Owner
2/1/2021 Introduction to Data Mining, 2nd Edition 12
13. Another Example of Decision Tree
MarSt
Home
Owner
Income
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 13
14. Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
2/1/2021 Introduction to Data Mining, 2nd Edition 14
15. Decision Tree Induction
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
2/1/2021 Introduction to Data Mining, 2nd Edition 15
16. General Structure of Hunt’s Algorithm
Let Dt be the set of training
records that reach a node t
General Procedure:
– If Dt contains records that
belong the same class yt,
then t is a leaf node
labeled as yt
– If Dt contains records that
belong to more than one
class, use an attribute test
to split the data into smaller
subsets. Recursively apply
the procedure to each
subset.
Dt
?
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 16
17. Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 17
18. Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 18
19. Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 19
20. Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2/1/2021 Introduction to Data Mining, 2nd Edition 20
21. Design Issues of Decision Tree Induction
How should training records be split?
– Method for expressing test condition
depending on attribute types
– Measure for evaluating the goodness of a test
condition
How should the splitting procedure stop?
– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
2/1/2021 Introduction to Data Mining, 2nd Edition 21
22. Methods for Expressing Test Conditions
Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous
2/1/2021 Introduction to Data Mining, 2nd Edition 22
23. Test Condition for Nominal Attributes
Multi-way split:
– Use as many partitions as
distinct values.
Binary split:
– Divides values into two subsets
Marital
Status
Single Divorced Married
{Single} {Married,
Divorced}
Marital
Status
{Married} {Single,
Divorced}
Marital
Status
OR OR
{Single,
Married}
Marital
Status
{Divorced}
2/1/2021 Introduction to Data Mining, 2nd Edition 23
24. Test Condition for Ordinal Attributes
Multi-way split:
– Use as many partitions
as distinct values
Binary split:
– Divides values into two
subsets
– Preserve order
property among
attribute values
Large
Shirt
Size
Medium Extra Large
Small
{Medium, Large,
Extra Large}
Shirt
Size
{Small}
{Large,
Extra Large}
Shirt
Size
{Small,
Medium}
{Medium,
Extra Large}
Shirt
Size
{Small,
Large}
This grouping
violates order
property
2/1/2021 Introduction to Data Mining, 2nd Edition 24
25. Test Condition for Continuous Attributes
Annual
Income
> 80K?
Yes No
Annual
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
2/1/2021 Introduction to Data Mining, 2nd Edition 25
26. Splitting Based on Continuous Attributes
Different ways of handling
– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
Static – discretize once at the beginning
Dynamic – repeat at each node
– Binary Decision: (A < v) or (A v)
consider all possible splits and finds the best cut
can be more compute intensive
2/1/2021 Introduction to Data Mining, 2nd Edition 26
27. How to determine the Best Split
Gender
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Customer
ID
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
2/1/2021 Introduction to Data Mining, 2nd Edition 27
28. How to determine the Best Split
Greedy approach:
– Nodes with purer class distribution are
preferred
Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
High degree of impurity Low degree of impurity
2/1/2021 Introduction to Data Mining, 2nd Edition 28
29. Measures of Node Impurity
Gini Index
Entropy
Misclassification error
2/1/2021 Introduction to Data Mining, 2nd Edition 29
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖(𝑡)
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝𝑖(𝑡)]
Where 𝒑𝒊 𝒕 is the frequency
of class 𝒊 at node t, and 𝒄 is
the total number of classes
30. Finding the Best Split
1. Compute impurity measure (P) before splitting
2. Compute impurity measure (M) after splitting
Compute impurity measure of each child node
M is the weighted impurity of child nodes
3. Choose the attribute test condition that
produces the highest gain
Gain = P - M
or equivalently, lowest impurity measure after splitting
(M)
2/1/2021 Introduction to Data Mining, 2nd Edition 30
31. Finding the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
P
M11 M12 M21 M22
M1 M2
Gain = P – M1 vs P – M2
2/1/2021 Introduction to Data Mining, 2nd Edition 31
32. Measure of Impurity: GINI
Gini Index for a given node 𝒕
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total
number of classes
– Maximum of 1 − 1/𝑐 when records are equally
distributed among all classes, implying the least
beneficial situation for classification
– Minimum of 0 when all records belong to one class,
implying the most beneficial situation for classification
– Gini index is used in decision tree algorithms such as
CART, SLIQ, SPRINT
2/1/2021 Introduction to Data Mining, 2nd Edition 32
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 2
33. Measure of Impurity: GINI
Gini Index for a given node t :
– For 2-class problem (p, 1 – p):
GINI = 1 – p2 – (1 – p)2 = 2p (1-p)
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
2/1/2021 Introduction to Data Mining, 2nd Edition 33
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 2
35. Computing Gini Index for a Collection of
Nodes
When a node 𝑝 is split into 𝑘 partitions (children)
where, 𝑛𝑖 = number of records at child 𝑖,
𝑛 = number of records at parent node 𝑝.
2/1/2021 Introduction to Data Mining, 2nd Edition 35
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 =
𝑖=1
𝑘
𝑛𝑖
𝑛
𝐺𝐼𝑁𝐼(𝑖)
36. Binary Attributes: Computing GINI Index
Splits into two partitions (child nodes)
Effect of Weighing partitions:
– Larger and purer partitions are sought
B?
Yes No
Node N1 Node N2
Parent
C1 7
C2 5
Gini = 0.486
N1 N2
C1 5 2
C2 1 4
Gini=0.361
Gini(N1)
= 1 – (5/6)2 – (1/6)2
= 0.278
Gini(N2)
= 1 – (2/6)2 – (4/6)2
= 0.444
Weighted Gini of N1 N2
= 6/12 * 0.278 +
6/12 * 0.444
= 0.361
Gain = 0.486 – 0.361 = 0.125
2/1/2021 Introduction to Data Mining, 2nd Edition 36
37. Categorical Attributes: Computing Gini Index
For each distinct value, gather counts for each class in
the dataset
Use the count matrix to make decisions
CarType
{Sports,
Luxury}
{Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports}
{Family,
Luxury}
C1 8 2
C2 0 10
Gini 0.167
CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
Multi-way split Two-way split
(find best partition of values)
Which of these is the best?
2/1/2021 Introduction to Data Mining, 2nd Edition 37
38. Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one
value
Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A ≤ v and A > v
Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute
its Gini index
– Computationally Inefficient!
Repetition of work.
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
≤ 80 > 80
Defaulted Yes 0 3
Defaulted No 3 4
Annual Income ?
2/1/2021 Introduction to Data Mining, 2nd Edition 38
39. Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 39
40. Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 40
41. Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 41
42. Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 42
43. Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
2/1/2021 Introduction to Data Mining, 2nd Edition 43
44. Measure of Impurity: Entropy
Entropy at a given node 𝒕
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total number
of classes
Maximum of log2𝑐 when records are equally distributed
among all classes, implying the least beneficial situation for
classification
Minimum of 0 when all records belong to one class,
implying most beneficial situation for classification
– Entropy based computations are quite similar to the GINI
index computations
2/1/2021 Introduction to Data Mining, 2nd Edition 44
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −
𝑖=0
𝑐−1
𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖(𝑡)
46. Computing Information Gain After Splitting
Information Gain:
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
– Choose the split that achieves most reduction (maximizes
GAIN)
– Used in the ID3 and C4.5 decision tree algorithms
– Information gain is the mutual information between the class
variable and the splitting variable
2/1/2021 Introduction to Data Mining, 2nd Edition 46
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 −
𝑖=1
𝑘
𝑛𝑖
𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
47. Problem with large number of partitions
Node impurity measures tend to prefer splits that
result in large number of partitions, each being
small but pure
– Customer ID has highest information gain
because entropy for all the children is zero
Gender
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Customer
ID
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
2/1/2021 Introduction to Data Mining, 2nd Edition 47
48. Gain Ratio
Gain Ratio:
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
– Adjusts Information Gain by the entropy of the partitioning
(𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜).
Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain
2/1/2021 Introduction to Data Mining, 2nd Edition 48
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = −
𝑖=1
𝑘
𝑛𝑖
𝑛
𝑙𝑜𝑔2
𝑛𝑖
𝑛
49. Gain Ratio
Gain Ratio:
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
CarType
{Sports,
Luxury}
{Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports}
{Family,
Luxury}
C1 8 2
C2 0 10
Gini 0.167
CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97
2/1/2021 Introduction to Data Mining, 2nd Edition 49
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 =
𝑖=1
𝑘
𝑛𝑖
𝑛
𝑙𝑜𝑔2
𝑛𝑖
𝑛
50. Measure of Impurity: Classification Error
Classification error at a node 𝑡
– Maximum of 1 − 1/𝑐 when records are equally
distributed among all classes, implying the least
interesting situation
– Minimum of 0 when all records belong to one class,
implying the most interesting situation
2/1/2021 Introduction to Data Mining, 2nd Edition 50
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max
𝑖
[𝑝𝑖 𝑡 ]
54. Misclassification Error vs Gini Index
A?
Yes No
Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.342
N1 N2
C1 3 4
C2 1 2
Gini=0.416
Misclassification error for all three cases = 0.3 !
2/1/2021 Introduction to Data Mining, 2nd Edition 54
55. Decision Tree Based Classification
Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are
interacting)
Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute
2/1/2021 Introduction to Data Mining, 2nd Edition 55
56. Handling interactions
X
Y
+ : 1000 instances
o : 1000 instances
Entropy (X) : 0.99
Entropy (Y) : 0.99
2/1/2021 Introduction to Data Mining, 2nd Edition 56
58. Handling interactions given irrelevant attributes
+ : 1000 instances
o : 1000 instances
Adding Z as a noisy
attribute generated
from a uniform
distribution
Y
Entropy (X) : 0.99
Entropy (Y) : 0.99
Entropy (Z) : 0.98
Attribute Z will be
chosen for splitting!
X
2/1/2021 Introduction to Data Mining, 2nd Edition 58
59. Limitations of single attribute-based decision boundaries
Both positive (+) and
negative (o) classes
generated from
skewed Gaussians
with centers at (8,8)
and (12,12)
respectively.
2/1/2021 Introduction to Data Mining, 2nd Edition 59