This document summarizes a lecture on decision tree learning. It introduces decision trees and algorithms like ID3 for building trees from data. Key concepts discussed include information gain, overfitting, pruning trees, handling continuous attributes, and predicting continuous values with regression trees. Decision trees are built by recursively splitting the training data on attributes that maximize information gain until reaching leaf nodes with class predictions.
A trie is a tree-like data structure used to store strings that allows for efficient retrieval of strings based on prefixes. It works by splitting strings into individual characters and storing them as nodes in a tree. Each node represents a unique character and branches to other nodes until the full string is represented by a path from the root node. Tries allow for fast insertion, searching, and retrieval of strings in O(m) time where m is the length of the string. They are useful for applications involving large dictionaries of strings like auto-complete features, IP routing tables, and storing browser histories.
The document discusses the paper "t-vMF Similarity for Regularizing Intra-Class Feature Distribution" presented at CVPR2021. The paper proposes a new similarity measure called t-vMF similarity that can control the width of the peak and skirt of the cosine similarity. This allows intra-class variance to be reduced while preventing gradient vanishing, especially for imbalanced or small-scale datasets where maximizing discrimination is more important than minimizing intra-class variance. The t-vMF similarity is implemented by considering the von Mises-Fisher distribution in the process of the softmax cross-entropy loss, making it simple to implement.
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we discuss the evaluation of classification algorithms.
This document discusses decision trees for data classification. It defines a decision tree as a tree where internal nodes represent attributes, branches represent attribute values, and leaf nodes represent class predictions. It describes the basic decision tree algorithm which builds the tree by recursively splitting the training data on attributes and stopping when data is pure or stopping criteria are met. Finally, it notes advantages like interpretability but also disadvantages like potential overfitting and issues with non-numeric data.
This document presents an example decision problem to demonstrate decision tree analysis. It describes three potential decisions - expand, maintain status quo, or sell now - under two possible future states, good or poor foreign competitive conditions. It then outlines the steps to analyze the problem: 1) determine the best decision without probabilities using various criteria, 2) determine the best decision with probabilities using expected value and opportunity loss, 3) compute the expected value of perfect information, and 4) develop a decision tree showing expected values at each node.
This document provides an overview of decision trees, including definitions, key terms, algorithms, and advantages/limitations. It defines a decision tree as a model that classifies instances by sorting them from the root to a leaf node. Important terms are defined like root node, branches, and leaf nodes. Popular algorithms like CART and C5.0 are described. Advantages are that decision trees are fast, robust, and require little experimentation. Limitations include class imbalance and overfitting with too many records and few attributes.
A trie is a tree-like data structure used to store strings that allows for efficient retrieval of strings based on prefixes. It works by splitting strings into individual characters and storing them as nodes in a tree. Each node represents a unique character and branches to other nodes until the full string is represented by a path from the root node. Tries allow for fast insertion, searching, and retrieval of strings in O(m) time where m is the length of the string. They are useful for applications involving large dictionaries of strings like auto-complete features, IP routing tables, and storing browser histories.
The document discusses the paper "t-vMF Similarity for Regularizing Intra-Class Feature Distribution" presented at CVPR2021. The paper proposes a new similarity measure called t-vMF similarity that can control the width of the peak and skirt of the cosine similarity. This allows intra-class variance to be reduced while preventing gradient vanishing, especially for imbalanced or small-scale datasets where maximizing discrimination is more important than minimizing intra-class variance. The t-vMF similarity is implemented by considering the von Mises-Fisher distribution in the process of the softmax cross-entropy loss, making it simple to implement.
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we discuss the evaluation of classification algorithms.
This document discusses decision trees for data classification. It defines a decision tree as a tree where internal nodes represent attributes, branches represent attribute values, and leaf nodes represent class predictions. It describes the basic decision tree algorithm which builds the tree by recursively splitting the training data on attributes and stopping when data is pure or stopping criteria are met. Finally, it notes advantages like interpretability but also disadvantages like potential overfitting and issues with non-numeric data.
This document presents an example decision problem to demonstrate decision tree analysis. It describes three potential decisions - expand, maintain status quo, or sell now - under two possible future states, good or poor foreign competitive conditions. It then outlines the steps to analyze the problem: 1) determine the best decision without probabilities using various criteria, 2) determine the best decision with probabilities using expected value and opportunity loss, 3) compute the expected value of perfect information, and 4) develop a decision tree showing expected values at each node.
This document provides an overview of decision trees, including definitions, key terms, algorithms, and advantages/limitations. It defines a decision tree as a model that classifies instances by sorting them from the root to a leaf node. Important terms are defined like root node, branches, and leaf nodes. Popular algorithms like CART and C5.0 are described. Advantages are that decision trees are fast, robust, and require little experimentation. Limitations include class imbalance and overfitting with too many records and few attributes.
The document discusses decision tree learning and the ID3 algorithm. It covers topics like decision tree representation, entropy and information gain for selecting attributes, overfitting, and techniques to avoid overfitting like reduced error pruning. It also discusses handling continuous values, missing data, and attributes with many values or costs in decision tree learning.
This document discusses decision tree algorithms C4.5 and CART. It explains that ID3 has limitations in dealing with continuous data and noisy data, which C4.5 aims to address through techniques like post-pruning trees to avoid overfitting. CART uses binary splits and measures like Gini index or entropy to produce classification trees, and sum of squared errors to produce regression trees. It also performs cost-complexity pruning to find an optimal trade-off between accuracy and model complexity.
Decision tree learning is a method for approximating discrete-valued functions that is widely used in machine learning. It represents learned functions as decision trees that classify instances described by attribute value pairs. The ID3 algorithm performs a top-down induction of decision trees by selecting the attribute that best splits the data at each step. This results in an expressive hypothesis space that is robust to noise while avoiding overfitting through techniques like reduced-error pruning.
Machine Learning: Foundations Course Number 0368403401butest
This machine learning foundations course will consist of 4 homework assignments, both theoretical and programming problems in Matlab. There will be a final exam. Students will work in groups of 2-3 to take notes during classes in LaTeX format. These class notes will contribute 30% to the overall grade. The course will cover basic machine learning concepts like storage and retrieval, learning rules, estimating flexible models, and applications in areas like control, medical diagnosis, and document retrieval.
Introduction to Machine Learning Aristotelis Tsirigos butest
This document provides an introduction to machine learning, covering several key concepts:
- Machine learning aims to build models from data to make predictions without being explicitly programmed.
- There are different types of learning problems including supervised, unsupervised, and reinforcement learning.
- Popular machine learning algorithms discussed include Bayesian learning, nearest neighbors, decision trees, linear classifiers, and ensembles.
- Proper evaluation of machine learning models is important using techniques like cross-validation.
This document discusses different clustering methods in data mining. It begins by defining cluster analysis and its applications. It then categorizes major clustering methods into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based clustering methods. Finally, it provides details on partitioning methods like k-means and k-medoids clustering algorithms.
This document summarizes Chapter 7 of the book "Data Mining: Concepts and Techniques" which covers advanced frequent pattern mining techniques. It discusses mining diverse patterns such as multiple-level, multi-dimensional, quantitative and negative associations. It also covers sequential pattern mining algorithms like GSP, SPADE and PrefixSpan that find frequent subsequences in sequence databases. Finally, it discusses constraint-based pattern mining and applications in software bug detection.
This document provides an overview of data mining and machine learning concepts. It defines data mining as the process of discovering patterns in data. Machine learning allows computers to learn without being explicitly programmed by improving at tasks through experience. The document discusses different types of machine learning including supervised learning to predict outputs from inputs, unsupervised learning to understand and describe data without correct answers, and reinforcement learning to learn actions through rewards. It also covers machine learning problems, algorithms like K-nearest neighbors for classification and K-means clustering, and evaluating machine learning models.
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
This document provides an overview of two machine learning algorithms: decision trees and k-nearest neighbors (k-NN). Decision trees use a hierarchical structure to classify instances based on the values of their features, splitting the data at each node based on tests of individual features. k-NN classifies new instances based on the majority class of its k nearest neighbors in the training data, where distance between instances is measured using a metric like overlap. The document discusses key aspects of both algorithms like decision criteria, parameters, and properties.
Isolation Forest is an anomaly detection algorithm that builds decision trees to isolate anomalies from normal data points. It works by constructing isolation trees on randomly selected sub-samples of the data, and computes an anomaly score based on the path length of each data point in the trees. The algorithm has linear time complexity and low memory requirements, making it scalable to large, high-dimensional datasets. Empirical experiments show Isolation Forest achieves high AUC scores comparable to other algorithms while using less processing time, especially as the number of trees increases. It is also effective at detecting anomalies in the presence of irrelevant attributes.
Clustering algorithms are a type of unsupervised learning that groups unlabeled data points together based on similarities. There are many different clustering methods that can group data either hierarchically or into flat clusters, and either exclusively or with overlap. K-means clustering is a simple algorithm that assigns data points to clusters based on proximity to randomly assigned cluster centroids, and is useful despite limitations around sensitivity to outliers and requiring pre-specification of the number of clusters. Clustering has applications across many domains including marketing, astronomy, seismology, and genomics.
Clustering algorithms are a type of unsupervised learning that groups unlabeled data points together based on similarities. There are many different clustering methods that can handle numeric and/or symbolic data in either hierarchical or flat clustering structures. One simple and commonly used algorithm is k-means clustering, which assigns data points to k clusters based on minimizing distances between points and assigned cluster centers. K-means clustering has advantages of being simple and automatically assigning data points but has disadvantages such as needing to pre-specify the number of clusters and being sensitive to outliers.
The document summarizes the key topics from the first lecture of a data mining course. It introduces data mining as the process of extracting implicit and potentially useful information from large amounts of data. It discusses why data mining is needed due to the abundance of data and challenges of manual organization. The lecture then covers machine learning techniques used for tasks like classification, clustering, and prediction. It provides examples of data mining applications and outlines the typical steps involved in a machine learning approach.
This document provides an overview of decision tree algorithms for machine learning. It discusses key concepts such as:
- Decision trees can be used for classification or regression problems.
- They represent rules that can be understood by humans and used in knowledge systems.
- The trees are built by splitting the data into purer subsets based on attribute tests, using measures like information gain.
- Issues like overfitting are addressed through techniques like reduced error pruning and rule post-pruning.
This document provides an introduction to basic statistics topics including data collection, presentation, measures of central tendency, and dispersion. It explains that statistics involves collecting, summarizing, analyzing, and presenting numerical data. Common forms of data presentation include line graphs, bar charts, pie charts, and histograms. Measures of central tendency like mean, median, and mode are used to describe typical values in a data set. The mean is the average and is calculated by summing all values and dividing by the number of values. The median is the middle value when data is arranged in order. The mode is the most frequent value. Measures of dispersion like mean deviation, variance, and standard deviation quantify how spread out the data is around the central tendency
Machine Learning: Foundations Course Number 0368403401butest
This machine learning course will cover theoretical and practical machine learning concepts. It will include 4 homework assignments and programming in Matlab. Lectures will be supplemented by student-submitted class notes in LaTeX. Topics will include learning approaches like storage and retrieval, rule learning, and flexible model estimation, as well as applications in areas like control, medical diagnosis, and web search. A final exam format has not been determined yet.
Machine Learning: Foundations Course Number 0368403401butest
This machine learning course will cover theoretical and practical machine learning concepts. It will include 4 homework assignments and programming in Matlab. Lectures will be supplemented by student-submitted class notes in LaTeX. Topics will include learning approaches like storage and retrieval, rule learning, and flexible model estimation, as well as applications in areas like control, medical diagnosis, and web search. A final exam format has not been determined yet.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.
The document discusses classification and prediction using decision trees. It begins by defining classification as predicting categorical labels from data, such as predicting if a loan applicant is "safe" or "risky". Prediction involves predicting continuous or ordered values, such as how much a customer will spend. The document then discusses how decision trees perform classification by recursively splitting the data into purer subsets based on attribute values, with leaf nodes representing class labels. Information gain is used as the splitting criterion to select the attribute that best splits the data. Finally, it notes that attributes with many values can bias decision trees towards overfitting.
This document discusses exercises related to information gain and decision tree learning. Exercise 2 calculates the information gain of attributes a1 and a2 on a sample dataset. Exercise 3 discusses overfitting related to using a unique identifier attribute. Exercise 4 shows that an attribute with many unique values can achieve maximum information gain but may not be a good predictor. Exercise 5 discusses approaches for handling missing values when calculating information gain.
The document discusses decision tree learning and the ID3 algorithm. It covers topics like decision tree representation, entropy and information gain for selecting attributes, overfitting, and techniques to avoid overfitting like reduced error pruning. It also discusses handling continuous values, missing data, and attributes with many values or costs in decision tree learning.
This document discusses decision tree algorithms C4.5 and CART. It explains that ID3 has limitations in dealing with continuous data and noisy data, which C4.5 aims to address through techniques like post-pruning trees to avoid overfitting. CART uses binary splits and measures like Gini index or entropy to produce classification trees, and sum of squared errors to produce regression trees. It also performs cost-complexity pruning to find an optimal trade-off between accuracy and model complexity.
Decision tree learning is a method for approximating discrete-valued functions that is widely used in machine learning. It represents learned functions as decision trees that classify instances described by attribute value pairs. The ID3 algorithm performs a top-down induction of decision trees by selecting the attribute that best splits the data at each step. This results in an expressive hypothesis space that is robust to noise while avoiding overfitting through techniques like reduced-error pruning.
Machine Learning: Foundations Course Number 0368403401butest
This machine learning foundations course will consist of 4 homework assignments, both theoretical and programming problems in Matlab. There will be a final exam. Students will work in groups of 2-3 to take notes during classes in LaTeX format. These class notes will contribute 30% to the overall grade. The course will cover basic machine learning concepts like storage and retrieval, learning rules, estimating flexible models, and applications in areas like control, medical diagnosis, and document retrieval.
Introduction to Machine Learning Aristotelis Tsirigos butest
This document provides an introduction to machine learning, covering several key concepts:
- Machine learning aims to build models from data to make predictions without being explicitly programmed.
- There are different types of learning problems including supervised, unsupervised, and reinforcement learning.
- Popular machine learning algorithms discussed include Bayesian learning, nearest neighbors, decision trees, linear classifiers, and ensembles.
- Proper evaluation of machine learning models is important using techniques like cross-validation.
This document discusses different clustering methods in data mining. It begins by defining cluster analysis and its applications. It then categorizes major clustering methods into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based clustering methods. Finally, it provides details on partitioning methods like k-means and k-medoids clustering algorithms.
This document summarizes Chapter 7 of the book "Data Mining: Concepts and Techniques" which covers advanced frequent pattern mining techniques. It discusses mining diverse patterns such as multiple-level, multi-dimensional, quantitative and negative associations. It also covers sequential pattern mining algorithms like GSP, SPADE and PrefixSpan that find frequent subsequences in sequence databases. Finally, it discusses constraint-based pattern mining and applications in software bug detection.
This document provides an overview of data mining and machine learning concepts. It defines data mining as the process of discovering patterns in data. Machine learning allows computers to learn without being explicitly programmed by improving at tasks through experience. The document discusses different types of machine learning including supervised learning to predict outputs from inputs, unsupervised learning to understand and describe data without correct answers, and reinforcement learning to learn actions through rewards. It also covers machine learning problems, algorithms like K-nearest neighbors for classification and K-means clustering, and evaluating machine learning models.
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
This document provides an overview of two machine learning algorithms: decision trees and k-nearest neighbors (k-NN). Decision trees use a hierarchical structure to classify instances based on the values of their features, splitting the data at each node based on tests of individual features. k-NN classifies new instances based on the majority class of its k nearest neighbors in the training data, where distance between instances is measured using a metric like overlap. The document discusses key aspects of both algorithms like decision criteria, parameters, and properties.
Isolation Forest is an anomaly detection algorithm that builds decision trees to isolate anomalies from normal data points. It works by constructing isolation trees on randomly selected sub-samples of the data, and computes an anomaly score based on the path length of each data point in the trees. The algorithm has linear time complexity and low memory requirements, making it scalable to large, high-dimensional datasets. Empirical experiments show Isolation Forest achieves high AUC scores comparable to other algorithms while using less processing time, especially as the number of trees increases. It is also effective at detecting anomalies in the presence of irrelevant attributes.
Clustering algorithms are a type of unsupervised learning that groups unlabeled data points together based on similarities. There are many different clustering methods that can group data either hierarchically or into flat clusters, and either exclusively or with overlap. K-means clustering is a simple algorithm that assigns data points to clusters based on proximity to randomly assigned cluster centroids, and is useful despite limitations around sensitivity to outliers and requiring pre-specification of the number of clusters. Clustering has applications across many domains including marketing, astronomy, seismology, and genomics.
Clustering algorithms are a type of unsupervised learning that groups unlabeled data points together based on similarities. There are many different clustering methods that can handle numeric and/or symbolic data in either hierarchical or flat clustering structures. One simple and commonly used algorithm is k-means clustering, which assigns data points to k clusters based on minimizing distances between points and assigned cluster centers. K-means clustering has advantages of being simple and automatically assigning data points but has disadvantages such as needing to pre-specify the number of clusters and being sensitive to outliers.
The document summarizes the key topics from the first lecture of a data mining course. It introduces data mining as the process of extracting implicit and potentially useful information from large amounts of data. It discusses why data mining is needed due to the abundance of data and challenges of manual organization. The lecture then covers machine learning techniques used for tasks like classification, clustering, and prediction. It provides examples of data mining applications and outlines the typical steps involved in a machine learning approach.
This document provides an overview of decision tree algorithms for machine learning. It discusses key concepts such as:
- Decision trees can be used for classification or regression problems.
- They represent rules that can be understood by humans and used in knowledge systems.
- The trees are built by splitting the data into purer subsets based on attribute tests, using measures like information gain.
- Issues like overfitting are addressed through techniques like reduced error pruning and rule post-pruning.
This document provides an introduction to basic statistics topics including data collection, presentation, measures of central tendency, and dispersion. It explains that statistics involves collecting, summarizing, analyzing, and presenting numerical data. Common forms of data presentation include line graphs, bar charts, pie charts, and histograms. Measures of central tendency like mean, median, and mode are used to describe typical values in a data set. The mean is the average and is calculated by summing all values and dividing by the number of values. The median is the middle value when data is arranged in order. The mode is the most frequent value. Measures of dispersion like mean deviation, variance, and standard deviation quantify how spread out the data is around the central tendency
Machine Learning: Foundations Course Number 0368403401butest
This machine learning course will cover theoretical and practical machine learning concepts. It will include 4 homework assignments and programming in Matlab. Lectures will be supplemented by student-submitted class notes in LaTeX. Topics will include learning approaches like storage and retrieval, rule learning, and flexible model estimation, as well as applications in areas like control, medical diagnosis, and web search. A final exam format has not been determined yet.
Machine Learning: Foundations Course Number 0368403401butest
This machine learning course will cover theoretical and practical machine learning concepts. It will include 4 homework assignments and programming in Matlab. Lectures will be supplemented by student-submitted class notes in LaTeX. Topics will include learning approaches like storage and retrieval, rule learning, and flexible model estimation, as well as applications in areas like control, medical diagnosis, and web search. A final exam format has not been determined yet.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.
The document discusses classification and prediction using decision trees. It begins by defining classification as predicting categorical labels from data, such as predicting if a loan applicant is "safe" or "risky". Prediction involves predicting continuous or ordered values, such as how much a customer will spend. The document then discusses how decision trees perform classification by recursively splitting the data into purer subsets based on attribute values, with leaf nodes representing class labels. Information gain is used as the splitting criterion to select the attribute that best splits the data. Finally, it notes that attributes with many values can bias decision trees towards overfitting.
This document discusses exercises related to information gain and decision tree learning. Exercise 2 calculates the information gain of attributes a1 and a2 on a sample dataset. Exercise 3 discusses overfitting related to using a unique identifier attribute. Exercise 4 shows that an attribute with many unique values can achieve maximum information gain but may not be a good predictor. Exercise 5 discusses approaches for handling missing values when calculating information gain.
This document discusses neural networks and their applications. It covers perceptrons, which are single-layer neural networks, and the perceptron training rule. It also describes gradient descent search and the delta rule for training neural networks. The document introduces multi-layer neural networks and the backpropagation algorithm for training these more complex networks. In the end, it provides examples of applications of neural networks such as text-to-speech, fraud detection, and game playing.
This document provides instructions for 5 exercises on data mining homework. Students are asked to submit their answers to the given exercises electronically by November 25, 2010. The exercises cover topics such as information gain, handling missing attribute values, perceptrons, gradient descent, and stochastic gradient descent. Contact information is provided for two teaching assistants in case students have any questions.
This document discusses the need for benchmarking and evaluation of visualization tools for data mining. It proposes developing standardized test datasets and metrics to compare different visualization approaches. The challenges include:
1) Performance depends on user expertise - domain knowledge is needed to understand complex real-world datasets. Evaluations must account for different user skill levels.
2) Perceptual issues - comparisons require controlling display/viewing conditions and ensuring users receive comparable training to learn how to interpret visualizations.
3) Acceptance by the KDD community - overcoming technical and cultural barriers to establishing benchmarking as a standard practice. The document advocates developing a centralized testing laboratory to standardize evaluations.
The document describes a study investigating how collaborative creativity can be supported electronically while maintaining face-to-face communication. The researchers designed a brainstorming application using an interactive table and wall display, and compared it to traditional paper-based brainstorming. They derived design guidelines for collaborative systems in interactive environments based on considerations from the application's design and observations during a user study with 30 participants. The guidelines aim to support group awareness, minimize cognitive load, and mediate mutual idea activation in order to foster collaborative creative problem solving.
This document provides an introduction to probabilistic and Bayesian analytics through a series of slides from a lecture by Andrew W. Moore. It begins by discussing the uncertainty in the world and how probability provides a framework to model uncertainty. The fundamentals of probability are then reviewed, including discrete random variables, probabilities, the axioms of probability, and theorems that can be derived from the axioms. Conditional probability and Bayesian inference are introduced. Joint probability distributions are discussed as a way to specify probabilities over multiple variables. The document aims to provide the foundations for understanding probabilistic modeling and reasoning.
This document discusses cross-validation techniques for evaluating machine learning models on a dataset and preventing overfitting. It introduces linear regression, quadratic regression, and join-the-dots/nonparametric regression on a sample regression problem. It then explains the test set method for model evaluation but notes its high variance. Leave-one-out cross-validation (LOOCV) and k-fold cross-validation are presented as alternatives that make more efficient use of data. Examples are given comparing the performance of different models using these cross-validation techniques on the sample regression problem. The document concludes by discussing how cross-validation can be used for model selection tasks like choosing the number of hidden units in a neural network or the k value in
The document describes the process of constructing decision trees. It begins with an example weather dataset and shows how to build a decision tree to predict whether to play or not based on attributes like outlook, temperature, etc. It then discusses the key steps in constructing decision trees which include selecting the best attribute to split on at each node based on information gain. It also discusses overfitting and the need for tree pruning. The document provides formulas to calculate information gain and discusses strategies like using a chi-squared test to select statistically robust splits during tree construction.
This document outlines linear regression, which is a machine learning technique for predicting real-valued outputs based on numerical input variables. It assumes a linear relationship between the inputs and outputs. Linear regression finds the linear equation that best fits the training data by minimizing a sum of squared errors function. The parameters of the linear equation can be estimated analytically through differentiation and solving for when the partial derivatives are equal to zero.
Christof Monz gave a lecture on probabilities and information theory for a data mining class. He provided a quick refresher on key probability concepts like sample spaces, events, and probability functions. He discussed examples of calculating probabilities for coin tosses and dice rolls. Monz also covered entropy as a measure of uncertainty and how more optimal encoding can achieve lower entropy. Finally, he included a brief review of calculus concepts like derivatives that are relevant to data mining.
This document contains instructions for homework assignments in data mining. It includes 3 exercises:
1) Describe two scenarios where data mining could be applied, what would be predicted, relevant attributes, data used, and potential problems.
2) Derive Bayes' rule step-by-step from definitions of conditional probability and other rules.
3) Calculate entropy for variables with different probability distributions, find the minimum bits needed to represent values, and explain which distributions have highest and lowest entropy.
This chapter discusses subjectivism as an alternative to objectivism for providing a theoretical foundation for information management. Subjectivism focuses on human sense-making and interpretation rather than objective truths. The chapter argues that subjectivism fails to address economic value, a key concern for organizations. It suggests combining objectivism and subjectivism into an integrated approach. Subjectivism is illustrated using practice-based social theories, which view social practices as transcending the divide between objectivism and subjectivism. However, differences between the two philosophies remain fundamental.
Groups tend to focus discussion on information that is commonly known, neglecting unique information known to only some members. This can result in suboptimal decisions. Groups also tend to accentuate their initial views, leading to more extreme decisions than individuals would make alone. Highly cohesive groups may prioritize consensus over considering information that challenges group unity. Effective information management is needed to help groups make better use of all relevant information in their decision making.
This chapter discusses how information management has been strongly influenced by the philosophical tradition of objectivism. Objectivism views the world as consisting of distinct objects that exist independently of human cognition and can be studied to gain objective knowledge. It has shaped key definitions and goals in information management, such as defining information and knowledge as granules that represent objective realities. Information management also shows influence from microeconomics, viewing information exchange as a market and aiming to maximize participation and competition. However, the chapter argues that objectivism may not provide the best foundation for information management, as it cannot adequately deal with the subjective nature of information.
The document discusses text and images as visual sign systems for representing knowledge. It provides conceptual models for representing text, including models for typography, layout, writing systems, syntax, dictionaries, semantics, style and genre. Text representation relies on agreed upon codes and rules. Images are represented using different codes, including perceptual, textual, social, and syntagmatic/paradigmatic codes. Both text and images can be described using standards like XML, RDF and MPEG for interpretation and understanding.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
1. Christof Monz
Informatics Institute
University of Amsterdam
Data Mining
Week 2: Decision Tree Learning
Today’s Class
Christof Monz
Data Mining - Week 2: Decision Tree Learning
1
Decision Trees
Decision tree learning algorithms
Learning bias
Overfitting
Pruning
Extensions to learning with real values
2. Decision Tree Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
2
Main algorithms introduced by Quinlan in the
1980s
A decision tree is a set of hierarchically nested
classification rules
Each rule is a node in the tree investigates a
specific attribute
Branches correspond to the values of the
attributes
Example Data
Christof Monz
Data Mining - Week 2: Decision Tree Learning
3
When to play tennis? (training data)
day outlook temperature humidity wind play
1 sunny hot high weak no
2 sunny hot high strong no
3 overcast hot high weak yes
4 rain mild high weak yes
5 rain cool normal weak yes
6 rain cool normal strong no
7 overcast cool normal strong yes
8 sunny mild high weak no
9 sunny cool normal weak yes
10 rain mild normal weak yes
11 sunny mild normal strong yes
12 overcast mild high strong yes
13 overcast hot normal weak yes
14 rain mild high strong no
3. Decision Tree
Christof Monz
Data Mining - Week 2: Decision Tree Learning
4
Nodes check attribute values
Leaves are final classifications
Decision Tree
Christof Monz
Data Mining - Week 2: Decision Tree Learning
5
Decision trees can be represented as logical
expressions in disjunctive normal form
Each path from the root corresponds to a
conjunction of attribute-value equations
All paths of the tree are combined by disjunction
(outlook=sunny ∧ humidity=normal)
∨ (outlook=overcast)
∨ (outlook=rain ∧ wind=weak)
4. Appropriate Problems for DTs
Christof Monz
Data Mining - Week 2: Decision Tree Learning
6
Attributes have discrete values (real-value
extension discussed later)
The class values are discrete (real-value
extension discussed later)
Training data may contain errors
Training data may contain instances with
missing/unknown attribute values
Learning Decision Trees
Christof Monz
Data Mining - Week 2: Decision Tree Learning
7
Many different decision trees can be learned for
a given training set
A number of criteria apply
• The tree should be as accurate as possible
• The tree should be as simple as possible
• The tree should generalize as good as possible
Basic questions
• Which attributes should be included in the tree?
• In which order should they be used in the tree?
Standard decision tree learning algorithms: ID3
and C4.5
5. Entropy
Christof Monz
Data Mining - Week 2: Decision Tree Learning
8
The better an attribute discriminated the classes
in the data, the higher it should be in the tree
How do we quantify the degree of
discrimination?
One way to do this is to use entropy
Entropy measures the uncertainty/ambiguity in
the data
H(S) = −p⊕log2p⊕ −p log2p
where p⊕/p is the probability of a
positive/negative class occurring in S
Entropy
Christof Monz
Data Mining - Week 2: Decision Tree Learning
9
In general, the entropy of a subset of S of the
training examples with respect to the target
class is defined as:
H(S) = ∑
c∈C
−pclog2pc
where C is the set of possible classes and pc is
the probability of an instance in S to belong to
class c
Note, we define 0log20 = 0
6. Entropy
Christof Monz
Data Mining - Week 2: Decision Tree Learning
10
Information Gain
Christof Monz
Data Mining - Week 2: Decision Tree Learning
11
Information gain is the reduction in entropy
gain(S,A) = H(S)− ∑
v∈values(A)
|Sv |
|S| H(Sv )
where values(A) is the set of possible values of
attribute A and Sv is the subset of S for which
attribute A has value v
gain(S,A) is the number of bits saved when
encoding an arbitrary member of S by knowing
the value of attribute A
7. Information Gain Example
Christof Monz
Data Mining - Week 2: Decision Tree Learning
12
S = [9+,5−]
values(wind) = {weak,strong}
Sweak = [6+,2−]
Sstrong = [3+,3−]
gain(S,wind) = H(S)− ∑
v∈{weak,strong}
|Sv |
|S|
H(Sv )
= H(S)−(8/14)H(Sweak)−(6/14)H(Sstrong)
= 0.94 −(8/14)0.811 −(6/14)1.0
= 0.048
Comparing Information Gains
Christof Monz
Data Mining - Week 2: Decision Tree Learning
13
8. ID3 DT Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
14
The ID3 algorithm computes the information
gain for each node in the tree and each
attribute, and chooses the attribute with the
highest gain
For instance, at the root (S) the gains are:
• gain(S,outlook) = 0.246
gain(S,humidity) = 0.151
gain(S,wind) = 0.048
gain(S,temperature) = 0.029
• Hence outlook outlook is chosen for the top node
ID3 then iteratively selects the attribute with
the highest gain for each daughter of the
previous node, . . .
ID3 DT Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
15
9. ID3 Algorithm
Christof Monz
Data Mining - Week 2: Decision Tree Learning
16
ID3(S’,S,node,attr)
if (for all s in S: class(s)=c)
return leaf node with class c;
else if (attr is empty)
return leaf node with most frequent class in S
else if (S is empty)
return leaf node with most frequent class in S’
else
a=argmax (a’∈attr) gain(S,a)
attribute(node)=a;
for each v∈values(a)
new(node v); new edge(node,node v);label(node,node v)=v;
ID3(S,S v,attr-{a})
Initial call: ID3(/0,S,root,A)
Hypothesis Search Space
Christof Monz
Data Mining - Week 2: Decision Tree Learning
17
10. Hypothesis Search Space
Christof Monz
Data Mining - Week 2: Decision Tree Learning
18
The hypothesis space searched by ID3 is the set
of possible decision trees
Hill-climbing (greedy) search guided purely by
information gain measure
• Only one hypothesis is considered for further extension
• No back-tracking to hypotheses dismissed earlier
All (relevant) training examples are used to
guide search
Due to greedy search, ID3 can get stuck in a
local optimum
Inductive Bias of ID3
Christof Monz
Data Mining - Week 2: Decision Tree Learning
19
ID3 has a preference for small trees (in
particular short trees)
ID3 has a preference for trees with high
information gain attributes near the root
Note, a bias is a preference for some hypotheses,
rather than a restriction of the hypothesis space
Some form of bias is required in order to
generalize beyond the training data
11. Evaluation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
20
How good is the learned decision tree?
Split the available data into a training set and
a test set
Sometimes data comes already with a
pre-defined split
Rule of thumb: use 80% for training and 20%
for testing
Test set should be big enough to draw stable
conclusions
Evaluation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
21
12. Cross-Validation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
22
What if availabe data is rather small?
Re-run training and testing on n different
portions of the data
Known as n-fold cross-validation
Compute the accuraccies of all combined
test-portions from each fold
Allows one also to report variation across
different folds
Stratified cross validation makes sure that the
different folds contain the same proportions of
class labels
Cross-Validation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
23
13. Occam’s Razor
Christof Monz
Data Mining - Week 2: Decision Tree Learning
24
Occam’s Razor (OR): Prefer the simplest
hypothesis that fits the data
Pro OR: A long hypothesis fitting the data
rather describes the data and it does not model
the underlying principle that generated the data
Pro OR: A short hypothesis fitting the data is
unlikely to be coincidence
Con OR: There are numerous ways to define the
size of hypotheses
Overfitting
Christof Monz
Data Mining - Week 2: Decision Tree Learning
25
Definition: Given a hypothesis space H, a
hypothesis h ∈ H is said to overfit the training
data if there exists some alternative hypothesis
h ∈ H, such that h has a smaller error than h
over the training data, but h has a smaller error
than h on the entire distribution of data.
Roughly speaking, a hypothesis h overfits if it
does not generalize as well beyond the training
data as another hypothesis h
14. Overfitting
Christof Monz
Data Mining - Week 2: Decision Tree Learning
26
Overfitting
Christof Monz
Data Mining - Week 2: Decision Tree Learning
27
Reasons for overfitting
• The training set is too small
• The training data is not representative of the real
distribution
• The training data contains errors (measurement errors,
human annotation errors, . . . )
Overfitting is an significant issue in practical
data mining applications
Two approaches to avoid reduce overfitting
• Stop-early tree growth (before it classifies the data
perfectly)
• Post-pruning of tree (after the complete tree has been
learned)
15. Reduced Error Pruning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
28
Split the training set into two sets:
• training subset (approx. 80%)
• validation set (approx. 20%)
A decision tree T is learned from the training
subset
Prune the node in T that leads to the highest
improvement on the validation set (and repeat
for all nodes until accuracy drops)
Pruning a node in a tree substitutes the node
and its subtree by the most common class under
the node
Effect of Pruning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
29
16. Rule Post-Pruning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
30
Instead of pruning entire subtrees rule
post-pruning affects only parts of the decision
chain
Convert tree into set of rules
• Each path is represented as a rule of the form:
If a1 = v1 ∧...∧an = vn then class = c
• For example:
If outlook = sunny∧humidity = high then play = no
Remove conjuncts in order of improvements on
the validation set until no further improvements
All paths are pruned independently of each other
Continuous-Valued Attributes
Christof Monz
Data Mining - Week 2: Decision Tree Learning
31
Considering real-valued attributes (like
temperature) as discrete values is clearly
inappropriate
Use threshold: if(value(a)<c) then . . . else . . .
Threshold c can be determined by computing
the maximum information gain for different
candidate thresholds
Note: Numeric attributes can be repeated along
the same path
17. Information Gain Revisited
Christof Monz
Data Mining - Week 2: Decision Tree Learning
32
H(S) = ∑
c∈C
−pclog2pc
gain(S,A) = H(S)− ∑
v∈values(A)
|Sv |
|S| H(Sv )
Information gain favors attributes with many
values over those with few. Why?
Extension: Measure how broadly and uniformly
the attribute splits the data:
split(S,A) = − ∑
v∈values(A)
|Sv |
|S| log2
|Sv |
|S|
split is the entropy of the attribute-value
distribution in S
Information Gain Revisited
Christof Monz
Data Mining - Week 2: Decision Tree Learning
33
Information gain and split can be combined:
gain ratio(S,A) =
gain(S,A)
split(S,A)
If |value(A)| = n and A completely determines
the class, then split(S,A) = log2n
If |Sv | ≈ |S| for one v then split(S,A) becomes
small and boosts the gain ratio
Heuristic: Compute information gain first,
remove attributes with below-average gain and
then select the attribute with the highest
gain-ratio
18. Missing Attribute Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
34
In real-world data it is not unusual that some
instances have missing attribute values
How to compute gain(S,A) for instances with
missing values?
Assume instance x with class(x) = c and
a(x) =?
• Take the most frequent value of a of all instances in S
(with the same class)
• Take the average value of a of all instances in S (with
the same class) for numeric attributes
Missing Attribute Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
35
Instead of choosing the single most frequent
value, use fractional instances
E.g., if p(a(x) = 1|S) = 0.6 and
p(a(x) = 0|S) = 0.4 then 0.6 (0.4) fractional
instances with missing values for a are passed
down the a = 1 (a = 0) branch
Entropy computation has to be adapted
accordingly
19. Attributes with Different Costs
Christof Monz
Data Mining - Week 2: Decision Tree Learning
36
In real-world scenarios there can be costs
associated with computing the values of
attributes (medical tests, computing time, . . . )
Considering costs might favors usage of
lower-cost attributes
Suggested measures include:
• gain(S,A)
cost(A)
• 2gain(S,A)−1
(cost(A)+1)w
where w ∈ [0,1] is a weight determining the importance
of cost
Predicting Continuous Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
37
So far we have focused on predicting discrete
classes (i.e. nominal classification)
What has to change when predicting real
values?
Splitting criterion redefined
• Information gain:
gain(S,A) = H(S)− ∑
v∈values(A)
|Sv |
|S| H(Sv )
Standard deviation reduction (SDR)
sdr(S,A) = std dev(S)− ∑
v∈val(A)
|Sv |
|S| std dev(Sv )
where std dev(S) = ∑s∈S
1
|S|(val(s)−avg val(S))2
20. Predicting Continuous Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
38
Stopping criterion in nominal classification:
Stop when all leaves in S have the same class
Too fine-grained for real-value prediction
Stop when standard deviation of node n is less
then some predefined ratio of standard deviation
of the original instance set:
Stop if
std dev(Sn)
std dev(Sall )
< θ, where, e.g., θ = 0.05
Predicting Continuous Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
39
If we decide not to split further on node n, what
should the predicted value be?
Simple solution: the average target value of the
instances underneath node n:
class(n) = avg val(Sn)
This approach is used in regression trees
More sophisticated: associate linear regression
models with all leaf nodes (model trees)
21. Model Tree Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
40
Suppose we have leave node n, regression trees
use the average target value of the instances
under n
More fine-grained approach is to apply linear
regression to all instances under n:
class(n) = a +b1x1 +b2x2 +···bmxm
where x1,x2,...,xm are the values of the
attributes that lead to n in the tree
a and bi are estimated just like in linear
regression
Problem: Not all attributes are numerical!
Converting Nominal Attributes
Christof Monz
Data Mining - Week 2: Decision Tree Learning
41
Assume a nominal attribute such as
outlook = {sunny,overcast,rain}
We can convert this into numerical values
simply by choosing equi-distant values from a
specific interval: outlook = {1,0.5,0}
This assumes an intuitive ordering of the values:
sunny > overcast > rain
Direct ordering of values not always possible:
city = {london,new york,tokyo}
london > new york > tokyo ???
22. Converting Nominal Attributes
Christof Monz
Data Mining - Week 2: Decision Tree Learning
42
Sort nominal values of attribute A by their
average target values
Introduce k −1 synthetic binary attributes, if
nominal attribute A has k values
The ith binary attribute checks whether the ith
nominal value in the ordering holds
For instance, if avg trg val(new york) <
avg trg val(london) < avg trg val(tokyo) then
the k −1 synthetic binary attributes are:
is new york and is new york OR london
Recap
Christof Monz
Data Mining - Week 2: Decision Tree Learning
43
Elements of a decision tree
Information gain
ID3 algorithm
Bias of ID3
Overfitting and Pruning
Attributes with many values (gain ratio)
Attributes with continuous values
Attributes with missing values
Predicting continuous classes:
• Regression trees
• Model trees