This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what is Machine Learning, problems in Machine Learning, what is Decision Tree, advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with solved examples and at the end we will implement a Decision Tree use case/ demo in Python on loan payment prediction. This Decision Tree tutorial is ideal for both beginners as well as professionals who want to learn Machine Learning Algorithms.
Below topics are covered in this Decision Tree Algorithm Presentation:
1. What is Machine Learning?
2. Types of Machine Learning?
3. Problems in Machine Learning
4. What is Decision Tree?
5. What are the problems a Decision Tree Solves?
6. Advantages of Decision Tree
7. How does Decision Tree Work?
8. Use Case - Loan Repayment Prediction
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what is Machine Learning, problems in Machine Learning, what is Decision Tree, advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with solved examples and at the end we will implement a Decision Tree use case/ demo in Python on loan payment prediction. This Decision Tree tutorial is ideal for both beginners as well as professionals who want to learn Machine Learning Algorithms.
Below topics are covered in this Decision Tree Algorithm Presentation:
1. What is Machine Learning?
2. Types of Machine Learning?
3. Problems in Machine Learning
4. What is Decision Tree?
5. What are the problems a Decision Tree Solves?
6. Advantages of Decision Tree
7. How does Decision Tree Work?
8. Use Case - Loan Repayment Prediction
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Basic of Decision Tree Learning. This slide includes definition of decision tree, basic example, basic construction of a decision tree, mathlab example
Decision Tree Analysis for statistical tool. The deck provides understanding on the Decision Analysis.
It provides practical application and limited theory. Will be useful for MBA students.
This presentation covers Decision Tree as a supervised machine learning technique, talking about Information Gain method and Gini Index method with their related Algorithms.
Aloha is a web portal which allows users to connect with their friends and family through a
common platform. Furthermore, users’ can share scribbles and ChitChat with their friends. These
chats can be saved or deleted as per the users’ wishes. Users can also maintain, update or delete
their account. Spring MVC / WebSockets / AJAX / Javascript
Aloha Social Networking Portal - Design DocumentMilind Gokhale
Aloha is a web portal which allows users to connect with their friends and family through a
common platform. Furthermore, users’ can share scribbles and ChitChat with their friends. These
chats can be saved or deleted as per the users’ wishes. Users can also maintain, update or delete
their account. Spring MVC / WebSockets / AJAX / Javascript
Paper written based on study of algorithms for external memory sorting in the coursework of CSCI-B 503 Algorithms Design and Analysis under guidance of Prof Funda Ergun
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
2. Decision Tree Learning Agenda
• Decision Tree Representation
• ID3 Learning Algorithm
• Entropy, Information Gain
• An Illustrative Example
• Issues in Decision Tree Learning
CSCI-B 659 | Decision Trees
3. Decision Tree Learning
• It is a method of approximating discrete-valued functions
that is robust to noisy data and capable of learning
disjunctive expressions
• The learned function is represented by a decision tree.
• Disjunctive Expressions – (A ∧ B ∧ C) ∨ (D ∧ E ∧ F)
CSCI-B 659 | Decision Trees
4. Decision Tree Representation
• Each internal node tests an
attribute
• Each branch corresponds to an
attribute value
• Each leaf node assigns a
classification
PlayTennis: This decision tree classifies Saturday mornings
according to whether or not they are suitable for playing tennis
CSCI-B 659 | Decision Trees
5. Decision Tree Representation - Classification
• An example is classified by
sorting it through the tree from
the root to the leaf node
• Example – (Outlook = Sunny,
Humidity = High) =>
(PlayTennis = No)
PlayTennis: This decision tree classifies Saturday mornings
according to whether or not they are suitable for playing tennis
CSCI-B 659 | Decision Trees
6. Appropriate problems for decision tree
learning
• Instances describable by attribute-value pairs
• Target function is discrete valued
• Disjunctive hypothesis may be required
• Possibly noisy data
• Training data may contain errors
• Training data may contain missing attribute values
• Examples – Classification Problems
1. Equipment or medical diagnosis
2. Credit risk analysis
CSCI-B 659 | Decision Trees
7. Basic ID3 Learning Algorithm approach
• Top-down construction of the tree, beginning with the question "which attribute
should be tested at the root of the tree?'
• Each instance attribute is evaluated using a statistical test to determine how
well it alone classifies the training examples.
• The best attribute is selected and used as the test at the root node of the tree.
• A descendant of the root node is then created for each possible value of this
attribute.
• The training examples are sorted to the appropriate descendant node
• The entire process is then repeated at the descendant node using the training
examples associated with each descendant node
• GREEDY Approach
• No Backtracking - So we may get a suboptimal solution.
CSCI-B 659 | Decision Trees
9. Top-Down induction of decision trees
1. Find A = the best decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendants of node
4. Sort the training examples to the leaf node.
5. If training examples classified perfectly, STOP else
iterate over the new leaf nodes.
CSCI-B 659 | Decision Trees
10. Which attribute is the best classifier?
• Information Gain – A statistical property that measures
how well a given attribute separates the training
examples according to their target classification.
• This measure is used to select among the candidate
attributes at each step while growing the tree.
CSCI-B 659 | Decision Trees
11. Entropy
• S is a sample of training examples
• is the proportion of positive examples in S
• is the proportion of negative examples in S
• Then the entropy measures the impurity of S:
• But If the target attribute can take c different
values:
CSCI-B 659 | Decision Trees
The entropy varies between 0 and 1.
if all the members belong to the same
class => The entropy = 0
if there are equal number of positive and
negative examples => The entropy = 1
13. Information Gain
• Gain(S,A) = expected reduction in entropy due to sorting
on A
• Here sv is simply the sum of the entropies of each
subset, weighted by the fraction of examples |sv/s| that
belong to sv
CSCI-B 659 | Decision Trees
14. Information Gain - Example
• Gain(S,A1)
= 0.994 – (26/64)*.706 – (38/64)*.742
= 0.266
• Information gained by partitioning
along attribute A1 is 0.266
CSCI-B 659 | Decision Trees
E=0.706 E=0.742
E=0.994 E=0.994
E=0.937 E=0.619
• Gain(S,A2)
= 0.994 – (51/64)*.937 – (13/64)*.619
= 0.121
• Information gained by partitioning
along attribute A2 is 0.121
16. An Illustrative Example
CSCI-B 659 | Decision Trees
• Gain(S, Outlook) = 0.246
• Gain(S, Humidity) = 0.151
• Gain(S, Wind) = 0.048
• Gain(S, Temperature) = 0.029
• Since Outlook attribute provides the
best prediction of the target attribute,
PlayTennis, it is selected as the
decision attribute for the root node, and
branches are created with its possible
values (i.e., Sunny, Overcast, and
Rain).
17. An Illustrative Example
CSCI-B 659 | Decision Trees
• Ssunny = {D1,D2,D8,D9,D11}
• Gain (Ssunny , Humidity)
= .970 - (3/5) 0.0 - (2/5) 0.0
= .970
• Gain (S sunny , Temperature)
= .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0
= .570
• Gain (S sunny , Wind)
= .970 - (2/5) 1.0 - (3/5) .918
= .019
18. Inductive Bias in ID3
• Inductive bias is the set of assumptions that along with
the training data justify the classifications assigned by
the learner to future instances.
• Given H as the power set of instances X
• ID3 has preference of short trees with high information
gain attributes near the root.
• ID3 has preference for certain hypotheses over others,
with no hard restriction on the hypotheses space H.
CSCI-B 659 | Decision Trees
19. Occam’s Razor
• Prefer the simplest hypothesis that fits the data
• Argument in favor -
• A short hypothesis that fits data unlikely to be a coincidence
• A long hypothesis that first data might be a coincidence
• Argument Opposed –
• There are many ways to define small sets of hypotheses
• Two different hypotheses from the same training examples
possible when applied by two learners that perceive these
examples in terms of different internal representations
CSCI-B 659 | Decision Trees
20. Issues in Decision Tree Learning
• Overfitting
• Incorporating Continuous-valued attributes
• Attributes with many values
• Handling attributes with costs
• Handling examples with missing attribute values
CSCI-B 659 | Decision Trees
21. Overfitting
• Consider a hypothesis h over
• Training data: errortrain(h)
• Entire distribution D of data: errorD(h)
• The hypothesis h ∈ H overfits training data if there is an
alternative hypothesis h’ ∈ H such that
• errortrain(h) < errortrain(h’)
AND
• errorD(h) > errorD(h’)
CSCI-B 659 | Decision Trees
23. Avoiding Overfitting
• Causes
1. This can happen when the training data contains errors or
noise.
2. small numbers of examples are associated with leaf nodes
• Avoiding Overfitting
1. Stop growing when data split not statistically significant
2. Grow full tree, then post-prune it.
• Selecting Best Tree
1. Measure performance over training data
2. Measure performance over separate validation data
CSCI-B 659 | Decision Trees
24. Reduced-Error Pruning
• Split data into training and validation sets
• Do until further pruning is harmful
1. Evaluate impact of pruning each possible node on
validation set
2. Greedily remove the one that most improves the validation
set accuracy
CSCI-B 659 | Decision Trees
26. Rule Post-Pruning
• The major drawback of Reduced-Error Pruning is when
the data is limited, validation set reduces even further
the number of examples for training.
Hence Rule Post-Pruning
• Convert tree to equivalent set of rules
• Prune each rule independently of others
• Sort final rules into desired sequence for use
CSCI-B 659 | Decision Trees
27. Converting a tree to rules
CSCI-B 659 | Decision Trees
IF (Outlook = Sunny) ∧ (Humidity = High)
THEN PlayTennis = No
IF (Outlook = Sunny) ∧ (Humidity = Normal)
THEN PlayTennis = Yes
28. Continuous Valued-Attributes
• Create a discrete-valued attribute to test continuous
• So if Temperature = 75
• We can infer that PlayTennis = Yes
CSCI-B 659 | Decision Trees
29. Attributes with many values
• Problem:
• If attribute has many values, Gain will select any value
• Example – Using date attribute
• One approach – Gain Ratio
Where si is a subset of S which has value vi
CSCI-B 659 | Decision Trees
30. Attributes with costs
• Problem:
• Medical diagnosis, BloodTest has cost $150
• Robotics, Width_from_1ft has cost 23 sec
• One Approach - replace gain
• Tan and Schlimmer (1990)
• Nunez (1988)
• where w ∈ [0, 1] is a constant that determines the relative importance of cost versus information
gain.
CSCI-B 659 | Decision Trees
31. Examples with missing attribute values
• What if some examples missing values of attribute A?
• Use training examples anyway and sort through tree
• If node n tests A, Assign it the most common value among
the examples at node n
• Assign a probability pi to each possible value of A – vi and
assign fraction pi of example to each descendant in tree
CSCI-B 659 | Decision Trees
32. Some of the latest Applications
CSCI-B 659 | Decision Trees
Gesture Recognition
Motion Detection
Xbox 360 Kinect
34. References
• Mitchell, Tom M. "Decision Tree Learning." In Machine
Learning. New York: McGraw-Hill, 1997.
• Flach, Peter A. "Tree Models: Decision Trees."
In Machine Learning: The Art and Science of Algorithms
That Make Sense of Data. Cambridge: Cambridge
University Press, 2012.
CSCI-B 659 | Decision Trees
Editor's Notes
We will see what is decision tree learning
Then decision tree representation
ID3 learning algorithm
Concepts like – Entropy, information gain which help to improve the inductive bias of the decision trees.
Overfitting problem which can occur in the decision trees and ways to solve the problem.
Decision Tree learning is a method of approximating discrete-values functions that is robust to noisy data and capable of learning disjunctive expressions.
Disjunctive expressions are basically expressions which are Disjunctions of Conjunctions.
Here each internal node in playTennis like Outlook humidity and wind test an attribute namely outlook humidity and wind.
While the values of the attributes like outlook being sunny, overcast or Rain corresponds to an attribute value. Similarly Humidity as high, normal and Wind as Strong/Weak.
An instance is classified by starting at the root node of the tree and testing the attribute specified by this node and then moving down the tree branch corresponding to the value of the attribute in the given example. And then repeat the process for the sub-tree rooted at this node.
Example – (Outlook = Sunny, Humidity = High) will be sorted down the tree to the left most leaf node and hence be classified as negative instance = (PlayTennis = No)
Instances are represented by attribute-value pairs
The target function has discrete values
Disjunctive expressions
Training data may contain errors
Training data may contain missing attribute values
Example as given on slide.
1. Problems such as learning to classify medical patients by their diseases or symptoms
2. equipment malfunction by their cause.
3. Classification of loan applicants by their likelihood of defaulting on payments.
learns decision trees by constructing them top-down, beginning with the question "which attribute should be tested at the root of the tree?'
each instance attribute is evaluated using a statistical test to determine how well it alone classifies the training examples. i.e. attribute which is most useful for classifying examples.
The best attribute is selected and used as the test at the root node of the tree.
A descendant of the root node is then created for each possible value of this attribute, and the training examples are sorted to the appropriate descendant node
The entire process is then repeated using the training examples associated with each descendant node to select the best attribute to test at that point in the tree.
GREEDY Approach
No Backtracking - So we may get a suboptimal solution.
Find the attribute A which is the best decision attribute for next node.
Assign A as the decision attribute for node. Thus making an internal node.
Now each of the values of A becomes a sub-branch of the tree i.e. descendants of the internal node.
Then sort the examples as the leaf nodes across these newly created branches.
If the training examples are classified completely, then we may stop otherwise iterate over the leaf nodes to again find the best attribute to classify the examples further.
The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. We would like to select the attribute that is most useful for classifying examples. What is a good quantitative measure of the worth of an attribute? We will define a statistical property, called information gain, that measures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.
In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy, that characterizes the (im)purity of an arbitrary collection of examples.
information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute.
Since Outlook attribute provides the best prediction of the target attribute, PlayTennis, over the training examples. Outlook is selected as the decision attribute for the root node, and branches are created below the root for each of its possible values (i.e., Sunny, Overcast, and Rain).
Since Outlook attribute provides the best prediction of the target attribute, PlayTennis, over the training examples. Outlook is selected as the decision attribute for the root node, and branches are created below the root for each of its possible values (i.e., Sunny, Overcast, and Rain).
Prefer the simplest hypothesis that fits the data
Argument in favor -
it is less likely that one will find a short hypothesis that coincidentally fits the training data. In contrast there are often many very complex hypotheses that fit the current training data but fail to generalize correctly to subsequent data
Argument Opposed –
There are many ways to define small sets of hypotheses
Occam's razor will produce two different hypotheses from the same training examples when it is applied by two learners that perceive these examples in terms of different internal representations
Definition: Given a hypothesis space H, a hypothesis h belongs to H is said to overfit the training data if there exists some alternative hypothesis h' belongs to H, such that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances.
Predictably, the accuracy of the tree over the training examples increases monotonically as the tree is grown. However, the accuracy measured over the independent test examples first increases, then decreases.
Causes
This can happen when the training data contains errors or noise.
Overfitting is possible even when the training data are noise-free, especially when small numbers of examples are associated with leaf nodes
Avoiding Overfitting
The approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data. -
Difficult to estimate precisely when to stop growing the tree.
The approaches that allow the tree to overfit the data, then post-prune the tree.
What criterion is to be used to determine the final correct tree size?
Use a separate set of examples, distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree.
training set, which is used to form the learned hypothesis, and a separate
validation set, which is used to evaluate the accuracy of this hypothesis
over subsequent data and, in particular, to evaluate the impact of pruning this
hypothesis.
Reduced Error Pruning: to consider each of the decision nodes in the tree to be candidates for pruning. Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node.
Nodes are removed only if the resulting pruned tree performs no worse than the original tree over the validation set.
Thus any leaf node added due to coincidental regularities in the training set is likely to be pruned because these same coincidences are unlikely to occur in the validation set.
Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy over the validation set. And this continues until further pruning is harmful for the decision tree.
The major drawback of this approach is when the data is limited, validation set reduces even further the number of examples for training.
Infer the decision tree from the training data available and allowing to grow as far as overfitting to occur.
Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root to the leaf.
Prune (generalize) each rule by removing any preconditions that result in improving the its estimated accuracy.
Sort the pruned rules by their accuracy and consider them in this sequence when classifying subsequent sequences.
In rule postpruning, one rule is generated for each leaf node in the tree. Each attribute test along the path from the root to the leaf becomes a rule antecedent (precondition) and the classification at the leaf node becomes the rule consequent (postcondition). For example, the leftmost path of the tree in Figure 3.1 is translated into the rule
IF (Outlook = Sunny) A (Humidity = High)
THEN PlayTennis = No
two candidate thresholds, corresponding to the values of Temperature at which the value of PlayTennis changes: (48 + 60)/2, and (80 + 90)/2. The information gain can then be computed for each of the candidate attributes, temperature>54 and temperature>85, the best can be selected (temperature>54)
In some learning tasks the instance attributes may have associated costs. For example, in learning to classify medical diseases we might describe patients in terms of attributes such as Temperature, BiopsyResult, Pulse, BloodTestResults, etc. These attributes vary significantly in their costs, both in terms of monetary cost and cost to patient comfort. In such tasks, we would prefer decision trees that use low-cost attributes where possible, relying on high-cost attributes only when needed to produce reliable classifications.
In such cases we can modify the ID3 algorithm to take Gain formula differently as given.
the pose recognition algorithm in the Kinect motion sensing device for the Xbox game console has decision tree classifiers at its heart (in fact, an ensemble of decision trees called a random forest about which you will learn more in Chapter 11)