This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Basic of Decision Tree Learning. This slide includes definition of decision tree, basic example, basic construction of a decision tree, mathlab example
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
This Presentation covers Data Mining: Classification and Prediction, NEURAL NETWORK REPRESENTATION, NEURAL NETWORK APPLICATION DEVELOPMENT, BENEFITS AND LIMITATIONS OF NEURAL NETWORKS, Neural Networks, Real Estate Appraiser, Kinds of Data Mining Problems, Data Mining Techniques, Learning in ANN, Elements of ANN, Neural Network Architectures Recurrent Neural Networks and ANN Software.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
This presentation briefly discusses about the following topics:
Data Analytics Lifecycle
Importance of Data Analytics Lifecycle
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communication Results
Phase 6: Operationalize
Data Analytics Lifecycle Example
Date: September 25, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
Basic of Decision Tree Learning. This slide includes definition of decision tree, basic example, basic construction of a decision tree, mathlab example
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
This Presentation covers Data Mining: Classification and Prediction, NEURAL NETWORK REPRESENTATION, NEURAL NETWORK APPLICATION DEVELOPMENT, BENEFITS AND LIMITATIONS OF NEURAL NETWORKS, Neural Networks, Real Estate Appraiser, Kinds of Data Mining Problems, Data Mining Techniques, Learning in ANN, Elements of ANN, Neural Network Architectures Recurrent Neural Networks and ANN Software.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
This presentation briefly discusses about the following topics:
Data Analytics Lifecycle
Importance of Data Analytics Lifecycle
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communication Results
Phase 6: Operationalize
Data Analytics Lifecycle Example
Date: September 25, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
Classification: Basic Concepts and Decision Treessathish sak
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Lecture 11: Naive Bayes classifier. Supervised Learning. Web Search and PageRank (ppt,pdf)
Chapter 5 from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Chapter 13 from the book "Introduction to Information Retrieval" by C. Manning, P. Raghavan, H. Schutze
Online Faculty Development Program-cum-certificate course on Research Analysis: Tools and Techniques Jointly organized by FGM Govt College Adampur, Hisar, GAD TLC, Khalsa College University of Delhi and #Heera Psychological Testing Research and Consultancy, Rewari.
Full presentation: https://youtu.be/VUglQZ8eoSk
This is good description of student learning at beginner level.we have made it through the UNITY three D. Also using Android studio for this app.small student can learn through this app with ease.
Software Engineering Economics Life Cycle.Sulman Ahmed
Software Engineering Economics Life Cycle.
Software Engineering Economics Life Cycle.
Software Engineering Economics Life Cycle.
Software Engineering Economics Life Cycle.
Software Engineering Economics Life Cycle.
This slide is all about the Data mining techniques.This slide is all about the Data mining techniques.This slide is all about the Data mining techniques.This slide is all about the Data mining techniques;This slide is all about the Data mining techniques;This slide is all about the Data mining techniques.This slide is all about the Data mining techniques.This slide is all about the Data mining techniques
This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
This slide is about all necessary information about the rules of data mining...
Data mining Basics and complete description Sulman Ahmed
This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
2. Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
3. Illustrating Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
4. Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions
as legitimate or fraudulent
• Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
• Categorizing news stories as finance,
weather, entertainment, sports, etc
5. Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
6. Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
7. Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that fits
the same data!
8. Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
9. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.
10. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
11. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
12. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
13. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
14. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
15. Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
16. Decision Tree Induction
• Many Algorithms:
1. Hunt’s Algorithm (one of the earliest)
2. CART (Classification And Regression Tree)
3. ID3 (Iterative Dichotomiser 3)
4. C4.5 (Successor of ID3)
5. SLIQ (It does not require loading the entire dataset into the main
memory)
6. SPRINT (similar approach as SLIQ, induces decision trees relatively
quickly)
7. CHAID (CHi-squared Automatic Interaction Detector). Performs
multi-level splits when computing classification trees.
8. MARS: extends decision trees to handle numerical data better.
9. Conditional Inference Trees. Statistics-based approach that uses
non-parametric tests as splitting criteria, corrected for multiple testing
to avoid overfitting.
17. General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that
reach a node t
• General Procedure:
– If Dt contains records that belong the
same class yt, then t is a leaf node
labeled as yt
– If Dt is an empty set, then t is a leaf
node labeled by the default class, yd
– If Dt contains records that belong to
more than one class, use an attribute
test to split the data into smaller
subsets. Recursively apply the
procedure to each subset.
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Dt
?
18. Hunt’s Algorithm
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes No
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K >= 80K
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
19. Evaluation of a Classifier
• How predictive is the model we learned?
– Which performance measure to use?
• Natural performance measure for classification
problems: error rate on a test set
– Success: instance’s class is predicted correctly
– Error: instance’s class is predicted incorrectly
– Error rate: proportion of errors made over the whole set of
instances
– Accuracy: proportion of correctly classified instances over
the whole set of instances
accuracy = 1 – error rate
19
20. Confusion Matrix
• A confusion matrix is a table that is often used to
describe the performance of a classification model (or
"classifier") on a set of test data for which the true
values are known.
20
PREDICTED CLASS
ACTUAL
CLASS
Class = Yes Class = No
Class = Yes a b
Class = No c d
a: TP (true positive) c: FP (false positive)
b: FN (false negative) d: TN (true negative)
21. Confusion Matrix - Example
• What can we learn from this matrix?
– There are two possible predicted
classes: "yes" and "no". If we were
predicting the presence of a disease,
for example, "yes" would mean they have the disease, and "no"
would mean they don't have the disease.
– The classifier made a total of 165 predictions (e.g., 165 patients
were being tested for the presence of that disease).
– Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
– In reality, 105 patients in the sample have the disease, and 60
patients do not.
21
22. Confusion Matrix – Confusion?
• False positives are actually negative
• False negatives are actually positives
22
23. Confusion Matrix - Example
• Let's now define the most
basic terms, which are
whole numbers (not rates):
– true positives (TP): These are
cases in which we predicted
yes (they have the disease), and
they do have the disease.
– true negatives (TN): We predicted no, and they don't have the
disease.
– false positives (FP): We predicted yes, but they don't actually
have the disease. (Also known as a "Type I error.")
– false negatives (FN): We predicted no, but they actually do have
the disease. (Also known as a "Type II error.")
23
24. Confusion Matrix - Computations
• This is a list of rates that are often computed from a confusion matrix:
• Accuracy: Overall, how often is the classifier correct?
(TP+TN)/total = (100+50)/165 = 0.91
• Misclassification Rate: Overall, how often is it wrong?
(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as "Error Rate"
• True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
• False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
24
25. Confusion Matrix - Computations
• This is a list of rates that are often computed from a confusion matrix:
• Specificity: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
• Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
• Prevalence: How often does the yes condition actually occur in our
sample?
actual yes/total = 105/165 = 0.64
25
26. Confusion Matrix – Example 2
• Imagine that you have a dataset that consists of 33 patterns that are
'Spam' (S) and 67 patterns that are 'Non-Spam' (NS).
• In the example 33 patterns that are 'Spam' (S), 27 were correctly
predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams'.
• On the other hand, out of the 67 patterns that are 'Non-Spams', 57 are
correctly predicted as 'Non-Spams' while 10 were incorrectly classified as
'Spams'.
26http://aimotion.blogspot.com/2010/08/tools-for-machine-learning-performance.html
28. Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes
certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
29. How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split
– 2-way split
– Multi-way split
30. • Multi-way split: Use as many partitions as distinct
values.
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury} {Sports}
CarType
{Sports,
Luxury} {Family} OR
Splitting Based on Nominal Attributes
31. • Multi-way split: Use as many partitions as distinct
values.
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
• What about this split?
Splitting Based on Ordinal Attributes
Size
Small
Medium
Large
Size
{Medium,
Large} {Small}
Size
{Small,
Medium} {Large}
OR
Size
{Small,
Large} {Medium}
32. • Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A v)
• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous Attributes
34. Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes
certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
35. How to determine the Best Split
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
36. How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
37. How to Measure Impurity?
• Given a data table that contains attributes and class of
the attributes, we can measure homogeneity (or
heterogeneity) of the table based on the classes.
• We say a table is pure or homogenous if it contains only
a single class.
• If a data table contains several classes, then we say
that the table is impure or heterogeneous.
37
http://people.revoledu.com/kardi/tutorial/DecisionTree/how-to-measure-impurity.htm
38. How to Measure Impurity?
• There are several indices to measure degree of impurity
quantitatively.
• Most well known indices to measure degree of impurity are:
– Entropy
– Gini Index
– Misclassification error
• All above formulas contain values of probability of pj a class j.
38
39. How to Measure Impurity? - Example
• In our example, the classes of Transportation mode
below consist of three groups of Bus, Car, and Train. In
this case, we have 4 buses, 3 cars, and 3 trains (in
short we write as 4B, 3C, 3T). The total data is 10 rows.
39
40. How to Measure Impurity? - Example
• Based on the data, we can compute probability of each
class. Since probability is equal to frequency relative,
we have
– Prob(Bus) = 4/10 = 0.4
– Prob(Car) = 3/10 = 0.3
– Prob(Train) = 3/10 = 0.3
• Observe that when to compute the probability, we only
focus on the classes, not on the attributes. Having the
probability of each class, now we are ready to compute
the quantitative indices of impurity degrees.
40
41. How to Measure Impurity? - Entropy
• One way to measure impurity degree is using entropy
• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,
Prob(Train)=0.3, we can now compute entropy as:
• Entropy = - 0.4log2(0.4) - 0.3log2(0.3) - 0.3log2(0.3) =
1.571
41
42. How to Measure Impurity? - Entropy
• Entropy of a pure table (consist of
single class) is zero because the
probability is 1 and log2(1)=0.
• Entropy reaches maximum value
when all classes in the table have
equal probability.
• Figure plots the values of
maximum entropy for different
number of classes n, where
probability is equal to p=1/n.
• In this case, maximum entropy is
equal to -n*p*log2p.
• Notice that the value of entropy is
larger than 1 if the number of
classes is more than 2.
42
43. How to Measure Impurity? - Gini
• Another way to measure impurity degree is using Gini
index
• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,
Prob(Train)=0.3, we can now compute Gini index as:
• Gini Index = 1 - (0.4^2 + 0.3^2 + 0.3^2) = 0.660
43
44. How to Measure Impurity? - Entropy
• Gini index of a pure table (consist
of a single class is zero because
the probability is 1 and 1-(1)^2=0.
• Similar to Entroy, Gini index also
reaches maximum value when all
classes in the table have equal
probability.
• Figure plots the values of
maximum Gini index for different
number of classes n, where
probability is equal to p=1/n.
• Notice that the value of Gini index
is always between 0 and 1
regardless the number of classes.
44
45. How to Measure Impurity? –
Missclassification Error
• Still another way to measure impurity degree
• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,
Prob(Train)=0.3, we can now compute index as:
• Index = 1 - Max{0.4,0.3,0.3} = 1-0.4 = 0.60
45
46. How to Measure Impurity? –
Missclassification Error
• Misclassification Error Index of a pure table (consist of a
single class is zero because the probability is 1 and 1 -
Max(1)=0.
• The value of classification error index is always between 0
and 1.
• In fact the maximum Gini index for a given number of
classes is always equal to the maximum of misclassification
error index because for a number of classes n, we set
probability is equal to p=1/n and maximum Gini index
happens at 1-n*(1/n)^2=1-1/n, while maximum
misclassification error index also happens at 1-max{1/n}=1-
1/n.
46
47. Information Gain
• The reason for different ways of computation of impurity
degrees between data table D and subset table Si is
because we would like to compare the difference of
impurity degrees before we split the table (i.e. data table
D) and after we split the table according to the values of
an attribute i (i.e. subset table Si) . The measure to
compare the difference of impurity degrees is called
information gain. We would like to know what our gain is
if we split the data table based on some attribute values.
47
48. Information Gain - Example
• For example, in the parent table below, we can compute degree of
impurity based on transportation mode. In this case we have 4
Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T):
48
49. Information Gain - Example
• For example,
we split using
travel cost
attribute and
compute the
degree of
impurity.
49
50. Information Gain - Example
• Information gain is computed as impurity degrees of the
parent table and weighted summation of impurity
degrees of the subset table. The weight is based on the
number of records for each attribute values. Suppose
we will use entropy as measurement of impurity degree,
then we have:
• Information gain (i) = Entropy of parent table D – Sum (n
k /n * Entropy of each value k of subset table Si )
• The information gain of attribute Travel cost per km is
computed as 1.571 – (5/10 * 0.722+2/10*0+3/10*0) =
1.210
50
51. Information Gain - Example
• You can also compute information gain based on Gini
index or classification error in the same method. The
results are given below.
51
55. Information Gain - Example
• Table below summarizes the information gain for all four
attributes. In practice, you don't need to compute the
impurity degree based on three methods. You can use
either one of Entropy or Gini index or index of classification
error.
• Now we find the optimum attribute that produce the
maximum information gain (i* = argmax {information gain of
attribute i}). In our case, travel cost per km produces the
maximum information gain.
55
56. Information Gain - Example
• So we split using “travel cost per km” attribute as this
produces the maximum information gain.
56