SlideShare a Scribd company logo
1 of 5
Download to read offline
Data mining
Assignment week 4




BARRY KOLLEE

10349863
Assignment	
  4	
  
	
  
Exercise 1: Pruning
1. Which problem do we try to address when using pruning?

“Overfitting and lack of generalization beyond training data, i.e. models that describe the training data
(too) well, but do not model the principles and characteristics underlying the data.”

On schema level we state that pruning merges a part of the tree together into one node. The difference
is descripted within the two schema’s below:




2. Describe the purpose of separating the data into training, development, and test data.

“Training data is used to build the model, and test data to test it. Just the Training data by itself is not
able to measure to what extend the model will perform (i.e.. generalize to) on unseen data. Test data
measures this, but we should not use the test data to directly inform our model construction. For this
purpose a third set is used: the development data set, which behaves like the test set but the feedback
can be used to change the model”

We create our training set to increase the accuracy of the classifier, which we use on the data. The
more data we train the more accurate the resulting model will be.

The other two sets are used to evaluate the performance of the classifier we use. The development set
is used to evaluate the accuracy of different configurations of our classifier. It’s called the development
set because we continuously need to evaluate the classification performance.

In the end we’ve got a model, which has a great performance on the test data. To get estimates on how
good the new model will deal with new data we use the test data.




2
Assignment	
  4	
  
	
  


Exercise 2: Information Gain and Attributes with Many Values.

Information gain is defined as:


Following to this definition, information gain
favors attributes with many values.
Why? Give an example.


We use a training set with (as shown in the table):

        •        N number of instances
        •        A number of attributes


                                A1              …                  Ak                   A*                class
            1                   T               …                 Black                 V1                 C1
            2                   T               …                 White                 V2                 C2
            ..                  ..              …                  …                    …                   …
            n                   F               …                 Black                 Vn                 Cn

If we want to classify a certain attribute we can state that we have a 50/50 chance of having a ‘-‘ and a
‘+’ classification. So Attribute A* could be a plus or a minus. We note this as follows.




                                [1+, 0-]
       SVi (A*) = {
                                [0+, 1-]



We can calculate the Entropy (uncertainty) of both outcomes of a plus or minus classification:



       H(S+) = - (1/1 log2 1/1 + 0/1 log2 0/1) = 0

       H(S-) = - (0/1 log2 0/1 + 1/1 log2 1/1) = 0


For calculating our information gain we perform the following formula:


       Gain(S, A*) = H(S)                   – (sum |Sv(A*)| / |S| * H(Sv(A*) )

       Gain(S, A*) = Entropy of H(S) – (gain of H(S+) + gain of H(S-))
       Gain(S, A*) = Entropy of H(S) – (0 + 0)
       Gain(S, A*) = Entropy of H(S)


We see that the Entropy of H(S+) and H(S-) is 0. So in the end we will have a high information gain because there’s
nothing to deduct.




3
Assignment	
  4	
  
	
  
Exercise 3: Missing Attribute Values
Consider the following set of training instances.
Instance 2 has a missing value
for attribute a1.

Apply at least two different strategies for dealing
with missing attribute values
and show how they work in this concrete example.

Example 1 :

We can give a prediction on the true/false value for the missing attribute ‘a1’ by looking at the attributes
from a2. Within the a2 attribute there’s an equal chance of having a ‘true’ value and having a ‘false’
value (50 % chance). We could also state this for attribute a1. In conclusion: the missing question mark
could be a ‘false’ value if we use this way of thinking.

Example 2:

We can also focus on the class attribute. Within a2 we can state the following:
   •    There’s a 100 % chance of having a ‘+’ when having the ‘true’ attribute.
   •    There’s a 50 % chance of having a ‘+’ value when having the ‘false’.

With this way of thinking we should write down the ‘true’ value at the question mark

Example 3:

Now we only look at the attribute a1. We can give a precise prediction of the value what should replace
the question mark.:


       P(true) = 2/3
       P(false) = 1/3




4
Assignment	
  4	
  
	
  



Exercise 4: Regression Trees

1. What are the stopping conditions for decision trees predicting discrete
classes?

       1.   All instances under a node have the same label.
       2.   All attributes have been used along a branch
       3.   There are no instances under a node


By labeling every input value we can state that only one of these outcomes is the correct one. We’ve
seen this with the weather example from the lecture. Because we predefine certain outcomes we also
define stopping conditions where it’s ‘Yes or No.




2. Why and how do the stopping conditions have to be changed for decision
trees that predict numerical values (e.g., regression trees)?

1. Measure the standard deviation of all instances under a node. If this value is below a pre-defined
value, we stop.
2. and
3. as before

In stead of defining a certain value like ‘yes’ or ‘no’ we define a certain range where the value can be
any point within that range. I.e. for temperature we define a particular degree in stead of hot and warm.
With this way of making our model we can still put several stopping conditions within our decision tree.




5

More Related Content

What's hot

Maxima & Minima of Functions - Differential Calculus by Arun Umrao
Maxima & Minima of Functions - Differential Calculus by Arun UmraoMaxima & Minima of Functions - Differential Calculus by Arun Umrao
Maxima & Minima of Functions - Differential Calculus by Arun Umraossuserd6b1fd
 
Decreasing and increasing functions by arun umrao
Decreasing and increasing functions by arun umraoDecreasing and increasing functions by arun umrao
Decreasing and increasing functions by arun umraossuserd6b1fd
 
Differential in several variables
Differential in several variables Differential in several variables
Differential in several variables Kum Visal
 
Arrays in Java | Edureka
Arrays in Java | EdurekaArrays in Java | Edureka
Arrays in Java | EdurekaEdureka!
 
INTRODUCTION TO MATLAB session with notes
  INTRODUCTION TO MATLAB   session with  notes  INTRODUCTION TO MATLAB   session with  notes
INTRODUCTION TO MATLAB session with notesInfinity Tech Solutions
 
27 power series x
27 power series x27 power series x
27 power series xmath266
 
Principle of Definite Integra - Integral Calculus - by Arun Umrao
Principle of Definite Integra - Integral Calculus - by Arun UmraoPrinciple of Definite Integra - Integral Calculus - by Arun Umrao
Principle of Definite Integra - Integral Calculus - by Arun Umraossuserd6b1fd
 
Limit & Continuity of Functions - Differential Calculus by Arun Umrao
Limit & Continuity of Functions - Differential Calculus by Arun UmraoLimit & Continuity of Functions - Differential Calculus by Arun Umrao
Limit & Continuity of Functions - Differential Calculus by Arun Umraossuserd6b1fd
 
Java căn bản - Chapter3
Java căn bản - Chapter3Java căn bản - Chapter3
Java căn bản - Chapter3Vince Vo
 
Array 31.8.2020 updated
Array 31.8.2020 updatedArray 31.8.2020 updated
Array 31.8.2020 updatedvrgokila
 
Principle of Function Analysis - by Arun Umrao
Principle of Function Analysis - by Arun UmraoPrinciple of Function Analysis - by Arun Umrao
Principle of Function Analysis - by Arun Umraossuserd6b1fd
 
Matlab lab manual
Matlab lab manualMatlab lab manual
Matlab lab manualnmahi96
 
03 truncation errors
03 truncation errors03 truncation errors
03 truncation errorsmaheej
 
Principle of Integration - Basic Introduction - by Arun Umrao
Principle of Integration - Basic Introduction - by Arun UmraoPrinciple of Integration - Basic Introduction - by Arun Umrao
Principle of Integration - Basic Introduction - by Arun Umraossuserd6b1fd
 
Arrays and structures
Arrays and structuresArrays and structures
Arrays and structuresMohd Arif
 

What's hot (18)

Maxima & Minima of Functions - Differential Calculus by Arun Umrao
Maxima & Minima of Functions - Differential Calculus by Arun UmraoMaxima & Minima of Functions - Differential Calculus by Arun Umrao
Maxima & Minima of Functions - Differential Calculus by Arun Umrao
 
Decreasing and increasing functions by arun umrao
Decreasing and increasing functions by arun umraoDecreasing and increasing functions by arun umrao
Decreasing and increasing functions by arun umrao
 
Differential in several variables
Differential in several variables Differential in several variables
Differential in several variables
 
Java arrays
Java    arraysJava    arrays
Java arrays
 
COM1407: Arrays
COM1407: ArraysCOM1407: Arrays
COM1407: Arrays
 
Arrays in Java | Edureka
Arrays in Java | EdurekaArrays in Java | Edureka
Arrays in Java | Edureka
 
INTRODUCTION TO MATLAB session with notes
  INTRODUCTION TO MATLAB   session with  notes  INTRODUCTION TO MATLAB   session with  notes
INTRODUCTION TO MATLAB session with notes
 
27 power series x
27 power series x27 power series x
27 power series x
 
Principle of Definite Integra - Integral Calculus - by Arun Umrao
Principle of Definite Integra - Integral Calculus - by Arun UmraoPrinciple of Definite Integra - Integral Calculus - by Arun Umrao
Principle of Definite Integra - Integral Calculus - by Arun Umrao
 
E10
E10E10
E10
 
Limit & Continuity of Functions - Differential Calculus by Arun Umrao
Limit & Continuity of Functions - Differential Calculus by Arun UmraoLimit & Continuity of Functions - Differential Calculus by Arun Umrao
Limit & Continuity of Functions - Differential Calculus by Arun Umrao
 
Java căn bản - Chapter3
Java căn bản - Chapter3Java căn bản - Chapter3
Java căn bản - Chapter3
 
Array 31.8.2020 updated
Array 31.8.2020 updatedArray 31.8.2020 updated
Array 31.8.2020 updated
 
Principle of Function Analysis - by Arun Umrao
Principle of Function Analysis - by Arun UmraoPrinciple of Function Analysis - by Arun Umrao
Principle of Function Analysis - by Arun Umrao
 
Matlab lab manual
Matlab lab manualMatlab lab manual
Matlab lab manual
 
03 truncation errors
03 truncation errors03 truncation errors
03 truncation errors
 
Principle of Integration - Basic Introduction - by Arun Umrao
Principle of Integration - Basic Introduction - by Arun UmraoPrinciple of Integration - Basic Introduction - by Arun Umrao
Principle of Integration - Basic Introduction - by Arun Umrao
 
Arrays and structures
Arrays and structuresArrays and structures
Arrays and structures
 

Viewers also liked

Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1BarryK88
 
DATA MINING IN RETAIL SECTOR
DATA MINING IN RETAIL SECTORDATA MINING IN RETAIL SECTOR
DATA MINING IN RETAIL SECTORRenuka Chand
 
Csc1100 lecture04 ch04
Csc1100 lecture04 ch04Csc1100 lecture04 ch04
Csc1100 lecture04 ch04IIUM
 
05 Conditional statements
05 Conditional statements05 Conditional statements
05 Conditional statementsmaznabili
 
01 10 speech channel assignment
01 10 speech channel assignment01 10 speech channel assignment
01 10 speech channel assignmentEricsson Saudi
 
С++ without new and delete
С++ without new and deleteС++ without new and delete
С++ without new and deletePlatonov Sergey
 
Data Engineering - Data Mining Assignment
Data Engineering - Data Mining AssignmentData Engineering - Data Mining Assignment
Data Engineering - Data Mining AssignmentDarran Mottershead
 
Data mining to predict academic performance.
Data mining to predict academic performance. Data mining to predict academic performance.
Data mining to predict academic performance. Ranjith Gowda
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetMateusz Brzoska
 
DATA MINING on WEKA
DATA MINING on WEKADATA MINING on WEKA
DATA MINING on WEKAsatyamkhatri
 

Viewers also liked (19)

Tree pruning
Tree pruningTree pruning
Tree pruning
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
 
DATA MINING IN RETAIL SECTOR
DATA MINING IN RETAIL SECTORDATA MINING IN RETAIL SECTOR
DATA MINING IN RETAIL SECTOR
 
Csc1100 lecture04 ch04
Csc1100 lecture04 ch04Csc1100 lecture04 ch04
Csc1100 lecture04 ch04
 
05 Conditional statements
05 Conditional statements05 Conditional statements
05 Conditional statements
 
01 10 speech channel assignment
01 10 speech channel assignment01 10 speech channel assignment
01 10 speech channel assignment
 
Project_702
Project_702Project_702
Project_702
 
С++ without new and delete
С++ without new and deleteС++ without new and delete
С++ without new and delete
 
Data Engineering - Data Mining Assignment
Data Engineering - Data Mining AssignmentData Engineering - Data Mining Assignment
Data Engineering - Data Mining Assignment
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
 
Data mining to predict academic performance.
Data mining to predict academic performance. Data mining to predict academic performance.
Data mining to predict academic performance.
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data Set
 
DATA MINING on WEKA
DATA MINING on WEKADATA MINING on WEKA
DATA MINING on WEKA
 
Ch06
Ch06Ch06
Ch06
 
Decision trees
Decision treesDecision trees
Decision trees
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Data ming wsn
Data ming wsnData ming wsn
Data ming wsn
 

Similar to Data mining assignment 4

03-Primitive-Datatypes.pdf
03-Primitive-Datatypes.pdf03-Primitive-Datatypes.pdf
03-Primitive-Datatypes.pdfKaraBaesh
 
Python programming workshop
Python programming workshopPython programming workshop
Python programming workshopBAINIDA
 
Array in C full basic explanation
Array in C full basic explanationArray in C full basic explanation
Array in C full basic explanationTeresaJencyBala
 
The Ring programming language version 1.5.4 book - Part 179 of 185
The Ring programming language version 1.5.4 book - Part 179 of 185The Ring programming language version 1.5.4 book - Part 179 of 185
The Ring programming language version 1.5.4 book - Part 179 of 185Mahmoud Samir Fayed
 
Calculus Application Problem #3 Name _________________________.docx
Calculus Application Problem #3 Name _________________________.docxCalculus Application Problem #3 Name _________________________.docx
Calculus Application Problem #3 Name _________________________.docxhumphrieskalyn
 
Reasoning about laziness
Reasoning about lazinessReasoning about laziness
Reasoning about lazinessJohan Tibell
 
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
  BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx  BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docxShiraPrater50
 
Statistics assignment
Statistics assignmentStatistics assignment
Statistics assignmentBrian Miles
 
Dm part03 neural-networks-homework
Dm part03 neural-networks-homeworkDm part03 neural-networks-homework
Dm part03 neural-networks-homeworkokeee
 
The Ring programming language version 1.5.2 book - Part 175 of 181
The Ring programming language version 1.5.2 book - Part 175 of 181The Ring programming language version 1.5.2 book - Part 175 of 181
The Ring programming language version 1.5.2 book - Part 175 of 181Mahmoud Samir Fayed
 
Java: Introduction to Arrays
Java: Introduction to ArraysJava: Introduction to Arrays
Java: Introduction to ArraysTareq Hasan
 

Similar to Data mining assignment 4 (20)

Midterm
MidtermMidterm
Midterm
 
Midterm sols
Midterm solsMidterm sols
Midterm sols
 
03-Primitive-Datatypes.pdf
03-Primitive-Datatypes.pdf03-Primitive-Datatypes.pdf
03-Primitive-Datatypes.pdf
 
Chapter 13.pptx
Chapter 13.pptxChapter 13.pptx
Chapter 13.pptx
 
Python Programming
Python Programming Python Programming
Python Programming
 
Python programming workshop
Python programming workshopPython programming workshop
Python programming workshop
 
Array in C full basic explanation
Array in C full basic explanationArray in C full basic explanation
Array in C full basic explanation
 
The Ring programming language version 1.5.4 book - Part 179 of 185
The Ring programming language version 1.5.4 book - Part 179 of 185The Ring programming language version 1.5.4 book - Part 179 of 185
The Ring programming language version 1.5.4 book - Part 179 of 185
 
Calculus Application Problem #3 Name _________________________.docx
Calculus Application Problem #3 Name _________________________.docxCalculus Application Problem #3 Name _________________________.docx
Calculus Application Problem #3 Name _________________________.docx
 
Reasoning about laziness
Reasoning about lazinessReasoning about laziness
Reasoning about laziness
 
03. Week 03.pptx
03. Week 03.pptx03. Week 03.pptx
03. Week 03.pptx
 
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
  BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx  BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
 
Statistics assignment
Statistics assignmentStatistics assignment
Statistics assignment
 
Dm part03 neural-networks-homework
Dm part03 neural-networks-homeworkDm part03 neural-networks-homework
Dm part03 neural-networks-homework
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision tree
Decision treeDecision tree
Decision tree
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
Lesson 18-20.pptx
Lesson 18-20.pptxLesson 18-20.pptx
Lesson 18-20.pptx
 
The Ring programming language version 1.5.2 book - Part 175 of 181
The Ring programming language version 1.5.2 book - Part 175 of 181The Ring programming language version 1.5.2 book - Part 175 of 181
The Ring programming language version 1.5.2 book - Part 175 of 181
 
Java: Introduction to Arrays
Java: Introduction to ArraysJava: Introduction to Arrays
Java: Introduction to Arrays
 

More from BarryK88

Data mining test notes (back)
Data mining test notes (back)Data mining test notes (back)
Data mining test notes (back)BarryK88
 
Data mining test notes (front)
Data mining test notes (front)Data mining test notes (front)
Data mining test notes (front)BarryK88
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2BarryK88
 
Data mining assignment 6
Data mining assignment 6Data mining assignment 6
Data mining assignment 6BarryK88
 
Data mining Computerassignment 2
Data mining Computerassignment 2Data mining Computerassignment 2
Data mining Computerassignment 2BarryK88
 
Data mining Computerassignment 1
Data mining Computerassignment 1Data mining Computerassignment 1
Data mining Computerassignment 1BarryK88
 
Semantic web final assignment
Semantic web final assignmentSemantic web final assignment
Semantic web final assignmentBarryK88
 
Semantic web assignment 3
Semantic web assignment 3Semantic web assignment 3
Semantic web assignment 3BarryK88
 
Semantic web assignment 2
Semantic web assignment 2Semantic web assignment 2
Semantic web assignment 2BarryK88
 
Semantic web assignment1
Semantic web assignment1Semantic web assignment1
Semantic web assignment1BarryK88
 

More from BarryK88 (10)

Data mining test notes (back)
Data mining test notes (back)Data mining test notes (back)
Data mining test notes (back)
 
Data mining test notes (front)
Data mining test notes (front)Data mining test notes (front)
Data mining test notes (front)
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2
 
Data mining assignment 6
Data mining assignment 6Data mining assignment 6
Data mining assignment 6
 
Data mining Computerassignment 2
Data mining Computerassignment 2Data mining Computerassignment 2
Data mining Computerassignment 2
 
Data mining Computerassignment 1
Data mining Computerassignment 1Data mining Computerassignment 1
Data mining Computerassignment 1
 
Semantic web final assignment
Semantic web final assignmentSemantic web final assignment
Semantic web final assignment
 
Semantic web assignment 3
Semantic web assignment 3Semantic web assignment 3
Semantic web assignment 3
 
Semantic web assignment 2
Semantic web assignment 2Semantic web assignment 2
Semantic web assignment 2
 
Semantic web assignment1
Semantic web assignment1Semantic web assignment1
Semantic web assignment1
 

Recently uploaded

ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 

Recently uploaded (20)

ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 

Data mining assignment 4

  • 1. Data mining Assignment week 4 BARRY KOLLEE 10349863
  • 2. Assignment  4     Exercise 1: Pruning 1. Which problem do we try to address when using pruning? “Overfitting and lack of generalization beyond training data, i.e. models that describe the training data (too) well, but do not model the principles and characteristics underlying the data.” On schema level we state that pruning merges a part of the tree together into one node. The difference is descripted within the two schema’s below: 2. Describe the purpose of separating the data into training, development, and test data. “Training data is used to build the model, and test data to test it. Just the Training data by itself is not able to measure to what extend the model will perform (i.e.. generalize to) on unseen data. Test data measures this, but we should not use the test data to directly inform our model construction. For this purpose a third set is used: the development data set, which behaves like the test set but the feedback can be used to change the model” We create our training set to increase the accuracy of the classifier, which we use on the data. The more data we train the more accurate the resulting model will be. The other two sets are used to evaluate the performance of the classifier we use. The development set is used to evaluate the accuracy of different configurations of our classifier. It’s called the development set because we continuously need to evaluate the classification performance. In the end we’ve got a model, which has a great performance on the test data. To get estimates on how good the new model will deal with new data we use the test data. 2
  • 3. Assignment  4     Exercise 2: Information Gain and Attributes with Many Values. Information gain is defined as: Following to this definition, information gain favors attributes with many values. Why? Give an example. We use a training set with (as shown in the table): • N number of instances • A number of attributes A1 … Ak A* class 1 T … Black V1 C1 2 T … White V2 C2 .. .. … … … … n F … Black Vn Cn If we want to classify a certain attribute we can state that we have a 50/50 chance of having a ‘-‘ and a ‘+’ classification. So Attribute A* could be a plus or a minus. We note this as follows. [1+, 0-] SVi (A*) = { [0+, 1-] We can calculate the Entropy (uncertainty) of both outcomes of a plus or minus classification: H(S+) = - (1/1 log2 1/1 + 0/1 log2 0/1) = 0 H(S-) = - (0/1 log2 0/1 + 1/1 log2 1/1) = 0 For calculating our information gain we perform the following formula: Gain(S, A*) = H(S) – (sum |Sv(A*)| / |S| * H(Sv(A*) ) Gain(S, A*) = Entropy of H(S) – (gain of H(S+) + gain of H(S-)) Gain(S, A*) = Entropy of H(S) – (0 + 0) Gain(S, A*) = Entropy of H(S) We see that the Entropy of H(S+) and H(S-) is 0. So in the end we will have a high information gain because there’s nothing to deduct. 3
  • 4. Assignment  4     Exercise 3: Missing Attribute Values Consider the following set of training instances. Instance 2 has a missing value for attribute a1. Apply at least two different strategies for dealing with missing attribute values and show how they work in this concrete example. Example 1 : We can give a prediction on the true/false value for the missing attribute ‘a1’ by looking at the attributes from a2. Within the a2 attribute there’s an equal chance of having a ‘true’ value and having a ‘false’ value (50 % chance). We could also state this for attribute a1. In conclusion: the missing question mark could be a ‘false’ value if we use this way of thinking. Example 2: We can also focus on the class attribute. Within a2 we can state the following: • There’s a 100 % chance of having a ‘+’ when having the ‘true’ attribute. • There’s a 50 % chance of having a ‘+’ value when having the ‘false’. With this way of thinking we should write down the ‘true’ value at the question mark Example 3: Now we only look at the attribute a1. We can give a precise prediction of the value what should replace the question mark.: P(true) = 2/3 P(false) = 1/3 4
  • 5. Assignment  4     Exercise 4: Regression Trees 1. What are the stopping conditions for decision trees predicting discrete classes? 1. All instances under a node have the same label. 2. All attributes have been used along a branch 3. There are no instances under a node By labeling every input value we can state that only one of these outcomes is the correct one. We’ve seen this with the weather example from the lecture. Because we predefine certain outcomes we also define stopping conditions where it’s ‘Yes or No. 2. Why and how do the stopping conditions have to be changed for decision trees that predict numerical values (e.g., regression trees)? 1. Measure the standard deviation of all instances under a node. If this value is below a pre-defined value, we stop. 2. and 3. as before In stead of defining a certain value like ‘yes’ or ‘no’ we define a certain range where the value can be any point within that range. I.e. for temperature we define a particular degree in stead of hot and warm. With this way of making our model we can still put several stopping conditions within our decision tree. 5