1. INTRODUCTION
1.1Project Overview:
Data mining:
Data mining, the extraction of hidden predictive information from large databases, is a powerful
new technology with great potential to help companies focus on the most important information
in their data warehouses. Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses
offered by data mining move beyond the analyses of past events provided by retrospective tools
typical of decision support systems. Data mining tools can answer business questions that
traditionally were too time consuming to resolve. They scour databases for hidden patterns,
finding predictive information that experts may miss because it lies outside their expectations.
Most companies already collect and refine massive quantities of data. Data mining techniques
can be implemented rapidly on existing software and hardware platforms to enhance the value of
existing information resources, and can be integrated with new products and systems as they are
brought on-line. When implemented on high performance client/server or parallel processing
computers, data mining tools can analyze massive databases to deliver answers to questions such
as, "Which clients are most likely to respond to my next promotional mailing, and why?"
The most commonly used techniques in data mining are:
Artificial neural networks: Non-linear predictive models that learn through training and
resemble biological neural networks in structure.
Decision trees: Tree-shaped structures that represent sets of decisions. These decisions
generate rules for the classification of a dataset. Specific decision tree methods include
Classification and Regression Trees (CART) and Chi Square Automatic Interaction
Detection (CHAID) .
Genetic algorithms: Optimization techniques that use processes such as genetic
combination, mutation, and natural selection in a design based on the concepts of
evolution.
Nearest neighbor method: A technique that classifies each record in a dataset based on a
combination of the classes of the k record(s) most similar to it in a historical dataset
(where k ³ 1). Sometimes called the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data based on statistical
significance.
2. DECISION TREES:
Decision trees have become one of the most powerful and popular approaches in knowledge
discovery and data mining, the science and technology of exploring large and complex bodies of
data in order to discover useful patterns. The area is of great importance because it enables
modeling and knowledge extraction from the abundance of data available. Both theoreticians and
practitioners are continually seeking techniques to make the process more efficient, cost-
effective and accurate. Decision trees, originally implemented in decision theory and statistics,
are highly effective tools in other areas such as data mining, text mining, information extraction,
machine learning, and pattern recognition. This book invites readers to explore the many benefits
in data mining that decision trees offer:
Self-explanatory and easy to follow when compacted
Able to handle a variety of input data: nominal, numeric and textual
Able to process datasets that may have errors or missing values
High predictive performance for a relatively small computational effort
Available in many data mining packages over a variety of platforms
Useful for various tasks, such as classification, regression, clustering and feature
selection
ADVANTAGES AND DISADVANTAGES OF DECISION TREES
Advantages
Simple to understand and interpret.
Requires little data preparation.
Able to handle both numerical and categorical data.
Uses a white box model.
Perform well with large data in a short time.
Disadvantages
Output attribute must be categorical.
Limited to one output attribute.
3. 1.2. Software Process Model
Waterfall Model is one of the most widely used Software Development Process.It is also called as
"Linear Sequential model" or the "classic life cycle" or iterative model. It is widely used in the
commercial development projects. It is called so because here, we move to next phase(step) after
getting input from previous phase, like in a waterfall, water flows down to from the upper steps.
In this iterative waterfall model Software Development process is divided into five phases:-
a) SRS (Software Requirement Specifications)
b) System Design and Software Design
c) Implementation and Unit testing
d) Integration and System Testing
e) Operation and Maintenance
Iterative Waterfall Model with its stages
Lets discuss all these stages of waterfall model in detail.
Software Requirements Specifications:
This is the most crucial phase for the whole project, here project team along with the customer
makes a detailed list of user requirements. The project team chalks out the functionality and
limitations(if there are any) of the software they are developing, in detail. The document which
contains all this information is called SRS, and it clearly and unambiguously indicates the
requirements. A small amount of top-level analysis and design is also documented. This document is
verified and endorsed by the customer before starting the project. SRS serves as the input for further
phases.
4. System Design and Software Design:
Using SRS as input, system design is done. System design included designing of software and
hardware i.e. functionality of hardware and software is separated-out. After separation design of
software modules(see what is modularity) is done. The design process translates requirements into
representation of the software that can be assessed for quality before generation of code begins. At
the same time test plan is prepared, test plan describes the various tests which will be carried out on
the system after completion of development.
Implementation and Unit Testing:
Now that we have system design, code generation begins. Code generation is conversion of design
into machine-readable form. If designing of software and system is done well, code generation can be
done easily. Software modules are now further divided into units. A unit is a logically separable part
of the software. Testing of units can be done separately. In this phase unit testing is done by the
developer itself, to ensure that there are no defects.
Integration and System testing:
Now the units of the software are integrated together and a system is built. So we have a complete
software at hand which is tested to check if it meets the functional and performance requirements of
the customer. Testing is done, as per the steps defined in the test plan, to ensure defined input
produces actual results which agree with the required results. A test report is generated which
contains test results.
Operation & maintenance:
Now that we have completed the tested software, we deliver it to the client. His feed-backs are taken
and any changes, if required, are made in this phase. This phase goes on till the software is retired.
1.4 Roles and Responsibilities:
Everyone of the team members have done the job collaboratively. Everyone one of us have
done each and every job in a distributive manner. Each one of us shared the work and worked as
a team but not like an individual.
1.5 Tools and Techniques:
ABOUT ECLIPSE
Eclipse is a universal platform for integrating development tools Open, extensible
architecture based on plug-ins. Eclipse provides a number of aids that make writing Java code
much quicker and easier than using a text editor. This means that you can spend more time
learning Java, and less time typing and looking up documentation. The Eclipse debugger and
5. scrapbook allow you to look inside the execution of the Java code. This allows you to “see”
objects and to understand how Java is working behind the scenes. Eclipse provides full support
for agile software development practices such as test-driven development and refactoring.
Existing System
2.1Description of Existing System :
A mathematical algorithm for building the decision tree Invented by J. Ross Quinlan in
1939.Uses Information Theory invented by Shannon in 1941.Builds the tree from the top down,
with no backtracking. Employs a top-down greedy search through the space of possible decision
trees. Greedy because there is no backtracking. It picks highest values first. Information Gain is
used to select the most useful attribute for classification. Select attribute that is most useful for
classifying examples (attribute that has the highest Information Gain).
Iterative Dichotomiser 3 Algorithm is a Decision Tree learning algorithm. The name is correct in
that it creates Decision Trees for “dichotomizing” data instances, or classifying them discretely
through branching nodes until a classification “bucket” is reached (leaf node). By using ID3 and
other machine-learning algorithms from Artificial Intelligence, expert systems can engage in
tasks usually done by human experts, such as doctors diagnosing diseases by examining various
symptoms (the attributes) of patients (the data instances) in a complex Decision Tree. Accurate
Decision Trees are fundamental to Data Mining and Databases.
Decision tree learning is a method for approximating discrete-valued target functions, in which
the learned function is represented by a decision tree. Decision tree learning is one of the most
widely used and practical methods for inductive inference. The input data of ID3 is known as
sets of “training” or “learning” data instances, which will be used by the algorithm to generate
the Decision Tree. The machine is “learning” from this set of preliminary data.
The ID3 algorithm generates a Decision Tree by using a greedy search through the inputted sets
of data instances so as to determine nodes and the attributes they use for branching. Also, the
emerging tree is traversed in a top-down (root to leaf) approach through each of the nodes within
the tree. This occur RECURSIVELY, reminding you of those “pointless” tree traversals
strategies .The traversal attempts to determine if the decision “attribute” on which the branching
will be based on for any particular emerging node is the most ideal branching attribute by using
6. the inputted sets of data. One particular metric that can be used to determine then if a branching
attribute is adequate is that of INFORMATION GAIN, or ENTROPY.
ID3 uses Entropy to determine if, based on the inputted set of data, the selected branching
attribute for a particular node of the emerging tree is adequate. Specifically, the attribute that
results in the most reduction of Entropy related to the learning sets is the best.
GOAL: Find a way to optimally classify a learning set, such that the resulting Decision Tree is
not too deep and the number of branches (internal nodes) of the Tree is minimized.
SOLUTION: The more Entropy in a system or measure of impurity in a collection of data sets,
the m ore branches and depth the tree will have. FIND entropy reducing attributes in the
learning sets and use them for branching.
Information Gain = measuring the expected reduction in Entropy. The higher the Information
Gain, the more expected reduction in Entropy.
It turns out Entropy, or measure of non-homogeneousness within a set of learning sets can be
calculated in a straight forward manner.
CALCULATION AND CONSTRUCTION OF ID3 ALGORITHM
STEP1: Take the dataset and calculate the count of unique attribute values of the class label.
STEP2: Calculate amount of information for the class label by based on following formulae.
STEP3: Calculate the count of the each unique attribute values based on the unique attribute
values of the class label.
STEP4: Calculate expected information for each unique attribute value of each attribute by based
on following formulae.
7. STEP5: Calculate entropy value for each attribute. Here take the total expected information
values of all unique attribute values of the attribute at the time of calculation of entropy value by
based on following formulae.
STEP6: Calculate gain value with the help of amount of information value and entropy value by
based on following formulae.
STEP7: Construct the decision tree based on following decision tree algorithm.
EXPLANATION WITH EXAMPLE FOR ID3 DECISION TREE
Age, colour-cloth, income, student, buys _computer
>40, red, high, no, no
<30, yellow, high, no, no
30--40, blue, high, no, yes
>40, red, medium, no, yes
<30, white, low, yes, no
>40, red, low, yes, no
30--40, blue, low, yes, yes
<30, yellow, medium, no, yes
<30, yellow, low, yes, no
>40, white, medium, no, no
This one is the dataset of sample data set of buys computer
Attribute names: Age, colour-cloth, income, student, buys_computer
Unique values of attributes: >40, <30, 30—40, >40
Class labels of the dataset: yes, no
8. Step1:
For the given dataset first we have to calculate the information gain value for that we have to
know the count of class labels in the given dataset. That is NO’s count is 10 and YES’s count is
4. Then by using the formulae have to calculate the information gain value
No = 10 and yes=4
I (10, 4) =-10/4log2 (10/10)-4/10log2 (4/10)=0.4421+0.52107=0.9709
This calculating part is called amount of information and this is the unique value for the total
dataset.
Step2: calculate the each attributes with unique values, after that expected information and
entropy and gain values are calculated by formulae
Take the Age attribute and separate the values with comparison of class labels
No , yes count
>40 3 , 1 4
<30 3 , 1 4
30-40 - , 2 2
Total 10
This is the table for extracting data, based on the class labels of buys_ computer.
Now calculate the Expected Information for the unique attribute values of >40, <30, 30-40
>40 I(3,1)=-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112
<30 I(3,1)=-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112
30-40 I(0,2)= -2/2log2 (2/2)= 0
Now calculate the Entropy value for the Age attributes
E (age) =4/10(0.10112) +4/10(0.10112) +2/10(0)=0.324410+0.324410=0.10410910
9. Step3: calculate the Gain value for the age attribute based on amount of information and
entropy value
Gain (age) =I (s1, s2)-E (age) = 0.9709-0.10410910 = 0.32194
Second attribute: For color _cloth extracting the data based on the class labels of No, yes
No , yes count
Red 2 , 1 3
Yellow 2 , 1 3
Blue - , 2 2
White 2 , - 2
Total 10
Calculate the Expected information for the color _ cloth
Red I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029
Yellow I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029
Blue I (0, 2) =-0/2log2 (0/2)-2/2log2 (2/2) = 0
White I (2, 0) =-2/2log2 (2/2) = 0
Now calculate the Entropy value for attribute color_cloth
E( color_cloth ) = 3/10(0.911029)+3/10(0.911029)+2/10(0)+2/10(0) = 0.550974
Gain value for color _ cloth
Gain (color_cloth) = I (s1, s2)-E (age) = 0.9709-0.550974= 0.41992
Third Attribute: For Income extracting the data based on the class labels of No, yes
No , yes count
High 1 , 2 3
Medium 2 , 1 3
Low 3 , 1 4
10. Total 10
Calculate the Expected information for the Income
High I (1, 2) =-1/3log2 (1/3)-2/3log2 (2/3) = 0.911029
Medium I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029
Low I (3, 4) =-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112
Now calculate the Entropy value for attribute Income
E( Income ) = 3/10(0.911029)+3/10(0.911029)+4/10(0.10112) = 0.1075454
Gain value for Income
Gain(income) = I (s1, s2)-E (Income) = 0.9709-0.1075454= 0.0954410
Fourth Attribute: For Student extracting the data based on the class labels of No, yes
No , yes count
No 3 , 3 6
Yes 1 , 3 4
Total 10
Calculate the Expected information for the student
No I (3,3) =-3/10log2 (3/10)-3/10log2 (3/10) = 1
Yes I (1,3) =-1/4log2 (1/4)-3/4log2 (3/4) = 0.10112
Now calculate the Entropy value for attribute student
E( Income ) = 10/10(1)+4/10(0.10112) = 0.924410
Gain value for student
Gain(student) = I (s1, s2)-E (Income) = 0.9709-0.924410= 0.041042
The important point is that , here compare the all Gain values what have calculated
Gain(age)=0.32194
Gain( color cloth )=0.41992highest Gain value
11. Gain(income)=0.954410
Gain(student)=0.041042
Color_cloth is the highest Gain value in the four attributes so that as per the algorithm it
becomes the Root Node while construction of the decision tree. After that take the unique
attribute values ( red , blue ,white, yellow) as the arcs and compare with the unique value red,
where we identify red attribute value in the dataset that all rows will be extract and form one
table that table called red attribute table like this we calculate the all Gain values after decide
where we have to place the right node , left node and leaf nodes.
In the above decision tree color_cloth is the root node remaining is attribute values of the
color_cloth. Based on the attribute values extract the data by neglecting the color_cloth attribute.
Here observe that first arc, we extract the data based on the red attribute value, where red
attribute is a look, take the total tuple values and calculate the expected information, entropy
value and at last gain value. This will be continued for left node, right node and leaf node
selections.
Red reference tuples
age,income,student,buys_computer
>40,high,no,no
>40,medium,no,yes
>40,low,yes,no
For Age extracting the data based on the class labels of No, yes
No , yes count
>40 2 , 1 3
12. Total 3
Calculate the Expected information for the Income
>40 I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029
Now calculate the Entropy value for attribute Income
E( Income ) = 3/3(0.911029) = 0.911029
Gain value for Age
Gain(age) = I (s1, s2)-E (age) = 0.9709-0.911029= 0.0525
For Income extracting the data based on the class labels of No, yes
No , yes count
High 1 , 0 1
Medium 0 , 1 1
Low 1 , 0 1
Total 3
Calculate the Expected information for the Income
High I (1, 0) =-1/1log2 (1/1) = 0
Medium I (0, 1) =-1/1log2 (1/1) = 0
Low I (1, 0) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute Income
E( Income ) = 1/3(0)+1/3(0)+1/3(0) = 0
Gain value for Income
Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709
For Student extracting the data based on the class labels of No, yes
13. No , yes count
No 1 , 1 2
Yes 1 , 0 1
Total 3
Calculate the Expected information for the student
No I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1
Yes I (1,0) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute student
E( Income ) = 2/3(1)+1/3(0) = 0.1010107
Gain value for student
Gain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.3043
Here again calculate the expected information,entropy and gain values of Red table.
Gain values of age , income , student are 0.05210 , 0.9709 , 0.3043
Here income have the highest gain value so left node is income.
Blue reference tuples
age,income,student,buys_computer
30-40, high, no, yes
30-40, low, yes, yes
For Age extracting the data based on the class labels of No, yes
No , yes count
30-40 0 , 2 2
Total 2
Calculate the Expected information for the Income
30-40 I (0, 2) =-2/2log2 (2/2) = 0
14. Now calculate the Entropy value for attribute age
E( age ) = 2/2(0) = 0
Gain value for Age
Gain(age) = I (s1, s2)-E (age) = 0.9709-0= 0
For Income extracting the data based on the class labels of No, yes
No , yes count
High 0 , 1 1
Low 0 , 1 1
Total 2
Calculate the Expected information for the Income
High I (0, 1) =-1/1log2 (1/1) = 0
Low I (0, 1) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute Income
E( Income ) = 1/2(0)+1/2(0) = 0
Gain value for Income
Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709
For Student extracting the data based on the class labels of No, yes
No , yes count
No 0 , 1 1
Yes 0 , 1 1
Total 2
Calculate the Expected information for the student
No I (0,1) =-1/1log2 (1/1) = 0
Yes I (0,1) =-1/1log2 (1/1) = 0
15. Now calculate the Entropy value for attribute student
E( Income ) = 1/2(0)+1/2(0) = 0
Gain value for student
Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709
Gain values of age, income, and student are 0.9709, 0.9709, 0.9709
Here all the values are same so node is ended here.
White reference tuples
age,income,student,buys_computer
<30, low, yes, no
>40, medium, no, no
For Age extracting the data based on the class labels of No, yes
No , yes count
<30 1 , 0 1
>40 1 , 0 1
Total 2
Calculate the Expected information for the Income
<30 I (1, 0) =-2/2log2 (2/2) = 0
>40 I (1, 0) =-2/2log2 (2/2) = 0
Now calculate the Entropy value for attribute age
E( age ) = 1/2(0)+1/2(0) = 0
Gain value for Age
Gain(age) = I (s1, s2)-E (age) = 0.9709-0= 0
For Income extracting the data based on the class labels of No, yes
No , yes count
16. Low 1 , 0 1
Medium 1 , 0 1
Total 2
Calculate the Expected information for the Income
Medium I (1, 0) =-1/1log2 (1/1) = 0
Low I (1, 0) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute Income
E( Income ) = 1/2(0)+1/2(0) = 0
Gain value for Income
Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709
For Student extracting the data based on the class labels of No, yes
No , yes count
No 1 , 0 1
Yes 1 , 0 1
Total 2
Calculate the Expected information for the student
No I (1,0) =-1/1log2 (1/1) = 0
Yes I (1,0) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute student
E( Income ) = 1/2(0)+1/2(0) = 0
Gain value for student
Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709
Gain values of age, income, and student are 0.9709, 0.9709, and 0.9709
Here all the values are same so node is ended here.
17. Yellow reference tuples
age,income,student,buys_computer
<30, high, no, no
<30, medium, no, yes
<30, low, yes, no
For Age extracting the data based on the class labels of No, yes
No , yes count
<30 2 , 1 3
Total 3
Calculate the Expected information for the Income
>40 I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029
Now calculate the Entropy value for attribute Income
E( Income ) = 3/3(0.911029) = 0.911029
Gain value for Age
Gain(age) = I (s1, s2)-E (age) = 0.9709-0.911029= 0.0525
For Income extracting the data based on the class labels of No, yes
No , yes count
High 1 , 0 1
Medium 0 , 1 1
Low 1 , 0 1
Total 3
Calculate the Expected information for the Income
18. High I (1, 0) =-1/1log2 (1/1) = 0
Medium I (0, 1) =-1/1log2 (1/1) = 0
Low I (1, 0) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute Income
E( Income ) = 1/3(0)+1/3(0)+1/3(0) = 0
Gain value for Income
Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709
For Student extracting the data based on the class labels of No, yes
No , yes count
No 1 , 1 2
Yes 1 , 0 1
Total 3
Calculate the Expected information for the student
No I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1
Yes I (1,0) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute student
E( Income ) = 2/3(1)+1/3(0) = 0.1010107
Gain value for student
Gain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.3043
Gain values of age , income , student are 0.052101, 0.9709, 0.3043
Here income gain value is high so right node is constructed.
CONSTRUCTION OF ID3 DECISION TREE ALGORITHM
19. STEP1: Calculate gains of the Attributes.
STEP2: Make one attribute as a Root Node which has the highest gain.
STEP3: Calculate datasets based on the current node unique attribute values by neglecting the
current node attribute values dataset and current node is the parent node.
STEP4: Take each dataset from the datasets and go to step5.
STEP5: Take current one as attributes. If attribute values size is one then go to step6 else go to
step7
STEP6: Consider the current one as a leaf node to the parent node.
STEP7: Calculate Attribute Gains
STEP8: If all Attribute Gains are same then go to step6 else make one attribute as a Node which
has the highest gain and consider this node as a current node and go to step3.
2.2 Flaws or Drawbacks with Existing System:
The principle of selecting attribute A as test attribute for ID3 is to make E (A) of attribute A, the
smallest. Study suggest that there exists a problem with this method, this means that it often
biased to select attributes with more taken values however, which are not necessarily the best
20. attributes. In other words, it is not so important in real situation for those attributes selected by
ID3 algorithm to be judged firstly according to make value of entropy minimal. Besides, ID3
algorithm selects attributes in terms of information entropy which is computed based on
probabilities, while probability method is only suitable for solving stochastic problems. Aiming
at these shortcomings for ID3 algorithm, some improvements on ID3 algorithm are made and a
improved decision tree algorithm is introduced.
3. Proposed System
3.1Description of Proposed System:
The principle of selecting attribute A as test attribute for ID3 is to make E (A) of
attribute A, the smallest. Study suggest that there exists a problem with this method, this means
that it often biased to select attributes with more taken values however, which are not necessarily
the best attributes. In other words, it is not so important in real situation for those attributes
selected by ID3 algorithm to be judged firstly according to make value of entropy minimal.
Besides, ID3 algorithm selects attributes in terms of information entropy which is computed
based on probabilities, while probability method is only suitable for solving stochastic problems.
Aiming at these shortcomings for ID3 algorithm, some improvements on ID3 algorithm are made
and a improved decision tree algorithm is introduced.
3.2Problem Statement:
Decision tree is an important method for both induction research and data mining, which is
mainly used for model classification and prediction. ID3 algorithm is the most widely used
algorithm in the decision tree. Through illustrating on the basic ideas of decision tree in data
mining, in my paper, the shortcoming of ID3’s inclining to choose attributes with many values,
however, which are not necessarily the best attributes. In other words, it is not so important in
real situation for those attributes selected by ID3 algorithm to be judged firstly according to
make value of entropy minimal. Besides, ID3 algorithm selects attributes in terms of information
entropy which is computed based on probabilities, while probability method is only suitable for
solving stochastic problems. Aiming at these shortcomings for ID3 algorithm, a new decision
21. tree algorithm combining ID3 and Association Function is presented. The experiment results
show that the proposed algorithm can overcome ID3’s shortcoming effectively and get more
reasonable and effective rules.
3.3 Working of Proposed System:
CALCULATION OF IMPROVED ID3 ALGORITHM
STEP1: Take the dataset and calculate the count of unique attribute values of the class label.
STEP2: Calculate amount of information for the class label by based on following formulae.
STEP3: Calculate the count of the each unique attribute values based on the unique attribute
values of the class label.
STEP4: Calculate expected information for each unique attribute value of each attribute by
based on following formulae.
STEP5: Calculate entropy value for each attribute. Here take the total calculated expected
information values of all unique attribute values of entropy value by based on following
formulae.
STEP6: Calculate gain value with the help of amount of information value and entropy value by
based on following formulae.
STEP7: After calculating gain values here again we calculate the importance of attribute by
using correlation function method for each attribute by based on following formulas. Here a
formula gives for only two cases class labels. While in two conditions only this formulae works
22. STEP8: Calculate the Improved ID3 gain of each attribute with the help of ID3, amount of
information and correlation function by using following formulae.
xV (k)
Till now the calculation part is as same as id3 algorithm, In improved id3 algorithm has to
calculate the importance of the each attribute after that multiply the correlation factor with the
id3 gain values that calculation is as follows by using datasets of the id3 that attribute values
have to calculate the ASSOCIATION FUNCTION expansion as below
AF (A) = this formula is applied on retrieved datasets.
Example takes the all attribute values from age, color_cloth, income, and student
Age data
No , yes count
>40 3 , 1 4
<30 3 , 1 4
30-40 - , 2 2
Total 10
Calculation for all datasets
AF (age) = =2
AF (color_cloth) = =1.5
AF (income) = =1.3334
AF (student) = =1
THE NORMALIZATION RELATION DEGREE FORMULA
23. =0.34210
=0.2571
=0.22105
NOW CALCULATE THE IMPROVED ID3 GAIN VALUE BY USING THE FORMULA
xV (K)
Gain (age) = 0.32194 x 0.34210= 0.110310 highest gain value in improved id3 decision tree so it is the root node
Gain (color_cloth) = 0.41992 x 0.2571= 0.107910
Gain (income) = 0.0954410 x 0.22105= 0.021100
Gain (student) = 0.041042 x 0.1714= 0.0079510
So AGE attribute becomes root node. <30, 30-40 and >40 are the unique attribute values.
Now extract the <30,30-40,>40 data from dataset.
Color_cloth, income, student, buys_computer here we neglect the age attribute because of, it was
the root node. so take the references of age attribute values and retrieve the total <30 tuple
values.
<30 dataset
Color_cloth, income, student, buys_computer
Yellow, high, no, no
24. White, low, yes, no
Yellow, medium, no, yes
Yellow, low, yes, no
Here calculate the gain values for color-cloth
No , yes count
Yellow 2 , 1 3
White 1 , - 1
Total 4
For color-cloth extracting the data based on the class labels of No, yes
No , yes count
yellow 2 , 1 3
white 1 , 0 1
Total 4
Calculate the Expected information for the color-cloth
yellow I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029
white I(1,0)=-1/1log(1/1)=0
Now calculate the Entropy value for attribute Income
E( color-cloth) = 3/4(0.911029) +1/4(0)= 0.10101010
Gain value for color-cloth
Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0.10101010= 0.21022
For Income extracting the data based on the class labels of No, yes
No , yes count
High 1 , 0 1
Medium 0 , 1 1
25. Low 2 , 0 2
Total 4
Calculate the Expected information for the Income
High I (1, 0) =-1/1log2 (1/1) = 0
Medium I (0, 1) =-1/1log2 (1/1) = 0
Low I (2, 0) =-2/2log2 (2/2) = 0
Now calculate the Entropy value for attribute Income
E( Income ) = 1/4(0)+1/4(0)+1/4(0) = 0
Gain value for Income
Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709
For Student extracting the data based on the class labels of No, yes
No , yes count
No 1 , 1 2
Yes 2 , 0 2
Total 4
Calculate the Expected information for the student
No I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1
Yes I (2,0) =-2/2log2 (2/2) = 0
Now calculate the Entropy value for attribute student
E( Income ) = 2/4(1)+2/4(0) = 0.5
Gain value for student
Gain(student) = I (s1, s2)-E (student) = 0.9709-0.5= 0.47010
AF (color_cloth) = =1
AF (income) = =1.3334
26. AF (student) = =1
=0.2999
=0.4
Gain values are
Gain (color_cloth) = 0.21022 x 0.2999= 0.010410
Gain (income) = 0.9709 x 0.4= 0.310103
Gain (student) = 0.47010 x 0.2999= 0.1411
Income has the highest gain value
30 - 40 dataset
Color_cloth, income, student, buys_computer
Blue , high, no, yes
Blue , low, yes, yes
Here calculate the gain values for color-cloth
No , yes count
Blue 0 , 2 2
Total 2
Calculate the Expected information for the color-cloth
Blue I(0,2)=-2/2log(2/2)=0
Now calculate the Entropy value for attribute Income
E( color-cloth ) = 2/2(0)= 0
Gain value for color-cloth
Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0= 0.9709
27. For Income extracting the data based on the class labels of No, yes
No , yes count
High 0 , 1 1
Low 0 , 1 1
Total 2
Calculate the Expected information for the Income
High I (1, 0) =-1/1log2 (1/1) = 0
Low I (1, 0) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute Income
E( Income ) = 1/2(0)+1/2(0) = 0
Gain value for Income
Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709
For Student extracting the data based on the class labels of No, yes
No , yes count
No 0 , 1 1
Yes 0 , 1 1
Total 2
Calculate the Expected information for the student
No I (0,1) =-1/1log2 (1/1) = 0
Yes I (0,1) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute student
E( Income ) = 1/2(0)+1/2(0) = 0
Gain value for student
28. Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709
AF (color_cloth) = =2
AF (income) = =1
AF (student) = =1
=0.5
=0.25
Gain values are
Gain (color_cloth) = 0.9709 x 0.5= 0.410545
Gain (income) = 0.9709 x 0.25= 0.2427
Gain (student) = 0.9709 x 0.25= 0.2427
Color-cloth has the highest gain value
>40 dataset
Color_cloth, income, student, buys_computer
Red , high, no, no
Red , low, yes, no
Red , medium, no, yes
White , medium, no, no
For color-cloth extracting the data based on the class labels of No, yes
No , yes count
white 1 , 0 1
red 2 , 1 3
29. Total 4
Calculate the Expected information for the color-cloth
Red I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029
white I(1,0)=-1/1log(1/1)=0
Now calculate the Entropy value for attribute Income
E( color-cloth) = 3/4(0.911029) +1/4(0)= 0.10101010
Gain value for color-cloth
Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0.10101010= 0.21022
For Income extracting the data based on the class labels of No, yes
No , yes count
High 1 , 0 1
Medium 1 , 1 2
Low 1 , 0 1
Total 4
Calculate the Expected information for the Income
High I (1, 0) =-1/1log2 (1/1) = 0
Medium I (1, 1) =-1/2log2 (1/1)-1/2log(1/2) = 1
Low I (1, 0) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute Income
E( Income ) = 1/4(0)+2/4(1)+1/4(0) = 0.5
Gain value for Income
Gain(income) = I (s1, s2)-E (Income) = 0.9709-0.5= 0.4709
For Student extracting the data based on the class labels of No, yes
No , yes count
30. No 2 , 1 3
Yes 1 , 0 1
Total 4
Calculate the Expected information for the student
No I (2,1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.9110295
Yes I (1,0) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute student
E( Income ) = 3/4(0.9110295)+1/4(0) = 0.1010107
Gain value for student
Gain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.21021
AF (color_cloth) = =1
AF (income) = =0.1010107
AF (student) = =1
=0.3749
=0.25
Gain values are
Gain (color_cloth) = 0.21022 x 0.3749= 0.1057
Gain (income) = 0.4709 x 0.25= 0.1177
Gain (student) = 0.21022 x 0.3749= 0.1057
Income has the highest gain value
>40 Income dataset for high
31. Color-cloth, student, buys-computer
Red, no, no
Class lable is same for both color-cloth and student, hence it becomes leaf node.
>40 Income dataset for low
Color-cloth, student, buys-computer
Red, yes , no
Class lable is same for both color-cloth and student, hence it becomes leaf node.
>40 Income dataset for medium
Color-cloth, student, buys-computer
Red, no , yes
White, no, no
For color-cloth extracting the data based on the class labels of No, yes
No , yes count
white 1 , 0 1
red 0 , 1 1
Total 2
Calculate the Expected information for the color-cloth
Red I (0, 1) =-1/1log2 (1/1)=0
white I(1,0)=-1/1log(1/1)=0
Now calculate the Entropy value for attribute Income
E( color-cloth) = 1/2(0) +1/2(0)= 0
Gain value for color-cloth
Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0= 0.9709
For Student extracting the data based on the class labels of No, yes
32. No , yes count
No 1 , 1 2
Total 2
Calculate the Expected information for the student
No I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1
Now calculate the Entropy value for attribute student
E( Income ) = 2/2(1) = 1
Gain value for student
Gain(student) = I (s1, s2)-E (student) = 0.9709-1= -0.0292
AF (color_cloth) = =1
AF (student) = =0
=1
Gain values are
Gain (color_cloth) = 0.9709 x 1= 0.9709
Gain (student) = -0.0292 x 0= 0
<30 Income dataset for high
Color-cloth, student, buys-computer
Yellow , no , no
Class lable is same for both color-cloth and student, hence it becomes leaf node.
<30 Income dataset for low
Color-cloth, student, buys-computer
White , yes , no
33. Yellow, yes, no
Class lable is same for both color-cloth and student, hence it becomes leaf node.
<30 Income dataset for medium
Color-cloth, student, buys-computer
Yellow , no , yes
Class lable is same for both color-cloth and student, hence it becomes leaf node.
Construction of Improved ID3 decision tree:
34. 4. Requirement Analysis or Requirements Elicitation
REQUIREMENT SPECIFICATION:
A requirement is a feature that the system must have or a constraint that it must satisfy to
be accepted by the client. Requirement engineering aims at defining the requirements of the
system under construction. Requirement engineering include two main activities, requirements
elicitation, which results in the specification of the system that the client understands, and
analysis, which results in an analysis model that the developer can unambiguously interpret. A
requirement is statement about what the proposed system will do. Requirements can be divided
into two major categories: functional requirements and non- functional requirements.
4.1 Functional requirements:
Functional requirements describe the interactions between the system and its
environment independent of its implementation. The environment includes the user and any
other external system with which the system interacts.
4.2 Non Functional Requirements:
Non-functional requirements describe aspects of the system that are not directly related
the functional behaviour of the system. Non-functional requirements include a broad variety of
requirements that apply to many different aspects of the system, from usability to performance.
4.2.1 Software Requirements:
Operating System : Windows 2000/XP IDE
Language : JDK 1.5
Eclipse Documentation : MS-Word
Designing : Rational rose
35. 4.2.2 Hardware Requirements:
CPU : Pentium IV
RAM : 512MB.
Hard Disk : 40GB.
Input device : Standard Keyboard and Mouse.
Output device : VGA and High Resolution Monitor