SlideShare a Scribd company logo
1 of 35
INTRODUCTION

1.1Project Overview:

Data mining:
Data mining, the extraction of hidden predictive information from large databases, is a powerful
new technology with great potential to help companies focus on the most important information
in their data warehouses. Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses
offered by data mining move beyond the analyses of past events provided by retrospective tools
typical of decision support systems. Data mining tools can answer business questions that
traditionally were too time consuming to resolve. They scour databases for hidden patterns,
finding predictive information that experts may miss because it lies outside their expectations.

Most companies already collect and refine massive quantities of data. Data mining techniques
can be implemented rapidly on existing software and hardware platforms to enhance the value of
existing information resources, and can be integrated with new products and systems as they are
brought on-line. When implemented on high performance client/server or parallel processing
computers, data mining tools can analyze massive databases to deliver answers to questions such
as, "Which clients are most likely to respond to my next promotional mailing, and why?"

The most commonly used techniques in data mining are:

       Artificial neural networks: Non-linear predictive models that learn through training and
       resemble biological neural networks in structure.

       Decision trees: Tree-shaped structures that represent sets of decisions. These decisions
       generate rules for the classification of a dataset. Specific decision tree methods include
       Classification and Regression Trees (CART) and Chi Square Automatic Interaction
       Detection (CHAID) .

       Genetic algorithms: Optimization techniques that use processes such as genetic
       combination, mutation, and natural selection in a design based on the concepts of
       evolution.

       Nearest neighbor method: A technique that classifies each record in a dataset based on a
       combination of the classes of the k record(s) most similar to it in a historical dataset
       (where k ³ 1). Sometimes called the k-nearest neighbor technique.

       Rule induction: The extraction of useful if-then rules from data based on statistical
       significance.
DECISION TREES:

Decision trees have become one of the most powerful and popular approaches in knowledge
discovery and data mining, the science and technology of exploring large and complex bodies of
data in order to discover useful patterns. The area is of great importance because it enables
modeling and knowledge extraction from the abundance of data available. Both theoreticians and
practitioners are continually seeking techniques to make the process more efficient, cost-
effective and accurate. Decision trees, originally implemented in decision theory and statistics,
are highly effective tools in other areas such as data mining, text mining, information extraction,
machine learning, and pattern recognition. This book invites readers to explore the many benefits
in data mining that decision trees offer:

    Self-explanatory and easy to follow when compacted

    Able to handle a variety of input data: nominal, numeric and textual

    Able to process datasets that may have errors or missing values

    High predictive performance for a relatively small computational effort

    Available in many data mining packages over a variety of platforms

    Useful for various tasks, such as classification, regression, clustering and feature
     selection

ADVANTAGES AND DISADVANTAGES OF DECISION TREES

Advantages

 Simple to understand and interpret.

 Requires little data preparation.

 Able to handle both numerical and categorical data.

 Uses a white box model.

 Perform well with large data in a short time.

Disadvantages

 Output attribute must be categorical.

 Limited to one output attribute.
1.2. Software Process Model
Waterfall Model is one of the most widely used Software Development Process.It is also called as
"Linear Sequential model" or the "classic life cycle" or iterative model. It is widely used in the
commercial development projects. It is called so because here, we move to next phase(step) after
getting input from previous phase, like in a waterfall, water flows down to from the upper steps.

In this iterative waterfall model Software Development process is divided into five phases:-
a) SRS (Software Requirement Specifications)
b) System Design and Software Design
c) Implementation and Unit testing
d) Integration and System Testing
e) Operation and Maintenance




                                    Iterative Waterfall Model with its stages



Lets discuss all these stages of waterfall model in detail.


Software Requirements Specifications:
This is the most crucial phase for the whole project, here project team along with the customer
makes a detailed list of user requirements. The project team chalks out the functionality and
limitations(if there are any) of the software they are developing, in detail. The document which
contains all this information is called SRS, and it clearly and unambiguously indicates the
requirements. A small amount of top-level analysis and design is also documented. This document is
verified and endorsed by the customer before starting the project. SRS serves as the input for further
phases.
System Design and Software Design:

Using SRS as input, system design is done. System design included designing of software and
hardware i.e. functionality of hardware and software is separated-out. After separation design of
software modules(see what is modularity) is done. The design process translates requirements into
representation of the software that can be assessed for quality before generation of code begins. At
the same time test plan is prepared, test plan describes the various tests which will be carried out on
the system after completion of development.



Implementation and Unit Testing:
Now that we have system design, code generation begins. Code generation is conversion of design
into machine-readable form. If designing of software and system is done well, code generation can be
done easily. Software modules are now further divided into units. A unit is a logically separable part
of the software. Testing of units can be done separately. In this phase unit testing is done by the
developer itself, to ensure that there are no defects.


Integration and System testing:
Now the units of the software are integrated together and a system is built. So we have a complete
software at hand which is tested to check if it meets the functional and performance requirements of
the customer. Testing is done, as per the steps defined in the test plan, to ensure defined input
produces actual results which agree with the required results. A test report is generated which
contains test results.


Operation & maintenance:
Now that we have completed the tested software, we deliver it to the client. His feed-backs are taken
and any changes, if required, are made in this phase. This phase goes on till the software is retired.

1.4 Roles and Responsibilities:

      Everyone of the team members have done the job collaboratively. Everyone one of us have
done each and every job in a distributive manner. Each one of us shared the work and worked as
a team but not like an individual.

1.5 Tools and Techniques:
ABOUT ECLIPSE

        Eclipse is a universal platform for integrating development tools Open, extensible
architecture based on plug-ins. Eclipse provides a number of aids that make writing Java code
much quicker and easier than using a text editor. This means that you can spend more time
learning Java, and less time typing and looking up documentation. The Eclipse debugger and
scrapbook allow you to look inside the execution of the Java code. This allows you to “see”
objects and to understand how Java is working behind the scenes. Eclipse provides full support
for agile software development practices such as test-driven development and refactoring.

Existing System

2.1Description of Existing System :

A mathematical algorithm for building the decision tree Invented by J. Ross Quinlan in
1939.Uses Information Theory invented by Shannon in 1941.Builds the tree from the top down,
with no backtracking. Employs a top-down greedy search through the space of possible decision
trees. Greedy because there is no backtracking. It picks highest values first. Information Gain is
used to select the most useful attribute for classification. Select attribute that is most useful for
classifying examples (attribute that has the highest Information Gain).

Iterative Dichotomiser 3 Algorithm is a Decision Tree learning algorithm. The name is correct in
that it creates Decision Trees for “dichotomizing” data instances, or classifying them discretely
through branching nodes until a classification “bucket” is reached (leaf node). By using ID3 and
other machine-learning algorithms from Artificial Intelligence, expert systems can engage in
tasks usually done by human experts, such as doctors diagnosing diseases by examining various
symptoms (the attributes) of patients (the data instances) in a complex Decision Tree. Accurate
Decision Trees are fundamental to Data Mining and Databases.

Decision tree learning is a method for approximating discrete-valued target functions, in which
the learned function is represented by a decision tree. Decision tree learning is one of the most
widely used and practical methods for inductive inference. The input data of ID3 is known as
sets of “training” or “learning” data instances, which will be used by the algorithm to generate
the Decision Tree. The machine is “learning” from this set of preliminary data.

The ID3 algorithm generates a Decision Tree by using a greedy search through the inputted sets
of data instances so as to determine nodes and the attributes they use for branching. Also, the
emerging tree is traversed in a top-down (root to leaf) approach through each of the nodes within
the tree. This occur RECURSIVELY, reminding you of those “pointless” tree traversals
strategies .The traversal attempts to determine if the decision “attribute” on which the branching
will be based on for any particular emerging node is the most ideal branching attribute by using
the inputted sets of data. One particular metric that can be used to determine then if a branching
attribute is adequate is that of INFORMATION GAIN, or ENTROPY.

ID3 uses Entropy to determine if, based on the inputted set of data, the selected branching
attribute for a particular node of the emerging tree is adequate. Specifically, the attribute that
results in the most reduction of Entropy related to the learning sets is the best.

GOAL: Find a way to optimally classify a learning set, such that the resulting Decision Tree is
not too deep and the number of branches (internal nodes) of the Tree is minimized.

SOLUTION: The more Entropy in a system or measure of impurity in a collection of data sets,
the m ore branches and depth the tree will have. FIND entropy reducing attributes in the
learning sets and use them for branching.

Information Gain = measuring the expected reduction in Entropy. The higher the Information
Gain, the more expected reduction in Entropy.

It turns out Entropy, or measure of non-homogeneousness within a set of learning sets can be
calculated in a straight forward manner.

CALCULATION AND CONSTRUCTION OF ID3 ALGORITHM

STEP1: Take the dataset and calculate the count of unique attribute values of the class label.

STEP2: Calculate amount of information for the class label by based on following formulae.




STEP3: Calculate the count of the each unique attribute values based on the unique attribute
values of the class label.

STEP4: Calculate expected information for each unique attribute value of each attribute by based
on following formulae.
STEP5: Calculate entropy value for each attribute. Here take the total expected information
values of all unique attribute values of the attribute at the time of calculation of entropy value by
based on following formulae.




STEP6: Calculate gain value with the help of amount of information value and entropy value by
based on following formulae.



STEP7: Construct the decision tree based on following decision tree algorithm.

EXPLANATION WITH EXAMPLE FOR ID3 DECISION TREE

Age, colour-cloth, income, student, buys _computer

>40, red, high, no, no

<30, yellow, high, no, no

30--40, blue, high, no, yes

>40, red, medium, no, yes

<30, white, low, yes, no

>40, red, low, yes, no

30--40, blue, low, yes, yes

<30, yellow, medium, no, yes

<30, yellow, low, yes, no

>40, white, medium, no, no

This one is the dataset of sample data set of buys computer

Attribute names: Age, colour-cloth, income, student, buys_computer

Unique values of attributes: >40, <30, 30—40, >40

Class labels of the dataset: yes, no
Step1:

For the given dataset first we have to calculate the information gain value for that we have to
know the count of class labels in the given dataset. That is NO’s count is 10 and YES’s count is
4. Then by using the formulae have to calculate the information gain value



No = 10 and yes=4

I (10, 4) =-10/4log2 (10/10)-4/10log2 (4/10)=0.4421+0.52107=0.9709

This calculating part is called amount of information and this is the unique value for the total
dataset.

Step2: calculate the each attributes with unique values, after that expected information and
entropy and gain values are calculated by formulae

Take the Age attribute and separate the values with comparison of class labels

            No    , yes    count

>40          3    ,   1     4

<30          3    ,   1     4

30-40        -    ,   2     2

Total                       10

This is the table for extracting data, based on the class labels of buys_ computer.

Now calculate the Expected Information for the unique attribute values of >40, <30, 30-40

>40        I(3,1)=-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112

<30        I(3,1)=-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112

30-40      I(0,2)= -2/2log2 (2/2)= 0

Now calculate the Entropy value for the Age attributes

E (age) =4/10(0.10112) +4/10(0.10112) +2/10(0)=0.324410+0.324410=0.10410910
Step3: calculate the Gain value for the age attribute based on amount of information and
entropy value

Gain (age) =I (s1, s2)-E (age) = 0.9709-0.10410910 = 0.32194

Second attribute: For color _cloth extracting the data based on the class labels of No, yes

                 No         ,   yes           count

Red               2       ,        1          3

Yellow            2        ,       1          3

Blue              -        ,       2          2

White            2         ,       -          2

Total                                        10

Calculate the Expected information for the color _ cloth

Red     I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029

Yellow I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029

Blue    I (0, 2) =-0/2log2 (0/2)-2/2log2 (2/2) = 0

White I (2, 0) =-2/2log2 (2/2) = 0

Now calculate the Entropy value for attribute color_cloth

E( color_cloth ) = 3/10(0.911029)+3/10(0.911029)+2/10(0)+2/10(0) = 0.550974

Gain value for color _ cloth

Gain (color_cloth) = I (s1, s2)-E (age) = 0.9709-0.550974= 0.41992

Third Attribute: For Income extracting the data based on the class labels of No, yes

                       No       ,       yes        count

High                  1        ,        2          3

Medium                 2       ,        1          3

Low                   3        ,        1          4
Total                                  10

Calculate the Expected information for the Income

High      I (1, 2) =-1/3log2 (1/3)-2/3log2 (2/3) = 0.911029

Medium I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029

Low     I (3, 4) =-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112

Now calculate the Entropy value for attribute Income

E( Income ) = 3/10(0.911029)+3/10(0.911029)+4/10(0.10112) = 0.1075454

Gain value for Income

Gain(income) = I (s1, s2)-E (Income) = 0.9709-0.1075454= 0.0954410

Fourth Attribute: For Student extracting the data based on the class labels of No, yes

                     No      ,   yes   count

No                  3       ,    3         6

Yes                  1      ,    3         4

Total                                       10

Calculate the Expected information for the student

No      I (3,3) =-3/10log2 (3/10)-3/10log2 (3/10) = 1

Yes     I (1,3) =-1/4log2 (1/4)-3/4log2 (3/4) = 0.10112

Now calculate the Entropy value for attribute student

E( Income ) = 10/10(1)+4/10(0.10112) = 0.924410

Gain value for student

Gain(student) = I (s1, s2)-E (Income) = 0.9709-0.924410= 0.041042

The important point is that , here compare the all Gain values what have calculated

Gain(age)=0.32194

Gain( color cloth )=0.41992highest Gain value
Gain(income)=0.954410

Gain(student)=0.041042

       Color_cloth is the highest Gain value in the four attributes so that as per the algorithm it
becomes the Root Node while construction of the decision tree. After that take the unique
attribute values ( red , blue ,white, yellow) as the arcs and compare with the unique value red,
where we identify red attribute value in the dataset that all rows will be extract and form one
table that table called red attribute table like this we calculate the all Gain values after decide
where we have to place the right node , left node and leaf nodes.




       In the above decision tree color_cloth is the root node remaining is attribute values of the
color_cloth. Based on the attribute values extract the data by neglecting the color_cloth attribute.
Here observe that first arc, we extract the data based on the red attribute value, where red
attribute is a look, take the total tuple values and calculate the expected information, entropy
value and at last gain value. This will be continued for left node, right node and leaf node
selections.

Red reference tuples

age,income,student,buys_computer

>40,high,no,no

>40,medium,no,yes

>40,low,yes,no

For Age extracting the data based on the class labels of No, yes

                    No     ,   yes   count

>40                   2   ,    1    3
Total                                   3

Calculate the Expected information for the Income

>40     I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029

Now calculate the Entropy value for attribute Income

E( Income ) = 3/3(0.911029) = 0.911029

Gain value for Age

Gain(age) = I (s1, s2)-E (age) = 0.9709-0.911029= 0.0525

For Income extracting the data based on the class labels of No, yes

                     No   ,   yes      count

High                1    ,    0        1

Medium               0   ,    1        1

Low                 1    ,    0        1

Total                                   3

Calculate the Expected information for the Income

High     I (1, 0) =-1/1log2 (1/1) = 0

Medium I (0, 1) =-1/1log2 (1/1) = 0

Low     I (1, 0) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute Income

E( Income ) = 1/3(0)+1/3(0)+1/3(0) = 0

Gain value for Income

Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709




For Student extracting the data based on the class labels of No, yes
No      ,   yes   count

No                    1       ,    1        2

Yes                   1       ,    0        1

Total                                        3

Calculate the Expected information for the student

No      I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1

Yes     I (1,0) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute student

E( Income ) = 2/3(1)+1/3(0) = 0.1010107

Gain value for student

Gain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.3043

Here again calculate the expected information,entropy and gain values of Red table.

Gain values of age , income , student are 0.05210 , 0.9709 , 0.3043

Here income have the highest gain value so left node is income.

Blue reference tuples

age,income,student,buys_computer

30-40, high, no, yes

30-40, low, yes, yes

For Age extracting the data based on the class labels of No, yes

                       No      ,   yes   count

30-40                     0   ,    2    2

Total                                    2

Calculate the Expected information for the Income

30-40     I (0, 2) =-2/2log2 (2/2) = 0
Now calculate the Entropy value for attribute age

E( age ) = 2/2(0) = 0

Gain value for Age

Gain(age) = I (s1, s2)-E (age) = 0.9709-0= 0

For Income extracting the data based on the class labels of No, yes

                     No     ,       yes       count

High                   0   ,        1        1

Low                    0   ,        1         1

Total                                         2

Calculate the Expected information for the Income

High     I (0, 1) =-1/1log2 (1/1) = 0

Low I (0, 1) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute Income

E( Income ) = 1/2(0)+1/2(0) = 0

Gain value for Income

Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709

For Student extracting the data based on the class labels of No, yes

                     No         ,    yes       count

No                  0          ,         1        1

Yes                    0       ,         1        1

Total                                              2

Calculate the Expected information for the student

No      I (0,1) =-1/1log2 (1/1) = 0

Yes     I (0,1) =-1/1log2 (1/1) = 0
Now calculate the Entropy value for attribute student

E( Income ) = 1/2(0)+1/2(0) = 0

Gain value for student

Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709

Gain values of age, income, and student are 0.9709, 0.9709, 0.9709

Here all the values are same so node is ended here.

White reference tuples

age,income,student,buys_computer

<30, low, yes, no

>40, medium, no, no

For Age extracting the data based on the class labels of No, yes

                     No    ,   yes     count

<30                  1    ,    0      1

>40                  1    ,    0      1

Total                                  2

Calculate the Expected information for the Income

<30     I (1, 0) =-2/2log2 (2/2) = 0

>40     I (1, 0) =-2/2log2 (2/2) = 0

Now calculate the Entropy value for attribute age

E( age ) = 1/2(0)+1/2(0) = 0

Gain value for Age

Gain(age) = I (s1, s2)-E (age) = 0.9709-0= 0

For Income extracting the data based on the class labels of No, yes

                     No    ,   yes     count
Low                 1    ,       0       1

Medium               1   ,       0       1

Total                                     2

Calculate the Expected information for the Income

Medium       I (1, 0) =-1/1log2 (1/1) = 0

Low I (1, 0) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute Income

E( Income ) = 1/2(0)+1/2(0) = 0

Gain value for Income

Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709

For Student extracting the data based on the class labels of No, yes

                    No        ,   yes     count

No                  1        ,       0       1

Yes                 1        ,       0       1

Total                                         2

Calculate the Expected information for the student

No      I (1,0) =-1/1log2 (1/1) = 0

Yes     I (1,0) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute student

E( Income ) = 1/2(0)+1/2(0) = 0

Gain value for student

Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709

Gain values of age, income, and student are 0.9709, 0.9709, and 0.9709

Here all the values are same so node is ended here.
Yellow reference tuples

age,income,student,buys_computer

<30, high, no, no

<30, medium, no, yes

<30, low, yes, no




For Age extracting the data based on the class labels of No, yes

                     No        ,   yes   count

<30                   2       ,    1     3

Total                                     3

Calculate the Expected information for the Income

>40     I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029

Now calculate the Entropy value for attribute Income

E( Income ) = 3/3(0.911029) = 0.911029

Gain value for Age

Gain(age) = I (s1, s2)-E (age) = 0.9709-0.911029= 0.0525

For Income extracting the data based on the class labels of No, yes

                     No    ,       yes   count

High                1     ,        0     1

Medium               0    ,        1     1

Low                 1     ,        0     1

Total                                    3

Calculate the Expected information for the Income
High     I (1, 0) =-1/1log2 (1/1) = 0

Medium I (0, 1) =-1/1log2 (1/1) = 0

Low     I (1, 0) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute Income

E( Income ) = 1/3(0)+1/3(0)+1/3(0) = 0

Gain value for Income

Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709




For Student extracting the data based on the class labels of No, yes

                    No    ,    yes      count

No                 1     ,     1         2

Yes                 1     ,    0         1

Total                                      3

Calculate the Expected information for the student

No      I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1

Yes     I (1,0) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute student

E( Income ) = 2/3(1)+1/3(0) = 0.1010107

Gain value for student

Gain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.3043

Gain values of age , income , student are 0.052101, 0.9709, 0.3043

Here income gain value is high so right node is constructed.

CONSTRUCTION OF ID3 DECISION TREE ALGORITHM
STEP1: Calculate gains of the Attributes.

STEP2: Make one attribute as a Root Node which has the highest gain.

STEP3: Calculate datasets based on the current node unique attribute values by neglecting the
current node attribute values dataset and current node is the parent node.

STEP4: Take each dataset from the datasets and go to step5.

STEP5: Take current one as attributes. If attribute values size is one then go to step6 else go to
step7

STEP6: Consider the current one as a leaf node to the parent node.

STEP7: Calculate Attribute Gains

STEP8: If all Attribute Gains are same then go to step6 else make one attribute as a Node which
has the highest gain and consider this node as a current node and go to step3.




2.2 Flaws or Drawbacks with Existing System:


The principle of selecting attribute A as test attribute for ID3 is to make E (A) of attribute A, the
smallest. Study suggest that there exists a problem with this method, this means that it often
biased to select attributes with more taken values however, which are not necessarily the best
attributes. In other words, it is not so important in real situation for those attributes selected by
ID3 algorithm to be judged firstly according to make value of entropy minimal. Besides, ID3
algorithm selects attributes in terms of information entropy which is computed based on
probabilities, while probability method is only suitable for solving stochastic problems. Aiming
at these shortcomings for ID3 algorithm, some improvements on ID3 algorithm are made and a
improved decision tree algorithm is introduced.



3. Proposed System

3.1Description of Proposed System:

         The principle of selecting attribute A as test attribute for ID3 is to make E (A) of
attribute A, the smallest. Study suggest that there exists a problem with this method, this means
that it often biased to select attributes with more taken values however, which are not necessarily
the best attributes. In other words, it is not so important in real situation for those attributes
selected by ID3 algorithm to be judged firstly according to make value of entropy minimal.
Besides, ID3 algorithm selects attributes in terms of information entropy which is computed
based on probabilities, while probability method is only suitable for solving stochastic problems.
Aiming at these shortcomings for ID3 algorithm, some improvements on ID3 algorithm are made
and a improved decision tree algorithm is introduced.


3.2Problem Statement:
Decision tree is an important method for both induction research and data mining, which is
mainly used for model classification and prediction. ID3 algorithm is the most widely used
algorithm in the decision tree. Through illustrating on the basic ideas of decision tree in data
mining, in my paper, the shortcoming of ID3’s inclining to choose attributes with many values,
however, which are not necessarily the best attributes. In other words, it is not so important in
real situation for those attributes selected by ID3 algorithm to be judged firstly according to
make value of entropy minimal. Besides, ID3 algorithm selects attributes in terms of information
entropy which is computed based on probabilities, while probability method is only suitable for
solving stochastic problems. Aiming at these shortcomings for ID3 algorithm, a new decision
tree algorithm combining ID3 and Association Function is presented. The experiment results
show that the proposed algorithm can overcome ID3’s shortcoming effectively and get more
reasonable and effective rules.

3.3 Working of Proposed System:

CALCULATION OF IMPROVED ID3 ALGORITHM

STEP1: Take the dataset and calculate the count of unique attribute values of the class label.

STEP2: Calculate amount of information for the class label by based on following formulae.




STEP3: Calculate the count of the each unique attribute values based on the unique attribute
values of the class label.

STEP4: Calculate expected information for each unique attribute value of each attribute by
based on following formulae.




STEP5: Calculate entropy value for each attribute. Here take the total calculated expected
information values of all unique attribute values of entropy value by based on following
formulae.




STEP6: Calculate gain value with the help of amount of information value and entropy value by
based on following formulae.



STEP7: After calculating gain values here again we calculate the importance of attribute by
using correlation function method for each attribute by based on following formulas. Here a
formula gives for only two cases class labels. While in two conditions only this formulae works
STEP8: Calculate the Improved ID3 gain of each attribute with the help of ID3, amount of
information and correlation function by using following formulae.

                                               xV (k)

Till now the calculation part is as same as id3 algorithm, In improved id3 algorithm has to
calculate the importance of the each attribute after that multiply the correlation factor with the
id3 gain values that calculation is as follows by using datasets of the id3 that attribute values
have to calculate the ASSOCIATION FUNCTION expansion as below

AF (A) =                                             this formula is applied on retrieved datasets.

Example takes the all attribute values from age, color_cloth, income, and student

Age data

           No    , yes     count

>40         3   ,    1        4

<30         3   ,    1        4

30-40       -   ,    2        2

Total                      10

Calculation for all datasets

AF (age) =                          =2

AF (color_cloth) =                                 =1.5

AF (income) =                            =1.3334

AF (student) =                 =1

THE NORMALIZATION RELATION DEGREE FORMULA
=0.34210

                                     =0.2571

                            =0.22105




NOW CALCULATE THE IMPROVED ID3 GAIN VALUE BY USING THE FORMULA

                                                  xV (K)

Gain (age) = 0.32194 x 0.34210= 0.110310 highest gain value in improved id3 decision tree so it is the root node

Gain (color_cloth) = 0.41992 x 0.2571= 0.107910

Gain (income) = 0.0954410 x 0.22105= 0.021100

Gain (student) = 0.041042 x 0.1714= 0.0079510




So AGE attribute becomes root node. <30, 30-40 and >40 are the unique attribute values.

Now extract the <30,30-40,>40 data from dataset.

Color_cloth, income, student, buys_computer here we neglect the age attribute because of, it was
the root node. so take the references of age attribute values and retrieve the total <30 tuple
values.

<30 dataset

Color_cloth, income, student, buys_computer

Yellow, high, no, no
White, low, yes, no

Yellow, medium, no, yes

Yellow, low, yes, no

Here calculate the gain values for color-cloth

               No   , yes            count

Yellow        2    ,       1            3

White         1    ,       -            1

Total                                    4

For color-cloth extracting the data based on the class labels of No, yes

                        No           ,       yes         count

yellow                      2       ,           1       3

white                       1       ,           0       1

Total                                                4

Calculate the Expected information for the color-cloth

yellow     I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029

white      I(1,0)=-1/1log(1/1)=0

Now calculate the Entropy value for attribute Income

E( color-cloth) = 3/4(0.911029) +1/4(0)= 0.10101010

Gain value for color-cloth

Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0.10101010= 0.21022

For Income extracting the data based on the class labels of No, yes

                        No       ,       yes         count

High                   1        ,           0       1

Medium                 0        ,           1       1
Low                 2    ,        0      2

Total                                     4

Calculate the Expected information for the Income

High     I (1, 0) =-1/1log2 (1/1) = 0

Medium I (0, 1) =-1/1log2 (1/1) = 0

Low     I (2, 0) =-2/2log2 (2/2) = 0

Now calculate the Entropy value for attribute Income

E( Income ) = 1/4(0)+1/4(0)+1/4(0) = 0

Gain value for Income

Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709

For Student extracting the data based on the class labels of No, yes

                     No       ,   yes    count

No                  1    ,        1          2

Yes                 2    ,        0          2

Total                                         4

Calculate the Expected information for the student

No      I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1

Yes     I (2,0) =-2/2log2 (2/2) = 0

Now calculate the Entropy value for attribute student

E( Income ) = 2/4(1)+2/4(0) = 0.5

Gain value for student

Gain(student) = I (s1, s2)-E (student) = 0.9709-0.5= 0.47010

AF (color_cloth) =                 =1

AF (income) =                           =1.3334
AF (student) =               =1

                               =0.2999

                          =0.4




Gain values are

Gain (color_cloth) = 0.21022 x 0.2999= 0.010410

Gain (income) = 0.9709 x 0.4= 0.310103

Gain (student) = 0.47010 x 0.2999= 0.1411

Income has the highest gain value

30 - 40 dataset

Color_cloth, income, student, buys_computer

Blue , high, no, yes

Blue , low, yes, yes

Here calculate the gain values for color-cloth

            No    , yes      count

Blue        0    ,    2          2

Total                             2

Calculate the Expected information for the color-cloth

Blue      I(0,2)=-2/2log(2/2)=0

Now calculate the Entropy value for attribute Income

E( color-cloth ) = 2/2(0)= 0

Gain value for color-cloth

Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0= 0.9709
For Income extracting the data based on the class labels of No, yes

                    No    ,   yes           count

High                0    ,       1         1

Low                 0    ,       1         1

Total                                       2

Calculate the Expected information for the Income

High     I (1, 0) =-1/1log2 (1/1) = 0

Low     I (1, 0) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute Income

E( Income ) = 1/2(0)+1/2(0) = 0

Gain value for Income

Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709

For Student extracting the data based on the class labels of No, yes



                     No       ,       yes       count

No                 0         ,        1             1

Yes                 0        ,        1             1

Total                                            2

Calculate the Expected information for the student

No      I (0,1) =-1/1log2 (1/1) = 0

Yes     I (0,1) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute student

E( Income ) = 1/2(0)+1/2(0) = 0

Gain value for student
Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709

AF (color_cloth) =        =2

AF (income) =                      =1

AF (student) =                 =1

                          =0.5

                     =0.25




Gain values are

Gain (color_cloth) = 0.9709 x 0.5= 0.410545

Gain (income) = 0.9709 x 0.25= 0.2427

Gain (student) = 0.9709 x 0.25= 0.2427

Color-cloth has the highest gain value

>40 dataset

Color_cloth, income, student, buys_computer

Red , high, no, no

Red , low, yes, no

Red , medium, no, yes

White , medium, no, no

For color-cloth extracting the data based on the class labels of No, yes

                     No        ,    yes   count

white               1    ,         0     1

red                 2     ,        1     3
Total                                   4

Calculate the Expected information for the color-cloth

Red      I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029

white     I(1,0)=-1/1log(1/1)=0

Now calculate the Entropy value for attribute Income

E( color-cloth) = 3/4(0.911029) +1/4(0)= 0.10101010

Gain value for color-cloth

Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0.10101010= 0.21022

For Income extracting the data based on the class labels of No, yes

                    No        ,   yes   count

High                1    ,       0     1

Medium              1   ,        1     2

Low                 1   ,        0     1

Total                                   4

Calculate the Expected information for the Income

High     I (1, 0) =-1/1log2 (1/1) = 0

Medium I (1, 1) =-1/2log2 (1/1)-1/2log(1/2) = 1

Low     I (1, 0) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute Income

E( Income ) = 1/4(0)+2/4(1)+1/4(0) = 0.5

Gain value for Income

Gain(income) = I (s1, s2)-E (Income) = 0.9709-0.5= 0.4709

For Student extracting the data based on the class labels of No, yes

                    No        ,   yes   count
No                  2    ,        1       3

Yes                 1    ,        0       1

Total                                       4

Calculate the Expected information for the student

No      I (2,1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.9110295

Yes     I (1,0) =-1/1log2 (1/1) = 0

Now calculate the Entropy value for attribute student

E( Income ) = 3/4(0.9110295)+1/4(0) = 0.1010107

Gain value for student

Gain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.21021

AF (color_cloth) =                 =1

AF (income) =                          =0.1010107

AF (student) =                =1

                                   =0.3749

                               =0.25




Gain values are

Gain (color_cloth) = 0.21022 x 0.3749= 0.1057

Gain (income) = 0.4709 x 0.25= 0.1177

Gain (student) = 0.21022 x 0.3749= 0.1057

Income has the highest gain value

>40 Income dataset for high
Color-cloth, student, buys-computer

Red, no, no

Class lable is same for both color-cloth and student, hence it becomes leaf node.

>40 Income dataset for low

Color-cloth, student, buys-computer

Red, yes , no

Class lable is same for both color-cloth and student, hence it becomes leaf node.

>40 Income dataset for medium

Color-cloth, student, buys-computer

Red, no , yes

White, no, no

For color-cloth extracting the data based on the class labels of No, yes

                    No       ,   yes   count

white               1   ,       0     1

red                 0   ,        1    1

Total                                  2

Calculate the Expected information for the color-cloth

Red      I (0, 1) =-1/1log2 (1/1)=0

white     I(1,0)=-1/1log(1/1)=0

Now calculate the Entropy value for attribute Income

E( color-cloth) = 1/2(0) +1/2(0)= 0

Gain value for color-cloth

Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0= 0.9709

For Student extracting the data based on the class labels of No, yes
No    ,   yes   count

No                  1     ,    1      2

Total                                   2

Calculate the Expected information for the student

No      I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1

Now calculate the Entropy value for attribute student

E( Income ) = 2/2(1) = 1

Gain value for student

Gain(student) = I (s1, s2)-E (student) = 0.9709-1= -0.0292

AF (color_cloth) =              =1

AF (student) =       =0

                     =1




Gain values are

Gain (color_cloth) = 0.9709 x 1= 0.9709

Gain (student) = -0.0292 x 0= 0

<30 Income dataset for high

Color-cloth, student, buys-computer

Yellow , no , no

Class lable is same for both color-cloth and student, hence it becomes leaf node.

<30 Income dataset for low

Color-cloth, student, buys-computer

White , yes , no
Yellow, yes, no

Class lable is same for both color-cloth and student, hence it becomes leaf node.

<30 Income dataset for medium

Color-cloth, student, buys-computer

Yellow , no , yes

Class lable is same for both color-cloth and student, hence it becomes leaf node.

Construction of Improved ID3 decision tree:
4. Requirement Analysis or Requirements Elicitation
REQUIREMENT SPECIFICATION:

       A requirement is a feature that the system must have or a constraint that it must satisfy to
be accepted by the client. Requirement engineering aims at defining the requirements of the
system under construction. Requirement engineering include two main activities, requirements
elicitation, which results in the specification of the system that the client understands, and
analysis, which results in an analysis model that the developer can unambiguously interpret. A
requirement is statement about what the proposed system will do. Requirements can be divided
into two major categories: functional requirements and non- functional requirements.



4.1 Functional requirements:

       Functional requirements describe the interactions between the system and its
environment independent of its implementation. The environment includes the user and any
other external system with which the system interacts.


4.2 Non Functional Requirements:

       Non-functional requirements describe aspects of the system that are not directly related
the functional behaviour of the system. Non-functional requirements include a broad variety of
requirements that apply to many different aspects of the system, from usability to performance.


4.2.1 Software Requirements:

Operating System                                 : Windows 2000/XP IDE

Language                                         : JDK 1.5

Eclipse Documentation                            : MS-Word

Designing                                        : Rational rose
4.2.2 Hardware Requirements:


CPU                            : Pentium IV

RAM                            : 512MB.

Hard Disk                      : 40GB.

Input device                   : Standard Keyboard and Mouse.
Output device                  : VGA and High Resolution Monitor

More Related Content

What's hot

SAD ASSIGN :)
SAD ASSIGN :)SAD ASSIGN :)
SAD ASSIGN :)Roy Reyes
 
Knowledge based systems
Knowledge based systemsKnowledge based systems
Knowledge based systemsYowan Rdotexe
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And DesignLijo Stalin
 
Expert System
Expert SystemExpert System
Expert Systemhasanshan
 
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
IRJET-  	  Comparative Study of Efficacy of Big Data Analysis and Deep Learni...IRJET-  	  Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...IRJET Journal
 
Chapter 8 system analysis and design
Chapter 8   system analysis and designChapter 8   system analysis and design
Chapter 8 system analysis and designPratik Gupta
 
Introduction to Expert Systems {Artificial Intelligence}
Introduction to Expert Systems {Artificial Intelligence}Introduction to Expert Systems {Artificial Intelligence}
Introduction to Expert Systems {Artificial Intelligence}FellowBuddy.com
 
New Fuzzy Logic Based Intrusion Detection System
New Fuzzy Logic Based Intrusion Detection SystemNew Fuzzy Logic Based Intrusion Detection System
New Fuzzy Logic Based Intrusion Detection Systemijsrd.com
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencesanjay_asati
 
KBS Lecture Notes
KBS Lecture NotesKBS Lecture Notes
KBS Lecture Notesbutest
 
Comparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining SoftwareComparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining SoftwareUniversitas Pembangunan Panca Budi
 
A Survey of Security of Multimodal Biometric Systems
A Survey of Security of Multimodal Biometric SystemsA Survey of Security of Multimodal Biometric Systems
A Survey of Security of Multimodal Biometric SystemsIJERA Editor
 
Expert Systems
Expert SystemsExpert Systems
Expert Systemsosmancikk
 

What's hot (20)

SAD ASSIGN :)
SAD ASSIGN :)SAD ASSIGN :)
SAD ASSIGN :)
 
Knowledge based systems
Knowledge based systemsKnowledge based systems
Knowledge based systems
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And Design
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Artificial Neural Networks for Data Mining
 
Expert System
Expert SystemExpert System
Expert System
 
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
IRJET-  	  Comparative Study of Efficacy of Big Data Analysis and Deep Learni...IRJET-  	  Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
 
Chapter 8 system analysis and design
Chapter 8   system analysis and designChapter 8   system analysis and design
Chapter 8 system analysis and design
 
Introduction to Expert Systems {Artificial Intelligence}
Introduction to Expert Systems {Artificial Intelligence}Introduction to Expert Systems {Artificial Intelligence}
Introduction to Expert Systems {Artificial Intelligence}
 
Hci2013 lecture8
Hci2013 lecture8Hci2013 lecture8
Hci2013 lecture8
 
New Fuzzy Logic Based Intrusion Detection System
New Fuzzy Logic Based Intrusion Detection SystemNew Fuzzy Logic Based Intrusion Detection System
New Fuzzy Logic Based Intrusion Detection System
 
SA Chapter 2
SA Chapter 2SA Chapter 2
SA Chapter 2
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Dsal#01
Dsal#01Dsal#01
Dsal#01
 
Software engineering srs, dfd
Software engineering srs, dfdSoftware engineering srs, dfd
Software engineering srs, dfd
 
SA Chapter 7
SA Chapter 7SA Chapter 7
SA Chapter 7
 
KBS Lecture Notes
KBS Lecture NotesKBS Lecture Notes
KBS Lecture Notes
 
System design
System designSystem design
System design
 
Comparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining SoftwareComparison Between WEKA and Salford System in Data Mining Software
Comparison Between WEKA and Salford System in Data Mining Software
 
A Survey of Security of Multimodal Biometric Systems
A Survey of Security of Multimodal Biometric SystemsA Survey of Security of Multimodal Biometric Systems
A Survey of Security of Multimodal Biometric Systems
 
Expert Systems
Expert SystemsExpert Systems
Expert Systems
 

Viewers also liked (20)

Nieuwsbrief tuinbouw juli 2013
Nieuwsbrief tuinbouw juli 2013Nieuwsbrief tuinbouw juli 2013
Nieuwsbrief tuinbouw juli 2013
 
Ordklasser
OrdklasserOrdklasser
Ordklasser
 
ľudská aura
ľudská auraľudská aura
ľudská aura
 
Závěrečný úkol KPI
Závěrečný úkol KPIZávěrečný úkol KPI
Závěrečný úkol KPI
 
Reported speech
Reported speechReported speech
Reported speech
 
Presentation1
Presentation1Presentation1
Presentation1
 
אנזימים
אנזימיםאנזימים
אנזימים
 
электронные сми
электронные смиэлектронные сми
электронные сми
 
Kpi zaverecny ukol
Kpi zaverecny ukolKpi zaverecny ukol
Kpi zaverecny ukol
 
Mia vamil tuinbouw 2014
Mia vamil tuinbouw 2014Mia vamil tuinbouw 2014
Mia vamil tuinbouw 2014
 
Careers at easy web
Careers at easy webCareers at easy web
Careers at easy web
 
Stinger Casino Night 2013
Stinger Casino Night 2013Stinger Casino Night 2013
Stinger Casino Night 2013
 
Resume 2016
Resume 2016Resume 2016
Resume 2016
 
Závěrečný úkol KPI
Závěrečný úkol KPIZávěrečný úkol KPI
Závěrečný úkol KPI
 
Mia vamil tuinbouw 2013
Mia vamil tuinbouw 2013Mia vamil tuinbouw 2013
Mia vamil tuinbouw 2013
 
Update taalgeschiedenis
Update taalgeschiedenisUpdate taalgeschiedenis
Update taalgeschiedenis
 
Autisme
AutismeAutisme
Autisme
 
Reported speech vijfdes leerlingen
Reported speech  vijfdes leerlingenReported speech  vijfdes leerlingen
Reported speech vijfdes leerlingen
 
Content Language Integrated Learning (CLIL)
Content Language Integrated Learning (CLIL)Content Language Integrated Learning (CLIL)
Content Language Integrated Learning (CLIL)
 
Literatuur in de 17de eeuw
Literatuur in de 17de eeuwLiteratuur in de 17de eeuw
Literatuur in de 17de eeuw
 

Similar to Introduction

Software_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxSoftware_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxArifaMehreen1
 
System analysis and design
System analysis and designSystem analysis and design
System analysis and designRobinsonObura
 
WELCOME TO AI PROJECT shidhant mittaal.pptx
WELCOME TO AI PROJECT shidhant mittaal.pptxWELCOME TO AI PROJECT shidhant mittaal.pptx
WELCOME TO AI PROJECT shidhant mittaal.pptx9D38SHIDHANTMITTAL
 
System_Analysis_and_Design_Assignment_New2.ppt
System_Analysis_and_Design_Assignment_New2.pptSystem_Analysis_and_Design_Assignment_New2.ppt
System_Analysis_and_Design_Assignment_New2.pptMarissaPedragosa
 
Hardware Design Practices For Modern Hardware
Hardware Design Practices For Modern HardwareHardware Design Practices For Modern Hardware
Hardware Design Practices For Modern HardwareWinstina Kennedy
 
Library mangement system project srs documentation
Library mangement system project srs documentationLibrary mangement system project srs documentation
Library mangement system project srs documentationSuchit Moon
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Different Approaches To Sys Bldg
Different Approaches To Sys BldgDifferent Approaches To Sys Bldg
Different Approaches To Sys BldgUSeP
 
Data modelling tool in CASE
Data modelling tool in CASEData modelling tool in CASE
Data modelling tool in CASEManju Pillai
 
06 styles and_greenfield_design
06 styles and_greenfield_design06 styles and_greenfield_design
06 styles and_greenfield_designMajong DevJfu
 
Software development life cycle
Software development life cycle Software development life cycle
Software development life cycle shefali mishra
 
DSS Presentation1.pptx
DSS Presentation1.pptxDSS Presentation1.pptx
DSS Presentation1.pptxLuciaMakwasha1
 
Project Documentation Student Management System format.pptx
Project Documentation Student Management System format.pptxProject Documentation Student Management System format.pptx
Project Documentation Student Management System format.pptxAjayPatre1
 
BCA V Sem System design and Implementation
BCA V Sem System design and ImplementationBCA V Sem System design and Implementation
BCA V Sem System design and Implementationgauravgoswami78
 
sadfinal2007-121022230733-phpapp01.pdf
sadfinal2007-121022230733-phpapp01.pdfsadfinal2007-121022230733-phpapp01.pdf
sadfinal2007-121022230733-phpapp01.pdfshoukatali154717
 

Similar to Introduction (20)

Software_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxSoftware_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptx
 
System analysis and design
System analysis and designSystem analysis and design
System analysis and design
 
WELCOME TO AI PROJECT shidhant mittaal.pptx
WELCOME TO AI PROJECT shidhant mittaal.pptxWELCOME TO AI PROJECT shidhant mittaal.pptx
WELCOME TO AI PROJECT shidhant mittaal.pptx
 
Database Design
Database DesignDatabase Design
Database Design
 
System_Analysis_and_Design_Assignment_New2.ppt
System_Analysis_and_Design_Assignment_New2.pptSystem_Analysis_and_Design_Assignment_New2.ppt
System_Analysis_and_Design_Assignment_New2.ppt
 
Hardware Design Practices For Modern Hardware
Hardware Design Practices For Modern HardwareHardware Design Practices For Modern Hardware
Hardware Design Practices For Modern Hardware
 
Library mangement system project srs documentation
Library mangement system project srs documentationLibrary mangement system project srs documentation
Library mangement system project srs documentation
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Different Approaches To Sys Bldg
Different Approaches To Sys BldgDifferent Approaches To Sys Bldg
Different Approaches To Sys Bldg
 
Data modelling tool in CASE
Data modelling tool in CASEData modelling tool in CASE
Data modelling tool in CASE
 
06 styles and_greenfield_design
06 styles and_greenfield_design06 styles and_greenfield_design
06 styles and_greenfield_design
 
Sadchap3
Sadchap3Sadchap3
Sadchap3
 
Software development life cycle
Software development life cycle Software development life cycle
Software development life cycle
 
Information systems
Information systemsInformation systems
Information systems
 
DSS Presentation1.pptx
DSS Presentation1.pptxDSS Presentation1.pptx
DSS Presentation1.pptx
 
Project Documentation Student Management System format.pptx
Project Documentation Student Management System format.pptxProject Documentation Student Management System format.pptx
Project Documentation Student Management System format.pptx
 
Computers in management
Computers in managementComputers in management
Computers in management
 
BCA V Sem System design and Implementation
BCA V Sem System design and ImplementationBCA V Sem System design and Implementation
BCA V Sem System design and Implementation
 
sadfinal2007-121022230733-phpapp01.pdf
sadfinal2007-121022230733-phpapp01.pdfsadfinal2007-121022230733-phpapp01.pdf
sadfinal2007-121022230733-phpapp01.pdf
 
Database Design
Database Design Database Design
Database Design
 

Introduction

  • 1. INTRODUCTION 1.1Project Overview: Data mining: Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?" The most commonly used techniques in data mining are: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance.
  • 2. DECISION TREES: Decision trees have become one of the most powerful and popular approaches in knowledge discovery and data mining, the science and technology of exploring large and complex bodies of data in order to discover useful patterns. The area is of great importance because it enables modeling and knowledge extraction from the abundance of data available. Both theoreticians and practitioners are continually seeking techniques to make the process more efficient, cost- effective and accurate. Decision trees, originally implemented in decision theory and statistics, are highly effective tools in other areas such as data mining, text mining, information extraction, machine learning, and pattern recognition. This book invites readers to explore the many benefits in data mining that decision trees offer:  Self-explanatory and easy to follow when compacted  Able to handle a variety of input data: nominal, numeric and textual  Able to process datasets that may have errors or missing values  High predictive performance for a relatively small computational effort  Available in many data mining packages over a variety of platforms  Useful for various tasks, such as classification, regression, clustering and feature selection ADVANTAGES AND DISADVANTAGES OF DECISION TREES Advantages  Simple to understand and interpret.  Requires little data preparation.  Able to handle both numerical and categorical data.  Uses a white box model.  Perform well with large data in a short time. Disadvantages  Output attribute must be categorical.  Limited to one output attribute.
  • 3. 1.2. Software Process Model Waterfall Model is one of the most widely used Software Development Process.It is also called as "Linear Sequential model" or the "classic life cycle" or iterative model. It is widely used in the commercial development projects. It is called so because here, we move to next phase(step) after getting input from previous phase, like in a waterfall, water flows down to from the upper steps. In this iterative waterfall model Software Development process is divided into five phases:- a) SRS (Software Requirement Specifications) b) System Design and Software Design c) Implementation and Unit testing d) Integration and System Testing e) Operation and Maintenance Iterative Waterfall Model with its stages Lets discuss all these stages of waterfall model in detail. Software Requirements Specifications: This is the most crucial phase for the whole project, here project team along with the customer makes a detailed list of user requirements. The project team chalks out the functionality and limitations(if there are any) of the software they are developing, in detail. The document which contains all this information is called SRS, and it clearly and unambiguously indicates the requirements. A small amount of top-level analysis and design is also documented. This document is verified and endorsed by the customer before starting the project. SRS serves as the input for further phases.
  • 4. System Design and Software Design: Using SRS as input, system design is done. System design included designing of software and hardware i.e. functionality of hardware and software is separated-out. After separation design of software modules(see what is modularity) is done. The design process translates requirements into representation of the software that can be assessed for quality before generation of code begins. At the same time test plan is prepared, test plan describes the various tests which will be carried out on the system after completion of development. Implementation and Unit Testing: Now that we have system design, code generation begins. Code generation is conversion of design into machine-readable form. If designing of software and system is done well, code generation can be done easily. Software modules are now further divided into units. A unit is a logically separable part of the software. Testing of units can be done separately. In this phase unit testing is done by the developer itself, to ensure that there are no defects. Integration and System testing: Now the units of the software are integrated together and a system is built. So we have a complete software at hand which is tested to check if it meets the functional and performance requirements of the customer. Testing is done, as per the steps defined in the test plan, to ensure defined input produces actual results which agree with the required results. A test report is generated which contains test results. Operation & maintenance: Now that we have completed the tested software, we deliver it to the client. His feed-backs are taken and any changes, if required, are made in this phase. This phase goes on till the software is retired. 1.4 Roles and Responsibilities: Everyone of the team members have done the job collaboratively. Everyone one of us have done each and every job in a distributive manner. Each one of us shared the work and worked as a team but not like an individual. 1.5 Tools and Techniques: ABOUT ECLIPSE Eclipse is a universal platform for integrating development tools Open, extensible architecture based on plug-ins. Eclipse provides a number of aids that make writing Java code much quicker and easier than using a text editor. This means that you can spend more time learning Java, and less time typing and looking up documentation. The Eclipse debugger and
  • 5. scrapbook allow you to look inside the execution of the Java code. This allows you to “see” objects and to understand how Java is working behind the scenes. Eclipse provides full support for agile software development practices such as test-driven development and refactoring. Existing System 2.1Description of Existing System : A mathematical algorithm for building the decision tree Invented by J. Ross Quinlan in 1939.Uses Information Theory invented by Shannon in 1941.Builds the tree from the top down, with no backtracking. Employs a top-down greedy search through the space of possible decision trees. Greedy because there is no backtracking. It picks highest values first. Information Gain is used to select the most useful attribute for classification. Select attribute that is most useful for classifying examples (attribute that has the highest Information Gain). Iterative Dichotomiser 3 Algorithm is a Decision Tree learning algorithm. The name is correct in that it creates Decision Trees for “dichotomizing” data instances, or classifying them discretely through branching nodes until a classification “bucket” is reached (leaf node). By using ID3 and other machine-learning algorithms from Artificial Intelligence, expert systems can engage in tasks usually done by human experts, such as doctors diagnosing diseases by examining various symptoms (the attributes) of patients (the data instances) in a complex Decision Tree. Accurate Decision Trees are fundamental to Data Mining and Databases. Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree. Decision tree learning is one of the most widely used and practical methods for inductive inference. The input data of ID3 is known as sets of “training” or “learning” data instances, which will be used by the algorithm to generate the Decision Tree. The machine is “learning” from this set of preliminary data. The ID3 algorithm generates a Decision Tree by using a greedy search through the inputted sets of data instances so as to determine nodes and the attributes they use for branching. Also, the emerging tree is traversed in a top-down (root to leaf) approach through each of the nodes within the tree. This occur RECURSIVELY, reminding you of those “pointless” tree traversals strategies .The traversal attempts to determine if the decision “attribute” on which the branching will be based on for any particular emerging node is the most ideal branching attribute by using
  • 6. the inputted sets of data. One particular metric that can be used to determine then if a branching attribute is adequate is that of INFORMATION GAIN, or ENTROPY. ID3 uses Entropy to determine if, based on the inputted set of data, the selected branching attribute for a particular node of the emerging tree is adequate. Specifically, the attribute that results in the most reduction of Entropy related to the learning sets is the best. GOAL: Find a way to optimally classify a learning set, such that the resulting Decision Tree is not too deep and the number of branches (internal nodes) of the Tree is minimized. SOLUTION: The more Entropy in a system or measure of impurity in a collection of data sets, the m ore branches and depth the tree will have. FIND entropy reducing attributes in the learning sets and use them for branching. Information Gain = measuring the expected reduction in Entropy. The higher the Information Gain, the more expected reduction in Entropy. It turns out Entropy, or measure of non-homogeneousness within a set of learning sets can be calculated in a straight forward manner. CALCULATION AND CONSTRUCTION OF ID3 ALGORITHM STEP1: Take the dataset and calculate the count of unique attribute values of the class label. STEP2: Calculate amount of information for the class label by based on following formulae. STEP3: Calculate the count of the each unique attribute values based on the unique attribute values of the class label. STEP4: Calculate expected information for each unique attribute value of each attribute by based on following formulae.
  • 7. STEP5: Calculate entropy value for each attribute. Here take the total expected information values of all unique attribute values of the attribute at the time of calculation of entropy value by based on following formulae. STEP6: Calculate gain value with the help of amount of information value and entropy value by based on following formulae. STEP7: Construct the decision tree based on following decision tree algorithm. EXPLANATION WITH EXAMPLE FOR ID3 DECISION TREE Age, colour-cloth, income, student, buys _computer >40, red, high, no, no <30, yellow, high, no, no 30--40, blue, high, no, yes >40, red, medium, no, yes <30, white, low, yes, no >40, red, low, yes, no 30--40, blue, low, yes, yes <30, yellow, medium, no, yes <30, yellow, low, yes, no >40, white, medium, no, no This one is the dataset of sample data set of buys computer Attribute names: Age, colour-cloth, income, student, buys_computer Unique values of attributes: >40, <30, 30—40, >40 Class labels of the dataset: yes, no
  • 8. Step1: For the given dataset first we have to calculate the information gain value for that we have to know the count of class labels in the given dataset. That is NO’s count is 10 and YES’s count is 4. Then by using the formulae have to calculate the information gain value No = 10 and yes=4 I (10, 4) =-10/4log2 (10/10)-4/10log2 (4/10)=0.4421+0.52107=0.9709 This calculating part is called amount of information and this is the unique value for the total dataset. Step2: calculate the each attributes with unique values, after that expected information and entropy and gain values are calculated by formulae Take the Age attribute and separate the values with comparison of class labels No , yes count >40  3 , 1 4 <30  3 , 1 4 30-40 - , 2 2 Total 10 This is the table for extracting data, based on the class labels of buys_ computer. Now calculate the Expected Information for the unique attribute values of >40, <30, 30-40 >40 I(3,1)=-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112 <30 I(3,1)=-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112 30-40 I(0,2)= -2/2log2 (2/2)= 0 Now calculate the Entropy value for the Age attributes E (age) =4/10(0.10112) +4/10(0.10112) +2/10(0)=0.324410+0.324410=0.10410910
  • 9. Step3: calculate the Gain value for the age attribute based on amount of information and entropy value Gain (age) =I (s1, s2)-E (age) = 0.9709-0.10410910 = 0.32194 Second attribute: For color _cloth extracting the data based on the class labels of No, yes No , yes count Red  2 , 1 3 Yellow  2 , 1 3 Blue  - , 2 2 White  2 , - 2 Total 10 Calculate the Expected information for the color _ cloth Red I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029 Yellow I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029 Blue I (0, 2) =-0/2log2 (0/2)-2/2log2 (2/2) = 0 White I (2, 0) =-2/2log2 (2/2) = 0 Now calculate the Entropy value for attribute color_cloth E( color_cloth ) = 3/10(0.911029)+3/10(0.911029)+2/10(0)+2/10(0) = 0.550974 Gain value for color _ cloth Gain (color_cloth) = I (s1, s2)-E (age) = 0.9709-0.550974= 0.41992 Third Attribute: For Income extracting the data based on the class labels of No, yes No , yes count High  1 , 2 3 Medium  2 , 1 3 Low  3 , 1 4
  • 10. Total 10 Calculate the Expected information for the Income High I (1, 2) =-1/3log2 (1/3)-2/3log2 (2/3) = 0.911029 Medium I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029 Low I (3, 4) =-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112 Now calculate the Entropy value for attribute Income E( Income ) = 3/10(0.911029)+3/10(0.911029)+4/10(0.10112) = 0.1075454 Gain value for Income Gain(income) = I (s1, s2)-E (Income) = 0.9709-0.1075454= 0.0954410 Fourth Attribute: For Student extracting the data based on the class labels of No, yes No , yes count No  3 , 3 6 Yes  1 , 3 4 Total 10 Calculate the Expected information for the student No I (3,3) =-3/10log2 (3/10)-3/10log2 (3/10) = 1 Yes I (1,3) =-1/4log2 (1/4)-3/4log2 (3/4) = 0.10112 Now calculate the Entropy value for attribute student E( Income ) = 10/10(1)+4/10(0.10112) = 0.924410 Gain value for student Gain(student) = I (s1, s2)-E (Income) = 0.9709-0.924410= 0.041042 The important point is that , here compare the all Gain values what have calculated Gain(age)=0.32194 Gain( color cloth )=0.41992highest Gain value
  • 11. Gain(income)=0.954410 Gain(student)=0.041042 Color_cloth is the highest Gain value in the four attributes so that as per the algorithm it becomes the Root Node while construction of the decision tree. After that take the unique attribute values ( red , blue ,white, yellow) as the arcs and compare with the unique value red, where we identify red attribute value in the dataset that all rows will be extract and form one table that table called red attribute table like this we calculate the all Gain values after decide where we have to place the right node , left node and leaf nodes. In the above decision tree color_cloth is the root node remaining is attribute values of the color_cloth. Based on the attribute values extract the data by neglecting the color_cloth attribute. Here observe that first arc, we extract the data based on the red attribute value, where red attribute is a look, take the total tuple values and calculate the expected information, entropy value and at last gain value. This will be continued for left node, right node and leaf node selections. Red reference tuples age,income,student,buys_computer >40,high,no,no >40,medium,no,yes >40,low,yes,no For Age extracting the data based on the class labels of No, yes No , yes count >40  2 , 1 3
  • 12. Total 3 Calculate the Expected information for the Income >40 I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029 Now calculate the Entropy value for attribute Income E( Income ) = 3/3(0.911029) = 0.911029 Gain value for Age Gain(age) = I (s1, s2)-E (age) = 0.9709-0.911029= 0.0525 For Income extracting the data based on the class labels of No, yes No , yes count High  1 , 0 1 Medium  0 , 1 1 Low  1 , 0 1 Total 3 Calculate the Expected information for the Income High I (1, 0) =-1/1log2 (1/1) = 0 Medium I (0, 1) =-1/1log2 (1/1) = 0 Low I (1, 0) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute Income E( Income ) = 1/3(0)+1/3(0)+1/3(0) = 0 Gain value for Income Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709 For Student extracting the data based on the class labels of No, yes
  • 13. No , yes count No  1 , 1 2 Yes  1 , 0 1 Total 3 Calculate the Expected information for the student No I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1 Yes I (1,0) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute student E( Income ) = 2/3(1)+1/3(0) = 0.1010107 Gain value for student Gain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.3043 Here again calculate the expected information,entropy and gain values of Red table. Gain values of age , income , student are 0.05210 , 0.9709 , 0.3043 Here income have the highest gain value so left node is income. Blue reference tuples age,income,student,buys_computer 30-40, high, no, yes 30-40, low, yes, yes For Age extracting the data based on the class labels of No, yes No , yes count 30-40  0 , 2 2 Total 2 Calculate the Expected information for the Income 30-40 I (0, 2) =-2/2log2 (2/2) = 0
  • 14. Now calculate the Entropy value for attribute age E( age ) = 2/2(0) = 0 Gain value for Age Gain(age) = I (s1, s2)-E (age) = 0.9709-0= 0 For Income extracting the data based on the class labels of No, yes No , yes count High  0 , 1 1 Low  0 , 1 1 Total 2 Calculate the Expected information for the Income High I (0, 1) =-1/1log2 (1/1) = 0 Low I (0, 1) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute Income E( Income ) = 1/2(0)+1/2(0) = 0 Gain value for Income Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709 For Student extracting the data based on the class labels of No, yes No , yes count No  0 , 1 1 Yes  0 , 1 1 Total 2 Calculate the Expected information for the student No I (0,1) =-1/1log2 (1/1) = 0 Yes I (0,1) =-1/1log2 (1/1) = 0
  • 15. Now calculate the Entropy value for attribute student E( Income ) = 1/2(0)+1/2(0) = 0 Gain value for student Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709 Gain values of age, income, and student are 0.9709, 0.9709, 0.9709 Here all the values are same so node is ended here. White reference tuples age,income,student,buys_computer <30, low, yes, no >40, medium, no, no For Age extracting the data based on the class labels of No, yes No , yes count <30  1 , 0 1 >40  1 , 0 1 Total 2 Calculate the Expected information for the Income <30 I (1, 0) =-2/2log2 (2/2) = 0 >40 I (1, 0) =-2/2log2 (2/2) = 0 Now calculate the Entropy value for attribute age E( age ) = 1/2(0)+1/2(0) = 0 Gain value for Age Gain(age) = I (s1, s2)-E (age) = 0.9709-0= 0 For Income extracting the data based on the class labels of No, yes No , yes count
  • 16. Low  1 , 0 1 Medium  1 , 0 1 Total 2 Calculate the Expected information for the Income Medium I (1, 0) =-1/1log2 (1/1) = 0 Low I (1, 0) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute Income E( Income ) = 1/2(0)+1/2(0) = 0 Gain value for Income Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709 For Student extracting the data based on the class labels of No, yes No , yes count No  1 , 0 1 Yes  1 , 0 1 Total 2 Calculate the Expected information for the student No I (1,0) =-1/1log2 (1/1) = 0 Yes I (1,0) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute student E( Income ) = 1/2(0)+1/2(0) = 0 Gain value for student Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709 Gain values of age, income, and student are 0.9709, 0.9709, and 0.9709 Here all the values are same so node is ended here.
  • 17. Yellow reference tuples age,income,student,buys_computer <30, high, no, no <30, medium, no, yes <30, low, yes, no For Age extracting the data based on the class labels of No, yes No , yes count <30  2 , 1 3 Total 3 Calculate the Expected information for the Income >40 I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029 Now calculate the Entropy value for attribute Income E( Income ) = 3/3(0.911029) = 0.911029 Gain value for Age Gain(age) = I (s1, s2)-E (age) = 0.9709-0.911029= 0.0525 For Income extracting the data based on the class labels of No, yes No , yes count High  1 , 0 1 Medium  0 , 1 1 Low  1 , 0 1 Total 3 Calculate the Expected information for the Income
  • 18. High I (1, 0) =-1/1log2 (1/1) = 0 Medium I (0, 1) =-1/1log2 (1/1) = 0 Low I (1, 0) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute Income E( Income ) = 1/3(0)+1/3(0)+1/3(0) = 0 Gain value for Income Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709 For Student extracting the data based on the class labels of No, yes No , yes count No  1 , 1 2 Yes  1 , 0 1 Total 3 Calculate the Expected information for the student No I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1 Yes I (1,0) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute student E( Income ) = 2/3(1)+1/3(0) = 0.1010107 Gain value for student Gain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.3043 Gain values of age , income , student are 0.052101, 0.9709, 0.3043 Here income gain value is high so right node is constructed. CONSTRUCTION OF ID3 DECISION TREE ALGORITHM
  • 19. STEP1: Calculate gains of the Attributes. STEP2: Make one attribute as a Root Node which has the highest gain. STEP3: Calculate datasets based on the current node unique attribute values by neglecting the current node attribute values dataset and current node is the parent node. STEP4: Take each dataset from the datasets and go to step5. STEP5: Take current one as attributes. If attribute values size is one then go to step6 else go to step7 STEP6: Consider the current one as a leaf node to the parent node. STEP7: Calculate Attribute Gains STEP8: If all Attribute Gains are same then go to step6 else make one attribute as a Node which has the highest gain and consider this node as a current node and go to step3. 2.2 Flaws or Drawbacks with Existing System: The principle of selecting attribute A as test attribute for ID3 is to make E (A) of attribute A, the smallest. Study suggest that there exists a problem with this method, this means that it often biased to select attributes with more taken values however, which are not necessarily the best
  • 20. attributes. In other words, it is not so important in real situation for those attributes selected by ID3 algorithm to be judged firstly according to make value of entropy minimal. Besides, ID3 algorithm selects attributes in terms of information entropy which is computed based on probabilities, while probability method is only suitable for solving stochastic problems. Aiming at these shortcomings for ID3 algorithm, some improvements on ID3 algorithm are made and a improved decision tree algorithm is introduced. 3. Proposed System 3.1Description of Proposed System: The principle of selecting attribute A as test attribute for ID3 is to make E (A) of attribute A, the smallest. Study suggest that there exists a problem with this method, this means that it often biased to select attributes with more taken values however, which are not necessarily the best attributes. In other words, it is not so important in real situation for those attributes selected by ID3 algorithm to be judged firstly according to make value of entropy minimal. Besides, ID3 algorithm selects attributes in terms of information entropy which is computed based on probabilities, while probability method is only suitable for solving stochastic problems. Aiming at these shortcomings for ID3 algorithm, some improvements on ID3 algorithm are made and a improved decision tree algorithm is introduced. 3.2Problem Statement: Decision tree is an important method for both induction research and data mining, which is mainly used for model classification and prediction. ID3 algorithm is the most widely used algorithm in the decision tree. Through illustrating on the basic ideas of decision tree in data mining, in my paper, the shortcoming of ID3’s inclining to choose attributes with many values, however, which are not necessarily the best attributes. In other words, it is not so important in real situation for those attributes selected by ID3 algorithm to be judged firstly according to make value of entropy minimal. Besides, ID3 algorithm selects attributes in terms of information entropy which is computed based on probabilities, while probability method is only suitable for solving stochastic problems. Aiming at these shortcomings for ID3 algorithm, a new decision
  • 21. tree algorithm combining ID3 and Association Function is presented. The experiment results show that the proposed algorithm can overcome ID3’s shortcoming effectively and get more reasonable and effective rules. 3.3 Working of Proposed System: CALCULATION OF IMPROVED ID3 ALGORITHM STEP1: Take the dataset and calculate the count of unique attribute values of the class label. STEP2: Calculate amount of information for the class label by based on following formulae. STEP3: Calculate the count of the each unique attribute values based on the unique attribute values of the class label. STEP4: Calculate expected information for each unique attribute value of each attribute by based on following formulae. STEP5: Calculate entropy value for each attribute. Here take the total calculated expected information values of all unique attribute values of entropy value by based on following formulae. STEP6: Calculate gain value with the help of amount of information value and entropy value by based on following formulae. STEP7: After calculating gain values here again we calculate the importance of attribute by using correlation function method for each attribute by based on following formulas. Here a formula gives for only two cases class labels. While in two conditions only this formulae works
  • 22. STEP8: Calculate the Improved ID3 gain of each attribute with the help of ID3, amount of information and correlation function by using following formulae. xV (k) Till now the calculation part is as same as id3 algorithm, In improved id3 algorithm has to calculate the importance of the each attribute after that multiply the correlation factor with the id3 gain values that calculation is as follows by using datasets of the id3 that attribute values have to calculate the ASSOCIATION FUNCTION expansion as below AF (A) = this formula is applied on retrieved datasets. Example takes the all attribute values from age, color_cloth, income, and student Age data No , yes count >40  3 , 1 4 <30  3 , 1 4 30-40 - , 2 2 Total 10 Calculation for all datasets AF (age) = =2 AF (color_cloth) = =1.5 AF (income) = =1.3334 AF (student) = =1 THE NORMALIZATION RELATION DEGREE FORMULA
  • 23. =0.34210 =0.2571 =0.22105 NOW CALCULATE THE IMPROVED ID3 GAIN VALUE BY USING THE FORMULA xV (K) Gain (age) = 0.32194 x 0.34210= 0.110310 highest gain value in improved id3 decision tree so it is the root node Gain (color_cloth) = 0.41992 x 0.2571= 0.107910 Gain (income) = 0.0954410 x 0.22105= 0.021100 Gain (student) = 0.041042 x 0.1714= 0.0079510 So AGE attribute becomes root node. <30, 30-40 and >40 are the unique attribute values. Now extract the <30,30-40,>40 data from dataset. Color_cloth, income, student, buys_computer here we neglect the age attribute because of, it was the root node. so take the references of age attribute values and retrieve the total <30 tuple values. <30 dataset Color_cloth, income, student, buys_computer Yellow, high, no, no
  • 24. White, low, yes, no Yellow, medium, no, yes Yellow, low, yes, no Here calculate the gain values for color-cloth No , yes count Yellow  2 , 1 3 White  1 , - 1 Total 4 For color-cloth extracting the data based on the class labels of No, yes No , yes count yellow  2 , 1 3 white  1 , 0 1 Total 4 Calculate the Expected information for the color-cloth yellow I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029 white I(1,0)=-1/1log(1/1)=0 Now calculate the Entropy value for attribute Income E( color-cloth) = 3/4(0.911029) +1/4(0)= 0.10101010 Gain value for color-cloth Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0.10101010= 0.21022 For Income extracting the data based on the class labels of No, yes No , yes count High  1 , 0 1 Medium  0 , 1 1
  • 25. Low  2 , 0 2 Total 4 Calculate the Expected information for the Income High I (1, 0) =-1/1log2 (1/1) = 0 Medium I (0, 1) =-1/1log2 (1/1) = 0 Low I (2, 0) =-2/2log2 (2/2) = 0 Now calculate the Entropy value for attribute Income E( Income ) = 1/4(0)+1/4(0)+1/4(0) = 0 Gain value for Income Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709 For Student extracting the data based on the class labels of No, yes No , yes count No  1 , 1 2 Yes  2 , 0 2 Total 4 Calculate the Expected information for the student No I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1 Yes I (2,0) =-2/2log2 (2/2) = 0 Now calculate the Entropy value for attribute student E( Income ) = 2/4(1)+2/4(0) = 0.5 Gain value for student Gain(student) = I (s1, s2)-E (student) = 0.9709-0.5= 0.47010 AF (color_cloth) = =1 AF (income) = =1.3334
  • 26. AF (student) = =1 =0.2999 =0.4 Gain values are Gain (color_cloth) = 0.21022 x 0.2999= 0.010410 Gain (income) = 0.9709 x 0.4= 0.310103 Gain (student) = 0.47010 x 0.2999= 0.1411 Income has the highest gain value 30 - 40 dataset Color_cloth, income, student, buys_computer Blue , high, no, yes Blue , low, yes, yes Here calculate the gain values for color-cloth No , yes count Blue  0 , 2 2 Total 2 Calculate the Expected information for the color-cloth Blue I(0,2)=-2/2log(2/2)=0 Now calculate the Entropy value for attribute Income E( color-cloth ) = 2/2(0)= 0 Gain value for color-cloth Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0= 0.9709
  • 27. For Income extracting the data based on the class labels of No, yes No , yes count High  0 , 1 1 Low  0 , 1 1 Total 2 Calculate the Expected information for the Income High I (1, 0) =-1/1log2 (1/1) = 0 Low I (1, 0) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute Income E( Income ) = 1/2(0)+1/2(0) = 0 Gain value for Income Gain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709 For Student extracting the data based on the class labels of No, yes No , yes count No  0 , 1 1 Yes  0 , 1 1 Total 2 Calculate the Expected information for the student No I (0,1) =-1/1log2 (1/1) = 0 Yes I (0,1) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute student E( Income ) = 1/2(0)+1/2(0) = 0 Gain value for student
  • 28. Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709 AF (color_cloth) = =2 AF (income) = =1 AF (student) = =1 =0.5 =0.25 Gain values are Gain (color_cloth) = 0.9709 x 0.5= 0.410545 Gain (income) = 0.9709 x 0.25= 0.2427 Gain (student) = 0.9709 x 0.25= 0.2427 Color-cloth has the highest gain value >40 dataset Color_cloth, income, student, buys_computer Red , high, no, no Red , low, yes, no Red , medium, no, yes White , medium, no, no For color-cloth extracting the data based on the class labels of No, yes No , yes count white  1 , 0 1 red  2 , 1 3
  • 29. Total 4 Calculate the Expected information for the color-cloth Red I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029 white I(1,0)=-1/1log(1/1)=0 Now calculate the Entropy value for attribute Income E( color-cloth) = 3/4(0.911029) +1/4(0)= 0.10101010 Gain value for color-cloth Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0.10101010= 0.21022 For Income extracting the data based on the class labels of No, yes No , yes count High  1 , 0 1 Medium  1 , 1 2 Low  1 , 0 1 Total 4 Calculate the Expected information for the Income High I (1, 0) =-1/1log2 (1/1) = 0 Medium I (1, 1) =-1/2log2 (1/1)-1/2log(1/2) = 1 Low I (1, 0) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute Income E( Income ) = 1/4(0)+2/4(1)+1/4(0) = 0.5 Gain value for Income Gain(income) = I (s1, s2)-E (Income) = 0.9709-0.5= 0.4709 For Student extracting the data based on the class labels of No, yes No , yes count
  • 30. No  2 , 1 3 Yes  1 , 0 1 Total 4 Calculate the Expected information for the student No I (2,1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.9110295 Yes I (1,0) =-1/1log2 (1/1) = 0 Now calculate the Entropy value for attribute student E( Income ) = 3/4(0.9110295)+1/4(0) = 0.1010107 Gain value for student Gain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.21021 AF (color_cloth) = =1 AF (income) = =0.1010107 AF (student) = =1 =0.3749 =0.25 Gain values are Gain (color_cloth) = 0.21022 x 0.3749= 0.1057 Gain (income) = 0.4709 x 0.25= 0.1177 Gain (student) = 0.21022 x 0.3749= 0.1057 Income has the highest gain value >40 Income dataset for high
  • 31. Color-cloth, student, buys-computer Red, no, no Class lable is same for both color-cloth and student, hence it becomes leaf node. >40 Income dataset for low Color-cloth, student, buys-computer Red, yes , no Class lable is same for both color-cloth and student, hence it becomes leaf node. >40 Income dataset for medium Color-cloth, student, buys-computer Red, no , yes White, no, no For color-cloth extracting the data based on the class labels of No, yes No , yes count white  1 , 0 1 red  0 , 1 1 Total 2 Calculate the Expected information for the color-cloth Red I (0, 1) =-1/1log2 (1/1)=0 white I(1,0)=-1/1log(1/1)=0 Now calculate the Entropy value for attribute Income E( color-cloth) = 1/2(0) +1/2(0)= 0 Gain value for color-cloth Gain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0= 0.9709 For Student extracting the data based on the class labels of No, yes
  • 32. No , yes count No  1 , 1 2 Total 2 Calculate the Expected information for the student No I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1 Now calculate the Entropy value for attribute student E( Income ) = 2/2(1) = 1 Gain value for student Gain(student) = I (s1, s2)-E (student) = 0.9709-1= -0.0292 AF (color_cloth) = =1 AF (student) = =0 =1 Gain values are Gain (color_cloth) = 0.9709 x 1= 0.9709 Gain (student) = -0.0292 x 0= 0 <30 Income dataset for high Color-cloth, student, buys-computer Yellow , no , no Class lable is same for both color-cloth and student, hence it becomes leaf node. <30 Income dataset for low Color-cloth, student, buys-computer White , yes , no
  • 33. Yellow, yes, no Class lable is same for both color-cloth and student, hence it becomes leaf node. <30 Income dataset for medium Color-cloth, student, buys-computer Yellow , no , yes Class lable is same for both color-cloth and student, hence it becomes leaf node. Construction of Improved ID3 decision tree:
  • 34. 4. Requirement Analysis or Requirements Elicitation REQUIREMENT SPECIFICATION: A requirement is a feature that the system must have or a constraint that it must satisfy to be accepted by the client. Requirement engineering aims at defining the requirements of the system under construction. Requirement engineering include two main activities, requirements elicitation, which results in the specification of the system that the client understands, and analysis, which results in an analysis model that the developer can unambiguously interpret. A requirement is statement about what the proposed system will do. Requirements can be divided into two major categories: functional requirements and non- functional requirements. 4.1 Functional requirements: Functional requirements describe the interactions between the system and its environment independent of its implementation. The environment includes the user and any other external system with which the system interacts. 4.2 Non Functional Requirements: Non-functional requirements describe aspects of the system that are not directly related the functional behaviour of the system. Non-functional requirements include a broad variety of requirements that apply to many different aspects of the system, from usability to performance. 4.2.1 Software Requirements: Operating System : Windows 2000/XP IDE Language : JDK 1.5 Eclipse Documentation : MS-Word Designing : Rational rose
  • 35. 4.2.2 Hardware Requirements: CPU : Pentium IV RAM : 512MB. Hard Disk : 40GB. Input device : Standard Keyboard and Mouse. Output device : VGA and High Resolution Monitor