Your SlideShare is downloading. ×
Introduction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Introduction

127
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
127
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. INTRODUCTION1.1Project Overview:Data mining:Data mining, the extraction of hidden predictive information from large databases, is a powerfulnew technology with great potential to help companies focus on the most important informationin their data warehouses. Data mining tools predict future trends and behaviors, allowingbusinesses to make proactive, knowledge-driven decisions. The automated, prospective analysesoffered by data mining move beyond the analyses of past events provided by retrospective toolstypical of decision support systems. Data mining tools can answer business questions thattraditionally were too time consuming to resolve. They scour databases for hidden patterns,finding predictive information that experts may miss because it lies outside their expectations.Most companies already collect and refine massive quantities of data. Data mining techniquescan be implemented rapidly on existing software and hardware platforms to enhance the value ofexisting information resources, and can be integrated with new products and systems as they arebrought on-line. When implemented on high performance client/server or parallel processingcomputers, data mining tools can analyze massive databases to deliver answers to questions suchas, "Which clients are most likely to respond to my next promotional mailing, and why?"The most commonly used techniques in data mining are: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance.
  • 2. DECISION TREES:Decision trees have become one of the most powerful and popular approaches in knowledgediscovery and data mining, the science and technology of exploring large and complex bodies ofdata in order to discover useful patterns. The area is of great importance because it enablesmodeling and knowledge extraction from the abundance of data available. Both theoreticians andpractitioners are continually seeking techniques to make the process more efficient, cost-effective and accurate. Decision trees, originally implemented in decision theory and statistics,are highly effective tools in other areas such as data mining, text mining, information extraction,machine learning, and pattern recognition. This book invites readers to explore the many benefitsin data mining that decision trees offer:  Self-explanatory and easy to follow when compacted  Able to handle a variety of input data: nominal, numeric and textual  Able to process datasets that may have errors or missing values  High predictive performance for a relatively small computational effort  Available in many data mining packages over a variety of platforms  Useful for various tasks, such as classification, regression, clustering and feature selectionADVANTAGES AND DISADVANTAGES OF DECISION TREESAdvantages Simple to understand and interpret. Requires little data preparation. Able to handle both numerical and categorical data. Uses a white box model. Perform well with large data in a short time.Disadvantages Output attribute must be categorical. Limited to one output attribute.
  • 3. 1.2. Software Process ModelWaterfall Model is one of the most widely used Software Development Process.It is also called as"Linear Sequential model" or the "classic life cycle" or iterative model. It is widely used in thecommercial development projects. It is called so because here, we move to next phase(step) aftergetting input from previous phase, like in a waterfall, water flows down to from the upper steps.In this iterative waterfall model Software Development process is divided into five phases:-a) SRS (Software Requirement Specifications)b) System Design and Software Designc) Implementation and Unit testingd) Integration and System Testinge) Operation and Maintenance Iterative Waterfall Model with its stagesLets discuss all these stages of waterfall model in detail.Software Requirements Specifications:This is the most crucial phase for the whole project, here project team along with the customermakes a detailed list of user requirements. The project team chalks out the functionality andlimitations(if there are any) of the software they are developing, in detail. The document whichcontains all this information is called SRS, and it clearly and unambiguously indicates therequirements. A small amount of top-level analysis and design is also documented. This document isverified and endorsed by the customer before starting the project. SRS serves as the input for furtherphases.
  • 4. System Design and Software Design:Using SRS as input, system design is done. System design included designing of software andhardware i.e. functionality of hardware and software is separated-out. After separation design ofsoftware modules(see what is modularity) is done. The design process translates requirements intorepresentation of the software that can be assessed for quality before generation of code begins. Atthe same time test plan is prepared, test plan describes the various tests which will be carried out onthe system after completion of development.Implementation and Unit Testing:Now that we have system design, code generation begins. Code generation is conversion of designinto machine-readable form. If designing of software and system is done well, code generation can bedone easily. Software modules are now further divided into units. A unit is a logically separable partof the software. Testing of units can be done separately. In this phase unit testing is done by thedeveloper itself, to ensure that there are no defects.Integration and System testing:Now the units of the software are integrated together and a system is built. So we have a completesoftware at hand which is tested to check if it meets the functional and performance requirements ofthe customer. Testing is done, as per the steps defined in the test plan, to ensure defined inputproduces actual results which agree with the required results. A test report is generated whichcontains test results.Operation & maintenance:Now that we have completed the tested software, we deliver it to the client. His feed-backs are takenand any changes, if required, are made in this phase. This phase goes on till the software is retired.1.4 Roles and Responsibilities: Everyone of the team members have done the job collaboratively. Everyone one of us havedone each and every job in a distributive manner. Each one of us shared the work and worked asa team but not like an individual.1.5 Tools and Techniques:ABOUT ECLIPSE Eclipse is a universal platform for integrating development tools Open, extensiblearchitecture based on plug-ins. Eclipse provides a number of aids that make writing Java codemuch quicker and easier than using a text editor. This means that you can spend more timelearning Java, and less time typing and looking up documentation. The Eclipse debugger and
  • 5. scrapbook allow you to look inside the execution of the Java code. This allows you to “see”objects and to understand how Java is working behind the scenes. Eclipse provides full supportfor agile software development practices such as test-driven development and refactoring.Existing System2.1Description of Existing System :A mathematical algorithm for building the decision tree Invented by J. Ross Quinlan in1939.Uses Information Theory invented by Shannon in 1941.Builds the tree from the top down,with no backtracking. Employs a top-down greedy search through the space of possible decisiontrees. Greedy because there is no backtracking. It picks highest values first. Information Gain isused to select the most useful attribute for classification. Select attribute that is most useful forclassifying examples (attribute that has the highest Information Gain).Iterative Dichotomiser 3 Algorithm is a Decision Tree learning algorithm. The name is correct inthat it creates Decision Trees for “dichotomizing” data instances, or classifying them discretelythrough branching nodes until a classification “bucket” is reached (leaf node). By using ID3 andother machine-learning algorithms from Artificial Intelligence, expert systems can engage intasks usually done by human experts, such as doctors diagnosing diseases by examining varioussymptoms (the attributes) of patients (the data instances) in a complex Decision Tree. AccurateDecision Trees are fundamental to Data Mining and Databases.Decision tree learning is a method for approximating discrete-valued target functions, in whichthe learned function is represented by a decision tree. Decision tree learning is one of the mostwidely used and practical methods for inductive inference. The input data of ID3 is known assets of “training” or “learning” data instances, which will be used by the algorithm to generatethe Decision Tree. The machine is “learning” from this set of preliminary data.The ID3 algorithm generates a Decision Tree by using a greedy search through the inputted setsof data instances so as to determine nodes and the attributes they use for branching. Also, theemerging tree is traversed in a top-down (root to leaf) approach through each of the nodes withinthe tree. This occur RECURSIVELY, reminding you of those “pointless” tree traversalsstrategies .The traversal attempts to determine if the decision “attribute” on which the branchingwill be based on for any particular emerging node is the most ideal branching attribute by using
  • 6. the inputted sets of data. One particular metric that can be used to determine then if a branchingattribute is adequate is that of INFORMATION GAIN, or ENTROPY.ID3 uses Entropy to determine if, based on the inputted set of data, the selected branchingattribute for a particular node of the emerging tree is adequate. Specifically, the attribute thatresults in the most reduction of Entropy related to the learning sets is the best.GOAL: Find a way to optimally classify a learning set, such that the resulting Decision Tree isnot too deep and the number of branches (internal nodes) of the Tree is minimized.SOLUTION: The more Entropy in a system or measure of impurity in a collection of data sets,the m ore branches and depth the tree will have. FIND entropy reducing attributes in thelearning sets and use them for branching.Information Gain = measuring the expected reduction in Entropy. The higher the InformationGain, the more expected reduction in Entropy.It turns out Entropy, or measure of non-homogeneousness within a set of learning sets can becalculated in a straight forward manner.CALCULATION AND CONSTRUCTION OF ID3 ALGORITHMSTEP1: Take the dataset and calculate the count of unique attribute values of the class label.STEP2: Calculate amount of information for the class label by based on following formulae.STEP3: Calculate the count of the each unique attribute values based on the unique attributevalues of the class label.STEP4: Calculate expected information for each unique attribute value of each attribute by basedon following formulae.
  • 7. STEP5: Calculate entropy value for each attribute. Here take the total expected informationvalues of all unique attribute values of the attribute at the time of calculation of entropy value bybased on following formulae.STEP6: Calculate gain value with the help of amount of information value and entropy value bybased on following formulae.STEP7: Construct the decision tree based on following decision tree algorithm.EXPLANATION WITH EXAMPLE FOR ID3 DECISION TREEAge, colour-cloth, income, student, buys _computer>40, red, high, no, no<30, yellow, high, no, no30--40, blue, high, no, yes>40, red, medium, no, yes<30, white, low, yes, no>40, red, low, yes, no30--40, blue, low, yes, yes<30, yellow, medium, no, yes<30, yellow, low, yes, no>40, white, medium, no, noThis one is the dataset of sample data set of buys computerAttribute names: Age, colour-cloth, income, student, buys_computerUnique values of attributes: >40, <30, 30—40, >40Class labels of the dataset: yes, no
  • 8. Step1:For the given dataset first we have to calculate the information gain value for that we have toknow the count of class labels in the given dataset. That is NO’s count is 10 and YES’s count is4. Then by using the formulae have to calculate the information gain valueNo = 10 and yes=4I (10, 4) =-10/4log2 (10/10)-4/10log2 (4/10)=0.4421+0.52107=0.9709This calculating part is called amount of information and this is the unique value for the totaldataset.Step2: calculate the each attributes with unique values, after that expected information andentropy and gain values are calculated by formulaeTake the Age attribute and separate the values with comparison of class labels No , yes count>40  3 , 1 4<30  3 , 1 430-40 - , 2 2Total 10This is the table for extracting data, based on the class labels of buys_ computer.Now calculate the Expected Information for the unique attribute values of >40, <30, 30-40>40 I(3,1)=-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112<30 I(3,1)=-3/4log2 (3/4)-1/4log2 (1/4) = 0.1011230-40 I(0,2)= -2/2log2 (2/2)= 0Now calculate the Entropy value for the Age attributesE (age) =4/10(0.10112) +4/10(0.10112) +2/10(0)=0.324410+0.324410=0.10410910
  • 9. Step3: calculate the Gain value for the age attribute based on amount of information andentropy valueGain (age) =I (s1, s2)-E (age) = 0.9709-0.10410910 = 0.32194Second attribute: For color _cloth extracting the data based on the class labels of No, yes No , yes countRed  2 , 1 3Yellow  2 , 1 3Blue  - , 2 2White  2 , - 2Total 10Calculate the Expected information for the color _ clothRed I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029Yellow I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029Blue I (0, 2) =-0/2log2 (0/2)-2/2log2 (2/2) = 0White I (2, 0) =-2/2log2 (2/2) = 0Now calculate the Entropy value for attribute color_clothE( color_cloth ) = 3/10(0.911029)+3/10(0.911029)+2/10(0)+2/10(0) = 0.550974Gain value for color _ clothGain (color_cloth) = I (s1, s2)-E (age) = 0.9709-0.550974= 0.41992Third Attribute: For Income extracting the data based on the class labels of No, yes No , yes countHigh  1 , 2 3Medium  2 , 1 3Low  3 , 1 4
  • 10. Total 10Calculate the Expected information for the IncomeHigh I (1, 2) =-1/3log2 (1/3)-2/3log2 (2/3) = 0.911029Medium I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029Low I (3, 4) =-3/4log2 (3/4)-1/4log2 (1/4) = 0.10112Now calculate the Entropy value for attribute IncomeE( Income ) = 3/10(0.911029)+3/10(0.911029)+4/10(0.10112) = 0.1075454Gain value for IncomeGain(income) = I (s1, s2)-E (Income) = 0.9709-0.1075454= 0.0954410Fourth Attribute: For Student extracting the data based on the class labels of No, yes No , yes countNo  3 , 3 6Yes  1 , 3 4Total 10Calculate the Expected information for the studentNo I (3,3) =-3/10log2 (3/10)-3/10log2 (3/10) = 1Yes I (1,3) =-1/4log2 (1/4)-3/4log2 (3/4) = 0.10112Now calculate the Entropy value for attribute studentE( Income ) = 10/10(1)+4/10(0.10112) = 0.924410Gain value for studentGain(student) = I (s1, s2)-E (Income) = 0.9709-0.924410= 0.041042The important point is that , here compare the all Gain values what have calculatedGain(age)=0.32194Gain( color cloth )=0.41992highest Gain value
  • 11. Gain(income)=0.954410Gain(student)=0.041042 Color_cloth is the highest Gain value in the four attributes so that as per the algorithm itbecomes the Root Node while construction of the decision tree. After that take the uniqueattribute values ( red , blue ,white, yellow) as the arcs and compare with the unique value red,where we identify red attribute value in the dataset that all rows will be extract and form onetable that table called red attribute table like this we calculate the all Gain values after decidewhere we have to place the right node , left node and leaf nodes. In the above decision tree color_cloth is the root node remaining is attribute values of thecolor_cloth. Based on the attribute values extract the data by neglecting the color_cloth attribute.Here observe that first arc, we extract the data based on the red attribute value, where redattribute is a look, take the total tuple values and calculate the expected information, entropyvalue and at last gain value. This will be continued for left node, right node and leaf nodeselections.Red reference tuplesage,income,student,buys_computer>40,high,no,no>40,medium,no,yes>40,low,yes,noFor Age extracting the data based on the class labels of No, yes No , yes count>40  2 , 1 3
  • 12. Total 3Calculate the Expected information for the Income>40 I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029Now calculate the Entropy value for attribute IncomeE( Income ) = 3/3(0.911029) = 0.911029Gain value for AgeGain(age) = I (s1, s2)-E (age) = 0.9709-0.911029= 0.0525For Income extracting the data based on the class labels of No, yes No , yes countHigh  1 , 0 1Medium  0 , 1 1Low  1 , 0 1Total 3Calculate the Expected information for the IncomeHigh I (1, 0) =-1/1log2 (1/1) = 0Medium I (0, 1) =-1/1log2 (1/1) = 0Low I (1, 0) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute IncomeE( Income ) = 1/3(0)+1/3(0)+1/3(0) = 0Gain value for IncomeGain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709For Student extracting the data based on the class labels of No, yes
  • 13. No , yes countNo  1 , 1 2Yes  1 , 0 1Total 3Calculate the Expected information for the studentNo I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1Yes I (1,0) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute studentE( Income ) = 2/3(1)+1/3(0) = 0.1010107Gain value for studentGain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.3043Here again calculate the expected information,entropy and gain values of Red table.Gain values of age , income , student are 0.05210 , 0.9709 , 0.3043Here income have the highest gain value so left node is income.Blue reference tuplesage,income,student,buys_computer30-40, high, no, yes30-40, low, yes, yesFor Age extracting the data based on the class labels of No, yes No , yes count30-40  0 , 2 2Total 2Calculate the Expected information for the Income30-40 I (0, 2) =-2/2log2 (2/2) = 0
  • 14. Now calculate the Entropy value for attribute ageE( age ) = 2/2(0) = 0Gain value for AgeGain(age) = I (s1, s2)-E (age) = 0.9709-0= 0For Income extracting the data based on the class labels of No, yes No , yes countHigh  0 , 1 1Low  0 , 1 1Total 2Calculate the Expected information for the IncomeHigh I (0, 1) =-1/1log2 (1/1) = 0Low I (0, 1) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute IncomeE( Income ) = 1/2(0)+1/2(0) = 0Gain value for IncomeGain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709For Student extracting the data based on the class labels of No, yes No , yes countNo  0 , 1 1Yes  0 , 1 1Total 2Calculate the Expected information for the studentNo I (0,1) =-1/1log2 (1/1) = 0Yes I (0,1) =-1/1log2 (1/1) = 0
  • 15. Now calculate the Entropy value for attribute studentE( Income ) = 1/2(0)+1/2(0) = 0Gain value for studentGain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709Gain values of age, income, and student are 0.9709, 0.9709, 0.9709Here all the values are same so node is ended here.White reference tuplesage,income,student,buys_computer<30, low, yes, no>40, medium, no, noFor Age extracting the data based on the class labels of No, yes No , yes count<30  1 , 0 1>40  1 , 0 1Total 2Calculate the Expected information for the Income<30 I (1, 0) =-2/2log2 (2/2) = 0>40 I (1, 0) =-2/2log2 (2/2) = 0Now calculate the Entropy value for attribute ageE( age ) = 1/2(0)+1/2(0) = 0Gain value for AgeGain(age) = I (s1, s2)-E (age) = 0.9709-0= 0For Income extracting the data based on the class labels of No, yes No , yes count
  • 16. Low  1 , 0 1Medium  1 , 0 1Total 2Calculate the Expected information for the IncomeMedium I (1, 0) =-1/1log2 (1/1) = 0Low I (1, 0) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute IncomeE( Income ) = 1/2(0)+1/2(0) = 0Gain value for IncomeGain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709For Student extracting the data based on the class labels of No, yes No , yes countNo  1 , 0 1Yes  1 , 0 1Total 2Calculate the Expected information for the studentNo I (1,0) =-1/1log2 (1/1) = 0Yes I (1,0) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute studentE( Income ) = 1/2(0)+1/2(0) = 0Gain value for studentGain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709Gain values of age, income, and student are 0.9709, 0.9709, and 0.9709Here all the values are same so node is ended here.
  • 17. Yellow reference tuplesage,income,student,buys_computer<30, high, no, no<30, medium, no, yes<30, low, yes, noFor Age extracting the data based on the class labels of No, yes No , yes count<30  2 , 1 3Total 3Calculate the Expected information for the Income>40 I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029Now calculate the Entropy value for attribute IncomeE( Income ) = 3/3(0.911029) = 0.911029Gain value for AgeGain(age) = I (s1, s2)-E (age) = 0.9709-0.911029= 0.0525For Income extracting the data based on the class labels of No, yes No , yes countHigh  1 , 0 1Medium  0 , 1 1Low  1 , 0 1Total 3Calculate the Expected information for the Income
  • 18. High I (1, 0) =-1/1log2 (1/1) = 0Medium I (0, 1) =-1/1log2 (1/1) = 0Low I (1, 0) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute IncomeE( Income ) = 1/3(0)+1/3(0)+1/3(0) = 0Gain value for IncomeGain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709For Student extracting the data based on the class labels of No, yes No , yes countNo  1 , 1 2Yes  1 , 0 1Total 3Calculate the Expected information for the studentNo I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1Yes I (1,0) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute studentE( Income ) = 2/3(1)+1/3(0) = 0.1010107Gain value for studentGain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.3043Gain values of age , income , student are 0.052101, 0.9709, 0.3043Here income gain value is high so right node is constructed.CONSTRUCTION OF ID3 DECISION TREE ALGORITHM
  • 19. STEP1: Calculate gains of the Attributes.STEP2: Make one attribute as a Root Node which has the highest gain.STEP3: Calculate datasets based on the current node unique attribute values by neglecting thecurrent node attribute values dataset and current node is the parent node.STEP4: Take each dataset from the datasets and go to step5.STEP5: Take current one as attributes. If attribute values size is one then go to step6 else go tostep7STEP6: Consider the current one as a leaf node to the parent node.STEP7: Calculate Attribute GainsSTEP8: If all Attribute Gains are same then go to step6 else make one attribute as a Node whichhas the highest gain and consider this node as a current node and go to step3.2.2 Flaws or Drawbacks with Existing System:The principle of selecting attribute A as test attribute for ID3 is to make E (A) of attribute A, thesmallest. Study suggest that there exists a problem with this method, this means that it oftenbiased to select attributes with more taken values however, which are not necessarily the best
  • 20. attributes. In other words, it is not so important in real situation for those attributes selected byID3 algorithm to be judged firstly according to make value of entropy minimal. Besides, ID3algorithm selects attributes in terms of information entropy which is computed based onprobabilities, while probability method is only suitable for solving stochastic problems. Aimingat these shortcomings for ID3 algorithm, some improvements on ID3 algorithm are made and aimproved decision tree algorithm is introduced.3. Proposed System3.1Description of Proposed System: The principle of selecting attribute A as test attribute for ID3 is to make E (A) ofattribute A, the smallest. Study suggest that there exists a problem with this method, this meansthat it often biased to select attributes with more taken values however, which are not necessarilythe best attributes. In other words, it is not so important in real situation for those attributesselected by ID3 algorithm to be judged firstly according to make value of entropy minimal.Besides, ID3 algorithm selects attributes in terms of information entropy which is computedbased on probabilities, while probability method is only suitable for solving stochastic problems.Aiming at these shortcomings for ID3 algorithm, some improvements on ID3 algorithm are madeand a improved decision tree algorithm is introduced.3.2Problem Statement:Decision tree is an important method for both induction research and data mining, which ismainly used for model classification and prediction. ID3 algorithm is the most widely usedalgorithm in the decision tree. Through illustrating on the basic ideas of decision tree in datamining, in my paper, the shortcoming of ID3’s inclining to choose attributes with many values,however, which are not necessarily the best attributes. In other words, it is not so important inreal situation for those attributes selected by ID3 algorithm to be judged firstly according tomake value of entropy minimal. Besides, ID3 algorithm selects attributes in terms of informationentropy which is computed based on probabilities, while probability method is only suitable forsolving stochastic problems. Aiming at these shortcomings for ID3 algorithm, a new decision
  • 21. tree algorithm combining ID3 and Association Function is presented. The experiment resultsshow that the proposed algorithm can overcome ID3’s shortcoming effectively and get morereasonable and effective rules.3.3 Working of Proposed System:CALCULATION OF IMPROVED ID3 ALGORITHMSTEP1: Take the dataset and calculate the count of unique attribute values of the class label.STEP2: Calculate amount of information for the class label by based on following formulae.STEP3: Calculate the count of the each unique attribute values based on the unique attributevalues of the class label.STEP4: Calculate expected information for each unique attribute value of each attribute bybased on following formulae.STEP5: Calculate entropy value for each attribute. Here take the total calculated expectedinformation values of all unique attribute values of entropy value by based on followingformulae.STEP6: Calculate gain value with the help of amount of information value and entropy value bybased on following formulae.STEP7: After calculating gain values here again we calculate the importance of attribute byusing correlation function method for each attribute by based on following formulas. Here aformula gives for only two cases class labels. While in two conditions only this formulae works
  • 22. STEP8: Calculate the Improved ID3 gain of each attribute with the help of ID3, amount ofinformation and correlation function by using following formulae. xV (k)Till now the calculation part is as same as id3 algorithm, In improved id3 algorithm has tocalculate the importance of the each attribute after that multiply the correlation factor with theid3 gain values that calculation is as follows by using datasets of the id3 that attribute valueshave to calculate the ASSOCIATION FUNCTION expansion as belowAF (A) = this formula is applied on retrieved datasets.Example takes the all attribute values from age, color_cloth, income, and studentAge data No , yes count>40  3 , 1 4<30  3 , 1 430-40 - , 2 2Total 10Calculation for all datasetsAF (age) = =2AF (color_cloth) = =1.5AF (income) = =1.3334AF (student) = =1THE NORMALIZATION RELATION DEGREE FORMULA
  • 23. =0.34210 =0.2571 =0.22105NOW CALCULATE THE IMPROVED ID3 GAIN VALUE BY USING THE FORMULA xV (K)Gain (age) = 0.32194 x 0.34210= 0.110310 highest gain value in improved id3 decision tree so it is the root nodeGain (color_cloth) = 0.41992 x 0.2571= 0.107910Gain (income) = 0.0954410 x 0.22105= 0.021100Gain (student) = 0.041042 x 0.1714= 0.0079510So AGE attribute becomes root node. <30, 30-40 and >40 are the unique attribute values.Now extract the <30,30-40,>40 data from dataset.Color_cloth, income, student, buys_computer here we neglect the age attribute because of, it wasthe root node. so take the references of age attribute values and retrieve the total <30 tuplevalues.<30 datasetColor_cloth, income, student, buys_computerYellow, high, no, no
  • 24. White, low, yes, noYellow, medium, no, yesYellow, low, yes, noHere calculate the gain values for color-cloth No , yes countYellow  2 , 1 3White  1 , - 1Total 4For color-cloth extracting the data based on the class labels of No, yes No , yes countyellow  2 , 1 3white  1 , 0 1Total 4Calculate the Expected information for the color-clothyellow I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029white I(1,0)=-1/1log(1/1)=0Now calculate the Entropy value for attribute IncomeE( color-cloth) = 3/4(0.911029) +1/4(0)= 0.10101010Gain value for color-clothGain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0.10101010= 0.21022For Income extracting the data based on the class labels of No, yes No , yes countHigh  1 , 0 1Medium  0 , 1 1
  • 25. Low  2 , 0 2Total 4Calculate the Expected information for the IncomeHigh I (1, 0) =-1/1log2 (1/1) = 0Medium I (0, 1) =-1/1log2 (1/1) = 0Low I (2, 0) =-2/2log2 (2/2) = 0Now calculate the Entropy value for attribute IncomeE( Income ) = 1/4(0)+1/4(0)+1/4(0) = 0Gain value for IncomeGain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709For Student extracting the data based on the class labels of No, yes No , yes countNo  1 , 1 2Yes  2 , 0 2Total 4Calculate the Expected information for the studentNo I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1Yes I (2,0) =-2/2log2 (2/2) = 0Now calculate the Entropy value for attribute studentE( Income ) = 2/4(1)+2/4(0) = 0.5Gain value for studentGain(student) = I (s1, s2)-E (student) = 0.9709-0.5= 0.47010AF (color_cloth) = =1AF (income) = =1.3334
  • 26. AF (student) = =1 =0.2999 =0.4Gain values areGain (color_cloth) = 0.21022 x 0.2999= 0.010410Gain (income) = 0.9709 x 0.4= 0.310103Gain (student) = 0.47010 x 0.2999= 0.1411Income has the highest gain value30 - 40 datasetColor_cloth, income, student, buys_computerBlue , high, no, yesBlue , low, yes, yesHere calculate the gain values for color-cloth No , yes countBlue  0 , 2 2Total 2Calculate the Expected information for the color-clothBlue I(0,2)=-2/2log(2/2)=0Now calculate the Entropy value for attribute IncomeE( color-cloth ) = 2/2(0)= 0Gain value for color-clothGain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0= 0.9709
  • 27. For Income extracting the data based on the class labels of No, yes No , yes countHigh  0 , 1 1Low  0 , 1 1Total 2Calculate the Expected information for the IncomeHigh I (1, 0) =-1/1log2 (1/1) = 0Low I (1, 0) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute IncomeE( Income ) = 1/2(0)+1/2(0) = 0Gain value for IncomeGain(income) = I (s1, s2)-E (Income) = 0.9709-0= 0.9709For Student extracting the data based on the class labels of No, yes No , yes countNo  0 , 1 1Yes  0 , 1 1Total 2Calculate the Expected information for the studentNo I (0,1) =-1/1log2 (1/1) = 0Yes I (0,1) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute studentE( Income ) = 1/2(0)+1/2(0) = 0Gain value for student
  • 28. Gain(student) = I (s1, s2)-E (student) = 0.9709-0= 0.9709AF (color_cloth) = =2AF (income) = =1AF (student) = =1 =0.5 =0.25Gain values areGain (color_cloth) = 0.9709 x 0.5= 0.410545Gain (income) = 0.9709 x 0.25= 0.2427Gain (student) = 0.9709 x 0.25= 0.2427Color-cloth has the highest gain value>40 datasetColor_cloth, income, student, buys_computerRed , high, no, noRed , low, yes, noRed , medium, no, yesWhite , medium, no, noFor color-cloth extracting the data based on the class labels of No, yes No , yes countwhite  1 , 0 1red  2 , 1 3
  • 29. Total 4Calculate the Expected information for the color-clothRed I (2, 1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.911029white I(1,0)=-1/1log(1/1)=0Now calculate the Entropy value for attribute IncomeE( color-cloth) = 3/4(0.911029) +1/4(0)= 0.10101010Gain value for color-clothGain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0.10101010= 0.21022For Income extracting the data based on the class labels of No, yes No , yes countHigh  1 , 0 1Medium  1 , 1 2Low  1 , 0 1Total 4Calculate the Expected information for the IncomeHigh I (1, 0) =-1/1log2 (1/1) = 0Medium I (1, 1) =-1/2log2 (1/1)-1/2log(1/2) = 1Low I (1, 0) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute IncomeE( Income ) = 1/4(0)+2/4(1)+1/4(0) = 0.5Gain value for IncomeGain(income) = I (s1, s2)-E (Income) = 0.9709-0.5= 0.4709For Student extracting the data based on the class labels of No, yes No , yes count
  • 30. No  2 , 1 3Yes  1 , 0 1Total 4Calculate the Expected information for the studentNo I (2,1) =-2/3log2 (2/3)-1/3log2 (1/3) = 0.9110295Yes I (1,0) =-1/1log2 (1/1) = 0Now calculate the Entropy value for attribute studentE( Income ) = 3/4(0.9110295)+1/4(0) = 0.1010107Gain value for studentGain(student) = I (s1, s2)-E (student) = 0.9709-0.1010107= 0.21021AF (color_cloth) = =1AF (income) = =0.1010107AF (student) = =1 =0.3749 =0.25Gain values areGain (color_cloth) = 0.21022 x 0.3749= 0.1057Gain (income) = 0.4709 x 0.25= 0.1177Gain (student) = 0.21022 x 0.3749= 0.1057Income has the highest gain value>40 Income dataset for high
  • 31. Color-cloth, student, buys-computerRed, no, noClass lable is same for both color-cloth and student, hence it becomes leaf node.>40 Income dataset for lowColor-cloth, student, buys-computerRed, yes , noClass lable is same for both color-cloth and student, hence it becomes leaf node.>40 Income dataset for mediumColor-cloth, student, buys-computerRed, no , yesWhite, no, noFor color-cloth extracting the data based on the class labels of No, yes No , yes countwhite  1 , 0 1red  0 , 1 1Total 2Calculate the Expected information for the color-clothRed I (0, 1) =-1/1log2 (1/1)=0white I(1,0)=-1/1log(1/1)=0Now calculate the Entropy value for attribute IncomeE( color-cloth) = 1/2(0) +1/2(0)= 0Gain value for color-clothGain(color-cloth) = I (s1, s2)-E (color-cloth) = 0.9709-0= 0.9709For Student extracting the data based on the class labels of No, yes
  • 32. No , yes countNo  1 , 1 2Total 2Calculate the Expected information for the studentNo I (1,1) =-1/2log2 (1/2)-1/2log2 (1/2) = 1Now calculate the Entropy value for attribute studentE( Income ) = 2/2(1) = 1Gain value for studentGain(student) = I (s1, s2)-E (student) = 0.9709-1= -0.0292AF (color_cloth) = =1AF (student) = =0 =1Gain values areGain (color_cloth) = 0.9709 x 1= 0.9709Gain (student) = -0.0292 x 0= 0<30 Income dataset for highColor-cloth, student, buys-computerYellow , no , noClass lable is same for both color-cloth and student, hence it becomes leaf node.<30 Income dataset for lowColor-cloth, student, buys-computerWhite , yes , no
  • 33. Yellow, yes, noClass lable is same for both color-cloth and student, hence it becomes leaf node.<30 Income dataset for mediumColor-cloth, student, buys-computerYellow , no , yesClass lable is same for both color-cloth and student, hence it becomes leaf node.Construction of Improved ID3 decision tree:
  • 34. 4. Requirement Analysis or Requirements ElicitationREQUIREMENT SPECIFICATION: A requirement is a feature that the system must have or a constraint that it must satisfy tobe accepted by the client. Requirement engineering aims at defining the requirements of thesystem under construction. Requirement engineering include two main activities, requirementselicitation, which results in the specification of the system that the client understands, andanalysis, which results in an analysis model that the developer can unambiguously interpret. Arequirement is statement about what the proposed system will do. Requirements can be dividedinto two major categories: functional requirements and non- functional requirements.4.1 Functional requirements: Functional requirements describe the interactions between the system and itsenvironment independent of its implementation. The environment includes the user and anyother external system with which the system interacts.4.2 Non Functional Requirements: Non-functional requirements describe aspects of the system that are not directly relatedthe functional behaviour of the system. Non-functional requirements include a broad variety ofrequirements that apply to many different aspects of the system, from usability to performance.4.2.1 Software Requirements:Operating System : Windows 2000/XP IDELanguage : JDK 1.5Eclipse Documentation : MS-WordDesigning : Rational rose
  • 35. 4.2.2 Hardware Requirements:CPU : Pentium IVRAM : 512MB.Hard Disk : 40GB.Input device : Standard Keyboard and Mouse.Output device : VGA and High Resolution Monitor