Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IT for Business Intelligence Term Paper


Published on

Prepared as part of the course requirements for the subject IT for Business Intelligence at Vinod Gupta School of Management, IIT Kharagpur. This paper discusses some of the data mining techniques using examples in the software WEKA.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

IT for Business Intelligence Term Paper

  1. 1. IT for Business IntelligenceTerm Paper on Data Mining Techniques Prepared By: Niloy Ghosh Roll No: 10BM60054 Second Year, MBA VInod Gupta School of Management (VGSOM) IIT Kharagpur
  2. 2. IntroductionThe purpose of this term paper is to demonstrate data mining techniques using the software toolWEKA. Data mining aims at transforming large amounts of data into meaningful patterns and rules.The derivation of meaning from the vast amounts of data has numerous business applications and isgenerating a tremendous amount of interest.Waikato Environment for Knowledge Analysis (WEKA) is a free and open source software that can beused to mine data and generate useful information. For using WEKA, the data needs to be in theAttribute-Relation File Format (ARFF). It is a flat file format where the type of data being loaded isdefined first, followed by the data itself.In this paper two techniques, Linear Regression and Decision Tree, are discussed with examples. Thesource of the data used to demonstrate the techniques is provided in the reference section.Technique ILinear RegressionLinear regression is used to predict the value of an unknown dependent variable based on the valuesof a number of independent variables. In this example, the model tries to predict the housing pricesin the Boston area.Description of datasetThe dataset contains details about housing in Boston area. The data contains 14 variables which aredefined as follows. 1. CRIM: per capita crime rate by town 2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS: proportion of non-retail business acres per town 4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX: nitric oxides concentration (parts per 10 million) 6. RM: average number of rooms per dwelling 7. AGE: proportion of owner-occupied units built prior to 1940 8. DIS: weighted distances to five Boston employment centres 9. RAD: index of accessibility to radial highways 10. TAX: full-value property-tax rate per $10,000 11. PTRATIO: pupil-teacher ratio by town 12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT: Percentage of lower status of the population 14. MEDV: Median value of owner-occupied homes in $1000sThe objective is to predict the housing values (i.e. the variable MEDV) using Linear Regression.
  3. 3. OutputOn running the model in WEKA, the following output was obtained.=== Run information ===Scheme:WEKA.classifiers.functions.LinearRegression -S 0 -R 1.0E-8Relation: housingInstances: 506Attributes: 14 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT CLASSTest mode:split 70.0% train, remainder test=== Classifier model (full training set) ===
  4. 4. Linear Regression ModelCLASS = -0.1084 * CRIM + 0.0458 * ZN + 2.7188 * CHAS + -17.3768 * NOX + 3.8016 * RM + -1.4927 * DIS + 0.2996 * RAD +-0.0118 * TAX + -0.9466 * PTRATIO + 0.0093 * B + -0.5225 * LSTAT + 36.342Time taken to build model: 0.05 seconds=== Evaluation on test split ====== Summary ===Correlation coefficient 0.8547Mean absolute error 3.3219Root mean squared error 4.6107Relative absolute error 52.2759 %Root relative squared error 51.9447 %Total Number of Instances 152The experiment was conducted using a 70-30 split of the data (70% used to form the model, 30%used to test the same).InterpretationThe results show a correlation of 85%, and thus the model is sufficiently acceptable. Though theerror values are quite high, other methods have yielded only slightly better results.The following conclusions can be made:  The proportion of non-retail business and age of the buildings are not a factor for evaluation.  As expected, crime rates, air pollution and (high) tax rates have a negative effect on the house value.  The proportion of lower status population has a negative effect. Thus, low income neighbourhoods will have lower house rates than affluent neighbourhoods.  Interestingly, the pupil student ratio has a negative effect and that too quite prominent. Thus, it is evident that educational facilities is a big concern while looking for a home and people are ready to pay more for areas having better educational facilities.
  5. 5. Technique IIDecision TreeIn data mining, a decision tree is a predictive model which maps observations about an item toconclusions about the items target value. Also known as classification trees, the leaves representclass labels and branches represent conjunctions of features that lead to those class labels.The WEKA classifier used in the example is J48. The model tries to make a diagnosis of urinarysystem disease.Description of datasetThe dataset contains the following variables. 1. Temperature of patient 2. Occurrence of nausea { yes, no } 3. Lumbar pain { yes, no } 4. Urine pushing (continuous need for urination) { yes, no } 5. Micturition pains { yes, no } 6. Burning of urethra, itch, swelling of urethra outlet { yes, no } 7. Decision: Inflammation of urinary bladder { yes, no } 8. Decision: Nephritis of renal pelvis origin { yes, no }For the purpose of the demonstration, first the variable ‘Nephritis of renal pelvis origin’ had beenremoved. The analysis then creates a decision tree for the prediction of the inflammation of urinarybladder.Next, the variable ‘Inflammation of urinary bladder’ has been removed and a new decision tree iscreated for the prediction of Nephritis of renal pelvis origin.
  6. 6. OutputThe WEKA output for prediction of the inflammation of urinary bladder was obtained as follows.Model 1=== Run information ===Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R8Instances: 120Attributes: 7 temperature nausea Lumbar_pain Urine_pushing Micturition_pains Burning_of_urethra Inflammation_of_urinary_bladderTest mode:10-fold cross-validation=== Classifier model (full training set) ===J48 pruned tree------------------Urine_pushing = yes| Micturition_pains = yes: yes (49.0)| Micturition_pains = no
  7. 7. | | Lumbar_pain = yes: no (21.0)| | Lumbar_pain = no: yes (10.0)Urine_pushing = no: no (40.0)Number of Leaves : 4Size of the tree : 7Time taken to build model: 0.01 seconds=== Stratified cross-validation ====== Summary ===Correctly Classified Instances 120 100 %Incorrectly Classified Instances 0 0 %Kappa statistic 1Mean absolute error 0Root mean squared error 0Relative absolute error 0 %Root relative squared error 0 %Total Number of Instances 120=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 yes 1 0 1 1 1 1 no Weighted Avg. 1 0 1 1 1 1
  8. 8. === Confusion Matrix === a b <-- classified as 59 0 a = yes 0 61 b = noThe tree is visualised as shown below.The same experiment was repeated for predicting the occurrence of Nephritis of renal pelvis origin.The following results were obtained.
  9. 9. Model 2=== Run information ===Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R7Instances: 120Attributes: 7 temperature nausea Lumbar_pain Urine_pushing Micturition_pains Burning_of_urethra Nephritis_of_renal_pelvis_originTest mode:evaluate on training data=== Classifier model (full training set) ===J48 pruned tree------------------temperature <= 37.9: no (60.0)temperature > 37.9| Lumbar_pain = yes: yes (50.0)| Lumbar_pain = no: no (10.0)Number of Leaves : 3
  10. 10. Size of the tree : 5Time taken to build model: 0 seconds=== Evaluation on training set ====== Summary ===Correctly Classified Instances 120 100 %Incorrectly Classified Instances 0 0 %Kappa statistic 1Mean absolute error 0Root mean squared error 0Relative absolute error 0 %Root relative squared error 0 %Total Number of Instances 120=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 yes 1 0 1 1 1 1 no Weighted Avg. 1 0 1 1 1 1=== Confusion Matrix === a b <-- classified as 50 0 a = yes 0 70 b = no
  11. 11. The visual tree is as belowInterpretationAs can be seen in both the models, 100% of the data has been classified correctly.In Model 1, the differentiating factors were Urine pushing, Micturition pains and Lumbar pain.In Model 2, the differentiating factors were Temperature and Lumbar Pain.As can be seen from both the results, Lumbar pain is an important factor in determining urinaryinfections.ConclusionThe paper barely scratches the surface of all the possible applications of data mining. This powerfultechnique can have unique applications in the field of business as well as academic research. It mayprovide clues to numerous questions by allowing us to make sense of the ever growing volume ofdata.
  12. 12. Reference 1. 2. 3. 4.