Data mining Computerassignment 1


Published on

Computer assignment 1 of the data mining course. This assignment is about using weka on an external database.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data mining Computerassignment 1

  1. 1. Data mining‘REGRESSION: CPU Performance’ Visualized data with WEKA COMPUTER ASSIGNMENT 1 BARRY KOLLEE 10349863
  2. 2. Regression  |  CPU  performance    1. Do you think that ERP should be at least partially predictable from the input attributes?Not in all cases. This is only possible if we’re able to see correlation between the two attributes that wecompare. In case both values correlate with each other we can state that we can predict certain valuesfrom the input attribute.2. Do any attributes exhibit significant correlations?I’ve loaded up the delivered database file into WEKA. With visualising the data as a graph (which showsthe correlation between all attributes) I’m seeing the plotted graphs which is listed below. To seecorrelation between all ‘dots’ it is necessary to see a linear pattern. The following correlated graphsseems to correlate with ERP; respectively MYCT, MMIN and MMAX: • Green MMAX, with MMAX plotted on the X-axis I see a pattern which is increasing slowly at first and after words it increases rapidly. If we swap the y and x axis we see the opposite result. It starts with increasing rapidly and after words it increases slowly. • Blue MYCT, with MYCT plotted on my x-axis I see a pattern within the correlation between ERP and MYCT. The pattern look like a (1/n) math graph where we start of with a high value. When increasing the x-axis you see a direct decrease in the pattern which is going to the ‘zeropoint’ of the Y-axis. When increasing the x-axis even more we don’t see the slope decreasing anymore. If we swap the x and y axis we see a similar pattern. • Red MMIN, the pattern which I see within MMIN is similar to the one of MMAX.2
  3. 3. Regression  |  CPU  performance    3. Now we have a feel for the data and we will try fitting a simple linear regression model tothe data. On the Classify tab, select Choose > functions > LinearRegression. • Use the default options and click Start. This will use 10-fold cross-validation to fit the linear regression model. Examine the results: • Record the Root relative squared error and the Relative absolute error. The Relative squared error is computed by dividing (normalizing) the sum of the squared prediction errors by the sum of the prediction errors obtained by always predicting the mean. The Root relative squared error is obtained by taking the square root of the Relative squared error. The Relative absolute error is similar to the Relative squared error, but uses absolute values rather than squares. Therefore, if we have a relative error of 100%, the learned model is no better than this very dumb predictor.When I perform the linear regression function onto the ERP attribute I’m getting the followinginformation about this attribute. The ‘Root relative squared error’ is given in red. Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: cpu-weka.filters.unsupervised.attribute.Remove-R1 Instances: 209 Attributes: 7 MYCT MMIN MMAX CACH CHMIN CHMAX ERP Test mode:10-fold cross-validation === Classifier model (full training set) === Linear Regression Model ERP = 0.0661 * MYCT + 0.0142 * MMIN + 0.0066 * MMAX + 0.4871 * CACH + 1.1868 * CHMAX + -66.5968 Time taken to build model: 0 seconds === Cross-validation === === Summary === Correlation coefficient 0.928 Mean absolute error 35.4878 Root mean squared error 57.5296 Relative absolute error 40.4842 % Root relative squared error 37.1725 % Total Number of Instances 209      The Root relative squared error looks pretty high. That’s because we take all of the attributes intoaccount and we fit that into our calculation. You can also see that we take 5 attributes into account forour scope of our linear regression model. Below you see the actual given function of y = ax + b whichrepresents our linear regression graph model. And we eventually have a scope of 0.928 if we take allthese attributes within our calculation. This calculation looks like: y = a x + b a = 0.0661 * MYCT + 0.0142 * MMIN + 0.0066 * MMAX + 0.4871 * CACH + 1.1868 * CHMAX b = -66.59683
  4. 4. Regression  |  CPU  performance      I prospect that we can make a better fitting linear regression model if we only take the attributes intoaccount which correlates best with ERP which we gave in answer 2. If we want to achieve this we onlytake MMIN and MMAX into account because it looks like that these attributes correlates best if westipulate the output which is given in answer 2. I made another linear regression model where I’ve onlyused the MMIN and MMAX attribute, which is given below (Root relative error in red):     === Run information === Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: cpu-weka.filters.unsupervised.attribute.Remove-R1- weka.filters.unsupervised.attribute.Remove-R1,4-6 Instances: 209 Attributes: 3 MMIN MMAX ERP Test mode:10-fold cross-validation === Classifier model (full training set) === Linear Regression Model ERP = 0.0128 * MMIN + 0.0087 * MMAX + -39.814 Time taken to build model: 0 seconds === Cross-validation === === Summary === Correlation coefficient 0.9022 Mean absolute error 39.8811 Root mean squared error 66.584 Relative absolute error 45.4961 % Root relative squared error 43.023 % Total Number of Instances 209                    My assumptions were actually wrong. I see that when only taking MMIN and MMAX into account thecorrelation coefficient is lower and we’ve got a higher error rate; i.e. the Mean absolute error which ishigher. This value gives us the average of the difference that we find between the actual value and thevalue of all the test cases. Also the value Root relative squared error has increased with ca. 6 %.4. Did you expect such a performance given your earlier observations? Hint: We are fitting alinear model.Because we’re trying to fit a linear model we’re searching for the attributes which correlates best withERP. The performance boost is clearly visible if we look at the correlation coefficient. A rate of ca. 0.93is really close to ‘1’ which is the best rate possible.However the root relative squared error is pretty high. If we would like to get a better fitting linearregression model we should only try to take attributes into account which correlates best with ERP. Thiswould result in a correlation coefficient closer to 1 and an error rate which is closer to 0%. However myobservation when only using MMIN and MMAX weren’t that hopeful. Perhaps that’s because theseerrors are less seen if we include more attributes. The using of more attributes seems to decrease theerror rate.On the other hand I would expect that including more attributes would be more error sensitive  4
  5. 5. Regression  |  CPU  performance    5. Above we deleted the vendor variable. However, we can use nominal attributes inregression by converting them to numeric. The standard way of so doing is to replace thenominal variable with a bunch of binary variables of the form "is_first_nominal_value,is_second_nominal_value" and so on. Reload the unmodified data file cpu.arff. • On the Preprocess tab select Choose > filters > unsupervised > attribute > NominaltoBinary and click Apply. This replaces the vendor variable with 30 binary variables and we now have 37 attributes (we started with 8). Now train a linear regression model as in (4) and examine the results. • Record the Relative absolute error and the Root relative squared error Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: cpu-weka.filters.unsupervised.attribute.NominalToBinary-Rfirst-last Instances: 209 Attributes: 37 vendor=adviser vendor=amdahl vendor=apollo vendor=basf vendor=bti vendor=burroughs vendor=c.r.d vendor=cdc vendor=cambex vendor=dec vendor=dg vendor=formation vendor=four-phase vendor=gould vendor=hp vendor=harris vendor=honeywell vendor=ibm vendor=ipl vendor=magnuson vendor=microdata vendor=nas vendor=ncr vendor=nixdorf vendor=perkin-elmer vendor=prime vendor=siemens vendor=sperry vendor=sratus vendor=wang MYCT MMIN MMAX CACH CHMIN CHMAX ERP Test mode:10-fold cross-validation === Classifier model (full training set) === Linear Regression Model ERP = -132.1272 * vendor=adviser + -34.3319 * vendor=burroughs + -52.3128 * vendor=gould + -35.8202 * vendor=honeywell + -16.7597 * vendor=ibm + -144.1856 * vendor=microdata + -22.7172 * vendor=nas + 41.5185 * vendor=sperry + 0.0696 * MYCT + 0.0167 * MMIN + 0.0055 * MMAX + 0.6304 * CACH + -1.5416 * CHMIN + 1.6106 * CHMAX + -57.432 Time taken to build model: 0.02 seconds === Cross-validation === === Summary ===5
  6. 6. Regression  |  CPU  performance     Correlation coefficient 0.9252 Mean absolute error 35.9725 Root mean squared error 58.5821 Relative absolute error 41.0372 % Root relative squared error 37.8525 % Total Number of Instances 209            6. Compare the performance to the one we had previously. Did adding the binarized vendorvariable help?  The errors of the first linear model where:Relative absolute error 40.4842 %Root relative squared error 37.1725 %The error rate of the latest linear regression model is:Relative absolute error 41.0372 %Root relative squared error 37.8525 %It looks like that the error rate has only increased. I think that’s because we now take a lot moreattributes into account what makes our slope (the a in y=ax+b) more complex and error sensitive. Ipredict that the error rate would be less higher of we would only take attributes into account whichcorrelates best with ERP.6