SlideShare a Scribd company logo
1 of 6
Download to read offline
Data mining
‘REGRESSION: CPU Performance’




        Visualized data with WEKA
        COMPUTER ASSIGNMENT 1

        BARRY KOLLEE

        10349863
Regression	
  |	
  CPU	
  performance	
  
	
  
1. Do you think that ERP should be at least partially predictable from the input attributes?

Not in all cases. This is only possible if we’re able to see correlation between the two attributes that we
compare. In case both values correlate with each other we can state that we can predict certain values
from the input attribute.

2. Do any attributes exhibit significant correlations?

I’ve loaded up the delivered database file into WEKA. With visualising the data as a graph (which shows
the correlation between all attributes) I’m seeing the plotted graphs which is listed below. To see
correlation between all ‘dots’ it is necessary to see a linear pattern. The following correlated graphs
seems to correlate with ERP; respectively MYCT, MMIN and MMAX:

       •   Green MMAX, with MMAX plotted on the X-axis I see a pattern which is increasing slowly at
           first and after words it increases rapidly. If we swap the y and x axis we see the opposite result.
           It starts with increasing rapidly and after words it increases slowly.
       •   Blue MYCT, with MYCT plotted on my x-axis I see a pattern within the correlation between
           ERP and MYCT. The pattern look like a (1/n) math graph where we start of with a high value.
           When increasing the x-axis you see a direct decrease in the pattern which is going to the
           ‘zeropoint’ of the Y-axis. When increasing the x-axis even more we don’t see the slope
           decreasing anymore. If we swap the x and y axis we see a similar pattern.
       •   Red MMIN, the pattern which I see within MMIN is similar to the one of MMAX.




2
Regression	
  |	
  CPU	
  performance	
  
	
  

3. Now we have a feel for the data and we will try fitting a simple linear regression model to
the data. On the Classify tab, select Choose > functions > LinearRegression.

        •      Use the default options and click Start. This will use 10-fold cross-validation to fit the linear
               regression model. Examine the results:
        •      Record the Root relative squared error and the Relative absolute error. The Relative squared
               error is computed by dividing (normalizing) the sum of the squared prediction errors by the sum
               of the prediction errors obtained by always predicting the mean. The Root relative squared
               error is obtained by taking the square root of the Relative squared error. The Relative absolute
               error is similar to the Relative squared error, but uses absolute values rather than squares.
               Therefore, if we have a relative error of 100%, the learned model is no better than this very
               dumb predictor.

When I perform the linear regression function onto the ERP attribute I’m getting the following
information about this attribute. The ‘Root relative squared error’ is given in red.



       Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
       Relation:     cpu-weka.filters.unsupervised.attribute.Remove-R1
       Instances:    209
       Attributes:   7
                     MYCT
                     MMIN
                     MMAX
                     CACH
                     CHMIN
                     CHMAX
                     ERP
       Test mode:10-fold cross-validation

       === Classifier model (full training set) ===
       Linear Regression Model

       ERP =

              0.0661    *   MYCT +
              0.0142    *   MMIN +
              0.0066    *   MMAX +
              0.4871    *   CACH +
              1.1868    *   CHMAX +
            -66.5968

       Time taken to build model: 0 seconds

       === Cross-validation ===
       === Summary ===

       Correlation coefficient                               0.928
       Mean absolute error                                   35.4878
       Root mean squared error                               57.5296
       Relative absolute error                               40.4842 %
       Root relative squared error                           37.1725 %
       Total Number of Instances                             209	
  	
  	
  



The Root relative squared error looks pretty high. That’s because we take all of the attributes into
account and we fit that into our calculation. You can also see that we take 5 attributes into account for
our scope of our linear regression model. Below you see the actual given function of y = ax + b which
represents our linear regression graph model. And we eventually have a scope of 0.928 if we take all
these attributes within our calculation. This calculation looks like:


       y = a x + b

       a = 0.0661 * MYCT + 0.0142 * MMIN +           0.0066 * MMAX + 0.4871 * CACH + 1.1868 * CHMAX

       b = -66.5968




3
Regression	
  |	
  CPU	
  performance	
  
	
  
	
  
I prospect that we can make a better fitting linear regression model if we only take the attributes into
account which correlates best with ERP which we gave in answer 2. If we want to achieve this we only
take MMIN and MMAX into account because it looks like that these attributes correlates best if we
stipulate the output which is given in answer 2. I made another linear regression model where I’ve only
used the MMIN and MMAX attribute, which is given below (Root relative error in red):
	
  
	
  

           === Run information ===

           Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
           Relation:     cpu-weka.filters.unsupervised.attribute.Remove-R1-
           weka.filters.unsupervised.attribute.Remove-R1,4-6
           Instances:    209
           Attributes:   3
                         MMIN
                         MMAX
                         ERP
           Test mode:10-fold cross-validation

           === Classifier model (full training set) ===


           Linear Regression Model

           ERP =

                 0.0128 * MMIN +
                 0.0087 * MMAX +
               -39.814

           Time taken to build model: 0 seconds

           === Cross-validation ===
           === Summary ===

           Correlation coefficient                      0.9022
           Mean absolute error                          39.8811
           Root mean squared error                      66.584
           Relative absolute error                      45.4961 %
           Root relative squared error                  43.023 %
           Total Number of Instances                    209	
  	
  	
  	
  	
  	
  
	
  
	
  
	
  	
  
My assumptions were actually wrong. I see that when only taking MMIN and MMAX into account the
correlation coefficient is lower and we’ve got a higher error rate; i.e. the Mean absolute error which is
higher. This value gives us the average of the difference that we find between the actual value and the
value of all the test cases. Also the value Root relative squared error has increased with ca. 6 %.

4. Did you expect such a performance given your earlier observations? Hint: We are fitting a
linear model.

Because we’re trying to fit a linear model we’re searching for the attributes which correlates best with
ERP. The performance boost is clearly visible if we look at the correlation coefficient. A rate of ca. 0.93
is really close to ‘1’ which is the best rate possible.

However the root relative squared error is pretty high. If we would like to get a better fitting linear
regression model we should only try to take attributes into account which correlates best with ERP. This
would result in a correlation coefficient closer to 1 and an error rate which is closer to 0%. However my
observation when only using MMIN and MMAX weren’t that hopeful. Perhaps that’s because these
errors are less seen if we include more attributes. The using of more attributes seems to decrease the
error rate.

On the other hand I would expect that including more attributes would be more error sensitive
	
  




4
Regression	
  |	
  CPU	
  performance	
  
	
  
5. Above we deleted the vendor variable. However, we can use nominal attributes in
regression by converting them to numeric. The standard way of so doing is to replace the
nominal variable with a bunch of binary variables of the form "is_first_nominal_value,
is_second_nominal_value" and so on. Reload the unmodified data file cpu.arff.
    • On the Preprocess tab select Choose > filters > unsupervised > attribute >
        NominaltoBinary and click Apply. This replaces the vendor variable with 30 binary
        variables and we now have 37 attributes (we started with 8).
        Now train a linear regression model as in (4) and examine the results.
    • Record the Relative absolute error and the Root relative squared error



       Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
       Relation:     cpu-weka.filters.unsupervised.attribute.NominalToBinary-Rfirst-last
       Instances:    209
       Attributes:   37
                     vendor=adviser
                     vendor=amdahl
                     vendor=apollo
                     vendor=basf
                     vendor=bti
                     vendor=burroughs
                     vendor=c.r.d
                     vendor=cdc
                     vendor=cambex
                     vendor=dec
                     vendor=dg
                     vendor=formation
                     vendor=four-phase
                     vendor=gould
                     vendor=hp
                     vendor=harris
                     vendor=honeywell
                     vendor=ibm
                     vendor=ipl
                     vendor=magnuson
                     vendor=microdata
                     vendor=nas
                     vendor=ncr
                     vendor=nixdorf
                     vendor=perkin-elmer
                     vendor=prime
                     vendor=siemens
                     vendor=sperry
                     vendor=sratus
                     vendor=wang
                     MYCT
                     MMIN
                     MMAX
                     CACH
                     CHMIN
                     CHMAX
                     ERP
       Test mode:10-fold cross-validation

       === Classifier model (full training set) ===

       Linear Regression Model

       ERP =

           -132.1272 * vendor=adviser +
           -34.3319 * vendor=burroughs +
           -52.3128 * vendor=gould +
           -35.8202 * vendor=honeywell +
           -16.7597 * vendor=ibm +
           -144.1856 * vendor=microdata +
           -22.7172 * vendor=nas +
           41.5185 * vendor=sperry +
           0.0696 * MYCT +
           0.0167 * MMIN +
           0.0055 * MMAX +
           0.6304 * CACH +
           -1.5416 * CHMIN +
           1.6106 * CHMAX +
          -57.432

       Time taken to build model: 0.02 seconds

       === Cross-validation ===
       === Summary ===




5
Regression	
  |	
  CPU	
  performance	
  
	
  

       Correlation coefficient                          0.9252
       Mean absolute error                              35.9725
       Root mean squared error                          58.5821
       Relative absolute error                          41.0372 %
       Root relative squared error                      37.8525 %
       Total Number of Instances                        209	
  	
  	
  	
  
	
  
	
  
6. Compare the performance to the one we had previously. Did adding the binarized vendor
variable help?
	
  
The errors of the first linear model where:

Relative absolute error                     40.4842 %
Root relative squared error                 37.1725 %


The error rate of the latest linear regression model is:

Relative absolute error                     41.0372 %
Root relative squared error                 37.8525 %


It looks like that the error rate has only increased. I think that’s because we now take a lot more
attributes into account what makes our slope (the a in y=ax+b) more complex and error sensitive. I
predict that the error rate would be less higher of we would only take attributes into account which
correlates best with ERP.




6

More Related Content

What's hot

Applications of-linear-algebra-hill-cipher
Applications of-linear-algebra-hill-cipherApplications of-linear-algebra-hill-cipher
Applications of-linear-algebra-hill-cipherAashirwad Kashyap
 
NLP for Biomedical Applications
NLP for Biomedical ApplicationsNLP for Biomedical Applications
NLP for Biomedical ApplicationsNVIDIA
 
Black Box Testing.pdf
Black Box Testing.pdfBlack Box Testing.pdf
Black Box Testing.pdfSupunLakshan4
 
Adbms 15 object data management group
Adbms 15 object data management groupAdbms 15 object data management group
Adbms 15 object data management groupVaibhav Khanna
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Using Cipher Key to Generate Dynamic S-Box in AES Cipher System
Using Cipher Key to Generate Dynamic S-Box in AES Cipher SystemUsing Cipher Key to Generate Dynamic S-Box in AES Cipher System
Using Cipher Key to Generate Dynamic S-Box in AES Cipher SystemCSCJournals
 
CMACs and MACS based on block ciphers, Digital signature
CMACs and MACS based on block ciphers, Digital signatureCMACs and MACS based on block ciphers, Digital signature
CMACs and MACS based on block ciphers, Digital signatureAdarsh Patel
 
Cryptography and Message Authentication NS3
Cryptography and Message Authentication NS3Cryptography and Message Authentication NS3
Cryptography and Message Authentication NS3koolkampus
 
Computer Security Lecture 7: RSA
Computer Security Lecture 7: RSAComputer Security Lecture 7: RSA
Computer Security Lecture 7: RSAMohamed Loey
 
A Practitioner's Guide to Successfully Migrate from Oracle to Sybase ASE Part 1
A Practitioner's Guide to Successfully Migrate from Oracle to Sybase ASE Part 1A Practitioner's Guide to Successfully Migrate from Oracle to Sybase ASE Part 1
A Practitioner's Guide to Successfully Migrate from Oracle to Sybase ASE Part 1Dobler Consulting
 
Network Security Primer
Network Security PrimerNetwork Security Primer
Network Security PrimerVenkatesh Iyer
 
Network Security & Cryptography
Network Security & CryptographyNetwork Security & Cryptography
Network Security & CryptographyDr. Himanshu Gupta
 
Transposition cipher
Transposition cipherTransposition cipher
Transposition cipherAntony Alex
 
Using Signals in Lucidworks Fusion
Using Signals in Lucidworks FusionUsing Signals in Lucidworks Fusion
Using Signals in Lucidworks FusionLucidworks
 

What's hot (20)

Applications of-linear-algebra-hill-cipher
Applications of-linear-algebra-hill-cipherApplications of-linear-algebra-hill-cipher
Applications of-linear-algebra-hill-cipher
 
Product Cipher
Product CipherProduct Cipher
Product Cipher
 
NLP for Biomedical Applications
NLP for Biomedical ApplicationsNLP for Biomedical Applications
NLP for Biomedical Applications
 
Black Box Testing.pdf
Black Box Testing.pdfBlack Box Testing.pdf
Black Box Testing.pdf
 
Adbms 15 object data management group
Adbms 15 object data management groupAdbms 15 object data management group
Adbms 15 object data management group
 
Moses
MosesMoses
Moses
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Using Cipher Key to Generate Dynamic S-Box in AES Cipher System
Using Cipher Key to Generate Dynamic S-Box in AES Cipher SystemUsing Cipher Key to Generate Dynamic S-Box in AES Cipher System
Using Cipher Key to Generate Dynamic S-Box in AES Cipher System
 
RSA Algorithm
RSA AlgorithmRSA Algorithm
RSA Algorithm
 
Cryptography.ppt
Cryptography.pptCryptography.ppt
Cryptography.ppt
 
CMACs and MACS based on block ciphers, Digital signature
CMACs and MACS based on block ciphers, Digital signatureCMACs and MACS based on block ciphers, Digital signature
CMACs and MACS based on block ciphers, Digital signature
 
Cryptography and Message Authentication NS3
Cryptography and Message Authentication NS3Cryptography and Message Authentication NS3
Cryptography and Message Authentication NS3
 
Computer Security Lecture 7: RSA
Computer Security Lecture 7: RSAComputer Security Lecture 7: RSA
Computer Security Lecture 7: RSA
 
A Practitioner's Guide to Successfully Migrate from Oracle to Sybase ASE Part 1
A Practitioner's Guide to Successfully Migrate from Oracle to Sybase ASE Part 1A Practitioner's Guide to Successfully Migrate from Oracle to Sybase ASE Part 1
A Practitioner's Guide to Successfully Migrate from Oracle to Sybase ASE Part 1
 
Mini Project- Home Automation
Mini Project- Home AutomationMini Project- Home Automation
Mini Project- Home Automation
 
Network Security Primer
Network Security PrimerNetwork Security Primer
Network Security Primer
 
Network Security & Cryptography
Network Security & CryptographyNetwork Security & Cryptography
Network Security & Cryptography
 
Transposition cipher
Transposition cipherTransposition cipher
Transposition cipher
 
Engineering Day
Engineering DayEngineering Day
Engineering Day
 
Using Signals in Lucidworks Fusion
Using Signals in Lucidworks FusionUsing Signals in Lucidworks Fusion
Using Signals in Lucidworks Fusion
 

Similar to Data mining Computerassignment 1

House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachYusuf Uzun
 
Parameter Estimation User Guide
Parameter Estimation User GuideParameter Estimation User Guide
Parameter Estimation User GuideAndy Salmon
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Chakkrit (Kla) Tantithamthavorn
 
KnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProjectKnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProjectMarciano Moreno
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIVikas Virani
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine LearningMehwish690898
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...prateek kumar
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...DineshRaj Goud
 
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...Geon-Hong Kim
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
PVS-Studio team is about to produce a technical breakthrough, but for now let...
PVS-Studio team is about to produce a technical breakthrough, but for now let...PVS-Studio team is about to produce a technical breakthrough, but for now let...
PVS-Studio team is about to produce a technical breakthrough, but for now let...PVS-Studio
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 
Data_Mining_Exploration
Data_Mining_ExplorationData_Mining_Exploration
Data_Mining_ExplorationBrett Keim
 
Scientific calculator project in c language
Scientific calculator project in c languageScientific calculator project in c language
Scientific calculator project in c languageAMIT KUMAR
 
Steady state CFD analysis of C-D nozzle
Steady state CFD analysis of C-D nozzle Steady state CFD analysis of C-D nozzle
Steady state CFD analysis of C-D nozzle Vishnu R
 

Similar to Data mining Computerassignment 1 (20)

House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
 
Parameter Estimation User Guide
Parameter Estimation User GuideParameter Estimation User Guide
Parameter Estimation User Guide
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
 
KnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProjectKnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProject
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMI
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine Learning
 
Week 4
Week 4Week 4
Week 4
 
C++ Homework Help
C++ Homework HelpC++ Homework Help
C++ Homework Help
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Chap 5 c++
Chap 5 c++Chap 5 c++
Chap 5 c++
 
PVS-Studio team is about to produce a technical breakthrough, but for now let...
PVS-Studio team is about to produce a technical breakthrough, but for now let...PVS-Studio team is about to produce a technical breakthrough, but for now let...
PVS-Studio team is about to produce a technical breakthrough, but for now let...
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Data_Mining_Exploration
Data_Mining_ExplorationData_Mining_Exploration
Data_Mining_Exploration
 
Scientific calculator project in c language
Scientific calculator project in c languageScientific calculator project in c language
Scientific calculator project in c language
 
Steady state CFD analysis of C-D nozzle
Steady state CFD analysis of C-D nozzle Steady state CFD analysis of C-D nozzle
Steady state CFD analysis of C-D nozzle
 

More from BarryK88

Data mining test notes (back)
Data mining test notes (back)Data mining test notes (back)
Data mining test notes (back)BarryK88
 
Data mining test notes (front)
Data mining test notes (front)Data mining test notes (front)
Data mining test notes (front)BarryK88
 
Data mining Computerassignment 3
Data mining Computerassignment 3Data mining Computerassignment 3
Data mining Computerassignment 3BarryK88
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2BarryK88
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4BarryK88
 
Data mining assignment 3
Data mining assignment 3Data mining assignment 3
Data mining assignment 3BarryK88
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5BarryK88
 
Data mining assignment 6
Data mining assignment 6Data mining assignment 6
Data mining assignment 6BarryK88
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1BarryK88
 
Data mining Computerassignment 2
Data mining Computerassignment 2Data mining Computerassignment 2
Data mining Computerassignment 2BarryK88
 
Semantic web final assignment
Semantic web final assignmentSemantic web final assignment
Semantic web final assignmentBarryK88
 
Semantic web assignment 3
Semantic web assignment 3Semantic web assignment 3
Semantic web assignment 3BarryK88
 
Semantic web assignment 2
Semantic web assignment 2Semantic web assignment 2
Semantic web assignment 2BarryK88
 
Semantic web assignment1
Semantic web assignment1Semantic web assignment1
Semantic web assignment1BarryK88
 

More from BarryK88 (14)

Data mining test notes (back)
Data mining test notes (back)Data mining test notes (back)
Data mining test notes (back)
 
Data mining test notes (front)
Data mining test notes (front)Data mining test notes (front)
Data mining test notes (front)
 
Data mining Computerassignment 3
Data mining Computerassignment 3Data mining Computerassignment 3
Data mining Computerassignment 3
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
 
Data mining assignment 3
Data mining assignment 3Data mining assignment 3
Data mining assignment 3
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
 
Data mining assignment 6
Data mining assignment 6Data mining assignment 6
Data mining assignment 6
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
 
Data mining Computerassignment 2
Data mining Computerassignment 2Data mining Computerassignment 2
Data mining Computerassignment 2
 
Semantic web final assignment
Semantic web final assignmentSemantic web final assignment
Semantic web final assignment
 
Semantic web assignment 3
Semantic web assignment 3Semantic web assignment 3
Semantic web assignment 3
 
Semantic web assignment 2
Semantic web assignment 2Semantic web assignment 2
Semantic web assignment 2
 
Semantic web assignment1
Semantic web assignment1Semantic web assignment1
Semantic web assignment1
 

Data mining Computerassignment 1

  • 1. Data mining ‘REGRESSION: CPU Performance’ Visualized data with WEKA COMPUTER ASSIGNMENT 1 BARRY KOLLEE 10349863
  • 2. Regression  |  CPU  performance     1. Do you think that ERP should be at least partially predictable from the input attributes? Not in all cases. This is only possible if we’re able to see correlation between the two attributes that we compare. In case both values correlate with each other we can state that we can predict certain values from the input attribute. 2. Do any attributes exhibit significant correlations? I’ve loaded up the delivered database file into WEKA. With visualising the data as a graph (which shows the correlation between all attributes) I’m seeing the plotted graphs which is listed below. To see correlation between all ‘dots’ it is necessary to see a linear pattern. The following correlated graphs seems to correlate with ERP; respectively MYCT, MMIN and MMAX: • Green MMAX, with MMAX plotted on the X-axis I see a pattern which is increasing slowly at first and after words it increases rapidly. If we swap the y and x axis we see the opposite result. It starts with increasing rapidly and after words it increases slowly. • Blue MYCT, with MYCT plotted on my x-axis I see a pattern within the correlation between ERP and MYCT. The pattern look like a (1/n) math graph where we start of with a high value. When increasing the x-axis you see a direct decrease in the pattern which is going to the ‘zeropoint’ of the Y-axis. When increasing the x-axis even more we don’t see the slope decreasing anymore. If we swap the x and y axis we see a similar pattern. • Red MMIN, the pattern which I see within MMIN is similar to the one of MMAX. 2
  • 3. Regression  |  CPU  performance     3. Now we have a feel for the data and we will try fitting a simple linear regression model to the data. On the Classify tab, select Choose > functions > LinearRegression. • Use the default options and click Start. This will use 10-fold cross-validation to fit the linear regression model. Examine the results: • Record the Root relative squared error and the Relative absolute error. The Relative squared error is computed by dividing (normalizing) the sum of the squared prediction errors by the sum of the prediction errors obtained by always predicting the mean. The Root relative squared error is obtained by taking the square root of the Relative squared error. The Relative absolute error is similar to the Relative squared error, but uses absolute values rather than squares. Therefore, if we have a relative error of 100%, the learned model is no better than this very dumb predictor. When I perform the linear regression function onto the ERP attribute I’m getting the following information about this attribute. The ‘Root relative squared error’ is given in red. Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: cpu-weka.filters.unsupervised.attribute.Remove-R1 Instances: 209 Attributes: 7 MYCT MMIN MMAX CACH CHMIN CHMAX ERP Test mode:10-fold cross-validation === Classifier model (full training set) === Linear Regression Model ERP = 0.0661 * MYCT + 0.0142 * MMIN + 0.0066 * MMAX + 0.4871 * CACH + 1.1868 * CHMAX + -66.5968 Time taken to build model: 0 seconds === Cross-validation === === Summary === Correlation coefficient 0.928 Mean absolute error 35.4878 Root mean squared error 57.5296 Relative absolute error 40.4842 % Root relative squared error 37.1725 % Total Number of Instances 209       The Root relative squared error looks pretty high. That’s because we take all of the attributes into account and we fit that into our calculation. You can also see that we take 5 attributes into account for our scope of our linear regression model. Below you see the actual given function of y = ax + b which represents our linear regression graph model. And we eventually have a scope of 0.928 if we take all these attributes within our calculation. This calculation looks like: y = a x + b a = 0.0661 * MYCT + 0.0142 * MMIN + 0.0066 * MMAX + 0.4871 * CACH + 1.1868 * CHMAX b = -66.5968 3
  • 4. Regression  |  CPU  performance       I prospect that we can make a better fitting linear regression model if we only take the attributes into account which correlates best with ERP which we gave in answer 2. If we want to achieve this we only take MMIN and MMAX into account because it looks like that these attributes correlates best if we stipulate the output which is given in answer 2. I made another linear regression model where I’ve only used the MMIN and MMAX attribute, which is given below (Root relative error in red):     === Run information === Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: cpu-weka.filters.unsupervised.attribute.Remove-R1- weka.filters.unsupervised.attribute.Remove-R1,4-6 Instances: 209 Attributes: 3 MMIN MMAX ERP Test mode:10-fold cross-validation === Classifier model (full training set) === Linear Regression Model ERP = 0.0128 * MMIN + 0.0087 * MMAX + -39.814 Time taken to build model: 0 seconds === Cross-validation === === Summary === Correlation coefficient 0.9022 Mean absolute error 39.8811 Root mean squared error 66.584 Relative absolute error 45.4961 % Root relative squared error 43.023 % Total Number of Instances 209                     My assumptions were actually wrong. I see that when only taking MMIN and MMAX into account the correlation coefficient is lower and we’ve got a higher error rate; i.e. the Mean absolute error which is higher. This value gives us the average of the difference that we find between the actual value and the value of all the test cases. Also the value Root relative squared error has increased with ca. 6 %. 4. Did you expect such a performance given your earlier observations? Hint: We are fitting a linear model. Because we’re trying to fit a linear model we’re searching for the attributes which correlates best with ERP. The performance boost is clearly visible if we look at the correlation coefficient. A rate of ca. 0.93 is really close to ‘1’ which is the best rate possible. However the root relative squared error is pretty high. If we would like to get a better fitting linear regression model we should only try to take attributes into account which correlates best with ERP. This would result in a correlation coefficient closer to 1 and an error rate which is closer to 0%. However my observation when only using MMIN and MMAX weren’t that hopeful. Perhaps that’s because these errors are less seen if we include more attributes. The using of more attributes seems to decrease the error rate. On the other hand I would expect that including more attributes would be more error sensitive   4
  • 5. Regression  |  CPU  performance     5. Above we deleted the vendor variable. However, we can use nominal attributes in regression by converting them to numeric. The standard way of so doing is to replace the nominal variable with a bunch of binary variables of the form "is_first_nominal_value, is_second_nominal_value" and so on. Reload the unmodified data file cpu.arff. • On the Preprocess tab select Choose > filters > unsupervised > attribute > NominaltoBinary and click Apply. This replaces the vendor variable with 30 binary variables and we now have 37 attributes (we started with 8). Now train a linear regression model as in (4) and examine the results. • Record the Relative absolute error and the Root relative squared error Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: cpu-weka.filters.unsupervised.attribute.NominalToBinary-Rfirst-last Instances: 209 Attributes: 37 vendor=adviser vendor=amdahl vendor=apollo vendor=basf vendor=bti vendor=burroughs vendor=c.r.d vendor=cdc vendor=cambex vendor=dec vendor=dg vendor=formation vendor=four-phase vendor=gould vendor=hp vendor=harris vendor=honeywell vendor=ibm vendor=ipl vendor=magnuson vendor=microdata vendor=nas vendor=ncr vendor=nixdorf vendor=perkin-elmer vendor=prime vendor=siemens vendor=sperry vendor=sratus vendor=wang MYCT MMIN MMAX CACH CHMIN CHMAX ERP Test mode:10-fold cross-validation === Classifier model (full training set) === Linear Regression Model ERP = -132.1272 * vendor=adviser + -34.3319 * vendor=burroughs + -52.3128 * vendor=gould + -35.8202 * vendor=honeywell + -16.7597 * vendor=ibm + -144.1856 * vendor=microdata + -22.7172 * vendor=nas + 41.5185 * vendor=sperry + 0.0696 * MYCT + 0.0167 * MMIN + 0.0055 * MMAX + 0.6304 * CACH + -1.5416 * CHMIN + 1.6106 * CHMAX + -57.432 Time taken to build model: 0.02 seconds === Cross-validation === === Summary === 5
  • 6. Regression  |  CPU  performance     Correlation coefficient 0.9252 Mean absolute error 35.9725 Root mean squared error 58.5821 Relative absolute error 41.0372 % Root relative squared error 37.8525 % Total Number of Instances 209             6. Compare the performance to the one we had previously. Did adding the binarized vendor variable help?   The errors of the first linear model where: Relative absolute error 40.4842 % Root relative squared error 37.1725 % The error rate of the latest linear regression model is: Relative absolute error 41.0372 % Root relative squared error 37.8525 % It looks like that the error rate has only increased. I think that’s because we now take a lot more attributes into account what makes our slope (the a in y=ax+b) more complex and error sensitive. I predict that the error rate would be less higher of we would only take attributes into account which correlates best with ERP. 6