REGRESSION, CLUSTERING AND CLASSIFICATION IN R-STUDIO

BUSINESS ANALYTICS -2
NAME: MOHAMMAD YASEEN DAR
REG: 11715830
FROM: JAMMU AND KASHMIR (TANGMARG)
COURSE: MBA (2017-2019)
PHONE: (7006304863)

BUSSINESS ANALYTICS ASSIGNMENT -2
NAME: Mohammad Yaseen Dar
REG: 11715830
SEC: Q1747, GROUP: 01
I have taken the data from below Website OPENML.ORG which is related to Stock performance
of ten aerospace companies from January 1988- October 1991,
Data Description link: https://archive.ics.uci.edu/ml/datasets/Stock+portfolio+performance#
Download link: https://archive.ics.uci.edu/ml/machine-learning-databases/00390/

Below is file imported in R studio which I have downloaded from above given link then click
on import datasheet and clikc on excel file
The first thing needed in this assignment is to download the data from OPENML.ORG cited
below in the snapshot, after downloading and converting the data in the excel work Sheet we need
to perform some of the operations as required for the assignment like Regression, Classification
and Clustering.
Above is the variables and the data covered under variable is given on which the entire Machine
learning has done and out of this data various models are based on it.

1) (A) CLUSTERING OF DATA
Interpretation:
Clustering is mostly used technique for finding subgroups of observations within a data set. When
we make cluster of observations, we put observations in the same group to be alike and
observations in dissimilar cluster to be dissimilar.
To compute k-means in R with the K-means function which denotes Clustering.
We can put data into to clusters by (centers=2). The kmeans function also has an “nstart” option
that attempts multiple initial configurations and reports on the best one. For example,
adding nstart = 25 will generate 25 initial configurations.
 From the above graph the maximum observation fall under the blue cluster which reflects
that there is not much difference between observation and mean of the data that is why
the maximum number of observation are under it.

 The cluster conatins lesser number of observations it means the mean of this data has more
deviation from other cluster observations.
 na.omit remove observations which are unidentified and missing from variables.
1) (B) 4 CLUSTERS OF DATA
We can make clustersasperourrequirementsmostlyanalystislike touse the optimalnumberof clusters.
To make more numberof groups of same data we have to use same command as usedin previousone,
but to divide datainmore groupsthere is a needto change K value,centervalue andnstartvalue as per
rthe number of groups.
 Still aftermakingthe fourclustersthemostnumberofobservationfall underBlue clusterfollowed
by Red one.
 More clusters means the clusters has made on four means of observations and put the
observations on the basis of differences of mean.
There are two typesofclustersone isGROUP CLUSTERS and another one isHIERARCHICAL CLUSTER,but
above are based on GROUP CLUSTER.

Interpretation:
 The result which is marked under circle (Intercept and the beta coefficient for the Total
Risk.
 From that output we can express the equation as follow (Systematic Risk=
0.1934+0.5213*Total Risk
 The intercept (b0)value is 0.1934 which can be analyzed as the predicted Systematic Risk
for a Zero Total Risk as we are functioning in Percentage format not in units it means for
a Total risk equal to Risk we can expect Systematic Risk of 0.1934*100=19.34%.
 Regression beta coefficient for the variable Total Risk(b1),which is also known as slope is
0.5213.it means that if we put Total Risk equal to 100 percent we can except an increase
in Systematic Risk Of 52.13%.
 Now we can calculate Expected Systematic Risk from intercept and Total risk Coefficient
Values which is Systematic Risk = 0.1934+0.5213*100 = 52.3234%
 Hypotheses
 Null hypothesis(H0):the coefficient are equal to zero( No relationship between Systematic
Risk (x) and Total Risk (y)
 Alternative hypothesis(Ha):the coefficients are not equal to zero ( there is relationship
between these two Variables)
 In our data both the values, p-values for intercept and the predictor variable are
highly Significant as there are 2 and 3 stars of these variable which shows the
Significance level of variables. So we can reject the null hypothesis and accept the
alternative hypothesis, which means that there is a association between variables.

MODEL ACCURACY
The accuracy of the model of linear regression fit can be executed on three quantities as per the
SUMMARY(MODEL)
##Residual standard error: 0.1267
##Multiple R-squared: 0.2695,
##F-statistic: 22.5
P-value: 1.307e-05
1.Residual Standard error(RSE).
Above data showsthe RSE = 0.1267 which meansthat the observedSystematicRiskvaluesdeviate from
the true regression line by approx. 12.6% in average.
However we can measure the error percentage on the basis of mean value of Systematic Risk which is
0.449569231
So the percentage error is 0.1267/0.449569231=27.32305%
3) R-Squared.
From the above data the R2 is close to Zero which indicayes that in our regression model a large
proportion of variability in the outcomes has not been explained.
4) F-Statistics.
In our regression model the F-Statistic value is equal 22.5 giving a p-value of 1.307, which is highly
significant.
Summary
RSE: closer to zero the better
R-Squared: Higher the better
F-Statistic: Higher the better
The Accuracy of the model is higher as it met all the though it lack little bit in R-Squared.

3) CLASSIFICATION OF DATA:
Interpretation:
Above is a classification of data in the form of tree model where we can analyze the data and the
position of observation of data whether the observation is in low category or in higher category.
As we have put ifelse condition on 0.6 on Annual Return with replacing existing variable. Because
of replace “true ” condition it shows the tree model of other variables and the type of Annual
Return low or high as per above condition.

 The 70% of the data is in testing and the prediction of the data is just only on 30% of the
data. The 30% of data is also on random base as there is no particular condition for that
which variable should be picked up and should not be for prediction of future data.
 One left side of the graph If the Excess Return is less than 0.687719 and the Systematic
risk is less than 0.448411 than the Annual return Would be low, but if the Systematic Risk
and Excess Return are greater than 0.448411 and 0.455994 the Annual Return is High
 On the other hand if the systematic Risk is less than 0.448411 and the Excess return is less
than 0.455994 than the Annual return is low in both the conditions which is negative
indication for investors and for shareholders.
 On the right side of the tree model, there is a two conditions of Excess Return based on
two different values 0.687719 and 0.713147.
 If the Excess return is greater than these two values than the Annual Return would be higher
in both the conditions which shows the positive sign.
 ACCURACY OF THE MODEL
The accuracy of the model is calculated on the basis of predicted values of the model which
are given below:
Hence, the accuracy of the model is
10+4
(4+4+0+10)
=
14
18
∗ 100 = 77.7%

REGRESSION, CLUSTERING AND CLASSIFICATION IN R-STUDIO

REGRESSION, CLUSTERING AND CLASSIFICATION IN R-STUDIO

Recommended

Recommended

More Related Content

Similar to REGRESSION, CLUSTERING AND CLASSIFICATION IN R-STUDIO

Similar to REGRESSION, CLUSTERING AND CLASSIFICATION IN R-STUDIO (20)

Recently uploaded

Recently uploaded (20)

REGRESSION, CLUSTERING AND CLASSIFICATION IN R-STUDIO