RABBIT: A CLI tool for identifying bots based on their GitHub events.
Measure of Association
1. Dr. Manoj Kumar Meher
Kalahandi University
meher.manoj@gmail.com
2. The measures of association refer to a wide variety of
coefficients that measure the statistical strength of the
relationship on the variables of interest; these measures of
strength, or association, can be described in several ways,
depending on the analysis.
There are certain statistical distinctions that a researcher should know
in order to better understand the measures of statistical association.
1. The student should know that measures of association are not
the same as measures of statistical significance. The measures of
significance have a null hypothesis that states that there is no
significant difference between the strength of an observed
relationship and the strength of an expected relationship by
means of simple random sampling. Therefore, there is a
possibility of having a relationship that depicts strong measures
of association but is not statistically significant, and a
relationship that depicts weak measures of association but is
very significant.
3. 2. The coefficient that measures statistical association, which can
vary depending on the analysis, that has a value of zero signifies
no relationship exists.
i. In correlation analyses, if the coefficient (r) has a value of one, it
signifies a perfect relationship on the variables of interest.
ii. In regression analyses, if the standardized beta weight (?) has a
value of one, it also signifies a perfect relationship on the
variables of interest.
iii. In regards to linear relationships, the measures of association are
those which deal with strictly monotonic, ordered monotonic,
predictive monotonic, and weak monotonic relationships.
iv. The researcher should note that if the relationships in measures
of association are perfect due to strict monotonicity, then it
should be perfect by other conditions as well.
v. However, in measures of association, one cannot have perfect
ordered and perfect predictive monotonicity at the same time.
The researcher should note that the linear definitions of perfect
relationships in measures of association are inappropriate for
curvilinear relationships or discontinuous relationships.
4. 3. The measures of association define the strength of the linear
relationship in terms of the degree of monotonicity. This degree
of monotonicity used by the measures of association is on the
counting of various types of pairs in a relationship. There are
basically four types of pairs in the measures of association.
These are concordant pairs (i.e. the pairs that agree with each
other),discordant pairs (i.e. the pairs that do not agree with each
other), the tied pair on one variable, and the tied pair on the
other variable. The researcher should note that as the concordant
pair increases, all the linear definitions of perfect relationships in
measures of association increases the coefficient of association
towards +1.
5. There are certain assumptions that are made on the
measures of association:
The measures of association assume categorical
(nominal or ordinal) and continuous types
Statistics Solutions of level data. The measures of
association assume a symmetrical or asymmetrical
type of causal direction.
The measures of association that define an ideal
relationship in terms of the strict monotonicity will
attain the value of one only if the two variables have
evolved from the same marginal distribution. The
measures of association also ignore those rows and
columns which have null values.
7. The Pearson product-moment correlation
coefficient (or Pearson correlation coefficient, for
short) is a measure of the strength of a linear
association between two variables and is denoted
by r. Basically, a Pearson product-moment
correlation attempts to draw a line of best fit through
the data of two variables, and the Pearson correlation
coefficient, r, indicates how far away all these data
points are to this line of best fit (i.e., how well the
data points fit this new model/line of best fit).
Product Movement Correlation
8.
9. r Strength of relationship
<0.2 Negligible
0.2 - 0.4 Low
0.4 – 0.7 Moderate
0.7 - 0.9 High
>0.9 Very High
Thumb rule
14. The Spearman’s Rank Correlation Coefficient is the
non-parametric statistical measure used to study the
strength of association between the two ranked variables.
This method is applied to the ordinal set of numbers,
which can be arranged in order, i.e. one after the other so
that ranks can be given to each.
In the rank correlation coefficient method, the ranks are
given to each individual on the basis of its quality or
quantity, such as ranking starts from position 1st and
goes till Nth position for the one ranked last in the
group.
Rank Correlation
15. R= 𝟏 −
𝟔∑𝑫𝟐
𝑵(𝑵𝟐−𝟏)
= 𝟏 −
𝟔∑𝑫𝟐
𝑵𝟑−𝑵
Where,
R = Rank coefficient of correlation
D = Difference of ranks
N = Number of Observations
Equal Ranks or Tie in Ranks: In case the same ranks are
assigned to two or more entities, then the ranks are assigned
on an average basis. Such as if two individuals are ranked
equal at third position, then the ranks shall be calculated as:
(4+5)/2 = 4.5
formula
16. Population Density No of District
100-120 1
120-140 3
140-160 4
160-180 6
180-200 8
200-220 14
220-240 12
240-260 11
260-280 15
280-300 7
20. The coefficient of determination, denoted R2 or r2 and pronounced "R
squared", is the proportion of the variance in the dependent variable that
is predictable from the independent variable(s).
It is a statistic used in the context of statistical models whose main
purpose is either the prediction of future outcomes or the testing
of hypotheses, on the basis of related information. It provides a measure
of how well observed outcomes are replicated by the model, based on
the proportion of total variation of outcomes explained by the model.
There are several definitions of R2 that are only sometimes equivalent.
One class of such cases includes that of simple linear
regression where r2 is used instead of R2. When an intercept is included,
then r2 is simply the square of the sample correlation coefficient (r)
between the observed outcomes and the observed predictor values. If
additional regressors are included, R2 is the square of the coefficient of
multiple correlation. In both such cases, the coefficient of determination
normally ranges from 0 to 1.
Coefficient of Determination
21. Steps to Find the Coefficient of Determination
Find r, Correlation Coefficient
Square ‘r’.
Change r to percentage.
22. How to interpret the coefficient of determination?
The coefficient of determination, or the R-squared value, is a value
between 0.0 and 1.0 that expresses what proportion of the variance
in Y can be explained by X:
If R2 = 1, then we have a perfect fit, which means that the values
of Y are fully determined (i.e., without any error) by the values
of X, and all data points lie precisely at the estimated best-fit line.
If R2 = 0, then our model is no better at predicting the values
of Y than the model which always returns the average value
of Y as a prediction.
Multiplying R2 by 100%, you get the percentage of the variance
in Y which is explained with help of X. For instance:
If R2 = 0.8, then 80% of the variance in Y is predicted by X
If R2 = 0.5 then half of the variance in Y can be explained by X
The complementary percentage, i.e., (1 - R2) * 100%, quantifies the
unexplained variance:
If R2 = 0.6, then 60% of the variance in Y has been explained with
help of X, while the remaining 40% remains unaccounted for.
23. Formula for the Coefficient of Determination, R2
Here are a few (equivalent) formulae:
R2 = SSR / SST
or
R2 = 1 - SSE / SST
or
R2 = SSR / (SSR + SSE)
TO BE DISCUSS AFTER REGRESSION
24. The sum of squares of errors (SSE in short), also called
the residual sum of squares:
SSE= ∑(yi - ŷi)² SSE quantifies the discrepancy between real
values of Y and those predicted by our model.
The Regression Sum of Squares (shortened to SSR), which
is sometimes also called the explained sum of squares:
SSR = ∑(ŷi - ȳ)² SSR measures the difference between the
values predicted by the regression model and those
predicted in the most basic way, namely by
ignoring X completely and using only the average value
of Y as a universal predictor.
The Total Sum of Squares (SST), which quantifies the total
variability in Y:
SST = ∑(yi - ȳ)² It turns out that those three sums of squares
satisfy:
SST= SSR + SSE so you only need to calculate any two of
them, and the remaining one can be easily found!
25. Sum of
Squares of
Errors
Regression
Sum of
Squares
Total
Sum of
Squares
Origina
l Value
Original
Value
Predicted
Value
(Predicted-
Original)²
(Predicted-
Mean)²
(Y Value-
Mean)²
Yi Xi Y^* SSE SSR SST
3.5 16 3.45 0.0025 0.2025 0.25
3.2 14 3.15 0.0025 0.0225 0.04
3.0 12 2.85 0.0225 0.0225 0.00
2.6 11 2.70 0.0100 0.0900 0.16
2.9 12 2.85 0.0025 0.0225 0.01
3.3 15 3.30 0.0000 0.0900 0.09
2.7 13 3.00 0.0900 0.0000 0.09
2.8 11 2.70 0.0100 0.0900 0.04
SUM 0.1400 0.5400 0.68
Mean 3.0 3
Y^= 1.05+0.15X
R2= SSR/SST 0.7941
1-SSE/SST 0.7941 0.2593
SSR/(SSR+S
SE) 0.7941
26. Original
Value
Original
Value
Predicted
Value
(Predited-
Original)²
(Predicted-
Mean)²
(Y Value-
Mean)²
Yi Xi Y^* SSE SSR SST
25.00 12.00 30.89 34.69 36.24
35.00 18.00 36.05 1.10 0.74
58.00 22.00 39.49 342.62 6.66
37.00 15.00 33.47 12.46 11.83
27.00 25.00 42.07 227.10 26.63
45.00 17.00 35.19 96.24 2.96
62.00 22.00 39.49 506.70 6.66
32.00 32.00 48.09 258.89 124.99
12.00 8.00 27.45 238.70 89.49
36.91 1718.51 306.19
Y=a+bx
y = 0.8647x +
20.57 a= 20.57
R2= SSR/SST b= 0.8600
1-SSE/SST
SSR/(SSR+SSE) 0.1512
27. In any distribution the line of best fit is known as
regression line. In a bivariate distribution there are two
regression line because there are two variable. If x on y are
two variable we get the regression x on y and y on x i.e, by
allotting a set of values to x a set of value for y where as a
set of value for x can be obtain respective to a set of values
y.
The line can be means of least square methods i.e the
square of the deviation from the expected value are
minimum.
The least-square method states that the curve that best fits a
given set of observations, is said to be a curve having
a minimum sum of the squared residuals (or deviations or
errors) from the given data points.
Linear regression
28. Formula
Y= a+bX
Now, here we need to find the value of the slope
of the line, b, plotted in scatter plot and the intercept, a
Where
N = no of observation
X= variable X
Y= Variable Y
X Y
1 1
2 4
3 5
4 6
5
32. X Y
12 25
18 35
22 58
15 37
25 27
17 45
22 62
32 32
8 12
Find Y if X = 25, 50 , 75 & 100
Exercise
Editor's Notes
Monotonic: a sequence or function; consistently increasing and never decreasing or consistently decreasing and never increasing in value
In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.