This document provides an overview of correlation analysis and how it can be used. It defines key terms like correlation, outliers, and ranked variables. It then gives examples of calculating Pearson's correlation coefficient and Spearman's rank correlation coefficient to demonstrate positive, negative, and weak/no correlations. It also discusses input parameters, sample user interfaces, interpreting outputs, and limitations. Finally, it provides business use cases for how correlation analysis can be applied to understand customer purchasing behavior, set credit limits, and cluster customers.
4. Terminologies
CORRELATION :
Correlation is a statistical measure that indicates the extent
to which two variables fluctuate together
A positive correlation indicates the extent to which those
variables increase or decrease in parallel
A negative correlation indicates the extent to which one
variable increases as the
other decreases
OUTLIERS :
Observations lying outside overall pattern
of distribution
RANKED/ORDINAL VARIABLES :
A variable whose set of values is ordered. For instance ,
High school class rankings: 1st, 2nd, 3rd etc..
Social economic class: working, middle, upper.
The Likert Scale: agree, strongly agree, disagree etc..
Outliers
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
5. Introduction : Karl Pearson's
correlation coefficient
• Karl Pearson's correlation coefficient measures degree
of linear relationship between two variables
• If the relationship between two variables X and Y is to
be ascertained through Karl Pearson method , then the
following formula is used:
• The value of the coefficient of correlation always lies
between ±1
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
6. Introduction :
Interpretation of
correlation coefficient
r value Interpretation
+.70 or higher Very strong positive relationship
+.40 to +.69 Strong positive relationship
+.30 to +.39 Moderate positive relationship
+.20 to +.29 weak positive relationship
+.01 to +.19 negligible relationship
0 No relationship [zero order correlation]
-.01 to -.19 negligible relationship
-.20 to -.29 weak negative relationship
-.30 to -.39 Moderate negative relationship
-.40 to -.69 Strong negative relationship
-.70 or higher Very strong negative relationship
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
7. Example : Karl Pearson's correlation :
X 1 2 3 4 5
Y 10 20 30 40 50
X Y X2
Y2
XY
1 10 1 100 10
2 20 4 400 40
3 30 9 900 90
4 40 16 1600 160
5 50 25 2500 250
Σ X =15 Σ Y =150 Σ X2
= 55 Σ Y2
= 5500 Σ XY = 550
Hence there is perfect positive correlation
between X and Y
i.e. If X increases, Y also increases and vice versa
Let’s compute the Pearson correlation
coefficient between X and Y variables :
Substituting the relevant
values in formula we get :
Positive correlation
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
8. Example : Karl Pearson's correlation : Weak/No correlation
X 1 2 3 4 5
Y 20 5 9 25 16
X Y X2
Y2
XY
1 20 1 400 20
2 5 4 25 10
3 9 9 81 27
4 25 16 625 100
5 16 25 256 80
Σ X = 15 Σ Y = 75 Σ X2
= 55 Σ Y2
= 1387 Σ XY = 237
Hence there is weak/no significant correlation between X and Y
i.e. If X increases /decreases , it has no significant impact on Y and vice versa
Let’s compute the Pearson correlation
coefficient between X and Y variables :
Substituting the relevant
values in formula we get :
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
9. Example : Karl Pearson's correlation : Negative correlation
X 1 2 3 4 5
Y 20 5 9 25 16
X Y X2
Y2
XY
1 25 1 625 25
2 20 4 400 40
3 9 9 81 27
4 5 16 25 20
5 4 25 16 20
Σ X = 15 Σ Y = 63 Σ X2
= 55 Σ Y2
= 1147 Σ XY = 132
Hence there is strong negative correlation between X and Y
i.e. If X increases , Y decreases and vice versa
Let’s compute the Pearson correlation
coefficient between X and Y variables :
Substituting the relevant
values in formula we get :
r= -285/297 = -0.96
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
10. Introduction : Spearman’s Rank correlation coefficient
Spearman’s rank correlation is a measure of
correlation between two ranked (ordered) variables
It measures the strength and direction of association
between two sets of data when ranked by each of
their quantities
If the strength of association between two variables is
to be ascertained through Spearman’s rank
correlation method, then the following formula is
used:
Spearman’s coefficient :
Where
n : Number of observations
d : Difference between two ranks of each
observation
-1 < rs < +1
When there is complete
agreement among rankings
When there is complete
disagreement among rankings
The value of rs is :
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
11. Example : Spearman’s Rank Correlation : Positive correlation
X Y Rank of X Rank of Y d d2
5 9 1 1 0 0
6 10 2 2 0 0
7 11 3 3 0 0
0
Spearman coefficient rs =
𝟏 − ( 𝟔 Σ d2 / n (n2-1) )
= 1 – ( 6*0 / 3 (9-1) )
= 1 – 0
= 1
= 1 ~ Perfect positive correlation
Let’s compute the Spearman’s correlation coefficient between two ranked variables X and Y :
Closer this
value to ±1 ,
stronger
the
relationship
between
variables
Closer this
value to 0 ,
weaker the
relationship
/association
between
both
variables
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
12. Example : Spearman’s Rank Correlation : Negative correlation
Let’s compute the Spearman’s correlation coefficient between two ranked variables X and Y :
Closer this value
to 0 , weaker
the
relationship/ass
ociation
between both
variables
Closer this value
to ±1 , stronger
the relationship
between
variables
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
X Y
Rank of
X
Rank of Y d d2
6 3 1 3 2 4
8 2 2 2 0 0
9 1 3 1 2 4
8
Spearman coefficient rs =
𝟏 − ( 𝟔 𝜮 d2 / n (n2-1) )
= 1 – ( 6*8 / 3 (9-1) )
= 1 – ( 48/24 )
= 1 – 2
= - 1 ~ Perfect negative correlation
13. Example : Spearman’s Rank Correlation : Weak/No correlation
X Y Rank of X Rank of Y d d2
6 9 2 3 1 1
7 3 3 2 1 1
4 2 1 1 0 1
3
Spearman coefficient rs =
𝟏 − 𝟔 Σ d2 / n (n2-1)
= 1 – 6*3 / 3 (9-1)
= 1 – 18/24
= 1 – 0.75
= 0.25 ~ Weak correlation
Closer this value
to 0 , weaker
the
relationship/
association
between both
variables
Closer this value
to ±1 , stronger
the relationship
between
variables
Let’s compute the Spearman’s correlation coefficient between two ranked variables X and Y :
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
14. Input/Tuning
Parameters And
Sample User
Interface
Select any two input variables for which
you want to find correlation
Age in years
Loan amount
Monthly income in thousands
Debt to income ratio
Step 1
Step 2
Is the data
ranked?
Select
Spearman’s
Correlation
Select
Pearson
Correlation
Yes No
Step
3
Display scatter plot along
with correlation coefficient
and it’s interpretation
"ranking" : Numerical values are replaced by
their rank when the data are sorted. Rank is a
position in the ascending order of values.
For example, the ranks of 3.4, 5.1, 2.6, 7.3 data
items would be 2, 3, 1 and 4 respectively
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
15. Sample Output User Interface : Positive Correlation
Scatterplot of selected input variables
should be shown along with
correlation coefficient value and it’s
interpretation as shown in image
below :
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
16. Sample Output User Interface : Negative Correlation
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
17. Sample Output User Interface : No Correlation
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
18. Limitations
Karl Pearson correlation is affected by outliers
Both correlation methods measure the strength of
relationship between only two variables, without taking
into consideration the fact that both these variables
may be influenced by a third variable
Both methods can handle only numeric data
In case of categorical ordinal (ranked) variables, they
need to be converted into numeric ranks in order to
proceed with Spearman’s correlation
•For example, sale of ice cream and sale of cold drinks are related to
weather conditions of the area. They may show a positive correlation
but they are not related to each other
•For instance, as a survey response we may have variable values such
as “Very dissatisfied”, “dissatisfied” , “neutral “, “satisfied” , “very
satisfied” etc. , these responses have to be converted into numeric
ranks 1,2,3,4,5 respectively as shown in figure in right
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
19. Business use case 1 : Karl Pearson Correlation
Business problem :
Marketing manager of an online retailer
wants to understand the influence of
Age on purchasing frequency
Input data :
Variable containing Age of each
consumer
Variable containing frequency of
purchases per month for each customer
Business benefit :
Depending on the correlation
between above two factors, marketing
manager can decide which age
segments are frequent/infrequent
purchasers and infrequent purchasers
can be converted to frequent
purchasers by targeting them with
more attractive offers like discounts/
cash backs etc.
20. Business use case 1 : Business decisions based on correlation output
For instance, in case of
negative correlation
• (r <- 0.3) , i.e. higher the age,
lesser the purchasing
frequency; high age
customer segments can be
targeted more effectively to
convert them to frequent
purchasers and in turn
increase the revenue
In case of weak or no
correlation
• (- 0.3 < = r < = 0.3 ) , there is
no relation between age and
purchase frequency ,
therefore there is no need
take into account the age
factor while designing a
target marketing strategy
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
Age
Purchase
Frequency
21. Business use case 2 : Karl Pearson Correlation
•A bank wants to find the correlation between income and credit
card delinquency rate of a credit card holderBusiness problem :
•Delinquency rate of each credit card customer
•Monthly income of each credit card customerInput data:
•Bank credit card manager can decide
on the individual’s credit limit
eligibility depending on the
correlation coefficient value between
above two factors i.e. Income and
delinquency rate of a customer
Business benefit :
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
22. Business use case 1 : Business decisions based on correlation output
If r < - 0.5 (strong negative correlation
between income and credit card
delinquency rate i.e. higher the income,
lower the delinquency, which is usually
the case)
•then lower credit limit should be given
to low income segment and high credit
limit to high income segment in order
to reduce the number of credit card
delinquents
If r > 0.5 (strong positive correlation) ,
it means higher the income, higher the
delinquency rate (which is unusual).
•In this case, high income segment is
prone to become delinquent on credit
card and hence the credit limit should
be set lower for them compared to
low income segment
If - 0.3 < = r < = 0.3 then
•delinquency has nothing to do with
income of a customer , hence there is
no need to decide credit limit
depending on an individual’s income
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
Income
Income
Delinquency
Delinquency
Income
Income
Delinquency
Delinquency
23. Business use case 3 : Spearman’s Rank Correlation
• An education institute’s head is looking to see the extent of
agreement on students’ rating , between two different
observers of faculty teaching
Business problem :
• Students’ rating by the department chairs
• Students’ rating by the faculty members
Input data :
•This will help the institute head to understand
how consistent both the observers were in rating
the students
•If ranks given by both the observers were similar
to each other, the institute head would be able to
put more faith in the ratings than if the observer
ranks varied widely from each other
•This will also reduce the chance of unethical /
biased rating
Business benefit :
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases
24. Business Use Case 4 : Spearman’s Rank Correlation
•A market research agency wants to cluster various survey responders based on the rank
correlation outputBusiness problem :
•Responders’ responses on brand loyalty containing values on likert scale
of 1 to 5 , where 1 representing disloyal , 2 meaning somewhat disloyal
and so on
•Responders’ frequency of brand visits per month ( Here responders with
visits above 10 per quarter can be ranked as “1” , between 8 to 10 as
“2” and so on
Input data :
•If the values of brand visits and brand loyalty turn
out to be positively correlated then we can cluster
the ongoing frequently visiting customers into
“brand loyal” segment and rarely visiting customers
into “brand disloyal” segment
•Upon identification of these disloyal customers ,
they can be converted into loyal ones by identifying
a way of increasing their frequency to visit
Business benefit :
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases