What is Karl Pearson Correlation Analysis and How Can it be Used for Enterprise Analysis Needs?

Master the Science of Analytics
A Series of Data Science Topics that Helps with your Journey towards
Augmented Analytics

Correlation Analysis
Parameter Tuning & Use cases

Terminologies
Introduction & Example
Input/tuning parameters & Sample UI
Sample output UI
Interpretation of Output
Limitations
Business use cases
What Are
All Covered

Terminologies
CORRELATION :
Correlation is a statistical measure that indicates the extent
to which two variables fluctuate together
A positive correlation indicates the extent to which those
variables increase or decrease in parallel
A negative correlation indicates the extent to which one
variable increases as the
other decreases
OUTLIERS :
Observations lying outside overall pattern
of distribution
RANKED/ORDINAL VARIABLES :
A variable whose set of values is ordered. For instance ,
High school class rankings: 1st, 2nd, 3rd etc..
Social economic class: working, middle, upper.
The Likert Scale: agree, strongly agree, disagree etc..
Outliers
Terminologies > Introduction & example > Standard input/tuning parameters & Sample UI > Sample output UI > Interpretation of Output > Limitations > Business use cases

Introduction : Karl Pearson's
correlation coefficient
• Karl Pearson's correlation coefficient measures degree
of linear relationship between two variables
• If the relationship between two variables X and Y is to
be ascertained through Karl Pearson method , then the
following formula is used:
• The value of the coefficient of correlation always lies
between ±1

Introduction :
Interpretation of
correlation coefficient
r value Interpretation
+.70 or higher Very strong positive relationship
+.40 to +.69 Strong positive relationship
+.30 to +.39 Moderate positive relationship
+.20 to +.29 weak positive relationship
+.01 to +.19 negligible relationship
0 No relationship [zero order correlation]
-.01 to -.19 negligible relationship
-.20 to -.29 weak negative relationship
-.30 to -.39 Moderate negative relationship
-.40 to -.69 Strong negative relationship
-.70 or higher Very strong negative relationship

Example : Karl Pearson's correlation :
X 1 2 3 4 5
Y 10 20 30 40 50
X Y X2
Y2
XY
1 10 1 100 10
2 20 4 400 40
3 30 9 900 90
4 40 16 1600 160
5 50 25 2500 250
Σ X =15 Σ Y =150 Σ X2
= 55 Σ Y2
= 5500 Σ XY = 550
Hence there is perfect positive correlation
between X and Y
i.e. If X increases, Y also increases and vice versa
Let’s compute the Pearson correlation
coefficient between X and Y variables :
Substituting the relevant
values in formula we get :
Positive correlation

Example : Karl Pearson's correlation : Weak/No correlation
X 1 2 3 4 5
Y 20 5 9 25 16
X Y X2
Y2
XY
1 20 1 400 20
2 5 4 25 10
3 9 9 81 27
4 25 16 625 100
5 16 25 256 80
Σ X = 15 Σ Y = 75 Σ X2
= 55 Σ Y2
= 1387 Σ XY = 237
Hence there is weak/no significant correlation between X and Y
i.e. If X increases /decreases , it has no significant impact on Y and vice versa

Example : Karl Pearson's correlation : Negative correlation
X 1 2 3 4 5
Y 20 5 9 25 16
X Y X2
Y2
XY
1 25 1 625 25
2 20 4 400 40
3 9 9 81 27
4 5 16 25 20
5 4 25 16 20
Σ X = 15 Σ Y = 63 Σ X2
= 55 Σ Y2
= 1147 Σ XY = 132
Hence there is strong negative correlation between X and Y
i.e. If X increases , Y decreases and vice versa
r= -285/297 = -0.96

Introduction : Spearman’s Rank correlation coefficient
Spearman’s rank correlation is a measure of
correlation between two ranked (ordered) variables
It measures the strength and direction of association
between two sets of data when ranked by each of
their quantities
If the strength of association between two variables is
to be ascertained through Spearman’s rank
correlation method, then the following formula is
used:
Spearman’s coefficient :
Where
n : Number of observations
d : Difference between two ranks of each
observation
-1 < rs < +1
When there is complete
agreement among rankings
When there is complete
disagreement among rankings
The value of rs is :

Example : Spearman’s Rank Correlation : Positive correlation
X Y Rank of X Rank of Y d d2
5 9 1 1 0 0
6 10 2 2 0 0
7 11 3 3 0 0
0
Spearman coefficient rs =
𝟏 − ( 𝟔 Σ d2 / n (n2-1) )
= 1 – ( 6*0 / 3 (9-1) )
= 1 – 0
= 1
= 1 ~ Perfect positive correlation
Let’s compute the Spearman’s correlation coefficient between two ranked variables X and Y :
Closer this
value to ±1 ,
stronger
the
relationship
between
variables
Closer this
value to 0 ,
weaker the
relationship
/association
between
both
variables

Example : Spearman’s Rank Correlation : Negative correlation
Closer this value
to 0 , weaker
the
relationship/ass
ociation
between both
variables
Closer this value
to ±1 , stronger
the relationship
between
variables
X Y
Rank of
X
Rank of Y d d2
6 3 1 3 2 4
8 2 2 2 0 0
9 1 3 1 2 4
8
𝟏 − ( 𝟔 𝜮 d2 / n (n2-1) )
= 1 – ( 6*8 / 3 (9-1) )
= 1 – ( 48/24 )
= 1 – 2
= - 1 ~ Perfect negative correlation

Example : Spearman’s Rank Correlation : Weak/No correlation
X Y Rank of X Rank of Y d d2
6 9 2 3 1 1
7 3 3 2 1 1
4 2 1 1 0 1
3
𝟏 − 𝟔 Σ d2 / n (n2-1)
= 1 – 6*3 / 3 (9-1)
= 1 – 18/24
= 1 – 0.75
= 0.25 ~ Weak correlation
Closer this value
to 0 , weaker
the
relationship/
association
between both
variables
Closer this value
to ±1 , stronger
the relationship
between
variables

Input/Tuning
Parameters And
Sample User
Interface
Select any two input variables for which
you want to find correlation
Age in years
Loan amount
Monthly income in thousands
Debt to income ratio
Step 1
Step 2
Is the data
ranked?
Select
Spearman’s
Correlation
Select
Pearson
Correlation
Yes No
Step
3
Display scatter plot along
with correlation coefficient
and it’s interpretation
"ranking" : Numerical values are replaced by
their rank when the data are sorted. Rank is a
position in the ascending order of values.
For example, the ranks of 3.4, 5.1, 2.6, 7.3 data
items would be 2, 3, 1 and 4 respectively

Sample Output User Interface : Positive Correlation
Scatterplot of selected input variables
should be shown along with
correlation coefficient value and it’s
interpretation as shown in image
below :

Sample Output User Interface : Negative Correlation

Sample Output User Interface : No Correlation

Limitations
Karl Pearson correlation is affected by outliers
Both correlation methods measure the strength of
relationship between only two variables, without taking
into consideration the fact that both these variables
may be influenced by a third variable
Both methods can handle only numeric data
In case of categorical ordinal (ranked) variables, they
need to be converted into numeric ranks in order to
proceed with Spearman’s correlation
•For example, sale of ice cream and sale of cold drinks are related to
weather conditions of the area. They may show a positive correlation
but they are not related to each other
•For instance, as a survey response we may have variable values such
as “Very dissatisfied”, “dissatisfied” , “neutral “, “satisfied” , “very
satisfied” etc. , these responses have to be converted into numeric
ranks 1,2,3,4,5 respectively as shown in figure in right

Business use case 1 : Karl Pearson Correlation
Business problem :
Marketing manager of an online retailer
wants to understand the influence of
Age on purchasing frequency
Input data :
Variable containing Age of each
consumer
Variable containing frequency of
purchases per month for each customer
Business benefit :
Depending on the correlation
between above two factors, marketing
manager can decide which age
segments are frequent/infrequent
purchasers and infrequent purchasers
can be converted to frequent
purchasers by targeting them with
more attractive offers like discounts/
cash backs etc.

Business use case 1 : Business decisions based on correlation output
For instance, in case of
negative correlation
• (r <- 0.3) , i.e. higher the age,
lesser the purchasing
frequency; high age
customer segments can be
targeted more effectively to
convert them to frequent
purchasers and in turn
increase the revenue
In case of weak or no
correlation
• (- 0.3 < = r < = 0.3 ) , there is
no relation between age and
purchase frequency ,
therefore there is no need
take into account the age
factor while designing a
target marketing strategy
Age
Purchase
Frequency

Business use case 2 : Karl Pearson Correlation
•A bank wants to find the correlation between income and credit
card delinquency rate of a credit card holderBusiness problem :
•Delinquency rate of each credit card customer
•Monthly income of each credit card customerInput data:
•Bank credit card manager can decide
on the individual’s credit limit
eligibility depending on the
correlation coefficient value between
above two factors i.e. Income and
delinquency rate of a customer
Business benefit :

Business use case 1 : Business decisions based on correlation output
If r < - 0.5 (strong negative correlation
between income and credit card
delinquency rate i.e. higher the income,
lower the delinquency, which is usually
the case)
•then lower credit limit should be given
to low income segment and high credit
limit to high income segment in order
to reduce the number of credit card
delinquents
If r > 0.5 (strong positive correlation) ,
it means higher the income, higher the
delinquency rate (which is unusual).
•In this case, high income segment is
prone to become delinquent on credit
card and hence the credit limit should
be set lower for them compared to
low income segment
If - 0.3 < = r < = 0.3 then
•delinquency has nothing to do with
income of a customer , hence there is
no need to decide credit limit
depending on an individual’s income
Income
Income
Delinquency
Delinquency
Income
Income
Delinquency
Delinquency

Business use case 3 : Spearman’s Rank Correlation
• An education institute’s head is looking to see the extent of
agreement on students’ rating , between two different
observers of faculty teaching
Business problem :
• Students’ rating by the department chairs
• Students’ rating by the faculty members
Input data :
•This will help the institute head to understand
how consistent both the observers were in rating
the students
•If ranks given by both the observers were similar
to each other, the institute head would be able to
put more faith in the ratings than if the observer
ranks varied widely from each other
•This will also reduce the chance of unethical /
biased rating
Business benefit :

Business Use Case 4 : Spearman’s Rank Correlation
•A market research agency wants to cluster various survey responders based on the rank
correlation outputBusiness problem :
•Responders’ responses on brand loyalty containing values on likert scale
of 1 to 5 , where 1 representing disloyal , 2 meaning somewhat disloyal
and so on
•Responders’ frequency of brand visits per month ( Here responders with
visits above 10 per quarter can be ranked as “1” , between 8 to 10 as
“2” and so on
Input data :
•If the values of brand visits and brand loyalty turn
out to be positively correlated then we can cluster
the ongoing frequently visiting customers into
“brand loyal” segment and rarely visiting customers
into “brand disloyal” segment
•Upon identification of these disloyal customers ,
they can be converted into loyal ones by identifying
a way of increasing their frequency to visit
Business benefit :

Want to Learn
More?
Get in touch with us @
support@Smarten.com
June 2018

What is Karl Pearson Correlation Analysis and How Can it be Used for Enterprise Analysis Needs?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What is Karl Pearson Correlation Analysis and How Can it be Used for Enterprise Analysis Needs?

Similar to What is Karl Pearson Correlation Analysis and How Can it be Used for Enterprise Analysis Needs? (20)

More from Smarten Augmented Analytics

More from Smarten Augmented Analytics (20)

Recently uploaded

Recently uploaded (20)

What is Karl Pearson Correlation Analysis and How Can it be Used for Enterprise Analysis Needs?