If we conducted a competition for which statistical quantity would be the most valuable in exploratory data analysis, the winner would most likely be the correlation coefficient with a significant difference from its first competitor. In addition, most data applications contain non-normal data with outliers without being able to be converted to normal data. Therefore, we search for robust correlation coefficients to nonnormality and/or outliers that could be applied to all applications and detect influenced or hidden correlations not recognized by the most popular correlation coefficients. We introduce a correlation-coefficient family with the Pearson and Spearman coefficients as specific cases. Other family members provide desirable lower p-values than those derived by the standard coefficients in the earlier problems. The proposed family of coefficients, their cut-off points, and p-values, computed by permutation tests, could be applied by all scientists analyzing data. We share simulations, code, and real data by email or the internet.
2. 2
CONTENTS
Γ ABSTRACT
Γ A FAMILY OF CORRELATION COEFFICIENTS
Γ MADE-UP EXAMPLE
Γ APPLICATION TO GDP PER CAPITA
Γ SIMULATION
Γ CONCLUSIONS
3. 3
ABSTRACT
If we conducted a competition for which statistical quantity would be the
most valuable in exploratory data analysis, the winner would most likely
be the correlation coefficient with a significant difference from its first
competitor. In addition, most data applications contain non-normal data
with outliers without being able to be converted to normal data.
Therefore, we search for robust correlation coefficients to nonnormality
and/or outliers that could be applied to all applications and detect
influenced or hidden correlations not recognized by the most popular
correlation coefficients. We introduce a correlation-coefficient family
with the Pearson and Spearman coefficients as specific cases. Other
family members provide desirable lower p-values than those derived by
the standard coefficients in the earlier problems. The proposed family of
coefficients, their cut-off points, and p-values, computed by permutation
tests, could be applied by all scientists analyzing data. We share
simulations, code, and real data by email or the internet.
4. 4
INTRODUCTION
Γ The existing literature recommends the Pearson (P) correlation for
normal data and the Spearman (S) correlation for nonnormal data.
Γ We propose alternative coefficients that perform better than P & S
coefficients on applications.
Γ Data-analysis software typically computes three classic correlation
coefficients, Pearsonβs, Spearmanβs, and Kendallβs.
Γ It is very striking that although the three correlation coefficients were
developed in the late 19th and early 20th centuries, and despite the rapid
development of computers, the three coefficients still dominate the use.
5. 5
THE CORRELATION COEFFICIENT FAMILY
Define the Minkowski distance: π·!(π₯", π¦") = (
#
$
β β |π₯" β π¦"|!
$
"%# -
#/!
In this study we mainly apply for p=1 (Manhattan distance)
Compute the standardized values of order p as
π₯!,"
())
=
+!,+
-"(+!,+)
Proposed 1, Value Correlation for positive & negative relationships
π!,. = /
π!,./ = 1 β
#
0# β π·!
1
(π₯!,"
())
, π¦!,"
())
- , if π!,./ β₯ βπ!,.,
π!,., =
#
0# β π·!
1 (π₯!,"
())
, β π¦!,"
())
- β 1, if π!,./ < βπ!,.,
π·! (π₯!,"
())
, π¦!,"
())
-
!
β πΏ, as π β β (convergence in probability)
7. 7
Pearson Correlation Coefficient, ππ·π, as a Special Case (p=2):
π56 = π56(π₯", π¦") = =
π56/ = 1 β
-#
#
7+#,!
(&)
, 9#,!
(&)
:
1
, if π56/ β₯ βπ56,
π56, =
-#
#7+#,!
(&)
,,9#,!
(&)
:
1
β 1, if π56/ < βπ56,
π·1 (π₯1,"
())
, π¦1,"
())
-
!
β β2, as π β β & independent x & y
Spearman2 Correlation Coefficient, ππΊ, as a Special Case (p=2):
π< = π56[π (π₯"), π (π¦")] = 1 β
=
$#,#
β π·1
1[π (π₯"), π (π¦")]
-#[2(+!),,2(9!)]
$
!
β
#
β=
, as π β β & independent x & y
2
https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
8. 8
Kendall3 Correlation Coefficient, ππ²:
π! =
"
#β(#&')
β β β π ππ(π₯) β π₯*+ β π ππ(π¦) β π¦*+
)&'
*+'
#
)+"
(sgn ΒΊ the sign function)
Spearman & Kendall coefficients are special cases of a general
rank correlation coefficient4
3
https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient
4
https://en.wikipedia.org/wiki/Rank_correlation#General_correlation_coefficient
9. 9
PROPERTIES FOR π",$ AND π",%$
Γ β1 β€ π,,. β€ 1, β1 β€ π,,/. β€ 1
Γ An exact value of +1 or -1 indicates a perfect positive or
negative relationship.
Γ A correlation value close to 0 indicates no relationship.
Γ The closer to +1 or -1 the coefficient, the stronger the
bivariate association.
Γ We square the distance π·!
1
(π₯!,"
())
, π¦!,"
())
- so π!,. will have the
same units as Pearsonβs correlation coefficient.
10. 10
THE PROPOSED CORRELATION COEFFICIENT
FOR DATA NOT REJECTED AS NORMAL
Compute the standardized values (s) of order p=1 as
π§',)
(0)
=
1!&1
|1"&1|
333333333, bar β‘ arithmetic mean
π#,. = M
π#,./, if π#,./ β₯ βπ#,.,
π#,., , if π#,./ < βπ#,.,
For positive correlation: π#,./ = 1 β
#
1
β NOπ₯1,π
(π )
β π¦1,π
(π )
O
PPPPPPPPPPPPPPPPP
Q
1
For negative correlation: π#,., =
#
1
β NOπ₯1,π
(π )
+ π¦1,π
(π )
O
PPPPPPPPPPPPPPPPP
Q
1
β 1
Oπ₯1,π
(π )
β π¦1,π
(π )
O
PPPPPPPPPPPPPPPPP !
β β2, as π β β & independent π₯ & π¦ (numerical finding)
We use this version in the applications and simulations.
11. 11
THE PROPOSED CORRELATION COEFFICIENT
FOR NONNORMAL DATA
Compute Rankings and their standardized values of order 1 as
π )(π§") =
2(A!),2(A)
BBBBBB
C2(A(),2(A)
BBBBBBC
BBBBBBBBBBBBBBBBBB, , bar β‘ arithmetic mean
π#,2. = M
π#,2./, if π#,2./ β₯ βπ#,2.,
π#,2., , if π#,2./ < βπ#,2.,
For positive correlation: π#,2./ = 1 β :
#
0
β |π )(π₯D) β π )(π¦D)|
PPPPPPPPPPPPPPPPPPPPPP;
1
For negative correlation: π#,2., = :
#
0
β |π )(π₯D) + π )(π¦D)|
PPPPPPPPPPPPPPPPPPPPPP;
1
β 1
|π )(π₯D) β π )(π¦D)|
PPPPPPPPPPPPPPPPPPPPPP
!
β L = 1.344 as π β β & independent π₯ & π¦ (numerical finding)
We use this version in the applications and simulations.
12. 12
THE CUT-OFF POINTS AND P-VALUES FOR ππ,π½ AND
ππ,πΉπ½ ARE COMPUTED BY PERMUTATION TESTS
Cut-Off Points for π',/. (two-sided Ξ±=0.05 or one-sided Ξ±=0.025)
n c n c n c n c n c n c
5 0.938 12 0.616 19 0.518 30 0.422 80 0.276 500 0.123
6 0.891 13 0.594 20 0.511 35 0.395 90 0.259 1000 0.091
7 0.784 14 0.592 21 0.491 40 0.372 100 0.248 2000 0.069
8 0.754 15 0.576 22 0.486 45 0.355 150 0.208 5000 0.050
9 0.729 16 0.559 23 0.479 50 0.337 200 0.184 104
0.039
10 0.713 17 0.538 24 0.477 60 0.311 300 0.151 105
0.023
11 0.646 18 0.535 25 0.460 70 0.293 400 0.135 106
0.019
Example: In a case with nonnormal data and n=37, we observe π!,#$ = 0.41.
Then we can reject π»%: π = 0 and accept π»!: π > 0 with Ξ±=0.025 since
,π!,#$, > π1,π π = 0.395.
13. 13
PERMUTATION TESTS
Permutation tests5 have been used for hypothesis testing of correlation
coefficients between two variables, x and y. Initially, calculate the
correlation coefficient repeatedly after shuffling the observations of the
variable y and keeping constant the order of the observations for the
variable x. Then, we can derive p-values from the distribution of the
computed correlation coefficients. Permutation tests6 enjoy the
following merits against other standard statistical tests:
β’ Approximate p-values very satisfactory.
β’ Do not assume any particular distribution (distribution-free).
β’ Are suitable for small samples.
β’ Are applicable to non-random samples, e.g., time-series data.
5
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
6
Berry, K. J., Johnston, J. E., & Mielke, Jr.(Paul W.). (2018). The measurement of association: a permutation statistical
approach. Springer International Publishing.
17. 17
A CLASSIFICATION MODEL
We consider the test: π»H: π = 0 against π»#: π β 0
Binary Observed Variable π¦ = [π β 0] = M
1, Correlation exists
0, No Correlation
(Iverson bracket, 1 if condition is true, 0 otherwise)
First, we test: π»H: (π₯, π¦)~MN β‘ Multivariate Normal Distribution
π»#: (π₯, π¦) β MN β‘ Multivariate Normal Distribution
β MN β‘ Does NOT follow a MN Distribution (Henze-Zirkler MN Test)
Predicted Binary Variable π¦
8 = :
;π!,# > π!,#?, π»$ for MN cannot be Rejected
;π!,%# > π!,%#?, π»$ for MN is Rejected
π!,. & π!,2. β‘ Proposed correlation coefficients for normal and nonnormal
π!,. & π!,2. <- critical values are computed by Permutation Tests
19. 19
AN APPLICATION TO GDP PER CAPITA
β’ Public available data9 WORLD BANK
β’ N=61 countries with GDP per Cap > 10,000$ in 2020, and full
annual data for 1981-2020
β’ T=40 for the period 1981-2020. Analyze Growth Rates (%)
β’ (612
-61)/2 = 1830 pairs (x,y) correlation cases
β’ 1187 Not rejected as normal
β’ 643 Rejected as Normal (Non-normal)
β’ No causality. Lurking variables: Global or Continental Economy
β’ Compare the economic growth of a country with its correlated
countries by regression residuals.
9
https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
20. 20
APP TO GDP π―π π ππ ππ ππππππ ππ ππππππππ
Γ For bivariate data not rejected as Normal, we compare
Pearson with the proposed π',..
Γ 1=Reject π»R: π = 0, 0=otherwise
Γ In 1045=690+355 cases, 88%, the two coefficients agree
[(1,1), (0,0)] and we assume that this is the true outcome.
This hypothesis may not be entirely accurate, but it does
not affect the conclusions for correlation comparisons.
Pearson Proposed_Value, π#,. Frequencies
1 1 690
1 0 44
0 1 98
0 0 355
Total 1187
22. 22
A CASE FOR DATA NOT REJECTED AS NORMAL
Correlation Coefficients & p-values for Euro Area vs Qatar.
1981-2020, n=40
1981-2019 excluding
outliers 1986 & 2000, n=38
Correl. p-value Correl. p-value
Proposed, π#,.
0.346 0.022 0.482 0.000
Pearson 0.143 0.379 0.478 0.002
Only the proposed correlation, π#,., recognizes the relationship p-value<0.05.
There is a significant positive relationship after removing 2 outliers.
23. 23
A CASE FOR NONNORMAL DATA
Correlation Coefficients & p-values for UK vs Trinidad and Tobago
1981-2020, n=40
1981-2019 excluding outliers 1986,
1987, 1988, 2008 & 2009, n=35
Correl. p-value Correl. p-value
Proposed, π#,2.
0.452 0.009 0.613 0.002
Kendall 0.169 0.124 0.348 0.003
Spearman 0.229 0.155 0.494 0.003
Only the proposed correlation, π#,2., recognizes the relationship p-value<0.05.
There is a significant positive relationship after removing 5 outliers.
24. 24
AN APPLICATION TO GDP FOR NONNORMAL DATA
Γ For NONNORMAL data, we compare the Spearman,
Kendall coefficients with the proposed π',/..
Γ 1=Reject π»R: π = 0, 0=otherwise
Γ In 643=297+296 cases, 92.2%, the 3 coefficients agree
[(1,1,1), (0,0,0)] and we assume that this is the true
outcome. This hypothesis may not be entirely accurate, but
it does not affect the conclusions for correlation
comparisons.
29. 29
A NEW PERFORMANCE MEASURE
ERROR WEIGHTED HARMONIC MEAN (EWHM)12
πΈππ»π(πππ , πππ , πππ, πππ) =
=
4 β πππ β πππ β πππ β πππ
1 β πππ
πππ +
1 β πππ
πππ +
1 β πππ
πππ +
1 β πππ
πππ
=
=
1 β π΄π(πππ , πππ , πππ, πππ)
1 β π»π(πππ , πππ , πππ, πππ)
β π»π(πππ , πππ , πππ, πππ) =
=
1 β π΄π(πππ , πππ , πππ, πππ)
π΄π _
1
πππ ,
1
πππ ,
1
πππ ,
1
πππ
` β 1
The higher the variance of TPR, PPV, TNR, & NPV, the smaller the EWHM.
HM -> Harmonic Mean, AM -> Arithmetic Mean
12
Papadopoulos, S., Stavroulias, P., & Sager, T. (2019). Systemic early warning systems for EU14 based on the 2008 crisis:
proposed estimation and model assessment for classification forecasting. Journal of Banking Regulation, 20(3), 226-244.
30. 30
NOTATION FOR PERFORMANCE MEASURES
FOR THE GDP APP
By Normal we really mean -> not rejected as Normal
All_PS All observations with Pearson for Normal & Spearman for Nonnormal
All_VR All observations with π),* for Normal & π),0* for Nonnormal
N_P Only Normal cases with Pearson
N_V Only Normal cases with π),*
NN_S Only Nonnormal cases with Spearman
NN_K Only Nonnormal cases with Kendall
NN_RV Only Nonnormal cases with π),0*
32. 32
PERFORMANCE-MEASURE DISCUSSION
Γ The overall measures ACC, Fb & EWHM give much higher
values when we use the proposed correlation coefficients π',.
& π',/. compared to the classic coefficients Pearson,
Spearman & Kendall separately for normal & nonnormal data
and all together.
Γ While, ACC, Fb & T1+T2 indicate π',. (N_V) as the best
coefficient, our EWHM measure shows π',/. (NN_RV) as the
best. The higher the variance of TPR, PPV, TNR, & NPV, the
smaller the EWHM.
33. 33
SIMULATION DESIGN
Γ10,000 simulations
ΓPython (NUMPY library)
ΓTwo schemes of n correlated-bivariate data, xi and yi with Pearsonβs coefficient
π = π56
ΓScheme 1 contains all the data correlated as follows:
β’ π₯", π" βΌ π(0, 1), π = 1,2, β¦ , π independent and
β’ π¦" = π β π₯" + β1 β π1 β π"
ΓScheme 2 retains
β’ 90% of the observations as in Scheme 1 and (**), and
β’ the remaining 10% NONNORMAL from uniform distribution, U(a,b),
β’ NONNORMAL within the area between two circles with radii, q, 3 and 3.5.
β’ The random circle coordinates are given by:
β’ π₯" = π" β cos(π€"), π¦" = π" β sin(π€"),
β’ π" βΌ π(3, 3.5) and π€" βΌ π(0, 2 β π)
35. 35
SIMULATION CONCLUSIONS
Γ For nonnormal data, the proposed correlation coefficients
π',/. & π',. have higher power (1- Ξ²) and smaller total error
(Ξ± + Ξ²) than the classic coefficients.
Γ For normal data, the inverse order holds but in practice, we
get bivariate data NOT REJECTED AS NORMAL, which may have
a few outliers or nonnormalities.
Γ In additional, the linear relationships in real data are
hypothetical, while in simulations, are real.
Γ The Pearson coefficient performs best in simulations for
normal data but not in the application.
37. 37
CONCLUSIONS
Γ Proposed π',. & π',./ coefficients are more powerful than
the standard coefficients Pearson, Spearman, & Kendall
Γ Could be applied by all scientists analyzing data
Γ Provide substantive interpretation
Γ Robust to Nonnormality & Outliers
Γ Cut-off points for Proposed-Rank coefficient are given
ΓThe Kendall coef. performs better than the Spearman coef.
OUR RECOMMENTATION: Use π',/. for nonnormal data & π',.
when multivariate normality (MN) cannot be rejected.