Eli plots visualizing innumerable number of correlations

“On visualizing Direct and Partial Correlations – ELI plots”
Leonardo E. Auslender
SAS Institute, Inc., Bedminster, NJ
1. Introduction
Statisticians and data analysts focus on correlations among pairs of
variables to understand the strength of linear relationships in the data.
Since correlations measure relations among pairs of variables, the
standard output is in matrix form, which tends to be difficult to interpret
for a large number of variables. The superlative analyst may also
incorporate partial correlations to further deepen the analysis, which at
least doubles the standard output. The hapless data-miner who faces
hundreds, if not thousands, of variables does not long to wade through
reams of outputs of correlations to find “interesting” patterns.1
In this paper, I present a method that enables to visualize any number of
Pearson (and partial) correlations by using a Proc-Timeplot-like output I
call Exploratory Linear Information (ELI) plots. Proc Timeplot is a
procedure available in SAS Base, of the SAS Institute SAS software,
since at least version 5.18. 2
Proc Timeplot “plots one or more variables
over time intervals” (SAS Procedures Guide, v. 6, 3rd
. edition, p. 579);
the time interval variable acts as an index for the observations being
plotted. Notice that the index variable is itself not plotted and, moreover,
that it is not at all necessary to have a time variable as an index (p. 581
of the same manual, ‘date’ variable.). In this paper, our index is a
variable that contains the names of the variables being correlated against
a ‘with’ variable, and we plot correlations (and partial correlations if so
desired) in an overlay fashion.
The proposed method, embedded in a SAS macro, allows to:
a) Plot correlations of either all variables against each
other or against a single 'with' variable, properly sorted
by the absolute value of the correlation.
b) Plot on the same graph described in a) the first ‘nth’
largest absolute value partial correlations, ‘n’ being a
chosen parameter dependent upon the desired crowding
of information in the plot.
c) Print the correlation and p-value matrices in a tabulate
fashion. The standard output is usually difficult to read
due to the intricacies of conceptualizing of long
sequences of numbers. 3
The tabulate presentation,
neater but still difficult to interpret, is necessary for
documentation.
2. Exploratory data analysis, variable selection and
correlation matrices.
The typical practice of data analysis includes, at least in principle,
exploratory data analysis, as espoused by Tukey (1977). More recently,
Cleveland (1993) emphasized visualization techniques, and many
research papers investigate the topic. This paper addresses the issue of
visualizing correlations, itself a component of EDA, with simple tools
available in the SAS System.
In addition, the hurried data mining practitioner finds himself/herself in
search of selecting variables for a model, a segmentation algorithm or a
customer profile, in an environment of hundreds and perhaps thousands
of variables. Stepwise methods, however much criticized, are one of the
present methodologies used to address variable selection.
In addition to variable selection techniques, practitioners also look at
correlations among variables to investigate linear dependencies. Less
frequently, practitioners look at squared partial (first order) correlation
coefficients. Given the linear model Y = α + β X + δ Z + ε with the
typical assumptions, these coefficients measure the proportion of
variation of a variable Y not estimated by X that is estimated by Z in
linear models. Equivalently, they measure the correlation between Y and
X holding Z constant. Direct and indirect effects of X and Z on Y can be
measured by the partial correlation coefficients. In the same vein, second
order partial correlation coefficients can be defined by partialling out an
additional variable from a first-order partial correlation. And third,
fourth, etc.
Specifically, given X, Y and Z, the zero order correlation between X and
Y is given by:
rxy = ( Σ (xi - x’) (yi -y’)) / √ Σ (xi - x’)2
Σ(yi - y’)2
where the apostrophe denotes mean value.
The partial correlation of x and y, given z, is:
rxy.z = ( rxy - rxz ryz) / √ (1 – rxz
2
) (1 – ryz
2
).
3. Programming considerations.
The Corr Procedure (with which the reader should be familiar to fully
understand this paper) is the basic tool for finding correlations, as in the
following code embedded in a macro:
PROC CORR DATA = &INDATA. OUTP = &OUTDATA. (WHERE = (_TYPE_ IN (“CORR”, “N))
RENAME = (_NAME_ = WITH)) NOPRINT;
%IF %NRBQUOTE(&WITH.) > %THEN WITH &WITH.; %STR(;)
VAR %DO K = 1 %TO &NUMVAR.; &&VAR&K. %END; %STR(;)
RUN;
In this macro-code, we are requesting not to print (NOPRINT) the
correlations, but to keep them in the data set &OUTDATA. The rest of
the code allows for the use of a ‘with’ variable and of selected VAR
variables. The names of the variables have been kept in macro variables
var1 through var&numvar. (&numvar. being the number of variables)
because we require the variables to be alphabetically ordered to search
for missing values later on. The standard output data set referenced by
&OUTDATA. provides the correlations but not the number of
observations for the ‘with’ variable. This number is critical in
determining p-values, and given the prevalence of missing values in
large databases, it forces us to re-capture that information. 4
(See section
3 below the typical Proc corr output).

OUTDATA AFTER PROC CORROUTDATA AFTER PROC CORROUTDATA AFTER PROC CORROUTDATA AFTER PROC CORR
OBS _TYPE_ _WITH LN_DAY N_DAYLS2 N_DAYLST N_DAYSEX N_INTRST RESPONSEOBS _TYPE_ _WITH LN_DAY N_DAYLS2 N_DAYLST N_DAYSEX N_INTRST RESPONSEOBS _TYPE_ _WITH LN_DAY N_DAYLS2 N_DAYLST N_DAYSEX N_INTRST RESPONSEOBS _TYPE_ _WITH LN_DAY N_DAYLS2 N_DAYLST N_DAYSEX N_INTRST RESPONSE
1 N 26610.00 38185.00 38185.00 38185.00 38185.00 22931.001 N 26610.00 38185.00 38185.00 38185.00 38185.00 22931.001 N 26610.00 38185.00 38185.00 38185.00 38185.00 22931.001 N 26610.00 38185.00 38185.00 38185.00 38185.00 22931.00
2 CORR LN_DAY 1.00 0.77 0.92 0.72 0.11 0.992 CORR LN_DAY 1.00 0.77 0.92 0.72 0.11 0.992 CORR LN_DAY 1.00 0.77 0.92 0.72 0.11 0.992 CORR LN_DAY 1.00 0.77 0.92 0.72 0.11 0.99
3 CORR N_DAYLS2 0.77 1.00 0.95 0.86 0.03 0.683 CORR N_DAYLS2 0.77 1.00 0.95 0.86 0.03 0.683 CORR N_DAYLS2 0.77 1.00 0.95 0.86 0.03 0.683 CORR N_DAYLS2 0.77 1.00 0.95 0.86 0.03 0.68
4 CORR N_DAYLST 0.924 CORR N_DAYLST 0.924 CORR N_DAYLST 0.924 CORR N_DAYLST 0.92 0.95 1.00 0.85 0.06 0.870.95 1.00 0.85 0.06 0.870.95 1.00 0.85 0.06 0.870.95 1.00 0.85 0.06 0.87
5 CORR N_DAYSEX 0.72 0.86 0.85 1.00 0.03 0.665 CORR N_DAYSEX 0.72 0.86 0.85 1.00 0.03 0.665 CORR N_DAYSEX 0.72 0.86 0.85 1.00 0.03 0.665 CORR N_DAYSEX 0.72 0.86 0.85 1.00 0.03 0.66
6 CORR N_INTRST 0.11 0.03 0.06 0.03 1.006 CORR N_INTRST 0.11 0.03 0.06 0.03 1.006 CORR N_INTRST 0.11 0.03 0.06 0.03 1.006 CORR N_INTRST 0.11 0.03 0.06 0.03 1.00 0.120.120.120.12
7 CORR RESPONSE 0.99 0.68 0.87 0.66 0.12 1.007 CORR RESPONSE 0.99 0.68 0.87 0.66 0.12 1.007 CORR RESPONSE 0.99 0.68 0.87 0.66 0.12 1.007 CORR RESPONSE 0.99 0.68 0.87 0.66 0.12 1.00
8 CORR SEXUNKN8 CORR SEXUNKN8 CORR SEXUNKN8 CORR SEXUNKN ----0.21 0.020.21 0.020.21 0.020.21 0.02 ----0.08 0.320.08 0.320.08 0.320.08 0.32 ----0.070.070.070.07 ----0.240.240.240.24
9 CORR TENURE9 CORR TENURE9 CORR TENURE9 CORR TENURE ----0.050.050.050.05 0.010.010.010.01 ----0.01 0.030.01 0.030.01 0.030.01 0.03 ----0.050.050.050.05 ----0.040.040.040.04
Due to the likelihood of the presence of missing values, it is necessary to
find out the number of non-missing observations for every pair of
variables. Since the &outdata. data set provides the number of present
observations for individual variables (but not for the ‘with’ variable), it
is necessary to obtain the information for those pairs in which at least
one variable has missing values. Once the number of non-missing values
is determined for every pair of variables, the p-values are computed by:
√√√√ (N – 2). Corr
____________ , ∼∼∼∼ t (N - 2).
√√√√ (1 – Corr2
)
which can be programmed as:
_STAT = ABS (SQRT(_NUMOBS - 2) * _CORR / SQRT ( 1 - (_CORR * _CORR)));
IF _NUMOBS > 100 OR _STAT > 40
THEN _P_VAL = ROUND ( 2 * (1 - PROBNORM (_STAT)),.00001);
ELSE IF _STAT > . THEN _P_VAL =
ROUND ( 2 * (1 - PROBT ( _STAT, _NUMOBS - 2 ,0 )),.00001);
ELSE _P_VAL = .;
At this point, we have obtained or calculated correlations and p-values
that allow us to “timeplot”. Since we have p-value information (in sas
data set &SASWORK.7 below), the analyst may desire to plot only
significant correlations, usually given by a p-value threshold. The
Timeplot code is:
PROC TIMEPLOT DATA = &SASWORK.7;
PLOT _CORR = "0" %IF &PARTIAL. = Y %THEN %DO K = 1 %TO &N_PRTLS.;
MXPART&K. = "&K."
%END;
/ OVERLAY NPP POS = 60 HILOC REF = 0 REFCHAR = '|' OVPCHAR = "*"
AXIS = -1 TO 1 BY .02 ;
ID _VARLBL ; /* VAR NAME + LABEL */
BY _WITH; /* SET OF WITH VARS */
TITLE2
%IF &PARTIAL. = Y %THEN "CORRS BY #BYVAL1, &N_PRTLS. PARTIALS REQUESTED";
%ELSE "CORRELATIONS BY #BYVAL1";
%STR(;)
%IF &SGNFCNT. = Y %THEN TITLE3 "SIGNIFICANT CORRS 95% ONLY"; %STR(;)
RUN;

In this code, we request at least to plot the correlation between a set of
‘with’ and ‘var’ variables (_WITH, _CORR) identified in the plot by the
value 0 (zero level correlation). If partial correlations are requested as
well, calculated in a “PROC IML” step, (“%DO K = 1 %TO
&N_PRTLS. …”), their values are identified by 1, 2, 3 … &N_prtls. in
descending order, where &n_prtls. is a user determined parameter. The
names of the variables partialled out corresponding to 1, 2, 3… are
found in a later printout under the names PART1, PART2, PART3 … .
We use * to denote overprinting (Ovpchar option).
3. Case Study.
I present one case, without a ‘with’ variable. 5
The ‘with’ variable case is
merely a subset of the more general case. All the variables are
continuous and their meaning is unimportant for this exercise. The usual
(clipped) printout of Proc Corr and the (clipped) Output data set
generated in this case are:
LN_DAYLN_DAYLN_DAYLN_DAY
LN_DAY RESPONSE N_DAYLST N_DAYLS2 N_DAYSEX TOT_RCVDLN_DAY RESPONSE N_DAYLST N_DAYLS2 N_DAYSEX TOT_RCVDLN_DAY RESPONSE N_DAYLST N_DAYLS2 N_DAYSEX TOT_RCVDLN_DAY RESPONSE N_DAYLST N_DAYLS2 N_DAYSEX TOT_RCVD
1.00000 0.99097 0.92451 0.76645 0.72429 0.224471.00000 0.99097 0.92451 0.76645 0.72429 0.224471.00000 0.99097 0.92451 0.76645 0.72429 0.224471.00000 0.99097 0.92451 0.76645 0.72429 0.22447
0.0 0.0001 0.0001 00.0 0.0001 0.0001 00.0 0.0001 0.0001 00.0 0.0001 0.0001 0.0001 0.0001 0.0001.0001 0.0001 0.0001.0001 0.0001 0.0001.0001 0.0001 0.0001
26610 16057 26610 26610 26610 2661026610 16057 26610 26610 26610 2661026610 16057 26610 26610 26610 2661026610 16057 26610 26610 26610 26610
SEXUNKN N_INTRST TENURE V3 V1 V2SEXUNKN N_INTRST TENURE V3 V1 V2SEXUNKN N_INTRST TENURE V3 V1 V2SEXUNKN N_INTRST TENURE V3 V1 V2
----0.21161 0.109580.21161 0.109580.21161 0.109580.21161 0.10958 ----0.053240.053240.053240.05324 ----0.01432 0.004370.01432 0.004370.01432 0.004370.01432 0.00437 ----0.001370.001370.001370.00137
0.0001 0.0001 0.0001 0.0195 0.4757 0.82280.0001 0.0001 0.0001 0.0195 0.4757 0.82280.0001 0.0001 0.0001 0.0195 0.4757 0.82280.0001 0.0001 0.0001 0.0195 0.4757 0.8228
26610 26610 26610 26610 26610 266126610 26610 26610 26610 26610 266126610 26610 26610 26610 26610 266126610 26610 26610 26610 26610 26610000
N_DAYLS2N_DAYLS2N_DAYLS2N_DAYLS2
N_DAYLS2 N_DAYLST N_DAYSEX LN_DAY RESPONSE TOT_RCVDN_DAYLS2 N_DAYLST N_DAYSEX LN_DAY RESPONSE TOT_RCVDN_DAYLS2 N_DAYLST N_DAYSEX LN_DAY RESPONSE TOT_RCVDN_DAYLS2 N_DAYLST N_DAYSEX LN_DAY RESPONSE TOT_RCVD
1.00000 0.95119 0.86207 0.76645 0.67704 0.199001.00000 0.95119 0.86207 0.76645 0.67704 0.199001.00000 0.95119 0.86207 0.76645 0.67704 0.199001.00000 0.95119 0.86207 0.76645 0.67704 0.19900
0.0 0.000.0 0.000.0 0.000.0 0.0001 0.0001 0.0001 0.0001 0.000101 0.0001 0.0001 0.0001 0.000101 0.0001 0.0001 0.0001 0.000101 0.0001 0.0001 0.0001 0.0001
38185 38185 38185 26610 22931 3818538185 38185 38185 26610 22931 3818538185 38185 38185 26610 22931 3818538185 38185 38185 26610 22931 38185
N_INTRST SEXUNKN TENURE V3 V1 V2N_INTRST SEXUNKN TENURE V3 V1 V2N_INTRST SEXUNKN TENURE V3 V1 V2N_INTRST SEXUNKN TENURE V3 V1 V2
0.02730 0.01862 0.009800.02730 0.01862 0.009800.02730 0.01862 0.009800.02730 0.01862 0.00980 ----0.00816 0.00204 0.001020.00816 0.00204 0.001020.00816 0.00204 0.001020.00816 0.00204 0.00102
0.0001 0.0003 0.0555 0.1109 0.6904 0.84230.0001 0.0003 0.0555 0.1109 0.6904 0.84230.0001 0.0003 0.0555 0.1109 0.6904 0.84230.0001 0.0003 0.0555 0.1109 0.6904 0.8423
38185 38185 38185 3818538185 38185 38185 3818538185 38185 38185 3818538185 38185 38185 38185 38185 3818538185 3818538185 3818538185 38185
The first line of numbers in the Proc Corr output is the corresponding
correlation coefficients, while the second is the corresponding p-values.
For the case of hundreds or thousands of variables, this presentation is
non-informative, and the wrapping-around effect will make it tedious to
review. It becomes more cumbersome when the analyst wants to
simplify the task by only looking at correlations with significant p-
values. In this light, we propose the following Timeplot-like output
(which corresponds to the set of correlations associated with LN_DAY),
adapted for visualization:
ELI PLOT: CORRELATIONS BY LN_DAYELI PLOT: CORRELATIONS BY LN_DAYELI PLOT: CORRELATIONS BY LN_DAYELI PLOT: CORRELATIONS BY LN_DAY
WITH:=LN_DAYWITH:=LN_DAYWITH:=LN_DAYWITH:=LN_DAY
VAR_NAME_+_LABEL min maxVAR_NAME_+_LABEL min maxVAR_NAME_+_LABEL min maxVAR_NAME_+_LABEL min max
----1 11 11 11 1
****------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------****
N_DAYLS2: |N_DAYLS2: |N_DAYLS2: |N_DAYLS2: | | 0 || 0 || 0 || 0 |
N_DAYLST: #_days lst_clkth | | 0 |N_DAYLST: #_days lst_clkth | | 0 |N_DAYLST: #_days lst_clkth | | 0 |N_DAYLST: #_days lst_clkth | | 0 |
N_DAYSEX: | | 0N_DAYSEX: | | 0N_DAYSEX: | | 0N_DAYSEX: | | 0 ||||
N_INTRST: #_intrsts e_intr | | 0 |N_INTRST: #_intrsts e_intr | | 0 |N_INTRST: #_intrsts e_intr | | 0 |N_INTRST: #_intrsts e_intr | | 0 |
RESPONSE: | | 0 |RESPONSE: | | 0 |RESPONSE: | | 0 |RESPONSE: | | 0 |
SEXUNKN: |SEXUNKN: |SEXUNKN: |SEXUNKN: | 0 | |0 | |0 | |0 | |
TENURE: # days since bec | 0 | |TENURE: # days since bec | 0 | |TENURE: # days since bec | 0 | |TENURE: # days since bec | 0 | |
TOT_RCVD: tot rcvd e_rcvd | | 0TOT_RCVD: tot rcvd e_rcvd | | 0TOT_RCVD: tot rcvd e_rcvd | | 0TOT_RCVD: tot rcvd e_rcvd | | 0 ||||
V1: | 0 |V1: | 0 |V1: | 0 |V1: | 0 |
V2: | 0 |V2: | 0 |V2: | 0 |V2: | 0 |
V3: |V3: |V3: |V3: | 0| |0| |0| |0| |
****------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------****

The previous ELI plot illustrates the correlation patterns among the
variables. ‘0’ marks direct (or zero order) correlations. The plot allows
the ‘stepwise-prone’ analyst to focus directly on areas of high-
correlation if interested in variable selection. In this case, N_Dayls2, N-
daylst, N-daysex, etc. These areas will be the ones closer to the (-1, +1)
axes. The midpoint of the plot marks the zero correlation mark.
Further, for every “(with, var)” pair, we can also plot the four (or any
number so desired) largest 1st
order partial correlations, denoted by the
numbers 1 through 4. Overlaps are denoted by ‘*’. The printout titled
“DIRECT & PARTIAL VAR NAMES” details the names of the
variables for each of the plotted correlations.
ELI PLOT: CORRS BY LN_DAY, 4 PARTIALS REQUESTEDELI PLOT: CORRS BY LN_DAY, 4 PARTIALS REQUESTEDELI PLOT: CORRS BY LN_DAY, 4 PARTIALS REQUESTEDELI PLOT: CORRS BY LN_DAY, 4 PARTIALS REQUESTED
VAR_NAME_+_LABEL mVAR_NAME_+_LABEL mVAR_NAME_+_LABEL mVAR_NAME_+_LABEL min maxin maxin maxin max
----1 11 11 11 1
****------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------****
N_DAYLS2: | 2N_DAYLS2: | 2N_DAYLS2: | 2N_DAYLS2: | 2------------------------------------------------------------------------------------------------------------||||------------------------------------------------------------------------------------*3*3*3*3----------------1 |1 |1 |1 |
N_DAYLST: #_days lst_clkth | | *3* |N_DAYLST: #_days lst_clkth | | *3* |N_DAYLST: #_days lst_clkth | | *3* |N_DAYLST: #_days lst_clkth | | *3* |
N_DAYSEX:N_DAYSEX:N_DAYSEX:N_DAYSEX: | | *| | *| | *| | *------------1 |1 |1 |1 |
N_INTRST: #_intrsts e_intr | | ** |N_INTRST: #_intrsts e_intr | | ** |N_INTRST: #_intrsts e_intr | | ** |N_INTRST: #_intrsts e_intr | | ** |
RESPONSE: | |RESPONSE: | |RESPONSE: | |RESPONSE: | | * |* |* |* |
SEXUNKN: | 1SEXUNKN: | 1SEXUNKN: | 1SEXUNKN: | 1--------------------------------****------------40 | |40 | |40 | |40 | |
TENURE: # days since bec | *3* | |TENURE: # days since bec | *3* | |TENURE: # days since bec | *3* | |TENURE: # days since bec | *3* | |
TOT_RCVD: tot rcvd e_rcvdTOT_RCVD: tot rcvd e_rcvdTOT_RCVD: tot rcvd e_rcvdTOT_RCVD: tot rcvd e_rcvd | | *1 || | *1 || | *1 || | *1 |
V1: | 1V1: | 1V1: | 1V1: | 1----* |* |* |* |
V2: | *V2: | *V2: | *V2: | * ||||
V3: | *|1 |V3: | *|1 |V3: | *|1 |V3: | *|1 |
****------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------****
ELI PLOT: CORRS BY N_DAYLS2, 4 PELI PLOT: CORRS BY N_DAYLS2, 4 PELI PLOT: CORRS BY N_DAYLS2, 4 PELI PLOT: CORRS BY N_DAYLS2, 4 PARTIALS REQUESTEDARTIALS REQUESTEDARTIALS REQUESTEDARTIALS REQUESTED
WITH:=N_DAYLS2WITH:=N_DAYLS2WITH:=N_DAYLS2WITH:=N_DAYLS2
----1 11 11 11 1
****------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------****
LN_DAY: | 2LN_DAY: | 2LN_DAY: | 2LN_DAY: | 2------------------------------------------------------------------------------------------------------------||||------------------------------------------------------------------------------------*3*3*3*3----------------1 |1 |1 |1 |
N_DAYLST: #_days lst_clkth |N_DAYLST: #_days lst_clkth |N_DAYLST: #_days lst_clkth |N_DAYLST: #_days lst_clkth | | ** || ** || ** || ** |
N_DAYSEX: | | *N_DAYSEX: | | *N_DAYSEX: | | *N_DAYSEX: | | *----1 |1 |1 |1 |
N_INTRST: #_intrsts e_intr | 1*4|0 |N_INTRST: #_intrsts e_intr | 1*4|0 |N_INTRST: #_intrsts e_intr | 1*4|0 |N_INTRST: #_intrsts e_intr | 1*4|0 |
RESPRESPRESPRESPONSE: | *ONSE: | *ONSE: | *ONSE: | *------------------------------------------------------------------------------------------------------------||||----------------------------------------------------------------------------*3 |*3 |*3 |*3 |
SEXUNKN: | 1SEXUNKN: | 1SEXUNKN: | 1SEXUNKN: | 1------------------------------------------------------------|0|0|0|0------------------------*2 |*2 |*2 |*2 |
TENURE: # days since bec |TENURE: # days since bec |TENURE: # days since bec |TENURE: # days since bec | 40404040----* |* |* |* |
TOT_RCVD: tot rcvd e_rcvd | | * |TOT_RCVD: tot rcvd e_rcvd | | * |TOT_RCVD: tot rcvd e_rcvd | | * |TOT_RCVD: tot rcvd e_rcvd | | * |
V1: | 1* |V1: | 1* |V1: | 1* |V1: | 1* |
VVVV2: | * |2: | * |2: | * |2: | * |
V3: | * |V3: | * |V3: | * |V3: | * |
****------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------****
Let us concentrate on a specific example. For instance, the first line of
the first diagram above (shown just below for clarity of exposition) plots
LN_DAY (‘with’ variable) against N_DAYLS2, and four first-order
partials in decreasing absolute order of magnitude. The correlations are
joined by hyphens that allow for a more compact view. ‘1’ in the first
line of the graph corresponds to the correlation between LN_DAY and
N_DAYLS2 after partialling out RESPONSE (which corresponds to
variable PART1 in the first observation of the printout below). ‘2’

corresponds to the next largest absolute partial correlation, which
corresponds to N_DAYLST, etc. In the diagram, there is an overlap
between the zero-order correlation and the partial corresponding to
N_INTRST (PART4), denoted by ‘*’. Given the distance of all these
correlations from the mid-point of zero correlation, the analyst might
deem these variables worth for further study. While p-values for direct
correlations are given in a tabulate below, corresponding p-values for the
partial correlations are not calculated at present.
----1 11 11 11 1
****------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------****
N_DAYLS2: | 2N_DAYLS2: | 2N_DAYLS2: | 2N_DAYLS2: | 2------------------------------------------------------------------------------------------------------------||||------------------------------------------------------------------------------------*3*3*3*3----------------1 |1 |1 |1 |
DIRECT & PARTIAL VAR NAMESDIRECT & PARTIAL VAR NAMESDIRECT & PARTIAL VAR NAMESDIRECT & PARTIAL VAR NAMES
WITH=LN_DAYWITH=LN_DAYWITH=LN_DAYWITH=LN_DAY
OBS VAR PART1 PART2 PART3 PART4OBS VAR PART1 PART2 PART3 PART4OBS VAR PART1 PART2 PART3 PART4OBS VAR PART1 PART2 PART3 PART4
1 N_DAYLS2 RESPONSE N_DAYLST SEXUNKN N_INTRST
2 N_DAYLST RESPONSE N_DAYLS2 SEXUNKN TENURE
3 N_DAYSEX SEXUNKN TENURE N_INTRST V1
4 N_INTRST N_DAYLS2 N_DAYLST N_DAYSEX V3
5 RESPONSE N_DAYLST N_DAYLS2 V1 TENURE
6 SEXUNKN N_DAYSEX N_DAYLST N_DAYLS2 TOT_RCVD
7 TENURE N_DAYLST N_DAYSEX N_DAYLS2 RESPONSE
8 TOT_RCVD SEXUNKN TENURE V3 V1
9 V1 RESPONSE N_DAYSEX N_DAYLST TOT_RCVD
10 V2 RESPONSE N_DAYLS2 N_DAYLST N_DAYSEX
11 V3 RESPONSE TOT_RCVD N_INTRST V2
WITH=N_DAYLS2WITH=N_DAYLS2WITH=N_DAYLS2WITH=N_DAYLS2
OBS VAR PART1 PART2 PART3 PART4OBS VAR PART1 PART2 PART3 PART4OBS VAR PART1 PART2 PART3 PART4OBS VAR PART1 PART2 PART3 PART4
12 LN_DAY RESPONSE N_DAYLST SEXUNKN N_INTRST
13 N_DAYLST LN_DAY RESPONSE SEXUNKN N_INTRST
14 N_DAYSEX SEXUNKN TENURE V1 V2
15 N_INTRST N_DAYLST LN_DAY RESPONSE TOT_RCVD
16 RESPONSE LN_DAY N_DAYLST SEXUNKN N_INTRST
17 SEXUNKN N_DAYSEX N_DAYLST LN_DAY RESPONSE
18 TENURE LN_DAY N_DAYLST RESPONSE N_DAYSEX
19 TOT_RCVD N_INTRST V3 V1 V2
20 V1 RESPONSE N_DAYSEX TOT_RCVD TENURE
21 V2 N_DAYLST LN_DAY N_DAYSEX RESPONSE
22 V3 TOT_RCVD SEXUNKN TENURE N_INTRST
ELI plots allow for a different configuration as well. Instead of plotting
the largest first-order partial correlations in addition to the zero order
one, we can plot the largest of the first-order, second largest, third
largest, etc. For the sake of brevity, this excursion is omitted.
Finally, and for documentation purposes, the correlation coefficients and
corresponding p-values are also tabulated

:
UPPER TRIANGULAR MATRIX
ALPHABETICALLY ORDERED
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ†
‚CORRELATIONS ‚ ‚ ‚#_days‚ ‚ ‚ ‚ ‚ ‚ ‚
‚ ‚ ‚ ‚lst_c-‚ ‚#_int-‚ ‚ ‚# days‚ ‚
‚ ‚ ‚ ‚lkthru‚ ‚ rsts ‚ ‚ ‚since ‚ tot ‚
‚ ‚ ‚N_DAY-‚&_dec-‚N_DAY-‚e_int-‚RESPO-‚SEXUN-‚became‚ rcvd ‚
‚ ‚LN_DAY‚ LS2 ‚.16.99‚ SEX ‚ rs2 ‚ NSE ‚ KN ‚member‚e_rcvd‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰
‚VARIABLE ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚
‚LN_DAY ‚ ‚ 0.77‚ 0.92‚ 0.73‚ 0.10‚ 0.99‚ -0.21‚ -0.04‚ 0.23‚
‚N_DAYLS2 ‚ ‚ ‚ 0.95‚ 0.86‚ 0.03‚ 0.68‚ 0.02‚ 0.01‚ 0.20‚
‚N_DAYLST ‚ ‚ ‚ ‚ 0.85‚ 0.06‚ 0.87‚ -0.08‚ -0.01‚ 0.22‚
‚N_DAYSEX ‚ ‚ ‚ ‚ ‚ 0.03‚ 0.66‚ 0.32‚ 0.03‚ 0.22‚
‚N_INTRST ‚ ‚ ‚ ‚ ‚ ‚ 0.12‚ -0.07‚ -0.05‚ 0.29‚
‚RESPONSE ‚ ‚ ‚ ‚ ‚ ‚ ‚ -0.24‚ -0.04‚ 0.22‚
‚SEXUNKN ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ 0.11‚ 0.06‚
‚TENURE ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ 0.00‚
‚TOT_RCVD ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ

„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ†
‚P_VALS OF CORRS ‚ ‚ ‚#_days‚ ‚ ‚ ‚ ‚ ‚ ‚
‚ ‚ ‚ ‚lst_c-‚ ‚#_int-‚ ‚ ‚# days‚ ‚
‚ ‚ ‚ ‚lkthru‚ ‚ rsts ‚ ‚ ‚since ‚ tot ‚
‚ ‚ ‚N_DAY-‚&_dec-‚N_DAY-‚e_int-‚RESPO-‚SEXUN-‚became‚ rcvd ‚
‚ ‚LN_DAY‚ LS2 ‚.16.99‚ SEX ‚ rs2 ‚ NSE ‚ KN ‚member‚e_rcvd‚
‚VARIABLE ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚
‚LN_DAY ‚ ‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚
‚N_DAYLS2 ‚ ‚ ‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚ 0.056‚ 0.000‚
‚N_DAYLST ‚ ‚ ‚ ‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚ 0.028‚ 0.000‚
‚N_DAYSEX ‚ ‚ ‚ ‚ ‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚
‚N_INTRST ‚ ‚ ‚ ‚ ‚ ‚ 0.000‚ 0.000‚ 0.000‚ 0.000‚
‚RESPONSE ‚ ‚ ‚ ‚ ‚ ‚ ‚ 0.000‚ 0.000‚ 0.000‚
‚SEXUNKN ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ 0.000‚ 0.000‚
‚TENURE ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ 0.831‚
‚TOT_RCVD ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
Since many correlations may not be significant at an alpha
level of, say, 95%, the ELI graphs can be made to portray
significant correlations only. In our example however, we
presented all possible effects with corresponding partial
correlations.
6. Trademarks.
SAS and all other SAS Institute Inc. product or service names
are registered trademarks or trademarks of SAS Institute Inc.
in the USA and other countries. indicates USA registration.
Other brand and product names are registered trademarks or
trademarks of their respective companies.
7. End Notes.
1
Data mining has often been defined as the search for
patterns, interesting or otherwise. Curiously, “interesting” is
in the eye of the beholder, and patterns are not well defined.
Ergo, any tool that purports to find interesting patterns
belongs under the rubric of data mining, which thus cannot
properly define any scientific application, since almost
anything can belong to it. My own preference is “Giga-data
analysis” (as opposed to the more traditional statistician’s
“small data set analysis”). It is in this spirit that I envision
this paper.
Since information from data requires the processes of
summarization, conceptualization, interpretation and
application, the data analyst victorious in all these steps after
successful perusal of reams of pages might require
hospitalization as well
2
Yes, I am that old. This paper deals only with Pearson
correlation coefficients, but the additional use of other
measures contained in Proc Corr is straightforward.
Programming Timeplot-like diagrams in other software
should not pose an insurmountable task. I created my first
diagram in Basic in 1980.
Additionally, the adjustment necessary for correlations
among continuous and categorical as well as among
categorical variables can be easily added.
3
I consider the name Timeplot a limiting and misleading
denomination. C’est la vie.
5
Partial correlations can also be understood as the
correlation between the residuals of a regression between Y
and X, and between Y and Z. See Cohen and Cohen (1983)
for an overall discussion, and Leahy (1996) for suppression
effects in the area of data base marketing.
6
The skillful programmer might be enticed to utilize Proc
Printto. My preference for a more arduous route is based on
the additional flexibility provided to enhance the overall
procedure, such as including partial correlations in one step,
multiple comparisons of correlations, Drezner’s Multirelation
(1995), etc.
Missing values are excluded from the calculation of
correlations in a pair-wise form. For a proposed solution to
the problem of missing values in the context of large
databases, see Auslender (1997).

7
The macro at present accepts only one ‘with’ variable. It is
a straightforward modification to enhance the code to accept
multiple ‘with’ variables.
8. Bibliography
Auslender L., Missing Value Imputation Methods for Large
DataBases, Proceedings of the 1997 northeastern SAS Users
Group Meeting, 1997.
Cleveland W., Visualizing Data, Hobart Press, USA, 1993.
Cohen J., Cohen P. Applied Multiple Regression/Correlation
Analysis for the Behavioral Sciences, Lawrence Erlbaum
Associates, Publishers, 1983.
Drezner, Z., Multirelation – Correlation among more than
two variables, Computational Statistics and Data Analysis,
1995, March.
Hoaglin D., Mosteller F., Tukey J., Understanding Robust
and Exploratory Data Analysis, John Wiley & Sons, 1983.
Leahy K., Nature, prevalence, and benefits of suppression
effects in direct response segmentation, Proceedings of the
American Statistical Association 1995 Meeting, 1996.
9. Contact Information
Your comments and questions are valued and encouraged.
Contact the author at:
Leonardo E. Auslender
SAS Institute
1545 Rt. 206 N, Suite 270
Bedminster, NJ 07921
908 470 0080 x 8217 (o)
908 470 0081 (f)
leonardo.auslender@sas.com

Eli plots visualizing innumerable number of correlations

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Eli plots visualizing innumerable number of correlations

Similar to Eli plots visualizing innumerable number of correlations (20)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

Eli plots visualizing innumerable number of correlations