2. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-29/17/2019
Analysts search for relations among PAIRS variables to create a model or
to understand a phenomenon or event.
Primary question: DOES one variable changes pari passu with another
variable of interest?. One of the best-known tools is the correlation between
two variables, which measures the degree by which mean centered variables
are linearly related. Below, we also present correlations from the mode and the
median, not standard practices. The mode is not simple to estimate, and
distributions could be multimodal. Results presented below are just illustrative.
Linear relations are easier to understand and, by reference to the
Taylor/McLaurin series, can be used as approximations to the function that
represents the searched model.
But BIVARIATE relationships do not necessarily transfer to multivariate
relationships, reviewed in MEDA and Modeling later on.
3. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-39/17/2019
Next: Trellis graphic visualization of many variables at
once for Data set 1, Useful for small number of variables.
8. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-89/17/2019
Partial and semi coincide in fhis simulation, notice change of slope wrt zero order
correlations (continued In Linear Regression). Notice that zero order does not
translate into similar partial order correlations. ppt).
9. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-99/17/2019
Corrs may differ according to variable ranges under study.
(X ~ U(0,1), error N (0, 0.01)). “Small” very different regr than “Big” and “Full”.
Ranges DO matter.
11. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-119/17/2019
DS1: BEDA correlations. Note: Max (abs (corr)) < .2, still models can be
obtained later on.
12. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-129/17/2019
DS1: semi-partial and partial correlations (listing too long to show
all possibilities. All mostly equal because data is artificial
Zero
Order.
E.g.: Corr(Total_spend, Doctor_
Visits / No_claims) = 0.10.
15. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-159/17/2019
NB: This is not Median Abs. Dev. Correlation. See Gideon (2007)
Diverg
ence in
Abs (
value)
and/or
sign
Revers
al
betwee
n
Modal
and
others.
Exampl
e =>
***
16. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-169/17/2019
L1 represents median corr, reg mean corr, Mode Mode_corr, Loess is smoother that
follows data more closely. Note that L1 and mode flatter slopes than Reg bec. not so
affected by extreme NE point (L1 and Loess not covered in this course at the moment).
***
19. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-199/17/2019
Correlation is not causation, or is it? (web example)
More than 98 percent of convicted felons are bread users. Fully HALF of all children who grow
up in bread-consuming households score below average on standardized tests.
In the 18th century, when virtually all bread was baked in the home, the average life expectancy
was less than 50 years; infant mortality rates were unacceptably high; many women died in
childbirth; and diseases such as typhoid, yellow fever, and influenza ravaged whole nations.
Bread is associated with all the major diseases of the body. For example, nearly all sick people
have eaten bread. The effects are obviously cumulative:
99.9 percent of all people who die from cancer have eaten bread. 99.7 percent of the people
involved in air and auto accidents ate bread within 6 months preceding the accident
93.1 percent of juvenile delinquents came from homes where bread is served frequently
Evidence points to the long-term effects of bread eating: Of all the people born since 1839 who
later dined on bread, there has been a 100% mortality rate.
Bread is made from a substance called "dough." It has been proven that as little as a teaspoon
of dough can be used to suffocate a lab rat. The average American eats more bread than that in
one day!
Primitive tribal societies that have no bread exhibit a low incidence of cancer, Alzheimer's,
Parkinson's disease, and osteoporosis. Bread has been proven to be addictive. Subjects
deprived of bread and given only water to eat begged for bread after as little as two days.
20. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-209/17/2019
Bread is often a "gateway" food item, leading the user to "harder" items such as butter, jelly,
peanut butter, and even cold cuts. Bread has been proven to absorb water. Since the human
body is more than 80 percent water, it follows that eating bread could lead to your body
being taken over by this absorptive food product, turning you into a soggy, gooey bread-
pudding person.
Newborn babies can choke on bread. Bread is baked at temperatures as high as 400 degrees
Fahrenheit! That kind of heat can kill an adult in less than one minute.
Most bread eaters are utterly unable to distinguish between significant scientific fact (as
given in these pages) and meaningless statistical babbling by Professors.
In light of these frightening statistics, we propose the following bread restrictions:
No sale of bread to minors.
A nationwide "Just Say No To Toast" campaign complete with celebrity TV spots and bumper
stickers.
A 300 percent federal tax on all bread to pay for all the societal ills we might associate with
bread.
No animal or human images, nor any primary colors (which may appeal to children) may be
used to promote bread usage.
The establishment of "Bread-free" zones around schools.
22. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-229/17/2019
From (“articles.mercola.com/sites/articles/archive/2018/08/02/beef-jerky-might-cause-mood-
swings-mental-
illness.aspx?utm_source=dnl&utm_medium=email&utm_content=art2&utm_campaign
=20180802Z1_UCM&et_cid=DM225499&et_rid=381827292”)
On the other hand … (my notation in Italics)
Hypothesis: Beef jerky with nitrates added was linked to a host of
concerning mental changes, including mania in humans and altered
behavior and brain gene expression in rats.
Evidence 1: People who were hospitalized with mania were 3.5 times
more likely to have eaten cured meats like beef jerky than people without a
history of psychiatric disorders
Evidence 2: Rats fed beef jerky with nitrates experienced mania-like
hyperactivity and irregular sleeping patterns, along with alterations in brain
pathways that have been implicated in human bipolar disorder and
changes in intestinal microbiota.
Conclusion: Nitrates in processed meats may influence mental health by
altering inflammatory processes and gut bacteria.
24. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-249/17/2019
Shifting constant ‘a’ in (X1 + a) **2 leads to
different correlations. Note negative corr for a < 0.
25. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-259/17/2019
Notice disparities in values, except in RMSE. All coeffs significant. If want to create
Interaction in regression/logistic [ (x – a) (y – b)] can select ‘a’ and ‘b’ to be uncorrelated
With X and Y.
27. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-279/17/2019
Comments on LI, corrs and Orthog.
Orthog. Is most extreme case of LI. Remember: 2 vars are LI
when one Is NOT a multiple of the other.
If two variables are linearly dependent, then corr = -1 or 1.
Correlated variables are not necessarily linearly dependent unless
abs (corr) = 1.
Linear dependence between X and Y ➔ , centered X and
centered Y also LD ➔ X and Y perfectly correlated. When 2 LI
vectors, orthogonal or not, are centered, angle between them
may or may not change ➔
For LI vectors, corr may be positive, negative, or zero.
28. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-289/17/2019
X Y
0 1
0 0
1 1
1 0
X, Y LI iff there is no constant ‘a’
such that a X – Y = 0; Corr (X, Y) = 0
X’Y ne 0 (non-orthogonal).
X, Y are LI , uncorrelated and non_orthog.
in this case (parallel vectors).
X Y
-1 1
-1 -1
1 1
1 -1
X, Y orthogonal iff ∑(Xi, Yi) = 0. X and Y are Orthogonal ,
uncorrelated (corr (X, Y) = 0), and LI.
X Y
1 2
1 3
2 4
3 5
LI, correlated (corr = 0.94) and not orthogonal.
X Y
1 5
-5 1
3 1
-1 3
LI, orthog, corred.
30. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-309/17/2019
Diagramatically
Uncorred.
Linearly Independent.
Orthog.
Note: Orthog.
And Linear Indep
calculated
From raw data.
Corrs from
Squared centered
variables.
31. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-319/17/2019
Overall comment on correlations.
By far, Pearson correlation is most used, possibly followed by
Kendall rank correlation (not shown).
Notice that Pearson is based on mean centered variables, while
mode- and median- corrs are not. Median-corrs are not so
affected by ‘outliers’, while mode-corrs can produce sign
reversals because mode tends to take more extreme values.
Correlation is not causation, learn how to discern that type of
reasoning and learn NOT to use it.
Most practical reasoning (e.g., intuition) is based on correlations.
Learn to be skeptical when distribution likely asymmetric.
33. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-339/17/2019
Chi-square for contingency tables (2 nominal variables).
Example of 2 x 2 (i.e., 2 binary variables)
Q. Are HIV and Smoking related?
Chi-square test of independence.
Non-smoker Smoker Total
No HIV A B C
Has HIV D E F
Total G H I
2
2 2
1,1
( )
: ~ [( 1)( 1)]
:
:
*
,
k
O E
TEST r c
E
O
E
C G
E
I
−
= − −
=
observed
expected
for instance
H0: Vars are independent. H1: not.
34. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-349/17/2019
Odds and Odds Ratios.
Diseased
Non-
diseased
Total
Exposed 7 10 17
Non-exposed 6 56 62
Odds of disease in exposed group are 7/10
Odds of disease in non-exposed group are 6/56
Odds ratio, (7/10) / (5/56) = 6.56, very close to the risk
ratio
If odds ratio > 1 and significant, then Prob (1st row) >
prob (2nd row).
36. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-36
Danger: Simpson’s Paradox (1951)
Example from Agresti (1996, p. 54).
Victim’s Race Defendants’ Race Death Penalty % Yes
Z X YES NO
--------------------------------------------------------------------------------------------------------
White White 53 414 467 11.3
Black 11 37 48 22.9
Black White 0 16 16 0
Black 4 139 143 2.8
White 53 430 11.0
Black 15 176 7.9
White 64 451 12.4
Black 4 155 2.5
----------------------------------------------------------------------------------------------------------
Grand total 68 + 606 = 674 10.0
% Black defendants condemned to death higher than for whites when victim’s race is considered
(conditional prob), but lower when looking at overall Numbers (unconditional prob.)
37. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-379/17/2019
Simpson’s paradox: conclusions from conditional and
Unconditional probs opposed.
Death penalty higher for white than for black victims
(11 vs 7.9)
Whites tend to kill whites instead of blacks (11.3 vs 0) and
white victimhood leads more often to death penalty
(12.4 vs 2.5), i.e., victim’s race correlated with defendant’s
Race and with death penalty.
Need overall knowledge of situation in which events took
place.
44. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-449/17/2019
Interview Question.
In your data set, choose two binaries and obtain odds ratio and report
findings.
Find arguments in favor of pirates causing global warming (this is not
a joke question).
In medical trial, you suspect Simpson’s paradox when for instance,
tobacco smoking is found to have helped some patients to recover
from cancer, while the prevailing view is that smoking is bad for your
health. Do some reading and present pro/con arguments.
In The Idiot Brain: A Neuroscientist Explains What Your Head is Really
Up To (2016), Dean Burnett states “The correlation between height and
intelligence is usually cited as being about 0.2, meaning height and
intelligence seem to be associated in only 1 in 5 people.”.
Comment, criticism?
45. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-459/17/2019
References
Agresti A. (1996), An Introduction to Categorical Data Analysis, Wiley
Gideon R. (2007): The Correlation Coefficients, J. Of Modern Applied
Statistical Methods.
Rodgers J. et al (1984): Linearly Independent, Orthogonal, and Correlated
Variables, The American Statistician.
Simpson, E. (1951). The interpretation of interaction in contingency tables.
Journal of the Royal Statistical Society, Series B 13 238-241.