Compact Letter Display (CLD) renders ANOVA & Tukey HSD testing a lot easier to interpret. It readily ranks and differentiate the tested variables. With CLD you can readily identify the variables that are statistically dissimilar vs. the ones that are similar.
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Compact Letter Display (CLD). How it works
1. Making ANOVA & Tukey HSD testing
clearer with Compact Letter Display
Gaetan Lion, September 3, 2022
1
2. Introduction
ANOVA is an incomplete test because it only tells you if several variables, or factors, have
different Means. But, it does not tell you which specific ones are truly different. Maybe
out of 5 variables (A, B, C, D, E) only E is truly different. And, this sole variable causes the
ANOVA F test to be statistically significant. The other 4 variables could have similar Means.
The Tukey Highly Significant Difference test (Tukey HSD) remedies the above situation. This
is a post-ANOVA test that tests whether each variable is different from any of the other
ones. And, Tukey HSD is conducted on a one-on-one matched variable basis just like an
unpaired t test. So, Tukey HSD tests the difference in Means for A vs. B, A vs. C, A vs. D, etc.
While the Tukey HSD test provides an abundance of supplementary information to ANOVA,
its output is overwhelming for non-statisticians.
Compact Letter Display (CLD) dramatically improves the clarity of the ANOVA & Tukey HSD
test output.
2
3. CLD basics
1. CLD identifies where the statistical significant differences are.
Each variable that shares a Mean that is not statistically different from another one will share the same letter.
For examples:
”a” “ab” “b”
The above indicates that the first variable “a” has a Mean that is statistically different from the third one “b”.
But, the second variable “ab” has a Mean that is not statistically different from either the first or third
variable.
”a” “ab” “bc” “c”
The above indicates that the first variable “a” has a Mean that is statistically different from the third variable
“bc” and the fourth one “c”. But, this first variable “a” is not statistically different from the second one “ab”.
2. CLD also ranks the variables in descending Mean order.
So, the variable with the highest Mean will be named “a” (if it is statistically different from all the others).
And, the variable with the lowest Mean will have the highest letter.
3
4. Working through an Example
We are going to test if the average rainfall in 5 West Coast cities is statistically
different. These cities are:
Eugene (OR)
Portland (OR)
San Francisco (CA)
Seattle (WA)
Spokane (WA)
The data is annual rainfall (1951 – 2021). The data source is NOAA.
4
6. ANOVA F test.
So, we know that the Cities have statistically different Average Rainfall
But, as shown this ANOVA F test really does not tell you much if anything.
6
7. Tukey HSD test identifies the difference between specific matched cities
Tukey HSD test output
We can observe that two pairs of matched cities
have non-statistically difference in Means.
These are:
Portland – Seattle. p-value 0.54
San Francisco – Spokane. p-value 0.08
San Francisco – Spokane is not quite statistically
significant, when using an alpha level of 0.05.
7
8. Using a Box Plot to visualize this data
8
Top whisker = ~ 99.7th percentile
Top of box = 75th percentile
Line near middle of box =
Median or 50th percentile
Bottom of box = 25th percentile
Bottom whisker = ~ 0.3d
percentile
Box Plot explanation
This box plot has a lot of information. But, it is a bit challenging to readily identify the cities’ rainfall levels
that are different from each other vs. the ones that are similar.
There is more info on Box Plot visual
interpretation on the last slide in the Appendix
section.
9. Putting an information package together
Tukey HSD test output
All the info is there. But, it is
rather challenging to interpret.
9
10. Rearranging the basic data with CLD
The original data set just sorted in
alphabetical order is not that
informative.
The revised data set using CLD is a lot
more informative. The cities are
ranked by Mean rainfall descending
order. And, the CLD identifies readily
which cities have statistically
significant Mean differences, and
which do not.
Eugene has a statistically significant higher Mean rainfall than all the other cities. So, it is “a”.
Seattle and Portland have similar Mean rainfall (not statistically different). So, they both come in as “b”.
San Francisco and Spokane have less rainfall than the other cities. And, their respective rainfall levels
are similar. So, they come in as “c”.
10
11. Rearranging the Box Plots using CLD
Within this Box Plot, it is challenging to differentiate
cities’ rainfall relative levels and to figure out which
ones are similar vs. dissimilar.
This Box Plot using CLD is more informative. The cities’ rainfall
levels are sorted in descending order. The color intensity is tiered
with more dense texture reflecting higher rainfall levels. And,
the CLD letters identify which cities have similar rainfall levels
and which do not. 11
12. 12
Upgraded information package with CLD
Using CLD, you can readily
identify the cities with similar
vs. dissimilar rainfall levels.