Exploratory data analysis

Data Analysis
Yabebal Ayalew
Statistics Department, Addis Ababa University
College of Natural & Computational Science
Statistics Department

Exploratory Data Analysis
“It is important to understand what you CAN DO before you learn to measure how WELL you
seem to have DONE it.”

Exploratory Data Analysis
Outline
1 Introduction
2 Bivariate Correlation Analysis
— Data Visualization
— Pearson’s Correlation Coefficient
— Spearman’s Rank Correlation
— The Kendalla Rank Correlation Coefficient
3 Inference on Correlation Coefficient
— Inference under Pearson’s Correlation
— Fisher Transformation of r
— Inference under Spearman’s and Kendall’s Rank Correlation
4 Multivariate Correlation Analysis
— Data Visualization
— Sample Correlation Coefficient Matrix
— Inference
5 Principal Component Analysis
3 Statistics Department Exploratory Data Analysis 17.9.2021

Introduction
Introduction
• Almost every discipline from biology and economics to engineering
and marketing measures, gathers, and stores data in some digital
form
• Retail companies store information on sales transactions, insurance
companies keep track of insurance claims, and meteorological
organizations measure and collect data concerning weather
conditions
• Timely and well-founded decisions need to be made using the
information collected. These decisions will be used to maximize
sales, improve research and development projects, and trim costs
• Data are being produced at faster rates due to the explosion of
internet related information and the increased use of
operational systems to collect business, engineering and scientific
data, and measurements from sensors or monitors

Introduction
Introduction
Definition—Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach of analyzing data sets to summarize their main
characteristics, often using statistical graphics and other data visualization methodsa
a
https://en.wikipedia.org/wiki/Exploratory_data_analysis
• According to Johon Tukey, exploratory data analysis is detective
work–numerical detective work-or counting detective work–or
graphical detective work
• A detective investigating a crime needs both tools and
understanding. If he has no fingerprint powder, he will fail to find
fingerprints on most surfaces. If he does not understand where the
criminal is likely to have put his fingers, he will not look in the
right places. Equally, the analyst of data needs both tools and
understanding. John Tukey
(June 16, 1915 – July 26, 2000)

Introduction
Introduction
• The processes of criminal justice are clearly divided between the
search for the evidence—the responsibility of the police and other
investigative forces—and the evaluation of the evidence’s
strength—a matter for juries and judges.
• In data analysis a similar distinction is helpful. Exploratory data
analysis is detective in character. Confirmatory data analysis is
judicial or quasi-judicial in character
• Unless the detective finds the clues, judge or jury has nothing to
consider. Unless exploratory data analysis uncovers indications,
usually quantitative ones, there is likely to be nothing for
confirmatory data analysis to consider
Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation
stone—as the first step.

Introduction
Introduction
EDA Objectives
• Exploratory Data Analysis (EDA) is an
approach/philosophy for data analysis which
employs a variety of techniques (mostly
graphical) to
1 Maximize insight into a data set
2 Uncover underlying structure
3 Extract important variables
4 Detect outliers and anomalies
5 Test underlying assumptions
6 Develop parsimonious models and
7 Determine optimal factor settings

Introduction
Introduction
• R programming language is one of the best tools to handle big
data analytic. It is free as in freedom and comes with no warranty.
The current stable version is R4.1.0 released on May 18, 2021
— Some of important libraries for EDA are DataExplorer, and
tidyverse
• Installing R packages can be done using
install.packages(<package name>) command. For
instance,
1 # Installing R packages
2 install.packages("DataExplorer")
3 install.packages("tidyverse")
• Rstudio is widely used GUI of R programming language. It free
though an enhanced version of Rstudio is not.

Introduction
Introduction
• We have two options to utilized built-in functions in the installed packages (libraries)
— The first option is to call the library and use the function by its name. For instance,
1 # Loading the library
2 library(DataExplorer)
3
4 # Using create_report function
5 create_report(airquality)
— The second option is instead of loading the library we can use
library_name::function_name(). For instance,
1 DataExplorer::create_report(airquality)
• Note: airqualty is a dataset that you can access it under datasets package and
create_report() function generates necessary report about the data set given as an argument
1 data("airquality")

Bivariate Correlation Analysis
Bivariate Correlation Analysis—Scatter Plot
• A critical step in making sense of data is an understanding of the relationships between different
variables.
— For example, is there a relationship between interest rates and inflation or education level and
income?
• These relationships or associations can be established through an examination of different summary
tables and data visualizations as well as calculations that measure the strength and confidence in the
relationship.

• The simplest graphical tool to assess the relationship between two quantitative variables of interest is
scatter plot
— Scatter plot is 2D graph on XY coordinate plane where the x-axis represent on of the variable
(usually believed to be the independent variable) and the y-axis represents the second variable
(usually the dependent variable)
— A pair of values (xi, yi) is represented by dot, collection of dots will create some kind of patter

• The plot(x, y, ...) function can be used to draw scatter plot in R. You can get more
information about the function and its arguments by typing ?plot on the consul of the Rstudio
• Let’s consider EuStockMarkets data set that comes with R installation for demonstration. This
data set is daily closing prices of major European stock indices, 1991–1998. Type
?EuStockMarkets for more information
1 # Reading EuStockMarkets
2 data(EuStockMarkets)
3 # Let’s convert this data to dataframe
4 stockPrice <- as.data.frame(EuStockMarkets)
5 # Let’s see the dataset
6 head(stockPrice)
7 # Rename the columns to more appropriate name
8 colnames(stockPrice) <- c("Germany", "Switzerland", "France", "UK")
9 # Scatter plot of Germany and UK
10 xlab <- "Germany Stock Price" # x-axis label
11 ylab <- "UK Stock Price" # y-axis label
12 plot(x = stockPrice$Germany, y = stockPrice$UK, xlab = xlab, ylab = ylab)
13

• Example 1: A sample of 10 claims and corresponding payments on settlement for household policies
is taken from the business of an insurance company
Claim (x) 2.10 2.40 2.50 3.20 3.60 3.80 4.10 4.20 4.50 5.00
Payment (y) 2.18 2.06 2.54 2.61 3.67 3.25 4.02 3.71 4.38 4.45
Draw scatter plot and comment on the relationship between claims and payment.
• Solution: Let’s use plot() function
1 # Enter the data as vector using c() function
2 Claim <- c(2.10,2.40,2.50, 3.20, 3.60, 3.80, 4.10, 4.20, 4.50, 5.00)
3 Payment <- c(2.18,2.06,2.54, 2.61,3.67,3.25, 4.02, 3.71, 4.38, 4.45)
4
5 # Scatter plot---pch is marker symbol and col is for color
6 plot(x = Claim, y = Payment, pch = 16, col = "blue")

• Example 2: A professional body wishes to analyse the performance of its students on a particular
two-part examination. It records the scores obtained by a sample of 12 students on the first part of
the exam, and the scores obtained by the same students on the second part of the exam. The results
are as follows:
Student A B C D E F G H I J K L
First-part exam score x (%) 82 49 73 60 61 77 65 85 91 53 59 73
Second-part exam score y (%) 76 58 75 66 70 71 76 92 87 59 63 71
Draw scatter plot and comment on the relationship between First-part exam score and Second-part
exam score.
1 #Enter the data
2 ‘First-part exam score‘ <- c(82,49,73,60,61,77,65,85,91,53,59,73)
3 ‘Second-part exam score‘ <- c(76,58,75,66,70,71,76,92,87,59,63,71)
4
5 # scatter plot
6 plot(x = ‘First-part exam score‘, y = ‘Second-part exam score‘, pch = 16, col
= 6)

Comment: Both scatter plot suggest a positive linear relationships between variables. i.e., as
one variable tend to increase, the second variable also tend to increase.

Merits
• It can be easily understood and interpreted
• Values of extreme items do not affect this
method. Such points are always isolated in the
plot
• It is the best method to show you a non-linear
pattern
• Shows both positive and negative type of
graphical correlation
Demerits
• These diagrams are unable to measure the
precise extent of correlation and interpretation
can be subjective
• Limited to two or three dimension at a time
• Data on both axes have to be continuous data
• It is not a quantitative measure of the
relationship between the variables. It is only a
quantitative expression of the quantitative change
• Exercise: The rate of interest of borrowing, over the next five years, for ten companies is compared
to each company’s leverage ratio (its debt to equity ratio). The data is as follows:
Leverage ratio (x) 0.1 0.4 0.5 0.8 1.0 1.8 2.0 2.5 2.8 3.0
Interest rate % (y) 2.8 3.4 3.5 3.6 4.6 6.3 10.2 19.7 31.3 42.9
Draw a scatterplot and comment on the relationship between company borrowing (leverage) and
interest rate. Hence apply a transformation to obtain a linear relationship.

Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
Review—Statistic Vs Parameter
• Statistic is any measure that is computed from the
sample. Usually denoted by English alphabet letters. For
example, X̄ is sample mean (it is statistic)
• Parameter is any measure that is computed from a
population. It is usually denoted by Greek letters. For
instance, µ is population mean.
• The degree of association between the x and y values is
summarised by the value of an appropriate correlation
coefficient each of which take values from −1 to +1.
• One of the correlation coefficient that is widely used is
called Pearson’s correlation coefficient named after the
well-known English mathematician and biostatistician.
Karl Pearson
(27 March 1857 - 27 April 1936)

• The Pearson’s correlation coefficient computed from the sample is denoted by r. Suppose we have
two quantitative variables, say x and y, with a total of n observations
r =
Pn
i=1(xi − x̄)(yi − ȳ)
pPn
i=1(xi − x̄)2
pPn
i=1(yi − ȳ)2
Let
Sxy =
n
X
i=1
(xi − x̄)(yi − ȳ) =
n
X
i=1
(xiyi − xiȳ − x̄yi + x̄ȳ), distribute summation
=
n
X
i=1
xiyi −
n
X
i=1
xiȳ −
n
X
i=1
yix̄ +
n
X
i=1
x̄ȳ, Note that x̄ =
Pn
i=1 xi
n
, and ȳ =
Pn
i=1 yi
n
=
n
X
i=1
xiyi − nx̄ȳ − nx̄ȳ + nx̄ȳ, x̄ and ȳ are constants over index i
=
n
X
i=1
xiyi − nx̄ȳ

• Let
Sxx =
n
X
i=1
(xi − ȳ)2
=
n
X
i=1
(xi − 2xix̄ + x̄2
), polynomial expansion
=
n
X
i=1
x2
i − 2
n
X
i=1
xix̄ +
n
X
i=1
x̄2
=
n
X
i=1
x2
i − 2nx̄x̄ + nx̄2
=
n
X
i=1
x2
i − nx̄2
With the same technique
Syy =
n
X
i=1
y2
i − nȳ2
Therefore, the sample Pearson’s correlation coefficient r can be re-written as
r =
Sxy
√
Sxx
p
Syy
=
Pn
i=1 xiyi − nx̄ȳ
pPn
i=1 x2
i − nx̄2
pPn
i=1 y2
i − nȳ2

• The population correlation coefficient is denoted by greed letter ρ
ρ =
cov(x, y)
p
V ar(x)V ar(Y )
=
σxy
σxσy
r is a sample estimator of the unknown population correlation coefficient ρ.
• Relationship with Simple Linear Regression: Simple linear regression has the form of
yi = β0 + β1xi + i
where is the random disturbance term and β0 and β1 are model parameters and are estimated by β̂0
and β̂1, respectively.
β̂1 =
Pn
i=1(xi − x̄)(yi − ȳ)
Pn
i=1(xi − x̄)2
and β̂0 = ȳ − β̂1x̄

• A very simple algebraic manipulation, we can derive that
β̂1 = r
r
Syy
Sxx
Moreover, the coefficient of multiple determination R2
is
R2
= r2
• Properties of Pearson’s correlation coefficient
— The value of r is between −1 to +1. r = 0 indicates absence of linear relationship between
variables. r = −1 indicates perfect negative linear relationship while r = +1 indicates perfect
positive linear relationship
— Pearson’s correlation coefficient doesn’t imply cause-and-effect type of relationship
— Non-linear type of correlation can’t be quantified by r
— No distinction between the role of the variables considered. i.e., dependent vs independent

• The cor(x, y = NULL, use = everything, method = c(pearson,
kendall, spearman)) function in R can be used to compute Pearson’s correlation
coefficient
• Example 3: Let’s consider the data in Example 1.
Claim (x) 2.10 2.40 2.50 3.20 3.60 3.80 4.10 4.20 4.50 5.00
Payment (y) 2.18 2.06 2.54 2.61 3.67 3.25 4.02 3.71 4.38 4.45
Compute Pearson’s correlation coefficient, r
• Solution: The summaries of the above data are
10
X
i=1
xi = 35.4,
10
X
i=1
x2
i = 133.76,
10
X
i=1
yi = 2.87,
10
X
i=1
y2
i = 115.2025,
10
X
i=1
xiyi = 123.81
Thus,
x̄ =
P10
i=1 xi
n
=
35.4
10
= 3.54 and ȳ =
P10
i=1 yi
n
= 13.376

• By definition,
r =
Pn
pPn
i=1 x2
i − nx̄2
pPn
i=1 y2
i − nȳ2
=
123.81 − 10(3.54)(13.376)
p
133.76 − 10(3.54)2
p
115.2025 − 10(13.376)2
= 0.96
This value indicates strong positive linear relationship between claim and payment.
• Using R programming language with cor() function
1 # Enter the data as vetctor uisng c() function
2 Claim - c(2.10,2.40,2.50, 3.20, 3.60, 3.80, 4.10, 4.20, 4.50, 5.00)
3 Payment - c(2.18,2.06,2.54, 2.61,3.67,3.25, 4.02, 3.71, 4.38, 4.45)
4
5 # Pearson’s correlation coefficient
6 cor(x = Claim, y = Payment)

• Example 4: Let’s consider the data in Example 2 and compute Pearson’s correlation coefficient.
Student A B C D E F G H I J K L
First-part exam score x (%) 82 49 73 60 61 77 65 85 91 53 59 73
Second-part exam score y (%) 76 58 75 66 70 71 76 92 87 59 63 71
• Solution: The summaries are
12
X
i=1
xi = 828,
12
X
i=1
x2
i = 59054,
12
X
i=1
yi = 864,
12
X
i=1
y2
i = 63362,
12
X
i=1
xiyi = 60950
The means are
x̄ = 69 and ȳ = 72
r =
Pn
pPn
i=1 x2
i − nx̄2
pPn
i=1 y2
i − nȳ2
=
60950 − 12(69)(72)
p
59054 − 12(69)2
p
63362 − 12(72)2
= 0.896

1 #Enter the data
2 ‘First-part exam score‘ - c(82,49,73,60,61,77,65,85,91,53,59,73)
3 ‘Second-part exam score‘ - c(76,58,75,66,70,71,76,92,87,59,63,71)
4
5 # Pearson’s correlation coefficient
6 cor(x = ‘First-part exam score‘, y = ‘Second-part exam score‘, method=pearson)
• Example 5: A new computerised ultrasound scanning technique has enabled doctors to monitor the
weights of unborn babies. The table below shows the estimated weights for one particular baby at
fortnightly intervals during the pregnancy.
Gestation period (weeks) 30 32 34 36 38 40
Estimated baby weight (kg) 1.6 1.7 2.5 2.8 3.2 3.5
6
X
i=1
xi = 210,
6
X
i=1
x2
i = 7420,
6
X
i=1
yi = 15.3,
6
X
i=1
y2
i = 42.03,
6
X
i=1
xiyi = 549.8
Compute Sxx, Syy, Sxy and r.

• Solution: As we understand it from the question, gestation period is x and estimated baby weight is
y. Then,
x̄ =
210
6
= 35, and ȳ = 2.55
By definition,
Sxx =
6
X
i=1
x2
i − nx̄2
= 7420 − 6(35)2
= 70
Syy =
6
X
i=1
y2
i − nȳ2
= 42.03 − 6 × 2.552
= 3.015
Sxy =
6
X
i=1
xiyi − nȳx̄ = 549.8 − 6(2.55)(35) = 14.3
r =
Sxy
p
SxxSyy
=
14.3
√
70 × 3.015
= 0.984
Very strong positive correlation is witnessed.

1 # Enter the data
2 ‘Gestation period‘ - c(30, 32, 34, 36, 38, 40)
3 ‘Estimated baby weight‘ - c(1.6, 1.7, 2.5, 2.8, 3.2, 3.5)
4
5 # Computing
6 n - length(‘Gestation period‘) # sample size
7 x_bar - mean(‘Gestation period‘) # Mean of gestation period
8 y_bar - mean(‘Estimated baby weight‘) # Mean of baby weight
9
10 Sxx - sum(‘Gestation period‘^2)-n*x_bar^2
11 Syy - sum(‘Estimated baby weight‘^2)-n*y_bar^2
12
13 Sxy - sum(‘Gestation period‘*‘Estimated baby weight‘)-n*x_bar*y_bar
14
15 r - Sxy/sqrt(Sxx*Syy)
16
17 # Since we have the raw dataset
18 cor(x = ‘Gestation period‘, y = ‘Estimated baby weight‘)

• Exercise: A schoolteacher is investigating the claim that class size does not affect GCSE results. His
observations of nine GCSE classes are as follows:
Students in class (c) 35 32 27 21 34 30 28 24 7
Average GCSE point score for class (p) 5.9 4.1 2.4 1.7 6.3 5.3 3.5 2.6 1.6
X
c = 238,
X
c2
= 6884,
X
p = 33.4,
X
p2
= 149.62,
X
cp = 983
Compute Pearson’s correlation coefficient
• Final Thought:
— Pearson’s correlation coefficient is affected by outliers
— Though the variables have strong non-linear correlation, Pearson’s correlation coefficient returns
low value
— The data set which is to be correlated should approximate to the normal distribution. If the
data is normally distributed, then the data points tend to lie closer to the mean
— Pearson’s correlation coefficient can also be called as Pearson product-moment correlation

Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient
• Often, a bivariate population is far from normal. In that event, the
computation of Pearson’s correlation coefficient r as an estimator
of ρ is no longer valid
• In some cases a transformation of the variables x and y brings
their joint distribution close to the bivariate normal, making it
possible to estimate ρ in the new scale
• The Spearman’s rank-order correlation is the nonparametric
version of the Pearson product-moment correlation r.
• Spearman’s correlation coefficient,rs measures the strength and
direction of association between two ranked variables.
Charles Spearman
(10 September 1863 - 17 September 1945)
• Spearman’s correlation determines the strength and direction of the monotonic relationship
between your two variables rather than the strength and direction of the linear relationship between
your two variables

• There are two methods to calculate Spearman’s correlation depending on whether:
1 Your data does not have tied ranks or
2 Your data has tied ranks
The formula for when there are no tied ranks is
rs = 1 −
6
Pn
i=1 d2
i
n(n2 − 1)
where di is difference in paired ranks and n is number of observations

• Let r(x) is rank of variable x and r(y) is rank of variable y. Then, if any of the rank is tied, then
Spearman’s Rank correlation coefficient can be computed as follows
rs =
Pn
i=1(r(xi) − r(x))(r(yi) − r(y))
qPn
i=1(r(xi) − r(x))2
qPn
i=1(r(yi) − r(y))2
• Example 6: A school teacher is investigating the claim that class size does not affect GCSE results.
His observations of nine GCSE classes are presented below. Compute rs
Students in class (c) Average GCSE point (p) r(c) r(p) d = r(c) − r(p) d2
35 5.9 9 8 1 1
32 4.1 7 6 1 1
27 2.4 4 3 1 1
21 1.7 2 2 0 0
34 6.3 8 9 -1 1
30 5.3 6 7 -1 1
28 3.5 5 5 0 0
24 2.6 3 4 -1 1
7 1.6 1 1 0 0

• Solution:
P
d2
= 6. The Spearman’s rank correlation coefficient rs is
rs = 1 −
6
Pn
i=1 d2
i
n(n2 − 1)
= 1 −
6 × 6
9(92 − 1)
= 0.95
1 # Enter data
2 c - c(35, 32, 27, 21, 34, 30, 28, 24, 7)
3 p - c(5.9, 4.1, 2.4, 1.7, 6.3, 5.3, 3.5, 2.6, 1.6)
4 # Ranking
5 r_c - rank(c)
6 r_p - rank(p)
7 # Sample size
8 n - length(c) # n- length(r_p)
9 # Rank difference
10 d - r_c - r_p
11 # Spearman’s Rank correlation
12 rs - 1-6*sum(d^2)/(n*(n^2-1))
13 # Alternatively
14 cor(x = c, y = p, method = spearman)

• Example 7: The rate of interest of borrowing, over the next five years, for ten companies is
compared to each company’s leverage ratio (its debt to equity ratio). The data is as follows:
Leverage ratio (x) Interest rate (y) r(x) r(y) d d2
0.1 2.8 1 1 0 0
0.4 3.4 2 2 0 0
0.5 3.5 3 3 0 0
0.8 3.6 4 4 0 0
1.0 4.6 5 5 0 0
1.8 6.3 6 6 0 0
2/0 10.2 7 7 0 0
2.5 19.7 8 8 0 0
2.8 31.3 9 9 0 0
3.0 42.9 10 10 0 0
The Spearman rank correlation is one since
P
d2
= 0

1 # Enter the data
2 ‘Leverage ratio‘ - c(0.1, 0.4, 0.5, 0.8, 1.0, 1.8, 2.0, 2.5, 2.8, 3.0)
3 ‘Interest rate‘ - c(2.8, 3.4, 3.5, 3.6, 4.6, 6.3, 10.2, 19.7, 31.3, 42.9)
4 # Rank the data---No ties
5 r_leverage - rank(‘Leverage ratio‘, ties.method = average)
6 r_interest - rank(‘Interest rate‘, ties.method = average)
7
8 # Correlation coefficients
9 cor(‘Leverage ratio‘, ‘Interest rate‘, method = spearman)
10 cor(‘Leverage ratio‘, ‘Interest rate‘)
11
12 # Plots
13 plot(‘Leverage ratio‘, ‘Interest rate‘,pch = 16, col = 18)
14 scatter.smooth(‘Leverage ratio‘, ‘Interest rate‘, pch = 16, col = 18)

Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient
• The Kendall’s rank correlation, which is also called as Kendall’s
Tau, is a correlation suitable for quantitative and ordinal variables.
It indicates how strongly two variables are monotonously related
— Kendall’s Tau serves the exact same purpose as the
Spearman’s rank correlation
— Kendall’s Tau is computationally more intensive than
Spearman’s rank correlation
Maurice George Kendall
(16 September 1907 – 29 March 1983)
• Despite the more complicated calculation, it is considered to have better statistical properties than
Spearman’s rank correlation coefficient, particularly for small data sets with large numbers of tied
ranks
• Any pair of observations (xi, yi); (xj, yj) where i 6= j , is said to be concordant if the ranks for both
elements agree. i.e., xi xj and yi yj or xi xj and yi yj; otherwise they are said to be
discordant.

• Let nc be the number of concordant pairs, and let nd be the number of discordant pairs. Assuming
that there are no ties, the Kendall coefficient τ is defined as
τ =
2(nc − nd)
n(n − 1
Note that we will have a total of

n
2

=
n!
2!(n − 2)!
=
n × (n − 1) × (n − 2)!
2!(n − 2)!
=
n(n − 1)
2
number of concordant and discordant
• Intuitively, it is clear that if the number of concordant pairs is much larger than the number of
discordant pairs, then the random variables are positively correlated. Whereas if the number of
concordant pairs is much less than the number of discordant pairs, then the variables are negatively
correlated

• Example 8: A new computerised ultrasound scanning technique has enabled doctors to monitor the
weights of unborn babies. The table below shows the estimated weights for one particular baby at
fortnightly intervals during the pregnancy.
Gestation period (weeks) 30 32 34 36 38 40
Estimated baby weight (kg) 1.6 1.7 2.5 2.8 3.2 3.5
Compute Kendall’s Tau
• Solution: The possible pairs are (30, 1.6), (32, 1.7), (34, 2.5), (36, 2.8), (38, 3.2) and (40, 3.5). Let’s
create a table to identify the concordant and discordant

(x, y) (30,1.6) (32,1.7) (34,2.5) (36,2.8) (38,3.2) (40,3.5)
(30,1.6) c c c c c
(32,1.7) c c c c
(34,2.5) c c c
(36,2.8) c c
(38,3.2) c
• As can be seen in the above table nc = 15 and nd = 0. Therefore,
τ =
2(nc − nd)
n(n − 1)
=
2(15 − 0)
6(6 − 1)
= 1
1 # Enter the data
2 ‘Gestation period‘ - c(30, 32, 34, 36, 38, 40)
3 ‘Estimated baby weight‘ - c(1.6, 1.7, 2.5, 2.8, 3.2, 3.5)
4
5 cor(x = ‘Gestation period‘, y = ‘Estimated baby weight‘, method = kendall)

• Example 9: The data below show the consumption of alcohol (x, liters per year per person, 14 years
or older) and the death rate from cirrhosis, a liver disease (y, death per 100,000 population) in 9
countries (each country is an observation unit)
Country Alc. Consumption Death Rate from Cirrhosis
France 24.7 46.1
Italy 15.2 23.6
Germany 12.3 23.7
Australia 10.9 7.0
Belgium 10.8 12.3
USA 9.9 14.2
Canada 8.3 7.4
England 7.2 3.0
Ireal 3.1 5.4
Computing Kendall’s tau using the approach utilized in Example 8 is time consuming and prone for
error.
• It’s often easier to determine concordant and discordant pairs by using the ranks instead of the actual
numbers

• First arrange the values in order of rank for x. Then the concordant pairs (c) are the number of
observations below which have a higher rank for the y and the discordant pairs (d) are the number of
observations below which have a lower rank for the y
Country Alc. Consumption Death Rate (y) r(y) Concordant Discordant
Ireal 3.1 5.4 2 7 1
England 7.2 3.0 1 7 0
Canada 8.3 7.4 4 5 1
USA 9.9 14.2 6 3 2
Belgium 10.8 12.3 5 3 1
Australia 10.9 7.0 3 3 0
Germany 12.3 23.7 8 1 1
Italy 15.2 23.6 7 1 0
France 24.7 46.1 9
nc = 7 + 7 + · · · + 1 = 30, nd = 1 + 0 + · · · + 0 = 6
τ =
2(nc − nd)
n(n − 1)
=
2(30 − 6)
9(9 − 1)
= 0.67

1 library(oii)
2
3 # Enter data
4 ‘Alc consumption‘ - c(24.7,15.2,12.3, 10.9, 10.8,9.9,8.3,7.2,3.1)
5 death - c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3, 5.4)
6
7 # Correlation coefficient
8 cor(‘Alc consumption‘, death, method = kendall)
9
10 # concordant and discordant numbers
11 concordant.pairs(‘Alc consumption‘, death)
12 discordant.pairs(‘Alc consumption‘, death)
• The oii library is imported to use concordant.pairs and discordant.pairs functions.

Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
• One of the assumptions of Pearson’s correlation coefficient is that the two variables are coming from
a bivariate normal distribution
• The joint pdf of the two variables, say x ∈ R and y ∈ R is
f(x, y) =
1
2πσxσy
p
1 − ρ2
exp −
1
2(1 − ρ2)

x − µx
σx
2
− 2ρ

x − µx
σx

y − µy
σy

+

y − µy
σy
2
#!
where ρ ∈ [−1, +1] is population correlation coefficient
• To assess the significance of any calculated r, the sampling distribution of this statistic is needed.
The distribution of r is negatively skewed and has high spread/variability.
• For the hypothesis H0 : ρ = 0 Vs H1 : ρ 6= 0, the test statistic is
t =
r
√
n − 2
√
1 − r2
follows t-distribution with n − 2 degrees of freedom under H0

1 # Sampling distribution of r
2 set.seed(23) #set seed number
3 # read the data
5 # Take only the first two variables
6 D - as.data.frame(EuStockMarkets[,1:2])
7 r - c(); t - c(); n - 30 # Sample size
8
9 # The simulation
10 for (i in 1:1000) {
11 # Take random sample
12 idx - sample(1:nrow(D), size = n)
13 r[i] - cor(D$DAX[idx], D$SMI[idx])
14 t[i] - r[i]*sqrt(n-2)/(sqrt(1-r[i]^2))
15 }
16
17 # Plot histogram
18 hist(r)
19 hist(t)

• Example 10: Let’s consider the data in Example 9. The Pearson correlation coefficient between
alcohol consumption and death is found to be r = 0.94. Test H0 : ρ = 0
• Solution: The sample size is n = 9, then the test statistic is
t =
r
√
n − 2
√
1 − r2
=
0.94
√
9 − 2
√
1 − 0.942
= 7.146
The p- value is 2Pr(t7 7.146) = 0.000186. Since the p-value is less than 5% level of significance,
we have enough evidence to conclude that alcohol consumption and death have linear relationship
• cor.test(x, y, alternative = c(two.sided, less, greater), method
= c(pearson, kendall, spearman), exact = NULL, conf.level =
0.95, continuity = FALSE, ...) function can be used to handle the business in R

1 # Enter data
2 ‘Alc consumption‘ - c(24.7,15.2,12.3, 10.9, 10.8,9.9,8.3,7.2,3.1)
3 death - c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3, 5.4)
4
5 # Hypothesis testing
6 cor.test(‘Alc consumption‘, death)
The output is
1 Pearson’s product-moment correlation
2
3 data: Alc consumption and death
4 t = 7.146, df = 7, p-value = 0.000186
5 alternative hypothesis: true correlation is not equal to 0
6 95 percent confidence interval:
7 0.7255307 0.9871237
8 sample estimates:
9 cor
10 0.937788

• However, if our interest lies on testing a hypothesis like
H0 : ρ = ρ0
where ρ0 is hypothesised value of correlation coefficient, then the test procedure for H0 : ρ = 0
doesn’t handle it
• A more general result for any value of ρ0 ∈ [−1, +1], under the null hypothesis H0 : ρ = ρ0, the test
statistic
W =
1
2
ln

1 + r
1 − r

= tanh−1
(r)
follows approximately a normal distribution with mean 1
2 ln

1+ρ
1−ρ

= tanh−1
(ρ) and standard
deviation
1
√
n − 3
Usually W is referred as Fisher Z transformation

• A 100(1 − α)% confidence interval for ρ is can be computed as
W ± Zα/2
1
√
n − 3
Transforming the upper and lower limits of the confidence by tanh()
tanh

W ± Zα/2
1
√
n − 3

Note that
tanh(r) =
e2r
− 1
e2r + 1
• Example 11: Consider the data and results in Example 9. Test the claim that
H0 : ρ = 0.9
H1 : ρ 6= 0.9

• Solution: The correlation coefficient is r = 0.94 and n = 9. The test statistic is
W = tanh−1
(r) = tanh−1
(0.94) = 1.738
The mean and variance of W are
mean = tanh−1
(ρ0) = tanh−1
(0.9) = 1.472 and SD =
1
√
n − 3
= 0.408
The p-value is
2Pr

Z
W − mean
SD

= 2Pr

Z
1.738 − 1.472
0.408

= 2Pr(Z 0.652) = 0.5144
We don’t reject the null hypothesis. atanh() function can be used to compute tanh−1
(˙
) in R

Inference on Correlation Coefficient—Under Spearman’s Kendall’s Rank
Correlation
• The hypothesis associated with Kendall and Spearman rank correlation is depending on the sample
size. However, for relatively large sample size, the test statistic which aim to test H0 : ρ = 0 is
following a known probability distribution
— Case 1: Inference Under Spearman’s Rank Correlation—For larger values of n 20, we
can use the test procedure for Pearson correlation test. The limiting normal distribution will
have a mean 0 and standard deviation of
1
√
n − 1
— Case 2: Inference Under Kendall’s Rank Correlation—For larger values of n 10, use of
the Central Limit Theorem means that an approximate normal distribution can be used,
with mean 0 and standard deviation s
2(2n + 5)
9n(n − 1)

Inference on Correlation Coefficient—Under Spearman’s and Kendall’s Rank
Correlation
• Example 12: An actuary wants to investigate if there is any correlation between students’ scores in
the CS1 mock exam and the CS2 mock exam. Data values from 22 students were collected and the
results are:
22
X
i=1
d2
i = 494, nc = 174, and nd = 57
Test H0 : ρ = 0 Vs H1 : ρ 0 for the mock score data using the Spearman’s rank correlation
coefficient and the Kendall’s rank correlation coefficient along with normal approximations
• Solution: The Spearman’s and Kendall’s rank correlation coefficients are
rs = 1 −
6
P22
i=1 d2
i
n(n2 − 1)
= 1 −
6(494)
22(222 − 1)
= 0.72
τ =
2(nc − nd)
n(n − 1)
=
2(174 − 57)
22(22 − 1)
= 0.51

Correlation
• Let’s use Spearman’s rank correlation coefficient to test the hypothesis
W = tanh−1
(rs) = tanh−1
(0.72) = 0.908
Note that the mean is zero and the standard deviation is 1/
√
22 − 1 = 0.218 The p-value is
Pr

Z
0.908 − 0
0.218

= Pr(Z 4.165) ≈ 0.000
We have enough evidence to reject the null hypothesis and decide in favor of the alternative
hypothesis
• Let’s use Kendall’s rank correlation
W = tanh−1
(τ) = tanh−1
(0.51) = 0.563
The mean is zero and the standard deviation is
q
2(2n+5)
9n(n−1) =
q
2(2×22+5)
9(22)(22−1) = 0.154

Correlation
• The standardized value is
W − mean
SD
=
0.563 − 0
0.154
= 3.656
The p-value is
Pr(Z 3.656) = 0.0001
The conclusion is still the same.

Multivariate correlation analysis
Multivariate Correlation Analysis—Visualization
• As the number of variables increased, it becomes very hard to visualize the data with one plot. For
instance, if we have five variables, we need to draw

5
2

= 10 scatter plots. This problem is
usually referred as curse of dimensionality.
• The relationship between variables under multivariate setting can be investigated by drawing scatter
plot matrix and correlogram
1 # Load library
2 library(DataExplorer)
3 # Load Dataset
5 stockData - as.data.frame(EuStockMarkets)
6 # Rename the columns to more appropriate name
7 colnames(stockData) - c(Germany, Switzerland, France, UK)
8 # correlogram
9 plot_correlation(stockData)
10
11 # Scatter plot Matrix
12 plot(stockData)

Multivariate Correlation Analysis—Sample Correlation Coefficient Matrix
• The correlation coefficient matrix is the best solution to examine the degree and sign of relationship
between pair of variables
1 cor(stockData)
2 Germany Switzerland France UK
3 Germany 1.0000000 0.9911539 0.9662274 0.9751778
4 Switzerland 0.9911539 1.0000000 0.9468139 0.9899691
5 France 0.9662274 0.9468139 1.0000000 0.9157265
6 UK 0.9751778 0.9899691 0.9157265 1.0000000
7 cor(stockData, method = spearman)
9 Germany 1.0000000 0.9727973 0.8290434 0.9670180
10 Switzerland 0.9727973 1.0000000 0.8131776 0.9867515
11 France 0.8290434 0.8131776 1.0000000 0.8051102
12 UK 0.9670180 0.9867515 0.8051102 1.0000000
• Exercise: Consider a data set called mtcars. Explore this data set and draw scatter plot, compute
correlation matrix. Use data(mtcars) to load the data to R

Multivariate Correlation Analysis—Inference
• The hypothesis H0 : ρ = 0 can be tested by corr.test() function of psych library
1 library(psych)
2 corr.test(stockData)
3 Call:corr.test(x = stockData)
4 Correlation matrix
6 Germany 1.00 0.99 0.97 0.98
7 Switzerland 0.99 1.00 0.95 0.99
8 France 0.97 0.95 1.00 0.92
9 UK 0.98 0.99 0.92 1.00
10 Sample Size
11 [1] 1860
12 Probability values (Entries above the diagonal are adjusted for multiple tests
.)
14 Germany 0 0 0 0
15 Switzerland 0 0 0 0
16 France 0 0 0 0
17 UK 0 0 0 0

• The function chart.Correlation() in the package PerformanceAnalytics, can be used
to display a chart of a correlation matrix
1 library(PerformanceAnalytics)
2 chart.Correlation(stockData, histogram=TRUE, pch=16)

Principal component analysis
Principal Component Analysis
• PCA is a statistical procedure that converts a set of observations of possibly correlated variables into
a set of values of linearly uncorrelated variables called principal components
• PCA will help us to find a reduced number of features that will represent our original dataset in a
compressed way, capturing up to a certain portion of its variance depending on the number of new
features we end up selecting
• This transformation is defined in such a way that the first principal component has the largest
possible variance (that is, accounts for as much of the variability in the data as possible), and each
succeeding component, in turn, has the highest possible variance possible
Review—Matrix Algebra
Eigenvalues: The number λ is an eigenvalue of A if and only if A − λI is singular:
det(A − λI) = 0
For each λ solve (A − λI)x = 0 to find an eigenvector of x

• Example 13: Find its λ’s and x’s of
A =

1 2
2 4

• Solution: When A is singular, λ = 0 is one of the eigenvalues. The equation Ax − 0x has solutions
det(A − λI) = det

1 − λ 2
2 4 − λ

= (1 − λ)(4 − λ) − 4 = λ2
− 5λ = λ(λ − 5) = 0
Then λ = 5 and λ = 0. Solve (A − λI)x = 0 for λ = 5 and λ = 0. For λ = 5,
(A − 5I)x = 0 =

−4 2
2 −1

y
z

=

0
0

The eigenvectors are
1
2

For λ = 0, the eigenvector is
2
−1


• The unit vectors are
1
√
5
2
√
5
#
and

2
√
5
−1
√
5
#
1 A - matrix(c(1,2,2,4), ncol = 2, byrow = T)
2 eigen(A)
3 eigen() decomposition
4 $values
5 [1] 5 0
6
7 $vectors
8 [,1] [,2]
9 [1,] 0.4472136 -0.8944272
10 [2,] 0.8944272 0.4472136
• Let’s form matrix of the eigenvectors W where the first column is the eigenvector of the highest
eigenvalue and the second column is the eigenvector of the second highest eigenvalue etc

• W is orthogonal (i.e., W −1
= W T
). Suppose X is centered data matrix of n × p. The principal
components decomposition P of X is XW
P = XW
By doing so we can transformed the data into a set of p orthogonal components
• To assess the explanatory power of each component, consider S = P T
P . S is diagonal matrix where
each diagonal element is the (scaled) variance of each component of the transformed data (the
covariance between components is zero by construction).
• Example 14: Let’s consider the EuStockMarkets data set and construct principal components
2 stockData - as.matrix(EuStockMarkets)
3
4 # Center the data set
5 x - scale(stockData, scale = F)
6 X - t(x) %*% x

1 X
2 DAX SMI CAC FTSE
3 DAX 2187625263 3324041230 1130755526 1920782024
4 SMI 3324041230 5141356405 1698659725 2989291485
5 CAC 1130755526 1698659725 626045333 964886991
6 FTSE 1920782024 2989291485 964886991 1773436263
1 # Eigenvector matrix
2 W - eigen(X)$vector
3 P - x%*%W
4 # Variance-covariance matrix
5 S - round(t(P)%*%P,2)
6 # Variation explained
7 diag(S)*100/sum(diag(S))
1 diag(S)*100/sum(diag(S))
2 [1] 98.7472656 0.9389800 0.1701669 0.1435875

1 prcomp(stockData)
2 Standard deviations (1, .., p=4):
3 [1] 2273.23831 221.67188 94.36696 86.68436
4
5 Rotation (n x k) = (4 x 4):
6 PC1 PC2 PC3 PC4
7 DAX 0.4747658 0.3776953 -0.4010222 -0.6863854
8 SMI 0.7309214 -0.1735069 -0.3023225 0.5867285
9 CAC 0.2437347 0.7304373 0.5985832 0.2208009
10 FTSE 0.4253759 -0.5419438 0.6240837 -0.3686079
11 W
12 [,1] [,2] [,3] [,4]
13 [1,] -0.4747658 -0.3776953 -0.4010222 0.6863854
14 [2,] -0.7309214 0.1735069 -0.3023225 -0.5867285
15 [3,] -0.2437347 -0.7304373 0.5985832 -0.2208009
16 [4,] -0.4253759 0.5419438 0.6240837 0.3686079

• How many principal components should be retained? there are three simple approaches, which
may be of guidance for deciding the number of relevant principal components. These are
— The visual examination of a scree plot (Searching for elbow in the scree plot)
— The variance explained criteria or
— The Kaiser rule (According to this rule only those principal components are retained, whose
variances exceed 1)
• Decide on the number of principal components is to set a threshold, say 80%, and stop when the first
k components account for a percentage of total variation greater than this threshold. Note that the
threshold is set somehow arbitrary; 70 to 90% are the usual sort of values, but it depends on the
context of the data set and can be higher or lower
1 # Variation explained
2 varExplained - diag(S)*100/sum(diag(S))
3 plot(varExplained, type = o, ylab = Variance explained, xlab = Principal
component)
4 plot(cumsum(varExplained), type = o, ylab = Variance explained, xlab =
Principal component)

Yabebal Ayalew
Statistics Department, Addis Ababa University(AAU)
Office: Freshman building, Room 115 email: yabebala@gmail.com
Training Title: Exploratory Data Analysis

Exploratory data analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Exploratory data analysis

Similar to Exploratory data analysis (20)

Recently uploaded

Recently uploaded (20)

Exploratory data analysis