SlideShare a Scribd company logo
1 of 72
Download to read offline
Data Analysis
Yabebal Ayalew
Statistics Department, Addis Ababa University
College of Natural & Computational Science
Statistics Department
Exploratory Data Analysis
“It is important to understand what you CAN DO before you learn to measure how WELL you
seem to have DONE it.”
Exploratory Data Analysis
Outline
1 Introduction
2 Bivariate Correlation Analysis
— Data Visualization
— Pearson’s Correlation Coefficient
— Spearman’s Rank Correlation
— The Kendalla Rank Correlation Coefficient
3 Inference on Correlation Coefficient
— Inference under Pearson’s Correlation
— Fisher Transformation of r
— Inference under Spearman’s and Kendall’s Rank Correlation
4 Multivariate Correlation Analysis
— Data Visualization
— Sample Correlation Coefficient Matrix
— Inference
5 Principal Component Analysis
3 Statistics Department Exploratory Data Analysis 17.9.2021
Introduction
Introduction
• Almost every discipline from biology and economics to engineering
and marketing measures, gathers, and stores data in some digital
form
• Retail companies store information on sales transactions, insurance
companies keep track of insurance claims, and meteorological
organizations measure and collect data concerning weather
conditions
• Timely and well-founded decisions need to be made using the
information collected. These decisions will be used to maximize
sales, improve research and development projects, and trim costs
• Data are being produced at faster rates due to the explosion of
internet related information and the increased use of
operational systems to collect business, engineering and scientific
data, and measurements from sensors or monitors
4 Statistics Department Exploratory Data Analysis 17.9.2021
Introduction
Introduction
Definition—Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach of analyzing data sets to summarize their main
characteristics, often using statistical graphics and other data visualization methodsa
a
https://en.wikipedia.org/wiki/Exploratory_data_analysis
• According to Johon Tukey, exploratory data analysis is detective
work–numerical detective work-or counting detective work–or
graphical detective work
• A detective investigating a crime needs both tools and
understanding. If he has no fingerprint powder, he will fail to find
fingerprints on most surfaces. If he does not understand where the
criminal is likely to have put his fingers, he will not look in the
right places. Equally, the analyst of data needs both tools and
understanding. John Tukey
(June 16, 1915 – July 26, 2000)
5 Statistics Department Exploratory Data Analysis 17.9.2021
Introduction
Introduction
• The processes of criminal justice are clearly divided between the
search for the evidence—the responsibility of the police and other
investigative forces—and the evaluation of the evidence’s
strength—a matter for juries and judges.
• In data analysis a similar distinction is helpful. Exploratory data
analysis is detective in character. Confirmatory data analysis is
judicial or quasi-judicial in character
• Unless the detective finds the clues, judge or jury has nothing to
consider. Unless exploratory data analysis uncovers indications,
usually quantitative ones, there is likely to be nothing for
confirmatory data analysis to consider
Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation
stone—as the first step.
6 Statistics Department Exploratory Data Analysis 17.9.2021
Introduction
Introduction
EDA Objectives
• Exploratory Data Analysis (EDA) is an
approach/philosophy for data analysis which
employs a variety of techniques (mostly
graphical) to
1 Maximize insight into a data set
2 Uncover underlying structure
3 Extract important variables
4 Detect outliers and anomalies
5 Test underlying assumptions
6 Develop parsimonious models and
7 Determine optimal factor settings
7 Statistics Department Exploratory Data Analysis 17.9.2021
Introduction
Introduction
• R programming language is one of the best tools to handle big
data analytic. It is free as in freedom and comes with no warranty.
The current stable version is R4.1.0 released on May 18, 2021
— Some of important libraries for EDA are DataExplorer, and
tidyverse
• Installing R packages can be done using
install.packages(<package name>) command. For
instance,
1 # Installing R packages
2 install.packages("DataExplorer")
3 install.packages("tidyverse")
• Rstudio is widely used GUI of R programming language. It free
though an enhanced version of Rstudio is not.
8 Statistics Department Exploratory Data Analysis 17.9.2021
Introduction
Introduction
• We have two options to utilized built-in functions in the installed packages (libraries)
— The first option is to call the library and use the function by its name. For instance,
1 # Loading the library
2 library(DataExplorer)
3
4 # Using create_report function
5 create_report(airquality)
— The second option is instead of loading the library we can use
library_name::function_name(). For instance,
1 DataExplorer::create_report(airquality)
• Note: airqualty is a dataset that you can access it under datasets package and
create_report() function generates necessary report about the data set given as an argument
1 data("airquality")
9 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Scatter Plot
• A critical step in making sense of data is an understanding of the relationships between different
variables.
— For example, is there a relationship between interest rates and inflation or education level and
income?
• These relationships or associations can be established through an examination of different summary
tables and data visualizations as well as calculations that measure the strength and confidence in the
relationship.
10 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Scatter Plot
• The simplest graphical tool to assess the relationship between two quantitative variables of interest is
scatter plot
— Scatter plot is 2D graph on XY coordinate plane where the x-axis represent on of the variable
(usually believed to be the independent variable) and the y-axis represents the second variable
(usually the dependent variable)
— A pair of values (xi, yi) is represented by dot, collection of dots will create some kind of patter
11 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Scatter Plot
• The plot(x, y, ...) function can be used to draw scatter plot in R. You can get more
information about the function and its arguments by typing ?plot on the consul of the Rstudio
• Let’s consider EuStockMarkets data set that comes with R installation for demonstration. This
data set is daily closing prices of major European stock indices, 1991–1998. Type
?EuStockMarkets for more information
1 # Reading EuStockMarkets
2 data(EuStockMarkets)
3 # Let’s convert this data to dataframe
4 stockPrice <- as.data.frame(EuStockMarkets)
5 # Let’s see the dataset
6 head(stockPrice)
7 # Rename the columns to more appropriate name
8 colnames(stockPrice) <- c("Germany", "Switzerland", "France", "UK")
9 # Scatter plot of Germany and UK
10 xlab <- "Germany Stock Price" # x-axis label
11 ylab <- "UK Stock Price" # y-axis label
12 plot(x = stockPrice$Germany, y = stockPrice$UK, xlab = xlab, ylab = ylab)
13
12 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Scatter Plot
13 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Scatter Plot
• Example 1: A sample of 10 claims and corresponding payments on settlement for household policies
is taken from the business of an insurance company
Claim (x) 2.10 2.40 2.50 3.20 3.60 3.80 4.10 4.20 4.50 5.00
Payment (y) 2.18 2.06 2.54 2.61 3.67 3.25 4.02 3.71 4.38 4.45
Draw scatter plot and comment on the relationship between claims and payment.
• Solution: Let’s use plot() function
1 # Enter the data as vector using c() function
2 Claim <- c(2.10,2.40,2.50, 3.20, 3.60, 3.80, 4.10, 4.20, 4.50, 5.00)
3 Payment <- c(2.18,2.06,2.54, 2.61,3.67,3.25, 4.02, 3.71, 4.38, 4.45)
4
5 # Scatter plot---pch is marker symbol and col is for color
6 plot(x = Claim, y = Payment, pch = 16, col = "blue")
14 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Scatter Plot
• Example 2: A professional body wishes to analyse the performance of its students on a particular
two-part examination. It records the scores obtained by a sample of 12 students on the first part of
the exam, and the scores obtained by the same students on the second part of the exam. The results
are as follows:
Student A B C D E F G H I J K L
First-part exam score x (%) 82 49 73 60 61 77 65 85 91 53 59 73
Second-part exam score y (%) 76 58 75 66 70 71 76 92 87 59 63 71
Draw scatter plot and comment on the relationship between First-part exam score and Second-part
exam score.
1 #Enter the data
2 ‘First-part exam score‘ <- c(82,49,73,60,61,77,65,85,91,53,59,73)
3 ‘Second-part exam score‘ <- c(76,58,75,66,70,71,76,92,87,59,63,71)
4
5 # scatter plot
6 plot(x = ‘First-part exam score‘, y = ‘Second-part exam score‘, pch = 16, col
= 6)
15 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Scatter Plot
Comment: Both scatter plot suggest a positive linear relationships between variables. i.e., as
one variable tend to increase, the second variable also tend to increase.
16 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Scatter Plot
Merits
• It can be easily understood and interpreted
• Values of extreme items do not affect this
method. Such points are always isolated in the
plot
• It is the best method to show you a non-linear
pattern
• Shows both positive and negative type of
graphical correlation
Demerits
• These diagrams are unable to measure the
precise extent of correlation and interpretation
can be subjective
• Limited to two or three dimension at a time
• Data on both axes have to be continuous data
• It is not a quantitative measure of the
relationship between the variables. It is only a
quantitative expression of the quantitative change
• Exercise: The rate of interest of borrowing, over the next five years, for ten companies is compared
to each company’s leverage ratio (its debt to equity ratio). The data is as follows:
Leverage ratio (x) 0.1 0.4 0.5 0.8 1.0 1.8 2.0 2.5 2.8 3.0
Interest rate % (y) 2.8 3.4 3.5 3.6 4.6 6.3 10.2 19.7 31.3 42.9
Draw a scatterplot and comment on the relationship between company borrowing (leverage) and
interest rate. Hence apply a transformation to obtain a linear relationship.
17 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
Review—Statistic Vs Parameter
• Statistic is any measure that is computed from the
sample. Usually denoted by English alphabet letters. For
example, X̄ is sample mean (it is statistic)
• Parameter is any measure that is computed from a
population. It is usually denoted by Greek letters. For
instance, µ is population mean.
• The degree of association between the x and y values is
summarised by the value of an appropriate correlation
coefficient each of which take values from −1 to +1.
• One of the correlation coefficient that is widely used is
called Pearson’s correlation coefficient named after the
well-known English mathematician and biostatistician.
Karl Pearson
(27 March 1857 - 27 April 1936)
18 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
• The Pearson’s correlation coefficient computed from the sample is denoted by r. Suppose we have
two quantitative variables, say x and y, with a total of n observations
r =
Pn
i=1(xi − x̄)(yi − ȳ)
pPn
i=1(xi − x̄)2
pPn
i=1(yi − ȳ)2
Let
Sxy =
n
X
i=1
(xi − x̄)(yi − ȳ) =
n
X
i=1
(xiyi − xiȳ − x̄yi + x̄ȳ), distribute summation
=
n
X
i=1
xiyi −
n
X
i=1
xiȳ −
n
X
i=1
yix̄ +
n
X
i=1
x̄ȳ, Note that x̄ =
Pn
i=1 xi
n
, and ȳ =
Pn
i=1 yi
n
=
n
X
i=1
xiyi − nx̄ȳ − nx̄ȳ + nx̄ȳ, x̄ and ȳ are constants over index i
=
n
X
i=1
xiyi − nx̄ȳ
19 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
• Let
Sxx =
n
X
i=1
(xi − ȳ)2
=
n
X
i=1
(xi − 2xix̄ + x̄2
), polynomial expansion
=
n
X
i=1
x2
i − 2
n
X
i=1
xix̄ +
n
X
i=1
x̄2
=
n
X
i=1
x2
i − 2nx̄x̄ + nx̄2
=
n
X
i=1
x2
i − nx̄2
With the same technique
Syy =
n
X
i=1
y2
i − nȳ2
Therefore, the sample Pearson’s correlation coefficient r can be re-written as
r =
Sxy
√
Sxx
p
Syy
=
Pn
i=1 xiyi − nx̄ȳ
pPn
i=1 x2
i − nx̄2
pPn
i=1 y2
i − nȳ2
20 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
• The population correlation coefficient is denoted by greed letter ρ
ρ =
cov(x, y)
p
V ar(x)V ar(Y )
=
σxy
σxσy
r is a sample estimator of the unknown population correlation coefficient ρ.
• Relationship with Simple Linear Regression: Simple linear regression has the form of
yi = β0 + β1xi + i
where  is the random disturbance term and β0 and β1 are model parameters and are estimated by β̂0
and β̂1, respectively.
β̂1 =
Pn
i=1(xi − x̄)(yi − ȳ)
Pn
i=1(xi − x̄)2
and β̂0 = ȳ − β̂1x̄
21 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
• A very simple algebraic manipulation, we can derive that
β̂1 = r
r
Syy
Sxx
Moreover, the coefficient of multiple determination R2
is
R2
= r2
• Properties of Pearson’s correlation coefficient
— The value of r is between −1 to +1. r = 0 indicates absence of linear relationship between
variables. r = −1 indicates perfect negative linear relationship while r = +1 indicates perfect
positive linear relationship
— Pearson’s correlation coefficient doesn’t imply cause-and-effect type of relationship
— Non-linear type of correlation can’t be quantified by r
— No distinction between the role of the variables considered. i.e., dependent vs independent
22 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
• The cor(x, y = NULL, use = everything, method = c(pearson,
kendall, spearman)) function in R can be used to compute Pearson’s correlation
coefficient
• Example 3: Let’s consider the data in Example 1.
Claim (x) 2.10 2.40 2.50 3.20 3.60 3.80 4.10 4.20 4.50 5.00
Payment (y) 2.18 2.06 2.54 2.61 3.67 3.25 4.02 3.71 4.38 4.45
Compute Pearson’s correlation coefficient, r
• Solution: The summaries of the above data are
10
X
i=1
xi = 35.4,
10
X
i=1
x2
i = 133.76,
10
X
i=1
yi = 2.87,
10
X
i=1
y2
i = 115.2025,
10
X
i=1
xiyi = 123.81
Thus,
x̄ =
P10
i=1 xi
n
=
35.4
10
= 3.54 and ȳ =
P10
i=1 yi
n
= 13.376
23 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
• By definition,
r =
Pn
i=1 xiyi − nx̄ȳ
pPn
i=1 x2
i − nx̄2
pPn
i=1 y2
i − nȳ2
=
123.81 − 10(3.54)(13.376)
p
133.76 − 10(3.54)2
p
115.2025 − 10(13.376)2
= 0.96
This value indicates strong positive linear relationship between claim and payment.
• Using R programming language with cor() function
1 # Enter the data as vetctor uisng c() function
2 Claim - c(2.10,2.40,2.50, 3.20, 3.60, 3.80, 4.10, 4.20, 4.50, 5.00)
3 Payment - c(2.18,2.06,2.54, 2.61,3.67,3.25, 4.02, 3.71, 4.38, 4.45)
4
5 # Pearson’s correlation coefficient
6 cor(x = Claim, y = Payment)
24 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
• Example 4: Let’s consider the data in Example 2 and compute Pearson’s correlation coefficient.
Student A B C D E F G H I J K L
First-part exam score x (%) 82 49 73 60 61 77 65 85 91 53 59 73
Second-part exam score y (%) 76 58 75 66 70 71 76 92 87 59 63 71
• Solution: The summaries are
12
X
i=1
xi = 828,
12
X
i=1
x2
i = 59054,
12
X
i=1
yi = 864,
12
X
i=1
y2
i = 63362,
12
X
i=1
xiyi = 60950
The means are
x̄ = 69 and ȳ = 72
r =
Pn
i=1 xiyi − nx̄ȳ
pPn
i=1 x2
i − nx̄2
pPn
i=1 y2
i − nȳ2
=
60950 − 12(69)(72)
p
59054 − 12(69)2
p
63362 − 12(72)2
= 0.896
25 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
1 #Enter the data
2 ‘First-part exam score‘ - c(82,49,73,60,61,77,65,85,91,53,59,73)
3 ‘Second-part exam score‘ - c(76,58,75,66,70,71,76,92,87,59,63,71)
4
5 # Pearson’s correlation coefficient
6 cor(x = ‘First-part exam score‘, y = ‘Second-part exam score‘, method=pearson)
• Example 5: A new computerised ultrasound scanning technique has enabled doctors to monitor the
weights of unborn babies. The table below shows the estimated weights for one particular baby at
fortnightly intervals during the pregnancy.
Gestation period (weeks) 30 32 34 36 38 40
Estimated baby weight (kg) 1.6 1.7 2.5 2.8 3.2 3.5
6
X
i=1
xi = 210,
6
X
i=1
x2
i = 7420,
6
X
i=1
yi = 15.3,
6
X
i=1
y2
i = 42.03,
6
X
i=1
xiyi = 549.8
Compute Sxx, Syy, Sxy and r.
26 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
• Solution: As we understand it from the question, gestation period is x and estimated baby weight is
y. Then,
x̄ =
210
6
= 35, and ȳ = 2.55
By definition,
Sxx =
6
X
i=1
x2
i − nx̄2
= 7420 − 6(35)2
= 70
Syy =
6
X
i=1
y2
i − nȳ2
= 42.03 − 6 × 2.552
= 3.015
Sxy =
6
X
i=1
xiyi − nȳx̄ = 549.8 − 6(2.55)(35) = 14.3
r =
Sxy
p
SxxSyy
=
14.3
√
70 × 3.015
= 0.984
Very strong positive correlation is witnessed.
27 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
1 # Enter the data
2 ‘Gestation period‘ - c(30, 32, 34, 36, 38, 40)
3 ‘Estimated baby weight‘ - c(1.6, 1.7, 2.5, 2.8, 3.2, 3.5)
4
5 # Computing
6 n - length(‘Gestation period‘) # sample size
7 x_bar - mean(‘Gestation period‘) # Mean of gestation period
8 y_bar - mean(‘Estimated baby weight‘) # Mean of baby weight
9
10 Sxx - sum(‘Gestation period‘^2)-n*x_bar^2
11 Syy - sum(‘Estimated baby weight‘^2)-n*y_bar^2
12
13 Sxy - sum(‘Gestation period‘*‘Estimated baby weight‘)-n*x_bar*y_bar
14
15 r - Sxy/sqrt(Sxx*Syy)
16
17 # Since we have the raw dataset
18 cor(x = ‘Gestation period‘, y = ‘Estimated baby weight‘)
28 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Pearson’s Correlation Coefficient
• Exercise: A schoolteacher is investigating the claim that class size does not affect GCSE results. His
observations of nine GCSE classes are as follows:
Students in class (c) 35 32 27 21 34 30 28 24 7
Average GCSE point score for class (p) 5.9 4.1 2.4 1.7 6.3 5.3 3.5 2.6 1.6
X
c = 238,
X
c2
= 6884,
X
p = 33.4,
X
p2
= 149.62,
X
cp = 983
Compute Pearson’s correlation coefficient
• Final Thought:
— Pearson’s correlation coefficient is affected by outliers
— Though the variables have strong non-linear correlation, Pearson’s correlation coefficient returns
low value
— The data set which is to be correlated should approximate to the normal distribution. If the
data is normally distributed, then the data points tend to lie closer to the mean
— Pearson’s correlation coefficient can also be called as Pearson product-moment correlation
29 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient
• Often, a bivariate population is far from normal. In that event, the
computation of Pearson’s correlation coefficient r as an estimator
of ρ is no longer valid
• In some cases a transformation of the variables x and y brings
their joint distribution close to the bivariate normal, making it
possible to estimate ρ in the new scale
• The Spearman’s rank-order correlation is the nonparametric
version of the Pearson product-moment correlation r.
• Spearman’s correlation coefficient,rs measures the strength and
direction of association between two ranked variables.
Charles Spearman
(10 September 1863 - 17 September 1945)
• Spearman’s correlation determines the strength and direction of the monotonic relationship
between your two variables rather than the strength and direction of the linear relationship between
your two variables
30 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient
• There are two methods to calculate Spearman’s correlation depending on whether:
1 Your data does not have tied ranks or
2 Your data has tied ranks
The formula for when there are no tied ranks is
rs = 1 −
6
Pn
i=1 d2
i
n(n2 − 1)
where di is difference in paired ranks and n is number of observations
31 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient
• Let r(x) is rank of variable x and r(y) is rank of variable y. Then, if any of the rank is tied, then
Spearman’s Rank correlation coefficient can be computed as follows
rs =
Pn
i=1(r(xi) − r(x))(r(yi) − r(y))
qPn
i=1(r(xi) − r(x))2
qPn
i=1(r(yi) − r(y))2
• Example 6: A school teacher is investigating the claim that class size does not affect GCSE results.
His observations of nine GCSE classes are presented below. Compute rs
Students in class (c) Average GCSE point (p) r(c) r(p) d = r(c) − r(p) d2
35 5.9 9 8 1 1
32 4.1 7 6 1 1
27 2.4 4 3 1 1
21 1.7 2 2 0 0
34 6.3 8 9 -1 1
30 5.3 6 7 -1 1
28 3.5 5 5 0 0
24 2.6 3 4 -1 1
7 1.6 1 1 0 0
32 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient
• Solution:
P
d2
= 6. The Spearman’s rank correlation coefficient rs is
rs = 1 −
6
Pn
i=1 d2
i
n(n2 − 1)
= 1 −
6 × 6
9(92 − 1)
= 0.95
1 # Enter data
2 c - c(35, 32, 27, 21, 34, 30, 28, 24, 7)
3 p - c(5.9, 4.1, 2.4, 1.7, 6.3, 5.3, 3.5, 2.6, 1.6)
4 # Ranking
5 r_c - rank(c)
6 r_p - rank(p)
7 # Sample size
8 n - length(c) # n- length(r_p)
9 # Rank difference
10 d - r_c - r_p
11 # Spearman’s Rank correlation
12 rs - 1-6*sum(d^2)/(n*(n^2-1))
13 # Alternatively
14 cor(x = c, y = p, method = spearman)
33 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient
• Example 7: The rate of interest of borrowing, over the next five years, for ten companies is
compared to each company’s leverage ratio (its debt to equity ratio). The data is as follows:
Leverage ratio (x) Interest rate (y) r(x) r(y) d d2
0.1 2.8 1 1 0 0
0.4 3.4 2 2 0 0
0.5 3.5 3 3 0 0
0.8 3.6 4 4 0 0
1.0 4.6 5 5 0 0
1.8 6.3 6 6 0 0
2/0 10.2 7 7 0 0
2.5 19.7 8 8 0 0
2.8 31.3 9 9 0 0
3.0 42.9 10 10 0 0
The Spearman rank correlation is one since
P
d2
= 0
34 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient
1 # Enter the data
2 ‘Leverage ratio‘ - c(0.1, 0.4, 0.5, 0.8, 1.0, 1.8, 2.0, 2.5, 2.8, 3.0)
3 ‘Interest rate‘ - c(2.8, 3.4, 3.5, 3.6, 4.6, 6.3, 10.2, 19.7, 31.3, 42.9)
4 # Rank the data---No ties
5 r_leverage - rank(‘Leverage ratio‘, ties.method = average)
6 r_interest - rank(‘Interest rate‘, ties.method = average)
7
8 # Correlation coefficients
9 cor(‘Leverage ratio‘, ‘Interest rate‘, method = spearman)
10 cor(‘Leverage ratio‘, ‘Interest rate‘)
11
12 # Plots
13 plot(‘Leverage ratio‘, ‘Interest rate‘,pch = 16, col = 18)
14 scatter.smooth(‘Leverage ratio‘, ‘Interest rate‘, pch = 16, col = 18)
35 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient
36 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient
• The Kendall’s rank correlation, which is also called as Kendall’s
Tau, is a correlation suitable for quantitative and ordinal variables.
It indicates how strongly two variables are monotonously related
— Kendall’s Tau serves the exact same purpose as the
Spearman’s rank correlation
— Kendall’s Tau is computationally more intensive than
Spearman’s rank correlation
Maurice George Kendall
(16 September 1907 – 29 March 1983)
• Despite the more complicated calculation, it is considered to have better statistical properties than
Spearman’s rank correlation coefficient, particularly for small data sets with large numbers of tied
ranks
• Any pair of observations (xi, yi); (xj, yj) where i 6= j , is said to be concordant if the ranks for both
elements agree. i.e., xi  xj and yi  yj or xi  xj and yi  yj; otherwise they are said to be
discordant.
37 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient
• Let nc be the number of concordant pairs, and let nd be the number of discordant pairs. Assuming
that there are no ties, the Kendall coefficient τ is defined as
τ =
2(nc − nd)
n(n − 1
Note that we will have a total of

n
2

=
n!
2!(n − 2)!
=
n × (n − 1) × (n − 2)!
2!(n − 2)!
=
n(n − 1)
2
number of concordant and discordant
• Intuitively, it is clear that if the number of concordant pairs is much larger than the number of
discordant pairs, then the random variables are positively correlated. Whereas if the number of
concordant pairs is much less than the number of discordant pairs, then the variables are negatively
correlated
38 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient
• Example 8: A new computerised ultrasound scanning technique has enabled doctors to monitor the
weights of unborn babies. The table below shows the estimated weights for one particular baby at
fortnightly intervals during the pregnancy.
Gestation period (weeks) 30 32 34 36 38 40
Estimated baby weight (kg) 1.6 1.7 2.5 2.8 3.2 3.5
Compute Kendall’s Tau
• Solution: The possible pairs are (30, 1.6), (32, 1.7), (34, 2.5), (36, 2.8), (38, 3.2) and (40, 3.5). Let’s
create a table to identify the concordant and discordant
39 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient
(x, y) (30,1.6) (32,1.7) (34,2.5) (36,2.8) (38,3.2) (40,3.5)
(30,1.6) c c c c c
(32,1.7) c c c c
(34,2.5) c c c
(36,2.8) c c
(38,3.2) c
• As can be seen in the above table nc = 15 and nd = 0. Therefore,
τ =
2(nc − nd)
n(n − 1)
=
2(15 − 0)
6(6 − 1)
= 1
1 # Enter the data
2 ‘Gestation period‘ - c(30, 32, 34, 36, 38, 40)
3 ‘Estimated baby weight‘ - c(1.6, 1.7, 2.5, 2.8, 3.2, 3.5)
4
5 cor(x = ‘Gestation period‘, y = ‘Estimated baby weight‘, method = kendall)
40 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient
• Example 9: The data below show the consumption of alcohol (x, liters per year per person, 14 years
or older) and the death rate from cirrhosis, a liver disease (y, death per 100,000 population) in 9
countries (each country is an observation unit)
Country Alc. Consumption Death Rate from Cirrhosis
France 24.7 46.1
Italy 15.2 23.6
Germany 12.3 23.7
Australia 10.9 7.0
Belgium 10.8 12.3
USA 9.9 14.2
Canada 8.3 7.4
England 7.2 3.0
Ireal 3.1 5.4
Computing Kendall’s tau using the approach utilized in Example 8 is time consuming and prone for
error.
• It’s often easier to determine concordant and discordant pairs by using the ranks instead of the actual
numbers
41 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient
• First arrange the values in order of rank for x. Then the concordant pairs (c) are the number of
observations below which have a higher rank for the y and the discordant pairs (d) are the number of
observations below which have a lower rank for the y
Country Alc. Consumption Death Rate (y) r(y) Concordant Discordant
Ireal 3.1 5.4 2 7 1
England 7.2 3.0 1 7 0
Canada 8.3 7.4 4 5 1
USA 9.9 14.2 6 3 2
Belgium 10.8 12.3 5 3 1
Australia 10.9 7.0 3 3 0
Germany 12.3 23.7 8 1 1
Italy 15.2 23.6 7 1 0
France 24.7 46.1 9
nc = 7 + 7 + · · · + 1 = 30, nd = 1 + 0 + · · · + 0 = 6
τ =
2(nc − nd)
n(n − 1)
=
2(30 − 6)
9(9 − 1)
= 0.67
42 Statistics Department Exploratory Data Analysis 17.9.2021
Bivariate Correlation Analysis
Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient
1 library(oii)
2
3 # Enter data
4 ‘Alc consumption‘ - c(24.7,15.2,12.3, 10.9, 10.8,9.9,8.3,7.2,3.1)
5 death - c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3, 5.4)
6
7 # Correlation coefficient
8 cor(‘Alc consumption‘, death, method = kendall)
9
10 # concordant and discordant numbers
11 concordant.pairs(‘Alc consumption‘, death)
12 discordant.pairs(‘Alc consumption‘, death)
• The oii library is imported to use concordant.pairs and discordant.pairs functions.
43 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
• One of the assumptions of Pearson’s correlation coefficient is that the two variables are coming from
a bivariate normal distribution
• The joint pdf of the two variables, say x ∈ R and y ∈ R is
f(x, y) =
1
2πσxσy
p
1 − ρ2
exp −
1
2(1 − ρ2)

x − µx
σx
2
− 2ρ

x − µx
σx
 
y − µy
σy

+

y − µy
σy
2
#!
where ρ ∈ [−1, +1] is population correlation coefficient
• To assess the significance of any calculated r, the sampling distribution of this statistic is needed.
The distribution of r is negatively skewed and has high spread/variability.
• For the hypothesis H0 : ρ = 0 Vs H1 : ρ 6= 0, the test statistic is
t =
r
√
n − 2
√
1 − r2
follows t-distribution with n − 2 degrees of freedom under H0
44 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
1 # Sampling distribution of r
2 set.seed(23) #set seed number
3 # read the data
4 data(EuStockMarkets)
5 # Take only the first two variables
6 D - as.data.frame(EuStockMarkets[,1:2])
7 r - c(); t - c(); n - 30 # Sample size
8
9 # The simulation
10 for (i in 1:1000) {
11 # Take random sample
12 idx - sample(1:nrow(D), size = n)
13 r[i] - cor(D$DAX[idx], D$SMI[idx])
14 t[i] - r[i]*sqrt(n-2)/(sqrt(1-r[i]^2))
15 }
16
17 # Plot histogram
18 hist(r)
19 hist(t)
45 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
46 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
47 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
• Example 10: Let’s consider the data in Example 9. The Pearson correlation coefficient between
alcohol consumption and death is found to be r = 0.94. Test H0 : ρ = 0
• Solution: The sample size is n = 9, then the test statistic is
t =
r
√
n − 2
√
1 − r2
=
0.94
√
9 − 2
√
1 − 0.942
= 7.146
The p- value is 2Pr(t7  7.146) = 0.000186. Since the p-value is less than 5% level of significance,
we have enough evidence to conclude that alcohol consumption and death have linear relationship
• cor.test(x, y, alternative = c(two.sided, less, greater), method
= c(pearson, kendall, spearman), exact = NULL, conf.level =
0.95, continuity = FALSE, ...) function can be used to handle the business in R
48 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
1 # Enter data
2 ‘Alc consumption‘ - c(24.7,15.2,12.3, 10.9, 10.8,9.9,8.3,7.2,3.1)
3 death - c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3, 5.4)
4
5 # Hypothesis testing
6 cor.test(‘Alc consumption‘, death)
The output is
1 Pearson’s product-moment correlation
2
3 data: Alc consumption and death
4 t = 7.146, df = 7, p-value = 0.000186
5 alternative hypothesis: true correlation is not equal to 0
6 95 percent confidence interval:
7 0.7255307 0.9871237
8 sample estimates:
9 cor
10 0.937788
49 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
• However, if our interest lies on testing a hypothesis like
H0 : ρ = ρ0
where ρ0 is hypothesised value of correlation coefficient, then the test procedure for H0 : ρ = 0
doesn’t handle it
• A more general result for any value of ρ0 ∈ [−1, +1], under the null hypothesis H0 : ρ = ρ0, the test
statistic
W =
1
2
ln

1 + r
1 − r

= tanh−1
(r)
follows approximately a normal distribution with mean 1
2 ln

1+ρ
1−ρ

= tanh−1
(ρ) and standard
deviation
1
√
n − 3
Usually W is referred as Fisher Z transformation
50 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
• A 100(1 − α)% confidence interval for ρ is can be computed as
W ± Zα/2
1
√
n − 3
Transforming the upper and lower limits of the confidence by tanh()
tanh

W ± Zα/2
1
√
n − 3

Note that
tanh(r) =
e2r
− 1
e2r + 1
• Example 11: Consider the data and results in Example 9. Test the claim that
H0 : ρ = 0.9
H1 : ρ 6= 0.9
51 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Pearson’s Correlation
• Solution: The correlation coefficient is r = 0.94 and n = 9. The test statistic is
W = tanh−1
(r) = tanh−1
(0.94) = 1.738
The mean and variance of W are
mean = tanh−1
(ρ0) = tanh−1
(0.9) = 1.472 and SD =
1
√
n − 3
= 0.408
The p-value is
2Pr

Z 
W − mean
SD

= 2Pr

Z 
1.738 − 1.472
0.408

= 2Pr(Z  0.652) = 0.5144
We don’t reject the null hypothesis. atanh() function can be used to compute tanh−1
(˙
) in R
52 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Spearman’s  Kendall’s Rank
Correlation
• The hypothesis associated with Kendall and Spearman rank correlation is depending on the sample
size. However, for relatively large sample size, the test statistic which aim to test H0 : ρ = 0 is
following a known probability distribution
— Case 1: Inference Under Spearman’s Rank Correlation—For larger values of n  20, we
can use the test procedure for Pearson correlation test. The limiting normal distribution will
have a mean 0 and standard deviation of
1
√
n − 1
— Case 2: Inference Under Kendall’s Rank Correlation—For larger values of n  10, use of
the Central Limit Theorem means that an approximate normal distribution can be used,
with mean 0 and standard deviation s
2(2n + 5)
9n(n − 1)
53 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Spearman’s and Kendall’s Rank
Correlation
• Example 12: An actuary wants to investigate if there is any correlation between students’ scores in
the CS1 mock exam and the CS2 mock exam. Data values from 22 students were collected and the
results are:
22
X
i=1
d2
i = 494, nc = 174, and nd = 57
Test H0 : ρ = 0 Vs H1 : ρ  0 for the mock score data using the Spearman’s rank correlation
coefficient and the Kendall’s rank correlation coefficient along with normal approximations
• Solution: The Spearman’s and Kendall’s rank correlation coefficients are
rs = 1 −
6
P22
i=1 d2
i
n(n2 − 1)
= 1 −
6(494)
22(222 − 1)
= 0.72
τ =
2(nc − nd)
n(n − 1)
=
2(174 − 57)
22(22 − 1)
= 0.51
54 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Spearman’s and Kendall’s Rank
Correlation
• Let’s use Spearman’s rank correlation coefficient to test the hypothesis
W = tanh−1
(rs) = tanh−1
(0.72) = 0.908
Note that the mean is zero and the standard deviation is 1/
√
22 − 1 = 0.218 The p-value is
Pr

Z 
0.908 − 0
0.218

= Pr(Z  4.165) ≈ 0.000
We have enough evidence to reject the null hypothesis and decide in favor of the alternative
hypothesis
• Let’s use Kendall’s rank correlation
W = tanh−1
(τ) = tanh−1
(0.51) = 0.563
The mean is zero and the standard deviation is
q
2(2n+5)
9n(n−1) =
q
2(2×22+5)
9(22)(22−1) = 0.154
55 Statistics Department Exploratory Data Analysis 17.9.2021
Inference on Correlation Coefficient
Inference on Correlation Coefficient—Under Spearman’s and Kendall’s Rank
Correlation
• The standardized value is
W − mean
SD
=
0.563 − 0
0.154
= 3.656
The p-value is
Pr(Z  3.656) = 0.0001
The conclusion is still the same.
56 Statistics Department Exploratory Data Analysis 17.9.2021
Multivariate correlation analysis
Multivariate Correlation Analysis—Visualization
• As the number of variables increased, it becomes very hard to visualize the data with one plot. For
instance, if we have five variables, we need to draw

5
2

= 10 scatter plots. This problem is
usually referred as curse of dimensionality.
• The relationship between variables under multivariate setting can be investigated by drawing scatter
plot matrix and correlogram
1 # Load library
2 library(DataExplorer)
3 # Load Dataset
4 data(EuStockMarkets)
5 stockData - as.data.frame(EuStockMarkets)
6 # Rename the columns to more appropriate name
7 colnames(stockData) - c(Germany, Switzerland, France, UK)
8 # correlogram
9 plot_correlation(stockData)
10
11 # Scatter plot Matrix
12 plot(stockData)
57 Statistics Department Exploratory Data Analysis 17.9.2021
Multivariate correlation analysis
Multivariate Correlation Analysis—Visualization
58 Statistics Department Exploratory Data Analysis 17.9.2021
Multivariate correlation analysis
Multivariate Correlation Analysis—Visualization
59 Statistics Department Exploratory Data Analysis 17.9.2021
Multivariate correlation analysis
Multivariate Correlation Analysis—Sample Correlation Coefficient Matrix
• The correlation coefficient matrix is the best solution to examine the degree and sign of relationship
between pair of variables
1  cor(stockData)
2 Germany Switzerland France UK
3 Germany 1.0000000 0.9911539 0.9662274 0.9751778
4 Switzerland 0.9911539 1.0000000 0.9468139 0.9899691
5 France 0.9662274 0.9468139 1.0000000 0.9157265
6 UK 0.9751778 0.9899691 0.9157265 1.0000000
7  cor(stockData, method = spearman)
8 Germany Switzerland France UK
9 Germany 1.0000000 0.9727973 0.8290434 0.9670180
10 Switzerland 0.9727973 1.0000000 0.8131776 0.9867515
11 France 0.8290434 0.8131776 1.0000000 0.8051102
12 UK 0.9670180 0.9867515 0.8051102 1.0000000
• Exercise: Consider a data set called mtcars. Explore this data set and draw scatter plot, compute
correlation matrix. Use data(mtcars) to load the data to R
60 Statistics Department Exploratory Data Analysis 17.9.2021
Multivariate correlation analysis
Multivariate Correlation Analysis—Inference
• The hypothesis H0 : ρ = 0 can be tested by corr.test() function of psych library
1  library(psych)
2  corr.test(stockData)
3 Call:corr.test(x = stockData)
4 Correlation matrix
5 Germany Switzerland France UK
6 Germany 1.00 0.99 0.97 0.98
7 Switzerland 0.99 1.00 0.95 0.99
8 France 0.97 0.95 1.00 0.92
9 UK 0.98 0.99 0.92 1.00
10 Sample Size
11 [1] 1860
12 Probability values (Entries above the diagonal are adjusted for multiple tests
.)
13 Germany Switzerland France UK
14 Germany 0 0 0 0
15 Switzerland 0 0 0 0
16 France 0 0 0 0
17 UK 0 0 0 0
61 Statistics Department Exploratory Data Analysis 17.9.2021
Multivariate correlation analysis
Multivariate Correlation Analysis—Inference
• The function chart.Correlation() in the package PerformanceAnalytics, can be used
to display a chart of a correlation matrix
1 library(PerformanceAnalytics)
2 chart.Correlation(stockData, histogram=TRUE, pch=16)
62 Statistics Department Exploratory Data Analysis 17.9.2021
Multivariate correlation analysis
Multivariate Correlation Analysis—Inference
63 Statistics Department Exploratory Data Analysis 17.9.2021
Principal component analysis
Principal Component Analysis
• PCA is a statistical procedure that converts a set of observations of possibly correlated variables into
a set of values of linearly uncorrelated variables called principal components
• PCA will help us to find a reduced number of features that will represent our original dataset in a
compressed way, capturing up to a certain portion of its variance depending on the number of new
features we end up selecting
• This transformation is defined in such a way that the first principal component has the largest
possible variance (that is, accounts for as much of the variability in the data as possible), and each
succeeding component, in turn, has the highest possible variance possible
Review—Matrix Algebra
Eigenvalues: The number λ is an eigenvalue of A if and only if A − λI is singular:
det(A − λI) = 0
For each λ solve (A − λI)x = 0 to find an eigenvector of x
64 Statistics Department Exploratory Data Analysis 17.9.2021
Principal component analysis
Principal Component Analysis
• Example 13: Find its λ’s and x’s of
A =

1 2
2 4

• Solution: When A is singular, λ = 0 is one of the eigenvalues. The equation Ax − 0x has solutions
det(A − λI) = det

1 − λ 2
2 4 − λ

= (1 − λ)(4 − λ) − 4 = λ2
− 5λ = λ(λ − 5) = 0
Then λ = 5 and λ = 0. Solve (A − λI)x = 0 for λ = 5 and λ = 0. For λ = 5,
(A − 5I)x = 0 =

−4 2
2 −1
 
y
z

=

0
0

The eigenvectors are 
1
2

For λ = 0, the eigenvector is 
2
−1

65 Statistics Department Exploratory Data Analysis 17.9.2021
Principal component analysis
Principal Component Analysis
• The unit vectors are 
1
√
5
2
√
5
#
and

2
√
5
−1
√
5
#
1  A - matrix(c(1,2,2,4), ncol = 2, byrow = T)
2  eigen(A)
3 eigen() decomposition
4 $values
5 [1] 5 0
6
7 $vectors
8 [,1] [,2]
9 [1,] 0.4472136 -0.8944272
10 [2,] 0.8944272 0.4472136
• Let’s form matrix of the eigenvectors W where the first column is the eigenvector of the highest
eigenvalue and the second column is the eigenvector of the second highest eigenvalue etc
66 Statistics Department Exploratory Data Analysis 17.9.2021
Principal component analysis
Principal Component Analysis
• W is orthogonal (i.e., W −1
= W T
). Suppose X is centered data matrix of n × p. The principal
components decomposition P of X is XW
P = XW
By doing so we can transformed the data into a set of p orthogonal components
• To assess the explanatory power of each component, consider S = P T
P . S is diagonal matrix where
each diagonal element is the (scaled) variance of each component of the transformed data (the
covariance between components is zero by construction).
• Example 14: Let’s consider the EuStockMarkets data set and construct principal components
1 data(EuStockMarkets)
2 stockData - as.matrix(EuStockMarkets)
3
4 # Center the data set
5 x - scale(stockData, scale = F)
6 X - t(x) %*% x
67 Statistics Department Exploratory Data Analysis 17.9.2021
Principal component analysis
Principal Component Analysis
1  X
2 DAX SMI CAC FTSE
3 DAX 2187625263 3324041230 1130755526 1920782024
4 SMI 3324041230 5141356405 1698659725 2989291485
5 CAC 1130755526 1698659725 626045333 964886991
6 FTSE 1920782024 2989291485 964886991 1773436263
1 # Eigenvector matrix
2 W - eigen(X)$vector
3 P - x%*%W
4 # Variance-covariance matrix
5 S - round(t(P)%*%P,2)
6 # Variation explained
7 diag(S)*100/sum(diag(S))
1  diag(S)*100/sum(diag(S))
2 [1] 98.7472656 0.9389800 0.1701669 0.1435875
68 Statistics Department Exploratory Data Analysis 17.9.2021
Principal component analysis
Principal Component Analysis
1 prcomp(stockData)
2 Standard deviations (1, .., p=4):
3 [1] 2273.23831 221.67188 94.36696 86.68436
4
5 Rotation (n x k) = (4 x 4):
6 PC1 PC2 PC3 PC4
7 DAX 0.4747658 0.3776953 -0.4010222 -0.6863854
8 SMI 0.7309214 -0.1735069 -0.3023225 0.5867285
9 CAC 0.2437347 0.7304373 0.5985832 0.2208009
10 FTSE 0.4253759 -0.5419438 0.6240837 -0.3686079
11  W
12 [,1] [,2] [,3] [,4]
13 [1,] -0.4747658 -0.3776953 -0.4010222 0.6863854
14 [2,] -0.7309214 0.1735069 -0.3023225 -0.5867285
15 [3,] -0.2437347 -0.7304373 0.5985832 -0.2208009
16 [4,] -0.4253759 0.5419438 0.6240837 0.3686079
69 Statistics Department Exploratory Data Analysis 17.9.2021
Principal component analysis
Principal Component Analysis
• How many principal components should be retained? there are three simple approaches, which
may be of guidance for deciding the number of relevant principal components. These are
— The visual examination of a scree plot (Searching for elbow in the scree plot)
— The variance explained criteria or
— The Kaiser rule (According to this rule only those principal components are retained, whose
variances exceed 1)
• Decide on the number of principal components is to set a threshold, say 80%, and stop when the first
k components account for a percentage of total variation greater than this threshold. Note that the
threshold is set somehow arbitrary; 70 to 90% are the usual sort of values, but it depends on the
context of the data set and can be higher or lower
1 # Variation explained
2 varExplained - diag(S)*100/sum(diag(S))
3 plot(varExplained, type = o, ylab = Variance explained, xlab = Principal
component)
4 plot(cumsum(varExplained), type = o, ylab = Variance explained, xlab = 
Principal component)
70 Statistics Department Exploratory Data Analysis 17.9.2021
Principal component analysis
Principal Component Analysis
71 Statistics Department Exploratory Data Analysis 17.9.2021
Yabebal Ayalew
Statistics Department, Addis Ababa University(AAU)
Office: Freshman building, Room 115 email: yabebala@gmail.com
Training Title: Exploratory Data Analysis

More Related Content

What's hot

EDA | Exploratory Data Analysis | Machine Learning | Data Science
EDA | Exploratory Data Analysis | Machine Learning | Data ScienceEDA | Exploratory Data Analysis | Machine Learning | Data Science
EDA | Exploratory Data Analysis | Machine Learning | Data ScienceSumit Pandey
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsUmasree Raghunath
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisVishwas N
 
Database indexing techniques
Database indexing techniquesDatabase indexing techniques
Database indexing techniquesahmadmughal0312
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science IntroductionGang Tao
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data VisualizationStephen Tracy
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project LifecycleJason Geng
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 ClassificationKhalid Elshafie
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data ScienceMaloy Manna, PMP®
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisGramener
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Application of data mining
Application of data miningApplication of data mining
Application of data miningSHIVANI SONI
 

What's hot (20)

EDA | Exploratory Data Analysis | Machine Learning | Data Science
EDA | Exploratory Data Analysis | Machine Learning | Data ScienceEDA | Exploratory Data Analysis | Machine Learning | Data Science
EDA | Exploratory Data Analysis | Machine Learning | Data Science
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Database indexing techniques
Database indexing techniquesDatabase indexing techniques
Database indexing techniques
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
 
R programming
R programmingR programming
R programming
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Analytical tools
Analytical toolsAnalytical tools
Analytical tools
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 

Similar to Exploratory data analysis

Linear Regression with R programming.pptx
Linear Regression with R programming.pptxLinear Regression with R programming.pptx
Linear Regression with R programming.pptxanshikagoel52
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Mark Ivan Ligason
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
 
How Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with DataHow Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with DataTa-Wei (David) Huang
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxPerumalPitchandi
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxShahbazKhan77289
 
t-Test Project Instructions and Rubric Project Overvi.docx
t-Test Project Instructions and Rubric  Project Overvi.docxt-Test Project Instructions and Rubric  Project Overvi.docx
t-Test Project Instructions and Rubric Project Overvi.docxmattinsonjanel
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1warishali570
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Dr Sulaimon Afolabi
 
A Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality IndicatorsA Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality Indicatorsvie_dels
 
Analysis of Trends in Stock Market.pdf
Analysis of Trends in Stock Market.pdfAnalysis of Trends in Stock Market.pdf
Analysis of Trends in Stock Market.pdfValerie Felton
 
Application of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA GuardsApplication of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA GuardsEditor IJCATR
 

Similar to Exploratory data analysis (20)

Linear Regression with R programming.pptx
Linear Regression with R programming.pptxLinear Regression with R programming.pptx
Linear Regression with R programming.pptx
 
Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Lecture_note1.pdf
Lecture_note1.pdfLecture_note1.pdf
Lecture_note1.pdf
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)
 
Lecture-1.pdf
Lecture-1.pdfLecture-1.pdf
Lecture-1.pdf
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
ml-03x01.pdf
ml-03x01.pdfml-03x01.pdf
ml-03x01.pdf
 
How Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with DataHow Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with Data
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptx
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docx
 
t-Test Project Instructions and Rubric Project Overvi.docx
t-Test Project Instructions and Rubric  Project Overvi.docxt-Test Project Instructions and Rubric  Project Overvi.docx
t-Test Project Instructions and Rubric Project Overvi.docx
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1
 
analysis plan.ppt
analysis plan.pptanalysis plan.ppt
analysis plan.ppt
 
A Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality IndicatorsA Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality Indicators
 
Analysis of Trends in Stock Market.pdf
Analysis of Trends in Stock Market.pdfAnalysis of Trends in Stock Market.pdf
Analysis of Trends in Stock Market.pdf
 
Application of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA GuardsApplication of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA Guards
 

Recently uploaded

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 

Recently uploaded (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 

Exploratory data analysis

  • 1. Data Analysis Yabebal Ayalew Statistics Department, Addis Ababa University College of Natural & Computational Science Statistics Department
  • 2. Exploratory Data Analysis “It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it.”
  • 3. Exploratory Data Analysis Outline 1 Introduction 2 Bivariate Correlation Analysis — Data Visualization — Pearson’s Correlation Coefficient — Spearman’s Rank Correlation — The Kendalla Rank Correlation Coefficient 3 Inference on Correlation Coefficient — Inference under Pearson’s Correlation — Fisher Transformation of r — Inference under Spearman’s and Kendall’s Rank Correlation 4 Multivariate Correlation Analysis — Data Visualization — Sample Correlation Coefficient Matrix — Inference 5 Principal Component Analysis 3 Statistics Department Exploratory Data Analysis 17.9.2021
  • 4. Introduction Introduction • Almost every discipline from biology and economics to engineering and marketing measures, gathers, and stores data in some digital form • Retail companies store information on sales transactions, insurance companies keep track of insurance claims, and meteorological organizations measure and collect data concerning weather conditions • Timely and well-founded decisions need to be made using the information collected. These decisions will be used to maximize sales, improve research and development projects, and trim costs • Data are being produced at faster rates due to the explosion of internet related information and the increased use of operational systems to collect business, engineering and scientific data, and measurements from sensors or monitors 4 Statistics Department Exploratory Data Analysis 17.9.2021
  • 5. Introduction Introduction Definition—Exploratory Data Analysis Exploratory Data Analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methodsa a https://en.wikipedia.org/wiki/Exploratory_data_analysis • According to Johon Tukey, exploratory data analysis is detective work–numerical detective work-or counting detective work–or graphical detective work • A detective investigating a crime needs both tools and understanding. If he has no fingerprint powder, he will fail to find fingerprints on most surfaces. If he does not understand where the criminal is likely to have put his fingers, he will not look in the right places. Equally, the analyst of data needs both tools and understanding. John Tukey (June 16, 1915 – July 26, 2000) 5 Statistics Department Exploratory Data Analysis 17.9.2021
  • 6. Introduction Introduction • The processes of criminal justice are clearly divided between the search for the evidence—the responsibility of the police and other investigative forces—and the evaluation of the evidence’s strength—a matter for juries and judges. • In data analysis a similar distinction is helpful. Exploratory data analysis is detective in character. Confirmatory data analysis is judicial or quasi-judicial in character • Unless the detective finds the clues, judge or jury has nothing to consider. Unless exploratory data analysis uncovers indications, usually quantitative ones, there is likely to be nothing for confirmatory data analysis to consider Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone—as the first step. 6 Statistics Department Exploratory Data Analysis 17.9.2021
  • 7. Introduction Introduction EDA Objectives • Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis which employs a variety of techniques (mostly graphical) to 1 Maximize insight into a data set 2 Uncover underlying structure 3 Extract important variables 4 Detect outliers and anomalies 5 Test underlying assumptions 6 Develop parsimonious models and 7 Determine optimal factor settings 7 Statistics Department Exploratory Data Analysis 17.9.2021
  • 8. Introduction Introduction • R programming language is one of the best tools to handle big data analytic. It is free as in freedom and comes with no warranty. The current stable version is R4.1.0 released on May 18, 2021 — Some of important libraries for EDA are DataExplorer, and tidyverse • Installing R packages can be done using install.packages(<package name>) command. For instance, 1 # Installing R packages 2 install.packages("DataExplorer") 3 install.packages("tidyverse") • Rstudio is widely used GUI of R programming language. It free though an enhanced version of Rstudio is not. 8 Statistics Department Exploratory Data Analysis 17.9.2021
  • 9. Introduction Introduction • We have two options to utilized built-in functions in the installed packages (libraries) — The first option is to call the library and use the function by its name. For instance, 1 # Loading the library 2 library(DataExplorer) 3 4 # Using create_report function 5 create_report(airquality) — The second option is instead of loading the library we can use library_name::function_name(). For instance, 1 DataExplorer::create_report(airquality) • Note: airqualty is a dataset that you can access it under datasets package and create_report() function generates necessary report about the data set given as an argument 1 data("airquality") 9 Statistics Department Exploratory Data Analysis 17.9.2021
  • 10. Bivariate Correlation Analysis Bivariate Correlation Analysis—Scatter Plot • A critical step in making sense of data is an understanding of the relationships between different variables. — For example, is there a relationship between interest rates and inflation or education level and income? • These relationships or associations can be established through an examination of different summary tables and data visualizations as well as calculations that measure the strength and confidence in the relationship. 10 Statistics Department Exploratory Data Analysis 17.9.2021
  • 11. Bivariate Correlation Analysis Bivariate Correlation Analysis—Scatter Plot • The simplest graphical tool to assess the relationship between two quantitative variables of interest is scatter plot — Scatter plot is 2D graph on XY coordinate plane where the x-axis represent on of the variable (usually believed to be the independent variable) and the y-axis represents the second variable (usually the dependent variable) — A pair of values (xi, yi) is represented by dot, collection of dots will create some kind of patter 11 Statistics Department Exploratory Data Analysis 17.9.2021
  • 12. Bivariate Correlation Analysis Bivariate Correlation Analysis—Scatter Plot • The plot(x, y, ...) function can be used to draw scatter plot in R. You can get more information about the function and its arguments by typing ?plot on the consul of the Rstudio • Let’s consider EuStockMarkets data set that comes with R installation for demonstration. This data set is daily closing prices of major European stock indices, 1991–1998. Type ?EuStockMarkets for more information 1 # Reading EuStockMarkets 2 data(EuStockMarkets) 3 # Let’s convert this data to dataframe 4 stockPrice <- as.data.frame(EuStockMarkets) 5 # Let’s see the dataset 6 head(stockPrice) 7 # Rename the columns to more appropriate name 8 colnames(stockPrice) <- c("Germany", "Switzerland", "France", "UK") 9 # Scatter plot of Germany and UK 10 xlab <- "Germany Stock Price" # x-axis label 11 ylab <- "UK Stock Price" # y-axis label 12 plot(x = stockPrice$Germany, y = stockPrice$UK, xlab = xlab, ylab = ylab) 13 12 Statistics Department Exploratory Data Analysis 17.9.2021
  • 13. Bivariate Correlation Analysis Bivariate Correlation Analysis—Scatter Plot 13 Statistics Department Exploratory Data Analysis 17.9.2021
  • 14. Bivariate Correlation Analysis Bivariate Correlation Analysis—Scatter Plot • Example 1: A sample of 10 claims and corresponding payments on settlement for household policies is taken from the business of an insurance company Claim (x) 2.10 2.40 2.50 3.20 3.60 3.80 4.10 4.20 4.50 5.00 Payment (y) 2.18 2.06 2.54 2.61 3.67 3.25 4.02 3.71 4.38 4.45 Draw scatter plot and comment on the relationship between claims and payment. • Solution: Let’s use plot() function 1 # Enter the data as vector using c() function 2 Claim <- c(2.10,2.40,2.50, 3.20, 3.60, 3.80, 4.10, 4.20, 4.50, 5.00) 3 Payment <- c(2.18,2.06,2.54, 2.61,3.67,3.25, 4.02, 3.71, 4.38, 4.45) 4 5 # Scatter plot---pch is marker symbol and col is for color 6 plot(x = Claim, y = Payment, pch = 16, col = "blue") 14 Statistics Department Exploratory Data Analysis 17.9.2021
  • 15. Bivariate Correlation Analysis Bivariate Correlation Analysis—Scatter Plot • Example 2: A professional body wishes to analyse the performance of its students on a particular two-part examination. It records the scores obtained by a sample of 12 students on the first part of the exam, and the scores obtained by the same students on the second part of the exam. The results are as follows: Student A B C D E F G H I J K L First-part exam score x (%) 82 49 73 60 61 77 65 85 91 53 59 73 Second-part exam score y (%) 76 58 75 66 70 71 76 92 87 59 63 71 Draw scatter plot and comment on the relationship between First-part exam score and Second-part exam score. 1 #Enter the data 2 ‘First-part exam score‘ <- c(82,49,73,60,61,77,65,85,91,53,59,73) 3 ‘Second-part exam score‘ <- c(76,58,75,66,70,71,76,92,87,59,63,71) 4 5 # scatter plot 6 plot(x = ‘First-part exam score‘, y = ‘Second-part exam score‘, pch = 16, col = 6) 15 Statistics Department Exploratory Data Analysis 17.9.2021
  • 16. Bivariate Correlation Analysis Bivariate Correlation Analysis—Scatter Plot Comment: Both scatter plot suggest a positive linear relationships between variables. i.e., as one variable tend to increase, the second variable also tend to increase. 16 Statistics Department Exploratory Data Analysis 17.9.2021
  • 17. Bivariate Correlation Analysis Bivariate Correlation Analysis—Scatter Plot Merits • It can be easily understood and interpreted • Values of extreme items do not affect this method. Such points are always isolated in the plot • It is the best method to show you a non-linear pattern • Shows both positive and negative type of graphical correlation Demerits • These diagrams are unable to measure the precise extent of correlation and interpretation can be subjective • Limited to two or three dimension at a time • Data on both axes have to be continuous data • It is not a quantitative measure of the relationship between the variables. It is only a quantitative expression of the quantitative change • Exercise: The rate of interest of borrowing, over the next five years, for ten companies is compared to each company’s leverage ratio (its debt to equity ratio). The data is as follows: Leverage ratio (x) 0.1 0.4 0.5 0.8 1.0 1.8 2.0 2.5 2.8 3.0 Interest rate % (y) 2.8 3.4 3.5 3.6 4.6 6.3 10.2 19.7 31.3 42.9 Draw a scatterplot and comment on the relationship between company borrowing (leverage) and interest rate. Hence apply a transformation to obtain a linear relationship. 17 Statistics Department Exploratory Data Analysis 17.9.2021
  • 18. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient Review—Statistic Vs Parameter • Statistic is any measure that is computed from the sample. Usually denoted by English alphabet letters. For example, X̄ is sample mean (it is statistic) • Parameter is any measure that is computed from a population. It is usually denoted by Greek letters. For instance, µ is population mean. • The degree of association between the x and y values is summarised by the value of an appropriate correlation coefficient each of which take values from −1 to +1. • One of the correlation coefficient that is widely used is called Pearson’s correlation coefficient named after the well-known English mathematician and biostatistician. Karl Pearson (27 March 1857 - 27 April 1936) 18 Statistics Department Exploratory Data Analysis 17.9.2021
  • 19. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient • The Pearson’s correlation coefficient computed from the sample is denoted by r. Suppose we have two quantitative variables, say x and y, with a total of n observations r = Pn i=1(xi − x̄)(yi − ȳ) pPn i=1(xi − x̄)2 pPn i=1(yi − ȳ)2 Let Sxy = n X i=1 (xi − x̄)(yi − ȳ) = n X i=1 (xiyi − xiȳ − x̄yi + x̄ȳ), distribute summation = n X i=1 xiyi − n X i=1 xiȳ − n X i=1 yix̄ + n X i=1 x̄ȳ, Note that x̄ = Pn i=1 xi n , and ȳ = Pn i=1 yi n = n X i=1 xiyi − nx̄ȳ − nx̄ȳ + nx̄ȳ, x̄ and ȳ are constants over index i = n X i=1 xiyi − nx̄ȳ 19 Statistics Department Exploratory Data Analysis 17.9.2021
  • 20. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient • Let Sxx = n X i=1 (xi − ȳ)2 = n X i=1 (xi − 2xix̄ + x̄2 ), polynomial expansion = n X i=1 x2 i − 2 n X i=1 xix̄ + n X i=1 x̄2 = n X i=1 x2 i − 2nx̄x̄ + nx̄2 = n X i=1 x2 i − nx̄2 With the same technique Syy = n X i=1 y2 i − nȳ2 Therefore, the sample Pearson’s correlation coefficient r can be re-written as r = Sxy √ Sxx p Syy = Pn i=1 xiyi − nx̄ȳ pPn i=1 x2 i − nx̄2 pPn i=1 y2 i − nȳ2 20 Statistics Department Exploratory Data Analysis 17.9.2021
  • 21. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient • The population correlation coefficient is denoted by greed letter ρ ρ = cov(x, y) p V ar(x)V ar(Y ) = σxy σxσy r is a sample estimator of the unknown population correlation coefficient ρ. • Relationship with Simple Linear Regression: Simple linear regression has the form of yi = β0 + β1xi + i where is the random disturbance term and β0 and β1 are model parameters and are estimated by β̂0 and β̂1, respectively. β̂1 = Pn i=1(xi − x̄)(yi − ȳ) Pn i=1(xi − x̄)2 and β̂0 = ȳ − β̂1x̄ 21 Statistics Department Exploratory Data Analysis 17.9.2021
  • 22. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient • A very simple algebraic manipulation, we can derive that β̂1 = r r Syy Sxx Moreover, the coefficient of multiple determination R2 is R2 = r2 • Properties of Pearson’s correlation coefficient — The value of r is between −1 to +1. r = 0 indicates absence of linear relationship between variables. r = −1 indicates perfect negative linear relationship while r = +1 indicates perfect positive linear relationship — Pearson’s correlation coefficient doesn’t imply cause-and-effect type of relationship — Non-linear type of correlation can’t be quantified by r — No distinction between the role of the variables considered. i.e., dependent vs independent 22 Statistics Department Exploratory Data Analysis 17.9.2021
  • 23. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient • The cor(x, y = NULL, use = everything, method = c(pearson, kendall, spearman)) function in R can be used to compute Pearson’s correlation coefficient • Example 3: Let’s consider the data in Example 1. Claim (x) 2.10 2.40 2.50 3.20 3.60 3.80 4.10 4.20 4.50 5.00 Payment (y) 2.18 2.06 2.54 2.61 3.67 3.25 4.02 3.71 4.38 4.45 Compute Pearson’s correlation coefficient, r • Solution: The summaries of the above data are 10 X i=1 xi = 35.4, 10 X i=1 x2 i = 133.76, 10 X i=1 yi = 2.87, 10 X i=1 y2 i = 115.2025, 10 X i=1 xiyi = 123.81 Thus, x̄ = P10 i=1 xi n = 35.4 10 = 3.54 and ȳ = P10 i=1 yi n = 13.376 23 Statistics Department Exploratory Data Analysis 17.9.2021
  • 24. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient • By definition, r = Pn i=1 xiyi − nx̄ȳ pPn i=1 x2 i − nx̄2 pPn i=1 y2 i − nȳ2 = 123.81 − 10(3.54)(13.376) p 133.76 − 10(3.54)2 p 115.2025 − 10(13.376)2 = 0.96 This value indicates strong positive linear relationship between claim and payment. • Using R programming language with cor() function 1 # Enter the data as vetctor uisng c() function 2 Claim - c(2.10,2.40,2.50, 3.20, 3.60, 3.80, 4.10, 4.20, 4.50, 5.00) 3 Payment - c(2.18,2.06,2.54, 2.61,3.67,3.25, 4.02, 3.71, 4.38, 4.45) 4 5 # Pearson’s correlation coefficient 6 cor(x = Claim, y = Payment) 24 Statistics Department Exploratory Data Analysis 17.9.2021
  • 25. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient • Example 4: Let’s consider the data in Example 2 and compute Pearson’s correlation coefficient. Student A B C D E F G H I J K L First-part exam score x (%) 82 49 73 60 61 77 65 85 91 53 59 73 Second-part exam score y (%) 76 58 75 66 70 71 76 92 87 59 63 71 • Solution: The summaries are 12 X i=1 xi = 828, 12 X i=1 x2 i = 59054, 12 X i=1 yi = 864, 12 X i=1 y2 i = 63362, 12 X i=1 xiyi = 60950 The means are x̄ = 69 and ȳ = 72 r = Pn i=1 xiyi − nx̄ȳ pPn i=1 x2 i − nx̄2 pPn i=1 y2 i − nȳ2 = 60950 − 12(69)(72) p 59054 − 12(69)2 p 63362 − 12(72)2 = 0.896 25 Statistics Department Exploratory Data Analysis 17.9.2021
  • 26. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient 1 #Enter the data 2 ‘First-part exam score‘ - c(82,49,73,60,61,77,65,85,91,53,59,73) 3 ‘Second-part exam score‘ - c(76,58,75,66,70,71,76,92,87,59,63,71) 4 5 # Pearson’s correlation coefficient 6 cor(x = ‘First-part exam score‘, y = ‘Second-part exam score‘, method=pearson) • Example 5: A new computerised ultrasound scanning technique has enabled doctors to monitor the weights of unborn babies. The table below shows the estimated weights for one particular baby at fortnightly intervals during the pregnancy. Gestation period (weeks) 30 32 34 36 38 40 Estimated baby weight (kg) 1.6 1.7 2.5 2.8 3.2 3.5 6 X i=1 xi = 210, 6 X i=1 x2 i = 7420, 6 X i=1 yi = 15.3, 6 X i=1 y2 i = 42.03, 6 X i=1 xiyi = 549.8 Compute Sxx, Syy, Sxy and r. 26 Statistics Department Exploratory Data Analysis 17.9.2021
  • 27. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient • Solution: As we understand it from the question, gestation period is x and estimated baby weight is y. Then, x̄ = 210 6 = 35, and ȳ = 2.55 By definition, Sxx = 6 X i=1 x2 i − nx̄2 = 7420 − 6(35)2 = 70 Syy = 6 X i=1 y2 i − nȳ2 = 42.03 − 6 × 2.552 = 3.015 Sxy = 6 X i=1 xiyi − nȳx̄ = 549.8 − 6(2.55)(35) = 14.3 r = Sxy p SxxSyy = 14.3 √ 70 × 3.015 = 0.984 Very strong positive correlation is witnessed. 27 Statistics Department Exploratory Data Analysis 17.9.2021
  • 28. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient 1 # Enter the data 2 ‘Gestation period‘ - c(30, 32, 34, 36, 38, 40) 3 ‘Estimated baby weight‘ - c(1.6, 1.7, 2.5, 2.8, 3.2, 3.5) 4 5 # Computing 6 n - length(‘Gestation period‘) # sample size 7 x_bar - mean(‘Gestation period‘) # Mean of gestation period 8 y_bar - mean(‘Estimated baby weight‘) # Mean of baby weight 9 10 Sxx - sum(‘Gestation period‘^2)-n*x_bar^2 11 Syy - sum(‘Estimated baby weight‘^2)-n*y_bar^2 12 13 Sxy - sum(‘Gestation period‘*‘Estimated baby weight‘)-n*x_bar*y_bar 14 15 r - Sxy/sqrt(Sxx*Syy) 16 17 # Since we have the raw dataset 18 cor(x = ‘Gestation period‘, y = ‘Estimated baby weight‘) 28 Statistics Department Exploratory Data Analysis 17.9.2021
  • 29. Bivariate Correlation Analysis Bivariate Correlation Analysis—Pearson’s Correlation Coefficient • Exercise: A schoolteacher is investigating the claim that class size does not affect GCSE results. His observations of nine GCSE classes are as follows: Students in class (c) 35 32 27 21 34 30 28 24 7 Average GCSE point score for class (p) 5.9 4.1 2.4 1.7 6.3 5.3 3.5 2.6 1.6 X c = 238, X c2 = 6884, X p = 33.4, X p2 = 149.62, X cp = 983 Compute Pearson’s correlation coefficient • Final Thought: — Pearson’s correlation coefficient is affected by outliers — Though the variables have strong non-linear correlation, Pearson’s correlation coefficient returns low value — The data set which is to be correlated should approximate to the normal distribution. If the data is normally distributed, then the data points tend to lie closer to the mean — Pearson’s correlation coefficient can also be called as Pearson product-moment correlation 29 Statistics Department Exploratory Data Analysis 17.9.2021
  • 30. Bivariate Correlation Analysis Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient • Often, a bivariate population is far from normal. In that event, the computation of Pearson’s correlation coefficient r as an estimator of ρ is no longer valid • In some cases a transformation of the variables x and y brings their joint distribution close to the bivariate normal, making it possible to estimate ρ in the new scale • The Spearman’s rank-order correlation is the nonparametric version of the Pearson product-moment correlation r. • Spearman’s correlation coefficient,rs measures the strength and direction of association between two ranked variables. Charles Spearman (10 September 1863 - 17 September 1945) • Spearman’s correlation determines the strength and direction of the monotonic relationship between your two variables rather than the strength and direction of the linear relationship between your two variables 30 Statistics Department Exploratory Data Analysis 17.9.2021
  • 31. Bivariate Correlation Analysis Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient • There are two methods to calculate Spearman’s correlation depending on whether: 1 Your data does not have tied ranks or 2 Your data has tied ranks The formula for when there are no tied ranks is rs = 1 − 6 Pn i=1 d2 i n(n2 − 1) where di is difference in paired ranks and n is number of observations 31 Statistics Department Exploratory Data Analysis 17.9.2021
  • 32. Bivariate Correlation Analysis Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient • Let r(x) is rank of variable x and r(y) is rank of variable y. Then, if any of the rank is tied, then Spearman’s Rank correlation coefficient can be computed as follows rs = Pn i=1(r(xi) − r(x))(r(yi) − r(y)) qPn i=1(r(xi) − r(x))2 qPn i=1(r(yi) − r(y))2 • Example 6: A school teacher is investigating the claim that class size does not affect GCSE results. His observations of nine GCSE classes are presented below. Compute rs Students in class (c) Average GCSE point (p) r(c) r(p) d = r(c) − r(p) d2 35 5.9 9 8 1 1 32 4.1 7 6 1 1 27 2.4 4 3 1 1 21 1.7 2 2 0 0 34 6.3 8 9 -1 1 30 5.3 6 7 -1 1 28 3.5 5 5 0 0 24 2.6 3 4 -1 1 7 1.6 1 1 0 0 32 Statistics Department Exploratory Data Analysis 17.9.2021
  • 33. Bivariate Correlation Analysis Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient • Solution: P d2 = 6. The Spearman’s rank correlation coefficient rs is rs = 1 − 6 Pn i=1 d2 i n(n2 − 1) = 1 − 6 × 6 9(92 − 1) = 0.95 1 # Enter data 2 c - c(35, 32, 27, 21, 34, 30, 28, 24, 7) 3 p - c(5.9, 4.1, 2.4, 1.7, 6.3, 5.3, 3.5, 2.6, 1.6) 4 # Ranking 5 r_c - rank(c) 6 r_p - rank(p) 7 # Sample size 8 n - length(c) # n- length(r_p) 9 # Rank difference 10 d - r_c - r_p 11 # Spearman’s Rank correlation 12 rs - 1-6*sum(d^2)/(n*(n^2-1)) 13 # Alternatively 14 cor(x = c, y = p, method = spearman) 33 Statistics Department Exploratory Data Analysis 17.9.2021
  • 34. Bivariate Correlation Analysis Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient • Example 7: The rate of interest of borrowing, over the next five years, for ten companies is compared to each company’s leverage ratio (its debt to equity ratio). The data is as follows: Leverage ratio (x) Interest rate (y) r(x) r(y) d d2 0.1 2.8 1 1 0 0 0.4 3.4 2 2 0 0 0.5 3.5 3 3 0 0 0.8 3.6 4 4 0 0 1.0 4.6 5 5 0 0 1.8 6.3 6 6 0 0 2/0 10.2 7 7 0 0 2.5 19.7 8 8 0 0 2.8 31.3 9 9 0 0 3.0 42.9 10 10 0 0 The Spearman rank correlation is one since P d2 = 0 34 Statistics Department Exploratory Data Analysis 17.9.2021
  • 35. Bivariate Correlation Analysis Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient 1 # Enter the data 2 ‘Leverage ratio‘ - c(0.1, 0.4, 0.5, 0.8, 1.0, 1.8, 2.0, 2.5, 2.8, 3.0) 3 ‘Interest rate‘ - c(2.8, 3.4, 3.5, 3.6, 4.6, 6.3, 10.2, 19.7, 31.3, 42.9) 4 # Rank the data---No ties 5 r_leverage - rank(‘Leverage ratio‘, ties.method = average) 6 r_interest - rank(‘Interest rate‘, ties.method = average) 7 8 # Correlation coefficients 9 cor(‘Leverage ratio‘, ‘Interest rate‘, method = spearman) 10 cor(‘Leverage ratio‘, ‘Interest rate‘) 11 12 # Plots 13 plot(‘Leverage ratio‘, ‘Interest rate‘,pch = 16, col = 18) 14 scatter.smooth(‘Leverage ratio‘, ‘Interest rate‘, pch = 16, col = 18) 35 Statistics Department Exploratory Data Analysis 17.9.2021
  • 36. Bivariate Correlation Analysis Bivariate Correlation Analysis—Spearman’s Rank Correlation Coefficient 36 Statistics Department Exploratory Data Analysis 17.9.2021
  • 37. Bivariate Correlation Analysis Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient • The Kendall’s rank correlation, which is also called as Kendall’s Tau, is a correlation suitable for quantitative and ordinal variables. It indicates how strongly two variables are monotonously related — Kendall’s Tau serves the exact same purpose as the Spearman’s rank correlation — Kendall’s Tau is computationally more intensive than Spearman’s rank correlation Maurice George Kendall (16 September 1907 – 29 March 1983) • Despite the more complicated calculation, it is considered to have better statistical properties than Spearman’s rank correlation coefficient, particularly for small data sets with large numbers of tied ranks • Any pair of observations (xi, yi); (xj, yj) where i 6= j , is said to be concordant if the ranks for both elements agree. i.e., xi xj and yi yj or xi xj and yi yj; otherwise they are said to be discordant. 37 Statistics Department Exploratory Data Analysis 17.9.2021
  • 38. Bivariate Correlation Analysis Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient • Let nc be the number of concordant pairs, and let nd be the number of discordant pairs. Assuming that there are no ties, the Kendall coefficient τ is defined as τ = 2(nc − nd) n(n − 1 Note that we will have a total of n 2 = n! 2!(n − 2)! = n × (n − 1) × (n − 2)! 2!(n − 2)! = n(n − 1) 2 number of concordant and discordant • Intuitively, it is clear that if the number of concordant pairs is much larger than the number of discordant pairs, then the random variables are positively correlated. Whereas if the number of concordant pairs is much less than the number of discordant pairs, then the variables are negatively correlated 38 Statistics Department Exploratory Data Analysis 17.9.2021
  • 39. Bivariate Correlation Analysis Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient • Example 8: A new computerised ultrasound scanning technique has enabled doctors to monitor the weights of unborn babies. The table below shows the estimated weights for one particular baby at fortnightly intervals during the pregnancy. Gestation period (weeks) 30 32 34 36 38 40 Estimated baby weight (kg) 1.6 1.7 2.5 2.8 3.2 3.5 Compute Kendall’s Tau • Solution: The possible pairs are (30, 1.6), (32, 1.7), (34, 2.5), (36, 2.8), (38, 3.2) and (40, 3.5). Let’s create a table to identify the concordant and discordant 39 Statistics Department Exploratory Data Analysis 17.9.2021
  • 40. Bivariate Correlation Analysis Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient (x, y) (30,1.6) (32,1.7) (34,2.5) (36,2.8) (38,3.2) (40,3.5) (30,1.6) c c c c c (32,1.7) c c c c (34,2.5) c c c (36,2.8) c c (38,3.2) c • As can be seen in the above table nc = 15 and nd = 0. Therefore, τ = 2(nc − nd) n(n − 1) = 2(15 − 0) 6(6 − 1) = 1 1 # Enter the data 2 ‘Gestation period‘ - c(30, 32, 34, 36, 38, 40) 3 ‘Estimated baby weight‘ - c(1.6, 1.7, 2.5, 2.8, 3.2, 3.5) 4 5 cor(x = ‘Gestation period‘, y = ‘Estimated baby weight‘, method = kendall) 40 Statistics Department Exploratory Data Analysis 17.9.2021
  • 41. Bivariate Correlation Analysis Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient • Example 9: The data below show the consumption of alcohol (x, liters per year per person, 14 years or older) and the death rate from cirrhosis, a liver disease (y, death per 100,000 population) in 9 countries (each country is an observation unit) Country Alc. Consumption Death Rate from Cirrhosis France 24.7 46.1 Italy 15.2 23.6 Germany 12.3 23.7 Australia 10.9 7.0 Belgium 10.8 12.3 USA 9.9 14.2 Canada 8.3 7.4 England 7.2 3.0 Ireal 3.1 5.4 Computing Kendall’s tau using the approach utilized in Example 8 is time consuming and prone for error. • It’s often easier to determine concordant and discordant pairs by using the ranks instead of the actual numbers 41 Statistics Department Exploratory Data Analysis 17.9.2021
  • 42. Bivariate Correlation Analysis Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient • First arrange the values in order of rank for x. Then the concordant pairs (c) are the number of observations below which have a higher rank for the y and the discordant pairs (d) are the number of observations below which have a lower rank for the y Country Alc. Consumption Death Rate (y) r(y) Concordant Discordant Ireal 3.1 5.4 2 7 1 England 7.2 3.0 1 7 0 Canada 8.3 7.4 4 5 1 USA 9.9 14.2 6 3 2 Belgium 10.8 12.3 5 3 1 Australia 10.9 7.0 3 3 0 Germany 12.3 23.7 8 1 1 Italy 15.2 23.6 7 1 0 France 24.7 46.1 9 nc = 7 + 7 + · · · + 1 = 30, nd = 1 + 0 + · · · + 0 = 6 τ = 2(nc − nd) n(n − 1) = 2(30 − 6) 9(9 − 1) = 0.67 42 Statistics Department Exploratory Data Analysis 17.9.2021
  • 43. Bivariate Correlation Analysis Bivariate Correlation Analysis—The Kendall’s Rank Correlation Coefficient 1 library(oii) 2 3 # Enter data 4 ‘Alc consumption‘ - c(24.7,15.2,12.3, 10.9, 10.8,9.9,8.3,7.2,3.1) 5 death - c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3, 5.4) 6 7 # Correlation coefficient 8 cor(‘Alc consumption‘, death, method = kendall) 9 10 # concordant and discordant numbers 11 concordant.pairs(‘Alc consumption‘, death) 12 discordant.pairs(‘Alc consumption‘, death) • The oii library is imported to use concordant.pairs and discordant.pairs functions. 43 Statistics Department Exploratory Data Analysis 17.9.2021
  • 44. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Pearson’s Correlation • One of the assumptions of Pearson’s correlation coefficient is that the two variables are coming from a bivariate normal distribution • The joint pdf of the two variables, say x ∈ R and y ∈ R is f(x, y) = 1 2πσxσy p 1 − ρ2 exp − 1 2(1 − ρ2) x − µx σx 2 − 2ρ x − µx σx y − µy σy + y − µy σy 2 #! where ρ ∈ [−1, +1] is population correlation coefficient • To assess the significance of any calculated r, the sampling distribution of this statistic is needed. The distribution of r is negatively skewed and has high spread/variability. • For the hypothesis H0 : ρ = 0 Vs H1 : ρ 6= 0, the test statistic is t = r √ n − 2 √ 1 − r2 follows t-distribution with n − 2 degrees of freedom under H0 44 Statistics Department Exploratory Data Analysis 17.9.2021
  • 45. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Pearson’s Correlation 1 # Sampling distribution of r 2 set.seed(23) #set seed number 3 # read the data 4 data(EuStockMarkets) 5 # Take only the first two variables 6 D - as.data.frame(EuStockMarkets[,1:2]) 7 r - c(); t - c(); n - 30 # Sample size 8 9 # The simulation 10 for (i in 1:1000) { 11 # Take random sample 12 idx - sample(1:nrow(D), size = n) 13 r[i] - cor(D$DAX[idx], D$SMI[idx]) 14 t[i] - r[i]*sqrt(n-2)/(sqrt(1-r[i]^2)) 15 } 16 17 # Plot histogram 18 hist(r) 19 hist(t) 45 Statistics Department Exploratory Data Analysis 17.9.2021
  • 46. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Pearson’s Correlation 46 Statistics Department Exploratory Data Analysis 17.9.2021
  • 47. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Pearson’s Correlation 47 Statistics Department Exploratory Data Analysis 17.9.2021
  • 48. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Pearson’s Correlation • Example 10: Let’s consider the data in Example 9. The Pearson correlation coefficient between alcohol consumption and death is found to be r = 0.94. Test H0 : ρ = 0 • Solution: The sample size is n = 9, then the test statistic is t = r √ n − 2 √ 1 − r2 = 0.94 √ 9 − 2 √ 1 − 0.942 = 7.146 The p- value is 2Pr(t7 7.146) = 0.000186. Since the p-value is less than 5% level of significance, we have enough evidence to conclude that alcohol consumption and death have linear relationship • cor.test(x, y, alternative = c(two.sided, less, greater), method = c(pearson, kendall, spearman), exact = NULL, conf.level = 0.95, continuity = FALSE, ...) function can be used to handle the business in R 48 Statistics Department Exploratory Data Analysis 17.9.2021
  • 49. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Pearson’s Correlation 1 # Enter data 2 ‘Alc consumption‘ - c(24.7,15.2,12.3, 10.9, 10.8,9.9,8.3,7.2,3.1) 3 death - c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3, 5.4) 4 5 # Hypothesis testing 6 cor.test(‘Alc consumption‘, death) The output is 1 Pearson’s product-moment correlation 2 3 data: Alc consumption and death 4 t = 7.146, df = 7, p-value = 0.000186 5 alternative hypothesis: true correlation is not equal to 0 6 95 percent confidence interval: 7 0.7255307 0.9871237 8 sample estimates: 9 cor 10 0.937788 49 Statistics Department Exploratory Data Analysis 17.9.2021
  • 50. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Pearson’s Correlation • However, if our interest lies on testing a hypothesis like H0 : ρ = ρ0 where ρ0 is hypothesised value of correlation coefficient, then the test procedure for H0 : ρ = 0 doesn’t handle it • A more general result for any value of ρ0 ∈ [−1, +1], under the null hypothesis H0 : ρ = ρ0, the test statistic W = 1 2 ln 1 + r 1 − r = tanh−1 (r) follows approximately a normal distribution with mean 1 2 ln 1+ρ 1−ρ = tanh−1 (ρ) and standard deviation 1 √ n − 3 Usually W is referred as Fisher Z transformation 50 Statistics Department Exploratory Data Analysis 17.9.2021
  • 51. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Pearson’s Correlation • A 100(1 − α)% confidence interval for ρ is can be computed as W ± Zα/2 1 √ n − 3 Transforming the upper and lower limits of the confidence by tanh() tanh W ± Zα/2 1 √ n − 3 Note that tanh(r) = e2r − 1 e2r + 1 • Example 11: Consider the data and results in Example 9. Test the claim that H0 : ρ = 0.9 H1 : ρ 6= 0.9 51 Statistics Department Exploratory Data Analysis 17.9.2021
  • 52. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Pearson’s Correlation • Solution: The correlation coefficient is r = 0.94 and n = 9. The test statistic is W = tanh−1 (r) = tanh−1 (0.94) = 1.738 The mean and variance of W are mean = tanh−1 (ρ0) = tanh−1 (0.9) = 1.472 and SD = 1 √ n − 3 = 0.408 The p-value is 2Pr Z W − mean SD = 2Pr Z 1.738 − 1.472 0.408 = 2Pr(Z 0.652) = 0.5144 We don’t reject the null hypothesis. atanh() function can be used to compute tanh−1 (˙ ) in R 52 Statistics Department Exploratory Data Analysis 17.9.2021
  • 53. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Spearman’s Kendall’s Rank Correlation • The hypothesis associated with Kendall and Spearman rank correlation is depending on the sample size. However, for relatively large sample size, the test statistic which aim to test H0 : ρ = 0 is following a known probability distribution — Case 1: Inference Under Spearman’s Rank Correlation—For larger values of n 20, we can use the test procedure for Pearson correlation test. The limiting normal distribution will have a mean 0 and standard deviation of 1 √ n − 1 — Case 2: Inference Under Kendall’s Rank Correlation—For larger values of n 10, use of the Central Limit Theorem means that an approximate normal distribution can be used, with mean 0 and standard deviation s 2(2n + 5) 9n(n − 1) 53 Statistics Department Exploratory Data Analysis 17.9.2021
  • 54. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Spearman’s and Kendall’s Rank Correlation • Example 12: An actuary wants to investigate if there is any correlation between students’ scores in the CS1 mock exam and the CS2 mock exam. Data values from 22 students were collected and the results are: 22 X i=1 d2 i = 494, nc = 174, and nd = 57 Test H0 : ρ = 0 Vs H1 : ρ 0 for the mock score data using the Spearman’s rank correlation coefficient and the Kendall’s rank correlation coefficient along with normal approximations • Solution: The Spearman’s and Kendall’s rank correlation coefficients are rs = 1 − 6 P22 i=1 d2 i n(n2 − 1) = 1 − 6(494) 22(222 − 1) = 0.72 τ = 2(nc − nd) n(n − 1) = 2(174 − 57) 22(22 − 1) = 0.51 54 Statistics Department Exploratory Data Analysis 17.9.2021
  • 55. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Spearman’s and Kendall’s Rank Correlation • Let’s use Spearman’s rank correlation coefficient to test the hypothesis W = tanh−1 (rs) = tanh−1 (0.72) = 0.908 Note that the mean is zero and the standard deviation is 1/ √ 22 − 1 = 0.218 The p-value is Pr Z 0.908 − 0 0.218 = Pr(Z 4.165) ≈ 0.000 We have enough evidence to reject the null hypothesis and decide in favor of the alternative hypothesis • Let’s use Kendall’s rank correlation W = tanh−1 (τ) = tanh−1 (0.51) = 0.563 The mean is zero and the standard deviation is q 2(2n+5) 9n(n−1) = q 2(2×22+5) 9(22)(22−1) = 0.154 55 Statistics Department Exploratory Data Analysis 17.9.2021
  • 56. Inference on Correlation Coefficient Inference on Correlation Coefficient—Under Spearman’s and Kendall’s Rank Correlation • The standardized value is W − mean SD = 0.563 − 0 0.154 = 3.656 The p-value is Pr(Z 3.656) = 0.0001 The conclusion is still the same. 56 Statistics Department Exploratory Data Analysis 17.9.2021
  • 57. Multivariate correlation analysis Multivariate Correlation Analysis—Visualization • As the number of variables increased, it becomes very hard to visualize the data with one plot. For instance, if we have five variables, we need to draw 5 2 = 10 scatter plots. This problem is usually referred as curse of dimensionality. • The relationship between variables under multivariate setting can be investigated by drawing scatter plot matrix and correlogram 1 # Load library 2 library(DataExplorer) 3 # Load Dataset 4 data(EuStockMarkets) 5 stockData - as.data.frame(EuStockMarkets) 6 # Rename the columns to more appropriate name 7 colnames(stockData) - c(Germany, Switzerland, France, UK) 8 # correlogram 9 plot_correlation(stockData) 10 11 # Scatter plot Matrix 12 plot(stockData) 57 Statistics Department Exploratory Data Analysis 17.9.2021
  • 58. Multivariate correlation analysis Multivariate Correlation Analysis—Visualization 58 Statistics Department Exploratory Data Analysis 17.9.2021
  • 59. Multivariate correlation analysis Multivariate Correlation Analysis—Visualization 59 Statistics Department Exploratory Data Analysis 17.9.2021
  • 60. Multivariate correlation analysis Multivariate Correlation Analysis—Sample Correlation Coefficient Matrix • The correlation coefficient matrix is the best solution to examine the degree and sign of relationship between pair of variables 1 cor(stockData) 2 Germany Switzerland France UK 3 Germany 1.0000000 0.9911539 0.9662274 0.9751778 4 Switzerland 0.9911539 1.0000000 0.9468139 0.9899691 5 France 0.9662274 0.9468139 1.0000000 0.9157265 6 UK 0.9751778 0.9899691 0.9157265 1.0000000 7 cor(stockData, method = spearman) 8 Germany Switzerland France UK 9 Germany 1.0000000 0.9727973 0.8290434 0.9670180 10 Switzerland 0.9727973 1.0000000 0.8131776 0.9867515 11 France 0.8290434 0.8131776 1.0000000 0.8051102 12 UK 0.9670180 0.9867515 0.8051102 1.0000000 • Exercise: Consider a data set called mtcars. Explore this data set and draw scatter plot, compute correlation matrix. Use data(mtcars) to load the data to R 60 Statistics Department Exploratory Data Analysis 17.9.2021
  • 61. Multivariate correlation analysis Multivariate Correlation Analysis—Inference • The hypothesis H0 : ρ = 0 can be tested by corr.test() function of psych library 1 library(psych) 2 corr.test(stockData) 3 Call:corr.test(x = stockData) 4 Correlation matrix 5 Germany Switzerland France UK 6 Germany 1.00 0.99 0.97 0.98 7 Switzerland 0.99 1.00 0.95 0.99 8 France 0.97 0.95 1.00 0.92 9 UK 0.98 0.99 0.92 1.00 10 Sample Size 11 [1] 1860 12 Probability values (Entries above the diagonal are adjusted for multiple tests .) 13 Germany Switzerland France UK 14 Germany 0 0 0 0 15 Switzerland 0 0 0 0 16 France 0 0 0 0 17 UK 0 0 0 0 61 Statistics Department Exploratory Data Analysis 17.9.2021
  • 62. Multivariate correlation analysis Multivariate Correlation Analysis—Inference • The function chart.Correlation() in the package PerformanceAnalytics, can be used to display a chart of a correlation matrix 1 library(PerformanceAnalytics) 2 chart.Correlation(stockData, histogram=TRUE, pch=16) 62 Statistics Department Exploratory Data Analysis 17.9.2021
  • 63. Multivariate correlation analysis Multivariate Correlation Analysis—Inference 63 Statistics Department Exploratory Data Analysis 17.9.2021
  • 64. Principal component analysis Principal Component Analysis • PCA is a statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components • PCA will help us to find a reduced number of features that will represent our original dataset in a compressed way, capturing up to a certain portion of its variance depending on the number of new features we end up selecting • This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component, in turn, has the highest possible variance possible Review—Matrix Algebra Eigenvalues: The number λ is an eigenvalue of A if and only if A − λI is singular: det(A − λI) = 0 For each λ solve (A − λI)x = 0 to find an eigenvector of x 64 Statistics Department Exploratory Data Analysis 17.9.2021
  • 65. Principal component analysis Principal Component Analysis • Example 13: Find its λ’s and x’s of A = 1 2 2 4 • Solution: When A is singular, λ = 0 is one of the eigenvalues. The equation Ax − 0x has solutions det(A − λI) = det 1 − λ 2 2 4 − λ = (1 − λ)(4 − λ) − 4 = λ2 − 5λ = λ(λ − 5) = 0 Then λ = 5 and λ = 0. Solve (A − λI)x = 0 for λ = 5 and λ = 0. For λ = 5, (A − 5I)x = 0 = −4 2 2 −1 y z = 0 0 The eigenvectors are 1 2 For λ = 0, the eigenvector is 2 −1 65 Statistics Department Exploratory Data Analysis 17.9.2021
  • 66. Principal component analysis Principal Component Analysis • The unit vectors are 1 √ 5 2 √ 5 # and 2 √ 5 −1 √ 5 # 1 A - matrix(c(1,2,2,4), ncol = 2, byrow = T) 2 eigen(A) 3 eigen() decomposition 4 $values 5 [1] 5 0 6 7 $vectors 8 [,1] [,2] 9 [1,] 0.4472136 -0.8944272 10 [2,] 0.8944272 0.4472136 • Let’s form matrix of the eigenvectors W where the first column is the eigenvector of the highest eigenvalue and the second column is the eigenvector of the second highest eigenvalue etc 66 Statistics Department Exploratory Data Analysis 17.9.2021
  • 67. Principal component analysis Principal Component Analysis • W is orthogonal (i.e., W −1 = W T ). Suppose X is centered data matrix of n × p. The principal components decomposition P of X is XW P = XW By doing so we can transformed the data into a set of p orthogonal components • To assess the explanatory power of each component, consider S = P T P . S is diagonal matrix where each diagonal element is the (scaled) variance of each component of the transformed data (the covariance between components is zero by construction). • Example 14: Let’s consider the EuStockMarkets data set and construct principal components 1 data(EuStockMarkets) 2 stockData - as.matrix(EuStockMarkets) 3 4 # Center the data set 5 x - scale(stockData, scale = F) 6 X - t(x) %*% x 67 Statistics Department Exploratory Data Analysis 17.9.2021
  • 68. Principal component analysis Principal Component Analysis 1 X 2 DAX SMI CAC FTSE 3 DAX 2187625263 3324041230 1130755526 1920782024 4 SMI 3324041230 5141356405 1698659725 2989291485 5 CAC 1130755526 1698659725 626045333 964886991 6 FTSE 1920782024 2989291485 964886991 1773436263 1 # Eigenvector matrix 2 W - eigen(X)$vector 3 P - x%*%W 4 # Variance-covariance matrix 5 S - round(t(P)%*%P,2) 6 # Variation explained 7 diag(S)*100/sum(diag(S)) 1 diag(S)*100/sum(diag(S)) 2 [1] 98.7472656 0.9389800 0.1701669 0.1435875 68 Statistics Department Exploratory Data Analysis 17.9.2021
  • 69. Principal component analysis Principal Component Analysis 1 prcomp(stockData) 2 Standard deviations (1, .., p=4): 3 [1] 2273.23831 221.67188 94.36696 86.68436 4 5 Rotation (n x k) = (4 x 4): 6 PC1 PC2 PC3 PC4 7 DAX 0.4747658 0.3776953 -0.4010222 -0.6863854 8 SMI 0.7309214 -0.1735069 -0.3023225 0.5867285 9 CAC 0.2437347 0.7304373 0.5985832 0.2208009 10 FTSE 0.4253759 -0.5419438 0.6240837 -0.3686079 11 W 12 [,1] [,2] [,3] [,4] 13 [1,] -0.4747658 -0.3776953 -0.4010222 0.6863854 14 [2,] -0.7309214 0.1735069 -0.3023225 -0.5867285 15 [3,] -0.2437347 -0.7304373 0.5985832 -0.2208009 16 [4,] -0.4253759 0.5419438 0.6240837 0.3686079 69 Statistics Department Exploratory Data Analysis 17.9.2021
  • 70. Principal component analysis Principal Component Analysis • How many principal components should be retained? there are three simple approaches, which may be of guidance for deciding the number of relevant principal components. These are — The visual examination of a scree plot (Searching for elbow in the scree plot) — The variance explained criteria or — The Kaiser rule (According to this rule only those principal components are retained, whose variances exceed 1) • Decide on the number of principal components is to set a threshold, say 80%, and stop when the first k components account for a percentage of total variation greater than this threshold. Note that the threshold is set somehow arbitrary; 70 to 90% are the usual sort of values, but it depends on the context of the data set and can be higher or lower 1 # Variation explained 2 varExplained - diag(S)*100/sum(diag(S)) 3 plot(varExplained, type = o, ylab = Variance explained, xlab = Principal component) 4 plot(cumsum(varExplained), type = o, ylab = Variance explained, xlab = Principal component) 70 Statistics Department Exploratory Data Analysis 17.9.2021
  • 71. Principal component analysis Principal Component Analysis 71 Statistics Department Exploratory Data Analysis 17.9.2021
  • 72. Yabebal Ayalew Statistics Department, Addis Ababa University(AAU) Office: Freshman building, Room 115 email: yabebala@gmail.com Training Title: Exploratory Data Analysis