Shahid Lecture-5- MKAG1273

MAL1303: Statistical Hydrology
Correlation
Dr. Shamsuddin Shahid
Department of Hydraulics and Hydrology
Faculty of Civil Engineering, Universiti Teknologi Malaysia
Room No. M46-332; E-mail: sshahid@utm.my
Mobile: 0182051586
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Research Questions: Are two variables related?
Example questions in hydrology:
– “Is there any relation between rainfall and river
discharge?”
– “Is there any relation between low river flow and river
water quality?”
– “Is there any relation between elevation and rainfall?”
– “Is there any relation between rainfall intensity and
landslides?
Test the relationship: Correlation

Correlation
Definition: Correlation is a statistical method that is used to
examine the extent to which two variables have a simple linear
relationship.
Questions:
 What does it mean to say that two variables are associated with
one another?
 How can we mathematically formalize the concept of
association?
Answer:
Correlation

Correlation gives relationship between two variables:
– direction
– Strength
– Significance
Sign indicates direction
Size indicates strength
Comparison with critical values gives significance
Correlation

Scatter Plots
• Plot each pair of observations (X, Y)
• x = predictor variable (independent)
• y = criterion variable (dependent)
• Check for:
– outliers
– linearity

How do you study the relationship between two variables?
Groundwater temperature data are collected at different depth from the earth
surface.
A list of these data is difficult to understand.
The relationship between the two variables can be visualized using a scatter
diagram, where each pair depth-temperature is represented as a point in a
plane.

Types of Correlation
Correlation
Positive Correlation Negative Correlation
Positive Correlation: The correlation is said to be positive correlation if
the values of two variables changing with same direction.
Negative Correlation: The correlation is said to be negative correlation
when the values of variables change with opposite direction.
Type I

Positive & Negative Association
At each depth two data are collected: Temperature and Nitrogen Concentration.
We obtained two scatter plot:
(i) Depth vs. Groundwater Temperature;
(ii) Depth vs. Nitrogen Concentration in Groundwater.
In the first graph, it is observed that temperature is increasing with depth, as a
general tendency. This corresponds to a positive association.
In the second graph, Nitrogen concentration decreasing with depth. This
corresponds to a negative association.

Correlation
Simple Multiple
Partial Total
Type II

Types of Correlation Type II
• Simple correlation: Under simple correlation problem there
are only two variables are studied.
• Multiple Correlation: Under Multiple Correlation three or
more than three variables are studied.
• Partial correlation: analysis recognizes more than two
variables but considers only two variables keeping the other
constant.
• Total correlation: is based on all the relevant variables, which
is normally not feasible.

Correlation
LINEAR NON LINEAR
Type III

Types of Correlation Type III
• Linear correlation: Correlation is said to be linear when the amount of
change in one variable tends to bear a constant ratio to the amount of
change in the other. The graph of the variables having a linear relationship
will form a straight line.
• Non Linear correlation: The correlation would be non linear if the amount of
change in one variable does not bear a constant ratio to the amount of
change in the other variable.

Correlation Coefficient
 The correlation coefficient gives a measure of the linear association
of two variables. It defines the degree of relationship.
 The correlation coefficient is usually denoted by r and takes values
between -1 and 1.
r is positive; between 0 and 1 r is negative; between 0 and -1

Correlation Coefficient
 Nitrogen concentration Data are collected at two different locations and
obtained two plots given below. Both show negative correlation between depth
and Nitrogen concentration. Correlation coefficient, r will be more negative in
case of first plot compared to second plot.
 If the scatter plot of the two variables is very close to the straight line we have a
correlation that is close to one. A near zero correlation corresponds to a diagram
where the data are widely scattered around the line.

Correlation Coefficient - Summary
 A positive coefficient means that the data are clustered around lines with a
positive slope. That is, as one variable increases, the other one also
increases.
 A negative coefficient means that the data are clustered around lines with a
negative slope. That is, as one variable increases, the other one decreases.
 The closer r is to 1 the stronger the positive linear association between the
variables.
 The closer r is to -1 the stronger the negative linear association between the
variables.
 When r is equal to or near to 1 or -1 there is a linear association between
the variables.
 When r is equal to or near to 0, there no association between the variables.

Pearson Correlation
 Pearson correlation is used to describe relationship between
two variables that are both interval and ration variables.
 Pearson correlation compares how consistently each Y value is
paired with each X value in a linear fashion

Covariance
• covariance is a measure of how much two variables change together.
• Variance shared by 2 variables
• Covariance reflects the direction of the relationship:
 Positive covariance indicates + relationship
 Negative covariance indicates - relationship

Computational Formula
Sum of Squares (SS) measures the amount of variation or variability of
a single variable.
Sum of Products (SP) provides a parallel procedure for measuring the
amount of covariation or covariability between two variables.

Calculation of Pearson’s Correlation Coefficient
 Pearson’s correlation coefficient is a ratio comparing the
covariability of X and Y with variability of X and Y separately.
 SP measures the covariability of X and Y
 The variability of X and Y is measured by calculating the SS for X
and Y scores separately

Calculation of Pearson’s Correlation Coefficient
Let, X represent Depth in feet and Y represent Nitrate Concentration in
mg/l. The association between Groundwater Depth and Nitrate
Concentration can be found as below:

Hypothesis Testing
 H0 : there is no correlation between depth and nitrate concentration or the
population correlation is 0.
 H1: there is a real non-zero correlation in the population.
 Population correlation is traditionally represented by , therefore, with
symbol we can write,
H0 :  = 0
H1:  ≠ 0
 For the pearson’s correlation, Degree of Freedom df = n-2. Where n is the
sample size. We lose 2 degree of freedoms because we need to estimate two
means, one for each variance estimate.
 If the calculated r is equal to or exceeds the critical value (given in Table) then
obtained r is significant.

Hypothesis Testing
In the present case, r = 0.875
df = n-2
= 5-2
= 3
Critical value for α = 0.05, df = 3 is 0.878.
Therefore, we accept H0 :  = 0
There is no correlation between the populations

Significance of Correlation
Df Critical Value
(N-2) p = .05
5 .67
10 .50
15 .41
20 .36
25 .32
30 .30
50 .23
200 .11
500 .07
1000 .05

Correlation: r & r2
 As a matter of routine it is the squared correlations
that should be interpreted. This is because the
correlation coefficient is misleading in suggesting
the existence of more covariation than exists, and
this problem gets worse as the correlation
approaches zero.
 Note that as the correlation r decrease by tenths,
the r2 decreases by much more. A correlation of .50
only shows that 25 percent variance is in common;
a correlation of .20 shows 4 percent in common;
and a correlation of .10 shows 1 percent in common
(or 99 percent not in common).
 Thus, squaring should be a healthy corrective to the
tendency to consider low correlations, such as .20
and .30, as indicating a meaningful or practical
covariation.

Assumptions
• Scale of measurement is interval
• Linear relationships
• Homoscedasticity
• Similar normal underlying distributions
• No outliers

Homoscedasticity

Advantages and Disadvanateges of Pearson’s Coefficient
Advantages
• It summarizes in one value, the degree of correlation &
direction of correlation also.
Limitations
• Always assume linear relationship
• Interpreting the value of r is difficult.
• Value of Correlation Coefficient is affected by the extreme
values.

Parametric and Non-parametric Correlation
Parametric correlation:
when distribution of data is normal.
Example: Pearson Correlation
Non-parametric correlation:
when distribution of data is not normal
Example: Spearman’s Rank Correlation, Kendall- Correlation

The Spearman Correlation
 Spearman’s correlation is designed to measure the relationship between
variables measured on an ordinal scale of measurement
 A perfectly positive relationship means that every time X increases Y also
increases; i.e., the smallest value of X is paired with the smallest value of
Y and so on
 The original scores are first converted to ranks, then the Spearman
correlation coefficient is used to measure the relationship for the ranks.
The degree of relationship for the ranks provides a measure of the
degree of consistency for the original scores.
Calculation of Spearman’s Correlation Coefficient
 Be sure you have ordinal data for X and Y scores
 The smallest value gets the rank 1 and the second smallest 2 and so on
 Rank X and Y separately
 Use the same formula on the ranked data as you used for Pearson’s r

Rank Correlation
• Spearman Rank-Correlation Coefficient, rs
where: n = number of items being ranked
xi = rank of item i with respect to one variable
yi = rank of item i with respect to a second
variable
di = xi - yi

Test for Significant Rank Correlation
• We may want to use sample results to make an inference
about the population rank correlation ps.
• To do so, we must test the hypotheses:
H0: ps = 0
Ha: ps  0

Spearman Rank Correlation
Monthly Rainfall (mm): Sample-1: {79, 71, 108, 54, 67, 90}
Monthly Discharge (cusec): Sample 2: {122, 100, 121, 43, 54, 80}
If rs > Critical value
There is a significant
correlation
Null Hypothesis:
There exists no association
(or correlation) between
the samples

Merits Spearman’s Rank Correlation
• This method is simpler to understand and easier to apply
compared to karl pearson’s correlation method.
• This method is useful where we can give the ranks and
not the actual data. (qualitative term)
• This method is to use where the initial data in the form
of ranks.

Limitation Spearman’s Correlation
• Cannot be used for finding out correlation in a grouped
frequency distribution.
• This method should be applied where N exceeds 30.

Kendall's rank correlation provides a distribution free test of
independence and a measure of the strength of dependence
between two variables.
Spearman's rank correlation is satisfactory for testing a null
hypothesis of independence between two variables but Kendall's
rank correlation is much powerful.
Kendall-tau Rank Correlation

Steps for Kendall-tau Rank Correlation
1. Arrange the data in increasing order of magnitude of the first
variable and label the objects with the resulting rank: 1 for the
smallest up to N for the largest.
2. Rearrange the data in order of increasing magnitude of the
second variable and record the rearranged order of the variable-
1 ranks
3. For each data, scan down variable-2, counting the number of
ranks that are larger.
4. Repeat the step(3), this time counting the number of ranks that
are smaller.
5. Subtract “smaller” from “larger” and sum the total (S).

6. Kendall’s  is given by:
 = (2 x S) / [N x (N-1)]
7. Computer z-statistics as
z =  x [9 x N x (N-1)] / [2 x (2N + 5)]
8. Null hypothesis rejected if z is out of the following range:
-1.96 < z > 1.96
Steps for Kendall-tau Rank Correlation

Problem: Ten groundwater samples
are collected from different points
to see is there any relation between
groundwater depth and
contamination. Data are given in
the table. Is there any association
between depth and contamination.
Null Hypothesis: There exists no
association. Contamination is
independent of Groundwater
Depth.

Step-1: Rank the data
separately
Step-2: Re-arrange the
second ranks according
the rank of first variable

 = (2 x S) / [N x (N-1)]
z =  x [9 x N x (N-1)] / [2 x (2N + 5)]

Null Hypothesis:
There exist no relation between depth and contamination
Null hypothesis rejected (p=0.05) if z is out of the following range:
-1.96 < z > 1.96
z (calculated) = 3.67
z(calculated) > z (critical), therefore null hypothesis rejected.
Decision: There exist significant correlation between depth and
groundwater contamination

Features of Correlation Coefficient
The correlation coefficient has the following properties:
 The correlation is not affected when the two variables are
interchanged.
 The correlation is not changed if the same number is added to all
the values of one of the variables.
 The correlation is not changed if all the values of one of the
variables is multiplied by the same positive number. It will change
sign if the number is negative.

Factors affect correlation
• Restricted range
• Heterogenous samples
• Outliers
• Scale

Range restriction
• Range restriction is when sample contains restricted (or
truncated) range of scores
– e.g., Groundwater Recharge and Rainfall > 5mm
• If range restriction, be cautious in generalising beyond
the range for which data is available
– e.g., Groundwater recharge less when rainfall is less, but below
a threshold level, there is no relation

Range restriction

Heterogenous samples
• Sub-samples may
artificially increase or
decrease overall r.
• Solution - calculate r
separately for sub-
samples and overall,
look for differences

Heterogenous samples

Effect of Outliers
• Outliers can disproportionately increase or decrease r.
• Options
– compute r with & without outliers
– get more data for outlying values
– recode outliers as having more conservative scores
– transformation
– recode variable into lower level of measurement

Effect of Outliers
Outliers can disproportionately
increase or decrease r

Closed Data
Sometimes, closed data or some discrete data shows high
correlation.

Log Transformed Data
If data is transformed to log scale, then relation between log data
shows high correlation.

Checklist
1. Graphs & Scatterplots
– Outliers?
– Linear?
– Does each variable have a reasonable range?
– Are there subsamples to consider?
2. Choose appropriate measure of Association
3. Conduct inferential test
4. Interpret/Discuss

Association and Causation
ASSOCIATION
• If two attributes say A and B are found to co-exit more often
than an ordinary chance. Then they are correlated. We can
say that there is an association between attributes A and B.
• Correlation indicates the degree of association between two
variables.
CAUSATION
If one of these attributes say A is the suspected cause and the
other say B is the outcome then we have a reason to suspect
that A has caused B.

Association and Causation
• Association does not mean causation.
• If association is consistence, then there may be
causation.
• If a relationship is causal, the findings should be
consistent with other data
• Causation always implies correlation but correlation
does not necessarily implies causation.

Reporting
• State the research hypothesis
• Describe & interpret correlation
– direction of relationship
– size/strength of relationship
– Significance of relationship
• Acknowledge limitations e.g.,
– Heterogeneity (sub-samples)
– Range restriction
– Causality?

Partial Correlation
River discharge depends on many factors, such as rainfall, soil
property, evapotranspiration, groundwater storage, etc. Each
independent factors are also correlated with each other.

Partial Correlation

Three (or more) Variables
• Three variables means three relationships
• Each can effect the other two
• Partial & semi-partial correlation—remove contributions of 3rd variable
Partial Correlation

• Sometimes it is desirable to know the relationship between two
variables with the effects of a third variable held constant. We
can do it by using Partial correlation
• It helps us to find the ‘pure’ correlation between two variable with
holding the others constant.
• ‘Holding constant’ in this situation is known as partialling out, and
the technique for partialling out the effects of one or more
variables from two others, in order to find the relationship
between them is called partial correlation.
Partial Correlation

A partial correlation is a correlation between two variables from
which the linear relations, or effects, of another variable(s) have
been removed.
Partial Correlation

Partial Correlation
Correlation = 0.72

Partial Correlation
Correlation = 0.7311/23/2015 Shamsuddin Shahid, FKA, UTM

Higher-Order Partial Correlation
The second-order partial correlation is the correlation between two
variables with the effects of two other variables being removed.

With partial correlation, we find the correlation between X and Y
holding Z constant for both X and Y. Sometimes, however, we want
to hold Z constant for just X or just Y. In that case, we compute a
semipartial correlation.
Semipartial Correlation
Comparison between the partial and semipartial correlation:
Partial:
Semi-partial:

Partial Correlation
The result doesn't make much
intuitive sense, but it does remind us
that the absolute value of the partial
is larger than the semipartial.

• The partial and semipartial correlation formulas are the
same in the numerator and almost the same in the
denominator.
• The partial contains something extra, that is, something
missing from the semipartial correlation in the
denominator.
• This means that the partial correlation is going to be
larger in absolute value than the semipartial.
• This will be true except when the controlling or partialling
variable is uncorrelated with the variable to be controlled.
Semipartial Correlation

Advantages of Correlation studies
• Show the amount (strength) of relationship present
• Can be used to make predictions about the variables
under study.
• Can be used in many places, including natural settings,
libraries, etc.
• Easier to collect co relational data

Disadvantages of correlation studies
• Can’t assume that a cause-effect relationship exists
• Little or no control (experimental manipulation) of the
variables is possible
• Relationships may be accidental or due to a third,
unmeasured factor common to the 2 variables that are
measured

Shahid Lecture-5- MKAG1273

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Recently uploaded

Recently uploaded (20)

Shahid Lecture-5- MKAG1273