Probability and basic statistics with R

Quantitative
Data Analysis

Probability and basic statistics

probability
The most familiar way of thinking about probability is within a
framework of repeatable random experiments. In this view the
probability of an event is defined as the limiting proportion of times
the event would occur given many repetitions.

Probability
Instead of exclusively relying on knowledge of the proportion of times
an event occurs in repeated sampling, this approach allows the
incorporation of subjective knowledge, so-called prior probabilities,
that are then updated. The common name for this approach is
Bayesian statistics.

The Fundamental Rules of
Probability
Rule 1: Probability is always positive
Rule 2: For a given sample space, the sum of probabilities is 1
Rule 3: For disjoint (mutually exclusive) events, P(AUB)=P (A)
+ P (B)

Counting
Permutations (order is important)

Combinations (order is not important)

Probability functions
The factorial function
factorial(n)
gamma(n+1)

Combinations can be calculated with
choose(x,n)

Simple statistics
mean(x) arithmetic average of the values in x
median(x) median value in x
var(x) sample variance of x
cor(x,y) correlation between vectors x and y
quantile(x) vector containing the minimum, lower quartile, median,
upper quartile, and maximum of x
rowMeans(x) row means of dataframe or matrix x
colMeans(x) column means

cumulative probability function
The cumulative probability function is, for any value of x, the
probability of obtaining a sample value that is less than or equal to
x.

curve(pnorm(x),-3,3)

probability density function
The probability density is the slope of this curve (its
‘derivative’).

curve(dnorm(x),-3,3)

Continuous Probability
Distributions

Continuous Probability
Distributions
R has a wide range of built-in probability distributions, for each of
which four functions are available: the probability density function
(which has a d prefix); the cumulative probability (p); the quantiles of
the distribution (q); and random numbers generated from the
distribution (r).

Normal distribution
par(mfrow=c(2,2))
x<-seq(-3,3,0.01)
y<-exp(-abs(x))
plot(x,y,type="l")
y<-exp(-abs(x)^2)
plot(x,y,type="l")
y<-exp(-abs(x)^3)
plot(x,y,type="l")
y<-exp(-abs(x)^8)
plot(x,y,type="l")

Normal distribution

norm.R

Exercise
Suppose we have measured the heights of 100 people. The mean
height was 170 cm and the standard deviation was 8 cm. We can ask
three sorts of questions about data like these: what is the probability
that a randomly selected individual will be:
shorter than a particular height?
taller than a particular height?
between one specified height and another?

The central limit theorem
If you take repeated samples from a population with finite variance
and calculate their averages, then the averages will be normally
distributed.

Checking normality

fishes.R

The gamma distribution
The gamma distribution is useful for describing a wide range of
processes where the data are positively skew (i.e. non-normal, with a
long tail on the right).

x<-seq(0.01,4,.01)
par(mfrow=c(2,2))
y<-dgamma(x,.5,.5)
plot(x,y,type="l")
y<-dgamma(x,.8,.8)
plot(x,y,type="l")
y<-dgamma(x,2,2)
plot(x,y,type="l")
y<-dgamma(x,10,10)
plot(x,y,type="l")

gammas.R

α is the shape parameter and β −1 is the scale parameter. Special
cases of the gamma distribution are the exponential =1 and chi-
squared =/2, =2.
The mean of the distribution is αβ , the variance is αβ 2, the
skewness is 2/√α and the kurtosis is 6/α.


gammas.R

Quantitative
Data Analysis

Hypothesis testing

Why Test?
Statistics is an experimental science, not really a branch of
mathematics.
It’s a tool that can tell you whether data are accidentally or really
similar.
It does not give you certainty.

Steps in hypothesis testing!
1. Set the null hypothesis and the alternative hypothesis.
2. Calculate the p-value.
3. Decision rule: If the p-value is less than 5% then reject the null
hypothesis otherwise the null hypothesis remains valid. In any
case, you must give the p-value as a justification for your
decision.

Types of Errors…
A Type I error occurs when we reject a true null hypothesis (i.e.
Reject H0 when it is TRUE)

H0 T F

Reject I

Reject II
A Type II error occurs when we don’t reject a false null hypothesis
(i.e. Do NOT reject H0 when it is FALSE)

11.33

Critical regions and power
The table shows schematically relation between relevant probabilities
under null and alternative hypothesis.

do not reject reject

Null hypothesis is true 1-  (Type I error)

Null hypothesis is false  (Type II error) 1- 

Significance
It is common in hypothesis testing to set probability of Type I error, 
to some values called the significance levels. These levels usually set
to 0.1, 0.05 and 0.01. If null hypothesis is true and probability of
observing value of the current test statistic is lower than the
significance levels then hypothesis is rejected.
Sometimes instead of setting pre-defined significance level, p-value is
reported. It is also called observed significance level.

36
n
e
e
n
e
p
pt
Significance Level
©
A
i When we reject the null hypothesis there is a risk of drawing a wrong
Ta conclusion
a
ni Risk of drawing a wrong conclusion (called p-value or observed
a significance level) can be calculated
Researcher decides the maximum risk (called significance level) he is
ready to take
Usual significance level is 5%

P-value
We start from the basic assumption: The null hypothesis is true
P-value is the probability of getting a value equal to or more extreme
than the sample result, given that the null hypothesis is true
Decision rule: If p-value is less than 5% then reject the null
hypothesis; if p-value is 5% or more then the null hypothesis remains
valid
In any case, you must give the p-value as a justification for your
decision.

Interpreting the p-value…
Overwhelming Evidence
(Highly Significant)

Strong Evidence
(Significant)

Weak Evidence
(Not Significant)

No Evidence
(Not Significant)

0 .01 .05 .10

Power analysis
The power of a test is the probability of rejecting the null hypothesis
when it is false.
It has to do with Type II errors: β is the probability of accepting the
null hypothesis when it is false. In an ideal world, we would obviously
make as small as possible.
The smaller we make the probability of committing a Type II error, the
greater we make the probability of committing a Type I error, and
rejecting the null hypothesis when, in fact, it is correct.
Most statisticians work with α=0.05 and β =0.2. Now the power of a
test is defined as 1− β =0.8

Confidence
A confidence interval with a particular confidence level is
intended to give the assurance that, if the statistical model is correct,
then taken over all the data that might have been obtained, the
procedure for constructing the interval would deliver a confidence
interval that included the true value of the parameter the proportion
of the time set by the confidence level.

Don't Complicate Things

Use the classical tests:
var.test to compare two variances (Fisher's F)
t.test to compare two means (Student's t)
wilcox.test to compare two means with non-
normal errors (Wilcoxon's rank test)
prop.test (binomial test) to compare two
proportions
cor.test (Pearson's or Spearman's rank
correlation) to correlate two variables
chisq.test (chi-square test) or fisher.test
(Fisher's exact test) to test for independence
in contingency tables

Comparing Two Variances
Before comparing means, verify that the variances are not
significantly different.
var.text(set1, set2)
This performs Fisher's F test
If the variances are significantly different, you can transform the
output (y) variable to equalise variances, or you can still use the
t.test (Welch's modified test).

Comparing Two Means
Student's t-test (t.test) assumes the samples
are independent, the variances constant,
and the errors normally distributed. It will
use the Welch-Satterthwaite approximation
(default, less power) if the variances are
different. This test can also be used for paired
data.
Wilcoxon rank sum test (wilcox.test) is used
for independent samples, errors not normally
distributed. If you do a transform to get
constant variance, you will probably have to
use this test.

Student’s t
The test statistic is the number of standard errors by which the two
sample means are separated:

Power analysis
So how many replicates do we need in each of two samples to detect
a difference of 10% with power =80% when the mean is 20 (i.e. delta
=20) and the standard deviation is about 3.5?
power.t.test(delta=2,sd=3.5,power=0.8)
You can work out what size of difference your sample of 30 would
allow you to detect, by specifying n and omitting delta:
power.t.test(n=30,sd=3.5,power=0.8)

Paired Observations
The measurements will not be independent.
Use the t.test with paired=T. Now you’re doing a single sample test
of the differences against 0.
When you can do a paired t.test, you should always do the paired
test. It’s more powerful.
Deals with blocking, spatial correlation, and temporal correlation.

Sign Test
Used when you can't measure a difference but can see it.
Use the binomial test (binom.test) for this.
Binomial tests can also be used to compare proportions. prop.test

Chi-squared contingency tables
the contingencies are all the events that could possibly happen. A
contingency table shows the counts of how many times each of the
contingencies actually happened in a particular sample.

Chi-square Contingency Tables
Deals with count data.
Suppose there are two characteristics (hair colour and eye colour).
The null hypothesis is that they are uncorrelated.
Create a matrix that contains the data and apply
chisq.test(matrix).
This will give you a p-value for matrix values given the assumption of
independence.

Fisher's Exact Test
Used for analysis of contingency tables when one or more of the
expected frequencies is less than 5.
Use fisher.test(x)

compare two proportions
It turns out that 196 men were promoted out of 3270 candidates,
compared with 4 promotions out of only 40 candidates for the
women.
prop.test(c(4,196),c(40,3270))

Correlation and covariance

covariance is a measure of how much two variables change
together
the Pearson product-moment correlation coefficient
(sometimes referred to as the PMCC, and typically denoted by r) is a
measure of the correlation (linear dependence) between two
variables

Correlation and Covariance
Are two parameters correlated significantly?
Create and attach the data.frame
Apply cor(data.frame)
To determine the significance of a
correlation, apply cor.test(data.frame)
You have three options: Kendall's tau
(method = "k"), Spearman's rank (method =
"s"), or (default) Pearson's product-moment
correlation (method = "p")

Kolmogorov-Smirnov Test
Are two sample distributions significantly different?
or
Does a sample distribution arise from a specific distribution?

ks.test(A,B)

Probability and basic statistics with R

Probability and basic statistics with R

More Related Content

What's hot

Viewers also liked

Similar to Probability and basic statistics with R

More from Alberto Labarga

Recently uploaded

Probability and basic statistics with R