FONAMENTS
D’ESTADÍSTICA
Dr. Josep M. Vilaseca
CAPSE 2009
PROBABILITY
The probability of an event is the
number of times we would observe it
if we repeated the experiment a large
number of times
THE NORMAL DISTRIBUTION
AND THE STANDARD NORMAL
DISTRIBUTION
INFERENCE FORM A SAMPLE
MEAN
 The mean of the sampling distribution of
means is the true population mean
 Its standard deviation is the population
standard deviation divided by the square
root of the sample (it is called the standard
error). Mesura la precisió de la meva
mostra.
 Confidence interval: estimated mean +/-
multiplier x standard error of the estimate
INFERENCE FORM A SAMPLE
MEAN
 95% confidence interval: we are 95% confident
that the true mean in the population lies between
this interval
 Z and t: using tables we can obtain a probability
for the calculated value
 P-value: is the area under the curve
corresponding to values outside the range (-z,z;
-t,t). That is, the area in the tails of the
distribution gives the probability of observing the
more extreme values
INFERENCE FORM A SAMPLE
MEAN
 Null hypothesis: the two population means
are the same
 Alternative hypothesis: the two population
means are not the same
 Hypothesis test: we calculate the
probability of obtaining the observed data
if the null hypothesis were true (“larger
accept, smaller reject”)
COMPARISON OF TWO MEANS
 Paired samples occur when the individual
observations in the first sample are
matched to individual observations in the
second sample. For quantitative data this
usually occurs when there are repeated
measurements on the same person
 Unpaired data occur when individual
observations in one sample are
independent of individual observations in
the other
COMPARISON OF TWO MEANS
 Paired data: we calculate the difference between
the first and second measurements, then the
mean difference, the standard deviation of the
differences and the standard error of the mean
difference. We can also calculate the probability
that, on average, there is no difference between
the paired observations in the population using a
hypothesis test. The null hypothesis is that the
mean population difference is zero. We assume
that the differences are normally distributed with
a mean of zero
COMPARISON OF TWO MEANS
 Unpaired data: we calculate the difference
between two independent means, the standard
deviation in two independent samples, and the
standard error of the difference in two
independent means, which is a combination of the
standard errors of the two independent sample
distributions. Using the standard error of the
difference in means, we can calculate the
confidence interval for the estimated difference
and test whether it is significantly different from
zero. We can use a z test in the same way as we
did before for a single sample mean of paired
samples
COMPARISON OF TWO MEANS
 When the sample size is small, we use the
t-distribution to calculate confidence
intervals and test hypothesis (either paired
or unpaired data).
 To compare independent samples,
however, we need to assume that the
variances of the two populations are the
same.
INFERENCE FROM A SAMPLE
PROPORTION (7)
 The sampling distribution of a proportion is
approximately Normal when the sample is
large
 The SE of a sample estimate is equal to
the standard deviation divided by √n.
 95% CI= p ± 1.96 x SE(proportion)
 95% CI= p ± 1.96 √p(1 – p) / n
INFERENCE FROM A SAMPLE
PROPORTION (7)
 If we want to assess whether the
population proportion has a certain value:
1. First we should state the Null Hypothesis
Π= Π0
2. Then we state the Alternative Hypothesis
Π≠ Π0
3. Finally we compute the test statistic
z= p - Π0 / SE(Π)
INFERENCE FROM A SAMPLE
PROPORTION (7)
 Remember: we calculate the SE(Π)
assuming the null hypothesis to be true.
 Remember: these methods are only
reliable if the sample is large (say, if the
proportion is less than 0.5 and the number
of subjects with the disease is 5 or more
 When these conditions are not satisfied,
we use the binomial distribution.
COMPARISON OF TWO
PROPORTIONS (8)
 We want to make comparisons between the
proportions in two independent populations
(case – control study, cohort study, clinical
trial).
 For a large sample we can use a normal
approximation to the binomial distribution
 When comparing proportions for
independent samples, the first thing we do
is calculate the difference between the two
proportions
COMPARISON OF TWO
PROPORTIONS (8)
 The analysis for comparing two independent
proportions is similar to the comparison of
two independent means
 The standard error for the difference in two
proportions is a combination of the standard
error of the two independent distributions
 Hypothesis test: we use a common
proportion (because the two proportions are
supposed to be the same) and the pooled
standard error
ASSOCIATION BETWEEN TWO
CATEGORICAL VARIABLES
 When we want to examine the relationship
between two categorical variables we
tabulate one against the other. This is
called a two – way table (also known as
cross – tabulation)
 An association exists between two
categorical variables if the distribution of
one variable varies according to the value
of the other
 The chi – squared test for the 2x2 tables is
identical to the z-test for comparing 2
proportions. The value z is the square root
of chi-squared.
 The Fisher’s exact test may also be used.
ASSOCIATION BETWEEN TWO
CATEGORICAL VARIABLES
CORRELATION (10)
 Do the values of a variable tend to be
higher (or lower) for higher values of the
other? CORRELATION
 What is the value of one of the variables
likely to be when we know the value of the
other? LINEAR REGRESSION
CORRELATION (10)
 Correlation is used to study the possible linear
(straight line) between two quantitative
variables. This tells how much the two
variables are associated
 To measure the degree of linear association
we calculate a correlation coefficient
 The standard method is to calculate the
Pearson’s correlation coefficient, denoted r
 Measures the scatter of the points around
an underlying linear (straight line trend)
 Can take any value from -1 to +1
 If there is no linear relationship then the
correlation is zero. But be careful, there
can be a strong non – linear relationship
between two variables.
CORRELATION (10)
Pearson’s correlation coefficient
CORRELATION (10)
 We can think of the square of r as: the
proportion of the variability in the y variable
that is accounted for by the linear
relationship with the y variable
 Assumptions for use of correlation:
 the two variables have an approximately
Normal distribution
 all observations should be independent
 Causation cannot be directly inferred from a
strong correlation coefficient
LINEAR REGRESSION (11)
 Regression studies the relationship
between two variables when one of them
depends on the other. This also alows one
variable to be estimated given the value of
the other.
MULTI – VARIABLE ANALYSIS 1:
STRATIFICATION (12)
Summary measures
 Standardisation
 Mantel – Haensel
MULTI – VARIABLE ANALYSIS 2:
MULTIPLE LINEAR REGRESSION

Fonaments d estadistica

  • 1.
  • 2.
    PROBABILITY The probability ofan event is the number of times we would observe it if we repeated the experiment a large number of times
  • 3.
    THE NORMAL DISTRIBUTION ANDTHE STANDARD NORMAL DISTRIBUTION
  • 4.
    INFERENCE FORM ASAMPLE MEAN  The mean of the sampling distribution of means is the true population mean  Its standard deviation is the population standard deviation divided by the square root of the sample (it is called the standard error). Mesura la precisió de la meva mostra.  Confidence interval: estimated mean +/- multiplier x standard error of the estimate
  • 5.
    INFERENCE FORM ASAMPLE MEAN  95% confidence interval: we are 95% confident that the true mean in the population lies between this interval  Z and t: using tables we can obtain a probability for the calculated value  P-value: is the area under the curve corresponding to values outside the range (-z,z; -t,t). That is, the area in the tails of the distribution gives the probability of observing the more extreme values
  • 6.
    INFERENCE FORM ASAMPLE MEAN  Null hypothesis: the two population means are the same  Alternative hypothesis: the two population means are not the same  Hypothesis test: we calculate the probability of obtaining the observed data if the null hypothesis were true (“larger accept, smaller reject”)
  • 7.
    COMPARISON OF TWOMEANS  Paired samples occur when the individual observations in the first sample are matched to individual observations in the second sample. For quantitative data this usually occurs when there are repeated measurements on the same person  Unpaired data occur when individual observations in one sample are independent of individual observations in the other
  • 8.
    COMPARISON OF TWOMEANS  Paired data: we calculate the difference between the first and second measurements, then the mean difference, the standard deviation of the differences and the standard error of the mean difference. We can also calculate the probability that, on average, there is no difference between the paired observations in the population using a hypothesis test. The null hypothesis is that the mean population difference is zero. We assume that the differences are normally distributed with a mean of zero
  • 9.
    COMPARISON OF TWOMEANS  Unpaired data: we calculate the difference between two independent means, the standard deviation in two independent samples, and the standard error of the difference in two independent means, which is a combination of the standard errors of the two independent sample distributions. Using the standard error of the difference in means, we can calculate the confidence interval for the estimated difference and test whether it is significantly different from zero. We can use a z test in the same way as we did before for a single sample mean of paired samples
  • 10.
    COMPARISON OF TWOMEANS  When the sample size is small, we use the t-distribution to calculate confidence intervals and test hypothesis (either paired or unpaired data).  To compare independent samples, however, we need to assume that the variances of the two populations are the same.
  • 11.
    INFERENCE FROM ASAMPLE PROPORTION (7)  The sampling distribution of a proportion is approximately Normal when the sample is large  The SE of a sample estimate is equal to the standard deviation divided by √n.  95% CI= p ± 1.96 x SE(proportion)  95% CI= p ± 1.96 √p(1 – p) / n
  • 12.
    INFERENCE FROM ASAMPLE PROPORTION (7)  If we want to assess whether the population proportion has a certain value: 1. First we should state the Null Hypothesis Π= Π0 2. Then we state the Alternative Hypothesis Π≠ Π0 3. Finally we compute the test statistic z= p - Π0 / SE(Π)
  • 13.
    INFERENCE FROM ASAMPLE PROPORTION (7)  Remember: we calculate the SE(Π) assuming the null hypothesis to be true.  Remember: these methods are only reliable if the sample is large (say, if the proportion is less than 0.5 and the number of subjects with the disease is 5 or more  When these conditions are not satisfied, we use the binomial distribution.
  • 14.
    COMPARISON OF TWO PROPORTIONS(8)  We want to make comparisons between the proportions in two independent populations (case – control study, cohort study, clinical trial).  For a large sample we can use a normal approximation to the binomial distribution  When comparing proportions for independent samples, the first thing we do is calculate the difference between the two proportions
  • 15.
    COMPARISON OF TWO PROPORTIONS(8)  The analysis for comparing two independent proportions is similar to the comparison of two independent means  The standard error for the difference in two proportions is a combination of the standard error of the two independent distributions  Hypothesis test: we use a common proportion (because the two proportions are supposed to be the same) and the pooled standard error
  • 16.
    ASSOCIATION BETWEEN TWO CATEGORICALVARIABLES  When we want to examine the relationship between two categorical variables we tabulate one against the other. This is called a two – way table (also known as cross – tabulation)  An association exists between two categorical variables if the distribution of one variable varies according to the value of the other
  • 17.
     The chi– squared test for the 2x2 tables is identical to the z-test for comparing 2 proportions. The value z is the square root of chi-squared.  The Fisher’s exact test may also be used. ASSOCIATION BETWEEN TWO CATEGORICAL VARIABLES
  • 18.
    CORRELATION (10)  Dothe values of a variable tend to be higher (or lower) for higher values of the other? CORRELATION  What is the value of one of the variables likely to be when we know the value of the other? LINEAR REGRESSION
  • 19.
    CORRELATION (10)  Correlationis used to study the possible linear (straight line) between two quantitative variables. This tells how much the two variables are associated  To measure the degree of linear association we calculate a correlation coefficient  The standard method is to calculate the Pearson’s correlation coefficient, denoted r
  • 20.
     Measures thescatter of the points around an underlying linear (straight line trend)  Can take any value from -1 to +1  If there is no linear relationship then the correlation is zero. But be careful, there can be a strong non – linear relationship between two variables. CORRELATION (10) Pearson’s correlation coefficient
  • 21.
    CORRELATION (10)  Wecan think of the square of r as: the proportion of the variability in the y variable that is accounted for by the linear relationship with the y variable  Assumptions for use of correlation:  the two variables have an approximately Normal distribution  all observations should be independent  Causation cannot be directly inferred from a strong correlation coefficient
  • 22.
    LINEAR REGRESSION (11) Regression studies the relationship between two variables when one of them depends on the other. This also alows one variable to be estimated given the value of the other.
  • 23.
    MULTI – VARIABLEANALYSIS 1: STRATIFICATION (12) Summary measures  Standardisation  Mantel – Haensel
  • 24.
    MULTI – VARIABLEANALYSIS 2: MULTIPLE LINEAR REGRESSION