Introduction to Biostatistics

AiiHC00 Introduction to
Biostatistics
Ramzi EL FEGHALI, Ph.D.
Senior Lecturer
ref@aiihc.com
Biostatistics course EL FEGHALI R.
Artificial intelligence in HealthCare
AIIHC International Ltd. U.K.

1st
Section : Applied Statistics to Healthcare
AiiHC00 - AiiHC10
2nd
Section : Artificial Intelligence
AiiHC11 - AiiHC15
3d
Section : Coding languages
AiiHC16 - AiiHC20
4th
Section : Business in Artificial Intelligence
AiiHC21 - AiiHC24
AiiHC Program
2
AIIHC International Ltd.

Statistics History
Biostatistics course EL FEGHALI R. 3

AiiHC00 Syllabus
4

5 5
Steps in statistical studies
• Data collection
– Simple observation :
•Without a specific intervention, data collected following the
study’s time.
•Sampling plan
– Experimentation
•To induce controlled phenomena. Example: administration of
a drug for a specific group of subjects.
• Statistical analysis
– "Deductive" analysis or descriptive
•Its target is to resume and present the observed data in
tables, graphs,…
– "Inductive" analysis or inferential
•Allows to extend and generalize under some conditions the
obtained conclusions.

6 6
Terminology
• Variable : measurable entity (X).
• Population : values corresponding to a variable.
• Individual or case : one or many observations performed on an
entity.
• Observation : particular measure of a variable (xi) for one case
• Statistic or statistical descriptor : number which describes or
resumes a set of observations.
• Sample : subset of the statistical population composed from many
observations. The size of the sample is generally called n. The
processes which lead to the creation of a sample is called sampling.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.

7 7
Sampling (1)
• Basic statistics consider the samples as aleatory/random and
independent.
• It is necessary to avoid unrandomized samples.
• The probability to be sampled, or to belong to a group should be
equal for all studied cases.
• From X and S,
we want to estimate m and σ Population
m

Unknown
Sample
X
S
Known

8 8
Sampling (2)
• A statistical sample is composed from a limited number of cases
sampled from the studied population.
• The sampling gives the representation of a population.
• An n sized sample of a random variable X is obtained by repeating n
time the sampling giving X.
• Notation : (X1, X2, … , Xn)
• A special sampling : (x1, x2, … , xn)
• Estimator : It is a characteristic computed from the observations
which allows to estimate the values of unknown parameters of a
statistical distribution.
• Unbiased Estimator : gives the mean of the statistical value.
• Convergent Estimator : is closer to the statistical value when n
increases.

9 9
Types of variables
• Qualitative variables :
– Nominal / Non ordinal
Examples : color, form, type of treatment, …
– Ordinal
Examples : vegetation coverage, scale of the wind,
presence/absence, …
• Quantitative variables :
– Continuous
Examples : temperature, height, weight, …
– Discrete
Examples : individual counting, occurrence number of an event,
etc…

10 10
Frequency distribution
• Statistical series :
– simple enumeration or counting of the observations
– can be ordered (quantitative variable)
– the total number of observations N is called size of the sample,
• Ungrouped distributions
– When we have multiple observations, the same value can be
observed many times.
– We use xi
to show different values, the number of occurrences is
ni and called absolute frequency ; p represents the number of
different observed values.
– ni
/N is called the relative frequency.
– In case of quantitative variable, we order xi and the absolute or
relative frequencies could be summed in order to obtain the
cumulative frequencies Ni or Fi:

11 11
Descriptive statistics (1)
• Descriptive statistics are used to describe a dataset or sample.
• A data table is not enough to analyze the data
=> another methods are used to understand more the data
interpretation.
• Two groups of methods : numerical methods (statistical descriptors)
and graphical methods.
• They are complementary.
• Distinction between univariate datasets (only one variable), bivariate
(two variables) and multivariate (more than two variables in the
dataset).

12 12
• Mean or average : sum of the observations divided by their number.
For the variable Y, its mean (called Y bar) is :
• Property : the sum of deviations
from the mean is null.
• Median : value in the middle of the observations.
– Sort the observations in an increasing order.
– If n is odd, take the middle value (eg : n = 7 => y4)
– If n is even, take the intermediate value between the two values
around the middle (eg : n = 8 => (y4 + y5) / 2)
• A statistic is called robust, or resistant if its value is little influenced
by the changing of a small proportion of the data.
– The median is more robust than the mean
– The mean gives more information about the data. It is more
powerful when the data is ready to be analyzed (without
outliers)
– Those two measures are complementary in statistical tools



n
i
i n
y
y
1
/




n
i
i y
y
1
0
)
(

13 13
• Range of the values ymax - ymin, (less robust)
• Quartiles : the median divides the sample into 2 sets,
the quartiles divide each set into 2 equal subsets.
• The interquartile range is: Q3 – Q1.
• The 5 numbers min, Q1, median, Q3, max resume the sample.
• Quantile or percentile p contain the equivalent proportion of
smallest values (0 <= p <= 1), eg : p = 0.1
=> 10% of the values are small and 90% are large.
Q1 Q2 = median Q3 max
min
Interquartile range
Range of the values

14 14
• Variance of a population
• Variance of a sample
• Standard Deviation
• The standard deviation represents the deviation from the mean.
• Coefficient of variation
• Variance, standard deviation and coefficient of variation are
associated to the mean.
n
y
n
i
i /
)
(
1
2
2



 

)
1
/(
)
(
1
2
2




n
y
y
s
n
i
i
2

  2
s
s 
%
100
.
/ y
s
cv 

15 15
• The boxplot allows to
represent graphically the
famous «5 numbers».
• The outliers are 1.5 times
distant from the
interquartile range
beginning from the
closest quartile (box
borders).
10
20
30
40
50
median
Q1
Q3
min
IQR
1.5 . IQR
maximum
max (extreme value)

16 16
• Classification into classes
(choice of the borders and
inclusion of the left and
right values)
• Graphical representation :
the histogram
= observation frequency
into each class.
• Unimodal, bimodal or
multimodal distribution.
y
Frequency
8 9 10 11 12
0
5
10
15
Unimodal distr.
y
Frequency
8 9 10 11 12 13
0
5
10
15
20
Bimodal distr.
y
Frequency
8 10 12 14 16
0
5
10
15
20

17 17
• The dotplot
– 2 observed variables
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
iris$Sepal.Length
iris$Petal.Length

18 18
• The Q-Q plot
– 1 observed variable and 1 theoretical
– 2 observed variables

19 19
Probabilities distribution
• Discrete distributions:
•Uniform, Bernoulli, binomial, Poisson… distribution
–The discrete random variables are discontinuous integer
values over a predefined interval. They are generally the
result of counting.
• Continuous distributions:
•Uniform, normal, standard normal… distribution
–The continuous random variables are continuous values
over a predefined interval.

20 20
Normal distribution
• The normal distribution
represents a mathematical
well-known bell-shaped
distribution.
• It was defined by Laplace and
Gauss.
• The equation of the frequency
curve of a normal distribution
depends on two parameters :
the mean and the standard
deviation of the variable
f x e
x m
( )
( )

 
1
2
2
2
2
 

)
;
( 
m
N
x 
14159
,
3


Normal distribution
x

21 21
The standard normal distribution
standard normal distribution
0
0.1
0.2
0.3
0.4
0.5
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
z
f(z)
2
2
2
1
)
(
z
e
z
f



)
1
;
0
(
N
m
x
z 



-1,96 +1,96
 
 
P z    
1 96 1 96 0 95
, : , ,
For a standard normal distribution α=5%, 95%
z belong to the interval [-1,96 :+1,96]
The confidence Interval :
z follows a normal distribution
with n-1 DF (degree of freedom)
n
s
z
m
CI α



22 22
Statistical Tests
• Statistical hypotheses
• One-tailed and two-tailed tests
• Type I and II errors (alpha et beta risks)
• Significance ‘p-value’
• Univariate, bivariate, multifactorial, and multivariate tests

23 23
Statistical hypotheses
• They are conclusions related to the frequency’s distribution.
• These conclusions could be true or false.
• In the majority of tests we define an hypothesis in the purpose to reject it.
• Example :
the observed percentage in a population is 10%. If we want to verify that
the observed percentage in a particular group differs from the observed
percentage in the population. We suppose that there is no difference. So
the hypothesis will be :
“All observed differences are the results of sampling fluctuations, or
hazard.”
• This hypothesis is called null hypothesis and noted H0.
• All other hypothesis are called alternative hypotheses and noted H1.

24 24
Test of hypotheses and
significance
• Test of hypotheses or significance are statistical procedures used to
decide either the defined hypotheses are true or false in order to
learn more about the unknown reality.
• It is a domain of inferential statistics.
• Different tests exist according to:
– the type of studied variables (quantitative/qualitative)
– the type of problem (comparison between 2 means or more…)
– the conditions of application (modeling in term of probability
distribution)
• However, the logical steps of a test are always the same.

25 25
One-tailed and two-tailed tests
• The null hypothesis H0 retained is often the equality. The alternative
hypotheses H1 could be thus all other possible situations that are
divided into two categories : greater than; less than.
• When we consider all the alternative hypotheses we use a two-tailed
test.
• When we consider a part of the alternative hypotheses greater than
or less than we use a one-tailed test.
• eg : we compare the height of 3 and 4 years old children. The test is
a one-tailed test because the height increases with age.

26 26
Type I and II errors
(alpha and beta risks)
• The type I error : It is the decision to reject the null hypothesis
when this one is true.
– Example: in a trial concerning a new drug compared to a
previous one we conclude that there is a difference between
both of them knowing that it is not the reality. We are making the
type I error.
• The type II error : It is the opposite of the first error. We accept the
null hypothesis when this one is false.
– Example: in a trial concerning a new drug compared to a
previous one we conclude that there is no difference between
both of them knowing that it is not the reality. We are making the
type II error.

27 27
Significance ‘p-value’
• When we test an hypothesis, the probability to make a type I error is
called the threshold of significance of the test and usually noted
alpha. This risk is defined before the experience when we evocate
the problem.
• The probability to make a type II error is usually noted beta.
The probability to reject H0 when it is false is called the power of the
test: Power = 1- beta.
• There is no direct relation between alpha and beta. Their values are
likely to be closer to 0. In general, we choose alpha = 0.05 and we
try to minimize beta (in general 0.1).
• The threshold of significance p-value is the probability, under the
null hypothesis to observe a difference due to the hazard.

28 28
Univariate test: Chi2
for goodness
of fit (1)
• The elements :
– Generalization of the comparison between an observed
percentage and a theoretical one.
– 1 qualitative variable defining classes (or classified quantitative
variable). We have the observations (number) of subjects
corresponding to each class.
– 1 theoretical distribution, an empiric one or following a theoretical
probabilities distribution concerning the same previous classes.
• The question :
– Could the observed distribution be conform to the theoretical
distribution?
– Could the difference between the observed values and the
theoretical one be due to hazard ?

29 29
for goodness
of fit (2)
• Hypotheses :
– Null hypothesis H0 :
•The difference between the observed values and the
theoretical one are due to hazard. The observed distribution
follows a probabilistic theory.
– Alternative hypothesis H1 :
•The observed distribution doesn’t follow a probabilistic theory..
• Elements necessary for the calculation
– Contingency table
Classes A B C D
Observed O1 O2 O3 O4
Frequency
Theoretical
Frequency T1 T2 T3 T4

30 30
for goodness
of fit (3)
• Statistic :
– Chi2
“goodness of fit”
•Degree of freedom :
–Number of classes - 1 - Number of parameters estimating
the theoretical distribution (if necessary).
•Condition of application :
–All theoretical frequencies must be greater than 5.
– Calculation of each theoretical frequency
– Condition of application : All theoretical frequencies must be
greater than 5 otherwise we group the classes
– Calculation of the Chi2
• Decision: If Chi2
> Chi2
alpha => reject H0 : the distribution is not
conform to the theoretical distribution. If the degree of significance p-
value is less than alpha => reject H0 .
Chi2
= 
(0-T)
T
2
1
p p = Number of classes after grouping
DF = p -1 – Number of estimated param.

31 31
Univariate tests: Tests of normality
• In addition to the Chi2
for goodness of fit, other graphical and/or
statistical approaches are used in order to test the normality of a
distribution.
• The graphical methods and empirical techniques are: Histogram,
Boxplot, Q-Q plot, and skewness and kurtosis coefficients
skewness
kurtosis
• Statistical methods: Univariate test of Kolmogorov-Smirnov
3
)
(
)
2
)(
1
(
1 




i
i
s
x
x
n
n
n
G
)
3
)(
2
(
)
1
(
3
)
(
)
3
)(
2
)(
1
(
)
1
(
2
2
4









  n
n
n
s
x
x
n
n
n
n
n
G
i
i

32 32
Bivariate test: Contingency Chi2
or
Chi2
for Independence
• The elements :
– Two qualitative variables ( C « columns » and L « lines » modalities),
are measured for each subject. We obtain a table of contingency with
C*L cases, a subject is classified in each case and only one.
• Hypotheses :
– Null hypothesis H0 : variables are independent.
– Alternative hypotheses H1 : variables are not independent.
• Statistic :
• Decision :
– If Chi2
> Chi2
alpha => reject H0 : the 2 variables are not independent
– The degree of significance p-value is less than alpha => reject H0.
(0ij-Tij)
Chi2
= ij Tij
2 i,j : modalities of C and L
DF = (L-1)*(C -1)

33 33
Bivariate test: Correlation and
simple linear regression
• Regression and correlation :
– x and y are two random continuous variables: x and y have a degree of
association => correlation
– y is explained by x => regression
– A statistical test could be performed with both methods in order to
estimate a p-value and verify if the association is significant or not
between the 2 variables.
• Correlation:
– Coefficient of correlation of Pearson
•covxy, the covariance between x and y
• Simple linear regression
– y = ax + b + ε => Estimation with the least squares method of the
regression curve parameters between the 2 variables x and y.
y
x
xy
xy
r var
var
/
cov 

n
y
y
x
x
n
i
i
i
xy





 1
)
(
)
(
cov

34 34
Bivariate test: test for the equal
distribution between 2 samples
• Parametric tests require a normal distribution and an equal
distribution between 2 variables.
• In order to test this equal distribution we use the non parametric test
of Kolmogorov-Smirnov used for the univariate test of normality.
• If the degree of significance p-value is less than alpha we reject H0
and we conclude that the distributions are not equal; thus, we use
non parametric tests (Wilcoxon, Mann-Whitney U, Kruskall-Wallis).
In equal distribution case we use parametric tests (Student for
dependent or independent variables, one way ANOVA).

35 35
Bivariate test: Test of Student
(means comparison)
• Hypotheses :
– Null Hypothesis :
•both observed means xa and xb are estimators of both means µa and
µb with µa = µb
– Alternative Hypotheses :
•Two-tailed test µa # µb
•One-tailed test µa > µb or (exclusive) µa < µb
• Statistic :
• Decision :
– If t > t alpha => reject H0 : the 2 means are not equal
– The degree of significance p-value is less than alpha => reject H0 .
common
|xa - xb |
Na Nb
+
=
t 2
common
2
has a Student distribution
with Na + Nb- 2 DF
a
2
* (Nb -1)
commun=
SSDa + SSDb
Na + Nb- 2
=
b
2
* (Na -1)+
Na + Nb- 2

36 36
Bivariate test: One way ANalysis Of
VAriance (One way ANOVA)
• The elements :
– One qualitative variable or factor with multiple modalities or levels and 1
quantitative variable. N is the size of the population and K the number of levels.
• Hypotheses :
– Null hypothesis :
•The observed means in different groups : xa, xb,xc,.. are estimators of the
means ma, mb, mc,...
H0 : ma = mb = mc…
– Alternative hypothesis :
•At least one of the means ma, mb, mc,.. Is different from the others.
• Statistic :
Let Vartotal = Varbetweenclasses + Varwithinclasse
• Decision :
– If F > F alpha => reject H0 : variances between groups are not equal
– The degree of significance p-value is less than alpha => reject H0 .
)
(
)
1
(
2
2
k
N
Var
k
Var
F
within
between




37 37
Multifactorial test: Multiway
ANalysis Of VAriance (Multiway
ANOVA - GLM ANOVA)
• The elements :
– Multiple qualitative variables or factors with multiple levels and 1
quantitative variable.
• Hypotheses :
– To previous hypotheses H0 et H1 of the one way ANOVA we add the
interaction between factors.
• Statistic : We estimate the parameters of the general linear model GLM
Xijk = µ + αi + βij +, εijk …, with the maximum likelihood method
i = 1.....a (number of classes of factor A)
j = 1...b (number of classes of factor B)
k = 1...n (number of repetitions in each sub-group)
µ: parametric mean of the population;
αi
: effect of the controlled factor A on the observation: fixed deviation of the group compared to
the mean mu;
βij
: random contribution of the jth
sub-group of the ith
group;
εijk
: random fluctuation of the xijk value: random variable, independent, normally distributed, with
a mean µ=0 and a variance s2
.

38 38
Multifactorial test: Multiple linear
regression
• The elements :
– We would like to study the effect of multiple independent variables (quantitative or
qualitative) on a continuous variable called response variable and to estimate the
linear curve which predict most the response variable.
• Hypotheses :
•The independent variables have no effect on the response variable.
– Alternative hypotheses :
•The independent variables influence the response variable.
• Statistic :
y = a + b1X1 + b2X2 +…+ bjXj+ ε1,…,j A multiple correlation R2
is calculated
We estimate the parameters of the general linear model GLM with the method which
maximize the likelihood. We could include covariates in the model with a known effect
on the other independent random variables.

39 39
Multifactorial test: Multiple logistic
regression
• The elements :
– In many applications, the response variable Y may have just 2 possible
values, and could be presented by a binary indicator variable having 0
and 1 as values.
• Hypotheses :
•the independent variables have no effect on the response variable
– Alternative hypotheses :
•the independent variables influence the response variable.
• Statistic :
•The logistic formula is:
GLM: Estimation of parameters X)
β
exp(β
1
X)
β
exp(β
E(Y)
1
0
1
0





40 40
Multifactorial test: Survival Curves
(1)
• Probability for the outcome of an event (B) in subjects with a
common event from origin (A), taking into consideration the time
between the two events. Or the outcome of death (B) in subjects
with a severe disease (A)
• Survival function (Si): probability to «survive» at an instant t.
• Method for the survival analysis:
Method of Kaplan-Meier
• Method for the comparison between 2 survival curves:
The Log Rank test

41 41
(2)
• Calculation of the probability to survive at each time that at
least one « death » occurs.
• Let:
Vi : number of survival at the begin of the interval ti - ti-1
Di : number of death during the interval ti - ti-1
Ei : number of exclusion at the begin of the interval ti - ti-1
qi : probability of death during the interval ti - ti-1 : qi = Di/(Vi - Ei)
pi : probability of survival during the interval ti - ti-1 : pi = 1 - qi
Si : function of survival at ti instant : Si = p0p1…pi = piSi-1

42 42
(3)
Comparison of 2 survival curves with the Log rank method.
• Hypotheses:
– H0 = the 2 survival curves have identical profiles, the risk of « death » at
a define time is thus the same in both groups
– H1 = the 2 survival curves have different profiles
• We calculate the theoretical probability of « death » at the instant i:
Pi = (D1i + D2i) / (V1i + V2i)
• We deduce that the calculated frequency of « death » in each group at
instant i is:
c1i = Pi * V1i c2i = Pi * V2i
We note c1 = Σc1i, c2 = Σc2i, o1 = ΣD1 et o2 = ΣD2

43 43
(4)
• Statistic: Formula with results and interpretation similar to a Chi2
(χ2
)
with 1 df.
(o1 - c1)² (o2 - c2)²
c1 + c2
• Decision: We use the table of χ² to look for the value of the risk , with
1 df:
χ² > χ²α , reject H0
• Remark: when we want to include one or many factors and to study
their effect on the survival analysis (multivariate analysis), we use
the Cox model…
Χ² =

44 44
Multivariate test: Multiple ANalysis
Of VAriance (MANOVA) or
covariance (MANCOVA)
• The elements :
2 or multiple qualitative
variables or factors with
multiple modalities or levels
and 2 or multiple quantitative
variables.
Eg: To study the difference of
age and satisfaction according
to the questionnaires and the
number of persons.
Ques. AGE NBPERS SATISF
1 33 3 18
2 29 2 9
3 45 1 14

45 45
Multivariate test: Principle
Component Analysis (PCA)
• Methods for orthogonal projection
– The PCA objective is to study globally
the relation between multiple
quantitative variables. It could be also
used for qualitative ordinal (numeric)
variables.
– The objectives of the PCA are :
– the information reduction:
the variables are regrouped in a small
number of new variables called
principle components;
– the typology of cases : the
positioning of cases compared to the
principal components allows the
detection of group of cases by
increasing the variance between
different groups.
*
*
*
* *
* *
*
*
*
*
*
*
*
F
1
F2

46 46
Fields of Application
• Genomics, Transcriptomics, Proteomics, Metabolomics
• Clinical studies & Epidemiology
• Pharmacology, Pharmacogenomics & Biology
• Agronomy & Ecology
Summary of the different steps of a statistical analysis:
• Transform numerically each observation if it is not numeric
• Retrieve the noise signal & replace the missing values
• Normalize the data and prepare the data table for analysis
• Perform the differential analysis (t-test, ANOVA,...)
• Find similar sub-groups (clustering, K-means,…)

Introduction to Biostatistics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Biostatistics

Similar to Introduction to Biostatistics (20)

Recently uploaded

Recently uploaded (20)

Introduction to Biostatistics