SlideShare a Scribd company logo
1 of 46
Download to read offline
AiiHC00 Introduction to
Biostatistics
Ramzi EL FEGHALI, Ph.D.
Senior Lecturer
ref@aiihc.com
Biostatistics course EL FEGHALI R.
Artificial intelligence in HealthCare
AIIHC International Ltd. U.K.
Biostatistics course EL FEGHALI R.
1st
Section : Applied Statistics to Healthcare
AiiHC00 - AiiHC10
2nd
Section : Artificial Intelligence
AiiHC11 - AiiHC15
3d
Section : Coding languages
AiiHC16 - AiiHC20
4th
Section : Business in Artificial Intelligence
AiiHC21 - AiiHC24
AiiHC Program
Artificial intelligence in HealthCare
2
AIIHC International Ltd.
Biostatistics course EL FEGHALI R.
Statistics History
Biostatistics course EL FEGHALI R. 3
AIIHC International Ltd.
Biostatistics course EL FEGHALI R.
AiiHC00 Syllabus
Artificial intelligence in HealthCare
4
AIIHC International Ltd.
5 5
Steps in statistical studies
• Data collection
– Simple observation :
•Without a specific intervention, data collected following the
study’s time.
•Sampling plan
– Experimentation
•To induce controlled phenomena. Example: administration of
a drug for a specific group of subjects.
• Statistical analysis
– "Deductive" analysis or descriptive
•Its target is to resume and present the observed data in
tables, graphs,…
– "Inductive" analysis or inferential
•Allows to extend and generalize under some conditions the
obtained conclusions.
AIIHC International Ltd.
Biostatistics course EL FEGHALI R.
6 6
Terminology
• Variable : measurable entity (X).
• Population : values corresponding to a variable.
• Individual or case : one or many observations performed on an
entity.
• Observation : particular measure of a variable (xi) for one case
• Statistic or statistical descriptor : number which describes or
resumes a set of observations.
• Sample : subset of the statistical population composed from many
observations. The size of the sample is generally called n. The
processes which lead to the creation of a sample is called sampling.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
7 7
Sampling (1)
• Basic statistics consider the samples as aleatory/random and
independent.
• It is necessary to avoid unrandomized samples.
• The probability to be sampled, or to belong to a group should be
equal for all studied cases.
• From X and S,
we want to estimate m and σ Population
m

Unknown
Sample
X
S
Known
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
8 8
Sampling (2)
• A statistical sample is composed from a limited number of cases
sampled from the studied population.
• The sampling gives the representation of a population.
• An n sized sample of a random variable X is obtained by repeating n
time the sampling giving X.
• Notation : (X1, X2, … , Xn)
• A special sampling : (x1, x2, … , xn)
• Estimator : It is a characteristic computed from the observations
which allows to estimate the values of unknown parameters of a
statistical distribution.
• Unbiased Estimator : gives the mean of the statistical value.
• Convergent Estimator : is closer to the statistical value when n
increases.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
9 9
Types of variables
• Qualitative variables :
– Nominal / Non ordinal
Examples : color, form, type of treatment, …
– Ordinal
Examples : vegetation coverage, scale of the wind,
presence/absence, …
• Quantitative variables :
– Continuous
Examples : temperature, height, weight, …
– Discrete
Examples : individual counting, occurrence number of an event,
etc…
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
10 10
Frequency distribution
• Statistical series :
– simple enumeration or counting of the observations
– can be ordered (quantitative variable)
– the total number of observations N is called size of the sample,
• Ungrouped distributions
– When we have multiple observations, the same value can be
observed many times.
– We use xi
to show different values, the number of occurrences is
ni and called absolute frequency ; p represents the number of
different observed values.
– ni
/N is called the relative frequency.
– In case of quantitative variable, we order xi and the absolute or
relative frequencies could be summed in order to obtain the
cumulative frequencies Ni or Fi:
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
11 11
Descriptive statistics (1)
• Descriptive statistics are used to describe a dataset or sample.
• A data table is not enough to analyze the data
=> another methods are used to understand more the data
interpretation.
• Two groups of methods : numerical methods (statistical descriptors)
and graphical methods.
• They are complementary.
• Distinction between univariate datasets (only one variable), bivariate
(two variables) and multivariate (more than two variables in the
dataset).
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
12 12
Descriptive statistics (2)
• Mean or average : sum of the observations divided by their number.
For the variable Y, its mean (called Y bar) is :
• Property : the sum of deviations
from the mean is null.
• Median : value in the middle of the observations.
– Sort the observations in an increasing order.
– If n is odd, take the middle value (eg : n = 7 => y4)
– If n is even, take the intermediate value between the two values
around the middle (eg : n = 8 => (y4 + y5) / 2)
• A statistic is called robust, or resistant if its value is little influenced
by the changing of a small proportion of the data.
– The median is more robust than the mean
– The mean gives more information about the data. It is more
powerful when the data is ready to be analyzed (without
outliers)
– Those two measures are complementary in statistical tools



n
i
i n
y
y
1
/




n
i
i y
y
1
0
)
(
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
13 13
• Range of the values ymax - ymin, (less robust)
• Quartiles : the median divides the sample into 2 sets,
the quartiles divide each set into 2 equal subsets.
• The interquartile range is: Q3 – Q1.
• The 5 numbers min, Q1, median, Q3, max resume the sample.
• Quantile or percentile p contain the equivalent proportion of
smallest values (0 <= p <= 1), eg : p = 0.1
=> 10% of the values are small and 90% are large.
Descriptive statistics (3)
Q1 Q2 = median Q3 max
min
Interquartile range
Range of the values
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
14 14
Descriptive statistics (4)
• Variance of a population
• Variance of a sample
• Standard Deviation
• The standard deviation represents the deviation from the mean.
• Coefficient of variation
• Variance, standard deviation and coefficient of variation are
associated to the mean.
n
y
n
i
i /
)
(
1
2
2



 

)
1
/(
)
(
1
2
2




n
y
y
s
n
i
i
2

  2
s
s 
%
100
.
/ y
s
cv 
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
15 15
Descriptive statistics (5)
• The boxplot allows to
represent graphically the
famous «5 numbers».
• The outliers are 1.5 times
distant from the
interquartile range
beginning from the
closest quartile (box
borders).
10
20
30
40
50
median
Q1
Q3
min
IQR
1.5 . IQR
maximum
max (extreme value)
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
16 16
Descriptive statistics (6)
• Classification into classes
(choice of the borders and
inclusion of the left and
right values)
• Graphical representation :
the histogram
= observation frequency
into each class.
• Unimodal, bimodal or
multimodal distribution.
y
Frequency
8 9 10 11 12
0
5
10
15
Unimodal distr.
y
Frequency
8 9 10 11 12 13
0
5
10
15
20
Bimodal distr.
y
Frequency
8 10 12 14 16
0
5
10
15
20
Biostatistics course EL FEGHALI R.
AIIHC International Ltd.
17 17
Descriptive statistics (7)
• The dotplot
– 2 observed variables
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
iris$Sepal.Length
iris$Petal.Length
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
18 18
Descriptive statistics (8)
• The Q-Q plot
– 1 observed variable and 1 theoretical
– 2 observed variables
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
19 19
Probabilities distribution
• Discrete distributions:
•Uniform, Bernoulli, binomial, Poisson… distribution
–The discrete random variables are discontinuous integer
values over a predefined interval. They are generally the
result of counting.
• Continuous distributions:
•Uniform, normal, standard normal… distribution
–The continuous random variables are continuous values
over a predefined interval.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
20 20
Normal distribution
• The normal distribution
represents a mathematical
well-known bell-shaped
distribution.
• It was defined by Laplace and
Gauss.
• The equation of the frequency
curve of a normal distribution
depends on two parameters :
the mean and the standard
deviation of the variable
f x e
x m
( )
( )

 
1
2
2
2
2
 

)
;
( 
m
N
x 
14159
,
3


Normal distribution
x
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
21 21
The standard normal distribution
standard normal distribution
0
0.1
0.2
0.3
0.4
0.5
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
z
f(z)
2
2
2
1
)
(
z
e
z
f



)
1
;
0
(
N
m
x
z 



-1,96 +1,96
 
 
P z    
1 96 1 96 0 95
, : , ,
For a standard normal distribution α=5%, 95%
z belong to the interval [-1,96 :+1,96]
The confidence Interval :
z follows a normal distribution
with n-1 DF (degree of freedom)
n
s
z
m
CI α


Biostatistics course EL FEGHALI R. AIIHC International Ltd.
22 22
Statistical Tests
• Statistical hypotheses
• One-tailed and two-tailed tests
• Type I and II errors (alpha et beta risks)
• Significance ‘p-value’
• Univariate, bivariate, multifactorial, and multivariate tests
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
23 23
Statistical hypotheses
• They are conclusions related to the frequency’s distribution.
• These conclusions could be true or false.
• In the majority of tests we define an hypothesis in the purpose to reject it.
• Example :
the observed percentage in a population is 10%. If we want to verify that
the observed percentage in a particular group differs from the observed
percentage in the population. We suppose that there is no difference. So
the hypothesis will be :
“All observed differences are the results of sampling fluctuations, or
hazard.”
• This hypothesis is called null hypothesis and noted H0.
• All other hypothesis are called alternative hypotheses and noted H1.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
24 24
Test of hypotheses and
significance
• Test of hypotheses or significance are statistical procedures used to
decide either the defined hypotheses are true or false in order to
learn more about the unknown reality.
• It is a domain of inferential statistics.
• Different tests exist according to:
– the type of studied variables (quantitative/qualitative)
– the type of problem (comparison between 2 means or more…)
– the conditions of application (modeling in term of probability
distribution)
• However, the logical steps of a test are always the same.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
25 25
One-tailed and two-tailed tests
• The null hypothesis H0 retained is often the equality. The alternative
hypotheses H1 could be thus all other possible situations that are
divided into two categories : greater than; less than.
• When we consider all the alternative hypotheses we use a two-tailed
test.
• When we consider a part of the alternative hypotheses greater than
or less than we use a one-tailed test.
• eg : we compare the height of 3 and 4 years old children. The test is
a one-tailed test because the height increases with age.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
26 26
Type I and II errors
(alpha and beta risks)
• The type I error : It is the decision to reject the null hypothesis
when this one is true.
– Example: in a trial concerning a new drug compared to a
previous one we conclude that there is a difference between
both of them knowing that it is not the reality. We are making the
type I error.
• The type II error : It is the opposite of the first error. We accept the
null hypothesis when this one is false.
– Example: in a trial concerning a new drug compared to a
previous one we conclude that there is no difference between
both of them knowing that it is not the reality. We are making the
type II error.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
27 27
Significance ‘p-value’
• When we test an hypothesis, the probability to make a type I error is
called the threshold of significance of the test and usually noted
alpha. This risk is defined before the experience when we evocate
the problem.
• The probability to make a type II error is usually noted beta.
The probability to reject H0 when it is false is called the power of the
test: Power = 1- beta.
• There is no direct relation between alpha and beta. Their values are
likely to be closer to 0. In general, we choose alpha = 0.05 and we
try to minimize beta (in general 0.1).
• The threshold of significance p-value is the probability, under the
null hypothesis to observe a difference due to the hazard.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
28 28
Univariate test: Chi2
for goodness
of fit (1)
• The elements :
– Generalization of the comparison between an observed
percentage and a theoretical one.
– 1 qualitative variable defining classes (or classified quantitative
variable). We have the observations (number) of subjects
corresponding to each class.
– 1 theoretical distribution, an empiric one or following a theoretical
probabilities distribution concerning the same previous classes.
• The question :
– Could the observed distribution be conform to the theoretical
distribution?
– Could the difference between the observed values and the
theoretical one be due to hazard ?
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
29 29
Univariate test: Chi2
for goodness
of fit (2)
• Hypotheses :
– Null hypothesis H0 :
•The difference between the observed values and the
theoretical one are due to hazard. The observed distribution
follows a probabilistic theory.
– Alternative hypothesis H1 :
•The observed distribution doesn’t follow a probabilistic theory..
• Elements necessary for the calculation
– Contingency table
Classes A B C D
Observed O1 O2 O3 O4
Frequency
Theoretical
Frequency T1 T2 T3 T4
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
30 30
Univariate test: Chi2
for goodness
of fit (3)
• Statistic :
– Chi2
“goodness of fit”
•Degree of freedom :
–Number of classes - 1 - Number of parameters estimating
the theoretical distribution (if necessary).
•Condition of application :
–All theoretical frequencies must be greater than 5.
– Calculation of each theoretical frequency
– Condition of application : All theoretical frequencies must be
greater than 5 otherwise we group the classes
– Calculation of the Chi2
• Decision: If Chi2
> Chi2
alpha => reject H0 : the distribution is not
conform to the theoretical distribution. If the degree of significance p-
value is less than alpha => reject H0 .
Chi2
= 
(0-T)
T
2
1
p p = Number of classes after grouping
DF = p -1 – Number of estimated param.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
31 31
Univariate tests: Tests of normality
• In addition to the Chi2
for goodness of fit, other graphical and/or
statistical approaches are used in order to test the normality of a
distribution.
• The graphical methods and empirical techniques are: Histogram,
Boxplot, Q-Q plot, and skewness and kurtosis coefficients
skewness
kurtosis
• Statistical methods: Univariate test of Kolmogorov-Smirnov
3
)
(
)
2
)(
1
(
1 




i
i
s
x
x
n
n
n
G
)
3
)(
2
(
)
1
(
3
)
(
)
3
)(
2
)(
1
(
)
1
(
2
2
4









  n
n
n
s
x
x
n
n
n
n
n
G
i
i
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
32 32
Bivariate test: Contingency Chi2
or
Chi2
for Independence
• The elements :
– Two qualitative variables ( C « columns » and L « lines » modalities),
are measured for each subject. We obtain a table of contingency with
C*L cases, a subject is classified in each case and only one.
• Hypotheses :
– Null hypothesis H0 : variables are independent.
– Alternative hypotheses H1 : variables are not independent.
• Statistic :
• Decision :
– If Chi2
> Chi2
alpha => reject H0 : the 2 variables are not independent
– The degree of significance p-value is less than alpha => reject H0.
(0ij-Tij)
Chi2
= ij Tij
2 i,j : modalities of C and L
DF = (L-1)*(C -1)
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
33 33
Bivariate test: Correlation and
simple linear regression
• Regression and correlation :
– x and y are two random continuous variables: x and y have a degree of
association => correlation
– y is explained by x => regression
– A statistical test could be performed with both methods in order to
estimate a p-value and verify if the association is significant or not
between the 2 variables.
• Correlation:
– Coefficient of correlation of Pearson
•covxy, the covariance between x and y
• Simple linear regression
– y = ax + b + ε => Estimation with the least squares method of the
regression curve parameters between the 2 variables x and y.
y
x
xy
xy
r var
var
/
cov 

n
y
y
x
x
n
i
i
i
xy





 1
)
(
)
(
cov
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
34 34
Bivariate test: test for the equal
distribution between 2 samples
• Parametric tests require a normal distribution and an equal
distribution between 2 variables.
• In order to test this equal distribution we use the non parametric test
of Kolmogorov-Smirnov used for the univariate test of normality.
• If the degree of significance p-value is less than alpha we reject H0
and we conclude that the distributions are not equal; thus, we use
non parametric tests (Wilcoxon, Mann-Whitney U, Kruskall-Wallis).
In equal distribution case we use parametric tests (Student for
dependent or independent variables, one way ANOVA).
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
35 35
Bivariate test: Test of Student
(means comparison)
• Hypotheses :
– Null Hypothesis :
•both observed means xa and xb are estimators of both means µa and
µb with µa = µb
– Alternative Hypotheses :
•Two-tailed test µa # µb
•One-tailed test µa > µb or (exclusive) µa < µb
• Statistic :
• Decision :
– If t > t alpha => reject H0 : the 2 means are not equal
– The degree of significance p-value is less than alpha => reject H0 .
common
|xa - xb |
Na Nb
+
=
t 2
common
2
has a Student distribution
with Na + Nb- 2 DF
a
2
* (Nb -1)
commun=
SSDa + SSDb
Na + Nb- 2
=
b
2
* (Na -1)+
Na + Nb- 2
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
36 36
Bivariate test: One way ANalysis Of
VAriance (One way ANOVA)
• The elements :
– One qualitative variable or factor with multiple modalities or levels and 1
quantitative variable. N is the size of the population and K the number of levels.
• Hypotheses :
– Null hypothesis :
•The observed means in different groups : xa, xb,xc,.. are estimators of the
means ma, mb, mc,...
H0 : ma = mb = mc…
– Alternative hypothesis :
•At least one of the means ma, mb, mc,.. Is different from the others.
• Statistic :
Let Vartotal = Varbetweenclasses + Varwithinclasse
• Decision :
– If F > F alpha => reject H0 : variances between groups are not equal
– The degree of significance p-value is less than alpha => reject H0 .
)
(
)
1
(
2
2
k
N
Var
k
Var
F
within
between



Biostatistics course EL FEGHALI R. AIIHC International Ltd.
37 37
Multifactorial test: Multiway
ANalysis Of VAriance (Multiway
ANOVA - GLM ANOVA)
• The elements :
– Multiple qualitative variables or factors with multiple levels and 1
quantitative variable.
• Hypotheses :
– To previous hypotheses H0 et H1 of the one way ANOVA we add the
interaction between factors.
• Statistic : We estimate the parameters of the general linear model GLM
Xijk = µ + αi + βij +, εijk …, with the maximum likelihood method
i = 1.....a (number of classes of factor A)
j = 1...b (number of classes of factor B)
k = 1...n (number of repetitions in each sub-group)
µ: parametric mean of the population;
αi
: effect of the controlled factor A on the observation: fixed deviation of the group compared to
the mean mu;
βij
: random contribution of the jth
sub-group of the ith
group;
εijk
: random fluctuation of the xijk value: random variable, independent, normally distributed, with
a mean µ=0 and a variance s2
.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
38 38
Multifactorial test: Multiple linear
regression
• The elements :
– We would like to study the effect of multiple independent variables (quantitative or
qualitative) on a continuous variable called response variable and to estimate the
linear curve which predict most the response variable.
• Hypotheses :
– Null hypothesis :
•The independent variables have no effect on the response variable.
– Alternative hypotheses :
•The independent variables influence the response variable.
• Statistic :
y = a + b1X1 + b2X2 +…+ bjXj+ ε1,…,j A multiple correlation R2
is calculated
We estimate the parameters of the general linear model GLM with the method which
maximize the likelihood. We could include covariates in the model with a known effect
on the other independent random variables.
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
39 39
Multifactorial test: Multiple logistic
regression
• The elements :
– In many applications, the response variable Y may have just 2 possible
values, and could be presented by a binary indicator variable having 0
and 1 as values.
• Hypotheses :
– Null hypothesis :
•the independent variables have no effect on the response variable
– Alternative hypotheses :
•the independent variables influence the response variable.
• Statistic :
•The logistic formula is:
GLM: Estimation of parameters X)
β
exp(β
1
X)
β
exp(β
E(Y)
1
0
1
0




Biostatistics course EL FEGHALI R. AIIHC International Ltd.
40 40
Multifactorial test: Survival Curves
(1)
• Probability for the outcome of an event (B) in subjects with a
common event from origin (A), taking into consideration the time
between the two events. Or the outcome of death (B) in subjects
with a severe disease (A)
• Survival function (Si): probability to «survive» at an instant t.
• Method for the survival analysis:
Method of Kaplan-Meier
• Method for the comparison between 2 survival curves:
The Log Rank test
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
41 41
Multifactorial test: Survival Curves
(2)
• Calculation of the probability to survive at each time that at
least one « death » occurs.
• Let:
Vi : number of survival at the begin of the interval ti - ti-1
Di : number of death during the interval ti - ti-1
Ei : number of exclusion at the begin of the interval ti - ti-1
qi : probability of death during the interval ti - ti-1 : qi = Di/(Vi - Ei)
pi : probability of survival during the interval ti - ti-1 : pi = 1 - qi
Si : function of survival at ti instant : Si = p0p1…pi = piSi-1
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
42 42
Multifactorial test: Survival Curves
(3)
Comparison of 2 survival curves with the Log rank method.
• Hypotheses:
– H0 = the 2 survival curves have identical profiles, the risk of « death » at
a define time is thus the same in both groups
– H1 = the 2 survival curves have different profiles
• We calculate the theoretical probability of « death » at the instant i:
Pi = (D1i + D2i) / (V1i + V2i)
• We deduce that the calculated frequency of « death » in each group at
instant i is:
c1i = Pi * V1i c2i = Pi * V2i
We note c1 = Σc1i, c2 = Σc2i, o1 = ΣD1 et o2 = ΣD2
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
43 43
Multifactorial test: Survival Curves
(4)
• Statistic: Formula with results and interpretation similar to a Chi2
(χ2
)
with 1 df.
(o1 - c1)² (o2 - c2)²
c1 + c2
• Decision: We use the table of χ² to look for the value of the risk , with
1 df:
χ² > χ²α , reject H0
• Remark: when we want to include one or many factors and to study
their effect on the survival analysis (multivariate analysis), we use
the Cox model…
Χ² =
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
44 44
Multivariate test: Multiple ANalysis
Of VAriance (MANOVA) or
covariance (MANCOVA)
• The elements :
2 or multiple qualitative
variables or factors with
multiple modalities or levels
and 2 or multiple quantitative
variables.
Eg: To study the difference of
age and satisfaction according
to the questionnaires and the
number of persons.
Ques. AGE NBPERS SATISF
1 33 3 18
2 29 2 9
3 45 1 14
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
45 45
Multivariate test: Principle
Component Analysis (PCA)
• Methods for orthogonal projection
– The PCA objective is to study globally
the relation between multiple
quantitative variables. It could be also
used for qualitative ordinal (numeric)
variables.
– The objectives of the PCA are :
– the information reduction:
the variables are regrouped in a small
number of new variables called
principle components;
– the typology of cases : the
positioning of cases compared to the
principal components allows the
detection of group of cases by
increasing the variance between
different groups.
*
*
*
* *
* *
*
*
*
*
*
*
*
F
1
F2
Biostatistics course EL FEGHALI R. AIIHC International Ltd.
46 46
Fields of Application
• Genomics, Transcriptomics, Proteomics, Metabolomics
• Clinical studies & Epidemiology
• Pharmacology, Pharmacogenomics & Biology
• Agronomy & Ecology
Summary of the different steps of a statistical analysis:
• Transform numerically each observation if it is not numeric
• Retrieve the noise signal & replace the missing values
• Normalize the data and prepare the data table for analysis
• Perform the differential analysis (t-test, ANOVA,...)
• Find similar sub-groups (clustering, K-means,…)
Biostatistics course EL FEGHALI R. AIIHC International Ltd.

More Related Content

What's hot

News And The Public Sphere
News And The Public SphereNews And The Public Sphere
News And The Public SphereRob Jewitt
 
Retroperitoneal fibrosis radiology
Retroperitoneal fibrosis radiologyRetroperitoneal fibrosis radiology
Retroperitoneal fibrosis radiologyAli Jiwani
 
Newspaper price control act
Newspaper price control actNewspaper price control act
Newspaper price control actAnirban Mandal
 
Arterial duplex images and interp
Arterial duplex images and interpArterial duplex images and interp
Arterial duplex images and interpchris profota
 
Presentation1.pptx, radiological imaging of pulmonary infection.
Presentation1.pptx, radiological imaging of pulmonary infection.Presentation1.pptx, radiological imaging of pulmonary infection.
Presentation1.pptx, radiological imaging of pulmonary infection.Abdellah Nazeer
 
News agency pm
News agency pmNews agency pm
News agency pmRhea Gupta
 
Normative theories
Normative theoriesNormative theories
Normative theoriesnadia naseem
 
Historia de la legislación de medios en Argentina
Historia de la legislación de medios en ArgentinaHistoria de la legislación de medios en Argentina
Historia de la legislación de medios en Argentinaeleperbe
 
Normative theory of the press ( Libertarian)
Normative theory of the press ( Libertarian)Normative theory of the press ( Libertarian)
Normative theory of the press ( Libertarian)JoannaDel
 
media and public sphere
media and public spheremedia and public sphere
media and public sphereVivie Chabie
 

What's hot (20)

Hypodermic needle
Hypodermic needleHypodermic needle
Hypodermic needle
 
Chronic PE
Chronic PEChronic PE
Chronic PE
 
News And The Public Sphere
News And The Public SphereNews And The Public Sphere
News And The Public Sphere
 
Retroperitoneal fibrosis radiology
Retroperitoneal fibrosis radiologyRetroperitoneal fibrosis radiology
Retroperitoneal fibrosis radiology
 
Newspaper price control act
Newspaper price control actNewspaper price control act
Newspaper price control act
 
Historia De Las TeoríAs De La Comunicacion. Cap.3
Historia De Las TeoríAs De La Comunicacion. Cap.3Historia De Las TeoríAs De La Comunicacion. Cap.3
Historia De Las TeoríAs De La Comunicacion. Cap.3
 
Digital Media Platforms
Digital Media PlatformsDigital Media Platforms
Digital Media Platforms
 
Arterial duplex images and interp
Arterial duplex images and interpArterial duplex images and interp
Arterial duplex images and interp
 
¿Quién fue Roland Barthes?
¿Quién fue Roland Barthes?¿Quién fue Roland Barthes?
¿Quién fue Roland Barthes?
 
Presentation1.pptx, radiological imaging of pulmonary infection.
Presentation1.pptx, radiological imaging of pulmonary infection.Presentation1.pptx, radiological imaging of pulmonary infection.
Presentation1.pptx, radiological imaging of pulmonary infection.
 
News agency pm
News agency pmNews agency pm
News agency pm
 
Indian television 2021
Indian television 2021Indian television 2021
Indian television 2021
 
Normative theories
Normative theoriesNormative theories
Normative theories
 
Media economics
Media economicsMedia economics
Media economics
 
Historia de la legislación de medios en Argentina
Historia de la legislación de medios en ArgentinaHistoria de la legislación de medios en Argentina
Historia de la legislación de medios en Argentina
 
Contemporary Media Issues Intro to Postmodern Media
Contemporary Media Issues  Intro to Postmodern MediaContemporary Media Issues  Intro to Postmodern Media
Contemporary Media Issues Intro to Postmodern Media
 
Imaging of Thoracic Trauma
Imaging of Thoracic TraumaImaging of Thoracic Trauma
Imaging of Thoracic Trauma
 
Normative theory of the press ( Libertarian)
Normative theory of the press ( Libertarian)Normative theory of the press ( Libertarian)
Normative theory of the press ( Libertarian)
 
media and public sphere
media and public spheremedia and public sphere
media and public sphere
 
Press council of India
Press council of IndiaPress council of India
Press council of India
 

Similar to Introduction to Biostatistics

Review of Chapters 1-5.ppt
Review of Chapters 1-5.pptReview of Chapters 1-5.ppt
Review of Chapters 1-5.pptNobelFFarrar
 
Basics of biostatistic
Basics of biostatisticBasics of biostatistic
Basics of biostatisticNeurologyKota
 
1 lab basicstatisticsfall2013
1 lab basicstatisticsfall20131 lab basicstatisticsfall2013
1 lab basicstatisticsfall2013TAMUK
 
Research methodology and iostatistics ppt
Research methodology and iostatistics pptResearch methodology and iostatistics ppt
Research methodology and iostatistics pptNikhat Mohammadi
 
COM 201_Inferential Statistics_18032022.pptx
COM 201_Inferential Statistics_18032022.pptxCOM 201_Inferential Statistics_18032022.pptx
COM 201_Inferential Statistics_18032022.pptxAkinsolaAyomidotun
 
Malimu statistical significance testing.
Malimu statistical significance testing.Malimu statistical significance testing.
Malimu statistical significance testing.Miharbi Ignasm
 
ChandanChakrabarty_1.pdf
ChandanChakrabarty_1.pdfChandanChakrabarty_1.pdf
ChandanChakrabarty_1.pdfDikshathawait
 
statistics introduction
 statistics introduction statistics introduction
statistics introductionkingstonKINGS
 
Chi square test
Chi square testChi square test
Chi square testNayna Azad
 
Statistical analysis.pptx
Statistical analysis.pptxStatistical analysis.pptx
Statistical analysis.pptxChinna Chadayan
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdfGerryMakilan2
 
Univariate Analysis
 Univariate Analysis Univariate Analysis
Univariate AnalysisSoumya Sahoo
 
Soni_Biostatistics.ppt
Soni_Biostatistics.pptSoni_Biostatistics.ppt
Soni_Biostatistics.pptOgunsina1
 
Parametric tests seminar
Parametric tests seminarParametric tests seminar
Parametric tests seminardrdeepika87
 
Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatMarwa Zalat
 
Statistik dan Probabilitas Yuni Yamasari 2.pptx
Statistik dan Probabilitas Yuni Yamasari 2.pptxStatistik dan Probabilitas Yuni Yamasari 2.pptx
Statistik dan Probabilitas Yuni Yamasari 2.pptxAisyahLailia
 

Similar to Introduction to Biostatistics (20)

Review of Chapters 1-5.ppt
Review of Chapters 1-5.pptReview of Chapters 1-5.ppt
Review of Chapters 1-5.ppt
 
Basics of biostatistic
Basics of biostatisticBasics of biostatistic
Basics of biostatistic
 
1 lab basicstatisticsfall2013
1 lab basicstatisticsfall20131 lab basicstatisticsfall2013
1 lab basicstatisticsfall2013
 
POINT_INTERVAL_estimates.ppt
POINT_INTERVAL_estimates.pptPOINT_INTERVAL_estimates.ppt
POINT_INTERVAL_estimates.ppt
 
Research methodology and iostatistics ppt
Research methodology and iostatistics pptResearch methodology and iostatistics ppt
Research methodology and iostatistics ppt
 
COM 201_Inferential Statistics_18032022.pptx
COM 201_Inferential Statistics_18032022.pptxCOM 201_Inferential Statistics_18032022.pptx
COM 201_Inferential Statistics_18032022.pptx
 
Statistical analysis
Statistical  analysisStatistical  analysis
Statistical analysis
 
Malimu statistical significance testing.
Malimu statistical significance testing.Malimu statistical significance testing.
Malimu statistical significance testing.
 
ChandanChakrabarty_1.pdf
ChandanChakrabarty_1.pdfChandanChakrabarty_1.pdf
ChandanChakrabarty_1.pdf
 
statistics introduction
 statistics introduction statistics introduction
statistics introduction
 
Chi square test
Chi square testChi square test
Chi square test
 
Statistical analysis.pptx
Statistical analysis.pptxStatistical analysis.pptx
Statistical analysis.pptx
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdf
 
Univariate Analysis
 Univariate Analysis Univariate Analysis
Univariate Analysis
 
Soni_Biostatistics.ppt
Soni_Biostatistics.pptSoni_Biostatistics.ppt
Soni_Biostatistics.ppt
 
Parametric tests seminar
Parametric tests seminarParametric tests seminar
Parametric tests seminar
 
Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa Zalat
 
Medical statistics
Medical statisticsMedical statistics
Medical statistics
 
Statistik dan Probabilitas Yuni Yamasari 2.pptx
Statistik dan Probabilitas Yuni Yamasari 2.pptxStatistik dan Probabilitas Yuni Yamasari 2.pptx
Statistik dan Probabilitas Yuni Yamasari 2.pptx
 
Biostatistics
Biostatistics Biostatistics
Biostatistics
 

Recently uploaded

Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxCeline George
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptxJoelynRubio1
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningMarc Dusseiller Dusjagr
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSAnaAcapella
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfPondicherry University
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptNishitharanjan Rout
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17Celine George
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsNbelano25
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of PlayPooky Knightsmith
 
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfUGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfNirmal Dwivedi
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...EADTU
 

Recently uploaded (20)

Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptx
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learning
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of Play
 
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfUGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 

Introduction to Biostatistics

  • 1. AiiHC00 Introduction to Biostatistics Ramzi EL FEGHALI, Ph.D. Senior Lecturer ref@aiihc.com Biostatistics course EL FEGHALI R. Artificial intelligence in HealthCare AIIHC International Ltd. U.K.
  • 2. Biostatistics course EL FEGHALI R. 1st Section : Applied Statistics to Healthcare AiiHC00 - AiiHC10 2nd Section : Artificial Intelligence AiiHC11 - AiiHC15 3d Section : Coding languages AiiHC16 - AiiHC20 4th Section : Business in Artificial Intelligence AiiHC21 - AiiHC24 AiiHC Program Artificial intelligence in HealthCare 2 AIIHC International Ltd.
  • 3. Biostatistics course EL FEGHALI R. Statistics History Biostatistics course EL FEGHALI R. 3 AIIHC International Ltd.
  • 4. Biostatistics course EL FEGHALI R. AiiHC00 Syllabus Artificial intelligence in HealthCare 4 AIIHC International Ltd.
  • 5. 5 5 Steps in statistical studies • Data collection – Simple observation : •Without a specific intervention, data collected following the study’s time. •Sampling plan – Experimentation •To induce controlled phenomena. Example: administration of a drug for a specific group of subjects. • Statistical analysis – "Deductive" analysis or descriptive •Its target is to resume and present the observed data in tables, graphs,… – "Inductive" analysis or inferential •Allows to extend and generalize under some conditions the obtained conclusions. AIIHC International Ltd. Biostatistics course EL FEGHALI R.
  • 6. 6 6 Terminology • Variable : measurable entity (X). • Population : values corresponding to a variable. • Individual or case : one or many observations performed on an entity. • Observation : particular measure of a variable (xi) for one case • Statistic or statistical descriptor : number which describes or resumes a set of observations. • Sample : subset of the statistical population composed from many observations. The size of the sample is generally called n. The processes which lead to the creation of a sample is called sampling. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 7. 7 7 Sampling (1) • Basic statistics consider the samples as aleatory/random and independent. • It is necessary to avoid unrandomized samples. • The probability to be sampled, or to belong to a group should be equal for all studied cases. • From X and S, we want to estimate m and σ Population m  Unknown Sample X S Known Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 8. 8 8 Sampling (2) • A statistical sample is composed from a limited number of cases sampled from the studied population. • The sampling gives the representation of a population. • An n sized sample of a random variable X is obtained by repeating n time the sampling giving X. • Notation : (X1, X2, … , Xn) • A special sampling : (x1, x2, … , xn) • Estimator : It is a characteristic computed from the observations which allows to estimate the values of unknown parameters of a statistical distribution. • Unbiased Estimator : gives the mean of the statistical value. • Convergent Estimator : is closer to the statistical value when n increases. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 9. 9 9 Types of variables • Qualitative variables : – Nominal / Non ordinal Examples : color, form, type of treatment, … – Ordinal Examples : vegetation coverage, scale of the wind, presence/absence, … • Quantitative variables : – Continuous Examples : temperature, height, weight, … – Discrete Examples : individual counting, occurrence number of an event, etc… Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 10. 10 10 Frequency distribution • Statistical series : – simple enumeration or counting of the observations – can be ordered (quantitative variable) – the total number of observations N is called size of the sample, • Ungrouped distributions – When we have multiple observations, the same value can be observed many times. – We use xi to show different values, the number of occurrences is ni and called absolute frequency ; p represents the number of different observed values. – ni /N is called the relative frequency. – In case of quantitative variable, we order xi and the absolute or relative frequencies could be summed in order to obtain the cumulative frequencies Ni or Fi: Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 11. 11 11 Descriptive statistics (1) • Descriptive statistics are used to describe a dataset or sample. • A data table is not enough to analyze the data => another methods are used to understand more the data interpretation. • Two groups of methods : numerical methods (statistical descriptors) and graphical methods. • They are complementary. • Distinction between univariate datasets (only one variable), bivariate (two variables) and multivariate (more than two variables in the dataset). Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 12. 12 12 Descriptive statistics (2) • Mean or average : sum of the observations divided by their number. For the variable Y, its mean (called Y bar) is : • Property : the sum of deviations from the mean is null. • Median : value in the middle of the observations. – Sort the observations in an increasing order. – If n is odd, take the middle value (eg : n = 7 => y4) – If n is even, take the intermediate value between the two values around the middle (eg : n = 8 => (y4 + y5) / 2) • A statistic is called robust, or resistant if its value is little influenced by the changing of a small proportion of the data. – The median is more robust than the mean – The mean gives more information about the data. It is more powerful when the data is ready to be analyzed (without outliers) – Those two measures are complementary in statistical tools    n i i n y y 1 /     n i i y y 1 0 ) ( Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 13. 13 13 • Range of the values ymax - ymin, (less robust) • Quartiles : the median divides the sample into 2 sets, the quartiles divide each set into 2 equal subsets. • The interquartile range is: Q3 – Q1. • The 5 numbers min, Q1, median, Q3, max resume the sample. • Quantile or percentile p contain the equivalent proportion of smallest values (0 <= p <= 1), eg : p = 0.1 => 10% of the values are small and 90% are large. Descriptive statistics (3) Q1 Q2 = median Q3 max min Interquartile range Range of the values Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 14. 14 14 Descriptive statistics (4) • Variance of a population • Variance of a sample • Standard Deviation • The standard deviation represents the deviation from the mean. • Coefficient of variation • Variance, standard deviation and coefficient of variation are associated to the mean. n y n i i / ) ( 1 2 2       ) 1 /( ) ( 1 2 2     n y y s n i i 2    2 s s  % 100 . / y s cv  Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 15. 15 15 Descriptive statistics (5) • The boxplot allows to represent graphically the famous «5 numbers». • The outliers are 1.5 times distant from the interquartile range beginning from the closest quartile (box borders). 10 20 30 40 50 median Q1 Q3 min IQR 1.5 . IQR maximum max (extreme value) Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 16. 16 16 Descriptive statistics (6) • Classification into classes (choice of the borders and inclusion of the left and right values) • Graphical representation : the histogram = observation frequency into each class. • Unimodal, bimodal or multimodal distribution. y Frequency 8 9 10 11 12 0 5 10 15 Unimodal distr. y Frequency 8 9 10 11 12 13 0 5 10 15 20 Bimodal distr. y Frequency 8 10 12 14 16 0 5 10 15 20 Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 17. 17 17 Descriptive statistics (7) • The dotplot – 2 observed variables 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 1 2 3 4 5 6 7 iris$Sepal.Length iris$Petal.Length Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 18. 18 18 Descriptive statistics (8) • The Q-Q plot – 1 observed variable and 1 theoretical – 2 observed variables Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 19. 19 19 Probabilities distribution • Discrete distributions: •Uniform, Bernoulli, binomial, Poisson… distribution –The discrete random variables are discontinuous integer values over a predefined interval. They are generally the result of counting. • Continuous distributions: •Uniform, normal, standard normal… distribution –The continuous random variables are continuous values over a predefined interval. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 20. 20 20 Normal distribution • The normal distribution represents a mathematical well-known bell-shaped distribution. • It was defined by Laplace and Gauss. • The equation of the frequency curve of a normal distribution depends on two parameters : the mean and the standard deviation of the variable f x e x m ( ) ( )    1 2 2 2 2    ) ; (  m N x  14159 , 3   Normal distribution x Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 21. 21 21 The standard normal distribution standard normal distribution 0 0.1 0.2 0.3 0.4 0.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 z f(z) 2 2 2 1 ) ( z e z f    ) 1 ; 0 ( N m x z     -1,96 +1,96     P z     1 96 1 96 0 95 , : , , For a standard normal distribution α=5%, 95% z belong to the interval [-1,96 :+1,96] The confidence Interval : z follows a normal distribution with n-1 DF (degree of freedom) n s z m CI α   Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 22. 22 22 Statistical Tests • Statistical hypotheses • One-tailed and two-tailed tests • Type I and II errors (alpha et beta risks) • Significance ‘p-value’ • Univariate, bivariate, multifactorial, and multivariate tests Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 23. 23 23 Statistical hypotheses • They are conclusions related to the frequency’s distribution. • These conclusions could be true or false. • In the majority of tests we define an hypothesis in the purpose to reject it. • Example : the observed percentage in a population is 10%. If we want to verify that the observed percentage in a particular group differs from the observed percentage in the population. We suppose that there is no difference. So the hypothesis will be : “All observed differences are the results of sampling fluctuations, or hazard.” • This hypothesis is called null hypothesis and noted H0. • All other hypothesis are called alternative hypotheses and noted H1. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 24. 24 24 Test of hypotheses and significance • Test of hypotheses or significance are statistical procedures used to decide either the defined hypotheses are true or false in order to learn more about the unknown reality. • It is a domain of inferential statistics. • Different tests exist according to: – the type of studied variables (quantitative/qualitative) – the type of problem (comparison between 2 means or more…) – the conditions of application (modeling in term of probability distribution) • However, the logical steps of a test are always the same. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 25. 25 25 One-tailed and two-tailed tests • The null hypothesis H0 retained is often the equality. The alternative hypotheses H1 could be thus all other possible situations that are divided into two categories : greater than; less than. • When we consider all the alternative hypotheses we use a two-tailed test. • When we consider a part of the alternative hypotheses greater than or less than we use a one-tailed test. • eg : we compare the height of 3 and 4 years old children. The test is a one-tailed test because the height increases with age. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 26. 26 26 Type I and II errors (alpha and beta risks) • The type I error : It is the decision to reject the null hypothesis when this one is true. – Example: in a trial concerning a new drug compared to a previous one we conclude that there is a difference between both of them knowing that it is not the reality. We are making the type I error. • The type II error : It is the opposite of the first error. We accept the null hypothesis when this one is false. – Example: in a trial concerning a new drug compared to a previous one we conclude that there is no difference between both of them knowing that it is not the reality. We are making the type II error. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 27. 27 27 Significance ‘p-value’ • When we test an hypothesis, the probability to make a type I error is called the threshold of significance of the test and usually noted alpha. This risk is defined before the experience when we evocate the problem. • The probability to make a type II error is usually noted beta. The probability to reject H0 when it is false is called the power of the test: Power = 1- beta. • There is no direct relation between alpha and beta. Their values are likely to be closer to 0. In general, we choose alpha = 0.05 and we try to minimize beta (in general 0.1). • The threshold of significance p-value is the probability, under the null hypothesis to observe a difference due to the hazard. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 28. 28 28 Univariate test: Chi2 for goodness of fit (1) • The elements : – Generalization of the comparison between an observed percentage and a theoretical one. – 1 qualitative variable defining classes (or classified quantitative variable). We have the observations (number) of subjects corresponding to each class. – 1 theoretical distribution, an empiric one or following a theoretical probabilities distribution concerning the same previous classes. • The question : – Could the observed distribution be conform to the theoretical distribution? – Could the difference between the observed values and the theoretical one be due to hazard ? Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 29. 29 29 Univariate test: Chi2 for goodness of fit (2) • Hypotheses : – Null hypothesis H0 : •The difference between the observed values and the theoretical one are due to hazard. The observed distribution follows a probabilistic theory. – Alternative hypothesis H1 : •The observed distribution doesn’t follow a probabilistic theory.. • Elements necessary for the calculation – Contingency table Classes A B C D Observed O1 O2 O3 O4 Frequency Theoretical Frequency T1 T2 T3 T4 Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 30. 30 30 Univariate test: Chi2 for goodness of fit (3) • Statistic : – Chi2 “goodness of fit” •Degree of freedom : –Number of classes - 1 - Number of parameters estimating the theoretical distribution (if necessary). •Condition of application : –All theoretical frequencies must be greater than 5. – Calculation of each theoretical frequency – Condition of application : All theoretical frequencies must be greater than 5 otherwise we group the classes – Calculation of the Chi2 • Decision: If Chi2 > Chi2 alpha => reject H0 : the distribution is not conform to the theoretical distribution. If the degree of significance p- value is less than alpha => reject H0 . Chi2 =  (0-T) T 2 1 p p = Number of classes after grouping DF = p -1 – Number of estimated param. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 31. 31 31 Univariate tests: Tests of normality • In addition to the Chi2 for goodness of fit, other graphical and/or statistical approaches are used in order to test the normality of a distribution. • The graphical methods and empirical techniques are: Histogram, Boxplot, Q-Q plot, and skewness and kurtosis coefficients skewness kurtosis • Statistical methods: Univariate test of Kolmogorov-Smirnov 3 ) ( ) 2 )( 1 ( 1      i i s x x n n n G ) 3 )( 2 ( ) 1 ( 3 ) ( ) 3 )( 2 )( 1 ( ) 1 ( 2 2 4            n n n s x x n n n n n G i i Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 32. 32 32 Bivariate test: Contingency Chi2 or Chi2 for Independence • The elements : – Two qualitative variables ( C « columns » and L « lines » modalities), are measured for each subject. We obtain a table of contingency with C*L cases, a subject is classified in each case and only one. • Hypotheses : – Null hypothesis H0 : variables are independent. – Alternative hypotheses H1 : variables are not independent. • Statistic : • Decision : – If Chi2 > Chi2 alpha => reject H0 : the 2 variables are not independent – The degree of significance p-value is less than alpha => reject H0. (0ij-Tij) Chi2 = ij Tij 2 i,j : modalities of C and L DF = (L-1)*(C -1) Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 33. 33 33 Bivariate test: Correlation and simple linear regression • Regression and correlation : – x and y are two random continuous variables: x and y have a degree of association => correlation – y is explained by x => regression – A statistical test could be performed with both methods in order to estimate a p-value and verify if the association is significant or not between the 2 variables. • Correlation: – Coefficient of correlation of Pearson •covxy, the covariance between x and y • Simple linear regression – y = ax + b + ε => Estimation with the least squares method of the regression curve parameters between the 2 variables x and y. y x xy xy r var var / cov   n y y x x n i i i xy       1 ) ( ) ( cov Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 34. 34 34 Bivariate test: test for the equal distribution between 2 samples • Parametric tests require a normal distribution and an equal distribution between 2 variables. • In order to test this equal distribution we use the non parametric test of Kolmogorov-Smirnov used for the univariate test of normality. • If the degree of significance p-value is less than alpha we reject H0 and we conclude that the distributions are not equal; thus, we use non parametric tests (Wilcoxon, Mann-Whitney U, Kruskall-Wallis). In equal distribution case we use parametric tests (Student for dependent or independent variables, one way ANOVA). Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 35. 35 35 Bivariate test: Test of Student (means comparison) • Hypotheses : – Null Hypothesis : •both observed means xa and xb are estimators of both means µa and µb with µa = µb – Alternative Hypotheses : •Two-tailed test µa # µb •One-tailed test µa > µb or (exclusive) µa < µb • Statistic : • Decision : – If t > t alpha => reject H0 : the 2 means are not equal – The degree of significance p-value is less than alpha => reject H0 . common |xa - xb | Na Nb + = t 2 common 2 has a Student distribution with Na + Nb- 2 DF a 2 * (Nb -1) commun= SSDa + SSDb Na + Nb- 2 = b 2 * (Na -1)+ Na + Nb- 2 Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 36. 36 36 Bivariate test: One way ANalysis Of VAriance (One way ANOVA) • The elements : – One qualitative variable or factor with multiple modalities or levels and 1 quantitative variable. N is the size of the population and K the number of levels. • Hypotheses : – Null hypothesis : •The observed means in different groups : xa, xb,xc,.. are estimators of the means ma, mb, mc,... H0 : ma = mb = mc… – Alternative hypothesis : •At least one of the means ma, mb, mc,.. Is different from the others. • Statistic : Let Vartotal = Varbetweenclasses + Varwithinclasse • Decision : – If F > F alpha => reject H0 : variances between groups are not equal – The degree of significance p-value is less than alpha => reject H0 . ) ( ) 1 ( 2 2 k N Var k Var F within between    Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 37. 37 37 Multifactorial test: Multiway ANalysis Of VAriance (Multiway ANOVA - GLM ANOVA) • The elements : – Multiple qualitative variables or factors with multiple levels and 1 quantitative variable. • Hypotheses : – To previous hypotheses H0 et H1 of the one way ANOVA we add the interaction between factors. • Statistic : We estimate the parameters of the general linear model GLM Xijk = µ + αi + βij +, εijk …, with the maximum likelihood method i = 1.....a (number of classes of factor A) j = 1...b (number of classes of factor B) k = 1...n (number of repetitions in each sub-group) µ: parametric mean of the population; αi : effect of the controlled factor A on the observation: fixed deviation of the group compared to the mean mu; βij : random contribution of the jth sub-group of the ith group; εijk : random fluctuation of the xijk value: random variable, independent, normally distributed, with a mean µ=0 and a variance s2 . Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 38. 38 38 Multifactorial test: Multiple linear regression • The elements : – We would like to study the effect of multiple independent variables (quantitative or qualitative) on a continuous variable called response variable and to estimate the linear curve which predict most the response variable. • Hypotheses : – Null hypothesis : •The independent variables have no effect on the response variable. – Alternative hypotheses : •The independent variables influence the response variable. • Statistic : y = a + b1X1 + b2X2 +…+ bjXj+ ε1,…,j A multiple correlation R2 is calculated We estimate the parameters of the general linear model GLM with the method which maximize the likelihood. We could include covariates in the model with a known effect on the other independent random variables. Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 39. 39 39 Multifactorial test: Multiple logistic regression • The elements : – In many applications, the response variable Y may have just 2 possible values, and could be presented by a binary indicator variable having 0 and 1 as values. • Hypotheses : – Null hypothesis : •the independent variables have no effect on the response variable – Alternative hypotheses : •the independent variables influence the response variable. • Statistic : •The logistic formula is: GLM: Estimation of parameters X) β exp(β 1 X) β exp(β E(Y) 1 0 1 0     Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 40. 40 40 Multifactorial test: Survival Curves (1) • Probability for the outcome of an event (B) in subjects with a common event from origin (A), taking into consideration the time between the two events. Or the outcome of death (B) in subjects with a severe disease (A) • Survival function (Si): probability to «survive» at an instant t. • Method for the survival analysis: Method of Kaplan-Meier • Method for the comparison between 2 survival curves: The Log Rank test Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 41. 41 41 Multifactorial test: Survival Curves (2) • Calculation of the probability to survive at each time that at least one « death » occurs. • Let: Vi : number of survival at the begin of the interval ti - ti-1 Di : number of death during the interval ti - ti-1 Ei : number of exclusion at the begin of the interval ti - ti-1 qi : probability of death during the interval ti - ti-1 : qi = Di/(Vi - Ei) pi : probability of survival during the interval ti - ti-1 : pi = 1 - qi Si : function of survival at ti instant : Si = p0p1…pi = piSi-1 Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 42. 42 42 Multifactorial test: Survival Curves (3) Comparison of 2 survival curves with the Log rank method. • Hypotheses: – H0 = the 2 survival curves have identical profiles, the risk of « death » at a define time is thus the same in both groups – H1 = the 2 survival curves have different profiles • We calculate the theoretical probability of « death » at the instant i: Pi = (D1i + D2i) / (V1i + V2i) • We deduce that the calculated frequency of « death » in each group at instant i is: c1i = Pi * V1i c2i = Pi * V2i We note c1 = Σc1i, c2 = Σc2i, o1 = ΣD1 et o2 = ΣD2 Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 43. 43 43 Multifactorial test: Survival Curves (4) • Statistic: Formula with results and interpretation similar to a Chi2 (χ2 ) with 1 df. (o1 - c1)² (o2 - c2)² c1 + c2 • Decision: We use the table of χ² to look for the value of the risk , with 1 df: χ² > χ²α , reject H0 • Remark: when we want to include one or many factors and to study their effect on the survival analysis (multivariate analysis), we use the Cox model… Χ² = Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 44. 44 44 Multivariate test: Multiple ANalysis Of VAriance (MANOVA) or covariance (MANCOVA) • The elements : 2 or multiple qualitative variables or factors with multiple modalities or levels and 2 or multiple quantitative variables. Eg: To study the difference of age and satisfaction according to the questionnaires and the number of persons. Ques. AGE NBPERS SATISF 1 33 3 18 2 29 2 9 3 45 1 14 Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 45. 45 45 Multivariate test: Principle Component Analysis (PCA) • Methods for orthogonal projection – The PCA objective is to study globally the relation between multiple quantitative variables. It could be also used for qualitative ordinal (numeric) variables. – The objectives of the PCA are : – the information reduction: the variables are regrouped in a small number of new variables called principle components; – the typology of cases : the positioning of cases compared to the principal components allows the detection of group of cases by increasing the variance between different groups. * * * * * * * * * * * * * * F 1 F2 Biostatistics course EL FEGHALI R. AIIHC International Ltd.
  • 46. 46 46 Fields of Application • Genomics, Transcriptomics, Proteomics, Metabolomics • Clinical studies & Epidemiology • Pharmacology, Pharmacogenomics & Biology • Agronomy & Ecology Summary of the different steps of a statistical analysis: • Transform numerically each observation if it is not numeric • Retrieve the noise signal & replace the missing values • Normalize the data and prepare the data table for analysis • Perform the differential analysis (t-test, ANOVA,...) • Find similar sub-groups (clustering, K-means,…) Biostatistics course EL FEGHALI R. AIIHC International Ltd.