3. ?How
• Book: Statistics at Square One 11th ed.
“ Campbell and Swinscow”
• SPSS Practical sessions-PASW guide.
• Practical sessions using SPSS v. 17.0
4. ”Statistics “ an overview
Population
Parameters
Data
Analysis
Interpretation
Information
Sample
Statistics
Statistical analysis
Reference range
Researches
7. diabIB
SB
P
SB
P
P NR
AT
AGE
SE
X
SM E
OK
H IGH
E T
W IGH
E T
CH
OL
H A1C
B
DIAB
DU
DE
AD
1
57
0
0
177
98
140
154
0
6.30
7.62
5
#NULL!
2
74
1
0
172
69
150
145
1
5.10
8.30
11
0
3
38
1
0
155
70
120
126
0
6.50
11.00
2
#NULL!
4
73
1
0
165
72
180
157
0
5.80
7.00
21
0
5
53
1
2
174
109
140
119
1
6.80
10.60
7
0
6
74
1
0
171
83
151
145
0
6.25
7.62
7
0
7
81
0
2
175
60
140
113
0
6.50
6.40
6
0
8
86
1
0
164
59
140
158
0
5.20
5.30
4
0
9
78
0
1
171
83
151
148
0
5.60
5.90
1
1
10
78
1
0
171
83
151
159
1
5.00
8.00
23
1
11
91
0
0
171
83
151
140
0
4.30
9.70
4
1
12
77
0
2
176
87
170
198
0
6.40
6.60
7
2
13
77
1
0
171
83
151
152
0
5.20
4.90
26
1
14
84
0
0
171
62
160
148
0
7.00
7.80
8
1
15
72
1
0
154
63
145
148
0
6.20
7.80
0
1
1
IN
2 INSUL
8. I-Tables
) Tables can summarize counts, frequency (categorical), measures (numerical
Contingency
Frequency
smoking history * SEX Crosstabulation
SEX
Valid
male
female
Total
Frequency
145
133
278
Percent
52.2
47.8
100.0
Count
Valid Percent
52.2
47.8
100.0
Cumulative
Percent
52.2
100.0
SEX
male
smoking
history
Total
never
stopped smoking
yes
26
64
55
145
female
110
14
9
133
Total
136
78
64
278
)For comparison (2 or more variables
9. Table 3 Daily servings of calcium and vitamin D rich foods in relation to body mass
. index classification of the included adults
(* F
ood items (servings/
day
Subjects classification
(Obese (N=91
Milk
Milk beverage
Milk in cereals
Milk in coffee or tea
-T
otal milk
Yoghurt
Cheese
Ice cream
-T
otal dairy
Tuna (canned)
Fish
Half cooked fish
Shrimp/oyster
Eggs
Liver (including chicken livers)
Others!
-Dietary vitamin D (IU/
day): Median (mean ±SD)
Low dietary intake c (< 200 IU/day): No. (%)
-Dietary calcium (mg/
day): Median (mean ±SD)
Low calcium intake d (<1000mg/day): No. (%)
(Non-obese (N=125
(0.71±0.3)0.52
0.45(0.59±0.4)
0.20(0.33±0.2)
0.15(0.25±0.6)
0.90(1.03±0.3)
0.10(0.12±0.6)
0.20(0.24±0.9)
0.15(0.14±0.6)
0.25(0.45±0.6)
0.05(0.03±0.1)
0.15(0.19±0.7)
0.06(0.11±0.5)
0.05(0.08±0.1)
0.85(0.81±1.1)
0.02(0.04±0.4)
0.20(0.23±0.3)
(111.6)118.1±73.5
56(62.2)
(660.0)698.8±261.9
51(56.7)
(0.88±0.7)0.65
0.35(0.53±0.4)
0.50(0.58±0.4)
0.20(0.23±0.6)
1.20(1.34±0.7)
0.20(0.14±0.5)
0.20(0.29±0.8)
0.06(0.09±0.3)
0.30(0.43±0.7)
0.03(0.04±0.3)
0.10(0.18±0.5)
0.25(0.27±0.6)
0.05(0.06±0.1)
0.80(0.76±0.7)
0.05(0.06±0.3)
0.40(0.55±0.5)
(123.7)132.2±67.4
47(37.6)
(692.0)717.9±245.9
49(39.2)
P value
0.031
0.279
0.001
0.790
0.001
0.790
0.661
0.422
0.826
0.761
0.902
0.029
0.149
0.797
0.834
0.549
0.034
0.003b
0.223
0.011b
a
10. Assignment I
).Table 1 Basic characteristics for the patients examined (N=278
Baseline characteristics 1996
(%)Men- 1
(%)Insulin users- 2
(%)Smokers- 3
(%)Ex-smokers- 4
(%)Non-smokers- 5
)Age in years (mean ±SD- 6
)Systolic Blood pressure at starting point mmHg (mean ±SD- 7
)Systolic blood pressure two years mm Hg (mean ±SD- 8
)Duration of diabetes (median/Quartiles 1-3- 9
Missed values- 10
)Total (N=278
52.2
25.5
23.0
28.1
48.9
±11.74 67.24
±22.00 151.20
±29.1 153.83
(2.75-12.25) 6.0
0.0
12. II- Graphs
Types of variables-1
Number of variables-2
Objectives-3
Selection of graphs
Next
Categorical
Numerical
Figure 1Outcomes of the included diabetic patients (1996)
Figure 2: Smoking status of the inlcuded diabetic patients
60
other cau se of death
M issin g
50
40
30
20
alive
10
Percent
died from CVD
0
never
smoking history
stopped smoking
yes
13. For numerical variables
Figure 3: Total cholesterol level in diabetic pateints 1996
in mmol/l
60
50
40
30
20
Std. Dev = 1.33
10
Mean = 6.25
N = 278.00
0
.
13
.
12
.
11
.
10
00
00
00
00
0
0
00
9.
0
8.
0
7.
0
0
00
6.
0
5.
0
4.
00
3.
total cholesterol
14. Figure 4: Systolic blood pressure at starting point
among diabetic patients 1996 (mmHg)
240
220
28
247
99
68
67
200
syst. blood pressure at start
180
160
140
120
100
80
N=
133
male
SEX
145
female
15. Figure 6: Total cholesterol level in relation to gender and
smoking status among diabetic patients 1996
95% CI total cholesterol (mmol/l)
8.5
8.0
7.5
smoking history
7.0
6.5
n ever
6.0
stopped sm oking
5.5
5.0
yes
N=
26
64
male
SEX
55
110
14
female
9
16. Figure 7: Duration of diabetes among the included patients 1996
Checking for normality
(in years)
80
70
60
Median=6.0
Mode=1
50
Normal distribution
40
30
20
Std. Dev = 6.96
10
Mean = 7.9
0
N = 278.00
0.0
5.0
2.5
10.0
7.5
15.0
12.5
20.0
17.5
25.0
22.5
-
30.0
27.5
32.5
+
duration of diabetes
Outliers
Mode
Median
Mean
17. (III-Measures (numerical variables
Central Tendency
H the data aggregate around a central point
ow
Mean
Median
Mode
P
ercentiles
Dispersion
H the data varies
ow
)Range (max-min
Inter Quartile range
Variance
Standard deviation
Variation coefficient
18. Central Tendency
M
ean= summation of observations/
their number
Affected by extremes of value
x1+x2+x3)/
number(
M
ode= T most frequently occurring values in a set of observations
he
M
edian= T middle value that divide the ordered data set into 50/
he
50
Not affected by extremes of values
20. Dispersion
1
1
6
8
10
16
17
23
43
53
Range=53-1=52
Affected by extremes of values
of data %25
25th percentile
1st quartile
M
edian=13
of data 50%
50th percentile=13
of the data 75%
75th percentiles
3rd quartile
Interquartile range=3rd-1st quartiles
17=23-6
IQR not affected by extremes of values
21. Standard deviation and variance
3
7
6-
2-
Sample of 3, their age in years
9
17
M
ean age=(3+7+17)/
3=9
8+
T sum of the differences between the mean and individual values=0
he
T mean deviation=0
he
T overcome the 0= sum the difference squared/
o
number-1= Variance
52=2/3-1(17-9)+2(6-9)+2(3-9)
)
T amount of dispersion around the mean=52 years2 (wrong scale
he
H
ence we need to convert back to the usual (natural) scale, use the standard deviation
Variance=±7.2 years√
22. T sample disperses around the mean (=9 years) by 7.2 years on both directions
he
23. Description of a binary (dichotomous
(variable
o A binary variable: H only two outcomes
as
(diseased or not diseased).
o T proportion of the population that is
he
diseased (at certain point of time) is
called prevalence.
o T new cases occurring is called
he
incidence.
25. P
robability and Odds
o Odds= chance
o In a population of 1000, 200 has a certain
disease.
o W
hen we randomly take one person out, the
probability that this person is diseased=
200/
1000= 0.2 (this is probability)
o T chance (the Odds) that is person is
he
diseased= probability of having the disease
/
probability of not having the disease.
o Odds= P (probability of disease)/
probability of
not having the disease (1-P / =
)=P 1-P
0.2/
0.8=1/ the odds are 1 to 4.
4,
26. T following table depicts the outcomes of isoniazid/
he
placebo trail
among children with H (death within 6 months
IV
Dead
(within 6
(months
Alive
Total
Placebo
21
110
131
Isoniazid
11
121
132
Interventions
W
hat is the risk of
?dying
Risk=21/
131=0.160
Risk=11/
132=0.083
Absolute risk difference (ARD)=risk in placebo-risk in isoniazid= 0.077
Net relative risk (NRR)=risk in placebo/
risk in isoniazid= 1.928
Relative risk reduction (RRR)=risk in placebo-risk in isoniazid/
risk in placebo= 0.48
Number needed to treat (NNT
)=1/
ARD=1/
0.077=13
27. )Odds ratio (OR
o An odds ratio (OR) is a measure of association
between an exposure and an outcome.
o The OR represents the odds that an outcome will
occur given a particular exposure, compared to the
odds of the outcome occurring in the absence of that
exposure.
o Odds ratios are most commonly used in case-control
studies, however they can also be used in crosssectional and cohort study designs as well (with some
modifications and/or assumptions).
28. B
asic structure of case-control design
PoPulation
Diseased
Unexposed to factor
(b)
Diseased
(cases)
Sample
The Odds “ chance of exposure
Is calculated between both groups
E
xposed to factor
(a)
Disease-free
E
xposed to factor
(c)
Disease-free
(controls)
Unexposed to factor
(d)
P time
ast
T
race
P
resent time
Starting point
29. Calculation
Case control
study
Diseased
Exposed
Cases+ exposed
((a
Exposed+ not
(diseased (b
a+b
Cases-not
( exposed (c
Not exposed+ not
(diseased (d
c+d
Non-exposed
None
Odds ratio= a/ d= ad/
c÷b/
bc
Prevalence among the diseased/
prevalence among the non-diseased
OR=1 Exposure does not affect odds of outcome
OR>1 Exposure associated with higher odds of outcome
OR<1 Exposure associated with lower odds of outcome
Total
30. Odds ratio
Case control
study
Lung cancer
Smoking
a-80
b-30
110
c-20
d-70
90
None
80x70=5600
30x20=600
9.3=5600/600
No lung cancer
Or 80/
20÷30/
70=9.3
Total
31. B
asic Structure of cohort study
Diseased
Disease-free
The Relative Risk is calculated for exposure
Develop
)Disease (a
Sample
E
xposed
to factor
Develop
)Disease (c
-Disease
free
Unexposed
to factor
P
resent time
Starting point
Disease-free
)b(
F
ollow
Disease-free
)d(
Future tim
e
Comparing the incidence of disease in each group
P
opulation
32. )Relative risk (RR
Mammography
Breast cancer
No breast cancer
Total
Positive
a-10
b-90
100
Negative
c-20
d-998980
100,100
In Cohort design
)RR= a/
(a+b)÷c/
(c+d
500 =0.1/0.0002=(100,100)20÷ (100)/10
33. Coh
ort
stu
dy
)T relative risk (RR
he
L
ung cancer
Smokers
Non
18
6
No lung
cancer
582
1194
Risk for smokers=18/600=0.03
Risk for non-smokers=6/1200=0.005
RR=0.03/0.005=6
T
otal
600
1200
34. Cas
ec
ont
rol
stu
dy
)T Odds ratio (OR
he
L
ung cancer
Smokers
Non
80
20
No lung
cancer
30
70
Odds for smokers=80/30=2.67
Odds for non-smokers=20/70=0.29
OR=80* 70/30* 20=9.33
T
otal
110
90
35. Assignment I
(.Table 1 Basic characteristics for the patients examined (N=278
Baseline characteristics 1996
)%(Men- 1
)%(Insulin users- 2
)%(Smokers- 3
)%(Ex-smokers- 4
)%(Non-smokers- 5
(Age in years (mean ±SD- 6
(Systolic Blood pressure at starting point mmHg (mean ±SD- 7
(Systolic blood pressure two years mm Hg (mean ±SD- 8
(Duration of diabetes (median/Quartiles 1st -3rd- 9
Missed values- 10
(Total (N=278
52.2
25.5
23.0
28.1
48.9
±11.74 67.24
±22.00 151.20
±29.1 153.83
)2.75-12.25( 6.0
--
37. 2b
Smoking history by sex
100
80
83
60
44
40
38
Percent
SEX
20
18
male
11
0
never
smoking history
stopped smoking
7
yes
female
38. 3a
Age using Bar (mean used as summary)
70
69
68
Mean age (years)
67
66
65
64
male
SEX
female
39. Boxplot age by Sex
3b
120
100
80
60
age (years)
40
20
195
0
N=
This graph gives check for
Data distribution and checking
SEX
for outliers
145
133
male
female
43. (p95, p5= M
ean± Z score (probability) at the specified percentiles * (Standard deviation
Probability distribution of the normal curve: page 180
-/-52
P95 SB
P1= 151.2+1.645(22.0)=187.4 mmH
g
48. Population and Sample
o In scientific research we want to make a statement
(conclusion) about the population.
o Studying the whole population is impossible in terms
of money/time/labor.
o Random sampling from the population and infer from
the sample data the needed conclusions.
o The task of statistics is to quantify the uncertainty
(the sample is really representing that population).
49. The concept of sampling
Study population: You select a few sampling units
Sam
pling units
from the study population
You make an estimate
“prediction” extrapolated to the
study population
(prevalence, outcomes etc.)
Sample
You collect information
from these people to
find answers to your
research questions.
50. What would be the mean systolic blood pressure
?of older subjects (65+) in Al Hassa
175
P
opulation mean ( μ)= unknown
165
180
155
F
rom our sample we calculate an estimate of the population parameter
51. T good sample (the
he
(estimator
: Should be
:Unbiased
The mean of sample = population mean
)Precise: (narrow dispersion about the mean
The dispersion in repeated samples is small
This is a dream
52. Sampling error
F
our individuals A, B C, D
,
A = 18 years
B 20 years
=
C= 23 years
D= 25 years
T
heir mean age is = 18+20+23+ 25=
86/ 21.5 years (population mean μ).
4=
53. P
robability of sampling two individuals: (6 probabilities)
A+B
=18+20= 38/
2=19.0 years
A+C= 18+23=20.5 years.
Sampling error= population mean-sample mean
A+D=18+25=21.5 years.
= ranges from -2.5 to +2.5 years.
B
+C=20+23=21.5 years.
B
+D=20+25=22.5 years.
C+D=23+25=24.0 years.
P
robability of sampling three individuals: (4 probabilities)
A+B
+C=18+20+23=20.33 years. E
rror = ranges from -1.17 to +1.7 years.
A+B
+D=18+20+25=21.00 years.
A+C+D=18+23+25=22.00 years.
B
+C+D=20+23+25=22.67 years.
If C=32 (instead of 23) years and D=40 (instead of 25) years:
sampling of 2= sampling error of -7.00 to +7.00 and in 3= -3.67 to
+3.67 years.
T greater the variability of a given variable the larger the sampling
he
error for a given sample size.
55. 2
o T normal distribution
he
o T Standard error of the mean
he
o E
stimation:
Reference interval Confidence intervals F mean
or
proportion
Difference between
means/
proportions
RR and OR
56. Norm Distribution:
al
M
any human traits, such as intelligence, personality, and attitudes, also, the
weight and height, are distributed among the populations in a fairly normal
way.
56
١٤٣٥/٠٢/٦
57.
58. T normal distribution
he
(within between μ ±1 SD (σ ±68%
(within between μ ±2 SD (σ ±95%
SDs Definite outliers 3<
2SDs Possible outliers<
59. One more
T Z score which measures how many standard
he
deviations a particular data point is above or
below the mean.
oUnusual observations would have a Z score over
2 or under 2 SD.
oE
xtreme observations would have Z scores over 3
or under 3 SD and should be investigated as
potential outliers.
Z = X1 − χ
s
60. .Areas under the standard normal curve
Z
±0.1
±0.2
±0.3
±0.4
±0.5
±0.6
±0.7
±0.8
±0.9
±1
±1.1
±1.2
±1.3
±1.4
±1.5
±1.6
±1.645
±1.7
±1.8
±1.9
1.96
±2
±2.1
±2.2
±2.3
±2.4
±2.578
Area under curve
between both points
((around the mean
0.080
0.159
0.236
0.311
0.383
0.451
0.516
0.576
0.632
0.683
0.729
0.770
0.806
0.838
0.866
0.890
0.900
0.911
0.928
0.943
0.950
0.954
0.964
0.972
0.979
0.984
0.99
B
eyond both
points
(two tails)
B
eyond one point
(one tail)
0.920
0.841
0.764
0.689
0.617
0.549
0.484
0.424
0.368
0.317
0.271
0.230
0.194
0.162
0.134
0.110
0.100
0.089
0.072
0.057
0.050
0.046
0.036
0.028
0.021
0.010
0.004
0.4600
0.4205
0.3820
0.3445
0.3085
0.2745
0.2420
0.2120
0.1840
0.1585
0.1355
0.1150
0.0970
0.0810
0.0670
0.0550
0.0500
0.0445
0.0360
0.0290
0.0250
0.0230
0.0180
0.0140
0.0105
0.0100
0.0020
61. Calculating values from Z-scores
(.Xi = Mean± Z (standard deviation
(Value (percentiles) =M
ean± Z score* (SD
62. Random sample for estimating a population
mean
X1=128
?μ
X2=133
X3=129
F
rom the information in the sample, we will estimate the
unknown
(population mean (X is an estimator for μ
?W
hat could have happened if we had another random sample
?W
hat is the measure of variation of sample means
63. T Sampling Distribution of a Sample Statistics
he
≈ L
et’s assume that we want to survey a
community of 400, the age of them were
recorded and having the following parameters:
µ = 35 years
σ = 13 years
≈
L
et’s assume, however, that we do not survey all 400,
instead we randomly select 120 people and ask them
about their ages and calculate the mean age.
≈
T
hen, we put them back into the community and randomly select
another 120 residents (may include members of the first sample).
W did this over and over and each time we calculate the mean
e
age.
T results will be like those in the following table.
he
≈
≈
64. Sample Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SD of the means
Sample mean
34.7
35.9
35.5
34.7
34.5
34.4
35.7
34.6
37.4
35.3
34.1
35.5
34.9
36.2
35.6
35.0
35.1
36.4
35.6
33.6
13.37
Distribution of 20 random sample means
((n=20
μ
..…
..… . .…
.
.
33
34
35
36
..
.
37
All the results are clustered around the
population value (35 years), with a few scores
a bit further out and one extreme score of 37.4
(.years (random variation=1/
20=5%
,T
hose 400 people have age range from 2 to 69 years
while the means of the samples have a very
narrow range of value of about 4 years and 10
(.samples coincide with the population mean (35 years
65. M of the samples will cluster around the population
ost
parameters with occasional sample result falling
relatively further to one side or the other of the
distribution (this called the sam
pling distribution of
(.sam m
ple eans
:H the following properties
as
T mean of the sampling distribution is equal to the
he
population mean, the average of the averages (µχ)
will be the same as the population mean.
T standard deviation of the sample means = the
he
standard error SE σ/ n, (σ= population SD).
= √
T distribution of the sample means is Normal if the
he
population distribution is Normal.
If the population distribution is Not Normal, T
he
distribution of the sample means is almost Normal
when n is large (Central L
imit T
heorem).
66. Standard error of the mean
P
opulation
P
arameters
M
ean
S.D
Sample mean
Sample
M
ean
S.D
The degree the sample statistics are deviating /different
.from the population parameters
T term error indicates the fact that due to sampling error,
he
each sample mean is likely to deviate some what from true
population mean.
67.
68. Central L
imit T
heorem
.T formula for SE= SD/
he
√n
T formula indicates that we are estimating the SE given
he
.the S.D of a sample of size n
.For a sam of 100 a S.D of 40 the SE= 40 /
ple
nd
√100 = 4
.For a sam of 1000 and S.D of 40 the SE= 40 /√1000 = 1.26
ple
T factors influence the SE sample size and S.D of the
wo
,
:sample
. Sample size has greater impact as it is used a denominator
.For a sam of 100 a S.D of 20 the SE = 20 /√100 = 2
ple
nd
.For a sam of 100 a S.D of 40 the SE = 40 /√100 = 4
ple
nd
If there is more variability within a sample the greater the
.SE
69.
70. (Confidence Interval (CI
A confidence interval gives an estimated
range of values which is likely to include
an unknown population parameter, the
estimated range being calculated from a
given set of sample data.
71. W need to know the smallest and the largest μ (range) we think is likely
e
using sample statistics.
T mean of sample = μ
he
72.
73. c= level of
confidence
Z c= Z critical
values (under
( normal curve
90%
95%
99%
1.645
1.960
2.578
σ
χ±Ζ
c
n
(C.I= Mean of the sample ±Z critical scores (SEM
SEM= SD/√n
74. C.I
• The confidence interval provides a range that
is highly likely (often 95% or 99%) to contain
the true population parameter that is being
estimated.
• The narrower the interval the more informative
is the result.
• It is usually calculated using the estimate
(sample mean) and its standard error (SEM).
75. CI for μ
Systolic blood pressure in 287 diabetic patients
Descriptives
syst. blood
pressure at start
syst. blood
pressure at start
Mean
90% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Descriptives
Interquartile Range
Skewness
Mean
Kurtosis
90% Confidence
Lower Bound
Interval for Mean
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Statistic
151.20
149.02
Std. Error
1.319
153.38
(C.I= 151.20±1.65(21.997/ 287 90%
√
C.I=149.02-153.38 mmH
g
150.30
150.00
483.880
21.997
100
220
120
30.00
Statistic
.540
155.06
.152
149.92
Std. Error
.146
3.064
.291
160.20
154.72
151.20
460.033
21.448
115
205
90
30.00
.263
-.506
Random sample of 50 out of 287
.340
.668
76. Descriptives
syst. blood
pressure at start
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Statistic
151.20
148.60
Std. Error
1.319
153.80
150.30
150.00
483.880
21.997
100
220
120
30.00
.540
.152
(C.I=151.20±1.96(21.997/ 287 95%
√
C.I=148.60-153.80 mmH
g
.146
.291
Descriptives
syst. blood
pressure at start
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
155.06
148.90
Std. Error
3.064
Random Sample of 50 out of 287
161.22
154.72
151.20
460.033
21.448
115
205
90
30.00
.263
-.506
.340
.668
77. Descriptives
syst. blood
pressure at start
Mean
99% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Descriptives
Skewness
Kurtosis
syst. blood
pressure at start
Mean
99% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
151.20
147.78
154.62
150.30
150.00
483.880
21.997
100
220
120
30.00
.540
.152
Statistic
155.06
146.84
Std. Error
1.319
99%
(C.I=151.20±2.58(21.997/ 287
√
C.I=147.78-154.62 mmH
g
.146
.291
Std. Error
3.064
163.28
154.72
151.20
460.033
21.448
115
205
90
30.00
.263
-.506
Random sample of 50 out of 287
.340
.668
78. (C.I= 151.20±1.65(21.997/ 287 90%
√
C.I=149.02-153.38 mmH
g
(C.I=151.20±1.96(21.997/ 287 95%
√
C.I=148.60-153.80 mmH
g
99%
(C.I=151.20±2.58(21.997/ 287
√
C.I=147.78-154.62 mmH
g
W
hat does this mean? It means that if the same
population is sampled on numerous occasions and
interval estimates are made on each occasion, the
resulting intervals would bracket the true population
parameter (ranged) in approximately 90, 95 and 99 %
. of the cases
79. T sample distribution of a proportion
he
µp =π
SE ( p ) =
p (1 − p )
n
p =K / n
CI p = p ±1.96( SE )
Z critical score equal 95%
81. CI for the difference between two 95%
(means (μ1-μ2
Smoke
No
Yes
Difference
n
Mean SBP
(SE (mean
214
64
153.1
144.8
8.3
1.50
2.62
χ1 − χ 2 ± 1.96 * SE ( χ1 − χ 2 )
SE = ( SE ( χ1 )) 2 + ( SE ( χ 2 )) 2
C.I= 2.4 to 14.2
82. CI for percentage 95%
(Smoke (n
died%
SE
(No (212
28.8
3.11
(Yes (64
23.4
5.30
Pns − Ps ± 1.96 * SE ( Pns − Ps )
Difference= 5.4%
P1 × (100 − p1) P2 × (100 − p 2)
SE =
+
n1
n2
C.I=-6.7% to 17.4% 95%
83. CI for RR and OR 95%
Use available software
http:/www.medcalc.org/
/
calc/
odds_ratio.php
http:/www.medcalc.org/
/
calc/
relative_risk.php
vl.academicdirect.org/applied_statistics/.../CIcalculator.xls
85. Inferential Statistics
Testing in research
o In scientific research we would like to test if our
research ideas are true.
o Based on previous observations (studies) we know
that the mean cholesterol of patients with diabetes is
higher than those without the disease.
o We will take samples and check whether the results
will agree with our expectations.
o Meaning we are going to test the situation using a
statistical test.
86. The Z-test for one sample
(Serum cholesterol (μ=5 mmol/
L
Diabetic patients, mean cholesterol > 5
σ=±1.5
?Considering σ=±1.5
Is there any difference between diabetes free population and the diabetic patients
. regarding serum cholesterol? Let’ s perform Z test
87. (Research question (hypothesis
T research hypothesis would be
he
The mean cholesterol of diabetics is > 5mmol/L
Null hypothesis
H0: μ=sample mean=5
Alternative hypothesis
(H1: μ >5 (one sided
Or
(H1: μ≠5 (two sided
88. P
rocedure
μ=5
Mean of sample
Cholesterol level diabetic patients in mmol/L
60
If the sample mean close to the population mean
The null hypothesis is TRUE
50
40
If the sample mean differs from population mean
We REJECT the null
30
20
Std. Dev = 1.33
10
Mean = 6.25
N = 278.00
0
0
.0
13
0
.0
12
0
.0
11
0
.0
10
00
9.
00
8.
00
7.
00
6.
00
5.
00
4.
00
3.
total cholesterol
89. T ά level (P
he
(value
T probability to obtain /
he
achieve the null
hypothesis
T probability that P
he
opulation mean=sample
mean
T
here no difference between the population and
.sample mean
Or
The maximum probability we accept to reject the null
hypothesis falsely
ά = 0.05
90. (P > 0.05 (ά
Accept the null
Sample mean= population
mean
(P ≤ 0.05 (ά
Reject the null
Sample mean≠
population mean
Alpha level
91. (Calculation (σ=1.5
SE =μ/ n=0.3
M √
Z=(mean sample-μ)/
σ
P (mean of the sample≥6)=P ≥6-5)/
(Z
0.3= 0.0005
Under the normal curve area of rejection >1.96 Z
: P=0.0005
T cholesterol blood level of diabetic patients can coincide
he
with the population (disease free) 5 in 10,000 times
T two values could be the same in 5 times if we repeated this test 10,000 tim
he
P < 0.05 so we reject the null
T diabetics have larger mean cholesterol level than the normal population
he
92. In reality
It is unlikely that the σ (population SD) is
known.
In most of the cases, σ will be unknown and
we will be able to apply neither the formula
nor the table of normal distribution (areas
under the curve=Z score).
We resort to other statistical tests.
94. Possible situations in Hypothesis testing
Level of significance
Reality
Decision
Reject H0
(Type I error (ά
H0 is true
H0 is not true
Do not reject H0
(OK (1-ά
)OK (1-В
)Type II error (В
В= Power-1
It is the probability to reject the null hypothesis if is NOT T
RUE
Usually 80% is the least required for any test
95. Errors of Hypothesis Testing and Power
Conclusion from hypothesis testing
Decisions and errors in hypothesis testing
True Situation
(Difference exist (H )
1
No difference (H
0
Study results
Correct decision
Difference exist
Reject H
0
No difference
Do not reject H
0
(power or 1-β )
T
ype II or β error
F
alse acceptance
T
here is no difference
when it is really
.present
T
ype I error or ά
Rejection when it is true
F
alse rejection
T
here is a difference
when it is really not
Correct decision
96. P
assive smoking and lung cancer
T
ruth about the population
Conclusions,
based on results
from a study of a
sample of the
population
Reject the null
hypothesis (rates in
the study appear to
(be different
Accept the null
hypothesis (rates in
the study appear
(similar
P
assive smoking
is
related to lung
.cancer
Not related to
.lung cancer
T
ype I E
rror
Incorrect rejection
P
assive sm
oking is related to
lung cancer when it is really
not..
T
ype II E
rror
Incorrect acceptance
P
assive sm
oking is not
related to lung cancer when
it is reallydoes.
97. The Alpha-Fetoprotein (AFP) test has both Type I and Type II error
. possibilities
This test screens the mother’ s blood during pregnancy for AFP and
. determines risk
.
Abnormally high or low levels may indicate Down syndrome
Ha: patient is unhealthy
H0: patient is healthy
Error Type I (False positive or False Rejection) is: Test wrongly indicates
that patient has a Down syndrome, which means that pregnancy must be
.aborted for no reason
Error Type II (False negative or False Acceptance) is: Test is negative and
the child will be born with multiple anomalies
101. t-distribution
In real life situations we
will estimate the
unknown population SD
. using Sample SD
Results are standardized to
:the t-distribution
Z test for normal distribution
The population SD is known
χ −µ
t=
s
n
Z=
χ −µ
σ
n
103. (Degree of freedom (df
For all sample statistics: variance, SD, we used
n-1
All the observations in any given sample are free
.except one= Complementary effect
106. t-test-steps to determine the statistical difference
W
hen? descriptive statistics: mean ± standard deviation
Number of
samples
One sample
vs. population mean
t = χ − µ / SD
n
T independent
wo
samples
2
SD12
SD2
χ1 − χ 2 /
+
n1
n2
T dependent (two
paired):
Repeated
measures
tMatched pairs
d−
dependent =
SE ( d −)
Steps:
1- State the hypothesis to be tested: Null (non-directional-two tailed)
mean= mean
Alternative (unidirectional-one tail)
mean ≠ mean
2- F the calculated t value: using the formulae.
ind
3- F the degree of freedom: all = n-1 (two sample independent df=n1-1+n2-1
ind
(n1+n2-2).
4- F the P value using the tables of t-distribution.
ind
5- Conclude: if < 0.05 = rejection. If > 0.05 the null is accepted.
107. t-test (student’s t-test) one sample
t = χ − µ / SD
n
?Using diabetes data: Is the mean age of diabetics > 65 years
H0:μ=65
H1:μ≠65
t one sample =67.24-65/SD/√n=3.18
t distribution P=0.002
Reject the null
Diabetics are significantly older than 65 years
Statistics
age (years(
N
Mean
Std. Error of Mean
Std. Deviation
Variance
Valid
Missing
278
0
67.24
.704
11.743
137.902
108. (P value (two sided
One-Sample Test
Test Value = 65
age (years(
t
3.182
df
277
Sig. (2-tailed(
.002
Mean
Difference
2.24
95% Confidence
Interval of the
Difference
Lower
Upper
.85
3.63
Degree of freedom
Assuming that the distribution of age is normal
( Population SD is unknown (σ
109. t-test for comparison of means of two
independent samples
H0: Smoking has no effect on systolic blood pressure
Mean S= Mean NS or Mean S-mean NS=0
H1: smoking has an effect
Mean S≠ Mean NS or Mean S-Mean NS≠0
:Assumptions
•Independent observations (2 samples)
•Normally distributed
•Equal variances (for the pooled t-test)
110. T
hree formulae
Expected difference if H0 is true
Standardized
t =
χ −χ −0
1
2
2
S12
S2
+
n1
n2
If SDs are equal
t=
χ1 − χ 2
2
Sp
n1
+
SD of the difference
t=
2
Sp
n2
2
(n1 − 1) S12 + (n2 − 1) S 2
S =
(n1 − 1) + (n2 − 1)
2
p
If SDs are not equal
χ1 − χ 2
2
1
2
2
S
S
+
n1 n2
Pooled SD
Decision based on L
evene’s test
111. Variances are apparently equal
Group Statistics
syst. blood
pressure at start
SMOKING
no
smokers
N
Mean
153.11
144.82
214
64
Std. Deviation
21.995
20.934
Std. Error
Mean
1.504
2.617
Independent Samples Test
Levene's Test for
Equality of Variances
F
syst. blood
pressure at start
Equal variances
assumed
Equal variances
not assumed
Sig.
.006
.936
t-test for Equality of Means
t
Sig. (2-tailed(
df
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
2.674
276
.008
8.29
3.100
2.188
14.392
2.747
107.982
.007
8.29
3.018
2.308
14.272
Two separate t-test
Not significant it means equal variances
P value <0.05, reject H0
112. Paired t-test
If we have paired data (two repeated
measurements on the same subjects) or before
and after
If the difference of the paired observations are
Normally distributed.
113. (P
aired samples (dependent
•
•
(P
aired /dependent 2-sample t-test)
To compare observations collected form the same group of individuals on 2
separate occasions (dependent observations or paired samples).
T paired t statistics is calculated by:
he
- Calculate the difference between the 2 measurements taken on
each individual.
md
- Calculate the mean of the differences.
- Calculate the SE of the observed differences. SE d
- Under the null hypothesis of no difference or difference = 0, the
paired t statistic takes the form.
md - 0
- t= Mean difference / SE of the difference.
t=
SEd
- It has a normal distribution with degrees of freedom = (n-1)
114. E
xample
F
our students had the following scores in 2 subsequent tests.
Is there a significant difference in their performance?
Number
Name
T 1
est
T 2 Dif
est
1
Mike
35%
32-
67%
2
Melanie
50%
4
46%
3
Melissa
90%
4
86%
4
Mitchell
78%
13-
91%
S D Dif = 17.152, SE Dif = 8.58Mean Dif = -9.25,
Calculated Paired t = -9.25/8.58 = -1.078,
df=n-1 = 3
md - 0
t=
SEd
116. Conclusion
T observed difference can be
he
encountered in 36 (actual P
value =0.362 out of 100 cases.
i.e. we accept the null hypothesis
of no difference between first
and 2nd test.
117. Paired Samples Statistics
Mean
Pair
1
syst. blood pressure
at start
syst. blood pressure
after 2 years
Std. Deviation
N
Std. Error
Mean
151.20
278
21.997
1.319
153.83
278
29.076
1.744
Paired Samples Test
Paired Differences
Mean
Pair
1
syst. blood pressure
at start - syst. blood
pressure after 2 years
-2.63
Std. Deviation
Std. Error
Mean
17.920
1.075
95% Confidence
Interval of the
Difference
Lower
Upper
-4.74
-.51
t
-2.443
df
Sig. (2-tailed(
277
.015
118. T of significance
est
Interval/
ratio data
P
arametric assuming normal distribution
(Known Population Variance (σ
One sample Z-test
Z test, rejection limit >
±1.96
χ−µ
Z= σ
n
One sample vs. population
One sample t-test
Unknown Population Variance
t-test
Reject if P ≤ 0.05
Number of samples
T samples
wo
Dependent
t-paired test
Independent
t-test independent
119. The Chi-Square test χ
2
Used for hypothesis testing for categorical
variables
M
any types depends on design, distribution
of variables and objectives of testing
120. χ
2
:E
xample
Vaccination against Influenza deceases the risk
.to get the disease
:Study
Compare the effectiveness of 5 vaccines with
.respect to the probability to get influenza
(Comparison will be in respect to a nominal variable (getting influenza: yes or no
121. Effectiveness of Five Vaccines
Data cross tabulated 2X5: response variable: Influenza
Frequency
within Vaccines%
Vaccines
Influenz
a No
Influenz
a Yes
T
otal
Vaccines
Influenz
a No
Influenz
a Yes
T
otal
1
2
3
4
5
237
198
245
212
233
43
52
25
48
57
280
250
270
260
290
1
2
3
4
5
84.6
79.2
90.7
81.5
80.3
15.4
20.8
9.3
18.5
19.7
100
100
100
100
100
T
otal
1125
225
1350
T
otal
83.3
16.7
100
T probability to get influenza
he
he null hypothesis states that the probability to get influenza is independent of the vaccin
T alternative states that a dependency exists
he
122. Effectiveness of Five Vaccines
:If H0 is true
=The probability to influenza in every group should be the same
, the probability in the total population
(Equal to: 225/1350=0.167 (16.7%
, Vaccine 1 used in 280, if H0 is true
.we expect that 16.7% (≈47) to get influenza
However this is not true
123. Expected frequencies
F any cell: E
or
xpected F
requency= Row total* column total/grand total
Vaccines
Observed-1
E
xpected
Observed-2
E
xpected
Observed-3
E
xpected
Observed-4
E
xpected
Observed-5
E
xpected
T
otal
Influenz
a No
Influenza
Yes
T
otal
237
233.3
198
208.3
245
225.0
212
216.7
233
241.7
43
46.7
52
41.7
25
45.0
48
43.3
57
48.3
280
1125
225
1350
Column total
250
Row total
280X225/1350
270
260
1125/1350*260
290
Grand total
124. Pearson Chi-square test
.Calculate the expected frequencies (assuming H0 is true) for all the ten cells
Calculate Chi square: Of = observed frequency
Ef = Expected frequency
χ =∑
2
(O f − E f )
2
Ef
Reject H0 if χ2 is large
Use the Chi-square distribution
(After determining the degree of freedom (df
(df= (r-1)* (c-1
127. SMOKING * SEX Crosstabulation
SEX
male
SMOKING
no
smokers
Total
90
42.1%
55
85.9%
145
52.2%
female
124
57.9%
9
14.1%
133
47.8%
Total
214
100.0%
64
100.0%
278
100.0%
Exact Sig.
(2-sided(
Exact Sig.
(1-sided(
.000
Count
% within SMOKING
Count
% within SMOKING
Count
% within SMOKING
.000
Chi-Square Tests
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
Value
38.017b
36.279
41.649
37.880
df
1
1
1
1
Asymp. Sig.
(2-sided(
.000
.000
.000
.000
278
a. Computed only for a 2x2 table
b. 0 cells (.0%( have expected count less than 5. The minimum expected count is
30.62.
At least 80% of cells must have Ef >5
128. We can’ t use Pearson Chi-square if
the expected frequency is <5
In this case we use Fisher’ s Exact test
129. status * SEX Crosstabulation
Count
SEX
male
status alive
died from CVD
other cause of death
Total
24
4
2
30
female
15
1
2
18
Total
39
5
4
48
(Expected f=4*30/48=2.5 (<5
Fisher Exact test provides correction
(E f=5*18/48=1.875 (<5
130. Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value
.935a
.991
.004
2
2
Asymp. Sig.
(2-sided(
.626
.609
1
.951
df
48
a. 4 cells (66.7%( have expected count less than 5. The
minimum expected count is 1.50.
Chi-square is not valid
131. Chi-Square Tests
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
37.880
df
1
1
1
1
Exact Sig.
(2-sided(
Exact Sig.
(1-sided(
.000
Value
38.017b
36.279
41.649
Asymp. Sig.
(2-sided(
.000
.000
.000
.000
.000
278
a. Computed only for a 2x2 table
b. 0 cells (.0%( have expected count less than 5. The minimum expected count is
30.62.
132. McNemar test
Paired data in a cross tabulation
(eczematous persons on both arms use ointment A or B (randomized 54
Ointment B
No+
Total
Ointment A
+
No
10
5
16
23
26
28
Total
15
39
54
M
cNemar test only take the discordant pairs into account
Χ2=(23-10)2/23+10
df=1