6. Types of Data
Discrete Data-limited number of choices
Binary: two choices (yes/no)
Dead or alive
Disease-free or not
Categorical: more than two choices, not ordered
Race
Age group
Ordinal: more than two choices, ordered
Stages of a cancer
Likert scale for response
E.G. strongly agree, agree, neither agree or disagree, etc.
7. Types of data
Continuous data
Theoretically infinite possible values (within
physiologic limits) , including fractional values
Height, age, weight
Can be interval
Interval between measures has meaning.
Ratio of two interval data points has no meaning
Temperature in celsius, day of the year).
Can be ratio
Ratio of the measures has meaning
Weight, height
8. Types of Data
Why important?
The type of data defines:
The summary measures used
Mean, Standard deviation for continuous data
Proportions for discrete data
Statistics used for analysis:
Examples:
T-test for normally distributed continuous
Wilcoxon Rank Sum for non-normally distributed
continuous
9. Descriptive Statistics
Characterize data set
Graphical presentation
Histograms
Frequency distribution
Box and whiskers plot
Numeric description
Mean, median, SD, interquartile range
13. Box and Whisker Plots
Popular in Epidemiologic Studies
Useful for presenting comparative data graphically
14. Numeric Descriptive Statistics
Measures of central tendency of data
Mean
Median
Mode
Measures of variability of
data(dispersion)
Standard Deviation, mean deviation
Interquartile range, variance
15. Mean
Most commonly used measure of central tendency
Best applied in normally distributed continuous data.
Not applicable in categorical data
Definition:
Sum of all the values in a sample, divided by the number of
values.
16. Eg mean Height of 6 adolescent
children 146 ,142,150,148,156,140
Ans ?
882/6 =147
17. Median
Used to indicate the “average” in a
skewed population
Often reported with the mean
If the mean and the median are the same,
sample is normally distributed.
18. It is the middle value from an ordered
listing of the values
If an odd number of values, it is the middle
value 1.2.3.4.5 ie 3
If even number of values, it is the average
of the two middle values.1,2,3,4,5,6 ie
3+4/2 = 3.5
Mid-value in interquartile range
20. Interquartile range
Is the range of data from the 25th percentile
to the 75th percentile
Common component of a box and whiskers
plot
It is the box, and the line across the box is the
median or middle value
Rarely, mean will also be displayed.
22. Mean deviation(standard
deviation )
Mean deviation(SD) = £I X- I / nẌ
n is the no of observations is the mean ,Ẍ
X each observation
Square mean deviation= variance=
£I X- I² / nẌ
Root mean square deviation =√£I X- I² / nẌ
23. Variance
Square of SD(standard deviation )
Coefficient of variance = SD/ mean x 100
Eg. If sd is 3 mean is 150
Variance is 9, coefficient of variance is
300/150 = 2
24. Standard Error
A fundamental goal of statistical analysis is to
estimate a parameter of a population based
on a sample
The values of a specific variable from a
sample are an estimate of the entire
population of individuals who might have
been eligible for the study.
A measure of the precision of a sample
25. Standard Error
Standard error of the mean
Standard deviation / square root of (sample
size)
(if sample greater than 60)
Sd/ √n
Important: dependent on sample size
Larger the sample, the smaller the
26. Clarification
Standard Deviation measures the
variability or spread of the data in an
individual sample.
Standard error measures the precision
of the estimate of a population
parameter provided by the sample
mean or proportion.
27. Standard Error
Significance:
Is the basis of confidence intervals
A 95% confidence interval is defined by
Sample mean (or proportion) ± 1.96 X standard error
Since standard error is inversely related to the
sample size:
The larger the study (sample size), the smaller the
confidence intervals and the greater the precision of the
estimate.
28.
Mean +/- 1 sd = 68.27% value
Mean +/- 2 sd = 95.49% value
Mean +/- 3 sd = 99.7% value
Mean +/- 4 sd = 99.9% value
29. Confidence Intervals
May be used to assess a single point
estimate such as mean or proportion.
Most commonly used in assessing the
estimate of the difference between two
groups.
31. P Values
The probability that any observation is
due to chance alone assuming that the
null hypothesis is true
Typically, an estimate that has a p
value of 0.05 or less is considered to
be “statistically significant” or unlikely
to occur due to chance alone. Null
hypothesis rejected
32. The P value used is an arbitrary value
P value of 0.05 equals 1 in 20
chance
P value of 0.01 equals 1 in 100
chance
P value of 0.001 equals 1 in 1000
chance.
33. Errors
Type I error
Claiming a difference between two
samples when in fact there is none.
Remember there is variability among samples-
they might seem to come from different
populations but they may not.
Also called the α error.
Typically 0.05 is used
34. Errors
Type II error
Claiming there is no difference between
two samples when in fact there is.
Also called a β error.
The probability of not making a Type II
error is 1 - β, which is called the power of
the test.
Hidden error because can’t be detected
without a proper power analysis
36. Sample Size Calculation
Also called “power analysis”.
When designing a study, one needs to
determine how large a study is needed.
Power is the ability of a study to avoid a
Type II error.
Sample size calculation yields the
number of study subjects needed, given
a certain desired power to detect a
difference and a certain level of P value
that will be considered significant.
37. Sample Size Calculation
Depends on:
Level of Type I error: 0.05 typical
Level of Type II error: 0.20 typical
One sided vs two sided: nearly always two
Inherent variability of population
Usually estimated from preliminary data
The difference that would be meaningful
between the two assessment arms.
38. One-sided vs. Two-sided
Most tests should be framed as a two-
sided test.
When comparing two samples, we usually
cannot be sure which is going to be be
better.
You never know which directions study results
will go.
For routine medical research, use only two-
sided tests.
39. Statistical Tests
Parametric tests
Continuous data normally distributed
Non-parametric tests
Continuous data not normally distributed
Categorical or Ordinal data
40. Comparison of 2 Sample Means
Student’s T test
Assumes normally distributed continuous
data.
T value = difference between means
standard error of difference
T value then looked up in Table to
determine significance
41. Paired T Tests
Uses the change before
and after intervention in a
single individual
Reduces the degree of
variability between the
groups
Given the same number
of patients, has greater
power to detect a
difference between groups
42. Analysis of Variance(ANOVA)
Used to determine if two or more
samples are from the same
population-
If two samples, is the same as
the T test.
Usually used for 3 or more
samples.
43. Non-parametric Tests
Testing proportions
(Pearson’s) Chi-Squared (χ2) Test
Fisher’s Exact Test
Testing ordinal variables
Mann Whiney “U” Test
Kruskal-Wallis One-way ANOVA
Testing Ordinal Paired Variables
Sign Test
Wilcoxon Rank Sum Test
44. Use of non-parametric tests
Use for categorical, ordinal or non-normally
distributed continuous data
May check both parametric and non-
parametric tests to check for congruity
Most non-parametric tests are based on
ranks or other non- value related methods
Interpretation:
Is the P value significant?
45. (Pearson’s) Chi-Squared (χ2) Test
Used to compare observed proportions of an
event compared to expected.
Used with nominal data (better/ worse;
dead/alive)
If there is a substantial difference between
observed and expected, then it is likely that
the null hypothesis is rejected.
Often presented graphically as a 2 X 2 Table
46. Non parametric test
For comparing 2 related samples
-Wilcoxon Signed Rank Test
For comparing 2 unrelated samples
-Mann- Whitney U Test
For comparing >2groups
-Kruskal Walli Test
47. Mann–Whitney U test
Mann–Whitney–Wilcoxon (MWW), Wilcoxon
rank-sum test, or Wilcoxon–Mann–Whitney
test) is a non-parametric test especially that a
particular population tends to have larger
values than the other.
It has greater efficiency than the t-test on non-
normal distributions, such as
a mixture of normal distributions, and it is
nearly as efficient as the t-test on normal
distributions.
48. STUDENT T TEST
A t-test is any statistical hypothesis
test in which the test statistic follows
a normal
distri bution if the null hypothesis is
supported.
It can be used to determine if two sets of
data are significantly different from each
other, and is most commonly applied
when the test statistic would follow
a normal distribution
49. The Kaplan–Meier estimator,also known
as the product limit estimator, is
an estimator for estimating the survival
function from lifetime data.
In medical research, it is often used to
measure the fraction of patients living for a
certain amount of time after treatment.
The estimator is named after Edward L.
Kaplan and Paul Meier.
50. A plot of the Kaplan–Meier
estimate of the survival function is
a series of horizontal steps of
declining magnitude which, when
a large enough sample is taken,
approaches the true survival
function for that population.
53. ODDS RATIO
In case control study –
measure of the strength of the
association between risk factor
and out come
59. RR=lncidence of disease among exposed/
incidence among non exposed
Relative risk of lung cancer=10/1=10
Incidence of lung cancer is 10 times higher in
exposed group (smokers) , ie having a
Positive relationship with smoking
Larger RR ,more the strength of association
60. Attributable risk
It is the difference in incidence
rates of disease between exposed
group(EG) and non exposed
group(NEG)
Often expressed in percent
61. (Incidence of disease rate in EG-
Incidence of disease in
NEG/incidence rate in EG ) * 100
. AR= 10-1/10=90%
Ie 90% lung cancers in smokers was
due to their smoking
62. Population attributable Risk
It is the incidence of the disease in total
population - the incidence of disease
among those who were not exposed to
the suspected causal factor/incidence of
disease in total population
PAR=7.3-1/7.3=86.3%, ie 86.3 %
disease can be avoided if risk factors like
cigarettes were avoided
63. Mortality rates & Ratios
Crude Death rate
No of deaths (from all cases )per
1000 estimated mid year
population(MYP) in one year in a
given place
CDR=(No. deaths during the
64. CDR in Panchayath A is
15.2/1000
Panchayath B is 8.2/1000
population
Health status of Panchayath B is
better than A
65. Specific Death rate=(No of diseases due to
specific diseases during a calendar year/
MYP)*1,000
Can calculate death rate in separate diseases
eg . TB, HIV 2/1000, 1/1000 resp
Age groups 5-20yrs, <5yrs - 1/1000, 3/3000
resp.
Sex eg. More in males,
Specific months,etc
66. Case fatality rate(ratio)
(Total no of deaths due to a particular
disease/Total no of cases due to same
disease)*100
Usually described in A/c infectious
diseases
Dengue, cholera, food poisoning etc
Represent killing power of the disease
67. Proportional mortality rate(ratio)
Due to a specific disease=(No of
deaths from the specific disease in a
year/ Total deaths in an year )*100
Under 5 Mortality rate=(No of deaths
under 5 years of age in a given
year/Total no of deaths during the
same period)*100
68. Survival rate
(Total no of patients alive after
5yrs/Total no of patients diagnosed
or treated)*100
Method of prognosis of certain
disease conditions mainly in
cancers
69. INCIDENCE
No of new cases occurring in a defined
population during a specified period of time
(No of new cases of specific disease during a
given time period / Population at risk)*1000
Eg 500 new cases of TB in a population of
30000, Incidence is (500/3000)*1000
ie 16.7/1000/yr expressed as incidence rate
70. Incidence-uses
Can be expressed as Special
incidence rate , Attack rate ,
Hospital admission rate , case rate
etc
Measures the rate at which new
cases are occurring in a population
Not influenced by duration
Generally use is restricted to acute
71. PREVALENCE
Refers specifically to all current
cases (old & new) existing at a
given point of time, or a period of
time in a given population
Referred to as a rate , it is really a
a ratio
72. Point prevalence=(No of all currant cases
(old& new) of a specified disease existing
at a given point of time / Estimated
population at the same point of time)*100
Period prevalence=(No of existing cases
(old& new) of a specified disease during
a given period of time / Estimated mid
interval population at risk)*100
74. Incidence - 3,4,5,8
Point prevalence at jan 1- 1,2& 7
Point prevalence at Dec 31- 1,3,5&8
Period prevalence(jan-Dec)-
1,2,3,4,5,7&8
76. PREVALENCE-USES
Helps to estimate magnitude of
health/disease problems in the
community, & identify potential high risk
populations
Prevalence rates are especially useful
for administrative and planning
purposes
eg: hospital beds, man power
needs,rehabilation facilities etc.
78. P value & its interpretation
“it is the probability of type 1 error”
The chance that, a difference or
association is concluded , when actually
there is none.
79. Study of prevalence of obesity in male
& female child in a classroom.
50 students
of 25 boys- 10 obese
of 25 girls - 16 obese
p value : 0.02
81. study ,Bubble vs conventional CPAP for
prevention of extubation Failure( EF) in
preterm very low birth weight infants.
EF bCPAP =4(16)
cCPAP =9(16)
p value-0.14
82. Null hypothesis: “ no difference in EF
among preterm babies treated with
bCPAP &cCPAP.”
83. 95% CI
95%CI= Mean ‡1.96SD(2SD)
= Mean ‡ 2SE
1) 100 children attending pediatric OP.
mean wt=15kg SD=2
95%CI =?
84. Interpretation of 95%CI
If a test is repeated 100times , 95 times
the mean value comes between this
value.
If CI of 2 variables overlap, the chance
of significant difference is very less.
86. Chi-Squared (χ2) Test
Chi-Squared (χ2) Formula
Not applicable in small samples
If fewer than 5 observations per cell, use
Fisher’s exact test
88. Correlation
Assesses the linear relationship between two variables
Example: height and weight
Strength of the association is described by a correlation
coefficient- r
r = 0 - .2 low, probably meaningless
r = .2 - .4 low, possible importance
r = .4 - .6 moderate correlation
r = .6 - .8 high correlation
r = .8 - 1 very high correlation
Can be positive or negative
Pearson’s, Spearman correlation coefficient
Tells nothing about causation
91. Regression
Based on fitting a line to data
Provides a regression coefficient, which is the slope of the
line
Y = ax + b
Use to predict a dependent variable’s value based on the
value of an independent variable.
Very helpful- In analysis of height and weight, for a known
height, one can predict weight.
Much more useful than correlation
Allows prediction of values of Y rather than just whether
there is a relationship between two variable.
92. Regression
Types of regression
Linear- uses continuous data to predict continuous
data outcome
Logistic- uses continuous data to predict
probability of a dichotomous outcome
Poisson regression- time between rare events.
Cox proportional hazards regression- survival
analysis.
93. Multiple Regression Models
Determining the association between two
variables while controlling for the values of
others.
Example: Uterine Fibroids
Both age and race impact the incidence of
fibroids.
Multiple regression allows one to test the impact of
age on the incidence while controlling for race
(and all other factors)
94. Multiple Regression Models
In published papers, the multivariable models are
more powerful than univariable models and take
precedence.
Therefore we discount the univariable model as it does not
control for confounding variables.
Eg: Coronary disease is potentially affected by age, HTN,
smoking status, gender and many other factors.
If assessing whether height is a factor:
If it is significant on univariable analysis, but not on
multivariable analysis, these other factors confounded the
analysis.
95. Survivial Analysis
Evaluation of time to an event (death,
recurrence, recover).
Provides means of handling censored data
Patients who do not reach the event by the end of
the study or who are lost to follow-up
Most common type is Kaplan-Meier analysis
Curves presented as stepwise change from
baseline
There are no fixed intervals of follow-up- survival
proportion recalculated after each event.
98. Kaplan-Meier Analysis
Provides a graphical means of comparing the
outcomes of two groups that vary by intervention or
other factor.
Survival rates can be measured directly from curve.
Difference between curves can be tested for
statistical significance.
99. Cox Regression Model
Proportional Hazards Survival Model.
Used to investigate relationship between an event
(death, recurrence) occurring over time and possible
explanatory factors.
Reported result: Hazard ratio (HR).
Ratio of the hazard in one group divided the hazard in
another.
Interpreted same as risk ratios and odds ratios
HR 1 = no effect
HR > 1 increased risk
HR < 1 decreased risk
100. Cox Regression Model
Common use in long-term studies
where various factors might predispose
to an event.
Example: after uterine embolization, which
factors (age, race, uterine size, etc) might
make recurrence more likely.
101. True disease state vs. Test result
not rejected rejected
No disease
(D = 0)
specificity
X
Type I error
(False +) α
Disease
(D = 1)
X
Type II error
(False -) β
Power 1 - β;
sensitivity
Disease
Test
104. Test Result
Call these patients “negative” Call these patients “positive”
without the disease
with the disease
True Positives
Some definitions ...
105. Test Result
Call these patients “negative” Call these patients “positive”
without the disease
with the disease
False
Positives
106. Test Result
Call these patients “negative” Call these patients “positive”
without the disease
with the disease
True
negatives
107. Test Result
Call these patients “negative” Call these patients “positive”
without the disease
with the disease
False
negatives
108. Test Result
without the disease
with the disease
‘‘‘‘-’’-’’ ‘‘‘‘+’’+’’
Moving the Threshold: right
109. Test Result
without the disease
with the disease
‘‘‘‘-’’-’’ ‘‘‘‘+’’+’’
Moving the Threshold: left
115. An example forest plot of five odds
ratios (squares) with the summary
measure (centre line of diamond)
and associated confidence
intervals (lateral tips of diamond),
and solid vertical line of no effect.
Names of (fictional) studies are
shown on the left, odds ratios and
115
116. A forest plot (or blobbogram[1]
) is a
graphical display designed to illustrate
the relative strength of treatment effects
in multiple quantitative scientific studies
addressing the same question. It was
developed for use in medical research
as a means of graphically representing
a meta-analysis of the results of
randomized controlled trials.
116
118. i. Probably a small study, with a wide
CI, crossing the line of no effect (OR =
1). Unable to say if the intervention
works
ii. Probably a small study, wide CI , but
does not cross OR = 1; suggests
intervention works but weak evidence
iii. Larger study, narrow CI: but crosses
OR = 1; no evidence that intervention
119. iv. Large study, narrow confidence
intervals: entirely to left of OR = 1;
suggests intervention works
v. Small study, wide confidence
intervals, suggests intervention is
detrimental
vi. Meta-analysis of all identified
studies: suggests intervention works.
120. PICOT
Used to test evidence based research
Population
Intervension or issue
Comparison with another intervention
Outcome
Time frame
Editor's Notes
Similar: use both to compare groups
sd = difference between each value and the mean, squared, then all added together and divided by (n-1) THEN take the square root of this value