Week 2 measures of disease occurence

Week 2: Measures of diseases
occurrence and related statistics
Dr. Hamdi Alhakimi
MD, MPH, M- epidemiology

Goals
• Describe the steps of descriptive data
analysis
• Be able to define variables
• Understand basic coding principles
• Learn simple descriptive data analysis
• Learn simple inferential statistics

DATA
RESULTS
+
conclusions
Biostatistics
4/7/2021 3

Types of
Biostatistics
Descriptive
Statistics
Inferential
statistics
4/7/2021 5

Descriptive
Statistics
graphs
tabulations
calculations
- Proportions, rates &
ratios.
- Measures of central
tendency (Mean,
Mode & Median).
- Measures of
dispersion (S.d,
range).
-Quintiles.
-Frequency
distribution
tables.
-Cross tabs.
-
- Bar graphs.
-Pie chart.
-Histogram.
- Scatter plot.
4/7/2021 6

Types of Variables
• (Quantitative) Numerical variables:
– Always numbers
– Examples: age in years, weight, blood pressure readings,
temperature, concentrations of pollutants and, counts of
cases per week, or any other measurements
• (qualitative) Categorical variables:
– Information that can be found into categories
– Types of categorical variables – ordinal, nominal and
dichotomous (binary)

Categorical Variables:
Ordinal Variables
• Ordinal variable—a categorical variable with some
intrinsic order
• Examples of ordinal variables:
– Education (illitrate, HS degree, some college, college
degree)
– Agreement (strongly disagree, disagree, neutral, agree,
strongly agree)
– Rating (excellent, good, fair, poor)
– Frequency (always, often, sometimes, never)
– Any other scale (“On a scale of 1 to 5...”)

Nominal Variables
• Nominal variable – a categorical variable without an
intrinsic order
• Examples of nominal variables:
– Where a person lives in the U.S. (Northeast, South,
Midwest, etc.)
– Nationality (American, Mexican, French)
– Race/ethnicity (African American, Hispanic, White, Asian
American)
– Favorite pet (dog, cat, fish, snake)

Dichotomous Variables
• Dichotomous (or binary) variables – a categorical
variable with only 2 levels of categories
– Often represents the answer to a yes or no question
• For example:
– “Did you attend the church on May 24?” Yes /No
– “Did you eat potato salad ?” Yes/No
– Anything with only 2 categories
– Gender (male, female)

Coding
• Coding – process of translating information gathered
from questionnaires or other sources into something
that can be analyzed
• Involves assigning a value to the information given—
often value is given a label
• Coding can make data more consistent:
– Example: Question = Gender
Answers = Male, Female, M, or F -> (0 ,1)

Coding Systems
• Common coding systems (code and label) for dichotomous
variables:
– 0=No 1=Yes
(1 = value assigned, Yes= label of value)
– OR: 1=No 2=Yes
• When you assign a value you must also make it clear what
that value means
– As long as it is clear how the data are coded, either is fine
• You can make it clear by creating a data dictionary to
accompany the dataset

Coding:
Attaching Labels to Values
• Many analysis software packages allow you to attach a label
to the variable values
Example: Label 0’s as male and 1’s as female
• Makes reading data output easier:
Without label: Variable SEX Frequency Percent
0 21 60%
1 14 40%
With label: Variable SEX Frequency Percent
Male 21 60%
Female 14 40%

Coding- Ordinal Variables
• Coding process is similar with other categorical variables
• Example: variable EDUCATION, possible coding:
0 = Did not graduate from high school
1 = High school graduate
2 = Some college or post-high school education
3 = College graduate
• Could be coded in reverse order (0=college graduate, 3=did
not graduate high school).
• For this ordinal categorical variable we want to be consistent
with numbering because the value of the code assigned has
significance.

Coding: Nominal Variables
• For coding nominal variables, order makes no
difference
• Example: variable RESIDE
1 = Northeast
2 = South
3 = Northwest
4 = Midwest
5 = Southwest
• Order does not matter, no ordered value associated
with each response

Coding: Continuous Variables
• Creating categories from a continuous variable (age) is
common
• Example: variable = AGE_CAT
Children= 0–9 years old
Teenagers= 10–19 years old
Young adults = 20–39 years old
Middle aged = 40–59 years old
Elderlies= 60 years or older

Data Cleaning
• One of the first steps in analyzing data is to “clean” it
of any obvious data entry errors:
– Outliers? (really high or low numbers)
Example: Age = 110 (really 10 or 11?)
– Value entered that doesn’t exist for variable?
Example: 2 entered where 1=male, 0=female
– Missing values?
Did the person not give an answer? Was answer
accidentally not entered into the database?

Data Cleaning (cont.)
• “double-entry” – ie., entering the data twice and then
comparing both entries for discrepancies
• Univariate data analysis is a useful way to check the quality of
the data

Univariate Data Analysis
• Univariate data analysis-explores each variable in a
data set separately:
– Serves as a good method to check the quality of the data
– Inconsistencies or unexpected results should be
investigated using the original data as the reference point
• Frequencies (percentages) can tell you if many study
participants share a characteristic of interest (age,
gender, etc.)
– Graphs and tables can be helpful

Univariate Data Analysis (cont.)
• Examining variables can give you important
information:
– Do all subjects have data, or are values missing?
– Are most values clumped together, or is there a lot of
variation?
– Are there outliers?
– Do the minimum and maximum values make sense, or
could there be mistakes in the coding?

Recap:
• All these descriptive statistics are univariate
(describe only one variable).
• Next week, we will discuss bivariate
descriptive analysis (2 variables involved).

Descriptive statistics in
qualitative data
Lecture 4

Use of descriptive
Statistics in quantitative
graphs
calculations
- Measures of central
tendency (Mean,
Mode & Median).
- Measures of
dispersion (S.d,
range).
-correlation coefficient
- Regression
coefficient.
- Quintiles.
- Histogram.
- Scatter plot.
4/7/2021 24

Use of descriptive
Statistics in
qualitative data
graphs
tabulations
calculations
- Proportions, rates &
ratios.
-Frequency
distribution
tables.
-Cross tabs.
- Bar graphs.
-Pie chart.
4/7/2021 25

Proportion (percentage, frequency):
Proportion:
a included in the denominator (a + b)
No measurement unit
> 0 to < 1
Often expressed as %
• Example: From 7,999 females there are 2,496 use modern contraceptive
methods.
• The proportion of those who use modern contraceptive methods
= 2,496 / 7,999 x 100 = 31.2%
26
4/7/2021

Prevalence rate:
Rate: is a specific time of proportion
Prevalence rate: the proportion of a defined group or population
that has a clinical condition or outcome at a given point in time
– Prevalence rate = Number of cases observed at time t
Total number of individuals at time t
• ranges from 0 to 1 (it’s a proportion), but usually referred
to as a rate and is often shown as a %
28
4/7/2021

Prevalence rate:
Example:
• Of 100 patients hospitalized with stroke, 18 had
Myocardial infarction (MI)
• Prevalence of MI among hospitalized stroke
patients = 18%
• The prevalence rate answers the question:
– “what fraction of the group is affected at this moment
in time?”
29
4/7/2021

Incidence rate in population based data:
4/7/2021 ‫أسنان‬ ‫صحة‬
(
1
) 30

Incidence rate: (usually in clinical data)
4/7/2021 ‫أسنان‬ ‫صحة‬
(
1
) 31

Descriptive statistics of Categorical Data
• Distribution of categorical
variables should be
examined before more in-
depth analyses.
– Bar graph
Number of people answering example questionnaire who reside
in 5 regions of the United States
Distribution of Area of Residence
Example Questionnaire Data
0
5
10
15
20
25
30
Midwest Northeast Northwest South Southwest
variable: RESIDE
Number
of
People

Descriptive statistics of Categorical Data
• Another way to look at
the data is to list the data
categories in tables.
• Frequency distribution
table.
Frequency Percent
Midwest 16 20%
Northeast 13 16%
Northwest 19 24%
South 24 30%
Southwest 8 10%
Total 80 100%
Table: Number of people answering sample
questionnaire who reside in 5 regions of the United
States

Descriptive statistics of continuous
variable
4/7/2021 ‫أسنان‬ ‫صحة‬
(
1
) 34

Measures of Central Tendency
• Measures of central tendency yield information about the
center of the data.
• Common Measures of Location
–Mode
–Median
–Mean
© 2002 Thomson / South-Western Slide 3-35

Mean
• Is the average of a group of numbers.
• Applicable for continuous and discrete data, not applicable for
nominal or ordinal data.
• Affected by each value in the data set, including extreme
values.
• Computed by summing all values in the data set and dividing
the sum by the number of values in the data set.

Descriptive statistics
• Commonly used statistics with univariate analysis of
continuous variables:
– Mean – average of all values of this variable in the dataset
– Median – the middle of the distribution, the number
where half of the values are above and half are below
– Mode – the value that occurs the most times
– Range of values – from minimum value to maximum value

Statistics describing a continuous variable distribution
Example Scatter Chart: Age
0
10
20
30
40
50
60
70
80
90
Age
(in
years)
,
84 = Maximum (an outlier)
2 = Minimum
28 = Mode (Occurs
twice)
33 = Mean
36 = Median (50th
Percentile)

Median
• Middle value in an ordered array of numbers.
• Applicable for ordinal, interval, and ratio data.
• Unaffected by extremely large and extremely
small values.

Median: Computational Procedure
• First Procedure
– Arrange observations in an ordered array.
– If number of terms is odd, the median is the
middle term of the ordered array.
– If number of terms is even, the median is the
average of the middle two terms.

Median: Example
Ordered Array includes:
4 5 7 8 9 11 14 15 16 16 17 19 19 20 21
• Median is 15.

Measures of Central Tendency
Mean … the most frequently used but is
sensitive to extreme scores
e.g. 1 2 3 4 5 6 7 8 9 10
Mean = 5.5 (median = 5.5)
e.g. 1 2 3 4 5 6 7 8 9 20
Mean = 6.5 (median = 5.5)
e.g. 1 2 3 4 5 6 7 8 9 100
Mean = 14.5 (median = 5.5)

Quartiles: after sorting of data
25% 25% 25% 25%
Q3
Q2
Q1

Measures of Variability
• Measures of variability describe the spread or
the dispersion of a set of data.
• Common Measures of Variability
–Standard Deviation

Variation: Standard Deviation
Example Scatter Chart 2: Age
0
10
20
30
40
50
60
70
80
90
Age
(in
years)
.
Example Scatter Chart 1: Age
0
10
20
30
40
50
60
70
80
90
Age
(in
years)
,
• Figure left: narrowly distributed age values (SD = 7.6) , mean=33
• Figure right: widely distributed age values (SD = 20.4), mean=33

Variability
Mean
Mean
Mean
No Variability in Cash Flow
Variability in Cash Flow Mean

Standard Deviation: measures of variation
• Square root of the
sample variance
 
2
2
2
1
663 866
3
221 288 67
221 288 67
470 41
S
X X
S
n
S









,
, .
, .
.
2,398
1,844
1,539
1,311
7,092
625
71
-234
-462
0
390,625
5,041
54,756
213,444
663,866
X X X
  
2
X X


Graphs to describe a numerical variable

Histogram (only for a numerical variable)
• Divide measurement up into equal-sized
categories.
• Determine number of measurements falling
into each category.
• Draw a bar for each category so bars’
heights represent number (or percent)
falling into the categories.

Histograms
Graph that uses
bars to show
frequencies or
percentage of a
possible outcome.

Too few categories
18 23 28
0
10
20
30
40
50
60
Age (in years)
Age of Spring 1998 Stat 250 Students
n=92 students

Too many categories
2 3 4
0
1
2
3
4
5
6
7
GPA
Frequency
(Count) GPAs of Spring 1998 Stat 250 Students
n=92 students

Normal distribution of a continuous variable

Why normal distribution is
important?
• Answer is in the next week?

Inferential statistic
Confidence Interval of Mean
57

•bell-shaped density function.
•Symmetric, around the mean
•Mean=Median=Mode
• 68% of area under the curve between m  s.
• 95% of area under the curve between m  2s.
• 99.7% of area under the curve between m  3s.
Standard Normal Form
.68
.95
m
ms m+s m+2s
m2s
Properties of the Normal Distribution
Empirical Rule

Estimation
• Estimation is one of the main purposes of
statistics.
• The basic idea is that we take a sample of data
and use it to make inferences about the
population of interest.
Important distrbutions 59

Estimation
• Estimation involves the calculation of confidence
intervals for some statistic (For ex. a mean or
proportion)

Example I
• What is the complication rate of heart surgery in KFH
hospital?
• Using 3 years of data from KFH , a sample of 52
patients who had a heart surgery was selected; of these,
4 patients had a complication.
• 7.7% complication rate (95% Confidence Interval = 2.5%
to 12.5%)

Confidence interval
• Interpretation of 95% confidence interval:
Based on our sample data,
“we are 95% confident that the "true"
complication rate at KFH is between 2.5% and
12.5%.”

Advantages of using confidence intervals:
• (1) Confidence intervals remind us that study estimates
have variability (i.e. the width of the CI).
• (2) Confidence intervals show clearly the role that sample
size plays in the estimation.
. Large sample size = Narrow confidence limits
Small sample size = Wide confidence limits

Calculation of confidence interval of the mean
1. Compute the standard error of the mean.
• 2. Add and subtract 2 SE to the mean to formulate the
interval (from F to Q)

95% Confidence Interval
Formula in English:
Estimate ± (1.96 × standard error)

Example
• A random sample of 16 students reported
having an average age of 31 with a
standard deviation of 6 years.
• In what range of values can we be 95%

Example
• 95% confidence interval =
• C. I. = 31 ± 1.96 ( 6/4) = 31± 3
• C.I = (31-3 to 31 + 3)= (28 to 34) years old
• Interpretation???

Length of Confidence Interval
• We want confidence interval to be as
narrow as possible.
• Length = Upper Limit - Lower Limit

How length of CI is affected?
• As the standard deviation decreases…
• As we decrease the confidence level…
• As we increase sample size …

Population
Mean = m
Sample
x

mean
m
n
s
2
n
s
2
s.d. = s
There is 95% chance that will fall
inside the interval
x
n
s
m
2

ND

Week 2 measures of disease occurence

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Week 2 measures of disease occurence

Similar to Week 2 measures of disease occurence (20)

Recently uploaded

Recently uploaded (20)

Week 2 measures of disease occurence