Introduction to biostatistics:
Biostatistics: Collections of data related to area of research (variables).
Data: information (Observations) about one or more variables.
Variables: any quantity that varies.
Biostatistics deals with:
Individual life from birth to death.
Events affecting life like marriage, sickness, education.
Factors affecting such vital events and their outcome.
Applications of biostatistics:
Public health, including epidemiology, nutrition and environmental health,
Design and analysis of clinical trials in medicine
Population genetics and statistical genetics
Biological sequence analysis.
Biostatistics is concerned with the following indicators:
Demographic and vital events.
Environmental health statistics.
Measuring health status (mortality, morbidity, disability)
Measuring health resources facilities, beds, manpower.
Planning for, administrating, and managing health services.
Researching of health diseases.
Preparation of life tables.
Estimation of population (census)
Refers to the methods used to collect data about a particular phenomena.
Methods of data collection:
Data entered directly into the computer but in order to move the data we must
store it in either spreadsheet or data package which are so limited
A more flexible approach is to have your data available as an ASCII format or
Text file ASCII / Text file
- Consists of rows of text that can
be reviewed in computer screen
- Each variable separated by
delimeter (comma, space …)
Can saved in
Data Organization & presentation:
There are three ways of organizing and presenting data:
It refers to different observations that are grouped together into classes.
Types of Data 1. Numerical (Quantitative) data.
2. Categorical (Qualitative) data.
3. Derived data.
4. Censored data.
1- Numerical (Quantitative) data: Variables take numerical value
Discrete data: there is count (integer) of value
Continuous data: take any value.
2- Categorical (Qualitative) data: one individual can belong to one of a number
of categories. Nominal data: Categories have names (not ordered)
(Ex. Blood groups A, B, AB, and O)
Ordinal data: Categories are ordered.
(Ex. Disease staging system: Advanced, moderate, mild, none)
Note: Binary or Dichotomous data when only two possible categories.
(Ex. Yes/No, Dead/alive, Disease/Patient)
3-Derived data: With a given characteristics (a)
Group (Category) Observations
Without a given characteristics (b)
1. Proportion = a / a+b
2. Percentage: Per-hundred= a / a+b × 100
3. Ratios and Quotients= a / b
4. Rate: Proportion multiplied by base = a / a+b * base
5. Score: When can’t measure the quality but overall may sum to give
a relevant quality.
4- Censored data: used when:
lab values are detected above cut-off value
follwing patient trials
It is putting data into tables. Or it is a systematic arrangement of data in
columns and rows.
Tabulation has three kinds:
a. Single tabulation.
It represents one variable.
For example: number of patients and marital status.
Data the marital status for 15 patients is as following: married, divorced,
single, widow, single, married, divorced, single, widow, married, married,
divorced, single, divorced, single.
Show it on a single table.
Marital status Number of patients
b. Double tabulation:
It represents two variables.
For example: nationality and marital status for a number of patients.
Data the nationality and marital status for 15 patients:
Married s, divorced n, single s, widow s, single n, married n, divorced s, single
n, widow n, married s, married s, divorced n, single n, divorced s, single s.
Note: S-Saudis N-Non-Saudis.
Draw a table
c. Manifold Tabulation
It represents more than two variables.
Displaying data graphically:
Frequency Distribution: relates every possible observation to its observed
frequency of occurrence
Relative Frequency: Percentage of total frequency used to compare the FD in
two or more groups of individuals
Displaying frequency distribution:
1- One variable
a) Bar or Column b) Pie chart c) Histogram
d) Dot plot e) Stem and leaf plot f) Box plot (box and whisker plot)
a) Segmented column chart b) Scatter diagram
Shape of frequency distribution:
Distribution of Data in a single peak:
1- Symmetrical (Normal or Gaussian) Distribution.
To the left (-ve skewed)
2- Skewed Distribution
To the right (+ve skewed)
Statistical (Experimental) Parameters:
1. Mean X 1. Range
2. Median 2. Standard Deviation SD
3. Mode 3. Mean Absolute Deviation
4. Variation S2
5. Coefficient Variation CV
a) Measures of Central Tendency:
1. Mean (X): = Average (used in normal distribution of the data)
X= Ʃ x
n Ʃ x: The sum of all the values
n: Number of value
2. Median: (used in skewed distribution of data)
It’s the midpoint of the distribution of the values and depends on the number of
1. Arrange the numerical observations in ascending or descending order.
2. For odd data number:
3. For even data number
3. Mode: is the most recurrent or most frequent value
Symbols of Population Vs Samples
Symbol Population Sample
Variable X x
Size N n
Mean µ X
Standard deviation Σ S
a) Measures of Central Tendency b) Measures of Dispersion (spread)
b) Measures of Dispersion (spread):
1. Range = largest observation value – smallest observation value
2. Standard Deviation:
S = √Ʃ(x-X)2
Note: in normal distribution:
±1 SD includes 68.2% of the data
±2 SD includes 95.4%,
±3 SD includes 99.7%.
4. Coefficient Variation: is a measure of spread that’s independent of the unit of
CV = S x 100
5. Mean Absolute Deviation:
6. Standard Error of the Mean: to quantitate the accuracy of the mean and
SX = S
7. Percentiles: describe a skewed distribution. We need also to calculate the
median which is the point that divides the population in half
Computing the percentile points of a population is a good way to see how close
to a normal distribution it is.
When a population follows a normal distribution, we can describe its location
and variability completely with two parameters, the mean and standard
deviation. When the population does not follow a normal distribution we can
describe it with the median and other percentiles. The standard error quantifies
the precision of these estimates.
The Standard Normal distribution has a mean of zero and a variance of one. If
the random variable, x, has a Normal distribution with mean, µ, and variance, σ2
then the Standardized Normal Deviate (SND) is a random variable
that has a Standard Normal distribution.
2.5th percentile mean-2 standard deviation
16th percentile mean-1 standard deviation
25th percentile mean-0.67 standard deviation
50th percentile (median) mean
75th percentile mean+0.67 standard deviation
84th percentile mean+1 standard deviation
97.5th percentile mean+2 standard deviation
1. Viral load of HIV-1 is a known risk factor for heterosexual
transmission of HIV; people with higher viral loads of HIV-1
are significantly more likely to transmit the virus to their
uninfected partners. Thomas Quinn and associates studied this
question by measuring the amount of HIV-1 RNA detected in
blood serum. The following data represent HIV-1 RNA levels in
the group whose partners seroconverted, which means that an
initially uninfected partner became HIV positive during the
course of the study; 79725, 12862, 18022, 76712, 256440,
14013, 46083, 6808, 85781, 1251, 6081, 50397, 11020,
13633, 1064, 496433, 25308, 6616, 11210, 13900 RNA
copies/mL. Find the mean, median, standard deviation, and
25th and 75th percentiles of these concentrations. Do these
data seem to be drawn from a normally distributed
population? Why or why not?
2. When data are not normally distributed, researchers can
sometimes transform their data to obtain values that more
closely approximate a normal distribution. One approach to
this is to take the logarithm of the observations. The following
numbers represent the same data described in Prob. 2-1
following log (base 10) transformation: 4.90, 4.11, 4.26, 4.88,
5.41, 4.15, 4.66, 3.83, 4.93, 3.10, 3.78, 4.70, 4.04, 4.13, 3.03,
5.70, 4.40, 3.82, 4.05, 4.14. Find the mean, median, standard
deviation, and 25th and 75th percentiles of these
concentrations. Do these data seem to be drawn from a
normally distributed population? Why or why not?
3. Polychlorinated biphenyls (PCBs) are a class of
environmental chemicals associated with a variety of adverse
health effects, including intellectual impairment in children
exposed in utero while their mothers were pregnant. PCBs are
also one of the most abundant contaminants found in human
fat. Tu Binh Minh and colleagues analyzed PCB concentrations
in the fat of a group of Japanese adults (“Occurrence of Tris
(4-chlorophenyl)methane, Tris (4-chlorophenyl)methanol, and
“Some Other Persistent Organochlorines in Japanese Human
Adipose Tissue. They detected 1800, 1800, 2600, 1300, 520,
3200, 1700, 2500, 560, 930, 2300, 2300, 1700, 720 ng/g lipid
weight of PCBs in the people they studied. Find the mean,
median standard deviation, and 25th and 75th percentiles of
these concentrations. Do these data seem to be drawn from a
normally dis-tributed population? Why or why not?
4. Sketch the distribution of all possible values of the number
on the upright face of a die. What is the mean of this
population of possible values?
5. Roll a pair of dice and note the numbers on each of the
upright faces. These two numbers can be considered a sample
of size 2 drawn from the population described in Prob. 2-4.
This sample can be averaged. What does this average
estimate? Repeat this procedure 20 times and plot the
averages observed after each roll. What is this distribution?
Compute its mean and standard deviation. What do they
6. Robert Fletcher and Suzanne Fletcher (“Clinical Research in
General Medical Journals: A 30-Year Perspective,” N. Engl. J.
Med., 301 :180–183, 1979, used by permission) studied the
characteristics of 612 randomly selected articles published in
the Journal of the American Medical Association, New England
Journal of Medicine, and Lancet since 1946. One of the
attributes they examined was the number of authors; they
Sketch the populations of numbers of authors for each of these
years. How closely do you expect the normal distribution to
approximate the actual population of all authors in each of
these years? Why? Estimate the certainty with which these
samples permit you to estimate the true mean number of
authors for all articles published in comparable journals each
Year No. of articles examined Mean no. of authors SD
1946 151 2.0 1.4
1956 149 2.3 1.6
1966 157 2.8 1.2
1976 155 4.9 7.3
1. Crude birth rate:
= Total number of live births during a year x 1000
2. General fertility rate:
= Number of live births during a year x 1000
Total number of women of childbearing age
3. Age-specific fertility rate:
= Number of births to women of a certain age in a year x 1000
Total number of women of the specified age
4. Rate of natural increase (RNI) :
= The crude birth rate - The crude death rate of a population
Morbidity: is the state of being diseased or the number of sick persons or cases of
disease in relation to specific population.
1. Incidence rate:
= Total number of new cases of a specific disease during a year x k
2. Prevalence rate:
= Total number of cases, new or old, existing at a point of time x k
Total population at that point in time
3. Case fatality ratio
= Total number of deaths due to a disease x k
Total number of cases due to a disease
k: is dependent on the magnitude of the numerator
Mortality (Death rate): is the proportion of inpatient hospitalizations that end in
death, usually expressed as a percentage
1. Annual crude death rate:
= Total number of death during a year x k
2. Age specific death rate:
= Total number of death in a specific Age during a year x k
Total population in a specific Age
3. Age- adjusted death rate:
= Total number of expected deaths x 1000
Total standard population
4. Maternal mortality rate:
= Number of maternal deaths in a given geographic area in a given year x k
Number of live births that occurred among the population of the given
geographic area during the same year
5. Infant mortality rate:
= Deaths under 1 year of age during a year x k
Total number of live births during the year
6. Neonatal mortality rate:
= Deaths under 28 days of age during a year x k
Total number of live births during the year
7. Fetal mortality rate:
= Total number of fetal deaths during a year x k
Total deliveries during the year
8. Total mortality rate:
= Ʃ (age specific fertility rate x interval to which age were grouped)
Health and hospital Statistics
1. Hospital Beds
1. Nursery beds
2. Recovery beds
3. Day ward
4. Labour beds
5. Observation beds
2. Bed Days
4. Inpatient admission
5. Inpatient discharge
6. Inpatient census
7. Inpatient day
8. Bed Count: is the number of available facility inpatient beds, both occupied and
vacant, on any given day.
9. Bed Count Days: is a unit of measure denoting the presence of one inpatient bed
(either occupied or vacant) set up staffed for use in one 24-hour period.
10.Total Bed Count Days: the sum of inpatient bed count days for each of the days in
the period under consideration.
11.Bed Count Ratio: is the proportion of beds occupied, defined as the ratio of inpatient
service days to bed count in the period under consideration.
12. Length of stay (LOS): is the number of days a patient stays in the hospital
13.Total Length of stay: (for all inpatients) is the sum of the days stay of any group of
inpatients discharged during a specified period of time
14.Average Length of stay (ALOS): is the average number of days that inpatients
discharged after considerable during the period under consideration stayed in the
1. Average daily inpatient census: is the average number of inpatients
present each day for a given period of time.
= Total inpatient service day for a period (excluding newborns)
Total number of days in period
2. Average Daily Newborn Census:
= Total newborn inpatient service says for a period
Total number of days in the period
3. Bed Occupancy Ratio
= Total inpatient service days in a period × 100
Total bed count days in the period (bed count × number of days in the period)
4. Bed Turnover Rate:
= Number of discharges (including deaths) for a period
Average bed count during the period
5. Bed turnover interval:
= (Bed count × 365) – inpatient days
Number of discharges
6. Average Length of stay (ALOS):
ALOS = Total length of stay (discharge days)
Total discharges (including deaths)
7. Postoperative Infection Rate:
= Number of infections in clean surgical cases for a period × 100
Number of surgical operations for the period
8. Postoperative death Rate:
= Number of deaths in 10 days after surgery for a period × 100
Number of surgical operations for the period
9. Cancer Mortality Rate
= Number of cancer deaths during a period × 100.000
Total number of cancer patient