BIOSTATISTICS + EXERCISES

BIOSTATISTIC
S
SESSIONS 1 , LEARNING NOTES
BY STUDENT MT-EINSTEIN , YEAR 2016-
2017
COLLEGE OF MEDICINE AND HEALTH SCIENCES
SCHOOL OF MEDICINES AND PHARMACY
DEPARTMENT OF PHARMACY YEAR 2 2016-
2017

WHAT IS BIOSTATISTICS
Etymologically,
biostatistics refers to the application of statistics to a wide
range of topics in biology, including medical sciences.
Specifically, biostatistics is the science which deals with the
development and application of the most appropriate methods for
the:
Collection of data;
Presentation/organization of the collected data in quantitative
form;
Analysis and interpretation of the results;
Interpretation and making decisions on the basis of such analysis.

WHY OF BIOSTATISTICS
Role of biostatisticians
GENERAL
To guide the design of an experiment or survey prior to data collection
To analyze data using proper statistical procedures and techniques
To present and interpret the results to researchers and other decision
makers
SPECIFIC
Identify and develop treatments for disease and estimate their
effects.
Identify risk factors for diseases.
Develop statistical methodologies to address questions arising from
medical/public health data
Design, monitor, analyze, interpret, and report results of clinical
studies.

AREA OF BIOSTATISTICS APPLICATION
Areas of application of biostatistics
Biostatistics concepts are applied to biological problems, including for:
Public health
Medicine
Ecological and environmental
Biostatisticians must have knowledge of above areas.

WHAT IS DATA
Data – definition and source
- It appears that “data”
are the starting point when dealing with the
whole biostatistical operations,
 specifically from design of an experiment up
to interpretation of results of clinical studies
though accurate data analysis process(es).
There is a need to first of all know what a data
is, how and where it is obtained.

DATA VS VARIABLE
Data: measurements or observations of a
variable made from a sample of a given
population.
Variable: the characteristic or
phenomenon that can be measured or
classified is called a variable.

TYPE EXAMPLE
Example: class survey
Students in an introductory statistics course were asked the following
questions as part of a class survey:
1 What is your gender, male or female?
2 Are you introverted or extraverted?
3 On average, how many hours of sleep do you get per night?
4 What is your bedtime: 8pm-10pm, 10pm-12am, 12am-2am, later
than 2am?
5 How many countries have you visited?
6 On a scale of 1 (very little) - 5 (a lot), how much do you dread
this semester?

TYPE OF VARIABLE
Gender,contry,personality : categorical=categorical=dicothomous
no inherent order between male and female, therefore gender is
not ordinal
sleep: numerical, continuous
even though data is reported as whole numbers, sleep is measured
on a continuous scale, people just tend to round their responses in
surveys
bedtime: categorical, ordinal
there is an inherent ordering in these time intervals

TYPE OF VARIABLE
countries: numerical, discrete
data are counted, and can only take on whole numbers
dread: categorical, ordinal, could also be used as numerical
categories have an inherent ordering
Demographic data are recorded as nominal variables.
Categorical variables can be nominal or ordinal.
A nominal variable is assigned (not measured) and could be a demographic
characteristic such as sex or race.
An ordinal variable is a ranking, such as mild, moderate, or severe.

POPULATION VS SAMPLE
Population
we may learn something about
the characteristics of the
population (parameters).
Population is parameter
Sample
We learn data from the
population sample since the
whole study population is time
consuming
Sample is statistics

Statistic: Summary data from a sample.
Examples:
The observed proportion of the sample that responds to treatment;
The observed association between a risk factor and a disease in this
sample.
Parameter: Summary data from a population.
Examples:
The proportion of the population that would respond to a certain
drug
The association between a risk factor and a disease in a population

Population
A group of individuals
that we would like to
know something about
Sample
A subset of a population
(hopefully
representative)

RANDOM VS NON RANDOM SAMPLING
Random samples
Subjects are selected from a population so that
each individual has an equal chance of being selected.
Random samples are representative of the source population.
Non-random samples are not representative.
They may be biased regarding age, severity of the condition,
socioeconomic status etc. …

RANDOM VS NON RANDOM
Random samples are rarely utilized in health care
research.
 Instead, patients are randomly assigned to treatment and
control groups.
Each person has an equal chance of being assigned to either
of the groups.
@Random assignment is also known as randomization.

LEVELS OF MEASUREMENT IN BIOSTATISTICS
Variables in a study are measured on a certain scale of measurement.
Scales or levels of measurement refer to how the properties of numbers
can change with different uses.
@There are 4 levels of variables or scale of measurement, which define
different kinds of variables, hence different kind of data:
Nominal
Ordinal
Interval
Ratio

NOMINAL DATA/VARIABLE
Nominal= categorical
*Data that is classified into
categories and
cannot be arranged in any particular order.
Nominal=Categorical=Dichotomous
*E.g. Gender (Male and Female); country of birth (Rwanda, USA,
India...), personality type, yes or no, demographic population.

ORDINAL VARIABLE
Ordinal =ranked
Data that is ranked or ordered: 1st, 2nd, 3rd etc..
Used to rank and order the levels of the data or variable being
studied. No particular value is placed between the numbers in the
rating scale.
E.g. Adverse events: ocular problem determination in patient
Mild, moderate, severe, life-threatening, death
Income , the level of income is diff and ordered
Low, medium, high

INTERVAL DATA/VARIABLE
Interval
Difference between the numbers on the scale is meaningful and intervals
are equal in size. NO absolute zero. 7
Temperatures on a thermometer: The difference between 60 and 70
is the same as the difference between 90 and 100.
The length of a person or an object
Intervals allow for comparisons between things being measured

RATIO VARIABLE/ DATA
Ratio : no absolute 0 point
Scales that do have an absolute zero point
than indicated the absence of the variable
being studied.
E.g. Body weight, height, family size, age,....

MEASURE OF CENTRAL TENDENCY
Measure of scale
Nominal
Ordinal
Interval
Ratio
Best measurement
Mode
Median
Symmetrical – Mean
Skewed – Median
Symmetrical – Mean
Skewed – Median

COMMENT
Quantitative variables are measured values.
A discrete quantitative variable has a finite number of possible
measurements.
A continuous quantitative variable has an infinite number of possible
measurements within a range, as would be typical for a serum
chemistry test such as glucose.

CLINICAL CASE
A 33-year-old woman comes to you complaining of lower
abdominal pain which she has had for the past day. She
left her job as a nurse's aide (her second day on the job)
because the pain was so bad. She says the pain began
after she had fallen off a stepstool while getting a
bedpan off a top shelf. No one saw her fall, but she
convinced her supervisor that she had an industrial
accident and needed medical attention because of blood
in her urine. To prove it, she brings in a urine specimen

QUESTIONS 1
How do you correlate the macroscopic and microscopic findings?
The macroscopic appearance is red, but the test for blood is
negative and there are no RBC's microscopically. It is unlikely
to be rhabdomyolysis. This specimen could be factitious.
It would be a simple matter to have the patient produce
another sample (though she might still be carrying the same
bottle of red food coloring with her). Remember that various
drugs can also produce colored urine. Eating fresh beets can
color the urine red temporarily

QUESTIONS 2
What do you think is happening?
Although care and concern should be the
immediate response of health care workers to
a patient, and historical findings should be duly
noted, remember that patients may not always
be telling you everything, or telling you
correctly, particularly when compensation is
being sought.

QUESTIONS 3
What kind of variables are pH and protein?
These measurements represent a quantitative (measured) variable
that is discrete, with a finite number of possible measurements in
the range of 5 to 8 for pH and from 0 to 4+ for protein.
The other form of quantitative variable is continuous with an infinite
number of possible measurements within a range, as would be
typical for a serum chemistry test such as urea nitrogen or
creatinine.
Categorical variables could be nominal or ordinal. A nominal variable
is assigned (not measured) and could be a demographic
characteristic such as sex or race. An ordinal variable is a ranking,

CONTINUOUS VS DISCRETE DATA
continuous
Definition: A set of data is said
to be continuous if the values
belonging to the set can take on
ANY value within a finite or
infinite interval.
Examples: • A person's height
could be any value (within the
range of human heights), not just
certain fixed heights
Discrete
Definition: A set of data is said
to be discrete if the values
belonging to the set are distinct
and separate (unconnected
values
Examples: • The number of
students in a class. • The
number of questions on a
pharmacology test.

CONTINUOUS DATA VS DISCRETE
A person's body weight, age ….
• The outdoor temperature (To)
at noon (any value within
possible To ranges
NOTE: Continuous data (CD) is
measured
Function: In the graph of CD,
the points are connected with a
continuous line, since every point
has meaning to the original
problem
NOTE: Discrete data (DD) is
counted
Function: In the graph of DD,
only separate, distinct points are
plotted, and only these points
have meaning to the original
problem

MORE TO HEAD IN
Continuous numeric data are of interest in investigations such as:
Average age of patients compared to average age of non-patients
Respiratory rate of those exposed to a chemical vs. respiratory rate
of those who were not exposed
If there are many different discrete values, then discrete data is
often treated as continuous.
Examples: CD4 count, HIV viral load
If there are very few discrete values, then discrete data is often
treated as ordinal.

TYPE OF VARIABLE NOT KIND
2type
Variables can be classified as
independent or
dependent
1.independent variable (IV)
is the variable that is manipulated (measured) in an experiment and that
remains unchanged (=“independent”) between conditions being observed in an
experiment.
IV is believed to influence the outcome measure (dependent variable) and is
the “presumed cause.”
e.g. time, age,..

TYPE OF VARIABLE
A dependent variable (DV)
is the variable that is dependent on the independent variable(s)
i.e a DV is the variable that is believed to change in the presence of the
independent variable.
It is the “presumed effect.”
The measured variable in an experiment (e.g. plasma concentration) is
referred to as DV.
DV vs IV: plasma concentration and time: Let’s take example of a patient who
has taken a drug in the morning. The plasma concentration of this drug is a DV
since it changes over time during the day after drug intake.

TYPE OF VARIABLE
An intervening variable
is the variable that links the independent and dependent variable
A confounding variable is a variable that has many other variables,
or dimensions built into it. Not sure what it contains or measures.
For example: Socio Economic Status (SES)How can we measure
SES? Income, Employment status, etc

EXAMPLE OF COFOUNDING VARIABLE
Need to be careful when using confounding variables.
Example
A researcher wants to study the effect of Vitamin C on cancer.
Vitamin C would be the independent variable because it is hypothesized
that it will have an affect on cancer, and cancer would be the dependent
variable because it is the variable that may be influenced by Vitamin C.

DATA PRESENTATION METHOD
Numerical presentation
Graphical presentation
Mathematical presentation

NUMERICAL PRESENTATION
Like frequency presentation in the table and other

GRAPHICAL PRESENTATION
1.Graphs drawn using Cartesian coordinates
In graphs, the data can be concisely summarized into:
• Bar graph (or Bar charts) , Histogram , Box Plot , Line graph , Frequency polygon
, Frequency curve , , , Scatter plot
Bar Graphs when presenting Nominal data (No order to horizontal axis)
Histograms when presenting Continuous or ordinal data (these should be on
horizontal axis)
Box Plots when presenting Continuous data
2.pie chart
3.statistical maps

WHY IS IT ALWAYS BETTER OF SUMMARIZING UR DATA
It is ALWAYS a good idea to summarize your data (at least for important
variables)
You become familiar with the data and the characteristics of the sample
that you are studying
You can also identify problems with data collection or errors in the data
(data management issues)
Dataset Structure presenting data need data building
Think of data as a rectangular matrix of rows and columns.
Simplest structure.
Rows represent the “experimental unit” NB: Each row is an independent
observation.
Columns represent “variables” measured on the experimental unit

SCATTER PLOT IS INVOLVE IN SHOWING CORRELATION

MATHEMATICAL PRESENTATION
Data presentation is usually performed through Descriptive statistics.
Some measures that are commonly used to describe a data set are the
following
Measures of Central
Tendency
-mean
-median
-mode
Measures of
Variability
(Dispersion)
-range
-variance
-standard deviation

MEASURE OF CENTRAL TENDENCY
Mode : The mode is the most frequently occurring score
Median : divide the score into 2 halves , care about odd and even number
mean is the sum of all the scores divided by the total number of scores =average
distribution of the data is normal, the mean =in middle distribution of the score =median
mean is a good measure of central tendency
It is preferred whenever possible and is the only measure of central tendency that is
used in advanced statistical calculations:
o More reliable and accurate
o Better suited to arithmetic calculations

C.T
mean can be misleading because it can be greatly influenced by extreme
scores called the out layer
For example, the average length of stay at a hospital could be greatly
influenced by one patient that stays for 5 years

17-46 C.T
Sometimes the median may yield more information when your
distribution contains outliers, or is skewed (not normally distributed).
What is a median?

MEASURE OF THE VARIABILITYRange = MAX-MIN
Used only for Ordinal, Interval, and Ratio scales as the data must be ordered
Example: 2 3 4 6 8 11 24 (Range is 22)
 Variance (S2)
- The variance is the extent to which individual scores in a distribution of scores
differ from one another. The larger the variance, the further spread out the data. IS
a measure of how spread out a distribution S
- The variance is the average squared deviation of the observations from their
mean (how the observations ‘vary’ from the mean).
Standard Deviation (SD)
SD=The square root of the variance
SD is a measure of the variability of a set of data in a distribution (most widely
used measure of the dispersion)
SD reflects how the data/observations/scores vary from the mean

STANDARD DEVIATION AND VARIANCE

QUARTILES
Quartiles are the three
values that split the sorted
data into four equal parts.
-Second Quartile (Q2) =
median.
-Lower quartile (Q3) =
median of lower half of
the data
-Upper quartile (Q1) =
median of upper half of
the data
-Need to order the
individuals first (from 1 to
“N” individuals)
-One quarter of the
individuals are in each

STANDARD ERROR OF MEAN
A measure of variability among means of samples selected from
certain population.

PROBABILITY OF DISTRIBUTION
A probability distribution
is a device for indicating the values that a random variable may have.
There are two categories of random variables:
c. discrete random variables, and
d. continuous random variables.
1.The probability distribution of a discrete random variable
specifies all possible values of a discrete random variable along with their respective
probabilities. Examples can be:
Frequency distribution
Probability distribution (relative frequency distribution)
Cumulative frequency

PROBABILITY OF DISTRIBUTION
A continuous random variable can assume any value
within a specified interval of values assumed by the
variable.
In a general case, with a large number of class
intervals, the frequency polygon begins to resemble a
smooth curve.

NORMAL DISTRIBUTION=GAUSSIAN DISTRIBUTION
The shape of data
Histograms of frequency distributions
demonstrates better the shape of the
data.
Distributions are often symmetrical
with most scores falling in the middle
and fewer toward the extremes.
Most biological data are symmetrically
distributed and form a normal curve
(also called “bell-shaped curve”). Such
data are said to be normally
distributed.

PROPERTIES OF A NORMAL
DISTRIBUTIONThe area under a normal
curve has a normal
distribution
Properties of a normal
distribution are:
It is symmetric about its
mean
The highest point is at its
mean
The mean, median and
mode are all equal.
The total area under the
curve above the x-axis is 1
square unit. Therefore 50%
is to the right of median
and 50% is to the left of

PROPERTIES OF A NORMAL
DISTRIBUTION
Perpendiculars of:
± 1s contain
about 68%;
±2 s contain
about 95%;
±3 s contain
about 99.7% of the
area under the
curve

WHY WIDE SPREAD IS NOT
IMPORTANT
Spread is important
when comparing 2 or
more group means.
For instance, it is
more difficult to see
a clear distinction
between groups in
the upper example
because the spread is
wider, even though
the means are the
same.

STANDARD NORMAL
DISTRIBUTION
A normal distribution is
determined by . This creates a
family of distributions
depending on whatever the
values of m and s are.
- The standard normal
distribution has
mean=0 and standard dev =1.
Standard Z-Scores The
standard z score is obtained
by creating a variable z whose
value is:

STANDARD NORMAL DISTRIBUTION
Given the values of m and s we can convert a value of x to a value of z.
A Z-score
is the number of standard deviations above or below the mean.
A Z-score of 1.5 means
that the score is 1.5 standard deviations above the mean;
a Z score of -1.5 means
that the score is 1.5 standard deviations below the mean.
It always has the same meaning in all distributions.

DISTORTION OF NORMAL CURVE
Data may not be normally
distributed:
- There may be data that
are outliers that distort
the mean. This is called
skewed distribution
(SKEWNESS).
- Data may be bunched
about the mean in a non-
normal fashion. This is
called kurtotic
distribution (KURTOSIS).
Normal Distribution Graph-Box
Plot:

+,-SKEWNESS
Skewness : not distributed symmetrically
around the mean. Consequently:
The mean, median, and mode are not
equal and are in different positions;
Scores (data) are clustered at one end of
the distribution (right or left)
A small number of outliers are located in
the limits of the opposite end
A variable that is positively skewed has
large outliers to the right of the mean, that
is, greater than the mean. In that case, a
positively skewed distribution ‘points’
towards the right.
A negatively skewed variable has large
outliers to the left of the mean; a
negatively

+(LEPTO) ,- (PLATY) KURTOSIS
It examines the horizontal
movement of a distribution from
a perfect normal ‘bell shape’.
A variable that is positively
kurtic (has a positive kurtosis) is
lepto-kurtic and is too ‘pointed’
have low standard deviation
value. In this case, the data are
bunched together and give a
tall, think distribution which is
not normal.
A variable that is negatively
kurtic is platy-kurtic and is too
‘flat’. In this case, the data are
spread out and give a low, flat
distribution which is not normal.

HOW TO EXAMINE THE NORMAL DISTRIBUTION OF THE
DATA
There are both
graphical and
statistical methods for evaluating normality.
Graphical methods mainly include Histogram, Box-Whisker plot,
Dot plot, the normality plots (=Q-Q and P-P plots), etc… Normality
plots are much used.
Statistical methods include:
o diagnostic tests for skewness and kurtosis between (+ 0r – 0.5
interval is norma)
o Normality Statistical tests

WHAT SHOULD BE DONE FOR THE ABNORMAL DISTRIBUTION OF THE
DATA
Transformation is required in order to study the data
parametrically while normality is tested
If not done we conduct a non parametric study for the data
Three common transformations are:
the logarithmic transformation (the commonest),
the square root transformation, and
the inverse transformation. They actually correct for skews &
unequal variances
Notice
Transformation should be justified: it is recommended when
including a non-normally distributed variable in the analysis
will reduce the effectiveness at identifying statistical
relationships, i.e. when this leads to losing power, due to lack
of normal distribution of the variable to be analyzed.

TYPE OF THE STATISTICS
There are two types of statistics:
Descriptive Statistics
 Inferential Statistics
1.Descriptive statistics
used to summarize, organize, and make sense of a set of data (scores or
observations).
are typically presented graphically, in tabular form (in tables), or as summary
statistics (single values) (descriptive statistics).
-e.g. : Mean, median, mode, frequencies, range, variance, standard deviation,
quartiles, standard error of the mean
also helps when it comes to describe the relationship between variables.
NB: descriptive statistics has been largely discussed in the previous paragraphs.

INFERENTIAL STATISTICS
Inferential Statistics are used to draw inferences about a population
from a sample.
Specifically it allows researchers to infer (make inferences) or
generalize observations made with samples to the larger population from
which they were selected.
Population and samples (reminder!):
Population: Group that the researcher wishes to study.
 Sample: A group of individuals selected from the population.
 Census: Gathering data from all units of a population, no sampling.
Inferential statistics generally require that data come from a random
sample (i.e. Probability sampling/equal chance of being chosen).

STATISTICAL SIGNIFICANCE
Significance level
Statistical analyses:
Allow to quantify the degree of relationship between variables
Allow generalization about populations using data from samples (inferential)
Specifically, the goal of statistical analysis is to answer the questions whether there is a
significant effect/association/difference between the variables of interest, and how big it is
(if there is any).
Significance level is the value that is pre-determined used to reject or retain the
hypothesis.
value of 0.05 is used called “p-value” common
Statistically significant findings mean that the probability of obtaining such findings by
chance only is less than 5% (i.e findings would occur no more than 5 out of 100 times by
chance alone).
Therefore, findings would be deemed
statistically significant if they were found to be 0.05 or less (p<0.05)
not statistically significant (insignificant) if they were found to be greater

MEASURE OF ASSOCIATION
What if there is an effect?
You need to measure how big the effect is by using a measure of
association like odds ratio, relative risk, absolute risk, attributable risk
etc..
Absolute Risk is the chance that a person will develop a certain disease
over a period of time is like the hazard is toxicology
E.g.: out of 20,000 people, 1600 developed lung cancer over 10 years,
therefore the absolute risk of developing lung cancer is 8%.
Relative Risk (RR) is a measure of association between the presence or
absence of an exposure and the occurrence of an event.
o RR is when we compare one group of people to another to see if there
is an increased risk from being exposed.

o Used in randomized control trials and cohort studies.
o Can't use RR unless looking forward in time.
o RR is the measure of risk for those exposed compared to those
unexposed.
E.g. :
The 20 year risk of lung cancer for smokers is 15%
The 20 year risk of lung cancer among non-smokers is 1%

Odds Ratio (OR) is a way of comparing whether the probability of a certain event
is the same for two groups. Compare event in two grp
Used for cross-sectional studies, case control trials, and retrospective trials is
study done referring to the past event.
o In case control studies you can't estimate the rate of disease among study
subjects because subjects selected according to disease/no disease. So, you can't
take the rate of disease in both populations (in order to calculate RR).
o OR is the comparison between the odds of exposure among cases to the odds of
exposure among controls.
o Odds are same as betting odds. Example: if you have a 1 in 3 chance of winning a
draw, your odds are 1:2.
o To calculate OR, take the odds of exposure (cases)/odds of exposure (controls).
E.g. Smokers are 2.3 times more likely to develop lung cancer than non-smokers.

CONFIDENCE INTERVALS
When we measure the size of the effect we use confidence intervals
(CI). A CI is the range* in risk we would expect to see in the population.
CI provide an expected upper and lower limit (=range*) for a statistics
at a specified probability level (usually 95%, and sometime 99%)
The odds ratio we found from our sample (E.g. Smokers are 2.3 times
more likely to develop cancer than non-smokers) is only true for the
sample we are using.
This exact number is only true for the sample we have examined; it
might be slightly different if we used another sample.
For this reason we calculate a confidence interval-which is the range in
risk we would expect to see in this population.

C.I
E.g: “a study of the effect of smoking on developing cancer”:
o A 95% confidence interval of 2.1 to 3.4 tells us that while smokers in
our study were 2.3 times more likely to develop cancer, in the general
population, smokers are between 2.1 and 3.4 times more likely to develop
cancer. We are 95% confident that this is true.
Calculating a CI:
For example, a sample mean is an estimate of the population mean.
A CI provides a band within which the population mean is likely to fall:
CI = mean ± (Sm × confidence level) , Sm is standard error dev

CI
Example: n = 30, M = 40, s = 8
CI = 40 ± (1.46 × 2.045)
CI = 40 ± 2.99 = 37.01 to 42.99
The value “1.46” came from the following formula:
The value “2.045” (confidence level) came from appropriate tables.

POWER
If findings are statistically significant, then conclusions can be
easily drawn, but what if findings are insignificant? Power
is the probability that a test or study will detect a
statistically significant result.
Did the independent variables or treatment have zero effect? If
an effect really occurs, what is the chance that the experiment
will find a "statistically significant" result?
Determining power depends on several factors:
1) Sample size: how big was your sample?
2) Effect size: what size of an effect are you looking for? E.g.
How large of a difference (association, correlation) are you looking
for? What would be the most scientifically interesting?
3) Standard deviation: how scattered was your data?

POWER
For example:
a large sample, with a large effect, and a small standard
deviation
would be very likely to have found a statistically significant
finding, if it existed.
A power of 80%-95% is desirable.
 One of the best ways to increase the power of your study is to increase
your sample size.

STATISTICAL ANALYSES
Statistical analyses are either
 parametric and
 non-parametric.
Therefore, statistical analyses are performed using
parametric tests =variable in question is from a normal
distribution:
non-parametric tests =do not require any assumption of normal
distribution, are not sensitive
Most non-parametric tests do not require an interval or ratio
level of measurement; can be used with nominal/ordinal level data.

EXAMPLES OF PARAMETRIC AND NON-
PARAMETRIC TESTS

INTRODUCTION TO SPSS FOR DATA HANDLING
Data entry in SPSS
Drawing graphs in SPSS
Computing descriptive statistics
Testing for normality assumptions
SPSS (Statistical Package for the Social Sciences) was designed to offer a more
user-friendly data analysis presentation than other statistical software.
It has got different versions over the past years (SPSS, IBM-SPSS, PASW -
Predictive Analytics Software

TYPE OF THE DATA
Types of data (reminder):
Nominal , Ranked , Scales (measures :Interval Ratio) , Mixed
types
Text answers (open ended questions)
Nominal (categorical)
− Order is arbitrary when entering data in SPSS
− e.g. Gender, country of birth, personality type, yes or no.
− Use numeric in SPSS and give value labels.
(e.g. 1=Female, 2=Male, 99=Missing)
(e.g. 1=Yes, 2=No, 99=Missing)
(e.g. 1=UK, 2=Ireland, 3=Pakistan, 4=India, 5=other, 99=Missing)

Ranks or Ordinal
Data must be ordered, 1st, 2nd, 3rd etc. e.g. status, social class
Use numeric in SPSS with value labels
E.g. 1=Working class, 2=Middle class, 3=Upper class
• E.g. Class of degree, 1=First, 2=Upper second, 3=Lower second, 4=Third,
5=Ordinary, 99=Missing
Measures, scales
− Interval - equal units
− Ratio - equal units, zero on scale
• e.g. Family size, Salary
• Makes sense to say one value is twice another
− Use numeric (or comma, dot or scientific) in SPSS
• NB: numeric if you can manage to use numbers
• E.g. Family size, 1, 2, 3, 4 etc.
• E.g. Salary per year, 25000, 14500, 18650 etc.

Mixed type
− Categorised data
− Actually ranked, but used to identify categories or groups
e.g. age groups
= ratio data put into groups
− Use numeric in SPSS and use value labels.
E.g. Age group, 1=Under 15
2=15-34
3=35-54
4=55 or greater
Text answers
− E.g. answers to open-ended questions
− Either enter text as given (Use String in SPSS) or
− Code or classify answers into one of a small number types (Use numeric/nominal in
SPSS)

COMPUTING DESCRIPTIVE
STATISTICS
Steps for statistical data analysis
Statistical data analysis is conducted in two steps:
1st step = Descriptive Statistics (to describe the sample) including Testing for
NORMALITY ASSUMPTIONS
2nd step = making inference (Inferential Statistics) (making inferences about the
population using what is observed in the sample).
Association statistics
Comparative statistics
Notice: As an introduction to SPSS for data analysis, we will focus on the first step
(Descriptive statistics); the second step is better covered after or combined with
“Research Methodology” courses/lectures

SPSS
about more on the SPSS ,
Come back into the notes

TESTING FOR NORMALITY
ASSUMPTIONS
Evaluating normality
There are both graphical and statistical methods for evaluating normality.
Graphical methods mainly include Histogram, Box-Whisker plot, Dot plot, the
normality plots (=Q-Q and P-P plots), etc… Normality plots are much used (Q-Q
plot is more common).
The assumption of univariate normality can be investigated using Statistical
methods including:
o diagnostic tests for skewness and kurtosis
o Normality Statistical tests
(=Kolmogorov-Smirnov Test and Shapiro-Wilk Test)

GRAPHICAL METHOD VS STATISTICAL
METHOD
Statistical tests
Make an objective judgment of normality
sometimes not being sensitive enough at low sample sizes or overly sensitive to
large sample sizes.
As such, some statisticians prefer to use their experience to make a subjective
judgment about the data from plots/graphs.
Graphical interpretation
allowing good judgment to assess normality in situations when statistical tests
might be over or under sensitive
graphical methods do lack objectivity.
Conclusions :In some cases, both methods complement each other (sometimes
you need to rely on statistical methods when graphical methods do not help you
to decide whether your data is normally distributed or not)

ASSESSING NORMALITY GRAPHICALLY
Q-Q plot and P-P plot are called probability plots.
Probability plot helps to compare two data sets in terms of
distribution;
one data set being from the data to be analyzed (data you collected
yourself) and another one from reference normally distributed data
(usually shown as a straight solid line) (theoretical normally
distributed data).
If the data is normally distributed, the result would be a straight line
with positive slope like in the figure on right below indicating a good
match for both data distributions.

WHY DO WE EVEN NEED Q-Q PLOT OR P-P PLOT?
If we consider plotting non-cumulative distribution of two data sets
against each other then it is called Q-Q plot.
If we consider plotting cumulative distribution of two sets against
each other then it is called P-P plot. Q-Q plot is more common
Difficult to interpret histogram that’s why Q-Q or P-P plots is better

BOX-WHISKER PLOT
Usually used as measure of Variability (Dispersion).
Box-Whisker plot shows four equal parts along with
three quartiles:
• Second Quartile (Q2) = median.
• Lower quartile (Q3) = median of lower half of the
data
• Upper quartile (Q1) = median of upper half of the
data
• Need to order the individuals first (from 1 to “N”
individuals)
• One quarter of the individuals are in each inter-

ASSESSING NORMALITY STATISTICALLY
Statistical methods include
a) diagnostic tests for skewness and kurtosis
b) Normality Statistical tests :
Kolmogorov-Smirnov Test
Shapiro-Wilk Test
tests for normality follow a rule of thumb
distribution is normal if its skewness and kurtosis have values between –1.0 and
+1.0”.
A perfectly normal distribution will have a skewness statistic of zero.

ASSESSING NORMALITY STATISTICALLY
Positive values of the skewness score describe positively skewed
distribution (pointing to large positive scores) and
negative skewness scores are negatively skewed.
A perfectly normal distribution will also have a kurtosis statistic of
zero.
Values above zero (positive kurtosis score) will describe “pointed”
distributions leptokurtosis and
values below zero will make flat platykurtosis (negative skewness)

NORMALITY STATISTICAL TESTS
Normality Statistical Tests include
Shapiro-Wilk Test (SW)
Kolmogorov-Smirnov Test (KS).
The KS is for a completely specified distribution (so if you are testing
normality, you must specify the mean and variance; they can't be
estimated from the data).
the SW is for normality, with unspecified mean and variance.
So the SW test is better for testing normality.
The KS test is a good method for comparing the shapes of two
sample distributions.

TST
however.
As such, the SW is more appropriate for small sample
sizes (< 50 samples), but can also handle sample sizes
as large as 2000, which makes it the best test for
normality.
How do you ascertain statistically normal distribution of
the data?
 If the p-value (see as “Sig.” in the output table) of the
Shapiro-Wilk Test is greater than 0.05 (> 0.05), the
data is significantly normally distributed.
If it is below 0.05 (< 0.05), the data significantly
deviate from a normal distribution.

QUESTIONS
Study and analyze the SPSS result about the normality below

TRANSFORMATION REMINDER
When a variable is not normally distributed, we can create a
transformed variable to achieve normality. After transformation,
normality should be tested.
Then the transformed variable (normally distributed) is analyzed by
parametric methods.
Three common transformations are: the logarithmic transformation
(the commonest), the square root transformation, and the inverse
transformation. They actually correct for skews & unequal variances
Transformation should be justified: it is recommended when
including a non-normally distributed variable in the analysis will
reduce the effectiveness at identifying statistical relationships, i.e.
when this leads to losing power, due to lack of normal distribution of
the variable to be analyzed.
When transformations do not work, we do have the option of

BIOSTATISTICS
SESSION 2 EXERCISING YOURSELF REMINDING YOUR WORK DONE

I.A Fahrenheit thermometer is an
example of what:
A. Nominal
B. Ordinal
C. Interval
D. Ratio
II.Within 3 standard deviations,
the mean picks up how much of
the scores?
A. 68
B. 78
C. 99
D. 99.7
E. 99.9
III.Classifications of dental
disease is an example of what:
A. Nominal
B. Ordinal
C. Interval
D. Fratio
IV. Has categorical variables and
bars are separate, but equal
distances apart:
A. Bar Graph
B. Histogram
C. Frequency Polygon

V. Has continuous variables, bars
touch and you can always find a third
value:
A. Bar Graph
B. Histogram
C. Frequency Polygon
VI. Within 1 standard deviation, the
mean picks up over how many of the
values?
A. 60
B. 62
C. 65
D. 66
E. 68
VII. The degree to which the
independent variable alone brings
about the change in the dependent
variable is what:
A. Internal Validity
B. External Validity
VIII. The students t-test measures what:
A. Test the difference between 2 means
B. Test for differences between 3 or more
means
C. Differences between two frequency
distributions
D. Whether two distributions are
independent or dependent
IX. The Scientific Method is:
A. Qualitatitive Research
B. Quantitative Research
X. As income level declines, tooth decay
increases. This is an example of what:
A. Positive correlation
B. Negative correlation
C. Internal Validity
D. External Validity

XI. Randomly selecting a
proportionate amount from
subgroups is an example of
what:
A. Random Sampling
B. Stratified Sampling
C. Systematic Sampling
D. Convenience Sampling
XII. Retrospective and Prospective
are what types of Epidemiological
Studies?
A. Analytical
B. Descriptive
XIII. Descriptive statistics make no
attempt to generalize the
research findings beyond the
immediate sample.
A. True
XIV. Randomly selecting a proportionate
amount from subgroups is an example
of what:
A. Random Sampling
B. Stratified Sampling
C. Systematic Sampling
D. Convenience Sampling
XV. In systematic sampling, every person
has an equal or random chance of being
selected.
A. True
B. False
XVI. A zero correlation coefficient shows:
A. A strong relationship
B. No relationship
What is thw difference between positive
correlation and negative correclation

BIOSTATISTICS + EXERCISES

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BIOSTATISTICS + EXERCISES

Similar to BIOSTATISTICS + EXERCISES (20)

More from MINANI Theobald

More from MINANI Theobald (19)

Recently uploaded

Recently uploaded (20)

BIOSTATISTICS + EXERCISES