Basic concepts in biostatistics edited pc-1.pptx

Arba Minch University
College of Medicine & Health Sciences
,School of public health Department of
Public Health
Epidemiology and Biostatistics unit
By Kusse Otayto(BSc,MPH in Epi/Biostatistics)
By: Kusse Otayto(BSc, MPH( Epidemiology & Biostatistics))
1

Descriptive statistics
 It deals with the description of data in a clear &
informative manner using tables, graphs &
numerical summary
 It involves the organization & summarization of a
body of data with one or more meaningful tools.
 It helps to identify the general features & trends in a
set of data & extracting useful information
 Also very important in conveying the final results of
a study
2

Descriptive statistics
 Data
 Are information collected from the source or
 Are the raw materials of statistics
 Are numbers which can be obtained by
measurements or counting
 Data are made up of a set of variables
 It Can be obtained from Counting, Routinely kept
records, Surveys, Experiments, Reports…
 Types of data
1. Primary data
2. Secondary data
3

1. Primary data
1. Primary data:
 Are data collected from the items or individual
respondents directly by the researcher themselves
for the purpose of a study.
 Advantages of primary data
1. The data is original
2. Possibility of flexibility
3. Source for extensive research
 Disadvantages of primary data
1. Expensive & time consuming
2. Possibility of personal prejudice(biases)
4

2. Secondary data
2. Secondary data:
 Are data which had been collected by certain people or
organization & statistically treated & the information
contained in it is used for other purpose by other people
 Obtained from journals, reports, government
publications
 Advantages of secondary data
1. Are readymade
2. Relatively cheaper
3. Lesser degree of personal prejudice
 Disadvantages of secondary data
1. Lacks originality
2. May or may not suit the objects of enquiry (Not source
for extensive research)
3. It is used with great care & caution
5

Methods of data collection
 Before any statistical work can be done data must be
collected.
 Data collection is a crucial stage in the planning &
implementation of a study
 If the data collection has been superficial, biased or
incomplete, data analysis becomes difficult, & the research
report will be of poor quality.
 Therefore, we should concentrate all possible efforts on
developing appropriate tools, & should test them several
times.
 Depending on the type of variable & the objective of the
study different data collection methods can be employed:
Observation,Interview,using self administered written
questionnaire 6

A. Observation
 Is a technique that involves systematically selecting,
watching & recording behavior & characteristics of
living things, objects or phenomena.
 It includes all methods from simple visual
observations to the use of high level equipments
 It can be undertaken in the following ways:
1. Participant observation:
 The observer takes part in the situation he or she
observes.
2. Non-participant observation:
 The observer watches the situation, openly or
concealed, but does not participate
7
Cont…

 Observations can give additional, more accurate information
on behavior of people than interviews or questionnaires
 Observations can also be made on objects
 Outline the guidelines for the observations prior to actual
data collection.
 Advantages
 Gives relatively more accurate data on behavior &
activities
 Disadvantages:
 Investigators or observer’s own biases
 Needs more resources & skilled human power during the
use of high level machines.
8
Cont…

B. Interview (face-to-face)
 Is a data collection technique that involves oral
questioning of respondents, either individually or as a
group
 Answers to the questions posed during an interview can
be recorded by:
1. Writing them down (either during the interview itself
or immediately after the interview) or
2. By tape-recording the responses, or
3. By a combination of both.
 Advantages of face-to-face interview
 Can stimulate & maintain the respondent’s interest
 Can create a rapport(bond) (understanding, concord)
 Observations can be made as well.
 Disadvantage
 It is time consuming & expensive 9
Cont…

Cont…
1. In-depth interview
 It is a conversion between the researcher & the
subject about the research area or topic.
 It is designed to allow the respondent to tell their
story in their own way
 Issues are covered in detail; respondent leads the
interviews/sets the agenda; no fixed order
 Important in:
 Highly sensitive issues
 Geographical dispersed respondents
 When peer pressure is expected to distort facts
 It takes high cost & time than FGD 10

2. Focus group discussions
 It allows a group of 8 -12 informants to freely discuss
a certain subject with the guidance of a facilitator or
reporter
Advantages
 Group interaction stimulate richer responses &
emergence of new ideas
 The researcher observes & gets first hand insights
 Can be done more quickly & generally less expensive
than in- depth interviews
Disadvantage
 Not good in highly sensitive issues
11
Cont…

C.Using self-administered written questionnaire
 Is a data collection tool in which written questions
are presented that are to be answered by the
respondents in written form
 It can be administered in different ways, such as by:
 Sending questionnaires by mail with clear
instructions
 Gathering all or part of the respondents in one place
at one time, giving oral or written instructions, &
letting the respondents fill out
 Hand-delivering questionnaires to respondents &
collecting them later
12
Cont…

 The questions can be either open-ended or
closed
A. Example of closed ended question
1. What is the current breastfeeding status of mother ?
A. Exclusive breastfeeding
B. Partial breastfeeding
C. Not breastfeeding
B. Example of Open ended question
1. At what age should the child start supplementary
food? why?
13
Cont….

Advantages
 Is simpler & cheaper than interview
 Can be administered to many persons
simultaneously
 Can be sent by post.
Disadvantages
 It demands a certain level of education & skill of
respondents
 If a mailed questionnaire one, people of a low socio-
economic status are less likely to respond to it
14
Cont….

Variable
Variable
 Is a characteristic which takes different values in
different PPT (persons, places, or things).
 Any aspect of an individual or object that is
measured (e.g. BP) or recorded (e.g. age, sex) &
takes any value.
 There may be one or many variable in a study
15

Types of variables
A. Qualitative (categorical) variables
 Nominal
 Ordinal
B. Quantitative (numerical) variables
 Continuous
 Discrete
1. Dependent (outcome,Response) variable
2. Independent (exposure,Explanatory) variable
16
Variable

1. Categorical(Qualitative) variable
 A variable which can not be measured in
quantitative form but can only be sorted by name or
categories
 Not able to be measured as we measure height or
weight
 The notion of magnitude is absent or implicit.
 Categories must not overlap & must cover all
possibilities
17
Variable….

Categorical variable is divided into two:
1. Nominal variable
 The values fall into un-ordered categories or classes
 Uses names, labels or symbols to assign each
measurement.
 Examples: Blood type (A, B, AB, O) Sex
(male/female)
2. Ordinal variable
 Assigns each measurement to one of a limited number of
categories that are ranked in terms of order.
 Although non-numerical, can be considered to have a
natural ordering
 Examples:
1. Cancer stages: 1, 2, 3, 4
2. Pain severity: no pain, slight pain, moderate pain, severe
pain 18
Variable….

B. Quantitative (numerical) variable
 A variable that can be measured or counted & expressed
numerically.
 Has the notion of magnitude.
 E.g. Height, weight, # of children, etc.
 Quantitative variable is divided into two:
1. Discrete variable
 It can only have a limited number of discrete values &
hence takes on integer values only
 Characterized by gaps or interruptions in the values.
 Both the order & magnitude of the values matter.
 The values are not just labels, but are actual measurable
quantities.
 E.g. Number of children in household(0, 1, 2, 3, etc.) 19
Variable….

Variables…
2. Continuous variable
 It can have an infinite number of possible values in
any given interval or within some range
 Both the magnitude & the order of the values matter
 Does not possess the gaps or interruptions
 E.g. Weight (50.123...), Height (1.342...)
20

Variables…
Manipulation of variables
 Continuous variables can be discredited
 E.g. Age (1&1/12-1yr) can be rounded to whole
numbers
 Continuous or discrete variables can be categorized
 E.g. Age categories- 1(1-5), 2(6-10), 3(11-15)
 Categorical variables can be re-categorized
 E.g. marital status (Single, Married, Divorced,
Widowed) lumping from 4 categories down to 2
(married, single)
21

Variables…
1. Independent variables
 Precede(come first) dependent variables in time
 Are often manipulated by the researcher
2. Dependent variables
 What is measured as an outcome in a study
 Values depend on the independent variable
 Example
1. Health education involving active participation of mothers
will produce more positive changes in child feeding than
health education based on lectures.
 Independent variable:
 Type of health education
 Dependent variable:
 Changes in child feeding 22

Scales of Measurement
 Scales of measurement
 Is an assignment of numbers to subjects, objects or
events(variables) in which we are interested according to
a set of rules
 Measurement is a way of refining our ordinary
observations so that we can assign numerical values to
our observations.
 These numbers will provide the raw material for our
statistical analysis.
 Why we measure things or worry about the different forms
that measurement may take?
 It allows us to go beyond simply describing the presence
or absence of an event or thing to specifying how much,
how long, or how intense it is.
 With measurement, our observations become more
accurate & more reliable. 24

Scales...
 There are four types of scales of measurement.
1. Nominal scale
 Used when data are classified into one of two or
more categories
 The values fall into un-ordered categories or classes(
aren’t hierarchical, one category isn’t “better” or
“higher” than another)
 Uses names, labels or symbols to assign each
measurement.
 Labeling or naming allows us to make qualitative
distinctions or to categorize & then count the
frequency of persons, objects, or things in each
category.
25

 It should be: Exhaustive & Mutually exclusive
1. Exhaustive :
 Should include all possible answerable responses.
2. Mutually exclusive :
 No respondent should be able to have two attributes
simultaneously
 Not really a ‘scale’ because it does not scale objects along
any dimension
 Assignment of numbers to the categories has no
mathematical meaning, simply for identification
purposes.
 Examples:
1. Marital status(Single, Married, Divorced)
2. Religion(Muslim, Protestant, Orthodox, Catholic) 26
Scales...

Scales...
2. Ordinal scale
 Used when data are classified into logically order- rank
 Assigns each measurement to one of a limited number of
categories that are logically ranked in terms of order
 Although non-numerical, can be considered to have a
natural ordering (The numbers have limited meaning
4>3>2>1)
 No consistent distance between points of measurement
 Example: Social class (Very poor, Poor, Rich, Very rich)
 There are not equal interval b/n adjacent numbers
27

Scales...
3. Interval scale
 Used when data are classified on a scale that assumes
equal distance between numbers
 There are Magnitude + Constant distance b/n points
+ No true zero point + Equal interval b/n adjacent
numbers
 Example: Temp. in o
F on 4 consecutive days
 Days: A B C D
 Temp. o
F: 50 55 60 65
 For these data, not only is day A with 50o F cooler
than day D with 65o but is 15o cooler.
 It has no true zero point (“0” is arbitrarily chosen &
doesn’t reflect the absence of temp.) 28

Scales...
4. Ratio scale
 Used when data are classified on a scale that assumes
equal distance & a true zero value
 Measurement begins at a true zero point & the scale has
equal space
 There are Magnitude + Constant distance b/n points +
Equal ratios + True zero.
 Examples: Height, weight, BP, etc.
 Zero weight or height means the complete absence of
weight or height.
 A 100-kg person has one-half the weight of a 200-kg
person & twice the weight of a 50-kg person.
 It is the most sensitive, powerful type- b/c contain the
most precise information about each observation that is
made 29

30
Decision tree to determine the appropriate scale of
measurement.
Question 1
There any order to the numbers?
Question 2
Are there equal interval b/n adjacent
numbers?
Question 3
Is there absolute zero?
Nominal
scale
Ordinal
scale
Interval
scale
Ratio
scale
Yes
Yes
Yes
No
No
No

Why Is Level of Measurement Important?
 Helps you to decide
1. What kind of data display or summary method &
What statistical analysis is appropriate on the values
that were assigned &
2. How to interpret the data from that variable.
32

Data organization & presentation
33

Data Organization & Presentation
1. For categorical variables
A. Using table of frequency distribution
1. Frequency counts
2. Relative frequency
3. Cumulative frequency
4. Relative cumulative frequency
B. Using pictorial forms
1. Bar charts(graph)
2. Pie charts
 Ordered array:
 A simple arrangement of individual observations in
order of magnitude.
 Very difficult with large sample size
34

2. For Quantitative variable
A. Using table of frequency distributions
1. Frequency counts
2. Relative frequency
3. Cumulative frequencies
4. Relative cumulative frequency
B. Using pictorial forms
1. Histogram
2. Frequency polygon
3. Line graph
4. Scattered plot
5. Box
6. Ogive/cumulative frequency… 35
Data Organization & Presentation….

 Frequency table:
 It involves a listing of all the observed values of the variable
being studied & How many times each value is observed.
 Frequency distribution:
 The distribution of the total number of observations among
the various categories is called a frequency distribution.
 Simple & effective way for summarizing large amounts of
data
 Relative Frequency
 It is the proportion or percentages of observations in each
category.
 The distribution of proportions is called the relative
frequency distribution of the variable
 Given a total number of observations, the relative frequency
distribution is easily derived from the frequency distribution.
36
Frequency table & Frequency Distributions…

Frequency table & Frequency Distributions…..
Cumulative frequency
 It is the number of observations in the category plus
observations in all categories smaller than it.
Cumulative relative frequency
 It is the proportion of observations in the category
plus observations in all categories smaller than it.
 It is obtained by dividing the cumulative frequency
by the total number of observations.
37

BWT Freq. Cum. Freq Rel. Freq. Cum. rel. freq
Very low 43 43 43/9974*100 = 0.4 43/9974*100 = 0.4
Low 793 43+793 = 836 793/9974*100 = 8.0 836/9974*100 = 8.4
Normal 8870 836+8870 = 9706 8870/9974*100 = 88.9 9706/9974*100 = 97.3
Big 268 9706+268 = 9974 268/9974*100 = 2.7 9974/9974*100 = 100
Total 9974 100 38
For example: Birth weight for newborns with levels:
1. Very low
2. Low
3. Normal &
4. Big
Table 1. Distribution of birth weight of newborns b/n 1976-1996 at “X” town.
For categorical variables

 For Quantitative variable,
 Select a set of continuous, non-overlapping intervals
such that each value can be placed in one & only one
of the intervals.
 The first consideration is how many intervals to
include
 To determine the number of class intervals & the
corresponding width, we may use:
 Sturge’s rule:
 Where
K = Number of class intervals
n = No. of observations
W = Width of the class interval
K 1 3.322(logn)
W
L S
K
 


39
Quantitative variable

1. Example: Leisure time (hours) per week for 40
college students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22
14 13 10 19 27 29 22 38 28 34 32 23 19 21 31
16 28 19 18 12 27 15 21 25 16
K = 1 + 3.322 (log n)
K = 1 + 3.322 (log40) = 6.32 ≈ 6
Maximum value = 38, Minimum value = 10
W = L-S
K
W = (38-10)/6 = 4.66 ≈ 5 40
Quantitative variable....

41
Time (Hours) Frequency Relative Frequency Cumulative Relative Frequency
10-14
15-19
20-24
25-29
30-34
35-39
5
11
12
7
3
2
0.125
0.275
0.300
0.175
0.075
0.050
0.125
0.400
0.700
0.875
0.950
1.00
Total 40 1.00
Quantitative variable

42
 Class Limit: The range for each class
 Upper class limit
 Lower class limit
 Mid-point (Class mark):
 The value of the interval which lies midway b/n the
lower & the upper limits of a class.
 Class boundary (True limits):
 Are those limits that make an interval of a continuous
variable continuous in both directions
 Upper class boundary
 Lower class boundary
 Subtract 0.5 from the lower & add it to the upper class limit

43
Time(Hours) True limit(class boundary) Mid-point Frequency
10-14
15-19
20-24
25-29
30-34
35-39
9.5 – 14.5
14.5 – 19.5
19.5 – 24.5
24.5 – 29.5
29.5 – 34.5
34.5 - 39.5
(10+14)/2 = 12
(15+19)/2 = 17
(20+24)/2 = 22
(25+29)/2 = 27
(30+34)/2 = 32
(35+39)/2 = 37
5
11
12
7
3
2
Total 40

Guidelines for constructing tables
1. Keep them simple (Limit the number of variables to
three or less)
2. All tables should be self-explanatory (Include clear
title telling what, when & where)
3. Clearly label the rows & columns
4. State clearly the unit of measurement used
5. Explain codes & abbreviations in the foot-note
6. Show totals
7. If data is not original, indicate the source in foot-
note.
44

Pictorial /Diagrammatic presentation
Importance of diagrammatic presentation
1. Diagrams have greater attraction than mere figures
2. They give quick overall impression of the data
3. They have great memorizing value than mere figures
4. They facilitate comparison
5. Used to understand patterns & trends
 E.g.,
 Skewed or symmetric distribution
 Multiple peaks / mode
 Are there any outliers ?
 Relationship between variables. 45

1. Bar charts (Graphs)
1. Graphical equivalent of a frequency table
2. Categories are listed on the horizontal axis (X-axis)
3. Frequencies or relative frequencies are represented
on the Y-axis (ordinate)
4. The height of each bar is proportional to the
frequency or relative frequency of observations in
that category
46
Qualitative variable presentation

A. Simple bar chart:-used to represent a single
variable
47
0
20
40
60
80
100
Not immunized Partially immunized Fully immunized
Immunization status
Number
of
children
Fig. 1. Immunization status of Children in Adami Tulu Woreda, Feb.1995

B. Sub-divided (component) bar chart
1. If there are different quantities forming the sub-
divisions of the totals, simple bars may be sub-
divided in the ratio of the various sub-divisions to
exhibit the relationship of the parts to the whole.
2. The order in which the components are shown in a
“bar” is followed in all bars used in the diagram
48

Example of 100%component bar chart:
0
20
40
60
80
100
August October December
2003
Percent
Mixed
P. vivax
P. falciparum
49
Fig.1 Plasmodium species distribution for confirmed malaria cases, Zeway, 2003

 Method of constructing bar chart
1. All the bars must have equal width
2. The bars are not joined together (leave space b/n
bars)
3. The different bars should be separated by equal
distances
4. All the bars should rest on the same line called the
base
5. Both axes clearly label
 Instead of “stacks” rising up from the horizontal (bar
chart), we could plot instead the shares of a pie.
50

2. Pie chart
1. It shows the relative frequency for each category by
dividing a circle into sectors
2. The angles are proportional to the relative frequency.
3. Used for a single categorical variable
4. Use percentage distributions
 Steps to construct a pie-chart
1. Construct a frequency table
2. Change the frequency into percentage (P)
3. Change the percentages into degrees, where,
 Degree = Percentage X 360o
4. Draw a circle & divide it accordingly 51

Cause of death No. of death Percentage
Circulatory system
Neoplasm
Respiratory system
Injury & poisoning
Digestive system
Others
100 000
70 000
30 000
6 000
10 000
20 000
100,000/236,000*360o = 153o
70,000/236,000*360o = 107o
30,000/236,000*360o = 46o
6,000/236,000*360o = 9o
10,000/236,000*360o = 15o
20,000/236,000*360o = 30o
Total 236 000 100% (360o)
52
Steps to construct a pie-chart
Example: Distribution of deaths for females, in England and Wales, 1989.

53
 Instead of “stacks” rising up from the horizontal (bar chart), we could plot
instead the shares of a pie.
 Recalling that a circle has 360 degrees, that 50% means 180 degrees, 25%
means 90 degrees, etc, we can identify “wedges” according to relative
frequency
Distribution fo cause of death for females, in England and Wales, 1989
Circulatory system
42%
Neoplasmas
30%
Respiratory system
13%
Injury and Poisoning
3%
Digestive System
4%
Others
8%

3. Histogram
1. Histograms are frequency distributions with
continuous class interval that have been turned into
graphs
2. A histogram is a type of bar chart, but there are no
spaces b/n the bars(continuous data)
3. Histograms are used to visually represent frequency
distributions of continuous data
4. Given a set of numerical data, we can obtain
impression of the shape of its distribution by
constructing a histogram
54
Quantitative variable presentation

3. Histogram
5. Constructed by choosing a set of non-overlapping class
intervals & counting the number of observations that fall in
each class.
6. It is necessary that the class intervals be non-overlapping so
that each observation falls in one & only one interval.
7. Bars are drawn over the intervals
8. The area of each bar is proportional to the frequency of
observations in the interval
 Two problems with histograms
1. They are somewhat difficult to construct
2. The actual values within the respective groups are lost
& difficult to reconstruct
 Stem-and-leaf plot overcomes these problems
55
Quantitative variable presentation….

Age group 15-19 20-24 25-29 30-34 35-39 40-44 45-49
Number 11 36 28 13 7 3 2
56
Age of women at the time of marriage
0
5
10
15
20
25
30
35
40
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
No
of
women
Histogram
Example: Distribution of the age of women at the time of marriage

5. Frequency polygon
1. Instead of drawing bars for each class interval,
sometimes a single point is drawn at the mid point of
each class interval & consecutive points joined by
straight line.
2. Graphs drawn in this way are called frequency
polygons
3. The total area under the frequency polygon is equal
to the area under the histogram
4. Frequency polygons are superior to histograms for
comparing two/more sets of data.
57

Age of women at the time of marriage
0
5
10
15
20
25
30
35
40
12 17 22 27 32 37 42 47
Age
No
of
women
58

6. Scatter plot
1. Most studies in medicine involve measuring more
than one characteristic
2. For two quantitative variables we use bivariate plots
(also called scatter plots or scatter diagrams).
3. In the study on percentage saturation of bile,
information was collected on the age of each patient
4. To see whether a relationship existed between the
two measures.
 E.g. Saturation of bile & age
59

6. Scatter plot….
 When both the variables are qualitative then we can
use a bar graph.
 When one of the characteristics is qualitative & the
other is quantitative, the data can be displayed in box
& whisker plots.
 A scatter diagram is constructed by drawing X- & Y-
axes.
 Each point represented by a point or dot() represents
a pair of values measured for a single study subject
 The graph suggests the possibility of a positive
relationship between age & percentage saturation of
bile in women. 60

Age and percentage saturation of bile for women patients in
hospital Z, 1998
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70 80
Age
Saturation
of
bile
61

7. Line graph
1. Useful for assessing the trend of particular situation
overtime.
2. Helps for monitoring the trend of epidemics.
3. Values for each category are connected by
continuous line.
4. Sometimes two or more graphs are drawn on the
same graph taking the same scale so that the plotted
graphs are comparable.
62

Line graph
63
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
1967 1969 1971 1973 1975 1977 1979
Year
Rate
(%)
Fig 5: Malaria Parasite Prevalence Rates in Ethiopia, 1967 – 1979 E.C.

Line Graph
0
10
20
30
40
50
60
1960 1970 1980 1990 2000
Year
MMR/1000
Year MMR
1960 50
1970 45
1980 26
1990 15
2000 12
64
Figure (1): Maternal mortality rate of (country), 1960-2000

Reading assignment
Reading assignment
1. Ogive curve
2. Box & whisker plot
3. Stem and Leave plot
65

Numerical summary measures
1. Measures of central tendency
2. Measures of dispersion
66

Measures of Central Tendency
67

1. Measures of Central Tendency
 Statistic:–
 Descriptive measure computed from sample data
 Parameter:–
 Descriptive measure computed from population data
 Measures of central tendency:-
 Are the measures used to summarize the point at
which the data tend to cluster in a single number or
statistic.
 The most commonly used measures of central
tendency are:
1. Arithmetic Mean,
2. Median &
3. Mode.
68

1. Arithmetic mean
1. Arithmetic mean
 It is the average of the data set
 The sum of the observations divided by the number of
observations.
 Mean for ungrouped data
 Mean of a sample
 Mean of a population
= (X bar) refers to the mean of a sample &
= refers to the mean of a population
Σx is a command that adds all of the X values
n = is the total number of values in the series of a sample
&
N = is the sum for a population
X
μ
69
N
X



n
X
X



Arithmetic mean …..
 Example: 19 21 20 20 34 22 24 27 27 27
 Calculate the mean , n=10
 Mean = 19 + 21 + 20 +20+ 34 + 22 + 24 + 27 + 27 +27 = 24.1
10
 Mean for grouped data
 We assume that all values falling into a particular class
interval are located at the mid-point of the interval.
 It is calculated as follow:
70
x =
m f
f
i i
i=1
k
i
i=1
k


Where,
k = the number of class intervals
mi = the mid-point of the ith class
interval
fi = the frequency of the ith class

Example. Compute the mean age of 169 subjects from the
grouped data.
Class interval Mid-point (mi) Frequency (fi) mifi
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
58.0
1617.0
1621.5
1602.0
654.0
258.0
Total __ 169 5810.5
71
Mean = 5810.5/169 = 34.5 years
Arithmetic mean …..

Properties of the arithmetic mean
1. Can be used for both discrete & continuous data.
 However, it is not appropriate for either nominal
or ordinal data.
2. For given set of data there is one & only one
arithmetic mean.
3. It is easily understood & easy to compute.
4. Algebraic sum of the deviations of the given values
from their arithmetic mean is always zero.
5. It is greatly affected by the extreme values.
72

2. Median
Median
 Is the value that divides a series of values in 1/2 when
they are listed in order
 If observations are odd, the median is defined as the
 [(n+1)/2]th observation.
 E.g. 19 20 20 21 22 23 24 27 27 27 34 n=11
 Median = [(n+1)/2]th = [(11+1)/2]th = [6]th= 23
 If observations are even the median is the average of
the two middle
 (n/2)th + [(n/2)+1]th /2 values i.e, there is no middle
observation.
 E.g. 19 20 20 21 22 24 27 27 27 34 n= 10
 Median = (n/2)th + [(n/2)+1]th /2= (10/2)th +
[(10/2)+1]th /2= (5)th + [6]th /2 = (22 + 24)/2 = 23 73

 Median for Grouped data
 We assume that the values within a class-interval are
evenly distributed through the interval.
 The first step is to locate the class interval in which it
is located.
 Find n/2 & see a class interval with a minimum
cumulative frequency which contains n/2.
 Note:- All class intervals with cumulative frequencies
≥n/2 contain the median.
74
Median….

To find a unique median value, use the following
interpolation formal.
75
W
f
F
2
n
L
=
x
~
m
c
m














Median….
 Where,
• Lm = lower true class boundary of the interval containing the median
• Fc = cumulative frequency of the interval just above the median class
interval
• fm = frequency of the interval containing the median
• W= class interval width
• n = total number of observations

Ex. Compute the median age of 169 subjects from the
grouped data.
Class interval Mid-point (mi) Frequency (fi) Cum. freq
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
4
70
117
153
165
169
Total 169
76
Median….

77
 Median =
= n/2 = 169/2 = 84.5
= 84.5 = in the 3rd class interval
= Lower limit = 29.5,
= Upper limit = 39.5
= Frequency of the class = 47
= Fc above class interval = 70
= Median = 29.5 + (84.5-70 /47)10 = 32.58 ≈ 33
W
f
F
2
n
L
=
x
~
m
c
m














Median….

Properties of median
1. Can be used for ordinal, discrete & continuous data.
 However, it is not appropriate for nominal data.
2. There is only one median for a given set of data
3. The median is easy to calculate
4. Median is a positional average & hence it is not
drastically affected by extreme values
5. It is not a good representative of data if the number
of items is small
78

3. Mode
 Mode
 It is the value/ observation which occurs most frequently.
 Most distributions have one peak & are described as uni-
modal.
 E.g. 19 21 20 20 34 22 24 27 27 27
 Mode = 27
 The mode of grouped data usually refers to the modal class
with the highest frequency.
 The modal value is the highest bar in a histogram
 Not a good summary
 Possible to have one, more than one/no mode
79

To find a single value of mode for grouped data, use
the following formula:
 
 
 
Mode 1
mo
1 2
Δ
= L + i
Δ + Δ
80
mo
L
1

2

 Where:
 i is the class width
 is the difference b/n the frequency of class mode & the frequency
of the class after (below) the class mode
 is the difference b/n the frequency of class mode & the frequency
of the class before (above) the class mode
 is the lower boundary of class mode
Mode….

Ex. Find the mode for the following data
81
 Solution
 Lmo = 19.5, F =66, Fb =47, Fa =4, i=10
 Mode =19.5+((66-47)/66-47+66-4))10 =21.8=22
Mode….
Class interval Mid-point (mi) Frequency (fi) Cum. freq
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
4
70
117
153
165
169
Total 169

Properties of mode
1. Can be used for nominal, ordinal, discrete &
continuous data.
 However, it is more appropriate for nominal &
ordinal data.
2. It is not affected by extreme values
3. Often its value is not unique
4. The main drawback of mode is that often it does not
exist
82

2. Measures of Dispersion
Measures of Dispersion
 Measures that quantify the variation or dispersion of a
set of data from its central location
 Dispersion of a set of observations is the variety exhibited by
the observations
1. If all the values are the same→ There is no dispersion
2. If all the values are different → There is a dispersion
3. If the values close to each other →The amount of
dispersion is small
4. If the values are widely scattered/spread → The
dispersion is greater
84

Common measures of dispersion
1. Range
2. Inter quartile range
3. Variance
4. Standard deviation
5. Coefficient of variation
85
Measures of Dispersion….

1. Range (R)
Range (R)
 Is the difference b/n the largest & smallest
observations in a sample.
 Range concern only on two values
 Range = Maximum value – Minimum value
 The range is the simplest measure of dispersion.
 A data set with higher range shows more variability
 Example –
 Data values: 5, 9, 12, 16, 23, 34, 37, 42
 Maximum value= 42,
 Minimum value= 5
 Range = 42-5 = 37 86

 Properties of range
1. It is the simplest crude measure & can be easily
understood
2. It takes into account only two values which causes it
to be a poor measure of dispersion
3. Very sensitive to extreme observations (outliers)
4. The larger the sample size, the larger the range
87
1. Properties of range....

2. Inter-quartile range (IQR)
 Inter-quartile range (IQR)
 It is used when the median is used as the measure of
central tendency.
 It gives the range in which the middle 50% of the
distribution lies.
 The inter-quartile range quantifies the difference b/n
the third & first quartiles.
IQR = Q3 - Q1
 A large IQR indicates a large amount of variability
among the middle 50% of the observations &
 A small IQR indicates a small amount of variability
88

2. Inter-quartile range (IQR).....
 The inter-quartile range is particularly useful to
describe data sets where there are a few extreme
values.
 Unlike the range, & to a lesser extent the standard
deviation, it is not sensitive to extreme values as it
relies on the spread of the middle 50% of the
distribution.
 So, if there are data sets which have extreme values,
it can be more appropriate to use the median to
describe central tendency & the inter-quartile range
to describe the spread. 89

What does Quartiles mean?
 If the data are divided into four equal parts, we speak of
quartiles.
 Quartiles (Q1, Q2, Q3, Q4) – sample size (data) is divided
into 4 equal parts getting 25% of observations in each of
them.
 The first quartile(Q1):
 Is the point which gives us 25% of the area to the left of
it & 75% to the right of it.
 This means that 25% of the observations are less than or
equal to the first quartile & 75% of the observations
greater than or equal to the first quartile.
 The first quartile (Q1): 25% of all the ranked
observations are less than Q1.
 The first quartile is also called the 25th percentile.
90

 The second quartile (Q2):
 The point which gives us 50% of the area to the left
of it & 50% to the right of it
 The second quartile is called the median.
 The third quartile (Q3):
 Is the point which gives us 75% of the area to the left
of it & 25% of the area to the right of it.
 This means that 75% of the observations are less
than or equal to the third quartile & 25% of the
observation are greater than or equal to the third
quartile.
 The third quartile is also called the 75th percentile.
91
What does Quartiles mean?....

 Ex.1: Suppose we have a small data set of
twelve observations
 15 18 19 20 20 20 21 23 23 24 24 25
1. We want to divide the data into four equal sets
2. First, we find the median
 15 18 19 20 20 20 ↑ median 21 23 23 24 24 25
 Median = 20.5 (half way b/n the 6th & 7th
observations),
 Divides the data into two equal sets with exactly 50% of
the observations in each:
 The 1st - 6th observations in the first set &
 The 7th - 12th observations in the other. 92

 To find the first quartile we consider the observations
less than the median.
 15 18 19 ↑ 20 20 20
 The first quartile is the median of these data.
 In this case, the first quartile is half way b/n the 3rd &
4thobservations & is equal to 19.5.
 Now, we consider the observations which are greater than
the median.
 21 23 23 ↑ 24 24 25
 The third quartile is the median of these data & is equal to
23.5.
 15 18 19 ↑ 20 20 20 ↑ 21 23 23 ↑ 24 24 25
Q1 Q2 Q3
 IQR = Q3- Q1 = 23.5- 19.5.= 4 93

 Example 1: Suppose the first & third quartile for weights of
girls 12 months of age are 8.8 Kg & 10.2 Kg, respectively.
 IQR = 10.2 Kg – 8.8 Kg = 1.4
 i.e., 50% of the infant girls weigh between 8.8 & 10.2
Kg.
 Example 2: Given the following data set (age of patients):-
 18, 59, 24, 42, 21, 23, 24, 32
 Find the inter-quartile range
 Solution: 18 21 23 24 24 32 42 59
 Q1st = {(n+1)/4}th = (2.25) th = 21 + (23-21)x .25 = 21.5
 Q3rd = {3/4 (n+1)} th = (6.75) th = 32 + (42-32)x .75 =
39.5
 Hence, IQR = 39.5 - 21.5 = 18
94

 Ex.2 :Given these data: 13, 7, 9, 15, 11, 5, 8, 4
a. Arrange the observations in increasing order.
 4, 5, 7, 8, 9, 11, 13, 15.
b. Find the position of the 1st & 3rd quartiles.
= n=8.
= Position of Q1 = ¼ (n+1) = ¼ (8+1) = 2.25th
= Q1 lies the 2nd & 3rd observations
= Position of Q3 = ¾(n+1) = ¾(8+1) = 6.75th
= Q3 lies the 6th & 7th observations
95

C. Identify the value of the 1st & 3rd quartiles.
 The value of Q1 is equal to the value of the 2nd
observation plus 1/4th the difference b/n the values of
the 3rd & 2nd observations:
 Value of the 3rd observation =7
 Value of the 2nd observation = 5
 Q1 = 5 +1/4(7-5) = 5 +2/4 = 5.5
 The value of Q3 is equal to the value of the 6th
observation plus 3/4ths of the difference b/n the value
of the 7th & 6th observations:
 Value of the 7th observation =13
 Value of the 6th observation=11
 Q3 = 11 +3/4 (13-11) = 11 +3(2)/4 = 11+6/4 = 12.5
96

d. Calculate the inter-quartile range
 Q3 = 12.5 ; Q1 = 5.5
 IQR = Q3-Q1 = 12.5–5.5 = 7
 Generally we apply this formula:
1. Qk = ((kn/4) th + (kn/4+1)th)/2 -if n is even
2. Qk = ((kn/4+1)/2) th- if n is odd
 Quartiles for grouped data
 Apply the same method with median
= Q1= Q1L+((n/4-fc)/fQ1)I & Q3= Q3L+((3n/4-fc)/fQ3)i
 To find the class of each
= Q1=n/4 & Q3=3n/4
= IQR= Q3-Q1 97

Properties of IQR
1. It is a simple & versatile measure
2. It encloses the central 50% of the observations
3. It is not based on all observations but only on two
specific values
4. It is important in selecting cut-off points in the
formulation of clinical standards
5. Since it excludes the lowest & highest 25% values, it
eliminates the outlier problem
6. Less sensitive to the size of the sample
98

Percentiles
 Percentiles:
 Are simply divide the data into 100 pieces.
 Are less sensitive to outliers &
 Are not greatly affected by the sample size (n).
99

100
 P0:
 The minimum
 P25:
 25% of the sample values are less than or equal to this
value.
 1st Quartile, P25 means 25th percentile
 P50:
 50% of the sample are less than or equal to this value.
 2nd Quartile
 P75:
 75% of the sample values are less than or equal to this
value.
 3rd Quartile
 P100:
 The maximum
Percentiles….

101
 The pth
percentile:
 Is a value that is p%
of the observations &  the
remaining (1-p)%
.
 The observation corresponding to p(n+1)th
if p(n+1)
is an integer
 The average of (k)th
& (k+1)th
observations if p(n+1)
is not an integer, where k is the largest integer less
than p(n+1).
 If p(n+1) = 3.6, the average of 3rd & 4th observation
Percentiles…..

102
 Example: Birth weight in grams
 2069, 2581, 2759, 2834, 2838, 2841, 3031,
 3101, 3200, 3245, 3248, 3260, 3265, 3314,
 3323, 3484, 3541, 3609, 3649, 4146
 Find the 10th & 90th percentile of the data set. n=20
 Solution: 10th percentile =Pt = ((tn/100)th +
(tn/100+1)th)/2 -if n is even
 20×0.1 = (2)th + (20×0.1)+1 = (3)th are not integers,
 The average of the 2nd & 3rd values
 = (2581+2759)/2 = 2670 g
 Solution: 90th percentile =
 20×0.9 = (18)th + (20×0.9)+1 = (19)th are not
integers,
 The average of the18th & 19th values
 = (3609+3649)/2 = 3629 g
Percentiles…..

 Generally we apply this formula:
1. Pt = ((tn/100)th + (tn/100+1)th)/2 -if n is even
2. Pt = ((tn/100+1)/2) th -if n is odd
 For grouped data use the following formula:
 P = PL+ (P(n)-fc)/f)i
 To find the class, use p(n) value or
 Where
 m represents the percentile we're finding,
 N is the total number of observations in the data set.
103
Percentiles…..

Variance (2, s2)
 The variance
 Is the average of the squares of the deviations taken
from the mean
 A good measure of dispersion should make use of all
the data.
 The variance achieves this by averaging the sum of
the squares of the deviations from the mean.
 The sample variance of the set x1, x2, ., xn of n
observations with mean ẍ is
 Degrees of freedom
 n-1 used because if we know n-1 deviations, the nth deviation is known
 Deviations have to sum to zero 104
S
(x x)
n - 1
2
i
2
i=1
n




 It is squared because the sum of the deviations of the
individual observations of a sample about the sample
mean is always zero
 Degrees of freedom
 In computing the variance there are (n-1) degrees of
freedom because only (n-1) of the deviations are
independent from each other
 This is because the sum of the deviations from their
mean (Xi-Mean) must add to zero.
 The last one can always be calculated from the
others automatically (It is not free to vary).
105
Variance (2, s2)

 Example
 Data: 43,66,61,64,65,38,59,57,57,50.
 Find Sample Variance of the data ,
 Mean = 56
 S2= [(43 - 56) 2 + (66 - 56)2+…..+(50 - 56) 2 ]/10-1 =
810/9 = 90
Variance for grouped data
106
S
(m x) f
f -1
2
i
2
i
i=1
k
i
i=1
k




x
 Where
 mi = the mid-point of the ith class interval
 fi = the frequency of the ith class interval
 = the sample mean
 k = the number of class intervals
Variance (2, s2)

 Ex. Compute the variance of the age of 169 subjects
from the grouped data.
Class interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
-19.98
-9-98
0.02
10.02
20.02
30.02
399.20
99.60
0.0004
100.40
400.80
901.20
1596.80
6573.60
0.0188
3614.40
4809.60
3604.80
Total 169 1901.20 20199.22
107
 Mean = 5810.5/169 = 34.48 years
 S2 = 20199.22/169-1 = 120.23
Variance (2, s2)

108
1. The main disadvantage of variance is that its unit is
the square of the unite of the original measurement
values
2. The variance gives more weight to the extreme
values as compared to those which are near to mean
value, because the difference is squared in variance.
3. The drawbacks of variance are overcome by the
standard deviation.
Properties of Variance

Standard deviation (, s)
 Standard deviation (, s)
 It is the square root of the variance.
 This produces a measure having the same scale as
that of the individual values.
 It shows variation about the mean
109
 
 2
and S = S2

110
Standard deviation (, s).....

Properties of SD
1. Has the advantage of being expressed in the same
units of measurement as the mean
2. The best measure of dispersion & is used widely
because of the properties of the theoretical normal
curve.
3. However, if the units of measurements of variables of
two data sets is not the same, then there variability
can’t be compared by comparing the values of SD.
111

112
Wide spread results in higher SDs Narrow spread in lower SDs
Standard deviation (, s).....

Coefficient of variation (CV)
 Coefficient of variation (CV)
 When two data sets have different units of
measurements, or their means differ sufficiently in
size, the CV should be used as a measure of
dispersion.
 It is the best measure to compare the variability of
two series of sets of observations.
 Data with less coefficient of variation is considered
more consistent.
 CV is the ratio of the SD to the mean multiplied by
100.
113

CV
S
x
100
 
“Cholesterol is more variable than systolic blood pressure”
SD Mean CV (%)
SBP
Cholesterol
15mm
40mg/dl
130mm
200md/dl
11.5
20.0
114
Coefficient of variation (CV).....

Distributions
Distributions used in statistical analysis:
1. Discrete random variables:
1) Binomial,
2) Poisson &
3) Hyper geometric distributions.
 E.g. The analysis of discrete random variables,
such as the position of a nucleotide on a given
sequence may use techniques based on a binomial
distribution & not techniques that assume a
normal distribution.
2. Continuous random variables:
1) Normal distribution,
2) Z distribution.
115

Normal distribution
 Normal distribution
 It is symmetric about its mean/one half of the curve
is the mirror image of the other half
 The mean, median, & mode are equal & are in
different positions
 The highest point is at its mean
 The height of the curve decreases as one moves away
from the mean in either direction, approaching, but
never reaching zero
116

117
Mean
A normal distribution is symmetric about its mean
As one moves away from
the mean in either direction
the height of the curve
decreases, approaching,
but never reaching zero
The highest point of
the overlying normal
curve is at the mean
Normal distribution…..

Skewed distributions
 Skewed distributions
 The data are not distributed symmetrically in
skewed distributions
 The mean, median, & mode are not equal & are in
different positions
 Scores are clustered at one end of the distribution
 A small number of extreme values are located in the
limits of the opposite end
 Skew is always toward the direction of the longer tail
118

Skewed distributions….
A. Negatively skewed distribution
 Occurs when majority of scores are at the right end
of the curve & a few small scores are scattered at the
left end
 Positive if skewed to the right
B. Positively skewed distribution
 Occurs when the majority of scores are at the left
end of the curve & a few extreme large scores are
scattered at the right end.
 Negative if to the left
119

Median Mode Mean
(a). Symmetric Distribution
Mean = Median = Mode
Mode Median Mean
(b). Distribution skewed to the right
Mean > Median > Mode
Mean Median Mode
(c). Distribution skewed to the left
Mean < Median < Mode 120

Which measures to use?
1. When the distribution is symmetric & uni-modal,
summarize the data using means & standard deviations.
2. When the data are skewed, it is preferable to use the
median & quartiles as summary statistics.
3. Median & quartiles are not easily influenced by extreme
values in a skewed distribution unlike means & standard
deviations.
A. Symmetric & uni-modal distribution —
 Mean, median, & mode should all be approximately the
same
B. Skewed to the right (Positively skewed) —
 Mean is sensitive to extreme values, so median might be
more appropriate
C. Skewed to the left (Negatively skewed) –
 Mean is sensitive to extreme values, so median might be
more appropriate 121

Basic concepts in biostatistics edited pc-1.pptx

Recommended

Recommended

More Related Content

Similar to Basic concepts in biostatistics edited pc-1.pptx

Similar to Basic concepts in biostatistics edited pc-1.pptx (20)

Recently uploaded

Recently uploaded (20)

Basic concepts in biostatistics edited pc-1.pptx