Cardiac Output, Venous Return, and Their Regulation
Basic concepts in biostatistics edited pc-1.pptx
1. Arba Minch University
College of Medicine & Health Sciences
,School of public health Department of
Public Health
Epidemiology and Biostatistics unit
By Kusse Otayto(BSc,MPH in Epi/Biostatistics)
By: Kusse Otayto(BSc, MPH( Epidemiology & Biostatistics))
1
2. Descriptive statistics
It deals with the description of data in a clear &
informative manner using tables, graphs &
numerical summary
It involves the organization & summarization of a
body of data with one or more meaningful tools.
It helps to identify the general features & trends in a
set of data & extracting useful information
Also very important in conveying the final results of
a study
2
3. Descriptive statistics
Data
Are information collected from the source or
Are the raw materials of statistics
Are numbers which can be obtained by
measurements or counting
Data are made up of a set of variables
It Can be obtained from Counting, Routinely kept
records, Surveys, Experiments, Reports…
Types of data
1. Primary data
2. Secondary data
3
4. 1. Primary data
1. Primary data:
Are data collected from the items or individual
respondents directly by the researcher themselves
for the purpose of a study.
Advantages of primary data
1. The data is original
2. Possibility of flexibility
3. Source for extensive research
Disadvantages of primary data
1. Expensive & time consuming
2. Possibility of personal prejudice(biases)
4
5. 2. Secondary data
2. Secondary data:
Are data which had been collected by certain people or
organization & statistically treated & the information
contained in it is used for other purpose by other people
Obtained from journals, reports, government
publications
Advantages of secondary data
1. Are readymade
2. Relatively cheaper
3. Lesser degree of personal prejudice
Disadvantages of secondary data
1. Lacks originality
2. May or may not suit the objects of enquiry (Not source
for extensive research)
3. It is used with great care & caution
5
6. Methods of data collection
Before any statistical work can be done data must be
collected.
Data collection is a crucial stage in the planning &
implementation of a study
If the data collection has been superficial, biased or
incomplete, data analysis becomes difficult, & the research
report will be of poor quality.
Therefore, we should concentrate all possible efforts on
developing appropriate tools, & should test them several
times.
Depending on the type of variable & the objective of the
study different data collection methods can be employed:
Observation,Interview,using self administered written
questionnaire 6
7. A. Observation
Is a technique that involves systematically selecting,
watching & recording behavior & characteristics of
living things, objects or phenomena.
It includes all methods from simple visual
observations to the use of high level equipments
It can be undertaken in the following ways:
1. Participant observation:
The observer takes part in the situation he or she
observes.
2. Non-participant observation:
The observer watches the situation, openly or
concealed, but does not participate
7
Cont…
8. Observations can give additional, more accurate information
on behavior of people than interviews or questionnaires
Observations can also be made on objects
Outline the guidelines for the observations prior to actual
data collection.
Advantages
Gives relatively more accurate data on behavior &
activities
Disadvantages:
Investigators or observer’s own biases
Needs more resources & skilled human power during the
use of high level machines.
8
Cont…
9. B. Interview (face-to-face)
Is a data collection technique that involves oral
questioning of respondents, either individually or as a
group
Answers to the questions posed during an interview can
be recorded by:
1. Writing them down (either during the interview itself
or immediately after the interview) or
2. By tape-recording the responses, or
3. By a combination of both.
Advantages of face-to-face interview
Can stimulate & maintain the respondent’s interest
Can create a rapport(bond) (understanding, concord)
Observations can be made as well.
Disadvantage
It is time consuming & expensive 9
Cont…
10. Cont…
1. In-depth interview
It is a conversion between the researcher & the
subject about the research area or topic.
It is designed to allow the respondent to tell their
story in their own way
Issues are covered in detail; respondent leads the
interviews/sets the agenda; no fixed order
Important in:
Highly sensitive issues
Geographical dispersed respondents
When peer pressure is expected to distort facts
It takes high cost & time than FGD 10
11. 2. Focus group discussions
It allows a group of 8 -12 informants to freely discuss
a certain subject with the guidance of a facilitator or
reporter
Advantages
Group interaction stimulate richer responses &
emergence of new ideas
The researcher observes & gets first hand insights
Can be done more quickly & generally less expensive
than in- depth interviews
Disadvantage
Not good in highly sensitive issues
11
Cont…
12. C.Using self-administered written questionnaire
Is a data collection tool in which written questions
are presented that are to be answered by the
respondents in written form
It can be administered in different ways, such as by:
Sending questionnaires by mail with clear
instructions
Gathering all or part of the respondents in one place
at one time, giving oral or written instructions, &
letting the respondents fill out
Hand-delivering questionnaires to respondents &
collecting them later
12
Cont…
13. The questions can be either open-ended or
closed
A. Example of closed ended question
1. What is the current breastfeeding status of mother ?
A. Exclusive breastfeeding
B. Partial breastfeeding
C. Not breastfeeding
B. Example of Open ended question
1. At what age should the child start supplementary
food? why?
13
Cont….
14. Advantages
Is simpler & cheaper than interview
Can be administered to many persons
simultaneously
Can be sent by post.
Disadvantages
It demands a certain level of education & skill of
respondents
If a mailed questionnaire one, people of a low socio-
economic status are less likely to respond to it
14
Cont….
15. Variable
Variable
Is a characteristic which takes different values in
different PPT (persons, places, or things).
Any aspect of an individual or object that is
measured (e.g. BP) or recorded (e.g. age, sex) &
takes any value.
There may be one or many variable in a study
15
16. Types of variables
A. Qualitative (categorical) variables
Nominal
Ordinal
B. Quantitative (numerical) variables
Continuous
Discrete
1. Dependent (outcome,Response) variable
2. Independent (exposure,Explanatory) variable
16
Variable
17. 1. Categorical(Qualitative) variable
A variable which can not be measured in
quantitative form but can only be sorted by name or
categories
Not able to be measured as we measure height or
weight
The notion of magnitude is absent or implicit.
Categories must not overlap & must cover all
possibilities
17
Variable….
18. Categorical variable is divided into two:
1. Nominal variable
The values fall into un-ordered categories or classes
Uses names, labels or symbols to assign each
measurement.
Examples: Blood type (A, B, AB, O) Sex
(male/female)
2. Ordinal variable
Assigns each measurement to one of a limited number of
categories that are ranked in terms of order.
Although non-numerical, can be considered to have a
natural ordering
Examples:
1. Cancer stages: 1, 2, 3, 4
2. Pain severity: no pain, slight pain, moderate pain, severe
pain 18
Variable….
19. B. Quantitative (numerical) variable
A variable that can be measured or counted & expressed
numerically.
Has the notion of magnitude.
E.g. Height, weight, # of children, etc.
Quantitative variable is divided into two:
1. Discrete variable
It can only have a limited number of discrete values &
hence takes on integer values only
Characterized by gaps or interruptions in the values.
Both the order & magnitude of the values matter.
The values are not just labels, but are actual measurable
quantities.
E.g. Number of children in household(0, 1, 2, 3, etc.) 19
Variable….
20. Variables…
2. Continuous variable
It can have an infinite number of possible values in
any given interval or within some range
Both the magnitude & the order of the values matter
Does not possess the gaps or interruptions
E.g. Weight (50.123...), Height (1.342...)
20
21. Variables…
Manipulation of variables
Continuous variables can be discredited
E.g. Age (1&1/12-1yr) can be rounded to whole
numbers
Continuous or discrete variables can be categorized
E.g. Age categories- 1(1-5), 2(6-10), 3(11-15)
Categorical variables can be re-categorized
E.g. marital status (Single, Married, Divorced,
Widowed) lumping from 4 categories down to 2
(married, single)
21
22. Variables…
1. Independent variables
Precede(come first) dependent variables in time
Are often manipulated by the researcher
2. Dependent variables
What is measured as an outcome in a study
Values depend on the independent variable
Example
1. Health education involving active participation of mothers
will produce more positive changes in child feeding than
health education based on lectures.
Independent variable:
Type of health education
Dependent variable:
Changes in child feeding 22
24. Scales of Measurement
Scales of measurement
Is an assignment of numbers to subjects, objects or
events(variables) in which we are interested according to
a set of rules
Measurement is a way of refining our ordinary
observations so that we can assign numerical values to
our observations.
These numbers will provide the raw material for our
statistical analysis.
Why we measure things or worry about the different forms
that measurement may take?
It allows us to go beyond simply describing the presence
or absence of an event or thing to specifying how much,
how long, or how intense it is.
With measurement, our observations become more
accurate & more reliable. 24
25. Scales...
There are four types of scales of measurement.
1. Nominal scale
Used when data are classified into one of two or
more categories
The values fall into un-ordered categories or classes(
aren’t hierarchical, one category isn’t “better” or
“higher” than another)
Uses names, labels or symbols to assign each
measurement.
Labeling or naming allows us to make qualitative
distinctions or to categorize & then count the
frequency of persons, objects, or things in each
category.
25
26. It should be: Exhaustive & Mutually exclusive
1. Exhaustive :
Should include all possible answerable responses.
2. Mutually exclusive :
No respondent should be able to have two attributes
simultaneously
Not really a ‘scale’ because it does not scale objects along
any dimension
Assignment of numbers to the categories has no
mathematical meaning, simply for identification
purposes.
Examples:
1. Marital status(Single, Married, Divorced)
2. Religion(Muslim, Protestant, Orthodox, Catholic) 26
Scales...
27. Scales...
2. Ordinal scale
Used when data are classified into logically order- rank
Assigns each measurement to one of a limited number of
categories that are logically ranked in terms of order
Although non-numerical, can be considered to have a
natural ordering (The numbers have limited meaning
4>3>2>1)
No consistent distance between points of measurement
Example: Social class (Very poor, Poor, Rich, Very rich)
There are not equal interval b/n adjacent numbers
27
28. Scales...
3. Interval scale
Used when data are classified on a scale that assumes
equal distance between numbers
There are Magnitude + Constant distance b/n points
+ No true zero point + Equal interval b/n adjacent
numbers
Example: Temp. in o
F on 4 consecutive days
Days: A B C D
Temp. o
F: 50 55 60 65
For these data, not only is day A with 50o F cooler
than day D with 65o but is 15o cooler.
It has no true zero point (“0” is arbitrarily chosen &
doesn’t reflect the absence of temp.) 28
29. Scales...
4. Ratio scale
Used when data are classified on a scale that assumes
equal distance & a true zero value
Measurement begins at a true zero point & the scale has
equal space
There are Magnitude + Constant distance b/n points +
Equal ratios + True zero.
Examples: Height, weight, BP, etc.
Zero weight or height means the complete absence of
weight or height.
A 100-kg person has one-half the weight of a 200-kg
person & twice the weight of a 50-kg person.
It is the most sensitive, powerful type- b/c contain the
most precise information about each observation that is
made 29
30. 30
Decision tree to determine the appropriate scale of
measurement.
Question 1
There any order to the numbers?
Question 2
Are there equal interval b/n adjacent
numbers?
Question 3
Is there absolute zero?
Nominal
scale
Ordinal
scale
Interval
scale
Ratio
scale
Yes
Yes
Yes
No
No
No
32. Why Is Level of Measurement Important?
Helps you to decide
1. What kind of data display or summary method &
What statistical analysis is appropriate on the values
that were assigned &
2. How to interpret the data from that variable.
32
34. Data Organization & Presentation
1. For categorical variables
A. Using table of frequency distribution
1. Frequency counts
2. Relative frequency
3. Cumulative frequency
4. Relative cumulative frequency
B. Using pictorial forms
1. Bar charts(graph)
2. Pie charts
Ordered array:
A simple arrangement of individual observations in
order of magnitude.
Very difficult with large sample size
34
35. 2. For Quantitative variable
A. Using table of frequency distributions
1. Frequency counts
2. Relative frequency
3. Cumulative frequencies
4. Relative cumulative frequency
B. Using pictorial forms
1. Histogram
2. Frequency polygon
3. Line graph
4. Scattered plot
5. Box
6. Ogive/cumulative frequency… 35
Data Organization & Presentation….
36. Frequency table:
It involves a listing of all the observed values of the variable
being studied & How many times each value is observed.
Frequency distribution:
The distribution of the total number of observations among
the various categories is called a frequency distribution.
Simple & effective way for summarizing large amounts of
data
Relative Frequency
It is the proportion or percentages of observations in each
category.
The distribution of proportions is called the relative
frequency distribution of the variable
Given a total number of observations, the relative frequency
distribution is easily derived from the frequency distribution.
36
Frequency table & Frequency Distributions…
37. Frequency table & Frequency Distributions…..
Cumulative frequency
It is the number of observations in the category plus
observations in all categories smaller than it.
Cumulative relative frequency
It is the proportion of observations in the category
plus observations in all categories smaller than it.
It is obtained by dividing the cumulative frequency
by the total number of observations.
37
38. BWT Freq. Cum. Freq Rel. Freq. Cum. rel. freq
Very low 43 43 43/9974*100 = 0.4 43/9974*100 = 0.4
Low 793 43+793 = 836 793/9974*100 = 8.0 836/9974*100 = 8.4
Normal 8870 836+8870 = 9706 8870/9974*100 = 88.9 9706/9974*100 = 97.3
Big 268 9706+268 = 9974 268/9974*100 = 2.7 9974/9974*100 = 100
Total 9974 100 38
For example: Birth weight for newborns with levels:
1. Very low
2. Low
3. Normal &
4. Big
Table 1. Distribution of birth weight of newborns b/n 1976-1996 at “X” town.
For categorical variables
39. For Quantitative variable,
Select a set of continuous, non-overlapping intervals
such that each value can be placed in one & only one
of the intervals.
The first consideration is how many intervals to
include
To determine the number of class intervals & the
corresponding width, we may use:
Sturge’s rule:
Where
K = Number of class intervals
n = No. of observations
W = Width of the class interval
K 1 3.322(logn)
W
L S
K
39
Quantitative variable
40. 1. Example: Leisure time (hours) per week for 40
college students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22
14 13 10 19 27 29 22 38 28 34 32 23 19 21 31
16 28 19 18 12 27 15 21 25 16
K = 1 + 3.322 (log n)
K = 1 + 3.322 (log40) = 6.32 ≈ 6
Maximum value = 38, Minimum value = 10
W = L-S
K
W = (38-10)/6 = 4.66 ≈ 5 40
Quantitative variable....
41. 41
Time (Hours) Frequency Relative Frequency Cumulative Relative Frequency
10-14
15-19
20-24
25-29
30-34
35-39
5
11
12
7
3
2
0.125
0.275
0.300
0.175
0.075
0.050
0.125
0.400
0.700
0.875
0.950
1.00
Total 40 1.00
Quantitative variable
42. 42
Class Limit: The range for each class
Upper class limit
Lower class limit
Mid-point (Class mark):
The value of the interval which lies midway b/n the
lower & the upper limits of a class.
Class boundary (True limits):
Are those limits that make an interval of a continuous
variable continuous in both directions
Upper class boundary
Lower class boundary
Subtract 0.5 from the lower & add it to the upper class limit
Quantitative variable....
44. Guidelines for constructing tables
1. Keep them simple (Limit the number of variables to
three or less)
2. All tables should be self-explanatory (Include clear
title telling what, when & where)
3. Clearly label the rows & columns
4. State clearly the unit of measurement used
5. Explain codes & abbreviations in the foot-note
6. Show totals
7. If data is not original, indicate the source in foot-
note.
44
45. Pictorial /Diagrammatic presentation
Importance of diagrammatic presentation
1. Diagrams have greater attraction than mere figures
2. They give quick overall impression of the data
3. They have great memorizing value than mere figures
4. They facilitate comparison
5. Used to understand patterns & trends
E.g.,
Skewed or symmetric distribution
Multiple peaks / mode
Are there any outliers ?
Relationship between variables. 45
46. 1. Bar charts (Graphs)
1. Graphical equivalent of a frequency table
2. Categories are listed on the horizontal axis (X-axis)
3. Frequencies or relative frequencies are represented
on the Y-axis (ordinate)
4. The height of each bar is proportional to the
frequency or relative frequency of observations in
that category
46
Qualitative variable presentation
47. A. Simple bar chart:-used to represent a single
variable
47
0
20
40
60
80
100
Not immunized Partially immunized Fully immunized
Immunization status
Number
of
children
Fig. 1. Immunization status of Children in Adami Tulu Woreda, Feb.1995
48. B. Sub-divided (component) bar chart
1. If there are different quantities forming the sub-
divisions of the totals, simple bars may be sub-
divided in the ratio of the various sub-divisions to
exhibit the relationship of the parts to the whole.
2. The order in which the components are shown in a
“bar” is followed in all bars used in the diagram
48
Qualitative variable presentation
49. Example of 100%component bar chart:
0
20
40
60
80
100
August October December
2003
Percent
Mixed
P. vivax
P. falciparum
49
Fig.1 Plasmodium species distribution for confirmed malaria cases, Zeway, 2003
50. Method of constructing bar chart
1. All the bars must have equal width
2. The bars are not joined together (leave space b/n
bars)
3. The different bars should be separated by equal
distances
4. All the bars should rest on the same line called the
base
5. Both axes clearly label
Instead of “stacks” rising up from the horizontal (bar
chart), we could plot instead the shares of a pie.
50
Qualitative variable presentation
51. 2. Pie chart
1. It shows the relative frequency for each category by
dividing a circle into sectors
2. The angles are proportional to the relative frequency.
3. Used for a single categorical variable
4. Use percentage distributions
Steps to construct a pie-chart
1. Construct a frequency table
2. Change the frequency into percentage (P)
3. Change the percentages into degrees, where,
Degree = Percentage X 360o
4. Draw a circle & divide it accordingly 51
Qualitative variable presentation
52. Cause of death No. of death Percentage
Circulatory system
Neoplasm
Respiratory system
Injury & poisoning
Digestive system
Others
100 000
70 000
30 000
6 000
10 000
20 000
100,000/236,000*360o = 153o
70,000/236,000*360o = 107o
30,000/236,000*360o = 46o
6,000/236,000*360o = 9o
10,000/236,000*360o = 15o
20,000/236,000*360o = 30o
Total 236 000 100% (360o)
52
Steps to construct a pie-chart
Example: Distribution of deaths for females, in England and Wales, 1989.
53. 53
Instead of “stacks” rising up from the horizontal (bar chart), we could plot
instead the shares of a pie.
Recalling that a circle has 360 degrees, that 50% means 180 degrees, 25%
means 90 degrees, etc, we can identify “wedges” according to relative
frequency
Distribution fo cause of death for females, in England and Wales, 1989
Circulatory system
42%
Neoplasmas
30%
Respiratory system
13%
Injury and Poisoning
3%
Digestive System
4%
Others
8%
54. 3. Histogram
1. Histograms are frequency distributions with
continuous class interval that have been turned into
graphs
2. A histogram is a type of bar chart, but there are no
spaces b/n the bars(continuous data)
3. Histograms are used to visually represent frequency
distributions of continuous data
4. Given a set of numerical data, we can obtain
impression of the shape of its distribution by
constructing a histogram
54
Quantitative variable presentation
55. 3. Histogram
5. Constructed by choosing a set of non-overlapping class
intervals & counting the number of observations that fall in
each class.
6. It is necessary that the class intervals be non-overlapping so
that each observation falls in one & only one interval.
7. Bars are drawn over the intervals
8. The area of each bar is proportional to the frequency of
observations in the interval
Two problems with histograms
1. They are somewhat difficult to construct
2. The actual values within the respective groups are lost
& difficult to reconstruct
Stem-and-leaf plot overcomes these problems
55
Quantitative variable presentation….
56. Age group 15-19 20-24 25-29 30-34 35-39 40-44 45-49
Number 11 36 28 13 7 3 2
56
Age of women at the time of marriage
0
5
10
15
20
25
30
35
40
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
No
of
women
Histogram
Example: Distribution of the age of women at the time of marriage
57. 5. Frequency polygon
1. Instead of drawing bars for each class interval,
sometimes a single point is drawn at the mid point of
each class interval & consecutive points joined by
straight line.
2. Graphs drawn in this way are called frequency
polygons
3. The total area under the frequency polygon is equal
to the area under the histogram
4. Frequency polygons are superior to histograms for
comparing two/more sets of data.
57
Quantitative variable presentation….
58. Age of women at the time of marriage
0
5
10
15
20
25
30
35
40
12 17 22 27 32 37 42 47
Age
No
of
women
58
59. 6. Scatter plot
1. Most studies in medicine involve measuring more
than one characteristic
2. For two quantitative variables we use bivariate plots
(also called scatter plots or scatter diagrams).
3. In the study on percentage saturation of bile,
information was collected on the age of each patient
4. To see whether a relationship existed between the
two measures.
E.g. Saturation of bile & age
59
Quantitative variable presentation….
60. 6. Scatter plot….
When both the variables are qualitative then we can
use a bar graph.
When one of the characteristics is qualitative & the
other is quantitative, the data can be displayed in box
& whisker plots.
A scatter diagram is constructed by drawing X- & Y-
axes.
Each point represented by a point or dot() represents
a pair of values measured for a single study subject
The graph suggests the possibility of a positive
relationship between age & percentage saturation of
bile in women. 60
Quantitative variable presentation….
61. Age and percentage saturation of bile for women patients in
hospital Z, 1998
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70 80
Age
Saturation
of
bile
61
62. 7. Line graph
1. Useful for assessing the trend of particular situation
overtime.
2. Helps for monitoring the trend of epidemics.
3. Values for each category are connected by
continuous line.
4. Sometimes two or more graphs are drawn on the
same graph taking the same scale so that the plotted
graphs are comparable.
62
Quantitative variable presentation….
68. 1. Measures of Central Tendency
Statistic:–
Descriptive measure computed from sample data
Parameter:–
Descriptive measure computed from population data
Measures of central tendency:-
Are the measures used to summarize the point at
which the data tend to cluster in a single number or
statistic.
The most commonly used measures of central
tendency are:
1. Arithmetic Mean,
2. Median &
3. Mode.
68
69. 1. Arithmetic mean
1. Arithmetic mean
It is the average of the data set
The sum of the observations divided by the number of
observations.
Mean for ungrouped data
Mean of a sample
Mean of a population
= (X bar) refers to the mean of a sample &
= refers to the mean of a population
Σx is a command that adds all of the X values
n = is the total number of values in the series of a sample
&
N = is the sum for a population
X
μ
69
N
X
n
X
X
70. Arithmetic mean …..
Example: 19 21 20 20 34 22 24 27 27 27
Calculate the mean , n=10
Mean = 19 + 21 + 20 +20+ 34 + 22 + 24 + 27 + 27 +27 = 24.1
10
Mean for grouped data
We assume that all values falling into a particular class
interval are located at the mid-point of the interval.
It is calculated as follow:
70
x =
m f
f
i i
i=1
k
i
i=1
k
Where,
k = the number of class intervals
mi = the mid-point of the ith class
interval
fi = the frequency of the ith class
71. Example. Compute the mean age of 169 subjects from the
grouped data.
Class interval Mid-point (mi) Frequency (fi) mifi
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
58.0
1617.0
1621.5
1602.0
654.0
258.0
Total __ 169 5810.5
71
Mean = 5810.5/169 = 34.5 years
Arithmetic mean …..
72. Properties of the arithmetic mean
1. Can be used for both discrete & continuous data.
However, it is not appropriate for either nominal
or ordinal data.
2. For given set of data there is one & only one
arithmetic mean.
3. It is easily understood & easy to compute.
4. Algebraic sum of the deviations of the given values
from their arithmetic mean is always zero.
5. It is greatly affected by the extreme values.
72
73. 2. Median
Median
Is the value that divides a series of values in 1/2 when
they are listed in order
If observations are odd, the median is defined as the
[(n+1)/2]th observation.
E.g. 19 20 20 21 22 23 24 27 27 27 34 n=11
Median = [(n+1)/2]th = [(11+1)/2]th = [6]th= 23
If observations are even the median is the average of
the two middle
(n/2)th + [(n/2)+1]th /2 values i.e, there is no middle
observation.
E.g. 19 20 20 21 22 24 27 27 27 34 n= 10
Median = (n/2)th + [(n/2)+1]th /2= (10/2)th +
[(10/2)+1]th /2= (5)th + [6]th /2 = (22 + 24)/2 = 23 73
74. Median for Grouped data
We assume that the values within a class-interval are
evenly distributed through the interval.
The first step is to locate the class interval in which it
is located.
Find n/2 & see a class interval with a minimum
cumulative frequency which contains n/2.
Note:- All class intervals with cumulative frequencies
≥n/2 contain the median.
74
Median….
75. To find a unique median value, use the following
interpolation formal.
75
W
f
F
2
n
L
=
x
~
m
c
m
Median….
Where,
• Lm = lower true class boundary of the interval containing the median
• Fc = cumulative frequency of the interval just above the median class
interval
• fm = frequency of the interval containing the median
• W= class interval width
• n = total number of observations
76. Ex. Compute the median age of 169 subjects from the
grouped data.
Class interval Mid-point (mi) Frequency (fi) Cum. freq
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
4
70
117
153
165
169
Total 169
76
Median….
77. 77
Median =
= n/2 = 169/2 = 84.5
= 84.5 = in the 3rd class interval
= Lower limit = 29.5,
= Upper limit = 39.5
= Frequency of the class = 47
= Fc above class interval = 70
= Median = 29.5 + (84.5-70 /47)10 = 32.58 ≈ 33
W
f
F
2
n
L
=
x
~
m
c
m
Median….
78. Properties of median
1. Can be used for ordinal, discrete & continuous data.
However, it is not appropriate for nominal data.
2. There is only one median for a given set of data
3. The median is easy to calculate
4. Median is a positional average & hence it is not
drastically affected by extreme values
5. It is not a good representative of data if the number
of items is small
78
79. 3. Mode
Mode
It is the value/ observation which occurs most frequently.
Most distributions have one peak & are described as uni-
modal.
E.g. 19 21 20 20 34 22 24 27 27 27
Mode = 27
The mode of grouped data usually refers to the modal class
with the highest frequency.
The modal value is the highest bar in a histogram
Not a good summary
Possible to have one, more than one/no mode
79
80. To find a single value of mode for grouped data, use
the following formula:
Mode 1
mo
1 2
Δ
= L + i
Δ + Δ
80
mo
L
1
2
Where:
i is the class width
is the difference b/n the frequency of class mode & the frequency
of the class after (below) the class mode
is the difference b/n the frequency of class mode & the frequency
of the class before (above) the class mode
is the lower boundary of class mode
Mode….
81. Ex. Find the mode for the following data
81
Solution
Lmo = 19.5, F =66, Fb =47, Fa =4, i=10
Mode =19.5+((66-47)/66-47+66-4))10 =21.8=22
Mode….
Class interval Mid-point (mi) Frequency (fi) Cum. freq
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
4
70
117
153
165
169
Total 169
82. Properties of mode
1. Can be used for nominal, ordinal, discrete &
continuous data.
However, it is more appropriate for nominal &
ordinal data.
2. It is not affected by extreme values
3. Often its value is not unique
4. The main drawback of mode is that often it does not
exist
82
84. 2. Measures of Dispersion
Measures of Dispersion
Measures that quantify the variation or dispersion of a
set of data from its central location
Dispersion of a set of observations is the variety exhibited by
the observations
1. If all the values are the same→ There is no dispersion
2. If all the values are different → There is a dispersion
3. If the values close to each other →The amount of
dispersion is small
4. If the values are widely scattered/spread → The
dispersion is greater
84
85. Common measures of dispersion
1. Range
2. Inter quartile range
3. Variance
4. Standard deviation
5. Coefficient of variation
85
Measures of Dispersion….
86. 1. Range (R)
Range (R)
Is the difference b/n the largest & smallest
observations in a sample.
Range concern only on two values
Range = Maximum value – Minimum value
The range is the simplest measure of dispersion.
A data set with higher range shows more variability
Example –
Data values: 5, 9, 12, 16, 23, 34, 37, 42
Maximum value= 42,
Minimum value= 5
Range = 42-5 = 37 86
87. Properties of range
1. It is the simplest crude measure & can be easily
understood
2. It takes into account only two values which causes it
to be a poor measure of dispersion
3. Very sensitive to extreme observations (outliers)
4. The larger the sample size, the larger the range
87
1. Properties of range....
88. 2. Inter-quartile range (IQR)
Inter-quartile range (IQR)
It is used when the median is used as the measure of
central tendency.
It gives the range in which the middle 50% of the
distribution lies.
The inter-quartile range quantifies the difference b/n
the third & first quartiles.
IQR = Q3 - Q1
A large IQR indicates a large amount of variability
among the middle 50% of the observations &
A small IQR indicates a small amount of variability
88
89. 2. Inter-quartile range (IQR).....
The inter-quartile range is particularly useful to
describe data sets where there are a few extreme
values.
Unlike the range, & to a lesser extent the standard
deviation, it is not sensitive to extreme values as it
relies on the spread of the middle 50% of the
distribution.
So, if there are data sets which have extreme values,
it can be more appropriate to use the median to
describe central tendency & the inter-quartile range
to describe the spread. 89
90. What does Quartiles mean?
If the data are divided into four equal parts, we speak of
quartiles.
Quartiles (Q1, Q2, Q3, Q4) – sample size (data) is divided
into 4 equal parts getting 25% of observations in each of
them.
The first quartile(Q1):
Is the point which gives us 25% of the area to the left of
it & 75% to the right of it.
This means that 25% of the observations are less than or
equal to the first quartile & 75% of the observations
greater than or equal to the first quartile.
The first quartile (Q1): 25% of all the ranked
observations are less than Q1.
The first quartile is also called the 25th percentile.
90
91. The second quartile (Q2):
The point which gives us 50% of the area to the left
of it & 50% to the right of it
The second quartile is called the median.
The third quartile (Q3):
Is the point which gives us 75% of the area to the left
of it & 25% of the area to the right of it.
This means that 75% of the observations are less
than or equal to the third quartile & 25% of the
observation are greater than or equal to the third
quartile.
The third quartile is also called the 75th percentile.
91
What does Quartiles mean?....
92. Ex.1: Suppose we have a small data set of
twelve observations
15 18 19 20 20 20 21 23 23 24 24 25
1. We want to divide the data into four equal sets
2. First, we find the median
15 18 19 20 20 20 ↑ median 21 23 23 24 24 25
Median = 20.5 (half way b/n the 6th & 7th
observations),
Divides the data into two equal sets with exactly 50% of
the observations in each:
The 1st - 6th observations in the first set &
The 7th - 12th observations in the other. 92
What does Quartiles mean?....
What does Quartiles mean?....
93. To find the first quartile we consider the observations
less than the median.
15 18 19 ↑ 20 20 20
The first quartile is the median of these data.
In this case, the first quartile is half way b/n the 3rd &
4thobservations & is equal to 19.5.
Now, we consider the observations which are greater than
the median.
21 23 23 ↑ 24 24 25
The third quartile is the median of these data & is equal to
23.5.
15 18 19 ↑ 20 20 20 ↑ 21 23 23 ↑ 24 24 25
Q1 Q2 Q3
IQR = Q3- Q1 = 23.5- 19.5.= 4 93
What does Quartiles mean?....
94. Example 1: Suppose the first & third quartile for weights of
girls 12 months of age are 8.8 Kg & 10.2 Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg = 1.4
i.e., 50% of the infant girls weigh between 8.8 & 10.2
Kg.
Example 2: Given the following data set (age of patients):-
18, 59, 24, 42, 21, 23, 24, 32
Find the inter-quartile range
Solution: 18 21 23 24 24 32 42 59
Q1st = {(n+1)/4}th = (2.25) th = 21 + (23-21)x .25 = 21.5
Q3rd = {3/4 (n+1)} th = (6.75) th = 32 + (42-32)x .75 =
39.5
Hence, IQR = 39.5 - 21.5 = 18
94
What does Quartiles mean?....
95. Ex.2 :Given these data: 13, 7, 9, 15, 11, 5, 8, 4
a. Arrange the observations in increasing order.
4, 5, 7, 8, 9, 11, 13, 15.
b. Find the position of the 1st & 3rd quartiles.
= n=8.
= Position of Q1 = ¼ (n+1) = ¼ (8+1) = 2.25th
= Q1 lies the 2nd & 3rd observations
= Position of Q3 = ¾(n+1) = ¾(8+1) = 6.75th
= Q3 lies the 6th & 7th observations
95
What does Quartiles mean?....
96. C. Identify the value of the 1st & 3rd quartiles.
The value of Q1 is equal to the value of the 2nd
observation plus 1/4th the difference b/n the values of
the 3rd & 2nd observations:
Value of the 3rd observation =7
Value of the 2nd observation = 5
Q1 = 5 +1/4(7-5) = 5 +2/4 = 5.5
The value of Q3 is equal to the value of the 6th
observation plus 3/4ths of the difference b/n the value
of the 7th & 6th observations:
Value of the 7th observation =13
Value of the 6th observation=11
Q3 = 11 +3/4 (13-11) = 11 +3(2)/4 = 11+6/4 = 12.5
96
What does Quartiles mean?....
97. d. Calculate the inter-quartile range
Q3 = 12.5 ; Q1 = 5.5
IQR = Q3-Q1 = 12.5–5.5 = 7
Generally we apply this formula:
1. Qk = ((kn/4) th + (kn/4+1)th)/2 -if n is even
2. Qk = ((kn/4+1)/2) th- if n is odd
Quartiles for grouped data
Apply the same method with median
= Q1= Q1L+((n/4-fc)/fQ1)I & Q3= Q3L+((3n/4-fc)/fQ3)i
To find the class of each
= Q1=n/4 & Q3=3n/4
= IQR= Q3-Q1 97
What does Quartiles mean?....
98. Properties of IQR
1. It is a simple & versatile measure
2. It encloses the central 50% of the observations
3. It is not based on all observations but only on two
specific values
4. It is important in selecting cut-off points in the
formulation of clinical standards
5. Since it excludes the lowest & highest 25% values, it
eliminates the outlier problem
6. Less sensitive to the size of the sample
98
99. Percentiles
Percentiles:
Are simply divide the data into 100 pieces.
Are less sensitive to outliers &
Are not greatly affected by the sample size (n).
99
100. 100
P0:
The minimum
P25:
25% of the sample values are less than or equal to this
value.
1st Quartile, P25 means 25th percentile
P50:
50% of the sample are less than or equal to this value.
2nd Quartile
P75:
75% of the sample values are less than or equal to this
value.
3rd Quartile
P100:
The maximum
Percentiles….
101. 101
The pth
percentile:
Is a value that is p%
of the observations & the
remaining (1-p)%
.
The observation corresponding to p(n+1)th
if p(n+1)
is an integer
The average of (k)th
& (k+1)th
observations if p(n+1)
is not an integer, where k is the largest integer less
than p(n+1).
If p(n+1) = 3.6, the average of 3rd & 4th observation
Percentiles…..
102. 102
Example: Birth weight in grams
2069, 2581, 2759, 2834, 2838, 2841, 3031,
3101, 3200, 3245, 3248, 3260, 3265, 3314,
3323, 3484, 3541, 3609, 3649, 4146
Find the 10th & 90th percentile of the data set. n=20
Solution: 10th percentile =Pt = ((tn/100)th +
(tn/100+1)th)/2 -if n is even
20×0.1 = (2)th + (20×0.1)+1 = (3)th are not integers,
The average of the 2nd & 3rd values
= (2581+2759)/2 = 2670 g
Solution: 90th percentile =
20×0.9 = (18)th + (20×0.9)+1 = (19)th are not
integers,
The average of the18th & 19th values
= (3609+3649)/2 = 3629 g
Percentiles…..
103. Generally we apply this formula:
1. Pt = ((tn/100)th + (tn/100+1)th)/2 -if n is even
2. Pt = ((tn/100+1)/2) th -if n is odd
For grouped data use the following formula:
P = PL+ (P(n)-fc)/f)i
To find the class, use p(n) value or
Where
m represents the percentile we're finding,
N is the total number of observations in the data set.
103
Percentiles…..
104. Variance (2, s2)
The variance
Is the average of the squares of the deviations taken
from the mean
A good measure of dispersion should make use of all
the data.
The variance achieves this by averaging the sum of
the squares of the deviations from the mean.
The sample variance of the set x1, x2, ., xn of n
observations with mean ẍ is
Degrees of freedom
n-1 used because if we know n-1 deviations, the nth deviation is known
Deviations have to sum to zero 104
S
(x x)
n - 1
2
i
2
i=1
n
105. It is squared because the sum of the deviations of the
individual observations of a sample about the sample
mean is always zero
Degrees of freedom
In computing the variance there are (n-1) degrees of
freedom because only (n-1) of the deviations are
independent from each other
This is because the sum of the deviations from their
mean (Xi-Mean) must add to zero.
The last one can always be calculated from the
others automatically (It is not free to vary).
105
Variance (2, s2)
106. Example
Data: 43,66,61,64,65,38,59,57,57,50.
Find Sample Variance of the data ,
Mean = 56
S2= [(43 - 56) 2 + (66 - 56)2+…..+(50 - 56) 2 ]/10-1 =
810/9 = 90
Variance for grouped data
106
S
(m x) f
f -1
2
i
2
i
i=1
k
i
i=1
k
x
Where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
= the sample mean
k = the number of class intervals
Variance (2, s2)
107. Ex. Compute the variance of the age of 169 subjects
from the grouped data.
Class interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
-19.98
-9-98
0.02
10.02
20.02
30.02
399.20
99.60
0.0004
100.40
400.80
901.20
1596.80
6573.60
0.0188
3614.40
4809.60
3604.80
Total 169 1901.20 20199.22
107
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
Variance (2, s2)
108. 108
1. The main disadvantage of variance is that its unit is
the square of the unite of the original measurement
values
2. The variance gives more weight to the extreme
values as compared to those which are near to mean
value, because the difference is squared in variance.
3. The drawbacks of variance are overcome by the
standard deviation.
Properties of Variance
109. Standard deviation (, s)
Standard deviation (, s)
It is the square root of the variance.
This produces a measure having the same scale as
that of the individual values.
It shows variation about the mean
109
2
and S = S2
111. Properties of SD
1. Has the advantage of being expressed in the same
units of measurement as the mean
2. The best measure of dispersion & is used widely
because of the properties of the theoretical normal
curve.
3. However, if the units of measurements of variables of
two data sets is not the same, then there variability
can’t be compared by comparing the values of SD.
111
112. 112
Wide spread results in higher SDs Narrow spread in lower SDs
Standard deviation (, s).....
113. Coefficient of variation (CV)
Coefficient of variation (CV)
When two data sets have different units of
measurements, or their means differ sufficiently in
size, the CV should be used as a measure of
dispersion.
It is the best measure to compare the variability of
two series of sets of observations.
Data with less coefficient of variation is considered
more consistent.
CV is the ratio of the SD to the mean multiplied by
100.
113
114. CV
S
x
100
“Cholesterol is more variable than systolic blood pressure”
SD Mean CV (%)
SBP
Cholesterol
15mm
40mg/dl
130mm
200md/dl
11.5
20.0
114
Coefficient of variation (CV).....
115. Distributions
Distributions used in statistical analysis:
1. Discrete random variables:
1) Binomial,
2) Poisson &
3) Hyper geometric distributions.
E.g. The analysis of discrete random variables,
such as the position of a nucleotide on a given
sequence may use techniques based on a binomial
distribution & not techniques that assume a
normal distribution.
2. Continuous random variables:
1) Normal distribution,
2) Z distribution.
115
116. Normal distribution
Normal distribution
It is symmetric about its mean/one half of the curve
is the mirror image of the other half
The mean, median, & mode are equal & are in
different positions
The highest point is at its mean
The height of the curve decreases as one moves away
from the mean in either direction, approaching, but
never reaching zero
116
117. 117
Mean
A normal distribution is symmetric about its mean
As one moves away from
the mean in either direction
the height of the curve
decreases, approaching,
but never reaching zero
The highest point of
the overlying normal
curve is at the mean
Normal distribution…..
118. Skewed distributions
Skewed distributions
The data are not distributed symmetrically in
skewed distributions
The mean, median, & mode are not equal & are in
different positions
Scores are clustered at one end of the distribution
A small number of extreme values are located in the
limits of the opposite end
Skew is always toward the direction of the longer tail
118
119. Skewed distributions….
A. Negatively skewed distribution
Occurs when majority of scores are at the right end
of the curve & a few small scores are scattered at the
left end
Positive if skewed to the right
B. Positively skewed distribution
Occurs when the majority of scores are at the left
end of the curve & a few extreme large scores are
scattered at the right end.
Negative if to the left
119
120. Median Mode Mean
(a). Symmetric Distribution
Mean = Median = Mode
Mode Median Mean
(b). Distribution skewed to the right
Mean > Median > Mode
Mean Median Mode
(c). Distribution skewed to the left
Mean < Median < Mode 120
121. Which measures to use?
1. When the distribution is symmetric & uni-modal,
summarize the data using means & standard deviations.
2. When the data are skewed, it is preferable to use the
median & quartiles as summary statistics.
3. Median & quartiles are not easily influenced by extreme
values in a skewed distribution unlike means & standard
deviations.
A. Symmetric & uni-modal distribution —
Mean, median, & mode should all be approximately the
same
B. Skewed to the right (Positively skewed) —
Mean is sensitive to extreme values, so median might be
more appropriate
C. Skewed to the left (Negatively skewed) –
Mean is sensitive to extreme values, so median might be
more appropriate 121