2. Learning Objectives
• Define biostatistics, data & variable
• Differentiate between quantitative variable & qualitative
variable.
• Know different types of quantitative & qualitative variables
• Differentiate between independent variables & dependant
variables.
• Differentiate between raw data & treated data.
• Differentiate between primary data & secondary data.
• Differentiate between grouped data & ungrouped data
• Describe various sources of data.
• Enlist different uses of health data
3. BIOSTATISTICS
It is a science of
Collecting,
Organizing,
Summarizing,
Analysis,
Interpretation,
Presentation and
Dissemination of DATA
Pertaining to the Medical Science.
4. DATA
Any piece of information about a
characteristic that can be measured, Judged,
assessed or observed.
Data comes from two sources :
Measurements
Counting.
5. VARIABLE
Characteristic of a person, object or phenomenon that can be
measured, assessed ,Judged or observed and takes on different
values in different persons, places or things, we label the
characteristic as VARIABLE.
EXAMPLES:
Age
Height
Weight
Blood Pressure
Gender
Income
Family size
7. QUANTITATIVE VARIABLE
Quantitative variable is one that can be measured
in usual sense.
Examples:
Heights of Adult males
Weights of School children
Ages of 4th year Students.
Measurements made on quantitative variables
convey information regarding amount /quantity. It
is also called “Numerical Variable” as the values
of variable are based on numbers.
8. QUALITATIVE VARIABLE
Some characteristics are not capable of being measured in
the sense that height, weight and age are measured, but can
be categorized only. However we can count the no. of
things. Persons or places belonging to various categorize.
So that the counts or frequencies are the numbers that we
manipulate during analysis of qualitative variable.
Measurements made on qualitative variables convey
information regarding attribute of thing, person or place.
Examples:
Male / Female
Rural / Urban
Diseased / Not Diseased
9. TYPES OF QUANTITATIVE VARIABLE
DISCRETE VARIABLE:
A quantitative or numerical variable that can assume values in whole
Numbers or integer numbers and not in Fractions or Decimal’s that
is there is gap between any two values.
Example
05 Pencils
04 Pens
07 Children
CONTINUOUS VARIABLE:
A quantitative or numerical variable that can assume any possible
values between any two values. It can be in Fraction or decimal.
Examples
Height 5.5 feet
Weight 60.4 Kg
10. TYPES OF QUALITATIVE VARIABLE
NOMINAL VARIABLE :
A qualitative variable that categorizes only and order cannot be
assigned to the categories.
Example:
Blood group
Gender
ORDINAL VARIABLE:
A qualitative variable that incorporates an ordered position or
ranking in categories.
Example:
levels of satisfaction
11. INDEPENDENT VARIABLE
VS
DEPENDENT VARIABLE
INDEPENDENT VARIABLE:
That characteristic which is thought to influence another event
or characteristic or the characteristic which comes first.
Example:
SMOKING is an independent variable.
DEPENDENT VARIABLE:
The resulting outcome characteristic under the influence of an
independent variable is called “ Dependent “ Variable.
Example:
Lung Cancer
12. QUALITATIVE DATA VS QUANTITATIVE DATA
QUALITATIVE DATA
This is non-numerical Data, based on values of qualitative variables.
It is of 02 types.
Nominal Data: Based on the values of nominal variable.
Ordinal /Ranked Data: Based on the values of ordinal variable.
QUANTITATIVE DATA:
Also called NUMERICAL Data. And is based on numbers.
Consisting on the values of quantitative variables.
It is of 02 types
Discrete data: No. of children in family, pulse rate.
Continuous data: Weights of class students/ Heights of class students
blood pressure of class students.
13. RAW DATA VS TREATED DATA
Data initially collected (First hand data) having lot
of unnecessary, irrelevant and unwanted
information due to lack of any sort of data cleaning
or statistical treatment, is called raw data.
When raw data is cleaned by removing unwanted,
irrelevant, unnecessary information or some
statistical shape is given, it is called treated data.
14. PRIMARY DATA VS SECONDARY DATA
Raw data is also called primary data
Treated data is also called secondary data.
15. GROUPED DATA VS UNGROUPED DATA
Data that is presented or observed individually
is called ungrouped data.
Example: No. of children in 10 families.
2, 6, 6, 4, 4, 3, 4, 7, 5, 4
The identical data by frequency is grouped to
gather, for better & quick understanding.
Example:
Families having 2-4 children: 6
Families having 5-7 children: 4
17. • To uncover problems and its magnitude
• To know the reasons / root cause of problems
• Priority setting
• Planning and implementation
• Monitoring and surveillance
• Output assessment
• Assessing client satisfaction
• Impact assessment
• For research purposes
• For local / national and international comparisons
• Used by politicians, administrators, social scientists
etc.
• Data can provide feedback
Uses of Health Data
19. Descriptive Statistics:
Is to collect, organize, summarize and present data. They
are simply a way to describe the data.
Inferential statistics:
Involve making inferences that go beyond the actual data.
Inferential statistics are techniques that allow us to use these
sample results to make generalization about populations
from which the samples were drawn.
However descriptive statistics do not allow us to make
conclusions beyond the data.
Descriptive : mean, median, Standard Deviation.
Inferential : test of significance, construction of confidence
interval.
20. POPULATION
A collection or set of individuals or objects or events
whose properties are to be analyzed.
A population is the universe (whole) about which an
investigator wishes to draw conclusion.
It need not consist of people but may be a population of
measurements.
Example: If we want to draw conclusions about blood
pressure of class students, the population will be blood
pressure measurements not class students. Denoted by
large “N”
SAMPLE
A subset of the population denoted by small “n”
21. ELEMENT:
A single observation is called element. Denoted
by “X”
PARAMETER:
A numerical value summarizing all the data of
an entire population.
STATISTIC:
A numerical value summarizing the sample data.
Different symbols are used for both.
22. PRESENTATION OF DATA
• Text presentation
• Tabular presentation
• Graphic & diagrammatic presentation
24. TABULAR PRESENTATION
TABLE is a systemic arrangement of data into vertical
columns and horizontal rows. AND the process of arranging
data into rows and columns is called TABULATION
General principles for designing tables
Table should be numbered e.g. table 1, table 2
A title must be given to each table- brief & explanatory
Headings of the column or rows should be clear &
concise.
No table should be too large
Data must be presented according to size or importance
chronologically, alphabetically or geographically.
Foot notes may be given where necessary
25. TYPES OF TABLES
Simple table:
Measurements of single set of data presented.
Complex table:
Measurements of multiple set of data presented.
Frequency distribution Table:
In the frequency distribution table, the data is first split up
into convenient groups called class intervals” and the No.
of items (frequency) which occur in each group is shown
in adjacent columns.
Hence it is a table showing frequency with which the
values are distributed in different groups or classes with
some defined characteristic.
26. Title: Number of cases of various diseases in Teaching Hospital in 2015
27.
28. STEPS OF CONSTRUCTION OF FREQUENCY TABLE
Arrange the data in ascending order.
Locate highest and lowest value
Calculate the “RANGE” by subtracting lowest value from
highest value.
Determine no. of classes (rule of thumb 5-15)
Calculate class interval by dividing Range with no. of
classes
Construct classes and make tables with rows and column
Tally the data
30. CHARTS/ GRAPHS & DIAGRAMS
They have powerful impact on imagination
of people
Gives information at a glance
Diagrams are better retained in memory than table
It should b remembered that a lot of details and
accuracy of original data is lost in charts and
diagrams and if we want the real study, we have to
go back to the original Data.
31. BAR CHARTS
• Better retained in the memory
• Have powerful impact
• Used as a tool for comparing mutually exclusive
discrete data.
Type of bar charts
• Simple bar chart
• Multiple bar chart
• Component bar chart
32. BAR CHARTS SIMPLE
Most useful, widely used, popular and easy way of
expressing statistical data of nominal or ordinal
variables
Each Bar represents one attribute / variable
The width of the bar and the gaps between the bars
should be equal throughout.
Length of the bar is proportional to the magnitude/
frequency of the variable
The Bars may be vertical or horizontal
36. MULTIPLE BAR CHARTS
Also called compound bar chart
Two or more than two bars pertaining to same category can
be grouped together, for better and easy understanding.
COMPONENT BAR CHART
The simple bars are divided into 2 or more components/
parts.
Each part represents a certain variable proportional to the
magnitude of that variable
Also called stacked bar chart.
37.
38.
39.
40.
41.
42. PIE CHART
Popular and powerful way of expressing qualitative
data.
Value of each category is divided by the total values
and than multiplied by 360 and then each category is
allocated the respective angle to present the proportion it
has.
It is often necessary to indicate percentages in the
segment to make easy to compare the areas of segment.
43.
44. PICTOGRAM
Popular method of presenting data to those
who can not understand orthodox charts.
Small pictures or symbols are used to present
the data e, g a picture of a doctor to represent
the population per physician.
45.
46.
47.
48. HISTOGRAM
• Used for quantitative, continuous variable
• Consists of a series of adjacent bars, the height of
each bar indicate frequency.
• The class intervals are given along horizontal axis
and the frequency along vertical axis.
• A gap or space between bars occurs only if a class
interval has zero frequency.
52. FREQUENCY POLYGON
• Frequency polygon is an area diagram of
frequency distribution over a histogram.
• It is obtained by joining the midpoints of a
histogram blocks(each class interval).
• It is constructed by plotting midpoints of each
class interval with the corresponding
frequency and than by connecting these points
with each other by straight line.
53.
54.
55.
56. CUMULATIVE FREQUENCY POLYGON
Also called “Ogive”
Indicates cumulative frequency of a data set. Here the
frequency of data in each category represents the sum
of data form the category and the preceding
categories.
Cumulative frequencies are plotted opposite the
group limits of the variable
These points are joined by smooth free hand curve to
get a cumulative frequency polygon or Ogive.
It is useful to compare 02 sets of data
61. LINE DIAGRAM
Are used to show the trend of events with the
passage of time
Future prediction can be presented efficiently
in this type of presentation
62.
63.
64. SCATTER DIAGRAM
It represents the relationship between 02
numerical variable measured on the same subject
Independent variable on X-axis and dependent on
y- axis.
For each subject a pair of reading is taken with
respect to each variable.
A dot is placed where both reading intersect each
other.
Line is drown through dots
68. Quantitative indices that describe the center
of a distribution.
Common measures of central tendency are:
•Mean
•Median
•Mode
69. MEAN
Most commonly used measure.
Also called “Arithmetic mean.
How to calculate?
By adding all the values in population or sample and dividing
by the total no. of values that are added.
Formulas are
or
N
X
n
X
X
70.
71. Advantages of Arithmetic Mean
Easy to calculate
Easy to understand
Based on all the values
There is only single mean in the data
Used to calculate mean deviation, variance
Coefficient of variation & standard deviation
Used in further statistical test.
Disadvantages of Mean
It is distorted by extreme values.
Sometime it looks ridicule.
72. Median
Median is the value that divides the data into 02 halves. i.e the
central value of data when data is arranged in ascending order
How to calculate:
For ODD data (it is middle value)
Median = (N+1/ 2)th observation
For EVEN data (it is average of 02 middle values)
Median = (N/2the value + Next value)th observation
2
73.
74.
75. Advantages of Median:
Easy to calculate
Easy to understand
Single median in the data Set
Not affected by extreme value as it
Is dependent of middle values
In skewed data it proves to be better measure
Disadvantages of median
It is not based on all values
Cannot be used for further mathematical calculation
It necessitates arrangement of data in ascending or deseeding order
Which is tedious job.
76. MODE
Is the most frequently occurring values in the data
series. i,e most repeated value is data series
Data may be
Non-modal (when no mode)
Uni-modal (when single mode)
Bimodal (when two modes)
Multimodal (when more than two modes)
77. Advantages of Mode
Easy to calculate
Not affected by extreme value
Only measure of central tendency that can be used for
qualitative as well as quantitative data.
Disadvantages of Mode
It is not based on all values
No further mathematical calculation can be carried out
No analytical concepts are based on mode
It can be modeless, uni-modal, bimodal or multimodal.
80. Quantitative indices that describe the spread of
data set are called measures of dispersion
Common measures are:
• RANGE
• MEAN DEVIATION
• VARIANCE
• STANDARD DEVIATION
81. RANGE
Range is the difference between the
highest and the lowest values, in a
given data set.
82.
83. Advantages:
Easy to compute
Easy to understand
Simplest measures of dispersion
It is used to calculate “class interval”.
Disadvantages:
It has no role in inferential statistic
Poor measures of dispersion, provides no
knowledge
About the spread of values within the data
series.
It is based on 02 extreme values and ignores
all other observation
84. Mean Deviation
It is the average of the deviation from the
arithmetic mean.
Mean of the absolute deviations of all
observation from the arithmetic mean.
How to calculate:
Mean deviation (M.D) = ∑ |( X-X )|
n
85. The marks of a eight students are
2,4,4,4,5,5,7,9
These eight data points have the mean (average) of 5:
First, calculate the deviations of each data point from
the mean, and take the absolute values.
(2-5)= -3 I5-5I= 0
(4-5)= -1 I7-5I= 2
(4-5)= -1 (9-5)= 4
(4-5)= -1
How to calculate: _
Mean deviation (M.D) = ∑l(X- X )l /n
Mean deviation = 12/8 = 1.5
86. STANDARD DEVIATION
Most frequently used measure of dispersion
Most useful measure of dispersion
How to calculate:
Step: 1: calculate the mean
Step: 2: find the difference of each observation from the mean
Step: 3: square the difference of observation from the mean.
Step: 4: add the squared values to get sum of squares.
Step: 5: Divide this sum by total number of observation to get
mean_ squared deviation called VARIANCE.
Step: 6: Find the square root of this variance to get root mean
squared devotion.
88. The marks of a eight students are
2,4,4,4,5,5,7,9
These eight data points have the mean (average) of 5:
First, calculate the deviations of each data point from
the mean, and square the result of each:
The variance is the mean of these values:
90. STANDARD DEVIATION
Advantages:
Easy to calculate
EASY to understand
Easy to interpret
It uses every observation
It is used to calculate coefficient of variance
Mathematical manageable
Disadvantage:
Sensitive to outliers (extreme values)
In appropriate for skewed data
91. CO- EFFICIENT OF VARIATION
It is the RATIO of the standard deviation of a data
series to the arithmetic mean of the series, expressed
in parentage.
Formula
CV= SD_ x 100
Mean
CV=2/5 x 100 = 40%
It compares the relative spread of the distributions of
different series, irrespective of the units used.
92. MEASURES OF LOCATION / POSITION
Descriptive measures that locate the relative position of an
observation in relation to other observation, hence are called
measures of relative standing. These are
Quartile
Decile
Percentile
Quartiles divide the Data into 4 equal parts
Deciles divide the Data into 10 equal parts
Percentiles divide the Data into 100 equal parts
1st Quartile I.e. Q1 = 25TH percentile
2nd Quartile I.e. Q2= 50TH percentile
3rd Quartile I.e. Q3= 75TH percentile
9th percentile means that 9% observation are equal to or less then
observation and (100-9) 91% observation are greater than that observation.
94. Question No = 05
• Eight Years old female child presented to emergency
department with complaints of severe pain
in the abdomen, 04 episodes of vomiting for last six
hours. On examination weight of the girl was
18 kg, height 116 cm, Fair skin, Blue iris, Brown hair
and temp was 102o F. categorize the following
Variables of child into nominal, ordinal, Discrete &
continuous.
Age, Sex, Weight, Height, Temperature, Severe pain,
Fair skin, Iris colour, Vomiting episode, Hair
colour.
95. Question No = 06
• A researcher recorded the duration of night
sleep among medical college students before
examination.
(a): Categorize the data.
(b) :What are the choice available to researcher
to present this type of data?.
96. NORMAL DISTRIBUTION CURVE
What is Distribution Curve:
A graphic presentation of distribution of set of
data is called distribution curve.
Frequency curves or frequency polygons may
take many different shapes but many naturally
occurring phenomenon or characteristics are
distributed according to distribution called
“Normal Distribution” or “Gaussian
Distribution”.
97. Characteristics of normal distribution curve
It is a continuous distribution.
It is a bell shaped curve.
It is symmetrical
It is unimodal
Mean, Median and Mode, all lay at the center are
equal.
As it is probability distribution, so its Area is taken
as
1 (100 %)
Normal distribution determined by 02 parameters µ
and standard deviation.
98. TYPES OF DISTRIBUTION
ON THE BASIS OF SYMMETRY
SYMMETRICAL NON- SYMMETRICAL
OR
SKEWED
Positively Skewed
Negatively Skewed
99. Different Shapes of Distributions
Source: http://faculty.vassar.edu/lowry/f0204.gif
100. Types of distribution
On the basis of peakedness
(Kurtosis)
PLATYKURTIC MESOKURTIC LEPTOKURTIC
( Values spread (balanced normal (piling up of
Throughout in the distribution ) values in the centre of
distribution ) distribution )
101.
102.
103. STANDARD NORMAL DISTRIBUTION
OR STANDARD NORMAL CURVE ALSO CALLED “Z”
DISTRIBUTION
The characteristic of the normal distribution implies that the normal
distribution is a family of distributions in which one member is
distinguished from another on the basis of the values of Mean &
standard Deviation
The most important member of this family is STANDARD NORMAL
DISTRIBUTION
which has mean Zero and Standard
Deviation 1.
104. When the mean of a Gaussian distribution is not O and S.D is
not 1, a simple transformation called the Z transformation,
must be made so that we can use the standard normal table.
The Z transformation express the deviation from mean in
standard deviation unit. that is any normal distribution can be
transformed to the standard normal distribution by using
following steps.
Move the distribution up or down the number line so that the
mean is o this step is accomplished by subtracting the mean (
u) from the value for X.
Make the distribution either narrow or wider so that the standard
deviation is equal to 1. This step is accomplished by dividing
by 6
To Summarize Z = x- µ
6
Called Z score, normal deviate, standard score, critical ratio.
105. 68 % values or area lie between X ± 1 S.D
95 % values or area lie between X ± 2 S.D
99.7 % values or area lie between X ± 3 S.D
114. SAMPLING DISTRIBUTION OF THE
SAMPLING MEANS
The distribution of individual observation is very different from the distribution of means,
which is called a “SAMPLING DISTRIBUTION”
If we take a random sample from the population and similar samples over and over again,
we will find that every sample will have a different means.
If we make a frequency distribution of all the sample means drawn from the same
population, we will find that distribution of the means is nearly a normal distribution
and means of the sample means practically the same as the population mean.
This is very important observation that the sample means are distributed normally about
the population mean.
Standard deviation of the means as a measure of sampling variability and is given by the
formula and is called standard Error of the mean or simply S.E.
Since the distribution of the means follows the pattern of a normal distribution, it is not
difficult to visualize that 95% of the sample means will be lying with in limits of
2SE.
M ± 2.SE or M ± On either side of the population mean.
115. CENTRAL LIMIT THEORUM
The mean of the sampling distribution or the mean of the
means is equal to the population mean µ based on
individual observations.
Standard deviation in the sampling distribution of the mean
is equal to called SE of the mean, plays an important
role in many of the statistical procedure in inferential
statistics.
If the distribution in the population is normal, then the
sampling distribution of the means is also normal. More
IMPORTANTLY for sufficiently large sample sizes, the
sampling distribution of the means is approximately
normally distributed, regardless of the shape of original
distribution in population.
119. 1. CONSTRUCTION OF CONFIDENCE
INTERVAL:
Point Estimate
Interval Estimate
2. TEST OF SIGNIFICANCE
Objective: to permit generalization from a sample
to the population from which it come.
02 APPROACHES
120. CONFIDENCE INTERVAL
An interval estimate with a specified level of
confidence. They define an upper limit and lower limit
with an associated probability. The ends of the
confidence interval are called “ confidence limits”
CONFIDENCE LEVEL
It is the probability / chance that a constructed
confidence interval actually contains TRUE population
value. O2 confidence level 95% & 99% are used and
taken as significant in medicine.
LEVEL OF SIGNIFIANCE
Probability of rejecting the null hypothesis when it is
true.
121. HOW TO CONSTRUCT CONFIDENCE INTERVAL
OBSERVED MEAN + (CONFIDENCE COEFFICIENT)
X MEASURE OF VARIABILITY OF THE MEAN
X̅ + Z X S.E x̅
Example:
The mean weight of a sample of 100 children aged 3 years from a rural
village “A” of the Punjab was 12kg, with standard deviation of 3kg.
Construct 95% confidence interval
What are lower and upper confidence limits?
95% C 1 = X̅ + 2x S.E
X̅ + 2 . SD
X̅ + 2 . 3
X̅ + 2
X̅ + 2 x 0.3
12 + 0.6
11.4 to 12.6
•
122. TESTS OF SIGNIFICANCE
• STANDARD ERROR OF THE MEAN
• STANDARD ERROR OF THE DIFFERNCE
BETWEEN 02 MEANS
• STANDARD ERROR OF PROPORTION
• STANDARD ERROR OF DIFFERENCE BETWEEN
02 PROPORTION
• HYPOTHESIS TESTING (T-TEST, CHI-SQUARE )
123. HYPOTHESIS TESTING
STEPS
1. State the statistical hypothesis in the form of
Null Hypothesis Ho
Alternate Hypothesis Ha
2. Decide an appropriate test statistic
3. Select the level of confidence
4. Determine the value of test statistic must attain to
declare the significant. This value divides the
acceptance & rejection sore.
5. Calculate the value of test statistic
6. Draw & state conclusion.
125. ERRORS IN HYPOTHESIS TESTING
ALPHA ERROR (α)
Probability of rejecting the null hypothesis when it is
true. Increasing the sample size will decrease α error.
1 – error = level of confidence
α error should not be more then 0.05 or 5%
BETA ERROR (β)
Probability of not rejecting the null hypothesis when it
is false.
(chance of missing to detect the real effect)
β-error = power of study
β-error should not be more then 10%
126. SAMPLING
What is sampling ?
&
Why it is done:
TYPES
PROBABILITY NON PROBABILTY
SAMPLING SAMPLING
Simple Random Sampling Convenient Sampling
Systemic Random Sampling Purposive Sampling
Stratified Random Sampling Quota Sampling
Snow ball Sampling