Biostatistics i

A. Thangamani ramalingam
Biostatistics

THE CONCEPT OF MEASUREMENT AND
SCALING (Meaning of scaling)
Measurement can be
defined as a standardized
process of assigning
numbers or other symbols
to certain characteristics
of the objects of interest
Measurement is “the
assignment of numbers
to observations [or
responses] according to
some set of rules”
Researchers engage in
using the
measurement process
by assigning
either numbers
or labels

CHARACTERISTICS OF
SCALES
DESCRIPTION (FOR INSTANCE, “YES” OR “NO”, “AGREE” OR “DISAGREE”
AND THE NUMBER OF YEARS OF A RESPONDENT’S AGE )
ORDER (1 IS LESS THAN 5” “EXTREMELY SATISFIED” IS MORE INTENSE
THAN “SOMEWHAT SATISFIED”“MOST IMPORTANT” HAS GREATER
IMPORTANCE THAN “ONLY SLIGHTLY IMPORTANT”)
DISTANCE (ABSOLUTE DIFFERENCES BETWEEN THE DESCRIPTORS ARE
KNOWN AND MAY BE EXPRESSED IN UNITS)
ORIGIN (IF THERE IS A UNIQUE BEGINNING OR TRUE ZERO POINT FOR THE
SCALE)
“EACH SCALING PROPERTY BUILDS ON THE PREVIOUS ONE”

Nominal, Ordinal, Interval, and Ratio Scales Provide Different Information

RELATIONSHIP BETWEEN SCALES AND SCALING
PROPERTIES
SCALING PROPERTIES
SCALE DESCRIPTION ORDER DISTANCE
ORIGIN
NOMINAL YES NO NO
NO
ORDINAL YES YES NO
NO
INTERVAL YES YES YES
NO
RATIO YES YES YES
YES

Primary data Secondary data
 Primary data – data
you collect
 Surveys
 Focus groups
 Questionnaires
 Personal interviews
 Experiments and
observational study
 Secondary data – data
someone else has collected
 County health departments
 Vital Statistics – birth, death
certificates
 Hospital, clinic, school nurse
records
 Private and foundation
databases
 City and county
governments
 Surveillance data from state
government programs
 Federal agency statistics -
Census, NIH, etc.
Methods of data collection

Data Processing operations
 Editing (fieldediting,central editing)
 Coding
 Classification(based on attributes or class
interval)
 Tabulation (simple or complex)

Data Processing Cycle
 Collected data is transformed into a form that
computer can understand. (input data).
 Verification (errors occur in collected data)
 Coding(Male-1,female-2)
 Storing

Processing denotes the actual data manipulation
techniques such as classifying, sorting, calculating,
summarizing, comparing, etc. that convert data into
information.
Classification -The data is classified into different
groups and subgroups, so that each group or sub-group
of data can be handled separately.
ii) Sorting -The data is arranged into an order so that it
can be accessed very quickly as and when required.
iii) Calculations -The arithmetic operations are
performed on the numeric data to get the required
results.
iv) Summarizing -The data is processed to represent it
in a summarized form.

Output-After completing the processing step, output
is generated. The main purpose of data processing
is to get the required result. Mostly, the output is
stored on the storage media for later user
i) Retrieval Output stored on the storage media
can be retrieved at any time.
ii) Conversion The generated output can be
converted into different forms. For example, it
can be represented into graphical form.
iii) Communication -The generated output is sent
to different places.

Problems in processing
 Don’t know responses
 Missing forms
 internal consistency of the data e.g. age &date of
birth
 Validity checks e.g. :extreme values

Types of analysis
 Descriptive
 Inferential
 Univariate
 Bivariate
 Multivariate(regressio
n ,manova, canonical
and discrimnant)
 Causal analysis
 Correlational analysis

 https://www.yourarticlelibrary.com/education/statis
tics/graphic-representation-of-data-meaning-
principles-and-methods/64884

Common descriptive statistics
 Count (frequencies)
 Percentage
 Mean
 Mode
 Median
 Range
 Standard deviation
 Variance
 Ranking

Basic Concepts
 Population: the whole set of a “universe”
 Sample: a sub-set of a population
 Parameter: an unknown “fixed” value of population characteristic
 Statistic: a known/calculable value of sample characteristic
representing that of the population. E.g.
μ = mean of population, = mean of sample

“Central Tendency”
Measur
e
Advantages Disadvantages
Mean
(Sum of
all
values ÷
no. of
values)
 Best known average
 Exactly calculable
 Make use of all data
 Useful for statistical analysis
 Affected by extreme values
 Can be absurd for discrete data
(e.g. Family size = 4.5 person)
 Cannot be obtained graphically
Median
(middle
value)
 Not influenced by extreme
values
 Obtainable even if data
distribution unknown (e.g.
group/aggregate data)
 Unaffected by irregular class
width
 Unaffected by open-ended class
 Needs interpolation for group/
aggregate data (cumulative
frequency curve)
 May not be characteristic of
group
when: (1) items are only few; (2)
distribution irregular
 Very limited statistical use
Mode
(most
frequent
value)
 Unaffected by extreme values
 Easy to obtain from histogram
 Determinable from only values
near the modal class
 Cannot be determined exactly in
group data
 Very limited statistical use

Central Tendency – “Mean”,
 For individual observations, . E.g.
X = {3,5,7,7,8,8,8,9,9,10,10,12}
= 96 ; n = 12
 Thus, = 96/12 = 8
 The above observations can be organised into a frequency
table and mean calculated on the basis of frequencies
= 96; = 12
Thus, = 96/12 = 8
x 3 5 7 8 9 1 0 1 2
f 1 1 2 3 2 2 1
f 3 5 1 4 2 4 1 8 2 0 1 2

Central Tendency–“Mean of Grouped Data”
 House rental or prices in the PMR are frequently
tabulated as a range of values. E.g.
 What is the mean rental across the areas?
∑f = 23; ∑fx= 3317.5
Thus, ∑fx/ ∑f = 3317.5/23 = 144.24
Rental (RM/month) 135-140 140-145 145-150 150-155 155-160
Mid-point value (x) 137.5 142.5 147.5 152.5 157.5
Number of Taman (f) 5 9 6 2 1
fx 687.5 1282.5 885.0 305.0 157.5

Central Tendency – “Median”
 Let say house rentals in a particular town are tabulated as
follows:
 Calculation of “median” rental needs a graphical aids→
Rental (RM/month) 130-135 135-140 140-145 155-50 150-155
Number of Taman (f) 3 5 9 6 2
Rental (RM/month) >135 > 140 > 145 > 150 > 155
Cumulative frequency 3 8 17 23 25
1. Median = (n+1)/2 = (25+1)/2 =13th.
Taman
2. (i.e. between 10 – 15 points on the
vertical axis of ogive).
3. Corresponds to RM 140-
145/month on the horizontal axis
4. There are (17-8) = 9 Taman in the
range of RM 140-145/month
5. Taman 13th. is 5th. out of the 9
Taman
6. The interval width is 5
7. Therefore, the median rental can
be calculated as:
140 + (5/9 x 5) = RM 142.8

Central Tendency – “Median” (contd.)

Central Tendency – “Quartiles”
Upper quartile = ¾(n+1) = 19.5th.
Taman
UQ = 145 + (3/7 x 5) = RM
147.1/month
Lower quartile = (n+1)/4 = 26/4 =
6.5 th. Taman
LQ = 135 + (3.5/5 x 5) =
RM138.5/month
Inter-quartile = UQ – LQ = 147.1
– 138.5 = 8.6th. Taman
IQ = 138.5 + (4/5 x 5) = RM
142.5/month

“Variability”
 Indicates dispersion, spread, variation, deviation
 For single population or sample data:
where σ2 and s2 = population and sample variance respectively, xi =
individual observations, μ = population mean, = sample mean, and n
= total number of individual observations.
 The square roots are:
standard deviation standard deviation

“Variability”
 Why “measure of dispersion” important?
 Consider returns from two categories of shares:
* Shares A (%) = {1.8, 1.9, 2.0, 2.1, 3.6}
* Shares B (%) = {1.0, 1.5, 2.0, 3.0, 3.9}
Mean A = mean B = 2.28%
But, different variability!
Var(A) = 0.557, Var(B) = 1.367
* Would you invest in category A shares or
category B shares?

“Variability”
 Coefficient of variation – COV – std. deviation as
% of the mean:
 Could be a better measure compared to std. dev.
COV(A) = 32.73%, COV(B) = 51.28%

“Variability”
 Std. dev. of a frequency distribution
The following table shows the age distribution of second-time home buyers:
x^

Skewness is a measure of asymmetry
and shows the manner in which the
items are clustered
around the average.

Kurtosis is the measure of flat- toppedness of a curve. A bell
shaped curve or the normal curve is Mesokurtic because it is kurtic
in the centre; but if the curve is relatively more peaked than the
normal curve, it is called Leptokurtic whereas a curve is more flat
than the normal curve, it is called Platykurtic

MEASURES OF RELATIONSHIP
 Correlation can be studied through
(a) cross tabulation;
(b) Charles Spearman’s coefficient of correlation
(c) Karl Pearson’s coefficient of correlation;
whereas cause and effect relationship can be
studied through simple regression equations.

Forms of “statistical” relationship
 Correlation
 Contingency
 Cause-and-effect
* Causal
* Feedback
* Multi-directional
* Recursive
 The last two categories are normally dealt
with through regression

Biostatistics i

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Biostatistics i

Similar to Biostatistics i (20)

More from Thangamani Ramalingam

More from Thangamani Ramalingam (20)

Recently uploaded

Recently uploaded (20)

Biostatistics i