Dr. Dalia El-Shafei
Assistant professor, Community Medicine Department, Zagazig
University
STATISTICS
It is the science of dealing with numbers.
It is used for collection, summarization, presentation
and analysis of data.
It provides a way of organizing data to get information
on a wider and more formal (objective) basis than
relying on personal experience (subjective).
Collection Summarization Presentation Analysis
USES OF MEDICAL STATISTICS:
Planning, monitoring & evaluating community health
care programs.
Epidemiological research studies.
Diagnosis of community health problems.
Comparison of health status & diseases in different
countries and in one country over years.
Form standards for the different biological measurements
as weight, height.
Differentiate between diseased & normal groups
TYPES OF STATISTICS
• Describe or summarize the data
of a target population.
• Describe the data which is
already known.
• Organize, analyze & present
data in a meaningful manner.
• Final results are shown in
forms of tables and graphs.
• Tools: measures of central
tendency & dispersion.
Descriptiv
e
• Use data to make inferences or
generalizations about population.
• Make conclusions for population
that is beyond available data.
• Compare, test and predicts future
outcomes.
• Final results is the probability
scores.
• Tools: hypothesis tests
Inferential
TYPES OF DATA
Data
Quantitative
Discrete (no
decimal)
No. of hospitals,
No. of patients
Continuous
(decimals
allowed)
Weight, height,
Hemoglobin
level
Qualitative
Categorical
Blood groups,
Male & female
Black & white
Ordinal
Have levels as
low, moderate,
high.
SOURCES OF DATA COLLECTION
1ry sources
2ry sources
PRESENTATION OF DATA
Tabular
presentation.
Graphical
presentation
Graphic presentations usually accompany tables to illustrate &
clarify information.
Tables are essential in presentation of scientific data & diagrams
are complementary to summarize these tables in an easy way.
TABULATION
 Basic form of presentation
• Table must be self-explanatory.
• Title: written at the top of table to define
precisely the content, the place and the time.
• Clear heading of the columns & rows
• Units of measurements should be indicated.
• The size of the table depends on the number of
classes “2 -10 rows or classes”.
TYPES OF TABLES
List
Frequency
Distribution table
LIST
Number of patients in each hospital department are:
Medicine 100 patients
Surgery 80 “
ENT 28 “
Ophthalmology 30 “
FREQUENCY DISTRIBUTION TABLE
Assume we have a group of 20 individuals whose blood groups were
as followed: A, AB, AB, O, B, A, A, B, B, AB, O, AB, AB, A, B, B,
B, A, O, A. we want to present these data by table.
Distribution of the studied individuals according to
blood group:
These are blood pressure measurements of 30 patients with
hypertension. Present these data in frequency table: 150, 155, 160,
154, 162, 170, 165, 155, 190, 186, 180, 178, 195, 200, 180,156, 173,
188, 173, 189, 190, 177, 186, 177, 174, 155, 164, 163, 172, 160.
Blood pressure
“mmHg”
Frequency
%
Tally Number
150 –
160 –
170 –
180 –
190 -
200 -
1111 1
1111 1
1111 111
1111 1
111
1
6
6
8
6
3
1
20
20
26.7
20
10
3.3
Total 30 100
Frequency distribution of blood pressure measurements
among studied patients:
GRAPHICAL PRESENTATION
Simple easy to understand.
Save a lot of words.
Simple easy to understand. Save a lot of words.
Self explanatory.
Has a clear title indicating its content “written under the graph”.
Fully labeled.
The y axis (vertical) is usually used for frequency.
Graphs
Bar chart
Pie diagram
Histogram
Scatter diagram
Line graph
Frequency polygon
BAR CHART
 Used for presenting discrete or qualitative data.
 It is a graphical presentation of magnitude (value or
percentage) by rectangles of constant width & lengths
proportional to the frequency & separated by gaps
Simple
MultipleComponent
SIMPLE BAR CHART
MULTIPLE BAR CHART
Percentage of Persons Aged ≥18 Years Who Were Current Smokers,
by Age and Sex — United States, 2002
COMPONENT BAR CHART
PIE DIAGRAM
 Consist of a circle whose area represents the total
frequency (100%) which is divided into segments.
 Each segment represents a proportional composition of
the total frequency.
HISTOGRAM
• It is very similar to bar chart with the difference that the
rectangles or bars are adherent (without gaps).
• It is used for presenting class frequency table
(continuous data).
• Each bar represents a class and its height represents the
frequency (number of cases), its width represent the class
interval.
SCATTER DIAGRAM
It is useful to represent the relationship between 2 numeric
measurements, each observation being represented by a
point corresponding to its value on each axis.
LINE GRAPH
• It is diagram showing the relationship between two
numeric variables (as the scatter) but the points are
joined together to form a line (either broken line or
smooth curve)

FREQUENCY POLYGON
 Derived from a histogram by connecting the mid points of the tops
of the rectangles in the histogram.
 The line connecting the centers of histogram rectangles is called
frequency polygon. We can draw polygon without rectangles so
we will get simpler form of line graph.
 A special type of frequency polygon is “the Normal Distribution
Curve”.
NORMAL DISTRIBUTION CURVE
“GAUSSIAN DISTRIBUTION CURVE”
The NDC is the frequency polygon of a quantitative continuous
variable measured in large number.
It is a form of presentation of frequency distribution of biologic
variables “weights, heights, hemoglobin level and blood pressure”.
CHARACTERISTICS OF THE CURVE:
Bell shaped, continuous curve
Symmetrical i.e. can be divided into 2 equal halves
vertically
Tails never touch the base line but extended to infinity
in either direction
The mean, median and mode values coincide
Described by 2 parameters: arithmetic mean (X)
“location of the center of the curve” & standard
deviation (SD) “scatter around the mean”
AREAS UNDER THE NORMAL CURVE:
X ± 1 SD = 68% of the area on each side of the mean.
X ± 2 SD = 95% of area on each side of the mean.
X ± 3 SD = 99% of area on each side of the mean.
SKEWED DATA
If we represent a collected data by a frequency polygon & the
resulted curve does not simulate the NDC (with all its
characteristics) then these data are
“Not normally distributed”
“Curve may be skewed to the Rt. or to the Lt. side”
CAUSES OF SKEWED CURVE
The data collected are from:
So; the results obtained from these data can not be applied
or generalized on the whole population.
Heterogeneous group Diseased or abnormal population
Example:
If we have NDC for Hb levels for a population of normal
adult males with mean±SD = 11±1.5
If we obtain a Hb reading for an individual = 8.1 & we
want to know if he/she is normal or anemic.
If this reading lies within the area under the curve at 95%
of normal (i.e. mean±2 SD)he /she will be considered
normal. If his reading is less then he is anemic.
NDC can be used in distinguishing between normal from
abnormal measurements.
• Normal range for Hb in this example will be:
Higher HB level: 11+2 (1.5) =14.
Lower Hb level: 11–2 (1.5) = 8.
i.e the normal Hb range of adult males is from 8 to 14.
Our sample (8.1) lies within the 95% of his population.
So; this individual is normal because his reading lies
within the 95% of his population.
DATA SUMMARIZATION
Datasummarization
Measures of
Central tendency
Mean
Mode
Median
Measures of
Dispersion
Range
Variance
Standard
deviation
Coefficient of
variation
Datasummarization
Measures of
Central tendency
Mean
Mode
Median
Measures of
Dispersion
Range
Variance
Standard
deviation
Coefficient of
variation
ARITHMETIC MEAN
Sum of observation divided by the number of observations.
x = mean
∑ denotes the (sum of)
x the values of observation
n the number of observation
ARITHMETIC MEAN
In case of frequency distribution data we calculate the
mean by this equation:
ARITHMETIC MEAN
ARITHMETIC MEAN
 If data is presented in frequency table with class intervals
we calculate mean by the same equation but using the
midpoint of class interval.
MEDIAN
 The middle observation in a series of observation
after arranging them in an ascending or
descending manner
Rank of median
Odd no.
(n + 1)/2
Even no.
(n + 1)/2 n/2
MEDIAN
MEDIAN
MODE
The most frequent occurring value in the data.
ADVANTAGES & DISADVANTAGES OF THE
MEASURES OF CENTRAL TENDENCY:
Mean
• Usually preferred since it takes into account each individual
observation
• Main disadvantage is that it is affected by the value of extreme
observations.
Median
• Useful descriptive measure if there are one or two
extremely high or low values.
Mode
• Seldom used.
Datasummarization
Measures of
Central tendency
Mean
Mode
Median
Measures of
Dispersion
Range
Variance
Standard
deviation
Coefficient of
variation
MEASURE OF DISPERSION
Describes the degree of variations or scatter or
dispersion of the data around its central values
(dispersion = variation = spread = scatter).
RANGE
 The difference between the largest & smallest values.
 It is the simplest measure of variation
It can be expressed as an interval such as 4-10, where 4 is
the smallest value & 10 is highest.
But often, it is expressed as interval width. For example,
the range of 4-10 can also be expressed as a range of 6.
RANGE
Disadvantages:
• To get the average of differences between the mean & each
observation in the data; we have to reduce each value from the
mean & then sum these differences and divide it by the number of
observation.
V = ∑ (mean - x) / n
• The value of this equation will be equal to zero, because the
differences between each value & the mean will have negative and
positive signs that will equalize zero on algebraic summation.
• To overcome this zero we square the difference between the mean
& each value so the sign will be always positive
. Thus we get:
• V = ∑ (mean - x)2 / n-1
VARIANCE
STANDARD DEVIATION “SD”
The main disadvantage of the variance is that it is the
square of the units used.
So, it is more convenient to express the variation in the
original units by taking the square root of the variance.
This is called the standard deviation (SD). Therefore SD =
√ V
i.e. SD = √ ∑ (mean – x)2 / n - 1
COEFFICIENT OF VARIATION “COV”
• The coefficient of variation expresses the standard
deviation as a percentage of the sample mean.
• C.V is useful when, we are interested in the relative size
of the variability in the data.
• Example:
If we have observations 5, 7, 10, 12 and 16. Their mean
will be 50/5=10. SD = √ (25+9 +0 + 4 + 36 ) / (5-1) = √ 74
/ 4 = 4.3
C.V. = 4.3 / 10 x 100 = 43%
Another observations are 2, 2, 5, 10, and 11. Their mean =
30 / 5 = 6
SD = √ (16 + 16 + 1 + 16 + 25)/(5 –1) = √ 74 / 4 = 4.3
C.V = 4.3 /6 x 100 = 71.6 %
Both observations have the same SD but they are different
in C.V. because data in the 1st group is homogenous (so
C.V. is not high), while data in the 2nd observations is
heterogeneous (so C.V. is high).
• Example: In a study where age was recorded the
following were the observed values: 6, 8, 9, 7, 6. and the
number of observations were 5.
• Calculate the mean, SD and range, mode and median.
Mean = (6 + 8 + 9 + 7 + 6) / 5 = 7.2
Variance = (7.2-6)2 + (7.2-8)2 + (7.2-9)2 + (7.2-7)2 + (7.2-
6)2 / 5-1 = (1.2)2 + (- 0.8)2 + (-1.8) 2 +(0.2)2 + (1.2)2 / 4 =
1.7
S.D. = √ 1.7 = 1.3
Range = 9 – 6 = 3
Mode= 6Median = 7
INFERENTIAL STATISTICS
TYPES OF STATISTICS
• Describe or summarize the data
of a target population.
• Describe the data which is
already known.
• Organize, analyze & present
data in a meaningful manner.
• Final results are shown in
forms of tables and graphs.
• Tools: measures of central
tendency & dispersion.
Descriptiv
e
• Use data to make inferences or
generalizations about population.
• Make conclusions for population
that is beyond available data.
• Compare, test and predicts future
outcomes.
• Final results is the probability
scores.
• Tools: hypothesis tests
Inferential
INFERENCE
Inference involves making a generalization about a larger
group of individuals on the basis of a subset or sample.
HYPOTHESIS TESTING
To find out whether the observed variation among sampling
is explained by sampling variations, chance or is really a
difference between groups.
The method of assessing the hypotheses testing is known
as “significance test”.
Significance testing is a method for assessing whether a
result is likely to be due to chance or due to a real effect.
NULL & ALTERNATIVE HYPOTHESES:
 In hypotheses testing, a specific hypothesis is formulated &
data is collected to accept or to reject it.
 Null hypotheses means: H0: x1=x2 this means that there is
no difference between x1 & x2.
 If we reject the null hypothesis, i.e there is a difference
between the 2 readings, it is either H1: x1 < x2 or H2: x1> x2
 In other words the null hypothesis is rejected because x1 is
different from x2.
GENERAL PRINCIPLES OF TESTS OF SIGNIFICANCE
Set up a null hypothesis and its alternative.
Find the value of the test statistic.
Refer the value of the test statistic to a
known distribution which it would follow
if the null hypothesis was true.
Conclude that the data are consistent or
inconsistent with the null hypothesis.
 If the data are not consistent with the null hypotheses,
the difference is said to be “statistically significant”.
 If the data are consistent with the null hypotheses it is
said that we accept it i.e. statistically insignificant.
 In medicine, we usually consider that differences are
significant if the probability is <0.05.
 This means that if the null hypothesis is true, we shall
make a wrong decision <5 in a 100 times.
TESTS OF SIGNIFICANCETests of significance
Quantitative variables
2 Means
Large
sample
“>60”
z test
Small sample “<60”
t-test
Paired t-
test
>2
Means
ANOVA
Qualitative
variables
X2 test Z test
COMPARING TWO MEANS OF LARGE SAMPLES USING
THE NORMAL DISTRIBUTION: (Z TEST OR SND
STANDARD NORMAL DEVIATE)
If we have a large sample size “≥ 60” & it follows a
normal distribution then we have to use the z-test.
z = (population mean - sample mean) / SD.
If the result of z >2 then there is significant difference.
The normal range for any biological reading lies between
the mean value of the population reading ± 2 SD. (includes
95% of the area under the normal distribution curve).
COMPARING TWO MEANS OF SMALL SAMPLES
USING T-TEST
 If we have a small sample size (<60), we can use the t
distribution instead of the normal distribution.
Degree of freedom = (n1+n2)-2
The value of t will be compared to values in the specific table of "t
distribution test" at the value of the degree of freedom.
If t-value is less than that in the table, then the difference between
samples is insignificant.
If t-value is larger than that in the table so the difference is significant
i.e. the null hypothesis is rejected.
Big t-value
Small P-
value
Statistical
significance
PAIRED T-TEST:
If we are comparing repeated observation in the same
individual or difference between paired data, we have to
use paired t-test where the analysis is carried out using the
mean and standard deviation of the difference between
each pair.
Paired t= mean of difference/sq r of SD² of
difference/number of sample.
d.f=n – 1
ANALYSIS OF VARIANCE “ANOVA”
 The main idea in ANOVA is that we have to take into account the
variability within the groups & between the groups
One-way
ANOVA
• Subgroups to be compared are defined by just one factor
• Comparison between means of different socio-economic
classes
Two-way
ANOVA
• When the subdivision is based upon more than one
factor
F-value is equal to the ratio between the means sum square
of between the groups & within the groups.
F = between-groups MS / within-groups MS
TESTS OF SIGNIFICANCETests of significance
Quantitative variables
2 Means
Large
sample
“>60”
z test
Small sample “<60”
t-test
Paired t-
test
>2
Means
ANOVA
Qualitative
variables
X2 test Z test
CHI -SQUARED TEST
A chi-squared test is used to test whether there is an
association between the row variable & the column
variable or, in other words whether the distribution of
individuals among the categories of one variable is
independent of their distribution among the categories of
the other.
Qualitative data are arranged in table formed by rows &
columns. One variable define the rows & the categories of
the other variable define the columns.
O = observed value in the table
E = expected value
Expected (E) = Row total Χ Column total
Grand total
Degree of freedom = (row - 1) (column - 1)
EXAMPLE HYPOTHETICAL STUDY
 Two groups of patients are treated using different spinal
manipulation techniques
 Gonstead vs. Diversified
 The presence or absence of pain after treatment is the
outcome measure.
 Two categories
 Technique used
 Pain after treatment
GONSTEAD VS. DIVERSIFIED EXAMPLE -
RESULTS
Yes No Row Total
Gonstead 9 21 30
Diversified 11 29 40
Column Total 20 50 70
Grand Total
Technique
Pain after treatment
9 out of 30 (30%) still had pain after Gonstead treatment
and 11 out of 40 (27.5%) still had pain after Diversified,
but is this difference statistically significant?
 To find E for cell a (and similarly for the rest)
Yes No Row Total
Gonstead 9 21 30
Diversified 11 29 40
Column Total 20 50 70
Grand Total
Technique
Pain after treatment
Multiply row total
Times column total
Divide by grand total
FIRST FIND THE EXPECTED VALUES FOR EACH CELL
Expected (E) = Row total Χ Column total
Grand total
Evidence-based Chiropractic
 Find E for all cells
Yes No Row Total
Gonstead
9
E = 30*20/70=8.6
21
E = 30*50/70=21.4
30
Diversified
11
E=40*20/70=11.4
29
E=40*50/70=28.6
40
Column Total 20 50 70
Grand Total
Technique
Pain after treatment
 Use the Χ2
formula with each cell and then add
them together
Χ2 = 0.0186 + 0.0168 + 0.0316 + 0.0056 = 0.0726
(9 - 8.6)2
8.6
(21 - 21.4)2
21.4
=
0.0186 0.0168
(11 - 11.4)2
11.4
(29 - 28.6)2
28.6
0.0316 0.0056
Evidence-based Chiropractic
o Find df and then consult a Χ
2
table to see if statistically
significant
o There are two categories for each variable in this case, so
df = 1
o Critical value at the 0.05 level and one df is 3.84
o Therefore, Χ
2
is not statistically significant
Degree of freedom = (row - 1) (column - 1)
Z TEST FOR COMPARING TWO PERCENTAGES
p1=% in the 1st group. p2 = % in the 2nd group
q1=100-p1 q2=100-p2
n1= sample size of 1st group
n2=sample size of 2nd group .
Z test is significant (at 0.05 level) if the result>2.
Example:
If the no. of anemic patients in group 1 which includes 50
patients is 5 & the no. of anemic patients in group 2 which
contains 60 patients is 20.
To find if groups 1 & 2 are statistically different in
prevalence of anemia we calculate z test.
P1=5/50=10%, p2=20/60=33%,
q1=100-10=90, q2=100-33=67
Z=10 – 33/ √ 10x90/50 + 33x67/60
Z= 23 / √ 18 + 36.85 z= 23/ 7.4 z= 3.1
Therefore there is statistical significant difference between
percentages of anemia in the studied groups (because z >2).
CORRELATION & REGRESSION
CORRELATION & REGRESSION
Correlation measures the closeness of the
association between 2 continuous variables, while
Linear regression gives the equation of the straight
line that best describes & enables the prediction of
one variable from the other.
CORRELATION
t-test for
correlation is
used to test the
significance of the
association.
CORRELATION IS NOT CAUSATION!!!
LINEAR REGRESSION
Same as correlation
•Determine the relation &
prediction of the change in a
variable due to changes in
other variable.
•t-test is also used for the
assessment of the level of
significance.
Differ than correlation
•The independent factor has to
be specified from the dependent
variable.
•The dependent variable in
linear regression must be a
continuous one.
•Allows the prediction of
dependent variable for a
particular independent variable
“But, should not be used outside
the range of original data”.
Evidence-based Chiropractic
SCATTERPLOTS
 An X-Y graph with symbols that represent the
values of two variables
Regression
line
MULTIPLE REGRESSION
 The dependency of a dependent variable on several
independent variables, not just one.
 Test of significance used is the ANOVA. (F test).
For example: if neonatal birth weight depends on these
factors: gestational age, length of baby and head
circumference. Each factor correlates significantly with
baby birth weight (i.e. has +ve linear correlation). We can
do multiple regression analysis to obtain a mathematical
equation by which we can predict the birth weight of any
neonate if we know the values of these factors.
Statistics "Descriptive & Inferential"

Statistics "Descriptive & Inferential"

  • 1.
    Dr. Dalia El-Shafei Assistantprofessor, Community Medicine Department, Zagazig University
  • 2.
    STATISTICS It is thescience of dealing with numbers. It is used for collection, summarization, presentation and analysis of data. It provides a way of organizing data to get information on a wider and more formal (objective) basis than relying on personal experience (subjective). Collection Summarization Presentation Analysis
  • 3.
    USES OF MEDICALSTATISTICS: Planning, monitoring & evaluating community health care programs. Epidemiological research studies. Diagnosis of community health problems. Comparison of health status & diseases in different countries and in one country over years. Form standards for the different biological measurements as weight, height. Differentiate between diseased & normal groups
  • 4.
    TYPES OF STATISTICS •Describe or summarize the data of a target population. • Describe the data which is already known. • Organize, analyze & present data in a meaningful manner. • Final results are shown in forms of tables and graphs. • Tools: measures of central tendency & dispersion. Descriptiv e • Use data to make inferences or generalizations about population. • Make conclusions for population that is beyond available data. • Compare, test and predicts future outcomes. • Final results is the probability scores. • Tools: hypothesis tests Inferential
  • 5.
  • 6.
    Data Quantitative Discrete (no decimal) No. ofhospitals, No. of patients Continuous (decimals allowed) Weight, height, Hemoglobin level Qualitative Categorical Blood groups, Male & female Black & white Ordinal Have levels as low, moderate, high.
  • 8.
    SOURCES OF DATACOLLECTION
  • 9.
  • 10.
  • 11.
    Tabular presentation. Graphical presentation Graphic presentations usuallyaccompany tables to illustrate & clarify information. Tables are essential in presentation of scientific data & diagrams are complementary to summarize these tables in an easy way.
  • 12.
    TABULATION  Basic formof presentation • Table must be self-explanatory. • Title: written at the top of table to define precisely the content, the place and the time. • Clear heading of the columns & rows • Units of measurements should be indicated. • The size of the table depends on the number of classes “2 -10 rows or classes”.
  • 13.
  • 14.
    LIST Number of patientsin each hospital department are: Medicine 100 patients Surgery 80 “ ENT 28 “ Ophthalmology 30 “
  • 15.
  • 16.
    Assume we havea group of 20 individuals whose blood groups were as followed: A, AB, AB, O, B, A, A, B, B, AB, O, AB, AB, A, B, B, B, A, O, A. we want to present these data by table. Distribution of the studied individuals according to blood group:
  • 17.
    These are bloodpressure measurements of 30 patients with hypertension. Present these data in frequency table: 150, 155, 160, 154, 162, 170, 165, 155, 190, 186, 180, 178, 195, 200, 180,156, 173, 188, 173, 189, 190, 177, 186, 177, 174, 155, 164, 163, 172, 160. Blood pressure “mmHg” Frequency % Tally Number 150 – 160 – 170 – 180 – 190 - 200 - 1111 1 1111 1 1111 111 1111 1 111 1 6 6 8 6 3 1 20 20 26.7 20 10 3.3 Total 30 100 Frequency distribution of blood pressure measurements among studied patients:
  • 18.
    GRAPHICAL PRESENTATION Simple easyto understand. Save a lot of words. Simple easy to understand. Save a lot of words. Self explanatory. Has a clear title indicating its content “written under the graph”. Fully labeled. The y axis (vertical) is usually used for frequency.
  • 19.
    Graphs Bar chart Pie diagram Histogram Scatterdiagram Line graph Frequency polygon
  • 20.
    BAR CHART  Usedfor presenting discrete or qualitative data.  It is a graphical presentation of magnitude (value or percentage) by rectangles of constant width & lengths proportional to the frequency & separated by gaps Simple MultipleComponent
  • 21.
  • 23.
    MULTIPLE BAR CHART Percentageof Persons Aged ≥18 Years Who Were Current Smokers, by Age and Sex — United States, 2002
  • 25.
  • 27.
    PIE DIAGRAM  Consistof a circle whose area represents the total frequency (100%) which is divided into segments.  Each segment represents a proportional composition of the total frequency.
  • 28.
    HISTOGRAM • It isvery similar to bar chart with the difference that the rectangles or bars are adherent (without gaps). • It is used for presenting class frequency table (continuous data). • Each bar represents a class and its height represents the frequency (number of cases), its width represent the class interval.
  • 31.
    SCATTER DIAGRAM It isuseful to represent the relationship between 2 numeric measurements, each observation being represented by a point corresponding to its value on each axis.
  • 33.
    LINE GRAPH • Itis diagram showing the relationship between two numeric variables (as the scatter) but the points are joined together to form a line (either broken line or smooth curve)
  • 34.
  • 35.
    FREQUENCY POLYGON  Derivedfrom a histogram by connecting the mid points of the tops of the rectangles in the histogram.  The line connecting the centers of histogram rectangles is called frequency polygon. We can draw polygon without rectangles so we will get simpler form of line graph.  A special type of frequency polygon is “the Normal Distribution Curve”.
  • 37.
  • 38.
    The NDC isthe frequency polygon of a quantitative continuous variable measured in large number. It is a form of presentation of frequency distribution of biologic variables “weights, heights, hemoglobin level and blood pressure”.
  • 39.
    CHARACTERISTICS OF THECURVE: Bell shaped, continuous curve Symmetrical i.e. can be divided into 2 equal halves vertically Tails never touch the base line but extended to infinity in either direction The mean, median and mode values coincide Described by 2 parameters: arithmetic mean (X) “location of the center of the curve” & standard deviation (SD) “scatter around the mean”
  • 40.
    AREAS UNDER THENORMAL CURVE: X ± 1 SD = 68% of the area on each side of the mean. X ± 2 SD = 95% of area on each side of the mean. X ± 3 SD = 99% of area on each side of the mean.
  • 41.
    SKEWED DATA If werepresent a collected data by a frequency polygon & the resulted curve does not simulate the NDC (with all its characteristics) then these data are “Not normally distributed” “Curve may be skewed to the Rt. or to the Lt. side”
  • 42.
    CAUSES OF SKEWEDCURVE The data collected are from: So; the results obtained from these data can not be applied or generalized on the whole population. Heterogeneous group Diseased or abnormal population
  • 43.
    Example: If we haveNDC for Hb levels for a population of normal adult males with mean±SD = 11±1.5 If we obtain a Hb reading for an individual = 8.1 & we want to know if he/she is normal or anemic. If this reading lies within the area under the curve at 95% of normal (i.e. mean±2 SD)he /she will be considered normal. If his reading is less then he is anemic. NDC can be used in distinguishing between normal from abnormal measurements.
  • 44.
    • Normal rangefor Hb in this example will be: Higher HB level: 11+2 (1.5) =14. Lower Hb level: 11–2 (1.5) = 8. i.e the normal Hb range of adult males is from 8 to 14. Our sample (8.1) lies within the 95% of his population. So; this individual is normal because his reading lies within the 95% of his population.
  • 45.
  • 46.
    Datasummarization Measures of Central tendency Mean Mode Median Measuresof Dispersion Range Variance Standard deviation Coefficient of variation
  • 48.
    Datasummarization Measures of Central tendency Mean Mode Median Measuresof Dispersion Range Variance Standard deviation Coefficient of variation
  • 49.
    ARITHMETIC MEAN Sum ofobservation divided by the number of observations. x = mean ∑ denotes the (sum of) x the values of observation n the number of observation
  • 50.
  • 51.
    In case offrequency distribution data we calculate the mean by this equation: ARITHMETIC MEAN
  • 52.
  • 53.
     If datais presented in frequency table with class intervals we calculate mean by the same equation but using the midpoint of class interval.
  • 54.
    MEDIAN  The middleobservation in a series of observation after arranging them in an ascending or descending manner Rank of median Odd no. (n + 1)/2 Even no. (n + 1)/2 n/2
  • 56.
  • 58.
  • 59.
    MODE The most frequentoccurring value in the data.
  • 61.
    ADVANTAGES & DISADVANTAGESOF THE MEASURES OF CENTRAL TENDENCY: Mean • Usually preferred since it takes into account each individual observation • Main disadvantage is that it is affected by the value of extreme observations. Median • Useful descriptive measure if there are one or two extremely high or low values. Mode • Seldom used.
  • 63.
    Datasummarization Measures of Central tendency Mean Mode Median Measuresof Dispersion Range Variance Standard deviation Coefficient of variation
  • 64.
    MEASURE OF DISPERSION Describesthe degree of variations or scatter or dispersion of the data around its central values (dispersion = variation = spread = scatter).
  • 65.
    RANGE  The differencebetween the largest & smallest values.  It is the simplest measure of variation It can be expressed as an interval such as 4-10, where 4 is the smallest value & 10 is highest. But often, it is expressed as interval width. For example, the range of 4-10 can also be expressed as a range of 6.
  • 66.
  • 67.
    • To getthe average of differences between the mean & each observation in the data; we have to reduce each value from the mean & then sum these differences and divide it by the number of observation. V = ∑ (mean - x) / n • The value of this equation will be equal to zero, because the differences between each value & the mean will have negative and positive signs that will equalize zero on algebraic summation. • To overcome this zero we square the difference between the mean & each value so the sign will be always positive . Thus we get: • V = ∑ (mean - x)2 / n-1 VARIANCE
  • 68.
    STANDARD DEVIATION “SD” Themain disadvantage of the variance is that it is the square of the units used. So, it is more convenient to express the variation in the original units by taking the square root of the variance. This is called the standard deviation (SD). Therefore SD = √ V i.e. SD = √ ∑ (mean – x)2 / n - 1
  • 70.
    COEFFICIENT OF VARIATION“COV” • The coefficient of variation expresses the standard deviation as a percentage of the sample mean. • C.V is useful when, we are interested in the relative size of the variability in the data.
  • 73.
    • Example: If wehave observations 5, 7, 10, 12 and 16. Their mean will be 50/5=10. SD = √ (25+9 +0 + 4 + 36 ) / (5-1) = √ 74 / 4 = 4.3 C.V. = 4.3 / 10 x 100 = 43% Another observations are 2, 2, 5, 10, and 11. Their mean = 30 / 5 = 6 SD = √ (16 + 16 + 1 + 16 + 25)/(5 –1) = √ 74 / 4 = 4.3 C.V = 4.3 /6 x 100 = 71.6 % Both observations have the same SD but they are different in C.V. because data in the 1st group is homogenous (so C.V. is not high), while data in the 2nd observations is heterogeneous (so C.V. is high).
  • 75.
    • Example: Ina study where age was recorded the following were the observed values: 6, 8, 9, 7, 6. and the number of observations were 5. • Calculate the mean, SD and range, mode and median. Mean = (6 + 8 + 9 + 7 + 6) / 5 = 7.2 Variance = (7.2-6)2 + (7.2-8)2 + (7.2-9)2 + (7.2-7)2 + (7.2- 6)2 / 5-1 = (1.2)2 + (- 0.8)2 + (-1.8) 2 +(0.2)2 + (1.2)2 / 4 = 1.7 S.D. = √ 1.7 = 1.3 Range = 9 – 6 = 3 Mode= 6Median = 7
  • 76.
  • 78.
    TYPES OF STATISTICS •Describe or summarize the data of a target population. • Describe the data which is already known. • Organize, analyze & present data in a meaningful manner. • Final results are shown in forms of tables and graphs. • Tools: measures of central tendency & dispersion. Descriptiv e • Use data to make inferences or generalizations about population. • Make conclusions for population that is beyond available data. • Compare, test and predicts future outcomes. • Final results is the probability scores. • Tools: hypothesis tests Inferential
  • 79.
    INFERENCE Inference involves makinga generalization about a larger group of individuals on the basis of a subset or sample.
  • 81.
    HYPOTHESIS TESTING To findout whether the observed variation among sampling is explained by sampling variations, chance or is really a difference between groups. The method of assessing the hypotheses testing is known as “significance test”. Significance testing is a method for assessing whether a result is likely to be due to chance or due to a real effect.
  • 82.
    NULL & ALTERNATIVEHYPOTHESES:  In hypotheses testing, a specific hypothesis is formulated & data is collected to accept or to reject it.  Null hypotheses means: H0: x1=x2 this means that there is no difference between x1 & x2.  If we reject the null hypothesis, i.e there is a difference between the 2 readings, it is either H1: x1 < x2 or H2: x1> x2  In other words the null hypothesis is rejected because x1 is different from x2.
  • 83.
    GENERAL PRINCIPLES OFTESTS OF SIGNIFICANCE Set up a null hypothesis and its alternative. Find the value of the test statistic. Refer the value of the test statistic to a known distribution which it would follow if the null hypothesis was true. Conclude that the data are consistent or inconsistent with the null hypothesis.
  • 84.
     If thedata are not consistent with the null hypotheses, the difference is said to be “statistically significant”.  If the data are consistent with the null hypotheses it is said that we accept it i.e. statistically insignificant.  In medicine, we usually consider that differences are significant if the probability is <0.05.  This means that if the null hypothesis is true, we shall make a wrong decision <5 in a 100 times.
  • 86.
    TESTS OF SIGNIFICANCETestsof significance Quantitative variables 2 Means Large sample “>60” z test Small sample “<60” t-test Paired t- test >2 Means ANOVA Qualitative variables X2 test Z test
  • 87.
    COMPARING TWO MEANSOF LARGE SAMPLES USING THE NORMAL DISTRIBUTION: (Z TEST OR SND STANDARD NORMAL DEVIATE) If we have a large sample size “≥ 60” & it follows a normal distribution then we have to use the z-test. z = (population mean - sample mean) / SD. If the result of z >2 then there is significant difference. The normal range for any biological reading lies between the mean value of the population reading ± 2 SD. (includes 95% of the area under the normal distribution curve).
  • 88.
    COMPARING TWO MEANSOF SMALL SAMPLES USING T-TEST  If we have a small sample size (<60), we can use the t distribution instead of the normal distribution.
  • 89.
    Degree of freedom= (n1+n2)-2 The value of t will be compared to values in the specific table of "t distribution test" at the value of the degree of freedom. If t-value is less than that in the table, then the difference between samples is insignificant. If t-value is larger than that in the table so the difference is significant i.e. the null hypothesis is rejected.
  • 91.
  • 92.
    PAIRED T-TEST: If weare comparing repeated observation in the same individual or difference between paired data, we have to use paired t-test where the analysis is carried out using the mean and standard deviation of the difference between each pair.
  • 93.
    Paired t= meanof difference/sq r of SD² of difference/number of sample. d.f=n – 1
  • 94.
    ANALYSIS OF VARIANCE“ANOVA”  The main idea in ANOVA is that we have to take into account the variability within the groups & between the groups One-way ANOVA • Subgroups to be compared are defined by just one factor • Comparison between means of different socio-economic classes Two-way ANOVA • When the subdivision is based upon more than one factor
  • 95.
    F-value is equalto the ratio between the means sum square of between the groups & within the groups. F = between-groups MS / within-groups MS
  • 96.
    TESTS OF SIGNIFICANCETestsof significance Quantitative variables 2 Means Large sample “>60” z test Small sample “<60” t-test Paired t- test >2 Means ANOVA Qualitative variables X2 test Z test
  • 97.
    CHI -SQUARED TEST Achi-squared test is used to test whether there is an association between the row variable & the column variable or, in other words whether the distribution of individuals among the categories of one variable is independent of their distribution among the categories of the other. Qualitative data are arranged in table formed by rows & columns. One variable define the rows & the categories of the other variable define the columns.
  • 98.
    O = observedvalue in the table E = expected value Expected (E) = Row total Χ Column total Grand total Degree of freedom = (row - 1) (column - 1)
  • 99.
    EXAMPLE HYPOTHETICAL STUDY Two groups of patients are treated using different spinal manipulation techniques  Gonstead vs. Diversified  The presence or absence of pain after treatment is the outcome measure.  Two categories  Technique used  Pain after treatment
  • 100.
    GONSTEAD VS. DIVERSIFIEDEXAMPLE - RESULTS Yes No Row Total Gonstead 9 21 30 Diversified 11 29 40 Column Total 20 50 70 Grand Total Technique Pain after treatment 9 out of 30 (30%) still had pain after Gonstead treatment and 11 out of 40 (27.5%) still had pain after Diversified, but is this difference statistically significant?
  • 101.
     To findE for cell a (and similarly for the rest) Yes No Row Total Gonstead 9 21 30 Diversified 11 29 40 Column Total 20 50 70 Grand Total Technique Pain after treatment Multiply row total Times column total Divide by grand total FIRST FIND THE EXPECTED VALUES FOR EACH CELL Expected (E) = Row total Χ Column total Grand total
  • 102.
    Evidence-based Chiropractic  FindE for all cells Yes No Row Total Gonstead 9 E = 30*20/70=8.6 21 E = 30*50/70=21.4 30 Diversified 11 E=40*20/70=11.4 29 E=40*50/70=28.6 40 Column Total 20 50 70 Grand Total Technique Pain after treatment
  • 103.
     Use theΧ2 formula with each cell and then add them together Χ2 = 0.0186 + 0.0168 + 0.0316 + 0.0056 = 0.0726 (9 - 8.6)2 8.6 (21 - 21.4)2 21.4 = 0.0186 0.0168 (11 - 11.4)2 11.4 (29 - 28.6)2 28.6 0.0316 0.0056
  • 104.
    Evidence-based Chiropractic o Finddf and then consult a Χ 2 table to see if statistically significant o There are two categories for each variable in this case, so df = 1 o Critical value at the 0.05 level and one df is 3.84 o Therefore, Χ 2 is not statistically significant Degree of freedom = (row - 1) (column - 1)
  • 105.
    Z TEST FORCOMPARING TWO PERCENTAGES p1=% in the 1st group. p2 = % in the 2nd group q1=100-p1 q2=100-p2 n1= sample size of 1st group n2=sample size of 2nd group . Z test is significant (at 0.05 level) if the result>2.
  • 106.
    Example: If the no.of anemic patients in group 1 which includes 50 patients is 5 & the no. of anemic patients in group 2 which contains 60 patients is 20. To find if groups 1 & 2 are statistically different in prevalence of anemia we calculate z test. P1=5/50=10%, p2=20/60=33%, q1=100-10=90, q2=100-33=67 Z=10 – 33/ √ 10x90/50 + 33x67/60 Z= 23 / √ 18 + 36.85 z= 23/ 7.4 z= 3.1 Therefore there is statistical significant difference between percentages of anemia in the studied groups (because z >2).
  • 107.
  • 108.
    CORRELATION & REGRESSION Correlationmeasures the closeness of the association between 2 continuous variables, while Linear regression gives the equation of the straight line that best describes & enables the prediction of one variable from the other.
  • 109.
  • 110.
    t-test for correlation is usedto test the significance of the association.
  • 112.
    CORRELATION IS NOTCAUSATION!!!
  • 113.
    LINEAR REGRESSION Same ascorrelation •Determine the relation & prediction of the change in a variable due to changes in other variable. •t-test is also used for the assessment of the level of significance. Differ than correlation •The independent factor has to be specified from the dependent variable. •The dependent variable in linear regression must be a continuous one. •Allows the prediction of dependent variable for a particular independent variable “But, should not be used outside the range of original data”.
  • 114.
    Evidence-based Chiropractic SCATTERPLOTS  AnX-Y graph with symbols that represent the values of two variables Regression line
  • 115.
    MULTIPLE REGRESSION  Thedependency of a dependent variable on several independent variables, not just one.  Test of significance used is the ANOVA. (F test).
  • 116.
    For example: ifneonatal birth weight depends on these factors: gestational age, length of baby and head circumference. Each factor correlates significantly with baby birth weight (i.e. has +ve linear correlation). We can do multiple regression analysis to obtain a mathematical equation by which we can predict the birth weight of any neonate if we know the values of these factors.