Advanced statistics

Prof. JOY V. LORIN-PICAR
DAVAO DEL NORTE STATE COLLEGE
NEW VISAYAS, PANABO CITY

TOPIC OUTLINE
PART 1
Role of Statistics in Research
Descriptive Statistics
Hands –On Statistical Software
Sample and Population
Sampling Procedures
Sample Size
Inferential Statistics
Hypothesis Testing

TOPIC OUTLINE
PART 2
Choice of Statistical Tests
Defining Independent and Dependent
Variables
Scales of Measurements
How many Samples / Groups are in the Design
PART 3
Parametric Tests
PART 4
Non-Parametric Tests

TOPIC OUTLINE
PART 5
Goodness of Fit
PART 6
Choosing the Correct Statistical Tests
Introduction to Multiple and Non-Linear
Regression

Role of Statistics in Research
Normally use to analyze data
To organize and make sense out of large amount
of data
This is basic to intelligent reading research
article
Has significant contributions in social sciences,
applied sciences and even business and
economics
Statistical researches make inferences about
population characteristics on the basis of one or
more samples that have been studied.

How is Statistics look into ?
1. Descriptive – this gives us information ,
or simple describe the sample we are
studying.
2. Correlational - this enables us to relate
variables and establish relationship
between and among variables which are
useful in making predictions.
3. Inferential – this is going beyond the
sample and make inference on the
population.

Descriptive Statistics
 N - total population/sample size from any given
population
Example
Minutes Spent on the Phone
102 124 108 86 103 82
71 104 112 118 87 95
103 116 85 122 87 100
105 97 107 67 78 125
109 99 105 99 101 92

Example 2
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

Range, Mean, Median and Mode
The terms mean, median, mode, and range describe
properties of statistical distributions. In statistics, a distribution
is the set of all possible values for terms that represent defined
events. The value of a term, when expressed as a variable, is
called a random variable. There are two major types of
statistical distributions. The first type has a discrete random
variable. This means that every term has a precise, isolated
numerical value. An example of a distribution with a discrete
random variable is the set of results for a test taken by a class in
school. The second major type of distribution has a continuous
random variable. In this situation, a term can acquire any
value within an unbroken interval or span. Such a distribution
is called a probability density function. This is the sort of
function that might, for example, be used by a computer in an
attempt to forecast the path of a weather system.

Mean
The most common expression for the mean of a statistical
distribution with a discrete random variable is the
mathematical average of all the terms. To calculate it,
add up the values of all the terms and then divide by the
number of terms. This expression is also called the
arithmetic mean. There are other expressions for the
mean of a finite set of terms but these forms are rarely
used in statistics. The mean of a statistical distribution
with a continuous random variable, also called the
expected value, is obtained by integrating the product of
the variable with its probability as defined by the
distribution. The expected value is denoted by the
lowercase Greek letter mu (µ).

Median
 The median of a distribution with a discrete random variable
depends on whether the number of terms in the distribution is
even or odd. If the number of terms is odd, then the median is
the value of the term in the middle. This is the value such that
the number of terms having values greater than or equal to it is
the same as the number of terms having values less than or
equal to it. If the number of terms is even, then the median is
the average of the two terms in the middle, such that the
number of terms having values greater than or equal to it is the
same as the number of terms having values less than or equal to
it. The median of a distribution with a continuous random
variable is the value m such that the probability is at least
1/2 (50%) that a randomly chosen point on the function
will be less than or equal to m, and the probability is at
least 1/2 that a randomly chosen point on the function will
be greater than or equal to m.

Mode
The mode of a distribution with a discrete random
variable is the value of the term that occurs the most
often. It is not uncommon for a distribution with a
discrete random variable to have more than one mode,
especially if there are not many terms. This happens when
two or more terms occur with equal frequency, and more
often than any of the others. A distribution with two
modes is called bimodal. A distribution with three modes
is called trimodal. The mode of a distribution with a
continuous random variable is the maximum value of
the function. As with discrete distributions, there may be
more than one mode.

RangeThe range of a distribution with a discrete
random variable is the difference between the
maximum value and the minimum value. For a
distribution with a continuous random variable,
the range is the difference between the two
extreme points on the distribution curve,
where the value of the function falls to zero.
For any value outside the range of a distribution,
the value of the function is equal to 0.
The least reliable of the measure and is use
only when one is in a hurry to get a measure
of variability

Standard Deviation
The standard deviation formula is very simple: it
is the square root of the variance. It is the most
commonly used measure of spread.
An important attribute of the standard deviation
as a measure of spread is that if the mean and
standard deviation of a normal distribution
are known, it is possible to compute the
percentile rank associated with any given score.

Standard Deviation
In a normal distribution, about 68% of the
scores are within one standard deviation of the
mean and about 95% of the scores are within
two standard deviations of the mean.
The standard deviation has proven to be an
extremely useful measure of spread in part
because it is mathematically tractable. Many
formulas in inferential statistics use the
standard deviation.

KURTOSIS - refers to how sharply peaked
a distribution is. A value for kurtosis is included
with the graphical summary:
· Values close to 0 indicate normally peaked
data.
· Negative values indicate a distribution that is
flatter than normal.
· Positive values indicate a distribution with a
sharper than normal peak.

Samples and Population
Population – as used in research, refers to all
the members of a particular group.
It is the group of interest to the researcher
This is the group of whom the researcher
would like to generalize the results of a
study

 A target population is the actual population to
whom the researcher would like to generalize
 Accessible population is the population to whom
the researcher is entitled to generalize

SAMPLING
This is the process of selecting the individuals
who will participate in a research study.
Any part of the population of individuals of whom
information is obtained.
A representative sample is a sample that is similar to
the population to whom the researcher is entitled
to generalize

PROBABILITY AND NON-PROBABILITY
SAMPLING
A sampling procedure that gives every element of
the population a (known) nonzero chance of
being selected in the sample is called probability
sampling. Otherwise, the sampling procedure is
called non-probability sampling.
Whenever possible, probability sampling is
used because there is no objective way of
assessing the reliability of inferences under
non-zero probability sampling.

METHODS OF PROBABILITY
SAMPLING
1. simple random sampling
2.systematic sampling
3.stratified sampling
4. cluster sampling
5. two-stage random sampling

Simple Random Sampling
This is a sample selected from
a population in such a manner
that all members of the
population have an equal
chance of being selected

Stratified Random Sampling
Sample selected so that certain
characteristics are represented in
the sample in the same proportion
as they occur in the population

Cluster Random Sample
This is obtained by using
groups as the sampling unit
rather than individuals.

Two-Stage Random Sample
Selects groups randomly and
then chooses individuals
randomly from these groups.

Non-Probability Sampling
1. accidental or convenience
sampling
2. purposive sampling
3. quota sampling
4. snowball or referral sampling
 5. systematic sampling

Systematic Sample
This is obtained by selecting
every nth name in a population

Convenience Sampling
Any group of individuals that
is conveniently available to be
studied

Purposive Sampling
Consist of individuals who
have special qualifications of
some sort or are deemed
representative on the basis of
prior evidence

Quota Sampling
In quota sampling, the population is first
segmented into mutually exclusive sub-groups,
just as in stratified sampling. Then judgment is
used to select the subjects or units from each
segment based on a specified proportion. For
example, an interviewer may be told to sample
200 females and 300 males between the age of
45 and 60. This means that individuals can put
a demand on who they want to sample
(targeting)

Snow ball Sampling
snowball sampling is a technique for developing a
research sample where existing study subjects recruit
future subjects from among their acquaintances. Thus
the sample group appears to grow like a rolling
snowball. As the sample builds up, enough data is
gathered to be useful for research. This sampling
technique is often used in hidden populations which
are difficult for researchers to access; example
populations would be drug users or prostitutes. As
sample members are not selected from a sampling
frame, snowball samples are subject to numerous
biases

General Classification of
Collecting Data
1. Census or complete enumeration-is the
process of gathering information from every unit
in the population.
- not always possible to get timely, accurate and
economical data
- costly, if the number of units in the population is
too large
2. Survey sampling- is the process of obtaining
information from the units in the selected sample.
Advantages: reduced cost, greater speed, greater
scope, and greater accuracy

Sample size
Samples should be as large as a researcher can
obtain with a reasonable expenditure of time and
energy.
As suggested, a minimum number of subjects is 100
for a descriptive study , 50 for a correlational study,
and 30 in each group for experimental and causal-
comparative design
According to Padua , for n parameters, minimum n
could be computed as n >= (p +3) p/2 where p =
parameters , say if p = 4, thus minimum n = 14.

Inferential Statistics
This is a formalized techniques used to make
conclusions about populations based on samples
taken from the populations.

Hypothesis
Hypothesis is defined as the tentative theory or
supposition provisionally adopted to explain certain facts
and to guide in the investigation of others.
A statistical hypothesis is an assertion or statement that
may or may not be true concerning one or more
population.
Example:
1. A leading drug in the treatment of hypertension has an
advertised therapeutic success rate of 83%. A medical
researcher believes he has found a new drug for treating
hypertensive patients that has higher therapeutic success
rate than the leading than the leading drug with fewer side
effect.

The Statistical Hypothesis :
HO: The new drug is no better than the old one (p
=0.83)
H1: The new drug is better than the old one ( p> 0.83)
Example 2. A social researcher is conducting a study
to determine if the level of women’s participation in
community extension programs of the barangay can
be affected by their educational attainment ,
occupation, income, civil status, and age.


HO: The level of women’s participation in community
extension programs is not affected by their
attainment, occupation, income , civil status and age.
H1: The level of women’s participation in community
extension programs is affected by their attainment,
occupation, income , civil status and age.
Example 3: A community organizer wants to compare
the three community organizing strategies applied to
cultural minorities in terms of effectiveness.

A. Hypothesis Testing
Steps in Hypothesis Testing
1. Formulate the null hypothesis and
the alternative hypothesis
- this is the statistical hypothesis
which are assumptions or guesses
about the population involved. In
short, these are statements about
the probability distributions of the
populations

Null Hypothesis
This is a hypothesis of “ no effect “.
It is usually formulated for the express
purpose of being rejected, that is, it is the
negation of the point one is trying to
make.
This is the hypothesis that two or more
variables are not related or that two or
more statistics are not significantly
different.

Alternative Hypothesis
This is the operational statement of
the researcher’s hypothesis
The hypothesis derived from the
theory of the investigator and
generally state a specified relationship
between two or more variables or that
two or more statistics significantly
differ.

Two Ways of Stating the
Alternative Hypothesis
1. Predictive - specifies the type of relationship
existing between two or more variables (direct or
indirect) or specifies the direction of the difference
between two or more statistics
2. Non- Predictive - does not specify the type of
relationship or the direction of the difference

C. LEVEL OF SIGNIFICANCE (α)
α is the maximum probability with which we
would be willing to risk Type I Error (The
hypothesis can be inappropriately rejected ).
The error of rejecting a null hypothesis when it
is actually true. Plainly speaking, it occurs
when we are observing a difference when in
truth there is none, thus indicating a test of
poor specificity. An example of this would be if
a test shows that a woman is pregnant when in
reality she is not.

In other words, the level of significance determines
the risk a researcher would be willing to take in his
test.
The choice of alpha is primarily dependent on the
practical application of the result of the study.

Examples of α
.05 (95 % confident of the claim)
.01 (99 % confident of the claim)
 But take note, α is not always .05 or .01. This could
mathematically be computed based from the
formula :
where the variance , no of samples and its
difference are predetermined – Chebychev’s sample
size formula.

D. Defining a Region of Rejection
The region of rejection is a region of
the null sampling distribution. It
consists of a set of possible values which
are so extreme that when the null
hypothesis is true the probability is
small (i.e. equal to alpha) that the
sample we observe will yield a value
which is among them.

E. Collect the data and compute
the value of the test- statistic
F . Collect the data and compute the
value of the test –statistic.
G. State your decision.
H. State your conclusion.

B. Choose an Appropriate Statistical Test for
testing the Null Hypothesis
The choice of a statistical test for the analysis
of your data requires careful and deliberate
judgment.
PRIMARY CONSIDERATIONS:
The choice of a statistical test is dictated by
the questions for which the research is
designed
The level, the distribution , and dispersion of
data also suggest the type of statistical test to
be used

SECONDARY CONSIDERATIONS
The extent of your knowledge in
statistics
Availability of resources in
connection with the computation
and interpretation of data

Choice of Statistical Tests
This is designed to help you
develop a framework for choosing
the correct statistic to test your
hypothesis.
 It begins with a set of questions
you should ask when selecting your
test.
It is followed by demonstrations of
the factors that are important to
consider when choosing your
statistic.

Presented below are four
questions you should ask and
answer when trying to determine
which statistical procedure is most
appropriate to test your
hypothesis.

What are the independent and
dependent variables?
What is the scale of measurement of
the study variables?
How many samples/groups are in
the design?
Have I met the assumptions of the
statistical test selected?

To determine which test should be
used in any given circumstance, we
need to consider the hypothesis that
is being tested, the independent and
dependent variables and their scale of
measurement, the study design, and
the assumptions of the test.

Variables
Before we can begin to choose our
statistical test, we must determine
which is the independent and which is
the dependent variable in our
hypothesis.
Our dependent variable is always the
phenomenon or behavior that we want
to explain or predict.

Defining Independent and Dependent
Variables
The independent variable represents a
predictor or causal variable in the
study.
In any antecedent-consequent
relationship, the antecedent is the
independent variable and the
consequent is the dependent variable.

Variables
With single samples and one dependent
variable, the one-sample Z test, the one-
sample t test, and the chi-square goodness-of-
fit test are the only statistics that can be used.
Students sometimes ask, "but don't you have
population data too, so you have two sets of
data?" Yes and no.
Data have to exist or else the population
parameters are defined. But, the researcher
does not collect these data, they already exist.

Variables
So, if you are collecting data on one sample
and comparing those data to information
that has already been gathered and is
published, then you are conducting a one-
sample test using the one sample/set of
data collected in this study.
For the chi-square goodness-of-fit test, you
can also compare the sample against chance
probabilities

Variables
When we have a single sample and
independent and dependent variables
measured on all subjects, we typically are
testing a hypothesis about the association
between two variables. The statistics that we
have learned to test hypotheses about
association include:
chi-square test of independence
Spearman's rs
Pearson's r
bivariate regression and multiple regression

Multiple Sample Tests
Studies that refer to repeated measurements or
pairs of subjects typically collect at least two sets
of scores. Studies that refer to specific subgroups
in the population also collect two or more samples
of data. Once you have determined that the
design uses two or more samples or "groups", then
you must determine how many samples or groups
are in the design. Studies that are limited to two
groups use either the chi-square statistic, Mann-
Whitney U, Wilcoxon test, independent means t
test, or the dependent means t test.

If you have three or more groups in the
design, the chi-square statistic, Kruskal-
Wallis H Test, Friedman ANOVA for ranks,
One-way Between-Groups ANOVA, and
Factorial ANOVA depending on the nature
of the relationship between groups. Some of
these tests are designed for dependent or
correlated samples/groups and some are
designed for samples/groups that are
completely independent.

Dependent Means
Dependent groups refer to some type of
association or link in the research design
between sets of scores. This usually occurs
in one of three conditions -- repeated
measures, linked selection, or matching.
Repeated measures designs collect data on
subjects using the same measure on at least
two occasions. This often occurs before and
after a treatment or when the same research
subjects are exposed to two different
experimental conditions.

When subjects are selected into the study because of
natural "links or associations", we want to analyze the
data together. This would occur in studies of parent-
infant interaction, romantic partners, siblings, or best
friends. In a study of parents and their children, a
parent’s data should be associated with his son's, not
some other child's. Subject matching also produces
dependent data. Suppose that an investigator wanted
to control for socioeconomic differences in research
subjects. She might measure socioeconomic status
and then match on that variable. The scores on the
dependent variable would then be treated as a pair in
the statistical test.

All statistical procedures for dependent or
correlated groups treat the data as linked,
therefore it is very important that you
correctly identify dependent groups
designs. The statistics that can be used for
correlated groups are the McNemar Test
(two samples or times of measurement),
Wilcoxon t Test (two samples), Dependent
Means t Test (two samples), Friedman
ANOVA for Ranks (three or more samples),
Simple Repeated Measures ANOVA (three
or more samples).

Independent Means
When there is no subject overlap across groups, we define
the groups as independent. Tests of gender differences are
a good example of independent groups. We cannot be
both male and female at the same time; the groups are
completely independent. If you want to determine
whether samples are independent or not, ask yourself,
"Can a person be in one group at the same time he or she
is in another?" If the answer is no (can't be in a remedial
education program and a regular classroom at the same
time; can't be a freshman in high school and a sophomore
in high school at the same time), then the groups are
independent.

The statistics that can be used for
independent groups include the chi-
square test of independence (two or
more groups), Mann-Whitney U Test
(two groups), Independent Means t
test (two groups), One-Way Between-
Groups ANOVA (three or more
groups), and Factorial ANOVA (two or
more independent variables).

Scales of Measurements
Once we have identified the independent
and dependent variables, our next step in
choosing a statistical test is to identify the
scale of measurement of the variables.
All of the parametric tests that we have
learned to date require an interval or ratio
scale of measurement for the dependent
variable.

Scales of Measurements
If you are working with a dependent
variable that has a nominal or ordinal
scale of measurement, then you must
choose a nonparametric statistic to
test your hypothesis

How many Samples / Groups are in the
Design
Once you have identified the scale of
measurement of the dependent variable,
you want to determine how many samples
or "groups" are in the study design.
Designs for which one-sample tests (e.g.,
Z test; t test; Pearson and Spearman
correlations; chi-square goodness-of-fit)
are appropriate to collect only one set or
"sample" of data.

How many Samples / Groups are in the
Design
There must be at least two sets of
scores or two "samples" for any
statistic that examines differences
between groups (e.g. , t test for
dependent means; t test for
independent means; one-way ANOVA;
Friedman ANOVA; chi-square test of
independence) .

Parametric Tests
Parametric statistics are used when our
data are measured on interval or ratio
scales of measurement
Tend to need larger samples
Data should fit a particular distribution;
transformed the data into that particular
distribution
Samples are normally drawn randomly
from the population
Follows the assumption of normality –
meaning the data is normally distributed.

Parametric Assumptions
Listed below are the most frequently
encountered assumptions for parametric tests.
Statistical procedures are available for testing
these assumptions.
The Kolmogorov-Smirnov Test is used to
determine how likely it is that a sample came
from a population that is normally distributed.

Parametric Assumptions
The Levene test is used to test the assumption of
equal variances.
If we violate test assumptions, the statistic chosen
cannot be applied. In this circumstance we have
two options:
We can use a data transformation
We can choose a nonparametric statistic
If data transformations are selected, the
transformation must correct the violated assumption.
If successful, the transformation is applied and the
parametric statistic is used for data analysis.

Types of Parametric Tests
Z test
One-way ANOVA
One-Sample t test
Factorial ANOVA
t test for dependent means
Pearson’s r
t test for independent means
Bivariate/Multiple regression

Non-Parametric TestsInference procedures which are likely
distribution free.
Nonparametric statistics are used when our
data are measured on a nominal or ordinal
scale of measurement.
All other nonparametric statistics are
appropriate when data are measured on an
ordinal scale of measurement.
Example to this is the sign tests. These are
tests designed to draw inferences about
medians.

Types of Non-parametric Tests
Signed Tests
Chi-square statistics and their
modifications (e.g., McNemar Test) are
used for nominal data.
Wilcoxon Test – alternative to t – test in
the parametric test
Kruskal- Wallis Test - alternative to
ANOVA
Freidman Test – alternative to ANOVA

Choosing the Correct Statistical
TestsSummary
Five issues must be considered when
choosing statistical tests.
Scale of measurement
Number of samples/groups
Nature of the relationship between
groups
Number of variables
Assumptions of statistical tests

Introduction to Multiple and Non-
Linear Regression

Hands –On Statistical Software

Thank you very much!
Hope you are now
ready to conduct your
study

Advanced statistics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Advanced statistics

Similar to Advanced statistics (20)

Recently uploaded

Recently uploaded (20)

Advanced statistics