This document discusses validity, reliability, and sampling methods in research. It defines validity as measuring what a study aims to measure, and reliability as the consistency of measurement. Validity is necessary for reliability, but reliability does not ensure validity. The document discusses various types of validity including content, criterion, and construct validity. It also discusses reliability measures like internal consistency, test-retest reliability, and alternative forms reliability. Finally, it summarizes various sampling methods including probability and non-probability techniques like simple random sampling, stratified sampling, cluster sampling, and their procedures.
4. Validity, Reliability, and their relationshipValidity is the degree
to which a study measures what it was designed to measure. It
deals with the quality of measurement. Reliability is the extent
to which a variable is consistent in what is intended to measure,
in other words, it is the consistency, dependability, or
repeatability of measures.
Relationship between validity and reliabilityReliability does not
necessarily tell whether the measurement is measuring what is
supposed to be measured. Compared to validity, which
addresses the issue of what should be measured, reliability is
related to how it is measured. Therefore, in order to minimize
our measurement error, both reliability and validity are
examined. A measure may be reliable but not valid, but it
cannot be valid without being reliable. That is, reliability is a
necessary but not sufficient condition for validity. Reliability is
a necessary, but not sufficient, condition for validity.
ValidityContent validity addresses whether the scales
adequately measure the domain content of the construct. It is a
subjective but systematic evaluation of how well the content of
a scale represents the measurement task at hand. There is no
objective statistical test to evaluate content validity.
Researchers must carefully utilize specified theoretical
descriptions of the construct to judge content validity.
Criterion validity reflects whether a scale performs as expected
in relation to other variables selected (criterion variables) as
meaningful criteria, i.e., is the proposed measures exhibit
generally the same direction and magnitude of the correlation
with other variables of which as the measures have already
been accepted within the social science community.
5. The establishment of construct validity involves two major
subdomainsconvergent validitydiscriminant validity
Construct validity
Convergent validity is the extent to which the scale correlates
positively with other measures of the same construct.
To test for convergent validity, we can use Factor Analysis and
examine the factor loadings and the significance level of each
construct. When the factor loadings of intended constructs are
all higher than .50, indicating convergent validity has been
achieved. (common factor analysis – PAF) We can also use
AVE to test convergent validity.
Discriminant validity is the extent to which a measure does not
correlate with other constructs from which it is supposed to
differ. In other words, it describes the degree to which one
construct is not similar to any other construct that is
theoretically distinct.
To test for discriminant validity, CFA can be used.
Reliability
Reliability can be defined as the extent to which measures are
free from random error. Researchers must demonstrate
instruments are reliable since without reliability, research
results using the instrument are not replicable.
Reliability is estimated in one of four ways
Internal consistency
Split-half reliability
Test-retest reliability
Alternative forms
6. Reliability
Internal consistency reliability: estimation based on the
correlation among the variables comprising the set (typically,
Cronbach's alpha).
Split-half reliability: estimation based on the correlation of two
equivalent forms of the scale.
Test-retest reliability: Estimation based on the correlation
between two (or more) administrations of the same item, scale,
or instrument for different times, locations, or populations,
when the two administrations do not differ on other relevant
variables.
Alternative-forms reliability: two equivalent forms of the scale
are constructed and the same respondents are measured at two
different times, with a different form being used each time.
Cronbach’s alpha
Cronbach’s alpha, the coefficient of reliability, is frequently
used to measure internal consistency and stability of an
instrument (Churchill, 1979). It is the average of all possible
split-half coefficients resulting from different ways of splitting
the scale items.
Cronbach’s alpha varies from 0 to 1, and a value of 0.6 or less
generally indicates unsatisfactory internal consistency
reliability.
7. The Sampling Design Process
Define the Population
Determine the Sampling Frame
Select Sampling Technique(s)
Determine the Sample Size
Execute the Sampling Process
Define the Target Population
The target population is the collection of elements or objects
that possess the information sought by the researcher and about
which inferences are to be made. The target population should
be defined in terms of elements, sampling units, extent, and
time.
An element is the object about which or from which the
information is desired, e.g., the respondent.
A sampling unit is an element, or a unit containing the element,
that is available for selection at some stage of the sampling
process.
Extent refers to the geographical boundaries.
Time is the time period under consideration.
Classification of Sampling Techniques
9. Sampling
Systematic
Sampling
Stratified
Sampling
Cluster
Sampling
Other Sampling
Techniques
Simple Random
Sampling
Convenience Sampling
Convenience sampling attempts to obtain a sample of
convenient elements. Often, respondents are selected because
they happen to be in the right place at the right time.
use of students, and members of social organizations
department stores using charge account lists
A Graphical Illustration of Convenience Sampling
Group D happens to assemble at a convenient time and place.
So all the elements in this Group are selected. The resulting
sample consists of elements 16, 17, 18, 19 and 20. Note, no
elements are selected from group A, B, C and E.
ABCDE16111621271217223813182349141924510152025
10.
11. Judgmental Sampling
Judgmental sampling is a form of convenience sampling in
which the population elements are selected based on the
judgment of the researcher.
test markets
purchase engineers selected in industrial marketing research
expert witnesses used in court
Graphical Illustration of Judgmental Sampling
The researcher considers groups B, C and E to be typical and
convenient. Within each of these groups one or two elements
12. are selected based on typicality and convenience. The
resulting sample consists of elements 8, 10, 11, 13, and 24.
Note, no elements are selected
from groups A and
D.ABCDE16111621271217223813182349141924510152025
13. Quota Sampling
Quota sampling may be viewed as two-stage restricted
judgmental sampling.
The first stage consists of developing control categories, or
quotas, of population elements.
In the second stage, sample elements are selected based on
convenience or judgment.
Population Sample
14. composition composition
Control
Characteristic Percentage Percentage Number
Sex
Male 48 48 480
Female 52 52 520
____ ____ ____
100 100 1000
What sampling technique do you recommend for DuPont Case?
Quota samples are most applicable for mall intercept interviews
because they allow for more precision than regular judgmental
sampling and mall intercept interviews are inherently non-
probabilistic.
We can create control categories along age groups. For example
Age %
22–30 20
31–45 43
45–60 18
60+ 19
which represents the percentage of the sample size, which
should be obtained from each category. Respondents are
approached in the mall with the goal of achieving this age
distribution.
In this case, we also want to bias our selection in terms of
15. women, since they purchase most carpeting. Thus, we should
purposely target women in these age groups at a 2 to 1 ratio to
men.
A Graphical Illustration of
Quota Sampling
A quota of one element from each group, A to E, is imposed.
Within each group, one element is selected based on judgment
or convenience. The resulting sample consists of elements 3, 6,
13, 20 and 22. Note, one element is selected from each column
or group.
ABCDE16111621271217223813182349141924510152025
17. selected, usually at random.
After being interviewed, these respondents are asked to identify
others who belong to the target population of interest.
Subsequent respondents are selected based on the referrals.
A Graphical Illustration of
Snowball Sampling
Elements 2 and 9 are selected randomly from groups A and B.
Element 2 refers elements 12 and 13. Element 9 refers
element 18. The resulting sample consists of elements 2, 9, 12,
13, and 18. Note, there are no element from group
E.ABCDE16111621271217223813182349141924510152025
20. Systematic
Sampling
Stratified
Sampling
Cluster
Sampling
Simple Random
Sampling
Simple Random Sampling
Each element in the population has a known and equal
probability of selection.
Each possible sample of a given size (n) has a known and equal
probability of being the sample actually selected.
This implies that every element is selected independently of
every other element.
A Graphical Illustration of
Simple Random Sampling
Select five random numbers from 1 to 25. The resulting sample
consists of population elements 3, 7, 9, 16, and 24.
ABCDE16111621271217223813182349141924510152025
21.
22. Systematic Sampling
The sample is chosen by selecting a random starting point and
then picking every ith element in succession from the sampling
frame.
The sampling interval, i, is determined by dividing the
population size N by the sample size n and rounding to the
nearest integer.
When the ordering of the elements is related to the
characteristic of interest, systematic sampling increases the
representativeness of the sample.
Systematic Sampling
If the ordering of the elements produces a cyclical pattern,
systematic sampling may decrease the representativeness of the
sample.
For example, there are 100,000 elements in the population
and a sample of 1,000 is desired. In this case the sampling
interval, i, is 100. A random number between 1 and 100 is
selected. If, for example, this number is 23, the sample consists
23. of elements 23, 123, 223, 323, 423, 523, and so on.
A Graphical Illustration of
Systematic Sampling
Select a random number between 1 to 5, say 2.
The resulting sample consists of population 2,
(2+5=) 7, (2+5x2=) 12, (2+5x3=)17, and (2+5x4=) 22. Note, all
the elements are selected from a single
row.ABCDE16111621271217223813182349141924510152025
25. The strata should be mutually exclusive and collectively
exhaustive in that every population element should be assigned
to one and only one stratum and no population elements should
be omitted.
Next, elements are selected from each stratum by a random
procedure, usually SRS.
A major objective of stratified sampling is to increase precision
without increasing cost.
The elements within a stratum should be as homogeneous as
possible, but the elements in different strata should be as
heterogeneous as possible.
The stratification variables should also be closely related to the
characteristic of interest.
A Graphical Illustration of
Stratified Sampling
Randomly select a number from 1 to 5
for each stratum, A to E. The resulting
sample consists of population elements
4, 7, 13, 19 and 21. Note, one element
is selected from each column.
ABCDE16111621271217223813182349141924510152025
26.
27. Cluster Sampling
The target population is first divided into mutually exclusive
and collectively exhaustive subpopulations, or clusters.
Elements within a cluster should be as heterogeneous as
possible, but clusters themselves should be as homogeneous as
possible. Ideally, each cluster should be a small-scale
representation of the population.
Then a random sample of clusters is selected, based on a
probability sampling technique such as SRS.
For each selected cluster, either all the elements are included in
the sample (one-stage) or a sample of elements is drawn
probabilistically (two-stage).
A Graphical Illustration of
Cluster Sampling (2-Stage)
Randomly select 3 clusters, B, D and E.
Within each cluster, randomly select one
or two elements. The resulting sample
consists of population elements 7, 18, 20, 21, and 23. Note, no
elements are selected from clusters A and C.
ABCDE16111621271217223813182349141924510152025
28.
29. Strengths and Weaknesses of
Basic Sampling Techniques
Technique
Strengths
Weaknesses
Nonprobability Sampling
Convenience sampling
Least expensive, least
time-consuming, most
convenient
Selection bias, sample not
representative, not recommended for
descriptive or causal research
30. Judgmental sampling
Low cost, convenient,
not time-consuming
Does not allow generalization,
subjective
Quota sampling
Sample can be controlled
for certain characteristics
Selection bias, no assurance of
representativeness
Snowball sampling
Can estimate rare
characteristics
Time-consuming
Probability sampling
Simple random sampling
(SRS)
Easily understood,
results
projectable
Difficult to construct sampling
frame, expensive,
lower precision,
no assurance of
representativeness.
Systematic sampling
Can increase
representativeness,
easier to implement than
SRS, sampling frame not
necessary
Can decrease
31. representativeness
Stratified sampling
Include all important
subpopulations,
precision
Difficult to select relevant
stratification variables, not feasible to
stratify on many variables, expensive
Cluster sampling
Easy to implement, cost
effective
Imprecise, difficult to compute and
interpret results
Procedures for Drawing
Probability Samples
Simple Random Sampling
1. Select a suitable sampling frame
2. Each element is assigned a number from 1 to N
(pop. size)
3. Generate n (sample size) random numbers
between 1 and N
4. The numbers generated denote the elements that
should be included in the sample
32. Procedures for Drawing
Probability Samples
Systematic Sampling
1. Select a suitable sampling frame
2. Each element is assigned a number from 1 to N (pop. size)
3. Determine the sampling interval i:i=N/n. If i is a fraction,
round to the nearest integer
4. Select a random number, r, between 1 and i, as explained in
simple random sampling
5. The elements with the following numbers will comprise the
systematic random sample: r, r+i,r+2i,r+3i,r+4i,...,r+(n-1)i
Procedures for Drawing
Probability Samples
nh = n
h=1
H
Stratified Sampling
33. 1. Select a suitable frame
2. Select the stratification variable(s) and the number of strata,
H
3. Divide the entire population into H strata. Based on the
classification variable, each element of the population is
assigned
to one of the H strata
4. In each stratum, number the elements from 1 to Nh (the pop.
size of stratum h)
5. Determine the sample size of each stratum, nh
6. In each stratum, select a simple random sample of size nh
Procedures for Drawing
Probability Samples
Cluster Sampling
1. Assign a number from 1 to N to each element in the
population
2. Divide the population into C clusters of which c will be
included in
the sample
34. 3. Calculate the sampling interval i, i=N/c (round to nearest
integer)
4. Select a random number r between 1 and i, as explained in
simple
random sampling
5. Identify elements with the following numbers:
r,r+i,r+2i,... r+(c-1)i
6. Select the clusters that contain the identified elements
7. Select sampling units within each selected cluster based on
SRS
or systematic sampling
8. Remove clusters exceeding sampling interval i. Calculate new
population size N*, number of clusters to be selected C*= C-
1,
and new sampling interval i*.
Tennis' Systematic Sampling
Returns a Smash
Tennis magazine conducted a mail survey of its subscribers to
gain a better understanding of its market. Systematic sampling
was employed to select a sample of 1,472 subscribers from the
publication's domestic circulation list. If we assume that the
subscriber list had 1,472,000 names, the sampling interval
would be 1,000 (1,472,000/1,472). A number from 1 to 1,000
was drawn at random. Beginning with that number, every
1,000th subscriber was selected.
A brand-new dollar bill was included with the questionnaire as
35. an incentive to respondents. An alert postcard was mailed one
week before the survey. A second, follow-up, questionnaire
was sent to the whole sample ten days after the initial
questionnaire. The net effective mailing was 1,396. Six weeks
after the first mailing, 778 completed questionnaires were
returned, yielding a response rate of 56%.
Discussion question 1Discuss the advantages of convenience
samples and when it is appropriate to use them.
Convenience sampling is the least expensive and least time
consuming of all sampling techniques. The sampling units are
accessible, easy to measure, and cooperative. In spite of these
advantages, this form of sampling has serious limitations. Many
potential sources of selection bias are present, including
respondent self-selection. Convenience samples are not
representative of any definable population. Hence, it is not
theoretically meaningful to generalize to any population from a
convenience sample, and convenience samples are not
appropriate for marketing research projects involving
population inferences.Convenience samples are not
recommended for descriptive or causal research, but they can be
used in exploratory research for generating ideas, insights, or
hypotheses. Convenience samples can be used for focus groups,
pretesting questionnaires, or pilot studies. Even in these cases,
caution should be exercised in interpreting the results.
Nevertheless, this technique is sometimes used even in large
surveys.
Discussion question 2Discuss the advantages of systematic
sampling.
36. Systematic sampling is less costly and easier than simple
random sampling, because random selection is done only once.
Moreover, the random numbers do not have to be matched with
individual elements as in simple random sampling. Because
some lists contain millions of elements, considerable time can
be saved. This reduces the costs of sampling. If information
related to the characteristic of interest is available for the
population, systematic sampling can be used to obtain a more
representative and reliable (lower sampling error) sample than
simple random sampling. Another relative advantage is that
systematic sampling can even be used without knowledge of the
composition (elements) of the sampling frame. For example,
every ith person leaving a department store or mall can be
intercepted. For these reasons, systematic sampling is often
employed in consumer mail, telephone, and mall intercept
interviews.
Discussion question 3Discuss the uses of nonprobability and
probability sampling.
Nonprobability sampling is used in concept tests, package tests,
name tests, and copy tests, where projections to the populations
are usually not needed. In such studies, interest centers on the
proportion of the sample that gives various responses or
expresses various attitudes. Samples for these studies can be
drawn using methods such as mall intercept quota sampling. On
the other hand, probability sampling is used when there is a
need for highly accurate estimates of market share or sales
37. volume for the entire market. National market tracking studies,
which provide information on product category and brand usage
rates, as well as psychographic and demographic profiles of
users, use probability sampling. Studies that use probability
sampling generally employ telephone interviews. Stratified and
systematic sampling are combined with some form of random-
digit dialing to select the respondents.
Green Guard Care is a managed care company that provides and
finances health care services for
employees of Arc Electric, Inc. Approximately 5,000
employees at Arc are currently enrolled in Green
Guard’s health insurance plan. The number of enrollees has
increased over the past year as Arc Electric
continued to expand its workforce, and more and more Arc
employees have elected to receive this
benefit.
Arc currently pays Green Guard the full cost of health insurance
for its employees. This insurance
provides comprehensive coverage for inpatient and outpatient
hospital care, surgical services, physician
office visits, and other services (e.g., x-rays). The only cost to
employees is a $15 copayment for each
physician office visit.
Louis Pasture is the Director of Strategic Planning and
Forecasting at Green Guard Care. His key
38. function is to direct the overall development and analysis of all
strategic financial planning initiatives
relating to Green Guard’s managed care programs. His staff is
involved in many activities, including
preparing periodic reports of the costs incurred under the Arc
Electric account. Every time an Arc Electric
employee uses health care services, information about the type
of service and the relevant costs are
recorded in a database. Mr. Pasture recently directed his staff
to perform a financial analysis of the
current utilization and costs incurred under the Arc Electric
account.
Bad News
Marie Curry personally delivered her summary of utilization on
the Arc Electric account to Mr. Pasture
(See Exhibit 1). The data, he noted, indicated a sharp increase
in number of physician office visits over
the past month. He remarked, “The Arc employees’ use of
outpatient physician services has been going
up for the past six months. What’s going on?” He asked Ms.
Curry to provide him with the enrollment
numbers to see if the increase in utilization of physician
services was primarily due to the change in the
number of employees enrolled in the health plan. “No problem,”
she replied. “I have already put the last
six months’ weekly statistics into a spreadsheet.”
Mr. Pasture was concerned about Green Guard’s profitability.
Last year, Green Guard negotiated with Arc
Electric to charge a fixed premium of $250 per employee per
month. The total premium revenue is
allocated as follows: 55% to hospital and surgical services, 30%
to physician visits, and 15% for other
39. services, administration, and profit. These allocations are used
to establish budgets in the different
departments at Green Guard. The Arc Electric contract would
expire next month, at which time Green
Guard would need to renegotiate the terms of its contract with
Arc Electric. Mr. Pasture feared that Green
Guard would have to request a sharp rate increase to remain
profitable. Green Guard’s monthly cost of
administering the health plan was fixed, but the increases in the
use of health care services were eroding
Green Guard’s profits. He suspected that other health plans
were planning to increase premiums by 5-10
percent, which was reasonable given the recent statistics on
national health expenditures. A report from
2004, the most recent he could find, indicated that total national
health expenditures rose 7.9 percent
from 2003 to 2004 -- over three times the rate of inflation.
Mr. Pasture called in the rest of his staff to assist him in
devising a strategy for renegotiating the Arc
account. “If possible, I would like to figure out how we can
continue providing this service for the rate we
established last year. I’m afraid if we attempt to increase the
per member premium, Arc will contract with
another health insurer. What other options do we have?”
Alex Langdon, who works in Membership Marketing, reported
that he recently conducted a survey of cost
40. control mechanisms used by other health plans. His analysis
revealed that Green Guard’s competitors
are increasing their use of these mechanisms, which include
copayments, waiting periods,
preauthorization requirements, and exclusions on certain health
care services.
“One of the problems, in my opinion, is that the Arc Electric
employees have nearly full coverage for all
their health care services,” remarked Langdon. “The Arc
employees should pay some part of their health
care services out-of-pocket, so that they share an incentive to
stay healthy. Green Guard only charges a
$15 copayment, but many other health insurance plans require
that enrollees pay $20 – 25 for each
physician office visit. A higher copayment will help us reduce
the use of physician services.” He showed
them the results from a national study that showed a significant
relationship between the amount of a
copayment and the number of visits to a physician (See Exhibit
3), and recommended that Mr. Pasture
consider implementing a larger copayment for each physician
visit when the contract with Arc is
renegotiated.
Faith Monroe, who works in Provider Relations, disagreed. “I
don’t think a higher copayment is going to
reduce the level of physician visits. The demand for health care
services is a derived demand because it
depends on the demand for good health. People don’t
necessarily want to visit their physician, but they
often have to in order to stay healthy. If we want to cut our
costs, we will have to figure out how to pay
the health care providers less.” Green Guard currently pays for
health care services on a fee-for-service
basis. Most of the area hospitals and physicians “participate” in
41. Green Guard’s health insurance plan.
When Arc employees obtain health care services from
participating health care providers, the providers
are reimbursed for their costs directly by Green Guard. Several
factors have increased health care costs
over time, including the growing availability of medical
technology, such as magnetic resonance imaging
(MRI), and increased medical malpractice litigation.
Ms. Monroe suggested that Mr. Pasture consider negotiating
with physicians to lower the costs of the
services provided. “I’ve heard that some managed care plans
have cut deals with physicians to lower
their charges by 10-25 percent,” she said. “Physicians have
accepted these deals because if they don’t,
they could be cut out of the health insurance plan and they
could lose all their patients.” Mr. Langdon
conceded that this might be possible, but expressed his concern
that if participating physicians accepted
a lower amount per visit, they might reduce the quality of care
they provide to Green Guard’s members.
Mr. Pasture dismissed his staff. Eager to resolve this issue, he
phoned your consulting company for
assistance. Green Guard’s executives would need a full report
of the current situation and evaluation of
his staff’s suggestions to either (a) increase the copayment, or
(b) implement a reduction in charges for
physician office visits.
Required:
Prepare a report of Green Guard’s current financial situation
and include an evaluation of the two options
for controlling costs on the Arc Electric account. Use the
42. guidelines for writing a report on the course web
site. You may wish to review the following LDC Concepts:
Microeconomics 3 and 5, SOM (Statistics) 1, 4,
and 7.
Exhibit 1
Monthly Report of Health Care Utilization
Total Costs Incurred - Arc Electric, Inc.
Category of Service July 2006 August 2006
Hospital Services– Inpatient 203,425 212,250
Hospital Services – Outpatient 182,440 180,700
Surgical Services 101,250 103,400
Physician Office Visits 337,900 391,450
Administrative Expenses 90,000 90,000
TOTAL 915,015 977,800
Number of members, July 31, 2006: 4129
Number of members, August 31, 2006: 4137
43. Exhibit 2
healthcare.xls- will be provided on the course website.
Exhibit 3
Sample Means for Annual Use of Health Care Services
Copayment
Level
Physician Visits
Per Capita
$10 6.3
$15 6.0
$20 5.7
$25 5.4
$30 5.1
$35 4.8
44. Source: “Demand for Health Care Services at Different
Copayment Levels:
Results Based on a National Study of Health Insurance
Enrollees”
Business Management Homework
Answer the question in 500 words
Question 6
In addition to the options suggested by his staff, Mr.
Pasture recently read an article about
rationing health care services as a method of controlling costs.
The general idea of rationing is that more expensive treatments
are excluded so that basic health benefits can be provided to a
wider population. Health plans can implement rationing by
limiting the types of services they will cover. While they
commonly exclude coverage for experimental treatments and
cosmetic surgery, many are now considering adding physical
therapy, mental health services, and therapies that treat fatal
conditions to the list of excluded services. Would you
recommend that Mr. Pasture consider this approach? Discuss
the ethical considerations.
Attachments area
6. Correlation and Regression
*
45. The mean, or average value, is the most commonly used
measure of central tendency. The mean, ,is given by
Where,
Xi = Observed values of the variable X
n = Number of observations (sample size)
The mode is the value that occurs most frequently. It represents
the highest peak of the distribution. The mode is a good
measure of location when the variable is inherently categorical
or has otherwise been grouped into categories.
Statistics Associated with Frequency Distribution Measures of
Location
X
=
X
i
/
n
S
i
=
1
n
X
*
46. The median of a sample is the middle value when the data are
arranged in ascending or descending order.
http://www.city-data.com/
Statistics Associated with Frequency Distribution Measures of
Location
*
Skewness. The tendency of the deviations from the mean to be
larger in one direction than in the other. It can be thought of as
the tendency for one tail of the distribution to be heavier than
the other.
Kurtosis is a measure of the relative peakedness or flatness of
the curve defined by the frequency distribution. The kurtosis of
a normal distribution is zero. If the kurtosis is positive, then the
distribution is more peaked than a normal distribution. A
negative value means that the distribution is flatter than a
normal distribution.
Statistics Associated with Frequency Distribution Measures of
Shape
47. *
Find the mean, median, mode, and range for the following list
of values
13, 18, 13, 14, 13, 16, 14, 21, 13The mean is the usual average:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15
The median is the middle value, so I'll have to rewrite the list in
order:
13, 13, 13, 13, 14, 14, 16, 18, 21There are nine
numbers in the list, so the middle one will be the (9 + 1) ÷ 2 =
10 ÷ 2 = 5th number (the median is the mean of the middle two
values if there are an even number of numbers): 13, 13, 13, 13,
14, 14, 16, 18, 21. So the median is 14.
The mode is the number that is repeated more often than any
other: 13 is the mode.
The largest value in the list is 21, and the smallest is 13, so the
range is 21 – 13 = 8.
The range measures the spread of the data. It is simply the
difference between the largest and smallest values in the
sample.
Range = Xlargest – Xsmallest.
The variance is the mean squared deviation from the mean. The
variance can never be negative. The variance is a measure of
how far a set of numbers is spread out from the mean.
http://www.mathsisfun.com/data/standard-deviation.html
Deviation the difference between the value of an observation
and the mean of the population. It is a value minus its mean: x -
meanx. Standard deviation is based on the square of the
48. difference. In SPSS, select Analyze, Correlate, Bivariate; click
Options; check Cross-product deviations and covariances.
The standard deviation is the square root of the variance.
Statistics Associated with Frequency Distribution Measures of
Variability
s
x
=
(
X
i
50. i
=
1
n
*
Statistics Associated with Frequency Distribution Measures of
VariabilityCovariance is a measure of how much the deviations
of two variables match. The equation is: cov(x,y) = SUM[(x -
meanx)(y - meany)]. In SPSS, select Analyze, Correlate,
Bivariate; click Options; check Cross-product deviations and
covariances.
Correlation is a bivariate measure of association (strength) of
the relationship between two variables. It varies from 0 (random
relationship) to 1 (perfect linear relationship) or -1 (perfect
negative linear relationship). It is usually reported in terms of
its square (r2), interpreted as percent of variance explained. For
instance, if r2 is .25, then the independent variable is said to
explain 25% of the variance in the dependent variable. In SPSS,
select Analyze, Correlate, Bivariate; check Pearson.
51. CorrelationPearson's r , the most common type sometimes, is
also called product-moment correlation.
Pearson's r is a measure of association which varies from -1 to
+1, with 0 indicating no relationship (random pairing of values)
and 1 indicating perfect relationship. In SPSS, select Analyze,
Correlate, Bivariate; check Pearson (the default).
Multiple RegressionThe multiple regression equation takes the
form y = b1x1 + b2x2 + ... + bnxn + c.
The b's are regression coefficients, representing the amount the
dependent variable y changes when the corresponding
independent changes 1 unit. The c is the constant, where the
regression line intercepts the y axis, representing the amount
the dependent y will be when all the independent variables are
0.
The standardized version of the b coefficients are the beta
weights, and the ratio of the beta coefficients is the ratio of the
relative predictive power of the independent variables.
Associated with multiple regression is R2, multiple correlation,
which is the percent of variance in the dependent variable
explained collectively by all of the independent variables.
How big a sample size do I need to do multiple regression ?
According to Tabachnick and Fidell (2001: 117), a rule of
thumb for testing b coefficients is to have N >= 104 + m, where
m = number of independent variables.
Another popular rule of thumb is that there must be at least 20
52. times as many cases as independent variables.
*
Statistics Associated with
Regression Analysis
Regression coefficient. The estimated parameter b is usually
referred to as the non-standardized regression coefficient.
Standardized regression coefficient. Also termed the beta
coefficient or beta weight is used to denote the standardized
regression coefficient.
Byx = Bxy = rxy
Sum of squared errors. The distances of all the points from the
regression line are squared and added together to arrive at the
sum of squared errors, which is a measure of total error, .
e
j
S
2
*
53. Conducting Regression Analysis
Plot the Scatter Diagram
A scatter diagram, or scattergram, is a plot of the values of two
variables for all the cases or observations.
The most commonly used technique for fitting a straight line to
a scattergram is the least-squares procedure.
In fitting the line, the least-squares procedure
minimizes the sum of squared errors, .
e
j
S
2
*
Determine the Strength and Significance of Association:
Significance of r with t test
t statistic. A t statistic with n - 2 degrees of freedom (in
simple regression) can be used to test the null hypothesis that
no linear relationship exists between X and Y, or H0: r = 0.One
tests the hypothesis that the correlation is zero (p = 0) using
54. this formula: t = [r*SQRT(n-2)]/[SQRT(1-r2)] If the computed t
value is as high or higher than the table t value, then the
researcher concludes the correlation is significant (that is,
significantly different from 0). In practice, most computer
programs compute the significance of correlation for the
researcher without need for manual methods.
T test: H0: b=0, H1 b is not equal to zero
F test: H0 R-sqr =0, H1 R-sqr is not equal to zero
Determine the Strength and Significance of Association: F test
Another, equivalent test for examining the significance of the
linear relationship between X and Y (significance of b) is the
test for the significance of the coefficient of determination.
The hypotheses in this case are:
H0: R2pop = 0
H1: R2pop > 0
F test. The F test is used to test the null hypothesis that the
coefficient of multiple determination in the population, R2pop,
is zero. This is equivalent to testing the null hypothesis, which
is the same as testing the significance of the regression model
as a whole. The test statistic has an F distribution with k and (n
- k - 1) degrees of freedom (in multiple regression), where k =
number of terms in the equation not counting the constant. F =
[R2/k]/[(1 - R2 )/(n - k - 1)].
*
55. Significance level: p valueIn statistics, a result is called
statistically significant if it is unlikely to have occurred by
chance.The decision is often made using the p-value (see sig. in
the table): if the p-value is less than the significance level, then
the null hypothesis is rejected. The smaller the p-value, the
more significant the result is said to be. Thus, we can say, “the
null hypothesis is rejected”.
Variables Dependent variable. The dependent variable is the
predicted variable in the regression equation.
Independent variables are the predictor variables in the
regression equation.
Dummy variables are a way of adding the values of a nominal or
ordinal variable to a regression equation. The standard approach
to modeling categorical variables is to include the categorical
variables in the regression equation by converting each level of
each categorical variable into a variable of its own, usually
coded 0 or 1. For instance, the categorical variable "region"
may be converted into dummy variables such as "East," "West,"
"North," or "South." Typically "1" means the attribute of
interest is present (ex., South = 1 means the case is from the
region South). We have to leave one of the levels out of the
regression model to avoid perfect multicollinearity (singularity;
redundancy), which will prevent a solution (for example, we
may leave out "North" to avoid singularity).
Regression with Dummy Variables
Product Usage Original Dummy Variable Code
Category Variable
Code D1 D2 D3
Nonusers............... 1 1 0 0
56. Light Users........... 2 0 1 0
Medium Users....... 3 0 0 1
Heavy Users.......... 4 0 0 0
i = a + b1D1 + b2D2 + b3D3
In this case, "heavy users" has been selected as a reference
category and has not been directly included in the regression
equation.
Y
*
Conducting Multiple Regression Analysis
Strength of Association
R
2
S
S
r
e
g
S
S
y
=
57. R2, also called multiple correlation or the coefficient of
multiple determination, is the percent of the variance in the
dependent explained uniquely or jointly by the independents.
*
Conducting Multiple Regression Analysis
Strength of Association
R
2
R
2
k
(
1
-
58. )
n
-
k
-
1
-
Adjusted R-Square is an adjustment for the fact that when one
has a large number of independents.
When used for the case of a few independents, R2 and adjusted
R2 will be close. When there are many independents, adjusted
R2 may be noticeably lower. Always use adjusted R2 when
comparing models with different numbers of independents.
R2 is adjusted for the number of independent variables and the
sample
size by using the following formula:
Adjusted R2 =
*
Assumptions
Normality
The error term is normally distributed.
59. the distribution of variables is normal. regression assumes that
the variables have normal distributions.
Linearity
The means of all these normal distributions of Y, given X, lie
on a straight line with slope b.
Absence of high multicollinearity
No outliers
*
Normality
A histogram of standardized residuals should show a roughly
normal curve.
Skewness and kurtosis can also be used to check normality of
the variables.
P-P plot: Another alternative for the same purpose is the
normal probability plot, with the observed cumulative
probabilities of occurrence of the standardized residuals on the
Y axis and of expected normal probabilities of occurrence on
the X axis, such that a 45-degree line will appear when
observed conforms to normally expected.
Normality
The error term is normally distributed.
The variance of the error term is constant.
Homoscedasticity (also spelled homoskedasticity): Lack of
homoscedasticity may mean (1) there is an interaction effect
between a measured independent variable and an unmeasured
60. independent variable not in the model; or (2) that some
independent variables are skewed while others are not.
The error terms are uncorrelated. In other words, the
observations have been drawn independently.
The Durbin-Watson statistic is a test to see if the assumption of
independent observations is met, which is the same as testing to
see if autocorrelation is present. As a rule of thumb, a Durbin-
Watson statistic in the range of 1.5 to 2.5 means the researcher
may reject the notion that data are autocorrelated (serially
dependent) and instead may assume independence of
observations.
Homoscedasticity:
error terms are constantNonconstant error variance
(heteroscedastivity) can indicate the need to respecify the
model to include omitted independent variables. Nonconstant
error variance can be observed by requesting simple residual
plots, as in the illustrations below, where "Training" as
independent is used to predict "Score" as dependent: Plot of the
dependent on the X-axis against standardized predicted values
on the Y axis. For the homoscedasticity assumption to be met,
observations should be spread about the regression line
similarly for the entire X axis. In the illustration below, which
is heteroscedastic, the spread is much narrower for low values
than for high values of the X variable, Score.
This plot shows heteroscedasticityThe variance of the error term
should be constant for all values of the independent variables.
Heteroscedasticity occurs when the variance of the error term
is not constant.
The presence of heteroscedasticity can invalidate statistical
61. tests of significance.
ResidualsResiduals are the difference between the observed
values and those predicted by the regression equation.
Residuals Unstandardized residuals, referenced as RESID in
SPSS, refer in a regression context to the linear difference
between the location of an observation (point) and the
regression line (or plane or surface) in multidimensional space.
Standardized residuals, of course, are residuals after they have
been constrained to a mean of zero and a standard deviation of
1. A rule of thumb is that outliers are points whose standardized
residual is greater than 3.3 (corresponding to the .001 alpha
level – confidence interval or significance level). SPSS will list
"Std. Residual" if "casewise diagnostics" is requested under the
Statistics button.
Studentized residuals are constrained only to have a standard
deviation of 1, but are not constrained to a mean of 0.
Studentized deleted residuals are residuals which have been
constrained to have a standard deviation of 1, after the standard
deviation is calculated leaving the given case out.
Multicollinearity and its problemsMulticollinearity refers to
excessive correlation of the predictor variables. When
correlation is excessive (some use the rule of thumb of r > .90),
standard errors of the b and beta coefficients become large,
making it difficult or impossible to assess the relative
importance of the predictor variables.
62. Multicollinearity is less important where the research purpose is
sheer prediction since the predicted values of the dependent
remain stable, but multicollinearity is a severe problem when
the research purpose includes causal modeling.
Multicollinearity can result in several problems, including:
The partial regression coefficients may not be estimated
precisely. The standard errors are likely to be high.
The magnitudes as well as the signs of the partial regression
coefficients may change from sample to sample.
It becomes difficult to assess the relative importance of the
independent variables in explaining the variation in the
dependent variable.
Predictor variables may be incorrectly included or removed in
stepwise regression.
Test of multicollinearity
Inspection of the correlation matrix reveals only bivariate
multicollinearity, with the typical criterion being bivariate
correlations > .90. To assess multivariate multicollinearity, one
uses tolerance or VIF (Variance-inflation factor)
Tolerance:As a rule of thumb, if tolerance is less than .20, a
problem with multicollinearity is indicated. In SPSS, select
Analyze, Regression, Linear; click Statistics; check Collinearity
diagnostics to get tolerance.
VIF is the variance inflation factor, which is simply the
reciprocal of tolerance. VIF >= 4 is an arbitrary but common
cut-off criterion for deciding when a given independent variable
displays "too much" multicollinearity: values above 4 suggest a
multicollinearity problem. Some researchers use the more
lenient cutoff of 5.0 or even 10.0 to signal when
multicollinearity is a problem.
63. Run simple regression
A simple procedure for adjusting for multicollinearity is to use
not variables with low multicollinearity.
Alternatively, the set of independent variables can be
transformed into a new set of predictors that are mutually
independent by using techniques such as principal components
analysis or factor analysis.
More specialized techniques, such as, ridge regression can also
be used.
Remedies of Multicollinearity
*
Outliers
The removal of outliers from the data set under analysis can at
times dramatically affect the performance of a regression
model. Outliers should be removed if there is reason to believe
that other variables not in the model explain why the outlier
cases are unusual -- that is, outliers may well be cases which
need a separate model. Alternatively, outliers may suggest that
additional explanatory variables need to be brought into the
model (that is, the model needs re-specification).
We can check outliers with any one of the five measures of case
influence statistics (DfBeta, standardized DfBeta, DfFit,
standardized DfFit, and the covariance ratio) and distance
measures (Mahalanobis, Cook's D, and leverage).
we can confidently use the following three ways to detect
outliers: casewise diagnostic, Mahalanobis distancem (D2), and
DfBETA.
64. Check Influential cases (outliers) with Influence statistics
Influence statistics in SPSS are selected under the Save button
dialog.
DfBeta, called standardized DfBeta in SPSS, measures the
change in b coefficients (measured in standard errors) due to
excluding a case from the dataset. A DfBeta coefficient is
computed for every observations. If DfBeta > 0, the case
increases the slope; if < 0, the case decreases the slope. The
case may be considered an influential outlier if |DfBeta| > 2. In
an alternative rule of thumb, a case may be an outlier if
|DfBeta|> 2/SQRT(n).
Standardized DfBeta. Once DfBeta is standardized, it is easier
to interpret. The threshold of SDFBETA is usually set at ±2
DfFit. DfFit measues how much the estimate (predicted value)
changes as a result of a particular observation being dropped
from analysis. The dfFit measure is quite similar to Cook's D.
Standardized DfFit. Once DfFit is standardized, it is easier to
interpret. A rule of thumb flags as outliers those observations
whose standardized DfBeta value is > twice the square root of
p/N, where p is the number of parameters in the model and N is
sample size.
Covariance ratio. This ratio compares the determinant of the
covariance matrix with and without inclusion of a given case.
The closer the covariance ratio approaches 1.0, the less
influential the observation.
Check influential case with Distance Measures
Distance measures in SPSS are also selected under the Save
button dialog.
Centered leverage statistic, h, also called the hat-value, is
65. available to identify cases which influence regression
coefficients more than others. The leverage statistic varies from
0 (no influence on the model) to almost 1 (completely
determines the model). The maximum value is (N-1)/N, where N
is sample size. A rule of thumb is that cases with leverage under
.2 are not a problem, but if a case has leverage over .5, the case
has undue leverage and should be examined for the possibility
of measurement error or the need to model such cases
separately.
Mahalabobis distance. The higher the Mahalanobis distance for
a case, the more that case's values on independent variables
diverge from average values. As a rule of thumb, the maximum
Mahalanobis distance should not exceed the critical chi-squared
value with degrees of freedom equal to number of predictors
and alpha =.001, or else outliers may be a problem in the data.
Cook's distance, D, is another measure of the influence of a
case. Observations with larger D values than the rest of the data
are those which have unusual influence or leverage. Fox (1991:
34) suggests as a cut-off for detecting influential cases, values
of D greater than 4/(N - k - 1), where N is sample size and k is
the number of independents. Others suggest D > 1 as the
criterion to constitute a strong indication of an outlier problem,
with D > 4/n the criterion to indicate a possible problem.
Casewise Diagnostics
8. Discriminant Analysis
66. *
Basic Research Questionthe primary goal is to find a
dimension(s) that groups differ on and create classification
functions i.e., Can group membership be accurately predicted by
a set of predictors?
Similarities and Differences between ANOVA, Regression, and
Discriminant Analysis
ANOVA REGRESSION DISCRIMINANT
Similarities
Number of One One One
dependent
variables
Number of
independent Multiple Multiple Multiple
variables
Differences
Nature of the
dependent Metric Metric Categorical
variables
Nature of the
independent Categorical Metric Metric
variables
67. Discriminant Analysis
Discriminant analysis is a technique for analyzing data when the
criterion variable (DV) is categorical and the predictor variables
(IVs) are interval in nature.
The objectives of discriminant analysis are as follows:
Development of discriminant functions (n groups, n-1
discrininant functions), or linear combinations of the predictor
or IVs which will best discriminate between the categories of
the DVs(groups).
Examination of whether significant differences exist among the
groups, in terms of the predictor variables. (Tests of Equality of
Group Means)
Determination of which predictor variables contribute to most
of the intergroup differences. (The smaller the variable Wilks'
lambda for an independent variable, the more that variable
contributes to the discriminant function.)
Evaluation of the accuracy of classification. (classification
result table)
Discriminant Analysis Model
The discriminant analysis model involves linear combinations
of
the following form:
D = b0 + b1X1 + b2X2 + b3X3 + . . . + bkXk
Where:
D = discriminant score
b 's = discriminant coefficient or weight
X 's = predictor (independent variable)
The coefficients, or weights (b), are estimated so that the
groups differ as much as possible on the values of the
discriminant function.
This occurs when the ratio of between-group sum of squares to
68. within-group sum of squares for the discriminant scores is at a
maximum.
Key terms and conceptsDiscriminating variables = IV, also
called predictors. Criterion variable =DV, also called the
grouping variable in SPSS. Discriminant function: A
discriminant function, also called a canonical root, is a latent
variable (e.g., somebody’s credit) which is created as a linear
combination of discriminating (independent) variables, such
that L (or D) = b1x1 + b2x2 + ... + bnxn + c, where the b's are
discriminant coefficients, the x's are discriminating variables,
and c is a constant.
Discriminant analysisDiscriminant analysis has two steps: (1)
Wilks' lambda with Chi-Square transformations is used to test if
the discriminant model as a whole is significant(2) if the F test
shows significance, then the individual independent variables
are assessed to see which differ significantly in mean by group
and these are used to classify the dependent variable.
Assumptions
Discriminant analysis requires following assumptions: linear
relationships homoscedasticproper model specification
(inclusion of all important independents and exclusion of
extraneous variables)
Canonical correlation. is a measure of the association between
the groups and the given discriminant function. When it is zero,
69. there is no relation between the groups and the function.
Squared Canonical correlation, Rc2: Squared canonical
correlation, Rc2, is the percent of variation in the dependent
discriminated by the set of independents in DA or MDA. A
canonical correlation square close to 1 means that nearly all the
variance in the discriminant scores can be attributed to group
differences.
Centroid. The centroid is the mean values of the discriminant
scores for a particular group. There are as many centroids as
there are groups, as there is one for each group. The means for
a group on all the functions are the group centroids.
Statistics Associated with Discriminant Analysis
Unstandardized discriminant function coefficients. are used in
the formula for making the classifications in DA, much as b
coefficients are used in regression in making predictions. The
constant plus the sum of products of the unstandardized
coefficients with the observations yields the discriminant
scores.
Standardized discriminant function coefficients. The
standardized discriminant function coefficients are the
discriminant function coefficients and are used as the
multipliers when the variables have been standardized to a mean
of 0 and a variance of 1.
Discriminant scores. also called the DA score, is the value
resulting from applying a discriminant function formula to the
data for a given case. The Z score is the discriminant score for
standardized data. To get discriminant scores in SPSS, select
Analyze, Classify, Discriminant; click the Save button; check
"Discriminant scores".
Statistics Associated with Discriminant Analysis
70. Eigenvalue. For each discriminant function, the Eigenvalue is
the ratio of between-group to within-group sums of squares. It
reflects the importance of the discriminant function. There is
one eigenvalue for each discriminant function. For two-group
DA, there is one discriminant function and one eigenvalue. If
there is more than one discriminant function, the first will be
the largest and most important, the second next most important
in explanatory power, and so on.
F values and their significance. F values are calculated from
ANOVA, with the grouping variable serving as the categorical
independent variable. Each predictor (IV), in turn, serves as the
metric dependent variable in the ANOVA.
Statistics Associated with Discriminant Analysis
Structure correlations. Also referred to as discriminant
loadings, the structure correlations represent the simple
correlations between the predictors and the discriminant
function. The correlations then serve like factor loadings in
factor analysis.
(model)Wilks' Lambda ( ) . is used to test the significance of
the discriminant function as a whole.
(variable) Wilks’ Lambda. Sometimes also called the U
statistic, Wilks' lambda for each predictor is the ratio of the
within-group sum of squares to the total sum of squares. Its
value varies between 0 and 1. Large values
of Wilks’ Lambda (near 1) indicate that group means do not
seem to be different. Small values of Wilk’s Lambda (near 0)
indicate that the group means seem to be different. Thus, the
smaller the variable Wilks' lambda for an independent variable,
the more that variable contributes to the discriminant function.
l
71. Statistics Associated with Discriminant Analysis
Conducting Discriminant Analysis
Formulate the Problem
The criterion variable (DV) must consist of two or more
mutually exclusive and collectively exhaustive categories.
The predictor variables (IV) should be selected based on a
theoretical model or previous research, or the experience of the
researcher.
One part of the sample, called the estimation or analysis
sample, is used for estimation of the discriminant function.
The other part, called the holdout or validation sample, is
reserved for validating the discriminant function.
Information on Resort Visits: Analysis SampleA total sample
contains 42 households30 households are included in the
analysis sample and the remaining 12 were part of the
validation sample. Variables:Families visited (VISIT) a resort
during the last two years: 1, Families didn’t: 2Annual income
(INCOME)Attitude toward travel(TRAVEL): 9 point likert
scaleImportant attached to family vacation (VACATION): 9
point scaleHousehold size (HSIZE)Age of the head of the
household (AGE)
Information on Resort Visits: Analysis Sample
Annual Attitude Importance Household Age of
Amount
Resort Family Toward Attached
72. Size Head of Spent on
No. Visit Income Travel to Family
Household Family
($000) Vacation
Vacation
1 1 50.2 5 8 3 43 M (2)
2 1 70.3 6 7 4 61 H (3)
3 1 62.9 7 5 6 52 H (3)
4 1 48.5 7 5 5 36 L (1)
5 1 52.7 6 6 4 55 H (3)
6 1 75.0 8 7 5 68 H (3)
7 1 46.2 5 3 3 62 M (2)
8 1 57.0 2 4 6 51 M (2)
9 1 64.1 7 5 4 57 H (3)
10 1 68.1 7 6 5 45 H (3)
11 1 73.4 6 7 5 44 H (3)
12 1 71.9 5 8 4 64 H (3)
13 1 56.2 1 8 6 54 M (2)
14 1 49.3 4 2 3 56 H (3)
15 1 62.0 5 6 2 58 H (3)
Information on Resort Visits: Analysis Sample
Table 18.2, cont.
Annual Attitude Importance Household Age of
Amount
Resort Family Toward Attached
Size Head of Spent on
No. Visit Income Travel to Family
Household Family
($000) Vacation
Vacation
73. 16 2 32.1 5 4 3 58 L (1)
17 2 36.2 4 3 2 55 L (1)
18 2 43.2 2 5 2 57 M (2)
19 2 50.4 5 2 4 37 M (2)
20 2 44.1 6 6 3 42 M (2)
21 2 38.3 6 6 2 45 L (1)
22 2 55.0 1 2 2 57 M (2)
23 2 46.1 3 5 3 51 L (1)
24 2 35.0 6 4 5 64 L (1)
25 2 37.3 2 7 4 54 L (1)
26 2 41.8 5 1 3 56 M (2)
27 2 57.0 8 3 2 36 M (2)
28 2 33.4 6 8 2 50 L (1)
29 2 37.5 3 2 3 48 L (1)
30 2 41.3 3 3 2 42 L (1)
Information on Resort Visits:
Holdout Sample
Annual Attitude Importance Household
Age of Amount
Resort Family Toward Attached
Size Head of Spent on
No. Visit Income Travel to Family
Household Family
($000) Vacation
Vacation
1 1 50.8 4 7 3 45 M(2)
2 1 63.6 7 4 7 55 H (3)
74. 3 1 54.0 6 7 4 58 M(2)
4 1 45.0 5 4 3 60 M(2)
5 1 68.0 6 6 6 46 H (3)
6 1 62.1 5 6 3 56 H (3)
7 2 35.0 4 3 4 54 L (1)
8 2 49.6 5 3 5 39 L (1)
9 2 39.4 6 5 3 44 H (3)
10 2 37.0 2 6 5 51 L (1)
11 2 54.5 7 3 3 37 M(2)
12 2 38.2 2 2 3 49 L (1)
Results
In the testing for significance in the vacation resort analysis, we
found the Wilks’ lambda is 0.359, which transforms to a chi-
square of 26.13 with 5 degrees of freedom. This model is
significant (p<0.001).
Conducting Discriminant Analysis
Determine the Significance of Discriminant Function
The null hypothesis that, in the population, the means of all
discriminant functions in all groups are equal.
In SPSS this test is based on Wilks' . If several functions are
tested simultaneously (as in the case of multiple discriminant
analysis), the Wilks' statistic is the product of the univariate
for each function. The significance level is estimated based on
a chi-square transformation of the statistic.
If the null hypothesis is rejected, indicating significant
discrimination.
l
l
75. Results
The pooled within-groups correlation matrix indicates low
correlations between the predictors. Multicollinearity is
unlikely to be problem.
The significance of the univariate F ratio indicate that when the
predictors are considered individually, only income, importance
of vacation, and household size significantly differentiate
between these those who visited a resort and those who did not.
Results
Predictors with relatively large standardized coefficients
contribute more to the discriminating power of the function.
The relative importance of the predictors can also be obtained
by examining the structure correlations between each predictors
and the discriminant function represent the variance that the
predictor shares with the function. Thus, income, household
size, importance attached to vacation, attitudes toward travel,
and age of the household are in the more important to less
important sequence. The signs of the coefficients associated
with all the predictors are positive. This suggests that higher
family income, household size, importance attached to family
vacation, attitude toward travel, and age are more likely to
result in the family visiting the resort.
The group centriod is also given. Group 1, those who have
visited a resort, has a positive value (1.291), whereas group 2
has an equal negative value.
76. Validation of Discriminant Analysis
Conducting Discriminant Analysis
Assess Validity of Discriminant Analysis
Many computer programs, such as SPSS, offer a leave-one-out
cross-validation option.
The hit ratio, or the percentage of cases correctly classified, can
then be determined by summing the diagonal elements and
dividing by the total number of cases. Here it is
(12+15)/30=90%. Leave-one-out cross validation correctly
classifies only (11+13)/30=80% of the cases. Conduction
classification analysis on an independent holdout set of data, we
have (4+6)/12=83% hit ratio.
Classification accuracy achieved by discriminant analysis
should be at least 25% greater than that obtained by chance.
Given two groups of equal size, by chance one could expect a
hit raio of ½=0.5, or 50%. Hence, the improvement over chance
is more than 25%, and the validity of the discriminant analysis
is judged as satisfactory.
Results for three group DA
Because there are 3 groups, a maximum of 2 functions can be
extracted. The eigenvalue associated with the first function is
3.819, and this function accounts for 93.93% of the explained
variance. The second function has a small eigenvalue of 0.247
and accounts for only 6.1% of the explained variance.
77. To test the null hypothesis of equal group centroids, both the
functions must be considered simultaneously.
The value of wilk’s lambda is 0.166. this transforms to a chi-
square of 44.831 with 10 df, which is significant at 0.05 level
(p<0.001).
Thus, the two functions together significantly discriminant
among the three groups. However, when the first function is
removed, the wilk’s lambda associated with the second function
is 0.8020, which is not significant at the 0.05 level. Therefore,
the second function does not contribute significantly to group
differences.
Results of Three-Group Discriminant Analysis
Income and attitudes toward travel are significant to separate
three groups of amount spend on family vacation (high,
medium, and low groups) at 0.05 level (p<0.05). The other
predictors are not significant at 0.05 level (p>0.05)
Results of Three-Group Discriminant Analysis
Predictors with relatively large standardized coefficients
contribute more to the discriminating power of the function.
Income is more important predictor than attitude towards travel
on function 1, whereas function 2 has relatively larger
coefficients for travel, vacation, and age. A similar conclusion
is reached by an examination of the structure matrix.
To help interpret the functions, variables with large coefficients
78. for a particular function are grouped together. Those groups are
show with asterisks.
Assess validity of discriminant analysis
The classification results indicate that (9+9+8)/30 =86.7% of
the cases are correctly classified. Leave-one-out cross-
validation correctly classified only (7+5+8)/30=66.7% of the
case. When the classification analysis is conducted on the
independent holdout sample, a hit ratio of (3+3+3)/12=75% is
obtained.
Given three groups of equal size, by chance alone one could
expect a hit ratio of 1/3=33%. Thus, the improvement over
chance is greater than 25%, indicating a satisfactory validity.
SPSS Windows
The DISCRIMINANT program performs both two-group
and multiple discriminant analysis. To select this procedure
using SPSS for Windows click:
Analyze>Classify>Discriminant …
http://www.utexas.edu/courses/schwab/sw388r7/Tutorials/TwoG
roupHatcoDiscriminantAnalysis_doc_html/
9. Factor Analysis
79. *
Factor AnalysisFactor analysis is a general name denoting a
class of procedures primarily used for data reduction and
summarization.
In regression, DA, and ANOVA, one variable is condisered as
DV, and the other variables as IVs. However, no such
distinction is made in factor analysis.
In factor analysis (FA) an entire set of interdependent
relationships is examined without making the distinction
between dependent and independent variables.
Factor analysis is used in the following circumstances: To
identify underlying dimensions, or factors, that explain the
correlations among a set of variables. To identify a new,
smaller, set of uncorrelated variables to replace the original set
of correlated variables in subsequent multivariate analysis
(regression or discriminant analysis). To identify a smaller set
of salient variables from a larger set for use in subsequent
multivariate analysis.
FA application in marketing researchUsed in market
segmentation for identifying the underlying variables on which
to group the customers. New car buyers might be grouped based
on the relative emphasis they place on economy, convenience,
performance, comfort, and luxury. This might result in five
segments economy seekers, convenience seeker…
In pricing studies, FA can be used to identify the characteristics
of price-sensitive consumers. For example, these consumers
might be methodical, economy minded, and home centered.
80. Explain loadings and communalitiesCommunalities are related
to loadings. The sum of squares for a row in the loading matrix
is equal to the communality of a variable.
Factor loadings. Factor loadings are simple correlations
between the variables and the factors.
Factor matrix. A factor matrix contains the factor loadings of all
the variables on all the factors extracted.
Communality. The proportion of variance explained by the
common factors.
Statistics Associated with Factor Analysis
*
Explain loadings and communalities
Loadings are the correlations between a factor (columns) and a
variable (rows). For the purpose of interpretation, loading
matrix is the most important output of factor analysis.
Analogous to Pearson’s r, the squared factor loading is the
percent of variance in that variable explained by the factor. For
the purpose of interpretation, loadings are most important
output for factor analysis.
Communality (h2) represents the proportion of the variance of
each variable that is explained by the selected factors. In
another term, it is how much an item is shared among factors.
81. Statistics Associated with Factor AnalysisBartlett's test of
sphericity. Bartlett's test of sphericity is a test statistic used to
examine the hypothesis that the variables are uncorrelated in the
population. In other words, the population correlation matrix is
an identity matrix; each variable correlates perfectly with itself
(r = 1) but has no correlation with the other variables (r = 0).
Correlation matrix. A correlation matrix is a lower triangle
matrix showing the simple correlations, r, between all possible
pairs of variables included in the analysis. The diagonal
elements, which are all 1, are usually omitted.
Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy. The
Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is an
index used to examine the appropriateness of factor analysis.
High values (between 0.5 and 1.0) indicate factor analysis is
appropriate. Values below 0.5 imply that factor analysis may
not be appropriate.
Factor scores. Factor scores are composite scores estimated for
each respondent on the derived factors.
Eigenvalue. Column sum of squared loadings for a factor; also
referred to as the latent root. It conceptually represents that
amount of variance accounted for by a factor.
Scree plot. A scree plot is a plot of the Eigenvalues against the
number of factors in order of extraction.
Statistics Associated with Factor Analysis
Scree Plot
82. 0.5
2
5
4
3
6
Component Number
0.0
2.0
3.0
Eigenvalue
1.0
1.5
2.5
1
In principal components analysis, the total variance in the data
is considered. The diagonal of the correlation matrix consists
of unities, and full variance is brought into the factor matrix.
Principal components analysis is recommended when the
primary concern is to determine the minimum number of factors
that will account for maximum variance in the data for use in
83. subsequent multivariate analysis. The factors are called
principal components.
In common factor analysis, the factors are estimated based only
on the common variance. Communalities are inserted in the
diagonal of the correlation matrix. This method is appropriate
when the primary concern is to identify the underlying
dimensions and the common variance is of interest. This
method is also known as principal axis factoring.
Conducting Factor Analysis
Determine the Method of Factor Analysis
Difference Between PAF and PCAPrincipal components factor
analysis inserts 1's on the diagonal of the correlation matrix,
thus considering all of the available variance. Most appropriate
when the concern is with deriving a minimum number of factors
to explain a maximum portion of variance in the original
variables, and the researcher knows the specific and error
variances are small.
Common factor analysis only uses the common variance and
places communality estimates on the diagonal of the correlation
matrix. Most appropriate when there is a desire to reveal latent
dimensions of the original variables and the researcher does not
know about the nature of specific and error variance.
Factor Analysis example: toothpaste attribute ratings
Sheet17RESPONDENT
NUMBERV1V2V3V5V617.003.006.004.002.004.0021.003.002.
004.005.004.0036.002.007.004.001.003.0044.005.004.006.002.0
05.0051.002.002.003.006.002.0066.003.006.004.002.004.0075.0
03.006.003.004.003.0086.004.007.004.001.004.0093.004.002.00
86. &A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
Example: toothpaste ratings
Suppose the researcher wants to determine the underlying
benefits consumers seek from the purchase of a toothpaste. A
sample of 30 respondents were interviewed using mall-intercept
method. The respondents were asked to indicate their degree of
agreement with the following statement using a 7-point scale
(1=strongly disagree, 7=strongly agree).
V1: it is important to buy a toothpaste that prevents cavitiesV2:
I like a toothpaste that gives shiny teethV3: A toothpaste should
strengthen your gumsV4: I prefer a toothpaste that freshens
breathV5: Prevention of tooth decay is not an important benefit
offered by a toothpaste.V6: The most important consideration in
buying a toothpaste is attractive teeth.
The analytical process is based on a matrix of correlations
87. between the variables. For the FA to be appropriate, the
variables must be correlated.Bartlett's test of sphericity can be
used to test the null hypothesis that the variables are
uncorrelated in the population. If this hypothesis cannot be
rejected, then the appropriateness of factor analysis should be
questioned. Another useful statistic is the Kaiser-Meyer-Olkin
(KMO) measure of sampling adequacy. Small values of the
KMO statistic indicate that the correlations between pairs of
variables cannot be explained by other variables and that factor
analysis may not be appropriate. Generally, a value greater than
0.5 is desirable.
Conducting Factor Analysis
Construct the Correlation Matrix
Correlation Matrix
Results of PCA
The null hypothesis, that the population correlation matrix is
an identity matrix (not correlated), is rejected by the Bartlett’s
test of sphericity. The chi-square statistic is 111.314 with 15df,
which is significant at the 0.05 level (p<0.001).
The value of KMO statistic (0.660) is large (>0.5). Thus, factor
analysis is considered an appropriate technique for this study.
The eigenvalue of factor 1 (F1) is 2.731, and it accounts for a
variance of 45.52% of the total variance. The second factor (F2)
has an eigenvalue of 2.218, which is 36.96% of the total
variance. The first two factors combined account for 82.49% of
the total variance.
88. The communality table gives relevant information after the
desired number of factors have been extracted.
results
From the scree plot, the first two components have eigenvalues
greater than 1.
A Priori Determination. Sometimes, because of prior
knowledge, the researcher knows how many factors to expect
and thus can specify the number of factors to be extracted
beforehand.
Determination Based on Eigenvalues. In this approach, only
factors with Eigenvalues greater than 1.0 are retained. An
Eigenvalue represents the amount of variance associated with
the factor. Hence, only factors with a variance greater than 1.0
are included. Factors with variance less than 1.0 are no better
than a single variable, since, due to standardization, each
variable has a variance of 1.0. If the number of variables is less
than 20, this approach will result in a conservative number of
factors.
Conducting Factor Analysis
Determine the Number of Factors
Determination Based on Scree Plot. A scree plot is a plot of
the Eigenvalues against the number of factors in order of
extraction. Experimental evidence indicates that the point at
which the scree begins denotes the true number of factors.
Generally, the number of factors determined by a scree plot will
be one or a few more than that determined by the Eigenvalue
criterion.
Determination Based on Percentage of Variance. In this
89. approach the number of factors extracted is determined so that
the cumulative percentage of variance extracted by the factors
reaches a satisfactory level. It is recommended that the factors
extracted should account for at least 60% of the variance.
Conducting Factor Analysis
Determine the Number of Factors
Results
The component matrix shows that some variables has cross
loading (absolute value of factor loading greater than 0.3) on
component 1 and component 2. In such a complex matrix, it is
difficult to interpret. Through rotation, the component matrix is
transformed into a simpler one that is easier to interpret.
In the rotated component matrix table, F1 has high coefficients
for V1 (prevention of cavities) and V3 (strong gum), and a
negative coefficient for V5 (prevention of tooth decay is not
important). Therefore, this factor will be labeled a health
benefit factor. Note the negative coefficient for a negative
variable (V5) leads to a positive interpretation that prevention
of tooth decay is important.
F2 is highly related with V2 (shiny teeth), V4 (fresh breath),
and V6 (attractive teeth). Thus F2 may be labaled a social
benefit factor.
Factor Matrix Before and After Rotation
Factors/component
Variables
1
92. (b)
High Loadings
After Rotation
Although the initial or unrotated factor matrix indicates the
relationship between the factors and individual variables, it
seldom results in factors that can be interpreted, because the
factors are correlated with many variables. Therefore, through
rotation the factor matrix (or component matrix) is transformed
into a simpler one that is easier to interpret. The rotation is
called orthogonal rotation ant, if the axes are maintained at
right angles. The most commonly used method for rotation is
the varimax procedure. This is an orthogonal method of
rotation that minimizes the number of variables with high
loadings on a factor, thereby enhancing the interpretability of
the factors. Orthogonal rotation results in factors that are
uncorrelated. The rotation is called oblique rotation when the
axes are not maintained at right angles, and the factors are
correlated. Sometimes, allowing for correlations among factors
can simplify the factor pattern matrix. Oblique rotation should
be used when factors in the population are likely to be strongly
correlated.
Conducting Factor Analysis
Rotate Factors
93. Results
In principal component analysis, component scores are
uncorrelated. In common factor analysis, estimates of these
scores are obtained, and there is no guarantee that the factors
will be uncorrelated with each other.
Fi = Wi1X1 + Wi2X2 + Wi3X3 + . . . + WikXk
The weights, or component/factor score coefficient, are
obtained from the component/factor score coefficients matrix.
Using the component score coefficient matrix, one could
compute two component score for each respondents.
A factor can then be interpreted in terms of the variables that
load high on it. Another useful aid in interpretation is to plot
the variables, using the factor loadings as coordinates.
Variables at the end of an axis are those that have high loadings
on only that factor, and hence describe the factor.
Conducting Factor Analysis
Interpret Factors
Factor Loading Plot
The correlations between the variables can be deduced or
reproduced from the estimated correlations between the
variables and the factors. The differences between the observed
correlations (as given in the input correlation matrix) and the
reproduced correlations (as estimated from the factor matrix)
can be examined to determine model fit. These differences are
called residuals.
Conducting Factor Analysis
94. Determine the Model Fit
Determine the model fit
The final step is to determine the model fit.
A basic assumption underlying FA is that the observed
correlation between variables can be attributed to common
factors. Hence, the correlations between the variables can be
reproduced from the estimated correlations between the
variables and the factors. The difference between the observed
correlations (as given in the input correlation matrix) and the
reproduced correlations (as estimated from the factor matrix)
can be examined to determine model fit. These differences are
called residuals. If there are many large residuals, the model
does not provide a good it. The following table only have five
residuals greater than 0.05, indicating an acceptable model fit.
Results of Common Factor Analysis
Results of Common Factor Analysis
Results of Common Factor Analysis
SPSS Windows
To select this procedure using SPSS for Windows click: