VALIDITY AND RELIABILITY OF
MEASURING INSTRUMENTS
By Qurat-ul-Ain (MBA-12-02)
1 | P a g e
Contents
INTRODUCTION........................................................................................................ 2
Measurement:......................................................................................................... 2
Instruments ............................................................................................................. 3
Criteria of Measurement:......................................................................................... 4
Reliability................................................................................................................. 4
Validity .................................................................................................................... 6
Measurement Errors: .............................................................................................. 8
Types of Errors ....................................................................................................... 8
VALIDITY AND RELIABILITY IN MEASURING INSTRUMENTS
(QUESTIONNAIRES)............................................................................................... 11
Types of Measuring Instruments:.......................................................................... 11
Validity and reliability in Questionnaires:............................................................... 13
Assessing Validity and Reliability of Questionnaires:............................................ 13
Approaches to Validity of Questionnaires ............................................................. 15
Development of a valid and reliable Questionnaire .................................................. 21
Conclusion ............................................................................................................... 25
References............................................................................................................... 26
2 | P a g e
Validity and Reliability of Measuring Instruments
INTRODUCTION
Before describing the validity and Reliability of measuring instruments lets first look at some
definitions that would be helpful in developing clear understandings of the phenomena.
Definitions:
 Measurement:
Measurement involves the operationalization of these constructs in defined variables and the
development and application of instruments or tests to quantify these variables.
S.S. Stevens defines measurement as
"The assignment of numerals to objects or events according to rules."
This definition incorporates a number of important distinctions.
First, it implies that if rules can be set up, it is theoretically possible to measure anything.
Further, measurement is only as good as the rules that direct its application. The "goodness"
of the rules reflects on the reliability and validity of the measurement.
Second aspect of definition given by Stevens is the use of the term numeral rather than
number. A numeral is a symbol and has no quantitative meaning unless the researcher
supplies it through the use of rules. The researcher sets up the criteria by which objects or
events are distinguished from one another and also the weights, if any, which are to be
assigned to these distinctions. This results in a scale.
3 | P a g e
 Instruments
Instrument is the generic term that researchers use for a measurement device (survey, test,
questionnaire, etc.). To help distinguish between instrument and instrumentation, consider
that the instrument is the device and instrumentation is the course of action (the process of
developing, testing, and using the device).
Instruments fall into two broad categories:
 researcher-completed
 subject-completed
Researcher-completed instruments are those instruments that researchers administer
themselves.
While,
Subject-completed are those that are completed by participants.
Researchers chose which type of instrument, or instruments, to use based on the research
question.
Examples are listed below:
Researcher-completed Instruments Subject-completed Instruments
Rating scales Questionnaires
Interview schedules/guides Self-checklists
Tally sheets Attitude scales
Flowcharts Personality inventories
Performance checklists Achievement/aptitude tests
Time-and-motion logs Projective devices
Observation forms Sociometric devices
(Dr. J. Patrick Biddix (Ph.D., 2003)
4 | P a g e
 Criteria of Measurement:
There are two fundamental criteria of measurement, i.e., reliability and validity.
(Quinn, 2000)
 Reliability
Reliability is the degree to which an assessment tool produces stable and consistent results.
Reliability can be thought of as consistency. Does the instrument consistently measure what
it is intended to measure?
Types of Reliability:
1. Test-retest reliability:
Test-retest reliability is a measure of reliability obtained by administering the same
test twice over a period of time to a group of individuals. The scores from Time 1 and
Time 2 can then be correlated in order to evaluate the test for stability over time.
Example: A test designed to assess student learning in psychology could be given to a
group of students twice, with the second administration perhaps coming a week after the
first. The obtained correlation coefficient would indicate the stability of the scores.
2. Parallel forms reliability:
Parallel forms reliability is a measure of reliability obtained by administering different
versions of an assessment tool (both versions must contain items that probe the
same construct, skill, knowledge base, etc.) to the same group of individuals. The
scores from the two versions can then be correlated in order to evaluate the
consistency of results across alternate versions.
Example: If you wanted to evaluate the reliability of a critical thinking assessment, you
might create a large set of items that all pertain to critical thinking and then randomly split the
questions up into two sets, which would represent the parallel forms.
5 | P a g e
3. Inter-rater reliability:
Inter-rater reliability is a measure of reliability used to assess the degree to which
different judges or raters agree in their assessment decisions. Inter-rater reliability is
useful because human observers will not necessarily interpret answers the same
way; raters may disagree as to how well certain responses or material demonstrate
knowledge of the construct or skill being assessed.
Example: Inter-rater reliability might be employed when different judges are evaluating the
degree to which art portfolios meet certain standards. Inter-rater reliability is especially
useful when judgments can be considered relatively subjective. Thus, the use of this type of
reliability would probably be more likely when evaluating artwork as opposed to math
problems.
4. Internal consistency reliability:
Internal consistency reliability is a measure of reliability used to evaluate the degree
to which different test items that probe the same construct produce similar results.
A. Average inter-item correlation is a subtype of internal consistency
reliability. It is obtained by taking all of the items on a test that probe the
same construct (e.g., reading comprehension), determining the correlation
coefficient for each pair of items, and finally taking the average of all of these
correlation coefficients. This final step yields the average inter-item
correlation.
B. Split-half reliability is another subtype of internal consistency reliability. The
process of obtaining split-half reliability is begun by “splitting in half” all items
of a test that are intended to probe the same area of knowledge (e.g., World
War II) in order to form two “sets” of items. The entire test is administered to
a group of individuals, the total score for each “set” is computed, and finally
the split-half reliability is obtained by determining the correlation between the
two total “set” scores.
6 | P a g e
 Validity
Validity refers to how well a test measures what it is purported to measure.
Why is it necessary?
While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs
to be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with an
excess of 5lbs. The scale is reliable because it consistently reports the same weight every
day, but it is not valid because it adds 5lbs to your true weight. It is not a valid measure of
your weight.
Types of Validity:
1. Face Validity:
Face Validity ascertains that the measure appears to be assessing the intended
construct under study. The stakeholders can easily assess face validity. Although this
is not a very “scientific” type of validity, it may be an essential component in enlisting
motivation of stakeholders. If the stakeholders do not believe the measure is an
accurate assessment of the ability, they may become disengaged with the task.
Example: If a measure of art appreciation is created all of the items should be related to the
different components and types of art. If the questions are regarding historical time periods,
with no reference to any artistic movement, stakeholders may not be motivated to give their
best effort or invest in this measure because they do not believe it is a true assessment of
art appreciation.
2. Construct Validity
Construct Validity is used to ensure that the measure is actually measure what it is
intended to measure (i.e. the construct), and no other variables. Using a panel of
“experts” familiar with the construct is a way in which this type of validity can be
assessed. The experts can examine the items and decide what that specific item is
intended to measure. Students can be involved in this process to obtain their
feedback.
7 | P a g e
Example: A women’s studies program may design a cumulative assessment of learning
throughout the major. The questions are written with complicated wording and
phrasing. This can cause the test inadvertently becoming a test of reading comprehension,
rather than a test of women’s studies. It is important that the measure is actually assessing
the intended construct, rather than an extraneous factor.
3. Criterion-Related Validity:
Criterion-Related Validity is used to predict future or current performance - it
correlates test results with another criterion of interest.
Example: If a physics program designed a measure to assess cumulative student learning
throughout the major. The new measure could be correlated with a standardized measure of
ability in this discipline, such as an ETS field test or the GRE subject test. The higher the
correlation between the established measure and new measure, the more faith stakeholders
can have in the new assessment tool.
4. Formative Validity
Formative Validity when applied to outcomes assessment it is used to assess how
well a measure is able to provide information to help improve the program under
study.
Example: When designing a rubric for history one could assess student’s knowledge across
the discipline. If the measure can provide information that students are lacking knowledge in
a certain area, for instance the Civil Rights Movement, then that assessment tool is providing
meaningful information that can be used to improve the course or program requirements
5. Sampling Validity
Sampling Validity (similar to content validity) ensures that the measure covers the
broad range of areas within the concept under study. Not everything can be covered,
so items need to be sampled from all of the domains. This may need to be
completed using a panel of “experts” to ensure that the content area is adequately
sampled.
8 | P a g e
Example: When designing an assessment of learning in the theatre department, it would not
be sufficient to only cover issues related to acting. Other areas of theatre such as lighting,
sound, functions of stage managers should all be included. The assessment should reflect
the content area in its entirety.
(Wren, 2005-06)
 Measurement Errors:
All measurements may contain some element of error; validity and reliability concern the
different types of error that typically occur, and they also show how we can estimate the
extent of error in a measurement.
There are three chief sources of error:
 In the thing being measured (my weight may fluctuate so it's difficult to get an
accurate picture of it);
 The observer (on Mondays I may knock a pound off my weight if I binged on my
mother's cooking at the week-end. Obviously the binging doesn't reflect my true
weight!);
 Or in the recording device (our clinic weigh scale has been acting up; we really
should get it recalibrated).
 Types of Errors
Following are some types of errors that can occur in measurement:
1. Random error
Random error is that which causes random and uncontrollable effects in measured results
across a sample, for example where rainy weather may depress some people.
The effect of random error is to cause additional spread in the measurement distribution,
causing an increase in the standard deviation of the measurement. The average should not
be affected, which is good news if this is being quoted in results.
9 | P a g e
The stability of the average is due to the effect of regression to the mean, whereby random
effects makes a high score as likely as a low score, so in a random sample they eventually
cancel one another out.
2. True score
The true score is that which is sought. It is not the same as the observed score as this
includes the random error, as follows:
Observed score = True score + random error
When the random error is small, then the observed score will be close to the true score and
thus be a fair representation. If, however, the random error is large, the observed score will
be nothing like the true score and has no value.
Another effect is that if a test score is near a boundary it may incorrectly cross the boundary.
For example a school exam result is close to the A/B grade level, then the grade given may
not be a reflection of the actual ability of the student.
Assuming an observed score is that true score is a dangerous trap, particularly if you have
no real idea of how big the random error may be.
3. Systematic error
In addition to natural error, additional variation from the true score may be introduced when
there is some error caused by problems in the measurement system, such as when bad
weather affects everyone in the study or when poor questions results in answers which do
not reflect true opinions.
There are many ways of allowing or introducing systematic error and elimination of this is a
critical part of experimental design, as well as assessment of the context environment at the
time of the experiment.
The effect of systematic error is often to shift the mean of the measurement distribution,
which can be particularly pernicious if this is to be quoted in results.
10 | P a g e
Measurement error
Measurement error is the real variation from the true score, and includes both random error
and systematic error.
Observed score = True score + random error + systematic error
(Jones, 2000)
So we can say that avoiding measurement error cab be ensured by choosing an instrument
that is valid and reliable.
11 | P a g e
VALIDITY AND RELIABILITY IN MEASURING INSTRUMENTS
(QUESTIONNAIRES)
Types of Measuring Instruments:
There are basically 4 types of Measuring Instruments:
 Observational Research
 Interview
 Survey
 Questionnaire
Observational research:
Observational research (or field research) is a type of correlational (i.e., non-experimental)
research in which a researcher observes ongoing behaviour. There are a variety of types of
observational research, each of which has both strengths and weaknesses. These types are
organized below by the extent to which an experimenter intrudes upon or controls the
environment.
Observational research is particularly prevalent in the social sciences and in marketing. It is
a social research technique that involves the direct observation of phenomena in their
natural setting. This differentiates it from experimental research in which a quasi-artificial
environment is created to control for spurious factors, and where at least one of the variables
is manipulated as part of the experiment. It is typically divided into naturalistic (or
“nonparticipant”) observation, and participant observation. Cases studies and archival
research are special types of observational research. Naturalistic (or nonparticipant)
observation has no intervention by a researcher. It is simply studying behaviours that occur
naturally in natural contexts, unlike the artificial environment of a controlled laboratory
setting. Importantly, in naturalistic observation, there is no attempt to manipulate variables. It
permits measuring what behaviour is really like. However, its typical limitations consist in its
incapability exploring the actual causes of behaviours, and the impossibility to determine if a
given observation is truly representative of what normally occurs.
12 | P a g e
Survey
The survey is probably the most commonly used research design we have all been asked to
take part in a survey at some time. As consumers we are asked about our shopping habits,
as users of services we are asked for our opinions of services.
The survey is a flexible research approach used to investigate a wide range of topics.
Surveys often employ the questionnaire as a tool for data collection. This resource pack
considers the use of surveys and questionnaires in health and social care research.
Surveys are a very traditional way of conducting research. They are particularly useful for
no experimental descriptive designs that seek to describe reality. So, for instance, a survey
approach may be used to establish the prevalence or incidence of a particular condition.
Likewise, the survey approach is frequently used to collect information on attitudes and
behaviour. Some issues are best addressed by classical experimental snapshot of what is
happening in that group at that particular time. They usually take a descriptive or exploratory
form that simply sets out to describe behaviour or attitudes. So for example, if you want to
measure some aspect of client satisfaction, then a cross-sectional descriptive survey would
be the recommended approach. Likewise, if you wish to establish the prevalence of
depression amongst new mothers, a postal survey might be an appropriate approach.
Surveys can take many forms. A survey of the entire population would be known as a
census. However usually surveys are restricted to a representative sample of the potential
group that the researcher is interested in, for reasons of practicality and cost-effectiveness.
Interview
Interviews are an attractive proposition for the project researcher. Interviews are something
more than conversation. They involve a set of assumptions and understandings about the
situation which are not normally associated with a casual conversion. Interviews are also
referred as an oral questionnaire by some people, but it is indeed mush more than that.
Questionnaire involves indirect data collection, whereas Interview data is collected directly
from others in face to face contact. As you know, people are hesitant to write something
than to talk. With friendly relationship and rapport, the interviewer can obtain certain types of
confidential information which might be reluctant to put in writing.
Therefore research interview should be systematically arranged. It does not happen by
chance. The interviews not done by secret recording of discussions as research data. The
consent of the subject is taken for the purpose of interview.
13 | P a g e
Validity and reliability in Questionnaires:
In this report we will describe the contents of validity and reliability in Questionnaires.
Questionnaires
Questionnaires are the most frequently used data collection method in educational and
evaluation research. Most questionnaire’s objective in research is to obtain relevant
information in most reliable and valid manner. Questionnaires help gather information on
knowledge, attitudes, opinions, behaviours, facts, and other information. In a review of 748
research studies conducted in agricultural and Extension education, Radhakrishna, Leite,
and Baggett (2003) found that 64% used questionnaires. They also found that a third of the
studies reviewed did not report procedures for establishing validity (31%) or reliability (33%).
Development of a valid and reliable questionnaire is a must to reduce measurement error.
Groves (1987) defines measurement error as the "discrepancy between respondents'
attributes and their survey responses" (p. 162).
(Radhakrishna, 2007)
Assessing Validity and Reliability of Questionnaires:
In order to have confidence in the results of a study, one must be assured that the
questionnaire consistently measures what it purports to measure when properly
administered. In short, the questionnaire must be both valid and reliable.
Warwick and Linninger (1975) point out that there are two basic goals in questionnaire
design.
1. To obtain information relevant to the purposes of the survey.
2. To collect this information with maximal reliability and validity.
How can a researcher be sure that the data gathering instrument being used will measure
what it is supposed to measure and will do this in a consistent manner?
This is a question that can only be answered by examining the definitions for and methods of
establishing the validity and reliability of a research instrument. We can say that reliability of
a questionnaire refers to the quality of tool while validity refers to the process used to employ
the tool in use, i.e. the process used to conduct the questionnaire.
14 | P a g e
The basic difference between these two criteria is that they deal with different aspects of
measurement. This difference can be summarized by two different sets of questions asked
when applying the two criteria:
Validity:
a. Does the measure employed really measure the theoretical concept (variable)?
Reliability:
a. Will the measure employed repeatedly on the same individuals yield similar results?
(Stability)
b. Will the measure employed by different investigators yield similar results? (Equivalence)
c. Will a set of different operational definitions of the same concept employed on the same
individuals, using the same data-collecting technique, yield a highly correlated result? Or, will
all items of the measure be internally consistent? (Homogeneity)
Validity
We have already define Validity which is the degree to which a test measures what it is
supposed to measure. The validation of questionnaire forms an important aspect of research
methodology and the validity of the outcomes. It is concerned with the accuracy of our
measurement. So we can say that validity of questionnaire seems to emerge from the
internal and external consistency and relevance of the questionnaire.
The overriding principle of validity is that it focuses on how a questionnaire or assessment
process is used. Reliability is a characteristic of the instrument itself, but validity comes
from the way the instrument is employed.
The following ideas support this principle:
 As nearly as possible, the data gathering should match the decisions you need to
make. This means if you need to make a priority-focused decision, such as
allocating resources or eliminating programs, your assessment process should be a
comparative one that ranks the programs or alternatives you will be considering.
 Gather data from all the people who can contribute information, even if they are
hard to contact. For example, if you are conducting a survey of customer service,
try to get a sample of all the customers, not just those who are easy to reach, such
as those who have complained or have made suggestions.
15 | P a g e
 If you're going after sensitive information, protect your sources. It has been
said that in the Prussian army at the turn of the century, decisions were made twice,
once when officers were sober, again when they were drunk. This concept
acknowledges the power of the "socially acceptable response" to questions or
requests. Don't assume that a simple statement printed on the questionnaire that "all
individual responses will be kept confidential" will make everybody relax and provide
candid answers. Give respondents the freedom to decide which information about
themselves they wish to withhold, and employ other administrative procedures, such
as handing out Login IDs and Passwords separately from the e-mail inviting people
to participate in the survey. (Jackson, 2000)
Approaches to Validity of Questionnaires
There are three basic approaches to the validity of Questionnaires as shown by Mason and
Bramble (1989). These are content validity, construct validity, and criterion-related validity.
1. Content Validity
Once the questionnaire is drafted one must determine whether the domain has been
adequately covered (content validity).This approach measures the degree to which the test
items represent the domain or universe of the trait or property being measured. In order to
establish the content validity of questionnaire, the researcher must identify the overall
content to be represented. Items must then be randomly chosen from this content that will
accurately represent the information in all areas. By using this method the researcher
should obtain a group of items which is representative of the content of the trait or property
to be measured.
Identifying the universe of content is not an easy task. It is, therefore, usually suggested that
a panel of experts in the field to be studied be used to identify a content area. ). For
example, suppose it was decided that appetite consisted of three attributes. In order for the
questionnaire to have content validity all three attributes must be questioned sufficiently. A
tally of the number of questions addressing each attribute will immediately indicate any
imbalance. If an imbalance exists the results may be biased, particularly when the
questionnaire yields a single score, as in measurements of functional health status.
16 | P a g e
2. Construct Validity
Construct validity refers to the extent to which the new questionnaire conforms to existing
ideas or hypotheses concerning the concepts (constructs) that are being measured.
Construct validity presents the greatest challenge in questionnaire development.
Cronbach and Meehl (1955) indicated that, "Construct validity must be investigated
whenever no criterion or universe of content is accepted as entirely adequate to define the
quality to be measured" as quoted by Carmines and Zeller (1979). The term construct in this
instance is defined as a property that is offered to explain some aspect of human behaviour,
such as mechanical ability, intelligence, or introversion (Van Dalen, 1979). The construct
validity approach concerns the degree to which the test measures the construct it was
designed to measure.
There are two parts to the evaluation of the construct validity of a test.
First and most important, the theory underlying the construct to be measured must be
considered.
Second the adequacy of the test in measuring the construct is evaluated (Mason and
Bramble, 1989). For example, suppose that a researcher is interested in measuring the
introverted nature of first year teachers. The researcher defines introverted as the overall
lack of social skills such as conversing, meeting and greeting people, and attending faculty
social functions. This definition is based upon the researcher’s own observations. A panel of
experts is then asked to evaluate this construct of introversion. The panel cannot agree that
the qualities pointed out by the researcher adequately define the construct of introversion.
Furthermore, the researcher cannot find evidence in the research literature supporting the
introversion construct as defined here. Using this information, the validity of the construct
itself can be questioned. In this case the researcher must reformulate the previous definition
of the construct.
Once the researcher has developed a meaningful, useable construct, the adequacy of the
test used to measure it must be evaluated. First, data concerning the trait being measured
should be gathered and compared with data from the test being assessed. The data from
other sources should be similar or convergent. If convergence exists, construct validity is
supported.
17 | P a g e
After establishing convergence the discriminate validity of the test must be determined. This
involves demonstrating that the construct can be differentiated from other constructs that
may be somewhat similar. In other words, the researcher must show that the construct being
measured is not the same as one that was measured under a different name.
3. Criterion-Related Validity
Criterion validity indicates the effectiveness of a questionnaire in measuring what it purports
to measure. The responses on the questionnaire being developed are checked against an
external criterion, or gold standard, which is a direct and independent measure of what the
new questionnaire is designed to measure.
This approach is concerned with detecting the presence or absence of one or more criteria
considered to represent traits or constructs of interest. One of the easiest ways to test for
criterion-related validity is to administer the instrument to a group that is known to exhibit the
trait to be measured. This group may be identified by a panel of experts. A wide range of
items should be developed for the test with invalid questions culled after the control group
has taken the test. Items should be omitted that are drastically inconsistent with respect to
the responses made among individual members of the group. If the researcher has
developed quality items for the instrument, the culling process should leave only those items
that will consistently measure the trait or construct being studied. For example, suppose one
wanted to develop an instrument that would identify teachers who are good at dealing with
abused children. First, a panel of unbiased experts identifies 100 teachers out of a larger
group that they judge to be best at handling abused children. The researcher develops 400
yes/no items that will be administered to the whole group of teachers, including those
identified by the experts. The responses are analysed and the items to which the expert
identified teachers and other teachers responding differently are seen as those questions
that will identify teachers who are good at dealing with abused children.
4. Face validity
Face validity is a type of validity in which the wording of items in a scale makes some
reference to what is being measured. Face validity is not really validity but refers to the
appearance of the questionnaire: Does it look "professional" or carelessly and poorly
constructed? Professional-looking questionnaires are more likely to elicit serious responses.
Therefore, face validity is an important consideration for both the pertest and the final
product.
18 | P a g e
Reliability
The reliability of a research instrument concerns the extent to which the instrument yields the
same results on repeated trials. Reliability of a questionnaire seems to emerge from the
quality of the questionnaire. Reliability, or reproducibility, indicates whether the questionnaire
performs consistently. Although unreliability is always present to a certain extent, there will
generally be a good deal of consistency in the results of a quality instrument gathered at
different times. Reliability of a questionnaire seems to emerge from the quality of the
questionnaire. In other words reliability of a questionnaire refers to the quality of tool or the
tendency toward consistency found in repeated measurements is referred to as reliability
(Carmines & Zeller, 1979).
In scientific research, accuracy in measurement is of great importance. Scientific research
normally measures physical attributes which can easily be assigned a precise value. Many
times numerical assessments of the mental attributes of human beings are accepted as
readily as numerical assessments of their physical attributes. Although we may understand
that the values assigned to mental attributes can never be completely precise, the
imprecision is often looked upon as being too small to be of any practical concern. However,
the magnitude of the imprecision is much greater in the measurement of mental attributes
than in that of physical attributes. This fact makes it very important that the researcher in the
social sciences and humanities determine the reliability of the data gathering instrument to
be used (Willmott & Nuttall, 1975).
There are four ways of examining reliability.
1. Retest Method
It measures the ability of the questionnaire to yield similar results when administered to the
same person on two separate occasions. The more reliable the questionnaire the higher the
correlation between the results. The interval between the administrations is important. If it is
too short the results may be confounded because the subject responds from memory; if it is
too long the attribute being examined may have changed, and the low correlation may
indicate this change rather than poor reliability.
The reliability of the test (instrument) can be estimated by examining the consistency of the
responses between the two tests.
If the researcher obtains the same results on the two administrations of the instrument, then
the reliability coefficient will be 1.00. Normally, the correlation of measurements across time
19 | P a g e
will be less than perfect due to different experiences and attitudes that respondents have
encountered from the time of the first test.
The test-retest method is a simple, clear cut way to determine reliability, but it can be costly
and impractical. Researchers are often only able to obtain measurements at a single point in
time or do not have the resources for multiple administration.
2. Alternative Form Method
Like the retest method, this method also requires two testing with the same people.
However, the same test is not given each time. Each of the two tests must be designed to
measure the same thing and should not differ in any systematic way. One way to help
ensure this is to use random procedures to select items for the different tests.
The alternative form method is viewed as superior to the retest method because a
respondent’s memory of test items is not as likely to play a role in the data received. One
drawback of this method is the practical difficulty in developing test items that are consistent
in the measurement of a specific phenomenon.
3. Split-Halves Method:
A more feasible method for testing the consistency of homogeneous (single-attribute)
questionnaires is the split-half method: the even- and odd-numbered questions are
separated and are considered to be two equivalent questionnaires. The internal consistency
of a homogeneous questionnaire can also be examined after a single administration by
applying an appropriate statistical procedure. The split-half method cannot be used with
heterogeneous questionnaires because division of the questionnaire will not yield
"equivalent" forms. In this situation one may repeat questions throughout the questionnaire;
only the original question is kept in the final form.
This method is more practical in that it does not require two administrations of the same or
an alternative form test. In the split-halves method, the total number of items is divided into
halves, and a correlation taken between the two halves. This correlation only estimates the
reliability of each half of the test. It is necessary then to use a statistical correction to
estimate the reliability of the whole test. This correction is known as the Spearman-Brown
prophecy formula (Carmines & Zeller, 1979)
20 | P a g e
There are many ways to divide the items in an instrument into halves. The most typical way
is to assign the odd numbered items to one half and the even numbered items to the other
half of the test. One drawback of the split-halves method is that the correlation between the
two halves is dependent upon the method used to divide the items.
4. Internal Consistency Method
This method requires neither the splitting of items into halves nor the multiple administration
of instruments. The internal consistency method provides a unique estimate of reliability for
the given test administration. The most popular internal consistency reliability estimate is
given by Cronbach’s alpha.
The coefficient alpha is an internal consistency index designed for use with tests containing
items that have no right answer. This is a very useful tool in educational and social science
research because instruments in these areas often ask respondents to rate the degree to
which they agree or disagree with a statement on a particular scale.
5. Homogeneity (Internal Consistency)
We have three ways to check the internal consistency of the index.
a) Split-half correlation. We could split the index of "exposure to televised news" in half so
that there are two groups of two questions, and see if the two sub-scales are highly
correlated. That is, do people who score high on the first half also score high on the second
half?
b) Average inter-item correlation. We also can determine internal consistency for each
question on the index. If the index is homogeneous, each question should be highly
correlated with the other three questions.
c) Average item-total correlation. We could correlate each question with the total score of
the TV news exposure index to examine the internal consistency of items. This gives us an
idea of the contribution of each item to the reliability of the index.
Another approach to the evaluation of reliability is to examine the relative absence of
random measurement error in a measuring instrument. Random measurement errors can
be indexed by a measure of variability of individual item scores around the mean index
score. Thus, an instrument which has a large measure of variability should be less reliable
than the one having smaller variability measure.
(Johnson, 1999)
21 | P a g e
Development of a valid and reliable Questionnaire
Development of a valid and reliable questionnaire involves several steps taking considerable
time. This article describes the sequential steps involved in the development and testing of
questionnaires used for data collection. Figure 1 illustrates the five sequential steps involved
in questionnaire development and testing. Each step depends on fine tuning and testing of
previous steps that must be completed before the next step.
A brief description of each of the five steps follows Figure
Figure 1.
Sequence for Questionnaire/Instrument Development
22 | P a g e
Step 1- Background
In this initial step, the purpose, objectives, research questions, and hypothesis of the
proposed research are examined. Determining who the audience, their background, is
especially their educational/readability levels, access, and the process used to select the
respondents (sample vs. population) are also part of this step. A thorough understanding of
the problem through literature search and readings is a must. Good preparation and
understanding of Step1 provides the foundation for initiating Step 2.
Step 2- Questionnaire Conceptualization
After developing a thorough understanding of the research, the next step is to generate
statements/questions for the questionnaire. In this step, content (from literature/theoretical
framework) is transformed into statements/questions. In addition, a link among the objectives
of the study and their translation into content is established. For example, the researcher
must indicate what the questionnaire is measuring, that is, knowledge, attitudes,
perceptions, opinions, recalling facts, behaviour change, etc. Major variables (independent,
dependent, and moderator variables) are identified and defined in this step.
Step 3- Format and Data Analysis
In Step 3, the focus is on writing statements/questions, selection of appropriate scales of
measurement, questionnaire layout, format, question ordering, font size, front and back
cover, and proposed data analysis. Scales are devices used to quantify a subject's response
on a particular variable. Understanding the relationship between the level of measurement
and the appropriateness of data analysis is important. For example, if ANOVA (analysis of
variance) is one mode of data analysis, the independent variable must be measured on a
nominal scale with two or more levels (yes, no, not sure), and the dependent variable must
be measured on a interval/ratio scale (strongly agree to strongly disagree).
23 | P a g e
Step 4 -Establishing Validity
As a result of Steps 1-3, a draft questionnaire is ready for establishing validity. Validity is the
amount of systematic or built-in error in measurement (Norland, 1990). Validity is established
using a panel of experts and a field test. Which type of validity (content, construct, criterion,
and face) to use depends on the objectives of the study. The following questions are
addressed in Step 4:
1. Is the questionnaire valid? In other words, is the questionnaire measuring what it
intended to measure?
2. Does it represent the content?
3. Is it appropriate for the sample/population?
4. Is the questionnaire comprehensive enough to collect all the information needed to
address the purpose and goals of the study?
5. Does the instrument look like a questionnaire?
Addressing these questions coupled with carrying out a readability test enhances
questionnaire validity. The Fog Index, Flesch Reading Ease, Flesch-Kinkaid Readability
Formula, and Gunning-Fog Index are formulas used to determine readability. Approval from
the Institutional Review Board (IRB) must also be obtained. Following IRB approval, the next
step is to conduct a field test using subjects not included in the sample. Make changes, as
appropriate, based on both a field test and expert opinion. Now the questionnaire is ready to
pilot test.
Step 5- Establishing Reliability
In this final step, reliability of the questionnaire using a pilot test is carried out. Reliability
refers to random error in measurement. Reliability indicates the accuracy or precision of the
measuring instrument (Norland, 1990). The pilot test seeks to answer the question, does the
questionnaire consistently measure whatever it measures?
The use of reliability types (test-retest, split half, alternate form, internal consistency)
depends on the nature of data (nominal, ordinal, interval/ratio). For example, to assess
24 | P a g e
reliability of questions measured on an interval/ratio scale, internal consistency is
appropriate to use. To assess reliability of knowledge questions, test-retest or split-half is
appropriate.
Reliability is established using a pilot test by collecting data from 20-30 subjects not included
in the sample. Data collected from pilot test is analyzed using SPSS (Statistical Package for
Social Sciences) or another software. SPSS provides two key pieces of information. These
are "correlation matrix" and "view alpha if item deleted" column. Make sure that
items/statements that have 0s, 1s, and negatives are eliminated. Then view "alpha if item
deleted" column to determine if alpha can be raised by deletion of items. Delete items that
substantially improve reliability. To preserve content, delete no more than 20% of the items.
The reliability coefficient (alpha) can range from 0 to 1, with 0 representing an instrument
with full of error and 1 representing total absence of error. A reliability coefficient (alpha) of
.70 or higher is considered acceptable reliability.
(william, 1999)
25 | P a g e
Conclusion
At the end we conclude that reliability and validity of the instrument is quite important.
“Validation” is also the process by which any data collection instrument, including
questionnaires, is assessed for its dependability. Validating questionnaires is somewhat
challenging as they usually evaluate subjective measures, which means they can be
influenced by a range of factors that are hard to control. Questionnaires are most widely
used tools in specially social science research. Most questionnaire’s objective in research is
to obtain relevant information in most reliable and valid manner. Therefore the validation of
questionnaire forms an important aspect of research methodology and the validity of the
outcomes. Often a researcher is confused with the objective of validating a questionnaire
and tends to find a link between the reliability of a questionnaire with the validity of it.
The reality is that reliability and validity are two different aspects of an acceptable research
questionnaire. It is important for a researcher to understand the differences between these
two aspects. In its simple explanation, reliability of a questionnaire seems to emerge from
the quality of the questionnaire. On the other hand validity seems to emerge from the
internal and external consistency and relevance of the questionnaire. In other words
reliability of a questionnaire refers to the quality of tool (read questionnaire) while validity
refers to the process used to employ the tool in use, i.e. the process used to conduct the
questionnaire.
26 | P a g e
References
Dr. J. Patrick Biddix (Ph.D., U. o.-S. (2003). Retrieved from
https://researchrundowns.wordpress.com/quantitative-methods/instrument-validity-
reliability/
Jackson. (2000). Retrieved from http://www.evensenwebs.com/validity.html
Johnson. (1999). Retrieved from
(http://www.okstate.edu/ag/agedcm4h/academic/aged5980a/5980/newpage18.htm
Jones, W. S. (2000). Retrieved from
http://changingminds.org/explanations/research/measurement/measurement_error.htm
Quinn, J. A. (2000). Retrieved from
http://www.journalism.wisc.edu/~dshah/j658/Reliability%20and%20Validity.pdf
Radhakrishna, R. B. (2007, February). Retrieved from http://www.joe.org/joe/2007february/tt2.php
william. (1999). Retrieved from Groves, R. M., (1987). Research on survey data quality. Public
Opinion Quarterly, 51, 156-172.
Wren, C. P. (2005-06). Retrieved from https://www.uni.edu/chfasoa/reliabilityandvalidity.htm

MBA-12-02

  • 1.
    VALIDITY AND RELIABILITYOF MEASURING INSTRUMENTS By Qurat-ul-Ain (MBA-12-02)
  • 2.
    1 | Pa g e Contents INTRODUCTION........................................................................................................ 2 Measurement:......................................................................................................... 2 Instruments ............................................................................................................. 3 Criteria of Measurement:......................................................................................... 4 Reliability................................................................................................................. 4 Validity .................................................................................................................... 6 Measurement Errors: .............................................................................................. 8 Types of Errors ....................................................................................................... 8 VALIDITY AND RELIABILITY IN MEASURING INSTRUMENTS (QUESTIONNAIRES)............................................................................................... 11 Types of Measuring Instruments:.......................................................................... 11 Validity and reliability in Questionnaires:............................................................... 13 Assessing Validity and Reliability of Questionnaires:............................................ 13 Approaches to Validity of Questionnaires ............................................................. 15 Development of a valid and reliable Questionnaire .................................................. 21 Conclusion ............................................................................................................... 25 References............................................................................................................... 26
  • 3.
    2 | Pa g e Validity and Reliability of Measuring Instruments INTRODUCTION Before describing the validity and Reliability of measuring instruments lets first look at some definitions that would be helpful in developing clear understandings of the phenomena. Definitions:  Measurement: Measurement involves the operationalization of these constructs in defined variables and the development and application of instruments or tests to quantify these variables. S.S. Stevens defines measurement as "The assignment of numerals to objects or events according to rules." This definition incorporates a number of important distinctions. First, it implies that if rules can be set up, it is theoretically possible to measure anything. Further, measurement is only as good as the rules that direct its application. The "goodness" of the rules reflects on the reliability and validity of the measurement. Second aspect of definition given by Stevens is the use of the term numeral rather than number. A numeral is a symbol and has no quantitative meaning unless the researcher supplies it through the use of rules. The researcher sets up the criteria by which objects or events are distinguished from one another and also the weights, if any, which are to be assigned to these distinctions. This results in a scale.
  • 4.
    3 | Pa g e  Instruments Instrument is the generic term that researchers use for a measurement device (survey, test, questionnaire, etc.). To help distinguish between instrument and instrumentation, consider that the instrument is the device and instrumentation is the course of action (the process of developing, testing, and using the device). Instruments fall into two broad categories:  researcher-completed  subject-completed Researcher-completed instruments are those instruments that researchers administer themselves. While, Subject-completed are those that are completed by participants. Researchers chose which type of instrument, or instruments, to use based on the research question. Examples are listed below: Researcher-completed Instruments Subject-completed Instruments Rating scales Questionnaires Interview schedules/guides Self-checklists Tally sheets Attitude scales Flowcharts Personality inventories Performance checklists Achievement/aptitude tests Time-and-motion logs Projective devices Observation forms Sociometric devices (Dr. J. Patrick Biddix (Ph.D., 2003)
  • 5.
    4 | Pa g e  Criteria of Measurement: There are two fundamental criteria of measurement, i.e., reliability and validity. (Quinn, 2000)  Reliability Reliability is the degree to which an assessment tool produces stable and consistent results. Reliability can be thought of as consistency. Does the instrument consistently measure what it is intended to measure? Types of Reliability: 1. Test-retest reliability: Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. Example: A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first. The obtained correlation coefficient would indicate the stability of the scores. 2. Parallel forms reliability: Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same construct, skill, knowledge base, etc.) to the same group of individuals. The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions. Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms.
  • 6.
    5 | Pa g e 3. Inter-rater reliability: Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed. Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards. Inter-rater reliability is especially useful when judgments can be considered relatively subjective. Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems. 4. Internal consistency reliability: Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results. A. Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by taking all of the items on a test that probe the same construct (e.g., reading comprehension), determining the correlation coefficient for each pair of items, and finally taking the average of all of these correlation coefficients. This final step yields the average inter-item correlation. B. Split-half reliability is another subtype of internal consistency reliability. The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.
  • 7.
    6 | Pa g e  Validity Validity refers to how well a test measures what it is purported to measure. Why is it necessary? While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs to be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with an excess of 5lbs. The scale is reliable because it consistently reports the same weight every day, but it is not valid because it adds 5lbs to your true weight. It is not a valid measure of your weight. Types of Validity: 1. Face Validity: Face Validity ascertains that the measure appears to be assessing the intended construct under study. The stakeholders can easily assess face validity. Although this is not a very “scientific” type of validity, it may be an essential component in enlisting motivation of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the ability, they may become disengaged with the task. Example: If a measure of art appreciation is created all of the items should be related to the different components and types of art. If the questions are regarding historical time periods, with no reference to any artistic movement, stakeholders may not be motivated to give their best effort or invest in this measure because they do not believe it is a true assessment of art appreciation. 2. Construct Validity Construct Validity is used to ensure that the measure is actually measure what it is intended to measure (i.e. the construct), and no other variables. Using a panel of “experts” familiar with the construct is a way in which this type of validity can be assessed. The experts can examine the items and decide what that specific item is intended to measure. Students can be involved in this process to obtain their feedback.
  • 8.
    7 | Pa g e Example: A women’s studies program may design a cumulative assessment of learning throughout the major. The questions are written with complicated wording and phrasing. This can cause the test inadvertently becoming a test of reading comprehension, rather than a test of women’s studies. It is important that the measure is actually assessing the intended construct, rather than an extraneous factor. 3. Criterion-Related Validity: Criterion-Related Validity is used to predict future or current performance - it correlates test results with another criterion of interest. Example: If a physics program designed a measure to assess cumulative student learning throughout the major. The new measure could be correlated with a standardized measure of ability in this discipline, such as an ETS field test or the GRE subject test. The higher the correlation between the established measure and new measure, the more faith stakeholders can have in the new assessment tool. 4. Formative Validity Formative Validity when applied to outcomes assessment it is used to assess how well a measure is able to provide information to help improve the program under study. Example: When designing a rubric for history one could assess student’s knowledge across the discipline. If the measure can provide information that students are lacking knowledge in a certain area, for instance the Civil Rights Movement, then that assessment tool is providing meaningful information that can be used to improve the course or program requirements 5. Sampling Validity Sampling Validity (similar to content validity) ensures that the measure covers the broad range of areas within the concept under study. Not everything can be covered, so items need to be sampled from all of the domains. This may need to be completed using a panel of “experts” to ensure that the content area is adequately sampled.
  • 9.
    8 | Pa g e Example: When designing an assessment of learning in the theatre department, it would not be sufficient to only cover issues related to acting. Other areas of theatre such as lighting, sound, functions of stage managers should all be included. The assessment should reflect the content area in its entirety. (Wren, 2005-06)  Measurement Errors: All measurements may contain some element of error; validity and reliability concern the different types of error that typically occur, and they also show how we can estimate the extent of error in a measurement. There are three chief sources of error:  In the thing being measured (my weight may fluctuate so it's difficult to get an accurate picture of it);  The observer (on Mondays I may knock a pound off my weight if I binged on my mother's cooking at the week-end. Obviously the binging doesn't reflect my true weight!);  Or in the recording device (our clinic weigh scale has been acting up; we really should get it recalibrated).  Types of Errors Following are some types of errors that can occur in measurement: 1. Random error Random error is that which causes random and uncontrollable effects in measured results across a sample, for example where rainy weather may depress some people. The effect of random error is to cause additional spread in the measurement distribution, causing an increase in the standard deviation of the measurement. The average should not be affected, which is good news if this is being quoted in results.
  • 10.
    9 | Pa g e The stability of the average is due to the effect of regression to the mean, whereby random effects makes a high score as likely as a low score, so in a random sample they eventually cancel one another out. 2. True score The true score is that which is sought. It is not the same as the observed score as this includes the random error, as follows: Observed score = True score + random error When the random error is small, then the observed score will be close to the true score and thus be a fair representation. If, however, the random error is large, the observed score will be nothing like the true score and has no value. Another effect is that if a test score is near a boundary it may incorrectly cross the boundary. For example a school exam result is close to the A/B grade level, then the grade given may not be a reflection of the actual ability of the student. Assuming an observed score is that true score is a dangerous trap, particularly if you have no real idea of how big the random error may be. 3. Systematic error In addition to natural error, additional variation from the true score may be introduced when there is some error caused by problems in the measurement system, such as when bad weather affects everyone in the study or when poor questions results in answers which do not reflect true opinions. There are many ways of allowing or introducing systematic error and elimination of this is a critical part of experimental design, as well as assessment of the context environment at the time of the experiment. The effect of systematic error is often to shift the mean of the measurement distribution, which can be particularly pernicious if this is to be quoted in results.
  • 11.
    10 | Pa g e Measurement error Measurement error is the real variation from the true score, and includes both random error and systematic error. Observed score = True score + random error + systematic error (Jones, 2000) So we can say that avoiding measurement error cab be ensured by choosing an instrument that is valid and reliable.
  • 12.
    11 | Pa g e VALIDITY AND RELIABILITY IN MEASURING INSTRUMENTS (QUESTIONNAIRES) Types of Measuring Instruments: There are basically 4 types of Measuring Instruments:  Observational Research  Interview  Survey  Questionnaire Observational research: Observational research (or field research) is a type of correlational (i.e., non-experimental) research in which a researcher observes ongoing behaviour. There are a variety of types of observational research, each of which has both strengths and weaknesses. These types are organized below by the extent to which an experimenter intrudes upon or controls the environment. Observational research is particularly prevalent in the social sciences and in marketing. It is a social research technique that involves the direct observation of phenomena in their natural setting. This differentiates it from experimental research in which a quasi-artificial environment is created to control for spurious factors, and where at least one of the variables is manipulated as part of the experiment. It is typically divided into naturalistic (or “nonparticipant”) observation, and participant observation. Cases studies and archival research are special types of observational research. Naturalistic (or nonparticipant) observation has no intervention by a researcher. It is simply studying behaviours that occur naturally in natural contexts, unlike the artificial environment of a controlled laboratory setting. Importantly, in naturalistic observation, there is no attempt to manipulate variables. It permits measuring what behaviour is really like. However, its typical limitations consist in its incapability exploring the actual causes of behaviours, and the impossibility to determine if a given observation is truly representative of what normally occurs.
  • 13.
    12 | Pa g e Survey The survey is probably the most commonly used research design we have all been asked to take part in a survey at some time. As consumers we are asked about our shopping habits, as users of services we are asked for our opinions of services. The survey is a flexible research approach used to investigate a wide range of topics. Surveys often employ the questionnaire as a tool for data collection. This resource pack considers the use of surveys and questionnaires in health and social care research. Surveys are a very traditional way of conducting research. They are particularly useful for no experimental descriptive designs that seek to describe reality. So, for instance, a survey approach may be used to establish the prevalence or incidence of a particular condition. Likewise, the survey approach is frequently used to collect information on attitudes and behaviour. Some issues are best addressed by classical experimental snapshot of what is happening in that group at that particular time. They usually take a descriptive or exploratory form that simply sets out to describe behaviour or attitudes. So for example, if you want to measure some aspect of client satisfaction, then a cross-sectional descriptive survey would be the recommended approach. Likewise, if you wish to establish the prevalence of depression amongst new mothers, a postal survey might be an appropriate approach. Surveys can take many forms. A survey of the entire population would be known as a census. However usually surveys are restricted to a representative sample of the potential group that the researcher is interested in, for reasons of practicality and cost-effectiveness. Interview Interviews are an attractive proposition for the project researcher. Interviews are something more than conversation. They involve a set of assumptions and understandings about the situation which are not normally associated with a casual conversion. Interviews are also referred as an oral questionnaire by some people, but it is indeed mush more than that. Questionnaire involves indirect data collection, whereas Interview data is collected directly from others in face to face contact. As you know, people are hesitant to write something than to talk. With friendly relationship and rapport, the interviewer can obtain certain types of confidential information which might be reluctant to put in writing. Therefore research interview should be systematically arranged. It does not happen by chance. The interviews not done by secret recording of discussions as research data. The consent of the subject is taken for the purpose of interview.
  • 14.
    13 | Pa g e Validity and reliability in Questionnaires: In this report we will describe the contents of validity and reliability in Questionnaires. Questionnaires Questionnaires are the most frequently used data collection method in educational and evaluation research. Most questionnaire’s objective in research is to obtain relevant information in most reliable and valid manner. Questionnaires help gather information on knowledge, attitudes, opinions, behaviours, facts, and other information. In a review of 748 research studies conducted in agricultural and Extension education, Radhakrishna, Leite, and Baggett (2003) found that 64% used questionnaires. They also found that a third of the studies reviewed did not report procedures for establishing validity (31%) or reliability (33%). Development of a valid and reliable questionnaire is a must to reduce measurement error. Groves (1987) defines measurement error as the "discrepancy between respondents' attributes and their survey responses" (p. 162). (Radhakrishna, 2007) Assessing Validity and Reliability of Questionnaires: In order to have confidence in the results of a study, one must be assured that the questionnaire consistently measures what it purports to measure when properly administered. In short, the questionnaire must be both valid and reliable. Warwick and Linninger (1975) point out that there are two basic goals in questionnaire design. 1. To obtain information relevant to the purposes of the survey. 2. To collect this information with maximal reliability and validity. How can a researcher be sure that the data gathering instrument being used will measure what it is supposed to measure and will do this in a consistent manner? This is a question that can only be answered by examining the definitions for and methods of establishing the validity and reliability of a research instrument. We can say that reliability of a questionnaire refers to the quality of tool while validity refers to the process used to employ the tool in use, i.e. the process used to conduct the questionnaire.
  • 15.
    14 | Pa g e The basic difference between these two criteria is that they deal with different aspects of measurement. This difference can be summarized by two different sets of questions asked when applying the two criteria: Validity: a. Does the measure employed really measure the theoretical concept (variable)? Reliability: a. Will the measure employed repeatedly on the same individuals yield similar results? (Stability) b. Will the measure employed by different investigators yield similar results? (Equivalence) c. Will a set of different operational definitions of the same concept employed on the same individuals, using the same data-collecting technique, yield a highly correlated result? Or, will all items of the measure be internally consistent? (Homogeneity) Validity We have already define Validity which is the degree to which a test measures what it is supposed to measure. The validation of questionnaire forms an important aspect of research methodology and the validity of the outcomes. It is concerned with the accuracy of our measurement. So we can say that validity of questionnaire seems to emerge from the internal and external consistency and relevance of the questionnaire. The overriding principle of validity is that it focuses on how a questionnaire or assessment process is used. Reliability is a characteristic of the instrument itself, but validity comes from the way the instrument is employed. The following ideas support this principle:  As nearly as possible, the data gathering should match the decisions you need to make. This means if you need to make a priority-focused decision, such as allocating resources or eliminating programs, your assessment process should be a comparative one that ranks the programs or alternatives you will be considering.  Gather data from all the people who can contribute information, even if they are hard to contact. For example, if you are conducting a survey of customer service, try to get a sample of all the customers, not just those who are easy to reach, such as those who have complained or have made suggestions.
  • 16.
    15 | Pa g e  If you're going after sensitive information, protect your sources. It has been said that in the Prussian army at the turn of the century, decisions were made twice, once when officers were sober, again when they were drunk. This concept acknowledges the power of the "socially acceptable response" to questions or requests. Don't assume that a simple statement printed on the questionnaire that "all individual responses will be kept confidential" will make everybody relax and provide candid answers. Give respondents the freedom to decide which information about themselves they wish to withhold, and employ other administrative procedures, such as handing out Login IDs and Passwords separately from the e-mail inviting people to participate in the survey. (Jackson, 2000) Approaches to Validity of Questionnaires There are three basic approaches to the validity of Questionnaires as shown by Mason and Bramble (1989). These are content validity, construct validity, and criterion-related validity. 1. Content Validity Once the questionnaire is drafted one must determine whether the domain has been adequately covered (content validity).This approach measures the degree to which the test items represent the domain or universe of the trait or property being measured. In order to establish the content validity of questionnaire, the researcher must identify the overall content to be represented. Items must then be randomly chosen from this content that will accurately represent the information in all areas. By using this method the researcher should obtain a group of items which is representative of the content of the trait or property to be measured. Identifying the universe of content is not an easy task. It is, therefore, usually suggested that a panel of experts in the field to be studied be used to identify a content area. ). For example, suppose it was decided that appetite consisted of three attributes. In order for the questionnaire to have content validity all three attributes must be questioned sufficiently. A tally of the number of questions addressing each attribute will immediately indicate any imbalance. If an imbalance exists the results may be biased, particularly when the questionnaire yields a single score, as in measurements of functional health status.
  • 17.
    16 | Pa g e 2. Construct Validity Construct validity refers to the extent to which the new questionnaire conforms to existing ideas or hypotheses concerning the concepts (constructs) that are being measured. Construct validity presents the greatest challenge in questionnaire development. Cronbach and Meehl (1955) indicated that, "Construct validity must be investigated whenever no criterion or universe of content is accepted as entirely adequate to define the quality to be measured" as quoted by Carmines and Zeller (1979). The term construct in this instance is defined as a property that is offered to explain some aspect of human behaviour, such as mechanical ability, intelligence, or introversion (Van Dalen, 1979). The construct validity approach concerns the degree to which the test measures the construct it was designed to measure. There are two parts to the evaluation of the construct validity of a test. First and most important, the theory underlying the construct to be measured must be considered. Second the adequacy of the test in measuring the construct is evaluated (Mason and Bramble, 1989). For example, suppose that a researcher is interested in measuring the introverted nature of first year teachers. The researcher defines introverted as the overall lack of social skills such as conversing, meeting and greeting people, and attending faculty social functions. This definition is based upon the researcher’s own observations. A panel of experts is then asked to evaluate this construct of introversion. The panel cannot agree that the qualities pointed out by the researcher adequately define the construct of introversion. Furthermore, the researcher cannot find evidence in the research literature supporting the introversion construct as defined here. Using this information, the validity of the construct itself can be questioned. In this case the researcher must reformulate the previous definition of the construct. Once the researcher has developed a meaningful, useable construct, the adequacy of the test used to measure it must be evaluated. First, data concerning the trait being measured should be gathered and compared with data from the test being assessed. The data from other sources should be similar or convergent. If convergence exists, construct validity is supported.
  • 18.
    17 | Pa g e After establishing convergence the discriminate validity of the test must be determined. This involves demonstrating that the construct can be differentiated from other constructs that may be somewhat similar. In other words, the researcher must show that the construct being measured is not the same as one that was measured under a different name. 3. Criterion-Related Validity Criterion validity indicates the effectiveness of a questionnaire in measuring what it purports to measure. The responses on the questionnaire being developed are checked against an external criterion, or gold standard, which is a direct and independent measure of what the new questionnaire is designed to measure. This approach is concerned with detecting the presence or absence of one or more criteria considered to represent traits or constructs of interest. One of the easiest ways to test for criterion-related validity is to administer the instrument to a group that is known to exhibit the trait to be measured. This group may be identified by a panel of experts. A wide range of items should be developed for the test with invalid questions culled after the control group has taken the test. Items should be omitted that are drastically inconsistent with respect to the responses made among individual members of the group. If the researcher has developed quality items for the instrument, the culling process should leave only those items that will consistently measure the trait or construct being studied. For example, suppose one wanted to develop an instrument that would identify teachers who are good at dealing with abused children. First, a panel of unbiased experts identifies 100 teachers out of a larger group that they judge to be best at handling abused children. The researcher develops 400 yes/no items that will be administered to the whole group of teachers, including those identified by the experts. The responses are analysed and the items to which the expert identified teachers and other teachers responding differently are seen as those questions that will identify teachers who are good at dealing with abused children. 4. Face validity Face validity is a type of validity in which the wording of items in a scale makes some reference to what is being measured. Face validity is not really validity but refers to the appearance of the questionnaire: Does it look "professional" or carelessly and poorly constructed? Professional-looking questionnaires are more likely to elicit serious responses. Therefore, face validity is an important consideration for both the pertest and the final product.
  • 19.
    18 | Pa g e Reliability The reliability of a research instrument concerns the extent to which the instrument yields the same results on repeated trials. Reliability of a questionnaire seems to emerge from the quality of the questionnaire. Reliability, or reproducibility, indicates whether the questionnaire performs consistently. Although unreliability is always present to a certain extent, there will generally be a good deal of consistency in the results of a quality instrument gathered at different times. Reliability of a questionnaire seems to emerge from the quality of the questionnaire. In other words reliability of a questionnaire refers to the quality of tool or the tendency toward consistency found in repeated measurements is referred to as reliability (Carmines & Zeller, 1979). In scientific research, accuracy in measurement is of great importance. Scientific research normally measures physical attributes which can easily be assigned a precise value. Many times numerical assessments of the mental attributes of human beings are accepted as readily as numerical assessments of their physical attributes. Although we may understand that the values assigned to mental attributes can never be completely precise, the imprecision is often looked upon as being too small to be of any practical concern. However, the magnitude of the imprecision is much greater in the measurement of mental attributes than in that of physical attributes. This fact makes it very important that the researcher in the social sciences and humanities determine the reliability of the data gathering instrument to be used (Willmott & Nuttall, 1975). There are four ways of examining reliability. 1. Retest Method It measures the ability of the questionnaire to yield similar results when administered to the same person on two separate occasions. The more reliable the questionnaire the higher the correlation between the results. The interval between the administrations is important. If it is too short the results may be confounded because the subject responds from memory; if it is too long the attribute being examined may have changed, and the low correlation may indicate this change rather than poor reliability. The reliability of the test (instrument) can be estimated by examining the consistency of the responses between the two tests. If the researcher obtains the same results on the two administrations of the instrument, then the reliability coefficient will be 1.00. Normally, the correlation of measurements across time
  • 20.
    19 | Pa g e will be less than perfect due to different experiences and attitudes that respondents have encountered from the time of the first test. The test-retest method is a simple, clear cut way to determine reliability, but it can be costly and impractical. Researchers are often only able to obtain measurements at a single point in time or do not have the resources for multiple administration. 2. Alternative Form Method Like the retest method, this method also requires two testing with the same people. However, the same test is not given each time. Each of the two tests must be designed to measure the same thing and should not differ in any systematic way. One way to help ensure this is to use random procedures to select items for the different tests. The alternative form method is viewed as superior to the retest method because a respondent’s memory of test items is not as likely to play a role in the data received. One drawback of this method is the practical difficulty in developing test items that are consistent in the measurement of a specific phenomenon. 3. Split-Halves Method: A more feasible method for testing the consistency of homogeneous (single-attribute) questionnaires is the split-half method: the even- and odd-numbered questions are separated and are considered to be two equivalent questionnaires. The internal consistency of a homogeneous questionnaire can also be examined after a single administration by applying an appropriate statistical procedure. The split-half method cannot be used with heterogeneous questionnaires because division of the questionnaire will not yield "equivalent" forms. In this situation one may repeat questions throughout the questionnaire; only the original question is kept in the final form. This method is more practical in that it does not require two administrations of the same or an alternative form test. In the split-halves method, the total number of items is divided into halves, and a correlation taken between the two halves. This correlation only estimates the reliability of each half of the test. It is necessary then to use a statistical correction to estimate the reliability of the whole test. This correction is known as the Spearman-Brown prophecy formula (Carmines & Zeller, 1979)
  • 21.
    20 | Pa g e There are many ways to divide the items in an instrument into halves. The most typical way is to assign the odd numbered items to one half and the even numbered items to the other half of the test. One drawback of the split-halves method is that the correlation between the two halves is dependent upon the method used to divide the items. 4. Internal Consistency Method This method requires neither the splitting of items into halves nor the multiple administration of instruments. The internal consistency method provides a unique estimate of reliability for the given test administration. The most popular internal consistency reliability estimate is given by Cronbach’s alpha. The coefficient alpha is an internal consistency index designed for use with tests containing items that have no right answer. This is a very useful tool in educational and social science research because instruments in these areas often ask respondents to rate the degree to which they agree or disagree with a statement on a particular scale. 5. Homogeneity (Internal Consistency) We have three ways to check the internal consistency of the index. a) Split-half correlation. We could split the index of "exposure to televised news" in half so that there are two groups of two questions, and see if the two sub-scales are highly correlated. That is, do people who score high on the first half also score high on the second half? b) Average inter-item correlation. We also can determine internal consistency for each question on the index. If the index is homogeneous, each question should be highly correlated with the other three questions. c) Average item-total correlation. We could correlate each question with the total score of the TV news exposure index to examine the internal consistency of items. This gives us an idea of the contribution of each item to the reliability of the index. Another approach to the evaluation of reliability is to examine the relative absence of random measurement error in a measuring instrument. Random measurement errors can be indexed by a measure of variability of individual item scores around the mean index score. Thus, an instrument which has a large measure of variability should be less reliable than the one having smaller variability measure. (Johnson, 1999)
  • 22.
    21 | Pa g e Development of a valid and reliable Questionnaire Development of a valid and reliable questionnaire involves several steps taking considerable time. This article describes the sequential steps involved in the development and testing of questionnaires used for data collection. Figure 1 illustrates the five sequential steps involved in questionnaire development and testing. Each step depends on fine tuning and testing of previous steps that must be completed before the next step. A brief description of each of the five steps follows Figure Figure 1. Sequence for Questionnaire/Instrument Development
  • 23.
    22 | Pa g e Step 1- Background In this initial step, the purpose, objectives, research questions, and hypothesis of the proposed research are examined. Determining who the audience, their background, is especially their educational/readability levels, access, and the process used to select the respondents (sample vs. population) are also part of this step. A thorough understanding of the problem through literature search and readings is a must. Good preparation and understanding of Step1 provides the foundation for initiating Step 2. Step 2- Questionnaire Conceptualization After developing a thorough understanding of the research, the next step is to generate statements/questions for the questionnaire. In this step, content (from literature/theoretical framework) is transformed into statements/questions. In addition, a link among the objectives of the study and their translation into content is established. For example, the researcher must indicate what the questionnaire is measuring, that is, knowledge, attitudes, perceptions, opinions, recalling facts, behaviour change, etc. Major variables (independent, dependent, and moderator variables) are identified and defined in this step. Step 3- Format and Data Analysis In Step 3, the focus is on writing statements/questions, selection of appropriate scales of measurement, questionnaire layout, format, question ordering, font size, front and back cover, and proposed data analysis. Scales are devices used to quantify a subject's response on a particular variable. Understanding the relationship between the level of measurement and the appropriateness of data analysis is important. For example, if ANOVA (analysis of variance) is one mode of data analysis, the independent variable must be measured on a nominal scale with two or more levels (yes, no, not sure), and the dependent variable must be measured on a interval/ratio scale (strongly agree to strongly disagree).
  • 24.
    23 | Pa g e Step 4 -Establishing Validity As a result of Steps 1-3, a draft questionnaire is ready for establishing validity. Validity is the amount of systematic or built-in error in measurement (Norland, 1990). Validity is established using a panel of experts and a field test. Which type of validity (content, construct, criterion, and face) to use depends on the objectives of the study. The following questions are addressed in Step 4: 1. Is the questionnaire valid? In other words, is the questionnaire measuring what it intended to measure? 2. Does it represent the content? 3. Is it appropriate for the sample/population? 4. Is the questionnaire comprehensive enough to collect all the information needed to address the purpose and goals of the study? 5. Does the instrument look like a questionnaire? Addressing these questions coupled with carrying out a readability test enhances questionnaire validity. The Fog Index, Flesch Reading Ease, Flesch-Kinkaid Readability Formula, and Gunning-Fog Index are formulas used to determine readability. Approval from the Institutional Review Board (IRB) must also be obtained. Following IRB approval, the next step is to conduct a field test using subjects not included in the sample. Make changes, as appropriate, based on both a field test and expert opinion. Now the questionnaire is ready to pilot test. Step 5- Establishing Reliability In this final step, reliability of the questionnaire using a pilot test is carried out. Reliability refers to random error in measurement. Reliability indicates the accuracy or precision of the measuring instrument (Norland, 1990). The pilot test seeks to answer the question, does the questionnaire consistently measure whatever it measures? The use of reliability types (test-retest, split half, alternate form, internal consistency) depends on the nature of data (nominal, ordinal, interval/ratio). For example, to assess
  • 25.
    24 | Pa g e reliability of questions measured on an interval/ratio scale, internal consistency is appropriate to use. To assess reliability of knowledge questions, test-retest or split-half is appropriate. Reliability is established using a pilot test by collecting data from 20-30 subjects not included in the sample. Data collected from pilot test is analyzed using SPSS (Statistical Package for Social Sciences) or another software. SPSS provides two key pieces of information. These are "correlation matrix" and "view alpha if item deleted" column. Make sure that items/statements that have 0s, 1s, and negatives are eliminated. Then view "alpha if item deleted" column to determine if alpha can be raised by deletion of items. Delete items that substantially improve reliability. To preserve content, delete no more than 20% of the items. The reliability coefficient (alpha) can range from 0 to 1, with 0 representing an instrument with full of error and 1 representing total absence of error. A reliability coefficient (alpha) of .70 or higher is considered acceptable reliability. (william, 1999)
  • 26.
    25 | Pa g e Conclusion At the end we conclude that reliability and validity of the instrument is quite important. “Validation” is also the process by which any data collection instrument, including questionnaires, is assessed for its dependability. Validating questionnaires is somewhat challenging as they usually evaluate subjective measures, which means they can be influenced by a range of factors that are hard to control. Questionnaires are most widely used tools in specially social science research. Most questionnaire’s objective in research is to obtain relevant information in most reliable and valid manner. Therefore the validation of questionnaire forms an important aspect of research methodology and the validity of the outcomes. Often a researcher is confused with the objective of validating a questionnaire and tends to find a link between the reliability of a questionnaire with the validity of it. The reality is that reliability and validity are two different aspects of an acceptable research questionnaire. It is important for a researcher to understand the differences between these two aspects. In its simple explanation, reliability of a questionnaire seems to emerge from the quality of the questionnaire. On the other hand validity seems to emerge from the internal and external consistency and relevance of the questionnaire. In other words reliability of a questionnaire refers to the quality of tool (read questionnaire) while validity refers to the process used to employ the tool in use, i.e. the process used to conduct the questionnaire.
  • 27.
    26 | Pa g e References Dr. J. Patrick Biddix (Ph.D., U. o.-S. (2003). Retrieved from https://researchrundowns.wordpress.com/quantitative-methods/instrument-validity- reliability/ Jackson. (2000). Retrieved from http://www.evensenwebs.com/validity.html Johnson. (1999). Retrieved from (http://www.okstate.edu/ag/agedcm4h/academic/aged5980a/5980/newpage18.htm Jones, W. S. (2000). Retrieved from http://changingminds.org/explanations/research/measurement/measurement_error.htm Quinn, J. A. (2000). Retrieved from http://www.journalism.wisc.edu/~dshah/j658/Reliability%20and%20Validity.pdf Radhakrishna, R. B. (2007, February). Retrieved from http://www.joe.org/joe/2007february/tt2.php william. (1999). Retrieved from Groves, R. M., (1987). Research on survey data quality. Public Opinion Quarterly, 51, 156-172. Wren, C. P. (2005-06). Retrieved from https://www.uni.edu/chfasoa/reliabilityandvalidity.htm