Steps in Developing A Valid and Reliable Scale.pdf

Steps In Developing AValid And Reliable
Scale of Measurement
BY:
Omnia Samir Elseifi
Assistant Professor of Public Health and Community Medicine.
Faculty of Medicine
Zagazig University
23 January 2020

Scale development process
• Measurement scales are useful tools to get scores about certain health aspects that cannot be measured directly, such as
measuring quality of life.
• The researcher must pass through many steps to reach the ultimate goal; which is the developing of a valid and reliable
scale to support the application of the test results.
Phase I
Item
Development
1-
Identification
of domain
2- Item
generation
3- Content
validity
Phase II
Scale
Development
4- Pretesting
(Pilot testing of
the Items)
5- Item
reduction
6- Extraction of
factors
Phase III
Scale Evaluation
7- Test of
dimensionality
8-Test of
reliability
9-Test of validity
(1,2,3)

Scale development process Scheme
1- Identification of
domain(s)
1- Purpose
2- Justification
2- Item generation
1- Appropriate
questions
2- Number of items
3- Item wording
4- Translation of items
3- ContentValidity
CVR
CVI
FaceValidity
3- Describing
domains
4- Specify the
dimensions
5- Define each
dimension
5-Types of questions
6- Response to items
• To Specify the boundaries of
the domain.
• To Select Which Items to
Ask.
• To Assess if the Items
Adequately Measure the
Content of The Domain of
Interest.

4- Pretesting
1- Interview with
target population
2- Sample size
5- Item reduction
1- Item difficulty index
2- Item discrimination
index
3- Item- item
correlation and Item –
total correlation
4- Distractor Efficiency
Analysis
6- Extraction of
factors
Exploratory
Factor Analysis
(EFA)
Confirmatory
Factor Analysis
(CFA)
3- Distribution of
scale
• To Gather Enough Data from
the Right People.
• To Identify Items That Are
Not Related To The Domain,
So, They Can Be Deleted Or
Modified.
• To Explore the Number of
Latent Constructs that Fit
The Observed Data.

7-Test of
Dimensionality
Using Factor
analysis
Unidimensional
scale
8-Test of
Reliability
1- Test- Retest
Reliability
2- Internal
Consistency
3- Parallel form
Reliability
4- Inter-Rater
Reliability
9-Test ofValidity
CriterionValidity:
Concurrent validity
Predictive validity
ConstructValidity:
ConvergentValidity.
DivergentValidity
Known groupValidity
Multidimensional
scale
• To Identify The Number Of
Latent Variables That Are
Measured By The Scale.
• To Establish if Responses Are
ConsistentWhen Repeated.
• To Ensure the scale
Measures The intended
Latent Dimension.

Example Of Validated Scale Development
Research
A research conducted In Pakistan for “Development of a
stress scale for pregnant women in the South Asian context:
the A–Z Stress Scale.”
Will be an example in most of steps.

Phase 1: Item development
Step 1: Identification of the Domain(s)
Identification
of the
Domain(s)
5-Define each
dimension
1-The purpose: is to
develop a scale based on
stressors to measure
stress among pregnant in
developing countries
2- Justification: They found
preexisting scales record the
somatic and psychological
symptoms of the stressors not
the stressors themselves
3- Describing
domains: They agreed
about defining the
different stressors the
pregnant exposed to.
4- Specify the
dimensions :They decided
the scale will be consisted
from three dimensions;
daily, life event and
pregnancy related stressors.
The purpose : To specify the boundaries of the domain and facilitate item generation.
(4,5)

Pitfalls
1. This step is often neglected or dealt with in a superficial manner.
2. Construct underrepresentation (focus on narrow aspect of the
domain).
These troubles lead to a significant number of problems later in the
validation process(6,7).
Step 1: Identification of the Domain(s)

Step 2 Item Generation
The purpose : To create an appropriate questions that fit to identified domain.
Item
Generation
6- Response to
questions
1-Appropriate
questions
2- Number of
items
(must be 2-5 times the
number in final scale)
Item pool of 235
items
3-Item
wording
4-Translation
of the items
5-Types of
questions
Deductive methods
Literature review
Inductive methods:
interviews with 25
experts from different
specialties” Psychiatry,
Gynecology and
Sociology”.
They conducted
interview with 79
pregnant women asking
them about the possible
stressors.
(5,8-11)

Pitfalls
1. Presence of irrelevant items to the defined domain can lead to failure of validation
of the measuring scale, poor quality of data and invalid conclusion regarding the
results and the relationship with other constructs.
2. Improper response to the items as too short scale can affect the reliability of the
instrument this is also for too many responses (more than 7) (12).
Step 2 Item Generation

Step 3: ContentValidity
Content validity:
• Content validity is to be sure that
the items of the generated scale
measure what they are presumed to
measure (all contents domain of
interest) (2)
Content validity is assessed by:
• Experts,
• Target population (2)

Purpose: To evaluate the items constituting the domain regarding; content relevance, and technical
quality .
Expert evaluation
ContentValidity
Ratio
(CVR)
Kappa coefficient
ContentValidity
Index
(CVI)
• >0.74 it’s considered excellent.
• Between 0.60 and 0.74 is considered good.
• Between 0.40 and 0.59 are considered fair.
(2)
I-CVIs
S-CVI

ContentValidity Ratio (CVR):
• The experts are requested to specify whether an item is necessary for the construct or not.
-Score 1 for: [not necessary] item.
-Score 2 for: [useful but not essential] item.
- Score 3 for: [essential] item.
.
(Number of experts indicating essential - The
total number of experts/2) / The total number
of experts / 2.
• For minimum number of expert (5 or 6 experts) CVR must be not
less than 0.99,
• for 8 experts not less than 0.85
• for 10 experts not less than 0.62
otherwise the item should be eliminated from the scale .
CVR
(13)

Content validity index (CVI):
Panel members are asked to rate instrument items in terms of clarity and relevancy to the construct on a 4-point scale:
-Score 1 for: [not relevant or not clear] items.
-Score 2 for: [somewhat relevant or item somewhat clear and need some revision] items.
-Score 3 for: [quite relevant or quite clear] items.
-Score 4 for: [highly relevant or highly clear] items
For each item:
Experts giving 3 or 4 score / the
Total number of experts
I-CVIs
• >79%, the item is appropriate and retained
within the scale.
• If between 70 and 79 % it will need
revision.
• <70 percent, it is eliminated from the scale
The number of relevant items by
agreement of all experts / Total
number of items
S-CVI/UA
Should be not less than 0.80
Sum of I-CVIs for the items
/ Total number of items
S-CVI/Ave
Should be not less than 0.90
(14)

Face
Validity
Readability
Feasibility
Layout
Clarity of
words
Face validity means the degree at which the
designed measuring instrument is apparently
appropriate and related to the domain under
study.
The target population share with expert in
evaluating the face validity of the scale of
measurement (15).

Example for this step:
A research conducted for the development of a stress scale for pregnant women in the South Asian
context: the A–Z Stress Scale (5). The researchers stated that they evaluate the content validity of the scale:
By experts and target pregnant (face validity) . According to that the items selected from the item pool were 78
items.
Pitfalls
• Some researches usually fail to assess the content validity, this may be due to lack of resources or skills. This is
expected to affects the final collected data conducted by the scale and the statistical analysis.
• Limited numbers of the developing scales undergo target population evaluation which is important step as those
population are the target of the newly developed scale (16).

Phase 2: Scale Development
Step 4: Pre-testing Questions
Pre-
testing
Questions
1- Cognitive
Interviews with
pregnant
2- Sample size
Golden rule of
thumb is10
respondents per
survey item (10:1)
They interviewed
70 pregnant
3-Distribution of
the scale;
Paper based survey or
Online survey
(they used Paper
based face to face
interview)
The purpose :
•To ensure the availability of sufficient data for scale development with
minimum level of error.
(5,17,18)
Pitfalls
• Sample size in many validation
studies is usually less than the
golden role, this may be due to this
type of studies may be difficult to
be funded.
• Missing data increase the risk of
inaccurate conclusions due to
increasing occurrence of errors.

Item
Reduction
Item
Difficulty
Index
Item
discrimination
test
Inter-item and
Item-Total
Correlations
Distractor
Efficiency
Analysis
The purpose :
To identify items that are not related to the domain under study so they can be deleted or modified.
(5)
Step 5: Item Reduction

Inter-item correlations:
Examine the correlation between each item in the
scale and the other items.
Inter-item and Item-Total Correlations
Purpose: To determine the correlations between scale items, as well as the correlations between each item and sum score of scale
items.
Item-total correlations:
Examine the relationship between each item score and
the total scale score.
In both techniques, items with low correlations (r <0.30) are less desirable and could be deleted.
(19,20)

Example:
A research conducted for the development of a stress scale for pregnant women in the South Asian context: the A–Z
Stress Scale (5).
The researchers conducted item- total analysis
ranged from r = 0.2 to r = 0.8.
As a result the items were reduced to final 30 items.

Item Difficulty Index
Purpose: To assess the difficulty level of the scale test items.
Item correct answers for the item /
the total answers on that item
Ranges between 0.0 to 1.0
Item difficulty index Difficulty level
0.86 and above Very easy.
0.71 to 0.85 Easy
0.30 to 0.70 Moderate
0.15 to 0.29 Difficult
0.14 and below Very difficult
High difficulty index score means a
greater proportion of the sample
population answered the question
correctly.
Lower difficulty index score means a
smaller proportion of the sample
understood the question and
answered correctly.
(2,21)

Item Discrimination test
Purpose: to identify the degree to which an item can correctly differentiates between respondents .
The upper group
(with high scores)
proportion of responders who got
the item correct in the upper group
- proportion of responders with
correct answer in the lower group.
Ranges between -1 to +1
The lower group
(with low scores) Item discrimination index Discrimination level
0.19 and below Poor item; should be eliminated
or revised.
0.20 to 0.29 Marginal items; need revision
0.30 to 0.39 Good item; may need some
improvement
0.4 or above Very good item
(22,23)

Distractor Efficiency Analysis:
Purpose:
To determine the distribution of incorrect options “distractors” and how they contribute to the quality of items.
The upper group
(with high scores)
The middle group
(with middle
scores)
The lower group
(with low scores)
• 100% of participants in the high
group
• about 50% of participants in the
middle
• few or none of those in the
lower group
Correct
option
Appropriate
item
If those with adequate knowledge “the high group” can’t differentiate between the right option of the item and
the distractors, the question may need to be modified or deleted.
(24,25)

Factor analysis:
It is a method for explaining the construction of data by
explaining the correlations between variables. It
summarizes data into a few dimensions by condensing many
variables into a smaller set of latent variables or factors .
• Exploratory Factor Analysis (EFA) it’s the interrelation
between items in the construct. It is used to reduce the set
of observed variables to a smaller, more close set of
variables.
• Confirmatory Factor Analysis (CFA) and is used to
determine the factors by statistically testing the hypothesis of
the expected factor loading (FL) of the observed items on
underlying (latent) factors and the correlation between latent
variables.
• Items having factor loading or slope coefficients
below 0.30 are considered inadequate “Unrelated
items” that should be eliminated.
• Items with cross loading > 0.4 should be eliminated.
Step 6: Extraction of Factors
(4,23,26)

Step 6: Extraction of Factors
Example:
In a research for Developing a disease-
specific tool for assessment of quality of
life of patients with hepatitis C virus
associated chronic liver disease (27).
They conducted CFA and calculated Factor
loading, any item with factor loading less than
0.3 is eliminated.
Pitfalls:
Many of scale developers are hesitating to use
factor analysis either because:
• it needs large sample size to be conducted
• because it involves many confusing and
complicated steps and interpretations (16)

• Purpose: A scale’s dimensionality, to identify the number of latent variables that are measured by the scale.
• It’s usually depends on the factor’s extraction and analysis.
Phase 3: Scale Evaluation
Step 7:Test dimensionality
(12)
Start

Example:
A research conducted for the development of a stress scale for pregnant women in the South Asian
context: the A–Z Stress Scale (5)
The researchers stated that their scale has two dimension by multidimension scaling;
1- socioenvironmental related hassles dimension (includes items from 1-26).
2- chronic illness dimension (items 27-30).
Step 7:Test dimensionality
Pitfalls
• Failure to effectively calculate EFA and CFA will lead to miss classification of the dimensions of the
construct.
• Many of the researchers depend on literature and expert view to divide the dimensions of the construct
rather than using factors analysis (12).

Reliability is the ability to reproduce same result consistently under the same conditions.
Purpose: To measure reliability regarding; stability, internal consistency, equivalence and inter-rater reliability.
Step 8:Tests of Reliability
Stability
The test is administered
twice or more to the same
participant to ensure that
same results are obtained.
Testing the developing
scale on 43 pregnant
twice one week interval
(r = 0.86).
It measures whether items measuring the same
general construct produce the same scores
(Homogeneity).It’s assessed by:
• Cronbach’s α;(value 0-1, ≥0.7 is acceptable)
• Kuder-Richardson
• Split halves reliability (two equal halves of the
scale then compare).
• Cronbach’s alpha (0.82 for the scale and
was ranged between 0.75 to 0.86 for
different items).
Equivalence
It determines the
correlation of level of
agreement between two or
more instruments at the
same point of time.
It assesses the degree of
agreement between two or
more raters in assessing
certain phenomena at the
same point of time.
The developing scale was
applied on 50 pregnant
and two interviewers (r =
0.91).
(22, 28, 29)

Pitfalls:
• Test – retest reliability should be used with caution as the score of values could be changed over
time in some types of studies (e.g., intervention studies), here the change isn’t due to low reliable
measure, but it’s a true change in the participants.
• Number of items in the scale below 10, could lead to decrease Cronbach’s alpha
• Lack of standardization between the observers leads to decrease interrater agreement (1,2).
Step 8:Tests of Reliability

Step 9:Tests ofValidity
Validity
The ability of
the
measuring
scale to
evaluate the
domain that
was intended
to be
measured.
Content
validity
Including face
validity
Criterion
validity
Concurrent
validity
Compare at
the same time
Gold
standard
Predictive
validity
Gold
standard
or
Behavior
Predict
after time
Construct
validity
Convergent
validity
Same
result
Two related
measures
Divergent
validity
(Discriminate)
Different
result
Two
different
measures
Known-groups
validity
Two
different
groups
Different
result
Same group
Same group
Same
measurement
New
measure
New
measure
(22. 28, 30)

Criterion
validity
Concurrent
validity
Compare at
the same time multicultural
validated
depression scale
New
A–Z Stress
Scale
Moderate correlation
between the two scales
(r = 0.56)
Example: In the study conducted for the development of a stress scale for pregnant women in the South Asian
context: the A–Z Stress Scale (5)
Pitfalls for validity calculation:
1- Criterion validity can’t be assessed with small sample size due to presence of sampling
error.
2- Criterion validity cannot be used in all circumstances, especially in social sciences as a
relevant criterion “gold standard” may be not present, So, it’s usually ignored and not
calculated in most of the validation studies.
3- Lack of sufficient resources or skills for calculation and assessment (22).

Pitfalls for validity calculation: (cont.)
4- The scale developers usually use homogeneous group from the population in the pilot study which
limit calculation of construct validity, so recruiting of heterogenous group or random sample of the
population is recommended.
5- Single time calculation of validity is inaccurate if the variable under study changed with time, so, it’s
recommended to conduct longitudinal studies during scale development to get accurate validity
measures especially predictive validity, as it will lead to pseudo correlations between variables.
6- Social desirability bias: which is a systematic error present in self-reporting measures in which the
participants want to keep good image. This is considered as one of the important threats to the
validity (22).

Conclusion
• Valid research results begin with valid and reliable measurement. This can be
achieved if a systematic and scientific based process is followed.
• Developing a valid and reliable scale is a multiphasic procedure that need a
researcher with adequate knowledge and proper level of skills.
• Poor scale development will be had effect on the validity and reliability of the results
and therefore, the applicability in practice. So, the availability of a comprehensive
guide for scale development is essential.

References
1. Fabrigar LR., Ebel-Lam A. Questionnaires. In N. J. Salkind (Ed.), Encyclopedia of Measurement and Statistics (2007).Thousand Oaks, CA: Sage. pp. 808-812.
2. DeVellis RF. Scale Development:Theory and Application. (3rd ed.). Los Angeles, CA: Sage Publications (2012).
3. Hinkin TR.A review of scale development practices in the study of organizations. J Manag. 1995; 21:967–88. doi:10.1016/01492063(95)90050-0
4. McCoach DB, Gable RK, Madura, JP. Instrument Development in the Affective Domain. School and Corporate Applications, 3rd Edn. NewYork, NY: Springer (2013).
5. Kazi A, Fatmi Z, Hatcher J, Niaz U, Aziz A. Development of a stress scale for pregnant women in the South Asian context: the A-Z Stress Scale. East Mediterr Health J. 2009 Mar-
Apr;15(2):353-61. PMID: 19554982.
6. Messick S. Validity of psychological assessment: validation of inferences from persons’ responses and performance as scientifica inquiry into score meaning. Am Psychol. (1995) 50:741–9.
doi: 10.1037/0003-066X.50.9.741
7. MacKenzie, S. B. 2003.“The Dangers of Poor Construct Conceptualization,” Journal of the Academy of Marketing Science (31:3), pp. 323-326.
8. Streiner, D. L., Norman, G. R., & Cairney, J. (2015). Health Measurement Scales:A Practical Guide to Their Development and Use (5th ed.). Oxford, UK: Oxford University Press.
9. Schinka JA,VelicerWF,Weiner IR. Handbook of Psychology, Research Methods in Psychology. Hoboken, NJ: JohnWiley & Sons, Inc. 2012.
10. DeVellis RF. Scale Development:Theory and Applications (4th ed.).Thousand Oaks, CA: Sage. 2017.
11. Price LR. Psychometric Methods:Theory into Practice. NewYork:The Guilford Press. 2017. pp: 190-191.
12. Furr RM. Scale Construction and Psychometrics for Social and Personality Psychology. New Delhi, IN: Sage Publications. 2011.
13. Streiner, DL, Norman GR, Cairney J. Health Measurement Scales:A Practical Guide to Their Development and Use (5th ed.). Oxford, UK: Oxford University Press. 2015.
14. 14. Polit DF, Beck CT, Owen SV. Is the CVI an acceptable indicator of content validity? Appraisal and recommend-ations. Res Nurs Health 2007;30(4):459-67.

15. Haynes SN, Richard DCS, Kubany ES. Content validity in psychological assessment: a functional approach to concepts and methods. Pyschol Assess. 1995; 7:238–47
16. Morgado FFR, Meireles JFF, Neves CM, Amaral ACS, Ferreira MEC. Scale development: ten main limitations and recommendations to improve future research practices. Psicol Reflex E
Crítica 2018; 30:3.
17. Greenlaw C, Brown-Welty S.A Comparison of web-based and paper-based survey methods: testing assumptions of survey mode and response cost. EvalRev. 2009; 33:464–80.
18. Fanning J, McAuley E.A Comparison of tablet computer and paper-based questionnaires in healthy aging research. JMIR Res Protoc. 2014; 3:e38.
19-Raykov T, Marcoulides GA. Introduction to Psychometric Theory. NewYork, NY: Routledge,Taylor & Francis Group 2011.
20. Cohen RJ, Swerdlik ME. Psychological testing and assessment:An introduction to tests and measurement (6th ed.). NewYork: McGraw-Hill, 2005.
21. Si-Mui Sim, Rasiah RI. Relationship between item difficulty and discrimination indices in true/false type multiple choice questions of a para-clinical multidisciplinary paper. Ann Acad
Med Singapore 2006; 35: 67-71
22- Whiston SC. Principles and Applications of Assessment in Counseling. Cengage Learning 2008.
23. Zubairi AM, Kassim NLA. Classical and Rasch analysis of dichotomously scored reading comprehension test items. Malaysian J of ELT Res 2006; 2: 1-20.
24- Tarrant M,Ware J, Mohammed AM.An assessment of functioning and nonfunctioning distractors in multiple-choice questions: a descriptive analysis. BMC Med Educ. 2009; 9:40.
25-Fulcher G, Davidson F.The Routledge Handbook of LanguageTesting. NewYork, NY: Routledge 2012.
26- Polit DF Beck CT. Nursing Research: Generating and Assessing Evidence for Nursing Practice, 9th ed. Philadelphia, USA:Wolters Klower Health, Lippincott Williams & Wilkins, 2012.
27- Sobhi SA, Ibrahim AS, Serwah AA, Tawfik MY. In a research for Developing a disease-specific tool for assessment of quality of life of patients with hepatitis C virus associated chronic
liver disease. Suez canal university medical journal.2008; 11(2):207-214.
28. Boateng GO, Neilands TB, Frongillo EA, Melgar-Quiñonez HR and Young SL Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer.
Front. Public Health 2018; 6:149.
29.Wong KL, Ong SF, Kuek TY. Constructing a survey questionnaire to collect data on service quality of business academics. Eur J Soc Sci 2012; 29:209-21.
30, Sackett PR, Lievens F, Berry CM, Landers RN. "A Cautionary Note on the Effects of Range Restriction on Predictor Intercorrelations" (PDF). Journal of Applied Psychology 2007; 92
(2): 538–544.
References

Steps in Developing A Valid and Reliable Scale.pdf

Steps in Developing A Valid and Reliable Scale.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Steps in Developing A Valid and Reliable Scale.pdf

Similar to Steps in Developing A Valid and Reliable Scale.pdf (20)

Recently uploaded

Recently uploaded (20)

Steps in Developing A Valid and Reliable Scale.pdf