David DeGeest06J:278 StaffingApril 13, 2010SJTs: Avoiding a Grand Failure
Why are people so interested?The Big Picture of SJT lit
The claims about SJTs“They seem to represent psychometric alchemy” (Landy, 2007).Adverse impact is down, validity upAssessees like themSeem to address relevant KSAOsThey assess soft skills and tacit knowledgeThey provide incremental validity above GMA and personality for predicting college GPA (Oswald et al., 2004)Some SJTs have demonstrated criterion-related validities as high as r=.36 (Pereira and Harve, 1999)They measure tacit knowledge and “non-academic intelligence” (Sternberg et al., 1995)
What is an SJT?Situational Judgment Tests (SJTs) or Inventories (SJIs) are psychological tests which offer respondents realistic, hypothetical scenarios and ask for an appropriate response.SJTs are often-identified as a type of low-fidelity situation (Motowildo, 1990).SJTs can be designed to be predictive of job performance, managerial ability, integrity, personality, and apparently other measures or constructs.
Example of an SJT item from Becker (2005)11. You’re retiring from a successful business that you started, and must now decide who will replace you. Two of your children want the position and would probably do a fine job. However, three non-family employees are more qualified. Who would you most likely put in charge?A. The best performing non-family member, because the most qualified person deserves the job. B. The lowest performing non-family member, because this won’t hurt your children’s feelings. C. The highest performing child, because you have the right to do what is best for your kids. D. The child you love the most, as long as he or she is able to do the job.
History of SJTsFirst recorded SJT: George Washington University Social Intelligence Test (1926)Some usage during WWII by military psychologists(1990): Motowildo’s research resurrected interest in SJTsIdea of the “low fidelity simulation”Commonly used now in industry as “customized” tool for organizations, consultants, etc.Takeaway: There is a lot of perceived promise and sunk cost in SJT research.
What the heck is an SJT?Construct Validity and the Development of SJTs
Item Characteristics	McDaniel et al (2005) claims that SJTs have eight differentiating characteristics:Test fidelityStem lengthStem complexityStem comprehensibilityNested stemsNature of responses Response instructionsDegree of item heterogeneityNo proscribed standards to develop an SJT
Item CharacteristicsExamples of response options:What is the best answer?What would you most likely do?Rate each response for effectivenessRate each response on likelihood you would engage in behaviorKnowledge v. Behavioral TendencyDichotomization Issue
Item Characteristics and Construct ValidityConstruct HeterogeneityMost items tend to correlate with GMA, Agreeableness, Conscientiousness, or Emotional Stability (McDaniel, 2005)Ployhart and Erhart (2003) notes that multiple constructs measured with SJTs makes it hard to measure differences across studiesTakeaway: SJTs are best described as a method, not a construct (Schmitt & Chan, 2006)
Exciting finds from research on SJTsThe promise of SJTs
GeneralizabilityMcDaniel et al. (2007) meta-analytically demonstrated SJTs have incremental validity of.03 to .05 over GMA.06 to .07 over Big Five.01 to .02 over GMA/Big Five compositeMcDaniel et al. (2001) showed that SJTs are generalizable as predictors of job performance90% CV did not contain zero in the metaPotosky et al. (2004) showed that a .84 score-equivalence correlation between an SJT administered via paper-and-pencil and the InternetNo effects based on beliefs in computer efficacyTakeaway: multiple metas have demonstrated the generalizability of SJTs in predicting job performance.
Variability in SJTsLievens and Sackett (2006) showed that video-based SJTs for interpersonal skills have more validity than written SJTs.McDaniel et al. (2007) showed that reliabilities for SJTs can range from .63 to .88The meta refers to alpha, but other reliability measures matterTakeaway: Effects of variations in level of fidelity offer interesting possibilities for research.
Assessment Reactions and Face ValidityChan & Schmitt (1997) showed that B-W differences in test performance and face validity reactions were lower for video-based SJTs than pencil-and-paper testsRace X Method interaction attributable to reading comprehension differences in subgroupsIncreasing fidelity increased mean performance on SJTChan (1997) showed that paper-and-pencil SJTs  are more consistent  with beliefs, values, and expectations of whites.Moving to video-based SJT increased validity perceptions for both whites and blacksBauer and Truxillo (2006) asserts that SJTs always have better face validity than do cognitive and personality measures.Takeaways: SJTs are useful in terms of face validity and justice perceptions, particularly high-fidelity (video) simulations.
The trouble with…Problems with SJTs
The problem of gNguyen (2005) Found that knowledge instruction SJT scores correlated .56 with GMA and behavioral instructions correlated .38 with GMA.
Peeters and Lievens (2005) found that faking-good instructions produced differences in means and criterion-related validities across subgroups.
“Specifically, the fakability of SJTs might depend on their correlation with one particular construct, namely, cognitive ability (see Nguyen &McDaniel, 2001).” (p.73)
Schmidt and Hunter (2003) found low discriminant validity between SJTs and job knowledge tests.
Retesting Issue
“If subgroup differences on a test exist, policies that permit retests by candidates who were unsuccessful on the test might inflate calculations of adverse impact.” (Lievens et al., 2005, p. 1005)Takeaway: If the degree of fakibility of an SJT depends on its GMA load, SJTs might just be contaminated g tests or low-reliability job knowledge tests.
FakingNguyen et al. (2005) found that d=.34 for honest instructions and d=.15 for fakingPloyhart and Erhart (2003) also note that behavioral response instructions are both more prone to faking and have problematic more reliability issuesHooper et al. (2006) notes that the fragmentation of the literature has made a meta-analytic study of this issue impossible.
Response InstructionsPloyhart and Erhart (2003) found that response instructions had dramatic effects on validity, reliability, and performance for SJTsShowed that dimensionality of an SJT is crucial to determining the reliability estimate to use.McDaniel et al. (2007) found meaningful differences between means for tests with different behavioral and knowledge instructions.Lievens and Sackett (2009) found no meaningful differences between means in a high-stakes testing environment with med school applicants.Last two studies found that knowledge instructions for an SJT increased the scores’ correlation with a GMA measureTakeaway: meta-analytic integration of these results is needed, but the primary research has yet to support this.
What is the reliability for an SJT?Bess (2001) points out the elephant in the room:“SJTs by definition are multidimensional and therefore internal consistency is not an appropriate measure of reliability” (p.29)Schmitt and Chan (1997) also notes this problem.Examples of reliability estimates:Ployhart and Erhart (2003) Used split-half estimates to get .67 and .68 reliabilities.Lievens and Sackett (2009) found low alphas for their SJT (.55-.56)Lievens and Sackett (2007) noted generating alternate forms is difficult for SJTs, given the contextual specificity of items.This means parallel forms reliability is a non-practical measure.Takeaway: no one is quite sure how to systematically assess reliabilities for SJT measures
What do we know about SJTs?Conclusions
Things we know fairly clearlySJTs are primarily a method, not a construct.SJTs have demonstrated generalizable meta-analytic incremental validity over GMA and Big 5 single and composite measures in predicting job performanceMost SJTs are correlated with GMA to a varying extent and share some benefits and disadvantages with GMASJTs often correlate with the Big 3
McDaniel et al. (2006) integrated model for SJT
Where do researchers go from here?  Practitioners?Future Directions for SJTs
Ployhart & Weekly (2006) Agenda for ResearchConstruct ValidityCorrelates are known, but nomological net uncertainSJTs targeted to constructs: the “holy grail” (p.348)What exactly is “judgment?”Understanding SJT structureHow do we build SJTs to get construct homogeneity?How do we enhance the reliability of these measures?More Experimentation/Micro ResearchCorrelation studies and metas show generalizabilityExperimental studies can enhance understanding

Avoiding A Grand Failure

  • 1.
    David DeGeest06J:278 StaffingApril13, 2010SJTs: Avoiding a Grand Failure
  • 2.
    Why are peopleso interested?The Big Picture of SJT lit
  • 3.
    The claims aboutSJTs“They seem to represent psychometric alchemy” (Landy, 2007).Adverse impact is down, validity upAssessees like themSeem to address relevant KSAOsThey assess soft skills and tacit knowledgeThey provide incremental validity above GMA and personality for predicting college GPA (Oswald et al., 2004)Some SJTs have demonstrated criterion-related validities as high as r=.36 (Pereira and Harve, 1999)They measure tacit knowledge and “non-academic intelligence” (Sternberg et al., 1995)
  • 4.
    What is anSJT?Situational Judgment Tests (SJTs) or Inventories (SJIs) are psychological tests which offer respondents realistic, hypothetical scenarios and ask for an appropriate response.SJTs are often-identified as a type of low-fidelity situation (Motowildo, 1990).SJTs can be designed to be predictive of job performance, managerial ability, integrity, personality, and apparently other measures or constructs.
  • 5.
    Example of anSJT item from Becker (2005)11. You’re retiring from a successful business that you started, and must now decide who will replace you. Two of your children want the position and would probably do a fine job. However, three non-family employees are more qualified. Who would you most likely put in charge?A. The best performing non-family member, because the most qualified person deserves the job. B. The lowest performing non-family member, because this won’t hurt your children’s feelings. C. The highest performing child, because you have the right to do what is best for your kids. D. The child you love the most, as long as he or she is able to do the job.
  • 6.
    History of SJTsFirstrecorded SJT: George Washington University Social Intelligence Test (1926)Some usage during WWII by military psychologists(1990): Motowildo’s research resurrected interest in SJTsIdea of the “low fidelity simulation”Commonly used now in industry as “customized” tool for organizations, consultants, etc.Takeaway: There is a lot of perceived promise and sunk cost in SJT research.
  • 7.
    What the heckis an SJT?Construct Validity and the Development of SJTs
  • 8.
    Item Characteristics McDaniel etal (2005) claims that SJTs have eight differentiating characteristics:Test fidelityStem lengthStem complexityStem comprehensibilityNested stemsNature of responses Response instructionsDegree of item heterogeneityNo proscribed standards to develop an SJT
  • 9.
    Item CharacteristicsExamples ofresponse options:What is the best answer?What would you most likely do?Rate each response for effectivenessRate each response on likelihood you would engage in behaviorKnowledge v. Behavioral TendencyDichotomization Issue
  • 10.
    Item Characteristics andConstruct ValidityConstruct HeterogeneityMost items tend to correlate with GMA, Agreeableness, Conscientiousness, or Emotional Stability (McDaniel, 2005)Ployhart and Erhart (2003) notes that multiple constructs measured with SJTs makes it hard to measure differences across studiesTakeaway: SJTs are best described as a method, not a construct (Schmitt & Chan, 2006)
  • 11.
    Exciting finds fromresearch on SJTsThe promise of SJTs
  • 12.
    GeneralizabilityMcDaniel et al.(2007) meta-analytically demonstrated SJTs have incremental validity of.03 to .05 over GMA.06 to .07 over Big Five.01 to .02 over GMA/Big Five compositeMcDaniel et al. (2001) showed that SJTs are generalizable as predictors of job performance90% CV did not contain zero in the metaPotosky et al. (2004) showed that a .84 score-equivalence correlation between an SJT administered via paper-and-pencil and the InternetNo effects based on beliefs in computer efficacyTakeaway: multiple metas have demonstrated the generalizability of SJTs in predicting job performance.
  • 13.
    Variability in SJTsLievensand Sackett (2006) showed that video-based SJTs for interpersonal skills have more validity than written SJTs.McDaniel et al. (2007) showed that reliabilities for SJTs can range from .63 to .88The meta refers to alpha, but other reliability measures matterTakeaway: Effects of variations in level of fidelity offer interesting possibilities for research.
  • 14.
    Assessment Reactions andFace ValidityChan & Schmitt (1997) showed that B-W differences in test performance and face validity reactions were lower for video-based SJTs than pencil-and-paper testsRace X Method interaction attributable to reading comprehension differences in subgroupsIncreasing fidelity increased mean performance on SJTChan (1997) showed that paper-and-pencil SJTs are more consistent with beliefs, values, and expectations of whites.Moving to video-based SJT increased validity perceptions for both whites and blacksBauer and Truxillo (2006) asserts that SJTs always have better face validity than do cognitive and personality measures.Takeaways: SJTs are useful in terms of face validity and justice perceptions, particularly high-fidelity (video) simulations.
  • 15.
  • 16.
    The problem ofgNguyen (2005) Found that knowledge instruction SJT scores correlated .56 with GMA and behavioral instructions correlated .38 with GMA.
  • 17.
    Peeters and Lievens(2005) found that faking-good instructions produced differences in means and criterion-related validities across subgroups.
  • 18.
    “Specifically, the fakabilityof SJTs might depend on their correlation with one particular construct, namely, cognitive ability (see Nguyen &McDaniel, 2001).” (p.73)
  • 19.
    Schmidt and Hunter(2003) found low discriminant validity between SJTs and job knowledge tests.
  • 20.
  • 21.
    “If subgroup differenceson a test exist, policies that permit retests by candidates who were unsuccessful on the test might inflate calculations of adverse impact.” (Lievens et al., 2005, p. 1005)Takeaway: If the degree of fakibility of an SJT depends on its GMA load, SJTs might just be contaminated g tests or low-reliability job knowledge tests.
  • 22.
    FakingNguyen et al.(2005) found that d=.34 for honest instructions and d=.15 for fakingPloyhart and Erhart (2003) also note that behavioral response instructions are both more prone to faking and have problematic more reliability issuesHooper et al. (2006) notes that the fragmentation of the literature has made a meta-analytic study of this issue impossible.
  • 23.
    Response InstructionsPloyhart andErhart (2003) found that response instructions had dramatic effects on validity, reliability, and performance for SJTsShowed that dimensionality of an SJT is crucial to determining the reliability estimate to use.McDaniel et al. (2007) found meaningful differences between means for tests with different behavioral and knowledge instructions.Lievens and Sackett (2009) found no meaningful differences between means in a high-stakes testing environment with med school applicants.Last two studies found that knowledge instructions for an SJT increased the scores’ correlation with a GMA measureTakeaway: meta-analytic integration of these results is needed, but the primary research has yet to support this.
  • 24.
    What is thereliability for an SJT?Bess (2001) points out the elephant in the room:“SJTs by definition are multidimensional and therefore internal consistency is not an appropriate measure of reliability” (p.29)Schmitt and Chan (1997) also notes this problem.Examples of reliability estimates:Ployhart and Erhart (2003) Used split-half estimates to get .67 and .68 reliabilities.Lievens and Sackett (2009) found low alphas for their SJT (.55-.56)Lievens and Sackett (2007) noted generating alternate forms is difficult for SJTs, given the contextual specificity of items.This means parallel forms reliability is a non-practical measure.Takeaway: no one is quite sure how to systematically assess reliabilities for SJT measures
  • 25.
    What do weknow about SJTs?Conclusions
  • 26.
    Things we knowfairly clearlySJTs are primarily a method, not a construct.SJTs have demonstrated generalizable meta-analytic incremental validity over GMA and Big 5 single and composite measures in predicting job performanceMost SJTs are correlated with GMA to a varying extent and share some benefits and disadvantages with GMASJTs often correlate with the Big 3
  • 27.
    McDaniel et al.(2006) integrated model for SJT
  • 28.
    Where do researchersgo from here? Practitioners?Future Directions for SJTs
  • 29.
    Ployhart & Weekly(2006) Agenda for ResearchConstruct ValidityCorrelates are known, but nomological net uncertainSJTs targeted to constructs: the “holy grail” (p.348)What exactly is “judgment?”Understanding SJT structureHow do we build SJTs to get construct homogeneity?How do we enhance the reliability of these measures?More Experimentation/Micro ResearchCorrelation studies and metas show generalizabilityExperimental studies can enhance understanding