Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Justifying the Use of an English Placement Test
1. LOGO
Justifying the Use of an English
Language Placement Test
with an Assessment Use Argument
Presented by:
Parisa Mehran
Alzahra University
Tehran, Iran
2. Placement Tests
Placement test is considered as a fairly high-stakes test (Bachman & Palmer, 1996, 2010),
and the social consequences of placement decisions are of great significance and need to
be investigated, since such decisions can affect the lives of students (Murray, 2001;
Schmitz & delMas, 1991).
Thus, as Brown (1989) emphasizes, it is important to make valid placement decisions to
avoid mismatches that can occur due to inappropriate placement testing.
3. Validity in Language Assessment
Validity has been regarded as the most significant and complex concept in language
assessment, and it has always been under investigation by language testing experts
and researchers (e.g., Bachman, 1990, 2004, 2005; Bachman & Palmer, 2010; Chapelle,
1999; Cronbach & Meehl, 1955; Kane, 2001, 2012, 2013; Lado, 1961; Messick, 1989). As a
result, the conception of validity has undergone a series of reinterpretations
throughout the history of language assessment.
4. Argument-based Approaches to Validity
Argument-based approaches to validity are based on the concept of a validity argument which has
been used in the process of validation for more than twenty years (e.g., Bachman, 2005; Cronbach,
1988, 1989; House, 1980, Kane, 1992; Mislevy, 2003; Mislevy, Steinberg, & Almond, 2002, 2003).
The process is comparable to building a legal case to persuade a judge or a jury. The process of
validation thus becomes ongoing: As long as a test is alive, the collection of relevant evidence is
going to be continued (Bachman & Palmer, 2010). Hence, any kind of relevant evidence is gathered
to show the plausibility of the intended interpretations and uses (Bachman, 2004; Kane, 2012).
5. Purpose of the Study
Using Bachman and Palmer's (2010) AUA as a framework, this study examined the
validity of an English language placement test, which is composed of the Oxford Quick
Placement Test (OQPT) and a follow-up oral examination. The following research
question was addressed:
To what extent are the OQPT and the oral examination justifiable in placing students
appropriately according to Bachman and Palmer's (2010) AUA?
6. Methodology: Participants and Setting
This study was conducted at one of the English language institutes in Tehran, Iran.
Three hundred and thirty-two (332) newcomers to the institute who had to take the
placement test participated in this study, and 15 of them were interviewed. The head of
the institute, three examiners of the placement test, ten teachers, and four experts also
attended the current study.
8. The AUA for Justifying the Placement Test
As Vongpumivitch (2010) and Wang et al. (2012) remark, Bachman and Palmer's (2010) framework
has a top-down approach. That is, the four claims are discussed from the perspective of test
development rather than from that of test use. Therefore, in this study, where the aim is to evaluate
the overall usefulness of an English language placement test, the four claims are presented in the
reverse order from that in Bachman and Palmer (2010).
It should also be mentioned that as Bachman and Palmer (2010) emphasize not all of the warrants
and rebuttals listed by them will necessarily be needed in the AUA for any given test. Moreover,
due to practical research limitations, not all of the warrants and rebuttals have been investigated in
the present study.
9. Claim 4: The assessment records of the OQPT and the oral examination are consistent across
different assessment tasks, different aspects of the assessment procedure, and across different
groups of test takers.
Claim 4: Assessment Records
10. Consistency
The first warrant for this claim is that the procedures for administrating the OQPT and the oral
examination are followed consistently across different occasions and for all test takers:
The observation of how the OQPT and the oral examination were administered as well as the
interviews with the examiners and the head of the institute revealed that there are a set of
administrative procedures which are strictly followed by the test administrators; hence, the
administrative procedures are consistent across different occasions and for all test taker groups.
11. Consistency (cont.)
Another warrant to support the consistency claim involves the scoring criteria and procedures:
The criteria and procedures for rating test takers' performance on the OQPT are well specified and
are adhered to. Since the OQPT is in multiple-choice format, its rating criteria and procedures are
quite objective, and scoring is done based on an answer key.
However, the criteria and procedures for the oral examination are not well specified and are quite
subjective. A set of questions have been devised based on the coursebook. In this sense, the
administration of the oral examination is consistent, yet its scoring process does not follow any
specific procedures. This lack of evidence could be a rebuttal here.
12. Consistency (cont.)
With respect to the warrant of rater training:
Raters undergo training before administrating the placement test.
However, one of the examiners was not satisfied with the training process, and she
believed that what matters is just the examiner's marketing skill to "grab more
customers" for the institute.
13. Consistency (cont.)
To check internal consistency of the items, as another warrant:
Kuder-Richardson formula 20 (KR-20) was used.
The reliability coefficient (KR-20) obtained for the OQPT Version 1 was .93 and for the OQPT Version 2 was
.88 showing that the OQPT has reasonable internal consistency reliability.
Two main test item indices (item difficulty and item discrimination) were used in the test item analysis for the
OQPT.
In terms of difficulty, the items have been ordered from the easiest to the most difficult, and this is in line
with the view of experts, examiners, and test takers. The analysis of item difficulty showed that, by and
large, most of the items were difficult (56% in the Version 1 and 63% in Version 2), and both versions of the
OQPT did not have a fairly acceptable distribution of difficulty.
In terms of item discrimination, the analysis of items demonstrated that the items in the OQPT Version 1
had good amount of discrimination (75%). However, the OQPT Version 2 contained less items of fair
discrimination (46%) in comparison to the first version.
14. Consistency (cont.)
In regard to inter-rater and intra-rater reliability, inconsistencies between and within human
raters are not a source of measurement error because the scoring of the OQPT is done through
an answer key.
In the case of the oral examination, Cronbach's alpha was computed. The alpha was .93 for
inter-rater reliability and .96 for intra-rater reliability indicating that despite the lack of
consistent criteria and procedures for the oral examination, the oral examination has
reasonable internal consistency reliability.
The analysis of the two versions of the OQPT brought a serious rebuttal to the consistency claim.
Cronbach's alpha was calculated and the alpha was .000 indicating that the two versions are not
equivalent. Two experts also remarked that the second version is much more difficult, and that
it cannot be considered as equivalent to the first version.
15. Claim 3: The OQPT scores and the oral examination results can be interpreted as test takers' level
of English proficiency and place them in their appropriate level. Such interpretations are
meaningful, impartial, generalizable, relevant, and sufficient.
Claim 3: Interpretations
16. Meaningfulness
Interpretations about the construct to be assessed are meaningful if they are based on a frame of reference like a
course syllabus, a needs analysis, or a general theory of language ability. The head of the institute and the
examiners believed that the OQPT is a suitable placement test because Oxford University Press is the publisher of
both this test and the coursebook taught in the institute (i.e., English Result). The teaching method followed in the
institute is Communicative Language Teaching (CLT), and speaking is thus the primary focus; therefore, the oral
examination has a significant role in placement testing. However, lack of a listening section in the OQPT can be a
rebuttal to the meaningfulness of the interpretations.
17. Impartiality
To support the impartiality warrant, the assessment items/tasks should be checked for response
formats or content that may either favor or disfavor some test takers, and test takers should be
treated impartially in terms of all aspects of test administration.
As mentioned earlier, interviews with test takers and examiners revealed that due to the lack of
a specific rubric for the oral examination, it was believed that interpretations about the ability to
be assessed were not made without bias against any groups of test takers.
No complaint was made in regard to the appropriateness of the content. Bias and item
sensitivity studies need to be done for deeper analysis.
18. Generalizability
According to the generalizability warrant, the characteristics of the assessment items/tasks (e.g., input,
expected response, type of interaction) as well as the scoring criteria and procedures of the test tasks
should correspond closely to those of the target language use (TLU) domain.
It might be that the items/tasks in the OQPT and the oral examination do not exactly correspond to all
of the TLU tasks; however, the content of the OQPT and the oral examination corresponds to the
content of the textbook taught in the institute. Moreover, in the oral examination, test takers' real world
language performance is examined. Thus, this can to some extent support the generalizability warrant.
Here, it is worth mentioning that some of the teachers, examiners, and experts asserted that a TOEFL or
an IELTS test was a better test, indeed an ideal one, due to having writing and specially listening parts,
for placement testing, but because of time limitations, they could not be used as placement tests.
Consequently, a TOEFL test was given to those who had taken the OQPT, and the results manifested
that the correlation between the OQPT and TOEFL scores was not high (r=.66).
19. Relevance
The forth warrant is relevance according to which the assessment-based interpretations should
provide information that is relevant and helpful for the decision makers to make decisions.
Based on the interviews conducted with examiners, it was revealed that the OQPT scores were
not sufficiently helpful in placement testing. This could have been a rebuttal to the relevance
warrant; yet, the oral examination is in support of this warrant since it is quite helpful for the
examiners to make their placement decisions. As said before, lack of a rubric is a serious
rebuttal.
20. Sufficiency
The fifth warrant demands that the assessment-based interpretations should provide
information that is sufficient for the decision makers to make decisions.
Again, because the process of placement testing includes both a written and an oral
test, sufficient information is obtained to make placement decisions.
21. Claim 2: The placement decisions that are made on the basis of the OQPT scores and the oral
examination results are sensitive to local values and equitable to all stakeholders.
Claim 2: Decisions
22. Values Sensitivity
According to values sensitivity warrant, the existing community values and relevant legal
requirements should be carefully and critically considered in the admission decisions that are to be
made and in determining the relative seriousness of false positive and false negative classification
errors.
The interviews and observations revealed that the process of placement testing does not
guarantee test fairness considerations, and in this phase just absorbing more clients is
important. Hence, it is possible to have potential false positives (i.e., individuals are placed in a
level higher than their actual level) and false negatives (i.e., individuals are placed in a level
lower than their actual level). Usually the latter happens, because it is less risky; nevertheless, if
at the time of placement testing the institute does not have the level appropriate for the test
taker (due to some limitations, such as space, time, lack of students, all the levels are not always
covered), the test taker will be placed in a higher level.
23. Equitability
Due to the subjectivity of the oral examination, it cannot be claimed that the same cut scores
and decision rules are used to classify all students who have applied for the same program, and
no other considerations are used. The economic and practical considerations always exist.
Consequently, test takers and other stakeholders are not fully informed about how decisions will
be made and whether decisions are actually made in the way described to them.
24. Claim 1: The consequences of the placement decisions based on the OQPT scores and the oral
examination results are beneficial to all stakeholders that use the test, including the test takers, the
institution, the teachers, and the supervisor.
Claim 1: Consequences
25. Beneficence
The first warrant is that the consequences of using the assessment that are specific to each
stakeholder will be beneficial.
Some of the test takers were interviewed after their placement test and their attendance in
their classes. On the whole, they were satisfied, although some of them believed that their
level was higher, and that they were placed in a lower level. To them, the reason is basically
cost-effectiveness. Most of the teachers believed that their students were homogenous in the
class; yet, two teachers strongly disagreed and they believed that their classes were not at all
homogenous especially at higher levels.
26. Conclusion
Based on the evidence gathered, this study found that the assessment records of the OQPT and the oral
examination were consistent across different assessment tasks, different aspects of the assessment
procedure, and across different groups of test takers. However, the oral examination required a set of
clear criteria.
Moreover, the analysis of the two versions of the OQPT manifested that their parallelism was under
question which could threaten the consistency of the assessment records.
The findings also indicated that the OQPT scores and the oral examination results could be interpreted
somewhat as test takers' level of English proficiency and could place them in their appropriate levels.
Such interpretations were meaningful, impartial, relevant, and sufficient, although lack of a listening
section in the OQPT and lack of a rubric for the oral examination could be threatening, and
generalizability of the results was to some extent under question.
In addition, the placement decisions were not sensitive to local values and equitable to all stakeholders
due to the subjectivity of the oral examination and the economic considerations of the institute.
Lastly, by and large, the consequences of the placement decisions were beneficial to all stakeholders that
use the test, which is composed of the test takers, the institution, the teachers, and the supervisor.
27. Local Implications
To support the intended test use, it would be helpful to examine the negative evidence that has
been identified in this study and resolve the identified issues or mitigate the potential negative
impact of unresolved issues. For instance, in the case of the current placement test, the oral
examination can be given based on a rubric, a listening section can be added to the written test,
and economic considerations can become less important for the institute, and therefore the
intended uses of the test would become much more justifiable with stronger evidence.
28. The Merits/Demerits of Using an AUA
Finally, this study serves as an illustration of the merits/demerits of using an AUA.
On the whole, the AUA provides a sound framework in which the validity of a test and its use
can be justified and the test developers/users can be accountable for their test.
With the help of the AUA framework, the process of assessment justification becomes more
comprehensive, systematic, and coherent. In fact, one of the merits of an AUA is its clear
articulation about which types of evidence should be collected for which claims or warrants.
However, in the process of assessment justification, an AUA demands that the evaluation of the
test be done at many levels and this needs different types of data and analyses. Thus, in practice,
the complexity of the justification study may be a big challenge for a single researcher.
According to Bachman (2004), items within a fairly narrow range of item difficulty, around .50, are desirable.
Oller (1979) asserts that, for item discrimination, correlations of less than .35 are considered as not being useful for discriminating between participants.