LOGO
Some reflections
on task-based language performance
assessment
L. F. Bachman (2002)
by
Parisa Mehran
Introduction
What we want to assess1
How we go about it2
What sorts of arguments and evidence we need to
provide to justify the inferences and decisions we
make on the basis of our assessments
3
4
The complexities of task-based language performance
assessment (TBLPA ) are leading us to reconsider many of the
fundamental issues about:
The term TBLPA:
‘Performance’ goes back to ‘direct testing’ movement of the
1970s
‘Task-based’:
 Relatively more recent lineage
 Derives from research in SLA and language pedagogy
Tasks and constructs in language assessment
 Distinction between task-centered and ability or construct-centered
approaches to language assessment can be found in both
1. Educational measurement
2. Language testing
 An underlying premise to the discussions of task-based language
assessment is :
inferences we want to make are about underlying ‘language ability’ or
‘capacity for language use’ or ‘ability for use’
A different approach to defining TBLPA according to Norris et al. and Brown et al.
 TBLPA as one kind of performance assessment
 Task-based assessment does not simply utilize real-world task as a means for eliciting particular
components of the language system which are then measured or evaluated; on the contrary, the
construct of interest in task-based assessment is performance on the task itself
 Inferences to be made are about ‘students’ abilities to accomplish particular tasks or task types’
 The difference in this approach lies not in the kinds of assessment tasks that are used (e.g.
employing ‘authentic’ assessment tasks), but rather in the kinds of inferences claimed to be made
on the basis of test-taker’s performance on assessment tasks
 The construct is defined in terms of ‘pragmatic ascription’ or what test-takers can do, and in so
doing, limiting the interpretation to predictions about future performance on real-world tasks
 Brown et al. make distinction between TLBPA and other types of performance assessment by
considering the way one interprets consistencies in responses across a set of assessment tasks
Response consistencies are
interpreted as evidence of
underlying processes or
structures
Consistencies are attributed to
characteristics of the test-taker
Response consistencies are
interpreted as ‘samples of
response classes’
Consistencies are attributed
to contextual factors
‘Behaviorist perspective’ on
construct definition by
Brown et al. and Chapelle (1998):
‘Trait perspective’ by other
proponents of performance
assessment:
Two approaches to interpreting consistencies in responses across a set of assessment tasks:
Difference between ‘ability-based’ and ‘task-based’ approaches
• First, focusing on the construct of interest
• Then, developing tasks based on the
performance attributes of the construct, score
uses, scoring constraints, …
• Both constructs and tasks are considered
‘Ability-based’
approach
• First, deciding which performances are the
desired ones
• Then, score uses, scoring criteria, … become
part of the performance test itself
• Only performances on tasks are considered
‘Task-based’
approach
‘What is this thing called task?’: Content domain specification
 Definitions of task vary from including virtually anything that is done to
distinguishing between ‘real-world tasks’ and ‘pedagogic tasks’ to Skehan’s
(1998) extended definition
Norris et al. (1998)
Define task ‘as those activities
that people do in everyday life and
which require language for their
accomplishment’
A task is essentially a real-world
activity
They do not distinguish between
these and assessment tasks
Bachman and Palmer (1996)
Define a ‘language use task’ as ‘ an
activity that involves individuals in using
language for the purpose of achieving a
particular goal or objective in a
particular situation’
This definition focuses on tasks that
involve language and adds to this the
notions that:
1. Tasks are goal oriented
2. Tasks are situated in specific
settings
Two critical issues that must be addressed in design, development and
use of any language assessment
‘Which tasks do we use?’: Identifying and
selecting assessment tasks
Specification of assessment tasks is a
critical issue because
1. The particular tasks included in the
assessment will provide the basis for
one part of a validity argument: content
relevance and representativeness
2. The degree of correspondence between
the test tasks and tasks outside the test
itself provides a basis for investigating
the authenticity of the test tasks
Content relevance and content
Representativeness
Content relevance: the extent to which
the areas of ability to be assessed are in
fact assessed by the task
Content representativeness : the extent
to which the test adequately samples the
content domain of interest and provides
a basis for investigating task
generalizability and extrapolation
Problem of investigating and demonstrating content relevance and
representativeness is two-fold:
1. We must identify the TLU domain defined as ‘a set of specific language use
tasks that test-taker is likely to encounter outside the test itself, and to
which we want our inferences about language ability to generalize’
2. We then need to select tasks from that domain which will form the basis
for language assessment tasks. Even when a well-defined TLU domain can
be identified, selecting specific tasks from within that domain may be
problematic.
1
Not all TLU tasks
will engage the
areas of ability we
want to assess
2
Some TLU tasks may
not be practical to
administer in an
assessment in their
entirety
3
Some TLU tasks may
not be appropriate or
fair for all test-takers
if they presuppose
prior knowledge or
experience that some
test-takers may not
possess
Bachman and Palmer (1996) suggest three reasons why real-life tasks may not always
be appropriate as a basis for developing assessment tasks:
There are serious problems with the claim that TBLPA’s
distinctive characteristic is that it enables us to make
predictions about future performance. These problems are
related to:
1. Task selection
2. Generalizability
3. Extrapolation
We cannot demonstrate that performance on one assessment
task generalizes to other assessment tasks, or that it
extrapolates to performance on tasks in the TLU domain
‘How hard is it?’: The difficulty with difficulty
The notion that test items or tasks themselves differ in difficulty is
ingrained in both:
1. The way we conceptualize the difficulty of assessment tasks
2. How we operationalize difficulty in most recent measurement
models
Difficulty does not reside in the task alone, but is relative to any
given test-taker
Two general approaches to understand, explain or predict how
difficult a given task will be:
1. First approach: To identify a number of task characteristics that are
considered to be essentially independent of ability and then
investigate the relationships between these characteristics and
empirical indicators of difficulty
2. Second approach: To explicitly identify ‘difficulty features’, which
are essentially combinations of ability requirements and task
characteristics that are hypothesized to affect the difficulty of a
given task
The relationships among task characteristics and task difficulty have been
researched for over a decade; however, the results have brought us no closer
to an understanding of this relationship. Possible explanations by the
researchers:
1. Methodological limitations in the studies
2. Differences between testing and pedagogic contexts, the former
producing a cognitive focus on display rather than on task fulfillment
or getting the message across
3. Assessment tasks may be fundamentally different from pedagogic or
‘real-life’ tasks
This explanation raises questions about
 The validity of assessing certain aspects of language ability
with certain types of task
 The generalizability of research with SLA and pedagogic
tasks to assessment tasks
Problems with ‘difficulty factors’
Sources of variation or factors that may affect test performance
Bachman (1990)
Factors:
1. Language ability of test-taker
2. Test-task characteristics
3. Personal characteristics of test-
taker
4. Random/unpredictable factors
These factors may well be
correlated with each other except for
random factors
There is no factor identified as
‘difficulty’
Skehan (1196) and Hawaii group
Three task difficulty features
affecting performance on tasks:
1. Code complexity: language
required to accomplish the task
2. Cognitive complexity: thinking
required to accomplish the task
3. Communicative stress:
performance conditions for
accomplishing a task
Code
complexity
Cognitive
complexity
Communicative
stress
Other
factors
(?)
Task
difficulty
Test
performance
Two problems with this formulation:
1. The difficulty features confound
the effects of the test-taker’s ability
with the effects of the test tasks
2. This approach introduces a
hypothetical entity, ‘task difficulty’,
as a primary determinant of test
performance and as a separate
factor
 Problems with the way difficulty is operationalized in current measurement models
 Some indicators of difficulty are averages of performance across facets of
measurement and do not consider differential performance of different individuals
 In measurement models, ‘difficulty’ is operationalized
 As either an average of scores on a given task or facet of measurement across a
group of test-takers
 Or as an interaction between the latent trait and performance on a given task
‘Difficulty’ is essentially an artifact of test performance and not a characteristic of
assessment tasks themselves
 Empirical estimates of task difficulty are not estimates of a separate entity,
‘difficulty’, but are themselves artifacts of the interaction between the test-taker’s
ability and the characteristics of the task
 Problem with trying to predict empirical difficulty from task characteristics
 The approach of using task characteristics to predict empirical estimates of item
difficulty is problematic because these item statistics are themselves a function of
interactions between test-takers and test tasks
Conclusions
A fundamental aim of most language performance assessment is to present
test-takers with tasks
That correspond to tasks in ‘real-world’ settings
That will engage test-takers in language use or the creation of discourse
A solely task-based approach is problematic. Most useful assessments in all
situations will be those that are based on the planned integration of both
tasks and constructs in the way they are designed, developed and used.
Task specification will also present challenges to such an integrated approach;
however, an integrated approach
Makes it possible for test-users to make a variety inferences about the
capacity for language use that test-takers have, or about what they can and
cannot do
Makes available to test-developers the full range of validation arguments
that can be developed in support of a given inference or use

Some Reflections on Task-based Language Performance Assessment

  • 1.
    LOGO Some reflections on task-basedlanguage performance assessment L. F. Bachman (2002) by Parisa Mehran
  • 2.
    Introduction What we wantto assess1 How we go about it2 What sorts of arguments and evidence we need to provide to justify the inferences and decisions we make on the basis of our assessments 3 4 The complexities of task-based language performance assessment (TBLPA ) are leading us to reconsider many of the fundamental issues about:
  • 3.
    The term TBLPA: ‘Performance’goes back to ‘direct testing’ movement of the 1970s ‘Task-based’:  Relatively more recent lineage  Derives from research in SLA and language pedagogy
  • 4.
    Tasks and constructsin language assessment  Distinction between task-centered and ability or construct-centered approaches to language assessment can be found in both 1. Educational measurement 2. Language testing  An underlying premise to the discussions of task-based language assessment is : inferences we want to make are about underlying ‘language ability’ or ‘capacity for language use’ or ‘ability for use’
  • 5.
    A different approachto defining TBLPA according to Norris et al. and Brown et al.  TBLPA as one kind of performance assessment  Task-based assessment does not simply utilize real-world task as a means for eliciting particular components of the language system which are then measured or evaluated; on the contrary, the construct of interest in task-based assessment is performance on the task itself  Inferences to be made are about ‘students’ abilities to accomplish particular tasks or task types’  The difference in this approach lies not in the kinds of assessment tasks that are used (e.g. employing ‘authentic’ assessment tasks), but rather in the kinds of inferences claimed to be made on the basis of test-taker’s performance on assessment tasks  The construct is defined in terms of ‘pragmatic ascription’ or what test-takers can do, and in so doing, limiting the interpretation to predictions about future performance on real-world tasks  Brown et al. make distinction between TLBPA and other types of performance assessment by considering the way one interprets consistencies in responses across a set of assessment tasks
  • 6.
    Response consistencies are interpretedas evidence of underlying processes or structures Consistencies are attributed to characteristics of the test-taker Response consistencies are interpreted as ‘samples of response classes’ Consistencies are attributed to contextual factors ‘Behaviorist perspective’ on construct definition by Brown et al. and Chapelle (1998): ‘Trait perspective’ by other proponents of performance assessment: Two approaches to interpreting consistencies in responses across a set of assessment tasks:
  • 7.
    Difference between ‘ability-based’and ‘task-based’ approaches • First, focusing on the construct of interest • Then, developing tasks based on the performance attributes of the construct, score uses, scoring constraints, … • Both constructs and tasks are considered ‘Ability-based’ approach • First, deciding which performances are the desired ones • Then, score uses, scoring criteria, … become part of the performance test itself • Only performances on tasks are considered ‘Task-based’ approach
  • 8.
    ‘What is thisthing called task?’: Content domain specification  Definitions of task vary from including virtually anything that is done to distinguishing between ‘real-world tasks’ and ‘pedagogic tasks’ to Skehan’s (1998) extended definition Norris et al. (1998) Define task ‘as those activities that people do in everyday life and which require language for their accomplishment’ A task is essentially a real-world activity They do not distinguish between these and assessment tasks Bachman and Palmer (1996) Define a ‘language use task’ as ‘ an activity that involves individuals in using language for the purpose of achieving a particular goal or objective in a particular situation’ This definition focuses on tasks that involve language and adds to this the notions that: 1. Tasks are goal oriented 2. Tasks are situated in specific settings
  • 9.
    Two critical issuesthat must be addressed in design, development and use of any language assessment ‘Which tasks do we use?’: Identifying and selecting assessment tasks Specification of assessment tasks is a critical issue because 1. The particular tasks included in the assessment will provide the basis for one part of a validity argument: content relevance and representativeness 2. The degree of correspondence between the test tasks and tasks outside the test itself provides a basis for investigating the authenticity of the test tasks Content relevance and content Representativeness Content relevance: the extent to which the areas of ability to be assessed are in fact assessed by the task Content representativeness : the extent to which the test adequately samples the content domain of interest and provides a basis for investigating task generalizability and extrapolation
  • 10.
    Problem of investigatingand demonstrating content relevance and representativeness is two-fold: 1. We must identify the TLU domain defined as ‘a set of specific language use tasks that test-taker is likely to encounter outside the test itself, and to which we want our inferences about language ability to generalize’ 2. We then need to select tasks from that domain which will form the basis for language assessment tasks. Even when a well-defined TLU domain can be identified, selecting specific tasks from within that domain may be problematic.
  • 11.
    1 Not all TLUtasks will engage the areas of ability we want to assess 2 Some TLU tasks may not be practical to administer in an assessment in their entirety 3 Some TLU tasks may not be appropriate or fair for all test-takers if they presuppose prior knowledge or experience that some test-takers may not possess Bachman and Palmer (1996) suggest three reasons why real-life tasks may not always be appropriate as a basis for developing assessment tasks:
  • 12.
    There are seriousproblems with the claim that TBLPA’s distinctive characteristic is that it enables us to make predictions about future performance. These problems are related to: 1. Task selection 2. Generalizability 3. Extrapolation We cannot demonstrate that performance on one assessment task generalizes to other assessment tasks, or that it extrapolates to performance on tasks in the TLU domain
  • 13.
    ‘How hard isit?’: The difficulty with difficulty The notion that test items or tasks themselves differ in difficulty is ingrained in both: 1. The way we conceptualize the difficulty of assessment tasks 2. How we operationalize difficulty in most recent measurement models Difficulty does not reside in the task alone, but is relative to any given test-taker
  • 14.
    Two general approachesto understand, explain or predict how difficult a given task will be: 1. First approach: To identify a number of task characteristics that are considered to be essentially independent of ability and then investigate the relationships between these characteristics and empirical indicators of difficulty 2. Second approach: To explicitly identify ‘difficulty features’, which are essentially combinations of ability requirements and task characteristics that are hypothesized to affect the difficulty of a given task
  • 15.
    The relationships amongtask characteristics and task difficulty have been researched for over a decade; however, the results have brought us no closer to an understanding of this relationship. Possible explanations by the researchers: 1. Methodological limitations in the studies 2. Differences between testing and pedagogic contexts, the former producing a cognitive focus on display rather than on task fulfillment or getting the message across 3. Assessment tasks may be fundamentally different from pedagogic or ‘real-life’ tasks This explanation raises questions about  The validity of assessing certain aspects of language ability with certain types of task  The generalizability of research with SLA and pedagogic tasks to assessment tasks
  • 16.
    Problems with ‘difficultyfactors’ Sources of variation or factors that may affect test performance Bachman (1990) Factors: 1. Language ability of test-taker 2. Test-task characteristics 3. Personal characteristics of test- taker 4. Random/unpredictable factors These factors may well be correlated with each other except for random factors There is no factor identified as ‘difficulty’ Skehan (1196) and Hawaii group Three task difficulty features affecting performance on tasks: 1. Code complexity: language required to accomplish the task 2. Cognitive complexity: thinking required to accomplish the task 3. Communicative stress: performance conditions for accomplishing a task
  • 17.
    Code complexity Cognitive complexity Communicative stress Other factors (?) Task difficulty Test performance Two problems withthis formulation: 1. The difficulty features confound the effects of the test-taker’s ability with the effects of the test tasks 2. This approach introduces a hypothetical entity, ‘task difficulty’, as a primary determinant of test performance and as a separate factor
  • 18.
     Problems withthe way difficulty is operationalized in current measurement models  Some indicators of difficulty are averages of performance across facets of measurement and do not consider differential performance of different individuals  In measurement models, ‘difficulty’ is operationalized  As either an average of scores on a given task or facet of measurement across a group of test-takers  Or as an interaction between the latent trait and performance on a given task ‘Difficulty’ is essentially an artifact of test performance and not a characteristic of assessment tasks themselves  Empirical estimates of task difficulty are not estimates of a separate entity, ‘difficulty’, but are themselves artifacts of the interaction between the test-taker’s ability and the characteristics of the task  Problem with trying to predict empirical difficulty from task characteristics  The approach of using task characteristics to predict empirical estimates of item difficulty is problematic because these item statistics are themselves a function of interactions between test-takers and test tasks
  • 19.
    Conclusions A fundamental aimof most language performance assessment is to present test-takers with tasks That correspond to tasks in ‘real-world’ settings That will engage test-takers in language use or the creation of discourse A solely task-based approach is problematic. Most useful assessments in all situations will be those that are based on the planned integration of both tasks and constructs in the way they are designed, developed and used. Task specification will also present challenges to such an integrated approach; however, an integrated approach Makes it possible for test-users to make a variety inferences about the capacity for language use that test-takers have, or about what they can and cannot do Makes available to test-developers the full range of validation arguments that can be developed in support of a given inference or use