A quasi experimental evaluation design study comparing the impact of using the Continuous Assessment strategy in intervention and control schools in Zambia
ICT Role in 21st Century Education & its Challenges.pptx
Ca Baseline and Post test assessment report 2007 12 oct07
1. Ministry of Education
DRAFT Technical Report
of the
Pre- and Post-Pilot Testing for the
Continuous Assessment Programme
in Lusaka, Southern and Western Provinces
Coordinated by the
Examinations Council of Zambia
Research and Test Development Department
Under the Direction of the
Continuous Assessment Steering and Technical Committees
Ministry of Education
Lusaka, Zambia
October 2007
2. Table of Contents
ACKNOWLEDGMENTS ..................................................................................2
CHAPTER ONE: BACKGROUND ....................................................................3
1.1 Introduction to Continuous Assessment....................................................... 3
1.2 Definition of Continuous Assessment .......................................................... 4
1.3 Challenges in the Implementation of Continuous Assessment .................... 4
1.4 Guidelines for Implementation of Continuous Assessment.......................... 5
1.5 Plan for Implementation of Continuous Assessment.................................... 7
CHAPTER TWO: EVALUATION METHODOLOGY ..............................................8
2.1 Objectives .................................................................................................... 8
2.2 Design.......................................................................................................... 8
2.3 Sample......................................................................................................... 9
2.4 Instruments .................................................................................................. 9
2.5 Administration .............................................................................................. 9
2.6 Data Capture and Scoring.......................................................................... 10
2.7 Data Analysis ............................................................................................. 10
CHAPTER THREE: ASSESSMENT RESULTS..................................................11
3.1 Psychometric Characteristics..................................................................... 11
3.2 Classical Test Theory ................................................................................ 11
3.3 Item Response Theory............................................................................... 14
3.4 Scaled Scores............................................................................................ 15
3.5 Vertical Scaled Scores ............................................................................... 18
3.6 Comparison between Pilot and Comparison Groups ................................. 19
3.7 Comparison across Regions ...................................................................... 24
3.8 Performance Categories ............................................................................ 25
CHAPTER FOUR: SUMMARY AND CONCLUSIONS .........................................28
APPENDIX 1: ITEM STATISTICS BY SUBJECT
APPENDIX 2: SCORES AND FREQUENCIES - GRADE 5 PRE-TESTS
APPENDIX 3: SCORES AND FREQUENCIES - GRADE 5 POST-TESTS
APPENDIX 4: HISTOGRAMS BY SUBJECT AND GROUP
1
3. ACKNOWLEDGMENTS
The Continuous Assessment Joint Steering and Technical Committees and the
Examinations Council of Zambia wish to express profound gratitude to the
professional and material support provided by the Provincial Education Offices,
District Education Boards, Educational Zone staff in the different districts, school
administrators, teachers and pupils. Without this support, the baseline and post-pilot
assessment exercises would not have succeeded.
Other appreciations go to the management in the Directorate for Curriculum and
Assessment in the Ministry of Education for providing professional support towards
the Continuous Assessment programme in general and the assessment exercises in
particular. We wish to specifically thank the Director for Standards and Curriculum,
the Director for the Examinations Council of Zambia, and the Chief Curriculum
Specialist for allowing their personnel to take part in the assessment exercise.
Finally, we wish to express our appreciation to the USAID and the EQUIP2 Project
for providing the finances and technical support towards the Continuous Assessment
programme in Zambia.
All of the participants and stakeholders listed above have played a crucial role in not
only developing and implementing the Continuous Assessment programme, but
have also been supportive of the quantitative evaluation of the programme presented
in this technical paper. It is because of their interest in improving student learning
outcomes that the Continuous Assessment programme has had the necessary
financial, administrative and technical support. Our hope is that the programme will
prove to be valuable for all of the pupils and teachers in Zambian schools.
2
4. Chapter One: Background
1.1 Introduction to Continuous Assessment
Over the years in Zambia, the education system has not been able to provide
enough spaces for all learners to proceed from Grade 7 to Grade 8, from
Grade 9 to Grade 10, and from Grade 12 to higher learning institutions. The
system has used examinations for selection of those to proceed to the next
level and for the certification of candidates; however, this has been done
without formal consideration of the school-based assessment as a component
in the final examinations, with the exception of some practical subjects.
The 1977 Educational Reforms explicitly provided for the use of Continuous
Assessment (CA). Later, national policy documents, particularly Educating
Our Future (1996) and Ministry of Education’s Strategic Plan 2003-2007,
stated the need for integrating school-based continuous assessment into the
education system, including the development of strategies to combine CA
results with the final examination results for purposes of pupil certification and
selection.
Furthermore, the national education policy, as stated in Educating Our Future,
stipulated that the Ministry of Education will develop procedures that will
enable teachers to standardise their assessment methods and tasks for use
as an integral part of school-based CA. The education policy document also
stated that the Directorate of Standards, in cooperation with the Examinations
Council of Zambia (ECZ), will determine how school-based CA can be better
conducted so that it can contribute to the final examination results for pupil
certification and promotion to the subsequent levels. The policy also stated
that the Directorate of Standards, with input from the ECZ, will determine
when school-based CA can be introduced.
In order to set in motion the implementation of school-based CA, the ECZ
convened a preparatory workshop from 16th to 22nd November 2003 in
Kafue. Ninety (90) participants from various stakeholders’ institutions took
part. The objectives of the preparatory workshop were to:
• Recommend a plan for developing and implementing CA;
• Recommend a training plan for preparing teachers in implementing CA;
• Explore ways of ensuring transparency, reliability, validity and
comparability in using CA results;
• Agree on common assessment tasks and learning outcomes to be
identified in the syllabuses for CA;
• Discuss the development of a teacher’s manual on CA; and
• Discuss the nature of summary forms for recording marks that should be
provided to schools.
3
5. 1.2 Definition of Continuous Assessment
Continuous assessment is defined as an on-going, diagnostic, classroom-
based process that uses a variety of assessment tools to measure learner
performance. CA is a formative evaluation tool conducted during the teaching
and learning process with the aim of influencing and informing the overall
instructional process. It is the assessment of the whole learner on an ongoing
basis over a period of time, where cumulative judgments of the learner’s
abilities in specific areas are made in order to facilitate further positive
learning (Le Grange & Reddy, 1998). 1
The data generated from CA should be useful in assisting teachers to plan for
the learning by individual pupils. It also should assist teachers in identifying
the unique understanding of each learner in a classroom by informing the
pupil of the level of instructional attainment, helping to target opportunities that
promote learning, and reducing anxiety and other problems associated with
examinations. CA has shown to have had positive impacts on student learning
outcomes in hundreds of educational settings (Black & William, 1998). 2
CA is made up of a variety of assessment methods that can be formal or
informal. It takes place during the learning process when it is most necessary,
making use of criterion referencing rather than norm referencing and providing
feedback on how learners are changing.
1.3 Challenges in the Implementation of Continuous Assessment
There are several areas in which the implementation of CA in the classroom
will present challenges. Some of these are listed below.
• Large class sizes in most primary schools are a major problem. It is
common to find classes of 60 and above in Zambian classrooms.
Teachers are expected to mark and keep records of the progress of all of
these learners.
• CA can take a lot of time for teachers. As a result, teachers get concerned
that time spent on remediation and enrichment is excessive and many
teachers do not believe that they would finish the syllabus with CA.
• CA will not be successfully implemented if there are inadequate teaching
resources / equipment in schools. Teachers need materials and equipment
such as stationery, computers and photocopiers (and electricity).
• There may be cases of resistance from school administrators and teachers
if they feel left out in the process of developing the CA programme.
• CA requires the cooperation of communities and parents. If they do not
understand what is expected of them, they may resist and hence affect the
success of the programme.
1
Le Grange, L.L. & Reddy, C. 1998. Continuous Assessment: An Introduction and Guidelines to
Implementation. Cape Town, South Africa: Juta.
2
Black, P. & William, D. (1998). Assessment and classroom learning. Assessment in Education, 5(1),
7-74.
4
6. 1.4 Guidelines for Implementation of Continuous Assessment
A teachers’ guide on the implementation of continuous assessment at the
basic school level was developed with the involvement of curriculum
specialists, Standards officers, Examinations specialists, Provincial Education
Officials, District Education Officials, Zonal in-Service training providers,
school administrators and teachers.
The Teachers’ Guide on CA comprises the following:
• Sample record forms;
• Description of the CA schemes;
• Instructions for preparing and administering assessment materials;
• Marking and moderation of the CA marks;
• Recording and reporting assessment results; and
• Monitoring of the implementation of the CA.
The Teachers’ Guide also specifies the roles of stakeholders as follows:
Teachers
• Plan assessment tasks, projects and mark schedules;
• Teach, guide and supervise pupils in implementing given tasks;
• Conduct the assessment in line with given guidelines;
• Mark and record the results;
• Provide correction and remedial work to the pupils;
• Inform the head teacher and parents on the performance of the child;
• Advise and counsel the pupils on their performance in class tasks;
• Take part in internal moderation of pupils’ results.
School Administrators
• Provide an enabling environment, such as the procurement of teaching
and learning materials;
• Act as links between the school and other stakeholders like ECZ,
traditional leaders, politicians and parents;
• Ensure validity, reliability and comparability through moderation of CA;
• Compile CA results and hand them to ECZ.
Parents
• Provide professional, moral, financial and material support to pupils.
• Continuously monitor their children’s attendance and performance
• Take part in making and enforcing school rules.
• Attend open days and witness the giving of prizes (rewards) to outstanding
pupils in terms of performance.
5
7. Standards Officers
• Interpret Government of Zambia policy on education;
• Monitor education policy implementation at various levels of the education
system;
• Advise and evaluate the extent to which the education objectives have
been achieved;
• Ensure that acceptable assessment practices are conducted;
• Monitor the overall standards of education.
Guidance Teachers/School Counsellors
• Prepare and store record cards for CA;
• Counsel pupils, teachers and parents/ guardians on CA and feedback;
• Take care of the pupils’ psycho-social needs;
• Make referrals for pupils to access other specialized assistance/support.
Heads of Department/Senior Teachers/Section Heads
• Monitor and advise teachers in the planning, setting, conducting, marking
and recording of CA results;
• Ensure validity, reliability and dependability of CA by conducting internal
moderation of results;
• Hold departmental meetings to analyze the assessment;
• Provide or make available the teaching and learning materials;
• Compile a final record of CA results and hand them over to Guidance
Teachers for onward submission to the ECZ.
District Resource Centre Coordinators
• Ensure adequate in service training for teachers in planning, conducting,
marking, moderating and recording results at school level in the district;
• Monitor the conduct of CA in the schools and district;
• Professionally guide teachers to ensure provision of quality education at
school level.
Provincial Resource Centre Coordinators
• Ensure adequate in-service training for teachers for them to be effective in
planning, conducting, marking, moderating and recording CA results;
• Monitor the conduct of CA in the province;
• Professionally guide teachers to ensure provision of quality education at
provincial level.
Examinations Specialist
• Analyse and moderate CA results and certify candidates;
• Integrate CA results with terminal examination results;
• Determine grade boundaries;
• Certify the candidates;
6
8. • Disseminate the results of candidates.
Monitors
As monitors of the CA programme, various officials and stakeholders will look
out for the following documents and information:
• Progress chart;
• Record of CA results and analysis;
• Marked evidence of pupils’ CA work on remedial activities;
• Evaluating gender performance;
• Pupil’s Record Cards;
• CA plans or schedules and schemes;
• Evidence of pupils’ work;
• CA administration;
• Evidence of remedial work;
• Availability of planned remedial work in the classroom;
• Availability of the teacher’s guide;
• Sample CA tasks;
• Evidence of a variety of CA tasks;
• Teacher’s record of pupils’ performance.
1.5 Plan for Implementation of Continuous Assessment
CA in Zambia is planned to roll out over a period of several years. This will
allow for proper stakeholder support and evaluation. The following list
provides the brief timeline of important CA activities through 2008:
• Creation of CA Steering and Technical Committees (2005);
• Development of assessment schemes, teacher’s guides, model
assessment tasks booklets and recordkeeping forms (2005);
• Design of quantitative evaluation methodology with focus on student
learning outcomes (2005);
• Implementation of CA pilot in Phase 1 schools: Lusaka, Southern and
Western regions (2006);
• Baseline report on student learning outcomes (2006);
• Implementation of CA pilot in Phase 2 schools: Central, Copperbelt and
Eastern Regions (2007);
• Expansion of modified CA pilot to community schools (2007);
• Post-test report on student learning outcomes (2007);
• Implementation of CA pilot in Phase 3 schools: Luapula, Northern and
Northwestern Regions (2008);
• Discussion of scaling up of CA pilot and systems-level planning for
combining Grade 7 end-of-cycle summative test scores with CA scores for
selection and certification purposes (2008).
7
9. Chapter Two: Evaluation Methodology
2.1 Objectives
The main objective of the quantitative evaluation is to determine whether the
CA programme has had positive effects on student learning outcomes. The
evaluation allows for a determination of whether pupils’ academic
performance has changed as a result of the CA intervention, as well as the
extent of the change in performance.
2.2 Design
The evaluation design is quasi-experimental, with pre-test and post-tests
administered to intervention (pilot) and control (comparison) groups. It
features a pre-test at the beginning of Grade 5 and post-tests at the end of
Grades 5, 6, and 7. The pilot and comparison groups will be compared at
each time point in 6 subject areas to see if there are differences in test scores
from the baseline to the post-tests by group (see Figures 1 and 2 below). 3
Figure 1: Pre-Test and Post-Test, Pilot and Control Group Design
Grade 5 Grade 5 Grade 6 Grade 7
Pre-test Post-test Post-test Post-test
Pilot Pilot Pilot Pilot
Group Group Group Group
Control Control Control Control
Group Group Group Group
Figure 2: Expected Results from the Evaluation
650
600
550
Scaled Score
500
450
400
350 Pilot
300 Control
250
200
G5 Pre-test G5 Post-test G6 Post-test G7 Post-test
Assessment
3
For more information, refer to the Summary of the Continuous Assessment Program August 2007 by
the Examinations Council of Zambia and the EQUIP2-Zambia project.
8
10. With the matched pairs random assignment design, it was expected that the
two groups, pilot and control, would have similar mean scores on the pre-test.
However, with a successful intervention, it was expected that the pilot group
would score higher than the control group on the subsequent post-tests.
2.3 Sample
The sample included all the 2006 (pre-test) and 2007 (post-test) Grade 5
basic school pupils in Lusaka, Southern and Western Provinces in the 24 pilot
(intervention) and 24 comparison (control) schools. The schools were chosen
using matched pairs by geographic location, school size, and grade levels as
matching variables, followed by random assignment to pilot and comparison
status. CA activities were implemented in pilot schools but not in the
comparison schools.
2.4 Instruments
Student achievement for the Grade 5 baseline and post-pilot administrations
was measured using multiple choice tests with 30 items (30 points per test).
The test development process included the following steps:
• Review of the curriculums for each subject area;
• Development of test specifications;
• Development of items;
• Piloting of items;
• Data reviews of item statistics;
• Forms pulling (selecting items for final test papers).
The test instruments were developed by teams of Curriculum Specialists,
Standards Officers, Examination Specialists and Teachers. The baseline tests
(pre-tests) were developed based on the Grade 4 syllabus and the post-pilot
tests (post-tests) were developed based on the Grade 5 syllabus.
2.5 Administration
The ECZ organized the administration of both pre-test and post-test papers.
Teams comprising an Examination Specialist, a Standards Officer and a
Curriculum Specialist were sent to each region to supervise the
administration. District Education officials, School Administrators and
Teachers were involved in the actual administration of the tests. All of the
Grade 5 pupils in the pilot and comparison schools sat for six tests, one in
each of the six subject areas (English, Mathematics, Social and Development
Studies, Integrated Science, Creative and Technology Studies and
Community Studies). The baseline tests (Grade 4 syllabus) were administered
to the students at the beginning of Grade 5, in February 2006. The post-pilot
tests (Grade 5 syllabus) were administered in February 2007.
Note that there will be two more administrations of post-tests for the cohort of
students in the three provinces. These will take place in February 2008
9
11. (Grade 6 syllabus) and November 2008 (Grade 7 syllabus). This process will
be repeated in Phases 2 and 3 schools (see Table 1 below).
Table 1: Implementation Plan for CA Pilot
Phase 2006 2007 2008 2009 2010
Phase 1 (Lusaka,
Grade 5 Grade 6 Grade 7
Southern, Western)
Phase 2 (Central,
Grade 5 Grade 6 Grade 7
Copperbelt, Eastern )
Phase 3 (Luapula,
Grade 5 Grade 6 Grade 7
Northern, Northwestern)
2.6 Data Capture and Scoring
Data were captured using Optical Mark Readers (OMR) and scored by use of
the Faim software at the ECZ. Through this process, tem scores for all
students were converted into electronic format and data files were produced
for analysis.
2.7 Data Analysis
Data were analysed by use of the Statistical Package for Social Sciences
(SPSS). Scores and frequencies by subject were generated. Analysed data
were presented in tabular, chart and graphical forms. Additional analyses
were conducted using WINSTEPS (item response theory Rasch modelling)
software. SPSS was used for scaling the pupils’ scores.
10
12. Chapter Three: Assessment Results
3.1 Psychometric Characteristics
An initial step in determining the results from the assessments was to conduct
analyses to determine the psychometric characteristics of the assessments.
Both the Standards for Educational and Psychological Testing (1999) 4 and
the Code of Fair Testing Practices in Education (2004) 5 include standards for
identifying quality items. Items should assess only knowledge or skills that are
identified as part of the domain being tested and should avoid assessing
irrelevant factors (e.g., ambiguous and grammatical errors, sensitive content
or language, etc.).
Both quantitative and qualitative analyses were conducted to ensure that
items on both Grade 5 baseline and post-pilot tests met satisfactory
psychometric guidelines. The statistical evaluations of the items are presented
in two parts, using classical test theory (CTT) and item response theory (IRT),
which is sometimes called modern test theory. 6 The two measurement
models generally provide similar results, but IRT is particularly useful for test
scaling and equating. CTT analyses included 1) difficulty index (p-value), 2)
discrimination index (item-test correlations), and 3) test reliability (Cronbach's
Alpha for an estimate of internal consistency reliability). IRT analyses
included (1) calibration of items, and (2) examination of item difficulty index
(i.e., b-parameter).
3.2 Classical Test Theory
Difficulty Indices (p)
All multiple-choice items were evaluated in terms of item difficulty according to
standard classical test theory practices. Difficulty was defined as the average
proportion of points achieved on an item by the students. It was calculated by
obtaining the average score on an item and dividing by the maximum possible
score for the item. Multiple-choice items were scored dichotomously (1 point
vs. no points, or correct vs. incorrect), so the difficulty index was simply the
proportion of students who correctly answered the item. All items on Grade 5
pre-tests and post-tests had four response options. Table 2 shows the
average p-values for each test. Note that this may also be calculated by
taking the average raw score of all students divided by the maximum points
(30) per test.
4
American Educational Research Association, American Psychological Association, and National
Council on Measurement in Education (1999). Standards for Educational and Psychological Testing.
Washington, DC: American Educational Research Association.
5
Joint Committee on Testing Practices (2004). Code of Fair Testing Practices in Education.
Washington, DC: American Psychological Association.
6
For more information, see Crocker, L. and Algina, J. (1986). Introduction to Classical and Modern
Test Theory. New York: Harcourt Brace.
11
13. Table 2: Overall Test Difficulty Estimates by Subject Area
Grade 5 Pre-test Grade 5 Post-test
Subject Area Mean Mean
# Items # Items
p-value p-value
English 30 0.40 30 0.37
Social and Developmental Studies 30 0.34 30 0.42
Mathematics 30 0.41 30 0.40
Integrated Science 30 0.33 30 0.36
Creative and Technology Studies 30 0.35 30 0.36
Community Studies 30 0.32 30 0.37
Items that are answered correctly by almost all students provide little
information about differences in student ability, but they do indicate
knowledge or skills that have been mastered by most students. Similarly,
items that are correctly answered by very few students may indicate
knowledge or skills that have not yet been mastered by most students, but
such items provide little information about differences in student ability. In
general, to provide best measurement, difficulty indices should range from
near-chance performance of about 0.20 (for four-option, multiple-choice
items) to 0.90. In general, the item difficulty indices for both Grade 5 pre-tests
and post-tests were within generally acceptable and expected ranges (see
Appendix 1 for a complete list of p-values for all items on each test).
Item Discrimination (Item-Test or Point-Biserial Correlations)
One desirable feature of an item is that the higher performing students do
better on the item than lower performing students. The correlation between
student performance on a single item and total test score is a commonly used
measure of this characteristic of an item. Within classical test theory, the item-
test (or point-biserial) correlation is referred to as the item’s discrimination
because it indicates the extent to which successful performance on an item
discriminates between high and low scores on the test. The theoretical range
of these statistics is –1 to +1, with a typical range from 0.2 to 0.6.
Discrimination indices can be thought of as measures of how closely an item
assesses the same knowledge and skills assessed by other items contributing
to the total score. Discrimination indices for Grade 5 are presented in Table 3.
Table 3: Overall Test Discrimination Estimates by Subject Area
Grade 5 Pre-test Grade 5 Post-test
Subject Area Mean Mean
# Items # Items
Pt-bis Pt-bis
English 30 0.46 30 0.48
Social and Developmental Studies 30 0.38 30 0.45
Mathematics 30 0.37 30 0.41
Integrated Science 30 0.35 30 0.43
Creative and Technology Studies 30 0.38 30 0.44
Community Studies 30 0.29 30 0.43
12
14. On average, the discrimination indices were within acceptable and expected
ranges (i.e., 0.20 to 0.60). The positive discrimination indices indicate that
students who performed well on individual items tended to perform well
overall on the test. There were no items on the instruments that had near-zero
discrimination indices (see Appendix 1 for a complete list of the point-biserial
correlations for all items on each pre-test and post-test per subject area).
Test Reliabilities
Although an individual item’s statistical properties is an important focus, a
complete evaluation of an assessment must also address the way items
function together and complement one another.
There are a number of ways to estimate an assessment’s reliability. One
possible approach is to give the same test to the same students at two
different points in time. If students receive the same scores on each test, then
the extraneous factors affecting performance are small and the test is reliable.
(This is referred to as test-retest reliability.) A potential problem with this
approach is that students may remember items from the first administration or
may have gained (or lost) knowledge or skills in the interim between the two
administrations. A solution to the ‘remembering items’ problem is to give a
different, but parallel test at the second administration. If the student scores
on each test correlate highly, the test is considered reliable. (This is known as
alternate forms reliability, because an alternate form of the test is used in
each administration.) This approach, however, does not address the problem
that students may have gained (or lost) knowledge or skills in the interim
between the two administrations. In addition, the practical challenges of
developing and administering parallel forms generally preclude the use of
parallel forms reliability indices. One way to address these problems is to split
the test in half and then correlate students’ scores on the two half-tests; this in
effect treats each half-test as a complete test. By doing this, the problems
associated with an intervening time interval, and of creating and administering
two parallel forms of the test, are alleviated. This is known as a split-half
estimate of reliability. If the two half-test scores correlate highly, items on the
two half-tests must be measuring very similar knowledge or skills. This is
evidence that the items complement one another and function well as a
group. This also suggests that measurement error will be minimal.
The split-half method requires a judgment regarding the selection of which
items contribute to which half-test score. This decision may have an impact on
the resulting correlation; different splits will give different estimates of
reliability. Cronbach (1951) 7 provided a statistic, α (alpha), that avoids this
concern about the split-half method. Cronbach’s α gives an estimate of the
average of all possible splits for a given test. Cronbach’s α is often referred to
as a measure of internal consistency because it provides a measure of how
well all the items in the test measure one single underlying ability. Cronbach’s
α is computed using the following formula:
7
Cronbach, L. J. (1951). Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16,
297–334.
13
15. ⎡ n ⎤
n ⎢ ∑σ 2 (Yi ) ⎥
α = ⎢1 − i =1 ⎥
n −1 ⎢ σ x2 ⎥
⎢ ⎥
⎣ ⎦
where, i : Item
n : Total number of items,
σ 2 (Yi ) : Individual item variance, and
σ x2 : Total test variance
For standardized tests, reliability estimates should be approximately 0.80 or
higher. According to Table 4, the reliabilities for the tests on the pre-test
ranged from 0.63 (Community Studies) to 0.87 (English). The reliability
estimate for Community Studies was low due to the absence of a national
curriculum for use in test construction. In contrast, the reliability estimates for
the post-tests ranged 0.83 (Mathematics) to 0.89 (English). It is likely that the
post-tests had higher reliability estimates since the test developers had more
experience than they had when they developed the baseline tests.
Table 4: Test Reliability Estimates by Subject Area
Grade 5 Pre-test Grade 5 Post-test
Subject Area Coefficient Coefficient
# Items # Items
Alpha Alpha
English 30 0.87 30 0.89
Social and Developmental Studies 30 0.80 30 0.87
Mathematics 30 0.79 30 0.83
Integrated Science 30 0.76 30 0.85
Creative and Technology Studies 30 0.80 30 0.86
Community Studies 30 0.63 30 0.85
3.3 Item Response Theory
Item Response Theory (IRT) uses mathematical models to define a
relationship between an unobserved measure of student ability, usually
referred to as theta ( θ ), and the probability ( p ) of getting a dichotomous item
correct. In IRT, it is assumed that all items are independent measures of the
same construct or ability (i.e., the same θ ). The process of determining the
specific mathematical relationship between θ and p is referred to as item
calibration. Once items are calibrated, they are defined by a set of parameters
which specify a non-linear relationship between θ and p . 8
8
For more information about item calibration, see the following references: Lord, F.M. and Novick,
M.R. (1968). Statistical Theories of Mental Test Scores. Boston, MA: Addison-Wesley; Hambleton,
R.K. and Swaminathan, H. (1984). Item Response Theory: Principles and Applications. New York:
Springer.
14
16. For the CA programme, a 1-parameter or Rasch model was implemented.
The equation for the Rasch model is defined as probability of giving correct
response to item i by a student with ability level of θ :
exp D(θ − bi )
Pi (θ ) =
1 + exp D(θ − bi )
Where, i = item,
b = item difficulty,
D = a normalizing constant equal to 1.701.
In IRT, item difficulty ( bi ) and student ability ( θ ) are measured on a scale of
− ∞ to + ∞ . A scale of − 3.0 to + 3.0 is used operationally in educational
assessment programmes. with − 3.0 being low student ability or an easy item
and + 3.0 being high student ability or a difficult item. The bi parameter for an
item is the position on the ability scale where the probability of a correct
response is 0.50.
The WINSTEPS program was the software used to do the IRT analyses. The
item parameter files resulting from the analyses are provided in Appendices 2
and 3. This presentation is direct output from WINSTEPS. 9 Raw scores were
then scaled using the item response theory model, with a range of 100-500
(see Appendices 2 and 3 for the raw score to scale score conversion tables
for each subject area).
3.4 Scaled Scores
The Grade 5 pre-test and post-test scores in each subject area are reported
on a scale that ranges from 100 to 500. Students’ raw scores or total number
of points, on the pre-tests and post-tests are translated to scaled scores using
a data analysis process called scaling. Scaling simply converts raw points
from one scale to another. In the same way that distance can be expressed in
miles or kilometres, or monetary value can be expressed in terms of U.S.
dollars or Zambian Kwacha, student scores on both pre and post-tests could
be expressed as raw scores (i.e., number of points) or scaled scores.
Cut points were established on the raw score scale both for the pre-tests and
post-tests (see Section 3.8 “Performance Levels” for an explanation of how
these cut points were determined). Once the raw score cut points were
determined via standard setting, the next step was to compute theta cuts
using the test characteristic curve (TCC) mapping procedure and then
calculate the transformation coefficients that would be used to place students’
raw scores onto the theta scale then onto the scaled score used for reporting.
As previously stated, student scores on the assessments are reported in
integer values from 100 to 500 with two scores representing cut scores on
each assessment. Two cut points (Unsatisfactory/Satisfactory and
Satisfactory/Advanced) were pre-set at 250 and 350, respectively.
9
See the WINSTEPS user’s manual for additional details regarding this output (at
http://www.winsteps.com).
15
17. Figure 3: Scaled Score Conversion Procedure
Raw Score Cut Conversion of Raw Score Cuts into theta Calculation of
Scores (from cuts θ1 and θ 2 Using TCC Mapping Scaled Score
Standard Setting) constants (b
and m) using
theta cuts
( θ 1 , θ 2 ), and
Calculation of Scaled Score using scaled score
cuts (250 and
m(θ ) + b 350)
The scaled scores are obtained by a simple linear transformation of the theta
score using the values of 250 and 350 on the scaled score metric and the
associated theta cut points to define the transformation. The scaling
coefficients were calculated using the following formulae:
b = 250 − m(θ1 )
b = 350 − m(θ 2 )
(350 − 250)
m=
(θ 2 − θ1 )
Where m is the slope of the line providing the relationship between the theta
and scaled scores, b is the intercept, θ 1 is the cut score on the theta score
metric for the Unsatisfactory/Satisfactory cut (i.e., corresponding to the raw
score cut for Unsatisfactory/Satisfactory), and θ 2 is the cut score on the theta
score metric for the Satisfactory/Advanced cut (i.e., corresponding to the raw
score cut for Satisfactory/Advanced). Scaled scores were then calculated
using the following linear transformation (see Figure 1):
Scaled Score = m (θ ) + b
Where, θ represents a student’s theta (or ability) score. The values obtained
using this formula were rounded to the nearest integer and then truncated
such that no student received a score below 100 or above 500. Table 4
presents the mean raw score for each grade/subject area combination in pre
and post-tests.
It is important to note that converting from raw scores to scaled scores does
not change the students’ performance-level classifications. For the Zambia
CA programme, a score of 250 is the cut score between Unsatisfactory and
Satisfactory and a score of 350 is the cut score between Satisfactory and
Advanced. This is true regardless of which subject area, grade, or year one
may be concerned with.
Scaled scores supplement the pre-test and post-test results by providing
information about the position of a student’s results within a performance
level. For instance, if the range for a performance level is 200 to 250, a
16
18. student with a scaled score of 245 is near the top of the performance level,
and close to the next higher performance level.
School level scaled scores are calculated by computing the average of
student-level scaled scores. Table 5 provides the raw score averages for each
of the subject areas, while Table 6 provides the same information in scaled
scores.
Table 5: Grade 5 Mean Raw Scores by Subject Area
Grade 5 Pre-test Grade 5 Post-test
#
Subject Area Std. Std.
Items N Mean N Mean
Dev. Dev.
English 30 3798 12.2 6.5 4025 11.7 7.1
Social and Developmental Studies 30 3962 10.1 5.3 4104 13.2 6.6
Mathematics 30 3883 12.3 5.3 4127 12.4 5.8
Integrated Science 30 4039 9.9 4.9 4135 11.1 6.3
Creative and Technology Studies 30 4032 10.5 5.3 4097 11.7 6.2
Community Studies 30 4037 9.5 4.0 4141 11.2 6.4
According to Table 5, overall mean raw scores (with both pilot and
comparison groups taken together) across the subject areas on the pre-test
ranged from 9.5 (Community Studies) to 12.3 (Mathematics) out of possible
score point of 30. In contrast, the overall mean raw scores for the post-tests
ranged from 11.1 (Integrated Science and Creative and Technology Studies)
to 13.2 (Social and Developmental Studies). From Table 6, the scaled score
averages for Grade 5 pre-tests ranged from 214 (Community Studies) to 239
(English) out of possible score point of 100-500. In contrast, the scaled score
averages for the post-tests ranged from 233 (English) to 262 (Mathematics).
Table 6: Grade 5 Mean Scaled Scores by Subject Area
Grade 5 Pre-test Grade 5 Post-test
#
Subject Area Std. Std.
Items N Mean N Mean
Dev. Dev.
English 30 3798 238.8 83.7 4025 233.4 88.1
Social and Developmental Studies 30 3962 230.5 86.2 4104 241.2 83.9
Mathematics 30 3883 222.4 89.2 4127 261.9 72.6
Integrated Science 30 4039 226.5 80.2 4135 245.7 73.7
Creative and Technology Studies 30 4032 224.1 85.3 4097 244.3 83.0
Community Studies 30 4037 214.0 83.7 4141 236.9 72.3
It was stated earlier that scaled score is a simple linear transformation of the
raw scores, using the values of 250 and 350 on the scaled score metric.
Student’s relative position on the raw score matrix does not change due to
this scale transformation.
Note that the primary interest of this evaluation is not whether the raw scores
and/or scaled scores increase or decrease from pre-test to post-test. These
differences will occur mainly through variations in test difficulty. The main
analysis will compare the relative changes in the two groups, i.e., pilot and
17
19. comparison, across the two time points, i.e., pre-test to post-test. At a later
point, post-tests will also be conducted when the cohort of students is in
Grade 6 and Grade 7, followed by extended analyses for the two additional
time points.
3.5 Vertical Scaled Scores
In vertical scaling, tests that vary in difficulty level, but that are intended to
measure similar constructs, are placed on the same scale. Placing different
tests on the same scale can be implemented in a number of ways, such as,
linking items across the tests or social moderation. For the CA programme, a
social moderation (Linn, 1993) procedure was employed for vertical scaling. 10
In social moderation, assessments are developed in reference to a common
content framework. Performance of individual students, and schools, is
measured against a single set of common standards. For Zambia, an analysis
of the Grade 4 and 5 curriculums showed that the content was vertically
aligned, i.e., students were expected to progress in their learning along the
same constructs from one grade level to the next. This allowed the test
developers to link the pre-tests and post-tests through common performance
standards. The visual representation of the vertical scaling scheme for the CA
programme is shown below.
Figure 4: Vertical Scaling Scheme
Grade 5 Pre-test: 250 350
Grade 5 Post-test: 350 450
Grade 6 Post-test: 450 550
Grade 7 Post-test: 550 650
In other words, students who were classified as Advanced in the Grade 5 pre-
test (i.e., end of Grade 4 syllabus) would also be considered as Satisfactory in
Grade 5 post-test (i.e., end of Grade 5 syllabus) and students who classified
as Advanced in Grade 5 post-test would be considered as Satisfactory in
Grade 6 post (end of Grade 6 test) so on through Grade 7. In the vertical
10
Linn, R. L. (1993). Linking results of distinct assessments. Applied Measurement in Education, 6(1),
83-102.
18
20. scaled score matrix, students who earned a grade level scaled score of 250
on Grade 5 post-test would also earn a vertical scaled score of 350 (because
350 is the equivalent grade level scaled score in Grade 5 pre-test).
Therefore, grade level scaled scores and vertical scaled scores is differed by
a constant value of 100 points. The mean vertical scaled scores for each
subject are shown in Table 7.
Table 7: Grade 5 Mean Vertical Scaled Scores by Subject Area
Grade 5 Pre-test Grade 5 Post-test
#
Subject Area Std. Std.
Items N Mean N Mean
Dev. Dev.
English 30 3798 238.8 83.7 4025 333.4 88.1
Social and Developmental Studies 30 3962 230.5 86.2 4104 341.2 83.9
Mathematics 30 3883 222.4 89.2 4127 361.9 72.6
Integrated Science 30 4039 226.5 80.2 4135 345.6 73.7
Creative and Technology Studies 30 4032 224.1 85.3 4097 344.4 83.0
Community Studies 30 4037 214.0 83.7 4141 336.9 72.3
Figure 5 shows that mean vertical scaled scores on pre and post-tests across
the subject areas. Vertical scaled scores for the pre-test are basically the
grade level scaled scores. As expected, vertical scaled scores for Grade 5
post-test are higher than the Grade 5 pre-test scaled scores.
Figure 5: Vertical Scaled Mean Scores by Subject Area
400
vertical Scaled Score
300
PRE
200
POST
100
0
Eng. SDS Math. ISC CTS CS
3.6 Comparison between Pilot and Comparison Groups
The comparisons between pilot and comparison groups were made in raw
scores and vertical scaled scores. Although raw scores in the pre and post
tests are not on the same scale as the tests are of varied difficulty, however
the comparison was made for simplicity. Comparison would be more
relevant, valid, and beneficial when they are compared on the vertical scaled
score. Note that vertical scaled scores for the pre and post tests are on the
same scale.
19
21. Raw Scores
Table 8 shows that the raw score mean differences between the pilot and
comparison schools on the Grade 5 pre-tests were small for each subject
area. The mean differences, analyzed using t-tests, were statistically
significant only in English and Mathematics, with the pupils in comparison
group performing better than those in the pilot group (p<.05). In the other four
subjects, the t-tests showed no significant differences between the two groups
on the baseline. In raw scores, differences in English and Mathematics were
about a half-point, while the differences for the other subjects had a maximum
difference of two-tenths of a point. These results reflected the expectation of
very small differences on the pre-tests, since the schools were randomly
assigned to one of the two groups based on a matched pairs design.
Table 8: Mean Raw Scores by Subject Area and Group
Grade 5 Pre-test Grade 5 Post-test
Subject Area Group †
N Mean Std. Dev. N Mean Std. Dev.
Pilot 1785 11.9 6.4 1773 13.3* 1.6
English Comparison 2013 12.4* 6.6 1967 12.2 1.6
Total 3798 12.2 6.5 3740 12.8 1.6
Social and Pilot 1907 10.0 5.2 1895 14.9* 1.3
Developmental Comparison 2055 10.2 5.5 2008 13.7 1.3
Studies Total 3962 10.1 5.3 3903 14.3 1.3
Pilot 1861 12.0 5.3 1849 13.8* 1.4
Mathematics Comparison 2022 12.6* 5.3 1975 13.2 1.4
Total 3883 12.3 5.3 3824 13.5 1.4
Pilot 1961 9.8 4.9 1949 13.2* 1.9
Integrated Science Comparison 2078 9.9 4.9 2031 11.2 1.8
Total 4039 9.9 4.9 3980 12.2 1.9
Pilot 1967 10.5 5.2 1955 12.9* 1.5
Creative and
Comparison 2065 10.6 5.4 2018 11.7 1.5
Technology Studies
Total 4032 10.5 5.3 3973 12.3 1.5
Pilot 1979 9.5 4.0 1967 13.4* 1.6
Community Studies Comparison 2058 9.5 3.9 2011 12.5 1.6
Total 4037 9.5 4.0 3978 13.0 1.6
* Significant at p<0.05; † represents adjusted weighted sample size.
The differences between the two groups for all subject areas in Grade 5 post-
test (also in Table 8),were evaluated using an Analysis of Covariance
(ANCOVA), with the pre-test scores as the covariates. In other words, the pre-
tests scores were made statistically equivalent so that the groups could be
evaluated on an equal basis on the post-tests. Using the raw scores, the
results were statistically significant in each of the subject areas, with the pilot
group outperforming the comparison group (p<.05).
Note that all statistical comparisons were made at the school level, not at the
student level. This was due to changes in student population at each school
from pre-test to post-test. The design was based on cohorts (student groups
20
22. over time) and not on panels (the same students over time). A panel design
would have been statistically possible, but it would also have led to skewed
results due to student attrition.
Vertical Scaled Scores
As started, vertical scaled scores on the pre and post tests were computed
independently both for pilot and comparison groups and were measured on
the same scale (i.e., vertical scale). This makes the comparison more relevant
and valid to assess the impact of CA in the pilot schools compared to the
comparison schools.
Table 9: Mean Vertical Scaled Scores by Subject Area and Group
Grade 5 Pre-tests Grade 5 Post-tests
Subject Area Group †
N Mean Std. Dev. N Mean Std. Dev.
Pilot 1785 236.1 82.4 1773 352.3* 20.3
English Comparison 2013 241.2* 84.8 1967 339.9 20.3
Total 3798 238.8 83.7 3740 346.1 20.3
Social and Pilot 1907 229.1 84.3 1895 362.4* 17.7
Developmental Comparison 2055 231.8 87.9 2008 346.2 17.7
Studies Total 3962 230.5 86.2 3903 354.3 17.7
Pilot 1861 217.8 89.3 1849 380.5* 17.1
Mathematics Comparison 2022 226.7* 88.9 1975 373.1 17.1
Total 3883 222.4 89.2 3824 376.8 17.1
Pilot 1961 225.5 80.1 1949 369.5* 20.4
Integrated Science Comparison 2078 227.4 80.4 2031 348.0 20.4
Total 4039 226.5 80.2 3980 358.8 20.4
Pilot 1967 223.0 84.0 1955 357.1* 16.0
Creative and
Comparison 2065 225.1 86.5 2018 343.5 16.0
Technology Studies
Total 4032 224.1 85.3 3973 350.3 16.0
Pilot 1979 213.7 84.3 1967 365.8* 22.1
Community Studies Comparison 2058 214.2 83.1 2011 352.8 22.1
Total 4037 214.0 83.7 3978 359.3 22.1
* Significant at p<0.05
Table 9 shows that the vertical scaled score mean differences between the
pilot and comparison schools on the Grade 5 pre-tests were small for each
subject area. The mean differences in all six subject areas, analyzed using t-
tests, were not statistically significant (p>.05). In contrast, when the
differences between the two groups for all subject areas in Grade 5 post-test
(also in Table 9),were evaluated using an ANCOVA (with the pre-test scores
as the covariates), the results were statistically significant in all subject areas,
with the pilot group outperforming the comparison group (p<.05).
Figures 6 through 11 show the differences in vertical scaled scores from the
Grade 5 pre-test to the Grade 5 post-test for each of the subject areas. The
graphs show clearly the greater score increases by the pilot groups in all
subject areas except for Mathematics, where the increases were not as
evident as in the other groups, though the pilot group started off lower.
21
23. Figure 6: English Mean Vertical Scores by Group
400
380
360
Vertical Scaled Score
340
320
300
280 Pilot
260 Comparison
240
220
200
Grade 5 Pre-test Grade 5 Post-test
Figure 7: Social & Dev. Studies Mean Vertical Scores by Group
400
380
360
Vertical Scaled Score
340
320
300
280
260 Pilot
240 Comparison
220
200
Grade 5 Pre-test Grade 5 Post-test
Figure 8: Mathematics Mean Vertical Scores by Group
400
380
360
Vertical Scaled Score
340
320
300
280
260 Pilot
240 Comparison
220
200
Grade 5 Pre-test Grade 5 Post-test
22
24. Figure 9: Integrated Science Mean Vertical Scores by Group
400
380
360
Vertical Scaled Score
340
320
300
280
260 Pilot
240 Comparison
220
200
Grade 5 Pre-test Grade 5 Post-test
Figure 10: Creative & Tech. Studies Mean Vertical Scores by Group
400
380
360
Vertical Scaled Score
340
320
300
280
260 Pilot
240 Comparison
220
200
Grade 5 Pre-test Grade 5 Post-test
Figure 11: Community Studies Mean Vertical Scores by Group
400
380
360
Vertical Scaled Score
340
320
300
280
260 Pilot
240 Comparison
220
200
Grade 5 Pre-test Grade 5 Post-test
23
25. 3.7 Comparison across Regions
While not the focus of the evaluation, the next two sections have useful
information on student performance. Tables 10 and 11 contain a brief analysis
of the scores by region, providing information on the scores on a
disaggregated basis. As with the overall analyses, the comparisons across
the three regions were made in raw scores and vertical scaled scores. Lusaka
Region consistently had the highest mean scores (both raw scores and
vertical scaled scores) in all subjects on the Grade 5 pre-tests, followed by
Western and Southern. The same pattern of results was also observed for
Grade 5 post-tests.
Table 10: Subject Area Mean Raw Scores by Region
Grade 5 Pre-test Grade 5 Post-test
Subject Area Region
N Mean Std. Dev. N Mean Std Dev.
Southern 1010 11.0 6.2 1157 10.4 6.6
Western 994 11.7 5.9 1103 11.9 6.7
English
Lusaka 1794 13.1 6.9 1765 12.4 7.5
Total 3798 12.2 6.5 4025 11.7 7.1
Southern 1014 9.4 4.8 1214 11.7 6.0
Social and Western 1112 9.9 4.9 1125 13.2 6.1
Developmental
Studies Lusaka 1836 10.7 5.8 1765 14.1 7.0
Total 3962 10.1 5.3 4104 13.2 6.6
Southern 1002 11.5 5.4 1226 11.1 5.2
Western 1086 12.2 5.2 1120 12.7 5.3
Mathematics
Lusaka 1795 12.9 5.2 1781 13.0 6.3
Total 3883 12.3 5.3 4127 12.4 5.8
Southern 1025 9.2 4.4 1212 9.6 5.4
Integrated Western 1151 9.4 4.6 1154 11.7 6.4
Science Lusaka 1863 10.6 5.3 1769 11.8 6.7
Total 4039 9.9 4.9 4135 11.1 6.3
Southern 1016 9.6 4.8 1205 9.9 5.6
Creative and Western 1140 10.2 5.0 1146 11.3 6.0
Technology
Studies Lusaka 1876 11.2 5.7 1790 11.9 6.9
Total 4032 10.5 5.3 4141 11.2 6.4
Southern 1015 9.0 3.5 1191 10.5 5.3
Community Western 1146 9.4 4.3 1122 11.5 6.0
Studies Lusaka 1876 9.8 4.0 1784 12.7 6.8
Total 4037 9.5 4.0 4097 11.7 6.2
24
26. Table 11: Subject Area Mean Vertical Scaled Scores by Region
Grade 5 Pre-test Grade 5 Post-test
Subject Area Region
N Mean Std. Dev. N Mean Std Dev.
Southern 1010 224.1 80.3 1157 317.3 82.8
Western 994 232.3 72.9 1103 335.0 81.0
English
Lusaka 1794 250.7 89.3 1765 343.0 94.1
Total 3798 238.8 83.7 4025 333.4 88.1
Southern 1014 218.5 77.4 1214 321.7 76.7
Social and Western 1112 226.4 79.1 1125 341.1 78.1
Developmental
Studies Lusaka 1836 239.6 93.6 1765 354.7 89.5
Total 3962 230.5 86.2 4104 341.2 84.0
Southern 1002 209.2 91.0 1226 346.6 66.1
Western 1086 219.9 86.2 1120 366.6 65.5
Mathematics
Lusaka 1795 231.3 89.0 1781 369.5 79.3
Total 3883 222.4 89.2 4127 361.9 72.6
Southern 1025 215.7 72.1 1212 328.9 63.5
Integrated Western 1151 218.1 76.1 1154 353.0 74.2
Science Lusaka 1863 237.5 85.5 1769 352.4 78.0
Total 4039 226.5 80.2 4135 345.7 73.7
Southern 1016 209.8 77.9 1191 327.6 70.7
Creative and Western 1140 218.9 79.7 1122 340.7 79.5
Technology
Studies Lusaka 1876 234.9 90.8 1784 357.7 90.3
Total 4032 224.1 85.3 4097 344.3 83.0
Southern 1015 204.2 74.8 1205 323.4 64.3
Community Western 1146 213.1 88.6 1146 338.7 66.8
Studies Lusaka 1876 219.8 84.6 1790 344.9 79.1
Total 4037 214.0 83.7 4141 336.9 72.3
3.8 Performance Categories
Depending on test difficulty and score distributions, performance categories
were established for each of the tests using a procedure called standard
setting. An Angoff (1971) 11 standard setting method was implemented to set
the cut scores between Unsatisfactory and Satisfactory and between
Satisfactory and Advanced both for pre-tests and post-tests.
The resultant cut scores are presented in Tables 12 and 13. In English, for
example, students who got a score of 1-12 would be classified Unsatisfactory,
students who got a score of 12-21 would be classified as Satisfactory and
students who earned a score of 22-30 would be classified as Advanced on the
pre-test. For Mathematics, the corresponding ranges are 1-13 Unsatisfactory,
14-19 Satisfactory, and 20-30 Advanced for the pre-test. The post-test ranges
for each subject area are different from those on the pre-tests; the reason is
that the pre-tests and post-tests covered different content and had different
levels of difficulty.
11
Angoff, W. H. (1971). Scales, Norms, and Equivalent Scores. In R.L. Thorndike (Ed.) Educational
Measurement (2nd ed.). (pp. 508-560). Washington, DC: American Council on Education.
25
27. Table 12: Performance Categories for Pre-tests by Subject
Grade 5 Pre-test
Subject Area 1 2 3
Unsatisfactory Satisfactory Advanced
(Fail) (Pass) (Pass)
English 1-12 13-21 22-30
Social and
1-10 11-17 18-30
Developmental Studies
Mathematics 1-13 14-19 20-30
Integrated Science 1-10 11-17 18-30
Creative and Technology
1-11 12-18 19-30
Studies
Community Studies 1-10 11-15 16-30
Table 13: Performance Categories for Post-tests by Subject
Grade 5 Post-test
Subject Area 1 2 3
Unsatisfactory Satisfactory Advanced
(Fail) (Pass) (Pass)
English 1-12 13-21 22-30
Social and
1-13 14-21 22-30
Developmental Studies
Mathematics 1-10 11-19 20-30
Integrated Science 1-10 11-20 21-30
Creative and Technology
1-11 12-21 22-30
Studies
Community Studies 1-11 12-19 20-30
Tables 14 and 15 provide the percentages of students classified in the 3
performance categories by subject. On the pre-test, the percentages in each
category by group were similar for most of the subjects. For instance, in
Integrated Science, similar percentages of students were in the Satisfactory
(Pass) category for the pilot (34%) and comparison (33%) groups. However,
on the post-test, there were some differences for the groups, mostly in favour
of the pilot group. In Integrated Science, 53% of students in the pilot group
were Satisfactory vs. 43% of students in the comparison group. The
percentages for each group favoured the pilot group on the post-test, with the
exception of Mathematics where the rounded percentage passing was the
same in the pilot (65%) and comparison (65%) groups.
26
28. Table 14: Percentages of Students in Performance Categories for Pre-tests
Grade 5 Pre-test
Subject Area Group 1 2 3
Unsatisfactory Satisfactory Advanced
(Fail) (Pass) (Pass)
Pilot 63.0 27.2 9.8
English
Comparison 59.7 28.2 12.1
Social and Pilot 62.8 26.9 10.3
Developmental
Studies Comparison 64.4 24.0 11.6
Pilot 64.3 26.2 9.5
Mathematics
Comparison 60.1 29.4 10.5
Integrated Pilot 65.9 25.6 8.5
Science Comparison 67.3 22.9 9.8
Creative and Pilot 67.5 22.9 9.6
Technology
Studies Comparison 68.4 20.1 11.5
Community Pilot 66.8 25.4 7.8
Studies Comparison 66.8 24.8 8.4
Table 15: Percentages of Students in Performance Categories for Post-tests
Grade 5 Post-test
Subject Area Group 1 2 3
Unsatisfactory Satisfactory Advanced
(Fail) (Pass) (Pass)
Pilot 60.0 26.5 13.5
English
Comparison 64.0 24.0 11.9
Social and Pilot 51.4 33.4 15.3
Developmental
Studies Comparison 59.3 30.6 10.2
Pilot 35.2 53.9 10.9
Mathematics
Comparison 34.8 56.3 8.9
Integrated Pilot 46.7 40.2 13.1
Science Comparison 57.3 36.0 6.7
Creative and Pilot 54.5 35.1 10.4
Technology
Studies Comparison 62.3 31.0 6.7
Community Pilot 50.4 33.9 15.6
Studies Comparison 54.4 36.2 9.5
27
29. Chapter Four: Summary and Conclusions
The main objective of the evaluation was to determine whether the CA
programme is having positive effects on student learning outcomes in the first
year of implementation. This was accomplished by measuring and comparing
the levels of learning achievement of pupils in pilot (intervention) and
comparison (control) schools. A baseline (pre-test) assessment occurred
before implementation of the proposed interventions at the beginning of
Grade 5 in randomly selected pilot schools. This created a basis upon which
the impact of CA was measured at the end of the Grade 5 pilot year.
A sample of 48 schools was selected from Lusaka, Southern and Western
Provinces using a matched pairs design and random assignment, resulting in
24 pilot schools and 24 comparison schools. Student achievement for the
Grade 5 baseline and post-test administrations was measured using multiple
choice tests in 6 subject areas with 30 items each (30 points per test). The
Grade 5 baseline tests were based on the Grade 4 curriculum, while the
Grade 5 post-tests were based on the Grade 5 curriculum. Overall, the
psychometric characteristics of the tests were very satisfactory on both the
pre-tests and post-tests. Items were within acceptable difficulty (p-value)
ranges and discrimination (point-biserial correlation) levels. Overall tests were
found reliable, using Cronbach's Alpha as an estimate of internal consistency
reliability.
Performance of the schools in the baseline and post-tests were compared
using mean raw scores and mean vertical scaled scores. The vertical scaled
score comparison was found more relevant, valid, and beneficial, since the
school mean scores both on the baseline and post-tests were evaluated on
the same measurement scale (i.e., vertical scale). In addition, statisticians
generally prefer using scaled scores for longitudinal comparisons since the
scale is equal interval, thus making comparisons more accurate.
Overall, the pupils’ scores on the baseline pre-test were very similar in the
pilot and comparison schools. The comparison schools scored slightly higher
on the English and Mathematics tests, but the score differences for the two
groups on the other four tests were minimal. On the post-test, which was
administered after one year of the CA programme, the scores of the pilot
schools on all six tests were significantly higher than those in the comparison
schools. This provides strong initial evidence that the CA programme had a
significantly positive effect on pupil learning outcomes.
When the performance of the schools on the baseline and post-tests were
compared by region, Lusaka Region consistently had the highest mean
scores in all subjects on the Grade 5 pre-tests and post-test, followed by
Western and Southern. The number of schools by region was too small to
make statistically valid region-by-region comparisons of pre-test to post-test
scores for the pilot and comparison groups.
Students were also classified into three performance level categories
(Unsatisfactory, Satisfactory, and Advanced) in each subject area based on
their performance in baseline and post-tests. On the pre-tests, the
28
30. percentages in each category by group were similar for most of the subjects.
However, on the post-test, there were differences in favour of the pilot group
in virtually all subjects. For instance, in Integrated Science, 53% of students in
the pilot group were Satisfactory and above vs. 43% of students in the
comparison group. This provided strong evidence that a greater percentage of
students in the pilot group were achieving a passing score on the post-test
than those in the comparison group.
The next round of post-tests in the Phase 1 schools will be administered when
the same cohort of pupils completes Grade 6. This will be followed by a final
test administration (a third post-test) when the cohort of pupils completes
Grade 7. At that point, with four time points (a baseline and three post-tests),
more substantial conclusions will be drawn on the effectiveness of the CA
programme.
Note also that the evaluation process is being repeated in the Phase 2 and
Phase 3 schools, which will provide a complete national quantitative
evaluation of the programme at the end of Year 5 of implementation (2010).
Based on guidance from the CA Steering Committee, results from the
evaluation will be used at a selected point in the implementation period as a
criterion for scaling up the CA programme to other primary schools in Zambia.
29