Formative assessment and personalised feedback are commonly recognised as key factors both for improving student’s performance and increasing their motivation and engagement (Gibbs, 2005). Currently, in large and massive online courses technological solutions to give feedback are reduced to different kinds of quizzes. Using multiple-choice questions shows serious shortcomings when it comes to the assessment of learning activities, based on written expression and higher-order thinking analysis. Thus, at present, one of our challenges is to be able to give feedback for open-ended questions through semantic technologies in a sustainable way.
To face such challenge, our academic team decided to test a Latent Semantic Analysis-based automatic assessment tool, named G-Rubric, developed by researchers at the Developmental and Educational Psychology Department of UNED (Spanish National Distance Education University). The first experience was launched in 2014-2015. By using GRubric, automated formative and iterative feedback were provided to our students to different types of open-ended questions (70-800 words). This feedback allowed students to improve their answers and practice writing skills, thus contributing both to a better concept organisation and the building of knowledge.
Open-ended questions allow instructors to assess the achievement of learning outcomes adequately. In fact, higher-level outcomes such as analytical skills, construction of arguments or precise writing can be more effectively assessed via open-ended questions. Because of this, many instructors show a strong preference for this kind of learning activities, even if they are more time-intensive and harder to grade. The problems, however, arise when it comes to the reliability and fairness of grades. As will be shown, the use of GRubric could cope simultaneously with both problems, namely inter-examiners variability and intra-examiner reliability.
In this paper, we present the promising results of our first experiences in UNED Business Degree students along three academic courses (2014/15, 2015/16 and 2016/17). These experiences show to what extent an automated assessment software such as Gallito-G-Rubric is currently mature enough to be used with students, obtaining quite satisfactory results regarding giving them an enriched and personalised feedback. Furthermore, GRubric would help to deal the problems related with grading above described. Our final goal is not to replace tutors by semantic tools, but to give support to tutors in grading student´s assignments.
Using semantic technologies for giving a formative assessment and supporting scoring in large courses and MOOCs: first experiences at UNED (2015-2017)
1. Miguel Santamaría Lancho, Mauro Hernández, ,Angeles Sánchez-Elvira, José María Luzón Encabo, Guillermo de Jorge-
Botana,
UNED, Spain
Using semantic technologies for giving a formative
assessment and supporting scoring in large courses
and MOOCs: first experiences at UNED (2015-2017)
2. Department of Economic
History and Applied Economics
Department of Developmental
and Educational Psychology
Economic History Teachers Team G-Rubric software developers
FACULTY OF ECONOMICS FACULTY OF PSYCHOLOGY
Miguel Santamaria José M. Luzón Guillermo de JorgeMauro Hernández
Our goal was to improve formative assessment in online courses giving personalised feedback
Department
of Personality
Ángeles Sánchez-Elvira
G-Rubric user
3. Summary
1. Our challenge: How semantic technologies
could help us to:
• give personalised feedback on open-ended questions
• support our tutors to score TMAs in a more reliable way
2. What G-Rubrics is and how it works?
3. Analysis of our experiences giving automatic
formative feedback on open-ended questions
4. Proposal about how G-Rubric could cope with
problems related to manual grading
5. Results and conclusions
4. How to give personalised feedback on
open-ended activities
5. •Personalising learning
• Fostering performance
improvement
• Increasing motivation
01/11/2017 msantamaria@cee.uned.es 5
FEEDBACK IS THE KEY FACTOR FOR
6. Wich is the kind of feedback that our students expect?
•Quick
•Iterative
• They love learning by trial and error
01/11/2017 msantamaria@cee.uned.es 6
CHARACTERISTICS OF EXPECTED
FEEDBACK
Only technology can provide this kind of feedback
7. Feedback based on technologies offers limited solutions
At classroom
• “clickers”
• (Socrative, Kahoot)
01/11/2017 msantamaria@cee.uned.es 7
In e-learning platforms
• Quizzes
• Adaptive quizzes
8. Quizzes have severe limitations to assess learning outcomes on
economic history field
Our challenge was how to give:
• quick and iterative feedback
• for open-ended questions
• in a sustainable way
• by using technologies
• Knowledge about
Economic History
• Soft skills:
• Analysis
• Critical thinking
• Multiple choice questions
• Open-ended short questions
about concepts, historical
processes, etc
• Writing comments of texts,
maps, graphs, statistical data
LEARNING OUTCOMES ASSESSMENT ACTIVITIES
10. 2nd step
3rd step
1st step To build up a specialized linguistic corpus and a Semantic Space
6 Economic History textbooks
Semantic SpaceCorpus
Activities based on short open-ended questions should be developed
To deliver the activities to our students we use a web interface
Students
Web interface
IN-built rubric space
To implement G-Rubric into a subject we need to follow 3 steps
Answer
Feedback
Canon answer
11. Example of a G-Rubric open-ended activity
Question
Canon answer
Or Golden text
Conceptual
axes
Mercantilism: policies and objectives.
“Mercantilism is a set of ideas and policies deployed in early modern Europe
(16th, 17th and 18th centuries) aimed at strengthening the State through
economic power, and specially focused on trade-balance surpluses and
accumulation of precious metals (bullionism).
The are several types of policies, emphasizing: a) those focused on obtaining trade
balance surpluses (tariff protectionism, prohibition on exporting gold or silver or
raw materials, privileged trading companies, shipping records, colonial
monopolies); B) promotion of manufactures (import tariffs or prohibitions, laws
against luxury, real manufactures); C) other policies: favoring the birth rate,
limitation or rate of interior prices.
They are often associated with the names of Colbert in France, or the English or
Dutch companies of India (VOC).
Definition : mercantilism, ideas, practices, state, economy, monarchy, strengthen, reinforce,
increase, trade balance, favorable, bullonism, precious metals, gold, silver, privileges.
Trade policies: trade, protectionism, tariffs, prohibition, exports, imports, privileged
companies, records of navigation, colonies, monopoly, fleet, merchant, surplus
Manufacturing policies: manufactures, factories, real, luxury, import substitution
Context: Europe, England, France, Holland, Colbert, XVI, XVII, XVIII, modern, VOC, East Indies,
West Indies.
12. An example to understand how G-Rubric works
01/11/2017 msantamaria@cee.uned.es 12
G-Rubric web interface
13. The student selects an activity
01/11/2017 msantamaria@cee.uned.es 13
1.-Mercantilism
2.- Triangular Trade
3.- Coal and Ind. Rev.
4.- Gerschenkron
5.- Second Industrial Revolution
6.- Consequences of IWW
7.- Bretton Woods
1.- Mercantilism
14. The student introduces the answer
01/11/2017 msantamaria@cee.uned.es 14
“Mercantilism is a set of ideas and policies deployed in early modern Europe (16th, 17th
and 18th centuries) aimed at strengthening the State through economic power, and
specially focused on trade-balance surpluses and accumulation of precious metals
(bullionism).
15. After submitting an answeer the students receive feedback
consisting of
01/11/2017 msantamaria@cee.uned.es 15
“Mercantilism is a set of ideas and policies deployed in early modern Europe (16th, 17th and
18th centuries) aimed at strengthening the State through economic power, and specially
focused on trade-balance surpluses and accumulation of precious metals (bulionism).
Content grade
Graphical
feedback
Style
grade
Acceptance
area
Definition
Trade
Manufact
Context
Grammatical
accuracy
to what
extent the
answer is
correct.
16. After checking th feedback
01/11/2017 msantamaria@cee.uned.es 16
The student improves their answer by adding new information
“Mercantilism is a set of ideas and policies deployed in early modern Europe (16th, 17th
and 18th centuries) aimed at strengthening the State through economic power, and
specially focused on trade-balance surpluses and accumulation of precious metals
(bulionism).
Amongst mercantilist polices, some outstand, i.e. those focused on attaining surpluses in
trade balance through tariff protection, prohibition of exports of gold, silver and raw
materials, creation of chartered trade companies, navigation acts and commercial
monopolies”.
17. A new feedback is provided
01/11/2017 msantamaria@cee.uned.es 17
The content grade
grow-up
the answers for each conceptual axis get closer to the acceptance area
19. Experiences using G-Rubrics in 2015 and 2016
• The trials carried out were focused on providing
formative assessment
• Our goal was to promote deep learning through
iterative feedback
• G-Rubric offers two main advantages regarding
formative assessment:
• It allows as many attempts as lecturers set
• gives the students immediate rich feedback
• All trials have been conducted with first year
Business Administration Degree students
20. Two experiences (2015 and 2016): goals
• Could Grubrics be able to give
accurate feedback?
• Could the feedback allow an
improvement on following answers?
• Could rich feedback increase the
time devoted to the activity?
OUR QUESTIONS
• The impact on their
motivation
• The utility to prepare the
final exam
• The level of agreement with
the grades received
STUDENTS OPINIONS ABOUT
2015: 132 Volunteers 2016: 120 Volunteers
The enriched graphical feedback increases:
• The number of trials performed by the students
• The amount of time devoted to the task
21. Content grade improvement
01/11/2017 msantamaria@cee.uned.es 2101/11/2017
msantamaria@cee.uned.es
21
The average percentage score increases between first and last attempt
Activity 1 Activity 3Activity 2 Activity 4 Activity 5 Activity 6 Activity 7
We could verify how students using the feedback could improve their answers
22. Students’ agreement with the grades received
The level of agreement was bigger in the last trial
First trial
47%
very much or totally agree
Last trial
70%
very much or totally agree
23. G-Rubric had a positive impact on students’ motivation
Totally or very much: 65%
Totally or very much: 60%
24. Usefulness and positive value
The 80 % of students
considered Grubric totally
or very much useful
regarding exam
preparation
More than 80 % of
students considered this
experience very much or
totally positive
26. Are humans reliable to mark open ended questions?
• Inter-examiners variability depending on who
marked the task
• Intra-examiner reliability depending on when the
same tutor marked the task
Students view manual grading of open-ended questions as
subjective
➢ In contrast automated test assessement is perceived as
more objective
Manual grading has almost two problems:
27. Accidentally double grading (2012 & 2013)
Two members of the academic team, independently and unknowingly, graded
the same exams.
• The differential was in an average of 1,5 points over 8
• Final grade differed substantially > 37,5% not obtain a passing grade
-1,5
-1
-0,5
0
0,5
1
1,5
2
2,5
3
3,5
4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
"Essay+short questions grade differential"
Essay grade differential
Figure 5. Differential in grades for doubly-assessed exams (June 2012)*
*Referred to 24 Econonic History final exams from Barcelona-CUXAM Regional Center (June 2012)
28. Accidentally double grading (2013)
-1
-0,5
0
0,5
1
1,5
2
2,5
3
3,5
4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77
Essay+Short Questions differential (Grade 2-Grade 1)
Figure 6. Differential in grades for doubly-assessed exams (June 2013)*
Referred to 76 Econ. History final exams from Valencia-Alzira Regional Center (June 2013)
• We found the same
• The differential was in an average of 0,9 points over 8
• Final grade differed substantially (21%) not obtain a passing grade
29. Correlations between grades assigned by examiners
2012 2013
n 20 76
GLOBAL GRADE 0,82 0,88
SORT QUESTIONS 0,85 0,87
TEXT COMMENTARY 0,70 0,67
Despite these differences between examiners we found:
• A high correlation on the global score and short questions
• Lower correlation on text commentary grades
30. Comparing how tutors and G-Rubrics marks TMA
• Grubric could cope simultaneously with both problems:
• Inter-examiners variability
• Intra-examiner reliability
• A fragment of "the Wealth of Nations", by Adam Smith, was selected to be
commented by students.
• A rubric was build to minimise inter-examiners variability.
• A G-Rubric's object, similar to those above described, was designed and their
axes were aligned with the rubric used by tutors to mark the students'
assignments
• The tutors graded these assignments using the rubric
• The teaching team used GRubric to grade the students' TMA again
• 252 TMAs were double-graded to compare G-Rubric and Tutors marks
Our first step has been to compare how tutors and G-Rubric grades TMAs
31. What have we found comparing grades given by tutors
and GRubric?
2.- Pearson correlations between GRubric´s and tutor´s marks
yielded a large effect size (.549**).
M SD Min Max
Tutor’s
Marks
5.95 1.45 1.55 8.54
GRubric
Marks
5.92 1.61 2.13 9,20
Main Descriptives of Tutors and GRubric marks (N=252)
An independent samples t test yielded no significant differences between the means of
Tutors and GRrubric marks, t(251), p=.720, ns **. The correlation is significant at the 0.01
level (bilateral)
1. - No significative difference between means.
32. Grades distributions: analysis of frequencies
0,79
4,37
6,75 6,35
30,56
28,57
14,68
7,94
4,76
9,92
15,08
17,06
22,62
21,83
7,94
0,79
0
5
10
15
20
25
30
35
0 a 1 1 a 2 2 a 3 3 a 4 4 a 5 5 a 6 6 a 7 7 a 8 8 a 9 9 a 10
3.- G-Rubric’s marks were more homogeneously distributed in
comparison with the higher concentration of the Tutors’ marks in the ranges
between 5 and 7 points
Tutors grades Grubric’s grades
Points ranges
Percentagesofgradesintoeachrange
33. Analysis of the homogeneity of G-Rubric and tutor’s marks
Tutor Mark GRubric Mark
Mark
Difference
Chi-
cuadrado
69,14 47,21 74,49
gl 36 36 36
p ,001 ,100 ,000
Kruskal-Wallis analyses for the evaluation of Marks homogeneity between the 37 tutoring groups
4.- Tutors’ marks presented a significant inter-group variability,
as well as mark difference.
On the contrary, G-Rubric marks did not differ significantly between
these same tutorial groups, proving, thus, its higher levels of homogeneity.
35. Main conclusions
• Automated-assessment software such as G-Rubric is currently
mature enough to be used with students.
• The kind of feedback offered was useful to improve the students’
performance
• Results in terms of students’ satisfaction are also encouraging.
• For teachers, the time and effort required is affordable.
36. • A remarkable correlation and no significant differences
between the means has been found.
• Tutors’ scores presented a significant inter-group variability
• On the contrary, G-Rubric’s marks did not differ significantly
between these same tutorial groups, proving, thus, its
higher levels of homogeneity
Our proposal:
The students’ essays will be grade first using G-Rubric,
afterward tutors will grade again to validate or modify the
grades given.
Regarding how Grubric could support grading
38. References
Cascón, L., & Antonio, J. (1989). Comprensión y memoria de textos expositivos: diferencias entre sujetos expertos y novatos. Recuperado a partir de
https://repositorio.uam.es/handle/10486/4362
Forsman, S. (1985). Writing to learn means learning to think. Roots in the Sawdust, 162–174.
Hernández, M., & Santamaría Lancho, M. (s. f.). G-Rubric: una aplicación para corrección automática de preguntas abiertas. Primer balance de su utilización. G-Rubric:
an application for automatic assessment of free-text questions: first outcome analysis. Recuperado a partir de http://www.xiiedhe.unican.es/wp-
content/uploads/2016/04/hernandezsantamaria.pdf
Jorge Botana, G. (2010). La técnica del análisis de la Semántica Latente (LSA/LSI) como modelo informático de la comprensión del texto y el discurso una aproximación
distribuida al análisis semántico. Universidad Autónoma de Madrid. Recuperado a partir de https://dialnet.unirioja.es/servlet/tesis?codigo=27624
Jorge-Botana, G., Leon, J. A., Olmos, R., & Escudero, I. (2010). Latent semantic analysis parameters for essay evaluation using small-scale corpora*. Journal of
Quantitative Linguistics, 17(1), 1–29.
Jorge-Botana, G., León, J. A., Olmos, R., & Hassan-Montero, Y. (2010). Visualizing polysemy using LSA and the predication algorithm. Journal of the American Society
for Information Science and Technology, 61(8), 1706–1724.
Jorge-Botana, G., Olmos, R., & Barroso, A. (2012). The Construction-Integration framework: a means to diminish bias in LSA-based call routing. International Journal
of Speech Technology, 15(2), 151–164.
Jorge-Botana, G., Olmos, R., & Barroso, A. (2013). Gallito 2.0: A natural language processing tool to support research on discourse. En Proceedings of the 13th Annual
Meeting of the Society for Text and Discourse. Recuperado a partir de http://elsemantico.es/Documentos/Gallito2_Valencia_new.pdf
Jorge-Botana, G., Olmos, R., & León, J. A. (2009). Using latent semantic analysis and the predication algorithm to improve extraction of meanings from a diagnostic
corpus. The Spanish journal of psychology, 12(02), 424–440.
Julià, J. M. (1999). Aprendizaje a través de la escritura. Actas de las V Jornadas de Enseñanza Universitaria de Informática, Jenui, 99, 205–210.
Olmos, R., Jorge-Botana, G., León, J. A., & Escudero, I. (2014). Transforming selected concepts into dimensions in latent semantic analysis. Discourse Processes, 51(5-
6), 494–510.
Olmos, R., León, J. A., Escudero, I., & Jorge-Botana, G. (2009). Análisis del tamaño y especificidad de los corpus en la evaluación de resúmenes mediante el LSA: Un
análisis comparativo entre LSA y jueces expertos. Revista signos, 42(69), 71–81.
Olmos, R., León, J. A., Escudero, I., & Jorge-Botana, G. (2011). Using latent semantic analysis to grade brief summaries: some proposals. International Journal of
Continuing Engineering Education and Life Long Learning, 21(2-3), 192–209.
Olmos, R., León, J. A., Jorge-Botana, G., & Escudero, I. (2009). New algorithms assessing short summaries in expository texts using latent semantic analysis. Behavior
Research Methods, 41(3), 944–950.
Parker, R. P., & Goodkin, V. (1987). The Consequences of Writing: Enhancing Learning in the Disciplines. ERIC. Recuperado a partir de http://eric.ed.gov/?id=ED272928
Roscoe, R. D., Allen, L. K., Weston, J. L., Crossley, S. A., & McNamara, D. S. (2014a). The Writing Pal intelligent tutoring system: Usability testing and development.
Computers and Composition, 34, 39–59.
Roscoe, R. D., Allen, L. K., Weston, J. L., Crossley, S. A., & McNamara, D. S. (2014b). The Writing Pal intelligent tutoring system: Usability testing and development.
Computers and Composition, 34, 39–59.
Roscoe, R. D., Brandon, R. D., Snow, E. L., & McNamara, D. S. (2013). Game-based writing strategy practice with the Writing Pal. Exploring technology for writing and
writing instruction, 1–20