SlideShare a Scribd company logo
Student Evaluations in Physics: A study on Gender Bias at
the University of Manchester
George Gilliver (8383712), Zac Baker (8446588) and Joshua Brothers (8268765)
Supervised by Anna Scaife
School of Physics and Astronomy, The University of Manchester
17th September 2015
Abstract
In an attempt to understand how the gender of a lecturer affected the feedback received,
three years worth of student teaching evaluations were collected from the School of Physics and
Astronomy at the University of Manchester and analysed. Male and female lecture feedback was
compared with a word rate analysis, and a list of 16 words used significantly more produced for
each gender. As a test of role congruity theory the same rate analysis was performed to produce
a list comparing small and large class sizes which were then cross referenced against a dictionary
of agentic and communal stereotyped words. There were 0 of either type in the small against
large male lecturer analysis, and 2 communal words, one in each class size for the female lecturer
analysis. Using a program called ‘SentiStrength’, all free-text feedback comments were analysed
for their positive and negative sentiments. The results indicated that students have been more
positive towards their male lecturers than their female lecturers, but are equally, mildly negative
towards both genders.
1
Contents
1 Introduction 2
2 Data Collection 3
3 Rate Analysis 4
3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Sentiment Analysis 7
4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Overall Discussion and Conclusion 13
6 Improvements and Future Developments 13
Appendices 15
I Censored Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
II Sentiment Analysis Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1 Introduction
In discussing lecturers career progression it is common, amongst other things, for student evaluation
of teaching forms (SETs) to be a major contributor for decision making. These SETs are a subjective
judgement of teaching made by their students. There have been many studies assessing whether this
opens up university teaching staff to career restricting biases[1][2][3]. These studies have shown that a
significant gender bias exists not only in numerical ratings, but also in the language used by students.
This bias is present for the perceived gender of the professor, rather than their actual gender. This
has been shown through online courses, in which some students were mis-informed of the gender of
their teaching staff and then asked to complete SETs. Results indicated that the quality of the teaching
was roughly equal, but students evaluated perceived female professors lower than that of their male
counterparts[4]. A more recent project by Ben Schmidt[5] analysed the frequency of words differentiated
by gender and subject for professor reviews left on the website www.ratemyprofessor.com.
Another area of research has been on ‘Role Congruity Theory’[6]. This states that when individuals
enter social interactions, there are implicit social assumptions about what roles others will play. For
men, this tends to be implicit assumptions that they will conform to the ‘agentic’ type, where people
are more assertive, analytical and authoritative. For women, the opposite is implicitly assumed: the
‘non-agentic’ type, where people are more affectionate, understanding and sensitive. When these
assumptions are broken, such as when a woman teaches a large lecture class and is thus required to
be authoritative[6], this can result in negative student reactions.
We decided that it would be of benefit to study the SETs for the School of Physics and Astronomy
at the University of Manchester to see whether similar results could be found. The SETs used by
Manchester ask for ordinal numeric evaluations along with qualitative responses about the course and
lecturer.
2
2 Data Collection
The School of Physics and Astronomy has deduced the best way to get a large number of responses
is to ask lecturers to hand out the forms in the lecture once the course is coming to an end. Thus,
they are completed once the students have had around 10 weeks of teaching and are completed by
hand. The School then publishes the results to SETs on their website1
. As time has progressed, the
format of the questionnaire and the way the results are presented has changed. We decided to base
our study on results from the academic years 2011-12 to 2013-14. This is due to the ease with which
the data could be collected, and the only years we could be certain that full information was available.
SETs contain a varying amount of ordinal numerical questions depending on the year, however
these provided no insight into the language used by students, so they were not included in the analysis
in favour of focusing on the qualitative, free-text questions. Additional data provided includes the
number of students providing feedback as a percentage of enrollees, which was used to calculate the
class sizes.
For the qualitative sections, three questions are always asked with the same general basis, but
they differ slightly in wording for 2011-12. See Table 1 for a breakdown of the questions. Although
it would appear that only question 3.2 would be relevant to the lecturer, as they are free speech
questions, people tended to say things about the lecturers over all three questions.
Question No.
Wording
2011-12 After 2011-12
3.2
What did you like the most about this
lecturer’s approach to teaching?
What aspects of the lecturer’s approach
to teaching best helped your learning?
4.1
Please provide us with details of what
you enjoyed about this course unit
Please provide details of what you val-
ued about this unit
4.2
Please provide us with details of what
you think could be improved on this
course unit
Please provide details of what you think
could be improved on this unit
Table 1: The variation in the qualitative, written feedback questions in the SETs.
To obtain the class sizes, a simple script was written to scrape the text from the PDF files, and
then convert them into a usable form.
For the written comments, these were hand typed by us. As a result, a small number had mistakes
in the transcribed versions. These have all been rectified to the best of our knowledge, but small
errors may still persist. There are also a small percentage of comments that were not included in the
analysis, either due to illegibility of handwriting, or comments rendered unreadable by the quality of
the scanning process.
As the school has only a small percentage of female teaching staff, it was decided for the qualitative
sections we would type up all of the female led lecture courses.
Grouping the comments from students in the same academic semester and the same stage of their
university careers, we managed to ensure that the bias of different cohorts and changes in response
style as students passed through the university were minimised, by selecting a comparable number of
comments left for both male and female lecturers. These groups will be further referred to as simply
‘cohorts’ such that these cohorts don’t persist over multiple academic years.
To select the comments left for male lecturers (henceforth referred to as ‘male comments’ and
similarly ‘female comments’), the scanned images were taken from the PDFs of all of the male led
courses and then a random selection were chosen to be typed up, ensuring the total number followed
the rules described above. In completing a ‘sanity check’ for the script that chose the images, we
1
These can be found at https://www.teaching.physics.manchester.ac.uk/GEN_INFO/QUESSUMM/INDEX.HTM.
3
noticed that some of the images contained slightly more text than appeared in the PDF documents.
This censorship had been used mainly to remove certain personal comments and profanities. As the
nature of this study requires us to look at personal comments, we decided to continue without the
censorship and to go back through the previously typed women’s comments and check them, using
this new method. Please see Appendix I for examples of comments that were censored.
3 Rate Analysis
3.1 Method
For our analysis of the rates of the words used, we first conducted a word count for each unique
word within the comments. Having read through this list, we realised a lot of the words were not
relevant to the lecturer (e.g. about examples classes, or the times of the lectures) and thus it was
decided to produce a reduced analysis as well. For this we only used comments that were about
the lecturer directly. As question 3.2 was always about the lecturer this was always included. For
questions 4.1 & 4.2, as can been seen from Table 1, the questions were not directly about the lecturer.
Therefore, we ran a script to only take comments for the reduced analysis if they contained the
lecturers name, or a pronoun. Although this would bring in words from sections of a longer comment
that were not entirely about the lecturer, it was agreed that it would be better to keep information
within this reduced analysis, than inaccurately remove relevant information.
First we split the words into those used for male lecturers and those for female lecturers. For
each word, we assumed that its usage as a sample of the overall comment population would adhere
to a poisson distribution with its frequency as the mean. From this, we could use the poisson error
indicated in Equation (1)
σ =
√
µ (1)
where σ is the error and µ is the frequency. To get the error on the rate, we used the total error, and
then propagated this through.
To then determine whether there was a significant difference for each word we used the formula of
Equation (2)
Significance =
Cm − Cw
σ2
Cm
+ σ2
Cw
(2)
where Cm and Cw are the counts for men and women respectively, and σCm and σCw are the errors on
the counts for men and women respectively. We decided that a 3σ difference would be significant
enough to talk about.
In order to make a comparison between the words used in small and large teaching environments
we took two samples from the larger pool of comments used in the overall rate analysis. Firstly the
comments for the small environments were taken from lecture courses with less than 89 respondents,
and whilst relatively arbitrary, gave a selection of optional courses that weren’t required for all
students to take and are lectured in the smaller theatres. The comments for the large lecturing
environments were chosen from courses with over 242 responses in the SET’s. These “core courses”
which are required to be taken by all students are given in the largest theatres in the university. This
comparison is not the same as in other works[6] which suggested a bias would exist in seminars vs.
lecture theatres, so this study should provide a test of whether the effect is strong enough to affect
the words used by students based off just the size of the lecture theatre. The words in the list were
then compared with a dictionary of agentic and communal words, built from an amalgamation of
words from other studies[7][8][9].
The words used in the small and large lecture courses were compared for each gender, in the same
way as in the reduced overall rate analysis. As there was a small number of women in the analysis we
wanted to remove words that were used significantly more for one lecturer than others, in order to
4
avoid words describing the characteristics of an individual woman, rather than women in general. If a
word was used 90% of the time or more for one female lecturer it was removed from the results. This
“max. percent from a single lecturer” statistic has been recorded for every female-significant word.
Additionally the percentage distribution of word usage within questions was recorded in an attempt
to measure if an individual word was used in a positive (questions 3.2 and 4.1) or negative (question
4.2) way.
3.2 Results
Word Significance
Rate per 1000 Words Max. Percent from a
Single Lecturer (%)
Question Distribution (%)
Male Female 3.2 4.1 4.2
handouts 11.761 0.65±0.2 10.7±0.83 44 97 2 1
examples 8.126 14.8±0.94 28.26±1.36 44 95 1 4
online 4.771 6.75±0.63 11.92±0.88 31 96 1 3
mix 4.35 0.18±0.1 1.67±0.33 72 100 0 0
organised 4.164 2.11±0.35 4.87±0.56 22 100 0 0
interactive 4.006 0.18±0.1 1.47±0.31 72 100 0 0
printed 3.899 0.12±0.08 1.28±0.29 40 95 5 0
notes 3.804 16.85±1 22.87±1.22 30 94 1 5
were 3.662 7.22±0.65 11.15±0.85 31 94 2 4
combination 3.638 0.23±0.12 1.41±0.3 55 91 5 5
mistakes 3.401 0.29±0.13 1.41±0.3 77 40 0 60
teachweb 3.395 0.06±0.06 0.9±0.24 88 93 0 7
lots 3.376 2.99±0.42 5.45±0.59 32 93 1 6
clear 3.23 8.98±0.73 12.75±0.91 31 98 1 2
helpful 3.166 3.52±0.46 5.96±0.62 30 96 2 2
worked 3.146 1.41±0.29 3.08±0.45 49 91 0 9
Table 2: The significant words used in comments by students for female lecturers that
were considered significant once a reduced analysis had been performed. This significance
was when there was a difference between male and female lecture usage of more than 3σ.
Also shown is the rate per 1000 words in both male and female comments for that word,
the percentage of word usage for one specific lecturer and the percentage distribution for
each word within each question on the SETs.
3.3 Discussion
The reduced overall rate analysis, shown in Table 2 and Table 3 found 16 significant words for
both genders, with 10 words removed for female lecturers. For male lecturers, many words were found
to be significant that are associated with the male stereotype e.g. ‘jokes’, ‘funny’, ‘humour’ whereas
‘organised’ was the only equivalent word for female lecturers.
The effect of the different sampling methods used for each gender is shown in the comparison of
the word usage for one lecturer. As there were so few women in our analysis, the range of usage seen
was much larger for their sample, suggesting that some of the significant words were in the list due to
a specific individual, rather than the gender of the lecturer. This was alleviated using the 90% rule
discussed in Section 3.1, removing obviously erroneous words (for example one lecturer was fond of
giving out cream eggs during a lecture, both the words ‘cream’, ‘eggs’ and ‘egg’ were removed due to
being used 100% of the time for one lecturer). However the bar of 90% was set fairly arbitrarily, and
words with high usage by a single lecturer should be treated with caution.
5
Word Significance
Rate per 1000 Words Max. Percent from a
Single Lecturer (%)
Question Distribution (%)
Male Female 3.2 4.1 4.2
funny 4.237 1.7±0.32 0.26±0.13 38 93 7 0
interesting 3.891 5.87±0.59 3.01±0.44 36 80 15 5
subject 3.825 4.17±0.5 1.86±0.35 37 86 8 6
write 3.78 1.47±0.29 0.26±0.13 34 52 5 43
jokes 3.74 0.82±0.22 0±0 31 79 21 0
accent 3.74 0.82±0.22 0±0 43 93 0 7
enthusiasm 3.703 3.7±0.47 1.6±0.32 33 87 10 3
previous 3.635 1.12±0.26 0.13±0.09 43 90 0 10
often 3.574 1.23±0.27 0.19±0.11 35 33 0 67
humour 3.498 1.59±0.31 0.38±0.16 41 86 14 0
blackboard 3.41 5.46±0.57 3.01±0.44 35 89 3 8
because 3.316 0.65±0.19 0±0 33 60 0 40
enthusiastic 3.293 3.88±0.48 1.92±0.35 37 98 2 0
entertaining 3.073 0.88±0.23 0.13±0.09 38 93 7 0
down 3.041 1.7±0.32 0.58±0.19 36 61 4 36
difficult 3.008 2.58±0.39 1.15±0.27 46 56 15 29
Table 3: Similar to Table 2, now for words considered significant for male lecturers.
Word Significance
Rate per 1000 Words Question Distribution (%)
Small Class Large Class 3.1 4.1 4.2
Small
which 3.626 4.26±0.93 0.67±0.34 90 0 10
course 3.326 9.12±1.37 3.86±0.81 82 3 15
well 3.259 9.73±1.41 4.36±0.86 98 1 1
Large
summaries 3.096 0.2±0.2 2.18±0.61 100 0 0
clear 3.158 4.86±0.99 10.07±1.31 98 0 2
explaining 3.173 0.61±0.35 3.19±0.73 92 2 7
derivations 3.568 1.01±0.45 4.53±0.87 93 0 7
explanations 4.336 1.01±0.45 5.71±0.98 97 0 3
thought 4.507 0.2±0.2 4.03±0.82 100 0 0
blackboard 4.957 2.63±0.73 10.07±1.31 92 2 7
Table 4: A table showing results for words used in comments for male lecturers in
comparison between small and large class sizes, with rates for small and large classes, and
percentage question distribution.
6
Word Significance
Rate per 1000 Words Max. Percent from a
Single Lecturer (%)
Question Distribution
Small Class Large Class 3.2 4.1 4.2
Small
approachable 3.912 2.84±0.52 1.09±0.29 60 90 7 3
classes 3.159 0.85±0.28 0±0 60 100 0 0
Large
how 3.057 0.28±0.16 1.8±0.38 70 91 5 5
friendly 3.303 0.19±0.13 1.64±0.36 76 100 0 0
worked 3.429 0.66±0.25 3.12±0.49 65 92 0 8
were 4.084 3.22±0.55 9.76±0.88 35 94 1 6
online 5.393 2.84±0.52 10.77±0.92 51 96 1 3
handouts 5.548 2.56±0.49 10.31±0.9 41 98 2 0
Table 5: Similar to Table 4, but with words obtained from feedback for female lecturers.
Another unexpected finding from the analysis was the difference in which questions the words
came from. For female lecturers, only 1 out of 16 words wasn’t used 90% of the time in question 3.2,
whereas for men 11 out of 16 words were below this threshold. There are two possible explanations
for this effect: firstly, given the assumption that question 3.2 leads to words used positively, the
unique words people use to describe women are used more often in a positive context than for men.
The second possible reason is due to only taking words from comments in question 4.1 and 4.2 with
pronouns in them. For our entire sample the rate of pronoun usage was twice as high for men as it
was for women, and as only words from the later two questions were filtered due to pronoun usage
this could explain why we see such a striking difference between genders. This difference in pronoun
rate implies that it may not be the best indicator to use for whether a particular comment is talking
about a lecturer, or some other aspect of the course.
In the small vs. large class size analysis there were 0 words from the agentic and the communal
dictionaries in the male table, and 2 communal words in the female list. The two words (‘approachable’
and ‘friendly’) were split, with approachable coming from the small class sizes and friendly from the
large. This split and the fact that only two agentic/communal words were found to be significant
imply that just the change in class size is not enough to affect the feedback in terms of agentic or
communal language, as opposed to seminars vs. lectures discussed in other studies[6].
The problem of a small sample size was magnified by further splitting it. There were only 3
unique women in the small classes data set, which led to 5 of the 7 words found being removed due
to the 90% rule. This makes it hard to say whether the difference in number of unique words between
Table 4 and Table 5 was due to the difference in genders, or the small number of female lecturers in
the analysis.
4 Sentiment Analysis
4.1 Method
To complete a sentiment analysis, a program called SentiStrength2
was used. The main ideas
behind SentiStrength are based on psychological research that has shown we process both positive
and negative sentiment in parallel[10]. The software allows for a score to be assigned to any short
informal text (in our case a single comment) for both its positive and negative sentiment. It was
decided that it would be of benefit to look into whether there was any variation in the sentiment of
the language used by students for male and female lecturers.
SentiStrength as a piece of software analyses each word within a comment, and then assigns it a
value for both positive and negative sentiment on a scale of 1 to 5, with 1 being completely neutral,
2
The SentiStrength software can be found at http://sentistrength.wlv.ac.uk/
7
and 5 being very strongly positive/negative (with negative sentiments denoted by a ‘−’, or referred
to as a ‘minus score’). It then takes into account negating words such as ‘not’, which would turn
a positive into a negative and vice versa, and ‘boosters’ such as ‘very’, which would amplify the
score of the following word as appropriate. For example, the word ‘good’ was given a sentiment
score of +2 and ‘very’ was a booster of +1. To show this methodology, a comment such as: “The
lectures were very good, but the maths wasn’t good!” would then return the following output, where
‘1 Last Word Booster’ indicates a sentiment score boosted by 1 acting on the previous word:
Positive emotion rating : 3
Negative emotion rating : −2
The [ 0 ] l e c t u r e s [ 0 ] were [ 0 ] very [ 0 ] good [ 2 ] [1 Last Word Booster ]
but [ 0 ] the [ 0 ] maths [ 0 ] wasn ’ t [ 0 ] good [ 2 ] [ Last Word Negator ]
[ [ Pos , Neg Sentence = +3 , −2]]
To calibrate SentiStrength to work with our feedback comments, a list of all unique words was
created. Three people were then asked to independently score each of the words using the scale
described above, whilst also identifying any negators and boosters. Where words were identified as
having the same score by all parties, they were assigned that value. For the remaining 820 words, the
value to be assigned was discussed and agreed upon by all parties.
To check that the assigned scores were accurate in practice, 1000 comments were selected at
random from the total and the same three people rated each comment from 1 to 5 for both positive
and negative sentiment. The median was then selected as the average result for the group, and was
used in checking the calibration. To check the calibration we ran a 10-fold cross-validation, in which
90% of the comments were used to check the weights of each word and the remaining 10% was used
to calculate its accuracy. A different 10% was then selected and run again, until all parts of the 1000
comments were used.
Please see Table 6 for a full output of our cross-validation. From there, we attempted to optimise
the results as best as possible, by going through comments that had significantly different scores
between human interpretation and SentiStrength’s prediction. On completion we found we had an
accuracy of 73.27% for the negative words and 57.26% for the positive words (Acc ). Although this
sounds quite low, especially for the positive sentiment, as the correlation scores for both sentiments
(Corr±) are quite high (0.6272 and 0.6623 respectively) and far above 0.43
, this shows that we have
produced a relatively accurate calibration and the program is able to replicate human opinion to a
sufficient degree.
Another thing to consider is that humans themselves will inherently disagree about what the
sentiment of a single comment should be scored, so for a computer to accurately model human
sentiment scoring, it too should have a certain level of uncertainty. In fact, all 3 human coders only
agreed 42.7% and 78.6% of the time respectively in their positive and negative scoring of the 1000
comment sample.
Corr+ Corr- Acc+ Acc- AccWithin1+ AccWithin1- MeanAbsErr+ MeanAbsErr-
0.6272 0.6623 57.26% 73.27% 86.09% 89.09% 0.4015 0.2518
Table 6: The results of our 10-Fold cross-validation test on the calibration of SentiStrength.
From Table 6, we can further see that the accuracy greatly increases to almost 90%, when the
program is allowed to vary the sentiment score by 1 in either direction (AccWithin1±). The last
values in the table (MeabAbsErr±) show the average errors assigned to the prediction of both positive
and negative sentiment values.
3
A value of over 0.4 has been described as ‘excellent’ for the correlation by the creators of SentiStrength as can be
seen in Step 3 here: http://sentistrength.wlv.ac.uk/documentation/language_changes.html
8
4.2 Results and Analysis
Running all typed comments through the now-calibrated program, the positive and negative
sentiment score for each comment was extracted. After splitting the comments into the 3 individual
questions outlined in Table 1, the average sentiment for those comments was calculated for male and
female lecturers on both positive and negative scales. These averages are displayed in Figure 1 with
positive sentiment on the top half, negative on the bottom and the data relative to the full 1-5 range
on the left, followed by the scale reduced to see the finer details on the right.
The errors on these averages were calculated from the mean absolute error as displayed in Table 6
and displayed at the 3σ limit to be over-confident of any discrepancies between genders, akin to the
reasoning in Section 3.1. The data for each average and its associated error are displayed in Table 8
under Appendix II.
3.2 4.1 4.2
1
2
3
4
5
Question Number
AveragePositiveSentiment
Full Sentiment Range
3.2 4.1 4.2
1.6
1.8
2
2.2
2.4
2.6
Question Number
Zoomed in to see M/F comparison detail
3.2 4.1 4.2
−5
−4
−3
−2
−1
Question Number
AverageNegativeSentiment
3.2 4.1 4.2
−1.6
−1.4
−1.2
−1
Question Number
Male
Female
3σ Error Limit
Positive
Negative
Positive
Negative
Figure 1: The average positive and negative sentiment for both male and female lecturers,
split into questions. Also included are 3σ error limits. The top and bottom layers are
positive and negative sentiment respectively. Both graphs on the right show the zoomed-in
areas denoted by the dashed lines.
As can be seen in Figure 1 the average sentiment for questions 3.2 and 4.1 appear to be mildly
positive overall, with very little negative sentiment (close to neutral at -1) whereas the sentiment for
question 4.2 is more negative than positive, but the strength of both sentiments are approximately
equal at around 1.65. Looking instead at gender, the average positive sentiment for feedback given to
male lecturers is significantly stronger in all 3 questions than for female lecturers. Conversely, there is
no difference between the average negative sentiment of genders, with the 3σ error limits overlapping
for all questions.
To investigate any possible trends over time (and possibly any inconsistent year groups), the
sentiment scores were grouped into cohorts and an average calculated for both positive and negative
scales. These are displayed in Figure 2 and Figure 3 respectively.
9
1
1.5
2
2.5
3
3.5
Cohort
AveragePositiveSentiment
2011/12
Y1/S1
Y1/S2
Y2/S1
Y2/S2
Y3/S1
Y3/S22012/13
Y1/S2
Y2/S1
Y2/S2
Y3/S1
Y3/S2
Y4/S12013/14
Y1/S2
Male
Female
3σ Error Limit
Figure 2: The average positive sentiment for both male and female lecturers, split into
individual academic year, undergraduate year (Y) and semester (S).
−3.5
−3
−2.5
−2
−1.5
−1
Cohort
AverageNegativeSentiment
2011/12
Y1/S1
Y1/S2
Y2/S1
Y2/S2
Y3/S1
Y3/S22012/13
Y1/S2
Y2/S1
Y2/S2
Y3/S1
Y3/S2
Y4/S12013/14
Y1/S2Male
Female
3σ Error Limit
Figure 3: The average negative sentiment for both male and female lecturers, split into
cohorts of individual academic year, undergraduate year (Y) and semester (S).
10
Again, 3σ error limits have been plotted, and the number of significant differences between genders
for the same cohort can be seen in Table 7. This data seems to suggest that for the 13 cohorts studied,
approximately half (46% positive and 54% negative) commented with sentiment no different for either
gender, but 46% of the remaining cohorts were more positive when leaving feedback for their male
lecturers (as opposed to 8% for female lecturers) and 31% more negative for female lecturers (15%
male).
Stronger For: Neither Male Female
Positive Sentiment 6 (46%) 6 (46%) 1 (8%)
Negative Sentiment 7( 54%) 2 (15%) 4 (31%)
Table 7: A table displaying the number of cohorts that commented with a stronger
sentiment towards either their Male lecturers, their Female lecturers or neither gender.
Each number as a percentage of the total number of cohorts is enclosed in brackets.
In an attempt to further understand how the sentiment of comments was distributed, we plotted
a normalised bar graph for the total number of comments with each sentiment score for both positive
and negative scales, as a percentage of the total number of comments left as feedback. Counts are
split again into question number and plotted for male and female lecturers in Figure 4, with the
sonstituent percentages available from Table 10 in Appendix II.
Figure 4: Each sentiment value from 1/−1 to 5/−5 as a percentage of total positive and
negative sentiment. The counts are split into the 3 text questions akin to Figure 1, with
male lecturers on the left hand side and female lecturers on the right hand side.
Again, the data suggests that students are far more likely to leave positive comments than negative
comments throughout all questions, even if question 4.2 leans more negatively than questions 3.2
11
or 4.1. This 3.2/4.1 and 4.2 split is also reflected in questions 3.2 and 4.1 receiving a very small
proportion of comments with negative sentiment (7.67%/6.56% for male and 6.92%/8.71% for female
3.2/4.1 questions respectively), opposed to question 4.2 receiving notably more for both genders
(28.58%/30% for male/female).
Also note that a positive sentiment score of 1 (i.e. neutral sentiment) has a higher percentage for
all female questions compared to the same male questions, signifying a higher percentage of positivity
towards male lecturers. The converse it true in both questions 4.1 and 4.2 for negativity, with only
question 3.2 having a slightly larger percentage of negative sentiment left for male lecturers than their
female counterparts.
As neutral sentiment dominated the responses, it was excluded from the total count and the
remaining, non-neutral sentiment comments were re-normalised to 100% and are displayed in Figure 5.
Figure 5: The same as Figure 4, bar the exclusion of all neutral sentiments to futher
convey the range of non-neutral sentiment in feedback comments.
The highest non-neutral percentage is always 3/−3 for positive/negative sentiments respectively,
indicating that when students were either being positive or negative, they were doing so only mildly.
In all but the negative sentiment of question 4.2 for female lecturers, the smallest percentage of
comment sentiment is 5/−5 followed by 4/−4, showing that students rarely commented with very
strong positive or negative sentiment.
4.3 Discussion
By attempting to extract the sentiment of commenters, we observed a few overall trends: Students’
comments are much more likely to contain positive sentiment over negative sentiment for either
gender, but are on average more positive towards their male lecturers and more negative towards
their female lecturers. However, students are more likely to not be negative (i.e. neutral) whilst being
12
generally positive, which could be due to the positive-leaning wording of the questions as defined in
Table 1. Any perceived sentiment also tends to be mild, with very small percentages of comments
containing very strong sentiment.
5 Overall Discussion and Conclusion
The overall rate analysis found that there is a definite difference in the language used by students
leaving feedback in the SETs relating to the stereotype of the gender of the lecturer. However, a rate
analysis alone does not provide the context in which these words are used and how this is affecting the
overall perception of a lecturer in their SETs. This issue can be addressed by analysing the sentiment
of the feedback comments. The analysis seems to indicate that students are more positive towards
their male lecturers than their female counterparts, whilst negative feedback for both genders tends
to be equally neutral. Whether these phenomena are due to the limited sample size, inherent biases
held by students or something else entirely is a topic of further research.
One of the main limitations of the study was the amount of female-taught courses compared to
male-taught courses (cf. 27 and 123) which limited the number of comments we could analyse and
the variety of courses within our samples. This sample size was so small that with the poisson error
presented in Equation (1), the associated errors could have been smaller with a larger sample size.
Another limitation might be the gender of the students leaving feedback themselves. With the
average percentage of enrolled students being female at around 15%-20%[11], it would be useful to
obtain the percentage gender of commenters to see if the gender of the student and the gender of the
lecturer somehow affected the language left in the feedback.
Something to consider could also be the effect of any group phenomenon in classes. Students may
be more likely to join in with the behaviour and attitudes of their peers, where a vocal minority
supersedes the impressionable majority[12].
6 Improvements and Future Developments
To resolve the sample size issues, ideally as much feedback as possible would be collected from a
range of different departments throughout both the University of Manchester and possibly further
afield, to see how student’s feedback changed across a varying range of commenters and over time.
Due to the pronoun selection used in the rate analysis, and the fact that the rate of pronoun usage
was twice as high for men this meant we removed significantly more comments for women than men.
If future studies into word usage in SETs want to avoid this cross-contamination of comments they
should either only use data from questions that ask specifically about a lecturer, or devise a better
method than pronoun tracking to differentiate between the two.
Further refining the accuracy and methodology of the SentiStrength program by honing the word
lists used and iterating the calibration process (or even using a more sophisticated approach to
analysing sentiment in sentences) would also increase the accuracy of reflection of the true sentiment
of commenters.
Many different and varied parameters make up a linguistic analysis of feedback that further research
could combine both rate and sentiment analysis along with maybe extra surveys and meta-data on
the students leaving the feedback to get a more accurate and true representation of the inherent
biases held.
References
[1] T. L. Lueck, K. L. Endres, and R. E. Caplan, “The interaction effects of gender on teaching
evaluations,” Journalism Educator, vol. 48, no. 3, pp. 46–54, 1993.
13
[2] K. J. Anderson and G. Smith, “Students preconceptions of professors: Benefits and barriers
according to ethnicity and gender,” Hispanic Journal of Behavioral Sciences, vol. 27, no. 2,
pp. 184–201, 2005.
[3] S. A. Basow, “Student evaluations of college professors: When gender matters,” Journal of
Educational Psychology, vol. 87, no. 4, 1995.
[4] L. MacNell, A. Driscoll, and A. Hunt, “Whats in a name: Exposing gender bias in student
ratings of teaching,” Innovative Higher Education, vol. 40, no. 4, pp. 291–303, 2015.
[5] B. Schmidt, “Gender and teaching reviews.” Available at: http://benschmidt.org/
profGender/. Last accessed: 17th September 2015.
[6] L. Martin, “Gender issues and teaching,” American Political Science Association 2013 Annual
Meeting, 2013.
[7] A. B. Diekman and A. H. Eagly, “Stereotypes as dynamic constructs: Women and men of the past,
present, and future,” Personality and Social Psychology Bulletin, vol. 26, no. 10, pp. 1171–1188,
2000.
[8] W. Wood and A. Eagly, “Two traditions of research on gender identity,” Sex Roles, pp. 1–13,
2015.
[9] A. E. Abele, “The dynamics of masculine-agentic and feminine-communal traits: findings from a
prospective study,” J Pers Soc Psychol, vol. 85, pp. 768–776, Oct 2003.
[10] R. Berrios, P. Totterdell, and S. Kellett, “Eliciting mixed emotions: A meta-analysis comparing
models, types and measures.,” Frontiers in Psychology, vol. 6, no. 428, 2015.
[11] “School of physics and astronomy, undergraduate admissions report 2013.” Page
8. Available at: https://www.teaching.physics.manchester.ac.uk/committe/minutes/
Board/2013_10_23/Admissions%20Report%202013.pdf. Last accessed: 17th September 2015.
[12] S. Reicher, The Psychology of Crowd Dynamics, pp. 182–208. Blackwell Publishers Ltd, 2008.
14
Appendices
I Censored Comments
Please see below for a list of the censored comments found for female lecturers. As these are often
personal and include the lecturer’s names, all names have been changed to Jill. The parts that were
censored are in Italics.
• Wasn’t very challenging, and a lot of stuff we’d already covered in EM/Waves courses. However
the new stuff was good. More jumpers. [Picture of a jumper]
• If a mind bogglingly hard tutorial question will be set provide a similar example because oh my
god stress.
• Jill’s lecturing style is engaging, awesome and covers topics at an in depth and easy to understand
way. Jill’s bitchin!
• Clear, defined structure. Long winded derivations on printouts, generally very good handouts.
Sense of style (Both in teaching, and fashion)
• Thanks Jill. Your a babe!
• Good style. Extremely good clothing, but mainly teaching. Good mixture between lectures and
online notes. Good balance between theory and practical applications. Always happy.
II Sentiment Analysis Tables
Question: 3.2 4.1 4.2
Male
Positive 2.503 2.360 1.729
Negative -1.150 -1.119 -1.591
error 0.011 0.013 0.013
Female
Positive 2.293 2.234 1.658
Negative -1.130 -1.157 -1.609
error 0.008 0.009 0.009
Table 8: The average positive and negative sentiment for both male and female lecturers,
split into questions 3.2, 4.1 and 4.2. Also included is the 1σ error for each question, which
applies to both positive and negative sentiment averages.
15
AcademicYear:2011/122012/132013/14
Year/Semeter:Y1/S1Y1/S2Y2/S1Y2/S2Y3/S1Y3/S2Y1/S2Y2/S1Y2/S2Y3/S1Y3/S2Y4/S1Y1/S2
Male
Positive1.9342.2212.1252.0902.3502.2902.3172.0122.8332.3632.4352.2932.308
Negative-1.195-1.226-1.216-1.192-1.342-1.234-1.300-1.404-1.143-1.277-1.333-1.377-1.339
error0.0180.0160.0430.0450.0180.0200.0270.0310.0620.0180.0390.0390.050
Female
Positive1.7642.0272.2132.3052.1982.5142.2352.0752.2242.1581.9662.3502.141
Negative-1.377-1.409-1.187-1.232-1.148-1.389-1.188-1.188-1.159-1.500-1.230-1.330-1.125
error0.0110.0140.0150.0280.0280.0300.0270.0140.0130.0200.0270.0250.031
Table9:Theaveragepositiveandnegativesentimentforbothmaleandfemalelecturers,splitintothe13individualcohorts.Alsoincluded
isthe1σerrorforeachcohort,whichappliestobothpositiveandnegativesentimentaverages.
16
Lecturer’s Gender: Male Female
Question: 3.2 4.1 4.2 3.2 4.1 4.2
Positive
Sentiment
Ratings
(%)
5 2.80 1.86 0.51 1.80 1.30 0.65
4 17.20 12.54 4.01 11.59 10.66 1.96
3 33.14 38.49 23.02 31.62 32.90 22.91
2 21.18 14.01 12.74 24.03 20.42 11.52
1 25.68 33.10 59.71 30.96 34.72 62.96
Negative
Sentiment
Ratings
(%)
-1 92.32 93.44 71.43 93.07 91.29 70.03
-2 1.92 2.06 6.58 2.18 2.60 6.68
-3 4.28 3.72 14.59 3.70 5.33 15.97
-4 1.40 0.78 6.27 0.76 0.65 7.07
-5 0.07 0.00 1.13 0.28 0.13 0.26
Number of Comments: 1355 1021 973 1053 769 764
Table 10: Each sentiment value as a percentage of the total number of comments (shown
on the bottom row). The counts are split again into the 3 text questions for both male
and female lecturers.
17

More Related Content

What's hot

Psychology IA
Psychology IAPsychology IA
Psychology IA
fuyuki31
 
Sigma xi presentation revised
Sigma xi presentation revisedSigma xi presentation revised
Sigma xi presentation revised
mnag56
 
Guidelines in writing items for noncognitive measures
Guidelines in writing items for noncognitive measuresGuidelines in writing items for noncognitive measures
Guidelines in writing items for noncognitive measures
Carlo Magno
 
Effects of Teachers Teaching Strategies and the Academic Performance at Grad...
Effects of  Teachers Teaching Strategies and the Academic Performance at Grad...Effects of  Teachers Teaching Strategies and the Academic Performance at Grad...
Effects of Teachers Teaching Strategies and the Academic Performance at Grad...
Brandon King Albito
 

What's hot (20)

Psychology IA
Psychology IAPsychology IA
Psychology IA
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Test construction tony coloma
Test construction tony colomaTest construction tony coloma
Test construction tony coloma
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Washback paper
Washback paperWashback paper
Washback paper
 
Peer Feedback on Writing: A SoTL work in progress
Peer Feedback on Writing: A SoTL work in progressPeer Feedback on Writing: A SoTL work in progress
Peer Feedback on Writing: A SoTL work in progress
 
Multiple choice-questions
Multiple choice-questionsMultiple choice-questions
Multiple choice-questions
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and Validity
 
Teacher made tests
Teacher made testsTeacher made tests
Teacher made tests
 
Teacher made tests
Teacher made testsTeacher made tests
Teacher made tests
 
Language testing
Language testingLanguage testing
Language testing
 
Sigma xi presentation revised
Sigma xi presentation revisedSigma xi presentation revised
Sigma xi presentation revised
 
Construction of Tests
Construction of TestsConstruction of Tests
Construction of Tests
 
Effect of scoring patterns on scorer reliability in economics essay tests
Effect of scoring patterns on scorer reliability in economics essay testsEffect of scoring patterns on scorer reliability in economics essay tests
Effect of scoring patterns on scorer reliability in economics essay tests
 
Laos Session 3: Principles of Reliability and Validity (EN)
Laos Session 3: Principles of Reliability and Validity (EN)Laos Session 3: Principles of Reliability and Validity (EN)
Laos Session 3: Principles of Reliability and Validity (EN)
 
Guidelines in writing items for noncognitive measures
Guidelines in writing items for noncognitive measuresGuidelines in writing items for noncognitive measures
Guidelines in writing items for noncognitive measures
 
Writing Test Items
Writing Test ItemsWriting Test Items
Writing Test Items
 
Effects of Teachers Teaching Strategies and the Academic Performance at Grad...
Effects of  Teachers Teaching Strategies and the Academic Performance at Grad...Effects of  Teachers Teaching Strategies and the Academic Performance at Grad...
Effects of Teachers Teaching Strategies and the Academic Performance at Grad...
 
Model Science Enquiry
Model  Science  EnquiryModel  Science  Enquiry
Model Science Enquiry
 
Types of tests in measurement and evaluation
Types of tests in measurement and evaluationTypes of tests in measurement and evaluation
Types of tests in measurement and evaluation
 

Similar to Student Evaluations in Physics - A Study on Gender Bias at the University of Manchester

Table of specification
Table of specificationTable of specification
Table of specification
chungchua17
 
Task Assessment of Fourth and Fifth Grade Teachers
Task Assessment of Fourth and Fifth Grade TeachersTask Assessment of Fourth and Fifth Grade Teachers
Task Assessment of Fourth and Fifth Grade Teachers
Christopher Peter Makris
 
1 Discussion Question Rubric 210 Points Total (30 Poin.docx
1 Discussion Question Rubric  210 Points Total (30 Poin.docx1 Discussion Question Rubric  210 Points Total (30 Poin.docx
1 Discussion Question Rubric 210 Points Total (30 Poin.docx
tarifarmarie
 
EMOTION DETECTION AND OPINION MINING FROM STUDENT COMMENTS FOR TEACHING INNOV...
EMOTION DETECTION AND OPINION MINING FROM STUDENT COMMENTS FOR TEACHING INNOV...EMOTION DETECTION AND OPINION MINING FROM STUDENT COMMENTS FOR TEACHING INNOV...
EMOTION DETECTION AND OPINION MINING FROM STUDENT COMMENTS FOR TEACHING INNOV...
ijejournal
 

Similar to Student Evaluations in Physics - A Study on Gender Bias at the University of Manchester (20)

Table of specification
Table of specificationTable of specification
Table of specification
 
TESTA, HEDG Spring Meeting London (March 2013)
 TESTA, HEDG Spring Meeting London (March 2013) TESTA, HEDG Spring Meeting London (March 2013)
TESTA, HEDG Spring Meeting London (March 2013)
 
Task Assessment of Fourth and Fifth Grade Teachers
Task Assessment of Fourth and Fifth Grade TeachersTask Assessment of Fourth and Fifth Grade Teachers
Task Assessment of Fourth and Fifth Grade Teachers
 
Research Presentation keynote (not yet result)
Research Presentation keynote (not yet result)Research Presentation keynote (not yet result)
Research Presentation keynote (not yet result)
 
TESTA, University of Greenwich Keynote (July 2013)
TESTA, University of Greenwich Keynote (July 2013)TESTA, University of Greenwich Keynote (July 2013)
TESTA, University of Greenwich Keynote (July 2013)
 
The Effect of Formative Assessment on English Learning of Higher Vocational ...
 The Effect of Formative Assessment on English Learning of Higher Vocational ... The Effect of Formative Assessment on English Learning of Higher Vocational ...
The Effect of Formative Assessment on English Learning of Higher Vocational ...
 
LAMAO ELEM.pptx
LAMAO ELEM.pptxLAMAO ELEM.pptx
LAMAO ELEM.pptx
 
Fostering Curriculum Development Through Artesol 2010
Fostering Curriculum Development Through Artesol 2010Fostering Curriculum Development Through Artesol 2010
Fostering Curriculum Development Through Artesol 2010
 
Why a programme view? Why TESTA?
Why a programme view? Why TESTA?Why a programme view? Why TESTA?
Why a programme view? Why TESTA?
 
TESTA, Imperial College Education Day (March 2015)
TESTA, Imperial College Education Day (March 2015)TESTA, Imperial College Education Day (March 2015)
TESTA, Imperial College Education Day (March 2015)
 
Data tool paper
Data tool paperData tool paper
Data tool paper
 
Frustrating formative
Frustrating formativeFrustrating formative
Frustrating formative
 
TESTA, Durham University (December 2013)
TESTA, Durham University (December 2013)TESTA, Durham University (December 2013)
TESTA, Durham University (December 2013)
 
1 Discussion Question Rubric 210 Points Total (30 Poin.docx
1 Discussion Question Rubric  210 Points Total (30 Poin.docx1 Discussion Question Rubric  210 Points Total (30 Poin.docx
1 Discussion Question Rubric 210 Points Total (30 Poin.docx
 
Ruben lichtert-popp-2017
Ruben lichtert-popp-2017Ruben lichtert-popp-2017
Ruben lichtert-popp-2017
 
A broken assessment paradigm?
A broken assessment paradigm?A broken assessment paradigm?
A broken assessment paradigm?
 
Assessment tools
Assessment toolsAssessment tools
Assessment tools
 
Changing the assessment narrative
Changing the assessment narrativeChanging the assessment narrative
Changing the assessment narrative
 
EMOTION DETECTION AND OPINION MINING FROM STUDENT COMMENTS FOR TEACHING INNOV...
EMOTION DETECTION AND OPINION MINING FROM STUDENT COMMENTS FOR TEACHING INNOV...EMOTION DETECTION AND OPINION MINING FROM STUDENT COMMENTS FOR TEACHING INNOV...
EMOTION DETECTION AND OPINION MINING FROM STUDENT COMMENTS FOR TEACHING INNOV...
 
CaseStudy2
CaseStudy2CaseStudy2
CaseStudy2
 

Student Evaluations in Physics - A Study on Gender Bias at the University of Manchester

  • 1. Student Evaluations in Physics: A study on Gender Bias at the University of Manchester George Gilliver (8383712), Zac Baker (8446588) and Joshua Brothers (8268765) Supervised by Anna Scaife School of Physics and Astronomy, The University of Manchester 17th September 2015 Abstract In an attempt to understand how the gender of a lecturer affected the feedback received, three years worth of student teaching evaluations were collected from the School of Physics and Astronomy at the University of Manchester and analysed. Male and female lecture feedback was compared with a word rate analysis, and a list of 16 words used significantly more produced for each gender. As a test of role congruity theory the same rate analysis was performed to produce a list comparing small and large class sizes which were then cross referenced against a dictionary of agentic and communal stereotyped words. There were 0 of either type in the small against large male lecturer analysis, and 2 communal words, one in each class size for the female lecturer analysis. Using a program called ‘SentiStrength’, all free-text feedback comments were analysed for their positive and negative sentiments. The results indicated that students have been more positive towards their male lecturers than their female lecturers, but are equally, mildly negative towards both genders. 1
  • 2. Contents 1 Introduction 2 2 Data Collection 3 3 Rate Analysis 4 3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Sentiment Analysis 7 4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5 Overall Discussion and Conclusion 13 6 Improvements and Future Developments 13 Appendices 15 I Censored Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 II Sentiment Analysis Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1 Introduction In discussing lecturers career progression it is common, amongst other things, for student evaluation of teaching forms (SETs) to be a major contributor for decision making. These SETs are a subjective judgement of teaching made by their students. There have been many studies assessing whether this opens up university teaching staff to career restricting biases[1][2][3]. These studies have shown that a significant gender bias exists not only in numerical ratings, but also in the language used by students. This bias is present for the perceived gender of the professor, rather than their actual gender. This has been shown through online courses, in which some students were mis-informed of the gender of their teaching staff and then asked to complete SETs. Results indicated that the quality of the teaching was roughly equal, but students evaluated perceived female professors lower than that of their male counterparts[4]. A more recent project by Ben Schmidt[5] analysed the frequency of words differentiated by gender and subject for professor reviews left on the website www.ratemyprofessor.com. Another area of research has been on ‘Role Congruity Theory’[6]. This states that when individuals enter social interactions, there are implicit social assumptions about what roles others will play. For men, this tends to be implicit assumptions that they will conform to the ‘agentic’ type, where people are more assertive, analytical and authoritative. For women, the opposite is implicitly assumed: the ‘non-agentic’ type, where people are more affectionate, understanding and sensitive. When these assumptions are broken, such as when a woman teaches a large lecture class and is thus required to be authoritative[6], this can result in negative student reactions. We decided that it would be of benefit to study the SETs for the School of Physics and Astronomy at the University of Manchester to see whether similar results could be found. The SETs used by Manchester ask for ordinal numeric evaluations along with qualitative responses about the course and lecturer. 2
  • 3. 2 Data Collection The School of Physics and Astronomy has deduced the best way to get a large number of responses is to ask lecturers to hand out the forms in the lecture once the course is coming to an end. Thus, they are completed once the students have had around 10 weeks of teaching and are completed by hand. The School then publishes the results to SETs on their website1 . As time has progressed, the format of the questionnaire and the way the results are presented has changed. We decided to base our study on results from the academic years 2011-12 to 2013-14. This is due to the ease with which the data could be collected, and the only years we could be certain that full information was available. SETs contain a varying amount of ordinal numerical questions depending on the year, however these provided no insight into the language used by students, so they were not included in the analysis in favour of focusing on the qualitative, free-text questions. Additional data provided includes the number of students providing feedback as a percentage of enrollees, which was used to calculate the class sizes. For the qualitative sections, three questions are always asked with the same general basis, but they differ slightly in wording for 2011-12. See Table 1 for a breakdown of the questions. Although it would appear that only question 3.2 would be relevant to the lecturer, as they are free speech questions, people tended to say things about the lecturers over all three questions. Question No. Wording 2011-12 After 2011-12 3.2 What did you like the most about this lecturer’s approach to teaching? What aspects of the lecturer’s approach to teaching best helped your learning? 4.1 Please provide us with details of what you enjoyed about this course unit Please provide details of what you val- ued about this unit 4.2 Please provide us with details of what you think could be improved on this course unit Please provide details of what you think could be improved on this unit Table 1: The variation in the qualitative, written feedback questions in the SETs. To obtain the class sizes, a simple script was written to scrape the text from the PDF files, and then convert them into a usable form. For the written comments, these were hand typed by us. As a result, a small number had mistakes in the transcribed versions. These have all been rectified to the best of our knowledge, but small errors may still persist. There are also a small percentage of comments that were not included in the analysis, either due to illegibility of handwriting, or comments rendered unreadable by the quality of the scanning process. As the school has only a small percentage of female teaching staff, it was decided for the qualitative sections we would type up all of the female led lecture courses. Grouping the comments from students in the same academic semester and the same stage of their university careers, we managed to ensure that the bias of different cohorts and changes in response style as students passed through the university were minimised, by selecting a comparable number of comments left for both male and female lecturers. These groups will be further referred to as simply ‘cohorts’ such that these cohorts don’t persist over multiple academic years. To select the comments left for male lecturers (henceforth referred to as ‘male comments’ and similarly ‘female comments’), the scanned images were taken from the PDFs of all of the male led courses and then a random selection were chosen to be typed up, ensuring the total number followed the rules described above. In completing a ‘sanity check’ for the script that chose the images, we 1 These can be found at https://www.teaching.physics.manchester.ac.uk/GEN_INFO/QUESSUMM/INDEX.HTM. 3
  • 4. noticed that some of the images contained slightly more text than appeared in the PDF documents. This censorship had been used mainly to remove certain personal comments and profanities. As the nature of this study requires us to look at personal comments, we decided to continue without the censorship and to go back through the previously typed women’s comments and check them, using this new method. Please see Appendix I for examples of comments that were censored. 3 Rate Analysis 3.1 Method For our analysis of the rates of the words used, we first conducted a word count for each unique word within the comments. Having read through this list, we realised a lot of the words were not relevant to the lecturer (e.g. about examples classes, or the times of the lectures) and thus it was decided to produce a reduced analysis as well. For this we only used comments that were about the lecturer directly. As question 3.2 was always about the lecturer this was always included. For questions 4.1 & 4.2, as can been seen from Table 1, the questions were not directly about the lecturer. Therefore, we ran a script to only take comments for the reduced analysis if they contained the lecturers name, or a pronoun. Although this would bring in words from sections of a longer comment that were not entirely about the lecturer, it was agreed that it would be better to keep information within this reduced analysis, than inaccurately remove relevant information. First we split the words into those used for male lecturers and those for female lecturers. For each word, we assumed that its usage as a sample of the overall comment population would adhere to a poisson distribution with its frequency as the mean. From this, we could use the poisson error indicated in Equation (1) σ = √ µ (1) where σ is the error and µ is the frequency. To get the error on the rate, we used the total error, and then propagated this through. To then determine whether there was a significant difference for each word we used the formula of Equation (2) Significance = Cm − Cw σ2 Cm + σ2 Cw (2) where Cm and Cw are the counts for men and women respectively, and σCm and σCw are the errors on the counts for men and women respectively. We decided that a 3σ difference would be significant enough to talk about. In order to make a comparison between the words used in small and large teaching environments we took two samples from the larger pool of comments used in the overall rate analysis. Firstly the comments for the small environments were taken from lecture courses with less than 89 respondents, and whilst relatively arbitrary, gave a selection of optional courses that weren’t required for all students to take and are lectured in the smaller theatres. The comments for the large lecturing environments were chosen from courses with over 242 responses in the SET’s. These “core courses” which are required to be taken by all students are given in the largest theatres in the university. This comparison is not the same as in other works[6] which suggested a bias would exist in seminars vs. lecture theatres, so this study should provide a test of whether the effect is strong enough to affect the words used by students based off just the size of the lecture theatre. The words in the list were then compared with a dictionary of agentic and communal words, built from an amalgamation of words from other studies[7][8][9]. The words used in the small and large lecture courses were compared for each gender, in the same way as in the reduced overall rate analysis. As there was a small number of women in the analysis we wanted to remove words that were used significantly more for one lecturer than others, in order to 4
  • 5. avoid words describing the characteristics of an individual woman, rather than women in general. If a word was used 90% of the time or more for one female lecturer it was removed from the results. This “max. percent from a single lecturer” statistic has been recorded for every female-significant word. Additionally the percentage distribution of word usage within questions was recorded in an attempt to measure if an individual word was used in a positive (questions 3.2 and 4.1) or negative (question 4.2) way. 3.2 Results Word Significance Rate per 1000 Words Max. Percent from a Single Lecturer (%) Question Distribution (%) Male Female 3.2 4.1 4.2 handouts 11.761 0.65±0.2 10.7±0.83 44 97 2 1 examples 8.126 14.8±0.94 28.26±1.36 44 95 1 4 online 4.771 6.75±0.63 11.92±0.88 31 96 1 3 mix 4.35 0.18±0.1 1.67±0.33 72 100 0 0 organised 4.164 2.11±0.35 4.87±0.56 22 100 0 0 interactive 4.006 0.18±0.1 1.47±0.31 72 100 0 0 printed 3.899 0.12±0.08 1.28±0.29 40 95 5 0 notes 3.804 16.85±1 22.87±1.22 30 94 1 5 were 3.662 7.22±0.65 11.15±0.85 31 94 2 4 combination 3.638 0.23±0.12 1.41±0.3 55 91 5 5 mistakes 3.401 0.29±0.13 1.41±0.3 77 40 0 60 teachweb 3.395 0.06±0.06 0.9±0.24 88 93 0 7 lots 3.376 2.99±0.42 5.45±0.59 32 93 1 6 clear 3.23 8.98±0.73 12.75±0.91 31 98 1 2 helpful 3.166 3.52±0.46 5.96±0.62 30 96 2 2 worked 3.146 1.41±0.29 3.08±0.45 49 91 0 9 Table 2: The significant words used in comments by students for female lecturers that were considered significant once a reduced analysis had been performed. This significance was when there was a difference between male and female lecture usage of more than 3σ. Also shown is the rate per 1000 words in both male and female comments for that word, the percentage of word usage for one specific lecturer and the percentage distribution for each word within each question on the SETs. 3.3 Discussion The reduced overall rate analysis, shown in Table 2 and Table 3 found 16 significant words for both genders, with 10 words removed for female lecturers. For male lecturers, many words were found to be significant that are associated with the male stereotype e.g. ‘jokes’, ‘funny’, ‘humour’ whereas ‘organised’ was the only equivalent word for female lecturers. The effect of the different sampling methods used for each gender is shown in the comparison of the word usage for one lecturer. As there were so few women in our analysis, the range of usage seen was much larger for their sample, suggesting that some of the significant words were in the list due to a specific individual, rather than the gender of the lecturer. This was alleviated using the 90% rule discussed in Section 3.1, removing obviously erroneous words (for example one lecturer was fond of giving out cream eggs during a lecture, both the words ‘cream’, ‘eggs’ and ‘egg’ were removed due to being used 100% of the time for one lecturer). However the bar of 90% was set fairly arbitrarily, and words with high usage by a single lecturer should be treated with caution. 5
  • 6. Word Significance Rate per 1000 Words Max. Percent from a Single Lecturer (%) Question Distribution (%) Male Female 3.2 4.1 4.2 funny 4.237 1.7±0.32 0.26±0.13 38 93 7 0 interesting 3.891 5.87±0.59 3.01±0.44 36 80 15 5 subject 3.825 4.17±0.5 1.86±0.35 37 86 8 6 write 3.78 1.47±0.29 0.26±0.13 34 52 5 43 jokes 3.74 0.82±0.22 0±0 31 79 21 0 accent 3.74 0.82±0.22 0±0 43 93 0 7 enthusiasm 3.703 3.7±0.47 1.6±0.32 33 87 10 3 previous 3.635 1.12±0.26 0.13±0.09 43 90 0 10 often 3.574 1.23±0.27 0.19±0.11 35 33 0 67 humour 3.498 1.59±0.31 0.38±0.16 41 86 14 0 blackboard 3.41 5.46±0.57 3.01±0.44 35 89 3 8 because 3.316 0.65±0.19 0±0 33 60 0 40 enthusiastic 3.293 3.88±0.48 1.92±0.35 37 98 2 0 entertaining 3.073 0.88±0.23 0.13±0.09 38 93 7 0 down 3.041 1.7±0.32 0.58±0.19 36 61 4 36 difficult 3.008 2.58±0.39 1.15±0.27 46 56 15 29 Table 3: Similar to Table 2, now for words considered significant for male lecturers. Word Significance Rate per 1000 Words Question Distribution (%) Small Class Large Class 3.1 4.1 4.2 Small which 3.626 4.26±0.93 0.67±0.34 90 0 10 course 3.326 9.12±1.37 3.86±0.81 82 3 15 well 3.259 9.73±1.41 4.36±0.86 98 1 1 Large summaries 3.096 0.2±0.2 2.18±0.61 100 0 0 clear 3.158 4.86±0.99 10.07±1.31 98 0 2 explaining 3.173 0.61±0.35 3.19±0.73 92 2 7 derivations 3.568 1.01±0.45 4.53±0.87 93 0 7 explanations 4.336 1.01±0.45 5.71±0.98 97 0 3 thought 4.507 0.2±0.2 4.03±0.82 100 0 0 blackboard 4.957 2.63±0.73 10.07±1.31 92 2 7 Table 4: A table showing results for words used in comments for male lecturers in comparison between small and large class sizes, with rates for small and large classes, and percentage question distribution. 6
  • 7. Word Significance Rate per 1000 Words Max. Percent from a Single Lecturer (%) Question Distribution Small Class Large Class 3.2 4.1 4.2 Small approachable 3.912 2.84±0.52 1.09±0.29 60 90 7 3 classes 3.159 0.85±0.28 0±0 60 100 0 0 Large how 3.057 0.28±0.16 1.8±0.38 70 91 5 5 friendly 3.303 0.19±0.13 1.64±0.36 76 100 0 0 worked 3.429 0.66±0.25 3.12±0.49 65 92 0 8 were 4.084 3.22±0.55 9.76±0.88 35 94 1 6 online 5.393 2.84±0.52 10.77±0.92 51 96 1 3 handouts 5.548 2.56±0.49 10.31±0.9 41 98 2 0 Table 5: Similar to Table 4, but with words obtained from feedback for female lecturers. Another unexpected finding from the analysis was the difference in which questions the words came from. For female lecturers, only 1 out of 16 words wasn’t used 90% of the time in question 3.2, whereas for men 11 out of 16 words were below this threshold. There are two possible explanations for this effect: firstly, given the assumption that question 3.2 leads to words used positively, the unique words people use to describe women are used more often in a positive context than for men. The second possible reason is due to only taking words from comments in question 4.1 and 4.2 with pronouns in them. For our entire sample the rate of pronoun usage was twice as high for men as it was for women, and as only words from the later two questions were filtered due to pronoun usage this could explain why we see such a striking difference between genders. This difference in pronoun rate implies that it may not be the best indicator to use for whether a particular comment is talking about a lecturer, or some other aspect of the course. In the small vs. large class size analysis there were 0 words from the agentic and the communal dictionaries in the male table, and 2 communal words in the female list. The two words (‘approachable’ and ‘friendly’) were split, with approachable coming from the small class sizes and friendly from the large. This split and the fact that only two agentic/communal words were found to be significant imply that just the change in class size is not enough to affect the feedback in terms of agentic or communal language, as opposed to seminars vs. lectures discussed in other studies[6]. The problem of a small sample size was magnified by further splitting it. There were only 3 unique women in the small classes data set, which led to 5 of the 7 words found being removed due to the 90% rule. This makes it hard to say whether the difference in number of unique words between Table 4 and Table 5 was due to the difference in genders, or the small number of female lecturers in the analysis. 4 Sentiment Analysis 4.1 Method To complete a sentiment analysis, a program called SentiStrength2 was used. The main ideas behind SentiStrength are based on psychological research that has shown we process both positive and negative sentiment in parallel[10]. The software allows for a score to be assigned to any short informal text (in our case a single comment) for both its positive and negative sentiment. It was decided that it would be of benefit to look into whether there was any variation in the sentiment of the language used by students for male and female lecturers. SentiStrength as a piece of software analyses each word within a comment, and then assigns it a value for both positive and negative sentiment on a scale of 1 to 5, with 1 being completely neutral, 2 The SentiStrength software can be found at http://sentistrength.wlv.ac.uk/ 7
  • 8. and 5 being very strongly positive/negative (with negative sentiments denoted by a ‘−’, or referred to as a ‘minus score’). It then takes into account negating words such as ‘not’, which would turn a positive into a negative and vice versa, and ‘boosters’ such as ‘very’, which would amplify the score of the following word as appropriate. For example, the word ‘good’ was given a sentiment score of +2 and ‘very’ was a booster of +1. To show this methodology, a comment such as: “The lectures were very good, but the maths wasn’t good!” would then return the following output, where ‘1 Last Word Booster’ indicates a sentiment score boosted by 1 acting on the previous word: Positive emotion rating : 3 Negative emotion rating : −2 The [ 0 ] l e c t u r e s [ 0 ] were [ 0 ] very [ 0 ] good [ 2 ] [1 Last Word Booster ] but [ 0 ] the [ 0 ] maths [ 0 ] wasn ’ t [ 0 ] good [ 2 ] [ Last Word Negator ] [ [ Pos , Neg Sentence = +3 , −2]] To calibrate SentiStrength to work with our feedback comments, a list of all unique words was created. Three people were then asked to independently score each of the words using the scale described above, whilst also identifying any negators and boosters. Where words were identified as having the same score by all parties, they were assigned that value. For the remaining 820 words, the value to be assigned was discussed and agreed upon by all parties. To check that the assigned scores were accurate in practice, 1000 comments were selected at random from the total and the same three people rated each comment from 1 to 5 for both positive and negative sentiment. The median was then selected as the average result for the group, and was used in checking the calibration. To check the calibration we ran a 10-fold cross-validation, in which 90% of the comments were used to check the weights of each word and the remaining 10% was used to calculate its accuracy. A different 10% was then selected and run again, until all parts of the 1000 comments were used. Please see Table 6 for a full output of our cross-validation. From there, we attempted to optimise the results as best as possible, by going through comments that had significantly different scores between human interpretation and SentiStrength’s prediction. On completion we found we had an accuracy of 73.27% for the negative words and 57.26% for the positive words (Acc ). Although this sounds quite low, especially for the positive sentiment, as the correlation scores for both sentiments (Corr±) are quite high (0.6272 and 0.6623 respectively) and far above 0.43 , this shows that we have produced a relatively accurate calibration and the program is able to replicate human opinion to a sufficient degree. Another thing to consider is that humans themselves will inherently disagree about what the sentiment of a single comment should be scored, so for a computer to accurately model human sentiment scoring, it too should have a certain level of uncertainty. In fact, all 3 human coders only agreed 42.7% and 78.6% of the time respectively in their positive and negative scoring of the 1000 comment sample. Corr+ Corr- Acc+ Acc- AccWithin1+ AccWithin1- MeanAbsErr+ MeanAbsErr- 0.6272 0.6623 57.26% 73.27% 86.09% 89.09% 0.4015 0.2518 Table 6: The results of our 10-Fold cross-validation test on the calibration of SentiStrength. From Table 6, we can further see that the accuracy greatly increases to almost 90%, when the program is allowed to vary the sentiment score by 1 in either direction (AccWithin1±). The last values in the table (MeabAbsErr±) show the average errors assigned to the prediction of both positive and negative sentiment values. 3 A value of over 0.4 has been described as ‘excellent’ for the correlation by the creators of SentiStrength as can be seen in Step 3 here: http://sentistrength.wlv.ac.uk/documentation/language_changes.html 8
  • 9. 4.2 Results and Analysis Running all typed comments through the now-calibrated program, the positive and negative sentiment score for each comment was extracted. After splitting the comments into the 3 individual questions outlined in Table 1, the average sentiment for those comments was calculated for male and female lecturers on both positive and negative scales. These averages are displayed in Figure 1 with positive sentiment on the top half, negative on the bottom and the data relative to the full 1-5 range on the left, followed by the scale reduced to see the finer details on the right. The errors on these averages were calculated from the mean absolute error as displayed in Table 6 and displayed at the 3σ limit to be over-confident of any discrepancies between genders, akin to the reasoning in Section 3.1. The data for each average and its associated error are displayed in Table 8 under Appendix II. 3.2 4.1 4.2 1 2 3 4 5 Question Number AveragePositiveSentiment Full Sentiment Range 3.2 4.1 4.2 1.6 1.8 2 2.2 2.4 2.6 Question Number Zoomed in to see M/F comparison detail 3.2 4.1 4.2 −5 −4 −3 −2 −1 Question Number AverageNegativeSentiment 3.2 4.1 4.2 −1.6 −1.4 −1.2 −1 Question Number Male Female 3σ Error Limit Positive Negative Positive Negative Figure 1: The average positive and negative sentiment for both male and female lecturers, split into questions. Also included are 3σ error limits. The top and bottom layers are positive and negative sentiment respectively. Both graphs on the right show the zoomed-in areas denoted by the dashed lines. As can be seen in Figure 1 the average sentiment for questions 3.2 and 4.1 appear to be mildly positive overall, with very little negative sentiment (close to neutral at -1) whereas the sentiment for question 4.2 is more negative than positive, but the strength of both sentiments are approximately equal at around 1.65. Looking instead at gender, the average positive sentiment for feedback given to male lecturers is significantly stronger in all 3 questions than for female lecturers. Conversely, there is no difference between the average negative sentiment of genders, with the 3σ error limits overlapping for all questions. To investigate any possible trends over time (and possibly any inconsistent year groups), the sentiment scores were grouped into cohorts and an average calculated for both positive and negative scales. These are displayed in Figure 2 and Figure 3 respectively. 9
  • 10. 1 1.5 2 2.5 3 3.5 Cohort AveragePositiveSentiment 2011/12 Y1/S1 Y1/S2 Y2/S1 Y2/S2 Y3/S1 Y3/S22012/13 Y1/S2 Y2/S1 Y2/S2 Y3/S1 Y3/S2 Y4/S12013/14 Y1/S2 Male Female 3σ Error Limit Figure 2: The average positive sentiment for both male and female lecturers, split into individual academic year, undergraduate year (Y) and semester (S). −3.5 −3 −2.5 −2 −1.5 −1 Cohort AverageNegativeSentiment 2011/12 Y1/S1 Y1/S2 Y2/S1 Y2/S2 Y3/S1 Y3/S22012/13 Y1/S2 Y2/S1 Y2/S2 Y3/S1 Y3/S2 Y4/S12013/14 Y1/S2Male Female 3σ Error Limit Figure 3: The average negative sentiment for both male and female lecturers, split into cohorts of individual academic year, undergraduate year (Y) and semester (S). 10
  • 11. Again, 3σ error limits have been plotted, and the number of significant differences between genders for the same cohort can be seen in Table 7. This data seems to suggest that for the 13 cohorts studied, approximately half (46% positive and 54% negative) commented with sentiment no different for either gender, but 46% of the remaining cohorts were more positive when leaving feedback for their male lecturers (as opposed to 8% for female lecturers) and 31% more negative for female lecturers (15% male). Stronger For: Neither Male Female Positive Sentiment 6 (46%) 6 (46%) 1 (8%) Negative Sentiment 7( 54%) 2 (15%) 4 (31%) Table 7: A table displaying the number of cohorts that commented with a stronger sentiment towards either their Male lecturers, their Female lecturers or neither gender. Each number as a percentage of the total number of cohorts is enclosed in brackets. In an attempt to further understand how the sentiment of comments was distributed, we plotted a normalised bar graph for the total number of comments with each sentiment score for both positive and negative scales, as a percentage of the total number of comments left as feedback. Counts are split again into question number and plotted for male and female lecturers in Figure 4, with the sonstituent percentages available from Table 10 in Appendix II. Figure 4: Each sentiment value from 1/−1 to 5/−5 as a percentage of total positive and negative sentiment. The counts are split into the 3 text questions akin to Figure 1, with male lecturers on the left hand side and female lecturers on the right hand side. Again, the data suggests that students are far more likely to leave positive comments than negative comments throughout all questions, even if question 4.2 leans more negatively than questions 3.2 11
  • 12. or 4.1. This 3.2/4.1 and 4.2 split is also reflected in questions 3.2 and 4.1 receiving a very small proportion of comments with negative sentiment (7.67%/6.56% for male and 6.92%/8.71% for female 3.2/4.1 questions respectively), opposed to question 4.2 receiving notably more for both genders (28.58%/30% for male/female). Also note that a positive sentiment score of 1 (i.e. neutral sentiment) has a higher percentage for all female questions compared to the same male questions, signifying a higher percentage of positivity towards male lecturers. The converse it true in both questions 4.1 and 4.2 for negativity, with only question 3.2 having a slightly larger percentage of negative sentiment left for male lecturers than their female counterparts. As neutral sentiment dominated the responses, it was excluded from the total count and the remaining, non-neutral sentiment comments were re-normalised to 100% and are displayed in Figure 5. Figure 5: The same as Figure 4, bar the exclusion of all neutral sentiments to futher convey the range of non-neutral sentiment in feedback comments. The highest non-neutral percentage is always 3/−3 for positive/negative sentiments respectively, indicating that when students were either being positive or negative, they were doing so only mildly. In all but the negative sentiment of question 4.2 for female lecturers, the smallest percentage of comment sentiment is 5/−5 followed by 4/−4, showing that students rarely commented with very strong positive or negative sentiment. 4.3 Discussion By attempting to extract the sentiment of commenters, we observed a few overall trends: Students’ comments are much more likely to contain positive sentiment over negative sentiment for either gender, but are on average more positive towards their male lecturers and more negative towards their female lecturers. However, students are more likely to not be negative (i.e. neutral) whilst being 12
  • 13. generally positive, which could be due to the positive-leaning wording of the questions as defined in Table 1. Any perceived sentiment also tends to be mild, with very small percentages of comments containing very strong sentiment. 5 Overall Discussion and Conclusion The overall rate analysis found that there is a definite difference in the language used by students leaving feedback in the SETs relating to the stereotype of the gender of the lecturer. However, a rate analysis alone does not provide the context in which these words are used and how this is affecting the overall perception of a lecturer in their SETs. This issue can be addressed by analysing the sentiment of the feedback comments. The analysis seems to indicate that students are more positive towards their male lecturers than their female counterparts, whilst negative feedback for both genders tends to be equally neutral. Whether these phenomena are due to the limited sample size, inherent biases held by students or something else entirely is a topic of further research. One of the main limitations of the study was the amount of female-taught courses compared to male-taught courses (cf. 27 and 123) which limited the number of comments we could analyse and the variety of courses within our samples. This sample size was so small that with the poisson error presented in Equation (1), the associated errors could have been smaller with a larger sample size. Another limitation might be the gender of the students leaving feedback themselves. With the average percentage of enrolled students being female at around 15%-20%[11], it would be useful to obtain the percentage gender of commenters to see if the gender of the student and the gender of the lecturer somehow affected the language left in the feedback. Something to consider could also be the effect of any group phenomenon in classes. Students may be more likely to join in with the behaviour and attitudes of their peers, where a vocal minority supersedes the impressionable majority[12]. 6 Improvements and Future Developments To resolve the sample size issues, ideally as much feedback as possible would be collected from a range of different departments throughout both the University of Manchester and possibly further afield, to see how student’s feedback changed across a varying range of commenters and over time. Due to the pronoun selection used in the rate analysis, and the fact that the rate of pronoun usage was twice as high for men this meant we removed significantly more comments for women than men. If future studies into word usage in SETs want to avoid this cross-contamination of comments they should either only use data from questions that ask specifically about a lecturer, or devise a better method than pronoun tracking to differentiate between the two. Further refining the accuracy and methodology of the SentiStrength program by honing the word lists used and iterating the calibration process (or even using a more sophisticated approach to analysing sentiment in sentences) would also increase the accuracy of reflection of the true sentiment of commenters. Many different and varied parameters make up a linguistic analysis of feedback that further research could combine both rate and sentiment analysis along with maybe extra surveys and meta-data on the students leaving the feedback to get a more accurate and true representation of the inherent biases held. References [1] T. L. Lueck, K. L. Endres, and R. E. Caplan, “The interaction effects of gender on teaching evaluations,” Journalism Educator, vol. 48, no. 3, pp. 46–54, 1993. 13
  • 14. [2] K. J. Anderson and G. Smith, “Students preconceptions of professors: Benefits and barriers according to ethnicity and gender,” Hispanic Journal of Behavioral Sciences, vol. 27, no. 2, pp. 184–201, 2005. [3] S. A. Basow, “Student evaluations of college professors: When gender matters,” Journal of Educational Psychology, vol. 87, no. 4, 1995. [4] L. MacNell, A. Driscoll, and A. Hunt, “Whats in a name: Exposing gender bias in student ratings of teaching,” Innovative Higher Education, vol. 40, no. 4, pp. 291–303, 2015. [5] B. Schmidt, “Gender and teaching reviews.” Available at: http://benschmidt.org/ profGender/. Last accessed: 17th September 2015. [6] L. Martin, “Gender issues and teaching,” American Political Science Association 2013 Annual Meeting, 2013. [7] A. B. Diekman and A. H. Eagly, “Stereotypes as dynamic constructs: Women and men of the past, present, and future,” Personality and Social Psychology Bulletin, vol. 26, no. 10, pp. 1171–1188, 2000. [8] W. Wood and A. Eagly, “Two traditions of research on gender identity,” Sex Roles, pp. 1–13, 2015. [9] A. E. Abele, “The dynamics of masculine-agentic and feminine-communal traits: findings from a prospective study,” J Pers Soc Psychol, vol. 85, pp. 768–776, Oct 2003. [10] R. Berrios, P. Totterdell, and S. Kellett, “Eliciting mixed emotions: A meta-analysis comparing models, types and measures.,” Frontiers in Psychology, vol. 6, no. 428, 2015. [11] “School of physics and astronomy, undergraduate admissions report 2013.” Page 8. Available at: https://www.teaching.physics.manchester.ac.uk/committe/minutes/ Board/2013_10_23/Admissions%20Report%202013.pdf. Last accessed: 17th September 2015. [12] S. Reicher, The Psychology of Crowd Dynamics, pp. 182–208. Blackwell Publishers Ltd, 2008. 14
  • 15. Appendices I Censored Comments Please see below for a list of the censored comments found for female lecturers. As these are often personal and include the lecturer’s names, all names have been changed to Jill. The parts that were censored are in Italics. • Wasn’t very challenging, and a lot of stuff we’d already covered in EM/Waves courses. However the new stuff was good. More jumpers. [Picture of a jumper] • If a mind bogglingly hard tutorial question will be set provide a similar example because oh my god stress. • Jill’s lecturing style is engaging, awesome and covers topics at an in depth and easy to understand way. Jill’s bitchin! • Clear, defined structure. Long winded derivations on printouts, generally very good handouts. Sense of style (Both in teaching, and fashion) • Thanks Jill. Your a babe! • Good style. Extremely good clothing, but mainly teaching. Good mixture between lectures and online notes. Good balance between theory and practical applications. Always happy. II Sentiment Analysis Tables Question: 3.2 4.1 4.2 Male Positive 2.503 2.360 1.729 Negative -1.150 -1.119 -1.591 error 0.011 0.013 0.013 Female Positive 2.293 2.234 1.658 Negative -1.130 -1.157 -1.609 error 0.008 0.009 0.009 Table 8: The average positive and negative sentiment for both male and female lecturers, split into questions 3.2, 4.1 and 4.2. Also included is the 1σ error for each question, which applies to both positive and negative sentiment averages. 15
  • 17. Lecturer’s Gender: Male Female Question: 3.2 4.1 4.2 3.2 4.1 4.2 Positive Sentiment Ratings (%) 5 2.80 1.86 0.51 1.80 1.30 0.65 4 17.20 12.54 4.01 11.59 10.66 1.96 3 33.14 38.49 23.02 31.62 32.90 22.91 2 21.18 14.01 12.74 24.03 20.42 11.52 1 25.68 33.10 59.71 30.96 34.72 62.96 Negative Sentiment Ratings (%) -1 92.32 93.44 71.43 93.07 91.29 70.03 -2 1.92 2.06 6.58 2.18 2.60 6.68 -3 4.28 3.72 14.59 3.70 5.33 15.97 -4 1.40 0.78 6.27 0.76 0.65 7.07 -5 0.07 0.00 1.13 0.28 0.13 0.26 Number of Comments: 1355 1021 973 1053 769 764 Table 10: Each sentiment value as a percentage of the total number of comments (shown on the bottom row). The counts are split again into the 3 text questions for both male and female lecturers. 17