Innovators of the Early Modern English spelling change: Using DICER to investigate spelling variation trends - Alistair Baron, Paul Rayson and Dawn Archer's presentation at the Helsinki Corpus Festival.
Innovators of the Early Modern English spelling change: Using DICER to investigate spelling variation trends
1. Alistair
Baron,
Paul
Rayson
and
Dawn
Archer
Helsinki Corpus Festival
28th September – 2nd October
2011
2. 100
ARCHER
EEBO
Large
amount
spelling
variation
in
90 Innsbruck
¡ 80
Lampeter
EMEMT
Shakespeare
Early
Modern
English
texts,
despite
Average Trend
70
gradual
standardisation
between
% Variant Types
60
1500-‐1700
(Görlach,
1991;
50
40
Nevalainen,
2006).
30
20
This
has
an
impact
on
the
accuracy
10
¡
of
automatic
corpus
linguistic
1400 1450 1500 1550 1600 1650 1700 1750 1800
Decade
techniques:
§ From
simple
searching
for
words
and
frequency
lists.
§ To
key
words
(Baron
et
al.,
2009)
and
clusters
(Palander-‐Collin
&
Hakala,
2011)
§ As
well
as
POS
tagging
(Rayson
et
al.,
2007)
and
semantic
annotation
(Archer
et
al.,
2003).
4. ¡ Designed
to
assist
researchers
in
normalising
spelling
variation
in
historical
corpora
both
manually
and
automatically.
¡ Uses
methods
from
modern
spellchecking
to
ind
spelling
variants
and
offer/select
appropriate
modern
equivalents.
¡ The
original
spelling
is
always
retained
in
the
text
with
an
xml
tag
surrounding
the
replacement.
§ <normalised
orig=”charitie">charity</normalised>
¡ Used
to
normalise
released
historical
(and
other)
corpora,
e.g.
EMEMT
(Lehto
et
al.,
2010)
and
CEEC
(Palander-‐Collin
&
Hakala,
2011).
5. ¡ Discovery
and
Investigation
of
Character
Edit
Rules
¡ Examines
variant
/
normalisation
pairs
found
in
the
XML
output
from
VARD.
¡ Determines
what
letter
replacement
rules
are
required
to
convert
the
variant
form
into
the
normalised
form.
For
example:
Variant Normalisation Rules
anie any ie → y
publick public Delete k
ioynte joint i→j
y→i
Delete e
¡ Frequencies
are
calculated
for
each
rule
indicating
how
often
each
rule
occurs,
which
position
of
the
variant
it
should
be
applied
and
with
which
surrounding
letters.
¡ Meta-‐data
is
also
stored
to
allow
for
the
analysis
of
spelling
rule
trends
over
time,
genre
or
any
other
meta-‐data
present.
6.
7. ¡ Corpus
of
English
Dialogues,
covers
the
period
1560-‐1760
and
contains
trials,
witness
depositions,
handbooks,
prose,
comedy
drama
and
miscellaneous
(Kytö
&
Culpeper,
2006).
¡ Trials
and
Witness
Depositions
chosen
for
current
study,
and
split
into
two
periods:
1560-‐1639
and
1640-‐1719.
¡ VARD
2.4
was
trained
for
each
half
of
the
sub-‐corpus
with
10,000
words
of
randomly
selected
text.
Each
half
was
then
automatically
normalised
with
a
75%
replacement
threshold.
¡ DICER
analysis
performed
over
resulting
variants:
§ 1560-‐1639:
14,782
variant
tokens,
2,981
variant
types.
§ 1640-‐1719:
8,273
variant
tokens,
1,870
variant
types.
8. ¡ Tracts
and
pamphlets
published
1640-‐1740
(Schmied,
1994).
¡ Six
domains
represented
(Religion,
Politics,
Economy
&
Trade,
Science,
Law
and
Miscellaneous)
with
two
texts
for
each
domain
per
decade.
¡ Just
Law
texts
used
in
current
study
(1640-‐1719).
¡ Spelling
variants
automatically
normalised
with
VARD
2.4
at
75%
threshold
after
being
trained
on
10
randomly
selected
1,000
word
samples.
¡ DICER
analysis
performed:
§ 4,637
spelling
variant
tokens,
1,483
variant
types.
9. ¡ Too
many
rules
to
consider
everything
¡ So,
either:
§ Examine
trends
for
rules
that
we
are
interested
in
(hypothesis
driven
–
top
down)
§ Use
a
statistical
technique
to
highlight
‘interesting’
rules
(data
driven
–
bottom
up)
¡ Proposal:
use
keyness
method
(c.f.
WordSmith
and
Wmatrix)
to
produce
Log-‐
Likelihood
value
for
each
rule.
10. Rule Examples 1640-1679
Rel. Freq.
1680-1719
Rel. Freq.
Log-
Likelihood
¡ Decline
of
“Delete
E”
could
be
Sub. ` → E ask’d → asked 0.01459 ↑ 0.14609 571.9
related
to
changing
practices
sign’d → signed (p < 0.0001) in
printing/publishing?
Delete E Sheriffe → Sheriff 0.33594 ↓ 0.17909 196.9 ¡ “Substitute
`
→
e”
nearly
knowe → know (p < 0.0001)
always
-‐`d
endings.
Why
is
this
Sub. TT → T att → at 0.05107 ↑ 0.13356 166.5
gott → got (p < 0.0001) feature
increasing
in
use?
Sub. LL → L pistoll → pistol 0.08008 ↓ 0.03821 61.2 ¡ Double
to
single
consonants
is
tryall → trial (p < 0.0001)
changing,
but
no
real
pattern
Sub. PP → P uppon → upon 0.00208 ↑ 0.00947 22.8
Chappel → Chapel (p < 0.0001) in
terms
of
usage
increase
or
Sub. U → V deuill → devil 0.03248 ↓ 0 168.3 decrease.
giue → give (p < 0.0001) ¡
“U
→
V”
/
“V
→
U”
declines
Sub. V → U vntill → until
vse → use
0.00660 ↓ 0 34.2
(p < 0.0001)
over
time,
perhaps
expected?
Operation 1640-1679
Rel. Freq.
1680-1719
Rel. Freq.
Log-Likelihood
¡ The
need
for
deletion
overall
Deletion 0.39301 ↓ 0.22813 182.2
for
normalisation
is
declining,
(p < 0.0001) whilst
substitution
is
Substitution 0.54352 ↑ 0.69834 196.9 increasing.
(p < 0.0001)
Insertion 0.06347 0.07352 3.1358
11. Rule Examples 1640-1679 1680-1719 Log-
Rel. Freq. Rel. Freq. Likelihood ¡ Decline
of
“Delete
E”
is
Delete E onely → only 0.33972 ↓ 0.01954 722.9 present
again.
lesse → less (p < 0.0001)
¡ “Substitute
`
→
e”
increasing
Sub. ` → E call’d → called 0.02557 ↑ 0.25237 535.5
joyn’d → joined (p < 0.0001)
is
present
again.
Sub. LL → L actuall → actual 0.15372 ↓ 0.02792 205.0 ¡ Double
to
single
consonants
illegall → illegal (p < 0.0001) prevalent
again,
but
here
a
Sub. MM →
M
dammage →
damage
0.01566 ↓ 0.00168 27.5
(p < 0.0001)
distinct
pattern
of
decline
in
summes → sums usage
is
observed.
Sub. RR → R warre → war 0.02077 ↓ 0.00614 18.2 ¡ “U
→
V”
does
not
appear
in
Forreign → Foreign (p < 0.001)
Lampeter
data,
only
one
Sub. PP → P Shipps → Ships 0.00352 ↓ 0 10.0
stepp → step (p < 0.01) instance
of
“V
→
U”.
Operation 1640-1679 1680-1719 Log-Likelihood ¡ Same
trend
of
deletion
rules
Rel. Freq. Rel. Freq.
declining
and
substitution
Deletion 0.41555 ↓ 0.08420 522.2
(p < 0.0001) rules
increasing,
but
insertion
Substitution 0.50234 ↑ 0.75839 123.8 rules
are
increasing
also.
(p < 0.0001)
Insertion 0.08211 ↑ 0.15740 57.5
(p < 0.0001)
14. Rule Examples Trials Witness Log-
Rel. Freq. Rel. Freq. Likelihood ¡ -‐’d
endings
much
more
prevalent
Sub. ` → E receiv’d → received 0.12597 < 0.00511 1699.3 in
trials.
alledg’d → alleged (p < 0.0001) ¡ Changes
in
the
use
of
double
Sub. TT → T att→ at 0.08872 < 0.01727 591.3 consonants
instead
of
single
Cittye → City (p < 0.0001)
consonants,
but
no
real
trend.
Sub. GG → G dogge → dog 0.00107 > 0.00500 24.4
Wigg → Wig (p < 0.0001) ¡ Single
consonants
instead
of
Sub. T → TT Litle → Little 0.01511 < 0.00279 105.3 double
consonants
also
found,
Scotish→ Scottish (p < 0.0001) but
commonly
overused
in
trials.
Sub. EE → E shee → she
beeing → being
0.01206 > 0.05364 251.7
(p < 0.0001)
¡ Singling
and
doubling
of
vowels
both
overused
in
witness
Sub. E → EE bene → been 0.00199 > 0.00660 23.5
chese → cheese (p < 0.0001) depositions.
Sub. U → V neuer → never 0.01374 > 0.04704 173.1 ¡ Interchanging
of
U
&
V
found
euill → evil (p < 0.0001) much
more
in
witness
Operation Trials Witness Rel. Log-Likelihood depositions.
Rel. Freq. Freq.
Deletion 0.28253 > 0.42911 290.1
(p < 0.0001)
¡ Deletion
is
required
more
for
Substitution 0.65793 < 0.51424 177.8
normalising
witness
depositions,
(p < 0.0001) substitutions
more
for
trials.
Insertion 0.05954 0.05665 0.7
15. ¡ Found
that
there
are
differences
in
terms
of
both
the
text-‐types
examined
and
also
across
the
period.
Not
sure,
as
yet,
what
is
causing
these
differences.
Our
hunch
is
that
it
is
possibly:
§ authorial/editorial
(how
they’re
recorded
in
rather
than
because
of
the
text-‐
type)
§ because
of
[frequency
of]
usage
of
particular
lexical
items
(e.g.
hee/shee
in
Witness
Depositions).
¡ This
said,
our
previous
work
(on
Lampeter)
has
suggested
that
there
are
signiicant
differences
in
terms
of
variant
frequencies
across
genres
(i.e.
Religion
particularly
high).
Substitute U → V
0.07
¡ Future
work
–
inding
the
innovators
0.06
for
change
(variant
rule
level
>
genre/ 0.05
text-‐type
>
texts
>
people)
–
requires
0.04
Trials (LL: 267.3)
0.03
large
scale
normalisation
–
which
0.02
Witness (LL: 134.4)
requires
more
corpora
...
over
to
you!
0.01
0
1560-1599 1600-1639 1640-1679 1680-1719
16. Normalisation of
spelling variation
with VARD 2.
Increased
Study of spelling
understanding of
patterns and
the properties of
trends.
spelling variation.
17. ¡ Acknowledgements:
§ Thanks
to
Merja
Kytö
for
providing
the
CED
corpus.
§ Research
funded
by
EPSRC
PhD
Plus
grant
at
Lancaster
University.
¡ More
information:
§ VARD:
http://ucrel.lancs.ac.uk/vard
§ DICER:
http://corpora.lancs.ac.uk/dicer
18. Archer,
D.,
McEnery,
T.,
Rayson,
P.
&
Hardie,
A.
(2003).
Developing
an
automated
semantic
analysis
system
for
Early
Modern
English.
In
D.
Archer,
P.
Rayson,
A.
Wilson
&
T.
Mcenery,
eds.,
Proceedings
of
Corpus
Linguistics
2003,
22–31,
Lancaster
University,
Lancaster,
UK.
Baron,
A.,
Rayson,
P.
and
Archer,
D.
(2009).
Word
frequency
and
key
word
statistics
in
historical
corpus
linguistics.
Anglistik:
International
Journal
of
English
Studies,
20
(1),
pp.
41–67.
Görlach,
M.
(1991).
Introduction
to
Early
Modern
English.
Cambridge
University
Press,
Cambridge.
Kytö,
M.
and
Culpeper,
J.
(2006).
A
Corpus
of
English
Dialogues
1560-‐1760.
19.
Lehto,
A.,
Baron,
A.,
Ratia,
M.
and
Rayson,
P.
(2010).
Improving
the
precision
of
corpus
methods:
The
standardized
version
of
Early
Modern
English
Medical
Texts.
In
Taavitsainen,
I.
and
Pahta,
P.
(eds.)
Early
Modern
English
Medical
Texts:
Corpus
description
and
studies,
pp.
279–290.
John
Benjamins,
Amsterdam.
Palander-‐Colin,
M.
and
Hakala,
M.
(2011).
Standardizing
the
Corpus
of
Early
English
Correspondence
(CEEC).
Poster
presented
at
ICAME
32,
Oslo,
1-‐5
June
2011.
Rayson,
P.,
Archer,
D.,
Baron,
A.,
Culpeper,
J.
and
Smith,
N.
(2007).
Tagging
the
Bard:
Evaluating
the
accuracy
of
a
modern
POS
tagger
on
Early
Modern
English
corpora.
In
Davies,
M.,
Rayson,
P.,
Hunston,
S.
and
Danielsson,
P.
(eds.)
Proceedings
of
the
Corpus
Linguistics
Conference:
CL2007,
University
of
Birmingham,
UK,
27-‐30
July
2007.
Schmied,
J.
(1994).
The
Lampeter
Corpus
of
Early
Modern
English
Tracts.
In
M.
Kytö,
M.
Rissanen
&
S.
Wright,
eds.,
Corpora
across
the
Centuries:
Proceedings
of
the
First
International
Colloquium
on
English
Diachronic
Corpora,
Rodopi,
Amsterdam,
St.
Catherine’s
College,
Cambridge.