Innovators of the Early Modern English spelling change: Using DICER to investigate spelling variation trends

Alistair
Baron,
Paul
Rayson
and
Dawn
Archer

Helsinki Corpus Festival
28th September – 2nd October
2011

100
ARCHER
EEBO

Large
amount
spelling
variation
in

90 Innsbruck

¡  80
Lampeter
EMEMT
Shakespeare

Early
Modern
English
texts,
despite

Average Trend
70

gradual
standardisation
between

% Variant Types
60

1500-‐1700
(Görlach,
1991;

50

40

Nevalainen,
2006).
30

20

This
has
an
impact
on
the
accuracy

10

¡ 
of
automatic
corpus
linguistic

1400 1450 1500 1550 1600 1650 1700 1750 1800
Decade

techniques:

§  From
simple
searching
for
words

and
frequency
lists.

§  To
key
words
(Baron
et
al.,
2009)

and
clusters
(Palander-‐Collin
&

Hakala,
2011)

§  As
well
as
POS
tagging
(Rayson
et

al.,
2007)
and
semantic
annotation

(Archer
et
al.,
2003).

100
ARCHER
EEBO
90 Innsbruck
Lampeter
EMEMT
80 Shakespeare
Average Trend
70
% Variant Types

60

50

40

30

20

10

1400 1450 1500 1550 1600 1650 1700 1750 1800
Decade
(Baron et al., 2009)

¡  Designed
to
assist
researchers
in
normalising

spelling
variation
in
historical
corpora
both

manually
and
automatically.

¡  Uses
methods
from
modern
spellchecking
to
ind

spelling
variants
and
offer/select
appropriate

modern
equivalents.

¡  The
original
spelling
is
always
retained
in
the
text

with
an
xml
tag
surrounding
the
replacement.

§  <normalised
orig=”charitie">charity</normalised>

¡  Used
to
normalise
released
historical
(and
other)

corpora,
e.g.
EMEMT
(Lehto
et
al.,
2010)
and
CEEC

(Palander-‐Collin
&
Hakala,
2011).

¡  Discovery
and
Investigation
of
Character
Edit
Rules

¡  Examines
variant
/
normalisation
pairs
found
in
the
XML
output

from
VARD.

¡  Determines
what
letter
replacement
rules
are
required
to

convert
the
variant
form
into
the
normalised
form.
For
example:

Variant Normalisation Rules
anie any ie → y
publick public Delete k
ioynte joint i→j
y→i
Delete e

¡  Frequencies
are
calculated
for
each
rule
indicating
how
often

each
rule
occurs,
which
position
of
the
variant
it
should
be

applied
and
with
which
surrounding
letters.

¡  Meta-‐data
is
also
stored
to
allow
for
the
analysis
of
spelling
rule

trends
over
time,
genre
or
any
other
meta-‐data
present.

¡  Corpus
of
English
Dialogues,
covers
the
period
1560-‐1760

and
contains
trials,
witness
depositions,
handbooks,
prose,

comedy
drama
and
miscellaneous
(Kytö
&
Culpeper,

2006).

¡  Trials
and
Witness
Depositions
chosen
for
current
study,

and
split
into
two
periods:
1560-‐1639
and
1640-‐1719.

¡  VARD
2.4
was
trained
for
each
half
of
the
sub-‐corpus
with

10,000
words
of
randomly
selected
text.
Each
half
was

then
automatically
normalised
with
a
75%
replacement

threshold.

¡  DICER
analysis
performed
over
resulting
variants:

§  1560-‐1639:
14,782
variant
tokens,
2,981
variant
types.

§  1640-‐1719:
8,273
variant
tokens,
1,870
variant
types.

¡  Tracts
and
pamphlets
published
1640-‐1740
(Schmied,

1994).

¡  Six
domains
represented
(Religion,
Politics,
Economy
&

Trade,
Science,
Law
and
Miscellaneous)
with
two
texts

for
each
domain
per
decade.

¡  Just
Law
texts
used
in
current
study
(1640-‐1719).

¡  Spelling
variants
automatically
normalised
with
VARD

2.4
at
75%
threshold
after
being
trained
on
10

randomly
selected
1,000
word
samples.

¡  DICER
analysis
performed:

§  4,637
spelling
variant
tokens,
1,483
variant
types.

¡  Too
many
rules
to
consider
everything

¡  So,
either:

§  Examine
trends
for
rules
that
we
are
interested

in
(hypothesis
driven
–
top
down)

§  Use
a
statistical
technique
to
highlight

‘interesting’
rules
(data
driven
–
bottom
up)

¡  Proposal:
use
keyness
method
(c.f.

WordSmith
and
Wmatrix)
to
produce
Log-‐
Likelihood
value
for
each
rule.

Rule Examples 1640-1679
Rel. Freq.
1680-1719
Rel. Freq.
Log-
Likelihood
¡  Decline
of
“Delete
E”
could
be

Sub. ` → E ask’d → asked 0.01459 ↑ 0.14609 571.9
related
to
changing
practices

sign’d → signed (p < 0.0001) in
printing/publishing?

Delete E Sheriffe → Sheriff 0.33594 ↓ 0.17909 196.9 ¡  “Substitute
`
→
e”
nearly

knowe → know (p < 0.0001)
always
-‐`d
endings.
Why
is
this

Sub. TT → T att → at 0.05107 ↑ 0.13356 166.5
gott → got (p < 0.0001) feature
increasing
in
use?

Sub. LL → L pistoll → pistol 0.08008 ↓ 0.03821 61.2 ¡  Double
to
single
consonants
is

tryall → trial (p < 0.0001)
changing,
but
no
real
pattern

Sub. PP → P uppon → upon 0.00208 ↑ 0.00947 22.8
Chappel → Chapel (p < 0.0001) in
terms
of
usage
increase
or

Sub. U → V deuill → devil 0.03248 ↓ 0 168.3 decrease.

giue → give (p < 0.0001) ¡ 
“U
→
V”
/
“V
→
U”
declines

Sub. V → U vntill → until
vse → use
0.00660 ↓ 0 34.2
(p < 0.0001)
over
time,
perhaps
expected?

Operation 1640-1679
Rel. Freq.
1680-1719
Rel. Freq.
Log-Likelihood
¡  The
need
for
deletion
overall

Deletion 0.39301 ↓ 0.22813 182.2
for
normalisation
is
declining,

(p < 0.0001) whilst
substitution
is

Substitution 0.54352 ↑ 0.69834 196.9 increasing.

(p < 0.0001)
Insertion 0.06347 0.07352 3.1358

Rule Examples 1640-1679 1680-1719 Log-
Rel. Freq. Rel. Freq. Likelihood ¡  Decline
of
“Delete
E”
is

Delete E onely → only 0.33972 ↓ 0.01954 722.9 present
again.

lesse → less (p < 0.0001)
¡  “Substitute
`
→
e”
increasing

Sub. ` → E call’d → called 0.02557 ↑ 0.25237 535.5
joyn’d → joined (p < 0.0001)
is
present
again.

Sub. LL → L actuall → actual 0.15372 ↓ 0.02792 205.0 ¡  Double
to
single
consonants

illegall → illegal (p < 0.0001) prevalent
again,
but
here
a

Sub. MM →
M
dammage →
damage
0.01566 ↓ 0.00168 27.5
(p < 0.0001)
distinct
pattern
of
decline
in

summes → sums usage
is
observed.

Sub. RR → R warre → war 0.02077 ↓ 0.00614 18.2 ¡  “U
→
V”
does
not
appear
in

Forreign → Foreign (p < 0.001)
Lampeter
data,
only
one

Sub. PP → P Shipps → Ships 0.00352 ↓ 0 10.0
stepp → step (p < 0.01) instance
of
“V
→
U”.

Operation 1640-1679 1680-1719 Log-Likelihood ¡  Same
trend
of
deletion
rules

Rel. Freq. Rel. Freq.
declining
and
substitution

Deletion 0.41555 ↓ 0.08420 522.2
(p < 0.0001) rules
increasing,
but
insertion

Substitution 0.50234 ↑ 0.75839 123.8 rules
are
increasing
also.

(p < 0.0001)
Insertion 0.08211 ↑ 0.15740 57.5
(p < 0.0001)

Delete E Substitute ` → E
LL: 763.8 (p < 0.0001) LL: 691.6 (p < 0.0001)
0.5 0.4
0.4 0.3
0.3
0.2
0.2
0.1 0.1

0 0
1640-1659 1660-1679 1680-1799 1700-1719 1640-1659 1660-1679 1680-1799 1700-1719

Substitute LL → L Substitution Rules
LL: 241.0 (p < 0.0001) LL: 166.7 (p < 0.0001)
0.2 1

0.15 0.8
0.6
0.1
0.4
0.05 0.2
0 0
1640-1659 1660-1679 1680-1799 1700-1719 1640-1659 1660-1679 1680-1799 1700-1719

Substitute ` → E Substitute TT → T
LL 949.9 (p < 0.0001) LL: 967.5 (p < 0.0001)
0.16 0.16
0.14 0.14
0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0 0
1560-1599 1600-1639 1640-1679 1680-1719 1560-1599 1600-1639 1640-1679 1680-1719

Substitute GG → G Deletion Rules
LL: 1.2 LL: 470.4 (p < 0.0001)
0.02 0.6
0.5
0.4
0.01 0.3
0.2
0.1
0 0
1560-1599 1600-1639 1640-1679 1680-1719 1560-1599 1600-1639 1640-1679 1680-1719

Rule Examples Trials Witness Log-
Rel. Freq. Rel. Freq. Likelihood ¡  -‐’d
endings
much
more
prevalent

Sub. ` → E receiv’d → received 0.12597 < 0.00511 1699.3 in
trials.

alledg’d → alleged (p < 0.0001) ¡  Changes
in
the
use
of
double

Sub. TT → T att→ at 0.08872 < 0.01727 591.3 consonants
instead
of
single

Cittye → City (p < 0.0001)
consonants,
but
no
real
trend.

Sub. GG → G dogge → dog 0.00107 > 0.00500 24.4
Wigg → Wig (p < 0.0001) ¡  Single
consonants
instead
of

Sub. T → TT Litle → Little 0.01511 < 0.00279 105.3 double
consonants
also
found,

Scotish→ Scottish (p < 0.0001) but
commonly
overused
in
trials.

Sub. EE → E shee → she
beeing → being
0.01206 > 0.05364 251.7
(p < 0.0001)
¡  Singling
and
doubling
of
vowels

both
overused
in
witness

Sub. E → EE bene → been 0.00199 > 0.00660 23.5
chese → cheese (p < 0.0001) depositions.

Sub. U → V neuer → never 0.01374 > 0.04704 173.1 ¡  Interchanging
of
U
&
V
found

euill → evil (p < 0.0001) much
more
in
witness

Operation Trials Witness Rel. Log-Likelihood depositions.

Rel. Freq. Freq.
Deletion 0.28253 > 0.42911 290.1
(p < 0.0001)
¡  Deletion
is
required
more
for

Substitution 0.65793 < 0.51424 177.8
normalising
witness
depositions,

(p < 0.0001) substitutions
more
for
trials.

Insertion 0.05954 0.05665 0.7

¡  Found
that
there
are
differences
in
terms
of
both
the
text-‐types
examined
and
also

across
the
period.
Not
sure,
as
yet,
what
is
causing
these
differences.
Our
hunch
is

that
it
is
possibly:

§  authorial/editorial
(how
they’re
recorded
in
rather
than
because
of
the
text-‐
type)

§  because
of
[frequency
of]
usage
of
particular
lexical
items
(e.g.
hee/shee
in

Witness
Depositions).

¡  This
said,
our
previous
work
(on
Lampeter)
has
suggested
that
there
are

signiicant
differences
in
terms
of
variant
frequencies
across
genres
(i.e.
Religion

particularly
high).
Substitute U → V
0.07

¡  Future
work
–
inding
the
innovators
0.06

for
change
(variant
rule
level
>
genre/ 0.05

text-‐type
>
texts
>
people)
–
requires
0.04
Trials (LL: 267.3)
0.03
large
scale
normalisation
–
which
0.02
Witness (LL: 134.4)

requires
more
corpora
...
over
to
you!

0.01

0
1560-1599 1600-1639 1640-1679 1680-1719

Normalisation of
spelling variation
with VARD 2.

Increased
Study of spelling
understanding of
patterns and
the properties of
trends.
spelling variation.

¡  Acknowledgements:

§  Thanks
to
Merja
Kytö
for
providing
the
CED

corpus.

§  Research
funded
by
EPSRC
PhD
Plus
grant
at

Lancaster
University.

¡  More
information:

§  VARD:
http://ucrel.lancs.ac.uk/vard

§  DICER:
http://corpora.lancs.ac.uk/dicer

Archer,
D.,
McEnery,
T.,
Rayson,
P.
&
Hardie,
A.
(2003).
Developing
an

automated
semantic
analysis
system
for
Early
Modern
English.
In
D.

Archer,
P.
Rayson,
A.
Wilson
&
T.
Mcenery,
eds.,
Proceedings
of
Corpus

Linguistics
2003,
22–31,
Lancaster
University,
Lancaster,
UK.

Baron,
A.,
Rayson,
P.
and
Archer,
D.
(2009).
Word
frequency
and
key
word

statistics
in
historical
corpus
linguistics.
Anglistik:
International
Journal
of

English
Studies,
20
(1),
pp.
41–67.

Görlach,
M.
(1991).
Introduction
to
Early
Modern
English.
Cambridge

University
Press,
Cambridge.

Kytö,
M.
and
Culpeper,
J.
(2006).
A
Corpus
of
English
Dialogues
1560-‐1760.

Lehto,
A.,
Baron,
A.,
Ratia,
M.
and
Rayson,
P.
(2010).
Improving
the
precision
of

corpus
methods:
The
standardized
version
of
Early
Modern
English
Medical
Texts.

In
Taavitsainen,
I.
and
Pahta,
P.
(eds.)
Early
Modern
English
Medical
Texts:
Corpus

description
and
studies,
pp.
279–290.
John
Benjamins,
Amsterdam.

Palander-‐Colin,
M.
and
Hakala,
M.
(2011).
Standardizing
the
Corpus
of
Early

English
Correspondence
(CEEC).
Poster
presented
at
ICAME
32,
Oslo,
1-‐5
June

2011.

Rayson,
P.,
Archer,
D.,
Baron,
A.,
Culpeper,
J.
and
Smith,
N.
(2007).
Tagging
the

Bard:
Evaluating
the
accuracy
of
a
modern
POS
tagger
on
Early
Modern
English

corpora.
In
Davies,
M.,
Rayson,
P.,
Hunston,
S.
and
Danielsson,
P.
(eds.)
Proceedings

of
the
Corpus
Linguistics
Conference:
CL2007,
University
of
Birmingham,
UK,

27-‐30
July
2007.

Schmied,
J.
(1994).
The
Lampeter
Corpus
of
Early
Modern
English
Tracts.
In
M.

Kytö,
M.
Rissanen
&
S.
Wright,
eds.,
Corpora
across
the
Centuries:
Proceedings
of

the
First
International
Colloquium
on
English
Diachronic
Corpora,
Rodopi,

Amsterdam,
St.
Catherine’s
College,
Cambridge.

Innovators of the Early Modern English spelling change: Using DICER to investigate spelling variation trends

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Innovators of the Early Modern English spelling change: Using DICER to investigate spelling variation trends