SlideShare a Scribd company logo
1 of 19
Alistair	
  Baron,	
  Paul	
  Rayson	
  and	
  Dawn	
  Archer	
  




                                                                            Helsinki Corpus Festival
                                                                    28th   September – 2nd October
                                                                                              2011
100
                                                                                                                                               ARCHER
                                                                                                                                                 EEBO


       Large	
  amount	
  spelling	
  variation	
  in	
  
                                                                                        90                                                    Innsbruck

¡                                                                                      80
                                                                                                                                               Lampeter
                                                                                                                                                EMEMT
                                                                                                                                           Shakespeare


       Early	
  Modern	
  English	
  texts,	
  despite	
  
                                                                                                                                          Average Trend
                                                                                        70



       gradual	
  standardisation	
  between	
  




                                                                     % Variant Types
                                                                                        60




       1500-­‐1700	
  (Görlach,	
  1991;	
  
                                                                                        50

                                                                                        40


       Nevalainen,	
  2006).	
                                                          30



	
                                                                                      20




       This	
  has	
  an	
  impact	
  on	
  the	
  accuracy	
  
                                                                                        10

¡ 
       of	
  automatic	
  corpus	
  linguistic	
  
                                                                                             1400   1450   1500   1550    1600    1650   1700     1750    1800
                                                                                                                         Decade




       techniques:	
  
       §  From	
  simple	
  searching	
  for	
  words	
  
           and	
  frequency	
  lists.	
  
       §  To	
  key	
  words	
  (Baron	
  et	
  al.,	
  2009)	
  
           and	
  clusters	
  (Palander-­‐Collin	
  &	
  
           Hakala,	
  2011)	
  
       §  As	
  well	
  as	
  POS	
  tagging	
  (Rayson	
  et	
  
           al.,	
  2007)	
  and	
  semantic	
  annotation	
  
           (Archer	
  et	
  al.,	
  2003).	
  
100
                                                                          ARCHER
                                                                            EEBO
                   90                                                    Innsbruck
                                                                          Lampeter
                                                                           EMEMT
                   80                                                 Shakespeare
                                                                     Average Trend
                   70
% Variant Types




                   60

                   50

                   40

                   30

                   20

                   10


                        1400   1450   1500   1550    1600    1650   1700     1750    1800
                                                    Decade
                                                                              (Baron et al., 2009)
¡    Designed	
  to	
  assist	
  researchers	
  in	
  normalising	
  
      spelling	
  variation	
  in	
  historical	
  corpora	
  both	
  
      manually	
  and	
  automatically.	
  
¡    Uses	
  methods	
  from	
  modern	
  spellchecking	
  to	
  ind	
  
      spelling	
  variants	
  and	
  offer/select	
  appropriate	
  
      modern	
  equivalents.	
  
¡    The	
  original	
  spelling	
  is	
  always	
  retained	
  in	
  the	
  text	
  
      with	
  an	
  xml	
  tag	
  surrounding	
  the	
  replacement.	
  
      §  <normalised	
  orig=”charitie">charity</normalised>	
  
¡    Used	
  to	
  normalise	
  released	
  historical	
  (and	
  other)	
  
      corpora,	
  e.g.	
  EMEMT	
  (Lehto	
  et	
  al.,	
  2010)	
  and	
  CEEC	
  
      (Palander-­‐Collin	
  &	
  Hakala,	
  2011).	
  
¡    Discovery	
  and	
  Investigation	
  of	
  Character	
  Edit	
  Rules	
  
¡    Examines	
  variant	
  /	
  normalisation	
  pairs	
  found	
  in	
  the	
  XML	
  output	
  
      from	
  VARD.	
  
¡    Determines	
  what	
  letter	
  replacement	
  rules	
  are	
  required	
  to	
  
      convert	
  the	
  variant	
  form	
  into	
  the	
  normalised	
  form.	
  For	
  example:	
  
                           Variant            Normalisation      Rules
                           anie               any                ie → y
                           publick            public             Delete k
                           ioynte             joint              i→j
                                                                 y→i
                                                                 Delete e


¡    Frequencies	
  are	
  calculated	
  for	
  each	
  rule	
  indicating	
  how	
  often	
  
      each	
  rule	
  occurs,	
  which	
  position	
  of	
  the	
  variant	
  it	
  should	
  be	
  
      applied	
  and	
  with	
  which	
  surrounding	
  letters.	
  
¡    Meta-­‐data	
  is	
  also	
  stored	
  to	
  allow	
  for	
  the	
  analysis	
  of	
  spelling	
  rule	
  
      trends	
  over	
  time,	
  genre	
  or	
  any	
  other	
  meta-­‐data	
  present.	
  
¡    Corpus	
  of	
  English	
  Dialogues,	
  covers	
  the	
  period	
  1560-­‐1760	
  
      and	
  contains	
  trials,	
  witness	
  depositions,	
  handbooks,	
  prose,	
  
      comedy	
  drama	
  and	
  miscellaneous	
  (Kytö	
  &	
  Culpeper,	
  
      2006).	
  
¡    Trials	
  and	
  Witness	
  Depositions	
  chosen	
  for	
  current	
  study,	
  
      and	
  split	
  into	
  two	
  periods:	
  1560-­‐1639	
  and	
  1640-­‐1719.	
  
¡    VARD	
  2.4	
  was	
  trained	
  for	
  each	
  half	
  of	
  the	
  sub-­‐corpus	
  with	
  
      10,000	
  words	
  of	
  randomly	
  selected	
  text.	
  Each	
  half	
  was	
  
      then	
  automatically	
  normalised	
  with	
  a	
  75%	
  replacement	
  
      threshold.	
  
¡    DICER	
  analysis	
  performed	
  over	
  resulting	
  variants:	
  
      §  1560-­‐1639:	
  14,782	
  variant	
  tokens,	
  2,981	
  variant	
  types.	
  
      §  1640-­‐1719:	
  8,273	
  variant	
  tokens,	
  1,870	
  variant	
  types.	
  
¡    Tracts	
  and	
  pamphlets	
  published	
  1640-­‐1740	
  (Schmied,	
  
      1994).	
  
¡    Six	
  domains	
  represented	
  (Religion,	
  Politics,	
  Economy	
  &	
  
      Trade,	
  Science,	
  Law	
  and	
  Miscellaneous)	
  with	
  two	
  texts	
  
      for	
  each	
  domain	
  per	
  decade.	
  
¡    Just	
  Law	
  texts	
  used	
  in	
  current	
  study	
  (1640-­‐1719).	
  
¡    Spelling	
  variants	
  automatically	
  normalised	
  with	
  VARD	
  
      2.4	
  at	
  75%	
  threshold	
  after	
  being	
  trained	
  on	
  10	
  
      randomly	
  selected	
  1,000	
  word	
  samples.	
  
¡    DICER	
  analysis	
  performed:	
  
      §  4,637	
  spelling	
  variant	
  tokens,	
  1,483	
  variant	
  types.	
  
¡  Too	
  many	
  rules	
  to	
  consider	
  everything	
  
¡  So,	
  either:	
  
    §  Examine	
  trends	
  for	
  rules	
  that	
  we	
  are	
  interested	
  
       in	
  (hypothesis	
  driven	
  –	
  top	
  down)	
  
   §  Use	
  a	
  statistical	
  technique	
  to	
  highlight	
  
       ‘interesting’	
  rules	
  (data	
  driven	
  –	
  bottom	
  up)	
  

¡  Proposal:	
  use	
  keyness	
  method	
  (c.f.	
  
   WordSmith	
  and	
  Wmatrix)	
  to	
  produce	
  Log-­‐
   Likelihood	
  value	
  for	
  each	
  rule.	
  
Rule           Examples                   1640-1679
                                          Rel. Freq.
                                                            1680-1719
                                                            Rel. Freq.
                                                                         Log-
                                                                         Likelihood
                                                                                        ¡    Decline	
  of	
  “Delete	
  E”	
  could	
  be	
  
Sub. ` → E     ask’d → asked                0.01459    ↑       0.14609          571.9
                                                                                              related	
  to	
  changing	
  practices	
  
               sign’d → signed                                           (p < 0.0001)         in	
  printing/publishing?	
  
Delete E       Sheriffe → Sheriff           0.33594    ↓       0.17909          196.9   ¡    “Substitute	
  `	
  →	
  e”	
  nearly	
  
               knowe → know                                              (p < 0.0001)
                                                                                              always	
  -­‐`d	
  endings.	
  Why	
  is	
  this	
  
Sub. TT → T    att → at                     0.05107    ↑       0.13356          166.5
               gott → got                                                (p < 0.0001)         feature	
  increasing	
  in	
  use?	
  
Sub. LL → L    pistoll → pistol             0.08008    ↓       0.03821           61.2   ¡    Double	
  to	
  single	
  consonants	
  is	
  
               tryall → trial                                            (p < 0.0001)
                                                                                              changing,	
  but	
  no	
  real	
  pattern	
  
Sub. PP → P    uppon → upon                 0.00208    ↑       0.00947           22.8
               Chappel → Chapel                                          (p < 0.0001)         in	
  terms	
  of	
  usage	
  increase	
  or	
  
Sub. U → V     deuill → devil               0.03248    ↓             0          168.3         decrease.	
  
               giue → give                                               (p < 0.0001)   ¡    	
  “U	
  →	
  V”	
  /	
  “V	
  →	
  U”	
  declines	
  
Sub. V → U     vntill → until
               vse → use
                                            0.00660    ↓             0           34.2
                                                                         (p < 0.0001)
                                                                                              over	
  time,	
  perhaps	
  expected?	
  

Operation           1640-1679
                    Rel. Freq.
                                                1680-1719
                                                Rel. Freq.
                                                                     Log-Likelihood
                                                                                        ¡    The	
  need	
  for	
  deletion	
  overall	
  
Deletion                        0.39301     ↓              0.22813              182.2
                                                                                              for	
  normalisation	
  is	
  declining,	
  
                                                                         (p < 0.0001)         whilst	
  substitution	
  is	
  
Substitution                    0.54352     ↑              0.69834              196.9         increasing.	
  
                                                                         (p < 0.0001)
Insertion                       0.06347                    0.07352            3.1358
Rule           Examples                 1640-1679          1680-1719    Log-
                                        Rel. Freq.         Rel. Freq.   Likelihood     ¡    Decline	
  of	
  “Delete	
  E”	
  is	
  
Delete E       onely → only                 0.33972   ↓      0.01954           722.9         present	
  again.	
  
               lesse → less                                             (p < 0.0001)
                                                                                       ¡    “Substitute	
  `	
  →	
  e”	
  increasing	
  
Sub. ` → E     call’d → called              0.02557   ↑      0.25237           535.5
               joyn’d → joined                                          (p < 0.0001)
                                                                                             is	
  present	
  again.	
  
Sub. LL → L    actuall → actual             0.15372   ↓      0.02792           205.0   ¡    Double	
  to	
  single	
  consonants	
  
               illegall → illegal                                       (p < 0.0001)         prevalent	
  again,	
  but	
  here	
  a	
  
Sub. MM →
M
               dammage →
               damage
                                            0.01566   ↓      0.00168            27.5
                                                                        (p < 0.0001)
                                                                                             distinct	
  pattern	
  of	
  decline	
  in	
  
               summes → sums                                                                 usage	
  is	
  observed.	
  
Sub. RR → R    warre → war                  0.02077   ↓      0.00614            18.2   ¡    “U	
  →	
  V”	
  does	
  not	
  appear	
  in	
  
               Forreign → Foreign                                        (p < 0.001)
                                                                                             Lampeter	
  data,	
  only	
  one	
  
Sub. PP → P    Shipps → Ships               0.00352   ↓            0            10.0
               stepp → step                                               (p < 0.01)         instance	
  of	
  “V	
  →	
  U”.	
  

Operation           1640-1679                 1680-1719          Log-Likelihood        ¡    Same	
  trend	
  of	
  deletion	
  rules	
  
                    Rel. Freq.                Rel. Freq.
                                                                                             declining	
  and	
  substitution	
  
Deletion                      0.41555   ↓             0.08420                  522.2
                                                                        (p < 0.0001)         rules	
  increasing,	
  but	
  insertion	
  
Substitution                  0.50234   ↑             0.75839                  123.8         rules	
  are	
  increasing	
  also.	
  
                                                                        (p < 0.0001)
Insertion                     0.08211   ↑             0.15740                   57.5
                                                                        (p < 0.0001)
Delete E                                        Substitute ` → E
                   LL: 763.8 (p < 0.0001)                              LL: 691.6 (p < 0.0001)
0.5                                                      0.4
0.4                                                      0.3
0.3
                                                         0.2
0.2
0.1                                                      0.1

 0                                                        0
       1640-1659     1660-1679   1680-1799   1700-1719         1640-1659   1660-1679   1680-1799   1700-1719



                    Substitute LL → L                                   Substitution Rules
                   LL: 241.0 (p < 0.0001)                              LL: 166.7 (p < 0.0001)
 0.2                                                      1

0.15                                                     0.8
                                                         0.6
 0.1
                                                         0.4
0.05                                                     0.2
  0                                                       0
       1640-1659     1660-1679   1680-1799   1700-1719         1640-1659   1660-1679   1680-1799   1700-1719
Substitute ` → E                                         Substitute TT → T
                       LL 949.9 (p < 0.0001)                                     LL: 967.5 (p < 0.0001)
0.16                                                         0.16
0.14                                                         0.14
0.12                                                         0.12
 0.1                                                          0.1
0.08                                                         0.08
0.06                                                         0.06
0.04                                                         0.04
0.02                                                         0.02
   0                                                            0
           1560-1599     1600-1639   1640-1679   1680-1719           1560-1599     1600-1639   1640-1679   1680-1719


                        Substitute GG → G                                           Deletion Rules
                             LL: 1.2                                             LL: 470.4 (p < 0.0001)
 0.02                                                        0.6
                                                             0.5
                                                             0.4
 0.01                                                        0.3
                                                             0.2
                                                             0.1
       0                                                      0
           1560-1599     1600-1639   1640-1679   1680-1719          1560-1599      1600-1639   1640-1679   1680-1719
Rule           Examples                  Trials           Witness       Log-
                                         Rel. Freq.       Rel. Freq.    Likelihood     ¡    -­‐’d	
  endings	
  much	
  more	
  prevalent	
  
Sub. ` → E     receiv’d → received         0.12597    <       0.00511         1699.3         in	
  trials.	
  
               alledg’d → alleged                                       (p < 0.0001)   ¡    Changes	
  in	
  the	
  use	
  of	
  double	
  
Sub. TT → T    att→ at                     0.08872    <       0.01727          591.3         consonants	
  instead	
  of	
  single	
  
               Cittye → City                                            (p < 0.0001)
                                                                                             consonants,	
  but	
  no	
  real	
  trend.	
  
Sub. GG → G    dogge → dog                 0.00107    >       0.00500           24.4
               Wigg → Wig                                               (p < 0.0001)   ¡    Single	
  consonants	
  instead	
  of	
  
Sub. T → TT    Litle → Little              0.01511    <       0.00279          105.3         double	
  consonants	
  also	
  found,	
  
               Scotish→ Scottish                                        (p < 0.0001)         but	
  commonly	
  overused	
  in	
  trials.	
  
Sub. EE → E    shee → she
               beeing → being
                                           0.01206    >       0.05364          251.7
                                                                        (p < 0.0001)
                                                                                       ¡    Singling	
  and	
  doubling	
  of	
  vowels	
  
                                                                                             both	
  overused	
  in	
  witness	
  
Sub. E → EE    bene → been                 0.00199    >       0.00660           23.5
               chese → cheese                                           (p < 0.0001)         depositions.	
  
Sub. U → V     neuer → never               0.01374    >       0.04704          173.1   ¡    Interchanging	
  of	
  U	
  &	
  V	
  found	
  
               euill → evil                                             (p < 0.0001)         much	
  more	
  in	
  witness	
  
Operation           Trials                     Witness Rel.       Log-Likelihood             depositions.	
  
                    Rel. Freq.                 Freq.
Deletion                       0.28253    >           0.42911                  290.1
                                                                        (p < 0.0001)
                                                                                       ¡    Deletion	
  is	
  required	
  more	
  for	
  
Substitution                   0.65793    <           0.51424                  177.8
                                                                                             normalising	
  witness	
  depositions,	
  
                                                                        (p < 0.0001)         substitutions	
  more	
  for	
  trials.	
  
Insertion                      0.05954                0.05665                   0.7
¡    Found	
  that	
  there	
  are	
  differences	
  in	
  terms	
  of	
  both	
  the	
  text-­‐types	
  examined	
  and	
  also	
  
      across	
  the	
  period.	
  Not	
  sure,	
  as	
  yet,	
  what	
  is	
  causing	
  these	
  differences.	
  Our	
  hunch	
  is	
  
      that	
  it	
  is	
  possibly:	
  
      §  authorial/editorial	
  (how	
  they’re	
  recorded	
  in	
  rather	
  than	
  because	
  of	
  the	
  text-­‐
          type)	
  
      §  because	
  of	
  [frequency	
  of]	
  usage	
  of	
  particular	
  lexical	
  items	
  (e.g.	
  hee/shee	
  in	
  
          Witness	
  Depositions).	
  
¡    This	
  said,	
  our	
  previous	
  work	
  (on	
  Lampeter)	
  has	
  suggested	
  that	
  there	
  are	
  
      signiicant	
  differences	
  in	
  terms	
  of	
  variant	
  frequencies	
  across	
  genres	
  (i.e.	
  Religion	
  
      particularly	
  high).	
                                                                Substitute U → V
                                                                         0.07

¡    Future	
  work	
  –	
  inding	
  the	
  innovators	
              0.06

      for	
  change	
  (variant	
  rule	
  level	
  >	
  genre/          0.05

      text-­‐type	
  >	
  texts	
  >	
  people)	
  –	
  requires	
       0.04
                                                                                                                          Trials (LL: 267.3)
                                                                         0.03
      large	
  scale	
  normalisation	
  –	
  which	
                    0.02
                                                                                                                          Witness (LL: 134.4)


      requires	
  more	
  corpora	
  ...	
  over	
  to	
  you!	
  	
     0.01

                                                                           0
                                                                                1560-1599 1600-1639 1640-1679 1680-1719
Normalisation of
                 spelling variation
                   with VARD 2.




    Increased
                                 Study of spelling
understanding of
                                   patterns and
 the properties of
                                     trends.
spelling variation.
¡  Acknowledgements:	
  
  §  Thanks	
  to	
  Merja	
  Kytö	
  for	
  providing	
  the	
  CED	
  
      corpus.	
  
  §  Research	
  funded	
  by	
  EPSRC	
  PhD	
  Plus	
  grant	
  at	
  
      Lancaster	
  University.	
  

¡  More	
  information:	
  
  §  VARD:	
  http://ucrel.lancs.ac.uk/vard	
  
  §  DICER:	
  http://corpora.lancs.ac.uk/dicer	
  
Archer,	
  D.,	
  McEnery,	
  T.,	
  Rayson,	
  P.	
  &	
  Hardie,	
  A.	
  (2003).	
  Developing	
  an	
  
automated	
  semantic	
  analysis	
  system	
  for	
  Early	
  Modern	
  English.	
  In	
  D.	
  
Archer,	
  P.	
  Rayson,	
  A.	
  Wilson	
  &	
  T.	
  Mcenery,	
  eds.,	
  Proceedings	
  of	
  Corpus	
  
Linguistics	
  2003,	
  22–31,	
  Lancaster	
  University,	
  Lancaster,	
  UK.	
  
	
  
Baron,	
  A.,	
  Rayson,	
  P.	
  and	
  Archer,	
  D.	
  (2009).	
  Word	
  frequency	
  and	
  key	
  word	
  
statistics	
  in	
  historical	
  corpus	
  linguistics.	
  Anglistik:	
  International	
  Journal	
  of	
  
English	
  Studies,	
  20	
  (1),	
  pp.	
  41–67.	
  
	
  
Görlach,	
  M.	
  (1991).	
  Introduction	
  to	
  Early	
  Modern	
  English.	
  Cambridge	
  
University	
  Press,	
  Cambridge.	
  
	
  
Kytö,	
  M.	
  and	
  Culpeper,	
  J.	
  (2006).	
  A	
  Corpus	
  of	
  English	
  Dialogues	
  1560-­‐1760.	
  
 
Lehto,	
  A.,	
  Baron,	
  A.,	
  Ratia,	
  M.	
  and	
  Rayson,	
  P.	
  (2010).	
  Improving	
  the	
  precision	
  of	
  
corpus	
  methods:	
  The	
  standardized	
  version	
  of	
  Early	
  Modern	
  English	
  Medical	
  Texts.	
  
In	
  Taavitsainen,	
  I.	
  and	
  Pahta,	
  P.	
  (eds.)	
  Early	
  Modern	
  English	
  Medical	
  Texts:	
  Corpus	
  
description	
  and	
  studies,	
  pp.	
  279–290.	
  John	
  Benjamins,	
  Amsterdam.	
  
	
  
Palander-­‐Colin,	
  M.	
  and	
  Hakala,	
  M.	
  (2011).	
  Standardizing	
  the	
  Corpus	
  of	
  Early	
  
English	
  Correspondence	
  (CEEC).	
  Poster	
  presented	
  at	
  ICAME	
  32,	
  Oslo,	
  1-­‐5	
  June	
  
2011.	
  
	
  
Rayson,	
  P.,	
  Archer,	
  D.,	
  Baron,	
  A.,	
  Culpeper,	
  J.	
  and	
  Smith,	
  N.	
  (2007).	
  Tagging	
  the	
  
Bard:	
  Evaluating	
  the	
  accuracy	
  of	
  a	
  modern	
  POS	
  tagger	
  on	
  Early	
  Modern	
  English	
  
corpora.	
  In	
  Davies,	
  M.,	
  Rayson,	
  P.,	
  Hunston,	
  S.	
  and	
  Danielsson,	
  P.	
  (eds.)	
  Proceedings	
  
of	
  the	
  Corpus	
  Linguistics	
  Conference:	
  CL2007,	
  University	
  of	
  Birmingham,	
  UK,	
  
27-­‐30	
  July	
  2007.	
  
	
  
Schmied,	
  J.	
  (1994).	
  The	
  Lampeter	
  Corpus	
  of	
  Early	
  Modern	
  English	
  Tracts.	
  In	
  M.	
  
Kytö,	
  M.	
  Rissanen	
  &	
  S.	
  Wright,	
  eds.,	
  Corpora	
  across	
  the	
  Centuries:	
  Proceedings	
  of	
  
the	
  First	
  International	
  Colloquium	
  on	
  English	
  Diachronic	
  Corpora,	
  Rodopi,	
  
Amsterdam,	
  St.	
  Catherine’s	
  College,	
  Cambridge.	
  

More Related Content

Viewers also liked

Comparing characteristics of old and middle english
Comparing characteristics of old and middle englishComparing characteristics of old and middle english
Comparing characteristics of old and middle englishAbdel-Fattah Adel
 
Development of the English language
Development of the English languageDevelopment of the English language
Development of the English languagelkhogvold
 
Differences between old english and modern english
Differences between old english and modern englishDifferences between old english and modern english
Differences between old english and modern englishdesfleuves
 
Indian Contract Act 1872
Indian Contract Act 1872Indian Contract Act 1872
Indian Contract Act 1872wizkidrx
 
A History of the English Language
A History of the English LanguageA History of the English Language
A History of the English LanguageCool
 
The History of the English Language
The History of the English LanguageThe History of the English Language
The History of the English LanguageLina Espinosa Gomez
 

Viewers also liked (7)

Comparing characteristics of old and middle english
Comparing characteristics of old and middle englishComparing characteristics of old and middle english
Comparing characteristics of old and middle english
 
Development of the English language
Development of the English languageDevelopment of the English language
Development of the English language
 
Differences between old english and modern english
Differences between old english and modern englishDifferences between old english and modern english
Differences between old english and modern english
 
Indian Contract Act 1872
Indian Contract Act 1872Indian Contract Act 1872
Indian Contract Act 1872
 
Brief history of English
Brief history of EnglishBrief history of English
Brief history of English
 
A History of the English Language
A History of the English LanguageA History of the English Language
A History of the English Language
 
The History of the English Language
The History of the English LanguageThe History of the English Language
The History of the English Language
 

Innovators of the Early Modern English spelling change: Using DICER to investigate spelling variation trends

  • 1. Alistair  Baron,  Paul  Rayson  and  Dawn  Archer   Helsinki Corpus Festival 28th September – 2nd October 2011
  • 2. 100 ARCHER EEBO Large  amount  spelling  variation  in   90 Innsbruck ¡  80 Lampeter EMEMT Shakespeare Early  Modern  English  texts,  despite   Average Trend 70 gradual  standardisation  between   % Variant Types 60 1500-­‐1700  (Görlach,  1991;   50 40 Nevalainen,  2006).   30   20 This  has  an  impact  on  the  accuracy   10 ¡  of  automatic  corpus  linguistic   1400 1450 1500 1550 1600 1650 1700 1750 1800 Decade techniques:   §  From  simple  searching  for  words   and  frequency  lists.   §  To  key  words  (Baron  et  al.,  2009)   and  clusters  (Palander-­‐Collin  &   Hakala,  2011)   §  As  well  as  POS  tagging  (Rayson  et   al.,  2007)  and  semantic  annotation   (Archer  et  al.,  2003).  
  • 3. 100 ARCHER EEBO 90 Innsbruck Lampeter EMEMT 80 Shakespeare Average Trend 70 % Variant Types 60 50 40 30 20 10 1400 1450 1500 1550 1600 1650 1700 1750 1800 Decade (Baron et al., 2009)
  • 4. ¡  Designed  to  assist  researchers  in  normalising   spelling  variation  in  historical  corpora  both   manually  and  automatically.   ¡  Uses  methods  from  modern  spellchecking  to  ind   spelling  variants  and  offer/select  appropriate   modern  equivalents.   ¡  The  original  spelling  is  always  retained  in  the  text   with  an  xml  tag  surrounding  the  replacement.   §  <normalised  orig=”charitie">charity</normalised>   ¡  Used  to  normalise  released  historical  (and  other)   corpora,  e.g.  EMEMT  (Lehto  et  al.,  2010)  and  CEEC   (Palander-­‐Collin  &  Hakala,  2011).  
  • 5. ¡  Discovery  and  Investigation  of  Character  Edit  Rules   ¡  Examines  variant  /  normalisation  pairs  found  in  the  XML  output   from  VARD.   ¡  Determines  what  letter  replacement  rules  are  required  to   convert  the  variant  form  into  the  normalised  form.  For  example:   Variant Normalisation Rules anie any ie → y publick public Delete k ioynte joint i→j y→i Delete e ¡  Frequencies  are  calculated  for  each  rule  indicating  how  often   each  rule  occurs,  which  position  of  the  variant  it  should  be   applied  and  with  which  surrounding  letters.   ¡  Meta-­‐data  is  also  stored  to  allow  for  the  analysis  of  spelling  rule   trends  over  time,  genre  or  any  other  meta-­‐data  present.  
  • 6.
  • 7. ¡  Corpus  of  English  Dialogues,  covers  the  period  1560-­‐1760   and  contains  trials,  witness  depositions,  handbooks,  prose,   comedy  drama  and  miscellaneous  (Kytö  &  Culpeper,   2006).   ¡  Trials  and  Witness  Depositions  chosen  for  current  study,   and  split  into  two  periods:  1560-­‐1639  and  1640-­‐1719.   ¡  VARD  2.4  was  trained  for  each  half  of  the  sub-­‐corpus  with   10,000  words  of  randomly  selected  text.  Each  half  was   then  automatically  normalised  with  a  75%  replacement   threshold.   ¡  DICER  analysis  performed  over  resulting  variants:   §  1560-­‐1639:  14,782  variant  tokens,  2,981  variant  types.   §  1640-­‐1719:  8,273  variant  tokens,  1,870  variant  types.  
  • 8. ¡  Tracts  and  pamphlets  published  1640-­‐1740  (Schmied,   1994).   ¡  Six  domains  represented  (Religion,  Politics,  Economy  &   Trade,  Science,  Law  and  Miscellaneous)  with  two  texts   for  each  domain  per  decade.   ¡  Just  Law  texts  used  in  current  study  (1640-­‐1719).   ¡  Spelling  variants  automatically  normalised  with  VARD   2.4  at  75%  threshold  after  being  trained  on  10   randomly  selected  1,000  word  samples.   ¡  DICER  analysis  performed:   §  4,637  spelling  variant  tokens,  1,483  variant  types.  
  • 9. ¡  Too  many  rules  to  consider  everything   ¡  So,  either:   §  Examine  trends  for  rules  that  we  are  interested   in  (hypothesis  driven  –  top  down)   §  Use  a  statistical  technique  to  highlight   ‘interesting’  rules  (data  driven  –  bottom  up)   ¡  Proposal:  use  keyness  method  (c.f.   WordSmith  and  Wmatrix)  to  produce  Log-­‐ Likelihood  value  for  each  rule.  
  • 10. Rule Examples 1640-1679 Rel. Freq. 1680-1719 Rel. Freq. Log- Likelihood ¡  Decline  of  “Delete  E”  could  be   Sub. ` → E ask’d → asked 0.01459 ↑ 0.14609 571.9 related  to  changing  practices   sign’d → signed (p < 0.0001) in  printing/publishing?   Delete E Sheriffe → Sheriff 0.33594 ↓ 0.17909 196.9 ¡  “Substitute  `  →  e”  nearly   knowe → know (p < 0.0001) always  -­‐`d  endings.  Why  is  this   Sub. TT → T att → at 0.05107 ↑ 0.13356 166.5 gott → got (p < 0.0001) feature  increasing  in  use?   Sub. LL → L pistoll → pistol 0.08008 ↓ 0.03821 61.2 ¡  Double  to  single  consonants  is   tryall → trial (p < 0.0001) changing,  but  no  real  pattern   Sub. PP → P uppon → upon 0.00208 ↑ 0.00947 22.8 Chappel → Chapel (p < 0.0001) in  terms  of  usage  increase  or   Sub. U → V deuill → devil 0.03248 ↓ 0 168.3 decrease.   giue → give (p < 0.0001) ¡   “U  →  V”  /  “V  →  U”  declines   Sub. V → U vntill → until vse → use 0.00660 ↓ 0 34.2 (p < 0.0001) over  time,  perhaps  expected?   Operation 1640-1679 Rel. Freq. 1680-1719 Rel. Freq. Log-Likelihood ¡  The  need  for  deletion  overall   Deletion 0.39301 ↓ 0.22813 182.2 for  normalisation  is  declining,   (p < 0.0001) whilst  substitution  is   Substitution 0.54352 ↑ 0.69834 196.9 increasing.   (p < 0.0001) Insertion 0.06347 0.07352 3.1358
  • 11. Rule Examples 1640-1679 1680-1719 Log- Rel. Freq. Rel. Freq. Likelihood ¡  Decline  of  “Delete  E”  is   Delete E onely → only 0.33972 ↓ 0.01954 722.9 present  again.   lesse → less (p < 0.0001) ¡  “Substitute  `  →  e”  increasing   Sub. ` → E call’d → called 0.02557 ↑ 0.25237 535.5 joyn’d → joined (p < 0.0001) is  present  again.   Sub. LL → L actuall → actual 0.15372 ↓ 0.02792 205.0 ¡  Double  to  single  consonants   illegall → illegal (p < 0.0001) prevalent  again,  but  here  a   Sub. MM → M dammage → damage 0.01566 ↓ 0.00168 27.5 (p < 0.0001) distinct  pattern  of  decline  in   summes → sums usage  is  observed.   Sub. RR → R warre → war 0.02077 ↓ 0.00614 18.2 ¡  “U  →  V”  does  not  appear  in   Forreign → Foreign (p < 0.001) Lampeter  data,  only  one   Sub. PP → P Shipps → Ships 0.00352 ↓ 0 10.0 stepp → step (p < 0.01) instance  of  “V  →  U”.   Operation 1640-1679 1680-1719 Log-Likelihood ¡  Same  trend  of  deletion  rules   Rel. Freq. Rel. Freq. declining  and  substitution   Deletion 0.41555 ↓ 0.08420 522.2 (p < 0.0001) rules  increasing,  but  insertion   Substitution 0.50234 ↑ 0.75839 123.8 rules  are  increasing  also.   (p < 0.0001) Insertion 0.08211 ↑ 0.15740 57.5 (p < 0.0001)
  • 12. Delete E Substitute ` → E LL: 763.8 (p < 0.0001) LL: 691.6 (p < 0.0001) 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1640-1659 1660-1679 1680-1799 1700-1719 1640-1659 1660-1679 1680-1799 1700-1719 Substitute LL → L Substitution Rules LL: 241.0 (p < 0.0001) LL: 166.7 (p < 0.0001) 0.2 1 0.15 0.8 0.6 0.1 0.4 0.05 0.2 0 0 1640-1659 1660-1679 1680-1799 1700-1719 1640-1659 1660-1679 1680-1799 1700-1719
  • 13. Substitute ` → E Substitute TT → T LL 949.9 (p < 0.0001) LL: 967.5 (p < 0.0001) 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 1560-1599 1600-1639 1640-1679 1680-1719 1560-1599 1600-1639 1640-1679 1680-1719 Substitute GG → G Deletion Rules LL: 1.2 LL: 470.4 (p < 0.0001) 0.02 0.6 0.5 0.4 0.01 0.3 0.2 0.1 0 0 1560-1599 1600-1639 1640-1679 1680-1719 1560-1599 1600-1639 1640-1679 1680-1719
  • 14. Rule Examples Trials Witness Log- Rel. Freq. Rel. Freq. Likelihood ¡  -­‐’d  endings  much  more  prevalent   Sub. ` → E receiv’d → received 0.12597 < 0.00511 1699.3 in  trials.   alledg’d → alleged (p < 0.0001) ¡  Changes  in  the  use  of  double   Sub. TT → T att→ at 0.08872 < 0.01727 591.3 consonants  instead  of  single   Cittye → City (p < 0.0001) consonants,  but  no  real  trend.   Sub. GG → G dogge → dog 0.00107 > 0.00500 24.4 Wigg → Wig (p < 0.0001) ¡  Single  consonants  instead  of   Sub. T → TT Litle → Little 0.01511 < 0.00279 105.3 double  consonants  also  found,   Scotish→ Scottish (p < 0.0001) but  commonly  overused  in  trials.   Sub. EE → E shee → she beeing → being 0.01206 > 0.05364 251.7 (p < 0.0001) ¡  Singling  and  doubling  of  vowels   both  overused  in  witness   Sub. E → EE bene → been 0.00199 > 0.00660 23.5 chese → cheese (p < 0.0001) depositions.   Sub. U → V neuer → never 0.01374 > 0.04704 173.1 ¡  Interchanging  of  U  &  V  found   euill → evil (p < 0.0001) much  more  in  witness   Operation Trials Witness Rel. Log-Likelihood depositions.   Rel. Freq. Freq. Deletion 0.28253 > 0.42911 290.1 (p < 0.0001) ¡  Deletion  is  required  more  for   Substitution 0.65793 < 0.51424 177.8 normalising  witness  depositions,   (p < 0.0001) substitutions  more  for  trials.   Insertion 0.05954 0.05665 0.7
  • 15. ¡  Found  that  there  are  differences  in  terms  of  both  the  text-­‐types  examined  and  also   across  the  period.  Not  sure,  as  yet,  what  is  causing  these  differences.  Our  hunch  is   that  it  is  possibly:   §  authorial/editorial  (how  they’re  recorded  in  rather  than  because  of  the  text-­‐ type)   §  because  of  [frequency  of]  usage  of  particular  lexical  items  (e.g.  hee/shee  in   Witness  Depositions).   ¡  This  said,  our  previous  work  (on  Lampeter)  has  suggested  that  there  are   signiicant  differences  in  terms  of  variant  frequencies  across  genres  (i.e.  Religion   particularly  high).   Substitute U → V 0.07 ¡  Future  work  –  inding  the  innovators   0.06 for  change  (variant  rule  level  >  genre/ 0.05 text-­‐type  >  texts  >  people)  –  requires   0.04 Trials (LL: 267.3) 0.03 large  scale  normalisation  –  which   0.02 Witness (LL: 134.4) requires  more  corpora  ...  over  to  you!     0.01 0 1560-1599 1600-1639 1640-1679 1680-1719
  • 16. Normalisation of spelling variation with VARD 2. Increased Study of spelling understanding of patterns and the properties of trends. spelling variation.
  • 17. ¡  Acknowledgements:   §  Thanks  to  Merja  Kytö  for  providing  the  CED   corpus.   §  Research  funded  by  EPSRC  PhD  Plus  grant  at   Lancaster  University.   ¡  More  information:   §  VARD:  http://ucrel.lancs.ac.uk/vard   §  DICER:  http://corpora.lancs.ac.uk/dicer  
  • 18. Archer,  D.,  McEnery,  T.,  Rayson,  P.  &  Hardie,  A.  (2003).  Developing  an   automated  semantic  analysis  system  for  Early  Modern  English.  In  D.   Archer,  P.  Rayson,  A.  Wilson  &  T.  Mcenery,  eds.,  Proceedings  of  Corpus   Linguistics  2003,  22–31,  Lancaster  University,  Lancaster,  UK.     Baron,  A.,  Rayson,  P.  and  Archer,  D.  (2009).  Word  frequency  and  key  word   statistics  in  historical  corpus  linguistics.  Anglistik:  International  Journal  of   English  Studies,  20  (1),  pp.  41–67.     Görlach,  M.  (1991).  Introduction  to  Early  Modern  English.  Cambridge   University  Press,  Cambridge.     Kytö,  M.  and  Culpeper,  J.  (2006).  A  Corpus  of  English  Dialogues  1560-­‐1760.  
  • 19.   Lehto,  A.,  Baron,  A.,  Ratia,  M.  and  Rayson,  P.  (2010).  Improving  the  precision  of   corpus  methods:  The  standardized  version  of  Early  Modern  English  Medical  Texts.   In  Taavitsainen,  I.  and  Pahta,  P.  (eds.)  Early  Modern  English  Medical  Texts:  Corpus   description  and  studies,  pp.  279–290.  John  Benjamins,  Amsterdam.     Palander-­‐Colin,  M.  and  Hakala,  M.  (2011).  Standardizing  the  Corpus  of  Early   English  Correspondence  (CEEC).  Poster  presented  at  ICAME  32,  Oslo,  1-­‐5  June   2011.     Rayson,  P.,  Archer,  D.,  Baron,  A.,  Culpeper,  J.  and  Smith,  N.  (2007).  Tagging  the   Bard:  Evaluating  the  accuracy  of  a  modern  POS  tagger  on  Early  Modern  English   corpora.  In  Davies,  M.,  Rayson,  P.,  Hunston,  S.  and  Danielsson,  P.  (eds.)  Proceedings   of  the  Corpus  Linguistics  Conference:  CL2007,  University  of  Birmingham,  UK,   27-­‐30  July  2007.     Schmied,  J.  (1994).  The  Lampeter  Corpus  of  Early  Modern  English  Tracts.  In  M.   Kytö,  M.  Rissanen  &  S.  Wright,  eds.,  Corpora  across  the  Centuries:  Proceedings  of   the  First  International  Colloquium  on  English  Diachronic  Corpora,  Rodopi,   Amsterdam,  St.  Catherine’s  College,  Cambridge.