An approach to unsupervised
historical text normalisation
Petar Mitankin
Sofia University
FMI
Stefan Gerdjikov
Sofia Unive...
An approach to unsupervised
historical text normalisation
Petar Mitankin
Sofia University
FMI
Stefan Gerdjikov
Sofia Unive...
Contents
● Supervised Text Normalisation
– CULTURA
– REBELS Translation Model
– Functional Automata
● Unsupervised Text No...
Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions,
80...
Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions,
80...
Supervised Text Normalisation
● Manually created ground truth
– 500 documents from the 1641 Depositions
– All words: 205 2...
Supervised Text Normalisation
● Manually created ground truth
– 500 documents from the 1641 Depositions
– All words: 205 2...
REgularities Based Embedding of
Language Structures
shee
REBELS
Translation
Model
he / -1.89
se / -1.69
she / -9.75
shea /...
Training of
The REBELS Translation Model
● Training pairs from the ground truth:
(shee, she), (maye, may), (she, she),
(ty...
Training of
The REBELS Translation Model
● Deterministic structure of all historical/modern
subwords
● Each word has sever...
Training of
The REBELS Translation Model
● For each training pair (knowth, knows) we find a
mapping between the decomposit...
REgularities Based Embedding of
Language Structures
shee
REBELS
Translation
Model
he / -1.89
se / -1.69
she / -9.75
shea /...
shee
REBELS
knowth
REBELS
me
REBELS
shee knowth me
relevance score (he knuth my) =
REBELS TM (he knuth my) * C_tm +
Statistical Language Model (he knuth my)*C_lm
Combination...
Functional Automata
L(C_tm, C_lm) is represented with
Functional Automata
Automatic Construction of
Functional Automaton For The
Partial Derivative w.r.t. x
L(C_tm, C_lm) is optimised with the Con...
Supervised Text Normalisation
REBELS
Translation
Model
Search
Module
Based on
Functional
Automata
Ground
Truth
Training
Mo...
Unsupervised Text Normalisation
REBELS
Translation
Model
Unsupervised
Generation of
Training Pairs
(knoweth, knows)
Histor...
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical...
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical...
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical...
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical...
Unsupervised Generation of the
Training Pairs
● We use similarity search to generate training
pairs:
– For each historical...
Normalisation of the 1641
Depositions. Experimental results
Method
Generation of
REBELS
Training
Pairs
Spelling
Probabilit...
Future Improvement
REBELS
Translation
Model
Unsupervised
Generation of
Training Pairs
(knoweth, knows)
with probabilities
...
Thank You!
Comments / Questions?
ACKNOWLEDGEMENTS
The reported research work is supported by
the project CULTURA, grant 26...
Upcoming SlideShare
Loading in …5
×

Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

348 views

Published on

Slides of the presentation of the paper An approach to Unsupervised Historical Text Normalisation by Petar Mitankin, Stefan Gerdjikov and Stoyan Mihov in DATeCH 2014. #digidays

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
348
On SlideShare
0
From Embeds
0
Number of Embeds
55
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

  1. 1. An approach to unsupervised historical text normalisation Petar Mitankin Sofia University FMI Stefan Gerdjikov Sofia University FMI Stoyan Mihov Bulgarian Academy of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May
  2. 2. An approach to unsupervised historical text normalisation Petar Mitankin Sofia University FMI Stefan Gerdjikov Sofia University FMI Stoyan Mihov Bulgarian Academy of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May
  3. 3. Contents ● Supervised Text Normalisation – CULTURA – REBELS Translation Model – Functional Automata ● Unsupervised Text Normalisation – Unsupervised REBELS – Experimental Results – Future Improvements
  4. 4. Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
  5. 5. Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
  6. 6. Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model
  7. 7. Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model
  8. 8. REgularities Based Embedding of Language Structures shee REBELS Translation Model he / -1.89 se / -1.69 she / -9.75 shea / -10.04 Automatic Extraction of Historical Spelling Variations
  9. 9. Training of The REBELS Translation Model ● Training pairs from the ground truth: (shee, she), (maye, may), (she, she), (tyme, time), (saith, says), (have, have), (tho:, thomas), ...
  10. 10. Training of The REBELS Translation Model ● Deterministic structure of all historical/modern subwords ● Each word has several hierarchical decompositions in the DAWG: Hierarchical decomposition of each historical word Hierarchical decomposition of each modern word
  11. 11. Training of The REBELS Translation Model ● For each training pair (knowth, knows) we find a mapping between the decompositions: ● We collect statistics about historical subword -> modern subword ● We collect statistics about historical subword -> modern subword
  12. 12. REgularities Based Embedding of Language Structures shee REBELS Translation Model he / -1.89 se / -1.69 she / -9.75 shea / -10.04 REBELS generates normalisation candidates for unseen historical words
  13. 13. shee REBELS knowth REBELS me REBELS shee knowth me
  14. 14. relevance score (he knuth my) = REBELS TM (he knuth my) * C_tm + Statistical Language Model (he knuth my)*C_lm Combination of REBELS with Statistical Bigram Language Model ● Bigram Statistical Model – Smoothing: Absolute Discounting, Backing-off – Gutengberg English language corpus
  15. 15. Functional Automata L(C_tm, C_lm) is represented with Functional Automata
  16. 16. Automatic Construction of Functional Automaton For The Partial Derivative w.r.t. x L(C_tm, C_lm) is optimised with the Conjugate Gradient method
  17. 17. Supervised Text Normalisation REBELS Translation Model Search Module Based on Functional Automata Ground Truth Training Module Based on Functional Automata Historical text Normalised text
  18. 18. Unsupervised Text Normalisation REBELS Translation Model Unsupervised Generation of Training Pairs (knoweth, knows) Historical text Normalised text Search Module Based on Functional Automata
  19. 19. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
  20. 20. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
  21. 21. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
  22. 22. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.
  23. 23. Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.
  24. 24. Normalisation of the 1641 Depositions. Experimental results Method Generation of REBELS Training Pairs Spelling Probabilities Language Model Accuracy BLEU 1 ---- ---- ---- 75.59 50.31 2 Unsupervised NO YES 67.84 45.52 3 Unsupervised YES NO 79.18 56.55 4 Unsupervised YES YES 81.79 61.88 5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78 6 Supervised Supervised Trained Supervised Trained 93.96 87.30
  25. 25. Future Improvement REBELS Translation Model Unsupervised Generation of Training Pairs (knoweth, knows) with probabilities Historical text Normalised text Search Module Based on Functional Automata MAP Training Module
  26. 26. Thank You! Comments / Questions? ACKNOWLEDGEMENTS The reported research work is supported by the project CULTURA, grant 269973, funded by the FP7Programme and the project AComIn, grant 316087, funded by the FP7 Programme.

×