This project was completed during the Lviv Data Science Summer School 2016 (http://cs.ucu.edu.ua/en/summerschool). The project supervisor - Jordi Carrera Ventura.
The project goal was to create a state-of-the-art automatic spellchecking system using the most recent advances in the industry (word embeddings, automatic word sense disambiguation through neural nets) as well as traditional technologies (collocation extraction, n-gram models, shallow syntactic parsing). The system shold be capable of using linguistic information and semantic context both to correct mistakes and to improve users’ word choice by suggesting better keywords whenever less specific ones are being used.
2. Input data
Datasets for training: 800 000 paragraphs (2 million words)
Datasets for testing:
from CONLL corpus
3. We want to achieve - Part 1
1. Spellchecker focused on defining and correcting Mec errors
4. Target error
Mec error:
● description: spelling, punctuation, capitalization
● examples:
○ This knowledge maybe relevant to them.
○ To tell his or her ralatives…
○ ...their altitudes will be easily changed.
5. We want to achieve - Part 2
● Implement several language models and choose the best one for Mec
mistakes
● Practice Hidden Markov Models and Python
● Have fun in the teamwork :)
6. Methods & Solution used
● Uni-gram
○ better understanding of the errors and the sensitivity of each parameters
○ previous & posterior context of a word is important
○ Part-of-speech could help
● Custom N-grams Model
○ N-grams models, based on previous context and posteriori context
7. Tested Methods
● Custom N-grams Model with Part-of-Speech tagging
○ No improvement of the results with part-of-speech tagging
● Conditionnal Random Fields
○ too slow to train
● Unigram model
○ perform well and are really simple
8. Results
● Custom N-gram Model
○ Precision : 30%
○ Recall : 41 %
TYPE AMU CAMB CUUI IITB IPN NARA NTHU PKU POST RAC SJTU UFC UMC
Vt 11.61 20.00 5.79 1.90 0.98 16.18 12.90 14.16 3.31 29.17 4.59 0.00 17.60
ArtOrDet 18.75 54.74 67.38 1.81 0.36 54.42 37.96 9.65 59.41 0.66 14.63 0.00 33.42
Mec 31.56 30.67 17.47 1.13 4.79 37.28 7.17 31.69 37.88 45.82 1.10 0.00 22.31
Recall (in %) for each error type with alternative answers, indicating how well each team performs against a particular error type