System Architecture
Domain-Appropriate Edit Distance
Settings
Context Modeling Challenges
Speech and Language Technologies, Vicomtech-ik4
Donostia-San Sebastián, Spain
http://www.vicomtech.org | Tweet-Norm Workshop at SEPLN 2013, 29th Conference of the Spanish Society for NLP · 11/20/2013 Madrid
Error Sources
Language Models
Conclusions
Results
o Tweet-Norm 2013 Task: 564 tweets, 654 OOVs to normalize
o 60.06 % Accuracy (6th place out of 13) [also see Settings]
o Task Average : 56.1 | SD : 12.7 | Range : 33.5 to 78.1
Pablo Ruiz, Montse Cuadros and Thierry Etchegoyhen (pruiz@vicomtech.org)
Lexical Normalization of Spanish Tweets
o Raise accuracy to 65.15 ; 65.42 (+5.09 ; + 4.42%) ( test ; dev )
List-and-Rule-Based Resources 45%
Regexes (e.g. onomatopoeias) 10.5
%
Gaps in domain dictionaries:
Internet and Social Media Slang
9
Common missegmentations 8
Abbreviations 7
Known-Words dictionary / Generic-Domain Slang 6.5
Entity Databases 4.5
Statistical Resources and Workflow 30%
Correction Model 12
%
Language Model and LM Queries 7
Entity Detection Heuristics 6
Ranking and Selection Criteria 5
Other Sources 25%
o Edit Costs based on common errors in Spanish microtext
o 16% gains vs. cost-model where each edit has cost 1
o Can be improved with context-sensitivity at character level
o 3 LM, 5-gram, case insensitive, Kenlm tool, with unk option
o 1 million sentences, vocabulary size ca. 140,000
Error Correction Cost Examples
a, e, i, o, u, n, d á, é, í, ó, ú, ñ, null 0.5 a l l i → a l l í |0.5
k, null q, u 0.75 kiero → quiero |1.5
p, a, z m, u, k 1 lana → luna |1
Corpus Content Accu (test ; dev)
Tweets No URLS, mentions or hashtags 59.9 ; 59.4%
Subtitles Films and documentaries 60.6 ; 61%
Europarl Parliamentary debates 55 ; 55 %
Error Correction Cost Example Edit Costs
ki, x, wa qui, ch, gua
0.5
ninia → niña | cost 0.5
p, t, k pe, te, ka
pnsao → pensado | cost 1
ao$ ado
@User buenoa dias mi vida , que ttal
o Large proportion of OOVs in context of target OOV buenoa
limits usefulness of n-gram language model.
OOV IV
OOV
context
buenos dias buena dias
-2.45412969589 1 : @User
-3.80655765533 1 : buenos
-7.03040409088 1 : dias
-2.45412969589 1 : @User
-3.75467634201 1 : buena
-6.99928283691 1 : dias
IV
context
buenos días buena días
-2.45412969589 1 : @User
-3.80655765533 1 : buenos
-1.7721581459 2 : buenos días
-2.45412969589 1 : @User
-3.75467634201 1 : buena
-4.22779893875 1 : días
o With OOV dias as context, only unigrams are used, and buena
wins over buenos, given higher LM logprob at equal edit cost.
o With IV días as context, bigram buenos días is found in the
LM, and the correct candidate, buenos, wins.
RESOURCES Important: 45% or errors involve coverage or accuracy issues in dictionaries and rules.
EDIT DISTANCE Domain-adapted edit costs are useful. 16% gains vs. cost-model where all operations bear a cost of 1.
CONTEXT MODEL
Difficult to exploit LM given abudance of other OOVs in target OOV’s context.
Useful to explore context model relying on non-contiguous components (e.g. Hand and Baldwin, 2011).
TRAINING CORPORA
for LM
Subtitles LM results and Tweets LM results equivalent, but with Europarl LM, accuracy drops 5%.
LM can be trained with off-domain corpus, if it contains short sentences and colloquial language.
CANDIDATE GENERATION
O: Candidate Set
Regex
Abbreviations
Run-in Expressions
Aspell Dictionary
O : Preprocessed OOV
I : Initial OOV
PREPROCESSING
Language Model
C A N D I D AT E R A N K I N G
Weighted Factors
Segmental
Heuristics
Edit Distance
Ngram
Probability
O : Ranked Candidate Set & Scores
Highest-Ranked Candidate Proceeds to Entity-Checks
JRC Names 1
SAVAS 2
O : Selected Entity or
NAMED-ENTITY CHECKS
Non-Entity Candidate
POST-PROCESSING
Recasing
Other Info Rejected Candidates
Scores
O : Recased Selected Candidate
Weighting LM at 30% and Edit Cost at 70% +1.06;+0.76
Promoting candidates where only difference
with OOV is absence of an accent mark [see Ref. 1]
+2.07 % ;
+0.76 %
Removing (vs. demoting) candidates at edit cost >1.5 +0.6 ; + 2.29
Context Sensitivity at character level for edit-costs +1.06 ; +0.6
Basedonananalysisof200errors
REFERENCES
[1] Ramírez, F and E.López (2006). Spelling Error Patterns in Spanish for Word Processing
Applications. Proceedings of LREC 2006, 93-98
[2] Han, B. and T. Baldwin (2011). Lexical normalization of short text messages: makn sens
a #twitter. Proceedings of ACL 49, 1:368-378
1 optima.jrc.it/data/entities.gzip
2 www.fp-7-savas.eu/savas_project

Poster Tweet-Norm 2013

  • 1.
    System Architecture Domain-Appropriate EditDistance Settings Context Modeling Challenges Speech and Language Technologies, Vicomtech-ik4 Donostia-San Sebastián, Spain http://www.vicomtech.org | Tweet-Norm Workshop at SEPLN 2013, 29th Conference of the Spanish Society for NLP · 11/20/2013 Madrid Error Sources Language Models Conclusions Results o Tweet-Norm 2013 Task: 564 tweets, 654 OOVs to normalize o 60.06 % Accuracy (6th place out of 13) [also see Settings] o Task Average : 56.1 | SD : 12.7 | Range : 33.5 to 78.1 Pablo Ruiz, Montse Cuadros and Thierry Etchegoyhen (pruiz@vicomtech.org) Lexical Normalization of Spanish Tweets o Raise accuracy to 65.15 ; 65.42 (+5.09 ; + 4.42%) ( test ; dev ) List-and-Rule-Based Resources 45% Regexes (e.g. onomatopoeias) 10.5 % Gaps in domain dictionaries: Internet and Social Media Slang 9 Common missegmentations 8 Abbreviations 7 Known-Words dictionary / Generic-Domain Slang 6.5 Entity Databases 4.5 Statistical Resources and Workflow 30% Correction Model 12 % Language Model and LM Queries 7 Entity Detection Heuristics 6 Ranking and Selection Criteria 5 Other Sources 25% o Edit Costs based on common errors in Spanish microtext o 16% gains vs. cost-model where each edit has cost 1 o Can be improved with context-sensitivity at character level o 3 LM, 5-gram, case insensitive, Kenlm tool, with unk option o 1 million sentences, vocabulary size ca. 140,000 Error Correction Cost Examples a, e, i, o, u, n, d á, é, í, ó, ú, ñ, null 0.5 a l l i → a l l í |0.5 k, null q, u 0.75 kiero → quiero |1.5 p, a, z m, u, k 1 lana → luna |1 Corpus Content Accu (test ; dev) Tweets No URLS, mentions or hashtags 59.9 ; 59.4% Subtitles Films and documentaries 60.6 ; 61% Europarl Parliamentary debates 55 ; 55 % Error Correction Cost Example Edit Costs ki, x, wa qui, ch, gua 0.5 ninia → niña | cost 0.5 p, t, k pe, te, ka pnsao → pensado | cost 1 ao$ ado @User buenoa dias mi vida , que ttal o Large proportion of OOVs in context of target OOV buenoa limits usefulness of n-gram language model. OOV IV OOV context buenos dias buena dias -2.45412969589 1 : @User -3.80655765533 1 : buenos -7.03040409088 1 : dias -2.45412969589 1 : @User -3.75467634201 1 : buena -6.99928283691 1 : dias IV context buenos días buena días -2.45412969589 1 : @User -3.80655765533 1 : buenos -1.7721581459 2 : buenos días -2.45412969589 1 : @User -3.75467634201 1 : buena -4.22779893875 1 : días o With OOV dias as context, only unigrams are used, and buena wins over buenos, given higher LM logprob at equal edit cost. o With IV días as context, bigram buenos días is found in the LM, and the correct candidate, buenos, wins. RESOURCES Important: 45% or errors involve coverage or accuracy issues in dictionaries and rules. EDIT DISTANCE Domain-adapted edit costs are useful. 16% gains vs. cost-model where all operations bear a cost of 1. CONTEXT MODEL Difficult to exploit LM given abudance of other OOVs in target OOV’s context. Useful to explore context model relying on non-contiguous components (e.g. Hand and Baldwin, 2011). TRAINING CORPORA for LM Subtitles LM results and Tweets LM results equivalent, but with Europarl LM, accuracy drops 5%. LM can be trained with off-domain corpus, if it contains short sentences and colloquial language. CANDIDATE GENERATION O: Candidate Set Regex Abbreviations Run-in Expressions Aspell Dictionary O : Preprocessed OOV I : Initial OOV PREPROCESSING Language Model C A N D I D AT E R A N K I N G Weighted Factors Segmental Heuristics Edit Distance Ngram Probability O : Ranked Candidate Set & Scores Highest-Ranked Candidate Proceeds to Entity-Checks JRC Names 1 SAVAS 2 O : Selected Entity or NAMED-ENTITY CHECKS Non-Entity Candidate POST-PROCESSING Recasing Other Info Rejected Candidates Scores O : Recased Selected Candidate Weighting LM at 30% and Edit Cost at 70% +1.06;+0.76 Promoting candidates where only difference with OOV is absence of an accent mark [see Ref. 1] +2.07 % ; +0.76 % Removing (vs. demoting) candidates at edit cost >1.5 +0.6 ; + 2.29 Context Sensitivity at character level for edit-costs +1.06 ; +0.6 Basedonananalysisof200errors REFERENCES [1] Ramírez, F and E.López (2006). Spelling Error Patterns in Spanish for Word Processing Applications. Proceedings of LREC 2006, 93-98 [2] Han, B. and T. Baldwin (2011). Lexical normalization of short text messages: makn sens a #twitter. Proceedings of ACL 49, 1:368-378 1 optima.jrc.it/data/entities.gzip 2 www.fp-7-savas.eu/savas_project