The slides present a method for the automatic detection and correction of malapropism errors found in documents using the WordNet lexical database, a search engine (Google) and a paronyms dictionary
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Malapropisms detection and correction prezentarea
1. Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Malapropisms Detection and Correction
Using a Paronyms Dictionary, a Search
Engine and WordNet
Costin-Gabriel Chiru - costin.chiru@cs.pub.ro
Valentin Cojocaru
Traian Rebedea
Ştefan Trăuşan-Matu
2. Contents
• Introduction
• Used tools
• Application architecture
– Malapropisms detection
– Malapropisms correction
• Walkthrough example
• Experiments and results
• Conclusions and further developing
23.07.2010 1ICSOFT 2010
3. Introduction
• Purpose: detection and correction of malapropos
words (unintentional misuse of a word by confusion
with another one).
• Methodology: evaluate the local cohesion of a text in
order to identify the possible malapropisms and then
use the whole text coherence evaluated in terms of
lexical chains built using the linguistic ontology in
order to correct these.
23.07.2010 2ICSOFT 2010
4. Tools
• Google search engine in order to see the
probability of co-appearance of two words or
blocks of words used for the detection of
malapropos words;
• A paronym dictionary to extract the possible
replacements for the malapropos words;
• WordNet for detecting how closely related two
words are used for malapropisms correction;
23.07.2010 3ICSOFT 2010
6. Malapropisms Detection
• Responsible for detecting anomalies in the local text
cohesion – using Google.
• Two chunks of text are sent to Google:
– The number of hits for the 1st
chunk (no_pages1);
– The number of hits for the 2nd
chunk (no_pages2);
– The number of hits for the co-occurrence of the two
chunks – 2nd
chunk is right after the 1st
one (no_combined).
• Based on the mutual information inequality it
evaluates if their co-appearance is statistically correct.
23.07.2010 5ICSOFT 2010
Why
chunks?
7. Malapropisms Detection (2)
• Content words are rarely adjacent to
check if the local text cohesion is damaged,
we also need the functional words that
connects them Chuncker phrase
decomposed in chunks sequentially
evaluated using Google.
23.07.2010 6ICSOFT 2010
8. Malapropisms Detection -
Filters
• Cohesion evaluation is done based on six
progressive filters.
• Assumptions behind these six filters are:
– The fewer hits of the co-occurrences of the two
chunks, the greater probability of a malapropism;
– The more pages for the individual chunks – having
the same number of co-occurrences of the two
chunks – the greater probability of a malapropism.
23.07.2010 7ICSOFT 2010
9. Malapropisms Detection - Filters
(2)
• 1st
filter - no_combined has a very small value
(less than 20) – signal a possible malapropism
– used to eliminate noise.
• For the next five filters, a possible
malapropism is signaled if the following
formula is true:
23.07.2010 8ICSOFT 2010
10. Malapropisms Detection - Filters
(3)
20 500
23.07.2010 ICSOFT 2010 9
2nd
filter
beta = 1.05
Higher
permission
12000 14000 15000 16000
3rd
filter
beta = 1
Normal
permission
Most often
used!
4th
filter
beta = .95
Smaller
permission
5th
filter
beta = .9
Even smaller
permission
6th
filter
beta = .8
Much
smaller
permission
7th
filter
The formula is not used anymore and
no malapropisms is signaled!
16000 +
11. Malapropisms Detection
Final Remarks (1)
• Filters depend on:
– Thresholds (20, 500, 12k, 14k, 15k, 16k) and
– Beta – coefficient for the co-occurrence of the two
chunks (1.05, 1, .95, .9, .8).
• These values have been empirically determined
and they are
– Language dependent – number of hits are different
for each language;
– Time dependent – web is continuously growing;
– Text independent – no feature of the text has been
considered.
23.07.2010 10ICSOFT 2010
12. Malapropisms Detection
Final Remarks (2)
• The purpose of this module is to limit as much
as possible the number of misses in the
malapropisms detection.
• The module also signals a lot of fake
malapropisms, but they will be evaluated in
the next module and some of them will be
ignored.
23.07.2010 11ICSOFT 2010
13. Malapropisms Correction
• Purposes:
– Identify and eliminate the false alarms and
– Detect the most probable candidates for the
remaining malapropisms and correct them.
• Uses all the technologies.
• Works sequentially - analyze every pair of two
chunks of words and decide whether a
malapropism or a false alarm has been found.
23.07.2010 12ICSOFT 2010
14. Malapropisms Correction
Methodology
• Correction is done in three stages:
– The replacement candidates that ensure the local
cohesion are identified using the paronyms
dictionary;
– These words are filtered against the local context,
using the search engine in the same manner as for
detection;
– The replacement word is chosen from the remaining
words, based on the text logic (represented by lexical
chains) so that the whole text coherence to be
maintained.
23.07.2010 13ICSOFT 2010
17. Malapropisms Correction
Possible Situations (3)
• A malapropisms chain: multiple consecutive
chunks signaled as possible malapropisms.
• Try to correct only one of them the one that
corrects both malapropisms (2 chunks are
corrected together) – figure a;
• If this is impossible, each malapropism is treated
separately in order to correct both – figure b;
• If still impossible, we correct only 1 of them.
23.07.2010 16ICSOFT 2010
19. Walkthrough Example (1)
• I am travelling around the word [world].
• Chuncker: I; am travelling; around the word.
• Google: “I am travelling” – 1.6 million hits; “am
travelling around the word“ – 3 hits.
– The first combination is considered to be correct, while
the second will signal a possible malapropisms.
• Paronyms dictionary: word - cord, ford, lord, sword,
ward, wyrd, woad, wold, wood, wordy, work, worm,
worn, wort, world.
23.07.2010 18ICSOFT 2010
20. Walkthrough Example (2)
• Google again: “Word” is replaced by each of its paronyms
and the number of hits for every combination “am
travelling around the <paronym>” is detected.
• Filters: only one that passes filters is “am travelling around
the world” which has 4120 hits – passes the 3rd
filter (beta =
1).
• WordNet: it is verified that world is part of a lexical chain
that starts from travelling.
• A malapropism is signalled and the corrected form is given:
“I am travelling around the world.”
23.07.2010 19ICSOFT 2010
21. Experiments
• 3 types of corpora have been used for testing:
– 1st
corpus – build from individual phrases
containing malapropisms;
– 2nd
corpus – contained no malapropisms at all;
– 3rd
corpus – consisted of parts of text published on
the Internet (parts of some Fox News) and
modified to introduce malapropisms as suggested
by (Hirst and St-Onge, 1998) and (Hirst and
Budanitsky, 2005).
23.07.2010 20ICSOFT 2010
22. Results (1)
• 1st
corpus:
– 27 out of the 31 examples were correctly detected
(87.05%) and
– 25 of them were properly corrected (80.64%).
• 2nd
corpus (587 words):
– 1 false alarm was inserted (.17%)
• Due to the POS Tagger that wrongfully identified
“while” as being a noun and the application replaced it
with the more probable “white”.
23.07.2010 21ICSOFT 2010
23. Results (2)
• 3rd
corpus:
– Smaller text (199 words, 1 malapropism)
• corrected the malapropism but introduced a false alarm
(.5%) - it seems we underestimated the false alarms rate.
– Larger text (2083 words, 25 malapropisms)
• 21 malapropisms have been detected (84%);
• 17 malapropisms have been corrected (68%);
• Introduced 10 false alarms (.48%)
– 6 of these were in the vicinity of a proper noun (ex: Iran has been
replaced by Iraq, the two countries having similar contexts).
23.07.2010 22ICSOFT 2010
24. Conclusions
• Our approach:
– Combines three technologies (WordNet, Google,
Paronyms dictionary);
– The used thresholds do not depend on the analyzed
texts;
– Uses chunks of text in order to capture the local
cohesion of texts;
– It is fully automated.
23.07.2010 23ICSOFT 2010
25. Limitations
• Limitations:
– The application has problems with the proper
nouns, the numbers and the metaphors found in
the analyzed texts;
– WordNet structure and the accuracy of lexical
chains construction;
– Paronyms dictionary (at the moment only first-
level paronyms are used).
23.07.2010 24ICSOFT 2010
26. Possible Improvements
• Possible improvements:
– Construct the phrases’ syntactic tree in order to
consider the dependencies between the chunks of
text instead of evaluating them sequentially;
– Evaluate the possibility that the empirically
chosen thresholds to stand for any language by
verifying them on a different language;
– Multi-threading.
23.07.2010 25ICSOFT 2010
POSTagger – Qtag. The dictionary has 77,503 words, 22,020 of them (28.4%) having at least one first-level paronym.
pages parameter from the formula above represents the number of indexed pages written in the used language
Every paronym replaces the malapropos word and the local cohesion of the phrase is tested considering the next/previous chunk of text. If the new word fits better, then it is tested if it fits in one of the lexical chains of the text. If so, it becomes the replacement candidate and the malapropism is signaled as a real one.
Here, the local cohesion of the phrase is tested considering both the next and previous chunks of text. If the candidate fits with only 1 chunk, then it is marked as a possible replacement, but the malapropism is not yet market as being real, nor is ignored.
A small one – 199 words and a larger one – 2083 words