SlideShare a Scribd company logo
Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Malapropisms Detection and Correction
Using a Paronyms Dictionary, a Search
Engine and WordNet
Costin-Gabriel Chiru - costin.chiru@cs.pub.ro
Valentin Cojocaru
Traian Rebedea
Ştefan Trăuşan-Matu
Contents
• Introduction
• Used tools
• Application architecture
– Malapropisms detection
– Malapropisms correction
• Walkthrough example
• Experiments and results
• Conclusions and further developing
23.07.2010 1ICSOFT 2010
Introduction
• Purpose: detection and correction of malapropos
words (unintentional misuse of a word by confusion
with another one).
• Methodology: evaluate the local cohesion of a text in
order to identify the possible malapropisms and then
use the whole text coherence evaluated in terms of
lexical chains built using the linguistic ontology in
order to correct these.
23.07.2010 2ICSOFT 2010
Tools
• Google search engine in order to see the
probability of co-appearance of two words or
blocks of words  used for the detection of
malapropos words;
• A paronym dictionary to extract the possible
replacements for the malapropos words;
• WordNet for detecting how closely related two
words are  used for malapropisms correction;
23.07.2010 3ICSOFT 2010
Application Architecture
23.07.2010 4ICSOFT 2010
Malapropisms Detection
• Responsible for detecting anomalies in the local text
cohesion – using Google.
• Two chunks of text are sent to Google:
– The number of hits for the 1st
chunk (no_pages1);
– The number of hits for the 2nd
chunk (no_pages2);
– The number of hits for the co-occurrence of the two
chunks – 2nd
chunk is right after the 1st
one (no_combined).
• Based on the mutual information inequality it
evaluates if their co-appearance is statistically correct.
23.07.2010 5ICSOFT 2010
Why
chunks?
Malapropisms Detection (2)
• Content words are rarely adjacent  to
check if the local text cohesion is damaged,
we also need the functional words that
connects them  Chuncker  phrase
decomposed in chunks  sequentially
evaluated using Google.
23.07.2010 6ICSOFT 2010
Malapropisms Detection -
Filters
• Cohesion evaluation is done based on six
progressive filters.
• Assumptions behind these six filters are:
– The fewer hits of the co-occurrences of the two
chunks, the greater probability of a malapropism;
– The more pages for the individual chunks – having
the same number of co-occurrences of the two
chunks – the greater probability of a malapropism.
23.07.2010 7ICSOFT 2010
Malapropisms Detection - Filters
(2)
• 1st
filter - no_combined has a very small value
(less than 20) – signal a possible malapropism
– used to eliminate noise.
• For the next five filters, a possible
malapropism is signaled if the following
formula is true:
23.07.2010 8ICSOFT 2010
Malapropisms Detection - Filters
(3)
20  500
23.07.2010 ICSOFT 2010 9
2nd
filter
beta = 1.05
Higher
permission
 12000 14000  15000 16000
3rd
filter
beta = 1
Normal
permission
Most often
used!
4th
filter
beta = .95
Smaller
permission
5th
filter
beta = .9
Even smaller
permission
6th
filter
beta = .8
Much
smaller
permission
7th
filter
The formula is not used anymore and
no malapropisms is signaled!
16000 +
Malapropisms Detection
Final Remarks (1)
• Filters depend on:
– Thresholds (20, 500, 12k, 14k, 15k, 16k) and
– Beta – coefficient for the co-occurrence of the two
chunks (1.05, 1, .95, .9, .8).
• These values have been empirically determined
and they are
– Language dependent – number of hits are different
for each language;
– Time dependent – web is continuously growing;
– Text independent – no feature of the text has been
considered.
23.07.2010 10ICSOFT 2010
Malapropisms Detection
Final Remarks (2)
• The purpose of this module is to limit as much
as possible the number of misses in the
malapropisms detection.
• The module also signals a lot of fake
malapropisms, but they will be evaluated in
the next module and some of them will be
ignored.
23.07.2010 11ICSOFT 2010
Malapropisms Correction
• Purposes:
– Identify and eliminate the false alarms and
– Detect the most probable candidates for the
remaining malapropisms and correct them.
• Uses all the technologies.
• Works sequentially - analyze every pair of two
chunks of words and decide whether a
malapropism or a false alarm has been found.
23.07.2010 12ICSOFT 2010
Malapropisms Correction
Methodology
• Correction is done in three stages:
– The replacement candidates that ensure the local
cohesion are identified using the paronyms
dictionary;
– These words are filtered against the local context,
using the search engine in the same manner as for
detection;
– The replacement word is chosen from the remaining
words, based on the text logic (represented by lexical
chains) so that the whole text coherence to be
maintained.
23.07.2010 13ICSOFT 2010
Malapropisms Correction
Possible Situations (1)
• A signaled malapropism in the first/last word
in a sentence:
23.07.2010 14ICSOFT 2010
Malapropisms Correction
Possible Situations (2)
• An isolated malapropism in the middle of the
sentence:
23.07.2010 15ICSOFT 2010
Malapropisms Correction
Possible Situations (3)
• A malapropisms chain: multiple consecutive
chunks signaled as possible malapropisms.
• Try to correct only one of them  the one that
corrects both malapropisms (2 chunks are
corrected together) – figure a;
• If this is impossible, each malapropism is treated
separately in order to correct both – figure b;
• If still impossible, we correct only 1 of them.
23.07.2010 16ICSOFT 2010
23.07.2010 17ICSOFT 2010
Walkthrough Example (1)
• I am travelling around the word [world].
• Chuncker: I; am travelling; around the word.
• Google: “I am travelling” – 1.6 million hits; “am
travelling around the word“ – 3 hits.
– The first combination is considered to be correct, while
the second will signal a possible malapropisms.
• Paronyms dictionary: word - cord, ford, lord, sword,
ward, wyrd, woad, wold, wood, wordy, work, worm,
worn, wort, world.
23.07.2010 18ICSOFT 2010
Walkthrough Example (2)
• Google again: “Word” is replaced by each of its paronyms
and the number of hits for every combination “am
travelling around the <paronym>” is detected.
• Filters: only one that passes filters is “am travelling around
the world” which has 4120 hits – passes the 3rd
filter (beta =
1).
• WordNet: it is verified that world is part of a lexical chain
that starts from travelling.
• A malapropism is signalled and the corrected form is given:
“I am travelling around the world.”
23.07.2010 19ICSOFT 2010
Experiments
• 3 types of corpora have been used for testing:
– 1st
corpus – build from individual phrases
containing malapropisms;
– 2nd
corpus – contained no malapropisms at all;
– 3rd
corpus – consisted of parts of text published on
the Internet (parts of some Fox News) and
modified to introduce malapropisms as suggested
by (Hirst and St-Onge, 1998) and (Hirst and
Budanitsky, 2005).
23.07.2010 20ICSOFT 2010
Results (1)
• 1st
corpus:
– 27 out of the 31 examples were correctly detected
(87.05%) and
– 25 of them were properly corrected (80.64%).
• 2nd
corpus (587 words):
– 1 false alarm was inserted (.17%)
• Due to the POS Tagger that wrongfully identified
“while” as being a noun and the application replaced it
with the more probable “white”.
23.07.2010 21ICSOFT 2010
Results (2)
• 3rd
corpus:
– Smaller text (199 words, 1 malapropism)
• corrected the malapropism but introduced a false alarm
(.5%) - it seems we underestimated the false alarms rate.
– Larger text (2083 words, 25 malapropisms)
• 21 malapropisms have been detected (84%);
• 17 malapropisms have been corrected (68%);
• Introduced 10 false alarms (.48%)
– 6 of these were in the vicinity of a proper noun (ex: Iran has been
replaced by Iraq, the two countries having similar contexts).
23.07.2010 22ICSOFT 2010
Conclusions
• Our approach:
– Combines three technologies (WordNet, Google,
Paronyms dictionary);
– The used thresholds do not depend on the analyzed
texts;
– Uses chunks of text in order to capture the local
cohesion of texts;
– It is fully automated.
23.07.2010 23ICSOFT 2010
Limitations
• Limitations:
– The application has problems with the proper
nouns, the numbers and the metaphors found in
the analyzed texts;
– WordNet structure and the accuracy of lexical
chains construction;
– Paronyms dictionary (at the moment only first-
level paronyms are used).
23.07.2010 24ICSOFT 2010
Possible Improvements
• Possible improvements:
– Construct the phrases’ syntactic tree in order to
consider the dependencies between the chunks of
text instead of evaluating them sequentially;
– Evaluate the possibility that the empirically
chosen thresholds to stand for any language by
verifying them on a different language;
– Multi-threading.
23.07.2010 25ICSOFT 2010
Q&A
Thank you for your time!
23.07.2010 ICSOFT 2010

More Related Content

Similar to Malapropisms detection and correction prezentarea

A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
Marius Corici
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
csandit
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
cscpconf
 
A Survey of String Matching Algorithms
A Survey of String Matching AlgorithmsA Survey of String Matching Algorithms
A Survey of String Matching Algorithms
IJERA Editor
 
Progressive Duplicate Detection
Progressive Duplicate DetectionProgressive Duplicate Detection
Progressive Duplicate Detection
1crore projects
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
ieeepondy
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
xRowlet
 
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Waqas Tariq
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
gwprice
 
Structure based drug design- kiranmayi
Structure based drug design- kiranmayiStructure based drug design- kiranmayi
Structure based drug design- kiranmayi
KiranmayiKnv
 
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
zeinabmovasaghinia
 
Chat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian languageChat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian language
University Politehnica Bucharest
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Christophe Tricot
 
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
Akira Taniguchi
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
Bhaskar Chatterjee
 
Supervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured TextSupervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured Text
International Journal of Engineering Inventions www.ijeijournal.com
 
Metaphor detection
Metaphor detectionMetaphor detection
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmCongestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
Editor IJCATR
 

Similar to Malapropisms detection and correction prezentarea (20)

A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
A Survey of String Matching Algorithms
A Survey of String Matching AlgorithmsA Survey of String Matching Algorithms
A Survey of String Matching Algorithms
 
Progressive Duplicate Detection
Progressive Duplicate DetectionProgressive Duplicate Detection
Progressive Duplicate Detection
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 
Structure based drug design- kiranmayi
Structure based drug design- kiranmayiStructure based drug design- kiranmayi
Structure based drug design- kiranmayi
 
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
 
Chat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian languageChat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian language
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Supervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured TextSupervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured Text
 
Metaphor detection
Metaphor detectionMetaphor detection
Metaphor detection
 
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmCongestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
 

More from University Politehnica Bucharest

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
University Politehnica Bucharest
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
University Politehnica Bucharest
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
University Politehnica Bucharest
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
University Politehnica Bucharest
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
University Politehnica Bucharest
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
University Politehnica Bucharest
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
University Politehnica Bucharest
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...
University Politehnica Bucharest
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
University Politehnica Bucharest
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
University Politehnica Bucharest
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
University Politehnica Bucharest
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
University Politehnica Bucharest
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
University Politehnica Bucharest
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
University Politehnica Bucharest
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
University Politehnica Bucharest
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
University Politehnica Bucharest
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
University Politehnica Bucharest
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
University Politehnica Bucharest
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
University Politehnica Bucharest
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
University Politehnica Bucharest
 

More from University Politehnica Bucharest (20)

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
 

Recently uploaded

Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 

Recently uploaded (20)

Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 

Malapropisms detection and correction prezentarea

  • 1. Autor Conducător științific Universitatea Politehnica București Facultatea de Automatică și Calculatoare Catedra de Calculatoare Malapropisms Detection and Correction Using a Paronyms Dictionary, a Search Engine and WordNet Costin-Gabriel Chiru - costin.chiru@cs.pub.ro Valentin Cojocaru Traian Rebedea Ştefan Trăuşan-Matu
  • 2. Contents • Introduction • Used tools • Application architecture – Malapropisms detection – Malapropisms correction • Walkthrough example • Experiments and results • Conclusions and further developing 23.07.2010 1ICSOFT 2010
  • 3. Introduction • Purpose: detection and correction of malapropos words (unintentional misuse of a word by confusion with another one). • Methodology: evaluate the local cohesion of a text in order to identify the possible malapropisms and then use the whole text coherence evaluated in terms of lexical chains built using the linguistic ontology in order to correct these. 23.07.2010 2ICSOFT 2010
  • 4. Tools • Google search engine in order to see the probability of co-appearance of two words or blocks of words  used for the detection of malapropos words; • A paronym dictionary to extract the possible replacements for the malapropos words; • WordNet for detecting how closely related two words are  used for malapropisms correction; 23.07.2010 3ICSOFT 2010
  • 6. Malapropisms Detection • Responsible for detecting anomalies in the local text cohesion – using Google. • Two chunks of text are sent to Google: – The number of hits for the 1st chunk (no_pages1); – The number of hits for the 2nd chunk (no_pages2); – The number of hits for the co-occurrence of the two chunks – 2nd chunk is right after the 1st one (no_combined). • Based on the mutual information inequality it evaluates if their co-appearance is statistically correct. 23.07.2010 5ICSOFT 2010 Why chunks?
  • 7. Malapropisms Detection (2) • Content words are rarely adjacent  to check if the local text cohesion is damaged, we also need the functional words that connects them  Chuncker  phrase decomposed in chunks  sequentially evaluated using Google. 23.07.2010 6ICSOFT 2010
  • 8. Malapropisms Detection - Filters • Cohesion evaluation is done based on six progressive filters. • Assumptions behind these six filters are: – The fewer hits of the co-occurrences of the two chunks, the greater probability of a malapropism; – The more pages for the individual chunks – having the same number of co-occurrences of the two chunks – the greater probability of a malapropism. 23.07.2010 7ICSOFT 2010
  • 9. Malapropisms Detection - Filters (2) • 1st filter - no_combined has a very small value (less than 20) – signal a possible malapropism – used to eliminate noise. • For the next five filters, a possible malapropism is signaled if the following formula is true: 23.07.2010 8ICSOFT 2010
  • 10. Malapropisms Detection - Filters (3) 20  500 23.07.2010 ICSOFT 2010 9 2nd filter beta = 1.05 Higher permission  12000 14000  15000 16000 3rd filter beta = 1 Normal permission Most often used! 4th filter beta = .95 Smaller permission 5th filter beta = .9 Even smaller permission 6th filter beta = .8 Much smaller permission 7th filter The formula is not used anymore and no malapropisms is signaled! 16000 +
  • 11. Malapropisms Detection Final Remarks (1) • Filters depend on: – Thresholds (20, 500, 12k, 14k, 15k, 16k) and – Beta – coefficient for the co-occurrence of the two chunks (1.05, 1, .95, .9, .8). • These values have been empirically determined and they are – Language dependent – number of hits are different for each language; – Time dependent – web is continuously growing; – Text independent – no feature of the text has been considered. 23.07.2010 10ICSOFT 2010
  • 12. Malapropisms Detection Final Remarks (2) • The purpose of this module is to limit as much as possible the number of misses in the malapropisms detection. • The module also signals a lot of fake malapropisms, but they will be evaluated in the next module and some of them will be ignored. 23.07.2010 11ICSOFT 2010
  • 13. Malapropisms Correction • Purposes: – Identify and eliminate the false alarms and – Detect the most probable candidates for the remaining malapropisms and correct them. • Uses all the technologies. • Works sequentially - analyze every pair of two chunks of words and decide whether a malapropism or a false alarm has been found. 23.07.2010 12ICSOFT 2010
  • 14. Malapropisms Correction Methodology • Correction is done in three stages: – The replacement candidates that ensure the local cohesion are identified using the paronyms dictionary; – These words are filtered against the local context, using the search engine in the same manner as for detection; – The replacement word is chosen from the remaining words, based on the text logic (represented by lexical chains) so that the whole text coherence to be maintained. 23.07.2010 13ICSOFT 2010
  • 15. Malapropisms Correction Possible Situations (1) • A signaled malapropism in the first/last word in a sentence: 23.07.2010 14ICSOFT 2010
  • 16. Malapropisms Correction Possible Situations (2) • An isolated malapropism in the middle of the sentence: 23.07.2010 15ICSOFT 2010
  • 17. Malapropisms Correction Possible Situations (3) • A malapropisms chain: multiple consecutive chunks signaled as possible malapropisms. • Try to correct only one of them  the one that corrects both malapropisms (2 chunks are corrected together) – figure a; • If this is impossible, each malapropism is treated separately in order to correct both – figure b; • If still impossible, we correct only 1 of them. 23.07.2010 16ICSOFT 2010
  • 19. Walkthrough Example (1) • I am travelling around the word [world]. • Chuncker: I; am travelling; around the word. • Google: “I am travelling” – 1.6 million hits; “am travelling around the word“ – 3 hits. – The first combination is considered to be correct, while the second will signal a possible malapropisms. • Paronyms dictionary: word - cord, ford, lord, sword, ward, wyrd, woad, wold, wood, wordy, work, worm, worn, wort, world. 23.07.2010 18ICSOFT 2010
  • 20. Walkthrough Example (2) • Google again: “Word” is replaced by each of its paronyms and the number of hits for every combination “am travelling around the <paronym>” is detected. • Filters: only one that passes filters is “am travelling around the world” which has 4120 hits – passes the 3rd filter (beta = 1). • WordNet: it is verified that world is part of a lexical chain that starts from travelling. • A malapropism is signalled and the corrected form is given: “I am travelling around the world.” 23.07.2010 19ICSOFT 2010
  • 21. Experiments • 3 types of corpora have been used for testing: – 1st corpus – build from individual phrases containing malapropisms; – 2nd corpus – contained no malapropisms at all; – 3rd corpus – consisted of parts of text published on the Internet (parts of some Fox News) and modified to introduce malapropisms as suggested by (Hirst and St-Onge, 1998) and (Hirst and Budanitsky, 2005). 23.07.2010 20ICSOFT 2010
  • 22. Results (1) • 1st corpus: – 27 out of the 31 examples were correctly detected (87.05%) and – 25 of them were properly corrected (80.64%). • 2nd corpus (587 words): – 1 false alarm was inserted (.17%) • Due to the POS Tagger that wrongfully identified “while” as being a noun and the application replaced it with the more probable “white”. 23.07.2010 21ICSOFT 2010
  • 23. Results (2) • 3rd corpus: – Smaller text (199 words, 1 malapropism) • corrected the malapropism but introduced a false alarm (.5%) - it seems we underestimated the false alarms rate. – Larger text (2083 words, 25 malapropisms) • 21 malapropisms have been detected (84%); • 17 malapropisms have been corrected (68%); • Introduced 10 false alarms (.48%) – 6 of these were in the vicinity of a proper noun (ex: Iran has been replaced by Iraq, the two countries having similar contexts). 23.07.2010 22ICSOFT 2010
  • 24. Conclusions • Our approach: – Combines three technologies (WordNet, Google, Paronyms dictionary); – The used thresholds do not depend on the analyzed texts; – Uses chunks of text in order to capture the local cohesion of texts; – It is fully automated. 23.07.2010 23ICSOFT 2010
  • 25. Limitations • Limitations: – The application has problems with the proper nouns, the numbers and the metaphors found in the analyzed texts; – WordNet structure and the accuracy of lexical chains construction; – Paronyms dictionary (at the moment only first- level paronyms are used). 23.07.2010 24ICSOFT 2010
  • 26. Possible Improvements • Possible improvements: – Construct the phrases’ syntactic tree in order to consider the dependencies between the chunks of text instead of evaluating them sequentially; – Evaluate the possibility that the empirically chosen thresholds to stand for any language by verifying them on a different language; – Multi-threading. 23.07.2010 25ICSOFT 2010
  • 27. Q&A Thank you for your time! 23.07.2010 ICSOFT 2010

Editor's Notes

  1. POSTagger – Qtag. The dictionary has 77,503 words, 22,020 of them (28.4%) having at least one first-level paronym.
  2. pages parameter from the formula above represents the number of indexed pages written in the used language
  3. Every paronym replaces the malapropos word and the local cohesion of the phrase is tested considering the next/previous chunk of text. If the new word fits better, then it is tested if it fits in one of the lexical chains of the text. If so, it becomes the replacement candidate and the malapropism is signaled as a real one.
  4. Here, the local cohesion of the phrase is tested considering both the next and previous chunks of text. If the candidate fits with only 1 chunk, then it is marked as a possible replacement, but the malapropism is not yet market as being real, nor is ignored.
  5. A small one – 199 words and a larger one – 2083 words