SlideShare a Scribd company logo
1 of 27
Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Malapropisms Detection and Correction
Using a Paronyms Dictionary, a Search
Engine and WordNet
Costin-Gabriel Chiru - costin.chiru@cs.pub.ro
Valentin Cojocaru
Traian Rebedea
Ştefan Trăuşan-Matu
Contents
• Introduction
• Used tools
• Application architecture
– Malapropisms detection
– Malapropisms correction
• Walkthrough example
• Experiments and results
• Conclusions and further developing
23.07.2010 1ICSOFT 2010
Introduction
• Purpose: detection and correction of malapropos
words (unintentional misuse of a word by confusion
with another one).
• Methodology: evaluate the local cohesion of a text in
order to identify the possible malapropisms and then
use the whole text coherence evaluated in terms of
lexical chains built using the linguistic ontology in
order to correct these.
23.07.2010 2ICSOFT 2010
Tools
• Google search engine in order to see the
probability of co-appearance of two words or
blocks of words  used for the detection of
malapropos words;
• A paronym dictionary to extract the possible
replacements for the malapropos words;
• WordNet for detecting how closely related two
words are  used for malapropisms correction;
23.07.2010 3ICSOFT 2010
Application Architecture
23.07.2010 4ICSOFT 2010
Malapropisms Detection
• Responsible for detecting anomalies in the local text
cohesion – using Google.
• Two chunks of text are sent to Google:
– The number of hits for the 1st
chunk (no_pages1);
– The number of hits for the 2nd
chunk (no_pages2);
– The number of hits for the co-occurrence of the two
chunks – 2nd
chunk is right after the 1st
one (no_combined).
• Based on the mutual information inequality it
evaluates if their co-appearance is statistically correct.
23.07.2010 5ICSOFT 2010
Why
chunks?
Malapropisms Detection (2)
• Content words are rarely adjacent  to
check if the local text cohesion is damaged,
we also need the functional words that
connects them  Chuncker  phrase
decomposed in chunks  sequentially
evaluated using Google.
23.07.2010 6ICSOFT 2010
Malapropisms Detection -
Filters
• Cohesion evaluation is done based on six
progressive filters.
• Assumptions behind these six filters are:
– The fewer hits of the co-occurrences of the two
chunks, the greater probability of a malapropism;
– The more pages for the individual chunks – having
the same number of co-occurrences of the two
chunks – the greater probability of a malapropism.
23.07.2010 7ICSOFT 2010
Malapropisms Detection - Filters
(2)
• 1st
filter - no_combined has a very small value
(less than 20) – signal a possible malapropism
– used to eliminate noise.
• For the next five filters, a possible
malapropism is signaled if the following
formula is true:
23.07.2010 8ICSOFT 2010
Malapropisms Detection - Filters
(3)
20  500
23.07.2010 ICSOFT 2010 9
2nd
filter
beta = 1.05
Higher
permission
 12000 14000  15000 16000
3rd
filter
beta = 1
Normal
permission
Most often
used!
4th
filter
beta = .95
Smaller
permission
5th
filter
beta = .9
Even smaller
permission
6th
filter
beta = .8
Much
smaller
permission
7th
filter
The formula is not used anymore and
no malapropisms is signaled!
16000 +
Malapropisms Detection
Final Remarks (1)
• Filters depend on:
– Thresholds (20, 500, 12k, 14k, 15k, 16k) and
– Beta – coefficient for the co-occurrence of the two
chunks (1.05, 1, .95, .9, .8).
• These values have been empirically determined
and they are
– Language dependent – number of hits are different
for each language;
– Time dependent – web is continuously growing;
– Text independent – no feature of the text has been
considered.
23.07.2010 10ICSOFT 2010
Malapropisms Detection
Final Remarks (2)
• The purpose of this module is to limit as much
as possible the number of misses in the
malapropisms detection.
• The module also signals a lot of fake
malapropisms, but they will be evaluated in
the next module and some of them will be
ignored.
23.07.2010 11ICSOFT 2010
Malapropisms Correction
• Purposes:
– Identify and eliminate the false alarms and
– Detect the most probable candidates for the
remaining malapropisms and correct them.
• Uses all the technologies.
• Works sequentially - analyze every pair of two
chunks of words and decide whether a
malapropism or a false alarm has been found.
23.07.2010 12ICSOFT 2010
Malapropisms Correction
Methodology
• Correction is done in three stages:
– The replacement candidates that ensure the local
cohesion are identified using the paronyms
dictionary;
– These words are filtered against the local context,
using the search engine in the same manner as for
detection;
– The replacement word is chosen from the remaining
words, based on the text logic (represented by lexical
chains) so that the whole text coherence to be
maintained.
23.07.2010 13ICSOFT 2010
Malapropisms Correction
Possible Situations (1)
• A signaled malapropism in the first/last word
in a sentence:
23.07.2010 14ICSOFT 2010
Malapropisms Correction
Possible Situations (2)
• An isolated malapropism in the middle of the
sentence:
23.07.2010 15ICSOFT 2010
Malapropisms Correction
Possible Situations (3)
• A malapropisms chain: multiple consecutive
chunks signaled as possible malapropisms.
• Try to correct only one of them  the one that
corrects both malapropisms (2 chunks are
corrected together) – figure a;
• If this is impossible, each malapropism is treated
separately in order to correct both – figure b;
• If still impossible, we correct only 1 of them.
23.07.2010 16ICSOFT 2010
23.07.2010 17ICSOFT 2010
Walkthrough Example (1)
• I am travelling around the word [world].
• Chuncker: I; am travelling; around the word.
• Google: “I am travelling” – 1.6 million hits; “am
travelling around the word“ – 3 hits.
– The first combination is considered to be correct, while
the second will signal a possible malapropisms.
• Paronyms dictionary: word - cord, ford, lord, sword,
ward, wyrd, woad, wold, wood, wordy, work, worm,
worn, wort, world.
23.07.2010 18ICSOFT 2010
Walkthrough Example (2)
• Google again: “Word” is replaced by each of its paronyms
and the number of hits for every combination “am
travelling around the <paronym>” is detected.
• Filters: only one that passes filters is “am travelling around
the world” which has 4120 hits – passes the 3rd
filter (beta =
1).
• WordNet: it is verified that world is part of a lexical chain
that starts from travelling.
• A malapropism is signalled and the corrected form is given:
“I am travelling around the world.”
23.07.2010 19ICSOFT 2010
Experiments
• 3 types of corpora have been used for testing:
– 1st
corpus – build from individual phrases
containing malapropisms;
– 2nd
corpus – contained no malapropisms at all;
– 3rd
corpus – consisted of parts of text published on
the Internet (parts of some Fox News) and
modified to introduce malapropisms as suggested
by (Hirst and St-Onge, 1998) and (Hirst and
Budanitsky, 2005).
23.07.2010 20ICSOFT 2010
Results (1)
• 1st
corpus:
– 27 out of the 31 examples were correctly detected
(87.05%) and
– 25 of them were properly corrected (80.64%).
• 2nd
corpus (587 words):
– 1 false alarm was inserted (.17%)
• Due to the POS Tagger that wrongfully identified
“while” as being a noun and the application replaced it
with the more probable “white”.
23.07.2010 21ICSOFT 2010
Results (2)
• 3rd
corpus:
– Smaller text (199 words, 1 malapropism)
• corrected the malapropism but introduced a false alarm
(.5%) - it seems we underestimated the false alarms rate.
– Larger text (2083 words, 25 malapropisms)
• 21 malapropisms have been detected (84%);
• 17 malapropisms have been corrected (68%);
• Introduced 10 false alarms (.48%)
– 6 of these were in the vicinity of a proper noun (ex: Iran has been
replaced by Iraq, the two countries having similar contexts).
23.07.2010 22ICSOFT 2010
Conclusions
• Our approach:
– Combines three technologies (WordNet, Google,
Paronyms dictionary);
– The used thresholds do not depend on the analyzed
texts;
– Uses chunks of text in order to capture the local
cohesion of texts;
– It is fully automated.
23.07.2010 23ICSOFT 2010
Limitations
• Limitations:
– The application has problems with the proper
nouns, the numbers and the metaphors found in
the analyzed texts;
– WordNet structure and the accuracy of lexical
chains construction;
– Paronyms dictionary (at the moment only first-
level paronyms are used).
23.07.2010 24ICSOFT 2010
Possible Improvements
• Possible improvements:
– Construct the phrases’ syntactic tree in order to
consider the dependencies between the chunks of
text instead of evaluating them sequentially;
– Evaluate the possibility that the empirically
chosen thresholds to stand for any language by
verifying them on a different language;
– Multi-threading.
23.07.2010 25ICSOFT 2010
Q&A
Thank you for your time!
23.07.2010 ICSOFT 2010

More Related Content

Similar to Malapropisms detection and correction prezentarea

A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-ServiceMarius Corici
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
 
A Survey of String Matching Algorithms
A Survey of String Matching AlgorithmsA Survey of String Matching Algorithms
A Survey of String Matching AlgorithmsIJERA Editor
 
Progressive Duplicate Detection
Progressive Duplicate DetectionProgressive Duplicate Detection
Progressive Duplicate Detection1crore projects
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detectionieeepondy
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Waqas Tariq
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscoverygwprice
 
Structure based drug design- kiranmayi
Structure based drug design- kiranmayiStructure based drug design- kiranmayi
Structure based drug design- kiranmayiKiranmayiKnv
 
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...zeinabmovasaghinia
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
 
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...Akira Taniguchi
 
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmCongestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmEditor IJCATR
 

Similar to Malapropisms detection and correction prezentarea (20)

A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
A Survey of String Matching Algorithms
A Survey of String Matching AlgorithmsA Survey of String Matching Algorithms
A Survey of String Matching Algorithms
 
Progressive Duplicate Detection
Progressive Duplicate DetectionProgressive Duplicate Detection
Progressive Duplicate Detection
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 
Structure based drug design- kiranmayi
Structure based drug design- kiranmayiStructure based drug design- kiranmayi
Structure based drug design- kiranmayi
 
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
 
Chat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian languageChat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian language
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Supervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured TextSupervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured Text
 
Metaphor detection
Metaphor detectionMetaphor detection
Metaphor detection
 
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmCongestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
 

More from University Politehnica Bucharest

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisUniversity Politehnica Bucharest
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...University Politehnica Bucharest
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...University Politehnica Bucharest
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisUniversity Politehnica Bucharest
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...University Politehnica Bucharest
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...University Politehnica Bucharest
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileUniversity Politehnica Bucharest
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaUniversity Politehnica Bucharest
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyUniversity Politehnica Bucharest
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUniversity Politehnica Bucharest
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentareaUniversity Politehnica Bucharest
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsUniversity Politehnica Bucharest
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...University Politehnica Bucharest
 

More from University Politehnica Bucharest (20)

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
 

Recently uploaded

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 

Recently uploaded (20)

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 

Malapropisms detection and correction prezentarea

  • 1. Autor Conducător științific Universitatea Politehnica București Facultatea de Automatică și Calculatoare Catedra de Calculatoare Malapropisms Detection and Correction Using a Paronyms Dictionary, a Search Engine and WordNet Costin-Gabriel Chiru - costin.chiru@cs.pub.ro Valentin Cojocaru Traian Rebedea Ştefan Trăuşan-Matu
  • 2. Contents • Introduction • Used tools • Application architecture – Malapropisms detection – Malapropisms correction • Walkthrough example • Experiments and results • Conclusions and further developing 23.07.2010 1ICSOFT 2010
  • 3. Introduction • Purpose: detection and correction of malapropos words (unintentional misuse of a word by confusion with another one). • Methodology: evaluate the local cohesion of a text in order to identify the possible malapropisms and then use the whole text coherence evaluated in terms of lexical chains built using the linguistic ontology in order to correct these. 23.07.2010 2ICSOFT 2010
  • 4. Tools • Google search engine in order to see the probability of co-appearance of two words or blocks of words  used for the detection of malapropos words; • A paronym dictionary to extract the possible replacements for the malapropos words; • WordNet for detecting how closely related two words are  used for malapropisms correction; 23.07.2010 3ICSOFT 2010
  • 6. Malapropisms Detection • Responsible for detecting anomalies in the local text cohesion – using Google. • Two chunks of text are sent to Google: – The number of hits for the 1st chunk (no_pages1); – The number of hits for the 2nd chunk (no_pages2); – The number of hits for the co-occurrence of the two chunks – 2nd chunk is right after the 1st one (no_combined). • Based on the mutual information inequality it evaluates if their co-appearance is statistically correct. 23.07.2010 5ICSOFT 2010 Why chunks?
  • 7. Malapropisms Detection (2) • Content words are rarely adjacent  to check if the local text cohesion is damaged, we also need the functional words that connects them  Chuncker  phrase decomposed in chunks  sequentially evaluated using Google. 23.07.2010 6ICSOFT 2010
  • 8. Malapropisms Detection - Filters • Cohesion evaluation is done based on six progressive filters. • Assumptions behind these six filters are: – The fewer hits of the co-occurrences of the two chunks, the greater probability of a malapropism; – The more pages for the individual chunks – having the same number of co-occurrences of the two chunks – the greater probability of a malapropism. 23.07.2010 7ICSOFT 2010
  • 9. Malapropisms Detection - Filters (2) • 1st filter - no_combined has a very small value (less than 20) – signal a possible malapropism – used to eliminate noise. • For the next five filters, a possible malapropism is signaled if the following formula is true: 23.07.2010 8ICSOFT 2010
  • 10. Malapropisms Detection - Filters (3) 20  500 23.07.2010 ICSOFT 2010 9 2nd filter beta = 1.05 Higher permission  12000 14000  15000 16000 3rd filter beta = 1 Normal permission Most often used! 4th filter beta = .95 Smaller permission 5th filter beta = .9 Even smaller permission 6th filter beta = .8 Much smaller permission 7th filter The formula is not used anymore and no malapropisms is signaled! 16000 +
  • 11. Malapropisms Detection Final Remarks (1) • Filters depend on: – Thresholds (20, 500, 12k, 14k, 15k, 16k) and – Beta – coefficient for the co-occurrence of the two chunks (1.05, 1, .95, .9, .8). • These values have been empirically determined and they are – Language dependent – number of hits are different for each language; – Time dependent – web is continuously growing; – Text independent – no feature of the text has been considered. 23.07.2010 10ICSOFT 2010
  • 12. Malapropisms Detection Final Remarks (2) • The purpose of this module is to limit as much as possible the number of misses in the malapropisms detection. • The module also signals a lot of fake malapropisms, but they will be evaluated in the next module and some of them will be ignored. 23.07.2010 11ICSOFT 2010
  • 13. Malapropisms Correction • Purposes: – Identify and eliminate the false alarms and – Detect the most probable candidates for the remaining malapropisms and correct them. • Uses all the technologies. • Works sequentially - analyze every pair of two chunks of words and decide whether a malapropism or a false alarm has been found. 23.07.2010 12ICSOFT 2010
  • 14. Malapropisms Correction Methodology • Correction is done in three stages: – The replacement candidates that ensure the local cohesion are identified using the paronyms dictionary; – These words are filtered against the local context, using the search engine in the same manner as for detection; – The replacement word is chosen from the remaining words, based on the text logic (represented by lexical chains) so that the whole text coherence to be maintained. 23.07.2010 13ICSOFT 2010
  • 15. Malapropisms Correction Possible Situations (1) • A signaled malapropism in the first/last word in a sentence: 23.07.2010 14ICSOFT 2010
  • 16. Malapropisms Correction Possible Situations (2) • An isolated malapropism in the middle of the sentence: 23.07.2010 15ICSOFT 2010
  • 17. Malapropisms Correction Possible Situations (3) • A malapropisms chain: multiple consecutive chunks signaled as possible malapropisms. • Try to correct only one of them  the one that corrects both malapropisms (2 chunks are corrected together) – figure a; • If this is impossible, each malapropism is treated separately in order to correct both – figure b; • If still impossible, we correct only 1 of them. 23.07.2010 16ICSOFT 2010
  • 19. Walkthrough Example (1) • I am travelling around the word [world]. • Chuncker: I; am travelling; around the word. • Google: “I am travelling” – 1.6 million hits; “am travelling around the word“ – 3 hits. – The first combination is considered to be correct, while the second will signal a possible malapropisms. • Paronyms dictionary: word - cord, ford, lord, sword, ward, wyrd, woad, wold, wood, wordy, work, worm, worn, wort, world. 23.07.2010 18ICSOFT 2010
  • 20. Walkthrough Example (2) • Google again: “Word” is replaced by each of its paronyms and the number of hits for every combination “am travelling around the <paronym>” is detected. • Filters: only one that passes filters is “am travelling around the world” which has 4120 hits – passes the 3rd filter (beta = 1). • WordNet: it is verified that world is part of a lexical chain that starts from travelling. • A malapropism is signalled and the corrected form is given: “I am travelling around the world.” 23.07.2010 19ICSOFT 2010
  • 21. Experiments • 3 types of corpora have been used for testing: – 1st corpus – build from individual phrases containing malapropisms; – 2nd corpus – contained no malapropisms at all; – 3rd corpus – consisted of parts of text published on the Internet (parts of some Fox News) and modified to introduce malapropisms as suggested by (Hirst and St-Onge, 1998) and (Hirst and Budanitsky, 2005). 23.07.2010 20ICSOFT 2010
  • 22. Results (1) • 1st corpus: – 27 out of the 31 examples were correctly detected (87.05%) and – 25 of them were properly corrected (80.64%). • 2nd corpus (587 words): – 1 false alarm was inserted (.17%) • Due to the POS Tagger that wrongfully identified “while” as being a noun and the application replaced it with the more probable “white”. 23.07.2010 21ICSOFT 2010
  • 23. Results (2) • 3rd corpus: – Smaller text (199 words, 1 malapropism) • corrected the malapropism but introduced a false alarm (.5%) - it seems we underestimated the false alarms rate. – Larger text (2083 words, 25 malapropisms) • 21 malapropisms have been detected (84%); • 17 malapropisms have been corrected (68%); • Introduced 10 false alarms (.48%) – 6 of these were in the vicinity of a proper noun (ex: Iran has been replaced by Iraq, the two countries having similar contexts). 23.07.2010 22ICSOFT 2010
  • 24. Conclusions • Our approach: – Combines three technologies (WordNet, Google, Paronyms dictionary); – The used thresholds do not depend on the analyzed texts; – Uses chunks of text in order to capture the local cohesion of texts; – It is fully automated. 23.07.2010 23ICSOFT 2010
  • 25. Limitations • Limitations: – The application has problems with the proper nouns, the numbers and the metaphors found in the analyzed texts; – WordNet structure and the accuracy of lexical chains construction; – Paronyms dictionary (at the moment only first- level paronyms are used). 23.07.2010 24ICSOFT 2010
  • 26. Possible Improvements • Possible improvements: – Construct the phrases’ syntactic tree in order to consider the dependencies between the chunks of text instead of evaluating them sequentially; – Evaluate the possibility that the empirically chosen thresholds to stand for any language by verifying them on a different language; – Multi-threading. 23.07.2010 25ICSOFT 2010
  • 27. Q&A Thank you for your time! 23.07.2010 ICSOFT 2010

Editor's Notes

  1. POSTagger – Qtag. The dictionary has 77,503 words, 22,020 of them (28.4%) having at least one first-level paronym.
  2. pages parameter from the formula above represents the number of indexed pages written in the used language
  3. Every paronym replaces the malapropos word and the local cohesion of the phrase is tested considering the next/previous chunk of text. If the new word fits better, then it is tested if it fits in one of the lexical chains of the text. If so, it becomes the replacement candidate and the malapropism is signaled as a real one.
  4. Here, the local cohesion of the phrase is tested considering both the next and previous chunks of text. If the candidate fits with only 1 chunk, then it is marked as a possible replacement, but the malapropism is not yet market as being real, nor is ignored.
  5. A small one – 199 words and a larger one – 2083 words