Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Part	2:	Detecting	and	Correcting	Odd	Collocations	in	Text
1
Commonsense	for	Machine	Intelligence:	Text	to	
Knowledge	and	K...
Introduction	to	Collocations
• Correct	native	speaker	expression	in	a	
given	language
• Strong	tea	(not	powerful	tea)
• Cl...
Collocation	Errors	or	Odd	Collocations
• Expressions	that	may	be	grammatically	correct,	
not	typical	among	native	speakers...
Collocations	and	Idioms
• Some	collocations	are	idiomatic	
expressions:	“couch	potato”
• Literal	idiom	translation	may	be	...
Motivation	to	Address	Collocations	– Daily	Communication	
• Tourist	wants	“black	coffee”	(regular	
coffee	without	milk)	in...
Motivation	to	Address	Collocations	– Written	Texts	
• Classic	Bible	quote	also	in	
Shakespeare’s	Hamlet
• Literal	machine	...
Motivation	to	Address	Collocations	– Search	Engines
• Odd	collocation	
“quick	cars”	returns	
fewer	hits	& less	
appropriat...
Techniques	to	Address	Odd	Collocations
• Treatment	of	Collocations
• Different	types	oddly	collocated	terms
• Examples	of	...
Treatment	of	Collocations
• Collocations	are	typically	treated	in	different	categories
• Insertion	Errors:	adding	a	wrong	...
Insertion	Errors
• These	include	adding	a	term	not	appropriate	in	a	correct	native	speaker	expression
“I	went	to home” vs
...
Deletion	Errors	
• These	are	the	opposite	of	insertion	errors	&	involve	missing	a	term	needed	in	an	expression
“Einstein	w...
Transposition	Errors
• These	errors	occur	when	terms	are	not	placed	in	the	appropriate	order
• They	could	be	more	problema...
Substitution	Errors
• These	involve	using	an	inappropriate	term	in	an	expression	instead	of	a	term	in	correct	usage
“This	...
Addressing	Odd	Collocations	by	Linguistic	Classification
• Some	works	focus	on	classifying	collocation	errors	from	a	lingu...
Collocation	Measures	on	Syntactic	Patterns
• This	work	addresses	7	aspects	of	lexical	collocations
• Collocation	errors	le...
Collocation	Measures	on	Syntactic	Patterns	(Contd.)
• After	spell	checking,	variants	of	word	strings	built	with	articles,	...
Collocation	Measures	on	Syntactic	Patterns	(Contd.)
• Measure	of	collocation	strength
• Rank	ratio	statistic	
• From	1b	wo...
Source	Language	to	Classify	Collocations	
• Errors	often	caused	by	semantic	
similarity	of	words	in	source	language
• This...
Source	Language	to	Classify	Collocations	(Contd.)
• NUCLE:	Annotated	1m	word	corpus	of	
1400	essays	by	ESL	university	stud...
Source	Language	to	Classify	Collocations	(Contd.)
• Detected	errors	classified	as:	Spelling,	Homophone,	Synonyms,	L1-trans...
Source	Language	to	Classify	Collocations	(Contd.)
• Number	of	errors	in	L1-transfer	> other	types
• Extract	English-L1,	L1...
Discussion	
• These	research	works	clearly	focus	more	on	lexical	
classification	of	collocation	errors
• Linguistic	perspe...
Collocation	Error	Detection	and	Correction
• These	approaches	develop	tools	for	the	actual	detection	and	correction	of	
co...
AwkChecker
• End-user	tool	to	correct	
collocation	errors	in	written	
documents
• Users	write	text,	Awkward	
phrases	are	C...
AwkChecker (Contd.)
• Builds	statistical	n-grams	(sequences	of	
n	words)	from	training	corpus	&	records	
frequencies		
• A...
AwkChecker (Contd.)
• Statistical	n-grams	are	used	over	relevant	corpora	including	Wikipedia	
• Helpful	in	capturing	commo...
CollOrder
• Detects	&	corrects	collocation	
errors	in	terms	input	to	the	tool	
• Outputs	ranked	responses	of	
correctly	co...
CollOrder (Contd.)
• Ensemble	of	measures	is	used	for	similarity	search	and	ranking
• Conditional	Probability:		Measures	r...
CollOrder (Contd.)
• These	&	other	measures	(Frequency	Normalized,	Frequency	Ratio)	are	used	[Varghese	et	al.,	2015]	
• Di...
Other	Related	Works
• [Ramos	et	al.,	2010]	build	annotation	schema	with	3D	topology	to	
classify	collocations	mainly	in	Sp...
Discussion
• Collocation	error	correction	tools	in	the	literature	are	
found	useful	by	users	
• Commonsense	knowledge	from...
Text	to	Knowledge	and	Knowledge	to	Text
• Collocation	approaches	start	with	text	and	extract	knowledge	from	corpora	
• Dif...
References
• Bollegala,	D.,	Matsuo,	Y.	and	Ishizuka,M.,	Measuring	the	similarity	between	implicit	semantic	relations	
usin...
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

Commonsense knowledge for Machine Intelligence - part 2

Download to read offline

These are the slides of the tutorial on commonsense knowledge for machine intelligence, presented by Dr. Niket Tandon, Dr. Aparna Varde, and Dr. Gerard de Melo at the CIKM conference 2017.


*Part 2/3: Commonsense knowledge for detecting and correcting odd collocations in text*


Website: http://allenai.org/tutorials/csk/

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Commonsense knowledge for Machine Intelligence - part 2

  1. 1. Part 2: Detecting and Correcting Odd Collocations in Text 1 Commonsense for Machine Intelligence: Text to Knowledge and Knowledge to Text
  2. 2. Introduction to Collocations • Correct native speaker expression in a given language • Strong tea (not powerful tea) • Clear sky (not pure sky) • Go home (not go to home) • Go to school (not go school) • House arrest (not arrest house) • Friend circle (not circle friend) 2
  3. 3. Collocation Errors or Odd Collocations • Expressions that may be grammatically correct, not typical among native speakers • Red meat & white meat are correct collocations in English • Their literal translations are odd collocations in German • Not usually used by Deutsche speakers • Machine translation can often cause such collocation errors • Can be due to lack of commonsense & world knowledge 3
  4. 4. Collocations and Idioms • Some collocations are idiomatic expressions: “couch potato” • Literal idiom translation may be totally absurd: “sofa potato” • Note: Correct idiom usage & translation is harder • All collocations are not idioms, e.g., “fast cars” (vs “quick cars”) • Yet, correct collocation usage is important in many situations 4
  5. 5. Motivation to Address Collocations – Daily Communication • Tourist wants “black coffee” (regular coffee without milk) in a coffee shop • Asks for “dark coffee” using online translation help • Server brings coffee with milk, made with darkest coffee beans available • This is not what the tourist intended… • What if he is lactose intolerant? • Note: “Coffee Shop” in Amsterdam might mean something completely different J A place for drugs! • Important to address collocations with commonsense & world knowledge 5
  6. 6. Motivation to Address Collocations – Written Texts • Classic Bible quote also in Shakespeare’s Hamlet • Literal machine translation can yield different meaning! • Collocations e.g., “willing spirit” & “weak flesh” must be translated with commonsense & reference to context 6
  7. 7. Motivation to Address Collocations – Search Engines • Odd collocation “quick cars” returns fewer hits & less appropriate results • Correct collocation “fast cars” shows better site & images of cars as good search results • Machine translation help for search engines should fix collocation errors 7
  8. 8. Techniques to Address Odd Collocations • Treatment of Collocations • Different types oddly collocated terms • Examples of each type with problems caused • Linguistic Classification • Classifying terms as correct vs incorrect collocations • Considering associations / using source language • Detection and Correction • Finding various incorrectly collocated terms using frequency etc. • Providing correct responses, similarity measures, ranking the suggestions 8
  9. 9. Treatment of Collocations • Collocations are typically treated in different categories • Insertion Errors: adding a wrong term • Deletion Errors: omitting a required term • Transposition Errors: changing order of terms • Substitution Errors: using one term instead of another • We briefly describe each type with examples and the problems they could cause 9
  10. 10. Insertion Errors • These include adding a term not appropriate in a correct native speaker expression “I went to home” vs “I went home” “When will you return back from Singapore?” vs “When will you return from Singapore?” “Take a break for the lunch” vs “Take a break for lunch” • Article errors quite common in this category (adding unnecessary articles) • Many of these errors involve grammatical mistakes • These types of errors create problems in • Fluency of speech especially at formal events • Clarity of written documents 10
  11. 11. Deletion Errors • These are the opposite of insertion errors & involve missing a term needed in an expression “Einstein was scientist” vs “Einstein was a scientist” “Hire someone to do job” vs “Hire someone to do the job” “Let us wait her” vs “Let us wait for her” • They also create similar problems with respect to fluency and clarity • Many deletion errors also pertain to odd use of articles (omitting a necessary one) • Approaches in the literature for article error treatment are applicable here • These also often pertain to grammatical mistakes 11
  12. 12. Transposition Errors • These errors occur when terms are not placed in the appropriate order • They could be more problematic than insertion & deletion errors “Don’t talk with your full mouth” vs “Don’t talk with your mouth full” “How to make friendships close” vs “How to make close friendships” • They might convey the wrong meaning, e.g., talking with your full mouth is different from talking with your mouth full • Sometimes it’s almost the opposite meaning, e.g., close friendships vs friendships close • Often, knowing native language of speaker / origin of the source text might help here 12
  13. 13. Substitution Errors • These involve using an inappropriate term in an expression instead of a term in correct usage “This actor does money” vs “This actor makes money” “Where is the nearest quick food place?” vs “Where is the nearest fast food place?” • Most common types of collocation errors • Often cause miscommunication problems while talking, writing, searching etc. • Many approaches in the literature address mainly substitution errors • They can be potentially applied to address the other types as well • Incorporation of commonsense knowledge is particularly useful here 13
  14. 14. Addressing Odd Collocations by Linguistic Classification • Some works focus on classifying collocation errors from a linguistic perspective • Using collocation measures on syntactic patterns for lexical classification as correctly collocated term vs error [Futagi et al., 2008] • Considering source language (of ESL learner or machine generated text) to classify collocations [Dahlmeier, 2011] 14
  15. 15. Collocation Measures on Syntactic Patterns • This work addresses 7 aspects of lexical collocations • Collocation errors lexically classified using candidate word strings • POS tagging of texts is conducted followed by pattern matching 15 [Futagi et al.]
  16. 16. Collocation Measures on Syntactic Patterns (Contd.) • After spell checking, variants of word strings built with articles, synonyms etc. • Word strings looked up in a reference DB (RR DB) to find a match • If no match found, it is classified as a collocation error [Futagi et al.] 16
  17. 17. Collocation Measures on Syntactic Patterns (Contd.) • Measure of collocation strength • Rank ratio statistic • From 1b words of native speaker texts • Incorporating commonsense knowledge • When evaluated by a gold standard with native speakers, this work gives around 85% precision in classification • This work does not provide correct suggestions as responses to collocation errors [Futagi et al.] 17
  18. 18. Source Language to Classify Collocations • Errors often caused by semantic similarity of words in source language • This is called the L1 language • Literal translation to destination language can cause collocation errors • Thus, L1 induced paraphrases are proposed for classifying collocations 18 Over a dozen English Translations: look, see, watch, read etc. vs [Dahlmeier et al.] Possible translation from source I like to look movies I like to watch movies
  19. 19. Source Language to Classify Collocations (Contd.) • NUCLE: Annotated 1m word corpus of 1400 essays by ESL university students • Annotated with start & end offset, error type, gold standard correction • Incorporates commonsense knowledge from professional English instructors • They filter out preposition & article errors, focus on collocations involving semantics 19 Statistics of NUCLE Analysis [Dahlmeier et al.]
  20. 20. Source Language to Classify Collocations (Contd.) • Detected errors classified as: Spelling, Homophone, Synonyms, L1-transfer • Spelling: Edit dist. (erroneous phrase, correction) < threshold • Homophone: (erroneous word, correction) have same pronunciation • Synonym: (erroneous word, correction) have similar meaning • L1-transfer: (erroneous phrase, correction) share a common translation [Dahlmeier et al.] 20
  21. 21. Source Language to Classify Collocations (Contd.) • Number of errors in L1-transfer > other types • Extract English-L1, L1-English phrases max 3 words • Phrase extraction heuristic: • Here, f: foreign language phrase • Translation probabilities p(e1|f), p(f|e2) predicted by max likelihood estimation • Only keep phrases with probability > threshold (0.001 in this work) • This serves as the basis for suggesting corrections [Dahlmeier et al.] Analysis of Collocation Errors 21
  22. 22. Discussion • These research works clearly focus more on lexical classification of collocation errors • Linguistic perspectives are significant here • Commonsense knowledge is included in collocation error classification using corpora from native speakers / English instructors • These works provide an insight into the reasons for collocation errors and their grammatical placements • Such research heads towards proposing corrective measures 22
  23. 23. Collocation Error Detection and Correction • These approaches develop tools for the actual detection and correction of collocation errors • AwkChecker: While a user writes a text document, flag collocation errors and suggest replacements that correspond closely to consensus using word-level statistical n-grams [Park et al., 2008] • CollOrder: When a user enters a term in the tool, detect collocation errors and provide correctly ordered collocated responses as outputs using an ensemble of similarity measures [Varghese et al., 2015] 23
  24. 24. AwkChecker • End-user tool to correct collocation errors in written documents • Users write text, Awkward phrases are Checked by highlighting them • Users can click awkward phrases to see suggested replacements • 1st ever tool for collocation error correction 24 AwkChecker’s user interface: A) Flagged phrases in the composition window B) Suggested replacement for “powerful tea” [Park et al.]
  25. 25. AwkChecker (Contd.) • Builds statistical n-grams (sequences of n words) from training corpus & records frequencies • Analyzes user input against corpus to find if a phrase is a collocation error • Flags error if there exist similar phrases with frequency > input frequency • Generates replacements using n-gram frequency based approach • Candidates with much higher frequency are potential replacements 25 [Park et al.]
  26. 26. AwkChecker (Contd.) • Statistical n-grams are used over relevant corpora including Wikipedia • Helpful in capturing commonsense with domain-specific knowledge using frequency-based approach • Example: Referring to a medical corpus to flag phrases awkward in medical research writing • Assumption: Relevant corpora are correct more frequently than they are incorrect • Evaluation reveals usefulness in collocation correction, but details of accuracy not discussed 26 [Park et al.]
  27. 27. CollOrder • Detects & corrects collocation errors in terms input to the tool • Outputs ranked responses of correctly collocated terms • Correct collocations source: ANC / BNC (American / British National Corpus) • Includes commonsense knowledge from native speakers’ writings • Useful in Web queries, text documents, ESL translation etc. 27 Approach in the CollOrder tool [Varghese et al.]
  28. 28. CollOrder (Contd.) • Ensemble of measures is used for similarity search and ranking • Conditional Probability: Measures relative occurrence of terms A & B • Jaccard’s Coefficient: Measures extent of semantic similarity between A & B • WebJaccard: To reduce adverse effects of random co-occurrence (due to scale & noise in Web data) [Bolegalla et al., 2009] 28 [Varghese et al.]
  29. 29. CollOrder (Contd.) • These & other measures (Frequency Normalized, Frequency Ratio) are used [Varghese et al., 2015] • Different measures empirically yield good results in different scenarios • Ensemble of measures with classifiers thus proposed to optimize performance • Classifier used: JRIP, implementation of RIPPER (Repeated Incremental Pruning to Produce Error Reduction) [Cohen, 1995] • CollOrder evaluation with MTurk on native speakers: Average accuracy 92.44% 29 Example of ensemble learning by the classifier “blue sky” is a valid suggestion, classified as “y” “night sky” is not a valid suggestion, classified as “n” [Varghese et al.]
  30. 30. Other Related Works • [Ramos et al., 2010] build annotation schema with 3D topology to classify collocations mainly in Spanish & English translation: • 1st dimension finds if error is for whole or part of collocation • 2nd dimension does language-oriented error analysis • 3rd dimension does interpretive error analysis • [Li et al., 2009] use a probabilistic approach for collocation correction: • Use BNC and WordNet as language learning sources • Suggest corrections based on commonly used expressions • Do not develop a tool for collocation detection & correction 30
  31. 31. Discussion • Collocation error correction tools in the literature are found useful by users • Commonsense knowledge from native speakers is typically entailed in the source corpora used for learning • Approaches in linguistic classification as well as in collocation correction rely heavily on frequency • Thus, potential issues related to sparse data with correct collocations call for further research 31
  32. 32. Text to Knowledge and Knowledge to Text • Collocation approaches start with text and extract knowledge from corpora • Different methods used for knowledge extraction - probabilistic, ensemble • Extracted knowledge used for linguistic classification, error correction • Statistical text categorization occurs due to analysis in linguistic classification • Correctly collocated text responses offered as suggestions in error correction • Thus, extracted knowledge serves to provide text based outputs • Commonsense knowledge plays a role mainly in source corpora from native speakers & expert writings • This contributes to machine intelligence by providing better machine translation incorporating commonsense 32
  33. 33. References • Bollegala, D., Matsuo, Y. and Ishizuka,M., Measuring the similarity between implicit semantic relations using web search engines, WSDM 2009, pp. 104-113. • Cohen, W., Fast effective rule induction. In Proceedings of the International Conference on Machine Learning, ICML 1995, pp. 115–123. • Dahlmeier, D. and Ng., H.T., Correcting semantic collocation errors with l1-induced paraphrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 107–117. • Futagi, Y., Deane, P., Chodorow, M. and Tetreault., J., A computational approach to detecting collocation errors in the writing of non-native speakers of English, Computer Assisted Language Learning 2008, 21(4):353–367. • Li-E, L. A., Wible, D. and Tsao, N-L., Automated suggestions for miscollocations, Proceedings of the 4th Workshop on Innovative Use of NLP for Building Educational Applications, 2009, pp. 47-50. • Park, T., Lank, E., Poupart, P. and Terry, M., Is the sky pure today - Awkchecker: An assistive tool for detecting and correcting collocation errors, ACM Symposium on User Interface Software and Technology 2008, pages 121–130. • Ramos, M.A., Wanner, L., Vincze, O., del Bosque, G.C., Veiga, N.V., Suárez, E.M. and González, S.P., Towards a Motivated Annotation Schema of Collocation Errors in Learner Corpora, LREC 2010, pp. 3209-3214. • Varghese, A., Varde, A., Peng, J. and Fitzpatrick. E., A framework for collocation error correction in Web pages and text documents, ACM SIGKDD Explorations 2015, 17(1):14–23. 33

These are the slides of the tutorial on commonsense knowledge for machine intelligence, presented by Dr. Niket Tandon, Dr. Aparna Varde, and Dr. Gerard de Melo at the CIKM conference 2017. *Part 2/3: Commonsense knowledge for detecting and correcting odd collocations in text* Website: http://allenai.org/tutorials/csk/

Views

Total views

289

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

12

Shares

0

Comments

0

Likes

0

×