Commonsense knowledge for Machine Intelligence - part 2

Part	2:	Detecting	and	Correcting	Odd	Collocations	in	Text
1
Commonsense	for	Machine	Intelligence:	Text	to	
Knowledge	and	Knowledge	to	Text
Introduction	to	Collocations
• Correct	native	speaker	expression	in	a	
given	language
• Strong	tea	(not	powerful	tea)
• Clear	sky	(not	pure	sky)
• Go	home	(not	go	to	home)
• Go	to	school	(not	go	school)
• House	arrest	(not	arrest	house)
• Friend	circle	(not	circle	friend)
2
Collocation	Errors	or	Odd	Collocations
• Expressions	that	may	be	grammatically	correct,	
not	typical	among	native	speakers
• Red	meat	&	white	meat	are	correct	collocations	
in	English
• Their	literal	translations	are	odd	collocations	in	
German
• Not	usually	used	by	Deutsche	speakers	
• Machine	translation	can	often	cause	such	
collocation	errors
• Can	be	due	to	lack	of	commonsense	&	world	
knowledge	
3
Collocations	and	Idioms
• Some	collocations	are	idiomatic	
expressions:	“couch	potato”
• Literal	idiom	translation	may	be	
totally	absurd:	“sofa	potato”	
• Note:	Correct	idiom	usage	&	
translation	is	harder	
• All	collocations	are	not	idioms,	
e.g.,	“fast	cars”	(vs	“quick	cars”)	
• Yet,	correct	collocation	usage	is	
important	in	many	situations
4
Motivation	to	Address	Collocations	– Daily	Communication	
• Tourist	wants	“black	coffee”	(regular	
coffee	without	milk)	in	a	coffee	shop
• Asks	for	“dark	coffee”	using	online	
translation	help
• Server	brings	coffee	with	milk,	made	
with	darkest	coffee	beans	available	
• This	is	not	what	the	tourist	intended…	
• What	if	he	is	lactose	intolerant?
• Note:	“Coffee	Shop”	in	Amsterdam	
might	mean	something	completely	
different	J A	place	for	drugs!
• Important	to	address	collocations	with	
commonsense	&	world	knowledge	
5
Motivation	to	Address	Collocations	– Written	Texts	
• Classic	Bible	quote	also	in	
Shakespeare’s	Hamlet
• Literal	machine	translation	
can	yield	different	meaning!
• Collocations	e.g.,	“willing	
spirit”	&	“weak	flesh”	must	
be	translated	with	
commonsense	&	reference	
to	context	
6
Motivation	to	Address	Collocations	– Search	Engines
• Odd	collocation	
“quick	cars”	returns	
fewer	hits	& less	
appropriate	results
• Correct	collocation	
“fast	cars”	shows	
better	site	&	images	
of	cars	as	good	
search	results
• Machine	translation	
help	for	search	
engines	should	fix	
collocation	errors	
7
Techniques	to	Address	Odd	Collocations
• Treatment	of	Collocations
• Different	types	oddly	collocated	terms
• Examples	of	each	type	with	problems	caused	
• Linguistic	Classification
• Classifying	terms	as	correct	vs	incorrect	collocations	
• Considering	associations	/	using	source	language	
• Detection	and	Correction
• Finding	various	incorrectly	collocated	terms	using	frequency	etc.	
• Providing	correct	responses,	similarity	measures,	ranking	the	suggestions	
8
Treatment	of	Collocations
• Collocations	are	typically	treated	in	different	categories
• Insertion	Errors:	adding	a	wrong	term
• Deletion	Errors:	omitting	a	required	term
• Transposition	Errors:	changing	order	of	terms
• Substitution	Errors:	using	one	term	instead	of	another
• We	briefly	describe	each	type	with	examples	and	the	problems	they	
could	cause
9
Insertion	Errors
• These	include	adding	a	term	not	appropriate	in	a	correct	native	speaker	expression
“I	went	to home” vs
“I	went	home”
“When	will	you	return	back	from Singapore?”	
vs	“When	will	you	return	from	Singapore?”
“Take	a	break	for	the	lunch”	vs	
“Take	a	break	for	lunch”
• Article	errors	quite	common	in	this	category	(adding	unnecessary	articles)	
• Many	of	these	errors	involve	grammatical	mistakes
• These	types	of	errors	create	problems	in	
• Fluency	of	speech	especially	at	formal	events
• Clarity	of	written	documents	 10
Deletion	Errors	
• These	are	the	opposite	of	insertion	errors	&	involve	missing	a	term	needed	in	an	expression
“Einstein	was	scientist”	
vs	“Einstein	was	a	scientist”
“Hire	someone	to	do	job”	
vs	“Hire	someone	to	do	the	job”
“Let	us	wait	her”	
vs	“Let	us	wait	for	her”
• They	also	create	similar	problems	with	respect	to	fluency	and	clarity
• Many	deletion	errors	also	pertain	to	odd	use	of	articles	(omitting	a	necessary	one)
• Approaches	in	the	literature	for	article	error	treatment	are	applicable	here
• These	also	often	pertain	to	grammatical	mistakes 11
Transposition	Errors
• These	errors	occur	when	terms	are	not	placed	in	the	appropriate	order
• They	could	be	more	problematic	than	insertion	&	deletion	errors
“Don’t	talk	with	your	full	mouth”
vs	“Don’t	talk	with	your	mouth	full”
“How	to	make	friendships	close”
vs	“How	to	make	close	friendships”
• They	might	convey	the	wrong	meaning,	e.g.,	talking	with	your	full	mouth	is	different	from	
talking	with	your	mouth	full
• Sometimes	it’s	almost	the	opposite	meaning,	e.g.,	close	friendships	vs	friendships	close
• Often,	knowing	native	language	of	speaker	/	origin	of	the	source	text	might	help	here
12
Substitution	Errors
• These	involve	using	an	inappropriate	term	in	an	expression	instead	of	a	term	in	correct	usage
“This	actor	does money”
vs	“This	actor	makes	money”
“Where	is	the	nearest	quick	food place?”	
vs	“Where	is	the	nearest	fast	food	place?”
• Most	common	types	of	collocation	errors	
• Often	cause	miscommunication	problems	while	talking,	writing,	searching	etc.
• Many	approaches	in	the	literature	address	mainly	substitution	errors
• They	can	be	potentially	applied	to	address	the	other	types	as	well	
• Incorporation	of	commonsense	knowledge	is	particularly	useful	here
13
Addressing	Odd	Collocations	by	Linguistic	Classification
• Some	works	focus	on	classifying	collocation	errors	from	a	linguistic	
perspective
• Using	collocation	measures	on	syntactic	patterns	for	lexical	
classification	as	correctly	collocated	term	vs	error	[Futagi et	al.,	2008]
• Considering	source	language	(of	ESL	learner	or	machine	generated	
text)	to	classify	collocations	[Dahlmeier,	2011]
14
Collocation	Measures	on	Syntactic	Patterns
• This	work	addresses	7	aspects	of	lexical	collocations
• Collocation	errors	lexically	classified	using	candidate	word	strings	
• POS	tagging	of	texts	is	conducted	followed	by	pattern	matching
15
[Futagi et	al.]
Collocation	Measures	on	Syntactic	Patterns	(Contd.)
• After	spell	checking,	variants	of	word	strings	built	with	articles,	synonyms	etc.
• Word	strings	looked	up	in	a	reference	DB	(RR	DB)	to	find	a	match
• If	no	match	found,	it	is	classified	as	a	collocation	error
[Futagi et	al.]
16
Collocation	Measures	on	Syntactic	Patterns	(Contd.)
• Measure	of	collocation	strength
• Rank	ratio	statistic	
• From	1b	words	of	native	speaker	texts	
• Incorporating	commonsense	knowledge
• When	evaluated	by	a	gold	standard	with	native	speakers,	this work	gives	
around	85%	precision	in	classification
• This	work	does	not	provide	correct	suggestions	as	responses	to	
collocation	errors	
[Futagi et	al.]
17
Source	Language	to	Classify	Collocations	
• Errors	often	caused	by	semantic	
similarity	of	words	in	source	language
• This	is	called	the	L1	language
• Literal	translation	to	destination	
language	can	cause	collocation	errors
• Thus,	L1	induced	paraphrases	are	
proposed	for	classifying	collocations
18
Over	a	dozen	English	Translations:
look,	see,	watch,	read	etc.
vs
[Dahlmeier et	al.]
Possible	translation	from	source
I	like	to	look	movies
I	like	to	watch movies
Source	Language	to	Classify	Collocations	(Contd.)
• NUCLE:	Annotated	1m	word	corpus	of	
1400	essays	by	ESL	university	students
• Annotated	with	start	&	end	offset,	error	
type,	gold	standard	correction	
• Incorporates	commonsense	knowledge	
from	professional	English	instructors
• They	filter	out	preposition	&	article	errors,	
focus	on	collocations	involving	semantics
19
Statistics	of	NUCLE	Analysis
[Dahlmeier et	al.]
Source	Language	to	Classify	Collocations	(Contd.)
• Detected	errors	classified	as:	Spelling,	Homophone,	Synonyms,	L1-transfer
• Spelling:	Edit	dist.	(erroneous	phrase,	correction)	<	threshold
• Homophone:	(erroneous	word,	correction)	have	same	pronunciation
• Synonym:	(erroneous	word,	correction)	have	similar	meaning
• L1-transfer:	(erroneous	phrase,	correction)	share	a	common	translation
[Dahlmeier et	al.]
20
Source	Language	to	Classify	Collocations	(Contd.)
• Number	of	errors	in	L1-transfer	> other	types
• Extract	English-L1,	L1-English	phrases	max	3	words	
• Phrase	extraction	heuristic:	
• Here,	f:	foreign	language	phrase
• Translation	probabilities	p(e1|f),	p(f|e2)	predicted	
by	max	likelihood	estimation
• Only	keep	phrases	with	probability	>	threshold	
(0.001	in	this	work)
• This	serves	as	the	basis	for	suggesting	corrections
[Dahlmeier et	al.]
Analysis	of	Collocation	Errors
21
Discussion	
• These	research	works	clearly	focus	more	on	lexical	
classification	of	collocation	errors
• Linguistic	perspectives	are	significant	here
• Commonsense	knowledge	is	included	in	collocation	
error	classification	using	corpora	from	native	
speakers	/	English	instructors
• These	works	provide	an	insight	into	the	reasons	for	
collocation	errors	and	their	grammatical	placements
• Such	research	heads	towards	proposing	corrective	
measures
22
Collocation	Error	Detection	and	Correction
• These	approaches	develop	tools	for	the	actual	detection	and	correction	of	
collocation	errors
• AwkChecker:	While	a	user	writes	a	text	document,	flag	collocation	errors	and	
suggest	replacements	that	correspond	closely	to	consensus	using	word-level	
statistical	n-grams	[Park	et	al.,	2008]
• CollOrder:	When	a	user	enters	a	term	in	the	tool,	detect	collocation	errors	
and	provide	correctly	ordered	collocated	responses	as	outputs	using	an	
ensemble	of	similarity	measures	[Varghese	et	al.,	2015]
23
AwkChecker
• End-user	tool	to	correct	
collocation	errors	in	written	
documents
• Users	write	text,	Awkward	
phrases	are	Checked	by	
highlighting	them	
• Users	can	click	awkward	
phrases	to	see	suggested	
replacements
• 1st ever	tool	for	collocation	
error	correction
24
AwkChecker’s user	interface:
A)	Flagged	phrases	in	the	composition		window
B)	Suggested	replacement	for	“powerful	tea”
[Park	et	al.]
AwkChecker (Contd.)
• Builds	statistical	n-grams	(sequences	of	
n	words)	from	training	corpus	&	records	
frequencies		
• Analyzes	user	input	against	corpus	to	
find	if	a	phrase	is	a	collocation	error
• Flags	error	if	there	exist	similar	phrases	
with	frequency	>	input	frequency	
• Generates	replacements	using	n-gram	
frequency	based	approach
• Candidates	with	much	higher	frequency	
are	potential	replacements
25
[Park	et	al.]
AwkChecker (Contd.)
• Statistical	n-grams	are	used	over	relevant	corpora	including	Wikipedia	
• Helpful	in	capturing	commonsense	with	domain-specific	knowledge	
using	frequency-based	approach
• Example:	Referring	to	a	medical	corpus	to	flag	phrases	awkward	in	
medical	research	writing
• Assumption:	Relevant	corpora	are	correct	more	frequently	than	they	
are	incorrect
• Evaluation	reveals	usefulness	in	collocation	correction,	but	details	of	
accuracy	not	discussed
26
[Park	et	al.]
CollOrder
• Detects	&	corrects	collocation	
errors	in	terms	input	to	the	tool	
• Outputs	ranked	responses	of	
correctly	collocated	terms
• Correct	collocations	source:	ANC	/	
BNC	(American	/	British	National	
Corpus)
• Includes	commonsense	knowledge	
from	native	speakers’	writings
• Useful	in	Web	queries,	text	
documents,	ESL	translation	etc.
27
Approach	in	the	CollOrder tool
[Varghese	et	al.]
CollOrder (Contd.)
• Ensemble	of	measures	is	used	for	similarity	search	and	ranking
• Conditional	Probability:		Measures	relative	occurrence	of	terms	A	&	B
• Jaccard’s Coefficient:	Measures	extent	of	semantic	similarity	between	A	&	B
• WebJaccard:	To	reduce	adverse	effects	of	random	co-occurrence	(due	to	scale	
&	noise	in	Web	data)	[Bolegalla et	al.,	2009]
28
[Varghese	et	al.]
CollOrder (Contd.)
• These	&	other	measures	(Frequency	Normalized,	Frequency	Ratio)	are	used	[Varghese	et	al.,	2015]	
• Different	measures	empirically	yield	good	results	in	different	scenarios
• Ensemble	of	measures	with	classifiers	thus	proposed	to	optimize	performance
• Classifier	used:	JRIP,	implementation	of	RIPPER	(Repeated	Incremental	Pruning	to	Produce	
Error	Reduction)	[Cohen,	1995]	
• CollOrder evaluation	with	MTurk on	native	speakers:	Average	accuracy	92.44%	
29
Example	of	ensemble	learning	by	
the	classifier	
“blue	sky”	is	a	valid	suggestion,	
classified	as	“y”
“night	sky”	is	not	a	valid	
suggestion,	classified	as	“n”
[Varghese	et	al.]
Other	Related	Works
• [Ramos	et	al.,	2010]	build	annotation	schema	with	3D	topology	to	
classify	collocations	mainly	in	Spanish	&	English	translation:	
• 1st dimension	finds	if	error	is	for	whole	or	part	of	collocation
• 2nd dimension	does	language-oriented	error	analysis	
• 3rd dimension	does	interpretive	error	analysis	
• [Li	et	al.,	2009]	use	a	probabilistic	approach	for	collocation	correction:
• Use	BNC	and	WordNet	as	language	learning	sources	
• Suggest	corrections	based	on	commonly	used	expressions
• Do	not	develop	a	tool	for	collocation	detection	&	correction
30
Discussion
• Collocation	error	correction	tools	in	the	literature	are	
found	useful	by	users	
• Commonsense	knowledge	from	native	speakers	is	
typically	entailed	in	the	source	corpora	used	for	learning	
• Approaches	in	linguistic	classification	as	well	as	in	
collocation	correction	rely	heavily	on	frequency
• Thus,	potential	issues	related	to	sparse	data	with	correct	
collocations	call	for	further	research		
31
Text	to	Knowledge	and	Knowledge	to	Text
• Collocation	approaches	start	with	text	and	extract	knowledge	from	corpora	
• Different	methods	used	for	knowledge	extraction - probabilistic,	ensemble	
• Extracted	knowledge	used	for	linguistic	classification,	error	correction	
• Statistical	text	categorization	occurs	due	to	analysis	in	linguistic	classification
• Correctly	collocated	text	responses	offered	as	suggestions	in	error	correction
• Thus,	extracted	knowledge serves	to	provide	text	based	outputs
• Commonsense knowledge	plays	a	role	mainly	in	source	corpora	from	native	
speakers	&	expert	writings
• This	contributes	to	machine	intelligence	by	providing	better	machine	
translation	incorporating	commonsense		
32
References
• Bollegala,	D.,	Matsuo,	Y.	and	Ishizuka,M.,	Measuring	the	similarity	between	implicit	semantic	relations	
using	web	search	engines,	WSDM	2009,	pp.	104-113.	
• Cohen,	W.,	Fast	effective	rule	induction.	In	Proceedings	of	the	International	Conference	on	Machine	
Learning,	ICML	1995,	pp.	115–123.
• Dahlmeier,	D.	and	Ng.,	H.T.,	Correcting	semantic	collocation	errors	with	l1-induced	paraphrases.	In	
Proceedings	of	the	Conference	on	Empirical	Methods	in	Natural	Language	Processing,	EMNLP	2011,	
pp.	107–117.
• Futagi,	Y.,	Deane,	P.,	Chodorow,	M.	and	Tetreault.,	J.,	A	computational	approach	to	detecting	
collocation	errors	in	the	writing	of	non-native	speakers	of	English, Computer	Assisted	Language	
Learning	2008,	21(4):353–367.
• Li-E,	L.	A.,	Wible,	D.	and	Tsao,	N-L.,	Automated	suggestions	for	miscollocations,	Proceedings	of	the	4th
Workshop	on	Innovative	Use	of	NLP	for	Building	Educational	Applications,	2009,	pp.	47-50.
• Park,	T.,	Lank,	E.,	Poupart,	P.	and	Terry,	M.,	Is	the	sky	pure	today	- Awkchecker:	An	assistive	tool	for	
detecting	and	correcting	collocation	errors,	ACM	Symposium	on	User	Interface	Software	and	
Technology	2008,	pages	121–130.
• Ramos,	M.A.,	Wanner,	L.,	Vincze,	O.,	del	Bosque,	G.C.,	Veiga,	N.V.,	Suárez,	E.M.	and	González,	S.P.,	
Towards	a	Motivated	Annotation	Schema	of	Collocation	Errors	in	Learner	Corpora,	LREC	2010, pp.	
3209-3214.
• Varghese,	A.,	Varde,	A.,	Peng,	J.	and	Fitzpatrick.	E.,	A	framework	for	collocation	error	correction	in	Web	
pages	and	text	documents,	ACM	SIGKDD	Explorations	2015,	17(1):14–23. 33
1 of 33

More Related Content

What's hot(20)

The ultimate guide to translating idiomsThe ultimate guide to translating idioms
The ultimate guide to translating idioms
Lorena Duarte Ortiz138 views
Formal vs informal_englishFormal vs informal_english
Formal vs informal_english
Joyce Twotown1.8K views
Problems with non equivalence at word levelProblems with non equivalence at word level
Problems with non equivalence at word level
Thanh Phan Trung29.7K views
Basic grammar rulesBasic grammar rules
Basic grammar rules
Joy Celestial182 views
Dreamcatcher voc.Dreamcatcher voc.
Dreamcatcher voc.
mghuerta266 views
Formal and informal language2Formal and informal language2
Formal and informal language2
egonzalezlara38.4K views
Enhancing your vocabularyEnhancing your vocabulary
Enhancing your vocabulary
Rajasi Ray100 views
Why are there silent letters in EnglishWhy are there silent letters in English
Why are there silent letters in English
Julia Angela Soriano, LPT22 views
Grammar essentialsGrammar essentials
Grammar essentials
Joy Celestial86 views
Writing concise sentencesWriting concise sentences
Writing concise sentences
theLecturette1K views
Groof proof grammarGroof proof grammar
Groof proof grammar
Muhammad Safari4.8K views
IdiomsIdioms
Idioms
Abeer Ghulam1.4K views
Grammar bookGrammar book
Grammar book
Afghan Step Logistic Services Cp 99 views
Paper 13Paper 13
Paper 13
Riddhi Joshi73 views
IdiomsIdioms
Idioms
Manoj Kumar450 views
Proofread Like A Pro (June 2011)Proofread Like A Pro (June 2011)
Proofread Like A Pro (June 2011)
Matt Boothman1.8K views
Exercise Word 1Exercise Word 1
Exercise Word 1
Jack Frost509 views

Recently uploaded(20)

RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela166 views
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 views
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika21 views
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdf
stuartmcphersonflipm314 views
ColonyOSColonyOS
ColonyOS
JohanKristiansson69 views
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIA
Federico Karagulian5 views
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 views
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann102 views

Commonsense knowledge for Machine Intelligence - part 2