휴먼인터페이스 연구실
Human Interface Lab.
Detecting Oxymoron in a
Single Statement
Won Ik Cho
Nov. 01, 2017
Contents
• Introduction
 Word vector representation
 Word analogy test
• Proposed methods
 Oxymoron detection
 Overall scheme and flow chart
• Experiment and discussion
• Conclusion
2
Introduction
3
Introduction
• Word meaning for computers
 Use a taxonomy like WordNet that has hypernyms (is-a)
relationships and synonym sets
 Problems with discreteness
Missing nuances
Missing new words
Subjective
Requires human labor
Hard to compute
accurate word similarity
4
ex) One-hot representation
hotel = [0 0 0 … 1 0 0 … 0 0 0]
motel = [0 0 0 … 0 1 0 … 0 0 0]
≈ ?
⊥ ?
Introduction
• In statistical NLP…
5
“You shall know a word by the company it keeps” (J. R. Firth 1957:11)
1) Capture co-occurrence counts directly (count-based)
2) Go through each word of the whole corpus and
predict surrounding words of each word (direct prediction)
Introduction
• Count based vs Direct prediction
6
Word vector representation
• Basic idea
 Define a model that assigns prediction between a center
word 𝑤𝑤𝑡𝑡 and 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 : 𝑃𝑃(𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐|𝑤𝑤𝑡𝑡)
 Loss function 𝐽𝐽 = 1 − 𝑃𝑃(𝑤𝑤−𝑡𝑡|𝑤𝑤𝑡𝑡)
 Keep adjusting the vector representation of words to
minimize the loss
7
Feedforward neural network based LM
By Y. Bengio and H. Schwenk (2003)
Main idea of word2vec
• Mikolov et al., 2013
• Two algorithms
 Skip-grams (SG)
Predict context words given target (position independent)
 Continuous bag of words (CBOW)
Predict target word from BOW context
• Two (moderately efficient) training methods
 Hierarchical softmax
 Negative sampling
8
Main idea of GloVe
• Pennington et al., 2014
• Count-based :
 Primarily used to capture word similarities
 Do poorly on word analogy tasks
(sub-optimal vector space structure)
• Direct prediction :
 Learn word embeddings by making predictions in local
context windows
 Demonstrate the capacity to capture complex linguistic
patterns
 Fail to make use of the global co-occurrence statistics
9
How about converging advantages of each approach?
Word analogy test
• Performed to test how properly the representation
describes the relation between words
 Pennington et al.(2014)
10
Proposed methods
11
Oxymoron detection
• Detecting contradiction caused by semantic
discrepancy between a pair of words
• Includes word analogy of :
antonym/synonym(with negation) or
words with an entailment error
• Differs from detecting paradox
 “There’s a pattern of unpredictability.” (oxymoron)
 “I am a compulsive liar.” (paradox)
12
Oxymoron detection
• Basic idea
 People recognize oxymoron in a text by existence of
incongruity between words
Antonym (ex) Sugar-free/Sweet
Words with entailment error (ex) Legalized/Robbery
Synonym with negation (ex) Much/not Enough
 Finding these relations (with some structural options) in
a single statement may imply the existence of oxymoron
(especially for short sentences)
 Let’s find the relation by comparing word vector offset!
13
Proposed scheme
• Offset vector set construction
 Offset vector of word 𝑎𝑎, 𝑏𝑏 :
For word embedding function 𝐹𝐹, offset vector 𝑟𝑟𝑟𝑟𝑟𝑟𝑎𝑎,𝑏𝑏 is defined :
𝑟𝑟𝑟𝑟𝑟𝑟𝑎𝑎,𝑏𝑏 = 𝐹𝐹 𝑎𝑎 − 𝐹𝐹 𝑏𝑏
 Offset vector set for antonyms :
For antonym word pairs 𝐴𝐴𝐴𝐴𝐴𝐴, 𝑖𝑖𝑡𝑡𝑡
antonym offset vector 𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖
for 𝑖𝑖𝑡𝑡𝑡
antonym pair (𝑎𝑎𝑖𝑖, 𝑏𝑏𝑖𝑖) is defined :
𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖 = 𝐹𝐹 𝑎𝑎𝑖𝑖 − 𝐹𝐹 𝑏𝑏𝑖𝑖
 𝐴𝐴𝐴𝐴𝐴𝐴 includes words with entailment error as well
 This process repeats for synonym pairs 𝑆𝑆𝑆𝑆𝑆𝑆
14
Proposed scheme
• Antonym/synonym checking
 For input word pair (𝑥𝑥, 𝑦𝑦), 𝑎𝑎𝑎𝑎𝑎𝑎(𝑥𝑥, 𝑦𝑦) is defined to check
antonymy/synonymy
 Define 𝑑𝑑𝑎𝑎𝑎𝑎𝑎𝑎,𝑖𝑖 = 𝐶𝐶𝐶𝐶𝐶𝐶(𝑟𝑟𝑟𝑟𝑟𝑟𝑥𝑥,𝑦𝑦, 𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖) for cosine distance
𝐶𝐶𝐶𝐶𝐶𝐶 = 1 −
𝑢𝑢∙𝑣𝑣
𝑢𝑢 |𝑣𝑣|
 (𝑥𝑥, 𝑦𝑦) is considered as antonym if 𝑑𝑑 = 𝑚𝑚𝑚𝑚𝑚𝑚𝑖𝑖 𝑑𝑑𝑖𝑖 < 𝐷𝐷 for
threshold value 𝐷𝐷
 𝐷𝐷 is varied in an implementation
15
Proposed scheme
• Checking invalid cases
 Assumption :
(1) Only lexical words can have antonym/synonym relationship
(not grammatical)
(2) Contradict occurs if antonym indicate the same
object/situation simultaneously
 For (1), only [verbs, nouns, adjectives, adverbs] are
analyzed, with lemmatization
 For (2), dependency parsing could be applied (not in
current implementation)
16
Proposed scheme
• Negation counting
 Usually negation terms come few words before
(ex) no, not, never, n’t
 Define indicator 𝑛𝑛𝑛𝑛𝑛𝑛 as:
• For every word pair 𝑤𝑤𝑖𝑖, 𝑤𝑤𝑗𝑗 :
If 𝑎𝑎𝑎𝑎𝑎𝑎(𝑤𝑤𝑖𝑖, 𝑤𝑤𝑗𝑗) = 𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖 ≥ 0 and both 𝑤𝑤𝑖𝑖, 𝑤𝑤𝑗𝑗 are valid,
𝑤𝑤𝑖𝑖, 𝑤𝑤𝑗𝑗 are decided to be contradictory if
𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖 + 𝑛𝑛𝑛𝑛𝑛𝑛 𝑤𝑤𝑖𝑖 + 𝑛𝑛𝑛𝑛𝑛𝑛 𝑤𝑤𝑗𝑗 ≡ 1 (𝑚𝑚𝑚𝑚𝑚𝑚 2)
If any word pair is decided to be contradictory, then the
statement contains oxymoron
17
Flow chart
18
Experiment and discussion
19
Experiment
• Python coding with NLTK library (for tokenizing,
POS tagging, lemmatization)
• Pre-trained word vector based on GloVe
 Glove.6B.50d
50 dim, trained with Wikipedia 2014 and Gigaword 5
• Dataset : constructed based on manual search
 For antonym/synonym pairs
Michigan Proficiency Exams (http://www.michigan-
proficiencyexams.com/)
 For test sentences
Oxymoron List (http://www.oxymoronlist.com/)
1001 Truisms! (http://1001truisms.webs.com/truisms.htm)
20
Experiment
21
Result
• Relatively low result
 Word vector was not trained on purpose of catching
antonym/synonym relations
 Dependency parsing not applied
 Determination of proper 𝐷𝐷 value necessary
Just high 𝐷𝐷 can improperly heighten the recall, thus
F-measure or accuracy should be used as an evaluation
measure
22
Discussion
• Advantages
 Easy to construct dataset (many open sources,
manageable amount of words/phrases)
 Does not need any additional training on sentences
(depends largely on the word vector)
 Checks how the word vector captures semantic relations
• To enhance the accuracy
 Setup suboptimal 𝐷𝐷 value based on optimization such as
bisection methods (Boyd and Vandenberghe, 2004)
 Use dependency parsers (Chen, 2014; Andor, 2016) to
check if the contradictory words really indicate same
object/situation
 Use word embedding regarding antonymy
23
Future work
• Applying dependency parsing
 Calculating the distance from the root with regard to the
lexical words (e.g. Nouns)
 Checking if two words are directly dependent
24
Future work
• Using word embeddings regarding antonymy
 M. Ono, M. Miwa, and Y. Sasaki, “Word Embedding-
based Antonym Detection using Thesauri and
Distributional Information,” In Proceedings of the
Human Language Technologies: The 2015 Annual
Conference of the North American Chapter of the ACL,
2015, pp. 984–989.
 J. Kim, M. De Marneffe, and E. Fosler-Lussier, “Adjusting
Word Embeddings with Semantic Intensity Orders,” In
Proceedings of the 1st Workshop on Representation
Learning for NLP, 2016, pp. 62–69.
25
Conclusion
• Deterministic scheme to check the oxymoron and
evaluate the word vector representation
• Suitable for word vectors that capture
antonym/synonym relations
• Several advantages over other contradict detection
 Produces stable result if a few options fixed
 Does not need training
 Also tells how the other word relations are not close to
the target relations
26
Thank you!
27

DETECTING OXYMORON IN A SINGLE STATEMENT

  • 1.
    휴먼인터페이스 연구실 Human InterfaceLab. Detecting Oxymoron in a Single Statement Won Ik Cho Nov. 01, 2017
  • 2.
    Contents • Introduction  Wordvector representation  Word analogy test • Proposed methods  Oxymoron detection  Overall scheme and flow chart • Experiment and discussion • Conclusion 2
  • 3.
  • 4.
    Introduction • Word meaningfor computers  Use a taxonomy like WordNet that has hypernyms (is-a) relationships and synonym sets  Problems with discreteness Missing nuances Missing new words Subjective Requires human labor Hard to compute accurate word similarity 4 ex) One-hot representation hotel = [0 0 0 … 1 0 0 … 0 0 0] motel = [0 0 0 … 0 1 0 … 0 0 0] ≈ ? ⊥ ?
  • 5.
    Introduction • In statisticalNLP… 5 “You shall know a word by the company it keeps” (J. R. Firth 1957:11) 1) Capture co-occurrence counts directly (count-based) 2) Go through each word of the whole corpus and predict surrounding words of each word (direct prediction)
  • 6.
    Introduction • Count basedvs Direct prediction 6
  • 7.
    Word vector representation •Basic idea  Define a model that assigns prediction between a center word 𝑤𝑤𝑡𝑡 and 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 : 𝑃𝑃(𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐|𝑤𝑤𝑡𝑡)  Loss function 𝐽𝐽 = 1 − 𝑃𝑃(𝑤𝑤−𝑡𝑡|𝑤𝑤𝑡𝑡)  Keep adjusting the vector representation of words to minimize the loss 7 Feedforward neural network based LM By Y. Bengio and H. Schwenk (2003)
  • 8.
    Main idea ofword2vec • Mikolov et al., 2013 • Two algorithms  Skip-grams (SG) Predict context words given target (position independent)  Continuous bag of words (CBOW) Predict target word from BOW context • Two (moderately efficient) training methods  Hierarchical softmax  Negative sampling 8
  • 9.
    Main idea ofGloVe • Pennington et al., 2014 • Count-based :  Primarily used to capture word similarities  Do poorly on word analogy tasks (sub-optimal vector space structure) • Direct prediction :  Learn word embeddings by making predictions in local context windows  Demonstrate the capacity to capture complex linguistic patterns  Fail to make use of the global co-occurrence statistics 9 How about converging advantages of each approach?
  • 10.
    Word analogy test •Performed to test how properly the representation describes the relation between words  Pennington et al.(2014) 10
  • 11.
  • 12.
    Oxymoron detection • Detectingcontradiction caused by semantic discrepancy between a pair of words • Includes word analogy of : antonym/synonym(with negation) or words with an entailment error • Differs from detecting paradox  “There’s a pattern of unpredictability.” (oxymoron)  “I am a compulsive liar.” (paradox) 12
  • 13.
    Oxymoron detection • Basicidea  People recognize oxymoron in a text by existence of incongruity between words Antonym (ex) Sugar-free/Sweet Words with entailment error (ex) Legalized/Robbery Synonym with negation (ex) Much/not Enough  Finding these relations (with some structural options) in a single statement may imply the existence of oxymoron (especially for short sentences)  Let’s find the relation by comparing word vector offset! 13
  • 14.
    Proposed scheme • Offsetvector set construction  Offset vector of word 𝑎𝑎, 𝑏𝑏 : For word embedding function 𝐹𝐹, offset vector 𝑟𝑟𝑟𝑟𝑟𝑟𝑎𝑎,𝑏𝑏 is defined : 𝑟𝑟𝑟𝑟𝑟𝑟𝑎𝑎,𝑏𝑏 = 𝐹𝐹 𝑎𝑎 − 𝐹𝐹 𝑏𝑏  Offset vector set for antonyms : For antonym word pairs 𝐴𝐴𝐴𝐴𝐴𝐴, 𝑖𝑖𝑡𝑡𝑡 antonym offset vector 𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖 for 𝑖𝑖𝑡𝑡𝑡 antonym pair (𝑎𝑎𝑖𝑖, 𝑏𝑏𝑖𝑖) is defined : 𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖 = 𝐹𝐹 𝑎𝑎𝑖𝑖 − 𝐹𝐹 𝑏𝑏𝑖𝑖  𝐴𝐴𝐴𝐴𝐴𝐴 includes words with entailment error as well  This process repeats for synonym pairs 𝑆𝑆𝑆𝑆𝑆𝑆 14
  • 15.
    Proposed scheme • Antonym/synonymchecking  For input word pair (𝑥𝑥, 𝑦𝑦), 𝑎𝑎𝑎𝑎𝑎𝑎(𝑥𝑥, 𝑦𝑦) is defined to check antonymy/synonymy  Define 𝑑𝑑𝑎𝑎𝑎𝑎𝑎𝑎,𝑖𝑖 = 𝐶𝐶𝐶𝐶𝐶𝐶(𝑟𝑟𝑟𝑟𝑟𝑟𝑥𝑥,𝑦𝑦, 𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖) for cosine distance 𝐶𝐶𝐶𝐶𝐶𝐶 = 1 − 𝑢𝑢∙𝑣𝑣 𝑢𝑢 |𝑣𝑣|  (𝑥𝑥, 𝑦𝑦) is considered as antonym if 𝑑𝑑 = 𝑚𝑚𝑚𝑚𝑚𝑚𝑖𝑖 𝑑𝑑𝑖𝑖 < 𝐷𝐷 for threshold value 𝐷𝐷  𝐷𝐷 is varied in an implementation 15
  • 16.
    Proposed scheme • Checkinginvalid cases  Assumption : (1) Only lexical words can have antonym/synonym relationship (not grammatical) (2) Contradict occurs if antonym indicate the same object/situation simultaneously  For (1), only [verbs, nouns, adjectives, adverbs] are analyzed, with lemmatization  For (2), dependency parsing could be applied (not in current implementation) 16
  • 17.
    Proposed scheme • Negationcounting  Usually negation terms come few words before (ex) no, not, never, n’t  Define indicator 𝑛𝑛𝑛𝑛𝑛𝑛 as: • For every word pair 𝑤𝑤𝑖𝑖, 𝑤𝑤𝑗𝑗 : If 𝑎𝑎𝑎𝑎𝑎𝑎(𝑤𝑤𝑖𝑖, 𝑤𝑤𝑗𝑗) = 𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖 ≥ 0 and both 𝑤𝑤𝑖𝑖, 𝑤𝑤𝑗𝑗 are valid, 𝑤𝑤𝑖𝑖, 𝑤𝑤𝑗𝑗 are decided to be contradictory if 𝑎𝑎𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖 + 𝑛𝑛𝑛𝑛𝑛𝑛 𝑤𝑤𝑖𝑖 + 𝑛𝑛𝑛𝑛𝑛𝑛 𝑤𝑤𝑗𝑗 ≡ 1 (𝑚𝑚𝑚𝑚𝑚𝑚 2) If any word pair is decided to be contradictory, then the statement contains oxymoron 17
  • 18.
  • 19.
  • 20.
    Experiment • Python codingwith NLTK library (for tokenizing, POS tagging, lemmatization) • Pre-trained word vector based on GloVe  Glove.6B.50d 50 dim, trained with Wikipedia 2014 and Gigaword 5 • Dataset : constructed based on manual search  For antonym/synonym pairs Michigan Proficiency Exams (http://www.michigan- proficiencyexams.com/)  For test sentences Oxymoron List (http://www.oxymoronlist.com/) 1001 Truisms! (http://1001truisms.webs.com/truisms.htm) 20
  • 21.
  • 22.
    Result • Relatively lowresult  Word vector was not trained on purpose of catching antonym/synonym relations  Dependency parsing not applied  Determination of proper 𝐷𝐷 value necessary Just high 𝐷𝐷 can improperly heighten the recall, thus F-measure or accuracy should be used as an evaluation measure 22
  • 23.
    Discussion • Advantages  Easyto construct dataset (many open sources, manageable amount of words/phrases)  Does not need any additional training on sentences (depends largely on the word vector)  Checks how the word vector captures semantic relations • To enhance the accuracy  Setup suboptimal 𝐷𝐷 value based on optimization such as bisection methods (Boyd and Vandenberghe, 2004)  Use dependency parsers (Chen, 2014; Andor, 2016) to check if the contradictory words really indicate same object/situation  Use word embedding regarding antonymy 23
  • 24.
    Future work • Applyingdependency parsing  Calculating the distance from the root with regard to the lexical words (e.g. Nouns)  Checking if two words are directly dependent 24
  • 25.
    Future work • Usingword embeddings regarding antonymy  M. Ono, M. Miwa, and Y. Sasaki, “Word Embedding- based Antonym Detection using Thesauri and Distributional Information,” In Proceedings of the Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, 2015, pp. 984–989.  J. Kim, M. De Marneffe, and E. Fosler-Lussier, “Adjusting Word Embeddings with Semantic Intensity Orders,” In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016, pp. 62–69. 25
  • 26.
    Conclusion • Deterministic schemeto check the oxymoron and evaluate the word vector representation • Suitable for word vectors that capture antonym/synonym relations • Several advantages over other contradict detection  Produces stable result if a few options fixed  Does not need training  Also tells how the other word relations are not close to the target relations 26
  • 27.