2. Outline
1. Learning Word Embeddings using Lexical Dictionaries
2. Auto-Encoding Dictionary Definitions into Consistent
Word Embeddings
3. Definition Auto-encoder with Semantic Injection
4. Our Conducted Work
5. Discussion 2
4. 1. Learning Word Embeddings using Lexical Dictionaries
4
car: A road vehicle, typically with four wheels, powered by an
internal combustion engine and able to carry a small number
of people
The definition of a word
xe đạp: phương tiện di chuyển có hai hoặc ba bánh, có tay lái
nối với bánh trước, được chuyển động bằng cách dùng sức
người tác động vào bàn đạp
5. 5
Strong Word Pairs and Weak Word Pairs [Julien Tissier et al.]
In a definition, each word does not have the same semantic relevance.
In the definition of “car”, the words “internal” or “number” are less
relevant than “vehicle”
If the word wa is in the definition of the word wb and wb is in the
definition of wa, they form a strong pair, as well as the K closest
words to wa (resp. wb) form a strong pair with wb (resp. wa).
If the word wa is in the definition of wb but wb is not in the definition of
wa, they form a weak pair.
Some weak pairs can be promoted as strong pairs if the two words
are among the K closest neighbours of each other
1. Learning Word Embeddings using Lexical Dictionaries
6. Positive sampling
6
Let S(w) be the set of all words forming a strong pair with the word w
and W(w) be the set of all words forming a weak pair with w. For each
target wt from the corpus, we build Vs(wt) a random set of ns words
drawn with replacement from S(wt) and Vw(wt) a random set of nw
words drawn with replacement from W(wt)
1. Learning Word Embeddings using Lexical Dictionaries
7. 7
Negative sampling replaces the softmax with binary classifiers
1. The unigram distribution only takes into account word frequency,
and provides the same noise distribution when selecting negative
examples for different target words.
2. Labeau and Allauzen (2017) already showed that a context-
dependent noise distribution could be a better solution to learn a
language model.
3. Unlike the positive target words, the meaning of negative
examples remain unclear: For a training word, we do not know what
a good noise distribution should be, while we do know what a good
target word is (one of its surrounding words).
1. Learning Word Embeddings using Lexical Dictionaries
8. Controlled negative sampling
8
Negative sampling consists in considering two random words from the
vocabulary V to be unrelated. For each word wt from the vocabulary, we
generate a set F(wt) of k randomly selected words from the vocabulary
In our experiments, we noticed this method discards around 2% of
generated negative pairs.
1. Learning Word Embeddings using Lexical Dictionaries
10. Fetching online definitions
10
1. Learning Word Embeddings using Lexical Dictionaries
• We extract all unique words with more than 5 occurrences from a full Wikipedia
dump, representing around 2.2M words
• We use the English version of Cambridge, Oxford, Collins and dictionary.com.
For each word, we download the 4 different webpages, and use regex to extract
the definitions from the HTML template specific to each website, making the
process fully accurate.
• Our approach does not focus on polysemy, so we concatenate all definitions for
each word. Then we concatenate results from all dictionaries, remove stop
words and punctuation and lowercase all words
• Among the 2.2M unique words, only 200K does have a definition. We generate
strong and weak pairs from the downloaded definitions according to the rule
described in subsection 3.1 leading to 417K strong pairs (when the parameter K
from 3.1 is set to 5) and 3.9M weak pairs.
11. 11
1. Learning Word Embeddings using Lexical Dictionaries
+ Containing only data fromWikipedia (corpus A)
+ Data from Wikipedia concatenated with the definitions extracted (corpus B).
14. 14
Consistency penalty
2. Consistency Penalized Auto Encoder
Three different embeddings:
a) definition embeddings h, produced by the definition encoder, are the
embeddings we are ultimately interested in computing;
b) input embeddings E are used by the encoder as inputs;
c) output embeddings E’ are compared to definition embeddings to yield a
probability distribution over the words in the definition.
A soft weight-tying scheme that brings the input embeddings closer to the
definition embeddings. We call this term a consistency penalty because its goal is
to to ensure that the embeddings used by the encoder (input embeddings) and
the embeddings produced by the encoder (definition embeddings) are consistent
with each other.
Complete objective
21. +Hai (hay nhiều) từ là đồng nghĩa nếu chúng xuất hiện trong các ngữ cảnh giống nhau.
+ Ngữ cảnh chính là thông tin mô tả từ.
Thank you for your attention
21