2. DATA/MSML 641
• Assignment 1 was due a couple minutes ago!
• Last lecture, we talked about what constitutes a word, MWE/collocations,
and statistical hypothesis testing
• We’ll take the first 20 minutes of this lecture to go over Assignment 0
Before we begin
3. DATA/MSML 641
• Word - See prior lecture for the hidden complexity!
• Meaning - What do we mean by meaning?
Word Meaning
5. DATA/MSML 641
• Looking up definitions for a word in a dictionary is the lexicographic
tradition
• Dictionaries enumerate word meanings (senses)
• What are the pitfalls of this approach?
• Definitions are not “defining” in the mathematical sense
• but also not in a precise legal sense
• “Assault” - Intentional, affects another person with reasonable apprehension,
harmful or offensive contact, imminently, physical injury NOT required
• “Definitions” are text you read that evoke concepts in your head
Lexicographic Traditions
6. DATA/MSML 641
• Homonymy
• “Pen”
• “Place to keep animals”
• “Instrument to write with”
• Coincidental or historical?
• Polysemy
• “Bass”
• “Low register”
• “Person who sings low”
• Related meanings
• Systematic / Structured polysemy
• “Chicken”
• Type of animal
• Food
Types of Word Senses
7. DATA/MSML 641
• Labor-intensive tradition
• Ex. Oxford English Dictionary (O.E.D.), started 1857
• Crowdsourcing by volunteers to find quotations illustrating each word
took more than 50 years to complete!
• c.f. Simon Winchester, The Professor and the Madman
• Corpus analysis approach
• Ex. Collins COBUILD dictionary, published 1987
• KWIC: Keyword in context
… sing a bass solo after he …
… lovely bass voice
reach the bass notes …
played bass guitar and …
with a bass and a lead guitarist ...
Lexicography - making dictionaries
8. DATA/MSML 641
• Necessary and sufficient conditions need to be met for it to be
defined something
• Ex. One is a “bachelor” iff it is a {human, male, adult, never married}
(can be thought of as features for a model)
• Lexical-conceptual structures (LCS)
• Framework for semantic analysis, developed by Ray Jackendoff in
the 70s
Decompositional representations
CAUSE
X GO
TO
Y
AT Z
Primetime semantic elements
10. DATA/MSML 641
• Set of concepts and relations
between them
• There are also gene, business
(eg NAICS), astronomy
ontologies, and more!
Ontologies
Ex.
(Keychain) Animalia
(Phzlaw) Chordate - dogs, lions, lizards, fish, . . .
(Chores) Mammalia - dogs, lions, elephants
(Order) Carnivores - dogs, lions, . . .
(Family) Canedae - coyotes, dogs, jackets, wolves
(Gems) Canis - dogs, foxes, jackals
(Species) Canis lupus - dogs, wolves
(Sub species) Canis lupus familiarizes - domestic dogs
Beagle
Pocket beagle
Remy
INSTANCE--OF
- Gene ontologies
- Boseneia ontologies
(eg. NAICS)
- Astsonomug ontologies
IS-A
11. DATA/MSML 641
• Conceived by George Miller as an ontology for English
• Has evolved into a broader collection of multilingual
WordNets
• Distinct ontologies for N, V, Adj, Adv
• Core concept: SynonymNet ~ “Concept”
• <board, plank, beam>
• <board, committee, panel, council>
• The noun taxonomy is most widely used
• Hyper/hyponym (IS-A)
• Instance (INSTANCE-OF)
• Meronym (PART-OF)
• Antonym (OPPOSITE)
WordNet
12. DATA/MSML 641
• Related to the core idea in AI of “inheritance”, also a
fundamental role in early semantic networks
• C2 IS-A C1 => for all f, f is a property of C1 => f is also a property of C2
• Related to subclassing in object-oriented programming
WordNet
14. DATA/MSML 641
• A classic problem in NLP
• Bar-Hillel (1060) argued that fully-automatic high-quality machine
translation (FAHQMT) was infeasible because it would require too
much world knowledge
• “The box is in the pen” requires commonsense reasoning about
relative sizes to disambiguate and translate “pen” correctly
• Long era of work on small sets of individual words (e.g. line, bank)
• Resnik and Yarowsky (1997) created SENSEVAL (later SemEval)
• Community-wise shared test for WSD
WSD
15. DATA/MSML 641
• Given:
• enumerated senses {s1, s2, …, sn} for a word w
• context for w (…. w ….)
• Select:
• “correct” si for w
• Strong baseline: Just the most frequent sense!
• Good pre-deep-learning baseline: Naive Bayes
• Choose argmax_y Pr(y|x) = argmax_y Pr(x|y)
WSD as supervised classification
16. DATA/MSML 641
• Strong baseline: Just the most frequent sense!
• Good pre-deep-learning baseline: Naive Bayes
• Choose
• Naively assume all features are independent
WSD as supervised classification
Multiply independent
probabilities
Prior (often uniform)
17. DATA/MSML 641
• Other supervised approaches:
• SVM with engineered features (See J&M 18.5.1)
WSD as supervised classification
DT JJ NN CC NN NB
An electric guitar and bass player stand . . .
Wi-2 Wi-1 Wi Wi+1 Wi+2
Extract feature vector
18. DATA/MSML 641
• Given
• … sawed a board in half and …
• … nailed together two pine boards …
• Create contextual embedding vectors for each sense, and use 1-N-
N classifier!
Supervised state-of-the-art (as of J+M 18.4.2)
…. board ….
Minimize cosine distance
to Aeleoy BOARDi
VBOARD1
VBOARD2
VBOARDN
Non-engineered
contractor
concentration
for sense C
c1
c2
cn
20. DATA/MSML 641
• Kilgariff advocated for task-dependent clustering of corpus
instances
• Schütze: Early distributional representations can form clusters
• Current trend is to use distributional representations as meaning
• “You shall know a word by the company it keeps”
• However, consider: systematic regularities (e.g. lamb is a food
or animal?), sparse data (esp for specialized domains),
explainability (supporting understanding and trusting inferences)
Abandoning enumerated senses
21. DATA/MSML 641
• Low performance - hard problem…
• Skew of senses
• Most-frequent sense is largely represented by the word itself
(Zipf’s law)
• “Pen” usually means writing implement for example
• One sense per discourse
• Implicit disambiguation (bank -> ambiguous, bank … atm … hours
-> unambiguous)
Why doesn’t traditional WSD help a lot of
standard applications?
23. DATA/MSML 641
• Symbolic representations are basically 1-hot encodings
Multiply 1xn matrix by nx1 matrix to get a 1x1 result
Distributional approaches to meaning
Or geometrically, measure
cosine similarity
24. DATA/MSML 641
• Often measured as
Problem: sim(x,y) = 0 for x!=y if 1-hot! Has trouble generalizing {puppy, dog, pup}
Vector Similarity
A
Wheel
Rome
Scooter
Books whert
Harly
Stay health
Z itk
Books whert food
• apple
• orange
25. DATA/MSML 641
• Often measured as
Problem: sim(x,y) = 0 for x!=y if 1-hot! Has trouble generalizing {puppy, dog, pup}
Vector Similarity
26. DATA/MSML 641
• Problem: sim(x,y) = 0 for x!=y if 1-hot! Has trouble generalizing {puppy, dog, pup}
• Solution: Use weights
Vector Similarity
Vwi dj = + f i,j . i dfi
Freq.
of wi in dj
inversely proportional to
# of dj’s containing wi
Ex. If “the” is in qqq0 of dogs,
As idf is really low
Usually: log (N/ # dccs containing wi)
If wi ∈ all doc,
idf = 0!
Usually:
27. DATA/MSML 641
• Using contexts to define word vectors
• count(wi, wj) = # times wi, wj co-occur in a window
• Same issue with frequent words, PMI
Vector Similarity
W1 Wv
W1
Wv
term - term
matrix
28. DATA/MSML 641
• Each word has a very high dimension! This can lead to reduced ability to generalize
• What are some dimension reduction techniques we can leverage?
• Latent Semantic Indexing (LSI), use Singular Value Decomposition to reduce
words/documents
Dimensionality reduction
M = w
m x n
c – dices, words
Original matric
Partially neglected
=
U ∑ V+
m x m m x n n x n
x x
0
0
1
0
Dugual
t
Truncated SVD to get
29. DATA/MSML 641
• CBOW - simple! Predict target word from context
• Skip-gram - a bit more involved, predict context words from target (reverse CBOW)
word2vec
Wi-2 O
Wi-1 O
Wi+1 O
Wi+2 O
O Wi-2
30. DATA/MSML 641
word2vec: skip-grams
● Goal: predict context words for a given
target word
● Optimize feature vectors:
○ Intuition: we want context words
with target to be high, and
random words with target to be
low
Wi
Feature vector
Mutuliyal continuously
O
O
O
O
O
Wi-2 (=c1) Wi , Wi-2 +
Wi-1 (=c2) Wi , Wi-1 +
Wi+1 (=c3) Wi , Wi+1 +
Wi+2 (=c4) Wi , Wi+2 +
Wi , -
Wi , -
Wi , -
Wi , -
L context words,
treated as
independent
Random samples
to get negative example
31. DATA/MSML 641
Effect of word2vec
● Words that appear in similar contexts get closer-together feature vectors
● Smaller contexts
○ More syntactic since near-words are more likely to be syntactically
related
● Captures structure in semantic spaces (classic man -> king as woman ->
queen example)
● Can extend to document-level work (e.g. doc2vec!)
king
queen
woman
man