Distributional Semantic word representation allows Natural Language Processing systems to extract and model an immense amount of information about a language. This technique maps words into a high dimensional continuous space through the use of a single-layer neural network. This process has allowed for advances in many Natural Language Processing research areas and tasks. These representation models are evaluated with the use of analogy tests, questions of the form ``If a is to a' then b is to what?'' are answered by composing multiple word vectors and searching the vector space. During the neural network training process, each word is examined as a member of its context. Generally, a word's context is considered to be the elements adjacent to it within a sentence. While some work has been conducted examining the effect of expanding this definition, very little exploration has been done in this area. Further, no inquiry has been conducted as to the specific linguistic competencies of these models or whether modifying their contexts impacts the information they extract. In this paper we propose a thorough analysis of the various lexical and grammatical competencies of distributional semantic models. We aim to leverage analogy tests to evaluate the most advanced distributional model across 14 different types of linguistic relationships. With this information we will then be able to investigate as to whether modifying the training context renders any differences in quality across any of these categories. Ideally we will be able to identify approaches to training that increase precision in some specific linguistic categories, which will allow us to investigate whether these improvements can be combined by joining the information used in different training approaches to build a single, improved, model.
2. Agenda
• Background
• Syntactic breakdown of model competency
• Using new models to examine competency
improvements
• Analysis and new approaches
3. Word Representation
We need some way to allow NLP systems to leverage
language information
• Tokenization
• Distributional Semantics
4. Distributional Semantics
“A word can be known by the company it keeps” - Firth
Premise:
• Words that appear together are more similar than
words that do not appear together
Idea:
• We can define a word by analyzing the frequency
with which every word occurs with every other word
5. Distributional Models
• Brown Clustering
• Grouping words by the terms most likely to have
come before
• Word Embeddings
• Mapping words to high-dimensional vectors
9. Neural Network Models
• A Neural Network is a model built with the use of
synthetic neurons
• This is an unsupervised machine learning
technique
Neural Network Language Models generally try to
determine the likelihood of a word relative to its
context, or vice versa
10. Word2Vec
• By far the most influential modern Neural Network
Language Model
• Significantly more efficient than previous models
11. Vector Offset
vector(a’) - vector(a)
We can extract the relationship between two word
vectors by taking their offset.
vector(goats) - vector(goat) => relationship(plural)
12. Vector Offset
vector(a’) - vector(a)
We can extract the relationship between two word
vectors by taking their offset.
vector(goats) - vector(goat) => relationship(plural)
vector(dog) + relationship(plural) => vector(dogs)
13. Analogies
a is to a’ as b is to b’
Run is to Runs as Walk is to Walks
Amazing is to Amazingly as Great is to Greatly
Atlanta is to Georgia as Tampa is to Florida
16. Analogy Tests
a is to a’ as b is to b’
vector(a’) - vector(a) => relationship(a’:a)
relationship(a’:b) + vector(b) => c
17. Analogy Tests
a is to a’ as b is to b’
vector(a’) - vector(a) => relationship(a’:a)
relationship(a’:b) + vector(b) => c
We search for the vector most similar to c and check
if it represents b’.
18. Analogy Tests
a is to a’ as b is to b’
vector(a’) - vector(a) => relationship(a’:a)
relationship(a’:b) + vector(b) => c
We search for the vector most similar to c and check
if it represents b’.
vector(a’) - vector(a) + vector(b) => c
19. Analogy Tests
We can evaluate a model by evaluating how many
analogies it can recognize
24. Analogy Test Approach
• Train EmoryNLP Word2Vec on Wikipedia 2015 and
NYTimes
• We take the predominantly used test set and
analyze based on linguistic categories
25. Analogy Test Categories
• Lexical
Common Capital Countries
• Grammatical
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
26. Analogy Test Categories
England is to London
• Lexical
Common Capital Countries
• Grammatical
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
27. Analogy Test Categories
Nigeria is to Abuja
• Lexical
Common Capital Countries, World Capitals
• Grammatical
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
28. Analogy Test Categories
Los Angeles is to California
• Lexical
Common Capital Countries, World Capitals, City in State
• Grammatical
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
29. Analogy Test Categories
England is to Pound
• Lexical
Common Capital Countries, World Capitals, City in State, Currency
• Grammatical
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
30. Analogy Test Categories
King is to Queen
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender
• Grammatical
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
31. Analogy Test Categories
Hot is to Cold
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite
• Grammatical
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
32. Analogy Test Categories
American is to America
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite,
Nationality Adjective
• Grammatical
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
33. Analogy Test Categories
Amazing is to Amazingly
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite,
Nationality Adjective
• Grammatical
Adjective to Adverb
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
34. Analogy Test Categories
Warm is to Warmer
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite,
Nationality Adjective
• Grammatical
Adjective to Adverb, Comparative
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
35. Analogy Test Categories
Loud is to Loudest
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite,
Nationality Adjective
• Grammatical
Adjective to Adverb, Comparative, Superlative
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
36. Analogy Test Categories
Coding is to Code
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite,
Nationality Adjective
• Grammatical
Adjective to Adverb, Comparative, Superlative, Present Participle
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
37. Analogy Test Categories
Dancing is to Danced
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite,
Nationality Adjective
• Grammatical
Adjective to Adverb, Comparative, Superlative, Present Participle, Past Tense
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
38. Analogy Test Categories
City is to Cities
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite,
Nationality Adjective
• Grammatical
Adjective to Adverb, Comparative, Superlative, Present Participle, Past Tense, Plural
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
39. Analogy Test Categories
Describe is to Describes
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite,
Nationality Adjective
• Grammatical
Adjective to Adverb, Comparative, Superlative, Present Participle, Past Tense, Plural, Plural
Verb
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
42. Word2Vec Analysis
Very high grammatical competency, very low lexical
• Adverb to adjective - only derivational morpheme
Idea: could we modify the training process to
improve weak model attributes?
43. Contexts
Recall:
• The Distributional Hypothesis posits that full co-
occurrence information would capture all
information about a language
It is possible that existing methods aren’t selecting
the best possible information.
44. Contexts
What if we used linguistic structure to select
contexts?
• Dependency Structure
• Predicate Argument Structure
45. Dependency Structure
• Linguistic units are connected by directed links called
dependencies
He opened the bottle with an opener for her.
opened
the
he bottle with
an
opener
for
her
46. Predicate Argument
Structure
• Predicate argument structure is concerned with the
arguments accepted by verbs and predicates in a
sentence
• Ex: “open” might accept an “opener”, “thing opened”,
“instrument” and “benefactive”
52. Previous Models
Word2vec
• Only adjacent contexts, very little categorical
analysis
Levy and Goldberg
• Used some basic dependency context information
• Did not explore much information, didn't do
thorough categorical analysis
55. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
56. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
57. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Dep1 Context
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
Dep1
Dependency Children
58. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Dep1 Context to
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
Dep1
Dependency Children
59. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Dep1 Context to her
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
Dep1
Dependency Children
60. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Dep1 Context to her frisbee
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
Dep1
Dependency Children
61. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Dep1 Context to her frisbee in
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
Dep1
Dependency Children
62. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Dep1 Context to her frisbee in
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
Dep1
Dependency Children
63. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Dep2 Context to her the bright red frisbee in park
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
Dep2
Dependency Grandchildren
64. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Dep2h Context wants to her the bright red frisbee in park
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
Dep2h
Dependency Grandchildren, Head
65. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Sib1 Context her in
Word2Vec Context throw her the bright red in the park
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
Training Word: frisbee
in
park
the
Sib1
Nearest Siblings
66. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Sib2 Context to her in
Word2Vec Context throw her the bright red in the park
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
Training Word: frisbee
in
park
the
Sib2
Nearest Two Siblings
67. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Sib1Dep1 Context her the bright red in
Word2Vec Context throw her the bright red in the park
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
Training Word: frisbee
in
park
the
Sib1Dep1
Dependency Children, Nearest Siblings
68. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Sib2Dep1 Context to her the bright red in
Word2Vec Context throw her the bright red in the park
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
Training Word: frisbee
in
park
the
Sib2Dep1
Dependency Children, Nearest Two Siblings
69. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
Sib1Dep2 Context He to her the bright red frisbee in park
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the
Sib1Dep2
Dependency Grandchildren, Nearest Siblings
70. wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
All Siblings
Context
to her in
Word2Vec Context throw her the bright red in the park
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
Training Word: frisbee
in
park
the
All Siblings
All Nodes Sharing The Same Head
71. Sentence He wants to throw her the bright red frisbee in the park
Srl1 Context wants throw
Word2Vec Context He wants to throw her the bright red
Distance from
Training Word 0 1 2 3 4 5 6 7 8 9 10 11
Training Word: He
Srl1
Semantic Role Head
verb: wants
wanter: He
verb: throw
thrower: He
wants
He throw
herto frisbee
the bright red
in
park
the
82. Context Analysis
Why are these models doing so well lexically?
• Idea: the data outside of the Word2Vec context
window is providing most of the improvement.
83. Sentence He wants to throw her the bright red frisbee in the park
Dep2 Context to her the bright red frisbee in park
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
Dep2
Dependency Grandchildren
Context Analysis
• Idea: the data outside of the Word2Vec context window is
providing most of the improvement.
84. Sentence He wants to throw her the bright red frisbee in the park
Dep2 Context to her the bright red frisbee in park
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
Dep2
Dependency Grandchildren
Context Analysis
New Information
• Idea: the data outside of the Word2Vec context window is
providing most of the improvement.
88. Context Analysis
Some of the biggest improvements are in models that
aren’t getting nearly as much new information.
• This indicates that the benefit is from the context
selection process
95. Ensemble Models
It seems difficult to construct models that are at least
the sum of their parts.
• What would the ideal composite model look like?
• Can we achieve the same result in a different way?
96. Ensemble Models
Idea: Specific models have specific competencies.
We can build a class that chooses what model to use
based on the current task.
111. Ensemble Analysis
• With this method we show a theoretical upper
bound for the information learned by these models
• This can be rapidly adapted to other problems,
allowing NLP systems to categorically select
models based on their specific competencies
112. Conclusion
• The meaning extracted from each context word is not
uniform
• Context selection massively impacts linguistic competencies
• Adjacency contexts are uniquely proficient at extracting
inflectional morpheme information
• Dependency contexts are significantly better at learning
lexical information
• Contexts are not compositional
• Categorically selecting models can be incredibly effective
113. Future Work
• Vector space analysis
• Additional models and linguistic context building