Categorical Evaluation for Advanced Distributional Semantic Models

Categorical Evaluation for
Advanced Distributional
Semantic Models
An Undergraduate Dissertation
by Reid Kilgore

Agenda
• Background
• Syntactic breakdown of model competency
• Using new models to examine competency
improvements
• Analysis and new approaches

Word Representation
We need some way to allow NLP systems to leverage
language information
• Tokenization
• Distributional Semantics

Distributional Semantics
“A word can be known by the company it keeps” - Firth
Premise:
• Words that appear together are more similar than
words that do not appear together
Idea:
• We can deﬁne a word by analyzing the frequency
with which every word occurs with every other word

Distributional Models
• Brown Clustering
• Grouping words by the terms most likely to have
come before
• Word Embeddings
• Mapping words to high-dimensional vectors

Language Modeling
Sentence:
I’m giving a presentation right now

Language Modeling
Sentence:
Window = 2

Language Modeling
Sentence:
Window = 2
giving a talk right now
giving a lecture right now
giving a speech right now

Neural Network Models
• A Neural Network is a model built with the use of
synthetic neurons
• This is an unsupervised machine learning
technique
Neural Network Language Models generally try to
determine the likelihood of a word relative to its
context, or vice versa

Word2Vec
• By far the most influential modern Neural Network
Language Model
• Significantly more efficient than previous models

Vector Offset
vector(a’) - vector(a)
We can extract the relationship between two word
vectors by taking their offset.
vector(goats) - vector(goat) => relationship(plural)

Vector Offset
vector(a’) - vector(a)
We can extract the relationship between two word
vectors by taking their offset.
vector(goats) - vector(goat) => relationship(plural)
vector(dog) + relationship(plural) => vector(dogs)

Analogies
a is to a’ as b is to b’
Run is to Runs as Walk is to Walks
Amazing is to Amazingly as Great is to Greatly
Atlanta is to Georgia as Tampa is to Florida

Analogy Tests
! "
👑
!is to as is to ⁇
👑

Analogy Tests
!
👑
"
👑
👾
%
&
! "

Analogy Tests
vector(a’) - vector(a) => relationship(a’:a)
relationship(a’:b) + vector(b) => c

Analogy Tests
We search for the vector most similar to c and check
if it represents b’.

Analogy Tests
We search for the vector most similar to c and check
if it represents b’.
vector(a’) - vector(a) + vector(b) => c

Analogy Tests
We can evaluate a model by evaluating how many
analogies it can recognize

Analogy Tests
! "
👑
"
👑
!- + => ✅

Analogy Tests
! "
👑
"
👑
!- + => ✅
! "
👑
(!- + => ❌

Analogy Tests
! "
👑
"
👑
!- + => ✅
! "
👑
(!- + => ❌
! "
👑
⛄
👑
!- + => ❌

Analogy Tests
! "
👑
"
👑
!- + => ✅
! "
👑
(!- + => ❌
! "
👑
⛄
👑
!- + =>
! "
👑
🚀!- + =>
❌
❌

Analogy Test Approach
• Train EmoryNLP Word2Vec on Wikipedia 2015 and
NYTimes
• We take the predominantly used test set and
analyze based on linguistic categories

Analogy Test Categories
• Lexical
Common Capital Countries
• Grammatical
Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al., 2013.
Tests available at at code.google.com/p/word2vec/source/browse/trunk/questions-words.txt

England is to London
• Lexical
Common Capital Countries
• Grammatical

Nigeria is to Abuja
• Lexical
Common Capital Countries, World Capitals
• Grammatical

Los Angeles is to California
• Lexical
Common Capital Countries, World Capitals, City in State
• Grammatical

England is to Pound
• Lexical
Common Capital Countries, World Capitals, City in State, Currency
• Grammatical

King is to Queen
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender
• Grammatical

Hot is to Cold
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite
• Grammatical

American is to America
• Lexical
Common Capital Countries, World Capitals, City in State, Currency, Gender, Opposite,
Nationality Adjective
• Grammatical

Amazing is to Amazingly
• Lexical
• Grammatical
Adjective to Adverb

Warm is to Warmer
• Lexical
• Grammatical
Adjective to Adverb, Comparative

Loud is to Loudest
• Lexical
• Grammatical
Adjective to Adverb, Comparative, Superlative

Coding is to Code
• Lexical
• Grammatical
Adjective to Adverb, Comparative, Superlative, Present Participle

Dancing is to Danced
• Lexical
• Grammatical
Adjective to Adverb, Comparative, Superlative, Present Participle, Past Tense

City is to Cities
• Lexical
• Grammatical
Adjective to Adverb, Comparative, Superlative, Present Participle, Past Tense, Plural

Describe is to Describes
• Lexical
• Grammatical
Adjective to Adverb, Comparative, Superlative, Present Participle, Past Tense, Plural, Plural
Verb

Word2Vec Analysis
Very high grammatical competency, very low lexical
• Adverb to adjective - only derivational morpheme
Idea: could we modify the training process to
improve weak model attributes?

Contexts
Recall:
• The Distributional Hypothesis posits that full co-
occurrence information would capture all
information about a language
It is possible that existing methods aren’t selecting
the best possible information.

Contexts
What if we used linguistic structure to select
contexts?
• Dependency Structure
• Predicate Argument Structure

Dependency Structure
• Linguistic units are connected by directed links called
dependencies
He opened the bottle with an opener for her.
opened
the
he bottle with
an
opener
for
her

Predicate Argument
Structure
• Predicate argument structure is concerned with the
arguments accepted by verbs and predicates in a
sentence
• Ex: “open” might accept an “opener”, “thing opened”,
“instrument” and “benefactive”

Predicate Argument
Structure
Arg0: Opener
Arg1: Thing opened
Arg2: Instrument
Arg3: Benefactive
He opened the bottle with an opener for her.
opened
the
he bottle with
an
opener
for
her

Previous Models
Word2vec
• Only adjacent contexts, very little categorical
analysis
Levy and Goldberg
• Used some basic dependency context information
• Did not explore much information, didn't do
thorough categorical analysis

wants
He throw
herto frisbee
the bright red
Sentence He wants to throw her the bright red frisbee in the park
in
park
the

wants
He throw
herto frisbee
the bright red
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Training Word: throw
in
park
the

wants
He throw
herto frisbee
the bright red
Word2Vec Context He wants to her the bright red frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
in
park
the

wants
He throw
herto frisbee
the bright red
Dep1 Context
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
in
park
the
Dep1
Dependency Children

wants
He throw
herto frisbee
the bright red
Dep1 Context to
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
in
park
the
Dep1
Dependency Children

wants
He throw
herto frisbee
the bright red
Dep1 Context to her
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
in
park
the
Dep1
Dependency Children

wants
He throw
herto frisbee
the bright red
Dep1 Context to her frisbee
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
in
park
the
Dep1
Dependency Children

wants
He throw
herto frisbee
the bright red
Dep1 Context to her frisbee in
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
in
park
the
Dep1
Dependency Children

wants
He throw
herto frisbee
the bright red
Dep2 Context to her the bright red frisbee in park
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
in
park
the
Dep2
Dependency Grandchildren

wants
He throw
herto frisbee
the bright red
Dep2h Context wants to her the bright red frisbee in park
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
in
park
the
Dep2h
Dependency Grandchildren, Head

wants
He throw
herto frisbee
the bright red
Sib1 Context her in
Word2Vec Context throw her the bright red in the park
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
Training Word: frisbee
in
park
the
Sib1
Nearest Siblings

wants
He throw
herto frisbee
the bright red
Sib2 Context to her in
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
in
park
the
Sib2
Nearest Two Siblings

wants
He throw
herto frisbee
the bright red
Sib1Dep1 Context her the bright red in
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
in
park
the
Sib1Dep1
Dependency Children, Nearest Siblings

wants
He throw
herto frisbee
the bright red
Sib2Dep1 Context to her the bright red in
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
in
park
the
Sib2Dep1
Dependency Children, Nearest Two Siblings

wants
He throw
herto frisbee
the bright red
Sib1Dep2 Context He to her the bright red frisbee in park
Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
in
park
the
Sib1Dep2
Dependency Grandchildren, Nearest Siblings

wants
He throw
herto frisbee
the bright red
All Siblings
Context
to her in
Distance from
Training Word -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
in
park
the
All Siblings
All Nodes Sharing The Same Head

Srl1 Context wants throw
Word2Vec Context He wants to throw her the bright red
Distance from
Training Word 0 1 2 3 4 5 6 7 8 9 10 11
Training Word: He
Srl1
Semantic Role Head
verb: wants
wanter: He
verb: throw
thrower: He
wants
He throw
herto frisbee
the bright red
in
park
the

Categorical Breakdown:
Lexical

Categorical Breakdown:
Grammatical

Context Analysis
Why are these models doing so well lexically?
• Idea: the data outside of the Word2Vec context
window is providing most of the improvement.

Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Dep2
Context Analysis
• Idea: the data outside of the Word2Vec context window is
providing most of the improvement.

Distance from
Training Word -3 -2 -1 0 1 2 3 4 5 6 7 8
Dep2
Context Analysis
New Information
• Idea: the data outside of the Word2Vec context window is
providing most of the improvement.

Context Analysis
Some of the biggest improvements are in models that
aren’t getting nearly as much new information.
• This indicates that the beneﬁt is from the context
selection process

Rank Analysis
We ﬁnd the rank score of a model by taking the
following:

Rank Sum Analysis
We ﬁnd the rank sum score of a model by taking the
sum of all the model’s categorical scores

Ensemble Models
It seems difﬁcult to construct models that are at least
the sum of their parts.
• What would the ideal composite model look like?
• Can we achieve the same result in a different way?

Ensemble Models
Idea: Speciﬁc models have speciﬁc competencies.
We can build a class that chooses what model to use
based on the current task.

Ensemble Selection
We start by building an ensemble with only one
model, then two models and so on

Ensemble Selection
• Sum of Categorical Competencies
• Rank Score Selection
• Maximum Information Selection

Ensemble Selection
Sum of Categorical Competencies
Rank Score Selection
1
2
3
4
5
6
7
.
.
.
1
2
3
4
5
6
7
.
.
.

Ensemble Selection
Maximum Information Selection

Ensemble Selection
1

Ensemble Selection
1
2

Ensemble Selection
1
2
3

Ensemble Selection
1
2
3
4

Ensemble Analysis
• With this method we show a theoretical upper
bound for the information learned by these models
• This can be rapidly adapted to other problems,
allowing NLP systems to categorically select
models based on their speciﬁc competencies

Conclusion
• The meaning extracted from each context word is not
uniform
• Context selection massively impacts linguistic competencies
• Adjacency contexts are uniquely proficient at extracting
inflectional morpheme information
• Dependency contexts are significantly better at learning
lexical information
• Contexts are not compositional
• Categorically selecting models can be incredibly effective

Future Work
• Vector space analysis
• Additional models and linguistic context building

Categorical Evaluation for Advanced Distributional Semantic Models

Recommended

Recommended

More Related Content

Similar to Categorical Evaluation for Advanced Distributional Semantic Models

Similar to Categorical Evaluation for Advanced Distributional Semantic Models (20)

More from Jinho Choi

More from Jinho Choi (20)

Recently uploaded

Recently uploaded (20)

Categorical Evaluation for Advanced Distributional Semantic Models