Compositional Distributional Models of Meaning

Compositional Distributional Models of Meaning
Dimitri Kartsaklis
Department of Computer Science
University of Oxford
Hilary Term, 2015
Dimitri Kartsaklis Compositional Distributional Models of Meaning 1/37

In a nutshell
Compositional distributional models of meaning (CDMs) aim
to unify two orthogonal semantic paradigms:
The type-logical compositional approach of formal semantics
The quantitative perspective of vector space models of
meaning
The goal is to represent sentences as points in some
highly-dimensional metric space
Useful in every NLP task: sentence similarity, paraphrase
detection, sentiment analysis, machine translation etc.
We review three generic classes of CDMs: vector mixtures,
tensor-based models and deep-learning models

Outline
1 Introduction
2 Vector mixture models
3 Tensor-based models
4 Deep learning models
5 Afterword

Compositional semantics
The principle of compositionality
The meaning of a complex expression is determined by the
meanings of its parts and the rules used for combining them.
A lexicon:
(1) a. every Dt : λP.λQ.∀x[P(x) → Q(x)]
b. man N : λy.man(y)
c. walks VI : λz.walk(z)
A parse tree, so syntax guides the semantic composition:
S
NP
Dt
Every
N
man
V IN
walks
NP → Dt N : [[N]]([[Dt]])
S → NP VIN : [[VIN ]]([[NP]])

Syntax-to-semantics correspondence
Logical forms of compounds are computed via β-reduction:
S
∀x[man(x) → walk(x)]
NP
λQ.∀x[man(x) → Q(x)]
Dt
λP.λQ.∀x[P(x) → Q(x)]
Every
N
λy.man(y)
man
V IN
λz.walk(z)
walks
The semantic value of a sentence can be true or false.
An approach known as Montague Grammar (1970)

Distributional models of meaning
Distributional hypothesis
The meaning of a word is determined by its context [Harris, 1958].

Words as vectors
A word is a vector of co-occurrence statistics with every other
word in the vocabulary:
milk
cute
dog
bank
money
12
8
5
0
1
cat
cat
dog
account
money
pet
Semantic relatedness is usually based on cosine similarity:
sim(−→v , −→u ) = cos(−→v , −→u ) =
(−→v · −→u )
−→v −→u
(1)

A real vector space
30
20
10
0
10
20
cat
dog
pet
kitten
mouse
horse
lion
elephanttiger
parrot
eagle
pigeon
raven
seagull
money
stock
bank
financecurrencyprofit
credit
market
business
broker
account
laptop
data
processor
technologyram
megabyte
keyboard monitor
intel
motherboard
microchip
cpu
game
football
score
championshipplayer league
team

The necessity for a uniﬁed model
Montague grammar is compositional but not quantitative
Distributional models of meaning are quantitative but not
compositional
Note that the distributional hypothesis does not apply to
phrases or sentences. There is not enough data:
Even if we had an inﬁnitely large corpus, what the context of
a sentence would be?

The role of compositionality
Compositional distributional models
We can synthetically produce a sentence vector by composing the
vectors of the words in that sentence.
−→s = f (−→w1, −→w2, . . . , −→wn) (2)
Three generic classes of CDMs:
Vector mixture models [Mitchell and Lapata, 2008]
Tensor-based models
[Coecke et al., 2010, Baroni and Zamparelli, 2010]
Deep learning models
[Socher et al., 2012, Kalchbrenner et al., 2014].

Outline
1 Introduction
5 Afterword

Element-wise vector composition
We can combine two vectors by working element-wise
[Mitchell and Lapata, 2008]:
−−−→w1w2 = α−→w1 + β−→w2 =
i
(αcw1
i + βcw2
i )−→ni (3)
−−−→w1w2 = −→w1
−→w2 =
i
cw1
i cw2
i
−→ni (4)
An element-wise “mixture” of the input elements:
= =
Vector mixture Tensor-based

Vector mixtures: Summary
Distinguishing feature:
All words contribute equally to the final result.
PROS:
Trivial to implement
As computationally light-weight as it gets
Surprisingly effective in practice
CONS:
A bag-of-word approach
Does not distinguish between the type-logical identities of the
words
By definition, sentence space = word space

Outline
1 Introduction
5 Afterword

Relational words as functions
In a vector mixture model, an adjective is of the same order as
the noun it modiﬁes, and both contribute equally to the result.
One step further: Relational words are multi-linear maps
(tensors of various orders) that can be applied to one or more
arguments (vectors).
= =
Vector mixture Tensor-based
Formalized in the context of compact closed categories by
Coecke, Sadrzadeh and Clark.

A categorical framework for composition
Categorical compositional distributional semantics
Coecke, Sadrzadeh and Clark (2010): Syntax and vector space
semantics can be structurally homomorphic.
Take CF, the free compact closed category over a pregroup
grammar, as the structure accommodating the syntax
Take FVect, the category of ﬁnite-dimensional vector spaces
and linear maps, as the semantic counterpart of CF.
CF and FVect share a compact closed structure, so a strongly
monoidal functor Q can be deﬁned such that:
Q : CF → FVect (5)
The meaning of a sentence w1w2 . . . wn with type reduction α
is given as:
Q(α)(−→w1 ⊗ −→w2 ⊗ . . . ⊗ −→wn) (6)

Categorical composition: Example
S
NP
Adj
happy
N
kids
VP
V
play
N
games
happy kids play games
n nl
n nr
s nl
n
Grammar rules follow the inequalities:
pl
· p ≤ 1 ≤ p · pl
and p · pr
≤ 1 ≤ pr
· p (7)
Derivation becomes:
(n · nl
) · n · (nr
· s · nl
) · n = n · (nl
· n) · (nr
· s · nl
) · n ≤
n · 1 · (nr
· s · nl
) · n = n · (nr
· s · nl
) · n =
(n · nr
) · s · (nl
· n) ≤ 1 · s · 1 ≤ s

Categorical composition: From syntax to semantics
Each atomic grammar type is mapped to a vector space:
Q(n) = N Q(s) = S
Due to strong monoidality, complex types are mapped to
tensor product of vector spaces:
Q(n · nl
) = Q(n) ⊗ Q(nl
) = N ⊗ N
Q(nr
· s · nl
) = Q(nr
) ⊗ Q(s) ⊗ Q(nl
) = N ⊗ S ⊗ N
Finally, each morphism in CF is mapped to a linear map in
FVect.

A multi-linear model
The grammatical type of a word deﬁnes the vector space in
which the word lives.
A noun lives in a basic vector space N.
Adjectives are linear maps N → N, i.e elements in N ⊗ N.
Intransitive verbs are maps N → S, i.e elements in N ⊗ S.
Transitive verbs are bi-linear maps N ⊗ N → S, i.e. elements
of N ⊗ S ⊗ N.
And so on.
The composition operation is tensor contraction, based on
inner product.
happy kids play games
N Nl N Nr
S Nl N

Frobenius algebras in language
Kartsaklis, Sadrzadeh, Pulman and Coecke (2013): Use the
co-multiplication of a Frobenius algebra to model verb tensors.
s v o
v
A combination of element-wise and categorical composition:
−−−→s v o = (s × v) o
Useful e.g. in intonation:
Who does John like? John likes Mary
(
−−→
john × likes) −−−→mary
Sadrzadeh, Clark and Coecke (2013): Relative pronouns are
modelled in terms of Frobenius copying and deleting.

Composition and lexical ambiguity
Kartsaklis and Sadrzadeh (2013): Explicitly handling lexical
ambiguity improves the performance of tensor-based models
How can we model ambiguity in categorical composition?
Words as mixed states
Piedeleu, Kartsaklis, Coecke and Sadrzadeh (2015): Ambiguous
words are represented as mixed states in CPM(FHilb).
ρ(wt) =
i
pi |wi
t wi
t | (8)
Von Neumann entropy shows how ambiguity evolves from
words to compounds
Disambiguation = puriﬁcation: Entropy of ‘vessel’ is 0.25,
but entropy of ‘vessel that sails’ is 0.01.

Tensor-based models: Summary
Distinguishing feature
Relational words are functions acting on arguments.
PROS:
Highly justiﬁed from a linguistic perspective
More powerful and robust than vector mixtures
A glass box approach that allows theoretical reasoning
CONS:
Every logical and functional word must be assigned to an
appropriate tensor representation (e.g. how do you represent
quantiﬁers?)
Space complexity problems for functions of higher arity (e.g. a
ditransitive verb is a tensor of order 4)
Limited to linearity

Outline
1 Introduction
5 Afterword

A simple neural net
A feed-forward neural network with one hidden layer:
h1 = f (w11x1+w21x2+w31x3+w41x4+w51x5+b1)
h2 = f (w12x1+w22x2+w32x3+w42x4+w52x5+b2)
h3 = f (w13x1+w23x2+w33x3+w43x4+w53x5+b3)
or
−→
h = f (W(1)−→x +
−→
b (1)
)
Similarly:
−→y = f (W(2)−→
h +
−→
b (2)
)
Note that W(1) ∈ R3×5 and W(2) ∈ R2×3
f is a non-linear function such as tanh or sigmoid
(take f = Id and you have a tensor-based model)
Weights are optimized against some objective function
A universal approximator.

Recursive neural networks
Pollack (1990), Socher et al. (2012): Recursive neural nets
for composition in natural language.
f is an element-wise non-linear function such as tanh or
sigmoid.
A supervised approach.

Unsupervised learning with NNs
How can we train a NN in an unsupervised manner?
Create an auto-encoder: Train the network to reproduce its
input via an expansion layer.
Use the output of the hidden layer as a compressed version of
the input [Socher et al., 2011]

Convolutional NNs
Originated in pattern recognition [Fukushima, 1980]
Small filters apply on every position of the input vector:
Capable of extracting fine-grained local features independently
of the exact position in input
Features become increasingly global as more layers are stacked
Each convolutional layer is usually followed by a pooling layer
Top layer is fully connected, usually a soft-max classifier
Application to language: Collobert and Weston (2008)

DCNNs for modelling sentences
Kalchbrenner, Grefenstette
and Blunsom (2014): A deep
architecture using dynamic
k-max pooling
Syntactic structure is
induced automatically:
(Figures reused with permission)

Beyond sentence level
Denil, Demiraj, Kalchbrenner, Blunsom, and De Freitas (2014): An
additional convolutional layer can provide document vectors:
(Figure reused with permission)

Deep learning models: Summary
Distinguishing feature
Drastic transformation of the sentence space.
PROS:
Non-linearity and layered approach allow the simulation of a
very wide range of functions
Word vectors are parameters of the model, optimized during
training
State-of-the-art results in a number of NLP tasks
CONS:
Requires expensive training
Diﬃcult to discover the right conﬁguration
A black-box approach: not easy to correlate inner workings
with output

Outline
1 Introduction
5 Afterword

A hierarchy of CDMs
Putting everything together:
Note that “power” is only theoretical; actual performance
depends on task, underlying assumptions, and conﬁguration

Open issues-Future work
No convincing solution for logical connectives, negation,
quantifiers and so on.
Functional words, such as prepositions and relative pronouns,
are also a problem.
Sentence space is usually identified with word space. This is
convenient, but obviously a simplification
Solutions depend on the specific CDM class—e.g. not much
to do in a vector mixture setting
Important: How can we make NNs more linguistically aware?
[Hermann and Blunsom, 2013]

Conclusion
CDMs provide quantitative semantic representations for
sentences (or even documents)
Element-wise operations on word vectors constitute an easy
and reasonably effective way to get sentence vectors
Categorical compositional distributional models allow
reasoning on a theoretical level—a glass box approach
Deep learning models are extremely powerful and effective;
still a black-box approach, not easy to explain why a specific
configuration works and some other does not.
Convolutional networks seem to constitute a very promising
solution for sentential/discourse semantics

References I
Baroni, M. and Zamparelli, R. (2010).
Nouns are Vectors, Adjectives are Matrices.
In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP).
Coecke, B., Sadrzadeh, M., and Clark, S. (2010).
Mathematical Foundations for a Compositional Distributional Model of Meaning. Lambek Festschrift.
Linguistic Analysis, 36:345–384.
Collobert, R. and Weston, J. (2008).
A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning.
In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM.
Denil, M., Demiraj, A., Kalchbrenner, N., Blunsom, P., and de Freitas, N. (2014).
Modelling, visualising and summarising documents with a single convolutional neural network.
arXiv preprint arXiv:1406.3830.
Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014).
A convolutional neural network for modelling sentences.
In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 655–665, Baltimore, Maryland. Association for Computational Linguistics.
Kartsaklis, D. and Sadrzadeh, M. (2013).
Prior disambiguation of word tensors for constructing sentence vectors.
In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages
1590–1601, Seattle, Washington, USA. Association for Computational Linguistics.
Kartsaklis, D., Sadrzadeh, M., Pulman, S., and Coecke, B. (2014).
Reasoning about meaning in natural language with compact closed categories and Frobenius algebras.

References II
Lambek, J. (2008).
From Word to Sentence.
Polimetrica, Milan.
Mitchell, J. and Lapata, M. (2008).
Vector-based Models of Semantic Composition.
In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pages 236–244.
Montague, R. (1970a).
English as a formal language.
In Linguaggi nella Società e nella Tecnica, pages 189–224. Edizioni di Comunità, Milan.
Montague, R. (1970b).
Universal grammar.
Theoria, 36:373–398.
Piedeleu, R., Kartsaklis, D., Coecke, B., and Sadrzadeh, M. (2015).
Open System Categorical Quantum Semantics in Natural Language Processing.
Pollack, J. B. (1990).
Recursive distributed representations.
Artificial Intelligence, 46(1):77–105.
Sadrzadeh, M., Clark, S., and Coecke, B. (2013).
The Frobenius anatomy of word meanings I: subject and object relative pronouns.
Journal of Logic and Computation, Advance Access.

References III
Socher, R., Huang, E., Pennington, J., Ng, A., and Manning, C. (2011).
Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection.
Advances in Neural Information Processing Systems, 24.
Socher, R., Huval, B., Manning, C., and A., N. (2012).
Semantic compositionality through recursive matrix-vector spaces.
In Conference on Empirical Methods in Natural Language Processing 2012.

Compositional Distributional Models of Meaning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Compositional Distributional Models of Meaning

Similar to Compositional Distributional Models of Meaning (20)

Recently uploaded

Recently uploaded (20)

Compositional Distributional Models of Meaning