Compositional distributional models of meaning (CDMs) aim to unify the two prominent semantic paradigms in natural language: The type-logical compositional approach of formal semantics, and the quantitative perspective of vector space models of meaning. This presentation gives an overview of state-of-the-art research on the field. We review three generic classes of CDMs: vector mixtures, tensor-based models, and deep-learning models.
The Mariana Trench remarkable geological features on Earth.pptx
Compositional Distributional Models of Meaning
1. Compositional Distributional Models of Meaning
Dimitri Kartsaklis
Department of Computer Science
University of Oxford
Hilary Term, 2015
Dimitri Kartsaklis Compositional Distributional Models of Meaning 1/37
2. In a nutshell
Compositional distributional models of meaning (CDMs) aim
to unify two orthogonal semantic paradigms:
The type-logical compositional approach of formal semantics
The quantitative perspective of vector space models of
meaning
The goal is to represent sentences as points in some
highly-dimensional metric space
Useful in every NLP task: sentence similarity, paraphrase
detection, sentiment analysis, machine translation etc.
We review three generic classes of CDMs: vector mixtures,
tensor-based models and deep-learning models
Dimitri Kartsaklis Compositional Distributional Models of Meaning 2/37
3. Outline
1 Introduction
2 Vector mixture models
3 Tensor-based models
4 Deep learning models
5 Afterword
Dimitri Kartsaklis Compositional Distributional Models of Meaning 3/37
4. Compositional semantics
The principle of compositionality
The meaning of a complex expression is determined by the
meanings of its parts and the rules used for combining them.
A lexicon:
(1) a. every Dt : λP.λQ.∀x[P(x) → Q(x)]
b. man N : λy.man(y)
c. walks VI : λz.walk(z)
A parse tree, so syntax guides the semantic composition:
S
NP
Dt
Every
N
man
V IN
walks
NP → Dt N : [[N]]([[Dt]])
S → NP VIN : [[VIN ]]([[NP]])
Dimitri Kartsaklis Compositional Distributional Models of Meaning 4/37
5. Syntax-to-semantics correspondence
Logical forms of compounds are computed via β-reduction:
S
∀x[man(x) → walk(x)]
NP
λQ.∀x[man(x) → Q(x)]
Dt
λP.λQ.∀x[P(x) → Q(x)]
Every
N
λy.man(y)
man
V IN
λz.walk(z)
walks
The semantic value of a sentence can be true or false.
An approach known as Montague Grammar (1970)
Dimitri Kartsaklis Compositional Distributional Models of Meaning 5/37
6. Distributional models of meaning
Distributional hypothesis
The meaning of a word is determined by its context [Harris, 1958].
Dimitri Kartsaklis Compositional Distributional Models of Meaning 6/37
7. Distributional models of meaning
Distributional hypothesis
The meaning of a word is determined by its context [Harris, 1958].
Dimitri Kartsaklis Compositional Distributional Models of Meaning 6/37
8. Words as vectors
A word is a vector of co-occurrence statistics with every other
word in the vocabulary:
milk
cute
dog
bank
money
12
8
5
0
1
cat
cat
dog
account
money
pet
Semantic relatedness is usually based on cosine similarity:
sim(−→v , −→u ) = cos(−→v , −→u ) =
(−→v · −→u )
−→v −→u
(1)
Dimitri Kartsaklis Compositional Distributional Models of Meaning 7/37
9. A real vector space
30
20
10
0
10
20
cat
dog
pet
kitten
mouse
horse
lion
elephanttiger
parrot
eagle
pigeon
raven
seagull
money
stock
bank
financecurrencyprofit
credit
market
business
broker
account
laptop
data
processor
technologyram
megabyte
keyboard monitor
intel
motherboard
microchip
cpu
game
football
score
championshipplayer league
team
Dimitri Kartsaklis Compositional Distributional Models of Meaning 8/37
10. The necessity for a unified model
Montague grammar is compositional but not quantitative
Distributional models of meaning are quantitative but not
compositional
Note that the distributional hypothesis does not apply to
phrases or sentences. There is not enough data:
Even if we had an infinitely large corpus, what the context of
a sentence would be?
Dimitri Kartsaklis Compositional Distributional Models of Meaning 9/37
11. The role of compositionality
Compositional distributional models
We can synthetically produce a sentence vector by composing the
vectors of the words in that sentence.
−→s = f (−→w1, −→w2, . . . , −→wn) (2)
Three generic classes of CDMs:
Vector mixture models [Mitchell and Lapata, 2008]
Tensor-based models
[Coecke et al., 2010, Baroni and Zamparelli, 2010]
Deep learning models
[Socher et al., 2012, Kalchbrenner et al., 2014].
Dimitri Kartsaklis Compositional Distributional Models of Meaning 10/37
12. Outline
1 Introduction
2 Vector mixture models
3 Tensor-based models
4 Deep learning models
5 Afterword
Dimitri Kartsaklis Compositional Distributional Models of Meaning 11/37
13. Element-wise vector composition
We can combine two vectors by working element-wise
[Mitchell and Lapata, 2008]:
−−−→w1w2 = α−→w1 + β−→w2 =
i
(αcw1
i + βcw2
i )−→ni (3)
−−−→w1w2 = −→w1
−→w2 =
i
cw1
i cw2
i
−→ni (4)
An element-wise “mixture” of the input elements:
= =
Vector mixture Tensor-based
Dimitri Kartsaklis Compositional Distributional Models of Meaning 12/37
14. Vector mixtures: Summary
Distinguishing feature:
All words contribute equally to the final result.
PROS:
Trivial to implement
As computationally light-weight as it gets
Surprisingly effective in practice
CONS:
A bag-of-word approach
Does not distinguish between the type-logical identities of the
words
By definition, sentence space = word space
Dimitri Kartsaklis Compositional Distributional Models of Meaning 13/37
15. Outline
1 Introduction
2 Vector mixture models
3 Tensor-based models
4 Deep learning models
5 Afterword
Dimitri Kartsaklis Compositional Distributional Models of Meaning 14/37
16. Relational words as functions
In a vector mixture model, an adjective is of the same order as
the noun it modifies, and both contribute equally to the result.
One step further: Relational words are multi-linear maps
(tensors of various orders) that can be applied to one or more
arguments (vectors).
= =
Vector mixture Tensor-based
Formalized in the context of compact closed categories by
Coecke, Sadrzadeh and Clark.
Dimitri Kartsaklis Compositional Distributional Models of Meaning 15/37
17. A categorical framework for composition
Categorical compositional distributional semantics
Coecke, Sadrzadeh and Clark (2010): Syntax and vector space
semantics can be structurally homomorphic.
Take CF, the free compact closed category over a pregroup
grammar, as the structure accommodating the syntax
Take FVect, the category of finite-dimensional vector spaces
and linear maps, as the semantic counterpart of CF.
CF and FVect share a compact closed structure, so a strongly
monoidal functor Q can be defined such that:
Q : CF → FVect (5)
The meaning of a sentence w1w2 . . . wn with type reduction α
is given as:
Q(α)(−→w1 ⊗ −→w2 ⊗ . . . ⊗ −→wn) (6)
Dimitri Kartsaklis Compositional Distributional Models of Meaning 16/37
18. Categorical composition: Example
S
NP
Adj
happy
N
kids
VP
V
play
N
games
happy kids play games
n nl
n nr
s nl
n
Grammar rules follow the inequalities:
pl
· p ≤ 1 ≤ p · pl
and p · pr
≤ 1 ≤ pr
· p (7)
Derivation becomes:
(n · nl
) · n · (nr
· s · nl
) · n = n · (nl
· n) · (nr
· s · nl
) · n ≤
n · 1 · (nr
· s · nl
) · n = n · (nr
· s · nl
) · n =
(n · nr
) · s · (nl
· n) ≤ 1 · s · 1 ≤ s
Dimitri Kartsaklis Compositional Distributional Models of Meaning 17/37
19. Categorical composition: From syntax to semantics
Each atomic grammar type is mapped to a vector space:
Q(n) = N Q(s) = S
Due to strong monoidality, complex types are mapped to
tensor product of vector spaces:
Q(n · nl
) = Q(n) ⊗ Q(nl
) = N ⊗ N
Q(nr
· s · nl
) = Q(nr
) ⊗ Q(s) ⊗ Q(nl
) = N ⊗ S ⊗ N
Finally, each morphism in CF is mapped to a linear map in
FVect.
Dimitri Kartsaklis Compositional Distributional Models of Meaning 18/37
20. A multi-linear model
The grammatical type of a word defines the vector space in
which the word lives.
A noun lives in a basic vector space N.
Adjectives are linear maps N → N, i.e elements in N ⊗ N.
Intransitive verbs are maps N → S, i.e elements in N ⊗ S.
Transitive verbs are bi-linear maps N ⊗ N → S, i.e. elements
of N ⊗ S ⊗ N.
And so on.
The composition operation is tensor contraction, based on
inner product.
happy kids play games
N Nl N Nr
S Nl N
Dimitri Kartsaklis Compositional Distributional Models of Meaning 19/37
21. Frobenius algebras in language
Kartsaklis, Sadrzadeh, Pulman and Coecke (2013): Use the
co-multiplication of a Frobenius algebra to model verb tensors.
s v o
v
A combination of element-wise and categorical composition:
−−−→s v o = (s × v) o
Useful e.g. in intonation:
Who does John like? John likes Mary
(
−−→
john × likes) −−−→mary
Sadrzadeh, Clark and Coecke (2013): Relative pronouns are
modelled in terms of Frobenius copying and deleting.
Dimitri Kartsaklis Compositional Distributional Models of Meaning 20/37
22. Composition and lexical ambiguity
Kartsaklis and Sadrzadeh (2013): Explicitly handling lexical
ambiguity improves the performance of tensor-based models
How can we model ambiguity in categorical composition?
Words as mixed states
Piedeleu, Kartsaklis, Coecke and Sadrzadeh (2015): Ambiguous
words are represented as mixed states in CPM(FHilb).
ρ(wt) =
i
pi |wi
t wi
t | (8)
Von Neumann entropy shows how ambiguity evolves from
words to compounds
Disambiguation = purification: Entropy of ‘vessel’ is 0.25,
but entropy of ‘vessel that sails’ is 0.01.
Dimitri Kartsaklis Compositional Distributional Models of Meaning 21/37
23. Tensor-based models: Summary
Distinguishing feature
Relational words are functions acting on arguments.
PROS:
Highly justified from a linguistic perspective
More powerful and robust than vector mixtures
A glass box approach that allows theoretical reasoning
CONS:
Every logical and functional word must be assigned to an
appropriate tensor representation (e.g. how do you represent
quantifiers?)
Space complexity problems for functions of higher arity (e.g. a
ditransitive verb is a tensor of order 4)
Limited to linearity
Dimitri Kartsaklis Compositional Distributional Models of Meaning 22/37
24. Outline
1 Introduction
2 Vector mixture models
3 Tensor-based models
4 Deep learning models
5 Afterword
Dimitri Kartsaklis Compositional Distributional Models of Meaning 23/37
25. A simple neural net
A feed-forward neural network with one hidden layer:
h1 = f (w11x1+w21x2+w31x3+w41x4+w51x5+b1)
h2 = f (w12x1+w22x2+w32x3+w42x4+w52x5+b2)
h3 = f (w13x1+w23x2+w33x3+w43x4+w53x5+b3)
or
−→
h = f (W(1)−→x +
−→
b (1)
)
Similarly:
−→y = f (W(2)−→
h +
−→
b (2)
)
Note that W(1) ∈ R3×5 and W(2) ∈ R2×3
f is a non-linear function such as tanh or sigmoid
(take f = Id and you have a tensor-based model)
Weights are optimized against some objective function
A universal approximator.
Dimitri Kartsaklis Compositional Distributional Models of Meaning 24/37
26. Recursive neural networks
Pollack (1990), Socher et al. (2012): Recursive neural nets
for composition in natural language.
f is an element-wise non-linear function such as tanh or
sigmoid.
A supervised approach.
Dimitri Kartsaklis Compositional Distributional Models of Meaning 25/37
27. Unsupervised learning with NNs
How can we train a NN in an unsupervised manner?
Create an auto-encoder: Train the network to reproduce its
input via an expansion layer.
Use the output of the hidden layer as a compressed version of
the input [Socher et al., 2011]
Dimitri Kartsaklis Compositional Distributional Models of Meaning 26/37
28. Convolutional NNs
Originated in pattern recognition [Fukushima, 1980]
Small filters apply on every position of the input vector:
Capable of extracting fine-grained local features independently
of the exact position in input
Features become increasingly global as more layers are stacked
Each convolutional layer is usually followed by a pooling layer
Top layer is fully connected, usually a soft-max classifier
Application to language: Collobert and Weston (2008)
Dimitri Kartsaklis Compositional Distributional Models of Meaning 27/37
29. DCNNs for modelling sentences
Kalchbrenner, Grefenstette
and Blunsom (2014): A deep
architecture using dynamic
k-max pooling
Syntactic structure is
induced automatically:
(Figures reused with permission)
Dimitri Kartsaklis Compositional Distributional Models of Meaning 28/37
30. Beyond sentence level
Denil, Demiraj, Kalchbrenner, Blunsom, and De Freitas (2014): An
additional convolutional layer can provide document vectors:
(Figure reused with permission)
Dimitri Kartsaklis Compositional Distributional Models of Meaning 29/37
31. Deep learning models: Summary
Distinguishing feature
Drastic transformation of the sentence space.
PROS:
Non-linearity and layered approach allow the simulation of a
very wide range of functions
Word vectors are parameters of the model, optimized during
training
State-of-the-art results in a number of NLP tasks
CONS:
Requires expensive training
Difficult to discover the right configuration
A black-box approach: not easy to correlate inner workings
with output
Dimitri Kartsaklis Compositional Distributional Models of Meaning 30/37
32. Outline
1 Introduction
2 Vector mixture models
3 Tensor-based models
4 Deep learning models
5 Afterword
Dimitri Kartsaklis Compositional Distributional Models of Meaning 31/37
33. A hierarchy of CDMs
Putting everything together:
Note that “power” is only theoretical; actual performance
depends on task, underlying assumptions, and configuration
Dimitri Kartsaklis Compositional Distributional Models of Meaning 32/37
34. Open issues-Future work
No convincing solution for logical connectives, negation,
quantifiers and so on.
Functional words, such as prepositions and relative pronouns,
are also a problem.
Sentence space is usually identified with word space. This is
convenient, but obviously a simplification
Solutions depend on the specific CDM class—e.g. not much
to do in a vector mixture setting
Important: How can we make NNs more linguistically aware?
[Hermann and Blunsom, 2013]
Dimitri Kartsaklis Compositional Distributional Models of Meaning 33/37
35. Conclusion
CDMs provide quantitative semantic representations for
sentences (or even documents)
Element-wise operations on word vectors constitute an easy
and reasonably effective way to get sentence vectors
Categorical compositional distributional models allow
reasoning on a theoretical level—a glass box approach
Deep learning models are extremely powerful and effective;
still a black-box approach, not easy to explain why a specific
configuration works and some other does not.
Convolutional networks seem to constitute a very promising
solution for sentential/discourse semantics
Dimitri Kartsaklis Compositional Distributional Models of Meaning 34/37
36. References I
Baroni, M. and Zamparelli, R. (2010).
Nouns are Vectors, Adjectives are Matrices.
In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP).
Coecke, B., Sadrzadeh, M., and Clark, S. (2010).
Mathematical Foundations for a Compositional Distributional Model of Meaning. Lambek Festschrift.
Linguistic Analysis, 36:345–384.
Collobert, R. and Weston, J. (2008).
A unified architecture for natural language processing: Deep neural networks with multitask learning.
In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM.
Denil, M., Demiraj, A., Kalchbrenner, N., Blunsom, P., and de Freitas, N. (2014).
Modelling, visualising and summarising documents with a single convolutional neural network.
arXiv preprint arXiv:1406.3830.
Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014).
A convolutional neural network for modelling sentences.
In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 655–665, Baltimore, Maryland. Association for Computational Linguistics.
Kartsaklis, D. and Sadrzadeh, M. (2013).
Prior disambiguation of word tensors for constructing sentence vectors.
In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages
1590–1601, Seattle, Washington, USA. Association for Computational Linguistics.
Kartsaklis, D., Sadrzadeh, M., Pulman, S., and Coecke, B. (2014).
Reasoning about meaning in natural language with compact closed categories and Frobenius algebras.
arXiv preprint arXiv:1401.5980.
Dimitri Kartsaklis Compositional Distributional Models of Meaning 35/37
37. References II
Lambek, J. (2008).
From Word to Sentence.
Polimetrica, Milan.
Mitchell, J. and Lapata, M. (2008).
Vector-based Models of Semantic Composition.
In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pages 236–244.
Montague, R. (1970a).
English as a formal language.
In Linguaggi nella Societ`a e nella Tecnica, pages 189–224. Edizioni di Comunit`a, Milan.
Montague, R. (1970b).
Universal grammar.
Theoria, 36:373–398.
Piedeleu, R., Kartsaklis, D., Coecke, B., and Sadrzadeh, M. (2015).
Open System Categorical Quantum Semantics in Natural Language Processing.
arXiv preprint arXiv:1502.00831.
Pollack, J. B. (1990).
Recursive distributed representations.
Artificial Intelligence, 46(1):77–105.
Sadrzadeh, M., Clark, S., and Coecke, B. (2013).
The Frobenius anatomy of word meanings I: subject and object relative pronouns.
Journal of Logic and Computation, Advance Access.
Dimitri Kartsaklis Compositional Distributional Models of Meaning 36/37
38. References III
Socher, R., Huang, E., Pennington, J., Ng, A., and Manning, C. (2011).
Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection.
Advances in Neural Information Processing Systems, 24.
Socher, R., Huval, B., Manning, C., and A., N. (2012).
Semantic compositionality through recursive matrix-vector spaces.
In Conference on Empirical Methods in Natural Language Processing 2012.
Dimitri Kartsaklis Compositional Distributional Models of Meaning 37/37