"SSC" - Geometria e Semantica del Linguaggio

Distributional Semantic Models
Pierpaolo Basile
pierpaolo.basile@gmail.com
Storming Science Caffè
17 Dicembre, 2014

Il significato nel mondo reale
Rappresentazione estensionale del significato!
automobile

Il significato nel mondo reale
Rappresentazione intensionale del significato!
automòbile agg. e s. f. [dal fr. automobile ‹otomobìl›, comp. di auto-1 e dell’agg. lat.
mobĭlis «che si muove»]. – 1. agg. Che si muove da sé, soprattutto con riferimento a
veicoli che si muovono sul terreno (o anche nell’acqua, come per es. i mezzi
autopropulsi, quali i siluri e i missili) per mezzo di un motore proprio. 2. s. f.
Autoveicolo a quattro ruote con motore generalmente a scoppio, adibito al trasporto di
un numero limitato di persone su strade ordinarie (detto, nelle... Leggi
AUTOMOBILE
automobile

Il significato nella nostra mente
La rappresentazione del concetto
AUTOMOBILE nella nostra mente
Connessionisti
(reti neurali)
Simbolici
(formalismo logico)

Il significato nel testo
Semantica distribuzionale: cosa è un’automobile?
AUTOMOBILE è l’insieme dei contesti
linguistici in cui la parola automobile
occorre

A bottle of Tezguno is on the table.
Everyone likes Tezguno.
Tezguno makes you drunk.
We make Tezguno out of corn.
What’s Tezguno?

Modelli distribuzionali semantici
You shall know a word by
the company it keeps!
Meaning of a word is
determined by its usage
7

Modelli distribuzionali semantici
• Modelli computazionali che costruiscono
rappresentazioni semantiche delle parole
analizzando dei corpora
– Le parole sono rappresentate tramite vettori
– I vettori sono costruiti analizzando statisticamente
i contesti linguistici in cui le parole occorrono

Vettore distribuzionale
1. contare quante volte una parola occorre in
un determinato contesto
2. costruire un vettore in funzione delle
occorrenze calcolate al punto 1
PAROLE SIMILI AVRANNO VETTORI SEMILI

Matrice = Spazio geometrico
• Matrice: parole X contesti
C1 C2 C3 C4 C5 C6 C7
cane 5 0 11 2 2 9 1
gatto 4 1 7 1 1 7 2
pane 0 12 0 0 9 1 9
pasta 0 8 1 2 14 0 10
carne 0 7 1 1 11 1 8
topo 4 0 8 0 1 8 1

Matrice = Spazio geometrico
C3
C5
pasta
topo
cane
gatto
Similarità->vicinanza in uno spazio
multi-dimensionale
(similarità del coseno)

Generalizzazione
• Un modello distribuzionale può essere
definito da <T, C, R, W, M, d, S>
– T: target elements -> le parole (generalmente)
– C: i contesti
– R: la relazione che lega T a C
– W: schema di pesatura
– M: spazio geometrico TxC
– d: funzione di riduzione dello spazio M -> M’
– S: funzione di similarità in M’

Costruire uno spazio semantico
1. Pre-processing del corpus
2. Individuare parole e contesti
3. Contare le co-occorrenze parole/contesti
4. Pesatura (opzionale, ma consigliata)
5. Costruzione della matrice TxC
6. Riduzione della matrice (opzionale)
7. Calcolare la similarità tra vettori

I parametri
• La definizione di contesto
– Una finestra di dimensione n, frase, paragrafo,
documento, un particolare contesto sintattico
• Schema di pesatura
• Funzione di similarità

Un esempio
• Matrice Termini-Termini
– T: parole
– C: parole
– R: T occorre «vicino» a C
– W: numero di volte che T e C co-occorrono
– M: matrice termini/termini
– d: nessuna o ad esempio SVD (Latent Semantic
Analysis)
– S: similarità del coseno

1. pre-processing
• Tokenizzazione necessaria!
– PoS-tag
– Lemmatizazzione
– Parsing
• Un’analisi troppo profonda
– Introduce errori
– Richiede altri parametri
– Dipende dalla lingua
• Pre-processing influisce sulla scelta delle parole e
dei contesti

2. Definizione del contesto
• Il documento
– l’intero documento
– paragrafo, frase, porzione di testo (passage)
• Le altre parole
– In genere si scelgono le n più frequenti
– Dove?
• ad un distanza fissata a priori (finestra)
• dipendenza sintattica
• pattern

3. Pesatura
• Frequenza o log(frequenza) per mitigare i
contesti che occorrono tante volte
• Idea: se l’occorrenza è più bassa significa che
la relazione è più forte
– Mutual Information, Log-Likelihood Ratio
• Information Retrieval: tf-idf, word-entropy, …

Pointwise Mutual Information
N
wfreq
wP
N
wwfreq
wwP
wPwP
wwP
wwMI
i
i
)(
)(
),(
),(
)()(
),(
log),(
21
21
21
21
221



),(),( 2121 wwMIwwfreq 
Local Mutual-Information
P(bere) = 100/106
P(birra) = 25/106
P(acqua)=150/106
P(bere, acqua)=60/106
P(bere, birra)=20/106
MI(bere, birra)=log2(2*106/250) = 12,96
MI(bere, acqua)=log2(6*106/1500) = 11,96
LMI(bere, birra) = 0,0002592
LMI(bere, acqua) = 0,0007176
MI tende a dare molto peso ad eventi
poco ricorrenti

5. Riduzione della matrice
• M (TxC) è una matrice altamente
dimensionale può essere utile ridurla:
1. Individuare le dimensioni latenti: LSI, PCA
2. Ridurre lo spazio: approssimazioni di M ad
esempio Random Indexing

Il metodo
• Assegnare un “vettore random” ad ogni
contesto: random/context vector
• Il vettore semantico associato ad ogni target
(e.s. parole) è la somma di tutti i vettori
contesto in cui il target (parola) occorre

Vettore contesto
• sparso
• altamente dimensionale
• ternario, valori in {-1, 0, +1}
• un piccolo numero di elementi non nulli
distribuiti casualmente
0 0 0 0 0 0 0 -1 0 0 0 0 1 0 0 -1 0 1 0 0 0 0 1 0 0 0 0 -1
24

Random Indexing (formal)
kmmnkn
RAB ,,,

B preserves the distance
between points
(Johnson-Lindenstrauss lemma)
mk 
dcdr 
25

Esempio
John eats a red apple
Rjohn -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)
Reat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)
Rred-> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)
SVapple= Rjohn+ Reat+Rred=(1, 0, 0, 1, -1, 0, 1, -1, -1, 0)
26

Random Indexing
• Vantaggi
– Semplice e veloce
– Scalabile e parallelizzabile
– Incrementale
• Svantaggi
– Richiede molta memoria

Permutazioni
• Utilizzare diverse permutazioni degli elementi
del vettore random per codificare diversi
contesti
– l’ordine delle parole
– dipendenza (sintattica/relazionale) tra termini
• Il vettore random prima di essere sommato
viene permutato in base al contesto che si sta
codificando

Esempio (word order)
Rjohn -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)
Reat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)
Rapple-> (0, 1, 0, 0, 0, 0, 0, 0, 0, -1)
SVred= R-2john+ R-1eat+R+1apple=
(0,0,0,0,1,0,-1,0,0,0)+
(0,0,0,-1,0,0,0,0,0,1)+
(-1,0,1,0,0,0,0,0,0,0)=(-1,0,1,-1,1,0,-1,0,0,1)
29

Permutazioni (query)
• In fase di query applicare la permutazione
inversa in base al contesto
• Word order: <t> ? (i termini più simili che si
trovano a destra di <t>)
– Permutare -1 il vettore random di t R-1t
• t deve comparire a sinistra
– Calcolare la similarità di R-1t con tutti i vettori
presenti nel term space

SIMPLE DSMS AND SIMPLE
OPERATORS

Simple DSMs…
Term-term co-occurrence matrix (TTM): each
cell contains the co-occurrences between two
terms within a prefixed distance
dog cat computer animal mouse
dog 0 4 0 2 1
cat 4 0 0 3 5
computer 0 0 0 0 3
animal 2 3 0 0 2
mouse 1 5 3 2 0

…Simple DSMs
Latent Semantic Analysis (LSA): relies on the
Singular Value Decomposition (SVD) of the co-
occurrence matrix
Random Indexing (RI): based on the Random
Projection
Latent Semantic Analysis over Random Indexing
(RILSA)

Latent Semantic Analysis over Random
Indexing
1. Reduce the dimension of the co-occurrences
matrix using RI
2. Perform LSA over RI (LSARI)
– reduction of LSA computation time: RI matrix
contains less dimensions than co-occurrences
matrix

Simple operators…
Addition (+): pointwise sum of components
Multiplication (∘): pointwise multiplication of
components
Addition and multiplication are commutative
– do not take into account word order
Complex structures represented summing or
multiplying words which compose them

…Simple operators
Given two word vectors u and v
– composition by sum p = u + v
– composition by multiplication p = u ∘ v
Can be applied to any sequence of words

SYNTACTIC DEPENDENCIES IN DSMS

Syntactic dependencies…
John eats a red apple.
John eats apple
red
modifier
objectsubject
38

…Syntactic dependencies
John eats a red apple.
John eats apple
red
modifier
objectsubject
39
HEADDEPENDENT

Representing dependences
Use filler/role binding approach to represent a
dependency dep(u, v)
rd⊚u + rh⊚v
rd and rh are vectors which represent
respectively the role of dependent and head
⊚ is a placeholder for a composition operator

Representing dependences (example)
obj(apple, eat)
rd⊚apple + rh⊚eat
role vectors

Structured DSMs
1. Vector permutation in RI (PERM) to encode
dependencies
2. Circular convolution (CONV) as filler/binding
operator to represent syntactic dependencies
in DSMs
3. LSA over PERM and CONV carries out two
spaces: PERMLSA and CONVLSA

Vector permutation in RI (PERM)
Using permutation of elements in context
vectors to encode dependencies
– right rotation of n elements to encode
dependents (permutation)
– left rotation of n elements to encode heads
(inverse permutation)
43

PERM (method)
Create and assign a context vector to each term
Assign a rotation function Π+1 to the dependent and
Π-1 to the head
Each term is represented by a vector which is
– the sum of the permuted vectors of all the dependent
terms
– the sum of the inverse permuted vectors of all the
head terms
– the sum of the no-permuted vectors of both
dependent and head words
44

PERM (example…)
John -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)
eat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)
red -> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)
apple -> (1, 0, 0, 0, 0, 0, 0, -1, 0, 0)
TVapple = Π+1(CVred) + Π-1(CVeat) + CVred + CVeat
45
apple
red
eats

PERM (…example)
John -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)
eat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)
red-> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)
apple -> (1, 0, 0, 0, 0, 0, 0, -1, 0, 0)
TVapple = Π+1(CVred) + Π-1(CVeat) + CVred + CVeat=…
…=(0, 0, 0, 0, 1, 0, 0, 0, -1, 0) + (0, 0, 0, -1, 0, 0, 0, 0, 1) +
+ (0, 0, 0, 1, 0, 0, 0, -1, 0, 0) + (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)
right shift left shift
46

Convolution (CONV)
Create and assign a context vector to each term
Create two context vectors for head and dependent
roles
Each term is represented by a vector which is
– the sum of the convolution between dependent terms
and the dependent role vector
– the sum of the convolution between head terms and
the head role vector
– the sum of the vectors of both dependent and head
words

Circular convolution operator
Circular convolution
p=u⊛v
defined as:


n
k
nkjkj vup
1
)1()(
U1 U2 U3 U4 U5
V1 1 1 -1 -1 1
V2 -1 -1 1 1 -1
V3 1 1 -1 -1 1
V4 -1 -1 1 1 -1
V5 -1 -1 1 1 -1
U=<1, 1, -1, -1, 1>
V=<1, -1, 1, -1, -1>
P=<-1, 3, -1, -1, -1>
P1
P2
P3
P4
P5

Circular convolution by FFTs
Circular convolution is computed in O(n2)
– using FFTs is computed in O(nlogn)
Given f the discrete FFTs and f -1 its inverse
– u ⊛v = f -1( f(u) ∘ f (v) )

CONV (example)
John -> (0, 0, 0, 0, 0, 0, 1, 0, -1, 0)
eat -> (1, 0, 0, 0, -1, 0, 0 ,0 ,0 ,0)
red-> (0, 0, 0, 1, 0, 0, 0, -1, 0, 0)
apple -> (1, 0, 0, 0, 0, 0, 0, -1, 0, 0)
rd -> (0, 0, 1, 0, -1, 0, 0, 0, 0, 0)
rh -> (0, -1, 1, 0, 0, 0, 0, 0, 0, 0)
apple = eat +red + (rd⊛ red) + (rh ⊛ eat)
50
Context vector for dependent role
Context vector for head role

Complex operators
Based on filler/role binding taking into account
syntactic role: rd ⊚ u + rh ⊚ v
– u and v could be recursive structures
Two vector operators to bind the role:
– convolution (⊛)
– tensor (⊗)
– convolution (⊛+): exploits also the sum of term
vectors
rd ⊛ u + rh ⊛ v + v + u

Complex operators (remarks)
Existing operators
– t1 ⊚ t2 ⊚ … ⊚ tn: does not take into account
syntactic role
– t1 ⊛ t2 is commutative
– t1⊗t2 ⊗ … ⊗tn: tensor order depends on the
phrase length
• two phrases with different length are not comparable
– t1 ⊗ r1 ⊗t2 ⊗r2 ⊗ … ⊗ tn ⊗rn : also depends on
the sentence length

System setup
• Corpus
– WaCkypedia EN based on a 2009 dump of Wikipedia
– about 800 million tokens
– dependency parse by MaltParser
• DSMs
– 500 vector dimension (LSA/RI/RILSA)
– 1,000 vector dimension (PERM/CONV/PERMLSA/CONVLSA)
– 50,000 most frequent words
– co-occurrence distance: 4
53

Evaluation
• GEMS 2011 Shared Task for compositional
semantics
– list of two pairs of words combination
(support offer) (help provide) 7
(old person) (right hand) 1
• rated by humans
• 5,833 rates
• 3 types involved: noun-noun (NN), adjective-noun (AN),
verb-object (VO)
– GOAL: compare the system performance against
humans scores
• Spearman correlation
54

Results (simple spaces)…
NN AN VO
TTM
LSA
RI
RILSA
TTM
LSA
RI
RILSA
TTM
LSA
RI
RILSA
+ .21 .36 .25 .42 .22 .35 .33 .41 .23 .31 .28 .31
∘ .31 .15 .23 .22 .21 .20 .22 .18 .13 .10 .18 .21
⊛ .21 .38 .26 .35 .20 .33 .31 .44 .15 .31 .24 .34
⊛+ .21 .34 .28 .43 .23 .32 .31 .37 .20 .31 .25 .29
⊗ .21 .38 .25 .39 .22 .38 .33 .43 .15 .34 .26 .32
human .49 .52 .55
Simple Semantic Spaces

…Results (structured spaces)
NN AN VO
CONV
PERM
CONVLSA
PERMLSA
CONV
PERM
CONVLSA
PERMLSA
CONV
PERM
CONVLSA
PERMLSA
+ .36 .39 .43 .42 .34 .39 .42 .45 .27 .23 .30 .31
∘ .22 .17 .10 .13 .23 .27 .13 .15 .20 .15 .06 .14
⊛ .31 .36 .37 .35 .39 .39 .45 .44 .28 .23 .27 .28
⊛+ .30 .36 .40 .36 .38 .32 .48 .44 .27 .22 .30 .32
⊗ .34 .37 .37 .40 .36 .40 .45 .45 .27 .24 .31 .32
human .49 .52 .55
Structured Semantic Spaces

Final remarks
• Best results are obtained when complex
operators/spaces (or both) are involved
• No best combination of operator/space exists
– depend on the type of relation (NN, AN, VO)
• Tensor product and convolution provide good results in
spite of previous results
– filler/role binding is effective
• Future work
– generate several rd and rh vectors for each kind of
dependency
– apply this approach to other direct graph-based
representations

Thank you for your attention!
Questions?
<pierpaolo.basile@uniba.it>

"SSC" - Geometria e Semantica del Linguaggio

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to "SSC" - Geometria e Semantica del Linguaggio

Similar to "SSC" - Geometria e Semantica del Linguaggio (20)

More from Alumni Mathematica

More from Alumni Mathematica (20)

Recently uploaded

Recently uploaded (20)

"SSC" - Geometria e Semantica del Linguaggio