Cross-lingual Similarity

oeg-upm.net
Cross-Lingual Similarity
Carlos Badenes-Olmedo 1
Jose Luis Redondo García 2
Oscar Corcho 1
1 Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
2 Amazon Research
Cambridge, UK

Outline
• Text Similarity (25min)
- represent texts to calculate distances and similarities among them.
- use Python modules to perform it.
• Document Similarity (30min)
- create topic models to describe and compare documents.
- use Python modules to perform it.
• Document Retrieval (20min)
- efﬁciently search for documents in large collections.
• Multi-lingual Retrieval (15min)
- create annotations to compare documents in multi-lingual corpora.
3
We will learn how to..

First Steps
1.Clone the demo project from Github:
git clone https://github.com/librairy/demo.git
2.Move into the root folder:
cd demo/
3.Download the docker images:
docker-compose pull
4

Material
5
1. Browse to http://classroom.google.com and go to classroom.
2. Sign in using your Google account
3. Join to the class by code: kbmakz

How similar are these texts?
6
Tennis is a racket sport played individually
or between two teams of two players each.
text1
Quidditch is a ﬁctional sport where witches
and wizards playing by riding ﬂying
broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by bringing
him the Witch's broomstick.
text3

Challenges
• identify features
• vector sparsity
• text normalization
7

Bag-of-Words
8
text1 ..
text2 ..
text3 ..
?
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
Quidditch is a ﬁctional sport where
witches and wizards playing by riding
ﬂying broomsticks.
text2
wishes if they prove their worth by
bringing him the Witch's broomstick.
text3
…
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?

Tokens
Tennis
sportplayed
players each.
text1
a
…
is
racket • Retrieve minimum processing units (tokens) from a text
• Phrase and word level
• Requires an initial cleaning process and a subsequent normalization
process
• Rules are deﬁned to identify the cut-off points or boundaries of each
segment (regular expressions):
• Phrases: punctuation marks to identify segments
• Words: In all modern languages based on Latin, Cyrillic or Greek
writing systems like English and other European languages, blank
spaces are used to identify segments.
• Tokenization in non-segmented languages (e.g. pictograms) require
more sophisticated algorithms.
• It depends on the domain

Bag-of-Words
10
text1 x x x x x x ..
text2 x x x x x x x ..
text3 x x x x x x x x ..
a
agrees
and
betweenbringing
players each.
text1
text2
text3
broom
stickbroom
sticks
by
each
ﬁctionalﬂying
grant
him
if
individuallyis
of
…

Stopwords
Tennis
sportplayed
players each.
text1
a
…
is
racket
• Deﬁnition by Collins Dictionary: “a common word
such as 'a' or 'the' that is not indexed or searchable
in a computer search engine”
• Previously known or generated during the analysis.
• They are removed from the original text during the
pre-processing phase.
• Stopwords ISO: https://github.com/stopwords-iso
• is it always recommended to delete stopwords?

Bag-of-Words
12
text3 x x x x x ..
agrees
bringing
broom
stick
broom
sticks
ﬁctional
players each.
text1
text2
text3
ﬂying
grant
individuallyplayed
players
playingprove
Q
uidditchracket
riding
sport
team
s
…

Stemming
• Different variants of the same word according to its grammatical
category: time, gender, number...
• The categories that share grammatical properties are considered
part-of-speech (PoS): noun, verb, adjective, adverb, pronoun..
• Techniques:
A) Rules-based: Linguistic normalization that reduces the
different grammatical forms that a word can adopt at its root
(stem) after eliminating its affixes (prefixes and suffixes). It uses
logical rules.
E.g. Agreed -> Agree
B) Dictionary-based (lemmas): Transformation based on
the context and the grammatical category (verb, name,
adjective...) of the words. It uses dictionaries.
E.g. Understood -> Understand
13
broom
stick
broom
sticks
?

N-Gram Stemming
14
broomstick:
broomsticks:
played:
players:
playing:
Dice Coefﬁcient =
2 * C / (A+B)  broomstick broomsticks played players playing
broomstick 1
broomsticks 1
played 1
players 1
playing 1
14
bi-grams
threshold=0.5

N-Gram Stemming
15
broomstick: *b, br, ro, oo, om, ms, st, ti, ic, ck, k* (12)
broomsticks:
played:
players:
playing:
Dice Coefﬁcient =
broomstick 1
broomsticks 1
played 1
players 1
playing 1
15
bi-grams
threshold=0.5

N-Gram Stemming
16
broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13)
played:
players:
playing:
Dice Coefﬁcient =
broomstick 1 0.88
broomsticks 1
played 1
players 1
playing 1
16
bi-grams
threshold=0.5

N-Gram Stemming
17
played: *p, pl, la, ay, ye, ed, d* (7)
players:
playing:
Dice Coefﬁcient =
broomstick 1 0.88 0.0
broomsticks 1 0.0
played 1
players 1
playing 1
17
bi-grams
threshold=0.5

N-Gram Stemming
18
players: *p, pl, la, ay, ye, er, rs, s* (8)
playing:
Dice Coefﬁcient =
broomstick 1 0.88 0.0 0.0
broomsticks 1 0.0 0.0
played 1 0.66
players 1
playing 1
18
bi-grams
threshold=0.5

N-Gram Stemming
19
playing: *p, pl, la, ay, yi, in, ng, g* (8)
Dice Coefﬁcient =
broomstick 1 0.88 0.0 0.0 0.0
broomsticks 1 0.0 0.0 0.0
played 1 0.66 0.53
players 1 0.50
playing 1
19
bi-grams
threshold=0.5

N-Gram Stemming
20
playing: *p, pl, la, ay, yi, in, ng, g* (8)
Dice Coefﬁcient =
broomstick 1 0.88 0.0 0.0 0.0
broomsticks 1 0.0 0.0 0.0
played 1 0.66 0.53
players 1 0.50
playing 1
20
bi-grams
threshold=0.5
play
broomstick

Bag-of-Words
21
text2 x x x x x x x x ..
agree
(s)bring
(ing)broom
stick
(s)
ﬁction
(al)
players each.
text1
text2
text3
ﬂy
(ing)grant
individual(ly)
play
(ed|ers|ing)
prove
Q
uidditchracket
ride
(-ing)sport
team
(s)
…
Tennis
wish
(es)witch
(es)

Binary Bag-of-Words
22
text1 0 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 ..
text2 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0 0 1 ..
text3 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 ..
agree
(s)bring
(ing)broom
stick
(s)
ﬁction
(al)
players each.
text1
text2
text3
ﬂy
(ing)grant
individual(ly)
play
(ed|ers|ing)
prove
Q
uidditchracket
ride
(-ing)sport
team
(s)
…
Tennis
wish
(es)witch
(es)
Binary-Coding

Term-Frequency Bag-of-Words
23
text1 2
text2 1
text3 0
players each.
text1
text2
text3
play
(ed|ers|ing)
• Term Frequency (TF): The importance of a
word depends on the number of times it
appears in the text

Scaled TF Bag-of-Words
24
text1 2/8
text2 1/9
text3 0
players each.
text1
text2
text3
play
(ed|ers|ing)
• Scaled Term Frequency: the importance of
a word depends on both the number of
times it appears in the text and the size of
the text

TF-IDF Bag-of-Words
25
text1 2/8*log(3/2)
text2 1/9*log(3/2)
text3 0
players each.
text1
text2
text3
play
(ed|ers|ing)
• Term Frequency- Inverse Document Frequency:
the importance of a word depends on both its
importance in a document (TF) and the importance
in a collection of documents (IDF).

Vector Space Model [Salton and McGill, 1983]
• A document is represented by a high-dimensional vector in the
space of words
• A basic vocabulary of “words” or “terms” is chosen, and, for each
document in the corpus, a count is formed of the number of
occurrences of each word.
• After suitable normalization, this term frequency count is
compared to an inverse document frequency count, which
measures the number of occurrences of a word in the entire
corpus (generally on a log scale, and again suitably normalized).
• The end result is a term-by-document matrix X whose columns
contain the tf-idf values for each of the documents in the corpus.
• Thus the tf-idf scheme reduces documents of arbitrary length to
ﬁxed-length lists of numbers.
26
…
0 5 ..
0 2 2 ..
4 1 0 ..
…
term-by-document
matrix
doc1 doc2 doc3
1
0
4

27
text1
broomsticks.
text2
text3

Distance Metrics
28
• A metric on a set is a distance function:
when the following conditions are satisﬁed:
1) Identity:
2) Symmetry:
3) non-negativity:
4) triangle inequality:
I
d : IxI − > R
∀i ∈ I, d(i, i) = 0
∀i, k ∈ I, d(i, k) = d(k, i)
∀i, k ∈ I, d(i, k) ≥ 0
∀i, l, k ∈ I, d(i, k) ≤ d(i, l) + d(l, k)
Euclidean
(l2-norm)
Manhattan
(l1-norm)
Chebychev
Minkowski
(lp-norm)
Mahalanobis

Text Similarity
• Jaccard Index:
It measures similarity between ﬁnite
sample sets, and is deﬁned as the
size of the intersection divided by the
size of the union of the sample sets
29
• Cosine Similarity:
It measures the cosine of the angle
between two vectors

30
text1
broomsticks.
text2
text3
Open the
‘Text Similarity’ Notebook
https://hackernoon.com/is-coding-becoming-obsolete-part-ii-d24a91f0a65b

Text Similarity Notebook
31
Create vectorial representation of texts
by using a NLP pipeline with:
1.Tokenization
2.Stopwords Removal
3.Stemming
And calculate the cosine similarities
among them
similarity text1 text2 text3
text1 1.0 0.158’ 0.06
text2 0.158’ 1.0 0.175’
text3 0.06 0.175’ 1.0

How similar are these books?
32

Challenges
• number of words in vocabulary
• term-frequency matrix size
• high dimensionality of vectors
33

Dimensionality Reduction
• Principal Component Analysis (PCA)
34
• Single Value Decomposition (SVD)

1
4
0
concept-space (k dimensional)
terms
documents
terms
dims
dims
dims
dims
documents
Latent Semantic Analysis (LSA/LSI) [Deerwester et. al, 1990]
• Map documents (and terms) to a low-dimensional representation by
SVD
35
Hard to interpret
Dimensionality Reduction
* it can capture some aspects of basic linguistic notions such as synonymy and polysemy

1
4
0
terms
documents
terms
topics
topics
documents
probabilistic LSA/LSI [Hofmann, 1999]
• Each word is generated from a single topic, and different words in a document may be
generated from different topics.
• Each document (bag-of-words) is DESCRIBED as a list of mixing proportions for these topics
36
No generative model at the level of documents -> No Inference (given an unseen texts, we cannot determine which topics it belongs to)
Mixture Components as representation of “topics”

1
4
0
terms
documents
terms
topics
topics
documents
Latent Dirichlet Allocation (LDA) [Blei et. al, 2003]
37
• Each word is generated from a single topic, and different words in a document may be
generated from different topics.
• Each document (bag-of-words) is GENERATED from a mixture of topics
Generative model of terms and documents
Parameters do not grow with the size of the training corpus

Topic?
Let's take a look at the following model:
38
• DBpedia Model: https://librairy.linkeddata.es/dbpedia-model

Hyperparameters
39
• Encodes assumptions
• Deﬁnes a factorization of the
joint distribution
• Connects to algorithms for
computing with data
Document-Topic
parameter
Per-document
topic proportions
Per-word
topic proportion
Observed
word
TopicsDocuments
Topic-Word
parameter
Words
• Nodes are random variables
• Edges indicate dependence
• Shaded nodes are observed
• Plates indicate replicated
variables
Plate Notation:
Per-topic
word proportions
β
αd

- value?α
40
• 15 documents
• 10 topics
• ?α
α = 100
• Blei, David M., Lawrence Carin and David B. Dunson. “Probabilistic Topic Models.” IEEE Signal Processing Magazine 27 (2010): 55-65

- value?α
41
• 15 documents
• 10 topics
• ?α
α = 1

- value?α
42
• 15 documents
• 10 topics
• ?α
α = 0.1

- value?α
43
• 15 documents
• 10 topics
• ?α
α = 0.01

How similar are these books?
44

Topics per Books
45
0.1 0.6 0.3
0.3 0.2 0.5
0.3 0.1 0.6
topic2
topic1
topic0
0.2 0.1 0.3 0.2 0.3 0.3 0.1 0.5 0.6 0.1
0.2 0.8 0.4 0.7 0.2 0.3 0.1 0.1 0.3 0.6
0.6 0.1 0.3 0.1 0.5 0.4 0.8 0.4 0.1 0.3
topic0
topic1
topic2
w
ord1w
ord2w
ord3w
ord4w
ord5w
ord6w
ord7w
ord8w
ord9..
=1.0
=1.0

Dirichlet Distribution
46
• Iterations of taking 1000 samples from
a Dirichlet distribution using an
increasing alpha value.
• Each dot represents some distribution
or mixture of the three topics like (1.0,
0.0, 0.0) or (0.4, 0.3, 0.3)

Document Similarity
• Distance metrics based on vector-type data
such as Euclidean distance (l2), Manhattan
distance (l1), and angular metric (θ) are not
optimal in this space.
• Information-theoretically motivated metrics
such as Kullback-Leibler (KL) divergence
(Eq.1) (also known as relative entropy),
Jensen- Shannon (JS) divergence (Eq.2) (as
its symmetric version) and Hellinger (He)
distance (Eq.3) are often more reasonable
• However, all these metrics are not well-deﬁned
distance metrics, that is, they do not satisfy
triangle inequality . S2JSD (Eq.4) was created
to satisfy it.
47

48
Open the
‘Document Similarity’ Notebook
https://hackernoon.com/is-coding-becoming-obsolete-part-ii-d24a91f0a65b

Document Similarity Notebook
• Train a topic model
• Modify NLP pipeline
• Understand LDA
hyperparameters
• Visualize topics
• Create document similarity
matrix
49

duplicated patents among those
published in the past 20 years in Spain?
50

Challenges
• avoid all pairwise comparisons
• large-scale document retrieval
51
Badenes-Omedo, C., Redondo-García, J. L., & Corcho, O. (2019).
Large-Scale Semantic Exploration of Scientific Literature using Topic-
based Hashing Algorithms. Semantic Web Journal.

Time Complexity
• Exact similarity computations require
to have complexity for
neighbours detection tasks or
computations when queries
are compared against a dataset of
documents
• Computation can be approximated
by a nearest neighbour (NN) search
problem.
O(n2
)
O(k ⋅ n) k
n
52
1.0 ?
? 1.0
1.0
..
1.0
patent-1
patent-2
patent-3
patent-n
..
patent-1patent-2patent-3
patent-n
..

Nearest Neighbour Search
• Approximate nearest neighbour (ANN)
search algorithm aims to ﬁnd the point in the tree that
is nearest to a given input point (kd-tree)
• This technique transforms data points from the original
feature space into a binary-code space, so that similar
data points have larger probability of collision (i.e.
having the same hash-code)
e.g. (4,7) -> 010/2
(9,6) -> 100/1
• Metric space can handle information-theoretically
motivated metrics such as JS divergence, KL
divergence, He distance or S2JSD.
53

Similarity Perception Similarity Scorevs
• High Dimensional Models:
54
a) ?JSDa b) JSDb?

55
a) JSDa = 0.74 b) JSDb = 0.71
• High Dimensional Models:
Similarity Perception Similarity Scorevs

Hashing Topic Distributions [Badenes-Olmedo et. al, 2019]
56
Hash methods based on hierarchical set of topics:
Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms. Semantic Web Journal.

Topic-based Approximate Nearest Neighbour [Badenes-Olmedo et. al, 2019]
57
{(t6),(t5)}
{(t6),(t2)}
{(t5),(t4)}
{(t4),(t5)}
{(t2),(t3)}{(t4),(t2)}
L0
L1
L0 L1

Topic Hierarchies
Let's take a look at the following model:
58
• DBpedia Model: https://librairy.linkeddata.es/dbpedia-model
Guernica is a large 1937 oil painting on canvas by
Spanish artist Pablo Picasso. One of Picasso's best
known works, Guernica is regarded by many art critics
as one of the most moving and powerful anti-war
paintings in history,It is exhibited in the Museo Reina
Sofía in Madrid.

59
published in the past 20 years in Spain?

60
published in the past 20 years in Spain
and United States?

Challenges
• cross-lingual topics
• unsupervised annotations
61
Scalable Cross-lingual Document Similarity through Language-specific
Concept Hierarchies. Proceedings of the 10th Knowledge Capture
Conference.

Multilingual Topic Models
62
• Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning
models that can be used to perform thematic explorations on collections of texts in multiple languages.
• However, these approaches require theme-aligned training data to create a language-independent space.
• This constraint limits the amount of scenarios that this technique can offer solutions to train and makes it difﬁcult to
scale up to situations where a huge collection of multi-lingual documents are required during the training phase.

Multilingual Dictionaries
• These supervised methods are usually easier to obtain and more widely
available than parallel corpora (e.g. PANLEX covers 5,700 languages and
Wiktionary covers 8,070 languages)
• But all these probabilistic topic models are based on prior knowledge.
• Connections at document-level (by parallel or comparable corpora) or at
word-level (by dictionaries) are created in the training-data before building
the model.
• In this way, the pre-established language relations condition the creation of
the topics (supervised method), instead of being inferred from the topics
themselves as a posteriori knowledge (non-supervised method)
63

Multi-lingual Topic Hierarchies
Let's take a look at the following models created from JRC-Acquis corpora:
64
• English Model: http://librairy.linkeddata.es/jrc-en-model
Fast food is a type of mass-produced food designed for commercial resale and with a
strong priority placed on speed of service versus other relevant factors involved in
culinary science. Fast food was originally created as a commercial strategy to
accommodate the larger numbers of busy commuters, travelers and wage workers
who often did not have the time to sit down at a public house or diner and wait for their
meal.

Multi-lingual Topic Hierarchies
Let's take a look at the following models created from JRC-Acquis corpora:
65
• English Model: http://librairy.linkeddata.es/jrc-es-model
La comida rápida es un estilo alimentario donde el alimento se prepara y
sirve para consumir rápidamente en establecimientos especializados
(generalmente callejeros o a pie de calle) con un nivel alimenticio bajo.

66
published in the past 20 years in Spain
and United States?

Cross-lingual Similarity

More Related Content

Similar to Cross-lingual Similarity

More from Carlos Badenes-Olmedo

Recently uploaded

Cross-lingual Similarity