oeg-upm.net
Cross-Lingual Similarity
Carlos Badenes-Olmedo 1
Jose Luis Redondo García 2
Oscar Corcho 1
1 Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
2 Amazon Research
Cambridge, UK
Introduce Yourselves
2
Outline
• Text Similarity (25min)
- represent texts to calculate distances and similarities among them.
- use Python modules to perform it.
• Document Similarity (30min)
- create topic models to describe and compare documents.
- use Python modules to perform it.
• Document Retrieval (20min)
- efficiently search for documents in large collections.
• Multi-lingual Retrieval (15min)
- create annotations to compare documents in multi-lingual corpora.
3
We will learn how to..
First Steps
1.Clone the demo project from Github:
git clone https://github.com/librairy/demo.git
2.Move into the root folder:
cd demo/
3.Download the docker images:
docker-compose pull
4
Material
5
1. Browse to http://classroom.google.com and go to classroom.
2. Sign in using your Google account
3. Join to the class by code: kbmakz
How similar are these texts?
6
Tennis is a racket sport played individually
or between two teams of two players each.
text1
Quidditch is a fictional sport where witches
and wizards playing by riding flying
broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by bringing
him the Witch's broomstick.
text3
Challenges
• identify features
• vector sparsity
• text normalization
7
Bag-of-Words
8
text1 ..
text2 ..
text3 ..
?
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
Quidditch is a fictional sport where
witches and wizards playing by riding
flying broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by
bringing him the Witch's broomstick.
text3
…
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Tokens
Tennis
sportplayed
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
a
…
is
racket • Retrieve minimum processing units (tokens) from a text
• Phrase and word level
• Requires an initial cleaning process and a subsequent normalization
process
• Rules are defined to identify the cut-off points or boundaries of each
segment (regular expressions):
• Phrases: punctuation marks to identify segments
• Words: In all modern languages based on Latin, Cyrillic or Greek
writing systems like English and other European languages, blank
spaces are used to identify segments.
• Tokenization in non-segmented languages (e.g. pictograms) require
more sophisticated algorithms.
• It depends on the domain
Bag-of-Words
10
text1 x x x x x x ..
text2 x x x x x x x ..
text3 x x x x x x x x ..
a
agrees
and
betweenbringing
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
Quidditch is a fictional sport where
witches and wizards playing by riding
flying broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by
bringing him the Witch's broomstick.
text3
broom
stickbroom
sticks
by
each
fictionalflying
grant
him
if
individuallyis
of
…
Stopwords
Tennis
sportplayed
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
a
…
is
racket
• Definition by Collins Dictionary: “a common word
such as 'a' or 'the' that is not indexed or searchable
in a computer search engine”
• Previously known or generated during the analysis.
• They are removed from the original text during the
pre-processing phase.
• Stopwords ISO: https://github.com/stopwords-iso
• is it always recommended to delete stopwords?
Bag-of-Words
12
text1 x x x x x x ..
text2 x x x x x x x ..
text3 x x x x x ..
agrees
bringing
broom
stick
broom
sticks
fictional
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
Quidditch is a fictional sport where
witches and wizards playing by riding
flying broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by
bringing him the Witch's broomstick.
text3
flying
grant
individuallyplayed
players
playingprove
Q
uidditchracket
riding
sport
team
s
…
Stemming
• Different variants of the same word according to its grammatical
category: time, gender, number...
• The categories that share grammatical properties are considered
part-of-speech (PoS): noun, verb, adjective, adverb, pronoun..
• Techniques:
A) Rules-based: Linguistic normalization that reduces the
different grammatical forms that a word can adopt at its root
(stem) after eliminating its affixes (prefixes and suffixes). It uses
logical rules.
E.g. Agreed -> Agree
B) Dictionary-based (lemmas): Transformation based on
the context and the grammatical category (verb, name,
adjective...) of the words. It uses dictionaries.
E.g. Understood -> Understand
13
broom
stick
broom
sticks
?
N-Gram Stemming
14
broomstick:
broomsticks:
played:
players:
playing:
Dice Coefficient =
2 * C / (A+B)
 broomstick broomsticks played players playing
broomstick 1
broomsticks 1
played 1
players 1
playing 1
14
bi-grams
threshold=0.5
N-Gram Stemming
15
broomstick: *b, br, ro, oo, om, ms, st, ti, ic, ck, k* (12)
broomsticks:
played:
players:
playing:
Dice Coefficient =
2 * C / (A+B)
 broomstick broomsticks played players playing
broomstick 1
broomsticks 1
played 1
players 1
playing 1
15
bi-grams
threshold=0.5
N-Gram Stemming
16
broomstick: *b, br, ro, oo, om, ms, st, ti, ic, ck, k* (12)
broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13)
played:
players:
playing:
Dice Coefficient =
2 * C / (A+B)
 broomstick broomsticks played players playing
broomstick 1 0.88
broomsticks 1
played 1
players 1
playing 1
16
bi-grams
threshold=0.5
N-Gram Stemming
17
broomstick: *b, br, ro, oo, om, ms, st, ti, ic, ck, k* (12)
broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13)
played: *p, pl, la, ay, ye, ed, d* (7)
players:
playing:
Dice Coefficient =
2 * C / (A+B)
 broomstick broomsticks played players playing
broomstick 1 0.88 0.0
broomsticks 1 0.0
played 1
players 1
playing 1
17
bi-grams
threshold=0.5
N-Gram Stemming
18
broomstick: *b, br, ro, oo, om, ms, st, ti, ic, ck, k* (12)
broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13)
played: *p, pl, la, ay, ye, ed, d* (7)
players: *p, pl, la, ay, ye, er, rs, s* (8)
playing:
Dice Coefficient =
2 * C / (A+B)
 broomstick broomsticks played players playing
broomstick 1 0.88 0.0 0.0
broomsticks 1 0.0 0.0
played 1 0.66
players 1
playing 1
18
bi-grams
threshold=0.5
N-Gram Stemming
19
broomstick: *b, br, ro, oo, om, ms, st, ti, ic, ck, k* (12)
broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13)
played: *p, pl, la, ay, ye, ed, d* (7)
players: *p, pl, la, ay, ye, er, rs, s* (8)
playing: *p, pl, la, ay, yi, in, ng, g* (8)
Dice Coefficient =
2 * C / (A+B)
 broomstick broomsticks played players playing
broomstick 1 0.88 0.0 0.0 0.0
broomsticks 1 0.0 0.0 0.0
played 1 0.66 0.53
players 1 0.50
playing 1
19
bi-grams
threshold=0.5
N-Gram Stemming
20
broomstick: *b, br, ro, oo, om, ms, st, ti, ic, ck, k* (12)
broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13)
played: *p, pl, la, ay, ye, ed, d* (7)
players: *p, pl, la, ay, ye, er, rs, s* (8)
playing: *p, pl, la, ay, yi, in, ng, g* (8)
Dice Coefficient =
2 * C / (A+B)
 broomstick broomsticks played players playing
broomstick 1 0.88 0.0 0.0 0.0
broomsticks 1 0.0 0.0 0.0
played 1 0.66 0.53
players 1 0.50
playing 1
20
bi-grams
threshold=0.5
play
broomstick
Bag-of-Words
21
text1 x x x x x x ..
text2 x x x x x x x x ..
text3 x x x x x x x ..
agree
(s)bring
(ing)broom
stick
(s)
fiction
(al)
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
Quidditch is a fictional sport where
witches and wizards playing by riding
flying broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by
bringing him the Witch's broomstick.
text3
fly
(ing)grant
individual(ly)
play
(ed|ers|ing)
prove
Q
uidditchracket
ride
(-ing)sport
team
(s)
…
Tennis
wish
(es)witch
(es)
Binary Bag-of-Words
22
text1 0 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 ..
text2 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0 0 1 ..
text3 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 ..
agree
(s)bring
(ing)broom
stick
(s)
fiction
(al)
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
Quidditch is a fictional sport where
witches and wizards playing by riding
flying broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by
bringing him the Witch's broomstick.
text3
fly
(ing)grant
individual(ly)
play
(ed|ers|ing)
prove
Q
uidditchracket
ride
(-ing)sport
team
(s)
…
Tennis
wish
(es)witch
(es)
Binary-Coding
Term-Frequency Bag-of-Words
23
text1 2
text2 1
text3 0
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
Quidditch is a fictional sport where
witches and wizards playing by riding
flying broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by
bringing him the Witch's broomstick.
text3
play
(ed|ers|ing)
• Term Frequency (TF): The importance of a
word depends on the number of times it
appears in the text
Scaled TF Bag-of-Words
24
text1 2/8
text2 1/9
text3 0
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
Quidditch is a fictional sport where
witches and wizards playing by riding
flying broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by
bringing him the Witch's broomstick.
text3
play
(ed|ers|ing)
• Scaled Term Frequency: the importance of
a word depends on both the number of
times it appears in the text and the size of
the text
TF-IDF Bag-of-Words
25
text1 2/8*log(3/2)
text2 1/9*log(3/2)
text3 0
Tennis is a racket sport played
individually or between two teams of two
players each.
text1
Quidditch is a fictional sport where
witches and wizards playing by riding
flying broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by
bringing him the Witch's broomstick.
text3
play
(ed|ers|ing)
• Term Frequency- Inverse Document Frequency:
the importance of a word depends on both its
importance in a document (TF) and the importance
in a collection of documents (IDF).
Vector Space Model [Salton and McGill, 1983]
• A document is represented by a high-dimensional vector in the
space of words
• A basic vocabulary of “words” or “terms” is chosen, and, for each
document in the corpus, a count is formed of the number of
occurrences of each word.
• After suitable normalization, this term frequency count is
compared to an inverse document frequency count, which
measures the number of occurrences of a word in the entire
corpus (generally on a log scale, and again suitably normalized).
• The end result is a term-by-document matrix X whose columns
contain the tf-idf values for each of the documents in the corpus.
• Thus the tf-idf scheme reduces documents of arbitrary length to
fixed-length lists of numbers.
26
…
0 5 ..
0 2 2 ..
4 1 0 ..
…
term-by-document
matrix
doc1 doc2 doc3
1
0
4
How similar are these texts?
27
Tennis is a racket sport played individually
or between two teams of two players each.
text1
Quidditch is a fictional sport where witches
and wizards playing by riding flying
broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by bringing
him the Witch's broomstick.
text3
Distance Metrics
28
• A metric on a set is a distance function:
when the following conditions are satisfied:
1) Identity:
2) Symmetry:
3) non-negativity:
4) triangle inequality:
I
d : IxI − > R
∀i ∈ I, d(i, i) = 0
∀i, k ∈ I, d(i, k) = d(k, i)
∀i, k ∈ I, d(i, k) ≥ 0
∀i, l, k ∈ I, d(i, k) ≤ d(i, l) + d(l, k)
Euclidean
(l2-norm)
Manhattan
(l1-norm)
Chebychev
Minkowski
(lp-norm)
Mahalanobis
Text Similarity
• Jaccard Index:
It measures similarity between finite
sample sets, and is defined as the
size of the intersection divided by the
size of the union of the sample sets
29
• Cosine Similarity:
It measures the cosine of the angle
between two vectors
How similar are these texts?
30
Tennis is a racket sport played individually
or between two teams of two players each.
text1
Quidditch is a fictional sport where witches
and wizards playing by riding flying
broomsticks.
text2
The Wizard of Oz agrees to grant their
wishes if they prove their worth by bringing
him the Witch's broomstick.
text3
Open the
‘Text Similarity’ Notebook
https://hackernoon.com/is-coding-becoming-obsolete-part-ii-d24a91f0a65b
Text Similarity Notebook
31
Create vectorial representation of texts
by using a NLP pipeline with:
1.Tokenization
2.Stopwords Removal
3.Stemming
And calculate the cosine similarities
among them
similarity text1 text2 text3
text1 1.0 0.158’ 0.06
text2 0.158’ 1.0 0.175’
text3 0.06 0.175’ 1.0
How similar are these books?
32
Challenges
• number of words in vocabulary
• term-frequency matrix size
• high dimensionality of vectors
33
Dimensionality Reduction
• Principal Component Analysis (PCA)
34
• Single Value Decomposition (SVD)
1
4
0
concept-space (k dimensional)
terms
documents
terms
dims
dims
dims
dims
documents
Latent Semantic Analysis (LSA/LSI) [Deerwester et. al, 1990]
• Map documents (and terms) to a low-dimensional representation by
SVD
35
Hard to interpret
Dimensionality Reduction
* it can capture some aspects of basic linguistic notions such as synonymy and polysemy
1
4
0
terms
documents
terms
topics
topics
documents
probabilistic LSA/LSI [Hofmann, 1999]
• Each word is generated from a single topic, and different words in a document may be
generated from different topics.
• Each document (bag-of-words) is DESCRIBED as a list of mixing proportions for these topics
36
No generative model at the level of documents -> No Inference (given an unseen texts, we cannot determine which topics it belongs to)
Mixture Components as representation of “topics”
1
4
0
terms
documents
terms
topics
topics
documents
Latent Dirichlet Allocation (LDA) [Blei et. al, 2003]
37
• Each word is generated from a single topic, and different words in a document may be
generated from different topics.
• Each document (bag-of-words) is GENERATED from a mixture of topics
Generative model of terms and documents
Parameters do not grow with the size of the training corpus
Topic?
Let's take a look at the following model:
38
• DBpedia Model: https://librairy.linkeddata.es/dbpedia-model
Hyperparameters
39
• Encodes assumptions
• Defines a factorization of the
joint distribution
• Connects to algorithms for
computing with data
Document-Topic
parameter
Per-document
topic proportions
Per-word
topic proportion
Observed
word
TopicsDocuments
Topic-Word
parameter
Words
• Nodes are random variables
• Edges indicate dependence
• Shaded nodes are observed
• Plates indicate replicated
variables
Plate Notation:
Per-topic
word proportions
β
αd
- value?α
40
• 15 documents
• 10 topics
• ?α
α = 100
• Blei, David M., Lawrence Carin and David B. Dunson. “Probabilistic Topic Models.” IEEE Signal Processing Magazine 27 (2010): 55-65
- value?α
41
• 15 documents
• 10 topics
• ?α
α = 1
• Blei, David M., Lawrence Carin and David B. Dunson. “Probabilistic Topic Models.” IEEE Signal Processing Magazine 27 (2010): 55-65
- value?α
42
• 15 documents
• 10 topics
• ?α
α = 0.1
• Blei, David M., Lawrence Carin and David B. Dunson. “Probabilistic Topic Models.” IEEE Signal Processing Magazine 27 (2010): 55-65
- value?α
43
• 15 documents
• 10 topics
• ?α
α = 0.01
• Blei, David M., Lawrence Carin and David B. Dunson. “Probabilistic Topic Models.” IEEE Signal Processing Magazine 27 (2010): 55-65
How similar are these books?
44
Topics per Books
45
0.1 0.6 0.3
0.3 0.2 0.5
0.3 0.1 0.6
topic2
topic1
topic0
0.2 0.1 0.3 0.2 0.3 0.3 0.1 0.5 0.6 0.1
0.2 0.8 0.4 0.7 0.2 0.3 0.1 0.1 0.3 0.6
0.6 0.1 0.3 0.1 0.5 0.4 0.8 0.4 0.1 0.3
topic0
topic1
topic2
w
ord1w
ord2w
ord3w
ord4w
ord5w
ord6w
ord7w
ord8w
ord9..
=1.0
=1.0
Dirichlet Distribution
46
• Iterations of taking 1000 samples from
a Dirichlet distribution using an
increasing alpha value.
• Each dot represents some distribution
or mixture of the three topics like (1.0,
0.0, 0.0) or (0.4, 0.3, 0.3)
Document Similarity
• Distance metrics based on vector-type data
such as Euclidean distance (l2), Manhattan
distance (l1), and angular metric (θ) are not
optimal in this space.
• Information-theoretically motivated metrics
such as Kullback-Leibler (KL) divergence
(Eq.1) (also known as relative entropy),
Jensen- Shannon (JS) divergence (Eq.2) (as
its symmetric version) and Hellinger (He)
distance (Eq.3) are often more reasonable
• However, all these metrics are not well-defined
distance metrics, that is, they do not satisfy
triangle inequality . S2JSD (Eq.4) was created
to satisfy it.
47
How similar are these texts?
48
Open the
‘Document Similarity’ Notebook
https://hackernoon.com/is-coding-becoming-obsolete-part-ii-d24a91f0a65b
Document Similarity Notebook
• Train a topic model
• Modify NLP pipeline
• Understand LDA
hyperparameters
• Visualize topics
• Create document similarity
matrix
49
duplicated patents among those
published in the past 20 years in Spain?
50
Challenges
• avoid all pairwise comparisons
• large-scale document retrieval
51
Badenes-Omedo, C., Redondo-García, J. L., & Corcho, O. (2019).
Large-Scale Semantic Exploration of Scientific Literature using Topic-
based Hashing Algorithms. Semantic Web Journal.
Time Complexity
• Exact similarity computations require
to have complexity for
neighbours detection tasks or
computations when queries
are compared against a dataset of
documents
• Computation can be approximated
by a nearest neighbour (NN) search
problem.
O(n2
)
O(k ⋅ n) k
n
52
1.0 ?
? 1.0
1.0
..
1.0
patent-1
patent-2
patent-3
patent-n
..
patent-1patent-2patent-3
patent-n
..
Nearest Neighbour Search
• Approximate nearest neighbour (ANN)
search algorithm aims to find the point in the tree that
is nearest to a given input point (kd-tree)
• This technique transforms data points from the original
feature space into a binary-code space, so that similar
data points have larger probability of collision (i.e.
having the same hash-code)
e.g. (4,7) -> 010/2
(9,6) -> 100/1
• Metric space can handle information-theoretically
motivated metrics such as JS divergence, KL
divergence, He distance or S2JSD.
53
Similarity Perception Similarity Scorevs
• High Dimensional Models:
54
a) ?JSDa b) JSDb?
55
a) JSDa = 0.74 b) JSDb = 0.71
• High Dimensional Models:
Similarity Perception Similarity Scorevs
Hashing Topic Distributions [Badenes-Olmedo et. al, 2019]
56
Hash methods based on hierarchical set of topics:
Badenes-Omedo, C., Redondo-García, J. L., & Corcho, O. (2019).
Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms. Semantic Web Journal.
Topic-based Approximate Nearest Neighbour [Badenes-Olmedo et. al, 2019]
57
{(t6),(t5)}
{(t6),(t2)}
{(t5),(t4)}
{(t4),(t5)}
{(t2),(t3)}{(t4),(t2)}
L0
L1
L0 L1
Topic Hierarchies
Let's take a look at the following model:
58
• DBpedia Model: https://librairy.linkeddata.es/dbpedia-model
Guernica is a large 1937 oil painting on canvas by
Spanish artist Pablo Picasso. One of Picasso's best
known works, Guernica is regarded by many art critics
as one of the most moving and powerful anti-war
paintings in history,It is exhibited in the Museo Reina
Sofía in Madrid.
59
duplicated patents among those
published in the past 20 years in Spain?
60
duplicated patents among those
published in the past 20 years in Spain
and United States?
Challenges
• cross-lingual topics
• unsupervised annotations
61
Badenes-Omedo, C., Redondo-García, J. L., & Corcho, O. (2019).
Scalable Cross-lingual Document Similarity through Language-specific
Concept Hierarchies. Proceedings of the 10th Knowledge Capture
Conference.
Multilingual Topic Models
62
• Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning
models that can be used to perform thematic explorations on collections of texts in multiple languages.
• However, these approaches require theme-aligned training data to create a language-independent space.
• This constraint limits the amount of scenarios that this technique can offer solutions to train and makes it difficult to
scale up to situations where a huge collection of multi-lingual documents are required during the training phase.
Multilingual Dictionaries
• These supervised methods are usually easier to obtain and more widely
available than parallel corpora (e.g. PANLEX covers 5,700 languages and
Wiktionary covers 8,070 languages)
• But all these probabilistic topic models are based on prior knowledge.
• Connections at document-level (by parallel or comparable corpora) or at
word-level (by dictionaries) are created in the training-data before building
the model.
• In this way, the pre-established language relations condition the creation of
the topics (supervised method), instead of being inferred from the topics
themselves as a posteriori knowledge (non-supervised method)
63
Multi-lingual Topic Hierarchies
Let's take a look at the following models created from JRC-Acquis corpora:
64
• English Model: http://librairy.linkeddata.es/jrc-en-model
Fast food is a type of mass-produced food designed for commercial resale and with a
strong priority placed on speed of service versus other relevant factors involved in
culinary science. Fast food was originally created as a commercial strategy to
accommodate the larger numbers of busy commuters, travelers and wage workers
who often did not have the time to sit down at a public house or diner and wait for their
meal.
Multi-lingual Topic Hierarchies
Let's take a look at the following models created from JRC-Acquis corpora:
65
• English Model: http://librairy.linkeddata.es/jrc-es-model
La comida rápida es un estilo alimentario donde el alimento se prepara y
sirve para consumir rápidamente en establecimientos especializados
(generalmente callejeros o a pie de calle) con un nivel alimenticio bajo.
66
duplicated patents among those
published in the past 20 years in Spain
and United States?
oeg-upm.net
Cross-Lingual Similarity
Carlos Badenes-Olmedo 1
Jose Luis Redondo García 2
Oscar Corcho 1
1 Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
2 Amazon Research
Cambridge, UK

Cross-lingual Similarity

  • 1.
    oeg-upm.net Cross-Lingual Similarity Carlos Badenes-Olmedo1 Jose Luis Redondo García 2 Oscar Corcho 1 1 Ontology Engineering Group Universidad Politécnica de Madrid, Spain 2 Amazon Research Cambridge, UK
  • 2.
  • 3.
    Outline • Text Similarity(25min) - represent texts to calculate distances and similarities among them. - use Python modules to perform it. • Document Similarity (30min) - create topic models to describe and compare documents. - use Python modules to perform it. • Document Retrieval (20min) - efficiently search for documents in large collections. • Multi-lingual Retrieval (15min) - create annotations to compare documents in multi-lingual corpora. 3 We will learn how to..
  • 4.
    First Steps 1.Clone thedemo project from Github: git clone https://github.com/librairy/demo.git 2.Move into the root folder: cd demo/ 3.Download the docker images: docker-compose pull 4
  • 5.
    Material 5 1. Browse tohttp://classroom.google.com and go to classroom. 2. Sign in using your Google account 3. Join to the class by code: kbmakz
  • 6.
    How similar arethese texts? 6 Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3
  • 7.
    Challenges • identify features •vector sparsity • text normalization 7
  • 8.
    Bag-of-Words 8 text1 .. text2 .. text3.. ? Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3 … ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
  • 9.
    Tokens Tennis sportplayed Tennis is a racket sport played individuallyor between two teams of two players each. text1 a … is racket • Retrieve minimum processing units (tokens) from a text • Phrase and word level • Requires an initial cleaning process and a subsequent normalization process • Rules are defined to identify the cut-off points or boundaries of each segment (regular expressions): • Phrases: punctuation marks to identify segments • Words: In all modern languages based on Latin, Cyrillic or Greek writing systems like English and other European languages, blank spaces are used to identify segments. • Tokenization in non-segmented languages (e.g. pictograms) require more sophisticated algorithms. • It depends on the domain
  • 10.
    Bag-of-Words 10 text1 x xx x x x .. text2 x x x x x x x .. text3 x x x x x x x x .. a agrees and betweenbringing Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3 broom stickbroom sticks by each fictionalflying grant him if individuallyis of …
  • 11.
    Stopwords Tennis sportplayed Tennis is a racket sport played individuallyor between two teams of two players each. text1 a … is racket • Definition by Collins Dictionary: “a common word such as 'a' or 'the' that is not indexed or searchable in a computer search engine” • Previously known or generated during the analysis. • They are removed from the original text during the pre-processing phase. • Stopwords ISO: https://github.com/stopwords-iso • is it always recommended to delete stopwords?
  • 12.
    Bag-of-Words 12 text1 x xx x x x .. text2 x x x x x x x .. text3 x x x x x .. agrees bringing broom stick broom sticks fictional Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3 flying grant individuallyplayed players playingprove Q uidditchracket riding sport team s …
  • 13.
    Stemming • Different variantsof the same word according to its grammatical category: time, gender, number... • The categories that share grammatical properties are considered part-of-speech (PoS): noun, verb, adjective, adverb, pronoun.. • Techniques: A) Rules-based: Linguistic normalization that reduces the different grammatical forms that a word can adopt at its root (stem) after eliminating its affixes (prefixes and suffixes). It uses logical rules. E.g. Agreed -> Agree B) Dictionary-based (lemmas): Transformation based on the context and the grammatical category (verb, name, adjective...) of the words. It uses dictionaries. E.g. Understood -> Understand 13 broom stick broom sticks ?
  • 14.
    N-Gram Stemming 14 broomstick: broomsticks: played: players: playing: Dice Coefficient= 2 * C / (A+B)
 broomstick broomsticks played players playing broomstick 1 broomsticks 1 played 1 players 1 playing 1 14 bi-grams threshold=0.5
  • 15.
    N-Gram Stemming 15 broomstick: *b,br, ro, oo, om, ms, st, ti, ic, ck, k* (12) broomsticks: played: players: playing: Dice Coefficient = 2 * C / (A+B)
 broomstick broomsticks played players playing broomstick 1 broomsticks 1 played 1 players 1 playing 1 15 bi-grams threshold=0.5
  • 16.
    N-Gram Stemming 16 broomstick: *b,br, ro, oo, om, ms, st, ti, ic, ck, k* (12) broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13) played: players: playing: Dice Coefficient = 2 * C / (A+B)
 broomstick broomsticks played players playing broomstick 1 0.88 broomsticks 1 played 1 players 1 playing 1 16 bi-grams threshold=0.5
  • 17.
    N-Gram Stemming 17 broomstick: *b,br, ro, oo, om, ms, st, ti, ic, ck, k* (12) broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13) played: *p, pl, la, ay, ye, ed, d* (7) players: playing: Dice Coefficient = 2 * C / (A+B)
 broomstick broomsticks played players playing broomstick 1 0.88 0.0 broomsticks 1 0.0 played 1 players 1 playing 1 17 bi-grams threshold=0.5
  • 18.
    N-Gram Stemming 18 broomstick: *b,br, ro, oo, om, ms, st, ti, ic, ck, k* (12) broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13) played: *p, pl, la, ay, ye, ed, d* (7) players: *p, pl, la, ay, ye, er, rs, s* (8) playing: Dice Coefficient = 2 * C / (A+B)
 broomstick broomsticks played players playing broomstick 1 0.88 0.0 0.0 broomsticks 1 0.0 0.0 played 1 0.66 players 1 playing 1 18 bi-grams threshold=0.5
  • 19.
    N-Gram Stemming 19 broomstick: *b,br, ro, oo, om, ms, st, ti, ic, ck, k* (12) broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13) played: *p, pl, la, ay, ye, ed, d* (7) players: *p, pl, la, ay, ye, er, rs, s* (8) playing: *p, pl, la, ay, yi, in, ng, g* (8) Dice Coefficient = 2 * C / (A+B)
 broomstick broomsticks played players playing broomstick 1 0.88 0.0 0.0 0.0 broomsticks 1 0.0 0.0 0.0 played 1 0.66 0.53 players 1 0.50 playing 1 19 bi-grams threshold=0.5
  • 20.
    N-Gram Stemming 20 broomstick: *b,br, ro, oo, om, ms, st, ti, ic, ck, k* (12) broomsticks: *b, br, ro, oo, om, ms, st, ti, ic, ck, ks, s* (13) played: *p, pl, la, ay, ye, ed, d* (7) players: *p, pl, la, ay, ye, er, rs, s* (8) playing: *p, pl, la, ay, yi, in, ng, g* (8) Dice Coefficient = 2 * C / (A+B)
 broomstick broomsticks played players playing broomstick 1 0.88 0.0 0.0 0.0 broomsticks 1 0.0 0.0 0.0 played 1 0.66 0.53 players 1 0.50 playing 1 20 bi-grams threshold=0.5 play broomstick
  • 21.
    Bag-of-Words 21 text1 x xx x x x .. text2 x x x x x x x x .. text3 x x x x x x x .. agree (s)bring (ing)broom stick (s) fiction (al) Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3 fly (ing)grant individual(ly) play (ed|ers|ing) prove Q uidditchracket ride (-ing)sport team (s) … Tennis wish (es)witch (es)
  • 22.
    Binary Bag-of-Words 22 text1 00 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 .. text2 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0 0 1 .. text3 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 .. agree (s)bring (ing)broom stick (s) fiction (al) Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3 fly (ing)grant individual(ly) play (ed|ers|ing) prove Q uidditchracket ride (-ing)sport team (s) … Tennis wish (es)witch (es) Binary-Coding
  • 23.
    Term-Frequency Bag-of-Words 23 text1 2 text21 text3 0 Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3 play (ed|ers|ing) • Term Frequency (TF): The importance of a word depends on the number of times it appears in the text
  • 24.
    Scaled TF Bag-of-Words 24 text12/8 text2 1/9 text3 0 Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3 play (ed|ers|ing) • Scaled Term Frequency: the importance of a word depends on both the number of times it appears in the text and the size of the text
  • 25.
    TF-IDF Bag-of-Words 25 text1 2/8*log(3/2) text21/9*log(3/2) text3 0 Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3 play (ed|ers|ing) • Term Frequency- Inverse Document Frequency: the importance of a word depends on both its importance in a document (TF) and the importance in a collection of documents (IDF).
  • 26.
    Vector Space Model[Salton and McGill, 1983] • A document is represented by a high-dimensional vector in the space of words • A basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word. • After suitable normalization, this term frequency count is compared to an inverse document frequency count, which measures the number of occurrences of a word in the entire corpus (generally on a log scale, and again suitably normalized). • The end result is a term-by-document matrix X whose columns contain the tf-idf values for each of the documents in the corpus. • Thus the tf-idf scheme reduces documents of arbitrary length to fixed-length lists of numbers. 26 … 0 5 .. 0 2 2 .. 4 1 0 .. … term-by-document matrix doc1 doc2 doc3 1 0 4
  • 27.
    How similar arethese texts? 27 Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3
  • 28.
    Distance Metrics 28 • Ametric on a set is a distance function: when the following conditions are satisfied: 1) Identity: 2) Symmetry: 3) non-negativity: 4) triangle inequality: I d : IxI − > R ∀i ∈ I, d(i, i) = 0 ∀i, k ∈ I, d(i, k) = d(k, i) ∀i, k ∈ I, d(i, k) ≥ 0 ∀i, l, k ∈ I, d(i, k) ≤ d(i, l) + d(l, k) Euclidean (l2-norm) Manhattan (l1-norm) Chebychev Minkowski (lp-norm) Mahalanobis
  • 29.
    Text Similarity • JaccardIndex: It measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets 29 • Cosine Similarity: It measures the cosine of the angle between two vectors
  • 30.
    How similar arethese texts? 30 Tennis is a racket sport played individually or between two teams of two players each. text1 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks. text2 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick. text3 Open the ‘Text Similarity’ Notebook https://hackernoon.com/is-coding-becoming-obsolete-part-ii-d24a91f0a65b
  • 31.
    Text Similarity Notebook 31 Createvectorial representation of texts by using a NLP pipeline with: 1.Tokenization 2.Stopwords Removal 3.Stemming And calculate the cosine similarities among them similarity text1 text2 text3 text1 1.0 0.158’ 0.06 text2 0.158’ 1.0 0.175’ text3 0.06 0.175’ 1.0
  • 32.
    How similar arethese books? 32
  • 33.
    Challenges • number ofwords in vocabulary • term-frequency matrix size • high dimensionality of vectors 33
  • 34.
    Dimensionality Reduction • PrincipalComponent Analysis (PCA) 34 • Single Value Decomposition (SVD)
  • 35.
    1 4 0 concept-space (k dimensional) terms documents terms dims dims dims dims documents LatentSemantic Analysis (LSA/LSI) [Deerwester et. al, 1990] • Map documents (and terms) to a low-dimensional representation by SVD 35 Hard to interpret Dimensionality Reduction * it can capture some aspects of basic linguistic notions such as synonymy and polysemy
  • 36.
    1 4 0 terms documents terms topics topics documents probabilistic LSA/LSI [Hofmann,1999] • Each word is generated from a single topic, and different words in a document may be generated from different topics. • Each document (bag-of-words) is DESCRIBED as a list of mixing proportions for these topics 36 No generative model at the level of documents -> No Inference (given an unseen texts, we cannot determine which topics it belongs to) Mixture Components as representation of “topics”
  • 37.
    1 4 0 terms documents terms topics topics documents Latent Dirichlet Allocation(LDA) [Blei et. al, 2003] 37 • Each word is generated from a single topic, and different words in a document may be generated from different topics. • Each document (bag-of-words) is GENERATED from a mixture of topics Generative model of terms and documents Parameters do not grow with the size of the training corpus
  • 38.
    Topic? Let's take alook at the following model: 38 • DBpedia Model: https://librairy.linkeddata.es/dbpedia-model
  • 39.
    Hyperparameters 39 • Encodes assumptions •Defines a factorization of the joint distribution • Connects to algorithms for computing with data Document-Topic parameter Per-document topic proportions Per-word topic proportion Observed word TopicsDocuments Topic-Word parameter Words • Nodes are random variables • Edges indicate dependence • Shaded nodes are observed • Plates indicate replicated variables Plate Notation: Per-topic word proportions β αd
  • 40.
    - value?α 40 • 15documents • 10 topics • ?α α = 100 • Blei, David M., Lawrence Carin and David B. Dunson. “Probabilistic Topic Models.” IEEE Signal Processing Magazine 27 (2010): 55-65
  • 41.
    - value?α 41 • 15documents • 10 topics • ?α α = 1 • Blei, David M., Lawrence Carin and David B. Dunson. “Probabilistic Topic Models.” IEEE Signal Processing Magazine 27 (2010): 55-65
  • 42.
    - value?α 42 • 15documents • 10 topics • ?α α = 0.1 • Blei, David M., Lawrence Carin and David B. Dunson. “Probabilistic Topic Models.” IEEE Signal Processing Magazine 27 (2010): 55-65
  • 43.
    - value?α 43 • 15documents • 10 topics • ?α α = 0.01 • Blei, David M., Lawrence Carin and David B. Dunson. “Probabilistic Topic Models.” IEEE Signal Processing Magazine 27 (2010): 55-65
  • 44.
    How similar arethese books? 44
  • 45.
    Topics per Books 45 0.10.6 0.3 0.3 0.2 0.5 0.3 0.1 0.6 topic2 topic1 topic0 0.2 0.1 0.3 0.2 0.3 0.3 0.1 0.5 0.6 0.1 0.2 0.8 0.4 0.7 0.2 0.3 0.1 0.1 0.3 0.6 0.6 0.1 0.3 0.1 0.5 0.4 0.8 0.4 0.1 0.3 topic0 topic1 topic2 w ord1w ord2w ord3w ord4w ord5w ord6w ord7w ord8w ord9.. =1.0 =1.0
  • 46.
    Dirichlet Distribution 46 • Iterationsof taking 1000 samples from a Dirichlet distribution using an increasing alpha value. • Each dot represents some distribution or mixture of the three topics like (1.0, 0.0, 0.0) or (0.4, 0.3, 0.3)
  • 47.
    Document Similarity • Distancemetrics based on vector-type data such as Euclidean distance (l2), Manhattan distance (l1), and angular metric (θ) are not optimal in this space. • Information-theoretically motivated metrics such as Kullback-Leibler (KL) divergence (Eq.1) (also known as relative entropy), Jensen- Shannon (JS) divergence (Eq.2) (as its symmetric version) and Hellinger (He) distance (Eq.3) are often more reasonable • However, all these metrics are not well-defined distance metrics, that is, they do not satisfy triangle inequality . S2JSD (Eq.4) was created to satisfy it. 47
  • 48.
    How similar arethese texts? 48 Open the ‘Document Similarity’ Notebook https://hackernoon.com/is-coding-becoming-obsolete-part-ii-d24a91f0a65b
  • 49.
    Document Similarity Notebook •Train a topic model • Modify NLP pipeline • Understand LDA hyperparameters • Visualize topics • Create document similarity matrix 49
  • 50.
    duplicated patents amongthose published in the past 20 years in Spain? 50
  • 51.
    Challenges • avoid allpairwise comparisons • large-scale document retrieval 51 Badenes-Omedo, C., Redondo-García, J. L., & Corcho, O. (2019). Large-Scale Semantic Exploration of Scientific Literature using Topic- based Hashing Algorithms. Semantic Web Journal.
  • 52.
    Time Complexity • Exactsimilarity computations require to have complexity for neighbours detection tasks or computations when queries are compared against a dataset of documents • Computation can be approximated by a nearest neighbour (NN) search problem. O(n2 ) O(k ⋅ n) k n 52 1.0 ? ? 1.0 1.0 .. 1.0 patent-1 patent-2 patent-3 patent-n .. patent-1patent-2patent-3 patent-n ..
  • 53.
    Nearest Neighbour Search •Approximate nearest neighbour (ANN) search algorithm aims to find the point in the tree that is nearest to a given input point (kd-tree) • This technique transforms data points from the original feature space into a binary-code space, so that similar data points have larger probability of collision (i.e. having the same hash-code) e.g. (4,7) -> 010/2 (9,6) -> 100/1 • Metric space can handle information-theoretically motivated metrics such as JS divergence, KL divergence, He distance or S2JSD. 53
  • 54.
    Similarity Perception SimilarityScorevs • High Dimensional Models: 54 a) ?JSDa b) JSDb?
  • 55.
    55 a) JSDa =0.74 b) JSDb = 0.71 • High Dimensional Models: Similarity Perception Similarity Scorevs
  • 56.
    Hashing Topic Distributions[Badenes-Olmedo et. al, 2019] 56 Hash methods based on hierarchical set of topics: Badenes-Omedo, C., Redondo-García, J. L., & Corcho, O. (2019). Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms. Semantic Web Journal.
  • 57.
    Topic-based Approximate NearestNeighbour [Badenes-Olmedo et. al, 2019] 57 {(t6),(t5)} {(t6),(t2)} {(t5),(t4)} {(t4),(t5)} {(t2),(t3)}{(t4),(t2)} L0 L1 L0 L1
  • 58.
    Topic Hierarchies Let's takea look at the following model: 58 • DBpedia Model: https://librairy.linkeddata.es/dbpedia-model Guernica is a large 1937 oil painting on canvas by Spanish artist Pablo Picasso. One of Picasso's best known works, Guernica is regarded by many art critics as one of the most moving and powerful anti-war paintings in history,It is exhibited in the Museo Reina Sofía in Madrid.
  • 59.
    59 duplicated patents amongthose published in the past 20 years in Spain?
  • 60.
    60 duplicated patents amongthose published in the past 20 years in Spain and United States?
  • 61.
    Challenges • cross-lingual topics •unsupervised annotations 61 Badenes-Omedo, C., Redondo-García, J. L., & Corcho, O. (2019). Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies. Proceedings of the 10th Knowledge Capture Conference.
  • 62.
    Multilingual Topic Models 62 •Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning models that can be used to perform thematic explorations on collections of texts in multiple languages. • However, these approaches require theme-aligned training data to create a language-independent space. • This constraint limits the amount of scenarios that this technique can offer solutions to train and makes it difficult to scale up to situations where a huge collection of multi-lingual documents are required during the training phase.
  • 63.
    Multilingual Dictionaries • Thesesupervised methods are usually easier to obtain and more widely available than parallel corpora (e.g. PANLEX covers 5,700 languages and Wiktionary covers 8,070 languages) • But all these probabilistic topic models are based on prior knowledge. • Connections at document-level (by parallel or comparable corpora) or at word-level (by dictionaries) are created in the training-data before building the model. • In this way, the pre-established language relations condition the creation of the topics (supervised method), instead of being inferred from the topics themselves as a posteriori knowledge (non-supervised method) 63
  • 64.
    Multi-lingual Topic Hierarchies Let'stake a look at the following models created from JRC-Acquis corpora: 64 • English Model: http://librairy.linkeddata.es/jrc-en-model Fast food is a type of mass-produced food designed for commercial resale and with a strong priority placed on speed of service versus other relevant factors involved in culinary science. Fast food was originally created as a commercial strategy to accommodate the larger numbers of busy commuters, travelers and wage workers who often did not have the time to sit down at a public house or diner and wait for their meal.
  • 65.
    Multi-lingual Topic Hierarchies Let'stake a look at the following models created from JRC-Acquis corpora: 65 • English Model: http://librairy.linkeddata.es/jrc-es-model La comida rápida es un estilo alimentario donde el alimento se prepara y sirve para consumir rápidamente en establecimientos especializados (generalmente callejeros o a pie de calle) con un nivel alimenticio bajo.
  • 66.
    66 duplicated patents amongthose published in the past 20 years in Spain and United States?
  • 67.
    oeg-upm.net Cross-Lingual Similarity Carlos Badenes-Olmedo1 Jose Luis Redondo García 2 Oscar Corcho 1 1 Ontology Engineering Group Universidad Politécnica de Madrid, Spain 2 Amazon Research Cambridge, UK